FASTA format requirements

A sequence in FASTA format is expressed in 2 or more lines of text. The first line is an identifying header, the remainder of the lines (one or more) represent the sequence itself.

The header

The header line starts with a greater-than symbol (>) followed by at most 200 ASCII characters. Allowed characters are "A" to "Z", "a" to "z", "0" to "9", "_", "-", ".", ",", ";" and "|" with SPACES between them. Since the header is used to identify the sequence, it is required to be unique for each sequence in the reference.

The sequence

The only characters accepted for representing a nucleotide sequence are "A", "C", "G", "T" and "N". Lower case versions are also allowed for representing low complexity regions. The sequence of a contig can be described in a single or multiple lines after the header. If a contig sequence is described in multiple lines, the length of each line must be the same, except for the last line which can be shorter or longer than the previous lines. When defining a contig sequence, it is customary to use separate lines of 50 or 60 characters in length for readability reasons. For cases where a large single sequence line is used, the maximum size should not exceed 65,535 bases. Any sequence that exceeds this length as a single line is not supported and will trigger an error.

Size of a sequence

The minimum length of a sequence is 160 bp, allotting a minimum of 60 bp for the insert and allowing 50 bp upstream and 50 bp downstream to serve as a design buffer for primer positioning during amplicon design. The recommended upstream and downstream context buffer sequence for optimal designs is 1,000 bp

Ensure that the sequence of each record is not redundant with sequences in any other record of the FASTA file (that is, the sequence does not overlap in genomic space with sequence in any other record). Redundant sequence interfere with calculations of primer specificity and can lead to missed regions in the solutions. Sequences which overlap in genomic space must be combined by the customer into a single FASTA record containing the non-redundant sequence for the combined genomic regions.