Help on Help on DCSE ali format

The DCSE alignment format contains sequence and structure information as used in the program DCSE. Sequences are stored as an alignment, and secondary structure is indicated by the insertion of special symbols between the sequence or gap characters. It can be used to store either DNA, RNA or protein alignments.

Extensive desciption

An alignment file usually has the extension ".ali". It has four info lines and several sequence lines. The first line shows two numbers: the first position shown in the alignment, and the last position. The difference between the two must be equal to the number of positions in the alignment. These positions can be preceded by a 'P' for a protein alignment, a 'D' for a DNA alignment or an 'R' for an RNA alignment. When none of these are present, the alignment is supposed to be an RNA alignment. The other info lines can contain any text. The second info line is usually empty. The third and fourth line usually contain an indication of the position. A sequence line consists of the entire sequence (including gaps), followed by a space, a number of five characters long, another space, and the species name of maximum 40 characters. The number is not essential, the number of characters between the sequence and its name however has to be 7. All sequences should have equal length. Each sequence line consists of symbols for nucleotides or gaps, alternated with positions that are either blank or contain special symbols, e.g. symbols delimiting secondary structure elements.

Symbols for nucleotides

Completely identified nucleotides are indicated using the standard codes. U, C, A, G

The standard ambiguity codes also apply for partially identified nucleotides:

U or C
A or G
A or C
U or G
U or A
C or G
U, C, or G
U, A, or G
U, C, or A
C, A, or G
U, C, A, or G

A problem arises because the symbol "N", used by authors when publishing or submitting sequences, can have two different meanings:

  1. A residue could not be properly identified on a sequencing gel. This can be due to template heterogeneity (e.g. in the case of reverse transcriptase sequencing) or to ambiguity in the polymerase reaction. In this case the number of unidentified nucleotides, although not their identity, is known.

  2. The sequence was only partially determined and after alignment with a complete sequence of a related sequence, the undetermined areas were padded with N's. In this case both the number and the identity of the nucleotides is unknown. Unfortunately, most authors do not explicitly mention which case applies. We assume, somewhat arbitrarily, that case 1 (known number of nucleotides, unidentifiable on a sequencing gel) applies if a single N, or a row of 5 N's at most, is found intercalated between known nucleotides. Rows of more than 5 N's are treated as unsequenced areas of unknown length in a partially sequenced RNA. Two different symbols are used to distinguish the two cases, as follows. N unidentified nucleotide, length of unidentified area probably known. o unidentified nucleotide, length of unidentified area unknown. In this case we intercalate a number of "o" symbols matching the number nucleotides in the most closely related species.

Symbols for amino acids

The standard one letter codes apply:
aspartic acid (Asp)
glutamic acid (Glu)
glycine (Gly)
asparagine (Asn)
glutamine (Gln)
cysteine (Cys)
serine (Ser)
threonine (Thr)
tyrosine (Tyr)
alanine (Ala)
valine (Val)
leucine (Leu)
isoleucine (Ile)
proline (Pro)
phenylalanine (Phe)
methionine (Met)
tryptophan (Trp)
lysine (Lys)
arginine (Arg)
histidine (His)

Symbol for gaps

The symbol "-" is used to denote the presence of a gap at an alignment position. Note that in areas of undetermined sequence, the placement of symbols for nucleotides ("o") and gaps ("-") is hypothetical. The pattern of "o" and "-" is matched to that of the most closely related known sequence. The presence of these symbols in areas of undetermined sequence is required to allow DCSE to check the consistency of the postulated secondary structure patterns.

Symbols describing secondary structure

The following symbols are used to indicate secondary structure elements.:
[ and ]
beginning and end of one strand of a helix.
symbolizes ][, a new helix starting immediately after the previous one.
{ and }
beginning and end of an internal loop or bulge loop interrupting a helix strand.
( and )
enclose a base forming part of a non-standard pair (any pair other than G.C, A.U, or G.U).

Helix numbering

To allow the identification of secondary structure elements, "helix numbering lines" are intercalated between the sequences. The name of such lines must begin with "Helix numbering". These lines contains the helix names, but have otherwise an empty sequence (only gap characters). The 5'- and 3'- strand of a helix name are indicated as and '.

Other symbols

The characters A-Z are not allowed between the sequence characters, since they are used for describing the sequence. Currently only symbols describing secondary structure elements ([,],,,,(,)) and the asterix (*) have a special meaning. The user can choose other symbols to indicate other things in an alignment, e.g. '' for an alpha helix and '#' for a beta sheet in protein alignments.

Schematic representation

In the following schematic representation spaces are replaced by bullet characters (). Characters are indicated by X. Options which have to filled in are surrounded by question marks (?). Continuations with a similar format are indicated by ....

?P, D, or R??first position??last position?

10      ....   x
....|....|      ....   ....|
XXX[XXX]XXXX      ....   X[XXX]X1?sequence name 1?
XXX[X{X}X]XXXX      ....   X[XX]XX2?sequence name 1?
XXX[XXX]XXXX      ....   X[XXX]X3?sequence name 1?
XXX[X(X)X]XXXX      ....   X[X(X)X]X4?sequence name 1?