Home
QIIME
BLAST
EMBOSS
R-Server
Help

Help on Help on distrib format

Each sequence in our databases is stored in a separate file, together with some information about this sequence, such as taxonomy, accession number, and literature reference in the distribution format. If you have a browser that supports forms, you need not worry about this format. You can choose a format in which you would like to receive your LSU or SSU sequences out of a number of formats, and the sequences will be converted on the fly.

The distribution format

The sequences in the distribution format use the standard letter codes for nucleotides They also include gap symbols, and secondary structure elements are indicated by the insertion of special symbols. For example:, an SSU rRNA sequence could have the following structure:
acc:X53497                             (accession no.)
src:NoData                             (source)
str:MUCL 29800, ATCC 18804, CBS 562    (strain info.)
ta1:Eukarya                            (taxonomic info.)
ta2:Fungi                              (taxonomic info.)
ta3:Eumycota                           (taxonomic info.)
ta4:Ascomycotina                       (taxonomic info.)
ta5:Hemiascomycetes                    (taxonomic info.)
chg:this sequence is not in EMBL       (changes other than del with regards to original EMBL entry)
rem:this is just an example            (remarks about the entry)
aut:person 1, person 2                 (authors)
ttl:The SSU rRNA sequence of a species (title)
jou:Journal name                       (journal)
dat:1989                               (journal year)
vol:12                                 (journal volume)
pgs:223-229                            (journal pages)
mty:SSU                                (type of RRNA)
del:500 AUG 800 AAAA                   (deletions made to keep alignment size down: <sequence position> <deleted> ...)
seq:organism name                      (organism name)
--------------------------------------------------------------------------------
--------------------------UAU[CUGGU]U-----GA[UCCU^GCCAG^UAGU{-C}AUA-UGCU]--[UGUC
]UCAAAG--AU-UAA[GCC{A-}UGC]A-UGUCUA-[A-GU{A-UAA-}GC]A---------------------------
-----------------------AUUUAU-AC------------------------------------------------
------------------------------------------A[G-U{-G--AA}AC-U]GCGAA--UGG[C-UC]AUUA
---AAU-[CAG{UU}AU{--CG}U-U{UA--UU}UGA]UAG--UA--CC--------------------------UU-AC
-UA[C(U)UG(G)-AU{AACCG-}UGG]UAAU-U[CUA{-GAGCUA}AU(A)-CA(U)G]CUU------AAA-[AUCCC{
G-A}CU]--------------------------------------------GUUU-------------------------
.....*
When a sequence consists of several fragments resulting from processing, or of several exons, the sequence of each part ends with an asterisk, and has its own header. However, the different segments are stored in the same file.

Sometimes parts of a sequence have no homologue in any of the other sequences in the database. Each such part would require a global insert in the alignment (gaps in all other sequences). In order to keep the size of the alignment down, these parts have sometimes been deleted in the alignment: In this case, the "del" field is present and lists the positions of these deletes in the sequence together with the sequence that has been deleted. The "del" field will only be present in the header of the first part.

File names

The names of the files in the database are produced from the species names by taking three characters of the genus name and three characters of the species name. When more than one sequence is known for the same species, a number is added to the name. If the file name constructed is not unique, a unique file name close to the original one will be constructed. The file names have an extension which is based on a code (see below) describing the phylogenetic group to which the species is classified in our database. This makes it possible for ftp users to either retrieve specific sequences using the full file name, or to retrieve a set of sequences belonging to a certain phylogenetic group using wildcards.

Helix numbering

The helix numbering lines are lines which serve to indicate the place of helices in the alignment. In these files, the position of the 5'- and 3'- strand of helix i are indicated as i and i'. The helix numbering system corresponds to that shown in the figures of the papers by Van de Peer et al. (1994) for the SSU rRNA, and by De Rijk et al., (1994) for the LSU rRNA. Researchers wishing to use the secondary structure information, can include these Helix numbering lines into their alignment. These lines are stored in a format different from the rRNA sequences (as one line in the DCSE alignment format).

Since certain helices may be absent in organisms belonging to one or more domains, separate helix numbering files may be used for each domain.

The following Helix numbering lines are available:

The appropriate Helix numbering lines are automatically inserted when transferring in the DCSE or printable alignment format.