Home
QIIME
BLAST
EMBOSS
R-Server
Help

Help on Help on DCSE refformat

Files in the reference format are used by DCSE to store the plain sequences and reference information about sequences.

Extensive describtion

A reference file usually has the extension ".ref". Its format is modelled after the format of EMBL or GENBANK files. The sequence itself can be directly copied from those files. The header is different however, since it must indicate the exact position of the sequence you're interested in. It also gives some other information about the sequence. Generally, a reference file consists of a reference header followed by the data in a number of sequence blocks, separated by a "//".

#reference file DCSE references info
#data
Header
Sequence 
//
Header
Sequence
//
...

Header

The reference header tells which fields can be present in the sequence headers. Every line contains a field, and has following format:

?field ident (3 characters)?:?field type:S or N?:?description?:?default value?
e.g.: acc:S:Accession number:NoAccn

The three letter field ident will also be used in the sequence headers to identify which information is given following it. Current field types are 'S' for string and 'N' for number. The 'description' tells what the field is used for. The 'default value' is the value that will be given for a sequence when this field is not present in its header. Default values can be indirected using a '%'. e.g. When a field has '%org' set as a default, it will return the contents of the org field when the original field is not present.

Every sequence block consists of a sequence header, which can contain any of the fields described in the reference header, followed by the essential sequence information. The lines in the sequence header are given in the following format:

?field ident (3 characters)?:?value?
e.g.: acc:X52949

When one sequence header contains one item several time, the data after this item is put in one concatenated line.

Essential sequence information

The start of the essential sequence information is indicated by the field 'seq:' and it has following format:

seq: ?sequence name( < 40characters)? ?n1? ?n2? ?n3?/?n4? [?P/D/R?] ?sequence?

The sequence name identifies the sequence, and forms the correlation between reference files and the alignment files. One sequence can be divided over several of sequence blocks having the same sequence name. The order of the different pieces is shown in the last line of the header. This is highly usefull when a sequence consists of several exons, or of different fragments. ?n1? gives the first position of the sequence in the block. ?n2? gives the last position. When the last position is given first (?n2? is bigger then ?n1?), the complementary sequence of the one in the block is used. ?n3? tells you which part of the total sequence this block contains, and ?n4? says of how many parts the total sequence consists. After a number of parts the type of sequence can be given by a P (=protein), D (=DNA) or R (=RNA). If this identifier is omitted, the sequence is supposed to be an RNA sequence.

Sequence

The sequence is given on the next lines. Characters can be given in upper or lower case. Usually the format of EMBL is adopted. In this format every line contains 60 characters, in blocks of ten, separated by a space. The first block of characters of every line is preceded by 5 spaces. In GENBANK format, there's a gap of 10 characters before the characters. This gap contains a number. However it is not necessary to follow this format strictly. DCSE just starts to read characters beginning from the first sequence line. All non alphabetic characters will be ignored. The first ?n1? characters are skipped and following characters are read until the ?n2?th character is read.

Schematic representation

In the following schematic representation spaces are replaced by bullet characters (·). Characters are indicated by X. Options which have to filled in are surrounded by question marks (?). Continuations with a similar format are indicated by ....
#Reference file DCSE
?field ident (3 characters)?:?field type:S or N?:?description?:?default value?
...
#data
?field ident (3 characters)?:?field value?
...
seq:
?sequence name 1?
?first position?·?last position?·?part number?/?number of parts?[·?P, D, or R?]
·····xxxxxxxxxx·xxxxxxxxxx·xxxxxxxxxx·xxxxxxxxxx·xxxxxxxxxx·xxxxxxxxxx
·····xxxxxxxxxx·xxxxxxxxxx·xxxxxxxxxx·xxxxxxxxxx·xxxxxxxxxx·xxxxxxxxxx
     .....
//
...