Files
File Input/Output.
SeqAn supports the input and output of files in different file formats.
The most simple file format is Raw that is used to load a file "as is" into a string or vice versa, e.g.:
fstream fstrm;
fstrm.open("input.txt", ios_base::in | ios_base::binary);
CharString str;
read(fstrm, str, Raw());
fstrm.close();
//saving
FILE * cstrm = fopen("output.txt", "w");
write(cstrm, str, Raw());
fclose(fstrm);
In this example, the tag Raw() can also be omitted, since Raw is the default file format.
Instead of using read and write to read and write raw data, one can also use the operators << and >> .
Files can either be instances of a standard stream classes,
or a C-style stream (i.e. FILE * ), or a SeqAn File object (see below).
Note that the files should always be opened in binary mode.
Apart from Raw , SeqAn offers other file formats especially for bioinformatics, line Fasta , EMBL , or Genbank .
These file formats consist of one or more data records.
For loading all records repeat calling read, for example:
fstrm.open("ests.fa", ios_base::in | ios_base::binary);
String<Dna> est;
while (read(fstrm, est, Fasta())
{
//use sequence data in est
}
The function goNext skips the current record and proceeds to the next record.
Each record contains a piece of data (i.e. a sequence or an alignment) and optional some additional metadata.
One can load these metadata before (not after) loading the actual data using readMeta.
The function fills a string with the unparsed metadata.
Example:
goNext(cstrm, Embl()); //skip first data record
String<Dna> dna_sequence;
read(cstrm, dna_sequence, Embl()); //reads second record
String<char> meta_data;
readMeta(cstrm, meta_data, Embl()); //reads meta data of third record
read(cstrm, dna_sequence, Embl()); //reads third record
fclose(cstrm);
write is used to write a record into a file.
Depending on the file format, a suitable metadata string must be passed to write.
Example: The following example program:
write(cstrm, "acgt", "the metadata", Fasta());
fclose(cstrm);
creates the following file "genomic_data.fa":
ACGT
The easiest way for a read-only access of sequence data stored in a file is a file reader string.
A file reader string implements the container concept, i.e. it implements common functions like length or begin.
It has minimal memory consumption, because each part of the sequence data is loaded not before it is needed.
Example:
cout << length(fr); //prints length of the sequence
The constructor of the file reader string can also take a file from which the sequences will be loaded.
For example, the following code will read the second sequence in the file:
goNext(cstrm, Embl());
String<char> meta_data;
readMeta(cstrm, meta_data, Embl()); //reads meta data of second record
String<Dna, FileReader<Embl> > fr(cstrm); //reads sequence data of second record
fclose(cstrm);
SeqAn - Sequence Analysis Library - www.seqan.de