SeqAn3 3.4.0-rc.1
The Modern C++ library for sequence analysis.
|
The IO module provides stream handling formatted I/O. More...
Modules | |
SAM File | |
Provides files and formats for handling read mapping data. | |
Sequence File | |
Provides files and formats for handling sequence data. | |
Stream | |
The stream sub-module contains data structures and functions for streaming and tokenization. | |
Structure File | |
Provides files and formats for handling structure data. | |
Views | |
IO related views. | |
Classes | |
struct | seqan3::fields< fs > |
A class template that holds a choice of seqan3::field. More... | |
struct | seqan3::file_open_error |
Thrown if there is an unspecified filesystem or stream error while opening, e.g. permission problem. More... | |
struct | seqan3::format_error |
Thrown if information given to output format didn't match expectations. More... | |
struct | seqan3::io_error |
Thrown if there is an io error in low level io operations such as in std::basic_streambuf operations. More... | |
struct | seqan3::parse_error |
Thrown if there is a parse error, such as reading an unexpected character from an input stream. More... | |
struct | seqan3::record< field_types, field_ids > |
The class template that file records are based on; behaves like a std::tuple. More... | |
struct | seqan3::unexpected_end_of_input |
Thrown if I/O was expecting more input (e.g. a delimiter or a new line), but the end of input was reached. More... | |
struct | seqan3::unhandled_extension_error |
Thrown if there is no format that accepts a given file extension. More... | |
The IO module provides stream handling formatted I/O.
SeqAn has the notion of files and formats. File is an abstraction level higher than format. A file describes a common use-case and it typically supports multiple formats. The developer needs to know which kind of file they want to read/write, this choice is made at compile-time. The format, on the other hand, is automatically detected based on the file provided by the user to the program.
For example, seqan3::sequence_file_input handles reading sequence files. It can be created directly from an input stream, or from a file name. After opening the file it will detect whether the format is seqan3::format_fasta or seqan3::format_fastq (or another supported format) automatically – normally by comparing the extension.
Some formats are available in multiple files, e.g. seqan3::format_sam can be read by seqan3::sequence_file_input and by seqan3::sam_file_input. This represents different use-cases of the same file format.
Typically formats are supported for reading and writing, but this does not always have to be the case. See the above links for more information.
The main file interface that SeqAn offers is record-based, i.e. every file conceptionally is a range of records. And each record in turn behaves as a tuple of fields.
The record type of all files is based on seqan3::record, but the composition of fields is different per file.
In particular this means:
Please have a look the tutorial for Sequence File Input and Output and the API docs for seqan3::sequence_file_input to learn about this design in practice.
SeqAn works with regular iostreams as provided by the standard library, but it also handles compressed streams:
Format | Extension | Dependency | Description |
---|---|---|---|
GZip | .gz ¹ | zlib | GNU-Zip, most common format on UNIX |
BGZF | .gz , .bgzf ² | zlib | Blocked GZip, compatible extension to GZip, features parallelisation |
BZip2 | .bz2 | libbz2 | Stronger compression than GZip, slower to compress |
¹ SeqAn always assumes GZip and does not handle pure .Z
.
² Some file formats like .bam
or .bcf
are implicitly BGZF-compressed without showing this in the extension.
Support for these compression formats is optional and depends on whether the respective dependency is available when you build your program (if you use CMake, this should happen automatically).
SeqAn file types apply compression/decompression streams transparently, i.e. if the given file-extension or "magic-header" of a file suggest this, the respective stream is automatically (de-)compressed.
The (de)compression stream wrappers are currently only used internally and not part of the API.
The number of threads used for (de-)compression of BGZF-streams can be adjusted via setting seqan3::contrib::bgzf_thread_count.
Besides formatted I/O which is realised via files and formats, SeqAn also supports object-level serialisation. This enables you to store data structures like indexes or sequences directly to disk.
We use the cereal library to accomplish this. For more information see cereal's documentation or our tutorial on Indexing and searching with SeqAn which contains an example.
|
strong |
An enumerator for the fields used in file formats.
Some of the fields are shared between formats.
The following table shows the usage of fields in the respective files (Note that each valid format for a file must handle all of its fields):
Field | Sequence IO | Alignment IO | Structure IO |
---|---|---|---|
seq | ✅ | ✅ | ✅ |
id | ✅ | ✅ | ✅ |
qual | ✅ | ✅ | ✅ |
seq_qual | ✅ | ||
offset | ✅ | ✅ | |
bpp | ✅ | ||
structure | ✅ | ||
structured_seq | ✅ | ||
energy | ✅ | ||
react | ✅ | ||
react_err | ✅ | ||
comment | ✅ | ||
alignment | ✅ | ||
ref_id | ✅ | ||
ref_seq | ✅ | ||
ref_offset | ✅ | ||
header_ptr | ✅ | ||
flag | ✅ | ||
mate | ✅ | ||
mapq | ✅ | ||
cigar | ✅ | ||
tags | ✅ | ||
bit_score | ✅ | ||
evalue | ✅ |
Enumerator | |
---|---|
seq | The "sequence", usually a range of nucleotides or amino acids. |
id | The identifier, usually a string. |
qual | The qualities, usually in Phred score notation. |
offset | Sequence (seqan3::field::seq) relative start position (0-based), unsigned value. |
bpp | Base pair probability matrix of interactions, usually a matrix of float numbers. |
structure | Fixed interactions, usually a string of structure alphabet characters. |
structured_seq | Sequence and fixed interactions combined in one range. |
energy | Energy of a folded sequence, represented by one float number. |
react | Reactivity values of the sequence characters given in a vector of float numbers. |
react_err | Reactivity error values given in a vector corresponding to seqan3::field::react. |
comment | Comment field of arbitrary content, usually a string. |
alignment | The (pairwise) alignment stored in an object that models seqan3::detail::pairwise_alignment. |
ref_id | The identifier of the (reference) sequence that seqan3::field::seq was aligned to. |
ref_seq | The (reference) "sequence" information, usually a range of nucleotides or amino acids. |
ref_offset | Sequence (seqan3::field::ref_seq) relative start position (0-based), unsigned value. |
header_ptr | A pointer to the seqan3::sam_file_header object storing header information. |
flag | The alignment flag (bit information), |
mate | The mate pair information given as a std::tuple of reference name, offset and template length. |
mapq | The mapping quality of the seqan3::field::seq alignment, usually a Phred-scaled score. |
cigar | The cigar vector (std::vector<seqan3::cigar>) representing the alignment in SAM/BAM format. |
tags | The optional tags in the SAM format, stored in a dictionary. |
bit_score | The bit score (statistical significance indicator), unsigned value. |
evalue | The e-value (length normalized bit score), |
user_defined_0 | Identifier for user defined file formats and specialisations. |
user_defined_1 | Identifier for user defined file formats and specialisations. |
user_defined_2 | Identifier for user defined file formats and specialisations. |
user_defined_3 | Identifier for user defined file formats and specialisations. |
user_defined_4 | Identifier for user defined file formats and specialisations. |
user_defined_5 | Identifier for user defined file formats and specialisations. |
user_defined_6 | Identifier for user defined file formats and specialisations. |
user_defined_7 | Identifier for user defined file formats and specialisations. |
user_defined_8 | Identifier for user defined file formats and specialisations. |
user_defined_9 | Identifier for user defined file formats and specialisations. |