SeqAn3 3.4.0-rc.1
The Modern C++ library for sequence analysis.
Loading...
Searching...
No Matches
IO

The IO module provides stream handling formatted I/O. More...

+ Collaboration diagram for IO:

Modules

 SAM File
 Provides files and formats for handling read mapping data.
 
 Sequence File
 Provides files and formats for handling sequence data.
 
 Stream
 The stream sub-module contains data structures and functions for streaming and tokenization.
 
 Structure File
 Provides files and formats for handling structure data.
 
 Views
 IO related views.
 

Classes

struct  seqan3::fields< fs >
 A class template that holds a choice of seqan3::field. More...
 
struct  seqan3::file_open_error
 Thrown if there is an unspecified filesystem or stream error while opening, e.g. permission problem. More...
 
struct  seqan3::format_error
 Thrown if information given to output format didn't match expectations. More...
 
struct  seqan3::io_error
 Thrown if there is an io error in low level io operations such as in std::basic_streambuf operations. More...
 
struct  seqan3::parse_error
 Thrown if there is a parse error, such as reading an unexpected character from an input stream. More...
 
struct  seqan3::record< field_types, field_ids >
 The class template that file records are based on; behaves like a std::tuple. More...
 
struct  seqan3::unexpected_end_of_input
 Thrown if I/O was expecting more input (e.g. a delimiter or a new line), but the end of input was reached. More...
 
struct  seqan3::unhandled_extension_error
 Thrown if there is no format that accepts a given file extension. More...
 

Enumerations

enum class  seqan3::field {
  seqan3::field::seq , seqan3::field::id , seqan3::field::qual , seqan3::field::offset ,
  seqan3::field::bpp , seqan3::field::structure , seqan3::field::structured_seq , seqan3::field::energy ,
  seqan3::field::react , seqan3::field::react_err , seqan3::field::comment , seqan3::field::alignment ,
  seqan3::field::ref_id , seqan3::field::ref_seq , seqan3::field::ref_offset , seqan3::field::header_ptr ,
  seqan3::field::flag , seqan3::field::mate , seqan3::field::mapq , seqan3::field::cigar ,
  seqan3::field::tags , seqan3::field::bit_score , seqan3::field::evalue , seqan3::field::user_defined_0 ,
  seqan3::field::user_defined_1 , seqan3::field::user_defined_2 , seqan3::field::user_defined_3 , seqan3::field::user_defined_4 ,
  seqan3::field::user_defined_5 , seqan3::field::user_defined_6 , seqan3::field::user_defined_7 , seqan3::field::user_defined_8 ,
  seqan3::field::user_defined_9
}
 An enumerator for the fields used in file formats. More...
 

Detailed Description

The IO module provides stream handling formatted I/O.

Formatted I/O

Files and formats

SeqAn has the notion of files and formats. File is an abstraction level higher than format. A file describes a common use-case and it typically supports multiple formats. The developer needs to know which kind of file they want to read/write, this choice is made at compile-time. The format, on the other hand, is automatically detected based on the file provided by the user to the program.

For example, seqan3::sequence_file_input handles reading sequence files. It can be created directly from an input stream, or from a file name. After opening the file it will detect whether the format is seqan3::format_fasta or seqan3::format_fastq (or another supported format) automatically – normally by comparing the extension.

File Formats
seqan3::sam_file_input seqan3::format_sam, seqan3::format_bam
seqan3::sam_file_output seqan3::format_sam, seqan3::format_bam
seqan3::sequence_file_input seqan3::format_embl, seqan3::format_fasta, seqan3::format_fastq, seqan3::format_genbank, seqan3::format_sam
seqan3::sequence_file_output seqan3::format_embl, seqan3::format_fasta, seqan3::format_fastq, seqan3::format_genbank, seqan3::format_sam
seqan3::structure_file_input seqan3::format_vienna
seqan3::structure_file_output seqan3::format_vienna

Some formats are available in multiple files, e.g. seqan3::format_sam can be read by seqan3::sequence_file_input and by seqan3::sam_file_input. This represents different use-cases of the same file format.

Typically formats are supported for reading and writing, but this does not always have to be the case. See the above links for more information.

Records and fields

The main file interface that SeqAn offers is record-based, i.e. every file conceptionally is a range of records. And each record in turn behaves as a tuple of fields.

The record type of all files is based on seqan3::record, but the composition of fields is different per file.

In particular this means:

Please have a look the tutorial for Sequence File Input and Output and the API docs for seqan3::sequence_file_input to learn about this design in practice.

Streams and (de-)compression

SeqAn works with regular iostreams as provided by the standard library, but it also handles compressed streams:

Format Extension Dependency Description
GZip .gz¹ zlib GNU-Zip, most common format on UNIX
BGZF .gz, .bgzf² zlib Blocked GZip, compatible extension to GZip, features parallelisation
BZip2 .bz2 libbz2 Stronger compression than GZip, slower to compress

¹ SeqAn always assumes GZip and does not handle pure .Z.
² Some file formats like .bam or .bcf are implicitly BGZF-compressed without showing this in the extension.

Support for these compression formats is optional and depends on whether the respective dependency is available when you build your program (if you use CMake, this should happen automatically).

SeqAn file types apply compression/decompression streams transparently, i.e. if the given file-extension or "magic-header" of a file suggest this, the respective stream is automatically (de-)compressed.

The (de)compression stream wrappers are currently only used internally and not part of the API.

The number of threads used for (de-)compression of BGZF-streams can be adjusted via setting seqan3::contrib::bgzf_thread_count.

Serialisation

Besides formatted I/O which is realised via files and formats, SeqAn also supports object-level serialisation. This enables you to store data structures like indexes or sequences directly to disk.

We use the cereal library to accomplish this. For more information see cereal's documentation or our tutorial on Indexing and searching with SeqAn which contains an example.

Enumeration Type Documentation

◆ field

enum class seqan3::field
strong

An enumerator for the fields used in file formats.

Some of the fields are shared between formats.

The following table shows the usage of fields in the respective files (Note that each valid format for a file must handle all of its fields):

Field Sequence IO Alignment IO Structure IO
seq
id
qual
seq_qual
offset
bpp
structure
structured_seq
energy
react
react_err
comment
alignment
ref_id
ref_seq
ref_offset
header_ptr
flag
mate
mapq
cigar
tags
bit_score
evalue
Enumerator
seq 

The "sequence", usually a range of nucleotides or amino acids.

id 

The identifier, usually a string.

qual 

The qualities, usually in Phred score notation.

offset 

Sequence (seqan3::field::seq) relative start position (0-based), unsigned value.

bpp 

Base pair probability matrix of interactions, usually a matrix of float numbers.

structure 

Fixed interactions, usually a string of structure alphabet characters.

structured_seq 

Sequence and fixed interactions combined in one range.

energy 

Energy of a folded sequence, represented by one float number.

react 

Reactivity values of the sequence characters given in a vector of float numbers.

react_err 

Reactivity error values given in a vector corresponding to seqan3::field::react.

comment 

Comment field of arbitrary content, usually a string.

alignment 

The (pairwise) alignment stored in an object that models seqan3::detail::pairwise_alignment.

ref_id 

The identifier of the (reference) sequence that seqan3::field::seq was aligned to.

ref_seq 

The (reference) "sequence" information, usually a range of nucleotides or amino acids.

ref_offset 

Sequence (seqan3::field::ref_seq) relative start position (0-based), unsigned value.

header_ptr 

A pointer to the seqan3::sam_file_header object storing header information.

flag 

The alignment flag (bit information), uint16_t value.

mate 

The mate pair information given as a std::tuple of reference name, offset and template length.

mapq 

The mapping quality of the seqan3::field::seq alignment, usually a Phred-scaled score.

cigar 

The cigar vector (std::vector<seqan3::cigar>) representing the alignment in SAM/BAM format.

tags 

The optional tags in the SAM format, stored in a dictionary.

bit_score 

The bit score (statistical significance indicator), unsigned value.

evalue 

The e-value (length normalized bit score), double value.

user_defined_0 

Identifier for user defined file formats and specialisations.

user_defined_1 

Identifier for user defined file formats and specialisations.

user_defined_2 

Identifier for user defined file formats and specialisations.

user_defined_3 

Identifier for user defined file formats and specialisations.

user_defined_4 

Identifier for user defined file formats and specialisations.

user_defined_5 

Identifier for user defined file formats and specialisations.

user_defined_6 

Identifier for user defined file formats and specialisations.

user_defined_7 

Identifier for user defined file formats and specialisations.

user_defined_8 

Identifier for user defined file formats and specialisations.

user_defined_9 

Identifier for user defined file formats and specialisations.

Hide me