SeqAn3  3.0.1
The Modern C++ library for sequence analysis.
IO

The IO module provides stream handling formatted I/O. More...

+ Collaboration diagram for IO:

Modules

 Alignment File
 Provides files and formats for handling alignment data.
 
 Sequence File
 Provides files and formats for handling sequence data.
 
 stream_REMOVEME
 The stream sub-module contains data structures and functions for streaming and tokenization.
 
 Structure File
 Provides files and formats for handling structure data.
 

Classes

struct  seqan3::fields< fs >
 A class template that holds a choice of seqan3::field. More...
 
struct  seqan3::file_open_error
 Thrown if there is an unspecified filesystem or stream error while opening, e.g. permission problem. More...
 
struct  seqan3::format_error
 Thrown if information given to output format didn't match expectations. More...
 
struct  seqan3::io_error
 Thrown if there is an io error in low level io operations such as in std::basic_streambuf operations. More...
 
struct  seqan3::parse_error
 Thrown if there is a parse error, such as reading an unexpected character from an input stream. More...
 
struct  seqan3::record< field_types, field_ids >
 The class template that file records are based on; behaves like an std::tuple. More...
 
struct  seqan3::unexpected_end_of_input
 Thrown if I/O was expecting more input (e.g. a delimiter or a new line), but the end of input was reached. More...
 
struct  seqan3::unhandled_extension_error
 Thrown if there is no format that accepts a given file extension. More...
 

Enumerations

enum  seqan3::field {
  seqan3::field::seq, seqan3::field::id, seqan3::field::qual, seqan3::field::seq_qual,
  seqan3::field::offset, seqan3::field::bpp, seqan3::field::structure, seqan3::field::structured_seq,
  seqan3::field::energy, seqan3::field::react, seqan3::field::react_err, seqan3::field::comment,
  seqan3::field::alignment, seqan3::field::ref_id, seqan3::field::ref_seq, seqan3::field::ref_offset,
  seqan3::field::header_ptr, seqan3::field::flag, seqan3::field::mate, seqan3::field::mapq,
  seqan3::field::cigar, seqan3::field::tags, seqan3::field::bit_score, seqan3::field::evalue,
  seqan3::field::user_defined_0, seqan3::field::user_defined_1, seqan3::field::user_defined_2, seqan3::field::user_defined_3,
  seqan3::field::user_defined_4, seqan3::field::user_defined_5, seqan3::field::user_defined_6, seqan3::field::user_defined_7,
  seqan3::field::user_defined_8, seqan3::field::user_defined_9, seqan3::field::SEQ = seq, seqan3::field::ID = id,
  seqan3::field::QUAL = qual, seqan3::field::SEQ_QUAL = seq_qual, seqan3::field::OFFSET = offset, seqan3::field::BPP = bpp,
  seqan3::field::STRUCTURE = structure, seqan3::field::STRUCTURED_SEQ = structured_seq, seqan3::field::ENERGY = energy, seqan3::field::REACT = react,
  seqan3::field::REACT_ERR = react_err, seqan3::field::COMMENT = comment, seqan3::field::ALIGNMENT = alignment, seqan3::field::REF_ID = ref_id,
  seqan3::field::REF_SEQ = ref_seq, seqan3::field::REF_OFFSET = ref_offset, seqan3::field::HEADER_PTR = header_ptr, seqan3::field::FLAG = flag,
  seqan3::field::MATE = mate, seqan3::field::MAPQ = mapq, seqan3::field::CIGAR = cigar, seqan3::field::TAGS = tags,
  seqan3::field::BIT_SCORE = bit_score, seqan3::field::EVALUE = evalue, seqan3::field::USER_DEFINED_0 = user_defined_0, seqan3::field::USER_DEFINED_1 = user_defined_1,
  seqan3::field::USER_DEFINED_2 = user_defined_2, seqan3::field::USER_DEFINED_3 = user_defined_3, seqan3::field::USER_DEFINED_4 = user_defined_4, seqan3::field::USER_DEFINED_5 = user_defined_5,
  seqan3::field::USER_DEFINED_6 = user_defined_6, seqan3::field::USER_DEFINED_7 = user_defined_7, seqan3::field::USER_DEFINED_8 = user_defined_8, seqan3::field::USER_DEFINED_9 = user_defined_9
}
 An enumerator for the fields used in file formats. More...
 

Variables

template<typename t >
SEQAN3_CONCEPT fields_specialisation = is_value_specialisation_of_v<t, fields>
 Auxiliary concept that checks whether a type is a specialisation of seqan3::fields.
 

Detailed Description

The IO module provides stream handling formatted I/O.

stream_REMOVEMEs and (de-)compression

SeqAn works with regular iostreams as provided by the standard library, but it also handles compressed streams:

Format Extension Dependency Description
GZip .gz¹ zlib GNU-Zip, most common format on UNIX
BGZF .gz, .bgzf² zlib Blocked GZip, compatible extension to GZip, features parallelisation
BZip2 .bz2 libbz2 Stronger compression than GZip, slower to compress

¹ SeqAn always assumes GZip and does not handle pure .Z.
² Some file formats like .bam or .bcf are implicitly BGZF-compressed without showing this in the extension.

Support for these compression formats is optional and depends on whether the respective dependency is available when you build your program (if you use CMake, this should happen automatically).

SeqAn file types apply compression/decompression streams transparently, i.e. if the given file-extension or "magic-header" of a file suggest this, the respective stream is automatically (de-)compressed.

The (de)compression stream wrappers are currently only used internally and not part of the API.

Formatted I/O

Files and formats

SeqAn has the notion of files and formats. File is an abstraction level higher than format. A file describes a common use-case and it typically supports multiple formats. The developer needs to know which kind of file they want to read/write, this choice is made at compile-time. The format, on the other hand, is automatically detected based on the file provided by the user to the program.

For example, seqan3::sequence_file_input handles reading sequence files. It can be created directly from an input stream, or from a file name. After opening the file it will detect whether the format is seqan3::format_fasta or seqan3::format_fastq (or another supported format) automatically – normally by comparing the extension.

File Formats
seqan3::alignment_file_input seqan3::format_sam, seqan3::format_bam
seqan3::alignment_file_output seqan3::format_sam, seqan3::format_bam
seqan3::sequence_file_input seqan3::format_embl, seqan3::format_fasta, seqan3::format_fastq, seqan3::format_genbank, seqan3::format_sam
seqan3::sequence_file_output seqan3::format_embl, seqan3::format_fasta, seqan3::format_fastq, seqan3::format_genbank, seqan3::format_sam
seqan3::structure_file_input seqan3::format_vienna
seqan3::structure_file_output seqan3::format_vienna

Some formats are available in multiple files, e.g. seqan3::format_sam can be read by seqan3::sequence_file_input and by seqan3::alignment_file_input. This represents different use-cases of the same file format.

Typically formats are supported for reading and writing, but this does not always have to be the case. See the above links for more information.

Records and fields

The main file interface that SeqAn offers is record-based, i.e. every file conceptionally is a range of records. And each record in turn behaves as a tuple of fields.

The record type of all files is based on seqan3::record, but the composition of fields is different per file.

In particular this means:

Please have a look the tutorial for Sequence File Input and Output and the API docs for seqan3::sequence_file_input to learn about this design in practice.

Serialisation

Besides formatted I/O which is realised via files and formats, SeqAn also supports object-level serialisation. This enables you to store data structures like indexes or sequences directly to disk.

We use the cereal library to accomplish this. For more information see cereal's documentation or our tutorial on Indexing and searching with SeqAn which contains an example.

!

Enumeration Type Documentation

◆ field

enum seqan3::field
strong

An enumerator for the fields used in file formats.

Some of the fields are shared between formats.

The following table shows the usage of fields in the respective files (Note that each valid format for a file must handle all of its fields):

Field Sequence IO Alignment IO Structure IO
seq
id
qual
seq_qual
offset
bpp
structure
structured_seq
energy
react
react_err
comment
alignment
ref_id
ref_seq
ref_offset
header_ptr
flag
mate
mapq
cigar
tags
bit_score
evalue
Enumerator
seq 

The "sequence", usually a range of nucleotides or amino acids.

id 

The identifier, usually a string.

qual 

The qualities, usually in phred-score notation.

seq_qual 

Sequence and qualities combined in one range.

offset 

Sequence (SEQ) relative start position (0-based), unsigned value.

bpp 

Base pair probability matrix of interactions, usually a matrix of float numbers.

structure 

Fixed interactions, usually a string of structure alphabet characters.

structured_seq 

Sequence and fixed interactions combined in one range.

energy 

Energy of a folded sequence, represented by one float number.

react 

Reactivity values of the sequence characters given in a vector of float numbers.

react_err 

Reactivity error values given in a vector corresponding to REACT.

comment 

Comment field of arbitrary content, usually a string.

alignment 

The (pairwise) alignment stored in an seqan3::alignment object.

ref_id 

The identifier of the (reference) sequence that SEQ was aligned to.

ref_seq 

The (reference) "sequence" information, usually a range of nucleotides or amino acids.

ref_offset 

Sequence (REF_SEQ) relative start position (0-based), unsigned value.

header_ptr 

A pointer to the seqan3::alignment_file_header object storing header information.

flag 

The alignment flag (bit information), uint16_t value.

mate 

The mate pair information given as a std::tuple of reference name, offset and template length.

mapq 

The mapping quality of the SEQ alignment, usually a ohred-scaled score.

cigar 

The cigar vector (std::vector<seqan3::cigar>) representing the alignment in SAM/BAM format.

tags 

The optional tags in the SAM format, stored in a dictionary.

bit_score 

The bit score (statistical significance indicator), unsigned value.

evalue 

The e-value (length normalized bit score), double value.

user_defined_0 

Identifier for user defined file formats and specialisations.

user_defined_1 

Identifier for user defined file formats and specialisations.

user_defined_2 

Identifier for user defined file formats and specialisations.

user_defined_3 

Identifier for user defined file formats and specialisations.

user_defined_4 

Identifier for user defined file formats and specialisations.

user_defined_5 

Identifier for user defined file formats and specialisations.

user_defined_6 

Identifier for user defined file formats and specialisations.

user_defined_7 

Identifier for user defined file formats and specialisations.

user_defined_8 

Identifier for user defined file formats and specialisations.

user_defined_9 

Identifier for user defined file formats and specialisations.

SEQ 

Please use the field name in lower case.

ID 

Please use the field name in lower case.

QUAL 

Please use the field name in lower case.

SEQ_QUAL 

Please use the field name in lower case.

OFFSET 

Please use the field name in lower case.

BPP 

Please use the field name in lower case.

STRUCTURE 

Please use the field name in lower case.

STRUCTURED_SEQ 

Please use the field name in lower case.

ENERGY 

Please use the field name in lower case.

REACT 

Please use the field name in lower case.

REACT_ERR 

Please use the field name in lower case.

COMMENT 

Please use the field name in lower case.

ALIGNMENT 

Please use the field name in lower case.

REF_ID 

Please use the field name in lower case.

REF_SEQ 

Please use the field name in lower case.

REF_OFFSET 

Please use the field name in lower case.

HEADER_PTR 

Please use the field name in lower case.

FLAG 

Please use the field name in lower case.

MATE 

Please use the field name in lower case.

MAPQ 

Please use the field name in lower case.

CIGAR 

Please use the field name in lower case.

TAGS 

Please use the field name in lower case.

BIT_SCORE 

Please use the field name in lower case.

EVALUE 

Please use the field name in lower case.

USER_DEFINED_0 

Please use the field name in lower case.

USER_DEFINED_1 

Please use the field name in lower case.

USER_DEFINED_2 

Please use the field name in lower case.

USER_DEFINED_3 

Please use the field name in lower case.

USER_DEFINED_4 

Please use the field name in lower case.

USER_DEFINED_5 

Please use the field name in lower case.

USER_DEFINED_6 

Please use the field name in lower case.

USER_DEFINED_7 

Please use the field name in lower case.

USER_DEFINED_8 

Please use the field name in lower case.

USER_DEFINED_9 

Please use the field name in lower case.