SeqAn3 3.4.0-rc.1
The Modern C++ library for sequence analysis.
Loading...
Searching...
No Matches
IO

The IO module provides stream handling formatted I/O. More...

+ Collaboration diagram for IO:

Modules

 SAM File
 Provides files and formats for handling read mapping data.
 
 Sequence File
 Provides files and formats for handling sequence data.
 
 Stream
 The stream sub-module contains data structures and functions for streaming and tokenization.
 
 Structure File
 Provides files and formats for handling structure data.
 
 Views
 IO related views.
 

Concepts

concept  seqan3::detail::fields_specialisation
 Auxiliary concept that checks whether a type is a specialisation of seqan3::fields.
 

Classes

struct  seqan3::detail::bgzf_compression
 A tag signifying a bgzf compressed file. More...
 
struct  seqan3::detail::bz2_compression
 A tag signifying a bz2 compressed file. More...
 
struct  seqan3::fields< fs >
 A class template that holds a choice of seqan3::field. More...
 
struct  seqan3::file_open_error
 Thrown if there is an unspecified filesystem or stream error while opening, e.g. permission problem. More...
 
struct  seqan3::format_error
 Thrown if information given to output format didn't match expectations. More...
 
struct  seqan3::detail::gz_compression
 A tag signifying a gz compressed file. More...
 
class  seqan3::detail::ignore_output_iterator
 An output iterator that emulates writing to a null-stream in order to dispose the output. More...
 
class  seqan3::detail::in_file_iterator< file_type >
 Input iterator necessary for providing a range-like interface in input file. More...
 
struct  seqan3::io_error
 Thrown if there is an io error in low level io operations such as in std::basic_streambuf operations. More...
 
struct  seqan3::detail::is_derived_from_record< record_t >
 Helper struct to implement seqan3::detail::record_like. More...
 
class  seqan3::detail::out_file_iterator< file_type >
 Output iterator necessary for providing a range-like interface in output file. More...
 
struct  seqan3::parse_error
 Thrown if there is a parse error, such as reading an unexpected character from an input stream. More...
 
struct  seqan3::record< field_types, field_ids >
 The class template that file records are based on; behaves like a std::tuple. More...
 
interface  record_like
 The concept for a type that models a record. More...
 
class  seqan3::detail::safe_filesystem_entry
 A safe guard to manage a filesystem entry, e.g. a file or a directory. More...
 
struct  seqan3::detail::select_types_with_ids< field_types, field_types_as_ids, selected_field_ids, field_no, return_types >
 Exposes a subset of types as a seqan3::type_list selected based on their IDs. More...
 
struct  seqan3::unexpected_end_of_input
 Thrown if I/O was expecting more input (e.g. a delimiter or a new line), but the end of input was reached. More...
 
struct  seqan3::unhandled_extension_error
 Thrown if there is no format that accepts a given file extension. More...
 
struct  seqan3::detail::variant_from_tags< list_t, output_t >
 Base class to deduce the std::variant type from format tags. More...
 
struct  seqan3::detail::variant_from_tags< type_list< ts... >, output_t >
 Transfers a list of format tags (...ts) onto a std::variant by specialising output_t with each. More...
 
struct  seqan3::detail::zstd_compression
 A tag signifying a zstd compressed file. More...
 

Typedefs

using seqan3::detail::compression_formats = pack_traits::drop_front< void >
 A seqan3::type_list containing the available compression formats.
 

Enumerations

enum class  seqan3::field {
  seqan3::field::seq , seqan3::field::id , seqan3::field::qual , seqan3::field::offset ,
  seqan3::field::bpp , seqan3::field::structure , seqan3::field::structured_seq , seqan3::field::energy ,
  seqan3::field::react , seqan3::field::react_err , seqan3::field::comment , seqan3::field::alignment ,
  seqan3::field::ref_id , seqan3::field::ref_seq , seqan3::field::ref_offset , seqan3::field::header_ptr ,
  seqan3::field::flag , seqan3::field::mate , seqan3::field::mapq , seqan3::field::cigar ,
  seqan3::field::tags , seqan3::field::bit_score , seqan3::field::evalue , seqan3::field::user_defined_0 ,
  seqan3::field::user_defined_1 , seqan3::field::user_defined_2 , seqan3::field::user_defined_3 , seqan3::field::user_defined_4 ,
  seqan3::field::user_defined_5 , seqan3::field::user_defined_6 , seqan3::field::user_defined_7 , seqan3::field::user_defined_8 ,
  seqan3::field::user_defined_9
}
 An enumerator for the fields used in file formats. More...
 

Functions

template<field f, typename field_types , typename field_ids , typename or_type >
decltype(auto) seqan3::detail::get_or (record< field_types, field_ids > &r, or_type &&or_value)
 Access an element in a std::tuple or seqan3::record; return or_value if not contained.
 
template<field f, typename field_types , typename field_ids >
auto & seqan3::detail::get_or_ignore (record< field_types, field_ids > &r)
 Access an element in a std::tuple or seqan3::record; return reference to std::ignore if not contained.
 
template<builtin_character char_t>
auto seqan3::detail::make_secondary_istream (std::basic_istream< char_t > &primary_stream, std::filesystem::path &filename) -> std::unique_ptr< std::basic_istream< char_t >, std::function< void(std::basic_istream< char_t > *)> >
 Depending on the magic bytes of the given stream, return a decompression stream or forward the primary stream.
 
template<builtin_character char_t>
auto seqan3::detail::make_secondary_ostream (std::basic_ostream< char_t > &primary_stream, std::filesystem::path &filename) -> std::unique_ptr< std::basic_ostream< char_t >, std::function< void(std::basic_ostream< char_t > *)> >
 Depending on the given filename/extension, create a compression stream or just forward the primary stream.
 
auto seqan3::detail::range_wrap_ignore (ignore_t const &)
 If the argument is std::ignore, return an infinite range of std::ignore values.
 
template<std::ranges::input_range rng_t>
auto & seqan3::detail::range_wrap_ignore (rng_t &range)
 Pass through the reference to the argument in case the argument satisfies std::ranges::input_range.
 
template<typename format_variant_type >
void seqan3::detail::set_format (format_variant_type &format, std::filesystem::path const &file_name)
 Sets the file format according to the file name extension.
 
template<std::ranges::forward_range ref_t, std::ranges::forward_range query_t>
requires std::equality_comparable_with<std::ranges::range_reference_t<ref_t>, std::ranges::range_reference_t<query_t>>
bool seqan3::detail::starts_with (ref_t &&reference, query_t &&query)
 Check whether the query range is a prefix of the reference range.
 
template<typename formats_t >
std::vector< std::stringseqan3::detail::valid_file_extensions ()
 Returns a list of valid file extensions.
 
void seqan3::detail::fast_ostreambuf_iterator< char_t, traits_t >::write_end_of_line (bool const add_cr)
 Write "\n" or "\r\n" to the stream buffer, depending on arguments.
 
template<std::output_iterator< char > it_t>
constexpr void seqan3::detail::write_eol (it_t &it, bool const add_cr)
 Write "\n" or "\r\n" to the stream iterator, depending on arguments.
 

Variables

template<typename list_t >
constexpr bool seqan3::detail::has_member_file_extensions = false
 Helper function to determine if all types in a format type list have a static member file_extensions.
 
template<typename query_t >
constexpr bool seqan3::detail::has_type_valid_formats = false
 Helper function to determine if a type has a static member valid_formats.
 

Detailed Description

The IO module provides stream handling formatted I/O.

Formatted I/O

Files and formats

SeqAn has the notion of files and formats. File is an abstraction level higher than format. A file describes a common use-case and it typically supports multiple formats. The developer needs to know which kind of file they want to read/write, this choice is made at compile-time. The format, on the other hand, is automatically detected based on the file provided by the user to the program.

For example, seqan3::sequence_file_input handles reading sequence files. It can be created directly from an input stream, or from a file name. After opening the file it will detect whether the format is seqan3::format_fasta or seqan3::format_fastq (or another supported format) automatically – normally by comparing the extension.

File Formats
seqan3::sam_file_input seqan3::format_sam, seqan3::format_bam
seqan3::sam_file_output seqan3::format_sam, seqan3::format_bam
seqan3::sequence_file_input seqan3::format_embl, seqan3::format_fasta, seqan3::format_fastq, seqan3::format_genbank, seqan3::format_sam
seqan3::sequence_file_output seqan3::format_embl, seqan3::format_fasta, seqan3::format_fastq, seqan3::format_genbank, seqan3::format_sam
seqan3::structure_file_input seqan3::format_vienna
seqan3::structure_file_output seqan3::format_vienna

Some formats are available in multiple files, e.g. seqan3::format_sam can be read by seqan3::sequence_file_input and by seqan3::sam_file_input. This represents different use-cases of the same file format.

Typically formats are supported for reading and writing, but this does not always have to be the case. See the above links for more information.

Records and fields

The main file interface that SeqAn offers is record-based, i.e. every file conceptionally is a range of records. And each record in turn behaves as a tuple of fields.

The record type of all files is based on seqan3::record, but the composition of fields is different per file.

In particular this means:

Please have a look the tutorial for Sequence File Input and Output and the API docs for seqan3::sequence_file_input to learn about this design in practice.

Streams and (de-)compression

SeqAn works with regular iostreams as provided by the standard library, but it also handles compressed streams:

Format Extension Dependency Description
GZip .gz¹ zlib GNU-Zip, most common format on UNIX
BGZF .gz, .bgzf² zlib Blocked GZip, compatible extension to GZip, features parallelisation
BZip2 .bz2 libbz2 Stronger compression than GZip, slower to compress

¹ SeqAn always assumes GZip and does not handle pure .Z.
² Some file formats like .bam or .bcf are implicitly BGZF-compressed without showing this in the extension.

Support for these compression formats is optional and depends on whether the respective dependency is available when you build your program (if you use CMake, this should happen automatically).

SeqAn file types apply compression/decompression streams transparently, i.e. if the given file-extension or "magic-header" of a file suggest this, the respective stream is automatically (de-)compressed.

The (de)compression stream wrappers are currently only used internally and not part of the API.

The number of threads used for (de-)compression of BGZF-streams can be adjusted via setting seqan3::contrib::bgzf_thread_count.

Serialisation

Besides formatted I/O which is realised via files and formats, SeqAn also supports object-level serialisation. This enables you to store data structures like indexes or sequences directly to disk.

We use the cereal library to accomplish this. For more information see cereal's documentation or our tutorial on Indexing and searching with SeqAn which contains an example.

Enumeration Type Documentation

◆ field

enum class seqan3::field
strong

An enumerator for the fields used in file formats.

Some of the fields are shared between formats.

The following table shows the usage of fields in the respective files (Note that each valid format for a file must handle all of its fields):

Field Sequence IO Alignment IO Structure IO
seq
id
qual
seq_qual
offset
bpp
structure
structured_seq
energy
react
react_err
comment
alignment
ref_id
ref_seq
ref_offset
header_ptr
flag
mate
mapq
cigar
tags
bit_score
evalue
Enumerator
seq 

The "sequence", usually a range of nucleotides or amino acids.

id 

The identifier, usually a string.

qual 

The qualities, usually in Phred score notation.

offset 

Sequence (seqan3::field::seq) relative start position (0-based), unsigned value.

bpp 

Base pair probability matrix of interactions, usually a matrix of float numbers.

structure 

Fixed interactions, usually a string of structure alphabet characters.

structured_seq 

Sequence and fixed interactions combined in one range.

energy 

Energy of a folded sequence, represented by one float number.

react 

Reactivity values of the sequence characters given in a vector of float numbers.

react_err 

Reactivity error values given in a vector corresponding to seqan3::field::react.

comment 

Comment field of arbitrary content, usually a string.

alignment 

The (pairwise) alignment stored in an object that models seqan3::detail::pairwise_alignment.

ref_id 

The identifier of the (reference) sequence that seqan3::field::seq was aligned to.

ref_seq 

The (reference) "sequence" information, usually a range of nucleotides or amino acids.

ref_offset 

Sequence (seqan3::field::ref_seq) relative start position (0-based), unsigned value.

header_ptr 

A pointer to the seqan3::sam_file_header object storing header information.

flag 

The alignment flag (bit information), uint16_t value.

mate 

The mate pair information given as a std::tuple of reference name, offset and template length.

mapq 

The mapping quality of the seqan3::field::seq alignment, usually a Phred-scaled score.

cigar 

The cigar vector (std::vector<seqan3::cigar>) representing the alignment in SAM/BAM format.

tags 

The optional tags in the SAM format, stored in a dictionary.

bit_score 

The bit score (statistical significance indicator), unsigned value.

evalue 

The e-value (length normalized bit score), double value.

user_defined_0 

Identifier for user defined file formats and specialisations.

user_defined_1 

Identifier for user defined file formats and specialisations.

user_defined_2 

Identifier for user defined file formats and specialisations.

user_defined_3 

Identifier for user defined file formats and specialisations.

user_defined_4 

Identifier for user defined file formats and specialisations.

user_defined_5 

Identifier for user defined file formats and specialisations.

user_defined_6 

Identifier for user defined file formats and specialisations.

user_defined_7 

Identifier for user defined file formats and specialisations.

user_defined_8 

Identifier for user defined file formats and specialisations.

user_defined_9 

Identifier for user defined file formats and specialisations.

Function Documentation

◆ make_secondary_istream()

template<builtin_character char_t>
auto seqan3::detail::make_secondary_istream ( std::basic_istream< char_t > &  primary_stream,
std::filesystem::path filename 
) -> std::unique_ptr<std::basic_istream<char_t>, std::function<void(std::basic_istream<char_t> *)>>
inline

Depending on the magic bytes of the given stream, return a decompression stream or forward the primary stream.

Parameters
[in]primary_streamThe primary (device) stream for reading.
[in,out]filenameThe associated filename; compression extensions will be stripped. [optional]
Returns
A pointer to the secondary stream with a default deleter or a nop-deleter.
Exceptions
seqan3::file_open_errorIf the magic bytes suggest compression, but is not supported/available.

◆ make_secondary_ostream()

template<builtin_character char_t>
auto seqan3::detail::make_secondary_ostream ( std::basic_ostream< char_t > &  primary_stream,
std::filesystem::path filename 
) -> std::unique_ptr<std::basic_ostream<char_t>, std::function<void(std::basic_ostream<char_t> *)>>
inline

Depending on the given filename/extension, create a compression stream or just forward the primary stream.

Parameters
[in]primary_streamThe primary (uncompressed) stream for writing.
[in,out]filenameThe associated filename; compression extensions will be stripped.
Returns
A pointer to the secondary stream with a default deleter or a nop-deleter.
Exceptions
seqan3::file_open_errorIf a compression-extension is used, but is not supported/available.

◆ range_wrap_ignore()

auto seqan3::detail::range_wrap_ignore ( ignore_t const &  )
inline

If the argument is std::ignore, return an infinite range of std::ignore values.

This function can be used in combination with seqan3::detail::get_or_ignore to ensure same dimensionality of the returned type, even for fields not present in the record / tuple.

◆ set_format()

template<typename format_variant_type >
void seqan3::detail::set_format ( format_variant_type &  format,
std::filesystem::path const &  file_name 
)

Sets the file format according to the file name extension.

Template Parameters
format_variant_typeThe variant type of the format to set.
Parameters
[out]formatThe format to set.
[in]file_nameThe file name to extract the extension from.
Exceptions
seqan3::unhandled_extension_errorIf the extension in file_name does not occur in any valid extensions of the formats specified in the format_variant_type template argument list.

◆ starts_with()

template<std::ranges::forward_range ref_t, std::ranges::forward_range query_t>
requires std::equality_comparable_with<std::ranges::range_reference_t<ref_t>, std::ranges::range_reference_t<query_t>>
bool seqan3::detail::starts_with ( ref_t &&  reference,
query_t &&  query 
)
inline

Check whether the query range is a prefix of the reference range.

Parameters
[in]referenceThe range that is expected to be the longer one.
[in]queryThe range that is expected to be the shorter one.

◆ valid_file_extensions()

template<typename formats_t >
std::vector< std::string > seqan3::detail::valid_file_extensions ( )
inline

Returns a list of valid file extensions.

Template Parameters
formats_tThe list of formats to parse, i.e. a seqan3::type_list; seqan3::detail::all_formats_have_file_extensions must return true.
Returns
std::vector<std::string> with all valid file extensions specified by valid_formats.

Complexity

Linear in the number of file extensions.

Thread-safety

Thread-safe.

Exception

Strong exception guarantee. No input is modified. Might throw std::bad_alloc.

◆ write_end_of_line()

template<typename char_t , typename traits_t = std::char_traits<char_t>>
void seqan3::detail::fast_ostreambuf_iterator< char_t, traits_t >::write_end_of_line ( bool const  add_cr)
inline

Write "\n" or "\r\n" to the stream buffer, depending on arguments.

Parameters
add_crWhether to add carriage return, too.

◆ write_eol()

template<std::output_iterator< char > it_t>
constexpr void seqan3::detail::write_eol ( it_t &  it,
bool const  add_cr 
)
constexpr

Write "\n" or "\r\n" to the stream iterator, depending on arguments.

Template Parameters
it_tType of the iterator; must satisfy std::output_Iterator with char.
Parameters
itThe iterator.
add_crWhether to add carriage return, too.

Variable Documentation

◆ has_member_file_extensions

template<typename list_t >
constexpr bool seqan3::detail::has_member_file_extensions = false
inlineconstexpr

Helper function to determine if all types in a format type list have a static member file_extensions.

Template Parameters
list_tThe type of the template parameter list.
Returns
true if type::file_extensions for all expanded types of list_t is valid, otherwise false.

◆ has_type_valid_formats

template<typename query_t >
constexpr bool seqan3::detail::has_type_valid_formats = false
inlineconstexpr

Helper function to determine if a type has a static member valid_formats.

Template Parameters
query_tThe type to query.
Returns
true if query_t::valid_formats is valid, otherwise false.
Hide me