SeqAn3 3.4.0-rc.1
The Modern C++ library for sequence analysis.
|
The IO module provides stream handling formatted I/O. More...
Modules | |
SAM File | |
Provides files and formats for handling read mapping data. | |
Sequence File | |
Provides files and formats for handling sequence data. | |
Stream | |
The stream sub-module contains data structures and functions for streaming and tokenization. | |
Structure File | |
Provides files and formats for handling structure data. | |
Views | |
IO related views. | |
Concepts | |
concept | seqan3::detail::fields_specialisation |
Auxiliary concept that checks whether a type is a specialisation of seqan3::fields. | |
Classes | |
struct | seqan3::detail::bgzf_compression |
A tag signifying a bgzf compressed file. More... | |
struct | seqan3::detail::bz2_compression |
A tag signifying a bz2 compressed file. More... | |
struct | seqan3::fields< fs > |
A class template that holds a choice of seqan3::field. More... | |
struct | seqan3::file_open_error |
Thrown if there is an unspecified filesystem or stream error while opening, e.g. permission problem. More... | |
struct | seqan3::format_error |
Thrown if information given to output format didn't match expectations. More... | |
struct | seqan3::detail::gz_compression |
A tag signifying a gz compressed file. More... | |
class | seqan3::detail::ignore_output_iterator |
An output iterator that emulates writing to a null -stream in order to dispose the output. More... | |
class | seqan3::detail::in_file_iterator< file_type > |
Input iterator necessary for providing a range-like interface in input file. More... | |
struct | seqan3::io_error |
Thrown if there is an io error in low level io operations such as in std::basic_streambuf operations. More... | |
struct | seqan3::detail::is_derived_from_record< record_t > |
Helper struct to implement seqan3::detail::record_like. More... | |
class | seqan3::detail::out_file_iterator< file_type > |
Output iterator necessary for providing a range-like interface in output file. More... | |
struct | seqan3::parse_error |
Thrown if there is a parse error, such as reading an unexpected character from an input stream. More... | |
struct | seqan3::record< field_types, field_ids > |
The class template that file records are based on; behaves like a std::tuple. More... | |
interface | record_like |
The concept for a type that models a record. More... | |
class | seqan3::detail::safe_filesystem_entry |
A safe guard to manage a filesystem entry, e.g. a file or a directory. More... | |
struct | seqan3::detail::select_types_with_ids< field_types, field_types_as_ids, selected_field_ids, field_no, return_types > |
Exposes a subset of types as a seqan3::type_list selected based on their IDs. More... | |
struct | seqan3::unexpected_end_of_input |
Thrown if I/O was expecting more input (e.g. a delimiter or a new line), but the end of input was reached. More... | |
struct | seqan3::unhandled_extension_error |
Thrown if there is no format that accepts a given file extension. More... | |
struct | seqan3::detail::variant_from_tags< list_t, output_t > |
Base class to deduce the std::variant type from format tags. More... | |
struct | seqan3::detail::variant_from_tags< type_list< ts... >, output_t > |
Transfers a list of format tags (...ts ) onto a std::variant by specialising output_t with each. More... | |
struct | seqan3::detail::zstd_compression |
A tag signifying a zstd compressed file. More... | |
Typedefs | |
using | seqan3::detail::compression_formats = pack_traits::drop_front< void > |
A seqan3::type_list containing the available compression formats. | |
Functions | |
template<field f, typename field_types , typename field_ids , typename or_type > | |
decltype(auto) | seqan3::detail::get_or (record< field_types, field_ids > &r, or_type &&or_value) |
Access an element in a std::tuple or seqan3::record; return or_value if not contained. | |
template<field f, typename field_types , typename field_ids > | |
auto & | seqan3::detail::get_or_ignore (record< field_types, field_ids > &r) |
Access an element in a std::tuple or seqan3::record; return reference to std::ignore if not contained. | |
template<builtin_character char_t> | |
auto | seqan3::detail::make_secondary_istream (std::basic_istream< char_t > &primary_stream, std::filesystem::path &filename) -> std::unique_ptr< std::basic_istream< char_t >, std::function< void(std::basic_istream< char_t > *)> > |
Depending on the magic bytes of the given stream, return a decompression stream or forward the primary stream. | |
template<builtin_character char_t> | |
auto | seqan3::detail::make_secondary_ostream (std::basic_ostream< char_t > &primary_stream, std::filesystem::path &filename) -> std::unique_ptr< std::basic_ostream< char_t >, std::function< void(std::basic_ostream< char_t > *)> > |
Depending on the given filename/extension, create a compression stream or just forward the primary stream. | |
auto | seqan3::detail::range_wrap_ignore (ignore_t const &) |
If the argument is std::ignore, return an infinite range of std::ignore values. | |
template<std::ranges::input_range rng_t> | |
auto & | seqan3::detail::range_wrap_ignore (rng_t &range) |
Pass through the reference to the argument in case the argument satisfies std::ranges::input_range. | |
template<typename format_variant_type > | |
void | seqan3::detail::set_format (format_variant_type &format, std::filesystem::path const &file_name) |
Sets the file format according to the file name extension. | |
template<std::ranges::forward_range ref_t, std::ranges::forward_range query_t> requires std::equality_comparable_with<std::ranges::range_reference_t<ref_t>, std::ranges::range_reference_t<query_t>> | |
bool | seqan3::detail::starts_with (ref_t &&reference, query_t &&query) |
Check whether the query range is a prefix of the reference range. | |
template<typename formats_t > | |
std::vector< std::string > | seqan3::detail::valid_file_extensions () |
Returns a list of valid file extensions. | |
void | seqan3::detail::fast_ostreambuf_iterator< char_t, traits_t >::write_end_of_line (bool const add_cr) |
Write "\n" or "\r\n" to the stream buffer, depending on arguments. | |
template<std::output_iterator< char > it_t> | |
constexpr void | seqan3::detail::write_eol (it_t &it, bool const add_cr) |
Write "\n" or "\r\n" to the stream iterator, depending on arguments. | |
Variables | |
template<typename list_t > | |
constexpr bool | seqan3::detail::has_member_file_extensions = false |
Helper function to determine if all types in a format type list have a static member file_extensions . | |
template<typename query_t > | |
constexpr bool | seqan3::detail::has_type_valid_formats = false |
Helper function to determine if a type has a static member valid_formats . | |
The IO module provides stream handling formatted I/O.
SeqAn has the notion of files and formats. File is an abstraction level higher than format. A file describes a common use-case and it typically supports multiple formats. The developer needs to know which kind of file they want to read/write, this choice is made at compile-time. The format, on the other hand, is automatically detected based on the file provided by the user to the program.
For example, seqan3::sequence_file_input handles reading sequence files. It can be created directly from an input stream, or from a file name. After opening the file it will detect whether the format is seqan3::format_fasta or seqan3::format_fastq (or another supported format) automatically – normally by comparing the extension.
Some formats are available in multiple files, e.g. seqan3::format_sam can be read by seqan3::sequence_file_input and by seqan3::sam_file_input. This represents different use-cases of the same file format.
Typically formats are supported for reading and writing, but this does not always have to be the case. See the above links for more information.
The main file interface that SeqAn offers is record-based, i.e. every file conceptionally is a range of records. And each record in turn behaves as a tuple of fields.
The record type of all files is based on seqan3::record, but the composition of fields is different per file.
In particular this means:
Please have a look the tutorial for Sequence File Input and Output and the API docs for seqan3::sequence_file_input to learn about this design in practice.
SeqAn works with regular iostreams as provided by the standard library, but it also handles compressed streams:
Format | Extension | Dependency | Description |
---|---|---|---|
GZip | .gz ¹ | zlib | GNU-Zip, most common format on UNIX |
BGZF | .gz , .bgzf ² | zlib | Blocked GZip, compatible extension to GZip, features parallelisation |
BZip2 | .bz2 | libbz2 | Stronger compression than GZip, slower to compress |
¹ SeqAn always assumes GZip and does not handle pure .Z
.
² Some file formats like .bam
or .bcf
are implicitly BGZF-compressed without showing this in the extension.
Support for these compression formats is optional and depends on whether the respective dependency is available when you build your program (if you use CMake, this should happen automatically).
SeqAn file types apply compression/decompression streams transparently, i.e. if the given file-extension or "magic-header" of a file suggest this, the respective stream is automatically (de-)compressed.
The (de)compression stream wrappers are currently only used internally and not part of the API.
The number of threads used for (de-)compression of BGZF-streams can be adjusted via setting seqan3::contrib::bgzf_thread_count.
Besides formatted I/O which is realised via files and formats, SeqAn also supports object-level serialisation. This enables you to store data structures like indexes or sequences directly to disk.
We use the cereal library to accomplish this. For more information see cereal's documentation or our tutorial on Indexing and searching with SeqAn which contains an example.
|
strong |
An enumerator for the fields used in file formats.
Some of the fields are shared between formats.
The following table shows the usage of fields in the respective files (Note that each valid format for a file must handle all of its fields):
Field | Sequence IO | Alignment IO | Structure IO |
---|---|---|---|
seq | ✅ | ✅ | ✅ |
id | ✅ | ✅ | ✅ |
qual | ✅ | ✅ | ✅ |
seq_qual | ✅ | ||
offset | ✅ | ✅ | |
bpp | ✅ | ||
structure | ✅ | ||
structured_seq | ✅ | ||
energy | ✅ | ||
react | ✅ | ||
react_err | ✅ | ||
comment | ✅ | ||
alignment | ✅ | ||
ref_id | ✅ | ||
ref_seq | ✅ | ||
ref_offset | ✅ | ||
header_ptr | ✅ | ||
flag | ✅ | ||
mate | ✅ | ||
mapq | ✅ | ||
cigar | ✅ | ||
tags | ✅ | ||
bit_score | ✅ | ||
evalue | ✅ |
Enumerator | |
---|---|
seq | The "sequence", usually a range of nucleotides or amino acids. |
id | The identifier, usually a string. |
qual | The qualities, usually in Phred score notation. |
offset | Sequence (seqan3::field::seq) relative start position (0-based), unsigned value. |
bpp | Base pair probability matrix of interactions, usually a matrix of float numbers. |
structure | Fixed interactions, usually a string of structure alphabet characters. |
structured_seq | Sequence and fixed interactions combined in one range. |
energy | Energy of a folded sequence, represented by one float number. |
react | Reactivity values of the sequence characters given in a vector of float numbers. |
react_err | Reactivity error values given in a vector corresponding to seqan3::field::react. |
comment | Comment field of arbitrary content, usually a string. |
alignment | The (pairwise) alignment stored in an object that models seqan3::detail::pairwise_alignment. |
ref_id | The identifier of the (reference) sequence that seqan3::field::seq was aligned to. |
ref_seq | The (reference) "sequence" information, usually a range of nucleotides or amino acids. |
ref_offset | Sequence (seqan3::field::ref_seq) relative start position (0-based), unsigned value. |
header_ptr | A pointer to the seqan3::sam_file_header object storing header information. |
flag | The alignment flag (bit information), |
mate | The mate pair information given as a std::tuple of reference name, offset and template length. |
mapq | The mapping quality of the seqan3::field::seq alignment, usually a Phred-scaled score. |
cigar | The cigar vector (std::vector<seqan3::cigar>) representing the alignment in SAM/BAM format. |
tags | The optional tags in the SAM format, stored in a dictionary. |
bit_score | The bit score (statistical significance indicator), unsigned value. |
evalue | The e-value (length normalized bit score), |
user_defined_0 | Identifier for user defined file formats and specialisations. |
user_defined_1 | Identifier for user defined file formats and specialisations. |
user_defined_2 | Identifier for user defined file formats and specialisations. |
user_defined_3 | Identifier for user defined file formats and specialisations. |
user_defined_4 | Identifier for user defined file formats and specialisations. |
user_defined_5 | Identifier for user defined file formats and specialisations. |
user_defined_6 | Identifier for user defined file formats and specialisations. |
user_defined_7 | Identifier for user defined file formats and specialisations. |
user_defined_8 | Identifier for user defined file formats and specialisations. |
user_defined_9 | Identifier for user defined file formats and specialisations. |
|
inline |
Depending on the magic bytes of the given stream, return a decompression stream or forward the primary stream.
[in] | primary_stream | The primary (device) stream for reading. |
[in,out] | filename | The associated filename; compression extensions will be stripped. [optional] |
seqan3::file_open_error | If the magic bytes suggest compression, but is not supported/available. |
|
inline |
Depending on the given filename/extension, create a compression stream or just forward the primary stream.
[in] | primary_stream | The primary (uncompressed) stream for writing. |
[in,out] | filename | The associated filename; compression extensions will be stripped. |
seqan3::file_open_error | If a compression-extension is used, but is not supported/available. |
|
inline |
If the argument is std::ignore, return an infinite range of std::ignore values.
This function can be used in combination with seqan3::detail::get_or_ignore to ensure same dimensionality of the returned type, even for fields not present in the record / tuple.
void seqan3::detail::set_format | ( | format_variant_type & | format, |
std::filesystem::path const & | file_name | ||
) |
Sets the file format according to the file name extension.
format_variant_type | The variant type of the format to set. |
[out] | format | The format to set. |
[in] | file_name | The file name to extract the extension from. |
seqan3::unhandled_extension_error | If the extension in file_name does not occur in any valid extensions of the formats specified in the format_variant_type template argument list. |
|
inline |
Check whether the query range is a prefix of the reference range.
[in] | reference | The range that is expected to be the longer one. |
[in] | query | The range that is expected to be the shorter one. |
|
inline |
Returns a list of valid file extensions.
formats_t | The list of formats to parse, i.e. a seqan3::type_list; seqan3::detail::all_formats_have_file_extensions must return true . |
std::vector<std::string>
with all valid file extensions specified by valid_formats
.Linear in the number of file extensions.
Thread-safe.
Strong exception guarantee. No input is modified. Might throw std::bad_alloc.
|
inline |
Write "\n"
or "\r\n"
to the stream buffer, depending on arguments.
add_cr | Whether to add carriage return, too. |
|
constexpr |
Write "\n"
or "\r\n"
to the stream iterator, depending on arguments.
it_t | Type of the iterator; must satisfy std::output_Iterator with char . |
it | The iterator. |
add_cr | Whether to add carriage return, too. |
|
inlineconstexpr |
Helper function to determine if all types in a format type list have a static member file_extensions
.
list_t | The type of the template parameter list. |
true
if type::file_extensions
for all expanded types of list_t
is valid, otherwise false
.
|
inlineconstexpr |
Helper function to determine if a type has a static member valid_formats
.
query_t | The type to query. |
true
if query_t::valid_formats
is valid, otherwise false
.