A class for reading alignment files, e.g. SAM, BAM, BLAST ... More...
#include <seqan3/io/sam_file/input.hpp>
Public Types | |
Template arguments | |
Exposed as member types for public access. | |
using | traits_type = traits_type_ |
A traits type that defines aliases and template for storage of the fields. | |
using | selected_field_ids = selected_field_ids_ |
A seqan3::fields list with the fields selected for the record. | |
using | valid_formats = valid_formats_ |
A seqan3::type_list with the possible formats. | |
using | stream_char_type = char |
Character type of the stream(s). | |
Range associated types | |
The types necessary to facilitate the behaviour of an input range (used in record-wise reading). | |
using | value_type = record_type |
The value_type is the record_type. | |
using | reference = record_type & |
The reference type. | |
using | const_reference = void |
The const_reference type is void because files are not const-iterable. | |
using | size_type = size_t |
An unsigned integer type, usually std::size_t. | |
using | difference_type = std::make_signed_t< size_t > |
A signed integer type, usually std::ptrdiff_t. | |
using | iterator = detail::in_file_iterator< sam_file_input > |
The iterator type of this view (an input iterator). | |
using | const_iterator = void |
The const iterator type is void because files are not const-iterable. | |
using | sentinel = std::default_sentinel_t |
The type returned by end(). | |
Public Member Functions | |
header_type & | header () |
Access the file's header. More... | |
Constructors, destructor and assignment | |
sam_file_input ()=delete | |
Default constructor is explicitly deleted, you need to give a stream or file name. | |
sam_file_input (sam_file_input const &)=delete | |
Copy construction is explicitly deleted because you cannot have multiple access to the same file. | |
sam_file_input & | operator= (sam_file_input const &)=delete |
Copy assignment is explicitly deleted because you cannot have multiple access to the same file. | |
sam_file_input (sam_file_input &&)=default | |
Move construction is defaulted. | |
sam_file_input & | operator= (sam_file_input &&)=default |
Move assignment is defaulted. | |
~sam_file_input ()=default | |
Destructor is defaulted. | |
sam_file_input (std::filesystem::path filename, selected_field_ids const &fields_tag=selected_field_ids{}) | |
Construct from filename. More... | |
template<input_stream stream_t, sam_file_input_format file_format> | |
sam_file_input (stream_t &stream, file_format const &format_tag, selected_field_ids const &fields_tag=selected_field_ids{}) | |
Construct from an existing stream and with specified format. More... | |
template<input_stream stream_t, sam_file_input_format file_format> | |
sam_file_input (stream_t &&stream, file_format const &format_tag, selected_field_ids const &fields_tag=selected_field_ids{}) | |
This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts. | |
sam_file_input (std::filesystem::path filename, typename traits_type::ref_ids &ref_ids, typename traits_type::ref_sequences &ref_sequences, selected_field_ids const &fields_tag=selected_field_ids{}) | |
Construct from filename and given additional reference information. More... | |
template<input_stream stream_t, sam_file_input_format file_format> | |
sam_file_input (stream_t &stream, typename traits_type::ref_ids &ref_ids, typename traits_type::ref_sequences &ref_sequences, file_format const &format_tag, selected_field_ids const &fields_tag=selected_field_ids{}) | |
Construct from an existing stream and with specified format. More... | |
template<input_stream stream_t, sam_file_input_format file_format> | |
sam_file_input (stream_t &&stream, typename traits_type::ref_ids &ref_ids, typename traits_type::ref_sequences &ref_sequences, file_format const &format_tag, selected_field_ids const &fields_tag=selected_field_ids{}) | |
This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts. | |
Range interface | |
Provides functions for record based reading of the file. | |
iterator | begin () |
Returns an iterator to current position in the file. More... | |
sentinel | end () noexcept |
Returns a sentinel for comparison with iterator. More... | |
reference | front () noexcept |
Return the record we are currently at in the file. More... | |
Public Attributes | |
sam_file_input_options< typename traits_type::sequence_legal_alphabet > | options |
The options are public and its members can be set directly. | |
Related Functions | |
(Note that these are not member functions.) | |
Type deduction guides | |
template<input_stream stream_type, sam_file_input_format file_format, detail::fields_specialisation selected_field_ids> | |
sam_file_input (stream_type &&stream, file_format const &, selected_field_ids const &) -> sam_file_input< typename sam_file_input<>::traits_type, selected_field_ids, type_list< file_format >> | |
Deduce selected fields, file_format, and default the rest. | |
template<input_stream stream_type, sam_file_input_format file_format, detail::fields_specialisation selected_field_ids> | |
sam_file_input (stream_type &stream, file_format const &, selected_field_ids const &) -> sam_file_input< typename sam_file_input<>::traits_type, selected_field_ids, type_list< file_format >> | |
Deduce selected fields, file_format, and default the rest. | |
template<input_stream stream_type, sam_file_input_format file_format> | |
sam_file_input (stream_type &&stream, file_format const &) -> sam_file_input< typename sam_file_input<>::traits_type, typename sam_file_input<>::selected_field_ids, type_list< file_format >> | |
Deduce file_format, and default the rest. | |
template<input_stream stream_type, sam_file_input_format file_format> | |
sam_file_input (stream_type &stream, file_format const &) -> sam_file_input< typename sam_file_input<>::traits_type, typename sam_file_input<>::selected_field_ids, type_list< file_format >> | |
Deduce file_format, and default the rest. | |
template<std::ranges::forward_range ref_ids_t, std::ranges::forward_range ref_sequences_t, detail::fields_specialisation selected_field_ids> | |
sam_file_input (std::filesystem::path path, ref_ids_t &, ref_sequences_t &, selected_field_ids const &) -> sam_file_input< sam_file_input_default_traits< std::remove_reference_t< ref_sequences_t >, std::remove_reference_t< ref_ids_t >>, selected_field_ids, typename sam_file_input<>::valid_formats > | |
Deduce selected fields, ref_sequences_t and ref_ids_t, default the rest. | |
template<std::ranges::forward_range ref_ids_t, std::ranges::forward_range ref_sequences_t> | |
sam_file_input (std::filesystem::path path, ref_ids_t &, ref_sequences_t &) -> sam_file_input< sam_file_input_default_traits< std::remove_reference_t< ref_sequences_t >, std::remove_reference_t< ref_ids_t >>, typename sam_file_input<>::selected_field_ids, typename sam_file_input<>::valid_formats > | |
Deduce ref_sequences_t and ref_ids_t, default the rest. | |
template<input_stream stream_type, std::ranges::forward_range ref_ids_t, std::ranges::forward_range ref_sequences_t, sam_file_input_format file_format, detail::fields_specialisation selected_field_ids> | |
sam_file_input (stream_type &&stream, ref_ids_t &, ref_sequences_t &, file_format const &, selected_field_ids const &) -> sam_file_input< sam_file_input_default_traits< std::remove_reference_t< ref_sequences_t >, std::remove_reference_t< ref_ids_t >>, selected_field_ids, type_list< file_format >> | |
Deduce selected fields, ref_sequences_t and ref_ids_t, and file format. | |
template<input_stream stream_type, std::ranges::forward_range ref_ids_t, std::ranges::forward_range ref_sequences_t, sam_file_input_format file_format, detail::fields_specialisation selected_field_ids> | |
sam_file_input (stream_type &stream, ref_ids_t &, ref_sequences_t &, file_format const &, selected_field_ids const &) -> sam_file_input< sam_file_input_default_traits< std::remove_reference_t< ref_sequences_t >, std::remove_reference_t< ref_ids_t >>, selected_field_ids, type_list< file_format >> | |
Deduce selected fields, ref_sequences_t and ref_ids_t, and file format. | |
template<input_stream stream_type, std::ranges::forward_range ref_ids_t, std::ranges::forward_range ref_sequences_t, sam_file_input_format file_format> | |
sam_file_input (stream_type &&stream, ref_ids_t &, ref_sequences_t &, file_format const &) -> sam_file_input< sam_file_input_default_traits< std::remove_reference_t< ref_sequences_t >, std::remove_reference_t< ref_ids_t >>, typename sam_file_input<>::selected_field_ids, type_list< file_format >> | |
Deduce ref_sequences_t and ref_ids_t, and file format. | |
template<input_stream stream_type, std::ranges::forward_range ref_ids_t, std::ranges::forward_range ref_sequences_t, sam_file_input_format file_format> | |
sam_file_input (stream_type &stream, ref_ids_t &, ref_sequences_t &, file_format const &) -> sam_file_input< sam_file_input_default_traits< std::remove_reference_t< ref_sequences_t >, std::remove_reference_t< ref_ids_t >>, typename sam_file_input<>::selected_field_ids, type_list< file_format >> | |
Deduce selected fields, ref_sequences_t and ref_ids_t, and file format. | |
Field types and record type | |
These types are relevant for record/row-based reading; they may be manipulated via the traits_type to achieve different storage behaviour. | |
using | sequence_type = typename traits_type::template sequence_container< typename traits_type::sequence_alphabet > |
The type of field::seq (default std::vector<seqan3::dna5>). | |
using | id_type = typename traits_type::template id_container< char > |
The type of field::id (default std::string by default). | |
using | offset_type = int32_t |
The type of field::offset is fixed to int32_t. | |
using | ref_sequence_type = std::conditional_t< std::same_as< typename traits_type::ref_sequences, ref_info_not_given >, dummy_ref_type, ref_sequence_sliced_type > |
The type of field::ref_seq (default depends on construction). More... | |
using | ref_id_type = std::optional< int32_t > |
The type of field::ref_id is fixed to std::optional<int32_t>. More... | |
using | ref_offset_type = std::optional< int32_t > |
The type of field::ref_offset is fixed to an std::optional<int32_t>. More... | |
using | mapq_type = uint8_t |
The type of field::mapq is fixed to uint8_t. | |
using | quality_type = typename traits_type::template quality_container< typename traits_type::quality_alphabet > |
The type of field::qual (default std::vector<seqan3::phred42>). | |
using | flag_type = sam_flag |
The type of field::flag is fixed to seqan3::sam_flag. | |
using | cigar_type = std::vector< cigar > |
The type of field::cigar is fixed to std::vector<cigar>. | |
using | mate_type = std::tuple< ref_id_type, ref_offset_type, int32_t > |
The type of field::mate is fixed to std::tuple<ref_id_type, ref_offset_type, int32_t>). | |
using | e_value_type = double |
The type of field::evalue is fixed to double. | |
using | bitscore_type = double |
The type of field::bitscore is fixed to double. | |
using | header_type = sam_file_header< typename traits_type::ref_ids > |
The type of field::header_ptr (default: sam_file_header<typename traits_type::ref_ids>). | |
using | alignment_type = std::tuple< gap_decorator< ref_sequence_type >, alignment_query_type > |
The type of field::alignment (default: std::pair<std::vector<gapped<dna5>>, std::vector<gapped<dna5>>>). | |
using | field_types = type_list< sequence_type, id_type, offset_type, ref_sequence_type, ref_id_type, ref_offset_type, alignment_type, std::vector< cigar >, mapq_type, quality_type, flag_type, mate_type, sam_tag_dictionary, e_value_type, bitscore_type, header_type * > |
The previously defined types aggregated in a seqan3::type_list. | |
using | field_ids = fields< field::seq, field::id, field::offset, field::ref_seq, field::ref_id, field::ref_offset, field::alignment, field::cigar, field::mapq, field::qual, field::flag, field::mate, field::tags, field::evalue, field::bit_score, field::header_ptr > |
The subset of seqan3::field tags valid for this file; order corresponds to the types in field_types. More... | |
using | record_type = sam_record< detail::select_types_with_ids_t< field_types, field_ids, selected_field_ids >, selected_field_ids > |
The type of the record, a specialisation of seqan3::record; acts as a tuple of the selected field types. | |
static constexpr bool | is_default_selected_field_ids = selected_field_ids::size == field_ids::size |
brief Does selected_field_ids contain all fields like in the default case? | |
A class for reading alignment files, e.g. SAM, BAM, BLAST ...
traits_type | An auxiliary type that defines certain member types and constants, must model seqan3::sam_file_input_traits. |
selected_field_ids | A seqan3::fields type with the list and order of desired record entries; all fields must be in seqan3::sam_file_input::field_ids. |
valid_formats | A seqan3::type_list of the selectable formats (each must meet seqan3::sam_file_input_format). |
Alignment files are primarily used to store pairwise alignments of two biological sequences and often come with many additional information. Well-known formats include the SAM/BAM format used to store read mapping data or the BLAST format that stores the results of a query search against a data base.
The SAM file abstraction supports reading 12 different fields:
There exists one more field for SAM files, the seqan3::field::header_ptr, but this field is mostly used internally. Please see the seqan3::sam_file_output::header member function for details on how to access the seqan3::sam_file_header of the file.
All of these fields are retrieved by default (and in that order). Note that some of the fields are specific to the SAM format (e.g. seqan3::field::flag) while others are specific to BLAST format (e.g. seqan3::field::bit_score). Please see the corresponding formats for more details.
This class comes with four constructors: One for construction from a file name, one for construction from an existing stream and a known format and both of the former with or without additional reference information.
Constructing from a file name automatically picks the format based on the extension of the file name. Constructing from a stream can be used if you have a non-file stream, like std::cin or std::istringstream, that you want to read from and/or if you cannot use file-extension based detection, but know that your input file has a certain format.
The reference information is specific to the SAM format. The SAM format only stores a "semi-alignment" meaning that it has the query sequence and the cigar string representing the gap information but not the reference information. If you want to retrieve valid/full alignments, you need to pass the corresponding reference information:
In most cases the template parameters are deduced automatically:
Reading from an std::istringstream:
Note that this is not the same as writing sam_file_input<>
(with angle brackets). In the latter case they are explicitly set to their default values, in the former case automatic deduction happens which chooses different parameters depending on the constructor arguments. For opening from file, sam_file_input<>
would have also worked, but for opening from stream it would not have.
You can define your own traits type to further customise the types used by and returned by this class, see seqan3::sam_file_input_default_traits for more details. As mentioned above, specifying at least one template parameter yourself means that you loose automatic deduction. The following is equivalent to the automatic type deduction example with a stream from above:
You can iterate over this file record-wise:
In the above example, rec
has the type record_type which is a specialisation of seqan3::record and behaves like an std::tuple (that's why we can access it via get
). Instead of using the seqan3::field based interface on the record, you could also use std::get<0>
or even std::get<dna4_vector>
to retrieve the sequence, but it is not recommended, because it is more error-prone.
Note: It is important to write auto &
and not just auto
, otherwise you will copy the record on every iteration. Since the buffer gets "refilled" on every iteration, you can also move the data out of the record if you want to store it somewhere without copying:
If you want to skip specific fields from the record you can pass a non-empty fields trait object to the seqan3::sam_file_input constructor to select the fields that should be read from the input. For example, you may only be interested in the mapping flag and mapping quality of your SAM data to get some statistics. The following snippets demonstrate the usage of such a fields trait object.
When reading a file, all fields not present in the file (but requested implicitly or via the selected_field_ids
parameter) are ignored and the respective value in the record stays empty.
Instead of using get
on the record, you can also use structured bindings to decompose the record into its elements. Considering the example of reading only the flag and mapping quality like before you can also write:
In this case you immediately get the two elements of the tuple: flag
of flag_type and mapq
of mapq_type. But beware: with structured bindings you do need to get the order of elements correctly!
Since SeqAn files are ranges, you can also create views over files. A useful example is to filter the records based on certain criteria, e.g. minimum length of the sequence field:
You can check whether a file is at its end by comparing begin() and end() (if they are the same, the file is at its end).
We currently support reading the following formats:
| no-api |
The subset of seqan3::field tags valid for this file; order corresponds to the types in field_types.
The SAM file abstraction supports reading 12 different fields:
There exists one more field for SAM files, the seqan3::field::header_ptr, but this field is mostly used internally. Please see the seqan3::sam_file_output::header member function for details on how to access the seqan3::sam_file_header of the file.
| no-api |
The type of field::ref_id is fixed to std::optional<int32_t>.
To be consistent with the BAM format, the field::ref_id will hold the index to the actual reference information stored in the header. If a read is unmapped, the optional will remain valueless.
| no-api |
The type of field::ref_offset is fixed to an std::optional<int32_t>.
The SAM format is 1-based and a 0 in the ref_offset field indicated an unmapped read. Since we convert 1-based positions to 0-based positions when reading the SAM format, we model the ref_offset_type as an std::optional. If the input value is 0, the std::optional will remain valueless.
| no-api |
The type of field::ref_seq (default depends on construction).
If no reference information are given on construction, this type deduces to a sized view that throws on access (since there is nothing to access anyway). If the reference information are given, the type is deduced to a view over the given input reference sequence type such that no sequence information is copied.
|
no-apiinline |
Construct from filename.
[in] | filename | Path to the file you wish to open. |
[in] | fields_tag | A seqan3::fields tag. [optional] |
seqan3::file_open_error | If the file could not be opened, e.g. non-existent, non-readable, unknown format. |
In addition to the file name, you may specify a custom seqan3::fields object (e.g. seqan3::fields<seqan3::field::seq>{}
) which may be easier than defining all the template parameters.
This constructor transparently applies a decompression stream on top of the file stream in case the file is detected as being compressed. See the section on compression and decompression for more information.
|
no-apiinline |
Construct from an existing stream and with specified format.
stream_t | The stream type; must model seqan3::input_stream. |
file_format | The format of the file in the stream, must model seqan3::sam_file_input_format. |
[in] | stream | The stream to operate on; must be derived of std::basic_istream. |
[in] | format_tag | The file format tag. |
[in] | fields_tag | A seqan3::fields tag. [optional] |
In addition to the stream and the format, you may specify a custom seqan3::fields object (e.g. seqan3::fields<seqan3::field::seq>{}
) which may be easier than defining all the template parameters.
This constructor transparently applies a decompression stream on top of the stream in case it is detected as being compressed. See the section on compression and decompression for more information.
|
no-apiinline |
Construct from filename and given additional reference information.
[in] | filename | Path to the file you wish to open. |
[in] | ref_ids | A range containing the reference ids that correspond to the SAM/BAM file. |
[in] | ref_sequences | A range containing the reference sequences that correspond to the SAM/BAM file. |
[in] | fields_tag | A seqan3::fields tag. [optional] |
seqan3::file_open_error | If the file could not be opened, e.g. non-existent, non-readable, unknown format. |
The reference information given by the ids (names) and sequences will be used to construct a proper alignment when reading in SAM or BAM files. If you are not interested in the full alignment, call the constructor without the parameters.
In addition to the file name and reference information, you may specify a custom seqan3::fields object (e.g. seqan3::fields<seqan3::field::seq>{}
) which may be easier than defining all the template parameters.
This constructor transparently applies a decompression stream on top of the file stream in case the file is detected as being compressed. See the section on compression and decompression for more information.
|
no-apiinline |
Construct from an existing stream and with specified format.
stream_t | The stream type; must model seqan3::input_stream. |
file_format | The format of the file in the stream; must model seqan3::sam_file_input_format. |
[in] | stream | The stream to operate on; must be derived of std::basic_istream. |
[in] | ref_ids | A range containing the reference ids that correspond to the SAM/BAM file. |
[in] | ref_sequences | A range containing the reference sequences that correspond to the SAM/BAM file. |
[in] | format_tag | The file format tag. |
[in] | fields_tag | A seqan3::fields tag. [optional] |
The reference information given by the ids (names) and sequences will be used to construct a proper alignment when reading in SAM or BAM files. If you are not interested in the full alignment, you do not need to specify those information.
In addition to the stream, reference information and format, you may specify a custom seqan3::fields object (e.g. seqan3::fields<seqan3::field::seq>{}
) which may be easier than defining all the template parameters.
This constructor transparently applies a decompression stream on top of the stream in case it is detected as being compressed. See the section on compression and decompression for more information.
|
no-apiinline |
Returns an iterator to current position in the file.
seqan3::format_error | Equals end() if the file is at end. |
Constant.
Throws seqan3::format_error if the first record could not be read into the buffer.
|
no-apiinlinenoexcept |
Returns a sentinel for comparison with iterator.
This element acts as a placeholder; attempting to dereference it results in undefined behaviour.
Constant.
No-throw guarantee.
|
no-apiinlinenoexcept |
Return the record we are currently at in the file.
This function returns a reference to the currently buffered record, it is identical to dereferencing begin(), and begin also always points to the current record on single pass input ranges:
In most situations using the iterator interface or a range-based for-loop are preferable to using front(), because you can only move to the next record via the iterator.
In any case, don't forget the reference! If you want to save the data from the record elsewhere, use move:
Constant.
No-throw guarantee.
|
no-apiinline |
Access the file's header.
You can access the header directly after the construction with reference information of the file object.