Provides files and formats for handling read mapping data.
More...
|
enum class | seqan3::sam_flag : uint16_t {
seqan3::sam_flag::none = 0
, seqan3::sam_flag::paired = 0x1
, seqan3::sam_flag::proper_pair = 0x2
, seqan3::sam_flag::unmapped = 0x4
,
seqan3::sam_flag::mate_unmapped = 0x8
, seqan3::sam_flag::on_reverse_strand = 0x10
, seqan3::sam_flag::mate_on_reverse_strand = 0x20
, seqan3::sam_flag::first_in_pair = 0x40
,
seqan3::sam_flag::second_in_pair = 0x80
, seqan3::sam_flag::secondary_alignment = 0x100
, seqan3::sam_flag::failed_filter = 0x200
, seqan3::sam_flag::duplicate = 0x400
,
seqan3::sam_flag::supplementary_alignment = 0x800
} |
| An enum flag that describes the properties of an aligned read (given as a SAM record). More...
|
|
Provides files and formats for handling read mapping data.
Introduction
SAM/BAM files are primarily used to store pairwise alignments of read mapping data.
- Note
- For a step-by-step guide take a look at our tutorial: SAM Input and Output in SeqAn.
The SAM file abstraction supports reading 10 different fields:
- seqan3::field::seq
- seqan3::field::id
- seqan3::field::ref_id
- seqan3::field::ref_offset
- seqan3::field::cigar
- seqan3::field::mapq
- seqan3::field::qual
- seqan3::field::flag
- seqan3::field::mate
- seqan3::field::tags
There exists one more field for SAM files, the seqan3::field::header_ptr, but this field is mostly used internally. Please see the seqan3::sam_file_output::header member function for details on how to access the seqan3::sam_file_header of the file.
All of these fields are retrieved by default (and in that order).
Reading SAM files
Construction and specialisation
The
seqan3::sam_file_input class comes with four constructors: One for construction from a file name, one for construction from an existing stream and a known format and both of the former with or without additional reference information. Constructing from a file name automatically picks the format based on the extension of the file name. Constructing from a stream can be used if you have a non-file stream, like
std::cin or
std::istringstream. It also comes in handy, if you cannot use file-extension based detection, but know that your input file has a certain format.
Passing reference information, e.g.
- ref_ids: The name of the references, e.g. "chr1", "chr2", ...
- ref_sequences: The reference sequence information in the same order as the ref_ids.
comes in handy once you want to convert the CIGAR string, read from your file, into an actual alignment. This will be covered in the section "Transforming the CIGAR information into an actual alignment".In most cases the template parameters are deduced automatically:
auto sam_file_raw = R"(@HD VN:1.6 SO:coordinate GO:none
@SQ SN:ref LN:45
r001 99 ref 7 30 8M2I4M1D3M = 37 39 TTAGATAAAGGATACTG *
r003 0 ref 29 30 5S6M * 0 0 GCCTAAGCTAA * SA:Z:ref,29,-,6H5M,17,0;
r003 2064 ref 29 17 6H5M * 0 0 TAGGC * SA:Z:ref,9,+,5S6M,30,1;
r001 147 ref 237 30 9M = 7 -39 CAGCGGCAT * NM:i:1
)";
int main()
{
tmp_stream << sam_file_raw;
}
T temp_directory_path(T... args)
Reading from a
std::istringstream:
auto input = R"(@HD VN:1.6 SO:coordinate
r001 99 ref 7 30 8M2I4M1D3M = 37 39 TTAGATAAAGGATACTG *)";
int main()
{
}
Note that this is not the same as writing
sam_file_input<>
(with angle brackets). In the latter case they are explicitly set to their default values, in the former case
automatic deduction happens which chooses different parameters depending on the constructor arguments. For opening from file,
sam_file_input<>
would have also worked, but for opening from stream it would not have.
You can define your own traits type to further customise the types used by and returned by this class, see
seqan3::sam_file_input_default_traits for more details. As mentioned above, specifying at least one template parameter yourself means that you loose automatic deduction. The following is equivalent to the automatic type deduction example with a stream from above:
auto input = R"(@HD VN:1.6 SO:coordinate
r001 99 ref 7 30 8M2I4M1D3M = 37 39 TTAGATAAAGGATACTG *)";
int main()
{
default_fields,
}
@ flag
The alignment flag (bit information), uint16_t value.
@ ref_offset
Sequence (seqan3::field::ref_seq) relative start position (0-based), unsigned value.
@ cigar
The cigar vector (std::vector<seqan3::cigar>) representing the alignment in SAM/BAM format.
@ mapq
The mapping quality of the seqan3::field::seq alignment, usually a Phred-scaled score.
@ mate
The mate pair information given as a std::tuple of reference name, offset and template length.
@ header_ptr
A pointer to the seqan3::sam_file_header object storing header information.
@ ref_id
The identifier of the (reference) sequence that seqan3::field::seq was aligned to.
@ id
The identifier, usually a string.
@ tags
The optional tags in the SAM format, stored in a dictionary.
@ seq
The "sequence", usually a range of nucleotides or amino acids.
@ qual
The qualities, usually in Phred score notation.
A class template that holds a choice of seqan3::field.
Definition record.hpp:125
Type that contains multiple types.
Definition type_list.hpp:26
Provides seqan3::type_list.
Reading record-wise
You can iterate over this file record-wise:
auto sam_file_raw = R"(@HD VN:1.6 SO:coordinate GO:none
@SQ SN:ref LN:45
r001 99 ref 7 30 8M2I4M1D3M = 37 39 TTAGATAAAGGATACTG *
r003 0 ref 29 30 5S6M * 0 0 GCCTAAGCTAA * SA:Z:ref,29,-,6H5M,17,0;
r003 2064 ref 29 17 6H5M * 0 0 TAGGC * SA:Z:ref,9,+,5S6M,30,1;
r001 147 ref 237 30 9M = 7 -39 CAGCGGCAT * NM:i:1
)";
int main()
{
for (auto & rec : fin)
{
}
}
Provides seqan3::debug_stream and related types.
debug_stream_type debug_stream
A global instance of seqan3::debug_stream_type.
Definition debug_stream.hpp:37
In the above example,
rec
has the type
seqan3::sam_file_input::record_type which is a specialisation of
seqan3::record and behaves like a
std::tuple (that's why we can access it via
get
). Instead of using the
seqan3::field based interface on the record, you could also use
std::get<0>
or even
std::get<dna4_vector>
to retrieve the sequence, but it is not recommended, because it is more error-prone.
- Note
- It is important to write
auto &
and not just auto
, otherwise you will copy the record on every iteration. Since the buffer gets "refilled" on every iteration, you can also move the data out of the record if you want to store it somewhere without copying:
auto sam_file_raw = R"(@HD VN:1.6 SO:coordinate GO:none
@SQ SN:ref LN:45
r001 99 ref 7 30 8M2I4M1D3M = 37 39 TTAGATAAAGGATACTG *
r003 0 ref 29 30 5S6M * 0 0 GCCTAAGCTAA * SA:Z:ref,29,-,6H5M,17,0;
r003 2064 ref 29 17 6H5M * 0 0 TAGGC * SA:Z:ref,9,+,5S6M,30,1;
r001 147 ref 237 30 9M = 7 -39 CAGCGGCAT * NM:i:1
)";
int main()
{
using record_type = typename decltype(fin)::record_type;
for (auto & rec : fin)
}
SeqAn specific customisations in the standard namespace.
Reading record-wise (custom fields)
If you want to skip specific fields from the record you can pass a non-empty fields trait object to the
seqan3::sam_file_input constructor to select the fields that should be read from the input. For example, you may only be interested in the mapping flag and mapping quality of your SAM data to get some statistics. The following snippets demonstrate the usage of such a fields trait object.
auto sam_file_raw = R"(@HD VN:1.6 SO:coordinate GO:none
@SQ SN:ref LN:45
r001 99 ref 7 30 8M2I4M1D3M = 37 39 TTAGATAAAGGATACTG *
r003 0 ref 29 30 5S6M * 0 0 GCCTAAGCTAA * SA:Z:ref,29,-,6H5M,17,0;
r003 2064 ref 29 17 6H5M * 0 0 TAGGC * SA:Z:ref,9,+,5S6M,30,1;
r001 147 ref 237 30 9M = 7 -39 CAGCGGCAT * NM:i:1
)";
int main()
{
for (auto & rec : fin)
{
}
}
When reading a file, all fields not present in the file (but requested implicitly or via the
selected_field_ids
parameter) are ignored and the respective value in the record stays empty.
Reading record-wise (decomposed records)
Instead of using
get
on the record, you can also use
structured bindings to decompose the record into its elements. Considering the example of reading only the flag and mapping quality like before you can also write:
auto sam_file_raw = R"(@HD VN:1.6 SO:coordinate GO:none
@SQ SN:ref LN:45
r001 99 ref 7 30 8M2I4M1D3M = 37 39 TTAGATAAAGGATACTG *
r003 0 ref 29 30 5S6M * 0 0 GCCTAAGCTAA * SA:Z:ref,29,-,6H5M,17,0;
r003 2064 ref 29 17 6H5M * 0 0 TAGGC * SA:Z:ref,9,+,5S6M,30,1;
r001 147 ref 237 30 9M = 7 -39 CAGCGGCAT * NM:i:1
)";
int main()
{
for (auto & [flag, mapq] : fin)
{
}
}
In this case you immediately get the two elements of the tuple:
flag
of
seqan3::sam_file_input::flag_type and
mapq
of
seqan3::sam_file_input::mapq_type.
- Note
- But beware: with structured bindings you do need to get the order of elements correctly!
Transforming the CIGAR information into an actual alignment
In SeqAn, we represent an alignment as a tuple of two
seqan3::aligned_sequence
s.The conversion from a CIGAR string to an alignment can be done with the function
seqan3::alignment_from_cigar
. You need to pass the reference sequence with the position the read was aligned to and the read sequence. All of it is already in the
record
when reading a SAM file:
auto sam_file_raw = R"(@HD VN:1.6
@SQ SN:ref LN:34
read1 41 ref 1 61 1S1M1D1M1I ref 10 300 ACGT !##$ AS:i:2 NM:i:7
read2 42 ref 2 62 1H7M1D1M1S2H ref 10 300 AGGCTGNAG !##$&'()* xy:B:S,3,4,5
read3 43 ref 3 63 1S1M1P1M1I1M1I1D1M1S ref 10 300 GGAGTATA !!*+,-./
)";
int main()
{
seqan3::dna5_vector reference = "ACTGATCGAGAGGATCTAGAGGAGATCGTAGGAC"_dna5;
for (auto && rec : fin)
{
alignment_from_cigar(rec.cigar_sequence(), reference, rec.reference_position().value(), rec.sequence());
}
}
Provides the function seqan3::alignment_from_cigar.
auto alignment_from_cigar(std::vector< cigar > const &cigar_vector, reference_type const &reference, uint32_t const zero_based_reference_start_position, sequence_type const &query)
Construct an alignment from a CIGAR string and the corresponding sequences.
Definition alignment_from_cigar.hpp:81
@ alignment
The (pairwise) alignment stored in an object that models seqan3::detail::pairwise_alignment.
Meta-header for the IO / SAM File submodule .
The SeqAn namespace for literals.
The code will print the following:
(ACT-,C-GT)
(CTGATCGAG,AGGCTGN-A)
(T-G-A-TC,G-AGTA-T)
Views on files
Since SeqAn files are ranges, you can also create views over files. A useful example is to filter the records based on certain criteria, e.g. minimum length of the sequence field:
auto sam_file_raw = R"(@HD VN:1.6 SO:coordinate GO:none
@SQ SN:ref LN:45
r001 99 ref 7 30 8M2I4M1D3M = 37 39 TTAGATAAAGGATACTG *
r003 0 ref 29 30 5S6M * 0 0 GCCTAAGCTAA * SA:Z:ref,29,-,6H5M,17,0;
r003 2064 ref 29 17 6H5M * 0 0 TAGGC * SA:Z:ref,9,+,5S6M,30,1;
r001 147 ref 237 30 9M = 7 -39 CAGCGGCAT * NM:i:1
)";
int main()
{
auto minimum_length10_filter = std::views::filter(
[](auto const & rec)
{
return std::ranges::size(rec.sequence()) >= 10;
});
for (auto & rec : fin | minimum_length10_filter)
}
The main SeqAn3 namespace.
Definition aligned_sequence_concept.hpp:26
End of file
You can check whether a file is at its end by comparing
begin()
and
end()
(if they are the same, the file is at its end).
Formats
We currently support reading the following formats:
Writing SAM files
Construction and specialisation
The
seqan3::sam_file_output class comes with two constructors, one for construction from a file name and one for construction from an existing stream and a known format. The first one automatically picks the format based on the extension of the file name. The second can be used if you have a non-file stream, like
std::cout or
std::ostringstream, that you want to read from and/or if you cannot use file-extension based detection, but know that your output file has a certain format.
In most cases the template parameters are deduced completely automatically:
int main()
{
}
A class for writing SAM files, both SAM and its binary representation BAM are supported.
Definition io/sam_file/output.hpp:71
Provides seqan3::sam_file_output and corresponding traits classes.
Writing to
std::cout:
Note that this is not the same as writing
sam_file_output<>
(with angle brackets). In the latter case they are explicitly set to their default values, in the former case
automatic deduction happens which chooses different parameters depending on the constructor arguments. For opening from file,
sam_file_output<>
would have also worked, but for opening from stream it would not have.
Writing record-wise
int main()
{
using alignment_type =
alignment_type dummy_alignment{};
record_type rec{read,
ref_id, dummy_alignment};
fout.push_back(rec);
fout.push_back(record_type{read,
ref_id, dummy_alignment});
}
Provides seqan3::dna5, container aliases and string literals.
The class template that file records are based on; behaves like a std::tuple.
Definition record.hpp:190
The easiest way to write to a SAM/BAM file is to use the
push_back()
member functions. These work similarly to how they work on a
std::vector. You may also use a tuple like interface or the
emplace_back()
function but this is not recommended since one would have to keep track of the correct order of many fields (14 in total). For the record based interface using
push_back()
please also see the
seqan3::record documentation on how to specify a record with the correct field and type lists.
You may also use the output file's iterator for writing, however, this rarely provides an advantage.
Writing record-wise (custom fields)
If you want to omit non-required parameter or change the order of the parameters, you can pass a non-empty fields trait object to the
seqan3::sam_file_output constructor to select the fields that are used for interpreting the arguments.
The following snippet demonstrates the usage of such a
field_traits
object.
int main()
{
unsigned mapping_pos{1300};
fout.emplace_back(mapping_pos, flag);
fout.push_back(
std::tie(mapping_pos, flag));
}
sam_flag
An enum flag that describes the properties of an aligned read (given as a SAM record).
Definition sam_flag.hpp:73
@ none
None of the flags below are set.
A different way of passing custom fields to the file is to pass a
seqan3::record – instead of a tuple – to
push_back()
. The
seqan3::record clearly indicates which of its elements has which
seqan3::field so
the file will use that information instead of the template argument. This is especially handy when reading from one file and writing to another, because you don't have to configure the output file to match the input file, it will just work:
auto sam_file_raw = R"(@HD VN:1.6 SO:coordinate GO:none
@SQ SN:ref LN:45
r001 99 ref 7 30 8M2I4M1D3M = 37 39 TTAGATAAAGGATACTG *
r003 0 ref 29 30 5S6M * 0 0 GCCTAAGCTAA * SA:Z:ref,29,-,6H5M,17,0;
r003 2064 ref 29 17 6H5M * 0 0 TAGGC * SA:Z:ref,9,+,5S6M,30,1;
r001 147 ref 237 30 9M = 7 -39 CAGCGGCAT * NM:i:1
)";
int main()
{
for (auto & r : fin)
fout.push_back(r);
}
This will copy the
seqan3::field::flag and
seqan3::field::ref_offset value into the new output file.
- Note
- Note that the other SAM columns in the output file will have a default value, so unless you specify to read all SAM columns (see seqan3::format_sam) the output file will not be equal to the input file.
Writing record-wise in batches
You can write multiple records at once, by assigning to the file:
int main()
{
{"NATA"_dna5, "2nd"},
{"GATA"_dna5, "Third"}};
fout = range;
range | fout;
}
File I/O pipelines
Record-wise writing in batches also works for writing from input files directly to output files, because input files are also input ranges in SeqAn:
auto sam_file_raw = R"(First 0 * 0 0 * * 0 0 ACGT *
2nd 0 * 0 0 * * 0 0 NATA *
Third 0 * 0 0 * * 0 0 GATA *
)";
int main()
{
fout = fin;
}
This can be combined with file-based views to create I/O pipelines:
auto sam_file_raw = R"(@HD VN:1.6 SO:coordinate GO:none
@SQ SN:ref LN:45
r001 99 ref 7 30 * = 37 39 TTAGATAAAGGATACTG *
r003 0 ref 29 30 * * 0 0 GCCTAAGCTAA * SA:Z:ref,29,-,6H5M,17,0;
r003 2064 ref 29 17 * * 0 0 TAGGC * SA:Z:ref,9,+,5S6M,30,1;
r001 147 ref 237 30 * = 7 -39 CAGCGGCAT * NM:i:1
)";
int main()
{
input_file | std::views::take(3)
}
Formats
We currently support writing the following formats:
◆ sam_flag
An enum flag that describes the properties of an aligned read (given as a SAM record).
- See also
- seqan3::enum_bitwise_operators enables combining enum values.
The SAM flag are bitwise flags, which means that each value corresponds to a specific bit that is set and that they can be combined and tested using binary operations. See this tutorial for an introduction on bitwise operations on enum flags.
Example:
auto sam_file_raw = R"(@HD VN:1.6 SO:coordinate GO:none
@SQ SN:ref LN:45
r001 99 ref 7 30 8M2I4M1D3M = 37 39 TTAGATAAAGGATACTG !!!!!!!!!!!!!!!!!
r003 0 ref 29 30 5S6M * 0 0 GCCTAAGCTAA !!!!!!!!!!! SA:Z:ref,29,-,6H5M,17,0;
r003 4 * 29 17 * * 0 0 TAGGC @@@@@ SA:Z:ref,9,+,5S6M,30,1;
r001 147 ref 237 30 9M = 7 -39 CAGCGGCAT !!!!!!!!! NM:i:1
)";
int main()
{
for (auto & rec : fin)
{
std::cout <<
"Read " << rec.id() <<
" is unmapped\n";
{
}
}
}
Quality type for traditional Sanger and modern Illumina Phred scores.
Definition phred42.hpp:44
constexpr auto assign_char_to
Assign a character to an alphabet object.
Definition alphabet/concept.hpp:517
@ duplicate
The read is marked as a PCR duplicate or optical duplicate.
@ failed_filter
The read alignment failed a filter, e.g. quality controls.
@ unmapped
The read is not mapped to a reference (unaligned).
Adapted from the SAM specifications are the following additional information to some flag values:
- For each read/contig in a SAM file, it is required that one and only one line associated with the read has neither the seqan3::sam_flag::secondary_alignment nor the seqan3::sam_flag::supplementary_alignment flag value set (satisfies
FLAG & 0x900 == 0
). This line is called the primary alignment of the read.
- seqan3::sam_flag::secondary_alignment (bit
0x100
) marks the alignment not to be used in certain analyses when the tools in use are aware of this bit. It is typically used to flag alternative mappings when multiple mappings are presented in a SAM.
- seqan3::sam_flag::supplementary_alignment (bit
0x800
) indicates that the corresponding alignment line is part of a chimeric alignment. If the SAM/BAM file corresponds to long reads (nanopore/pacbio) this happens when reads are split before being aligned and the best matching part is marked as primary, while all other aligned parts are marked supplementary.
- seqan3::sam_flag::unmapped (bit
0x4
) is the only reliable place to tell whether the read is unmapped. If seqan3::sam_flag::unmapped is set, no assumptions can be made about RNAME, POS, CIGAR, MAPQ, and seqan3::sam_flag::proper_pair, seqan3::sam_flag::secondary_alignment, and seqan3::sam_flag::supplementary_alignment (bits 0x2
, 0x100
, and 0x800
).
- seqan3::sam_flag::on_reverse_strand (bit
0x10
) indicates whether the read sequence has been reverse complemented and the quality string is reversed. When bit seqan3::sam_flag::unmapped (0x4
) is unset, this corresponds to the strand to which the segment has been mapped: seqan3::sam_flag::on_reverse_strand (bit 0x10
) unset indicates the forward strand, while set indicates the reverse strand. When seqan3::sam_flag::unmapped (0x4
) is set, this indicates whether the unmapped read is stored in its original orientation as it came off the sequencing machine.
- seqan3::sam_flag::first_in_pair and seqan3::sam_flag::second_in_pair (bits
0x40
and 0x80
) reflect the read ordering within each template inherent in the sequencing technology used. If seqan3::sam_flag::first_in_pair and seqan3::sam_flag::second_in_pair (0x40
and 0x80
) are both set, the read is part of a linear template, but it is neither the first nor the last read. If both are unset, the index of the read in the template is unknown. This may happen for a non-linear template or when this information is lost during data processing.
- If seqan3::sam_flag::paired (bit
0x1
) is unset, no assumptions can be made about seqan3::sam_flag::proper_pair, seqan3::sam_flag::mate_unmapped, seqan3::sam_flag::mate_on_reverse_strand, seqan3::sam_flag::first_in_pair and seqan3::sam_flag::second_in_pair (bits 0x2
, 0x8
, 0x20
, 0x40
and 0x80
).
- See also
- https://broadinstitute.github.io/picard/explain-flags.html
Enumerator |
---|
none | None of the flags below are set.
|
paired | The aligned read is paired (paired-end sequencing).
|
proper_pair | The two aligned reads in a pair have a proper distance between each other.
|
unmapped | The read is not mapped to a reference (unaligned).
|
mate_unmapped | The mate of this read is not mapped to a reference (unaligned).
|
on_reverse_strand | The read sequence has been reverse complemented before being mapped (aligned).
|
mate_on_reverse_strand | The mate sequence has been reverse complemented before being mapped (aligned).
|
first_in_pair | Indicates the ordering (see details in the seqan3::sam_flag description).
|
second_in_pair | Indicates the ordering (see details in the seqan3::sam_flag description).
|
secondary_alignment | This read alignment is an alternative (possibly suboptimal) to the primary.
|
failed_filter | The read alignment failed a filter, e.g. quality controls.
|
duplicate | The read is marked as a PCR duplicate or optical duplicate.
|
supplementary_alignment | This sequence is part of a split alignment and is not the primary alignment.
|
◆ operator""_tag() [1/2]
constexpr uint16_t seqan3::literals::operator""_tag |
( |
| ) |
|
|
constexpr |
The SAM tag literal, such that tags can be used in constant expressions.
- Template Parameters
-
char_t | The char type. Usually char . Parameter pack ...s must be of length 2 since SAM tags consist of two letters (char0 and char1). |
- Returns
- The unique identifier of the SAM tag computed by char0 * 128 + char1.
A SAM tag consists of two letters, initialized via the string literal ""_tag, which delegate to its unique id.
uint16_t tag_id = "NM"_tag;
Provides the seqan3::sam_tag_dictionary class and auxiliaries.
The purpose of those tags is to fill or query the seqan3::sam_tag_dictionary for a specific key (tag_id) and retrieve the corresponding value.
◆ operator""_tag() [2/2]
constexpr uint16_t operator""_tag |
( |
| ) |
|
|
related |
The SAM tag literal, such that tags can be used in constant expressions.
- Template Parameters
-
char_t | The char type. Usually char . Parameter pack ...s must be of length 2 since SAM tags consist of two letters (char0 and char1). |
- Returns
- The unique identifier of the SAM tag computed by char0 * 128 + char1.
A SAM tag consists of two letters, initialized via the string literal ""_tag, which delegate to its unique id.
uint16_t tag_id = "NM"_tag;
The purpose of those tags is to fill or query the seqan3::sam_tag_dictionary for a specific key (tag_id) and retrieve the corresponding value.