SeqAn3 3.3.0
The Modern C++ library for sequence analysis.
|
The CIGAR Conversion submodule contains utility functions to convert a CIGAR to an alignment or vice versa. More...
Classes | |
struct | seqan3::cigar_clipped_bases |
Helper struct to specialise soft and hard clipping when using seqan3::cigar_from_alignment. More... | |
Functions | |
template<typename reference_type , typename sequence_type > | |
auto | seqan3::alignment_from_cigar (std::string const &cigar_string, reference_type const &reference, uint32_t const zero_based_reference_start_position, sequence_type const &query) |
template<typename reference_type , typename sequence_type > | |
auto | seqan3::alignment_from_cigar (std::vector< cigar > const &cigar_vector, reference_type const &reference, uint32_t const zero_based_reference_start_position, sequence_type const &query) |
Construct an alignment from a CIGAR string and the corresponding sequences. | |
template<typename alignment_type > | |
auto | seqan3::cigar_from_alignment (alignment_type const &alignment, cigar_clipped_bases const &clipped_bases={}, bool const extended_cigar=false) |
Creates a CIGAR string (SAM format) given a seqan3::detail::pairwise_alignment represented by two seqan3::aligned_sequence s. | |
The CIGAR Conversion submodule contains utility functions to convert a CIGAR to an alignment or vice versa.
The CIGAR string is a compact representation of an aligned read against a reference and was introduced by the SAM format. The SAM format stores the result of mapping short/long read sequences from a sequencing experiment (e.g., Illumina/Nanopore) against a reference (e.g., hg38).
To create an alignment from a CIGAR String, you can do the following:
To create a CIGAR String from an alignment, you can do the following:
|
no-apiinline |
This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.
|
no-apiinline |
Construct an alignment from a CIGAR string and the corresponding sequences.
reference_type | The type of the reference sequence for a SAM record. |
sequence_type | The type of the read sequence for a SAM record. |
[in] | cigar_vector | The CIGAR information to convert to an alignment. |
[in] | reference | The reference sequence to which the query was aligned to, the alignment being represented by cigar_vector . |
[in] | zero_based_reference_start_position | The start position of the alignment in the reference sequence. The position is zero-based. When using our seqan3::sam_file_input, note that the seqan3::sam_file_input::record_type::reference_position() is always zero-based. |
[in] | query | The query or read sequence of the alignment represented by cigar_vector . |
std::tuple
of size 2 holding 2 seqan3::gap_decorator
s. At position 0 is the aligned reference sequence and at position 1 the aligned read sequence.The CIGAR string is a compact representation of an aligned read against a reference and was introduced by the SAM format. The SAM format stores the result of mapping short/long read sequences from a sequencing experiment (e.g., Illumina/Nanopore) against a reference (e.g., hg38).
You can reconstruct a full alignment from a CIGAR string, if you have the respective sequences at hand:
In seqan3, an alignment is represented by a std::tuple
of size 2 that holds 2 seqan3::aligned_sequence
s.
The data structure that we use most often to model seqan3::aligned_sequence
is the seqan3::gap_decorator
. It is a lightweight data structure that only holds a view on the sequence (no copy is made) and on top can hold seqan3::gap
s.
In the above example, the read sequence ACGT
is aligned to the reference with one gap: AC-GA
where -
represents a gap. In the CIGAR string, the gap in the query/read is represented by 1D
.
The full alignment consist of two aligned sequences (read and reference). In the above example, the alignment
is represented by a tuple of the aligned reference at the first position (std::get<0>) and the aligned read at the second position (std::get<1>): (ACTGA,AC-GA)
.
A more realistic example is extracting the information directly from a SAM file:
|
no-apiinline |
Creates a CIGAR string (SAM format) given a seqan3::detail::pairwise_alignment represented by two seqan3::aligned_sequence
s.
alignment_type | Must model seqan3::detail::pairwise_alignment. |
alignment | The alignment, represented by a pair of aligned sequences, to be transformed into CIGAR vector based on the second (query) sequence. |
clipped_bases | Provides information on whether the query sequence was cropped (hard clipping) before the alignment or whether part of the query sequence does not take part (soft clipping) in the alignment. |
extended_cigar | Whether to print the extended CIGAR alphabet or not. See seqan3::cigar::operation . |
cigar_vector
is based on the query sequence, which is the second sequence in the alignment
pair.Given the following alignment, reference sequence on top and the query sequence at the bottom:
In this case, the function seqan3::cigar_from_alignment
will return the following CIGAR vector:
The extended CIGAR string would look like this:
The terms soft and hard clipping were introduced by the SAM specifications. A SAM file is only storing a semi alignment represented by the CIGAR string. The semi alignment of a query sequence is most often the result of a read mapping.
Before aligning a query or read to the reference, some tools crop the query sequence because the quality is bad at one end (e.g., Illumina reads tend to display a bad sequence quality towards the end of the read).
To inform the user of a SAM file that query sequences were altered, hard-clipping information is appended to the CIGAR string. E.g. 100M50H
indicates that a read of former length 150
has been cropped at the right end by 50
bases. The sequence in the SAM file will thus only be 100 bases long.
In contrast to hard clipping, soft clipping indicates that the read was cropped and the respective bases do not participate in the alignment, but they are still part of the reported sequence. E.g., 100M50S
indicates that a read of length 150
has been aligned without the rightmost 50
bases. The sequence in the SAM file will still be 150 bases long.
seqan3::cigar_from_alignment
You can add the respective clipping information by passing an instance of seqan3::cigar_clipped_bases: