SeqAn3 3.4.0-rc.1
The Modern C++ library for sequence analysis.
Loading...
Searching...
No Matches
CIGAR Conversion

The CIGAR Conversion submodule contains utility functions to convert a CIGAR to an alignment or vice versa. More...

+ Collaboration diagram for CIGAR Conversion:

Classes

struct  seqan3::cigar_clipped_bases
 Helper struct to specialise soft and hard clipping when using seqan3::cigar_from_alignment. More...
 

Functions

template<typename reference_type , typename sequence_type >
auto seqan3::alignment_from_cigar (std::string const &cigar_string, reference_type const &reference, uint32_t const zero_based_reference_start_position, sequence_type const &query)
 
template<typename reference_type , typename sequence_type >
auto seqan3::alignment_from_cigar (std::vector< cigar > const &cigar_vector, reference_type const &reference, uint32_t const zero_based_reference_start_position, sequence_type const &query)
 Construct an alignment from a CIGAR string and the corresponding sequences.
 
template<typename alignment_type >
auto seqan3::cigar_from_alignment (alignment_type const &alignment, cigar_clipped_bases const &clipped_bases={}, bool const extended_cigar=false)
 Creates a CIGAR string (SAM format) given a seqan3::detail::pairwise_alignment represented by two seqan3::aligned_sequences.
 

Detailed Description

The CIGAR Conversion submodule contains utility functions to convert a CIGAR to an alignment or vice versa.

Quick background on the CIGAR string

The CIGAR string is a compact representation of an aligned read against a reference and was introduced by the SAM format. The SAM format stores the result of mapping short/long read sequences from a sequencing experiment (e.g., Illumina/Nanopore) against a reference (e.g., hg38).

Creating an alignment from a CIGAR String

To create an alignment from a CIGAR String, you can do the following:

// SPDX-FileCopyrightText: 2006-2024 Knut Reinert & Freie Universität Berlin
// SPDX-FileCopyrightText: 2016-2024 Knut Reinert & MPI für molekulare Genetik
// SPDX-License-Identifier: CC0-1.0
using namespace seqan3::literals;
auto sam_file_raw = R"(@HD VN:1.6
@SQ SN:ref LN:34
read1 41 ref 1 61 1S1M1D1M1I ref 10 300 ACGT !##$ AS:i:2 NM:i:7
read2 42 ref 2 62 1H7M1D1M1S2H ref 10 300 AGGCTGNAG !##$&'()* xy:B:S,3,4,5
read3 43 ref 3 63 1S1M1P1M1I1M1I1D1M1S ref 10 300 GGAGTATA !!*+,-./
)";
int main()
{
// The reference sequence might be read from a different file.
seqan3::dna5_vector reference = "ACTGATCGAGAGGATCTAGAGGAGATCGTAGGAC"_dna5;
// You will probably read it from a file, e.g., like this:
// seqan3::sam_file_input fin{"test.sam"};
for (auto && rec : fin)
{
auto alignment =
alignment_from_cigar(rec.cigar_sequence(), reference, rec.reference_position().value(), rec.sequence());
}
// prints:
// (ACT-,C-GT)
// (CTGATCGAG,AGGCTGN-A)
// (T-G-A-TC,G-AGTA-T)
}
Provides the function seqan3::alignment_from_cigar.
The SAM format (tag).
Definition format_sam.hpp:105
A class for reading SAM files, both SAM and its binary representation BAM are supported.
Definition sam_file/input.hpp:236
Provides seqan3::debug_stream and related types.
auto alignment_from_cigar(std::vector< cigar > const &cigar_vector, reference_type const &reference, uint32_t const zero_based_reference_start_position, sequence_type const &query)
Construct an alignment from a CIGAR string and the corresponding sequences.
Definition alignment_from_cigar.hpp:81
debug_stream_type debug_stream
A global instance of seqan3::debug_stream_type.
Definition debug_stream.hpp:37
@ alignment
The (pairwise) alignment stored in an object that models seqan3::detail::pairwise_alignment.
Meta-header for the IO / SAM File submodule .
The SeqAn namespace for literals.

Creating a CIGAR string from an alignment

To create a CIGAR String from an alignment, you can do the following:

// SPDX-FileCopyrightText: 2006-2024 Knut Reinert & Freie Universität Berlin
// SPDX-FileCopyrightText: 2016-2024 Knut Reinert & MPI für molekulare Genetik
// SPDX-License-Identifier: CC0-1.0
int main()
{
using namespace seqan3::literals;
seqan3::dna5_vector reference = "ATGGCGTAGAGCTTCCCCCCCCCCCCCCCCC"_dna5;
seqan3::dna5_vector read = "ATGCCCCGTTGCTT"_dna5; // length 14
// Let's say, we want to ignore the last 2 bases of the query because the quality is low.
// We thus only align the first 12 bases, the last two will be soft-clipped bases in the CIGAR string.
seqan3::gap_decorator aligned_reference{reference | seqan3::views::slice(0, 12)};
seqan3::gap_decorator aligned_query{read | seqan3::views::slice(0, 12)};
// insert gaps
seqan3::insert_gap(aligned_reference, aligned_reference.begin() + 4, 2);
seqan3::insert_gap(aligned_query, aligned_query.begin() + 11, 2);
auto cigar_sequence =
seqan3::cigar_from_alignment(std::tie(aligned_reference, aligned_query),
{.hard_front = 1, .hard_back = 0, .soft_front = 0, .soft_back = 2});
seqan3::debug_stream << cigar_sequence << '\n'; // prints [1H,4M,2I,5M,2D,1M,2S]
}
Includes the aligned_sequence and the related insert_gap and erase_gap functions to enable stl contai...
Provides the function seqan3::cigar_from_alignment and a helper struct seqan3::cigar_clipped_bases.
A gap decorator allows the annotation of sequences with gap symbols while leaving the underlying sequ...
Definition gap_decorator.hpp:78
Provides seqan3::dna5, container aliases and string literals.
Provides seqan3::gap_decorator.
Provides seqan3::gapped.
auto cigar_from_alignment(alignment_type const &alignment, cigar_clipped_bases const &clipped_bases={}, bool const extended_cigar=false)
Creates a CIGAR string (SAM format) given a seqan3::detail::pairwise_alignment represented by two seq...
Definition cigar_from_alignment.hpp:111
constexpr auto slice
A view adaptor that returns a half-open interval on the underlying range.
Definition slice.hpp:175
Provides seqan3::views::slice.
T tie(T... args)
See also
seqan3::alignment_from_cigar
seqan3::cigar_from_alignment

Function Documentation

◆ alignment_from_cigar() [1/2]

template<typename reference_type , typename sequence_type >
auto seqan3::alignment_from_cigar ( std::string const &  cigar_string,
reference_type const &  reference,
uint32_t const  zero_based_reference_start_position,
sequence_type const &  query 
)
inline

This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.

◆ alignment_from_cigar() [2/2]

template<typename reference_type , typename sequence_type >
auto seqan3::alignment_from_cigar ( std::vector< cigar > const &  cigar_vector,
reference_type const &  reference,
uint32_t const  zero_based_reference_start_position,
sequence_type const &  query 
)
inline

Construct an alignment from a CIGAR string and the corresponding sequences.

Template Parameters
reference_typeThe type of the reference sequence for a SAM record.
sequence_typeThe type of the read sequence for a SAM record.
Parameters
[in]cigar_vectorThe CIGAR information to convert to an alignment.
[in]referenceThe reference sequence to which the query was aligned to, the alignment being represented by cigar_vector.
[in]zero_based_reference_start_positionThe start position of the alignment in the reference sequence. The position is zero-based. When using our seqan3::sam_file_input, note that the seqan3::sam_file_input::record_type::reference_position() is always zero-based.
[in]queryThe query or read sequence of the alignment represented by cigar_vector.
Returns
An alignment represented by a std::tuple of size 2 holding 2 seqan3::gap_decorators. At position 0 is the aligned reference sequence and at position 1 the aligned read sequence.

Quick background on the CIGAR string

The CIGAR string is a compact representation of an aligned read against a reference and was introduced by the SAM format. The SAM format stores the result of mapping short/long read sequences from a sequencing experiment (e.g., Illumina/Nanopore) against a reference (e.g., hg38).

Conversion to an alignment

You can reconstruct a full alignment from a CIGAR string, if you have the respective sequences at hand:

// SPDX-FileCopyrightText: 2006-2024 Knut Reinert & Freie Universität Berlin
// SPDX-FileCopyrightText: 2016-2024 Knut Reinert & MPI für molekulare Genetik
// SPDX-License-Identifier: CC0-1.0
using namespace seqan3::literals;
int main()
{
// CIGAR string = 2M1D2M
std::vector<seqan3::cigar> cigar_vector{{2, 'M'_cigar_operation},
{1, 'D'_cigar_operation},
{2, 'M'_cigar_operation}};
uint32_t reference_start_position{0}; // The read is aligned at the start of the reference.
seqan3::dna5_vector reference = "ACTGATCGAGAGGATCTAGAGGAGATCGTAGGAC"_dna5;
seqan3::dna5_vector query = "ACGA"_dna5;
auto alignment = alignment_from_cigar(cigar_vector, reference, reference_start_position, query);
seqan3::debug_stream << alignment << '\n'; // prints (ACTGA,AC-GA)
}
Provides the seqan3::cigar alphabet.

Quick explanation of the alignment representation

In seqan3, an alignment is represented by a std::tuple of size 2 that holds 2 seqan3::aligned_sequences.

The data structure that we use most often to model seqan3::aligned_sequence is the seqan3::gap_decorator. It is a lightweight data structure that only holds a view on the sequence (no copy is made) and on top can hold seqan3::gaps.

In the above example, the read sequence ACGT is aligned to the reference with one gap: AC-GA where - represents a gap. In the CIGAR string, the gap in the query/read is represented by 1D.

The full alignment consist of two aligned sequences (read and reference). In the above example, the alignment

position 01234
reference ACTGA
read AC-GA

is represented by a tuple of the aligned reference at the first position (std::get<0>) and the aligned read at the second position (std::get<1>): (ACTGA,AC-GA).

IO Example

A more realistic example is extracting the information directly from a SAM file:

// SPDX-FileCopyrightText: 2006-2024 Knut Reinert & Freie Universität Berlin
// SPDX-FileCopyrightText: 2016-2024 Knut Reinert & MPI für molekulare Genetik
// SPDX-License-Identifier: CC0-1.0
using namespace seqan3::literals;
auto sam_file_raw = R"(@HD VN:1.6
@SQ SN:ref LN:34
read1 41 ref 1 61 1S1M1D1M1I ref 10 300 ACGT !##$ AS:i:2 NM:i:7
read2 42 ref 2 62 1H7M1D1M1S2H ref 10 300 AGGCTGNAG !##$&'()* xy:B:S,3,4,5
read3 43 ref 3 63 1S1M1P1M1I1M1I1D1M1S ref 10 300 GGAGTATA !!*+,-./
)";
int main()
{
// The reference sequence might be read from a different file.
seqan3::dna5_vector reference = "ACTGATCGAGAGGATCTAGAGGAGATCGTAGGAC"_dna5;
// You will probably read it from a file, e.g., like this:
// seqan3::sam_file_input fin{"test.sam"};
for (auto && rec : fin)
{
auto alignment =
alignment_from_cigar(rec.cigar_sequence(), reference, rec.reference_position().value(), rec.sequence());
}
// prints:
// (ACT-,C-GT)
// (CTGATCGAG,AGGCTGN-A)
// (T-G-A-TC,G-AGTA-T)
}
See also
seqan3::sam_file_input
seqan3::cigar

◆ cigar_from_alignment()

template<typename alignment_type >
auto seqan3::cigar_from_alignment ( alignment_type const &  alignment,
cigar_clipped_bases const &  clipped_bases = {},
bool const  extended_cigar = false 
)
inline

Creates a CIGAR string (SAM format) given a seqan3::detail::pairwise_alignment represented by two seqan3::aligned_sequences.

Template Parameters
alignment_typeMust model seqan3::detail::pairwise_alignment.
Parameters
alignmentThe alignment, represented by a pair of aligned sequences, to be transformed into CIGAR vector based on the second (query) sequence.
clipped_basesProvides information on whether the query sequence was cropped (hard clipping) before the alignment or whether part of the query sequence does not take part (soft clipping) in the alignment.
extended_cigarWhether to print the extended CIGAR alphabet or not. See seqan3::cigar::operation.
Returns
A std::vector<seqan3::cigar> representing the alignment.
Note
The resulting cigar_vector is based on the query sequence, which is the second sequence in the alignment pair.

Example:

Given the following alignment, reference sequence on top and the query sequence at the bottom:

ATGG--CGTAGAGCTT
|||X |||X| |||
ATGCCCCGTTG--CTT

In this case, the function seqan3::cigar_from_alignment will return the following CIGAR vector:

[('M',4),('I',2),('M',5),('D',2),('M',3)]

The extended CIGAR string would look like this:

[('=',3)('X',1)('I',2)('=',3)('X',1)('=',1)('D',2)('=',3)]
// SPDX-FileCopyrightText: 2006-2024 Knut Reinert & Freie Universität Berlin
// SPDX-FileCopyrightText: 2016-2024 Knut Reinert & MPI für molekulare Genetik
// SPDX-License-Identifier: CC0-1.0
int main()
{
using namespace seqan3::literals;
seqan3::dna5_vector reference = "ATGGCGTAGAGCTTCCCCCCCCCCCCCCCCC"_dna5;
seqan3::dna5_vector read = "ATGCCCCGTTGCTT"_dna5; // length 14
// Align the full query against the first 14 bases of the reference.
seqan3::gap_decorator aligned_reference{reference | seqan3::views::slice(0, 14)};
seqan3::gap_decorator aligned_read{read};
// Insert gaps to represent the alignment:
seqan3::insert_gap(aligned_read, aligned_read.begin() + 11, 2);
seqan3::insert_gap(aligned_reference, aligned_reference.begin() + 4, 2);
seqan3::debug_stream << aligned_reference << '\n' << aligned_read << '\n';
// prints:
// ATGG--CGTAGAGCTT
// ATGCCCCGTTG--CTT
auto cigar_sequence = seqan3::cigar_from_alignment(std::tie(aligned_reference, aligned_read));
seqan3::debug_stream << cigar_sequence << '\n'; // prints [4M,2I,5M,2D,3M]
}

Soft and Hard clipping

The terms soft and hard clipping were introduced by the SAM specifications. A SAM file is only storing a semi alignment represented by the CIGAR string. The semi alignment of a query sequence is most often the result of a read mapping.

Hard clipping

Before aligning a query or read to the reference, some tools crop the query sequence because the quality is bad at one end (e.g., Illumina reads tend to display a bad sequence quality towards the end of the read).

To inform the user of a SAM file that query sequences were altered, hard-clipping information is appended to the CIGAR string. E.g. 100M50H indicates that a read of former length 150 has been cropped at the right end by 50 bases. The sequence in the SAM file will thus only be 100 bases long.

Soft clipping

In contrast to hard clipping, soft clipping indicates that the read was cropped and the respective bases do not participate in the alignment, but they are still part of the reported sequence. E.g., 100M50S indicates that a read of length 150 has been aligned without the rightmost 50 bases. The sequence in the SAM file will still be 150 bases long.

Adding soft and hard clipping with seqan3::cigar_from_alignment

You can add the respective clipping information by passing an instance of seqan3::cigar_clipped_bases:

// SPDX-FileCopyrightText: 2006-2024 Knut Reinert & Freie Universität Berlin
// SPDX-FileCopyrightText: 2016-2024 Knut Reinert & MPI für molekulare Genetik
// SPDX-License-Identifier: CC0-1.0
int main()
{
using namespace seqan3::literals;
seqan3::dna5_vector reference = "ATGGCGTAGAGCTTCCCCCCCCCCCCCCCCC"_dna5;
seqan3::dna5_vector read = "ATGCCCCGTTGCTT"_dna5; // length 14
// Let's say, we want to ignore the last 2 bases of the query because the quality is low.
// We thus only align the first 12 bases, the last two will be soft-clipped bases in the CIGAR string.
seqan3::gap_decorator aligned_reference{reference | seqan3::views::slice(0, 12)};
seqan3::gap_decorator aligned_query{read | seqan3::views::slice(0, 12)};
// insert gaps
seqan3::insert_gap(aligned_reference, aligned_reference.begin() + 4, 2);
seqan3::insert_gap(aligned_query, aligned_query.begin() + 11, 2);
auto cigar_sequence =
seqan3::cigar_from_alignment(std::tie(aligned_reference, aligned_query),
{.hard_front = 1, .hard_back = 0, .soft_front = 0, .soft_back = 2});
seqan3::debug_stream << cigar_sequence << '\n'; // prints [1H,4M,2I,5M,2D,1M,2S]
}
See also
seqan3::cigar_clipped_bases
seqan3::sam_file_output
seqan3::cigar
seqan3::aligned_sequence
Hide me