SeqAn3  3.0.3
The Modern C++ library for sequence analysis.
SAM Input and Output in SeqAn

Learning Objective:

Learning Objective:
You will get an overview of how to read and write SAM/BAM files. This tutorial is a walk-through with links into the API documentation and also meant as a source for copy-and-paste code.

DifficultyMedium
Duration60 min
Prerequisite tutorialsQuick Setup (using CMake), Alphabets in SeqAn, Sequence File Input and Output
Recommended reading

Introduction

SAM files are used to store pairwise alignments between two (biological) sequences. There are also other output formats, like BLAST, that can store sequence alignments, but in this tutorial we will focus on SAM/BAM files. In addition to the alignment, these formats store information such as the start positions or mapping qualities. SAM files are a little more complex than sequence files but the basic design is the same. If you are new to SeqAn, we strongly recommend to do the tutorial Sequence File Input and Output first.

SAM/BAM file formats

SAM format

SAM stands for Sequence Alignment/Map format. It is a TAB-delimited text format consisting of a header section, which is optional, and an alignment section (see the official SAM specifications).

Here is an example of a SAM file:

@HD VN:1.6 SO:coordinate
@SQ SN:ref LN:45
r001 99 ref 7 30 8M2I4M1D3M = 37 39 TTAGATAAAGGATACTG *
r003 0 ref 9 30 5S6M * 0 0 GCCTAAGCTAA *
r004 0 ref 16 30 6M14N5M * 0 0 ATAGCTTCAGC *
r003 2064 ref 29 17 5M * 0 0 TAGGC *
r001 147 ref 37 30 9M = 7 -39 CAGCGGCAT * NM:i:1

The following table summarises the columns of a SAM file:

# SAM Column ID Description
1 QNAME Query template NAME
2 FLAG bitwise FLAG
3 RNAME Reference sequence NAME
4 POS 1-based leftmost mapping POSition
5 MAPQ MAPping Quality
6 CIGAR CIGAR string
7 RNEXT Reference name of the mate/next read
8 PNEXT Position of the mate/next read
9 TLEN observed Template LENgth
10 SEQ segment SEQuence
11 QUAL ASCII of Phred-scaled base QUALity+33

If you want to read more about the SAM format, take a look at the official specifications.

BAM format

BAM is the binary format version of SAM. It provides the same data as the SAM format with negligible and subtle differences in most use cases.

SAM file fields

To make things clearer, here is the table of SAM columns and the corresponding fields of a SAM file record:

# SAM Column ID FIELD name seqan3::field
1 QNAME seqan3::sam_record::id seqan3::field::id
2 FLAG seqan3::sam_record::flag seqan3::field::flag
3 RNAME seqan3::sam_record::reference_id seqan3::field::ref_id
4 POS seqan3::sam_record::reference_position seqan3::field::ref_offset
5 MAPQ seqan3::sam_record::mapping_quality seqan3::field::mapq
6 CIGAR implicitly stored in seqan3::sam_record::alignment
explicitly stored in seqan3::sam_record::cigar_sequence
seqan3::field::alignment
seqan3::field::cigar
7 RNEXT seqan3::sam_record::mate_reference_id seqan3::field::mate
8 PNEXT seqan3::sam_record::mate_position seqan3::field::mate
9 TLEN seqan3::sam_record::template_length seqan3::field::mate
10 SEQ seqan3::sam_record::sequence seqan3::field::seq
11 QUAL seqan3::sam_record::base_qualities seqan3::field::qual

SAM files provide following additional fields:

File extensions

The formerly introduced formats can be identified by the following file name extensions (this is important for automatic format detection from a file name as you will learn in the next section).

File Format File Extensions
SAM .sam
BAM .bam

You can access and modify the valid file extensions via the file_extension member variable in a format tag:

int main()
{
}
static std::vector< std::string > file_extensions
The valid file extensions for this format; note that you can modify this value.
Definition: format_sam.hpp:132
Provides seqan3::debug_stream and related types.
debug_stream_type debug_stream
A global instance of seqan3::debug_stream_type.
Definition: debug_stream.hpp:42
Meta-include for the SAM IO submodule.
T push_back(T... args)

Reading SAM files

Before we start, you should copy and paste this example file into a file location of your choice (we use the current path in the examples, so make sure you adjust your path).

Attention
Make sure the file you copied is tab delimited!

Construction

The construction works analogously to sequence files by passing a file name, in which case all template parameters are automatically deduced (by the file name extension). Or you can pass a stream (e.g. std::cin or std::stringstream), but then you need to know your format beforehand:

int main()
{
auto filename = std::filesystem::current_path() / "my.sam";
seqan3::sam_file_input fin_from_filename{filename};
return 0;
}
The SAM format (tag).
Definition: format_sam.hpp:115
A class for reading alignment files, e.g. SAM, BAM, BLAST ...
Definition: input.hpp:357
T current_path(T... args)
This header includes C++17 filesystem support and imports it into namespace std::filesystem (independ...

Accessing individual record members

You can access a record member like this:

int main()
{
auto filename = std::filesystem::current_path() / "example.sam";
seqan3::sam_file_input fin{filename};
// for (seqan3::sam_record record : fin) // this will copy the record
for (auto && record : fin) // this will pass the record by reference (no copy)
{
seqan3::debug_stream << record.id() << '\n';
seqan3::debug_stream << record.sequence() << '\n';
seqan3::debug_stream << record.flag() << '\n';
}
}

See seqan3::sam_record for all data accessors.

Assignment 1: Accumulating mapping qualities

Let's assume we want to compute the average mapping quality of a SAM file.

For this purpose, write a small program that

Use the following file to test your program:

@HD VN:1.6 SO:coordinate
@SQ SN:ref LN:45
r001 99 ref 7 30 8M2I4M1D3M = 37 39 TTAGATAAAGGATACTG *
r003 0 ref 9 30 5S6M * 0 0 GCCTAAGCTAA * SA:Z:ref,29,-,6H5M,17,0;
r004 0 ref 16 30 6M14N5M * 0 0 ATAGCTTCAGC *
r003 2064 ref 29 17 5M * 0 0 TAGGC * SA:Z:ref,9,+,5S6M,30,1;
r001 147 ref 37 30 9M = 7 -39 CAGCGGCAT * NM:i:1
T ref(T... args)

It should output:

Average: 27.4

Solution

int main()
{
auto filename = std::filesystem::current_path() / "my.sam";
seqan3::sam_file_input fin{filename};
double sum{};
size_t count{};
std::ranges::for_each(fin, [&sum, &count] (auto & record)
{
sum += record.mapping_quality();
++count;
});
seqan3::debug_stream << "Average: " << (sum / count) << '\n';
}
constexpr ptrdiff_t count
Count the occurrences of a type in a pack.
Definition: traits.hpp:169
Adaptations of concepts from the Ranges TS.

Alignment representation in the SAM format

In SeqAn, we represent an alignment as a tuple of two seqan3::aligned_sequences.

The SAM format is the common output format of read mappers where you align short read sequences to one or more large reference sequences. In fact, the SAM format stores those alignment information only partially: It does not store the reference sequence but only the read sequence and a CIGAR string representing the alignment based on the read.

Take this SAM record as an example:

r003 73 ref 3 17 1M1D4M * 0 0 TAGGC *

The record gives you the following information: A read with name r003 has been mapped to a reference with name ref at position 3 (in the reference, counting from 1) with a quality of 17 (Phred scaled). The flag has a value of 73 which indicates that the read is paired, the first in pair, but the mate is unmapped (see this website for a nice explanation of SAM flags). Fields set to 0 or * indicate empty fields and contain no valuable information.

The cigar string is 1M1D4M which represents the following alignment:

1 2 3 4 5 6 7 8 9 ...
ref N N N N N N N N N ...
read T - A G G C

where the reference sequence is not known (represented by N). You will learn in the next section how to handle additional reference sequence information.

If you want to read up more about cigar strings, take a look at the SAM specifications or the SAMtools paper.

Reading the CIGAR string

By default, the seqan3::sam_file_input will always read the seqan3::sam_record::cigar_sequence and store it into a std::vector<seqan3::cigar>:

int main()
{
auto filename = std::filesystem::current_path() / "my.sam";
seqan3::sam_file_input fin{filename}; // default fields
for (auto & record : fin)
seqan3::debug_stream << record.cigar_sequence() << '\n'; // access cigar vector
}

Reading the CIGAR information into an actual alignment

In SeqAn, the conversion from a CIGAR string to an alignment (two seqan3::aligned_sequence's) is done automatically for you. You can access it by accessing seqan3::sam_record::alignment from the record:

int main()
{
auto filename = std::filesystem::current_path() / "example.sam";
seqan3::sam_file_input fin{filename};
for (auto & record : fin)
seqan3::debug_stream << record.id() << ": " << std::get<1>(record.alignment()) << '\n';
}

In the example above, you can only safely access the aligned read.

Attention
The unknown aligned reference sequence at the first position in the alignment tuple cannot be accessed (e.g. via the operator[]). It is represented by a dummy type that throws on access.

Although the SAM format does not handle reference sequence information, you can provide these information to the seqan3::sam_file_input which automatically fills the alignment object. You can pass reference ids and reference sequences as additional constructor parameters:

int main()
{
using namespace seqan3::literals;
auto filename = std::filesystem::current_path() / "example.sam";
std::vector<std::string> ref_ids{"ref"}; // list of one reference name
std::vector<seqan3::dna5_vector> ref_sequences{"AGAGTTCGAGATCGAGGACTAGCGACGAGGCAGCGAGCGATCGAT"_dna5};
seqan3::sam_file_input fin{filename, ref_ids, ref_sequences};
for (auto & record : fin)
seqan3::debug_stream << record.alignment() << '\n'; // Now you can print the whole alignment!
}
The SeqAn namespace for literals.

The code will print the following:

(CGAGATCG--AGGACTAG,TTAGATAAAGGATA-CTG)
(AGATCG,AGCTAA)
(GGACTAGCGACGAGGCAGCGAGCGA,ATAGCT--------------TCAGC)
(GGCAG,TAGGC)
(GCGATCGAT,CAGCGGCAT)

Assignment 2: Combining sequence and alignment files

Read the following reference sequence FASTA file (see the sequence file tutorial if you need a reminder):

>chr1
ACAGCAGGCATCTATCGGCGGATCGATCAGGCAGGCAGCTACTGG
>chr2
ACAGCAGGCATCTATCGGCGGATCGATCAGGCAGGCAGCTACTGTAATGGCATCAAAATCGGCATG

Then read the following SAM file while providing the reference sequence information.

@HD VN:1.6 SO:coordinate
@SQ SN:chr1 LN:45
@SQ SN:chr2 LN:66
r001 99 chr1 7 60 8M2I4M1D3M = 37 39 TTAGATAAAGGATACTG *
r003 0 chr1 9 60 5S6M * 0 0 GCCTAAGCTAA *
r004 0 chr2 16 60 6M14N5M * 0 0 ATAGCTTCAGC *
r003 2064 chr2 18 10 5M * 0 0 TAGGC *

Only use

With that information do the following:

  • Filter the alignment records and only take those with a mapping quality >= 30. (Take a look at the tutorial Reading paired-end reads for a reminder how to use views on files)
  • For the resulting alignments, print which read was mapped against which reference id and the number of seqan3::gaps in each sequence (aligned reference and read sequence).

Your program should print the following:

r001 mapped against 0 with 1 gaps in the read sequence and 2 gaps in the reference sequence.
r003 mapped against 0 with 0 gaps in the read sequence and 0 gaps in the reference sequence.
r004 mapped against 1 with 14 gaps in the read sequence and 0 gaps in the reference sequence.

Solution

#include <seqan3/std/algorithm> // std::ranges::count
#include <string>
#include <vector>
int main()
{
// read in reference information
seqan3::sequence_file_input reference_file{current_path / "reference.fasta"};
std::vector<std::string> reference_ids{};
std::vector<seqan3::dna5_vector> reference_sequences{};
for (auto && record : reference_file)
{
reference_ids.push_back(std::move(record.id()));
reference_sequences.push_back(std::move(record.sequence()));
}
// filter out alignments
seqan3::sam_file_input mapping_file{current_path / "mapping.sam", reference_ids, reference_sequences};
auto mapq_filter = std::views::filter([] (auto & record) { return record.mapping_quality() >= 30; });
for (auto & record : mapping_file | mapq_filter)
{
// as loop
size_t sum_reference{};
for (auto const & char_reference : std::get<0>(record.alignment()))
if (char_reference == seqan3::gap{})
++sum_reference;
// or via std::ranges::count
size_t sum_read = std::ranges::count(std::get<1>(record.alignment()), seqan3::gap{});
// The reference_id is ZERO based and an optional. -1 is represented by std::nullopt (= reference not known).
std::optional reference_id = record.reference_id();
seqan3::debug_stream << record.id() << " mapped against "
<< (reference_id ? std::to_string(reference_id.value()) : "unknown reference")
<< " with "
<< sum_read << " gaps in the read sequence and "
<< sum_reference << " gaps in the reference sequence.\n";
}
}
Adaptations of algorithms from the Ranges TS.
The alphabet of a gap character '-'.
Definition: gap.hpp:39
A class for reading sequence files, e.g. FASTA, FASTQ ...
Definition: input.hpp:309
Provides seqan3::dna5, container aliases and string literals.
Provides seqan3::gap.
auto const move
A view that turns lvalue-references into rvalue-references.
Definition: move.hpp:74
Provides the seqan3::record template and the seqan3::field enum.
Provides seqan3::sam_file_input and corresponding traits classes.
Provides seqan3::sequence_file_input and corresponding traits classes.
T to_string(T... args)
T value(T... args)

Writing alignment files

Writing records

When writing a SAM file without any further specifications, the default file assumes that all fields are provided. Since those are quite a lot for alignment files, we usually want to write only a subset of the data stored in the SAM format and default the rest.

For this purpose, you can use the seqan3::sam_record to write out a partial record.

using aligned_sequence_type = std::vector<seqan3::gapped<seqan3::dna5>>;
int main()
{
using namespace seqan3::literals;
auto filename = std::filesystem::current_path() / "out.sam";
seqan3::sam_file_output fout{filename};
using sam_record_type = seqan3::sam_record<types, fields>;
// write the following to the file
// r001 0 * 0 0 4M2I2M2D * 0 0 ACGTACGT *
sam_record_type record{};
record.id() = "r001";
record.sequence() = "ACGTACGT"_dna5;
auto & [reference_sequence, read_sequence] = record.alignment();
// ACGT--GTTT
seqan3::assign_unaligned(reference_sequence, "ACGTGTTT"_dna5);
seqan3::insert_gap(reference_sequence, reference_sequence.begin() + 4, 2);
// ACGTACGT--
seqan3::assign_unaligned(read_sequence, record.sequence());
seqan3::insert_gap(read_sequence, read_sequence.end(), 2);
fout.push_back(record);
}
A class for writing alignment files, e.g. SAM, BAL, BLAST, ...
Definition: output.hpp:173
The record type of seqan3::alignment_file_input.
Definition: record.hpp:27
decltype(auto) id() &&
The identifier, usually a string. (SAM Column ID: QNAME)
Definition: record.hpp:57
A class template that holds a choice of seqan3::field.
Definition: record.hpp:172
Type that contains multiple types.
Definition: type_list.hpp:29

Note that this only works because in the SAM format all fields are optional. So if we provide less fields when writing, default values are written.

Assignment 3: Writing id and sequence information

Create a small program that writes the following unmapped (see seqan3::sam_flag) read ids and sequences:

read1: ACGATCGACTAGCTACGATCAGCTAGCAG
read2: AGAAAGAGCGAGGCTATTTTAGCGAGTTA

Your ids can be of type std::string and your sequences of type std::vector<seqan3::dna4>.

Your resulting SAM file should look like this:

read1 4 * 0 0 * * 0 0 ACGATCGACTAGCTACGATCAGCTAGCAG *
read2 4 * 0 0 * * 0 0 AGAAAGAGCGAGGCTATTTTAGCGAGTTA *

Solution

int main()
{
using namespace seqan3::literals;
auto filename = std::filesystem::current_path() / "my.sam";
std::vector<std::string> ids{"read1", "read2"};
std::vector<std::vector<seqan3::dna4>> seqs{"ACGATCGACTAGCTACGATCAGCTAGCAG"_dna4,
"AGAAAGAGCGAGGCTATTTTAGCGAGTTA"_dna4};
seqan3::sam_file_output fout{filename};
using sam_record_type = seqan3::sam_record<types, fields>;
for (size_t i = 0; i < ids.size(); ++i)
{
fout.push_back(sam_record_type{ids[i], seqs[i], seqan3::sam_flag::unmapped});
}
}
Provides seqan3::dna4, container aliases and string literals.
sam_flag
An enum flag that describes the properties of an aligned read (given as a SAM record).
Definition: sam_flag.hpp:73
@ unmapped
The read is not mapped to a reference (unaligned).