Introduction

Alignment files are used to store pairwise alignments between two (biological) sequences. Common file formats are the Sequence Alignment/Map format (SAM) and BLAST output format. Next to the alignment, those formats store additional information like the start positions or mapping qualities. Alignment files are a little more complex than sequence files but the basic design is the same. If you are new to SeqAn3, we strongly recommend to do the tutorial Sequence File Input and Output first.

Alignment file formats

SAM format

SAM stands for Sequence Alignment/Map format. It is a TAB-delimited text format consisting of a header section, which is optional, and an alignment section (see the official SAM specifications).

Here is an example of a SAM file:

//!\file
@HD VN:1.6  SO:coordinate
@SQ SN:ref  LN:45
r001    99  ref 7   30  8M2I4M1D3M  =   37  39  TTAGATAAAGGATACTG   *
r003    0   ref 9   30  5S6M    *   0   0   GCCTAAGCTAA *
r004    0   ref 16  30  6M14N5M *   0   0   ATAGCTTCAGC *
r003    2064    ref 29  17  5M  *   0   0   TAGGC   *
r001    147 ref 37  30  9M  =   7   -39 CAGCGGCAT   *   NM:i:1

The following table summarises the columns of a SAM file:

#	SAM Column ID	Description
1	QNAME	Query template NAME
2	FLAG	bitwise FLAG
3	RNAME	Reference sequence NAME
4	POS	1-based leftmost mapping POSition
5	MAPQ	MAPping Quality
6	CIGAR	CIGAR string
7	RNEXT	Reference name of the mate/next read
8	PNEXT	Position of the mate/next read
9	TLEN	observed Template LENgth
10	SEQ	segment SEQuence
11	QUAL	ASCII of Phred-scaled base QUALity+33

If you want to read more about the SAM format, take a look at the official specifications.

BAM format

BAM is the binary format version of SAM. It provides the same data as the SAM format with the only difference that the header is mandatory.

Alignment file fields

The Alignment file abstraction supports writing the following fields:

field::SEQ
field::ID
field::OFFSET
field::REF_SEQ
field::REF_ID
field::REF_OFFSET
field::ALIGNMENT
field::MAPQ
field::FLAG
field::QUAL
field::MATE
field::TAGS
field::EVALUE
field::BIT_SCORE

There is an additional field called seqan3::field::HEADER_PTR. It is used to transfer header information from seqan3::alignment_file_input to seqan3::alignment_file_output, but you needn't deal with this field manually.

Note that some of the fields are specific to the SAM format, while some are specific to BLAST. To make things clearer, here is the table of SAM columns connected to the corresponding alignment file field:

#	SAM Column ID	FIELD name
1	QNAME	seqan3::field::ID
2	FLAG	seqan3::field::FLAG
3	RNAME	seqan3::field::REF_ID
4	POS	seqan3::field::REF_OFFSET
5	MAPQ	seqan3::field::MAPQ
6	CIGAR	implicitly stored in seqan3::field::ALIGNMENT
7	RNEXT	seqan3::field::MATE (tuple pos 0)
8	PNEXT	seqan3::field::MATE (tuple pos 1)
9	TLEN	seqan3::field::MATE (tuple pos 2)
10	SEQ	seqan3::field::SEQ
11	QUAL	seqan3::field::QUAL

File extensions

The formerly introduced formats can be identified by the following file name extensions (this is important for automatic format detection from a file name as you will learn in the next section).

File Format	File Extensions
SAM	`.sam`
BAM	`.bam`

You can access and modify the valid file extensions via the file_extension member variable in a format tag:

debug_stream << format_sam::file_extensions << std::endl; // prints [fastq,fq]

format_sam::file_extensions.push_back("sm");

Reading alignment files

Before we start, you should copy and paste this example file into a file location of your choice (we use /tmp/ in the examples, so make sure you adjust your path).

Attention: Make sure the file you copied is tab delimited!

Construction

The construction works analogously to sequence files by passing a file name, in which case all template parameters are automatically deduced (by the file name extension). Or you can pass a stream (e.g. std::cin or std::stringstream), but then you need to know your format beforehand:

#include <seqan3/io/alignment_file/all.hpp>
using namespace seqan3;
int main()
{

    alignment_file_input fin_from_filename{"/tmp/my.sam"};
    alignment_file_input fin_from_stream{std::cin, format_sam{}};

}

Reading custom fields

In many cases you are not interested in all of the information in a file. For this purpose, we provide the possibility to select specific seqan3::field's for a file. The file will read only those fields and fill the record accordingly.

You can select fields by providing a seqan3::fields object as an extra parameter to the constructor:

#include <seqan3/io/alignment_file/all.hpp>
using namespace seqan3;
int main()
{

    auto filename = std::filesystem::temp_directory_path()/"example.sam";
    alignment_file_input fin{filename, fields<field::ID, field::SEQ, field::FLAG>{}};
    for (auto & [id, seq, flag /*order!*/] : fin)
    {
        debug_stream << id << std::endl;
        debug_stream << seq << std::endl;
        debug_stream << flag << std::endl;
    }

}

Attention: The order in which you specify the selected fields determines the order of elements in the seqan3::record.

In the example above we only select the id, sequence and flag information so the seqan3::record object has three tuple elements that are decomposed using structural bindings.

Note that this is possible for all seqan3 file objects.

Exercise: Accumulating mapping qualities

Let's assume we want to compute the average mapping quality of a SAM file.

For this purpose, write a small program that

only reads the mapping quality (field::MAPQ) out of a SAM file and
computes the average of all qualities.

Use the following file to test your program:

@HD VN:1.6  SO:coordinate
@SQ SN:ref  LN:45
r001    99  ref 7   30  8M2I4M1D3M  =   37  39  TTAGATAAAGGATACTG   *
r003    0   ref 9   30  5S6M    *   0   0   GCCTAAGCTAA *   SA:Z:ref,29,-,6H5M,17,0;
r004    0   ref 16  30  6M14N5M *   0   0   ATAGCTTCAGC *
r003    2064    ref 29  17  5M  *   0   0   TAGGC   *   SA:Z:ref,9,+,5S6M,30,1;
r001    147 ref 37  30  9M  =   7   -39 CAGCGGCAT   *   NM:i:1

It should output:

Average: 27.4

Solution

#include <seqan3/core/debug_stream.hpp>
#include <seqan3/io/alignment_file/all.hpp>
#include <seqan3/std/filesystem>
#include <seqan3/range/view/get.hpp>
#include <seqan3/std/ranges>
using namespace seqan3;
int main()
{
    std::filesystem::path tmp_dir = std::filesystem::temp_directory_path(); // get the temp directory
    alignment_file_input fin{tmp_dir/"my.sam", fields<field::MAPQ>{}};
    double sum{};
    size_t c{};
    std::ranges::for_each(fin.begin(), fin.end(), [&sum, &c] (auto & rec) { sum += get<field::MAPQ>(rec); ++c; });
    debug_stream << "Average: " << (sum/c) << std::endl;
}

Alignment representation in the SAM format

In SeqAn3 we represent an alignment as a tuple of two aligned_sequences, as you have probably learned by now from the alignment tutorial.

The SAM format is the common output format of read mappers where you align short read sequences to one or more large reference sequences. In fact, the SAM format stores those alignment information only partially: It does not store the reference sequence but only the read sequence and a CIGAR string representing the alignment based on the read.

Take this SAM record as an example:

r003 73 ref 3 17 1M1D4M * 0 0 TAGGC *

The record gives you the following information: A read with name r003 has been mapped to a reference with name ref at position 3 (in the reference, counting from 1) with a quality of 17 (Phred scaled). The flag has a value of 73 which indicates that the read is paired, the first in pair, but the mate is unmapped (see this website for a nice explanation of SAM flags). Fields set to 0 or * are defaulted and contain no information.

The cigar string is 1M1D4M which represents the following alignment:

      1 2 3 4 5 6 7 8 9 ...
ref   N N N N N N N N N ...
read      T - A G G C

where the reference sequence is not known (represented by N). You will learn in the next section how to handle additional reference sequence information.

If you want to read up more about cigar strings, take a look at the SAM specifications or the SAMtools paper.

Completing reference information

In SeqAn3 the conversion from a CIGAR string to an alignment (two aligned_sequences) is done automatically for you. You can just pass the alignment object to the alignment file:

#include <seqan3/io/alignment_file/all.hpp>
using namespace seqan3;
int main()
{

    auto filename = std::filesystem::temp_directory_path()/"example.sam";
    alignment_file_input fin{filename, fields<field::ID, field::ALIGNMENT>{}};
    for (auto & [ id, alignment ] : fin)
    {
        debug_stream << id << ": " << get<1>(alignment) << std::endl;
    }

}

In the example above, you can only safely access the aligned read.

Attention: The unknown aligned reference sequence at the first position in the alignment tuple cannot be accessed (e.g. via the operator[]). It is represented by a dummy type that throws on access.

Although the SAM format does not handle reference sequence information, you can provide these information to the seqan3::alignment_file_input which automatically fills the alignment object. You can pass reference ids and reference sequences as additional constructor parameters:

#include <seqan3/io/alignment_file/all.hpp>
using namespace seqan3;
int main()
{

    auto filename = std::filesystem::temp_directory_path()/"example.sam";
    std::vector<std::string> ref_ids{"ref"}; // list of one reference name
    std::vector<dna5_vector> ref_sequences{"AGAGTTCGAGATCGAGGACTAGCGACGAGGCAGCGAGCGATCGAT"_dna5};
    alignment_file_input fin{filename, ref_ids, ref_sequences, fields<field::ALIGNMENT>{}};
    for (auto & [ alignment ] : fin)
    {
        debug_stream << alignment << std::endl; // Now you can print the whole alignment!
    }

}

Exercise: Combining sequence and alignment files

Read in the following reference sequence FASTA file (see the sequence file tutorial if you need a remainder):

>chr1
ACAGCAGGCATCTATCGGCGGATCGATCAGGCAGGCAGCTACTGG
>chr2
ACAGCAGGCATCTATCGGCGGATCGATCAGGCAGGCAGCTACTGTAATGGCATCAAAATCGGCATG

Then read in the following SAM file while providing the reference sequence information. Only read in the id, reference id, mapping quality, and alignment.

@HD VN:1.6  SO:coordinate
@SQ SN:chr1 LN:45
@SQ SN:chr2 LN:66
r001    99  chr1    7   60  8M2I4M1D3M  =   37  39  TTAGATAAAGGATACTG   *
r003    0   chr1    9   60  5S6M    *   0   0   GCCTAAGCTAA *
r004    0   chr2    16  60  6M14N5M *   0   0   ATAGCTTCAGC *
r003    2064    chr2    18  10  5M  *   0   0   TAGGC   *

With those information do the following:

Filter the alignment records and only take those with a mapping quality >= 30. (Take a look at the tutorial Reading paired-end reads for a reminder how to use views on files)
For the resulting alignments, print which read was mapped against with reference id and the number of seqan3::gap's in each sequence (aligned reference and read sequence).

Note: reference ids (field::REF_ID) are given as an index of type std::optional<int32_t> that denote the position of the reference id in the ref_ids vector passed to the alignment file.

Your program should print the following:

r001 mapped against 0 with 2 gaps in the read sequence and 2 gaps in the reference sequence.
r003 mapped against 0 with 0 gaps in the read sequence and 0 gaps in the reference sequence.
r004 mapped against 1 with 0 gaps in the read sequence and 0 gaps in the reference sequence.

Solution

#include <seqan3/core/debug_stream.hpp>
#include <seqan3/io/alignment_file/all.hpp>
#include <seqan3/io/sequence_file/all.hpp>
#include <seqan3/alphabet/gap/gap.hpp>
#include <seqan3/std/filesystem>
#include <seqan3/std/ranges>
using namespace seqan3;
struct my_traits : public sequence_file_input_default_traits_dna
{
    template <typename _sequence_container>
    using sequence_container_container = std::vector<_sequence_container>;
};
int main()
{
    std::filesystem::path tmp_dir = std::filesystem::temp_directory_path(); // get the temp directory
    // read in reference information
    sequence_file_input<my_traits> reference_file{tmp_dir/"reference.fasta"};
    concatenated_sequences<std::string> ref_ids = get<field::ID>(reference_file);
    std::vector<std::vector<dna5>> ref_seqs = get<field::SEQ>(reference_file);
    alignment_file_input mapping_file{tmp_dir/"mapping.sam",
                                      ref_ids,
                                      ref_seqs,
                                      fields<field::ID,field::REF_ID, field::MAPQ, field::ALIGNMENT>{}};
    auto mapq_filter = std::view::filter([] (auto & rec) { return get<field::MAPQ>(rec) >= 30; });
    for (auto & [id, ref_id, mapq, alignment] : mapping_file | mapq_filter)
    {
        auto & ref = get<0>(alignment);
        size_t sum_ref{};
        std::ranges::for_each(ref.begin(), ref.end(), [&sum_ref] (auto c) { if (c == gap{}) ++sum_ref; });
        auto & read = get<0>(alignment);
        size_t sum_read{};
        std::ranges::for_each(read.begin(), read.end(), [&sum_read] (auto c) { if (c == gap{}) ++sum_read; });
        debug_stream << id << " mapped against " << ref_id << " with "
                     << sum_read << " gaps in the read sequence and "
                     << sum_ref  << " gaps in the reference sequence." << std::endl;
    }
}

Writing alignment files

Writing records

When writing a SAM file without any further specifications, the default file assumes that all fields are provided. Since those are quite a lot for alignment files, we usually want to write only a subset of the data stored in the SAM format and default the rest.

For this purpose, you can also select specific fields by giving an additional seqan3::fields object to the constructor.

Attention: The order of the field tags in your seqan3::fields object will determine the order of values stored in the record type!

#include <seqan3/io/alignment_file/all.hpp>
using namespace seqan3;
int main()
{

    auto filename = std::filesystem::temp_directory_path()/"out.sam";
    alignment_file_output fout{filename, fields<field::FLAG, field::MAPQ>{}};
    size_t mymapq{60};
    uint8_t flag{0};
    // ...
    fout.emplace_back(flag, mymapq);
    // or:
    fout.push_back(std::tie(flag, mymapq));

}

Note that this only works because in the SAM format all fields are optional. So if we provide less fields when writing, default values are printed.

Exercise: Writing id and sequence information

Write a small program that writes the following read ids + sequences:

read1: ACGATCGACTAGCTACGATCAGCTAGCAG

read2: AGAAAGAGCGAGGCTATTTTAGCGAGTTA

Your ids can be of type std::string and your sequences of type std::vector<seqan3::dna4>.

Your resulting SAM file should look like this:

read1 0 * 0 0 * * 0 0 ACGATCGACTAGCTACGATCAGCTAGCAG *

read2 0 * 0 0 * * 0 0 AGAAAGAGCGAGGCTATTTTAGCGAGTTA *

Solution

#include <seqan3/alphabet/nucleotide/dna4.hpp>
#include <seqan3/io/alignment_file/all.hpp>
#include <seqan3/std/filesystem>
using namespace seqan3;
int main()
{
    std::vector<std::string> ids = {"read1", "read2"};
    std::vector<std::vector<dna4>> seqs = {"ACGATCGACTAGCTACGATCAGCTAGCAG"_dna4, "AGAAAGAGCGAGGCTATTTTAGCGAGTTA"_dna4};
    auto tmp_dir = std::filesystem::temp_directory_path();
    alignment_file_output fout{tmp_dir/"my.sam", fields<field::ID, field::SEQ>{}};
    for (size_t i = 0; i < ids.size(); ++i)
    {
        fout.emplace_back(ids[i], seqs[i]);
    }
}

Difficulty	Medium
Duration	60 min
Prerequisite tutorials	Quick Setup (using CMake), Alphabets in SeqAn3, Sequence File Input and Output
Recommended reading

Table of Contents

Introduction

Alignment file formats

SAM format

BAM format

Alignment file fields

File extensions

Reading alignment files

Construction

Reading custom fields

Exercise: Accumulating mapping qualities

Alignment representation in the SAM format

Completing reference information

Exercise: Combining sequence and alignment files

Writing alignment files

Writing records

Exercise: Writing id and sequence information