SeqAn3 3.4.0-rc.1
The Modern C++ library for sequence analysis.
|
Learning Objective:
Learning Objective:
You will get an overview of how to read and write SAM/BAM files. This tutorial is a walk-through with links into the API documentation and is also meant as a source for copy-and-paste code.
Difficulty | Medium |
---|---|
Duration | 60 min |
Prerequisite tutorials | Quick Setup (using CMake), Alphabets in SeqAn, Sequence File Input and Output |
Recommended reading |
SAM files are used to store pairwise alignments between two (biological) sequences. There are also other output formats, like BLAST, which can store sequence alignments, but in this tutorial we will focus on SAM/BAM files. In addition to the alignment, these formats store information such as the start positions or mapping qualities. SAM files are a little more complex than sequence files but the basic design is the same. If you are new to SeqAn, we strongly recommend completing the tutorial Sequence File Input and Output first.
SAM stands for Sequence Alignment/Map format. It is a TAB-delimited text format consisting of a header section, which is optional, and an alignment section (see the official SAM specifications).
Here is an example of a SAM file:
The following table summarises the columns of a SAM file:
# | SAM Column ID | Description |
---|---|---|
1 | QNAME | Query template NAME |
2 | FLAG | bitwise FLAG |
3 | RNAME | Reference sequence NAME |
4 | POS | 1-based leftmost mapping POSition |
5 | MAPQ | MAPping Quality |
6 | CIGAR | CIGAR string |
7 | RNEXT | Reference name of the mate/next read |
8 | PNEXT | Position of the mate/next read |
9 | TLEN | observed Template LENgth |
10 | SEQ | segment SEQuence |
11 | QUAL | ASCII of Phred-scaled base QUALity+33 |
If you want to read more about the SAM format, take a look at the official specifications.
BAM is the binary format version of SAM. It provides the same data as the SAM format with negligible and subtle differences in most use cases.
To make things clearer, here is the table of SAM columns and the corresponding fields of a SAM file record:
SAM files provide following additional fields:
The formerly introduced formats can be identified by the following file name extensions (this is important for automatic format detection from a file name as you will learn in the next section).
File Format | File Extensions |
---|---|
SAM | .sam |
BAM | .bam |
You can access and modify the valid file extensions via the file_extension
member variable in a format tag:
Before we start, you should copy and paste this example file into a file location of your choice (we use the current path in the examples, so make sure you adjust your path).
The construction works analogously to sequence files by passing a file name, in which case, all template parameters are automatically deduced (by the file name extension). Or you can pass a stream (e.g. std::cin or std::stringstream), but then you need to know your format beforehand:
You can access a record member like this:
See seqan3::sam_record for all data accessors.
Let's assume we want to compute the average mapping quality of a SAM file.
For this purpose, write a small program that
Use the following file to test your program:
It should output:
The SAM format is the common output format of read mappers where you align short read sequences to one or more large reference sequences. In fact, the SAM format stores those alignment information only partially: It does not store the reference sequence but only the query/read sequence and a CIGAR string representing the alignment based on the read.
Take this SAM record as an example:
The record gives you the following information: A read with name r003
has been mapped to a reference with name ref
at position 3
(in the reference, counting from 1) with a quality of 17
(Phred scaled). The flag has a value of 73
which indicates that the read is paired, the first in pair, but the mate is unmapped (see this website for a nice explanation of SAM flags). Fields set to 0
or *
indicate empty fields and contain no valuable information.
The cigar string is 1M1D4M
which represents the following alignment:
where the reference sequence is not known (represented by N
). You will learn in the next section how to handle additional reference sequence information.
If you want to read up more about cigar strings, take a look at the SAM specifications or the SAMtools paper.
By default, the seqan3::sam_file_input will always read the seqan3::sam_record::cigar_sequence and store it into a std::vector<seqan3::cigar>:
In SeqAn, we represent an alignment as a tuple of two seqan3::aligned_sequence
s.
The conversion from a CIGAR string to an alignment can be done with the function seqan3::alignment_from_cigar
. You need to pass the reference sequence with the position the read was aligned to and the read sequence:
The code will print the following:
Read the following reference sequence FASTA file (see the sequence file tutorial if you need a reminder):
Then read the following SAM file while providing the reference sequence information.
Only use
With that information do the following:
seqan3::gap
s in each sequence (aligned reference and read sequence).
record.reference_id()
which gives you the position of the respective reference sequence. Your program should print the following:
When writing a SAM file without any further specifications, the default file assumes that all fields are provided. Since those are quite a lot for alignment files, we usually want to write only a subset of the data stored in the SAM format and default the rest.
For this purpose, you can use the seqan3::sam_record to write out a partial record.
Note that this only works because in the SAM format all fields are optional. If we provide fewer fields when writing, default values are written.
Create a small program that writes the following unmapped (see seqan3::sam_flag) read ids and sequences:
Your ids can be of type std::string
and your sequences of type std::vector<seqan3::dna4>
.
Your resulting SAM file should look like this: