SeqAn3  3.0.2
The Modern C++ library for sequence analysis.
Alphabets in SeqAn

Learning Objective:

In this tutorial we look at alphabets and you will learn how to work with nucleotides and amino acids in SeqAn. We guide you through the most important properties of SeqAn's alphabets and show you the different implemented types. After completion, you will be able to use the alphabets inside of STL containers and to compare alphabet values.

DifficultyEasy
Duration45 min
Prerequisite tutorialsQuick Setup (using CMake)
Recommended readingNone

The links on this page mostly point straight into the API documentation which you should use as a reference. The code examples and assignments are designed to provide some practical experience with our interface as well as a code basis for your own program development.

Introduction

An alphabet is the set of symbols of which a biological sequence – or in general a text – is composed. SeqAn implements specific and optimised alphabets not only for sequences of RNA, DNA and protein components, but also for quality, secondary structure and gap annotation as well as combinations of the aforementioned.

Task

Read the section Detailed Description of the API reference page for Alphabet. This is a detailed introduction to the alphabet module and demonstrates its main advantages.

The nucleotide alphabets

Nucleotides are the components of (Deoxy)Ribonucleic acid (DNA/RNA) and contain one of the nucleobases Adenine (A), Cytosine (C), Guanine (G), Thymine (T, only DNA) and Uracil (U, only RNA). In SeqAn the alphabets seqan3::dna4 and seqan3::rna4 contain exactly the four respective nucleotides. The trailed number in the alphabets' name represents the number of entities the alphabet holds – we denote this number as alphabet size. For instance, the alphabet seqan3::dna5 represents five entities as it contains the additional symbol 'N' to refer to an unknown nucleotide.

Construction and assignment of alphabet symbols

Let's look at some example code which demonstrates how objects of the seqan3::dna4 alphabet are assigned from characters.

#include <seqan3/alphabet/all.hpp> // for working with alphabets directly
using seqan3::operator""_dna4;
int main ()
{
// Two objects of seqan3::dna4 alphabet constructed with a char literal.
seqan3::dna4 ade = 'A'_dna4;
seqan3::dna4 gua = 'G'_dna4;
// Two additional objects assigned explicitly from char or rank.
seqan3::dna4 cyt, thy;
cyt.assign_char('C');
thy.assign_rank(3);
// Further code here...
return 0;
}

We have shown three solutions for assigning variables of alphabet type.

  1. Construction by character literal, i.e. appending the operator _dna4 to the respective char symbol.
    This is the handiest way as it can be also used as a temporary object.
  2. Assignment by char via the global function seqan3::assign_char.
    This is useful if the assignment target already exists, e.g. in a sequence vector.
  3. Assignment by rank via the global function seqan3::assign_rank.
    May be used when the rank is known.

The rank of an alphabet symbol

The rank of a symbol is a number in range [0..alphabet_size) where each number is paired with an alphabet symbol by a bijective function. In SeqAn the rank is always determined by the lexicographical order of the underlying characters. For instance, in seqan3::dna4 the bijection is
'A'_dna4 ⟼ 0
'C'_dna4 ⟼ 1
'G'_dna4 ⟼ 2
'T'_dna4 ⟼ 3.

SeqAn provides the function seqan3::to_rank for converting a symbol to its rank value as demonstrated in the following code example. Note that the data type of the rank is usually the smallest possible unsigned type that is required for storing the values of the alphabet.

// Get the rank type of the alphabet (here uint8_t).
// Retrieve the numerical representation (rank) of the objects.
rank_type rank_a = ade.to_rank(); // => 0
rank_type rank_g = gua.to_rank(); // => 2

The char representation of an alphabet symbol

Our alphabets also have a character representation because it is more intuitive to work with them than using the rank. Each alphabet symbol is represented by its respective character whenever possible (A ⟼ 'A'). Analogously to the rank, SeqAn provides the function seqan3::to_char for converting a symbol to its character representation.

// Get the character type of the alphabet (here char).
// Retrieve the character representation.
char_type char_a = ade.to_char(); // => 'A'
char_type char_g = gua.to_char(); // => 'G'

Above you have seen that you can assign an alphabet symbol from a character with seqan3::from_char. In contrast to the rank interface, this assignment is not a bijection because the whole spectrum of available chars is mapped to values inside the alphabet. For instance, assigning to seqan3::dna4 from any character other than C, G or T results in the value 'A'_dna4 and assigning from any character except A, C, G or T to seqan3::dna5 results in the value 'N'_dna5. You can avoid the implicit conversion by using seqan3::assign_char_strict which throws seqan3::invalid_char_assignment on invalid characters.

// Assign from character with value check.
// seqan3::assign_char_strictly_to('X', thy); // would throw seqan3::invalid_char_assignment

You can test the validity of a character by calling seqan3::char_is_valid_for. It returns true if the character is valid and false otherwise.

Obtaining the alphabet size

You can retrieve the alphabet size by accessing the class member variable alphabet_size which is implemented in most seqan3::alphabet instances.

// Get the alphabet size as class member of the alphabet.
uint8_t const size1 = seqan3::dna4::alphabet_size; // => 4

containers over alphabets

In SeqAn you can use the STL containers to model e.g. sequences, sets or mappings with our alphabets. The following example shows some exemplary contexts for their use. For sequences we recommend the std::vector with one of SeqAn's alphabet types. Please note how easily a sequence can be created via the string literal.

using seqan3::operator""_dna5;
// Examples of different container types with SeqAn's alphabets.
std::vector<seqan3::dna5> dna_sequence{"GATTANAG"_dna5};
std::set<seqan3::dna4> pyrimidines{'C'_dna4, 'T'_dna4};

Comparability

All alphabets in SeqAn are regular and comparable. This means that you can use the <, <=, > and >= operators to compare the values based on the rank. For instance, this is useful for sorting a text over the alphabet. Regular implies that the equality and inequality of two values can be tested with == and !=.

// Equality and comparison of seqan::dna4 symbols.
bool eq = (cyt == 'C'_dna4); // true
bool neq = (thy != 'C'_dna4); // true
bool geq = (cyt >= 'C'_dna4); // true
bool gt = (thy > 'C'_dna4); // true
bool seq = (cyt <= 'C'_dna4); // true
bool st = (ade < 'C'_dna4); // true
// Sort a vector of symbols.
std::vector<seqan3::dna4> some_nucl{"GTA"_dna4};
std::sort(some_nucl.begin(), some_nucl.end()); // some_nucl: "AGT"

Example

To wrap up this section on the nucleotide alphabet, the following assignment lets you practise the use of a SeqAn alphabet and its related functions. It will also show you a handy advantage of using a vector over an alphabet instead of using std::string: The rank representation can be used straight as an array index (opposed to using a map with logarithmic access times, for example).

Assignment 1: GC content of a sequence

An important property of DNA and RNA sequences is the GC content, which is the percentage of nucleobases that are either Guanine or Cytosine. Given the nucleotide counts $n_A$, $n_T$, $n_G$, $n_C$ the GC content $c$ is calculated as

\[ c = \frac{n_G + n_C}{n_A + n_T + n_G + n_C} \]

Write a program that

  1. reads a sequence as command line argument into a vector of seqan3::dna5,
  2. counts the number of occurrences for each nucleotide in an array of size alphabet size and
  3. calculates the GC content.

The seqan3::dna5 type ensures that invalid characters in the input sequence are converted to 'N'. Note that these characters should not influence the GC content.

Pass the sequences CATTACAG (3/8 = 37.5%) and ANNAGAT (1/5 = 20%) to your program and check if your results are correct.

Solution

#include <array> // std::array
#include <string> // std::string
#include <vector> // std::vector
#include <seqan3/range/views/all.hpp> // optional: use views to convert the input string to a dna5 sequence
using seqan3::operator""_dna5;
int main (int argc, char * argv[])
{
std::string input{};
seqan3::argument_parser parser("GC-Content", argc, argv);
parser.add_positional_option(input, "Specify an input sequence.");
try
{
parser.parse();
}
catch (seqan3::argument_parser_error const & ext) // the input is invalid
{
seqan3::debug_stream << "[PARSER ERROR] " << ext.what() << '\n';
return 0;
}
// Convert the input to a dna5 sequence
for (char c : input)
// Optional: use views for the conversion. Views will be introduced in the next chapter.
//std::vector<seqan3::dna5> sequence = input | seqan3::views::char_to<seqan3::dna5> | seqan3::views::to<std::vector>;
// Initialise an array with count values for dna5 symbols.
std::array<size_t, seqan3::dna5::alphabet_size> count{}; // default initialised with zeroes
// Increase the symbol count according to the sequence.
for (seqan3::dna5 symbol : sequence)
++count[symbol.to_rank()];
// Calculate the GC content: (#G + #C) / (#A + #T + #G + #C).
size_t gc = count['C'_dna5.to_rank()] + count['G'_dna5.to_rank()];
size_t atgc = input.size() - count['N'_dna5.to_rank()];
float gc_content = 1.0f * gc / atgc;
seqan3::debug_stream << "The GC content of " << sequence << " is " << 100 * gc_content << "%.\n";
return 0;
}

Other alphabets

Until now, we have focused on alphabets for nucleotides to introduce the properties of SeqAn's alphabet on a specific example. SeqAn implements, however, many more alphabets. In this section, we want to give you an overview of the other existing alphabets.

The amino acid alphabet

Proteins consist of one or more long chains of amino acids. The so-called primary structure of a protein is expressed as sequences over an amino acid alphabet. The seqan3::aa27 alphabet contains the standard one-letter code of the 20 canonical amino acids as well as the two proteinogenic amino acids, a termination symbol and some wildcard characters. For details read the Aminoacid page.

Structure and quality alphabets

The alphabets for structure and quality are sequence annotations since they describe additional properties of the respective sequence. We distinguish between three types:

  1. Quality alphabet for nucleotides. The values are produced by sequencing machines and represent the probability that a nucleobase was recorded incorrectly. The characters are most commonly found in FASTQ files. See Quality for details.
  2. RNA structure alphabets. They describe RNA nucleobases as unpaired or up-/downstream paired and can be found in annotated RNA sequence and alignment files (e.g. Stockholm format). Currently we provide the Dot Bracket and WUSS formats.
  3. Protein structure alphabet. The DSSP format represents secondary structure elements like alpha helices and turns.

You can build an Alphabet Tuple Composite with a nucleotide and quality alphabet, or nucleotide / amino acid and structure alphabet that stores both information together. For the use cases just described we offer pre-defined composites (seqan3::qualified, seqan3::structured_rna, seqan3::structured_aa). See our API documentation for a detailed description of each.

Gap alphabet

The seqan3::gap alphabet is the smallest alphabet in SeqAn, consisting only of the gap character. It is most often used in an Alphabet Variant with a nucleotide or amino acid alphabet to represent gapped sequences, e.g. in alignments. To create a gapped alphabet simply use seqan3::gapped<> with the alphabet type you want to refine.

// Assign a gap symbol to a gapped RNA alphabet.
using seqan3::operator""_rna5;
// Each seqan3::rna5 symbol is still valid.
sym = 'U'_rna5; // => U
// The alphabet size is six (AUGCN-).
uint8_t const size2 = seqan3::gapped<seqan3::rna5>::alphabet_size; // => 6
debug_stream.hpp
Provides seqan3::debug_stream and related types.
seqan3::gap
The alphabet of a gap character '-'.
Definition: gap.hpp:37
std::string
seqan3::argument_parser
The SeqAn command line parser.
Definition: argument_parser.hpp:154
std::pair
output_stream_over::char_type
typename stream::char_type char_type
Declares the associated char type.
seqan3::alphabet_base< dna4, size, char >::alphabet_size
static constexpr detail::min_viable_uint_t< size > alphabet_size
The size of the alphabet, i.e. the number of different values it can take.
Definition: alphabet_base.hpp:176
seqan3::assign_char_strictly_to
constexpr auto assign_char_strictly_to
Assign a character to an alphabet object, throw if the character is not valid.
Definition: concept.hpp:598
std::vector
seqan3::alphabet_base::to_char
constexpr char_type to_char() const noexcept
Return the letter as a character of char_type.
Definition: alphabet_base.hpp:96
all.hpp
Meta-header for the views submodule .
std::sort
T sort(T... args)
seqan3::seq
constexpr sequenced_policy seq
Global execution policy object for sequenced execution policy.
Definition: execution.hpp:54
seqan3::alphabet_base::assign_char
constexpr derived_type & assign_char(char_type const c) noexcept
Assign from a character, implicitly converts invalid characters.
Definition: alphabet_base.hpp:142
seqan3::alphabet_variant
A combined alphabet that can hold values of either of its alternatives.
Definition: alphabet_variant.hpp:133
seqan3::alphabet_rank_t
decltype(seqan3::to_rank(std::declval< semi_alphabet_type >())) alphabet_rank_t
The rank_type of the semi-alphabet; defined as the return type of seqan3::to_rank.
Definition: concept.hpp:152
seqan3::argument_parser::add_positional_option
void add_positional_option(option_type &value, std::string const &desc, validator_type option_validator=validator_type{})
Adds a positional option to the seqan3::argument_parser.
Definition: argument_parser.hpp:300
seqan3::alphabet_char_t
decltype(seqan3::to_char(std::declval< alphabet_type const >())) alphabet_char_t
The char_type of the alphabet; defined as the return type of seqan3::to_char.
Definition: concept.hpp:330
seqan3::dna4
The four letter DNA alphabet of A,C,G,T.
Definition: dna4.hpp:51
array
all.hpp
Meta-header for the alphabet module.
all.hpp
Meta-Header for the argument parser module.
seqan3::argument_parser_error
Argument parser exception that is thrown whenever there is an error while parsing the command line ar...
Definition: exceptions.hpp:49
seqan3::assign_char_to
constexpr auto assign_char_to
Assign a character to an alphabet object.
Definition: concept.hpp:417
seqan3::debug_stream
debug_stream_type debug_stream
A global instance of seqan3::debug_stream_type.
Definition: debug_stream.hpp:42
seqan3::alphabet_base::assign_rank
constexpr derived_type & assign_rank(rank_type const c) noexcept
Assign from a numeric value.
Definition: alphabet_base.hpp:167
seqan3::dna5
The five letter DNA alphabet of A,C,G,T and the unknown character N.
Definition: dna5.hpp:49
seqan3::argument_parser::parse
void parse()
Initiates the actual command line parsing.
Definition: argument_parser.hpp:387
seqan3::alphabet_base::to_rank
constexpr rank_type to_rank() const noexcept
Return the letter's numeric value (rank in the alphabet).
Definition: alphabet_base.hpp:118
sequence
The generic concept for a sequence.
std::set
seqan3::pack_traits::count
constexpr ptrdiff_t count
Count the occurrences of a type in a pack.
Definition: traits.hpp:134
std::runtime_error::what
T what(T... args)
string