Provides the different DNA and RNA alphabet types. More...

Collaboration diagram for Nucleotide:

Classes
class	seqan3::dna15
	The 15 letter DNA alphabet, containing all IUPAC smybols minus the gap. More...

class	seqan3::dna16sam
	A 16 letter DNA alphabet, containing all IUPAC symbols minus the gap and plus an equality sign ('='). More...

class	seqan3::dna3bs
	The three letter reduced DNA alphabet for bisulfite sequencing mode (A,G,T(=C)). More...

class	seqan3::dna4
	The four letter DNA alphabet of A,C,G,T. More...

class	seqan3::dna5
	The five letter DNA alphabet of A,C,G,T and the unknown character N. More...

interface	nucleotide_alphabet
	A concept that indicates whether an alphabet represents nucleotides. More...

class	seqan3::nucleotide_base< derived_type, size >
	A CRTP-base that refines seqan3::alphabet_base and is used by the nucleotides. More...

class	seqan3::rna15
	The 15 letter RNA alphabet, containing all IUPAC smybols minus the gap. More...

class	seqan3::rna4
	The four letter RNA alphabet of A,C,G,U. More...

class	seqan3::rna5
	The five letter RNA alphabet of A,C,G,U and the unknown character N. More...

Function objects (Nucleotide)
constexpr auto	seqan3::complement = detail::adl_only::complement_cpo{}
	Return the complement of a nucleotide object.

Detailed Description

Provides the different DNA and RNA alphabet types.

See also: Alphabet

Introduction

Nucleotide sequences are at the core of most bioinformatic data processing and while it is possible to represent them in a regular std::string, it makes sense to have specialised data structures in most cases. This sub-module offers multiple nucleotide alphabets that can be used with regular containers and ranges.

Letter	Description	seqan3::dna15	seqan3::dna5	seqan3::dna4	seqan3::dna3bs	seqan3::rna15	seqan3::rna5	seqan3::rna4
A	Adenine	A	A	A	A	A	A	A
C	Cytosine	C	C	C	T	C	C	C
G	Guanine	G	G	G	G	G	G	G
T	Thymine (DNA)	T	T	T	T	U	U	U
U	Uracil (RNA)	T	T	T	T	U	U	U
M	A or C	M	N	A	A	M	N	A
R	A or G	R	N	A	A	R	N	A
W	A or T	W	N	A	A	W	N	A
Y	C or T	Y	N	C	T	Y	N	C
S	C or G	S	N	C	T	S	N	C
K	G or T	K	N	G	G	K	N	G
V	A or C or G	V	N	A	A	V	N	A
H	A or C or T	H	N	A	A	H	N	A
D	A or G or T	D	N	A	A	D	N	A
B	C or G or T	B	N	C	T	B	N	C
N	A or C or G or T	N	N	A	A	N	N	A
Size		15	5	4	3	15	5	4

Keep in mind, that while we think of "the nucleotide alphabet" as consisting of four bases, there are indeed more characters defined with different levels of ambiguity. Depending on your application it will make sense to preserve this ambiguity or to discard it to save space and/or optimise computations. SeqAn offers six distinct nucleotide alphabet types to accommodate for this.

The specialised RNA alphabets are provided for convenience, however the DNA alphabets can handle being assigned a 'U' character, as well. See below for the details.

Which alphabet to chose?

in most cases, take seqan3::dna15 (includes all IUPAC characters)
if you are memory constrained and sequence data is actually the main memory consumer, use seqan3::dna5
if you use specialised algorithms that profit from a 2-bit representation, use seqan3::dna4
if you are doing only RNA input/output, use the respective seqan3::rna4, seqan3::rna5, seqan3::rna15 type
to actually save space from using smaller alphabets, you need a compressed container (e.g. seqan3::bitpacked_sequence)
if you are working with bisulfite data use seqan3::dna3bs

Printing and conversion to char

As with all alphabets in SeqAn, none of the nucleotide alphabets can be directly converted to char or printed. You need to explicitly call seqan3::to_char to convert to char. The only exception is seqan3::debug_stream which does this conversion to char automatically.

T and U are represented by the same rank and you cannot differentiate between them. The only difference between e.g. seqan3::dna4 and seqan3::rna4 is the output when calling to_char().

Assignment and conversions between nucleotide types

Nucleotide types defined here are implicitly convertible to each other if they have the same size (e.g. seqan3::dna4 ↔ seqan3::rna4).
Other nucleotide types are explicitly convertible to each other through their character representation.
None of the nucleotide alphabets can be directly converted or assigned from char. You need to explicitly call assign_char or use a literal (see below).
Ranges of nucleotides can be converted to each other by using std::views::transform. See our cookbook for an example.

When assigning from char or converting from a larger nucleotide alphabet to a smaller one, loss of information can occur since obviously some bases are not available. When converting to seqan3::dna5 or seqan3::rna5, non-canonical bases (letters other than A, C, G, T, U) are converted to 'N' to preserve ambiguity at that position, while for seqan3::dna4 and seqan3::rna4 they are converted to the first of the possibilities they represent (because there is no letter 'N' to represent ambiguity). See the greyed out values in the table at the top for an overview of which conversions take place.

char values that are none of the IUPAC symbols, e.g. 'P', are always converted to the equivalent of assigning 'N', i.e. they result in 'A' for seqan3::dna4 and seqan3::rna4, and in 'N' for the other alphabets. If the special char conversion of IUPAC characters to seqan3::dna4 is not your desired behavior, refer to our cookbook for an example of A custom dna4 alphabet that converts all unknown characters to A to change the conversion behavior.

Literals

To avoid writing dna4{}.assign_char('C') every time, you may instead use the literal 'C'_dna4. All nucleotide types defined here have character literals (e.g 'A'_dna4) and also string literals (e.g "ACGT"_dna4) which return a vector of the respective type.

Concept

The nucleotide submodule defines seqan3::nucleotide_alphabet which encompasses all the alphabets defined in the submodule and refines seqan3::alphabet. The only additional requirement is that their values can be complemented, see below.

Complement

Letter	Description	Complement
A	Adenine	T
C	Cytosine	G
G	Guanine	C
T	Thymine (DNA)	A
U	Uracil (RNA)	A
M	A or C	K
R	A or G	Y
W	A or T	W
Y	C or T	R
S	C or G	S
K	G or T	M
V	A or C or G	B
H	A or C or T	D
D	A or G or T	H
B	C or G or T	V
N	A or C or G or T	N

In the typical structure of DNA molecules (or double-stranded RNA), each nucleotide has a complement that it pairs with. To generate the complement value of a nucleotide letter, you can call an implementation of seqan3::nucleotide_alphabet::complement() on it.

The only exception to this table is the seqan3::dna3bs alphabet. The complement for 'G' is defined as 'T' since 'C' and 'T' are treated as the same letters. However, it is not recommended to use the complement of seqan3::dna3bs but rather use the complement of another dna alphabet and afterwards transform it into seqan3::dna3bs.

For the ambiguous letters, the complement is the (possibly also ambiguous) letter representing the variant of the individual complements.

Variable Documentation

◆ complement

constexpr auto seqan3::complement = detail::adl_only::complement_cpo{}

inlineconstexpr

Return the complement of a nucleotide object.

Template Parameters

your_type Type of the argument.

Parameters

nucl	The nucleotide object for which you want to receive the complement.

Returns: The complement character of nucl, e.g. 'C' for 'G'.

This is a function object. Invoke it with the parameter(s) specified above.

It acts as a wrapper and looks for three possible implementations (in this order):

A static member function complement(your_type const a) of the class seqan3::custom::alphabet<your_type>.
A free function complement(your_type const a) in the namespace of your type (or as friend).
A member function called complement().

Functions are only considered for one of the above cases if they are marked noexcept (constexpr is not required, but recommended) and if the returned type is your_type.

Every nucleotide alphabet type must provide one of the above.

Example

// SPDX-FileCopyrightText: 2006-2024 Knut Reinert & Freie Universität Berlin
// SPDX-FileCopyrightText: 2016-2024 Knut Reinert & MPI für molekulare Genetik
// SPDX-License-Identifier: CC0-1.0
 
#include <seqan3/alphabet/nucleotide/concept.hpp>
#include <seqan3/alphabet/nucleotide/rna5.hpp>
 
using namespace seqan3::literals;
 
int main()
{
    auto r1 = 'A'_rna5.complement();        // calls member function rna5::complement(); r1 == 'U'_rna5
    auto r2 = seqan3::complement('A'_rna5); // calls global complement() function on the rna5 object; r2 == 'U'_rna5
}

Customisation point

This is a customisation point (see Customisation). To specify the behaviour for your own alphabet type, simply provide one of the three functions specified above.

This entity is experimental and subject to change in the future. Implementation 2 (free function) is not stable.

This entity is stable. Since version 3.1. The name seqan3::complement, Implementation 1, and Implementation 3 are stable and will not change.

Classes

Function objects (Nucleotide)