Modules | |
Adaptation | |
Provides alphabet adaptions of some standard char and uint types. | |
Aminoacid | |
Provides the amino acid alphabets and functionality for translation from nucleotide. | |
CIGAR | |
Provides the CIGAR operation alphabet, along with the CIGAR cartesian composition. | |
Composite | |
Provides data structures joining multiple alphabets into a single alphabet. | |
Gap | |
Provides the gap alphabet and functionality to make an alphabet a gapped alphabet. | |
Mask | |
Provides the mask alphabet and functionality for creating masked composites. | |
Nucleotide | |
Provides the different DNA and RNA alphabet types. | |
Quality | |
Provides the various quality score types. | |
Structure | |
The structure module contains alphabets for RNA and protein structure. | |
Classes | |
interface | seqan3::Alphabet |
The generic alphabet concept that covers most data types used in ranges. More... | |
class | seqan3::alphabet_base< derived_type, size, char_t > |
A CRTP-base that makes defining a custom alphabet easier. More... | |
class | seqan3::alphabet_base< derived_type, 1ul, char_t > |
Specialisation of seqan3::alphabet_base for alphabets of size 1. More... | |
class | seqan3::alphabet_proxy< derived_type, alphabet_type > |
A CRTP-base that eases the definition of proxy types returned in place of regular alphabets. More... | |
struct | std::hash< alphabet_t > |
Struct for hashing a character. More... | |
struct | std::hash< urng_t > |
Struct for hashing a range of characters. More... | |
interface | seqan3::Semialphabet |
The basis for seqan3::Alphabet, but requires only rank interface (not char). More... | |
interface | seqan3::WritableAlphabet |
Refines seqan3::Alphabet and adds assignability. More... | |
interface | seqan3::WritableSemialphabet |
A refinement of seqan3::Semialphabet that adds assignability. More... | |
Typedefs | |
template<typename alphabet_type > | |
using | seqan3::alphabet_char_t = decltype(seqan3::to_char(std::declval< alphabet_type const >())) |
The char_type of the alphabet; defined as the return type of seqan3::to_char. | |
template<typename semi_alphabet_type > | |
using | seqan3::alphabet_rank_t = decltype(seqan3::to_rank(std::declval< semi_alphabet_type >())) |
The rank_type of the semi-alphabet; defined as the return type of seqan3::to_rank. | |
Functions | |
size_t | std::hash< alphabet_t >::operator() (alphabet_t const character) const noexcept |
Compute the hash for a character. More... | |
size_t | std::hash< urng_t >::operator() (urng_t const &range) const noexcept |
Compute the hash for a range of characters. More... | |
Variables | |
template<typename alph_t > | |
constexpr auto | seqan3::alphabet_size = detail::adl::only::alphabet_size_obj<alph_t>() |
A type trait that holds the size of a (semi-)alphabet. More... | |
Function objects | |
constexpr auto | seqan3::to_rank = detail::adl::only::to_rank_fn{} |
Return the rank representation of a (semi-)alphabet object. More... | |
constexpr auto | seqan3::assign_rank_to = detail::adl::only::assign_rank_to_fn{} |
Assign a rank to an alphabet object. More... | |
constexpr auto | seqan3::to_char = detail::adl::only::to_char_fn{} |
Return the char representation of an alphabet object. More... | |
constexpr auto | seqan3::assign_char_to = detail::adl::only::assign_char_to_fn{} |
Assign a character to an alphabet object. More... | |
template<typename alph_t > | |
constexpr auto | seqan3::char_is_valid_for = detail::adl::only::char_is_valid_for_fn<alph_t>{} |
Returns whether a character is in the valid set of a seqan3::Alphabet (usually implies a bijective mapping to an alphabet value). More... | |
constexpr auto | seqan3::assign_char_strictly_to = detail::adl::only::assign_char_strictly_to_fn{} |
Assign a character to an alphabet object, throw if the character is not valid. More... | |
Alphabets are a core component in SeqAn. They enable us to represent the smallest unit of biological sequence data, e.g. a nucleotide or an amino acid.
In theory, these could just be represented as a char
and this is how many people perceive them, but it makes sense to use a smaller, stricter and well-defined alphabet in almost all cases, because:
char
, e.g. a char
can have 256 values and thus must be represented by 8 bits of memory, but a DNA character could be represented by 2 bits, because it only has four values in the smallest representation ('A', 'C', 'G', 'T').0
, 1
, 2
, 3
respectively. In fact the rank representation is used a lot more often than the visual representation which is only used in input/output.In SeqAn there are alphabet types for typical sequence alphabets like DNA and amino acid, but also for qualities, RNA structures and alignment gaps. In addition there are templates for combining alphabet types into new alphabets, and wrappers for existing data types like the canonical char
.
In addition to concrete alphabet types, SeqAn provides multiple concepts that describe groups of alphabets by their properties and can be used to constrain templates so that they only work with certain alphabet types. See the Tutorial on Concepts for a gentle introduction to the topic.
All alphabets in SeqAn have a fixed size. It can be queried via the seqan3::alphabet_size type trait and optionally also the alphabet_size
static member of the alphabet (see below for "members VS free/global functions").
In some areas we provide alphabets types with different sizes for the same purpose, e.g. seqan3::dna4 ('A', 'C', 'G', 'T'), seqan3::dna5 (plus 'N') and seqan3::dna15 (plus ambiguous characters defined by IUPAC). By convention most of our alphabets carry their size in their name (seqan3::dna4 has size 4 a.s.o.).
A main reason for choosing a smaller alphabet over a bigger one is the possibility of optimising for space efficiency. Note, however, that a single letter by itself can never be smaller than a byte for architectural reasons. Actual space improvements are realised via secondary structures, e.g. when using a seqan3::bitcompressed_vector<seqan3::dna4>
instead of std::vector<seqan3::dna4>
. Also the single letter quality composite seqan3::qualified<seqan3::dna4, seqan3::phred42>
fits into one byte, because the product of the alphabet sizes (4 * 42) is smaller than 256; whereas the same composite with seqan3::dna15 requires two bytes per letter (15 * 42 > 256).
As mentioned above, we typically think of alphabets in their character representation, but we also require them in "rank representation" as programmers. In C and C++ it is quite difficult to cleanly differentiate between these, because the char
type is considered an integral type and can be used to index an array (e.g. my_array['A']
translates to my_array[65]
). Moreover the sign of char
is implementation defined and on many platforms the smallest integer types int8_t
and uint8_t
are literally the same types as signed char
and unsigned char
respectively.
This leads to ambiguity when assigning and retrieving values:
To solve this problem, alphabets in SeqAn define two interfaces:
size_t
; the exact type can be retrieved via the seqan3::alphabet_rank_t.char
, but could be char16_t
or char32_t
, as well); the exact type can be retrieved via seqan3::alphabet_char_t.To prevent the aforementioned ambiguity, you can neither assign from rank or char representation via operator=
, nor can you cast the alphabet to either of it's representation forms, you need to explicitly use the interfaces:
For efficiency, the representation saved internally is normally the rank representation, and the character representation is generated via conversion tables. This is, however, not required as long as both interfaces are provided and all functions operate in constant time.
The same applies for printing characters although seqan3::debug_stream provides some convenience:
To reduce the burden of calling assign_char
often, most alphabets in SeqAn3 provide custom literals for the alphabet and sequences over the alphabet:
Note, however, that literals are not required by the concept.
All types that have valid implementations of the functions/functors described above model the concept seqan3::WritableAlphabet. This is the strongest (i.e. most refined) general case concept. There are more refined concepts for specific biological applications (like seqan3::NucleotideAlphabet), and there are less refined concepts that only model part of an alphabet:
Writable*
) and derived concepts only require readability and not assignability.Typically you will use seqan3::Alphabet in "read-only" situations (e.g. const
parameters) and seqan3::WritableAlphabet whenever the values might be changed. Semi-alphabets are less useful in application code.
Semialphabet | WritableSemialphabet | Alphabet | WritableAlphabet | Aux | |
---|---|---|---|---|---|
alphabet_size | ✅ | ✅ | ✅ | ✅ | |
to_rank | ✅ | ✅ | ✅ | ✅ | |
alphabet_rank_t | ✅ | ✅ | ✅ | ✅ | 🔗 |
assign_rank_to | ✅ | ✅ | |||
to_char | ✅ | ✅ | |||
alphabet_char_t | ✅ | ✅ | 🔗 | ||
assign_char_to | ✅ | ||||
char_is_valid_for | ✅ | ||||
assign_char_strictly_to | ✅ | 🔗 |
The above table shows all alphabet concepts and related functions and type traits. The entities marked as "auxiliary" provide shortcuts to the other "essential" entitities. This difference is only relevant if you want to create your own alphabet (you do not need to provide an implementation for the "auxiliary" entities, they are provided automatically).
The alphabet concept (as most concepts in SeqAn) looks for free/global functions, i.e. you need to be able to call seqan3::to_rank(my_letter)
, however most alphabets also provide a member function, i.e. my_letter.to_rank()
. The same is true for the type trait seqan3::alphabet_size vs the static data member alphabet_size
.
Members are provided for convenience and if you are an application developer who works with a single concrete alphabet type you are fine with using the member functions. If you, however, implement a generic function that accepts different alphabet types, you need to use the free function / type trait interface, because it is the only interface guaranteed to exist (member functions are not required/enforced by the concept).
In SeqAn3 it is recommended you use the STL container classes like std::vector for storing sequence data, but you can use other class templates if they satisfy the respective seqan3::Container, e.g. std::deque
or folly::fbvector
or even Qt::QVector
.
std::basic_string
is also supported, however, we recommend against using it, because it is not safe (and not useful) to call certain members like .c_str()
if our alphabets are used as value type.
We provide specialised containers with certain properties in the Range module.
|
inlinenoexcept |
Compute the hash for a character.
[in] | character | The character to process. Must model seqan3::Semialphabet. |
|
inlinenoexcept |
Compute the hash for a range of characters.
[in] | range | The input range to process. Must model std::ranges::InputRange and the reference type of the range of the range must model seqan3::Semialphabet. |
|
inline |
A type trait that holds the size of a (semi-)alphabet.
your_type | The (semi-)alphabet type being queried. |
This type trait is implemented as a global variable template.
It is only defined for types that provide one of the following (checked in this order):
alphabet_size(your_type const &)
in the namespace of your type (or as friend
) that returns the size as an integral value. The function must be marked constexpr
and noexcept
and the return type needs to be implicitly convertible to size_t
. The value of the argument to the function shall be ignored, it is only used to select the function via argument-dependent lookup.alphabet_size(your_type const &)
in namespace seqan3::custom
that returns the size as an integral value. The same restrictions apply as above.static constexpr
data member called alphabet_size
that is the size. It must be implicitly convertible to size_t
.Every (semi-)alphabet type must provide one of the above.
Note that if the (semi-)alphabet type with cvref removed is not std::is_nothrow_default_constructible or not seqan3::is_constexpr_default_constructible, this object will instead look for alphabet_size(std::type_identity<your_type> const &)
with the same semantics (in cases 1. and 2.).
For an example of a full alphabet definition with free function implementations (solution 1. above), see seqan3::Alphabet.
This is a customisation point (see Customisation). To specify the behaviour for your own alphabet type, simply provide one of the three functions specified above.
|
inline |
Assign a character to an alphabet object, throw if the character is not valid.
your_type | Type of the target object. |
chr | The character being assigned; must be of the seqan3::alphabet_char_t of the target object. |
alph | The target object; its type must model seqan3::Alphabet. |
alph
if alph
was given as lvalue, otherwise a copy. seqan3::invalid_char_assignment | If seqan3::char_is_valid_for<decltype(alph)>(chr) == false . |
This is a function object. Invoke it with the parameters specified above.
Note that this is not a customisation point and it cannot be "overloaded". It simply invokes seqan3::char_is_valid_for and seqan3::assign_char_to.
|
inline |
Assign a character to an alphabet object.
your_type | Type of the target object. |
chr | The character being assigned; must be of the seqan3::alphabet_char_t of the target object. |
alph | The target object; its type must model seqan3::Alphabet. |
alph
if alph
was given as lvalue, otherwise a copy.This is a function object. Invoke it with the parameter(s) specified above.
It acts as a wrapper and looks for three possible implementations (in this order):
assign_char_to(char_type const chr, your_type & a)
in the namespace of your type (or as friend
). The function must be marked noexcept
(constexpr
is not required, but recommended) and the return type be your_type &
.assign_char_to(char_type const chr, your_type & a)
in namespace seqan3::custom
. The same restrictions apply as above.assign_char(char_type const chr)
(not assign_char_to
). It must be marked noexcept
(constexpr
is not required, but recommended) and the return type be your_type &
.Every alphabet type must provide one of the above. Note that temporaries of your_type
are handled by this function object and do not require an additional overload.
For an example of a full alphabet definition with free function implementations (solution 1. above), see seqan3::Alphabet.
This is a customisation point (see Customisation). To specify the behaviour for your own alphabet type, simply provide one of the three functions specified above.
|
inline |
Assign a rank to an alphabet object.
your_type | Type of the target object. |
chr | The rank being assigned; must be of the seqan3::alphabet_rank_t of the target object. |
alph | The target object. |
alph
if alph
was given as lvalue, otherwise a copy.This is a function object. Invoke it with the parameter(s) specified above.
It acts as a wrapper and looks for three possible implementations (in this order):
assign_rank_to(rank_type const chr, your_type & a)
in the namespace of your type (or as friend
). The function must be marked noexcept
(constexpr
is not required, but recommended) and the return type be your_type &
.assign_rank_to(rank_type const chr, your_type & a)
in namespace seqan3::custom
. The same restrictions apply as above.assign_rank(rank_type const chr)
(not assign_rank_to
). It must be marked noexcept
(constexpr
is not required, but recommended) and the return type be your_type &
.Every (semi-)alphabet type must provide one of the above. Note that temporaries of your_type
are handled by this function object and do not require an additional overload.
For an example of a full alphabet definition with free function implementations (solution 1. above), see seqan3::Alphabet.
This is a customisation point (see Customisation). To specify the behaviour for your own alphabet type, simply provide one of the three functions specified above.
|
inline |
Returns whether a character is in the valid set of a seqan3::Alphabet (usually implies a bijective mapping to an alphabet value).
your_type | The alphabet type being queried. |
chr | The character being checked; must be convertible to seqan3::alphabet_char_t<your_type> . |
alph | The target object; its type must model seqan3::Alphabet. |
true
or false
.This is a function object. Invoke it with the parameter(s) specified above.
It acts as a wrapper and looks for three possible implementations (in this order):
char_is_valid_for(char_type const chr, your_type const &)
in the namespace of your type (or as friend
). The function must be marked noexcept
(constexpr
is not required, but recommended) and the return type be bool
. The value of the second argument to the function shall be ignored, it is only used to select the function via argument-dependent lookup.char_is_valid_for(char_type const chr, your_type const &)
in namespace seqan3::custom
. The same restrictions apply as above.static
member function called char_is_valid(char_type)
(not char_is_valid_for
). It must be marked noexcept
(constexpr
is not required, but recommended) and the return type be bool
.An alphabet type may provide one of the above. If none is provided, this function will declare every character c
as valid for whom it holds that seqan3::to_char(seqan3::assign_char_to(c, alph_t{})) == c
, i.e. converting back and forth results in the same value.
Note that if the alphabet type with cvref removed is not std::is_nothrow_default_constructible, this function object will instead look for char_is_valid_for(char_type const chr, std::type_identity<your_type> const &)
with the same semantics. In that case the "fallback" above also does not work and you are required to provide such an implementation.
This is a customisation point (see Customisation). To specify the behaviour for your own alphabet type, simply provide one of the three functions specified above.
|
inline |
Return the char representation of an alphabet object.
your_type | Type of the argument. |
alph | The alphabet object. |
char
.This is a function object. Invoke it with the parameter(s) specified above.
It acts as a wrapper and looks for three possible implementations (in this order):
to_char(your_type const a)
in the namespace of your type (or as friend
). The function must be marked noexcept
(constexpr
is not required, but recommended) and the return type be of the respective char representation (usually a small integral type).to_char(your_type const a)
in namespace seqan3::custom
. The same restrictions apply as above.to_char()
. It must be marked noexcept
(constexpr
is not required, but recommended) and the return type be of the respective char representation.Every alphabet type must provide one of the above.
For an example of a full alphabet definition with free function implementations (solution 1. above), see seqan3::Alphabet.
This is a customisation point (see Customisation). To specify the behaviour for your own alphabet type, simply provide one of the three functions specified above.
|
inline |
Return the rank representation of a (semi-)alphabet object.
your_type | Type of the argument. |
alph | The (semi-)alphabet object. |
This is a function object. Invoke it with the parameter(s) specified above.
It acts as a wrapper and looks for three possible implementations (in this order):
to_rank(your_type const a)
in the namespace of your type (or as friend
). The function must be marked noexcept
(constexpr
is not required, but recommended) and the return type be of the respective rank representation (usually a small integral type).to_rank(your_type const a)
in namespace seqan3::custom
. The same restrictions apply as above.to_rank()
. It must be marked noexcept
(constexpr
is not required, but recommended) and the return type be of the respective rank representation.Every (semi-)alphabet type must provide one of the above.
For an example of a full alphabet definition with free function implementations (solution 1. above), see seqan3::Alphabet.
This is a customisation point (see Customisation). To specify the behaviour for your own alphabet type, simply provide one of the three functions specified above.