SeqAn3 3.4.0-rc.1
The Modern C++ library for sequence analysis.
Loading...
Searching...
No Matches
How to write your own alphabet

This HowTo documents how to write a custom alphabet that can be used with the algorithms and data structures in SeqAn.

DifficultyModerate
Duration45 min
Prerequisite tutorialsC++ Concepts, Alphabets in SeqAn
Recommended readingAlphabet

Motivation

In the alphabet tutorial you have learned that alphabets are a core component in SeqAn and represent the smallest unit of biological sequence data. We introduced the common alphabets for nucleotide, amino acid and gap as well as structure and quality annotation. However, in your SeqAn application you may want to implement a custom alphabet in order to work efficiently with SeqAn's algorithms. To achieve this, your custom alphabet must meet certain requirements, which are defined in concepts.

For detailed information on the alphabet concepts please read the Alphabet page. In the following sections we demonstrate how to write an alphabet that models them.

A brief summary of the concepts used in this HowTo:

  • seqan3::semialphabet requires your type to have a numerical representation (rank), as well as an alphabet size and comparison operators.
  • seqan3::writable_semialphabet additionally requires being able to change the value of the object via the rank representation.
  • seqan3::alphabet requires that your type has a visual representation in addition to the numerical representation. Usually this is a character type like char.
  • seqan3::writable_alphabet additionally requires being able to change the value of the object via the char representation.

Step by step: Create your own alphabet

In the alphabet tutorial we have calculated the GC content of a nucleotide sequence. Guanine and Cytosine are complementary nucleobases, which pair in a DNA molecule by building 3 hydrogen bonds. Adenine and Thymine pair with only 2 hydrogen bonds. As a consequence, we denote Guanine and Cytosine as strong (S) and Adenine and Thymine as weak (W) nucleobases. In this section we want to implement a seqan3::alphabet that consists of the characters S and W to represent strong and weak nucleobases.

Let's start with a simple struct that only holds the alphabet's numerical representation, namely the rank value. It is not specified how the value of an alphabet is stored internally, however alphabets typically store the rank as a member variable.

#include <seqan3/alphabet/concept.hpp> // alphabet concept checks
struct dna2
{
uint8_t rank{};
};
Core alphabet concept and free function/type trait wrappers.
Note
The type of the member variable is typically the smallest unsigned integer type that can hold the maximum rank. In our case we have only two values so bool would be sufficient. However bool is not actually smaller than uint8_t and does not always behave like an unsigned integral type so we use uint8_t here. You only need a different type if your alphabet has an alphabet size >=255.

If you want SeqAn's algorithms to accept it as an alphabet, you need to make sure that your type satisfies the requirements of the seqan3::alphabet concept. A quick check can reveal that this is not the case:

static_assert(seqan3::alphabet<dna2> == false); // NOT an alphabet
The generic alphabet concept that covers most data types used in ranges.
Note
A static_assert() will only tell you whether the expression is true or not. If you want the compiler to tell you why the concept is not modelled, you can construct a dummy requirement like this:
template <seqan3::alphabet check_this_type>
void foo()
{}
int main()
{
foo<dna2>();
}
This will fail with a (slightly long and need-to-get-used-to) error message telling you that foo() cannot be called with dna2 because constraints are not satisfied ....

Prerequisites

A look at the documentation of seqan3::alphabet will reveal that it is actually a refinement of other concepts, more precisely seqan3::semialphabet which in turn refines std::copy_constructible and std::totally_ordered. Let's check those:

static_assert(std::copy_constructible<dna2>); // ok
static_assert(std::totally_ordered<dna2> == false); // NO comparison operators
static_assert(seqan3::semialphabet<dna2> == false); // NOT a semialphabet
static_assert(seqan3::alphabet<dna2> == false); // NOT an alphabet
The basis for seqan3::alphabet, but requires only rank interface (not char).

You should see that your type models only the std::copy_constructible concept. Let's have a look at the documentation for std::totally_ordered. Can you guess what it describes?

It describes the requirements for types that are comparable via <, <=, > and >=. This is useful so that you can sort a text over the alphabet, for example. Additionally, std::totally_ordered is again a refinement of std::equality_comparable, which requires the == and != operators, so let's first implement those.

#include <seqan3/alphabet/concept.hpp> // alphabet concept checks
struct dna2
{
uint8_t rank{};
// Equality operator
friend bool operator==(dna2 const & lhs, dna2 const & rhs) noexcept
{
return lhs.rank == rhs.rank;
}
};

As you can see we chose to implement them as friend functions. You can also implement them as free functions in the namespace of your class, but you should avoid regular member functions, because they result in different conversion rules for left-hand-side and right-hand-side.

Excercise

Implement the inequality operator (!=) for dna2.
Solution

#include <seqan3/alphabet/concept.hpp> // alphabet concept checks
struct dna2
{
uint8_t rank{};
// Equality and inequality operators
friend bool operator==(dna2 const & lhs, dna2 const & rhs) noexcept
{
return lhs.rank == rhs.rank;
}
friend bool operator!=(dna2 const & lhs, dna2 const & rhs) noexcept
{
return !(lhs == rhs);
}
};
static_assert(std::equality_comparable<dna2>); // ok
T operator!=(T... args)

We see that our type now models std::equality_comparable, which is a prerequisite of std::totally_ordered.

Excercise

Implement the four comparison operators and verify that your type models std::totally_ordered.
Solution

#include <seqan3/alphabet/concept.hpp> // alphabet concept checks
struct dna2
{
uint8_t rank{};
// Equality and inequality operators
friend bool operator==(dna2 const & lhs, dna2 const & rhs) noexcept
{
return lhs.rank == rhs.rank;
}
friend bool operator!=(dna2 const & lhs, dna2 const & rhs) noexcept
{
return !(lhs == rhs);
}
// Comparison operators
friend bool operator<(dna2 const & lhs, dna2 const & rhs) noexcept
{
return lhs.rank < rhs.rank;
}
friend bool operator<=(dna2 const & lhs, dna2 const & rhs) noexcept
{
return lhs.rank <= rhs.rank;
}
friend bool operator>(dna2 const & lhs, dna2 const & rhs) noexcept
{
return lhs.rank > rhs.rank;
}
friend bool operator>=(dna2 const & lhs, dna2 const & rhs) noexcept
{
return lhs.rank >= rhs.rank;
}
};
static_assert(std::equality_comparable<dna2>); // ok
static_assert(std::totally_ordered<dna2>); // ok

semialphabet

Let's move on to the more interesting concepts. seqan3::semialphabet constitutes the rank interface that we introduced in the alphabet tutorial. Have a look at the API reference again. Beyond the conceptional requirements, it also requires that seqan3::alphabet_size and seqan3::to_rank can be called on your alphabet.

There are different ways to satisfy seqan3::alphabet_size and seqan3::to_rank, have a look at the respective API reference and also the documentation on customisation points.

In this case we choose to implement the functionality as member functions:

#include <cassert>
#include <seqan3/alphabet/concept.hpp> // alphabet concept checks
struct dna2
{
uint8_t rank{};
// semialphabet
static constexpr size_t alphabet_size = 2;
uint8_t to_rank() const noexcept
{
return rank;
}
dna2 & assign_rank(uint8_t const rk) noexcept
{
assert(rk < alphabet_size);
rank = rk;
return *this;
}
// Equality and inequality operators ...
// Comparison operators ...
};

As you can see from the static_assert our dna2 alphabet now models seqan3::semialphabet and seqan3::writable_semialphabet:

static_assert(seqan3::semialphabet<dna2>); // ok
A refinement of seqan3::semialphabet that adds assignability.

You can also try out the customisation points directly (they appear like free functions). Here is an example:

int main()
{
dna2 chr{};
seqan3::assign_rank_to(1, chr); // chr is assigned rank 1
uint8_t rnk = seqan3::to_rank(chr); // query rank value
std::cout << static_cast<uint16_t>(rnk) << '\n'; // 1
}
constexpr auto assign_rank_to
Assign a rank to an alphabet object.
Definition alphabet/concept.hpp:288
constexpr auto to_rank
Return the rank representation of a (semi-)alphabet object.
Definition alphabet/concept.hpp:152

alphabet

Now that you have a feeling for concepts, have a look at seqan3::alphabet and seqan3::writable_alphabet and make your type also model these concepts.

Excercise

  1. Implement the member function char to_char() which returns the char 'S' or 'W' depending on the rank value.
  2. Implement the member function dna2 & assign_char(char const).
  3. There is a function object that tells us which characters are valid for a given alphabet: seqan3::char_is_valid_for. By default, all characters are "valid" that are preserved when being assigned from and then be converted back. But in some cases you want to change the default behaviour, e.g. declaring lower-case letters to be valid, as well.
    Optional task: Implement the static member function bool char_is_valid(char const) in order to allow also 's' and 'w' as valid characters.
Solution
#include <cassert>
#include <seqan3/alphabet/concept.hpp> // alphabet concept checks
struct dna2
{
uint8_t rank{};
// semialphabet
static constexpr size_t alphabet_size = 2;
uint8_t to_rank() const noexcept
{
return rank;
}
dna2 & assign_rank(uint8_t const rk) noexcept
{
assert(rk < alphabet_size);
rank = rk;
return *this;
}
// alphabet
char to_char() const noexcept
{
// map 0 => 'S' and 1 => 'W'
char const rank_to_char[2]{'S', 'W'};
return rank_to_char[rank];
}
dna2 & assign_char(char const ch) noexcept
{
switch (ch)
{
case 'W':
case 'w':
rank = 1;
break; // allow assignment from uppercase and lowercase
default:
rank = 0; // unknown characters are mapped to 0 (=> 'S')
}
return *this;
}
// Optional: Can be omitted.
static bool char_is_valid(char const ch) noexcept
{
return (ch == dna2{}.assign_char(ch).to_char());
}
// Equality and inequality operators
friend bool operator==(dna2 const & lhs, dna2 const & rhs) noexcept
{
return lhs.rank == rhs.rank;
}
friend bool operator!=(dna2 const & lhs, dna2 const & rhs) noexcept
{
return !(lhs == rhs);
}
// Comparison operators
friend bool operator<(dna2 const & lhs, dna2 const & rhs) noexcept
{
return lhs.rank < rhs.rank;
}
friend bool operator<=(dna2 const & lhs, dna2 const & rhs) noexcept
{
return lhs.rank <= rhs.rank;
}
friend bool operator>(dna2 const & lhs, dna2 const & rhs) noexcept
{
return lhs.rank > rhs.rank;
}
friend bool operator>=(dna2 const & lhs, dna2 const & rhs) noexcept
{
return lhs.rank >= rhs.rank;
}
};
constexpr auto to_char
Return the char representation of an alphabet object.
Definition alphabet/concept.hpp:381

At this point the seqan3::alphabet concept should be modelled successfully and even seqan3::writable_alphabet is fine because we implemented assign_char.

static_assert(seqan3::alphabet<dna2>); // ok
static_assert(seqan3::writable_alphabet<dna2>); // ok
Refines seqan3::alphabet and adds assignability.

Shortcut: alphabet base template

Often it is not required to implement the entire class yourself, instead you can derive from seqan3::alphabet_base which defines most things for you if you provide certain conversion tables. Read the documentation of seqan3::alphabet_base for details. This implementation deduces bool as the smallest possible rank type.

#include <array> // std::array
#include <seqan3/alphabet/alphabet_base.hpp> // alphabet_base
#include <seqan3/alphabet/concept.hpp> // alphabet concept checks
#include <seqan3/utility/char_operations/transform.hpp> // seqan3::to_lower
// derive from alphabet_base
struct dna2 : public seqan3::alphabet_base<dna2, 2>
{
private:
// make the base class a friend so it can access the tables:
// map 0 => 'S' and 1 => 'W'
static constexpr char_type rank_to_char(rank_type const rank)
{
// via a lookup table
return rank_to_char_table[rank];
// or via an arithmetic expression
return rank == 1 ? 'W' : 'S';
}
static constexpr rank_type char_to_rank(char_type const chr)
{
// via a lookup table
return char_to_rank_table[static_cast<index_t>(chr)];
// or via an arithmetic expression
return seqan3::to_lower(chr) == 'w' ? 1 : 0;
}
private:
// === lookup-table implementation detail ===
// map 0 => 'S' and 1 => 'W'
static constexpr char_type rank_to_char_table[alphabet_size]{'S', 'W'};
static constexpr std::array<rank_type, 256> char_to_rank_table{
// initialise with an immediately evaluated lambda expression:
[]() constexpr
{
std::array<rank_type, 256> ret{}; // initialise all values with 0 (=> 'S')
ret['W'] = 1; // only 'W' and 'w' result in rank 1
ret['w'] = 1;
return ret;
}()};
};
// check the concepts
static_assert(seqan3::alphabet<dna2>); // ok
static_assert(seqan3::writable_alphabet<dna2>); // ok
Provides seqan3::alphabet_base.
A CRTP-base that makes defining a custom alphabet easier.
Definition alphabet_base.hpp:54
constexpr auto alphabet_size
A type trait that holds the size of a (semi-)alphabet.
Definition alphabet/concept.hpp:834
constexpr char_type to_lower(char_type const c) noexcept
Converts 'A'-'Z' to 'a'-'z' respectively; other characters are returned as is.
Definition transform.hpp:74
Provides utilities for modifying characters.

Further examples

Implementation as enum class

This is an example of a minimal custom alphabet that provides implementations for all necessary customisation points.

As an enum class the values already have an order and therefore the class models std::totally_ordered without defining the (in)equality and comparison operators. Opposed to the examples above, we use only free functions to model the functionality.

#include <cstddef> // for size_t
#include <seqan3/alphabet/concept.hpp> // for seqan3::alphabet
namespace my_namespace
{
enum class my_alph
{
ZERO,
ONE,
TWO
};
constexpr size_t alphabet_size(my_alph const &) noexcept
{
return 3;
}
constexpr size_t to_rank(my_alph const a) noexcept
{
return static_cast<size_t>(a);
}
constexpr my_alph & assign_rank_to(size_t const r, my_alph & a) noexcept
{
switch (r)
{
case 0:
a = my_alph::ZERO;
return a;
case 1:
a = my_alph::ONE;
return a;
default:
a = my_alph::TWO;
return a;
}
}
constexpr char to_char(my_alph const a) noexcept
{
switch (a)
{
case my_alph::ZERO:
return '0';
case my_alph::ONE:
return '1';
default:
return '2';
}
}
constexpr my_alph & assign_char_to(char const c, my_alph & a) noexcept
{
switch (c)
{
case '0':
a = my_alph::ZERO;
return a;
case '1':
a = my_alph::ONE;
return a;
default:
a = my_alph::TWO;
return a;
}
}
} // namespace my_namespace
constexpr auto assign_char_to
Assign a character to an alphabet object.
Definition alphabet/concept.hpp:517

Adaptation of a third party type

This example is similar to the previous one, but assuming that you cannot add anything to the namespace of the type that you wish to adapt. In that case, you need to specialise the seqan3::custom::alphabet class template and provide the required functionality as static members.

#include <cstddef> // for size_t
#include <seqan3/alphabet/concept.hpp> // for seqan3::alphabet
// this is from some other library:
namespace third_party_ns
{
enum class third_party_type
{
zero,
one,
two
};
} // namespace third_party_ns
// ------------------------------------------------------------------------------------
// this is in your code (no namespace):
template <>
struct seqan3::custom::alphabet<third_party_ns::third_party_type>
{
using alphabet_t = third_party_ns::third_party_type;
static constexpr size_t alphabet_size = 3;
static constexpr size_t to_rank(alphabet_t const a) noexcept
{
return static_cast<size_t>(a);
}
static constexpr alphabet_t & assign_rank_to(size_t const r, alphabet_t & a) noexcept
{
switch (r)
{
case 0:
a = alphabet_t::zero;
return a;
case 1:
a = alphabet_t::one;
return a;
default:
a = alphabet_t::two;
return a;
}
}
static constexpr char to_char(alphabet_t const a) noexcept
{
switch (a)
{
case alphabet_t::zero:
return '0';
case alphabet_t::one:
return '1';
default:
return '2';
}
}
static constexpr alphabet_t & assign_char_to(char const c, alphabet_t & a) noexcept
{
switch (c)
{
case '0':
a = alphabet_t::zero;
return a;
case '1':
a = alphabet_t::one;
return a;
default:
a = alphabet_t::two;
return a;
}
}
};
A type that can be specialised to provide customisation point implementations so that third party typ...
Definition alphabet/concept.hpp:46

Implementation of a non-default-constructible class

This is an example of a custom alphabet that is not default-constructible and that has a non-default overload for seqan3::char_is_valid_for.

Please note that for the overloads of seqan3::alphabet_size and seqan3::char_is_valid_for our alphabet type has to be wrapped into std::type_identity<> to be recognised by the customisation point objects, because our type does not model std::is_nothrow_default_constructible after we have deleted the default constructor.

With the overload of seqan3::char_is_valid_for we allow assignment to the underlying boolean type from '1', 't' and 'T' for value 1 as well as from '0', 'f' and 'F' for value 0.

#include <cstddef> // for size_t
#include <type_traits> // for std::type_identity
#include <seqan3/alphabet/concept.hpp> // for seqan3::alphabet
namespace my_namespace
{
class my_alph
{
public:
bool rank;
my_alph() = delete;
constexpr my_alph(my_alph const &) = default;
constexpr my_alph & operator=(my_alph const &) = default;
constexpr my_alph(bool rank) : rank{rank}
{}
constexpr friend bool operator==(my_alph lhs, my_alph rhs)
{
return lhs.rank == rhs.rank;
}
constexpr friend bool operator!=(my_alph lhs, my_alph rhs)
{
return lhs.rank != rhs.rank;
}
constexpr friend bool operator<=(my_alph lhs, my_alph rhs)
{
return lhs.rank <= rhs.rank;
}
constexpr friend bool operator>=(my_alph lhs, my_alph rhs)
{
return lhs.rank >= rhs.rank;
}
constexpr friend bool operator<(my_alph lhs, my_alph rhs)
{
return lhs.rank < rhs.rank;
}
constexpr friend bool operator>(my_alph lhs, my_alph rhs)
{
return lhs.rank > rhs.rank;
}
};
constexpr size_t alphabet_size(std::type_identity<my_alph> const &) noexcept
{
return 2;
}
constexpr bool to_rank(my_alph const a) noexcept
{
return a.rank;
}
constexpr my_alph & assign_rank_to(bool const r, my_alph & a) noexcept
{
a.rank = r;
return a;
}
constexpr char to_char(my_alph const a) noexcept
{
if (a.rank)
return '1';
else
return '0';
}
constexpr my_alph & assign_char_to(char const c, my_alph & a) noexcept
{
switch (c)
{
case '0':
case 'F':
case 'f':
a.rank = 0;
return a;
default:
a.rank = 1;
return a;
}
}
constexpr bool char_is_valid_for(char const c, std::type_identity<my_alph> const &) noexcept
{
switch (c)
{
case '0':
case 'F':
case 'f':
case '1':
case 'T':
case 't':
return true;
default:
return false;
}
}
} // namespace my_namespace
static_assert(seqan3::alphabet_size<my_namespace::my_alph> == 2);
static_assert(seqan3::char_is_valid_for<my_namespace::my_alph>('T'));
static_assert(!seqan3::char_is_valid_for<my_namespace::my_alph>('!'));
constexpr auto char_is_valid_for
Returns whether a character is in the valid set of a seqan3::alphabet (usually implies a bijective ma...
Definition alphabet/concept.hpp:661
Note
You should really make your alphabet types no-throw-default-constructible if you can!
Hide me