This HowTo documents how to write a custom alphabet that can be used with the algorithms and data structures in SeqAn.
Motivation
In the alphabet tutorial you have learned that alphabets are a core component in SeqAn and represent the smallest unit of biological sequence data. We introduced the common alphabets for nucleotide, amino acid and gap as well as structure and quality annotation. However, in your SeqAn application you may want to implement a custom alphabet in order to work efficiently with SeqAn's algorithms. To achieve this, your custom alphabet must meet certain requirements, which are defined in concepts.
For detailed information on the alphabet concepts please read the Alphabet page. In the following sections we demonstrate how to write an alphabet that models them.
A brief summary of the concepts used in this HowTo:
- seqan3::semialphabet requires your type to have a numerical representation (rank), as well as an alphabet size and comparison operators.
- seqan3::writable_semialphabet additionally requires being able to change the value of the object via the rank representation.
- seqan3::alphabet requires that your type has a visual representation in addition to the numerical representation. Usually this is a character type like char.
- seqan3::writable_alphabet additionally requires being able to change the value of the object via the char representation.
Step by step: Create your own alphabet
In the alphabet tutorial we have calculated the GC content of a nucleotide sequence. Guanine and Cytosine are complementary nucleobases, which pair in a DNA molecule by building 3 hydrogen bonds. Adenine and Thymine pair with only 2 hydrogen bonds. As a consequence, we denote Guanine and Cytosine as strong (S) and Adenine and Thymine as weak (W) nucleobases. In this section we want to implement a seqan3::alphabet that consists of the characters S
and W
to represent strong and weak nucleobases.
Let's start with a simple struct that only holds the alphabet's numerical representation, namely the rank value. It is not specified how the value of an alphabet is stored internally, however alphabets typically store the rank as a member variable.
struct dna2
{
uint8_t rank{};
};
Core alphabet concept and free function/type trait wrappers.
- Note
- The type of the member variable is typically the smallest unsigned integer type that can hold the maximum rank. In our case we have only two values so
bool
would be sufficient. However bool
is not actually smaller than uint8_t
and does not always behave like an unsigned integral type so we use uint8_t
here. You only need a different type if your alphabet has an alphabet size >=255.
If you want SeqAn's algorithms to accept it as an alphabet, you need to make sure that your type satisfies the requirements of the seqan3::alphabet concept. A quick check can reveal that this is not the case:
A "pretty printer" for most SeqAn data structures and related types.
Definition debug_stream_type.hpp:79
- Note
- A
static_assert()
will only tell you whether the expression is true or not. If you want the compiler to tell you why the concept is not modelled, you can construct a dummy requirement like this: template <seqan3::alphabet check_this_type>
void foo()
{}
int main()
{
foo<dna2>();
}
This will fail with a (slightly long and need-to-get-used-to) error message telling you that foo()
cannot be called with dna2
because constraints are not satisfied ...
.
Prerequisites
A look at the documentation of seqan3::alphabet will reveal that it is actually a refinement of other concepts, more precisely seqan3::semialphabet which in turn refines std::copy_constructible and std::totally_ordered. Let's check those:
static_assert(std::copy_constructible<dna2>);
static_assert(std::totally_ordered<dna2> == false);
You should see that your type models only the std::copy_constructible concept. Let's have a look at the documentation for std::totally_ordered. Can you guess what it describes?
It describes the requirements for types that are comparable via <
, <=
, >
and >=
. This is useful so that you can sort a text over the alphabet, for example. Additionally, std::totally_ordered is again a refinement of std::equality_comparable, which requires the ==
and !=
operators, so let's first implement those.
struct dna2
{
uint8_t rank{};
friend bool operator==(dna2 const & lhs, dna2 const & rhs) noexcept
{
return lhs.rank == rhs.rank;
}
};
As you can see we chose to implement them as friend functions. You can also implement them as free functions in the namespace of your class, but you should avoid regular member functions, because they result in different conversion rules for left-hand-side and right-hand-side.
Excercise
Implement the inequality operator (!=) for dna2
.
Solution
struct dna2
{
uint8_t rank{};
friend bool operator==(dna2 const & lhs, dna2 const & rhs) noexcept
{
return lhs.rank == rhs.rank;
}
friend bool operator!=(dna2
const & lhs, dna2
const & rhs)
noexcept
{
return !(lhs == rhs);
}
};
static_assert(std::equality_comparable<dna2>);
We see that our type now models std::equality_comparable, which is a prerequisite of std::totally_ordered.
Excercise
Implement the four comparison operators and verify that your type models std::totally_ordered.
Solution
struct dna2
{
uint8_t rank{};
friend bool operator==(dna2 const & lhs, dna2 const & rhs) noexcept
{
return lhs.rank == rhs.rank;
}
friend bool operator!=(dna2
const & lhs, dna2
const & rhs)
noexcept
{
return !(lhs == rhs);
}
friend bool operator<(dna2 const & lhs, dna2 const & rhs) noexcept
{
return lhs.rank < rhs.rank;
}
friend bool operator<=(dna2
const & lhs, dna2
const & rhs)
noexcept
{
return lhs.rank <= rhs.rank;
}
friend bool operator>(dna2
const & lhs, dna2
const & rhs)
noexcept
{
return lhs.rank > rhs.rank;
}
friend bool operator>=(dna2
const & lhs, dna2
const & rhs)
noexcept
{
return lhs.rank >= rhs.rank;
}
};
static_assert(std::equality_comparable<dna2>);
static_assert(std::totally_ordered<dna2>);
semialphabet
Let's move on to the more interesting concepts. seqan3::semialphabet constitutes the rank interface that we introduced in the alphabet tutorial. Have a look at the API reference again. Beyond the conceptional requirements, it also requires that seqan3::alphabet_size and seqan3::to_rank can be called on your alphabet.
There are different ways to satisfy seqan3::alphabet_size and seqan3::to_rank, have a look at the respective API reference and also the documentation on customisation points.
In this case we choose to implement the functionality as member functions:
struct dna2
{
uint8_t rank{};
static constexpr size_t alphabet_size = 2;
uint8_t to_rank() const noexcept
{
return rank;
}
dna2 & assign_rank(uint8_t const rk) noexcept
{
assert(rk < alphabet_size);
rank = rk;
return *this;
}
};
As you can see from the static_assert
our dna2 alphabet now models seqan3::semialphabet and seqan3::writable_semialphabet:
You can also try out the customisation points directly (they appear like free functions). Here is an example:
int main()
{
dna2 chr{};
std::cout << static_cast<uint16_t>(rnk) << '\n';
}
constexpr auto assign_rank_to
Assign a rank to an alphabet object.
Definition alphabet/concept.hpp:288
constexpr auto to_rank
Return the rank representation of a (semi-)alphabet object.
Definition alphabet/concept.hpp:152
alphabet
Now that you have a feeling for concepts, have a look at seqan3::alphabet and seqan3::writable_alphabet and make your type also model these concepts.
Excercise
- Implement the member function
char to_char()
which returns the char 'S' or 'W' depending on the rank value.
- Implement the member function
dna2 & assign_char(char const)
.
- There is a function object that tells us which characters are valid for a given alphabet: seqan3::char_is_valid_for. By default, all characters are "valid" that are preserved when being assigned from and then be converted back. But in some cases you want to change the default behaviour, e.g. declaring lower-case letters to be valid, as well.
Optional task: Implement the static member function bool char_is_valid(char const)
in order to allow also 's' and 'w' as valid characters.
Solution
struct dna2
{
uint8_t rank{};
static constexpr size_t alphabet_size = 2;
uint8_t to_rank() const noexcept
{
return rank;
}
dna2 & assign_rank(uint8_t const rk) noexcept
{
assert(rk < alphabet_size);
rank = rk;
return *this;
}
{
char const rank_to_char[2]{'S', 'W'};
return rank_to_char[rank];
}
dna2 & assign_char(char const ch) noexcept
{
switch (ch)
{
case 'W':
case 'w':
rank = 1;
break;
default:
rank = 0;
}
return *this;
}
static bool char_is_valid(char const ch) noexcept
{
return (ch == dna2{}.assign_char(ch).
to_char());
}
friend bool operator==(dna2 const & lhs, dna2 const & rhs) noexcept
{
return lhs.rank == rhs.rank;
}
friend bool operator!=(dna2
const & lhs, dna2
const & rhs)
noexcept
{
return !(lhs == rhs);
}
friend bool operator<(dna2 const & lhs, dna2 const & rhs) noexcept
{
return lhs.rank < rhs.rank;
}
friend bool operator<=(dna2
const & lhs, dna2
const & rhs)
noexcept
{
return lhs.rank <= rhs.rank;
}
friend bool operator>(dna2
const & lhs, dna2
const & rhs)
noexcept
{
return lhs.rank > rhs.rank;
}
friend bool operator>=(dna2
const & lhs, dna2
const & rhs)
noexcept
{
return lhs.rank >= rhs.rank;
}
};
constexpr auto to_char
Return the char representation of an alphabet object.
Definition alphabet/concept.hpp:381
At this point the seqan3::alphabet concept should be modelled successfully and even seqan3::writable_alphabet is fine because we implemented assign_char
.
Shortcut: alphabet base template
Often it is not required to implement the entire class yourself, instead you can derive from seqan3::alphabet_base which defines most things for you if you provide certain conversion tables. Read the documentation of seqan3::alphabet_base for details. This implementation deduces bool
as the smallest possible rank type.
{
private:
static constexpr char_type rank_to_char(rank_type const rank)
{
return rank_to_char_table[rank];
return rank == 1 ? 'W' : 'S';
}
static constexpr rank_type char_to_rank(char_type const chr)
{
return char_to_rank_table[static_cast<index_t>(chr)];
}
private:
static constexpr char_type rank_to_char_table[
alphabet_size]{
'S',
'W'};
[]() constexpr
{
ret['W'] = 1;
ret['w'] = 1;
return ret;
}()};
};
Provides seqan3::alphabet_base.
A CRTP-base that makes defining a custom alphabet easier.
Definition alphabet_base.hpp:54
constexpr auto alphabet_size
A type trait that holds the size of a (semi-)alphabet.
Definition alphabet/concept.hpp:834
constexpr char_type to_lower(char_type const c) noexcept
Converts 'A'-'Z' to 'a'-'z' respectively; other characters are returned as is.
Definition transform.hpp:74
Further examples
Implementation as enum class
This is an example of a minimal custom alphabet that provides implementations for all necessary customisation points.
As an enum class the values already have an order and therefore the class models std::totally_ordered without defining the (in)equality and comparison operators. Opposed to the examples above, we use only free functions to model the functionality.
namespace my_namespace
{
enum class my_alph
{
ZERO,
ONE,
TWO
};
constexpr size_t alphabet_size(my_alph const &) noexcept
{
return 3;
}
constexpr size_t to_rank(my_alph
const a)
noexcept
{
return static_cast<size_t>(a);
}
constexpr my_alph &
assign_rank_to(
size_t const r, my_alph & a)
noexcept
{
switch (r)
{
case 0:
a = my_alph::ZERO;
return a;
case 1:
a = my_alph::ONE;
return a;
default:
a = my_alph::TWO;
return a;
}
}
constexpr char to_char(my_alph
const a)
noexcept
{
switch (a)
{
case my_alph::ZERO:
return '0';
case my_alph::ONE:
return '1';
default:
return '2';
}
}
constexpr my_alph &
assign_char_to(
char const c, my_alph & a)
noexcept
{
switch (c)
{
case '0':
a = my_alph::ZERO;
return a;
case '1':
a = my_alph::ONE;
return a;
default:
a = my_alph::TWO;
return a;
}
}
}
constexpr auto assign_char_to
Assign a character to an alphabet object.
Definition alphabet/concept.hpp:517
Adaptation of a third party type
This example is similar to the previous one, but assuming that you cannot add anything to the namespace of the type that you wish to adapt. In that case, you need to specialise the seqan3::custom::alphabet class template and provide the required functionality as static members.
namespace third_party_ns
{
enum class third_party_type
{
zero,
one,
two
};
}
template <>
{
using alphabet_t = third_party_ns::third_party_type;
static constexpr size_t to_rank(alphabet_t
const a)
noexcept
{
return static_cast<size_t>(a);
}
static constexpr alphabet_t &
assign_rank_to(
size_t const r, alphabet_t & a)
noexcept
{
switch (r)
{
case 0:
a = alphabet_t::zero;
return a;
case 1:
a = alphabet_t::one;
return a;
default:
a = alphabet_t::two;
return a;
}
}
static constexpr char to_char(alphabet_t
const a)
noexcept
{
switch (a)
{
case alphabet_t::zero:
return '0';
case alphabet_t::one:
return '1';
default:
return '2';
}
}
static constexpr alphabet_t &
assign_char_to(
char const c, alphabet_t & a)
noexcept
{
switch (c)
{
case '0':
a = alphabet_t::zero;
return a;
case '1':
a = alphabet_t::one;
return a;
default:
a = alphabet_t::two;
return a;
}
}
};
A type that can be specialised to provide customisation point implementations so that third party typ...
Definition alphabet/concept.hpp:46
Implementation of a non-default-constructible class
This is an example of a custom alphabet that is not default-constructible and that has a non-default overload for seqan3::char_is_valid_for.
Please note that for the overloads of seqan3::alphabet_size and seqan3::char_is_valid_for our alphabet type has to be wrapped into std::type_identity<>
to be recognised by the customisation point objects, because our type does not model std::is_nothrow_default_constructible after we have deleted the default constructor.
With the overload of seqan3::char_is_valid_for we allow assignment to the underlying boolean type from '1', 't' and 'T' for value 1 as well as from '0', 'f' and 'F' for value 0.
namespace my_namespace
{
class my_alph
{
public:
bool rank;
my_alph() = delete;
constexpr my_alph(my_alph const &) = default;
constexpr my_alph & operator=(my_alph const &) = default;
constexpr my_alph(bool rank) : rank{rank}
{}
constexpr friend bool operator==(my_alph lhs, my_alph rhs)
{
return lhs.rank == rhs.rank;
}
constexpr friend bool operator!=(my_alph lhs, my_alph rhs)
{
return lhs.rank != rhs.rank;
}
constexpr friend bool operator<=(my_alph lhs, my_alph rhs)
{
return lhs.rank <= rhs.rank;
}
constexpr friend bool operator>=(my_alph lhs, my_alph rhs)
{
return lhs.rank >= rhs.rank;
}
constexpr friend bool operator<(my_alph lhs, my_alph rhs)
{
return lhs.rank < rhs.rank;
}
constexpr friend bool operator>(my_alph lhs, my_alph rhs)
{
return lhs.rank > rhs.rank;
}
};
{
return 2;
}
constexpr bool to_rank(my_alph
const a)
noexcept
{
return a.rank;
}
constexpr my_alph &
assign_rank_to(
bool const r, my_alph & a)
noexcept
{
a.rank = r;
return a;
}
constexpr char to_char(my_alph
const a)
noexcept
{
if (a.rank)
return '1';
else
return '0';
}
constexpr my_alph &
assign_char_to(
char const c, my_alph & a)
noexcept
{
switch (c)
{
case '0':
case 'F':
case 'f':
a.rank = 0;
return a;
default:
a.rank = 1;
return a;
}
}
{
switch (c)
{
case '0':
case 'F':
case 'f':
case '1':
case 'T':
case 't':
return true;
default:
return false;
}
}
}
constexpr auto char_is_valid_for
Returns whether a character is in the valid set of a seqan3::alphabet (usually implies a bijective ma...
Definition alphabet/concept.hpp:661
- Note
- You should really make your alphabet types no-throw-default-constructible if you can!