SeqAn3 3.4.0-rc.1
The Modern C++ library for sequence analysis.
|
Provides the various quality score types. More...
Classes | |
class | seqan3::phred42 |
Quality type for traditional Sanger and modern Illumina Phred scores. More... | |
class | seqan3::phred63 |
Quality type for traditional Sanger and modern Illumina Phred scores. More... | |
class | seqan3::phred68solexa |
Quality type for Solexa and deprecated Illumina formats. More... | |
class | seqan3::phred94 |
Quality type for PacBio Phred scores of HiFi reads. More... | |
class | seqan3::phred_base< derived_type, size > |
A CRTP-base that refines seqan3::alphabet_base and is used by the quality alphabets. More... | |
class | seqan3::qualified< sequence_alphabet_t, quality_alphabet_t > |
Joins an arbitrary alphabet with a quality alphabet. More... | |
interface | quality_alphabet |
A concept that indicates whether an alphabet represents quality scores. More... | |
interface | writable_quality_alphabet |
A concept that indicates whether a writable alphabet represents quality scores. More... | |
Typedefs | |
template<typename alphabet_type > | |
using | seqan3::alphabet_phred_t = decltype(seqan3::to_phred(std::declval< alphabet_type >())) |
The phred_type of the alphabet; defined as the return type of seqan3::to_phred. | |
using | seqan3::dna15q = qualified< dna15, phred42 > |
An alphabet that stores a seqan3::dna15 letter and an seqan3::qualified letter at each position. | |
using | seqan3::dna4q = qualified< dna4, phred42 > |
An alphabet that stores a seqan3::dna4 letter and an seqan3::phred42 letter at each position. | |
using | seqan3::dna5q = qualified< dna5, phred42 > |
An alphabet that stores a seqan3::dna5 letter and an seqan3::phred42 letter at each position. | |
using | seqan3::rna15q = qualified< rna15, phred42 > |
An alphabet that stores a seqan3::rna15 letter and an seqan3::qualified letter at each position. | |
using | seqan3::rna4q = qualified< rna4, phred42 > |
An alphabet that stores a seqan3::rna4 letter and an seqan3::phred42 letter at each position. | |
using | seqan3::rna5q = qualified< rna5, phred42 > |
An alphabet that stores a seqan3::rna5 letter and an seqan3::phred42 letter at each position. | |
Function objects (Quality) | |
constexpr auto | seqan3::to_phred = detail::adl_only::to_phred_cpo{} |
The public getter function for the Phred representation of a quality score. | |
constexpr auto | seqan3::assign_phred_to = detail::adl_only::assign_phred_to_cpo{} |
Assign a Phred score to a quality alphabet object. | |
Provides the various quality score types.
Quality score sequences are usually output together with the DNA (or RNA) sequence by sequencing machines like the Illumina Genome Analyzer. The quality score of a nucleotide is also known as Phred score and is an integer score being inversely proportional to the propability \(p\) that a base call is incorrect. Which roughly means that the higher a Phred score is, the higher is the probabality that the corresponding nucleotide is correct for that position. There exists two common variants of its computation:
Thus, despite implicit conversion between different quality types is supported, for very low quality levels the scores vary significantly and need to be corrected by an offset before being compared. For easy handling of the Phred score in file formats and console output, it is mapped to a single ASCII character. The sequencing / analyser machine, e.g. HiSeq, PacBio, will dictate which Phred format is used. Output files storing DNA sequences and their quality scores are usually stored in the FASTQ format indicated by the file extensions fastq or fq. This sub-module provides multiple quality alphabets that can be used in combination with regular containers and ranges.
Standard Use Case | Format | Encoding | Alphabet Type | Phred Score Range | Rank Range | ASCII Range |
---|---|---|---|---|---|---|
Sanger, Illumina | Sanger, Illumina 1.8+ | Phred+33 | seqan3::phred42 | [0 .. 41] | [0 .. 41] | [33 .. 74] ['!' .. 'J'] |
Sanger, Illumina | Sanger, Illumina 1.8+ | Phred+33 | seqan3::phred63 | [0 .. 62] | [0 .. 62] | [33 .. 95] ['!' .. '_'] |
PacBio | Sanger, Illumina 1.8+ | Phred+33 | seqan3::phred94 | [0 .. 93] | [0 .. 93] | [33 .. 126] ['!' .. '~'] |
Solexa | Solexa, Illumina [1.0; 1.8[ | Phred+64 | seqan3::phred68solexa | [-5 .. 62] | [0 .. 67] | [59 .. 126] [';' .. '~'] |
The most distributed format is the Sanger or Illumina 1.8+ format. Despite typical Phred scores for Illumina machines range from 0 to 41, it is possible that processed reads reach higher scores. If you do not intend handling Phred scores larger than 41, we recommend using seqan3::phred42 due to its more space-efficient implementation (see below). If you want to store PacBio HiFi reads, we recommend to use seqan3::phred94, as these use the full range of the Phred quality scores. For other formats, like Solexa and Illumina 1.0 to 1.7, the type seqan3::phred68solexa is provided. To also cover the Solexa format, the Phred score is stored as a signed integer starting at -5.
The following figure gives a graphical explanation of the different Alphabet Types:
SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS....................................................
MMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMM...............................
PPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPP
..........................OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO
!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
|.........................|..............|....................|..............................|
33........................59............73....................95............................126
0_______________________________________40.....................................................
0_______________________________________40____________________62...............................
0_______________________________________40____________________62_____________________________93
.........................-5____0________9____________________________________________________62
S - Sanger, Illumina 1.8+ - phred42
M - Sanger, Illumina 1.8+ - phred63
P - Sanger, Illumina 1.8+ - phred94 (PacBio)
O - Solexa - phred68solexa
Graphic was inspired by https://en.wikipedia.org/wiki/FASTQ_format#Encoding (last access 28.01.2021).
Quality values are usually paired together with nucleotides. Therefore, it stands to reason to combine both alphabets into a new data structure. In SeqAn, this can be done with seqan3::qualified. It represents the cross product between a nucleotide and a quality alphabet and is the go-to choice when compression is of interest.
The following combinations still fit into a single byte:
seqan3::qualified<seqan3::dna4, seqan3::phred42>
(alphabet size: 4 x 42 = 168)seqan3::qualified<seqan3::dna4, seqan3::phred63>
(alphabet size: 4 x 63 = 252)seqan3::qualified<seqan3::dna5, seqan3::phred42>
(alphabet size: 4 x 42 = 210)Using seqan3::qualified
can half the storage usage compared to storing qualities and nucleotides separately. Note that any combination of seqan3::phred94
with another alphabet will cause seqan3::qualified
to use at least 2 bytes. While we used DNA alphabets in this example, the same properties hold true for RNA alphabets.
The quality submodule defines the seqan3::writable_quality_alphabet which encompasses all the alphabets, defined in the submodule, and refines the seqan3::writable_alphabet by providing Phred score assignment and conversion operations. Additionally, this submodule defines the seqan3::quality_alphabet, which only requires readablity and not assignability.
Quality alphabets can be converted to their char and rank representation via seqan3::to_char
and seqan3::to_rank
respectively (like all other alphabets). Additionally they can be converted to their Phred representation via seqan3::to_phred
.
Likewise, assignment happens via seqan3::assign_char_to
, seqan3::assign_rank_to
and seqan3::assign_phred_to
. Phred values outside the representable range, but inside the legal range, are converted to the closest Phred score, e.g. assigning 60 to a seqan3::phred42
will result in a Phred score of 41. Assigning Phred values outside the legal range results in undefined behaviour.
All quality alphabets are explicitly convertible to each other via their Phred representation. Values not present in one alphabet are mapped to the closest value in the target alphabet (e.g. a seqan3::phred63
letter with value 60 will convert to a seqan3::phred42
letter of score 41, this also applies to seqan3::phred94
).
using seqan3::alphabet_phred_t = typedef decltype(seqan3::to_phred(std::declval<alphabet_type>())) |
The phred_type
of the alphabet; defined as the return type of seqan3::to_phred.
using seqan3::dna15q = typedef qualified<dna15, phred42> |
An alphabet that stores a seqan3::dna15 letter and an seqan3::qualified letter at each position.
using seqan3::dna4q = typedef qualified<dna4, phred42> |
An alphabet that stores a seqan3::dna4 letter and an seqan3::phred42 letter at each position.
using seqan3::dna5q = typedef qualified<dna5, phred42> |
An alphabet that stores a seqan3::dna5 letter and an seqan3::phred42 letter at each position.
using seqan3::rna15q = typedef qualified<rna15, phred42> |
An alphabet that stores a seqan3::rna15 letter and an seqan3::qualified letter at each position.
using seqan3::rna4q = typedef qualified<rna4, phred42> |
An alphabet that stores a seqan3::rna4 letter and an seqan3::phred42 letter at each position.
using seqan3::rna5q = typedef qualified<rna5, phred42> |
An alphabet that stores a seqan3::rna5 letter and an seqan3::phred42 letter at each position.
|
inlineconstexpr |
Assign a Phred score to a quality alphabet object.
your_type | The type of the target object. Must model the seqan3::quality_alphabet. |
chr | The Phred score being assigned; must be of the seqan3::alphabet_phred_t of the target object. |
alph
if alph
was given as lvalue, otherwise a copy.This is a function object. Invoke it with the parameter(s) specified above.
It acts as a wrapper and looks for three possible implementations (in this order):
assign_phred_to(phred_type const chr, your_type & a)
of the class seqan3::custom::alphabet<your_type>
.assign_phred_to(phred_type const chr, your_type & a)
in the namespace of your type (or as friend
).assign_phred(phred_type const chr)
(not assign_phred_to
).Functions are only considered for one of the above cases if they are marked noexcept
(constexpr
is not required, but recommended) and if the returned type is your_type &
.
Every writable quality alphabet type must provide one of the above. Note that temporaries of your_type
are handled by this function object and do not require an additional overload.
This is a customisation point (see Customisation). To specify the behaviour for your own alphabet type, simply provide one of the three functions specified above.
|
inlineconstexpr |
The public getter function for the Phred representation of a quality score.
your_type | The type of alphabet. Must model the seqan3::quality_alphabet. |
chr | The quality value to convert into the Phred score. |
This is a function object. Invoke it with the parameter(s) specified above.
It acts as a wrapper and looks for three possible implementations (in this order):
to_phred(your_type const a)
of the class seqan3::custom::alphabet<your_type>
.to_phred(your_type const a)
in the namespace of your type (or as friend
).to_phred()
.Functions are only considered for one of the above cases if they are marked noexcept
(constexpr
is not required, but recommended) and if the returned type is convertible to size_t
.
Every quality alphabet type must provide one of the above.
This is a customisation point (see Customisation). To specify the behaviour for your own alphabet type, simply provide one of the three functions specified above.