 |
SeqAn3
3.0.2
The Modern C++ library for sequence analysis.
|
|
Learning Objective:
In this tutorial you will learn how to combine the components of previous tutorials to create your very first SeqAn application: a read mapper!
Difficulty | High |
Duration | 90 Minutes |
Prerequisite tutorials | All |
Recommended reading | |
Introduction
Read mapping is a common task in bioinformatics and is often the first step of an in-depth analysis of Next Generation Sequencing data. Its aim is to identify positions where a query sequence (read) matches with up to e
errors to a reference sequence.
In this example we will implement a read mapper step by step and make use of what we have learned in the previous tutorials. As it is common practice with read mappers, we will first create an indexer that creates an index from the reference and stores it to disk. After this, we will implement the actual read mapper that will use the stored index and map the reads.
Agenda
- Implementing an indexer
- Parse arguments
- Read input files
- Create and store index
- Implementing a read mapper
- Parse arguments
- Read and load input, search for approximate matches
- Align the search results
- Write final results into a SAM file
The data
We provide an example reference and an example query file.
The indexer
Step 1 - Parsing arguments
As a first step, we want to parse command line arguments for our indexer. If you get into trouble, you can take a peek at the Argument Parser Tutorial or the API documentation of the seqan3::argument_parser for help.
Assignment 1: Parsing arguments
Let's start our application by setting up the argument parser with the following options:
- The path to the reference file
- An output path for the index
Follow the best practice and create:
- A function
run_program
that prints the parsed arguments
- A struct
cmd_arguments
that stores the arguments
- A function
initialise_argument_parser
to add meta data and options to the parser
- A
main
function that parses the arguments and calls run_program
Use validators where applicable!
Your main
may look like this:
Hint
int main(int argc, char const ** argv)
{
cmd_arguments args{};
initialise_argument_parser(parser, args);
try
{
}
{
return -1;
}
run_program(args.reference_path, args.index_path);
return 0;
}
Solution
{
}
struct cmd_arguments
{
};
{
parser.
add_option(args.reference_path,
'r',
"reference",
"The path to the reference.",
parser.
add_option(args.index_path,
'o',
"output",
"The output index file path.",
}
int main(int argc, char const ** argv)
{
cmd_arguments args{};
initialise_argument_parser(parser, args);
try
{
}
{
return -1;
}
run_program(args.reference_path, args.index_path);
return 0;
}
Step 2 - Reading the input
As a next step, we want to use the parsed file name to read in our reference data. This will be done using seqan3::sequence_file_input class. As a guide, you can take a look at the Sequence I/O Tutorial.
Assignment 2: Reading the input
Extend your program to store the sequence information contained in the reference file into a struct.
To do this, you should create:
- A struct
reference_storage_t
that stores the sequence information for both reference and query information within member variables
- A function
read_reference
that fills a reference_storage_t
object with information from the files and prints the reference IDs
You should also perform the following changes in run_program
:
- Construct of an object
storage
of type reference_storage_t
- Add a call to
read_reference
This is the signature of read_reference
:
Hint
reference_storage_t & storage)
This is the reference_storage_t
:
Hint
struct reference_storage_t
{
};
Solution
struct reference_storage_t
{
};
reference_storage_t & storage)
{
for (
auto & [
seq,
id, qual] : reference_in)
{
}
}
{
reference_storage_t storage{};
read_reference(reference_path, storage);
}
Here is the complete program:
Hint
struct reference_storage_t
{
};
reference_storage_t & storage)
{
for (
auto & [
seq,
id, qual] : reference_in)
{
}
}
{
reference_storage_t storage{};
read_reference(reference_path, storage);
}
struct cmd_arguments
{
};
{
parser.
add_option(args.reference_path,
'r',
"reference",
"The path to the reference.",
parser.
add_option(args.index_path,
'o',
"output",
"The output index file path.",
}
int main(int argc, char const ** argv)
{
cmd_arguments args{};
initialise_argument_parser(parser, args);
try
{
}
{
return -1;
}
run_program(args.reference_path, args.index_path);
return 0;
}
Step 3 - Index
Now that we have the necessary sequence information, we can create an index and store it. Read up on the Index Tutorial if you have any questions.
Assignment 3: Index
We want to create a new function
create_index
:
- It takes
index_path
and storage
as parameters
- Creates a bi_fm_index
- Stores the bi_fm_index
We also need to change:
run_program
to now also call create_index
run_program
and read_reference
to not print the debug output anymore
This is the signature of create_index
:
Hint
reference_storage_t & storage)
Solution
reference_storage_t & storage)
{
for (
auto & [
seq,
id, qual] : reference_in)
{
}
}
reference_storage_t & storage)
{
{
cereal::BinaryOutputArchive oarchive{os};
oarchive(index);
}
}
{
reference_storage_t storage{};
read_reference(reference_path, storage);
create_index(index_path, storage);
}
Here is the complete program:
Hint
#include <cereal/archives/binary.hpp>
struct reference_storage_t
{
};
reference_storage_t & storage)
{
for (
auto & [
seq,
id, qual] : reference_in)
{
}
}
reference_storage_t & storage)
{
{
cereal::BinaryOutputArchive oarchive{os};
oarchive(index);
}
}
{
reference_storage_t storage{};
read_reference(reference_path, storage);
create_index(index_path, storage);
}
struct cmd_arguments
{
};
{
parser.
add_option(args.reference_path,
'r',
"reference",
"The path to the reference.",
parser.
add_option(args.index_path,
'o',
"output",
"The output index file path.",
}
int main(int argc, char const ** argv)
{
cmd_arguments args{};
initialise_argument_parser(parser, args);
try
{
}
{
return -1;
}
run_program(args.reference_path, args.index_path);
return 0;
}
The read mapper
Step 1 - Parsing arguments
Again, we want to parse command line arguments for our read mapper as a first step. If you get into trouble, you can take a peek at the Argumet Parser Tutorial or the API documentation of the seqan3::argument_parser for help.
Assignment 4: Parsing arguments
Let's start our application by setting up the argument parser with the following options:
- The path to the reference file
- The path to the query file
- The path to the index file
- An output path
- The maximum number of errors we want to allow (between 0 and 4)
Follow the best practice and create:
- A function
run_program
that prints the parsed arguments
- A struct
cmd_arguments
that stores the arguments
- A function
initialise_argument_parser
to add meta data and options to the parser
- A
main
function that parses the arguments and calls run_program
Use validators where applicable!
Your main
may look like this:
Hint
int main(int argc, char const ** argv)
{
cmd_arguments args{};
initialise_argument_parser(parser, args);
try
{
}
{
return -1;
}
run_program(args.reference_path, args.query_path, args.index_path, args.sam_path, args.errors);
return 0;
}
Solution
uint8_t const errors)
{
}
struct cmd_arguments
{
uint8_t errors{0};
};
{
parser.
add_option(args.reference_path,
'r',
"reference",
"The path to the reference.",
parser.
add_option(args.query_path,
'q',
"query",
"The path to the query.",
parser.
add_option(args.index_path,
'i',
"index",
"The path to the index.",
parser.
add_option(args.sam_path,
'o',
"output",
"The output SAM file path.",
parser.
add_option(args.errors,
'e',
"error",
"Maximum allowed errors.",
}
int main(int argc, char const ** argv)
{
cmd_arguments args{};
initialise_argument_parser(parser, args);
try
{
}
{
return -1;
}
run_program(args.reference_path, args.query_path, args.index_path, args.sam_path, args.errors);
return 0;
}
Step 2 - Reading the input and searching
We also want to read the reference in the read mapper. This is done the same way as for the indexer. We can now load the index and conduct a search. Read up on the Search Tutorial if you have any questions.
Assignment 5: Reading the input
Extend your program to read the reference file the same way the indexer does. After this you can load the index and print results of a search.
To do this, you should:
- Carry over the
read_reference
function and the reference_storage_t
struct from the indexer
- Create a function
map_reads
that loads the index and prints the results of the search (allowing all error types) for the first 20 queries
You should also perform the following changes in run_program
:
- Remove the debug output
- Construct an object
storage
of type reference_storage_t
- Add a call to
read_reference
and map_reads
This is the signature of map_reads
:
Hint
reference_storage_t & storage,
uint8_t const errors)
Solution
struct reference_storage_t
{
};
reference_storage_t & storage)
{
for (
auto & [
seq,
id, qual] : reference_in)
{
}
}
reference_storage_t & storage,
uint8_t const errors)
{
{
cereal::BinaryInputArchive iarchive{is};
iarchive(index);
}
#if SEQAN3_WORKAROUND_GCC_93983
for (auto && record : query_file_in )
#else // ^^^ workaround / no workaround vvv
#endif // SEQAN3_WORKAROUND_GCC_93983
{
for (
auto && result :
search(seqan3::get<seqan3::field::seq>(record), index, search_config))
}
(void) sam_path;
(void) storage;
}
uint8_t const errors)
{
reference_storage_t storage{};
read_reference(reference_path, storage);
map_reads(query_path, index_path, sam_path, storage, errors);
}
Here is the complete program:
Hint
#include <cereal/archives/binary.hpp>
struct reference_storage_t
{
};
reference_storage_t & storage)
{
for (
auto & [
seq,
id, qual] : reference_in)
{
}
}
reference_storage_t & storage,
uint8_t const errors)
{
{
cereal::BinaryInputArchive iarchive{is};
iarchive(index);
}
#if SEQAN3_WORKAROUND_GCC_93983
for (auto && record : query_file_in )
#else // ^^^ workaround / no workaround vvv
#endif // SEQAN3_WORKAROUND_GCC_93983
{
for (
auto && result :
search(seqan3::get<seqan3::field::seq>(record), index, search_config))
}
(void) sam_path;
(void) storage;
}
uint8_t const errors)
{
reference_storage_t storage{};
read_reference(reference_path, storage);
map_reads(query_path, index_path, sam_path, storage, errors);
}
struct cmd_arguments
{
uint8_t errors{0};
};
{
parser.
add_option(args.reference_path,
'r',
"reference",
"The path to the reference.",
parser.
add_option(args.query_path,
'q',
"query",
"The path to the query.",
parser.
add_option(args.index_path,
'i',
"index",
"The path to the index.",
parser.
add_option(args.sam_path,
'o',
"output",
"The output SAM file path.",
parser.
add_option(args.errors,
'e',
"error",
"Maximum allowed errors.",
}
int main(int argc, char const ** argv)
{
cmd_arguments args{};
initialise_argument_parser(parser, args);
try
{
}
{
return -1;
}
run_program(args.reference_path, args.query_path, args.index_path, args.sam_path, args.errors);
return 0;
}
Step 3 - Alignment
We can now use the obtained positions to align each query against the reference. Refer to the Alignment Tutorial if you have any questions.
Assignment 6: Alignment
We want to extend
map_reads
to:
- Use the output of the search to align the query against the reference
- Print the query ID, alignment score, subrange of the reference sequence and the query (for the first 20 queries)
This is the alignment config:
Solution
reference_storage_t & storage,
uint8_t const errors)
{
{
cereal::BinaryInputArchive iarchive{is};
iarchive(index);
}
#if SEQAN3_WORKAROUND_GCC_93983
for (auto && record : query_file_in )
#else // ^^^ workaround / no workaround vvv
#endif // SEQAN3_WORKAROUND_GCC_93983
{
auto & query = seqan3::get<seqan3::field::seq>(record);
for (
auto && result :
search(query, index, search_config))
{
size_t start = result.reference_begin_position() ? result.reference_begin_position() - 1 : 0;
std::span text_view{
std::data(storage.seqs[result.reference_id()]) + start, query.size() + 1};
{
auto && [aligned_database, aligned_query] = alignment.alignment();
}
}
}
(void) sam_path;
}
Here is the complete program:
Hint
#include <cereal/archives/binary.hpp>
struct reference_storage_t
{
};
reference_storage_t & storage)
{
for (
auto & [
seq,
id, qual] : reference_in)
{
}
}
reference_storage_t & storage,
uint8_t const errors)
{
{
cereal::BinaryInputArchive iarchive{is};
iarchive(index);
}
#if SEQAN3_WORKAROUND_GCC_93983
for (auto && record : query_file_in )
#else // ^^^ workaround / no workaround vvv
#endif // SEQAN3_WORKAROUND_GCC_93983
{
auto & query = seqan3::get<seqan3::field::seq>(record);
for (
auto && result :
search(query, index, search_config))
{
size_t start = result.reference_begin_position() ? result.reference_begin_position() - 1 : 0;
std::span text_view{
std::data(storage.seqs[result.reference_id()]) + start, query.size() + 1};
{
auto && [aligned_database, aligned_query] = alignment.alignment();
}
}
}
(void) sam_path;
}
uint8_t const errors)
{
reference_storage_t storage{};
read_reference(reference_path, storage);
map_reads(query_path, index_path, sam_path, storage, errors);
}
struct cmd_arguments
{
uint8_t errors{0};
};
{
parser.
add_option(args.reference_path,
'r',
"reference",
"The path to the reference.",
parser.
add_option(args.query_path,
'q',
"query",
"The path to the query.",
parser.
add_option(args.index_path,
'i',
"index",
"The path to the index.",
parser.
add_option(args.sam_path,
'o',
"output",
"The output SAM file path.",
parser.
add_option(args.errors,
'e',
"error",
"Maximum allowed errors.",
}
int main(int argc, char const ** argv)
{
cmd_arguments args{};
initialise_argument_parser(parser, args);
try
{
}
{
return -1;
}
run_program(args.reference_path, args.query_path, args.index_path, args.sam_path, args.errors);
return 0;
}
Step 4 - Alignment output
Finally, we can write our results into a SAM file.
Assignment 7: SAM out
We further need to extend
map_reads
to write the alignment results into a SAM file. Additionally, there should be no more debug output.
Try to write all available information into the SAM file. We can introduce a naive mapping quality by using mapping quality = 60 + alignment score
.
This is the alignment_file_output construction:
Hint
seqan3::field::ref_id,
seqan3::field::ref_offset,
seqan3::field::qual,
Solution
reference_storage_t & storage,
uint8_t const errors)
{
{
cereal::BinaryInputArchive iarchive{is};
iarchive(index);
}
seqan3::field::ref_id,
seqan3::field::ref_offset,
seqan3::field::qual,
for (auto && record : query_file_in)
{
auto & query = seqan3::get<seqan3::field::seq>(record);
for (
auto && result :
search(query, index, search_config))
{
size_t start = result.reference_begin_position() ? result.reference_begin_position() - 1 : 0;
std::span text_view{
std::data(storage.seqs[result.reference_id()]) + start, query.size() + 1};
{
auto aligned_seq = alignment.alignment();
size_t ref_offset = alignment.sequence1_begin_position() + 2 + start;
size_t map_qual = 60u + alignment.score();
sam_out.emplace_back(query,
seqan3::get<seqan3::field::id>(record),
storage.ids[result.reference_id()],
ref_offset,
aligned_seq,
seqan3::get<seqan3::field::qual>(record),
map_qual);
}
}
}
}
Here is the complete program:
Hint
#include <cereal/archives/binary.hpp>
struct reference_storage_t
{
};
reference_storage_t & storage)
{
for (
auto & [
seq,
id, qual] : reference_in)
{
}
}
reference_storage_t & storage,
uint8_t const errors)
{
{
cereal::BinaryInputArchive iarchive{is};
iarchive(index);
}
seqan3::field::ref_id,
seqan3::field::ref_offset,
seqan3::field::qual,
for (auto && record : query_file_in)
{
auto & query = seqan3::get<seqan3::field::seq>(record);
for (
auto && result :
search(query, index, search_config))
{
size_t start = result.reference_begin_position() ? result.reference_begin_position() - 1 : 0;
std::span text_view{
std::data(storage.seqs[result.reference_id()]) + start, query.size() + 1};
{
auto aligned_seq = alignment.alignment();
size_t ref_offset = alignment.sequence1_begin_position() + 2 + start;
size_t map_qual = 60u + alignment.score();
sam_out.emplace_back(query,
seqan3::get<seqan3::field::id>(record),
storage.ids[result.reference_id()],
ref_offset,
aligned_seq,
seqan3::get<seqan3::field::qual>(record),
map_qual);
}
}
}
}
uint8_t const errors)
{
reference_storage_t storage{};
read_reference(reference_path, storage);
map_reads(query_path, index_path, sam_path, storage, errors);
}
struct cmd_arguments
{
uint8_t errors{0};
};
{
parser.
add_option(args.reference_path,
'r',
"reference",
"The path to the reference.",
parser.
add_option(args.query_path,
'q',
"query",
"The path to the query.",
parser.
add_option(args.index_path,
'i',
"index",
"The path to the index.",
parser.
add_option(args.sam_path,
'o',
"output",
"The output SAM file path.",
parser.
add_option(args.errors,
'e',
"error",
"Maximum allowed errors.",
}
int main(int argc, char const ** argv)
{
cmd_arguments args{};
initialise_argument_parser(parser, args);
try
{
}
{
return -1;
}
run_program(args.reference_path, args.query_path, args.index_path, args.sam_path, args.errors);
return 0;
}
Provides seqan3::debug_stream and related types.
Meta-header for the search module.
A strong type representing free_end_gaps_sequence1_trailing of the seqan3::align_cfg::method_global.
Definition: align_config_method.hpp:85
Provides pairwise alignment function.
Configuration element to receive all hits with the lowest number of errors within the error bounds.
Definition: hit.hpp:56
The SeqAn command line parser.
Definition: argument_parser.hpp:154
A strong type representing free_end_gaps_sequence2_leading of the seqan3::align_cfg::method_global.
Definition: align_config_method.hpp:75
void add_option(option_type &value, char const short_id, std::string const &long_id, std::string const &desc, option_spec const spec=option_spec::DEFAULT, validator_type option_validator=validator_type{})
Adds an option to the seqan3::argument_parser.
Definition: argument_parser.hpp:237
A strong type representing free_end_gaps_sequence1_leading of the seqan3::align_cfg::method_global.
Definition: align_config_method.hpp:65
Configuration element that represents the number or rate of total errors.
Definition: max_error.hpp:36
@ DEFAULT
The default were no checking or special displaying is happening.
Definition: auxiliary.hpp:234
@ id
The identifier, usually a string.
A class for writing alignment files, e.g. SAM, BAL, BLAST, ...
Definition: output.hpp:163
constexpr configuration edit_scheme
Shortcut for edit distance configuration.
Definition: align_config_edit.hpp:51
constexpr auto align_pairwise(sequence_t &&seq, alignment_config_t const &config)
Computes the pairwise alignment for a pair of sequences or a range over sequence pairs.
Definition: align_pairwise.hpp:138
A validator that checks whether a number is inside a given range.
Definition: validators.hpp:106
Provides the bidirectional seqan3::bi_fm_index.
The SeqAn Bidirectional FM Index.
Definition: bi_fm_index.hpp:66
A class template that holds a choice of seqan3::field.
Definition: record.hpp:166
constexpr sequenced_policy seq
Global execution policy object for sequenced execution policy.
Definition: execution.hpp:54
Configures the alignment result to output the alignment.
Definition: align_config_output.hpp:168
Collection of elements to configure an algorithm.
Definition: configuration.hpp:82
Configures the alignment result to output the begin positions.
Definition: align_config_output.hpp:129
auto const move
A view that turns lvalue-references into rvalue-references.
Definition: move.hpp:68
A strong type of underlying type uint8_t that represents the number of errors.
Definition: max_error_common.hpp:30
@ mapq
The mapping quality of the SEQ alignment, usually a ohred-scaled score.
@ REQUIRED
Definition: auxiliary.hpp:235
A strong type representing free_end_gaps_sequence2_trailing of the seqan3::align_cfg::method_global.
Definition: align_config_method.hpp:95
Meta-header for the alignment configuration module .
constexpr auto take
A view adaptor that returns the first size elements from the underlying range (or less if the underly...
Definition: take.hpp:611
Configures the alignment result to output the score.
Definition: align_config_output.hpp:43
Meta-Header for the argument parser module.
Sets the global alignment method.
Definition: align_config_method.hpp:106
Argument parser exception that is thrown whenever there is an error while parsing the command line ar...
Definition: exceptions.hpp:49
argument_parser_meta_data info
Aggregates all parser related meta data (see seqan3::argument_parser_meta_data struct).
Definition: argument_parser.hpp:530
debug_stream_type debug_stream
A global instance of seqan3::debug_stream_type.
Definition: debug_stream.hpp:42
void parse()
Initiates the actual command line parsing.
Definition: argument_parser.hpp:387
auto search(queries_t &&queries, index_t const &index, configuration_t const &cfg=search_cfg::default_configuration)
Search a query or a range of queries in an index.
Definition: search.hpp:105
A validator that checks if a given path is a valid output file.
Definition: validators.hpp:601
@ alignment
The (pairwise) alignment stored in an seqan3::alignment object.
Provides seqan3::alignment_file_output and corresponding traits classes.