Provides files and formats for handling sequence data.
More...
Provides files and formats for handling sequence data.
Reading Sequence Files
Sequence files are the most generic and common biological files. Well-known formats include FASTA and FASTQ, but some may also be interested in treating SAM or BAM files as sequence files, discarding the alignment.The Sequence file abstraction supports reading three different fields:
- seqan3::field::seq
- seqan3::field::id
- seqan3::field::qual
The first three fields are retrieved by default (and in that order). The last field may be selected to have sequence and qualities directly stored in a more memory-efficient combined container. If you select the last field you may not select
seqan3::field::seq or
seqan3::field::qual.
Construction and specialisation
This class comes with two constructors, one for construction from a file name and one for construction from an existing stream and a known format. The first one automatically picks the format based on the extension of the file name. The second can be used if you have a non-file stream, like
std::cin or
std::istringstream, that you want to read from and/or if you cannot use file-extension based detection, but know that your input file has a certain format.
In most cases the template parameters are deduced completely automatically:
int main()
{
{
fout.emplace_back("AGGCTGA"_dna4, "Test2");
fout.emplace_back("GGAGTATAATATATATATATATAT"_dna4, "Test3");
}
}
A class for writing sequence files, e.g. FASTA, FASTQ ...
Definition io/sequence_file/output.hpp:66
void emplace_back(arg_t &&arg, arg_types &&... args)
Write a record to the file by passing individual fields.
Definition io/sequence_file/output.hpp:336
T current_path(T... args)
Provides seqan3::dna4, container aliases and string literals.
Provides seqan3::sequence_file_output and corresponding traits classes.
The SeqAn namespace for literals.
Reading from a
std::istringstream:
auto input = R"(>TEST1
ACGT
>Test2
AGGCTGA
>Test3
GGAGTATAATATATATATATATAT)";
int main()
{
}
Note that this is not the same as writing
sequence_file_input<>
(with angle brackets). In the latter case they are explicitly set to their default values, in the former case
automatic deduction happens which chooses different parameters depending on the constructor arguments. For opening from file,
sequence_file_input<>
would have also worked, but for opening from stream it would not have. In some cases, you do need to specify the arguments, e.g. if you want to read amino acids:
auto input = R"(>TEST1
ACGT
>Test2
AGGCTGA
>Test3
GGAGTATAATATATATATATATAT)";
int main()
{
}
You can define your own traits type to further customise the types used by and returned by this class, see
seqan3::sequence_file_input_default_traits_dna for more details. As mentioned above, specifying at least one template parameter yourself means that you loose automatic deduction so if you want to read amino acids
and want to read from a string stream you need to give all types yourself:
auto input = R"(>TEST1
FQTWE
>Test2
KYRTW
>Test3
EEYQTWEEFARAAEKLYLTDPMKV)";
int main()
{
using sequence_file_input_type =
}
A class template that holds a choice of seqan3::field.
Definition record.hpp:125
Type that contains multiple types.
Definition type_list.hpp:26
Provides seqan3::type_list.
Reading record-wise
You can iterate over this file record-wise:
auto input = R"(>TEST1
ACGT
>Test2
AGGCTGA
>Test3
GGAGTATAATATATATATATATAT)";
int main()
{
{
}
}
Provides seqan3::debug_stream and related types.
debug_stream_type debug_stream
A global instance of seqan3::debug_stream_type.
Definition debug_stream.hpp:37
The class template that file records are based on; behaves like a std::tuple.
Definition record.hpp:190
In the above example,
record
has the type
seqan3::sequence_file_input::record_type which is
seqan3::sequence_record.
Note: It is important to write
auto &
and not just
auto
, otherwise you will copy the record on every iteration. Since the buffer gets "refilled" on every iteration, you can also move the data out of the record if you want to store it somewhere without copying:
auto input = R"(>TEST1
ACGT
>Test2
AGGCTGA
>Test3
GGAGTATAATATATATATATATAT)";
int main()
{
using record_type = typename decltype(fin)::record_type;
for (auto & rec : fin)
}
SeqAn specific customisations in the standard namespace.
Reading record-wise (decomposed records)
Instead of using member accessor on the record, you can also use
structured bindings to decompose the record into its elements:
auto input = R"(>TEST1
ACGT
>Test2
AGGCTGA
>Test3
GGAGTATAATATATATATATATAT)";
int main()
{
for (
auto & [
sequence,
id, quality] : fin)
{
}
}
The generic concept for a (biological) sequence.
In this case you immediately get the two elements of the tuple:
sequence
of
seqan3::sequence_file_input::sequence_type and
id
of
seqan3::sequence_file_input::id_type.
But beware: with structured bindings you do need to get the order of elements correctly!Reading record-wise (custom fields)
If you want to skip specific fields from the record you can pass a non-empty fields trait object to the sequence_file_input constructor to select the fields that should be read from the input. For example to choose a combined field for SEQ and QUAL (see above). Or to never actually read the QUAL, if you don't need it. The following snippets demonstrate the usage of such a fields trait object.
auto input = R"(>TEST1
ACGT
>Test2
AGGCTGA
>Test3
GGAGTATAATATATATATATATAT)";
int main()
{
for (auto & [id, seq, qual] : fin)
{
}
}
@ seq
The "sequence", usually a range of nucleotides or amino acids.
@ qual
The qualities, usually in Phred score notation.
constexpr auto const & get(configuration< configs_t... > const &config) noexcept
This is an overloaded member function, provided for convenience. It differs from the above function o...
Definition configuration.hpp:412
When reading a file, all fields not present in the file (but requested implicitly or via the
selected_field_ids
parameter) are ignored.
Views on files
Since SeqAn files are ranges, you can also create views over files. A useful example is to filter the records based on certain criteria, e.g. minimum length of the sequence field:
auto input = R"(>TEST1
ACGT
>Test2
AGGCTGA
>Test3
GGAGTATAATATATATATATATAT)";
int main()
{
auto minimum_length5_filter = std::views::filter(
[](auto const & rec)
{
return std::ranges::size(rec.sequence()) >= 5;
});
for (auto & rec : fin | minimum_length5_filter)
{
}
}
End of file
You can check whether a file is at end by comparing begin() and end() (if they are the same, the file is at end).
Formats
We currently support reading the following formats:
Writing Sequence Files
Sequence files are the most generic and common biological files. Well-known formats include FASTA and FASTQ, but some may also be interested in treating SAM or BAM files as sequence files, discarding the alignment.The Sequence file abstraction supports writing three different fields:
- seqan3::field::seq
- seqan3::field::id
- seqan3::field::qual
The member functions take any and either of these fields. If the field ID of an argument cannot be deduced, it is assumed to correspond to the field ID of the respective template parameter.
Construction and specialisation
This class comes with two constructors, one for construction from a file name and one for construction from an existing stream and a known format. The first one automatically picks the format based on the extension of the file name. The second can be used if you have a non-file stream, like
std::cout or
std::ostringstream, that you want to read from and/or if you cannot use file-extension based detection, but know that your output file has a certain format.
In most cases the template parameters are deduced completely automatically:
Writing to
std::cout:
Note that this is not the same as writing
sequence_file_output<>
(with angle brackets). In the latter case they are explicitly set to their default values, in the former case
automatic deduction happens which chooses different parameters depending on the constructor arguments. Prefer deduction over explicit defaults.
Writing record-wise
You can iterate over this file record-wise:
int main()
{
for (int i = 0; i < 5; ++i)
{
seqan3::dna5_vector
seq{
"ACGT"_dna5};
fout.emplace_back(seq, id);
}
}
Provides seqan3::dna5, container aliases and string literals.
The easiest way to write to a sequence file is to use the push_back() or emplace_back() member functions. These work similarly to how they work on a
std::vector. If you pass a tuple to push_back() or give arguments to emplace_back() the
seqan3::field ID of the i-th tuple-element/argument is assumed to be the i-th value of selected_field_ids, i.e. by default the first is assumed to be
seqan3::field::seq, the second
seqan3::field::id and the third one
seqan3::field::qual. You may give less fields than are selected if the actual format you are writing to can cope with less (e.g. for FASTA it is sufficient to write
seqan3::field::seq and
seqan3::field::id, even if selected_field_ids also contains
seqan3::field::qual at the third position). You may also use the output file's iterator for writing, however, this rarely provides an advantage.
Writing record-wise (custom fields)
If you want to change the order of the parameters, you can pass a non-empty fields trait object to the sequence_file_output constructor to select the fields that are used for interpreting the arguments. The following snippets demonstrates the usage of such a fields trait object.
int main()
{
for (int i = 0; i < 5; i++)
{
{'A'_dna5, '1'_phred42},
{'C'_dna5, '3'_phred42}};
auto view_on_seq = seqan3::views::elements<0>(seq_qual);
auto view_on_qual = seqan3::views::elements<1>(seq_qual);
fout.emplace_back(id, view_on_seq, view_on_qual);
fout.push_back(
std::tie(
id, view_on_seq, view_on_qual));
}
}
Provides seqan3::views::elements.
Provides seqan3::phred42 quality scores.
Provides quality alphabet composites.
A different way of passing custom fields to the file is to pass a
seqan3::record – instead of a tuple – to push_back(). The
seqan3::record clearly indicates which of its elements has which
seqan3::field ID so the file will use that information instead of the template argument. This is especially handy when reading from one file and writing to another, because you don't have to configure the output file to match the input file, it will just work:
auto input = R"(@TEST1
ACGT
+
##!#
@Test2
AGGCTGA
+
##!#!!!
@Test3
GGAGTATAATATATATATATATAT
+
##!###!###!###!###!###!#)";
int main()
{
for (auto & r : fin)
{
if (true)
fout.push_back(r);
}
}
Writing record-wise in batches
You can write multiple records at once, by assigning to the file:
int main()
{
{"NATA"_dna5, "2nd"},
{"GATA"_dna5, "Third"}};
fout = range;
range | fout;
}
File I/O pipelines
Record-wise writing in batches also works for writing from input files directly to output files, because input files are also input ranges in SeqAn:
auto input = R"(@TEST1
ACGT
+
##!#
@Test2
AGGCTGA
+
##!#!!!
@Test3
GGAGTATAATATATATATATATAT
+
##!###!###!###!###!###!#)";
int main()
{
fout = fin;
}
This can be combined with file-based views to create I/O pipelines:
auto input = R"(@TEST1
ACGT
+
##!#
@Test2
AGGCTGA
+
##!#!!!
@Test3
GGAGTATAATATATATATATATAT
+
##!###!###!###!###!###!#)";
int main()
{
auto minimum_sequence_length_filter = std::views::filter(
[](auto rec)
{
return std::ranges::distance(rec.sequence()) >= 50;
});
auto minimum_average_quality_filter = std::views::filter(
{
double qual_sum{0};
for (
auto chr :
record.base_qualities())
return qual_sum / (std::ranges::distance(
record.base_qualities())) >= 20;
});
input_file | minimum_average_quality_filter | minimum_sequence_length_filter | std::views::take(3)
}
constexpr auto to_phred
The public getter function for the Phred representation of a quality score.
Definition alphabet/quality/concept.hpp:97
The main SeqAn3 namespace.
Definition aligned_sequence_concept.hpp:26
Column-based writing
The record-based interface treats the file as a range of tuples (the records), but in certain situations you might have the data as columns, i.e. a tuple-of-ranges, instead of range-of-tuples. You can use column-based writing in that case, it uses operator=() and
seqan3::views::zip():
struct data_storage_t
{
};
int main()
{
data_storage_t data_storage{};
}
Container that stores sequences concatenated internally.
Definition concatenated_sequences.hpp:86
Provides seqan3::concatenated_sequences.
seqan::stl::views::zip zip
A view adaptor that takes several views and returns tuple-like values from every i-th element of each...
Definition zip.hpp:24
Provides seqan3::views::zip.
Formats
We currently support writing the following formats:
- See also
- IO
-
Sequence File Input and Output