Raptor
A fast and space-efficient pre-filter
|
You will learn how to construct a Raptor index of large collections of nucleotide sequences.
Difficulty | Easy |
---|---|
Duration | 30 min |
Prerequisite tutorials | First steps with Raptor , for using HIBF: Create a layout with Raptor |
Recommended reading | Interleaved Bloom Filter (IBF), Hierarchical Interleaved Bloom Filter (HIBF) |
Raptor consists of two methods, build
and search
. The former creates an index over a given dataset so that it can be searched efficiently with raptor search
. In this tutorial, we will look at raptor build
in more detail.
Raptor can be used as a pre-filter for applications where searching the complete dataset is not feasible. For example, in read mapping you might want to compare your genome to the genome of 100'000 other people. It can also be used for metagenomic classification, i.e., which microbes are in a single sample.
Regardless of the application, we start with a data set of different nucleotide sequences over which we want to build an index.
These sequences are typically available as FASTA or FASTQ files, but can be in any format supported by seqan3::sequence_file_input
.
We summarise these in a list of paths and can then create a first index using the default values of raptor:
Then create an all_paths.txt and run raptor build.
all_paths.txt
looks like, which can be accessed as usual by typing raptor build -h
or raptor build --help
.
all_paths.txt
might look like: /note Sometimes it would be better to use the absolute paths instead.
And you should have run:
Your directory should look like this:
raptor search
. Therefore, we recommend not deleting the files including the built indexes.
Before explaining parameters, we would like to briefly explain the general idea of the Raptor index.
If we want to check whether a query is contained in a sample, we can use a Bloom Filter (BF) to create an index for the sample. Although this only gives us a probability, it saves us a time-consuming complete mapping. If our data set consists of many samples, there is a BF for each sample. By default, Raptor uses an Interleaved Bloom Filter (IBF), which is an efficient way to store these many Bloom Filters, called raptor index. Another possibility is to use the Hierarchical Interleaved Bloom Filter (HIBF), more about this later in IBF vs HIBF.
To create the index, the individual samples of the data set are chopped up into k-mers and determine in their so-called bin the specific bit setting of the Bloom Filter by passing them through the hash functions. This means that a k-mer from sample i
marks in bin i
with j
hash functions j
bits with a 1
. If a query is then searched, its k-mers are thrown into the hash functions and looked at in which bins it only points to ones. This can also result in false positives. Thus, the result only indicates that the query is probably part of a sample.
With --kmer
you can specify the length of the k-mers, which should be long enough to avoid random hits. By using multiple hash functions, you can sometimes further reduce the possibility of false positives (--hash
). We found a useful Bloom Filter Calculator to get a calculation if it could help. As it is not ours, we do not guarantee its accuracy. To use this calculator the number of inserted elements is the number of kmers in a single bin and you should use the biggest bin to be sure.
Each Bloom Filter has a bit vector length, which over all Bloom Filters gives the size of the Interleaved Bloom Filter, which is automatically inferred. The lower the false positive rate, the bigger the index.
raptor2.index
.
Your directory should look like this:
The k-mers can also be saved as minimisers, which saves space. To do this, first use raptor prepare
. A minimiser works with windows, which means that you also have to define their size --window
. A window is always larger than a k-mer because it combines several k-mers.
*.minimiser
files.
Your directory should look like this:
When hashing a sequence, there may be positions that do not count towards the final hash value. A shape offers an easy way to define such patterns: --shape
.
raptor build -hh
or raptor build --advanced-help
.Raptor supports parallelization. By specifying --threads
, for example, the fastq-records are processed simultaneously.
To reduce the overall memory consumption of the search, the index can be divided into multiple (a power of two) parts. This can be done by passing --parts n
, where n is the number of parts you want to create. This will create n
files, each representing one part of the index. This will reduce the memory consumption of raptor build
and raptor search
by roughly n
, since there will only be one part in memory at any given time. raptor search
will automatically detect the parts, and does not need any special parameters.
Raptor works with the Interleaved Bloom Filter by default. A new feature is the Hierarchical Interleaved Bloom Filter (HIBF) (raptor::hierarchical_interleaved_bloom_filter), which is enabled when a layout file from raptor layout
is used as input. This uses a more space-saving method of storing the bins. It distinguishes between the user bins, which reflect the individual samples as before, and the so-called technical bins, which throw some bins together. This is especially useful when there are samples of very different sizes.
To use the HIBF, a layout must be created before creating an index. We have written an extra tutorial for this Create a layout with Raptor.
The layout replaces the --input all_bin_path.txt
and is given instead with the layout: --input layout.txt
.
We can set the desired false positive rate with --fpr
. Thus, for example, a call looks like this:
And use the data of the 1024
Folder and the two layouts we've created in the Create a layout with Raptor tutorial : layout.txt
and binning2.layout
.
Lets use the HIBF with the default parameters and call the new indexes hibf.index
and hibf2.index
.
16
, 3
hash functions and a false positive rate of 0.1
.
Your directory should look like this:
raptor::hierarchical_interleaved_bloom_filter
API.