Raptor
A fast and space-efficient pre-filter
|
raptor prepare
] -> raptor layout
-> raptor build
-> raptor search
raptor prepare
] -> raptor build
-> raptor search
Recommendation: HIBF
Cases in which to consider the IBF:
(k,k) | (w,k) | |
---|---|---|
Index size | ⬆️ | ⬇️ |
Runtime | ⬆️ | ⬇️ |
RAM usage | ⬆️ | ⬇️ |
Thresholding¹ | Exact | Heuristic |
¹ When searching with a given number of errors.
Recommendation: (w,k) with gentle compression (w-k=4)
Requirements:
w > k
w ≤ query length
Recommendation:
w - k = 4
w << query length
Examples:
w = 24
, k = 20
w = 28
, k = 24
Also see the figure in usage_w_vs_k_figure.
Requirements:
k ≤ query length
Recommendation:
k << query length
Examples (for two errors):
k = 20
k = 32
Depending on the number of errors that should be accounted for when searching, the kmer-size
(k
) has to be chosen such that the k-mer lemma still has a positive threshold.
K-mer counting lemma: For a given k
and number of errors e
, there are p
and an approximate occurrence of p
in text T
has to share at least
For example, when searching reads of length 100 and allowing 4 errors, k has to be at most 20 (100 − 20 + 1 − 4 · 20 = 1).
Furthermore, k shall be such that a random k-mer match in the database is unlikely. For example, we chose k = 32 for the RefSeq data set. In general, there is no drawback in choosing the (currently supported) maximum k of 32, as long as the aforementioned requirements are fulfilled.