The IBF binning directory. A data structure that efficiently answers set-membership queries for multiple bins.
More...
template<data_layout data_layout_mode_ = data_layout::uncompressed>
class seqan3::interleaved_bloom_filter< data_layout_mode_ >
The IBF binning directory. A data structure that efficiently answers set-membership queries for multiple bins.
- Template Parameters
-
data_layout_mode_ | Indicates whether the underlying data type is compressed. See seqan3::data_layout. |
Binning Directory
A binning directory is a data structure that can be used to determine set membership for elements. For example, a common use case is dividing a database into a fixed number (e.g. 1024) bins by some means of clustering (e.g. taxonomic binning or k-mer similarity clustering for genomic sequences). For a query, the binning directory can now answer in which bins the query (probably) occurs. In SeqAn we provide the Interleaved Bloom Filter (IBF) that can answer these queries efficiently.
Interleaved Bloom Filter (IBF)
The Interleaved Bloom Filter is a probabilistic data structure that extends the Bloom Filter. A Bloom Filter can be thought of as a bitvector of length n
and h
hash functions and is used to determine set membership. To insert data, the data is hashed by the h
hash functions (returning values in [0, n)
) and the corresponding h
positions in the bitvector are set to 1
. To query data, i.e. to determine whether the query belongs to the set the Bloom Filter was built for, the query is hashed by the same h
hash functions and the corresponding positions are checked. If all h
positions contain a 1
, the query is (probably) in the data set. Since the Bloom Filter has variable length, the hashing is not bijective, i.e. it may return true for a set membership query even though the query was never inserted into the Bloom Filter. Note that the Bloom Filter will always return true
if the query was inserted, i.e. there may be false positives, but no false negatives.
The Interleaved Bloom Filter now applies the concept of a Bloom Filter to multiple sets and provides a global data structure to determine set membership of a query in b
data sets/bins. Conceptually, a Bloom Filter is created for each bin using the same fixed length and fixed hash functions for each filter. The resulting b
Bloom Filters are then interleaved such that the i
'th bit if each Bloom Filter are adjacent to each other:
Bloom Filter 0 Bloom Filter 1 Bloom Filter 2 Bloom Filter 3
|0.0|0.1|0.2|0.3| |1.0|1.1|1.2|1.3| |2.0|2.1|2.2|2.3| |3.0|3.1|3.2|3.3|
Where x.y
denotes the y
'th bit of the x
'th Bloom Filter.
Interleaved Bloom Filter
|0.0|1.0|2.0|3.0|0.1|1.1|2.1|3.1|0.2|1.2|2.2|3.2|0.3|1.3|2.3|3.3|
A query can now be searched in all b
bins by computing the h
hash functions, retrieving the h
sub-bitvectors of length b
starting at the positions indicated by the hash functions. The bitwise AND of these sub-bitvectors yields the binningvector, a bitvector of length b
where the i
'th bit indicates set membership in the i
'th bin.
Querying
To query the Interleaved Bloom Filter for a value, call seqan3::interleaved_bloom_filter::membership_agent() and use the returned seqan3::interleaved_bloom_filter::membership_agent.
Compression
The Interleaved Bloom Filter can be compressed by passing data_layout::compressed
as template argument. The compressed seqan3::interleaved_bloom_filter<seqan3::data_layout::compressed>
can only be constructed from a seqan3::interleaved_bloom_filter
, in which case the underlying bitvector is compressed. The compressed Interleaved Bloom Filter is immutable, i.e. only querying is supported.
Thread safety
The Interleaved Bloom Filter promises the basic thread-safety by the STL that all calls to const
member functions are safe from multiple threads (as long as no thread calls a non-const
member function at the same time).
Additionally, concurrent calls to emplace
are safe iff each thread handles a multiple of wordsize (=64) many bins. For example, calls to emplace
from multiple threads are safe if thread_1
accesses bins 0-63, thread_2
bins 64-127, and so on.