The Bloom Filter. A data structure that efficiently answers set-membership queries.
More...
template<
data_layout data_layout_mode_ = data_layout::uncompressed>
class seqan3::bloom_filter< data_layout_mode_ >
The Bloom Filter. A data structure that efficiently answers set-membership queries.
- Template Parameters
-
data_layout_mode_ | Indicates whether the underlying data type is compressed. See seqan3::data_layout. |
Bloom Filter (BF)
The Bloom Filter is a probabilistic data structure. A Bloom Filter can be thought of as a bitvector of length n
and h
hash functions and is used to determine set membership. To insert data, the data is hashed by the h
hash functions (returning values in [0, n)
) and the corresponding h
positions in the bitvector are set to 1
. To query data, i.e. to determine whether the query belongs to the set the Bloom Filter was built for, the query is hashed by the same h
hash functions and the corresponding positions are checked. If all h
positions contain a 1
, the query is (probably) in the data set. Since the Bloom Filter has variable length, the hashing is not bijective, i.e. it may return true for a set membership query even though the query was never inserted into the Bloom Filter. Note that the Bloom Filter will always return true
if the query was inserted, i.e. there may be false positives, but no false negatives.
Querying
To query the Bloom Filter for a value, call seqan3::bloom_filter::contains
which returns true if the k-mer hash is present in the index, and false if the hash is not present. The value is a hash value of the k-mer to check membership for.
To query the Bloom Filter for a range of values, call seqan3::bloom_filter::count
which returns the number of k-mer hits in the Bloom Filter for the given range of values.
Please note the results are based on a heuristic data structure and with a certain probability (depending on the selected size of the bit vector) you may receive a false positive result.
Differences to the Interleaved Bloom Filter (IBF)
While the Bloom Filter provides a single linear bit vector to represent the underlying data, the Interleaved Bloom Filter provides a data structure that combines a set of Bloom Filters to enable efficient queries to multiple fractions of the data. In doing so, the Interleaved Bloom Filter can not only answer whether a hash value is present in the data, but also provides information in which fraction of the data it occurs. The design of the Interleaved Bloom Filter is particularly useful when the underlying data is systematically structured; for example, if each fraction of the data represents a specific set of organisms. Important applications of the Interleaved Bloom Filter include taxonomic classification of sequencing data, or prefiltering of specific fractions of an input data set to enable more efficient in-depth analysis. The Bloom Filter, on the other hand, is useful if the database does not contain any underlying structure, or it is not relevant for the analysis. A typical application is the removal of host sequences or different types of contamination where it is usually not of interest which part of the database was matched. In such cases, the Bloom Filter provides a lighter data structure and a more simple interface (for example, the use of agents for determining and counting membership is not necessary in this case).
Compression
The Bloom Filter can be compressed by passing seqan3::data_layout::compressed
as template argument. The compressed seqan3::bloom_filter<seqan3::data_layout::compressed>
can only be constructed from a seqan3::bloom_filter
, in which case the underlying bitvector is compressed. The compressed Bloom Filter is immutable, i.e. only querying is supported.
Thread safety
The Bloom Filter promises the basic thread-safety by the STL that all calls to const
member functions are safe from multiple threads (as long as no thread calls a non-const
member function at the same time).
- See also
- seqan3::interleaved_bloom_filter