SeqAn3 3.4.0-rc.4
The Modern C++ library for sequence analysis.
|
The Bloom Filter. A data structure that efficiently answers set-membership queries. More...
#include <seqan3/utility/bloom_filter/bloom_filter.hpp>
Public Member Functions | |
template<cereal_archive archive_t> | |
void | serialize (archive_t &archive) |
Serialisation support function. | |
Constructors, destructor and assignment | |
bloom_filter ()=default | |
Defaulted. | |
bloom_filter (bloom_filter const &)=default | |
Defaulted. | |
bloom_filter & | operator= (bloom_filter const &)=default |
Defaulted. | |
bloom_filter (bloom_filter &&)=default | |
Defaulted. | |
bloom_filter & | operator= (bloom_filter &&)=default |
Defaulted. | |
~bloom_filter ()=default | |
Defaulted. | |
bloom_filter (seqan3::bin_size size, seqan3::hash_function_count funs=seqan3::hash_function_count{2u}) | |
Construct an uncompressed Bloom Filter. | |
bloom_filter (bloom_filter< data_layout::uncompressed > const &bf) | |
Construct a compressed Bloom Filter. | |
Modifiers | |
void | emplace (size_t const value) noexcept |
Inserts a value into the Bloom Filter. | |
void | reset () noexcept |
Remove all values from the Bloom Filter by setting all bits to 0. | |
Lookup | |
bool | contains (size_t const value) const noexcept |
Check whether a value is present in the Bloom Filter. | |
Counting | |
template<std::ranges::range value_range_t> | |
size_t | count (value_range_t &&values) const noexcept |
Counts the occurrences for all values in a range. | |
Capacity | |
size_t | hash_function_count () const noexcept |
Returns the number of hash functions used in the Bloom Filter. | |
size_t | bit_size () const noexcept |
Returns the size of the underlying bitvector. | |
Access | |
constexpr data_type & | raw_data () noexcept |
Provides direct, unsafe access to the underlying data structure. | |
constexpr data_type const & | raw_data () const noexcept |
Provides direct, unsafe access to the underlying data structure. | |
Static Public Attributes | |
static constexpr data_layout | data_layout_mode = data_layout_mode_ |
Indicates whether the Bloom Filter is compressed. | |
Private Types | |
using | data_type = std::conditional_t< data_layout_mode_==data_layout::uncompressed, seqan3::contrib::sdsl::bit_vector, seqan3::contrib::sdsl::sd_vector<> > |
The underlying datatype to use. | |
Private Member Functions | |
constexpr size_t | hash_and_fit (size_t h, size_t const seed) const |
Perturbs a value and fits it into the vector. | |
Private Attributes | |
data_type | data {} |
The bitvector. | |
size_t | hash_funs {} |
The number of hash functions. | |
size_t | hash_shift {} |
The number of bits to shift the hash value before doing multiplicative hashing. | |
size_t | size_in_bits {} |
The size of the underlying bit vector in bits. | |
Static Private Attributes | |
static constexpr std::array< size_t, 5 > | hash_seeds |
Precalculated seeds for multiplicative hashing. We use large irrational numbers for a uniform hashing. | |
Friends | |
Comparison operators | |
bool | operator== (bloom_filter const &lhs, bloom_filter const &rhs) noexcept |
Test for equality. | |
bool | operator!= (bloom_filter const &lhs, bloom_filter const &rhs) noexcept |
Test for inequality. | |
The Bloom Filter. A data structure that efficiently answers set-membership queries.
data_layout_mode_ | Indicates whether the underlying data type is compressed. See seqan3::data_layout. |
The Bloom Filter is a probabilistic data structure. A Bloom Filter can be thought of as a bitvector of length n
and h
hash functions and is used to determine set membership. To insert data, the data is hashed by the h
hash functions (returning values in [0, n)
) and the corresponding h
positions in the bitvector are set to 1
. To query data, i.e. to determine whether the query belongs to the set the Bloom Filter was built for, the query is hashed by the same h
hash functions and the corresponding positions are checked. If all h
positions contain a 1
, the query is (probably) in the data set. Since the Bloom Filter has variable length, the hashing is not bijective, i.e. it may return true for a set membership query even though the query was never inserted into the Bloom Filter. Note that the Bloom Filter will always return true
if the query was inserted, i.e. there may be false positives, but no false negatives.
To query the Bloom Filter for a value, call seqan3::bloom_filter::contains
which returns true if the k-mer hash is present in the index, and false if the hash is not present. The value is a hash value of the k-mer to check membership for.
To query the Bloom Filter for a range of values, call seqan3::bloom_filter::count
which returns the number of k-mer hits in the Bloom Filter for the given range of values.
Please note the results are based on a heuristic data structure and with a certain probability (depending on the selected size of the bit vector) you may receive a false positive result.
While the Bloom Filter provides a single linear bit vector to represent the underlying data, the Interleaved Bloom Filter provides a data structure that combines a set of Bloom Filters to enable efficient queries to multiple fractions of the data. In doing so, the Interleaved Bloom Filter can not only answer whether a hash value is present in the data, but also provides information in which fraction of the data it occurs. The design of the Interleaved Bloom Filter is particularly useful when the underlying data is systematically structured; for example, if each fraction of the data represents a specific set of organisms. Important applications of the Interleaved Bloom Filter include taxonomic classification of sequencing data, or prefiltering of specific fractions of an input data set to enable more efficient in-depth analysis. The Bloom Filter, on the other hand, is useful if the database does not contain any underlying structure, or it is not relevant for the analysis. A typical application is the removal of host sequences or different types of contamination where it is usually not of interest which part of the database was matched. In such cases, the Bloom Filter provides a lighter data structure and a more simple interface (for example, the use of agents for determining and counting membership is not necessary in this case).
The Bloom Filter can be compressed by passing seqan3::data_layout::compressed
as template argument. The compressed seqan3::bloom_filter<seqan3::data_layout::compressed>
can only be constructed from a seqan3::bloom_filter
, in which case the underlying bitvector is compressed. The compressed Bloom Filter is immutable, i.e. only querying is supported.
The Bloom Filter promises the basic thread-safety by the STL that all calls to const
member functions are safe from multiple threads (as long as no thread calls a non-const
member function at the same time).
|
no-apiinline |
Construct an uncompressed Bloom Filter.
size | The bit vector size (in bits). |
funs | The number of hash functions. Default 2. At least 1, at most 5. |
|
no-apiinline |
Construct a compressed Bloom Filter.
[in] | bf | The uncompressed seqan3::bloom_filter. |
|
no-apiinlinenoexcept |
Returns the size of the underlying bitvector.
|
no-apiinlinenoexcept |
Check whether a value is present in the Bloom Filter.
[in] | value | The raw numeric value to process. |
|
no-apiinlinenoexcept |
Counts the occurrences for all values in a range.
value_range_t | The type of the range of values. Must model std::ranges::input_range. The reference type must model std::unsigned_integral. |
[in] | values | The range of values to process. |
Concurrent invocations of this function are thread safe.
|
no-apiinlinenoexcept |
Inserts a value into the Bloom Filter.
[in] | value | The raw numeric value to process. |
|
no-apiinlineconstexprprivate |
Perturbs a value and fits it into the vector.
h | The value to process. |
seed | The seed to use. |
data
.
|
no-apiinlinenoexcept |
Returns the number of hash functions used in the Bloom Filter.
|
no-apiinlineconstexprnoexcept |
Provides direct, unsafe access to the underlying data structure.
|
no-apiinlineconstexprnoexcept |
Provides direct, unsafe access to the underlying data structure.
|
no-apiinlinenoexcept |
Remove all values from the Bloom Filter by setting all bits to 0.
While all values are removed from the vector, the size of the Bloom Filter is not changed.
|
no-apiinline |
Serialisation support function.
archive_t | Type of archive ; must satisfy seqan3::cereal_archive. |
[in] | archive | The archive being serialised from/to. |
|
no-apifriend |
Test for inequality.
[in] | lhs | A seqan3::bloom_filter . |
[in] | rhs | seqan3::bloom_filter to compare to. |
true
if unequal, false
otherwise.
|
no-apifriend |
Test for equality.
[in] | lhs | A seqan3::bloom_filter . |
[in] | rhs | seqan3::bloom_filter to compare to. |
true
if equal, false
otherwise.
|
no-apistaticconstexprprivate |
Precalculated seeds for multiplicative hashing. We use large irrational numbers for a uniform hashing.