SeqAn3 3.4.0-rc.3
The Modern C++ library for sequence analysis.
|
The IBF binning directory. A data structure that efficiently answers set-membership queries for multiple bins. More...
#include <seqan3/search/dream_index/interleaved_bloom_filter.hpp>
Classes | |
class | counting_agent_type |
Manages counting ranges of values for the seqan3::interleaved_bloom_filter. More... | |
class | membership_agent_type |
Manages membership queries for the seqan3::interleaved_bloom_filter. More... | |
Public Member Functions | |
Constructors, destructor and assignment | |
interleaved_bloom_filter ()=default | |
Defaulted. | |
interleaved_bloom_filter (interleaved_bloom_filter const &)=default | |
Defaulted. | |
interleaved_bloom_filter & | operator= (interleaved_bloom_filter const &)=default |
Defaulted. | |
interleaved_bloom_filter (interleaved_bloom_filter &&)=default | |
Defaulted. | |
interleaved_bloom_filter & | operator= (interleaved_bloom_filter &&)=default |
Defaulted. | |
~interleaved_bloom_filter ()=default | |
Defaulted. | |
interleaved_bloom_filter (seqan3::bin_count bins_, seqan3::bin_size size, seqan3::hash_function_count funs=seqan3::hash_function_count{2u}) | |
Construct an uncompressed Interleaved Bloom Filter. | |
interleaved_bloom_filter (interleaved_bloom_filter< data_layout::compressed > const &ibf) | |
Construct an uncompressed Interleaved Bloom Filter from a compressed one. | |
interleaved_bloom_filter (interleaved_bloom_filter< data_layout::uncompressed > const &ibf) | |
Construct a compressed Interleaved Bloom Filter. | |
Modifiers | |
void | emplace (size_t const value, bin_index const bin) noexcept |
Inserts a value into a specific bin. | |
void | clear (bin_index const bin) noexcept |
Clears a specific bin. | |
template<typename rng_t > requires (data_layout_mode == data_layout::uncompressed) | |
void | clear (rng_t &&bin_range) noexcept |
Clears a range of bins. | |
void | increase_bin_number_to (bin_count const new_bins_) |
Increases the number of bins stored in the Interleaved Bloom Filter. | |
Lookup | |
membership_agent_type | membership_agent () const |
Returns a seqan3::interleaved_bloom_filter::membership_agent_type to be used for lookup. | |
template<typename value_t = uint16_t> | |
counting_agent_type< value_t > | counting_agent () const |
Returns a seqan3::interleaved_bloom_filter::counting_agent_type to be used for counting. | |
Capacity | |
size_t | hash_function_count () const noexcept |
Returns the number of hash functions used in the Interleaved Bloom Filter. | |
size_t | bin_count () const noexcept |
Returns the number of bins that the Interleaved Bloom Filter manages. | |
size_t | bin_size () const noexcept |
Returns the size of a single bin that the Interleaved Bloom Filter manages. | |
size_t | bit_size () const noexcept |
Returns the size of the underlying bitvector. | |
Access | |
constexpr data_type & | raw_data () noexcept |
Provides direct, unsafe access to the underlying data structure. | |
constexpr data_type const & | raw_data () const noexcept |
Provides direct, unsafe access to the underlying data structure. | |
Static Public Attributes | |
static constexpr data_layout | data_layout_mode = data_layout_mode_ |
Indicates whether the Interleaved Bloom Filter is compressed. | |
Friends | |
Comparison operators | |
bool | operator== (interleaved_bloom_filter const &lhs, interleaved_bloom_filter const &rhs) noexcept |
Test for equality. | |
bool | operator!= (interleaved_bloom_filter const &lhs, interleaved_bloom_filter const &rhs) noexcept |
Test for inequality. | |
The IBF binning directory. A data structure that efficiently answers set-membership queries for multiple bins.
data_layout_mode_ | Indicates whether the underlying data type is compressed. See seqan3::data_layout. |
A binning directory is a data structure that can be used to determine set membership for elements. For example, a common use case is dividing a database into a fixed number (e.g. 1024) bins by some means of clustering (e.g. taxonomic binning or k-mer similarity clustering for genomic sequences). For a query, the binning directory can now answer in which bins the query (probably) occurs. In SeqAn we provide the Interleaved Bloom Filter (IBF) that can answer these queries efficiently.
The Interleaved Bloom Filter is a probabilistic data structure that extends the Bloom Filter. A Bloom Filter can be thought of as a bitvector of length n
and h
hash functions and is used to determine set membership. To insert data, the data is hashed by the h
hash functions (returning values in [0, n)
) and the corresponding h
positions in the bitvector are set to 1
. To query data, i.e. to determine whether the query belongs to the set the Bloom Filter was built for, the query is hashed by the same h
hash functions and the corresponding positions are checked. If all h
positions contain a 1
, the query is (probably) in the data set. Since the Bloom Filter has variable length, the hashing is not bijective, i.e. it may return true for a set membership query even though the query was never inserted into the Bloom Filter. Note that the Bloom Filter will always return true
if the query was inserted, i.e. there may be false positives, but no false negatives.
The Interleaved Bloom Filter now applies the concept of a Bloom Filter to multiple sets and provides a global data structure to determine set membership of a query in b
data sets/bins. Conceptually, a Bloom Filter is created for each bin using the same fixed length and fixed hash functions for each filter. The resulting b
Bloom Filters are then interleaved such that the i
'th bit if each Bloom Filter are adjacent to each other:
Where x.y
denotes the y
'th bit of the x
'th Bloom Filter.
A query can now be searched in all b
bins by computing the h
hash functions, retrieving the h
sub-bitvectors of length b
starting at the positions indicated by the hash functions. The bitwise AND of these sub-bitvectors yields the binningvector, a bitvector of length b
where the i
'th bit indicates set membership in the i
'th bin.
To query the Interleaved Bloom Filter for a value, call seqan3::interleaved_bloom_filter::membership_agent() and use the returned seqan3::interleaved_bloom_filter::membership_agent_type.
To count the occurrences of a range of values in the Interleaved Bloom Filter, call seqan3::interleaved_bloom_filter::counting_agent() and use the returned seqan3::interleaved_bloom_filter::counting_agent_type.
The Interleaved Bloom Filter can be compressed by passing data_layout::compressed
as template argument. The compressed seqan3::interleaved_bloom_filter<seqan3::data_layout::compressed>
can only be constructed from a seqan3::interleaved_bloom_filter
, in which case the underlying bitvector is compressed. The compressed Interleaved Bloom Filter is immutable, i.e. only querying is supported.
The Interleaved Bloom Filter promises the basic thread-safety by the STL that all calls to const
member functions are safe from multiple threads (as long as no thread calls a non-const
member function at the same time).
Additionally, concurrent calls to emplace
are safe iff each thread handles a multiple of wordsize (=64) many bins. For example, calls to emplace
from multiple threads are safe if thread_1
accesses bins 0-63, thread_2
bins 64-127, and so on.
|
inline |
Construct an uncompressed Interleaved Bloom Filter.
bins_ | The number of bins. |
size | The bitvector size. |
funs | The number of hash functions. Default 2. At least 1, at most 5. |
|
inline |
Construct an uncompressed Interleaved Bloom Filter from a compressed one.
[in] | ibf | The compressed seqan3::interleaved_bloom_filter. |
|
inline |
Construct a compressed Interleaved Bloom Filter.
[in] | ibf | The uncompressed seqan3::interleaved_bloom_filter. |
|
inlinenoexcept |
Returns the number of bins that the Interleaved Bloom Filter manages.
|
inlinenoexcept |
Returns the size of a single bin that the Interleaved Bloom Filter manages.
|
inlinenoexcept |
Returns the size of the underlying bitvector.
|
inlinenoexcept |
Clears a specific bin.
[in] | bin | The bin index to clear. |
|
inlinenoexcept |
Clears a range of bins.
rng_t | The type of the range. Must model std::ranges::forward_range and the reference type must be seqan3::bin_index. |
[in] | bin_range | The range of bins to clear. |
|
inline |
Returns a seqan3::interleaved_bloom_filter::counting_agent_type to be used for counting.
seqan3::interleaved_bloom_filter::counting_agent_type
s constructed for this Interleaved Bloom Filter.
|
inlinenoexcept |
Inserts a value into a specific bin.
[in] | value | The raw numeric value to process. |
[in] | bin | The bin index to insert into. |
|
inlinenoexcept |
Returns the number of hash functions used in the Interleaved Bloom Filter.
|
inline |
Increases the number of bins stored in the Interleaved Bloom Filter.
[in] | new_bins_ | The new number of bins. |
std::invalid_argument | If passed number of bins is smaller than current number of bins. |
The resulting seqan3::interleaved_bloom_filter
has an increased size proportional to the increase in the bin_words
(the number of 64-bit words needed to represent bins
many bins), e.g. resizing a seqan3::interleaved_bloom_filter
with 40 bins to 73 bins also increases the bin_words
from 1 to 2 and hence the new seqan3::interleaved_bloom_filter
will be twice the size. This increase in size is necessary to avoid invalidating all computed hash functions. If you want to add more bins while keeping the size constant, you need to rebuild the seqan3::interleaved_bloom_filter
.
|
inline |
Returns a seqan3::interleaved_bloom_filter::membership_agent_type to be used for lookup.
seqan3::interleaved_bloom_filter::membership_agent_type
s constructed for this Interleaved Bloom Filter.
|
inlineconstexprnoexcept |
Provides direct, unsafe access to the underlying data structure.
|
inlineconstexprnoexcept |
Provides direct, unsafe access to the underlying data structure.
|
friend |
Test for inequality.
[in] | lhs | A seqan3::interleaved_bloom_filter . |
[in] | rhs | seqan3::interleaved_bloom_filter to compare to. |
true
if unequal, false
otherwise.
|
friend |
Test for equality.
[in] | lhs | A seqan3::interleaved_bloom_filter . |
[in] | rhs | seqan3::interleaved_bloom_filter to compare to. |
true
if equal, false
otherwise.