SeqAn3 3.1.0
The Modern C++ library for sequence analysis.
seqan3::interleaved_bloom_filter< data_layout_mode_ > Class Template Reference

The IBF binning directory. A data structure that efficiently answers set-membership queries for multiple bins. More...

#include <seqan3/search/dream_index/interleaved_bloom_filter.hpp>

+ Inheritance diagram for seqan3::interleaved_bloom_filter< data_layout_mode_ >:

Classes

class  counting_agent_type
 Manages counting ranges of values for the seqan3::interleaved_bloom_filter. More...
 
class  membership_agent
 Manages membership queries for the seqan3::interleaved_bloom_filter. More...
 

Public Member Functions

Constructors, destructor and assignment
 interleaved_bloom_filter ()=default
 Defaulted.
 
 interleaved_bloom_filter (interleaved_bloom_filter const &)=default
 Defaulted.
 
interleaved_bloom_filteroperator= (interleaved_bloom_filter const &)=default
 Defaulted.
 
 interleaved_bloom_filter (interleaved_bloom_filter &&)=default
 Defaulted.
 
interleaved_bloom_filteroperator= (interleaved_bloom_filter &&)=default
 Defaulted.
 
 ~interleaved_bloom_filter ()=default
 Defaulted.
 
 interleaved_bloom_filter (seqan3::bin_count bins_, seqan3::bin_size size, seqan3::hash_function_count funs=seqan3::hash_function_count{2u})
 Construct an uncompressed Interleaved Bloom Filter. More...
 
 interleaved_bloom_filter (interleaved_bloom_filter< data_layout::uncompressed > const &ibf)
 Construct a compressed Interleaved Bloom Filter. More...
 
Modifiers
void emplace (size_t const value, bin_index const bin) noexcept
 Inserts a value into a specific bin. More...
 
void clear (bin_index const bin) noexcept
 Clears a specific bin. More...
 
template<typename rng_t >
void clear (rng_t &&bin_range) noexcept
 Clears a range of bins. More...
 
void increase_bin_number_to (bin_count const new_bins_)
 Increases the number of bins stored in the Interleaved Bloom Filter. More...
 
Lookup
membership_agent membership_agent () const
 Returns a seqan3::interleaved_bloom_filter::membership_agent to be used for lookup. More...
 
template<typename value_t = uint16_t>
counting_agent_type< value_t > counting_agent () const
 Returns a seqan3::interleaved_bloom_filter::counting_agent_type to be used for counting. More...
 
Capacity
size_t hash_function_count () const noexcept
 Returns the number of hash functions used in the Interleaved Bloom Filter. More...
 
size_t bin_count () const noexcept
 Returns the number of bins that the Interleaved Bloom Filter manages. More...
 
size_t bin_size () const noexcept
 Returns the size of a single bin that the Interleaved Bloom Filter manages. More...
 
size_t bit_size () const noexcept
 Returns the size of the underlying bitvector. More...
 
Access
constexpr data_typeraw_data () noexcept
 Provides direct, unsafe access to the underlying data structure. More...
 
constexpr data_type const & raw_data () const noexcept
 Provides direct, unsafe access to the underlying data structure. More...
 

Static Public Attributes

static constexpr data_layout data_layout_mode = data_layout_mode_
 Indicates whether the Interleaved Bloom Filter is compressed.
 

Friends

Comparison operators
bool operator== (interleaved_bloom_filter const &lhs, interleaved_bloom_filter const &rhs) noexcept
 Test for equality. More...
 
bool operator!= (interleaved_bloom_filter const &lhs, interleaved_bloom_filter const &rhs) noexcept
 Test for inequality. More...
 

Detailed Description

template<data_layout data_layout_mode_ = data_layout::uncompressed>
class seqan3::interleaved_bloom_filter< data_layout_mode_ >

The IBF binning directory. A data structure that efficiently answers set-membership queries for multiple bins.

Template Parameters
data_layout_mode_Indicates whether the underlying data type is compressed. See seqan3::data_layout.

Binning Directory

A binning directory is a data structure that can be used to determine set membership for elements. For example, a common use case is dividing a database into a fixed number (e.g. 1024) bins by some means of clustering (e.g. taxonomic binning or k-mer similarity clustering for genomic sequences). For a query, the binning directory can now answer in which bins the query (probably) occurs. In SeqAn we provide the Interleaved Bloom Filter (IBF) that can answer these queries efficiently.

Interleaved Bloom Filter (IBF)

The Interleaved Bloom Filter is a probabilistic data structure that extends the Bloom Filter. A Bloom Filter can be thought of as a bitvector of length n and h hash functions and is used to determine set membership. To insert data, the data is hashed by the h hash functions (returning values in [0, n)) and the corresponding h positions in the bitvector are set to 1. To query data, i.e. to determine whether the query belongs to the set the Bloom Filter was built for, the query is hashed by the same h hash functions and the corresponding positions are checked. If all h positions contain a 1, the query is (probably) in the data set. Since the Bloom Filter has variable length, the hashing is not bijective, i.e. it may return true for a set membership query even though the query was never inserted into the Bloom Filter. Note that the Bloom Filter will always return true if the query was inserted, i.e. there may be false positives, but no false negatives.

The Interleaved Bloom Filter now applies the concept of a Bloom Filter to multiple sets and provides a global data structure to determine set membership of a query in b data sets/bins. Conceptually, a Bloom Filter is created for each bin using the same fixed length and fixed hash functions for each filter. The resulting b Bloom Filters are then interleaved such that the i'th bit if each Bloom Filter are adjacent to each other:

Bloom Filter 0 Bloom Filter 1 Bloom Filter 2 Bloom Filter 3
|0.0|0.1|0.2|0.3| |1.0|1.1|1.2|1.3| |2.0|2.1|2.2|2.3| |3.0|3.1|3.2|3.3|

Where x.y denotes the y'th bit of the x'th Bloom Filter.

Interleaved Bloom Filter
|0.0|1.0|2.0|3.0|0.1|1.1|2.1|3.1|0.2|1.2|2.2|3.2|0.3|1.3|2.3|3.3|

A query can now be searched in all b bins by computing the h hash functions, retrieving the h sub-bitvectors of length b starting at the positions indicated by the hash functions. The bitwise AND of these sub-bitvectors yields the binningvector, a bitvector of length b where the i'th bit indicates set membership in the i'th bin.

Querying

To query the Interleaved Bloom Filter for a value, call seqan3::interleaved_bloom_filter::membership_agent() and use the returned seqan3::interleaved_bloom_filter::membership_agent.

To count the occurrences of a range of values in the Interleaved Bloom Filter, call seqan3::interleaved_bloom_filter::counting_agent() and use the returned seqan3::interleaved_bloom_filter::counting_agent_type.

Compression

The Interleaved Bloom Filter can be compressed by passing data_layout::compressed as template argument. The compressed seqan3::interleaved_bloom_filter<seqan3::data_layout::compressed> can only be constructed from a seqan3::interleaved_bloom_filter, in which case the underlying bitvector is compressed. The compressed Interleaved Bloom Filter is immutable, i.e. only querying is supported.

Thread safety

The Interleaved Bloom Filter promises the basic thread-safety by the STL that all calls to const member functions are safe from multiple threads (as long as no thread calls a non-const member function at the same time).

Additionally, concurrent calls to emplace are safe iff each thread handles a multiple of wordsize (=64) many bins. For example, calls to emplace from multiple threads are safe if thread_1 accesses bins 0-63, thread_2 bins 64-127, and so on.

Constructor & Destructor Documentation

◆ interleaved_bloom_filter() [1/2]

template<data_layout data_layout_mode_ = data_layout::uncompressed>
seqan3::interleaved_bloom_filter< data_layout_mode_ >::interleaved_bloom_filter ( seqan3::bin_count  bins_,
seqan3::bin_size  size,
seqan3::hash_function_count  funs = seqan3::hash_function_count{2u} 
)
inline

Construct an uncompressed Interleaved Bloom Filter.

Parameters
bins_The number of bins.
sizeThe bitvector size.
funsThe number of hash functions. Default 2. At least 1, at most 5.
Attention
This constructor can only be used to construct uncompressed Interleaved Bloom Filters.

Example

int main()
{
// Construct an Interleaved Bloom Filter that contains 43 bins, each using 8192 bits, and 3 hash functions.
// Construct an Interleaved Bloom Filter that contains 43 bins, each using 256 KiBits,
// and the default of 2 hash functions.
}
The IBF binning directory. A data structure that efficiently answers set-membership queries for multi...
Definition: interleaved_bloom_filter.hpp:134
Provides seqan3::interleaved_bloom_filter.
A strong type that represents the number of bins for the seqan3::interleaved_bloom_filter.
Definition: interleaved_bloom_filter.hpp:36
A strong type that represents the number of bits for each bin in the seqan3::interleaved_bloom_filter...
Definition: interleaved_bloom_filter.hpp:43
A strong type that represents the number of hash functions for the seqan3::interleaved_bloom_filter.
Definition: interleaved_bloom_filter.hpp:50

◆ interleaved_bloom_filter() [2/2]

template<data_layout data_layout_mode_ = data_layout::uncompressed>
seqan3::interleaved_bloom_filter< data_layout_mode_ >::interleaved_bloom_filter ( interleaved_bloom_filter< data_layout::uncompressed > const &  ibf)
inline

Construct a compressed Interleaved Bloom Filter.

Parameters
[in]ibfThe uncompressed seqan3::interleaved_bloom_filter.
Attention
This constructor can only be used to construct compressed Interleaved Bloom Filters.

Example

int main()
{
// Construct an uncompressed Interleaved Bloom Filter.
// Fill `ibf` with data.
// Construct an immutable, compressed Interleaved Bloom Filter.
}

Member Function Documentation

◆ bin_count()

template<data_layout data_layout_mode_ = data_layout::uncompressed>
size_t seqan3::interleaved_bloom_filter< data_layout_mode_ >::bin_count ( ) const
inlinenoexcept

Returns the number of bins that the Interleaved Bloom Filter manages.

Returns
The number of bins.

◆ bin_size()

template<data_layout data_layout_mode_ = data_layout::uncompressed>
size_t seqan3::interleaved_bloom_filter< data_layout_mode_ >::bin_size ( ) const
inlinenoexcept

Returns the size of a single bin that the Interleaved Bloom Filter manages.

Returns
The size in bits of a single bin.

◆ bit_size()

template<data_layout data_layout_mode_ = data_layout::uncompressed>
size_t seqan3::interleaved_bloom_filter< data_layout_mode_ >::bit_size ( ) const
inlinenoexcept

Returns the size of the underlying bitvector.

Returns
The size in bits of the underlying bitvector.

◆ clear() [1/2]

template<data_layout data_layout_mode_ = data_layout::uncompressed>
void seqan3::interleaved_bloom_filter< data_layout_mode_ >::clear ( bin_index const  bin)
inlinenoexcept

Clears a specific bin.

Parameters
[in]binThe bin index to clear.
Attention
This function is only available for uncompressed Interleaved Bloom Filters.

Example

using namespace seqan3::literals;
int main()
{
auto const sequence1 = "ACTGACTGACTGATC"_dna4;
auto const sequence2 = "GTGACTGACTGACTCG"_dna4;
auto const sequence3 = "AAAAAAACGATCGACA"_dna4;
auto hash_adaptor = seqan3::views::kmer_hash(seqan3::ungapped{5u});
// Insert all 5-mers of sequence1 into bin 0
for (auto && value : sequence1 | hash_adaptor)
ibf.emplace(value, seqan3::bin_index{0u});
// Insert all 5-mers of sequence2 into bin 4
for (auto && value : sequence2 | hash_adaptor)
ibf.emplace(value, seqan3::bin_index{4u});
// Insert all 5-mers of sequence3 into bin 7
for (auto && value : sequence3 | hash_adaptor)
ibf.emplace(value, seqan3::bin_index{7u});
auto agent = ibf.counting_agent();
// Count all 5-mers of sequence1 for all bins
seqan3::debug_stream << agent.bulk_count(sequence1 | hash_adaptor) << '\n'; // [11,0,0,0,9,0,0,0]
// Clear bin 0
ibf.clear(seqan3::bin_index{0u});
// After clearing, no 5-mers are found in bin 0
seqan3::debug_stream << agent.bulk_count(sequence1 | hash_adaptor) << '\n'; // [0,0,0,0,9,0,0,0]
// Search for specific values
seqan3::debug_stream << agent.bulk_count(std::views::iota(0u, 1024u)) << '\n'; // [0,0,0,0,7,0,0,10]
// Clear bin 4 and 7
// After clearing, nothing is found
seqan3::debug_stream << agent.bulk_count(std::views::iota(0u, 1024u)) << '\n'; // [0,0,0,0,0,0,0,0]
}
Provides seqan3::debug_stream and related types.
Provides seqan3::dna4, container aliases and string literals.
debug_stream_type debug_stream
A global instance of seqan3::debug_stream_type.
Definition: debug_stream.hpp:37
constexpr auto kmer_hash
Computes hash values for each position of a range via a given shape.
Definition: kmer_hash.hpp:785
Provides seqan3::views::kmer_hash.
The SeqAn namespace for literals.
A strong type that represents the bin index for the seqan3::interleaved_bloom_filter.
Definition: interleaved_bloom_filter.hpp:57
A strong type of underlying type uint8_t that represents the ungapped shape size.
Definition: shape.hpp:25

◆ clear() [2/2]

template<data_layout data_layout_mode_ = data_layout::uncompressed>
template<typename rng_t >
void seqan3::interleaved_bloom_filter< data_layout_mode_ >::clear ( rng_t &&  bin_range)
inlinenoexcept

Clears a range of bins.

Template Parameters
rng_tThe type of the range. Must model std::ranges::forward_range and the reference type must be seqan3::bin_index.
Parameters
[in]bin_rangeThe range of bins to clear.
Attention
This function is only available for uncompressed Interleaved Bloom Filters.

Example

using namespace seqan3::literals;
int main()
{
auto const sequence1 = "ACTGACTGACTGATC"_dna4;
auto const sequence2 = "GTGACTGACTGACTCG"_dna4;
auto const sequence3 = "AAAAAAACGATCGACA"_dna4;
auto hash_adaptor = seqan3::views::kmer_hash(seqan3::ungapped{5u});
// Insert all 5-mers of sequence1 into bin 0
for (auto && value : sequence1 | hash_adaptor)
ibf.emplace(value, seqan3::bin_index{0u});
// Insert all 5-mers of sequence2 into bin 4
for (auto && value : sequence2 | hash_adaptor)
ibf.emplace(value, seqan3::bin_index{4u});
// Insert all 5-mers of sequence3 into bin 7
for (auto && value : sequence3 | hash_adaptor)
ibf.emplace(value, seqan3::bin_index{7u});
auto agent = ibf.counting_agent();
// Count all 5-mers of sequence1 for all bins
seqan3::debug_stream << agent.bulk_count(sequence1 | hash_adaptor) << '\n'; // [11,0,0,0,9,0,0,0]
// Clear bin 0
ibf.clear(seqan3::bin_index{0u});
// After clearing, no 5-mers are found in bin 0
seqan3::debug_stream << agent.bulk_count(sequence1 | hash_adaptor) << '\n'; // [0,0,0,0,9,0,0,0]
// Search for specific values
seqan3::debug_stream << agent.bulk_count(std::views::iota(0u, 1024u)) << '\n'; // [0,0,0,0,7,0,0,10]
// Clear bin 4 and 7
// After clearing, nothing is found
seqan3::debug_stream << agent.bulk_count(std::views::iota(0u, 1024u)) << '\n'; // [0,0,0,0,0,0,0,0]
}

◆ counting_agent()

template<data_layout data_layout_mode_ = data_layout::uncompressed>
template<typename value_t = uint16_t>
counting_agent_type< value_t > seqan3::interleaved_bloom_filter< data_layout_mode_ >::counting_agent ( ) const
inline

Returns a seqan3::interleaved_bloom_filter::counting_agent_type to be used for counting.

Attention
Calling seqan3::interleaved_bloom_filter::increase_bin_number_to invalidates all seqan3::interleaved_bloom_filter::counting_agent_types constructed for this Interleaved Bloom Filter.

Example

int main()
{
// Construct an Interleaved Bloom Filter to be used with the counting_agent.
// The counting_agent can now be constructed by calling `counting_agent` on the Interleaved Bloom Filter.
auto agent = ibf.counting_agent();
// Calling `increase_bin_number_to` invalidates the agent.
ibf.increase_bin_number_to(seqan3::bin_count{60u});
// So make sure to construct a new counting_agent.
agent = ibf.counting_agent();
}
See also
seqan3::interleaved_bloom_filter::counting_agent_type::bulk_count

◆ emplace()

template<data_layout data_layout_mode_ = data_layout::uncompressed>
void seqan3::interleaved_bloom_filter< data_layout_mode_ >::emplace ( size_t const  value,
bin_index const  bin 
)
inlinenoexcept

Inserts a value into a specific bin.

Parameters
[in]valueThe raw numeric value to process.
[in]binThe bin index to insert into.
Attention
This function is only available for uncompressed Interleaved Bloom Filters.

Example

int main()
{
// Insert the values `126`, `712` and `237` into bins `0`, `3` and `9` of the Interleaved Bloom Filter.
ibf.emplace(126, seqan3::bin_index{0u});
ibf.emplace(712, seqan3::bin_index{3u});
ibf.emplace(237, seqan3::bin_index{9u});
}

◆ hash_function_count()

template<data_layout data_layout_mode_ = data_layout::uncompressed>
size_t seqan3::interleaved_bloom_filter< data_layout_mode_ >::hash_function_count ( ) const
inlinenoexcept

Returns the number of hash functions used in the Interleaved Bloom Filter.

Returns
The number of hash functions.

◆ increase_bin_number_to()

template<data_layout data_layout_mode_ = data_layout::uncompressed>
void seqan3::interleaved_bloom_filter< data_layout_mode_ >::increase_bin_number_to ( bin_count const  new_bins_)
inline

Increases the number of bins stored in the Interleaved Bloom Filter.

Parameters
[in]new_bins_The new number of bins.
Exceptions
std::invalid_argumentIf passed number of bins is smaller than current number of bins.
Attention
This function is only available for uncompressed Interleaved Bloom Filters.
The new number of bins must be greater or equal to the current number of bins.
This function invalidates all seqan3::interleaved_bloom_filter::membership_agent constructed for this Interleaved Bloom Filter.

The resulting seqan3::interleaved_bloom_filter has an increased size proportional to the increase in the bin_words (the number of 64-bit words needed to represent bins many bins), e.g. resizing a seqan3::interleaved_bloom_filter with 40 bins to 73 bins also increases the bin_words from 1 to 2 and hence the new seqan3::interleaved_bloom_filter will be twice the size. This increase in size is necessary to avoid invalidating all computed hash functions. If you want to add more bins while keeping the size constant, you need to rebuild the seqan3::interleaved_bloom_filter.

Example

int main()
{
ibf.emplace(126, seqan3::bin_index{0u});
ibf.emplace(712, seqan3::bin_index{3u});
ibf.emplace(237, seqan3::bin_index{9u});
ibf.increase_bin_number_to(seqan3::bin_count{18u});
// Be sure to get the agent after `increase_bin_number_to` as it invalidates all agents!
auto agent = ibf.membership_agent();
// The content of the bins which were already present before the resize does not change
seqan3::debug_stream << agent.bulk_contains(126) << '\n'; // prints [1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]
seqan3::debug_stream << agent.bulk_contains(712) << '\n'; // prints [0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0]
seqan3::debug_stream << agent.bulk_contains(237) << '\n'; // prints [0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0]
}

◆ membership_agent()

template<data_layout data_layout_mode_ = data_layout::uncompressed>
membership_agent seqan3::interleaved_bloom_filter< data_layout_mode_ >::membership_agent ( ) const
inline

Returns a seqan3::interleaved_bloom_filter::membership_agent to be used for lookup.

Attention
Calling seqan3::interleaved_bloom_filter::increase_bin_number_to invalidates all seqan3::interleaved_bloom_filter::membership_agents constructed for this Interleaved Bloom Filter.

Example

int main()
{
// Construct an Interleaved Bloom Filter to be used with the membership_agent.
// The membership_agent can now be constructed by calling `membership_agent` on the Interleaved Bloom Filter.
auto agent = ibf.membership_agent();
// Calling `increase_bin_number_to` invalidates the agent.
ibf.increase_bin_number_to(seqan3::bin_count{60u});
// So make sure to construct a new membership_agent.
agent = ibf.membership_agent();
}
See also
seqan3::interleaved_bloom_filter::membership_agent::bulk_contains

◆ raw_data() [1/2]

template<data_layout data_layout_mode_ = data_layout::uncompressed>
constexpr data_type const & seqan3::interleaved_bloom_filter< data_layout_mode_ >::raw_data ( ) const
inlineconstexprnoexcept

Provides direct, unsafe access to the underlying data structure.

Returns
A reference to an SDSL bitvector.

This entity is not part of the SeqAn API. Do not rely on it in your applications. The exact representation of the data is implementation defined.

◆ raw_data() [2/2]

template<data_layout data_layout_mode_ = data_layout::uncompressed>
constexpr data_type & seqan3::interleaved_bloom_filter< data_layout_mode_ >::raw_data ( )
inlineconstexprnoexcept

Provides direct, unsafe access to the underlying data structure.

Returns
A reference to an SDSL bitvector.

This entity is not part of the SeqAn API. Do not rely on it in your applications. The exact representation of the data is implementation defined.

Friends And Related Function Documentation

◆ operator!=

template<data_layout data_layout_mode_ = data_layout::uncompressed>
bool operator!= ( interleaved_bloom_filter< data_layout_mode_ > const &  lhs,
interleaved_bloom_filter< data_layout_mode_ > const &  rhs 
)
friend

Test for inequality.

Parameters
[in]lhsA seqan3::interleaved_bloom_filter.
[in]rhsseqan3::interleaved_bloom_filter to compare to.
Returns
true if unequal, false otherwise.

◆ operator==

template<data_layout data_layout_mode_ = data_layout::uncompressed>
bool operator== ( interleaved_bloom_filter< data_layout_mode_ > const &  lhs,
interleaved_bloom_filter< data_layout_mode_ > const &  rhs 
)
friend

Test for equality.

Parameters
[in]lhsA seqan3::interleaved_bloom_filter.
[in]rhsseqan3::interleaved_bloom_filter to compare to.
Returns
true if equal, false otherwise.

The documentation for this class was generated from the following file: