SeqAn3 3.4.0-rc.1
The Modern C++ library for sequence analysis.
Loading...
Searching...
No Matches
seqan3::bloom_filter< data_layout_mode_ > Class Template Reference

The Bloom Filter. A data structure that efficiently answers set-membership queries. More...

#include <seqan3/utility/bloom_filter/bloom_filter.hpp>

+ Inheritance diagram for seqan3::bloom_filter< data_layout_mode_ >:

Public Member Functions

Constructors, destructor and assignment
 bloom_filter ()=default
 Defaulted.
 
 bloom_filter (bloom_filter const &)=default
 Defaulted.
 
bloom_filteroperator= (bloom_filter const &)=default
 Defaulted.
 
 bloom_filter (bloom_filter &&)=default
 Defaulted.
 
bloom_filteroperator= (bloom_filter &&)=default
 Defaulted.
 
 ~bloom_filter ()=default
 Defaulted.
 
 bloom_filter (seqan3::bin_size size, seqan3::hash_function_count funs=seqan3::hash_function_count{2u})
 Construct an uncompressed Bloom Filter.
 
 bloom_filter (bloom_filter< data_layout::uncompressed > const &bf)
 Construct a compressed Bloom Filter.
 
Modifiers
void emplace (size_t const value) noexcept
 Inserts a value into the Bloom Filter.
 
void reset () noexcept
 Remove all values from the Bloom Filter by setting all bits to 0.
 
Lookup
bool contains (size_t const value) const noexcept
 Check whether a value is present in the Bloom Filter.
 
Counting
template<std::ranges::range value_range_t>
size_t count (value_range_t &&values) const noexcept
 Counts the occurrences for all values in a range.
 
Capacity
size_t hash_function_count () const noexcept
 Returns the number of hash functions used in the Bloom Filter.
 
size_t bit_size () const noexcept
 Returns the size of the underlying bitvector.
 
Access
constexpr data_typeraw_data () noexcept
 Provides direct, unsafe access to the underlying data structure.
 
constexpr data_type const & raw_data () const noexcept
 Provides direct, unsafe access to the underlying data structure.
 

Static Public Attributes

static constexpr data_layout data_layout_mode = data_layout_mode_
 Indicates whether the Bloom Filter is compressed.
 

Friends

Comparison operators
bool operator== (bloom_filter const &lhs, bloom_filter const &rhs) noexcept
 Test for equality.
 
bool operator!= (bloom_filter const &lhs, bloom_filter const &rhs) noexcept
 Test for inequality.
 

Detailed Description

template<data_layout data_layout_mode_ = data_layout::uncompressed>
class seqan3::bloom_filter< data_layout_mode_ >

The Bloom Filter. A data structure that efficiently answers set-membership queries.

Template Parameters
data_layout_mode_Indicates whether the underlying data type is compressed. See seqan3::data_layout.

Bloom Filter (BF)

The Bloom Filter is a probabilistic data structure. A Bloom Filter can be thought of as a bitvector of length n and h hash functions and is used to determine set membership. To insert data, the data is hashed by the h hash functions (returning values in [0, n)) and the corresponding h positions in the bitvector are set to 1. To query data, i.e. to determine whether the query belongs to the set the Bloom Filter was built for, the query is hashed by the same h hash functions and the corresponding positions are checked. If all h positions contain a 1, the query is (probably) in the data set. Since the Bloom Filter has variable length, the hashing is not bijective, i.e. it may return true for a set membership query even though the query was never inserted into the Bloom Filter. Note that the Bloom Filter will always return true if the query was inserted, i.e. there may be false positives, but no false negatives.

Querying

To query the Bloom Filter for a value, call seqan3::bloom_filter::contains which returns true if the k-mer hash is present in the index, and false if the hash is not present. The value is a hash value of the k-mer to check membership for.

To query the Bloom Filter for a range of values, call seqan3::bloom_filter::count which returns the number of k-mer hits in the Bloom Filter for the given range of values.

Please note the results are based on a heuristic data structure and with a certain probability (depending on the selected size of the bit vector) you may receive a false positive result.

Differences to the Interleaved Bloom Filter (IBF)

While the Bloom Filter provides a single linear bit vector to represent the underlying data, the Interleaved Bloom Filter provides a data structure that combines a set of Bloom Filters to enable efficient queries to multiple fractions of the data. In doing so, the Interleaved Bloom Filter can not only answer whether a hash value is present in the data, but also provides information in which fraction of the data it occurs. The design of the Interleaved Bloom Filter is particularly useful when the underlying data is systematically structured; for example, if each fraction of the data represents a specific set of organisms. Important applications of the Interleaved Bloom Filter include taxonomic classification of sequencing data, or prefiltering of specific fractions of an input data set to enable more efficient in-depth analysis. The Bloom Filter, on the other hand, is useful if the database does not contain any underlying structure, or it is not relevant for the analysis. A typical application is the removal of host sequences or different types of contamination where it is usually not of interest which part of the database was matched. In such cases, the Bloom Filter provides a lighter data structure and a more simple interface (for example, the use of agents for determining and counting membership is not necessary in this case).

Compression

The Bloom Filter can be compressed by passing seqan3::data_layout::compressed as template argument. The compressed seqan3::bloom_filter<seqan3::data_layout::compressed> can only be constructed from a seqan3::bloom_filter, in which case the underlying bitvector is compressed. The compressed Bloom Filter is immutable, i.e. only querying is supported.

Thread safety

The Bloom Filter promises the basic thread-safety by the STL that all calls to const member functions are safe from multiple threads (as long as no thread calls a non-const member function at the same time).

See also
seqan3::interleaved_bloom_filter

Constructor & Destructor Documentation

◆ bloom_filter() [1/2]

template<data_layout data_layout_mode_ = data_layout::uncompressed>
seqan3::bloom_filter< data_layout_mode_ >::bloom_filter ( seqan3::bin_size  size,
seqan3::hash_function_count  funs = seqan3::hash_function_count{2u} 
)
inline

Construct an uncompressed Bloom Filter.

Parameters
sizeThe bit vector size (in bits).
funsThe number of hash functions. Default 2. At least 1, at most 5.
Attention
This constructor can only be used to construct uncompressed Bloom Filters.

Example

// SPDX-FileCopyrightText: 2006-2024 Knut Reinert & Freie Universität Berlin
// SPDX-FileCopyrightText: 2016-2024 Knut Reinert & MPI für molekulare Genetik
// SPDX-License-Identifier: CC0-1.0
int main()
{
// Construct a Bloom Filter that contains 8192 bits and 3 hash functions.
// Construct a Bloom Filter that contains 256 KiBits and the default of 2
// hash functions.
}
Provides seqan3::bloom_filter.
The Bloom Filter. A data structure that efficiently answers set-membership queries.
Definition bloom_filter.hpp:82
A strong type that represents the number of bits for each bin in the seqan3::interleaved_bloom_filter...
Definition interleaved_bloom_filter.hpp:40
A strong type that represents the number of hash functions for the seqan3::interleaved_bloom_filter.
Definition interleaved_bloom_filter.hpp:47

◆ bloom_filter() [2/2]

template<data_layout data_layout_mode_ = data_layout::uncompressed>
seqan3::bloom_filter< data_layout_mode_ >::bloom_filter ( bloom_filter< data_layout::uncompressed > const &  bf)
inline

Construct a compressed Bloom Filter.

Parameters
[in]bfThe uncompressed seqan3::bloom_filter.
Attention
This constructor can only be used to construct compressed Bloom Filters.

Example

// SPDX-FileCopyrightText: 2006-2024 Knut Reinert & Freie Universität Berlin
// SPDX-FileCopyrightText: 2016-2024 Knut Reinert & MPI für molekulare Genetik
// SPDX-License-Identifier: CC0-1.0
int main()
{
// Construct an uncompressed Bloom Filter.
// Fill `bf` with data.
// Construct an immutable, compressed Bloom Filter.
}

Member Function Documentation

◆ bit_size()

template<data_layout data_layout_mode_ = data_layout::uncompressed>
size_t seqan3::bloom_filter< data_layout_mode_ >::bit_size ( ) const
inlinenoexcept

Returns the size of the underlying bitvector.

Returns
The size in bits of the underlying bitvector.

◆ contains()

template<data_layout data_layout_mode_ = data_layout::uncompressed>
bool seqan3::bloom_filter< data_layout_mode_ >::contains ( size_t const  value) const
inlinenoexcept

Check whether a value is present in the Bloom Filter.

Parameters
[in]valueThe raw numeric value to process.
Attention
This function is only available for uncompressed Bloom Filters.

Example

// SPDX-FileCopyrightText: 2006-2024 Knut Reinert & Freie Universität Berlin
// SPDX-FileCopyrightText: 2016-2024 Knut Reinert & MPI für molekulare Genetik
// SPDX-License-Identifier: CC0-1.0
int main()
{
bf.emplace(126);
bf.emplace(712);
bf.emplace(237);
// Query the Bloom Filter.
// A return of `false` guarantees the query not being present in the Bloom Filter.
// A return of `true` indicates the (probable) presence of the query in the Bloom Filter.
// Note that there may be false positive results!
bool result = bf.contains(712);
seqan3::debug_stream << result << '\n'; // prints 1
}
Provides seqan3::debug_stream and related types.
debug_stream_type debug_stream
A global instance of seqan3::debug_stream_type.
Definition debug_stream.hpp:37

◆ count()

template<data_layout data_layout_mode_ = data_layout::uncompressed>
template<std::ranges::range value_range_t>
size_t seqan3::bloom_filter< data_layout_mode_ >::count ( value_range_t &&  values) const
inlinenoexcept

Counts the occurrences for all values in a range.

Template Parameters
value_range_tThe type of the range of values. Must model std::ranges::input_range. The reference type must model std::unsigned_integral.
Parameters
[in]valuesThe range of values to process.

Example

// SPDX-FileCopyrightText: 2006-2024 Knut Reinert & Freie Universität Berlin
// SPDX-FileCopyrightText: 2016-2024 Knut Reinert & MPI für molekulare Genetik
// SPDX-License-Identifier: CC0-1.0
int main()
{
using namespace seqan3::literals;
auto const sequence1 = "ACTGACTGACTGATC"_dna4;
auto const sequence2 = "GTGACTGACTGACTCG"_dna4;
auto const sequence3 = "AAAAAAACGATCGACA"_dna4;
// Insert all 5-mers of sequence1
for (auto && value : sequence1 | kmers)
bf.emplace(value);
// Insert all 5-mers of sequence3
for (auto && value : sequence3 | kmers)
bf.emplace(value);
// Count all 5-mers of sequence2
seqan3::debug_stream << bf.count(sequence2 | kmers) << '\n'; // 9
}
void emplace(size_t const value) noexcept
Inserts a value into the Bloom Filter.
Definition bloom_filter.hpp:204
Provides seqan3::dna4, container aliases and string literals.
constexpr auto kmer_hash
Computes hash values for each position of a range via a given shape.
Definition kmer_hash.hpp:766
Provides seqan3::views::kmer_hash.
The SeqAn namespace for literals.
A strong type of underlying type uint8_t that represents the ungapped shape size.
Definition shape.hpp:22

Thread safety

Concurrent invocations of this function are thread safe.

◆ emplace()

template<data_layout data_layout_mode_ = data_layout::uncompressed>
void seqan3::bloom_filter< data_layout_mode_ >::emplace ( size_t const  value)
inlinenoexcept

Inserts a value into the Bloom Filter.

Parameters
[in]valueThe raw numeric value to process.
Attention
This function is only available for uncompressed Bloom Filters.

Example

// SPDX-FileCopyrightText: 2006-2024 Knut Reinert & Freie Universität Berlin
// SPDX-FileCopyrightText: 2016-2024 Knut Reinert & MPI für molekulare Genetik
// SPDX-License-Identifier: CC0-1.0
int main()
{
// Insert the values `126`, `712` and `237` into the Bloom Filter.
bf.emplace(126);
bf.emplace(712);
bf.emplace(237);
}

◆ hash_function_count()

template<data_layout data_layout_mode_ = data_layout::uncompressed>
size_t seqan3::bloom_filter< data_layout_mode_ >::hash_function_count ( ) const
inlinenoexcept

Returns the number of hash functions used in the Bloom Filter.

Returns
The number of hash functions.

◆ raw_data() [1/2]

template<data_layout data_layout_mode_ = data_layout::uncompressed>
constexpr data_type const & seqan3::bloom_filter< data_layout_mode_ >::raw_data ( ) const
inlineconstexprnoexcept

Provides direct, unsafe access to the underlying data structure.

Returns
A reference to an SDSL bitvector.

This entity is not part of the SeqAn API. Do not rely on it in your applications. The exact representation of the data is implementation defined.

◆ raw_data() [2/2]

template<data_layout data_layout_mode_ = data_layout::uncompressed>
constexpr data_type & seqan3::bloom_filter< data_layout_mode_ >::raw_data ( )
inlineconstexprnoexcept

Provides direct, unsafe access to the underlying data structure.

Returns
A reference to an SDSL bitvector.

This entity is not part of the SeqAn API. Do not rely on it in your applications. The exact representation of the data is implementation defined.

◆ reset()

template<data_layout data_layout_mode_ = data_layout::uncompressed>
void seqan3::bloom_filter< data_layout_mode_ >::reset ( )
inlinenoexcept

Remove all values from the Bloom Filter by setting all bits to 0.

Attention
This function is only available for uncompressed Bloom Filters.

While all values are removed from the vector, the size of the Bloom Filter is not changed.

Example

// SPDX-FileCopyrightText: 2006-2024 Knut Reinert & Freie Universität Berlin
// SPDX-FileCopyrightText: 2016-2024 Knut Reinert & MPI für molekulare Genetik
// SPDX-License-Identifier: CC0-1.0
int main()
{
using namespace seqan3::literals;
auto const sequence1 = "ACTGACTGACTGATC"_dna4;
auto const sequence2 = "GTGACTGACTGACTCG"_dna4;
auto const sequence3 = "AAAAAAACGATCGACA"_dna4;
// Insert all 5-mers of sequence1
for (auto && value : sequence1 | kmers)
bf.emplace(value);
// Insert all 5-mers of sequence3
for (auto && value : sequence3 | kmers)
bf.emplace(value);
// Count all 5-mers of sequence2
seqan3::debug_stream << bf.count(sequence2 | kmers) << '\n'; // 9
// Reset the Bloom Filter
bf.reset();
// After reset, no 5-mers are found
seqan3::debug_stream << bf.count(sequence2 | kmers) << '\n'; // 0
}

Friends And Related Symbol Documentation

◆ operator!=

template<data_layout data_layout_mode_ = data_layout::uncompressed>
bool operator!= ( bloom_filter< data_layout_mode_ > const &  lhs,
bloom_filter< data_layout_mode_ > const &  rhs 
)
friend

Test for inequality.

Parameters
[in]lhsA seqan3::bloom_filter.
[in]rhsseqan3::bloom_filter to compare to.
Returns
true if unequal, false otherwise.

◆ operator==

template<data_layout data_layout_mode_ = data_layout::uncompressed>
bool operator== ( bloom_filter< data_layout_mode_ > const &  lhs,
bloom_filter< data_layout_mode_ > const &  rhs 
)
friend

Test for equality.

Parameters
[in]lhsA seqan3::bloom_filter.
[in]rhsseqan3::bloom_filter to compare to.
Returns
true if equal, false otherwise.

The documentation for this class was generated from the following file:
Hide me