The Bloom Filter. A data structure that efficiently answers set-membership queries. More...

#include <seqan3/utility/bloom_filter/bloom_filter.hpp>

Inheritance diagram for seqan3::bloom_filter< data_layout_mode_ >:

Public Member Functions
Constructors, destructor and assignment
	bloom_filter ()=default
	Defaulted.

	bloom_filter (bloom_filter const &)=default
	Defaulted.

bloom_filter &	operator= (bloom_filter const &)=default
	Defaulted.

	bloom_filter (bloom_filter &&)=default
	Defaulted.

bloom_filter &	operator= (bloom_filter &&)=default
	Defaulted.

	~bloom_filter ()=default
	Defaulted.

	bloom_filter (seqan3::bin_size size, seqan3::hash_function_count funs=seqan3::hash_function_count{2u})
	Construct an uncompressed Bloom Filter.

	bloom_filter (bloom_filter< data_layout::uncompressed > const &bf)
	Construct a compressed Bloom Filter.

Modifiers
void	emplace (size_t const value) noexcept
	Inserts a value into the Bloom Filter.

void	reset () noexcept
	Remove all values from the Bloom Filter by setting all bits to 0.

Lookup
bool	contains (size_t const value) const noexcept
	Check whether a value is present in the Bloom Filter.

Counting
template<std::ranges::range value_range_t>
size_t	count (value_range_t &&values) const noexcept
	Counts the occurrences for all values in a range.

Capacity
size_t	hash_function_count () const noexcept
	Returns the number of hash functions used in the Bloom Filter.

size_t	bit_size () const noexcept
	Returns the size of the underlying bitvector.

Access
constexpr data_type &	raw_data () noexcept
	Provides direct, unsafe access to the underlying data structure.

constexpr data_type const &	raw_data () const noexcept
	Provides direct, unsafe access to the underlying data structure.

Static Public Attributes
static constexpr data_layout	data_layout_mode = data_layout_mode_
	Indicates whether the Bloom Filter is compressed.

Friends
Comparison operators
bool	operator== (bloom_filter const &lhs, bloom_filter const &rhs) noexcept
	Test for equality.

bool	operator!= (bloom_filter const &lhs, bloom_filter const &rhs) noexcept
	Test for inequality.

Detailed Descriptionno-api

template<data_layout data_layout_mode_ = data_layout::uncompressed>
class seqan3::bloom_filter< data_layout_mode_ >

The Bloom Filter. A data structure that efficiently answers set-membership queries.

Template Parameters

data_layout_mode_ Indicates whether the underlying data type is compressed. See seqan3::data_layout.

Bloom Filter (BF)

The Bloom Filter is a probabilistic data structure. A Bloom Filter can be thought of as a bitvector of length n and h hash functions and is used to determine set membership. To insert data, the data is hashed by the h hash functions (returning values in [0, n)) and the corresponding h positions in the bitvector are set to 1. To query data, i.e. to determine whether the query belongs to the set the Bloom Filter was built for, the query is hashed by the same h hash functions and the corresponding positions are checked. If all h positions contain a 1, the query is (probably) in the data set. Since the Bloom Filter has variable length, the hashing is not bijective, i.e. it may return true for a set membership query even though the query was never inserted into the Bloom Filter. Note that the Bloom Filter will always return true if the query was inserted, i.e. there may be false positives, but no false negatives.

Querying

To query the Bloom Filter for a value, call seqan3::bloom_filter::contains which returns true if the k-mer hash is present in the index, and false if the hash is not present. The value is a hash value of the k-mer to check membership for.

To query the Bloom Filter for a range of values, call seqan3::bloom_filter::count which returns the number of k-mer hits in the Bloom Filter for the given range of values.

Please note the results are based on a heuristic data structure and with a certain probability (depending on the selected size of the bit vector) you may receive a false positive result.

Differences to the Interleaved Bloom Filter (IBF)

While the Bloom Filter provides a single linear bit vector to represent the underlying data, the Interleaved Bloom Filter provides a data structure that combines a set of Bloom Filters to enable efficient queries to multiple fractions of the data. In doing so, the Interleaved Bloom Filter can not only answer whether a hash value is present in the data, but also provides information in which fraction of the data it occurs. The design of the Interleaved Bloom Filter is particularly useful when the underlying data is systematically structured; for example, if each fraction of the data represents a specific set of organisms. Important applications of the Interleaved Bloom Filter include taxonomic classification of sequencing data, or prefiltering of specific fractions of an input data set to enable more efficient in-depth analysis. The Bloom Filter, on the other hand, is useful if the database does not contain any underlying structure, or it is not relevant for the analysis. A typical application is the removal of host sequences or different types of contamination where it is usually not of interest which part of the database was matched. In such cases, the Bloom Filter provides a lighter data structure and a more simple interface (for example, the use of agents for determining and counting membership is not necessary in this case).

Compression

The Bloom Filter can be compressed by passing seqan3::data_layout::compressed as template argument. The compressed seqan3::bloom_filter<seqan3::data_layout::compressed> can only be constructed from a seqan3::bloom_filter, in which case the underlying bitvector is compressed. The compressed Bloom Filter is immutable, i.e. only querying is supported.

Thread safety

The Bloom Filter promises the basic thread-safety by the STL that all calls to const member functions are safe from multiple threads (as long as no thread calls a non-const member function at the same time).

See also: seqan3::interleaved_bloom_filter

Constructor & Destructor Documentation

◆ bloom_filter() [1/2] [no-api]

template<data_layout data_layout_mode_ = data_layout::uncompressed>

seqan3::bloom_filter< data_layout_mode_ >::bloom_filter	(	seqan3::bin_size	size,
		seqan3::hash_function_count	funs = `seqan3::hash_function_count{2u}`
	)

no-apiinline

Construct an uncompressed Bloom Filter.

Parameters

size	The bit vector size (in bits).
funs	The number of hash functions. Default 2. At least 1, at most 5.

Attention: This constructor can only be used to construct uncompressed Bloom Filters.

Example

// SPDX-FileCopyrightText: 2006-2025 Knut Reinert & Freie Universität Berlin
// SPDX-FileCopyrightText: 2016-2025 Knut Reinert & MPI für molekulare Genetik
// SPDX-License-Identifier: CC0-1.0
 
#include <seqan3/utility/bloom_filter/bloom_filter.hpp>
 
int main()
{
    // Construct a Bloom Filter that contains 8192 bits and 3 hash functions.
    seqan3::bloom_filter bf{seqan3::bin_size{8192u}, seqan3::hash_function_count{3}};
 
    // Construct a Bloom Filter that contains 256 KiBits and the default of 2
    // hash functions.
    seqan3::bloom_filter bf2{seqan3::bin_size{1ULL << 20}};
}

◆ bloom_filter() [2/2] [no-api]

template<data_layout data_layout_mode_ = data_layout::uncompressed>

seqan3::bloom_filter< data_layout_mode_ >::bloom_filter ( bloom_filter< data_layout::uncompressed > const & bf )

no-apiinline

Construct a compressed Bloom Filter.

Parameters

[in] bf The uncompressed seqan3::bloom_filter.

Attention: This constructor can only be used to construct compressed Bloom Filters.

Example

// SPDX-FileCopyrightText: 2006-2025 Knut Reinert & Freie Universität Berlin
// SPDX-FileCopyrightText: 2016-2025 Knut Reinert & MPI für molekulare Genetik
// SPDX-License-Identifier: CC0-1.0
 
#include <seqan3/utility/bloom_filter/bloom_filter.hpp>
 
int main()
{
    // Construct an uncompressed Bloom Filter.
    seqan3::bloom_filter bf{seqan3::bin_size{128u}, seqan3::hash_function_count{3}};
 
    // Fill `bf` with data.
 
    // Construct an immutable, compressed Bloom Filter.
    seqan3::bloom_filter<seqan3::data_layout::compressed> bf2{bf};
}

Member Function Documentation

◆ bit_size() [no-api]

template<data_layout data_layout_mode_ = data_layout::uncompressed>

size_t seqan3::bloom_filter< data_layout_mode_ >::bit_size ( ) const

no-apiinlinenoexcept

Returns the size of the underlying bitvector.

Returns: The size in bits of the underlying bitvector.

◆ contains() [no-api]

template<data_layout data_layout_mode_ = data_layout::uncompressed>

bool seqan3::bloom_filter< data_layout_mode_ >::contains ( size_t const value ) const

no-apiinlinenoexcept

Check whether a value is present in the Bloom Filter.

Parameters

[in] value The raw numeric value to process.

Attention: This function is only available for uncompressed Bloom Filters.

Example

// SPDX-FileCopyrightText: 2006-2025 Knut Reinert & Freie Universität Berlin
// SPDX-FileCopyrightText: 2016-2025 Knut Reinert & MPI für molekulare Genetik
// SPDX-License-Identifier: CC0-1.0
 
#include <seqan3/core/debug_stream.hpp>
#include <seqan3/utility/bloom_filter/bloom_filter.hpp>
 
int main()
{
    seqan3::bloom_filter bf{seqan3::bin_size{1024u}};
    bf.emplace(126);
    bf.emplace(712);
    bf.emplace(237);
 
    // Query the Bloom Filter.
    // A return of `false` guarantees the query not being present in the Bloom Filter.
    // A return of `true` indicates the (probable) presence of the query in the Bloom Filter.
    // Note that there may be false positive results!
    bool result = bf.contains(712);
    seqan3::debug_stream << result << '\n'; // prints 1
}

◆ count() [no-api]

template<data_layout data_layout_mode_ = data_layout::uncompressed>

template<std::ranges::range value_range_t>

size_t seqan3::bloom_filter< data_layout_mode_ >::count ( value_range_t && values ) const

no-apiinlinenoexcept

Counts the occurrences for all values in a range.

Template Parameters

value_range_t The type of the range of values. Must model std::ranges::input_range. The reference type must model std::unsigned_integral.

Parameters

[in] values The range of values to process.

Example

// SPDX-FileCopyrightText: 2006-2025 Knut Reinert & Freie Universität Berlin
// SPDX-FileCopyrightText: 2016-2025 Knut Reinert & MPI für molekulare Genetik
// SPDX-License-Identifier: CC0-1.0
 
#include <seqan3/alphabet/nucleotide/dna4.hpp>
#include <seqan3/core/debug_stream.hpp>
#include <seqan3/search/views/kmer_hash.hpp>
#include <seqan3/utility/bloom_filter/bloom_filter.hpp>
 
int main()
{
    using namespace seqan3::literals;
 
    seqan3::bloom_filter bf{seqan3::bin_size{8192u}, seqan3::hash_function_count{2u}};
 
    auto const sequence1 = "ACTGACTGACTGATC"_dna4;
    auto const sequence2 = "GTGACTGACTGACTCG"_dna4;
    auto const sequence3 = "AAAAAAACGATCGACA"_dna4;
    auto kmers = seqan3::views::kmer_hash(seqan3::ungapped{5u});
 
    // Insert all 5-mers of sequence1
    for (auto && value : sequence1 | kmers)
        bf.emplace(value);
 
    // Insert all 5-mers of sequence3
    for (auto && value : sequence3 | kmers)
        bf.emplace(value);
 
    // Count all 5-mers of sequence2
    seqan3::debug_stream << bf.count(sequence2 | kmers) << '\n'; // 9
}

Thread safety

Concurrent invocations of this function are thread safe.

◆ emplace() [no-api]

template<data_layout data_layout_mode_ = data_layout::uncompressed>

void seqan3::bloom_filter< data_layout_mode_ >::emplace ( size_t const value )

no-apiinlinenoexcept

Inserts a value into the Bloom Filter.

Parameters

[in] value The raw numeric value to process.

Attention: This function is only available for uncompressed Bloom Filters.

Example

// SPDX-FileCopyrightText: 2006-2025 Knut Reinert & Freie Universität Berlin
// SPDX-FileCopyrightText: 2016-2025 Knut Reinert & MPI für molekulare Genetik
// SPDX-License-Identifier: CC0-1.0
 
#include <seqan3/utility/bloom_filter/bloom_filter.hpp>
 
int main()
{
    seqan3::bloom_filter bf{seqan3::bin_size{8192u}};
 
    // Insert the values `126`, `712` and `237` into the Bloom Filter.
    bf.emplace(126);
    bf.emplace(712);
    bf.emplace(237);
}

◆ hash_function_count() [no-api]

template<data_layout data_layout_mode_ = data_layout::uncompressed>

size_t seqan3::bloom_filter< data_layout_mode_ >::hash_function_count ( ) const

no-apiinlinenoexcept

Returns the number of hash functions used in the Bloom Filter.

Returns: The number of hash functions.

◆ raw_data() [1/2] [no-api]

template<data_layout data_layout_mode_ = data_layout::uncompressed>

constexpr data_type const & seqan3::bloom_filter< data_layout_mode_ >::raw_data ( ) const

no-apiinlineconstexprnoexcept

Provides direct, unsafe access to the underlying data structure.

Returns: A reference to an SDSL bitvector.

This entity is not part of the SeqAn API. Do not rely on it in your applications. The exact representation of the data is implementation defined.

◆ raw_data() [2/2] [no-api]

template<data_layout data_layout_mode_ = data_layout::uncompressed>

constexpr data_type & seqan3::bloom_filter< data_layout_mode_ >::raw_data ( )

no-apiinlineconstexprnoexcept

Provides direct, unsafe access to the underlying data structure.

Returns: A reference to an SDSL bitvector.

This entity is not part of the SeqAn API. Do not rely on it in your applications. The exact representation of the data is implementation defined.

◆ reset() [no-api]

template<data_layout data_layout_mode_ = data_layout::uncompressed>

void seqan3::bloom_filter< data_layout_mode_ >::reset ( )

no-apiinlinenoexcept

Remove all values from the Bloom Filter by setting all bits to 0.

Attention: This function is only available for uncompressed Bloom Filters.

While all values are removed from the vector, the size of the Bloom Filter is not changed.

Example

// SPDX-FileCopyrightText: 2006-2025 Knut Reinert & Freie Universität Berlin
// SPDX-FileCopyrightText: 2016-2025 Knut Reinert & MPI für molekulare Genetik
// SPDX-License-Identifier: CC0-1.0
 
#include <seqan3/alphabet/nucleotide/dna4.hpp>
#include <seqan3/core/debug_stream.hpp>
#include <seqan3/search/views/kmer_hash.hpp>
#include <seqan3/utility/bloom_filter/bloom_filter.hpp>
 
int main()
{
    using namespace seqan3::literals;
 
    seqan3::bloom_filter bf{seqan3::bin_size{8192u}, seqan3::hash_function_count{2u}};
 
    auto const sequence1 = "ACTGACTGACTGATC"_dna4;
    auto const sequence2 = "GTGACTGACTGACTCG"_dna4;
    auto const sequence3 = "AAAAAAACGATCGACA"_dna4;
    auto kmers = seqan3::views::kmer_hash(seqan3::ungapped{5u});
 
    // Insert all 5-mers of sequence1
    for (auto && value : sequence1 | kmers)
        bf.emplace(value);
 
    // Insert all 5-mers of sequence3
    for (auto && value : sequence3 | kmers)
        bf.emplace(value);
 
    // Count all 5-mers of sequence2
    seqan3::debug_stream << bf.count(sequence2 | kmers) << '\n'; // 9
 
    // Reset the Bloom Filter
    bf.reset();
 
    // After reset, no 5-mers are found
    seqan3::debug_stream << bf.count(sequence2 | kmers) << '\n'; // 0
}

Friends And Related Symbol Documentation

◆ operator!= [no-api]

template<data_layout data_layout_mode_ = data_layout::uncompressed>

bool operator!=	(	bloom_filter< data_layout_mode_ > const &	lhs,
		bloom_filter< data_layout_mode_ > const &	rhs
	)

no-apifriend

Test for inequality.

Parameters

[in]	lhs	A `seqan3::bloom_filter`.
[in]	rhs	`seqan3::bloom_filter` to compare to.

Returns: true if unequal, false otherwise.

◆ operator== [no-api]

template<data_layout data_layout_mode_ = data_layout::uncompressed>

bool operator==	(	bloom_filter< data_layout_mode_ > const &	lhs,
		bloom_filter< data_layout_mode_ > const &	rhs
	)

no-apifriend

Test for equality.

Parameters

[in]	lhs	A `seqan3::bloom_filter`.
[in]	rhs	`seqan3::bloom_filter` to compare to.

Returns: true if equal, false otherwise.

The documentation for this class was generated from the following file:

bloom_filter.hpp

Public Member Functions

Static Public Attributes

Friends

Detailed Descriptionno-api

Bloom Filter (BF)

Querying

Differences to the Interleaved Bloom Filter (IBF)

Compression

Thread safety

Constructor & Destructor Documentation

◆ bloom_filter() [1/2] [no-api]

Example

◆ bloom_filter() [2/2] [no-api]

Example

Member Function Documentation

◆ bit_size() [no-api]

◆ contains() [no-api]

Example

◆ count() [no-api]

Example

Thread safety

◆ emplace() [no-api]

Example

◆ hash_function_count() [no-api]

◆ raw_data() [1/2] [no-api]

◆ raw_data() [2/2] [no-api]

◆ reset() [no-api]

Example

Friends And Related Symbol Documentation

◆ operator!= [no-api]

◆ operator== [no-api]