HIBF 1.0.0-rc.1
All Classes Namespaces Files Functions Variables Typedefs Friends Macros Modules Pages Concepts
seqan::hibf::config Struct Reference

The configuration used to build an (H)IBF. More...

#include <hibf/config.hpp>

Public Member Functions

constexpr bool operator== (config const &other) const
 Two configs are equal if all options, except seqan::hibf::config::input_fn, are equal.
 
void read_from (std::istream &stream)
 
void validate_and_set_defaults ()
 Checks several variables of seqan::hibf::config and sets default values if necessary.
 
void write_to (std::ostream &stream) const
 

Public Attributes

General Configuration
std::function< void(size_t const, insert_iterator &&) input_fn ) {}
 A function for how to hash your input [REQUIRED].
 
size_t number_of_user_bins {}
 The number of user bins.
 
size_t number_of_hash_functions {2}
 The number of hash functions for the underlying Bloom Filters.
 
double maximum_fpr {0.05}
 The desired maximum false positive rate of the underlying Bloom Filters. [RECOMMENDED_TO_ADAPT].
 
double relaxed_fpr {0.3}
 Allow a higher FPR in non-accuracy-critical parts of the HIBF structure.
 
size_t threads {1u}
 The number of threads to use during construction. [RECOMMENDED_TO_ADAPT].
 
HIBF Layout Configuration
uint8_t sketch_bits {12}
 The number of bits for HyperLogLog sketches.
 
size_t tmax {}
 The maximum number of technical bins of each IBF in the HIBF.
 
double empty_bin_fraction {}
 The percentage of empty bins in the layout.
 
double alpha {1.2}
 A scaling factor to influence the amount of merged bins produced by the layout algorithm.
 
double max_rearrangement_ratio {0.5}
 The maximal cardinality ratio in the clustering intervals of the layout rearrangement algorithm.
 
bool disable_estimate_union {false}
 Whether to disable union estimate of user bins to improve the layout.
 
bool disable_rearrangement {false}
 Whether to disable rearranging user bins based on their content similarity.
 

Friends

class cereal::access
 

Detailed Description

The configuration used to build an (H)IBF.

The (H)IBF config

The configuration can be used to construct an HIBF or IBF.

When constructing an IBF, only the members General Configuration are considered, layout parameters from the section HIBF Layout Configuration are ignored.

Here is the list of all configs options:

Type Option Name Default Note
General seqan::hibf::config::input_fn - [REQUIRED]
General seqan::hibf::config::number_of_user_bins - [REQUIRED]
General seqan::hibf::config::number_of_hash_functions 2
General seqan::hibf::config::maximum_fpr 0.05 [RECOMMENDED_TO_ADAPT]
General seqan::hibf::config::relaxed_fpr 0.3
General seqan::hibf::config::threads 1 [RECOMMENDED_TO_ADAPT]
Layout seqan::hibf::config::sketch_bits 12
Layout seqan::hibf::config::tmax 0 0 indicates unset
Layout seqan::hibf::config::empty_bin_fraction 0.0 Dynamic Layout
Layout seqan::hibf::config::max_rearrangement_ratio 0.5
Layout seqan::hibf::config::alpha 1.2
Layout seqan::hibf::config::disable_estimate_union false
Layout seqan::hibf::config::disable_rearrangement false

As a copy and paste source, here are all config options with their defaults:

// SPDX-FileCopyrightText: 2006-2025, Knut Reinert & Freie Universität Berlin
// SPDX-FileCopyrightText: 2016-2025, Knut Reinert & MPI für molekulare Genetik
// SPDX-License-Identifier: CC0-1.0
#include <cstddef> // for size_t
#include <functional> // for function
#include <vector> // for vector
#include <hibf/config.hpp> // for config, insert_iterator
#include <hibf/hierarchical_interleaved_bloom_filter.hpp> // for hierarchical_interleaved_bloom_filter
int main()
{
// 2 user bins:
std::vector<std::vector<size_t>> hashes{{1u, 2u, 3u, 4u, 5u, 6u, 7u, 8u, 9u, 10u}, {1u, 2u, 3u, 4u, 5u}};
// input just passes hashes:
auto my_input = [&](size_t const user_bin_id, seqan::hibf::insert_iterator it)
{
for (auto const hash : hashes[user_bin_id])
it = hash;
};
seqan::hibf::config config{.input_fn = my_input, // required
.number_of_user_bins = 2, // required
.number_of_hash_functions = 2,
.maximum_fpr = 0.05, // recommended to adapt
.threads = 1, // recommended to adapt
.sketch_bits = 12,
.tmax = 0, // triggers default copmutation
.alpha = 1.2,
.max_rearrangement_ratio = 0.5,
.disable_estimate_union = false,
.disable_rearrangement = false};
// construct the HIBF
}
The Hierarchical Interleaved Bloom Filter (HIBF) - Fast answers to set-membership queries for multipl...
Definition hierarchical_interleaved_bloom_filter.hpp:140
Definition insert_iterator.hpp:25
The configuration used to build an (H)IBF.
Definition config.hpp:75
std::function< void(size_t const, insert_iterator &&) input_fn)
A function for how to hash your input [REQUIRED].
Definition config.hpp:110

The HIBF takes too long to construct?

Check the documentation of the following options that influence the runtime:

The HIBF memory consumption is too high?

Check the documentation of the following options that influence the memory consumption:

Validation

Checks several variables of seqan::hibf::config and sets default values if necessary.

See seqan::hibf::config::validate_and_set_defaults

Member Function Documentation

◆ validate_and_set_defaults()

void seqan::hibf::config::validate_and_set_defaults ( )

Checks several variables of seqan::hibf::config and sets default values if necessary.

Required options:

Constrains:

Modifications:

Member Data Documentation

◆ input_fn

std::function<void(size_t const, insert_iterator &&) seqan::hibf::config::input_fn) {}

A function for how to hash your input [REQUIRED].

Attention
This option is required!

To efficiently construct the hierarchical structure of the HIBF, the input needs to be given as a function object. The IBF construction can be done in another way (see seqan::hibf::interleaved_bloom_filter) but has the config construction for consistency.

// SPDX-FileCopyrightText: 2006-2025, Knut Reinert & Freie Universität Berlin
// SPDX-FileCopyrightText: 2016-2025, Knut Reinert & MPI für molekulare Genetik
// SPDX-License-Identifier: CC0-1.0
#include <cstddef> // for size_t
#include <functional> // for function
#include <hibf/config.hpp> // for config, insert_iterator
int main()
{
auto my_input = [&](size_t const /* user_bin_id */, seqan::hibf::insert_iterator it) // fixed parameters!
{
it = 42; // assign something that is convertible to uint64_t
};
}

It is important that the function object, a lambda in the example, has exactly two parameters: (1) A size_t const which reflects the user bin ID to add hashes/values to in the (H)IBF. (2) A seqan::hibf::insert_iterator that inserts hashes via it = hash where hash must be convertible to uint64_t.

The above example would just insert a single hash (42) into each user bin.

Let's look at two more realistic examples.

Inserting from a vector of values:

// SPDX-FileCopyrightText: 2006-2025, Knut Reinert & Freie Universität Berlin
// SPDX-FileCopyrightText: 2016-2025, Knut Reinert & MPI für molekulare Genetik
// SPDX-License-Identifier: CC0-1.0
#include <cstddef> // for size_t
#include <cstdint> // for uint64_t
#include <functional> // for function
#include <vector> // for vector
#include <hibf/config.hpp> // for insert_iterator, config
struct dna
{
int rank{0}; // 0 = A, C = 1, G = 2, T = 3
};
int main()
{
// user_bins stores one dna sequence per user bin
// You can look at https://docs.seqan.de/seqan3/3-master-user/group__alphabet__nucleotide.html for dna alphabets
std::vector<std::vector<dna>> user_bins{{{0}, {0}, {0} /*AAA*/}, {{1}, {1}, {1} /*CCC*/}};
auto my_input = [&](size_t const user_bin_id, seqan::hibf::insert_iterator it)
{
auto const & seq = user_bins[user_bin_id];
for (size_t i = 0; i < seq.size() - 1; ++i)
{
// compute 2-mer hashes
uint64_t hash = 4 * seq[i].rank + seq[i + 1].rank;
// You can also look at the seqan3::kmer_hash view for hashing
// https://docs.seqan.de/seqan3/3-master-user/group__search__views.html#ga6e598d6a021868f704d39df73252974f
it = hash;
}
};
seqan::hibf::config config{.input_fn = my_input};
}

Inserting from a file:

#include <hibf/config.hpp> // for insert_iterator, config
int main()
{
std::vector<std::filesystem::path> filenames{file1.path(), file2.path()};
auto my_input = [&](size_t const user_bin_id, seqan::hibf::insert_iterator it)
{
std::ifstream infile{filenames[user_bin_id]};
std::getline(infile, line); // assume there is a sequence in the first line
// Look at https://docs.seqan.de/seqan3/3-master-user/group__io__sequence__file.html for e.g. FASTA File I/O
for (size_t i = 0; i < line.size() - 1; ++i)
{
// compute 2-mer hashes based on the character value
uint64_t hash = 4 * line[i] + line[i + 1];
// You can also look at the seqan3::kmer_hash view for hashing
// https://docs.seqan.de/seqan3/3-master-user/group__search__views.html#ga6e598d6a021868f704d39df73252974f
it = hash;
}
};
seqan::hibf::config config{.input_fn = my_input};
}
Definition insert_iterator.hpp:25
T getline(T... args)
T size(T... args)
The configuration used to build an (H)IBF.
Definition config.hpp:75
std::function< void(size_t const, insert_iterator &&) input_fn)
A function for how to hash your input [REQUIRED].
Definition config.hpp:110
Attention
Since the number of user bins cannot be inferred from the function object, you need to pass them by a separate config member: seqan::hibf::config::number_of_user_bins.

◆ number_of_user_bins

size_t seqan::hibf::config::number_of_user_bins {}

The number of user bins.

Since the data to construct the (H)IBF is given by a function object seqan::hibf::config::input_fn, the number of user bins to consider must be given via this option.

Value must be neither 0 nor std::numeric_limits<uint64_t>::max().

// SPDX-FileCopyrightText: 2006-2025, Knut Reinert & Freie Universität Berlin
// SPDX-FileCopyrightText: 2016-2025, Knut Reinert & MPI für molekulare Genetik
// SPDX-License-Identifier: CC0-1.0
#include <cstddef> // for size_t
#include <functional> // for function
#include <hibf/config.hpp> // for config, insert_iterator
int main()
{
auto my_input = [&](size_t const /* user_bin_id */, seqan::hibf::insert_iterator it) // fixed parameters!
{
it = 42; // assign something that is convertible to uint64_t
};
seqan::hibf::config config{.input_fn = my_input, .number_of_user_bins = 12};
}

In this example, 12 user bins would be inserted into the (H)IBF, each only storing the hash 42.

◆ number_of_hash_functions

size_t seqan::hibf::config::number_of_hash_functions {2}

The number of hash functions for the underlying Bloom Filters.

The (H)IBF is based on the Bloom Filter data structure which requires a set of hash functions. The number of hash functions used influences the speed and space but the optimal number of hash functions is data dependent (see Bloom Filter Calculator).

Based on our experiments, we recommend a value of 2 (default).

Be sure to experiment with this option with your data before changing it.

◆ maximum_fpr

double seqan::hibf::config::maximum_fpr {0.05}

The desired maximum false positive rate of the underlying Bloom Filters. [RECOMMENDED_TO_ADAPT].

We ensure that when querying a single hash value in the (H)IBF, the probability of getting a false positive answer will not exceed the value set for seqan::hibf::config::maximum_fpr. The internal Bloom Filters will be configured accordingly. Individual Bloom Filters might have a different but always lower false positive rate (FPR).

Value must be in range (0.0,1.0). Recommendation: default value (0.05)

The FPR influences the memory consumption of the (H)IBF:

  • A lower FPR limits the number of false-positive results, but increases the index size.
  • A higher FPR can help to reduce memory consumption in cases where false-positive answers have little effect.
See also
Bloom Filter Calculator.

◆ relaxed_fpr

double seqan::hibf::config::relaxed_fpr {0.3}

Allow a higher FPR in non-accuracy-critical parts of the HIBF structure.

Some parts in the hierarchical structure are not critical to ensure the seqan::hibf::config::maximum_fpr. These can be allowed to have a higher FPR to reduce the overall space consumption, while only minimally affecting the runtime performance.

Value must be in range (0.0,1.0). Value must be equal to or larger than seqan::hibf::config::maximum_fpr. Recommendation: default value (0.3)

Technical details

Merged bins in an HIBF layout will always be followed by one or more lower-level IBFs that will have split bins or single bins (split = 1) to recover the original user bins. Thus, the FPR of merged bins does not determine the seqan::hibf::config::maximum_fpr, but is independent. Choosing a higher FPR for merged bins can lower the memory requirement but increases the runtime. Experiments show that the decrease in memory is significant, while the runtime suffers only slightly. The accuracy of the results is not affected by this parameter.

Note: For each IBF there is a limit to how high the FPR of merged bins can be. Specifically, the FPR for merged bins can never decrease the IBF size more than what is needed to ensure the seqan::hibf::config::maximum_fpr for split bins. This means that, at some point, choosing even higher values for this parameter will have no effect anymore.

See also
Bloom Filter Calculator.

◆ threads

size_t seqan::hibf::config::threads {1u}

The number of threads to use during construction. [RECOMMENDED_TO_ADAPT].

Using more threads increases the memory consumption during construction because the threads hold local data. It can be beneficial to try a lower number of threads if you have limited RAM but many threads.

Currently, the following parts of the HIBF construction process are parallelized:

  • Sketch computation during layouting. Each thread processes a single user bin, keeping the total amount of hashes in memory before computing the sketch.
  • The hierarchical, bottom-up build process. Lower level IBFs of the HIBF that are independent of each other may be built in parallel. The threads hold local hash sets needed for higher level merged bins.

◆ sketch_bits

uint8_t seqan::hibf::config::sketch_bits {12}

The number of bits for HyperLogLog sketches.

The HIBF layout algorithm estimates the user bin sizes by computing HyperLogLog sketches.

A key parameter of HyperLogLog sketches is the number of bits of a hash value to consider for sketching. Fewer bits accelerate the sketching process but decrease the accuracy.

Value must be in range [5,32]. Recommendation: default value (12).

Be sure to experiment with this option with your data before changing it.

◆ tmax

size_t seqan::hibf::config::tmax {}

The maximum number of technical bins of each IBF in the HIBF.

One of the key methods of the HIBF for increasing query speed is limiting the number of (technical) bins of the IBF data structure used within the HIBF.

Choosing a good tmax is not trivial.

The smaller tmax, the more levels the layout needs to represent the data. This results in a higher space consumption of the index. While querying each individual level is cheap, querying many levels might also lead to an increased runtime.

A good tmax is usually the square root of the number of user bins/samples rounded to the next multiple of 64. Note that your tmax will always be rounded to the next multiple of 64.

Value must be in range [0,max(size_t)]. Recommendation: default value (the default will compute ≈sqrt(samples)).

◆ empty_bin_fraction

double seqan::hibf::config::empty_bin_fraction {}

The percentage of empty bins in the layout.

Note
Do not set this option unless you are developing an application that requires empty technical bins.

Certain applications, e.g., dynamic indices, require empty technical bins in the layout. This option allows you to specify the fraction of tmax that should be empty bins. The empty bins will be present in each IBF of the generated layout.

For example, if tmax is 64 and empty_bin_fraction is 0.10, then 6 bins will be empty, i.e., not designated to contain any data. The resulting layout will be very similar to a layout with tmax set to 58 and no empty bins.

Value must be in range [0.0,1.0). Recommendation: default value (0.0). This option is not recommended for general use.

◆ alpha

double seqan::hibf::config::alpha {1.2}

A scaling factor to influence the amount of merged bins produced by the layout algorithm.

The layout algorithm optimizes the space consumption of the resulting HIBF, but currently has no means of optimizing the runtime for querying such an HIBF. In general, the ratio of merged bins and split bins influences the query time because a merged bin always triggers another search on a lower level. To influence this ratio, alpha can be used.

The higher alpha, the less merged bins are chosen in the layout. This improves query times but leads to a bigger index.

Value must be in range [0.0,max(double)]. Recommendation: default value (1.2). disable_estimate_union Be sure to experiment with this option with your data before changing it.

◆ max_rearrangement_ratio

double seqan::hibf::config::max_rearrangement_ratio {0.5}

The maximal cardinality ratio in the clustering intervals of the layout rearrangement algorithm.

This option can influence the layout rearrangement algorithm. The layout rearrangement algorithm improves the layout by rearranging user bins with similar content into close proximity s.t. their shared hash values reduces the overall index size. The algorithm only rearranges the order of user bins in fixed intervals. The higher –max-rearrangement-ratio, the larger the intervals. This potentially improves the layout, but increases the runtime of the layout algorithm.

Value must be in range [0.0,1.0]. Recommendation: default value (0.5).

◆ disable_estimate_union

bool seqan::hibf::config::disable_estimate_union {false}

Whether to disable union estimate of user bins to improve the layout.

The layout algorithm usually estimates the union size of user bins based on their HyperLogLog sketches in order to precisely estimate the resulting HIBF memory consumption. It improves the layout quality but is computationally expensive.

If you are constructing an HIBF with more than 100,000 samples, we recommend considering disabling this feature for faster HIBF construction time.

◆ disable_rearrangement

bool seqan::hibf::config::disable_rearrangement {false}

Whether to disable rearranging user bins based on their content similarity.

The layout rearrangement algorithm improves the layout by rearranging user bins with similar content into close proximity s.t. their shared hash values reduce the overall index size. It improves the layout quality but is computationally expensive.

If you are constructing an HIBF with more than 100,000 samples, we recommend considering disabling this feature for faster HIBF construction time.


The documentation for this struct was generated from the following file: