HIBF 1.0.0-rc.1
|
The configuration used to build an (H)IBF. More...
#include <hibf/config.hpp>
Public Member Functions | |
constexpr bool | operator== (config const &other) const |
Two configs are equal if all options, except seqan::hibf::config::input_fn, are equal. | |
void | read_from (std::istream &stream) |
void | validate_and_set_defaults () |
Checks several variables of seqan::hibf::config and sets default values if necessary. | |
void | write_to (std::ostream &stream) const |
Public Attributes | |
General Configuration | |
std::function< void(size_t const, insert_iterator &&) | input_fn ) {} |
A function for how to hash your input [REQUIRED]. | |
size_t | number_of_user_bins {} |
The number of user bins. | |
size_t | number_of_hash_functions {2} |
The number of hash functions for the underlying Bloom Filters. | |
double | maximum_fpr {0.05} |
The desired maximum false positive rate of the underlying Bloom Filters. [RECOMMENDED_TO_ADAPT]. | |
double | relaxed_fpr {0.3} |
Allow a higher FPR in non-accuracy-critical parts of the HIBF structure. | |
size_t | threads {1u} |
The number of threads to use during construction. [RECOMMENDED_TO_ADAPT]. | |
HIBF Layout Configuration | |
uint8_t | sketch_bits {12} |
The number of bits for HyperLogLog sketches. | |
size_t | tmax {} |
The maximum number of technical bins of each IBF in the HIBF. | |
double | empty_bin_fraction {} |
The percentage of empty bins in the layout. | |
double | alpha {1.2} |
A scaling factor to influence the amount of merged bins produced by the layout algorithm. | |
double | max_rearrangement_ratio {0.5} |
The maximal cardinality ratio in the clustering intervals of the layout rearrangement algorithm. | |
bool | disable_estimate_union {false} |
Whether to disable union estimate of user bins to improve the layout. | |
bool | disable_rearrangement {false} |
Whether to disable rearranging user bins based on their content similarity. | |
Friends | |
class | cereal::access |
The configuration used to build an (H)IBF.
The configuration can be used to construct an HIBF or IBF.
When constructing an IBF, only the members General Configuration
are considered, layout parameters from the section HIBF Layout Configuration
are ignored.
Here is the list of all configs options:
Type | Option Name | Default | Note |
---|---|---|---|
General | seqan::hibf::config::input_fn | - | [REQUIRED] |
General | seqan::hibf::config::number_of_user_bins | - | [REQUIRED] |
General | seqan::hibf::config::number_of_hash_functions | 2 | |
General | seqan::hibf::config::maximum_fpr | 0.05 | [RECOMMENDED_TO_ADAPT] |
General | seqan::hibf::config::relaxed_fpr | 0.3 | |
General | seqan::hibf::config::threads | 1 | [RECOMMENDED_TO_ADAPT] |
Layout | seqan::hibf::config::sketch_bits | 12 | |
Layout | seqan::hibf::config::tmax | 0 | 0 indicates unset |
Layout | seqan::hibf::config::empty_bin_fraction | 0.0 | Dynamic Layout |
Layout | seqan::hibf::config::max_rearrangement_ratio | 0.5 | |
Layout | seqan::hibf::config::alpha | 1.2 | |
Layout | seqan::hibf::config::disable_estimate_union | false | |
Layout | seqan::hibf::config::disable_rearrangement | false |
As a copy and paste source, here are all config options with their defaults:
Check the documentation of the following options that influence the runtime:
Check the documentation of the following options that influence the memory consumption:
Checks several variables of seqan::hibf::config and sets default values if necessary.
void seqan::hibf::config::validate_and_set_defaults | ( | ) |
Checks several variables of seqan::hibf::config and sets default values if necessary.
Required options:
0
nor std::numeric_limits<uint64_t>::max()
.Constrains:
[1,5]
.(0.0,1.0)
.(0.0,1.0)
.0
.[5,32]
.18446744073709551552
.[0.0,1.0)
.[0.0,1.0]
.Modifications:
0.0
also enables seqan::hibf::config::disable_rearrangement.0
, results in a default tmax std::ceil(std::sqrt(number_of_user_bins))
being used.std::function<void(size_t const, insert_iterator &&) seqan::hibf::config::input_fn) {} |
A function for how to hash your input [REQUIRED].
To efficiently construct the hierarchical structure of the HIBF, the input needs to be given as a function object. The IBF construction can be done in another way (see seqan::hibf::interleaved_bloom_filter) but has the config construction for consistency.
It is important that the function object, a lambda in the example, has exactly two parameters: (1) A size_t const
which reflects the user bin ID to add hashes/values to in the (H)IBF. (2) A seqan::hibf::insert_iterator
that inserts hashes via it = hash
where hash
must be convertible to uint64_t
.
The above example would just insert a single hash (42
) into each user bin.
Let's look at two more realistic examples.
Inserting from a vector of values:
Inserting from a file:
seqan::hibf::config::number_of_user_bins
. size_t seqan::hibf::config::number_of_user_bins {} |
The number of user bins.
Since the data to construct the (H)IBF is given by a function object seqan::hibf::config::input_fn
, the number of user bins to consider must be given via this option.
Value must be neither 0
nor std::numeric_limits<uint64_t>::max()
.
In this example, 12
user bins would be inserted into the (H)IBF, each only storing the hash 42
.
size_t seqan::hibf::config::number_of_hash_functions {2} |
The number of hash functions for the underlying Bloom Filters.
The (H)IBF is based on the Bloom Filter data structure which requires a set of hash functions. The number of hash functions used influences the speed and space but the optimal number of hash functions is data dependent (see Bloom Filter Calculator).
Based on our experiments, we recommend a value of 2 (default).
Be sure to experiment with this option with your data before changing it.
double seqan::hibf::config::maximum_fpr {0.05} |
The desired maximum false positive rate of the underlying Bloom Filters. [RECOMMENDED_TO_ADAPT].
We ensure that when querying a single hash value in the (H)IBF, the probability of getting a false positive answer will not exceed the value set for seqan::hibf::config::maximum_fpr. The internal Bloom Filters will be configured accordingly. Individual Bloom Filters might have a different but always lower false positive rate (FPR).
Value must be in range (0.0,1.0). Recommendation: default value (0.05)
The FPR influences the memory consumption of the (H)IBF:
double seqan::hibf::config::relaxed_fpr {0.3} |
Allow a higher FPR in non-accuracy-critical parts of the HIBF structure.
Some parts in the hierarchical structure are not critical to ensure the seqan::hibf::config::maximum_fpr. These can be allowed to have a higher FPR to reduce the overall space consumption, while only minimally affecting the runtime performance.
Value must be in range (0.0,1.0). Value must be equal to or larger than seqan::hibf::config::maximum_fpr. Recommendation: default value (0.3)
Merged bins in an HIBF layout will always be followed by one or more lower-level IBFs that will have split bins or single bins (split = 1) to recover the original user bins. Thus, the FPR of merged bins does not determine the seqan::hibf::config::maximum_fpr, but is independent. Choosing a higher FPR for merged bins can lower the memory requirement but increases the runtime. Experiments show that the decrease in memory is significant, while the runtime suffers only slightly. The accuracy of the results is not affected by this parameter.
Note: For each IBF there is a limit to how high the FPR of merged bins can be. Specifically, the FPR for merged bins can never decrease the IBF size more than what is needed to ensure the seqan::hibf::config::maximum_fpr for split bins. This means that, at some point, choosing even higher values for this parameter will have no effect anymore.
size_t seqan::hibf::config::threads {1u} |
The number of threads to use during construction. [RECOMMENDED_TO_ADAPT].
Using more threads increases the memory consumption during construction because the threads hold local data. It can be beneficial to try a lower number of threads if you have limited RAM but many threads.
Currently, the following parts of the HIBF construction process are parallelized:
uint8_t seqan::hibf::config::sketch_bits {12} |
The number of bits for HyperLogLog sketches.
The HIBF layout algorithm estimates the user bin sizes by computing HyperLogLog sketches.
A key parameter of HyperLogLog sketches is the number of bits of a hash value to consider for sketching. Fewer bits accelerate the sketching process but decrease the accuracy.
Value must be in range [5,32]. Recommendation: default value (12).
Be sure to experiment with this option with your data before changing it.
size_t seqan::hibf::config::tmax {} |
The maximum number of technical bins of each IBF in the HIBF.
One of the key methods of the HIBF for increasing query speed is limiting the number of (technical) bins of the IBF data structure used within the HIBF.
Choosing a good tmax is not trivial.
The smaller tmax, the more levels the layout needs to represent the data. This results in a higher space consumption of the index. While querying each individual level is cheap, querying many levels might also lead to an increased runtime.
A good tmax is usually the square root of the number of user bins/samples rounded to the next multiple of 64. Note that your tmax will always be rounded to the next multiple of 64.
Value must be in range [0,max(size_t)]. Recommendation: default value (the default will compute ≈sqrt(samples)).
double seqan::hibf::config::empty_bin_fraction {} |
The percentage of empty bins in the layout.
Certain applications, e.g., dynamic indices, require empty technical bins in the layout. This option allows you to specify the fraction of tmax that should be empty bins. The empty bins will be present in each IBF of the generated layout.
For example, if tmax
is 64
and empty_bin_fraction
is 0.10
, then 6 bins will be empty, i.e., not designated to contain any data. The resulting layout will be very similar to a layout with tmax
set to 58
and no empty bins.
Value must be in range [0.0,1.0). Recommendation: default value (0.0). This option is not recommended for general use.
double seqan::hibf::config::alpha {1.2} |
A scaling factor to influence the amount of merged bins produced by the layout algorithm.
The layout algorithm optimizes the space consumption of the resulting HIBF, but currently has no means of optimizing the runtime for querying such an HIBF. In general, the ratio of merged bins and split bins influences the query time because a merged bin always triggers another search on a lower level. To influence this ratio, alpha can be used.
The higher alpha, the less merged bins are chosen in the layout. This improves query times but leads to a bigger index.
Value must be in range [0.0,max(double)]. Recommendation: default value (1.2). disable_estimate_union Be sure to experiment with this option with your data before changing it.
double seqan::hibf::config::max_rearrangement_ratio {0.5} |
The maximal cardinality ratio in the clustering intervals of the layout rearrangement algorithm.
This option can influence the layout rearrangement algorithm. The layout rearrangement algorithm improves the layout by rearranging user bins with similar content into close proximity s.t. their shared hash values reduces the overall index size. The algorithm only rearranges the order of user bins in fixed intervals. The higher –max-rearrangement-ratio, the larger the intervals. This potentially improves the layout, but increases the runtime of the layout algorithm.
Value must be in range [0.0,1.0]. Recommendation: default value (0.5).
bool seqan::hibf::config::disable_estimate_union {false} |
Whether to disable union estimate of user bins to improve the layout.
The layout algorithm usually estimates the union size of user bins based on their HyperLogLog sketches in order to precisely estimate the resulting HIBF memory consumption. It improves the layout quality but is computationally expensive.
If you are constructing an HIBF with more than 100,000 samples, we recommend considering disabling this feature for faster HIBF construction time.
bool seqan::hibf::config::disable_rearrangement {false} |
Whether to disable rearranging user bins based on their content similarity.
The layout rearrangement algorithm improves the layout by rearranging user bins with similar content into close proximity s.t. their shared hash values reduce the overall index size. It improves the layout quality but is computationally expensive.
If you are constructing an HIBF with more than 100,000 samples, we recommend considering disabling this feature for faster HIBF construction time.