SeqAn3  3.2.0
The Modern C++ library for sequence analysis.
seqan3::interleaved_bloom_filter< data_layout_mode_ > Class Template Reference

The IBF binning directory. A data structure that efficiently answers set-membership queries for multiple bins. More...

#include <seqan3/search/dream_index/interleaved_bloom_filter.hpp>

+ Inheritance diagram for seqan3::interleaved_bloom_filter< data_layout_mode_ >:

Classes

class  counting_agent_type
 Manages counting ranges of values for the seqan3::interleaved_bloom_filter. More...
 
class  membership_agent_type
 Manages membership queries for the seqan3::interleaved_bloom_filter. More...
 

Public Member Functions

Constructors, destructor and assignment
 interleaved_bloom_filter ()=default
 Defaulted.
 
 interleaved_bloom_filter (interleaved_bloom_filter const &)=default
 Defaulted.
 
interleaved_bloom_filteroperator= (interleaved_bloom_filter const &)=default
 Defaulted.
 
 interleaved_bloom_filter (interleaved_bloom_filter &&)=default
 Defaulted.
 
interleaved_bloom_filteroperator= (interleaved_bloom_filter &&)=default
 Defaulted.
 
 ~interleaved_bloom_filter ()=default
 Defaulted.
 

Static Public Attributes

static constexpr data_layout data_layout_mode = data_layout_mode_
 Indicates whether the Interleaved Bloom Filter is compressed.
 

Detailed Description

template<data_layout data_layout_mode_ = data_layout::uncompressed>
class seqan3::interleaved_bloom_filter< data_layout_mode_ >

The IBF binning directory. A data structure that efficiently answers set-membership queries for multiple bins.

Template Parameters
data_layout_mode_Indicates whether the underlying data type is compressed. See seqan3::data_layout.

Binning Directory

A binning directory is a data structure that can be used to determine set membership for elements. For example, a common use case is dividing a database into a fixed number (e.g. 1024) bins by some means of clustering (e.g. taxonomic binning or k-mer similarity clustering for genomic sequences). For a query, the binning directory can now answer in which bins the query (probably) occurs. In SeqAn we provide the Interleaved Bloom Filter (IBF) that can answer these queries efficiently.

Interleaved Bloom Filter (IBF)

The Interleaved Bloom Filter is a probabilistic data structure that extends the Bloom Filter. A Bloom Filter can be thought of as a bitvector of length n and h hash functions and is used to determine set membership. To insert data, the data is hashed by the h hash functions (returning values in [0, n)) and the corresponding h positions in the bitvector are set to 1. To query data, i.e. to determine whether the query belongs to the set the Bloom Filter was built for, the query is hashed by the same h hash functions and the corresponding positions are checked. If all h positions contain a 1, the query is (probably) in the data set. Since the Bloom Filter has variable length, the hashing is not bijective, i.e. it may return true for a set membership query even though the query was never inserted into the Bloom Filter. Note that the Bloom Filter will always return true if the query was inserted, i.e. there may be false positives, but no false negatives.

The Interleaved Bloom Filter now applies the concept of a Bloom Filter to multiple sets and provides a global data structure to determine set membership of a query in b data sets/bins. Conceptually, a Bloom Filter is created for each bin using the same fixed length and fixed hash functions for each filter. The resulting b Bloom Filters are then interleaved such that the i'th bit if each Bloom Filter are adjacent to each other:

Bloom Filter 0 Bloom Filter 1 Bloom Filter 2 Bloom Filter 3
|0.0|0.1|0.2|0.3| |1.0|1.1|1.2|1.3| |2.0|2.1|2.2|2.3| |3.0|3.1|3.2|3.3|

Where x.y denotes the y'th bit of the x'th Bloom Filter.

Interleaved Bloom Filter
|0.0|1.0|2.0|3.0|0.1|1.1|2.1|3.1|0.2|1.2|2.2|3.2|0.3|1.3|2.3|3.3|

A query can now be searched in all b bins by computing the h hash functions, retrieving the h sub-bitvectors of length b starting at the positions indicated by the hash functions. The bitwise AND of these sub-bitvectors yields the binningvector, a bitvector of length b where the i'th bit indicates set membership in the i'th bin.

Querying

To query the Interleaved Bloom Filter for a value, call seqan3::interleaved_bloom_filter::membership_agent() and use the returned seqan3::interleaved_bloom_filter::membership_agent_type.

To count the occurrences of a range of values in the Interleaved Bloom Filter, call seqan3::interleaved_bloom_filter::counting_agent() and use the returned seqan3::interleaved_bloom_filter::counting_agent_type.

Compression

The Interleaved Bloom Filter can be compressed by passing data_layout::compressed as template argument. The compressed seqan3::interleaved_bloom_filter<seqan3::data_layout::compressed> can only be constructed from a seqan3::interleaved_bloom_filter, in which case the underlying bitvector is compressed. The compressed Interleaved Bloom Filter is immutable, i.e. only querying is supported.

Thread safety

The Interleaved Bloom Filter promises the basic thread-safety by the STL that all calls to const member functions are safe from multiple threads (as long as no thread calls a non-const member function at the same time).

Additionally, concurrent calls to emplace are safe iff each thread handles a multiple of wordsize (=64) many bins. For example, calls to emplace from multiple threads are safe if thread_1 accesses bins 0-63, thread_2 bins 64-127, and so on.


The documentation for this class was generated from the following file: