• Corpus ID: 203838515

RAMBO: Repeated And Merged Bloom Filter for Multiple Set Membership Testing (MSMT) in Sub-linear time

  title={RAMBO: Repeated And Merged Bloom Filter for Multiple Set Membership Testing (MSMT) in Sub-linear time},
  author={Gaurav Gupta and Benjamin Coleman and Tharun Medini and Vijai Mohan and Anshumali Shrivastava},
Approximate set membership is a common problem with wide applications in databases, networking, and search. Given a set S and a query q, the task is to determine whether q in S. The Bloom Filter (BF) is a popular data structure for approximate membership testing due to its simplicity. In particular, a BF consists of a bit array that can be incrementally updated. A related problem concerning this paper is the Multiple Set Membership Testing (MSMT) problem. Here we are given K different sets, and… 

Figures and Tables from this paper

Building Fast and Compact Sketches for Approximately Multi-Set Multi-Membership Querying

A novel Circular Shift and Coalesce (CSC) framework is proposed to efficiently achieve approximate MS-MMQ, which encodes all n sets into a compact sketch and retrieves only a few bytes in the sketch for a query, which achieves high memory-efficiency and boosts the query speed by several times.

To Petabytes and beyond: recent advances in probabilistic and signal processing algorithms and their application to metagenomics

The fundamentals of the most impactful probabilistic and signal processing algorithms are reviewed and more recent advances are highlighted to augment previous reviews in these areas that have taken a broader approach.

Sub-linear Sequence Search via a Repeated And Merged Bloom Filter (RAMBO)

RAMBO (Repeated and Merged Bloom Filter) is proposed where the number of Bloom filter probes is significantly less than BigSI due to sub-linear scaling for the same false-positive rate and provides a significant improvement over BigSI in terms of query time when evaluated on real genome datasets.



Ultra-fast search of all deposited bacterial and viral genomic data

This work indexed the entire global corpus of 447,833 bacterial and viral whole-genome sequence datasets using four orders of magnitude less storage than previous methods and produced a searchable data structure named BItsliced Genomic Signature Index (BIGSI).

Extreme Classification in Log Memory

MACH is a generic K-classification algorithm, with provably theoretical guarantees, which requires O(log K) memory without any assumption on the relationship between classes, and provides theoretical quantification of discriminability-memory tradeoff.

OMASS: One Memory Access Set Separation

The One Memory Access Set Separation (OMASS) scheme is designed so that for a given element x, the corresponding Bloom filter bits for each set map to different positions in the memory word, which ensures that the false positive rates for the Bloom filters for element x under other sets are not affected.

One Sketch to Rule Them All: Rethinking Network Flow Monitoring with UnivMon

UnivMon is presented, a framework for flow monitoring which leverages recent theoretical advances and demonstrates that it is possible to achieve both generality and high accuracy, and evaluated using a range of trace-driven evaluations to show that it offers comparable (and sometimes better) accuracy relative to custom sketching solutions.

Exact and approximate membership testers

The question of how much space is needed to represent a set is considered, given a finite universe U and some subset V and a procedure that for each element s in U determines if s is in V.

Beating CountSketch for heavy hitters in insertion streams

One can achieve O(logn loglogn) bits of space for the problem of returning all ℓ2-heavy hitters, i.e., those items j for which fj ≥ є √F2, where fj is the number of occurrences of item j in the stream, and F2 = ∑i ∈ [n] fi2.

Finding Frequent Items in Data Streams

This work presents a 1-pass algorithm for estimating the most frequent items in a data stream using limited storage space, which achieves better space bounds than the previously known best algorithms for this problem for several natural distributions on the item frequencies.

Compressed bloom filters

A Bloom filter is a simple space-efficient randomized data structure for representing a set in order to support membership queries. Although Bloom filters allow false positives, for many applications

Managing Gigabytes: Compressing and Indexing Documents and Images

A guide to the MG system and its applications, as well as a comparison to the NZDL reference index, are provided.