Conjunctive Filter: Breaking the Entropy Barrier

@inproceedings{Okanohara2010ConjunctiveFB,
  title={Conjunctive Filter: Breaking the Entropy Barrier},
  author={Daisuke Okanohara and Yuichi Yoshida},
  booktitle={ALENEX},
  year={2010}
}
We consider a problem for storing a map that associates a key with a set of values. To store n values from the universe of size m, it requires log2(mn) bits of space, which can be approximated as (1.44 + n) log2 m/n bits when n L m. If we allow e fraction of errors in outputs, we can store it with roughly n log2 1/e bits, which matches the entropy bound. Bloom filter is a well-known example for such data structures. Our objective is to break this entropy bound and construct more space-efficient… 

Figures from this paper

Faster upper bounding of intersection sizes
TLDR
A new data structure is described, a Cardinality Filter, to quickly compute an upper bound on the size of a set intersection, which can be used to accelerate many applications such as top-k query processing in text mining.
On Gapped Set Intersection Size Estimation
TLDR
This paper considers a generalized problem for integer sets where, given a gap parameter δ, two elements are deemed as matches if their numeric difference equals δ or is within δ; it can be used to model applications in database systems, data mining, and information retrieval.
Efficient Identification of Local Keyword Patterns in Microblogging Platforms
TLDR
To handle the high volume microblog stream and meet the requirement when a large number of queries are issued, novel data structures are developed to maintain the data stream, and efficient algorithms to process LFP and LKFP queries with theoretical underpinnings are proposed.

References

SHOWING 1-10 OF 13 REFERENCES
Succinct Data Structures for Retrieval and Approximate Membership
TLDR
It is shown that for any k, query time O(k) can be beachieved using space that is within a factor 1 + e-k of optimal, asymptotically forlarge n.
An Optimal Bloom Filter Replacement Based on Matrix Solving
TLDR
This work suggests a method for holding a dictionary data structure, which maps keys to values, in the spirit of Bloom Filters, and suggests a data structure that requires only nk bits space, has O (n) preprocessing time, and has a O (logn ) query time.
Fast Evaluation of Union-Intersection Expressions
TLDR
A novel combination of approximate set representations and word-level parallelism is used, which shows how to represent sets in a linear space data structure such that expressions involving unions and intersections of sets can be computed in a worst-case efficient way.
The Bloomier filter: an efficient data structure for static support lookup tables
TLDR
The Bloomier filter is introduced, a data structure for compactly encoding a function with static support in order to support approximate evaluation queries and lower bounds are provided to prove the (near) optimality of the constructions.
Bloomier Filters: A second look
TLDR
This article gives a simple construction of a Bloomier filter, a space efficient structure for storing static sets, where the space efficiency is gained at the expense of a small probability of false-positives.
Space/time trade-offs in hash coding with allowable errors
TLDR
Analysis of the paradigm problem demonstrates that allowing a small number of test messages to be falsely identified as members of the given set will permit a much smaller hash area to be used without increasing reject time.
Storing a Compressed Function with Constant Time Access
We consider the problem of representing, in a space-efficient way, a function f: S→Σ such that any function value can be computed in constant time on a RAM. Specifically, our aim is to achieve space
Succinct indexable dictionaries with applications to encoding k-ary trees and multisets
TLDR
A structure that supports both operations in <i>O</i>(1) time on the RAM model and an information-theoretically optimal representation for cardinal cardinal trees and multisets where (appropriate generalisations of) the select and rank operations can be supported in 1) time.
Secondary indexing in one dimension: beyond b-trees and bitmap indexes
TLDR
This paper gives the first theoretically optimal data structure for the secondary indexing problem and shows how to bound the size of the data structure in terms of the <i>0</i>th order entropy of <b>x</b>.
Probability Inequalities for sums of Bounded Random Variables
Abstract Upper bounds are derived for the probability that the sum S of n independent random variables exceeds its mean ES by a positive number nt. It is assumed that the range of each summand of S
...
...