BitFunnel: Revisiting Signatures for Search

@article{Goodwin2017BitFunnelRS,
  title={BitFunnel: Revisiting Signatures for Search},
  author={Bob Goodwin and Michael Hopcroft and Danh Nguyen Luu and Alex Clemmer and Mihaela Curmei and Sameh Elnikety and Yuxiong He},
  journal={Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval},
  year={2017}
}
  • B. Goodwin, M. Hopcroft, Yuxiong He
  • Published 7 August 2017
  • Computer Science
  • Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval
Since the mid-90s there has been a widely-held belief that signature files are inferior to inverted files for text indexing. In recent years the Bing search engine has developed and deployed an index based on bit-sliced signatures. This index, known as BitFunnel, replaced an existing production system based on an inverted index. The driving factor behind the shift away from the inverted index was operational cost savings. This paper describes algorithmic innovations and changes in the cloud… 
COBS: a Compact Bit-Sliced Signature Index
TLDR
COBS' compact but simple data structure outperforms the other indexes in construction time and query performance with Mantis by Pandey et al. in second place and COBS does not need the complete index in RAM and is thus designed to scale to larger document sets.
A Hybrid BitFunnel and Partitioned Elias-Fano Inverted Index
TLDR
This work proposes a hybrid method which uses both the recently published mapping-matrix-style index BitFunnel (BF) for search efficiency and the state-of-the-art Partitioned Elias-Fano (PEF) inverted-index compression method to minimize time while satisfying a fixed space constraint.
Index Compression for BitFunnel Query Processing
TLDR
A dictionary-based compression approach for the recently proposed bitwise data-structure BitFunnel, which makes use of a Bloom filter is proposed, and a docID reordering strategy is introduced to improve compression.
The Potential of Learned Index Structures for Index Compression
TLDR
This work investigates how a learned model can replace document postings of an inverted index, and then evaluates the compromises such an approach might have, and the potential gains that can be achieved in terms of memory requirements.
Document Reordering for Faster Intersection
TLDR
This paper defines the problem of minimizing the cost of queries given an inverted index and a query distribution, relates it to work on adaptive set intersection, and proposes a heuristic algorithm for finding a document reordering that minimizes query processing costs under suitable cost models.
Efficient In-Memory, List-Based Text Inversion
TLDR
This work addresses three main techniques for improving the performance of an in-memory, list-based inverted file indexer: List chunking, in-chunk postings compression, and use of virtual memory "Large Pages".
Seesaw Counting Filter: An Efficient Guardian for Vulnerable Negative Keys During Dynamic Filtering
TLDR
This work proposes SeeSaw Counting Filter (SSCF), which is innovated with encapsulating the vulnerable negative keys into a unified counter array named seesaw counter array, and dynamically modulating (or varying) the applied hash functions to guard the encapsulated keys from being misidentified.
On Tradeoffs Between Document Signature Methods for a Legal Due Diligence Corpus
TLDR
This work quantifies the trade-off between signature length, time to compute, number of hash collisions, and number of nearest neighbours for a 90,000 document due diligence corpus.
Engineering a Compact Bit-Sliced Signature Index for Approximate Search on Genomic Data
TLDR
The Compact Bit-Sliced Signature Index (COBS), a new index variant with significantly improved space and time requirements, is introduced, constructed to address the challenges of searching large genomic collections.
RAMBO: Repeated And Merged BloOm Filter for Ultra-fast Multiple Set Membership Testing (MSMT) on Large-Scale Data.
TLDR
A data-structure called RAMBO (Repeated And Merged BloOm Filter) is proposed that achieves O(\sqrt{K} log K) query time in expectation with an additional worst-case memory cost factor of O(log K) beyond the array of Bloom Filters.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 29 REFERENCES
Skewed partial bitvectors for list intersection
TLDR
This paper examines the space-time performance of in-memory conjunctive list intersection algorithms, as used in search engines, where integers represent document identifiers and defines semi-bitvectors, a new partial bitvector data structure that stores the front of the list using a bitvector and the remainder using skips and delta compression.
Partitioned Elias-Fano indexes
TLDR
This paper describes a new representation of monotone sequences based on partitioning the list into chunks and encoding both the chunks and their endpoints with Elias-Fano, hence forming a two-level data structure that offers significantly better compression and improves compression ratio/query time trade-off.
Bit Transposed Files
TLDR
Results from experiments suggest that the bit transposed file is a reasonable alternative file structure for large SSDBs and is also amenable to special parallel hardware.
Decoding billions of integers per second through vectorization
TLDR
A novel vectorized scheme called SIMD‐BP128⋆ is introduced that improves over previously proposed vectorized approaches and is nearly twice as fast as the previously fastest schemes on desktop processors (varint‐G8IU and PFOR).
TOPSIG: topology preserving document signatures
TLDR
TopSig is described, a new approach to the construction of file signatures that extends recent advances in semantic hashing and dimensionality reduction that suggests that file signatures offer a viable alternative to inverted files in suitable settings and positions the file signatures model in the class of Vector Space retrieval models.
Multikey access methods based on superimposed coding techniques
TLDR
For large data files, it is shown that the two-level implementation is generally more efficient for queries with a small number of matching records and when blocks of records match the query but individual records within these blocks do not.
Faster and smaller inverted indices with treaps
TLDR
This work introduces a new representation of the inverted index that performs faster ranked unions and intersections while using less space, and performs queries up to three times faster, than state-of-the-art compact representations.
Inverted files versus signature files for text indexing
TLDR
A detailed comparison of inverted files and signature files in the context of text indexing shows that inverted files are distinctly superior to signature files, and shows that a synthetic text database can provide a realistic indication of the behavior of an actual text database.
Efficient set intersection for inverted indexing
TLDR
This article investigates intersection techniques that make use of both uncompressed “integer” representations, as well as compressed arrangements, and proposes a simple hybrid method that provides both compact storage and faster intersection computations for conjunctive querying than is possible even with uncompressed representations.
Self-indexing inverted files for fast text retrieval
TLDR
This work shows that the CPU component of query response time for conjunctive Boolean queries and for informal ranked queries can be similarly reduced, at little cost in terms of storage, by the inclusion of an internal index in each compressed inverted list.
...
1
2
3
...