Partitioned Elias-Fano indexes

  title={Partitioned Elias-Fano indexes},
  author={Giuseppe Ottaviano and Rossano Venturini},
  journal={Proceedings of the 37th international ACM SIGIR conference on Research \& development in information retrieval},
  • G. Ottaviano, Rossano Venturini
  • Published 3 July 2014
  • Computer Science
  • Proceedings of the 37th international ACM SIGIR conference on Research & development in information retrieval
The Elias-Fano representation of monotone sequences has been recently applied to the compression of inverted indexes, showing excellent query performance thanks to its efficient random access and search operations. While its space occupancy is competitive with some state-of-the-art methods such as gamma-delta-Golomb codes and PForDelta, it fails to exploit the local clustering that inverted lists usually exhibit, namely the presence of long subsequences of close identifiers. In this paper we… 

Figures and Tables from this paper

A Heuristically Optimized Partitioning Strategy on Elias-Fano Index
This paper compares performances of existing encoders in the space-time trade-off curve, then it presents a faster algorithm to heuristically compute optimal partitions for the state-of-the-art Partitioned Elias-Fano index, taking account of compression time.
A Hybrid BitFunnel and Partitioned Elias-Fano Inverted Index
This work proposes a hybrid method which uses both the recently published mapping-matrix-style index BitFunnel (BF) for search efficiency and the state-of-the-art Partitioned Elias-Fano (PEF) inverted-index compression method to minimize time while satisfying a fixed space constraint.
On Slicing Sorted Integer Sequences
A solution is proposed and implemented that recursively slices the universe of representation of a sequence to achieve compact storage and attain to fast query execution, thus offering an excellent space/time trade-off for the problem.
A flexible space-time tradeoff on hybrid index with bicriteria optimization
The concept of bicriteria compression is introduced, which aims to formalize the problem of optimally trading the compressed size and query processing time for inverted index and adopts a Lagrangian relaxation algorithm to solve this problem by reducing it to a knapsack-type problem.
Dynamic Elias-Fano Representation
We show that it is possible to store a dynamic ordered set S of n integers drawn from a bounded universe of size u in space close to the information-theoretic lower bound and preserve, at the same
Clustered Elias-Fano Indexes
A new index representation based on clustering the collection of posting lists and, for each created cluster, building an ad hoc reference list with respect to which all lists in the cluster are encoded with Elias-Fano is proposed.
On Optimizing Partitioning Strategies for Faster Inverted Index Compression
Compression speed is introduced as one criterion to evaluate compression techniques, and a linear-time optimization is proposed, to enhance VSEncoding with faster compression speed and more flexibility to partition an index.
Faster BlockMax WAND with Variable-sized Blocks
This work sets up the problem of deciding the block partitioning as an optimization problem which maximizes how accurately the block upper bounds represent the underlying scores, and describes an efficient algorithm to find an approximate solution, with provable approximation guarantees.
Optimal Space-time Tradeoffs for Inverted Indexes
A linear time algorithm is introduced that, given a query distribution and a set of encoders, selects the best encoder for each index block to obtain the lowest expected query processing time respecting a given space constraint.
MILC: Inverted List Compression in Memory
This work proposes a new compression scheme, namely, MILC (memory inverted list compression), which relies on a series of techniques including offset-oriented fixed-bit encoding, dynamic partitioning, in-block compression, cache-aware optimization, and SIMD acceleration and experimentally shows that MILC improves the query performance and reduces the space overhead.


Improving table compression with combinatorial optimization
This work devise the first on-line training algorithms for table compression, which can be applied to individual files, not just continuously operating sources; and also a new, off-lineTraining algorithm, based on a link to the asymmetric traveling salesman problem, which improves on prior work by rearranging columns prior to partitions.
VSEncoding: efficient coding and fast decoding of integer lists via dynamic programming
Experiments show that this class of encoders outperform all the existing methods in literature by more than 10% (with the exception of Binary Interpolative Coding with which they, roughly, tie) still retaining a very fast decompression algorithm.
Compressing relations and indexes
We propose a new compression algorithm that is tailored to database applications. It can be applied to a collection of records, and is especially effective for records with many low to medium
Quasi-succinct indices
This paper proposes to represent an index using a different architecture based on quasi-succinct representation of monotone sequences and shows that the new index provides expected constant-time operations, space savings, and, in practice, significant performance improvements on conjunctive, phrasal and proximity queries.
Sorting Out the Document Identifier Assignment Problem
It is empirically shown that in the case of collections of Web Documents the authors can enhance the performance of compression algorithms by simply assigning identifiers to documents according to the lexicographical ordering of the URLs.
Index compression is good, especially for random access
It is demonstrated that, in some cases, random access into a term's postings list may be realized more efficiently if the list is stored in compressed form instead of uncompressed, regardless of whether the index is stored on disk or in main memory.
On Optimally Partitioning a Text to Improve Its Compression
The first algorithm is provided, a partition of T whose compressed output is guaranteed to be no more than (1+ε)-worse the optimal one, where ε may be any positive constant fixed in advance, which holds for any base-compressor C whose compression performance can be bounded in terms of the zero-th or the k-th order empirical entropy of the text T.
Inverted index compression and query processing with optimized document ordering
This work performs an extensive study of compression techniques for document IDs and presents new optimizations of existing techniques which can achieve significant improvement in both compression and decompression performances.
Super-Scalar RAM-CPU Cache Compression
This work proposes three new versatile compression schemes (PDICT, PFOR, and PFOR-DELTA) that are specifically designed to extract maximum IPC from modern CPUs and compares these algorithms with compression techniques used in (commercial) database and information retrieval systems.
Decoding billions of integers per second through vectorization
A novel vectorized scheme called SIMD‐BP128⋆ is introduced that improves over previously proposed vectorized approaches and is nearly twice as fast as the previously fastest schemes on desktop processors (varint‐G8IU and PFOR).