Reordering columns for smaller indexes

@article{Lemire2011ReorderingCF,
  title={Reordering columns for smaller indexes},
  author={D. Lemire and Owen Kaser},
  journal={ArXiv},
  year={2011},
  volume={abs/0909.1346}
}
Column-oriented indexes-such as projection or bitmap indexes-are compressed by run-length encoding to reduce storage and increase speed. Sorting the tables improves compression. On realistic data sets, permuting the columns in the right order before sorting can reduce the number of runs by a factor of two or more. Unfortunately, determining the best column order is NP-hard. For many cases, we prove that the number of runs in table columns is minimized if we sort columns by increasing… Expand
Minimizing Index Size by Reordering Rows and Columns
TLDR
This paper develops accurate statistical formulas that compute approximate solutions for reordering rows and columns of a data table and confirms that the heuristics of sorting columns with low column cardinalities first is indeed effective in reducing the index sizes. Expand
Column Partition and Permutation for Run Length Encoding in Columnar Databases
  • Jia Shi
  • Computer Science
  • SIGMOD Conference
  • 2020
TLDR
This paper proposes an incremental heuristic that identifies the set of columns to be compressed and the order of rows that offer a better compression ratio, and improves the compression rate by up to 25% on test data, compared with compressing all columns of a table. Expand
Reordering rows for better compression: Beyond the lexicographic order
TLDR
It is proved that the new row reordering is optimal at minimizing the runs of identical values within columns, in a few cases, and it is found that run-length encoding can improve up to a factor of 3 whereas prefix coding can be improved by up to 80%: these gains are on top of the gains due to lexicographically sorting the table. Expand
Reordering Rows for Better Compression: Beyond the Lexicographic Order
TLDR
It is proved that the new row reordering is optimal at minimizing the runs of identical values within columns, in a few cases, and it is found that run-length encoding can improve up to a factor of 3 whereas prefix coding can be improved by up to 80%: these gains are on top of the gains due to lexicographically sorting the table. Expand
Variable Length Compression for Bitmap Indices
TLDR
The empirical study shows that in the best case the approach can out-compress BBC by 30% and WAH by 70%, for real data sets, and an algorithm that efficiently processes queries when encoding lengths share a common integer factor is presented. Expand
A meta-heuristic approach for RLE compression in a column store table
TLDR
This paper presents a comprehensive analysis and comparison of common and well-known meta-heuristics for columnar run minimization, based on standard implementations by using real datasets, and provides comprehensive implementations of the heuristic RLE compression approaches based on common optimization methods. Expand
Compressed bitmap indexes: beyond unions and intersections
TLDR
This work shows that bitmap indexes are more broadly applicable than is commonly believed and introduces new algorithms that are sometimes three orders of magnitude faster than a naïve approach. Expand
A Genetic Algorithm Approach for Minimizing the Number of Columnar Runs in a Column Store Table
TLDR
This paper presents a genetic algorithm for determining an optimal column sorting order which will minimize the number of columnar runs in a column store table and therefore maximize the RLE-based table compression. Expand
Threshold and Symmetric Functions over Bitmaps
TLDR
This work considers symmetric Boolean queries, and finds that the best of the bitmap-based algorithms are competitive with the state-of-the-art algorithms for important special cases (e.g., MergeOpt, MergeSkip, DivideSkip, ScanCount). Expand
Performance evaluation of fast integer compression techniques over tables
TLDR
This study aims to quantify the trade-offs of fast integer compression schemes with respect to compression ratio and speed of compression and decompression, and finds that sorting can significantly improve the performance of compression. Expand
...
1
2
3
...

References

SHOWING 1-10 OF 99 REFERENCES
Sorting improves word-aligned bitmap indexes
TLDR
This work uses techniques based on run-length encoding (RLE) to accelerate logical operations (AND, OR, XOR) over bitmaps, such as Word-Aligned Hybrid (WAH) compression, and investigates row-reordering heuristics. Expand
Compressing table data with column dependency
  • B. Vo, K. Vo
  • Computer Science
  • Theor. Comput. Sci.
  • 2007
TLDR
This paper formalizes the notion of column dependency as a way to capture this information redundancy across columns and discusses how to automatically compute and use it to substantially improve table compression. Expand
Dictionary-based order-preserving string compression for main memory column stores
TLDR
This paper proposes new data structures that efficiently support an order-preserving dictionary compression for (variablelength) string attributes with a large domain size that is likely to change over time and introduces a novel indexing approach that provides efficient access paths to such a dictionary while compressing the index data. Expand
Compression of inverted indexes For fast query evaluation
TLDR
This paper proposes several simple optimisations to well-known integer compression schemes, and shows experimentally that these lead to significant reductions in time, and concludes that fast byte-aligned codes should be used to store integers in inverted lists. Expand
Integrating compression and execution in column-oriented database systems
TLDR
This paper shows how compression schemes not traditionally used in row-oriented DBMSs can be applied to column-oriented systems and evaluates a set of compression schemes and shows that the best scheme depends not only on the properties of the data but also on the nature of the query workload. Expand
Index compression is good, especially for random access
TLDR
It is demonstrated that, in some cases, random access into a term's postings list may be realized more efficiently if the list is stored in compressed form instead of uncompressed, regardless of whether the index is stored on disk or in main memory. Expand
Binary Interpolative Coding for Effective Index Compression
TLDR
A new method for compressing inverted indexes is introduced that yields excellent compression, fast decoding, and exploits clustering—the tendency for words to appear relatively frequently in some parts of the collection and infrequently in others. Expand
Optimizing bitmap indices with efficient compression
TLDR
This article presents a new compression scheme called Word-Aligned Hybrid (WAH) code that makes compressed bitmap indices efficient even for high-cardinality attributes and proves that the new compressed bit map index, like the best variants of the B-tree index, is optimal for one-dimensional range queries. Expand
Read-optimized databases, in depth
TLDR
This study examines five tables with various characteristics and different query workloads in order to obtain a greater understanding and quantification of the relative performance of column stores and row stores. Expand
C-Store: A Column-oriented DBMS
TLDR
Preliminary performance data on a subset of TPC-H is presented and it is shown that the system the team is building, C-Store, is substantially faster than popular commercial products. Expand
...
1
2
3
4
5
...