Compressed Linear Algebra for Large-Scale Machine Learning

@article{Elgohary2016CompressedLA,
  title={Compressed Linear Algebra for Large-Scale Machine Learning},
  author={Ahmed Elgohary and Matthias Boehm and Peter J. Haas and Frederick Reiss and Berthold Reinwald},
  journal={Proc. VLDB Endow.},
  year={2016},
  volume={9},
  pages={960-971}
}
Large-scale machine learning (ML) algorithms are often iterative, using repeated read-only data access and I/O-bound matrix-vector multiplications to converge to an optimal model. It is crucial for performance to fit the data into single-node or distributed main memory. General-purpose, heavy- and lightweight compression techniques struggle to achieve both good compression ratios and fast decompression speed to enable block-wise uncompressed operations. Hence, we initiate work on compressed… 
Compressed linear algebra for large-scale machine learning
TLDR
This work begins work on value-based compressed linear algebra (CLA), in which heterogeneous, lightweight database compression techniques are applied to matrices, and then linear algebra operations such as matrix-vector multiplication are executed directly on the compressed representation.
Scaling Machine Learning via Compressed Linear Algebra
TLDR
Large-scale machine learning algorithms are often iterative, using repeated read-only data access and I/Obound matrix-vector multiplications to converge to an optimal model, so effective column compression schemes, cache-conscious operations, and an efficient sampling-based compression algorithm are needed.
Compressed linear algebra for declarative large-scale machine learning
TLDR
This work introduces Compressed Linear Algebra (CLA) for lossless matrix compression, which encodes matrices with lightweight, value-based compression techniques and executes linear algebra operations directly on the compressed representations.
Tuple-oriented Compression for Large-scale Mini-batch Stochastic Gradient Descent
TLDR
This work proposes a new lossless compression scheme called tuple-oriented compression (TOC) that is inspired by an unlikely source, the string/ text compression scheme Lempel-Ziv-Welch, but tailored to mini-batch stochastic gradient descent in a way that preserves tuple boundaries within mini-batches.
FlashR: parallelize and scale R for machine learning using SSDs
TLDR
Despite the huge performance gap between SSDs and RAM, FlashR on SSDs closely tracks the performance of FlashR in memory for many algorithms and the R implementations in FlashR outperforms H2O and Spark MLlib by a factor of 3 -- 20.
Technical Perspective: Scaling Machine Learning via Compressed Linear Algebra
TLDR
The paper cleverly adapts ideas first developed in relational database systems — column-oriented compression, sampling-based cost estimation, trading between compression speed and compression rate — to build an elegant implementation of compressed linear algebra operations.
Lightweight Data Compression Algorithms: An Experimental Survey (Experiments and Analyses)
TLDR
This work conducted an exhaustive experimental survey by evaluating several state-of-the-art compression algorithms as well as cascades of basic techniques, finding that there is no single-best algorithm.
Beyond Straightforward Vectorization of Lightweight Data Compression Algorithms for Larger Vector Sizes
TLDR
A novel implementation concept for run-length encoding using conflict-detection operations which have been introduced in Intel’s AVX-512 SIMD extension is presented and different data layouts for vectorization and their impact on wider vector sizes are investigated.
BlockJoin: Efficient Matrix Partitioning Through Joins
TLDR
BlockJoin is presented, a distributed join algorithm which directly produces block-partitioned results and applies database techniques known from columnar processing, such as index-joins and late materialization, in the context of parallel dataflow engines.
Low Level Big Data Compression
TLDR
This work proposes a mechanism for storing and processing categorical information by compression at the bit level, and proposes a compression and decompression by blocks, with which the process of compressed information resembles theprocess of the original information.
...
1
2
3
4
...

References

SHOWING 1-10 OF 84 REFERENCES
An Extended Compression Format for the Optimization of Sparse Matrix-Vector Multiplication
TLDR
A compressed storage format, called Compressed Sparse eXtended (CSX), that is able to detect and encode simultaneously multiple commonly encountered substructures inside a sparse matrix, and considerably reduce the memory footprint of a sparse matrices, alleviating the pressure to the memory subsystem.
SLACID - sparse linear algebra in a column-oriented in-memory database system
TLDR
This paper presents and compares different approaches of storing sparse matrices in an in-memory column-oriented database system and shows that a system layout derived from the compressed sparse row representation integrates well with a columnar database design and is moreover amenable to a wide range of non-numerical use cases when dictionary encoding is used.
Optimizing sparse matrix-vector multiplication using index and value compression
TLDR
This paper proposes two distinct compression methods targeting index and numerical values respectively and demonstrates that the index compression method can be applied successfully to a wide range of matrices and the value compression method is able to achieve impressive speedups in a more limited yet important class of sparse matrices that contain a small number of distinct values.
On optimizing machine learning workloads via kernel fusion
TLDR
An analytical model is presented that considers input data characteristics and available GPU resources to estimate near-optimal settings for kernel launch parameters and demonstrates the effectiveness of the fused kernel approach in improving end-to-end performance on an entire ML algorithm.
An efficient two-dimensional blocking strategy for sparse matrix-vector multiplication on GPUs
TLDR
A new blocked row-column (BRC) storage format with a novel two-dimensional blocking mechanism that effectively addresses the challenges: it reduces thread divergence by reordering and grouping rows of the input matrix with nearly equal number of non-zero elements onto the same execution units (i.e., warps).
Implementing sparse matrix-vector multiplication on throughput-oriented processors
  • Nathan Bell, M. Garland
  • Computer Science
    Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
  • 2009
TLDR
This work explores SpMV methods that are well-suited to throughput-oriented architectures like the GPU and which exploit several common sparsity classes, including structured grid and unstructured mesh matrices.
Super-Scalar RAM-CPU Cache Compression
TLDR
This work proposes three new versatile compression schemes (PDICT, PFOR, and PFOR-DELTA) that are specifically designed to extract maximum IPC from modern CPUs and compares these algorithms with compression techniques used in (commercial) database and information retrieval systems.
Gilbert: Declarative Sparse Linear Algebra on Massively Parallel Dataflow Systems
TLDR
An exhaustive evaluation indicates that Gilbert is able to process varying amounts of data exceeding the memory of a single computer on clusters of different sizes and simplifies the development process significantly due to its high-level programming abstraction.
Dictionary-based order-preserving string compression for main memory column stores
TLDR
This paper proposes new data structures that efficiently support an order-preserving dictionary compression for (variablelength) string attributes with a large domain size that is likely to change over time and introduces a novel indexing approach that provides efficient access paths to such a dictionary while compressing the index data.
Compressed Nonnegative Matrix Factorization Is Fast and Accurate
TLDR
This work proposes to use structured random compression, that is, random projections that exploit the data structure, for two NMF variants: classical and separable, and shows that the resulting compressed techniques are faster than their uncompressed variants, vastly reduce memory demands, and do not encompass any significant deterioration in performance.
...
1
2
3
4
5
...