Compressed linear algebra for large-scale machine learning

@article{Elgohary2017CompressedLA,
  title={Compressed linear algebra for large-scale machine learning},
  author={Ahmed Elgohary and Matthias Boehm and Peter J. Haas and Frederick Reiss and Berthold Reinwald},
  journal={The VLDB Journal},
  year={2017},
  volume={27},
  pages={719-744}
}
Large-scale machine learning algorithms are often iterative, using repeated read-only data access and I/O-bound matrix-vector multiplications to converge to an optimal model. It is crucial for performance to fit the data into single-node or distributed main memory and enable fast matrix-vector operations on in-memory data. General-purpose, heavy- and lightweight compression techniques struggle to achieve both good compression ratios and fast decompression speed to enable block-wise uncompressed… 
Compressed linear algebra for declarative large-scale machine learning
TLDR
This work introduces Compressed Linear Algebra (CLA) for lossless matrix compression, which encodes matrices with lightweight, value-based compression techniques and executes linear algebra operations directly on the compressed representations.
Improving Matrix-vector Multiplication via Lossless Grammar-Compressed Matrices
TLDR
A new lossless compression scheme for real-valued matrices which achieves efficient performance in terms of compression ratio and time for linear-algebra operations and is the first one achieving time and space complexities which match the theoretical limit expressed by the k -th order statistical entropy of the input.
Technical Perspective: Scaling Machine Learning via Compressed Linear Algebra
TLDR
The paper cleverly adapts ideas first developed in relational database systems — column-oriented compression, sampling-based cost estimation, trading between compression speed and compression rate — to build an elegant implementation of compressed linear algebra operations.
FlashR: parallelize and scale R for machine learning using SSDs
TLDR
Despite the huge performance gap between SSDs and RAM, FlashR on SSDs closely tracks the performance of FlashR in memory for many algorithms and the R implementations in FlashR outperforms H2O and Spark MLlib by a factor of 3 -- 20.
Gilbert: Declarative Sparse Linear Algebra on Massively Parallel Dataflow Systems
TLDR
An exhaustive evaluation indicates that Gilbert is able to process varying amounts of data exceeding the memory of a single computer on clusters of different sizes and simplifies the development process significantly due to its high-level programming abstraction.
BlockJoin: Efficient Matrix Partitioning Through Joins
TLDR
BlockJoin is presented, a distributed join algorithm which directly produces block-partitioned results and applies database techniques known from columnar processing, such as index-joins and late materialization, in the context of parallel dataflow engines.
Homomorphic Parameter Compression for Distributed Deep Learning Training
TLDR
Although the specific method is yet to be discovered, it is demonstrated that there is a high probability that the homomorphism can reduce the communication overhead, thanks to little compression and decompression times, and provides theoretical speedup of homomorphic compression.
Automatic Optimization of Matrix Implementations for Distributed Machine Learning and Linear Algebra
TLDR
This paper proposes a framework for automatic optimization of the physical implementation of a complex ML or linear algebra computation in a distributed environment, develops algorithms for solving this problem, and shows that the ideas can radically speed up common ML and LA computations.
Accelerating Generalized Linear Models with MLWeaving: A One-Size-Fits-All System for Any-precision Learning
TLDR
ML-Weaving is presented, a data structure and hardware acceleration technique intended to speed up learning of generalized linear models over low precision data and provides a compact in-memory representation that enables the retrieval of data at any level of precision.
Scalable Relational Query Processing on Big Matrix Data
TLDR
New efficient and scalable relational query processing techniques on big matrix data for inmemory distributed clusters that leverage algebraic transformation rules to rewrite query execution plans into ones with lower computation costs are presented.
...
...

References

SHOWING 1-10 OF 107 REFERENCES
An Extended Compression Format for the Optimization of Sparse Matrix-Vector Multiplication
TLDR
A compressed storage format, called Compressed Sparse eXtended (CSX), that is able to detect and encode simultaneously multiple commonly encountered substructures inside a sparse matrix, and considerably reduce the memory footprint of a sparse matrices, alleviating the pressure to the memory subsystem.
SLACID - sparse linear algebra in a column-oriented in-memory database system
TLDR
This paper presents and compares different approaches of storing sparse matrices in an in-memory column-oriented database system and shows that a system layout derived from the compressed sparse row representation integrates well with a columnar database design and is moreover amenable to a wide range of non-numerical use cases when dictionary encoding is used.
Optimizing sparse matrix-vector multiplication using index and value compression
TLDR
This paper proposes two distinct compression methods targeting index and numerical values respectively and demonstrates that the index compression method can be applied successfully to a wide range of matrices and the value compression method is able to achieve impressive speedups in a more limited yet important class of sparse matrices that contain a small number of distinct values.
Implementing sparse matrix-vector multiplication on throughput-oriented processors
  • Nathan Bell, M. Garland
  • Computer Science
    Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
  • 2009
TLDR
This work explores SpMV methods that are well-suited to throughput-oriented architectures like the GPU and which exploit several common sparsity classes, including structured grid and unstructured mesh matrices.
On optimizing machine learning workloads via kernel fusion
TLDR
An analytical model is presented that considers input data characteristics and available GPU resources to estimate near-optimal settings for kernel launch parameters and demonstrates the effectiveness of the fused kernel approach in improving end-to-end performance on an entire ML algorithm.
Gilbert: Declarative Sparse Linear Algebra on Massively Parallel Dataflow Systems
TLDR
An exhaustive evaluation indicates that Gilbert is able to process varying amounts of data exceeding the memory of a single computer on clusters of different sizes and simplifies the development process significantly due to its high-level programming abstraction.
Super-Scalar RAM-CPU Cache Compression
TLDR
This work proposes three new versatile compression schemes (PDICT, PFOR, and PFOR-DELTA) that are specifically designed to extract maximum IPC from modern CPUs and compares these algorithms with compression techniques used in (commercial) database and information retrieval systems.
Dictionary-based order-preserving string compression for main memory column stores
TLDR
This paper proposes new data structures that efficiently support an order-preserving dictionary compression for (variablelength) string attributes with a large domain size that is likely to change over time and introduces a novel indexing approach that provides efficient access paths to such a dictionary while compressing the index data.
Compressed Nonnegative Matrix Factorization Is Fast and Accurate
TLDR
This work proposes to use structured random compression, that is, random projections that exploit the data structure, for two NMF variants: classical and separable, and shows that the resulting compressed techniques are faster than their uncompressed variants, vastly reduce memory demands, and do not encompass any significant deterioration in performance.
SPOOF: Sum-Product Optimization and Operator Fusion for Large-Scale Machine Learning
TLDR
Spoof is introduced, an architecture to automatically identify algebraic simplification rewrites, and generate fused operators in a holistic framework, and a snapshot of the overall system is described, including key techniques of sum-product optimization and code generation.
...
...