Exploiting SIMD for complex numerical predicates

@inproceedings{Song2016ExploitingSF,
  title={Exploiting SIMD for complex numerical predicates},
  author={Dongxiao Song and Shimin Chen},
  booktitle={ICDE Workshops},
  year={2016}
}
We study the use of SIMD instructions to support complex conjunctive numerical predicates. [] Key Method We propose a code framework based on three alternative SIMD algorithms for conjunctive predicates. Then, we investigate cost models for both single-threaded and multi-threaded evaluation of filtering predicates. Our experimental results on synthetic data show that an optimal SIMD plan can achieve up to 10.4× speedup over the best no SIMD plan, and up to 6.8× speedup over sub-optimal SIMD plans.

Figures and Tables from this paper

Understanding and Optimizing Conjunctive Predicates Under Memory-Efficient Storage Layouts

TLDR
A hybrid empirical/analytical cost model is proposed to unveil the performance characteristics of memory-efficient storage layouts when applying to predicate evaluation and a simple execution scheme Hebe is proposed, which is order-oblivious while maintaining high performance.

Hebe: An Order-Oblivious and High-Performance Execution Scheme for Conjunctive Predicates

TLDR
This paper proposes Hebe, a simplified execution scheme which is attractive to the query optimizer, as it does not need to go through a sampling process to determine an optimal evaluation order of predicates, and can also achieve up to 153% performance improvement.

Everything You Always Wanted to Know About Compiled and Vectorized Queries But Were Afraid to Ask

TLDR
This paper experimentally compare the two models of vectorization and data-centric query processing by implementing both within the same test system, and finds that both are efficient, but have different strengths and weaknesses.

Research on Vectorization Technology for Irregular Data Access

TLDR
A calculation method of vectorization performance gains is designed and the experimental results show that this method can vectorize irregular data access effectively and improve the program execution efficiency.

CuMF_SGD: Parallelized Stochastic Gradient Descent for Matrix Factorization on GPUs

TLDR
This paper first design high-performance GPU computation kernels that accelerate individual SGD updates by exploiting model parallelism, then design efficient schemes that parallelize SGD Updates by exploiting data parallelism and scales cuMF SGD to large data sets that cannot fit into one GPU's memory.

Proposal : Extreme Acceleration and Seamless Integration of Raw Data Analysis

TLDR
This work designs a novel hardware architecture, the Unstructured Data Processor (UDP), a programmable general-purpose data transformation accelerator, customized for data analytics, and builds the ACCORDA (Accelerated Operators for Raw Data Analysis) system, by extending the state-of-art distributed analytical system with the ATO approach.

CuMF_SGD: Fast and Scalable Matrix Factorization

TLDR
This work designs two workload schedule schemes, i.e., batch-Hogwild! and wavefront-update that fully exploit the massive amount of cores, and develops highly-optimized kernels for SGD update, leveraging cache, warp-shuffle instructions and half-precision floats.

Accurate and Fast Recovery of Network Monitoring Data With GPU-Accelerated Tensor Completion

TLDR
This work proposes a GPU-accelerated parallel Tensor Completion scheme (GPU-TC) for accurate and fast recovery of missing data and proposes three novel techniques to well exploit the tensor factorization structure and the GPU features: grid-based tensor partition, independent task assignment based on Fisher-Yates shuffle, sphere facilitated and memory-correlated scheduling.

References

SHOWING 1-10 OF 12 REFERENCES

Rethinking SIMD Vectorization for In-Memory Databases

TLDR
This paper presents novel vectorized designs and implementations of database operators, based on advanced SIMD operations, such as gathers and scatters, and highlights the impact of efficient vectorization on the algorithmic design of in-memorydatabase operators, as well as the architectural design and power efficiency of hardware.

Conjunctive selection conditions in main memory

TLDR
It is demonstrated that branch misprediction has a substantial impact on the performance of an algorithm for applying selection conditions, and a cost model that takes branch prediction into account is proposed and a query optimization algorithm that chooses a plan with optimal estimated cost is developed.

Ameliorating memory contention of OLAP operators on GPU processors

TLDR
This work defines the problem of bank and value conflict optimization for data processing operators using the CUDA platform and uses two database operations: foreignkey join and grouped aggregation to analyze the impact of these two factors on operator performance.

Database Architecture Optimized for the New Bottleneck: Memory Access

TLDR
A simple scan test is used to show the severe impact of main-memory access bottleneck, and radix algorithms for partitioned hash-join are introduced, using a detailed analytical model that incorporates memory access cost.

In-memory BLU acceleration in IBM's DB2 and dashDB: Optimized for modern workloads and hardware architectures

TLDR
In-memory BLU Acceleration used in IBM's DB2 for Linux, UNIX, and Windows, and now also the dashDB cloud offering, is presented, which was designed and implemented from the ground up to exploit main memory but is not limited to what fits in memory and does not require manual management of what to retain in memory.

ByteSlice: Pushing the Envelop of Main Memory Data Processing with a New Storage Layout

TLDR
ByteSlice is a new main memory storage layout that supports both highly efficient scans and lookups that fully leverages SIMD data-parallelism and offers significant performance improvement over all state-of-the-art approaches.

The Impact of Columnar In-Memory Databases on Enterprise Systems

TLDR
First analyses of productive applications adopting this concept confirm that system architectures enabled by in-memory column stores are conceptually superior for business transaction processing compared to row-based approaches.

SIMD-Scan: Ultra Fast in-Memory Table Scan using on-Chip Vector Processing Units

TLDR
This paper shows that utilizing the embedded Vector Processing Units (VPUs) found in standard superscalar processors can speed up the performance of mainmemory full table scan by factors without changing the hardware architecture and thereby without additional power consumption.

BitWeaving: fast scans for main memory data processing

TLDR
The proposed BitWeaving technique exploits the parallelism available at the bit level in modern processors to produce significant performance benefits over the existing state-of-the-art methods, and in some cases produce over an order of magnitude in performance improvement.

Evolving the architecture of SQL Server for modern hardware trends

TLDR
An overview of the design of the two features added for column store indexes and in-memory tables and the performance improvements they provide is given.