FAST: fast architecture sensitive tree search on modern CPUs and GPUs

@article{Kim2010FASTFA,
  title={FAST: fast architecture sensitive tree search on modern CPUs and GPUs},
  author={Changkyu Kim and Jatin Chhugani and Nadathur Satish and Eric Sedlar and Anthony D. Nguyen and Tim Kaldewey and Victor W. Lee and Scott A. Brandt and Pradeep K. Dubey},
  journal={Proceedings of the 2010 ACM SIGMOD International Conference on Management of data},
  year={2010}
}
  • Changkyu Kim, J. Chhugani, P. Dubey
  • Published 6 June 2010
  • Computer Science
  • Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
In-memory tree structured index search is a fundamental database operation. Modern processors provide tremendous computing power by integrating multiple cores, each with wide vector units. There has been much work to exploit modern processor architectures for database primitives like scan, sort, join and aggregation. However, unlike other primitives, tree search presents significant challenges due to irregular and unpredictable data accesses in tree traversal. In this paper, we present FAST, an… 
Designing fast architecture-sensitive tree search on modern multicore/many-core processors
TLDR
FAST is an extremely fast architecture-sensitive layout of the index tree logically organized to optimize for architecture features like page size, cache line size, and Single Instruction Multiple Data (SIMD) width of the underlying hardware, achieving a 6X performance improvement over uncompressed index search for large keys on CPUs.
Parallelizing Approximate Search on Adaptive Radix Trees
TLDR
This work uses the edit distance to compare two search keys in the tree and select appropriate values and proposes several variations of the CPU algorithm like fixed vs. dynamic memory layouts and pointer vs. pointer-less data structures.
Exploiting Coarse-Grained Parallelism in B+ Tree Searches on an APU
  • Mayank Daga, Mark Nutter
  • Computer Science
    2012 SC Companion: High Performance Computing, Networking Storage and Analysis
  • 2012
TLDR
This paper reorganizes the B+ tree in memory and utilizing the novel heterogeneous system architecture, which eliminates the need to copy the tree to the GPU and the limitation on the size of the tree that can be accelerated.
A Performance Study of Traversing Spatial Indexing Structures in Parallel on GPU
  • Jinwoong Kim, Sumin Hong, B. Nam
  • Computer Science
    2012 IEEE 14th International Conference on High Performance Computing and Communication & 2012 IEEE 9th International Conference on Embedded Software and Systems
  • 2012
TLDR
This paper proposes assigning an individual sub-tree to each SMP (streaming multi-processor) in GPGPU, such that CUDA cores in the same SMP co-operate to navigate tree index nodes, and proposes a new range query search algorithm - three-phase-search that avoids non-sequential random access to tree nodes and accelerates the search performance of spatial indexing structures on GPU.
Exploiting Massive Parallelism for IndexingMulti-Dimensional Datasets on the GPU
TLDR
A novel parallel tree traversal algorithm-massively parallel restart scanning (MPRS) for multi-dimensional range queries that avoids recursion and irregular memory access and accesses 7-20 times less amount of global memory than task parallel parent link algorithm by virtue of minimal warp divergence.
Parallel Tree Traversal for Nearest Neighbor Query on the GPU
TLDR
A data parallel tree traversal algorithm, Parallel Scan and Backtrack (PSB), for kNN query processing on the GPU, this algorithm traverses a multi-dimensional tree structured index while avoiding warp divergence problems.
Exploring Means to Enhance the Efficiency of GPU Bitmap Index Query Processing
TLDR
Three GPU algorithm enhancement strategies for executing queries of bitmap indices compressed using word aligned hybrid compression are presented: data structure reuse, metadata creation with various type alignment and a preallocated memory pool, which greatly reduces the number of costly memory system calls.
Performance of Point and Range Queries for In-memory Databases Using Radix Trees on GPUs
  • Maksudul Alam, Srikanth B. Yoginath, K. Perumalla
  • Computer Science
    2016 IEEE 18th International Conference on High Performance Computing and Communications; IEEE 14th International Conference on Smart City; IEEE 2nd International Conference on Data Science and Systems (HPCC/SmartCity/DSS)
  • 2016
TLDR
A detailed performance study is presented of the GPU-based adaptive radix tree (GRT) implementation over a variety of key distributions, synthetic benchmarks, and actual keys from music and book data sets, which achieves some of the highest rates of index searches reported in the literature.
In-Cache Query Co-Processing on Coupled CPU-GPU Architectures
TLDR
A novel in-cache query co-processing paradigm for main memory On-Line Analytical Processing (OLAP) databases on coupled CPU-GPU architectures is proposed and a cost model guided adaptation mechanism for distributing the workload of prefetching, decompression, and query execution between CPU and GPU is developed.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 41 REFERENCES
Cache Conscious Indexing for Decision-Support in Main Memory
TLDR
A new indexing technique called \Cache-Sensitive Search Trees" (CSS-trees) is proposed, to provide faster lookup times than binary search by paying attention to reference locality and cache behavior, without using substantial extra space.
Parallel search on video cards
TLDR
P-ary search is presented, a novel parallel search algorithm for large-scale database index operations that scales with the number of processors and outperforms traditional thread-level parallel GPU and CPU implementations.
Efficient implementation of sorting on multi-core SIMD CPU architecture
TLDR
An efficient implementation and detailed analysis of MergeSort on current CPU architectures, and performance scalability of the proposed sorting algorithm with respect to certain salient architectural features of modern chip multiprocessor (CMP) architectures, including SIMD width and core-count.
SIMD-Scan: Ultra Fast in-Memory Table Scan using on-Chip Vector Processing Units
TLDR
This paper shows that utilizing the embedded Vector Processing Units (VPUs) found in standard superscalar processors can speed up the performance of mainmemory full table scan by factors without changing the hardware architecture and thereby without additional power consumption.
Making B+- trees cache conscious in main memory
TLDR
A new indexing technique called CSB+-Trees is proposed that stores all the child nodes of any given node contiguously, and keeps only the address of the first child in each node, and introduces two variants of CSB+, which can reduce the copying cost when there is a split and preallocate space for the full node group to reduce the split cost.
Real-time parallel hashing on the GPU
TLDR
An efficient data-parallel algorithm for building large hash tables of millions of elements in real-time, which considers a classical sparse perfect hashing approach, and cuckoo hashing, which packs elements densely by allowing an element to be stored in one of multiple possible locations.
Super-Scalar RAM-CPU Cache Compression
TLDR
This work proposes three new versatile compression schemes (PDICT, PFOR, and PFOR-DELTA) that are specifically designed to extract maximum IPC from modern CPUs and compares these algorithms with compression techniques used in (commercial) database and information retrieval systems.
Main-memory index structures with fixed-size partial keys
TLDR
This paper proposes two index structures, pkT-trees and pkB-tree, which significantly reduce cache misses by storing partial-key information in the index, and shows that a small, fixed amount of key information allows most cache misses to be avoided, allowing for a simple node structure and efficient implementation.
How to barter bits for chronons: compression and bandwidth trade offs for database scans
TLDR
A study of how to make table scans faster by the use of a scan code generator that produces code tuned to the database schema, the compression dictionaries, the queries being evaluated and the target CPU architecture is presented.
Buffering Accesses to Memory-Resident Index Structures
...
1
2
3
4
5
...