FAST: fast architecture sensitive tree search on modern CPUs and GPUs
@article{Kim2010FASTFA, title={FAST: fast architecture sensitive tree search on modern CPUs and GPUs}, author={Changkyu Kim and Jatin Chhugani and Nadathur Satish and Eric Sedlar and Anthony D. Nguyen and Tim Kaldewey and Victor W. Lee and Scott A. Brandt and Pradeep K. Dubey}, journal={Proceedings of the 2010 ACM SIGMOD International Conference on Management of data}, year={2010} }
In-memory tree structured index search is a fundamental database operation. Modern processors provide tremendous computing power by integrating multiple cores, each with wide vector units. There has been much work to exploit modern processor architectures for database primitives like scan, sort, join and aggregation. However, unlike other primitives, tree search presents significant challenges due to irregular and unpredictable data accesses in tree traversal. In this paper, we present FAST, an…
Figures and Tables from this paper
311 Citations
Designing fast architecture-sensitive tree search on modern multicore/many-core processors
- Computer ScienceTODS
- 2011
FAST is an extremely fast architecture-sensitive layout of the index tree logically organized to optimize for architecture features like page size, cache line size, and Single Instruction Multiple Data (SIMD) width of the underlying hardware, achieving a 6X performance improvement over uncompressed index search for large keys on CPUs.
Parallelizing Approximate Search on Adaptive Radix Trees
- Computer ScienceSEBD
- 2020
This work uses the edit distance to compare two search keys in the tree and select appropriate values and proposes several variations of the CPU algorithm like fixed vs. dynamic memory layouts and pointer vs. pointer-less data structures.
Exploiting Coarse-Grained Parallelism in B+ Tree Searches on an APU
- Computer Science2012 SC Companion: High Performance Computing, Networking Storage and Analysis
- 2012
This paper reorganizes the B+ tree in memory and utilizing the novel heterogeneous system architecture, which eliminates the need to copy the tree to the GPU and the limitation on the size of the tree that can be accelerated.
A Performance Study of Traversing Spatial Indexing Structures in Parallel on GPU
- Computer Science2012 IEEE 14th International Conference on High Performance Computing and Communication & 2012 IEEE 9th International Conference on Embedded Software and Systems
- 2012
This paper proposes assigning an individual sub-tree to each SMP (streaming multi-processor) in GPGPU, such that CUDA cores in the same SMP co-operate to navigate tree index nodes, and proposes a new range query search algorithm - three-phase-search that avoids non-sequential random access to tree nodes and accelerates the search performance of spatial indexing structures on GPU.
Exploiting Massive Parallelism for IndexingMulti-Dimensional Datasets on the GPU
- Computer ScienceIEEE Transactions on Parallel and Distributed Systems
- 2015
A novel parallel tree traversal algorithm-massively parallel restart scanning (MPRS) for multi-dimensional range queries that avoids recursion and irregular memory access and accesses 7-20 times less amount of global memory than task parallel parent link algorithm by virtue of minimal warp divergence.
Parallel Tree Traversal for Nearest Neighbor Query on the GPU
- Computer Science2016 45th International Conference on Parallel Processing (ICPP)
- 2016
A data parallel tree traversal algorithm, Parallel Scan and Backtrack (PSB), for kNN query processing on the GPU, this algorithm traverses a multi-dimensional tree structured index while avoiding warp divergence problems.
Accelerating in-memory transaction processing using general purpose graphics processing units
- Computer ScienceFuture Gener. Comput. Syst.
- 2019
Exploring Means to Enhance the Efficiency of GPU Bitmap Index Query Processing
- Computer ScienceData Sci. Eng.
- 2021
Three GPU algorithm enhancement strategies for executing queries of bitmap indices compressed using word aligned hybrid compression are presented: data structure reuse, metadata creation with various type alignment and a preallocated memory pool, which greatly reduces the number of costly memory system calls.
Performance of Point and Range Queries for In-memory Databases Using Radix Trees on GPUs
- Computer Science2016 IEEE 18th International Conference on High Performance Computing and Communications; IEEE 14th International Conference on Smart City; IEEE 2nd International Conference on Data Science and Systems (HPCC/SmartCity/DSS)
- 2016
A detailed performance study is presented of the GPU-based adaptive radix tree (GRT) implementation over a variety of key distributions, synthetic benchmarks, and actual keys from music and book data sets, which achieves some of the highest rates of index searches reported in the literature.
In-Cache Query Co-Processing on Coupled CPU-GPU Architectures
- Computer ScienceProc. VLDB Endow.
- 2014
A novel in-cache query co-processing paradigm for main memory On-Line Analytical Processing (OLAP) databases on coupled CPU-GPU architectures is proposed and a cost model guided adaptation mechanism for distributing the workload of prefetching, decompression, and query execution between CPU and GPU is developed.
References
SHOWING 1-10 OF 41 REFERENCES
Cache Conscious Indexing for Decision-Support in Main Memory
- Computer ScienceVLDB
- 1999
A new indexing technique called \Cache-Sensitive Search Trees" (CSS-trees) is proposed, to provide faster lookup times than binary search by paying attention to reference locality and cache behavior, without using substantial extra space.
Parallel search on video cards
- Computer Science
- 2009
P-ary search is presented, a novel parallel search algorithm for large-scale database index operations that scales with the number of processors and outperforms traditional thread-level parallel GPU and CPU implementations.
Efficient implementation of sorting on multi-core SIMD CPU architecture
- Computer ScienceProc. VLDB Endow.
- 2008
An efficient implementation and detailed analysis of MergeSort on current CPU architectures, and performance scalability of the proposed sorting algorithm with respect to certain salient architectural features of modern chip multiprocessor (CMP) architectures, including SIMD width and core-count.
SIMD-Scan: Ultra Fast in-Memory Table Scan using on-Chip Vector Processing Units
- Computer ScienceProc. VLDB Endow.
- 2009
This paper shows that utilizing the embedded Vector Processing Units (VPUs) found in standard superscalar processors can speed up the performance of mainmemory full table scan by factors without changing the hardware architecture and thereby without additional power consumption.
Making B+- trees cache conscious in main memory
- Computer ScienceSIGMOD '00
- 2000
A new indexing technique called CSB+-Trees is proposed that stores all the child nodes of any given node contiguously, and keeps only the address of the first child in each node, and introduces two variants of CSB+, which can reduce the copying cost when there is a split and preallocate space for the full node group to reduce the split cost.
Real-time parallel hashing on the GPU
- Computer ScienceSIGGRAPH 2009
- 2009
An efficient data-parallel algorithm for building large hash tables of millions of elements in real-time, which considers a classical sparse perfect hashing approach, and cuckoo hashing, which packs elements densely by allowing an element to be stored in one of multiple possible locations.
Super-Scalar RAM-CPU Cache Compression
- Computer Science22nd International Conference on Data Engineering (ICDE'06)
- 2006
This work proposes three new versatile compression schemes (PDICT, PFOR, and PFOR-DELTA) that are specifically designed to extract maximum IPC from modern CPUs and compares these algorithms with compression techniques used in (commercial) database and information retrieval systems.
Main-memory index structures with fixed-size partial keys
- Computer ScienceSIGMOD '01
- 2001
This paper proposes two index structures, pkT-trees and pkB-tree, which significantly reduce cache misses by storing partial-key information in the index, and shows that a small, fixed amount of key information allows most cache misses to be avoided, allowing for a simple node structure and efficient implementation.
How to barter bits for chronons: compression and bandwidth trade offs for database scans
- Computer ScienceSIGMOD '07
- 2007
A study of how to make table scans faster by the use of a scan code generator that produces code tuned to the database schema, the compression dictionaries, the queries being evaluated and the target CPU architecture is presented.
Buffering Accesses to Memory-Resident Index Structures
- Computer ScienceVLDB
- 2003