Designing fast architecture-sensitive tree search on modern multicore/many-core processors

Authors:
Changkyu Kim;Jatin Chhugani;Nadathur Satish;Eric Sedlar;Anthony D. Nguyen;Tim Kaldewey;Victor W. Lee;Scott A. Brandt;Pradeep Dubey
Affiliations:
Intel Corporation;Intel Corporation;Intel Corporation;Oracle Corporation;Intel Corporation;Oracle Corporation;Intel Corporation;University of California, Santa Cruz, CA;Intel Corporation
Venue:
ACM Transactions on Database Systems (TODS)
Year:
2011

Citing 34
Cited 2

Order-preserving minimal perfect hash functions and information retrieval

ACM Transactions on Information Systems (TOIS) - Special issue on research and development in information retrieval
Software pipelining

ACM Computing Surveys (CSUR)
Prefix B-trees

ACM Transactions on Database Systems (TODS)
Making B+- trees cache conscious in main memory

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Ubiquitous B-Tree

ACM Computing Surveys (CSUR)
Main-memory index structures with fixed-size partial keys

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Improving index performance through prefetching

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Implementing database operations using SIMD instructions

Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Fractal prefetching B+-Trees: optimizing both cache and disk performance

Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Sort vs. Hash Revisited

IEEE Transactions on Knowledge and Data Engineering
On Sort-Merge Algorithm for Band Joins

IEEE Transactions on Knowledge and Data Engineering
Compressing Relations and Indexes

ICDE '98 Proceedings of the Fourteenth International Conference on Data Engineering
B-Tree Indexes and CPU Caches

Proceedings of the 17th International Conference on Data Engineering
A Study of Index Structures for Main Memory Database Management Systems

VLDB '86 Proceedings of the 12th International Conference on Very Large Data Bases
Cache Conscious Indexing for Decision-Support in Main Memory

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Data Compression Support in Databases

VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
Effect of node size on the performance of cache-conscious B+-trees

SIGMETRICS '03 Proceedings of the 2003 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Super-Scalar RAM-CPU Cache Compression

ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Integrating compression and execution in column-oriented database systems

Proceedings of the 2006 ACM SIGMOD international conference on Management of data
Improving instruction cache performance in OLTP

ACM Transactions on Database Systems (TODS)
How to barter bits for chronons: compression and bandwidth trade offs for database scans

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Buffering accesses to memory-resident index structures

VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29
Adaptive aggregation on chip multiprocessors

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Larrabee: a many-core x86 architecture for visual computing

ACM SIGGRAPH 2008 papers
Efficient implementation of sorting on multi-core SIMD CPU architecture

Proceedings of the VLDB Endowment
Dictionary-based order-preserving string compression for main memory column stores

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Cache-conscious buffering for database operators with state

Proceedings of the Fifth International Workshop on Data Management on New Hardware
k-ary search on modern processors

Proceedings of the Fifth International Workshop on Data Management on New Hardware
Designing efficient sorting algorithms for manycore GPUs

IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
Real-time parallel hashing on the GPU

ACM SIGGRAPH Asia 2009 papers
Sort vs. Hash revisited: fast join implementation on modern multi-core CPUs

Proceedings of the VLDB Endowment
SIMD-scan: ultra fast in-memory table scan using on-chip vector processing units

Proceedings of the VLDB Endowment
FAST: fast architecture sensitive tree search on modern CPUs and GPUs

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Parallel search on video cards

HotPar'09 Proceedings of the First USENIX conference on Hot topics in parallelism

VAST-Tree: a vector-advanced and compressed structure for massive data tree traversal

Proceedings of the 15th International Conference on Extending Database Technology
TJJE: An efficient algorithm for top-k join on massive data

Information Sciences: an International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

In-memory tree structured index search is a fundamental database operation. Modern processors provide tremendous computing power by integrating multiple cores, each with wide vector units. There has been much work to exploit modern processor architectures for database primitives like scan, sort, join, and aggregation. However, unlike other primitives, tree search presents significant challenges due to irregular and unpredictable data accesses in tree traversal. In this article, we present FAST, an extremely fast architecture-sensitive layout of the index tree. FAST is a binary tree logically organized to optimize for architecture features like page size, cache line size, and Single Instruction Multiple Data (SIMD) width of the underlying hardware. FAST eliminates the impact of memory latency, and exploits thread-level and data-level parallelism on both CPUs and GPUs to achieve 50 million (CPU) and 85 million (GPU) queries per second for large trees of 64M elements, with even better results on smaller trees. These are 5X (CPU) and 1.7X (GPU) faster than the best previously reported performance on the same architectures. We also evaluated FAST on the Intel$^\tiny\textregistered$ Many Integrated Core architecture (Intel$^\tiny\textregistered$ MIC), showing a speedup of 2.4X--3X over CPU and 1.8X--4.4X over GPU. FAST supports efficient bulk updates by rebuilding index trees in less than 0.1 seconds for datasets as large as 64M keys and naturally integrates compression techniques, overcoming the memory bandwidth bottleneck and achieving a 6X performance improvement over uncompressed index search for large keys on CPUs.