Fast sort on CPUs and GPUs: a case for bandwidth oblivious SIMD sort

Authors:
Nadathur Satish;Changkyu Kim;Jatin Chhugani;Anthony D. Nguyen;Victor W. Lee;Daehyun Kim;Pradeep Dubey
Affiliations:
Intel Corporation, Santa Clara, CA, USA;Intel Corporation, Santa Clara, CA, USA;Intel Corporation, Santa Clara, CA, USA;Intel Corporation, Santa Clara, CA, USA;Intel Corporation, Santa Clara, CA, USA;Intel Corporation, Santa Clara, CA, USA;Intel Corporation, Santa Clara, CA, USA
Venue:
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Year:
2010

Citing 17
Cited 22

Vector models for data-parallel computing

Vector models for data-parallel computing
Radix sort for vector multiprocessors

Proceedings of the 1991 ACM/IEEE conference on Supercomputing
An improved supercomputer sorting benchmark

Proceedings of the 1992 ACM/IEEE conference on Supercomputing
Software pipelining

ACM Computing Surveys (CSUR)
The influence of caches on the performance of sorting

SODA '97 Proceedings of the eighth annual ACM-SIAM symposium on Discrete algorithms
Main-memory index structures with fixed-size partial keys

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
A Fast, Simple Algorithm to Balance a Parallel Multiway Merge

PARLE '93 Proceedings of the 5th International PARLE Conference on Parallel Architectures and Languages Europe
GPUTeraSort: high performance graphics co-processor sorting for large database management

Proceedings of the 2006 ACM SIGMOD international conference on Management of data
Scan primitives for GPU computing

Proceedings of the 22nd ACM SIGGRAPH/EUROGRAPHICS symposium on Graphics hardware
AA-Sort: A New Parallel Sorting Algorithm for Multi-Core SIMD Processors

PACT '07 Proceedings of the 16th International Conference on Parallel Architecture and Compilation Techniques
Larrabee: a many-core x86 architecture for visual computing

ACM SIGGRAPH 2008 papers
Efficient implementation of sorting on multi-core SIMD CPU architecture

Proceedings of the VLDB Endowment
Sorting networks and their applications

AFIPS '68 (Spring) Proceedings of the April 30--May 2, 1968, spring joint computer conference
Rock: A High-Performance Sparc CMT Processor

IEEE Micro
Dictionary-based order-preserving string compression for main memory column stores

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Designing efficient sorting algorithms for manycore GPUs

IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
Sort vs. Hash revisited: fast join implementation on modern multi-core CPUs

Proceedings of the VLDB Endowment

Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU

Proceedings of the 37th annual international symposium on Computer architecture
Engineering a multi-core radix sort

Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part II
CloudRAMSort: fast and efficient large-scale distributed RAM sort on shared-nothing cluster

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
A high-performance sorting algorithm for multicore single-instruction multiple-data processors

Software—Practice & Experience
VAST-Tree: a vector-advanced and compressed structure for massive data tree traversal

Proceedings of the 15th International Conference on Extending Database Technology
Can traditional programming bridge the Ninja performance gap for parallel computing applications?

Proceedings of the 39th Annual International Symposium on Computer Architecture
Database analytics acceleration using FPGAs

Proceedings of the 21st international conference on Parallel architectures and compilation techniques
Billion-particle SIMD-friendly two-point correlation on large-scale HPC cluster systems

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Building a collision for 75-round reduced SHA-1 using GPU clusters

Euro-Par'12 Proceedings of the 18th international conference on Parallel Processing
Parallel suffix array construction for shared memory architectures

SPIRE'12 Proceedings of the 19th international conference on String Processing and Information Retrieval
Comparison based sorting for systems with multiple GPUs

Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units
Location-aware cache management for many-core processors with deep cache hierarchy

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Tera-scale 1D FFT with low-communication algorithm and Intel® Xeon Phi™ coprocessors

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Memory footprint matters: efficient equi-join algorithms for main memory data processing

Proceedings of the 4th annual Symposium on Cloud Computing
A novel finite element method assembler for co-processors and accelerators

IA^3 '13 Proceedings of the 3rd Workshop on Irregular Applications: Architectures and Algorithms
Register level sort algorithm on multi-core SIMD processors

IA^3 '13 Proceedings of the 3rd Workshop on Irregular Applications: Architectures and Algorithms
The Yin and Yang of processing data warehousing queries on GPU devices

Proceedings of the VLDB Endowment
Hardware-oblivious parallelism for in-memory column-stores

Proceedings of the VLDB Endowment
Permuting data on random-access block storage

Proceedings of the VLDB Endowment
Hardware acceleration of database operations

Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays
Time- and space-efficient flow-sensitive points-to analysis

ACM Transactions on Architecture and Code Optimization (TACO)
Streaming similarity search over one billion tweets using parallel locality-sensitive hashing

Proceedings of the VLDB Endowment

Quantified Score

Hi-index	0.00

Visualization

Abstract

Sort is a fundamental kernel used in many database operations. In-memory sorts are now feasible; sort performance is limited by compute flops and main memory bandwidth rather than I/O. In this paper, we present a competitive analysis of comparison and non-comparison based sorting algorithms on two modern architectures - the latest CPU and GPU architectures. We propose novel CPU radix sort and GPU merge sort implementations which are 2X faster than previously published results. We perform a fair comparison of the algorithms using these best performing implementations on both architectures. While radix sort is faster on current architectures, the gap narrows from CPU to GPU architectures. Merge sort performs better than radix sort for sorting keys of large sizes - such keys will be required to accommodate the increasing cardinality of future databases. We present analytical models for analyzing the performance of our implementations in terms of architectural features such as core count, SIMD and bandwidth. Our obtained performance results are successfully predicted by our models. Our analysis points to merge sort winning over radix sort on future architectures due to its efficient utilization of SIMD and low bandwidth utilization. We simulate a 64-core platform with varying SIMD widths under constant bandwidth per core constraints, and show that large data sizes of 240 (one trillion records), merge sort performance on large key sizes is up to 3X better than radix sort for large SIMD widths on future architectures. Therefore, merge sort should be the sorting method of choice for future databases.