Efficient implementation of sorting on multi-core SIMD CPU architecture

Authors:
Jatin Chhugani;Anthony D. Nguyen;Victor W. Lee;William Macy;Mostafa Hagog;Yen-Kuang Chen;Akram Baransi;Sanjeev Kumar;Pradeep Dubey
Affiliations:
Intel Corporation;Intel Corporation;Intel Corporation;Intel Corporation;Intel Corporation;Intel Corporation;Intel Corporation;Intel Corporation;Intel Corporation
Venue:
Proceedings of the VLDB Endowment
Year:
2008

Citing 15
Cited 26

A Benchmark Parallel Sort for Shared Memory Multiprocessors

IEEE Transactions on Computers
Adaptive bitonic sorting: an optimal parallel algorithm for shared-memory machines

SIAM Journal on Computing
K-Way Bitonic Sort

IEEE Transactions on Computers
A fast, easy sort

BYTE
The art of computer programming, volume 3: (2nd ed.) sorting and searching

The art of computer programming, volume 3: (2nd ed.) sorting and searching
Sorting

ACM Computing Surveys (CSUR)
A Fast, Simple Algorithm to Balance a Parallel Multiway Merge

PARLE '93 Proceedings of the 5th International PARLE Conference on Parallel Architectures and Languages Europe
Chip multiprocessing and the cell broadband engine

Proceedings of the 3rd conference on Computing frontiers
GPUTeraSort: high performance graphics co-processor sorting for large database management

Proceedings of the 2006 ACM SIGMOD international conference on Management of data
Photon mapping on programmable graphics hardware

SIGGRAPH '05 ACM SIGGRAPH 2005 Courses
Scan primitives for GPU computing

Proceedings of the 22nd ACM SIGGRAPH/EUROGRAPHICS symposium on Graphics hardware
AA-Sort: A New Parallel Sorting Algorithm for Multi-Core SIMD Processors

PACT '07 Proceedings of the 16th International Conference on Parallel Architecture and Compilation Techniques
CellSort: high performance sorting on the cell processor

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Larrabee: a many-core x86 architecture for visual computing

ACM SIGGRAPH 2008 papers
GPU-ABiSort: optimal parallel sorting on stream architectures

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing

FPGA: what's in it for a database?

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
k-ary search on modern processors

Proceedings of the Fifth International Workshop on Data Management on New Hardware
Sort vs. Hash revisited: fast join implementation on modern multi-core CPUs

Proceedings of the VLDB Endowment
Data processing on FPGAs

Proceedings of the VLDB Endowment
FPGAs: a new point in the database design space

Proceedings of the 13th International Conference on Extending Database Technology
State-of-the-art in heterogeneous computing

Scientific Programming
FAST: fast architecture sensitive tree search on modern CPUs and GPUs

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Fast sort on CPUs and GPUs: a case for bandwidth oblivious SIMD sort

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU

Proceedings of the 37th annual international symposium on Computer architecture
Revisiting sorting for GPGPU stream architectures

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
OpenCL and parallel primitives for digital TV applications

IBM Journal of Research and Development
Designing fast architecture-sensitive tree search on modern multicore/many-core processors

ACM Transactions on Database Systems (TODS)
Fast updates on read-optimized databases using multi-core CPUs

Proceedings of the VLDB Endowment
Parallel implementation of external sort and join operations on a multi-core network-optimized system on a chip

ICA3PP'11 Proceedings of the 11th international conference on Algorithms and architectures for parallel processing - Volume Part I
Sorting networks on FPGAs

The VLDB Journal — The International Journal on Very Large Data Bases
High-Performance video based fire detection algorithms using a multi-core architecture

ICIC'11 Proceedings of the 7th international conference on Advanced Intelligent Computing Theories and Applications: with aspects of artificial intelligence
CloudRAMSort: fast and efficient large-scale distributed RAM sort on shared-nothing cluster

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Design and implementation of an efficient integer count sort in CUDA GPUs

Concurrency and Computation: Practice & Experience
VAST-Tree: a vector-advanced and compressed structure for massive data tree traversal

Proceedings of the 15th International Conference on Extending Database Technology
Massively parallel sort-merge joins in main memory multi-core database systems

Proceedings of the VLDB Endowment
Can traditional programming bridge the Ninja performance gap for parallel computing applications?

Proceedings of the 39th Annual International Symposium on Computer Architecture
A meta-scheduler for the par-monad: composable scheduling for the heterogeneous cloud

Proceedings of the 17th ACM SIGPLAN international conference on Functional programming
HykSort: a new variant of hypercube quicksort on distributed memory architectures

Proceedings of the 27th international ACM conference on International conference on supercomputing
Register level sort algorithm on multi-core SIMD processors

IA^3 '13 Proceedings of the 3rd Workshop on Irregular Applications: Architectures and Algorithms
Hardware acceleration of database operations

Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays
Optimizing image processing on multi-core CPUs with Intel parallel programming technologies

Multimedia Tools and Applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

Sorting a list of input numbers is one of the most fundamental problems in the field of computer science in general and high-throughput database applications in particular. Although literature abounds with various flavors of sorting algorithms, different architectures call for customized implementations to achieve faster sorting times. This paper presents an efficient implementation and detailed analysis of MergeSort on current CPU architectures. Our SIMD implementation with 128-bit SSE is 3.3X faster than the scalar version. In addition, our algorithm performs an efficient multiway merge, and is not constrained by the memory bandwidth. Our multi-threaded, SIMD implementation sorts 64 million floating point numbers in less than0.5 seconds on a commodity 4-core Intel processor. This measured performance compares favorably with all previously published results. Additionally, the paper demonstrates performance scalability of the proposed sorting algorithm with respect to certain salient architectural features of modern chip multiprocessor (CMP) architectures, including SIMD width and core-count. Based on our analytical models of various architectural configurations, we see excellent scalability of our implementation with SIMD width scaling up to 16X wider than current SSE width of 128-bits, and CMP core-count scaling well beyond 32 cores. Cycle-accurate simulation of Intel's upcoming x86 many-core Larrabee architecture confirms scalability of our proposed algorithm.