A Benchmark Parallel Sort for Shared Memory Multiprocessors
IEEE Transactions on Computers
Adaptive bitonic sorting: an optimal parallel algorithm for shared-memory machines
SIAM Journal on Computing
IEEE Transactions on Computers
BYTE
The art of computer programming, volume 3: (2nd ed.) sorting and searching
The art of computer programming, volume 3: (2nd ed.) sorting and searching
ACM Computing Surveys (CSUR)
A Fast, Simple Algorithm to Balance a Parallel Multiway Merge
PARLE '93 Proceedings of the 5th International PARLE Conference on Parallel Architectures and Languages Europe
Chip multiprocessing and the cell broadband engine
Proceedings of the 3rd conference on Computing frontiers
GPUTeraSort: high performance graphics co-processor sorting for large database management
Proceedings of the 2006 ACM SIGMOD international conference on Management of data
Photon mapping on programmable graphics hardware
SIGGRAPH '05 ACM SIGGRAPH 2005 Courses
Scan primitives for GPU computing
Proceedings of the 22nd ACM SIGGRAPH/EUROGRAPHICS symposium on Graphics hardware
AA-Sort: A New Parallel Sorting Algorithm for Multi-Core SIMD Processors
PACT '07 Proceedings of the 16th International Conference on Parallel Architecture and Compilation Techniques
CellSort: high performance sorting on the cell processor
VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Larrabee: a many-core x86 architecture for visual computing
ACM SIGGRAPH 2008 papers
GPU-ABiSort: optimal parallel sorting on stream architectures
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
FPGA: what's in it for a database?
Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
k-ary search on modern processors
Proceedings of the Fifth International Workshop on Data Management on New Hardware
Sort vs. Hash revisited: fast join implementation on modern multi-core CPUs
Proceedings of the VLDB Endowment
Proceedings of the VLDB Endowment
FPGAs: a new point in the database design space
Proceedings of the 13th International Conference on Extending Database Technology
State-of-the-art in heterogeneous computing
Scientific Programming
FAST: fast architecture sensitive tree search on modern CPUs and GPUs
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Fast sort on CPUs and GPUs: a case for bandwidth oblivious SIMD sort
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU
Proceedings of the 37th annual international symposium on Computer architecture
Revisiting sorting for GPGPU stream architectures
Proceedings of the 19th international conference on Parallel architectures and compilation techniques
OpenCL and parallel primitives for digital TV applications
IBM Journal of Research and Development
Designing fast architecture-sensitive tree search on modern multicore/many-core processors
ACM Transactions on Database Systems (TODS)
Fast updates on read-optimized databases using multi-core CPUs
Proceedings of the VLDB Endowment
ICA3PP'11 Proceedings of the 11th international conference on Algorithms and architectures for parallel processing - Volume Part I
The VLDB Journal — The International Journal on Very Large Data Bases
High-Performance video based fire detection algorithms using a multi-core architecture
ICIC'11 Proceedings of the 7th international conference on Advanced Intelligent Computing Theories and Applications: with aspects of artificial intelligence
CloudRAMSort: fast and efficient large-scale distributed RAM sort on shared-nothing cluster
SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Design and implementation of an efficient integer count sort in CUDA GPUs
Concurrency and Computation: Practice & Experience
VAST-Tree: a vector-advanced and compressed structure for massive data tree traversal
Proceedings of the 15th International Conference on Extending Database Technology
Massively parallel sort-merge joins in main memory multi-core database systems
Proceedings of the VLDB Endowment
Can traditional programming bridge the Ninja performance gap for parallel computing applications?
Proceedings of the 39th Annual International Symposium on Computer Architecture
A meta-scheduler for the par-monad: composable scheduling for the heterogeneous cloud
Proceedings of the 17th ACM SIGPLAN international conference on Functional programming
HykSort: a new variant of hypercube quicksort on distributed memory architectures
Proceedings of the 27th international ACM conference on International conference on supercomputing
Register level sort algorithm on multi-core SIMD processors
IA^3 '13 Proceedings of the 3rd Workshop on Irregular Applications: Architectures and Algorithms
Hardware acceleration of database operations
Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays
Optimizing image processing on multi-core CPUs with Intel parallel programming technologies
Multimedia Tools and Applications
Hi-index | 0.00 |
Sorting a list of input numbers is one of the most fundamental problems in the field of computer science in general and high-throughput database applications in particular. Although literature abounds with various flavors of sorting algorithms, different architectures call for customized implementations to achieve faster sorting times. This paper presents an efficient implementation and detailed analysis of MergeSort on current CPU architectures. Our SIMD implementation with 128-bit SSE is 3.3X faster than the scalar version. In addition, our algorithm performs an efficient multiway merge, and is not constrained by the memory bandwidth. Our multi-threaded, SIMD implementation sorts 64 million floating point numbers in less than0.5 seconds on a commodity 4-core Intel processor. This measured performance compares favorably with all previously published results. Additionally, the paper demonstrates performance scalability of the proposed sorting algorithm with respect to certain salient architectural features of modern chip multiprocessor (CMP) architectures, including SIMD width and core-count. Based on our analytical models of various architectural configurations, we see excellent scalability of our implementation with SIMD width scaling up to 16X wider than current SSE width of 128-bits, and CMP core-count scaling well beyond 32 cores. Cycle-accurate simulation of Intel's upcoming x86 many-core Larrabee architecture confirms scalability of our proposed algorithm.