Vector models for data-parallel computing
Vector models for data-parallel computing
Radix sort for vector multiprocessors
Proceedings of the 1991 ACM/IEEE conference on Supercomputing
An improved supercomputer sorting benchmark
Proceedings of the 1992 ACM/IEEE conference on Supercomputing
ACM Computing Surveys (CSUR)
The influence of caches on the performance of sorting
SODA '97 Proceedings of the eighth annual ACM-SIAM symposium on Discrete algorithms
Main-memory index structures with fixed-size partial keys
SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
A Fast, Simple Algorithm to Balance a Parallel Multiway Merge
PARLE '93 Proceedings of the 5th International PARLE Conference on Parallel Architectures and Languages Europe
GPUTeraSort: high performance graphics co-processor sorting for large database management
Proceedings of the 2006 ACM SIGMOD international conference on Management of data
Scan primitives for GPU computing
Proceedings of the 22nd ACM SIGGRAPH/EUROGRAPHICS symposium on Graphics hardware
AA-Sort: A New Parallel Sorting Algorithm for Multi-Core SIMD Processors
PACT '07 Proceedings of the 16th International Conference on Parallel Architecture and Compilation Techniques
Larrabee: a many-core x86 architecture for visual computing
ACM SIGGRAPH 2008 papers
Efficient implementation of sorting on multi-core SIMD CPU architecture
Proceedings of the VLDB Endowment
Sorting networks and their applications
AFIPS '68 (Spring) Proceedings of the April 30--May 2, 1968, spring joint computer conference
Rock: A High-Performance Sparc CMT Processor
IEEE Micro
Dictionary-based order-preserving string compression for main memory column stores
Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Designing efficient sorting algorithms for manycore GPUs
IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
Sort vs. Hash revisited: fast join implementation on modern multi-core CPUs
Proceedings of the VLDB Endowment
Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU
Proceedings of the 37th annual international symposium on Computer architecture
Engineering a multi-core radix sort
Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part II
CloudRAMSort: fast and efficient large-scale distributed RAM sort on shared-nothing cluster
SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
A high-performance sorting algorithm for multicore single-instruction multiple-data processors
Software—Practice & Experience
VAST-Tree: a vector-advanced and compressed structure for massive data tree traversal
Proceedings of the 15th International Conference on Extending Database Technology
Can traditional programming bridge the Ninja performance gap for parallel computing applications?
Proceedings of the 39th Annual International Symposium on Computer Architecture
Database analytics acceleration using FPGAs
Proceedings of the 21st international conference on Parallel architectures and compilation techniques
Billion-particle SIMD-friendly two-point correlation on large-scale HPC cluster systems
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Building a collision for 75-round reduced SHA-1 using GPU clusters
Euro-Par'12 Proceedings of the 18th international conference on Parallel Processing
Parallel suffix array construction for shared memory architectures
SPIRE'12 Proceedings of the 19th international conference on String Processing and Information Retrieval
Comparison based sorting for systems with multiple GPUs
Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units
Location-aware cache management for many-core processors with deep cache hierarchy
SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Tera-scale 1D FFT with low-communication algorithm and Intel® Xeon Phi™ coprocessors
SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Memory footprint matters: efficient equi-join algorithms for main memory data processing
Proceedings of the 4th annual Symposium on Cloud Computing
A novel finite element method assembler for co-processors and accelerators
IA^3 '13 Proceedings of the 3rd Workshop on Irregular Applications: Architectures and Algorithms
Register level sort algorithm on multi-core SIMD processors
IA^3 '13 Proceedings of the 3rd Workshop on Irregular Applications: Architectures and Algorithms
The Yin and Yang of processing data warehousing queries on GPU devices
Proceedings of the VLDB Endowment
Hardware-oblivious parallelism for in-memory column-stores
Proceedings of the VLDB Endowment
Permuting data on random-access block storage
Proceedings of the VLDB Endowment
Hardware acceleration of database operations
Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays
Time- and space-efficient flow-sensitive points-to analysis
ACM Transactions on Architecture and Code Optimization (TACO)
Streaming similarity search over one billion tweets using parallel locality-sensitive hashing
Proceedings of the VLDB Endowment
Hi-index | 0.00 |
Sort is a fundamental kernel used in many database operations. In-memory sorts are now feasible; sort performance is limited by compute flops and main memory bandwidth rather than I/O. In this paper, we present a competitive analysis of comparison and non-comparison based sorting algorithms on two modern architectures - the latest CPU and GPU architectures. We propose novel CPU radix sort and GPU merge sort implementations which are 2X faster than previously published results. We perform a fair comparison of the algorithms using these best performing implementations on both architectures. While radix sort is faster on current architectures, the gap narrows from CPU to GPU architectures. Merge sort performs better than radix sort for sorting keys of large sizes - such keys will be required to accommodate the increasing cardinality of future databases. We present analytical models for analyzing the performance of our implementations in terms of architectural features such as core count, SIMD and bandwidth. Our obtained performance results are successfully predicted by our models. Our analysis points to merge sort winning over radix sort on future architectures due to its efficient utilization of SIMD and low bandwidth utilization. We simulate a 64-core platform with varying SIMD widths under constant bandwidth per core constraints, and show that large data sizes of 240 (one trillion records), merge sort performance on large key sizes is up to 3X better than radix sort for large SIMD widths on future architectures. Therefore, merge sort should be the sorting method of choice for future databases.