Stream compaction for deferred shading
Proceedings of the Conference on High Performance Graphics 2009
Real-time parallel hashing on the GPU
ACM SIGGRAPH Asia 2009 papers
Interactive fluid-particle simulation using translating Eulerian grids
Proceedings of the 2010 ACM SIGGRAPH symposium on Interactive 3D Graphics and Games
FreePipe: a programmable parallel rendering architecture for efficient multi-fragment effects
Proceedings of the 2010 ACM SIGGRAPH symposium on Interactive 3D Graphics and Games
Teaching design & analysis of multi-core parallel algorithms using CUDA
Journal of Computing Sciences in Colleges
The Scalable Heterogeneous Computing (SHOC) benchmark suite
Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units
Fast sort on CPUs and GPUs: a case for bandwidth oblivious SIMD sort
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU
Proceedings of the 37th annual international symposium on Computer architecture
High-detailed fluid simulations on the GPU
ACM SIGGRAPH 2010 Talks
Understanding throughput-oriented architectures
Communications of the ACM
Revisiting sorting for GPGPU stream architectures
Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Fast parallel surface and solid voxelization on GPUs
ACM SIGGRAPH Asia 2010 papers
Fast in-place sorting with CUDA based on bitonic sort
PPAM'09 Proceedings of the 8th international conference on Parallel processing and applied mathematics: Part I
A two-level real-time vision machine combining coarse- and fine-grained parallelism
Journal of Real-Time Image Processing
HLBVH: hierarchical LBVH construction for real-time ray tracing of dynamic geometry
Proceedings of the Conference on High Performance Graphics
Particle-in-cell simulations with charge-conserving current deposition on graphic processing units
Journal of Computational Physics
OpenCL and parallel primitives for digital TV applications
IBM Journal of Research and Development
GPU curvature estimation on deformable meshes
I3D '11 Symposium on Interactive 3D Graphics and Games
FPGA vs. multi-core CPUs vs. GPUs: hands-on experience with a sorting application
Facing the multicore-challenge
FPGA vs. multi-core CPUs vs. GPUs: hands-on experience with a sorting application
Facing the multicore-challenge
Dynamic workload balancing deques for branch and bound algorithms in the message passing interface
International Journal of High Performance Systems Architecture
Implicit and dynamic trees for high performance rendering
Proceedings of Graphics Interface 2011
SAH KD-tree construction on GPU
Proceedings of the ACM SIGGRAPH Symposium on High Performance Graphics
Randomized selection on the GPU
Proceedings of the ACM SIGGRAPH Symposium on High Performance Graphics
Preliminary work on graphics processing unit based direct simulation Monte Carlo
Proceedings of the 2010 Conference on Grand Challenges in Modeling & Simulation
A GPU-Based Implementation for Range Queries on Spaghettis Data Structure
ICCSA'11 Proceedings of the 2011 international conference on Computational science and its applications - Volume Part I
Analysis of Multi-Sort Algorithm on Multi-Mesh of Trees (MMT) architecture
The Journal of Supercomputing
Proceedings of the 9th International Conference on Computational Methods in Systems Biology
Real-time computation of advanced rules in OLAP databases
ADBIS'11 Proceedings of the 15th international conference on Advances in databases and information systems
Designing fast architecture-sensitive tree search on modern multicore/many-core processors
ACM Transactions on Database Systems (TODS)
Accelerating the smoldyn spatial stochastic biochemical reaction network simulator using GPUs
Proceedings of the 19th High Performance Computing Symposia
Speeding up large-scale geospatial polygon rasterization on GPGPUs
Proceedings of the ACM SIGSPATIAL Second International Workshop on High Performance and Distributed Geographic Information Systems
ICA3PP'11 Proceedings of the 11th international conference on Algorithms and architectures for parallel processing - Volume Part I
Parallel quadtree coding of large-scale raster geospatial data on GPGPUs
Proceedings of the 19th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems
Scalable parallel minimum spanning forest computation
Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
Implementing p systems parallelism by means of GPUs
WMC'09 Proceedings of the 10th international conference on Membrane Computing
Fast GPU-Based fluid simulations using SPH
PARA'10 Proceedings of the 10th international conference on Applied Parallel and Scientific Computing - Volume 2
Improving the speed and stability of the k-nearest neighbors method
Pattern Recognition Letters
Design and implementation of an efficient integer count sort in CUDA GPUs
Concurrency and Computation: Practice & Experience
A high-performance sorting algorithm for multicore single-instruction multiple-data processors
Software—Practice & Experience
Thermal management of a many-core processor under fine-grained parallelism
Euro-Par'11 Proceedings of the 2011 international conference on Parallel Processing
VAST-Tree: a vector-advanced and compressed structure for massive data tree traversal
Proceedings of the 15th International Conference on Extending Database Technology
GPU merge path: a GPU merging algorithm
Proceedings of the 26th ACM international conference on Supercomputing
CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
International Journal of High Performance Computing Applications
Discrete range searching primitive for the GPU and its applications
Journal of Experimental Algorithmics (JEA)
GPGPU implementation of growing neural gas: Application to 3D scene reconstruction
Journal of Parallel and Distributed Computing
Interactive global photon mapping
EGSR'09 Proceedings of the Twentieth Eurographics conference on Rendering
Ray tracing dynamic scenes with shadows on GPU
EG PGV'10 Proceedings of the 10th Eurographics conference on Parallel Graphics and Visualization
Approximate parallel sorting on a spatial computer
Proceedings of the 2012 ACM workshop on Relaxing synchronization for multicore and manycore scalability
Parallel suffix array construction for shared memory architectures
SPIRE'12 Proceedings of the 19th international conference on String Processing and Information Retrieval
RDFS reasoning on massively parallel hardware
ISWC'12 Proceedings of the 11th international conference on The Semantic Web - Volume Part I
Parallel Shellsort Algorithm for Many-Core GPUs with CUDA
International Journal of Grid and High Performance Computing
Parallel suffix array and least common prefix for the GPU
Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming
StreamScan: fast scan algorithms for GPUs without global barrier synchronization
Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming
Comparison based sorting for systems with multiple GPUs
Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units
Designing a database system for modern processing architectures
Proceedings of the 2013 Sigmod/PODS Ph.D. symposium on PhD symposium
Optimising purely functional GPU programs
Proceedings of the 18th ACM SIGPLAN international conference on Functional programming
Box-counting algorithm on GPU and multi-core CPU: an OpenCL cross-platform study
The Journal of Supercomputing
Evaluating the acceleration of typical scientific problems on the GPU
Proceedings of the South African Institute for Computer Scientists and Information Technologists Conference
A micro 64-tree structure for accelerating ray tracing on a GPU
Proceedings of Graphics Interface 2013
Journal of Parallel and Distributed Computing
Interactive smoke simulation and rendering on the GPU
Proceedings of the 12th ACM SIGGRAPH International Conference on Virtual-Reality Continuum and Its Applications in Industry
Register level sort algorithm on multi-core SIMD processors
IA^3 '13 Proceedings of the 3rd Workshop on Irregular Applications: Architectures and Algorithms
A sound and complete abstraction for reasoning about parallel prefix sums
Proceedings of the 41st ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages
Hardware-oblivious parallelism for in-memory column-stores
Proceedings of the VLDB Endowment
Red Fox: An Execution Environment for Relational Query Processing on GPUs
Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization
Hardware acceleration of database operations
Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays
Theoretical analysis of classic algorithms on highly-threaded many-core GPUs
Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming
Time- and space-efficient flow-sensitive points-to analysis
ACM Transactions on Architecture and Code Optimization (TACO)
A memory access model for highly-threaded many-core architectures
Future Generation Computer Systems
GPU-based parallel construction of compact visual hull meshes
The Visual Computer: International Journal of Computer Graphics
Sorted deferred shading for production path tracing
EGSR '13 Proceedings of the Eurographics Symposium on Rendering
Hi-index | 0.02 |
We describe the design of high-performance parallel radix sort and merge sort routines for manycore GPUs, taking advantage of the full programmability offered by CUDA. Our radix sort is the fastest GPU sort and our merge sort is the fastest comparison-based sort reported in the literature. Our radix sort is up to 4 times faster than the graphics-based GPUSort and greater than 2 times faster than other CUDA-based radix sorts. It is also 23% faster, on average, than even a very carefully optimized multicore CPU sorting routine. To achieve this performance, we carefully design our algorithms to expose substantial fine-grained parallelism and decompose the computation into independent tasks that perform minimal global communication. We exploit the high-speed onchip shared memory provided by NVIDIA's GPU architecture and efficient data-parallel primitives, particularly parallel scan. While targeted at GPUs, these algorithms should also be well-suited for other manycore processors.