Circuits, Systems, and Signal Processing
A High-Performance FFT Algorithm for Vector Supercomputers-Abstract
Proceedings of the Third SIAM Conference on Parallel Processing for Scientific Computing
Automatic Tuning Matrix Multiplication Performance on Graphics Hardware
Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
GPUTeraSort: high performance graphics co-processor sorting for large database management
Proceedings of the 2006 ACM SIGMOD international conference on Management of data
Rigid body collision detection on the GPU
ACM SIGGRAPH 2006 Research posters
Die Stacking (3D) Microarchitecture
Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Using compression to improve chip multiprocessor performance
Using compression to improve chip multiprocessor performance
Carbon: architectural support for fine-grained parallelism on chip multiprocessors
Proceedings of the 34th annual international symposium on Computer architecture
Larrabee: a many-core x86 architecture for visual computing
ACM SIGGRAPH 2008 papers
Optimization of sparse matrix-vector multiplication on emerging multicore platforms
Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Efficient computation of sum-products on GPUs through software-managed cache
Proceedings of the 22nd annual international conference on Supercomputing
Atomic Vector Operations on Chip Multiprocessors
ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
High performance discrete Fourier transforms on graphics processors
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Benchmarking GPUs to tune dense linear algebra
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
TeraFLOP computing on a desktop PC with GPUs for 3D CFD
International Journal of Computational Fluid Dynamics - Mesoscopic Methods And Their Applications To CFD
The PARSEC benchmark suite: characterization and architectural implications
Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Efficient implementation of sorting on multi-core SIMD CPU architecture
Proceedings of the VLDB Endowment
Parallel Image Processing Based on CUDA
CSSE '08 Proceedings of the 2008 International Conference on Computer Science and Software Engineering - Volume 03
Roofline: an insightful visual performance model for multicore architectures
Communications of the ACM - A Direct Path to Dependable Software
An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness
Proceedings of the 36th annual international symposium on Computer architecture
Multi-execution: multicore caching for data-similar executions
Proceedings of the 36th annual international symposium on Computer architecture
Thread motion: fine-grained power management for multi-core systems
Proceedings of the 36th annual international symposium on Computer architecture
Achieving predictable performance through better memory controller placement in many-core CMPs
Proceedings of the 36th annual international symposium on Computer architecture
Designing efficient sorting algorithms for manycore GPUs
IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
Mapping High-Fidelity Volume Rendering for Medical Imaging to CPU, GPU and Many-Core Architectures
IEEE Transactions on Visualization and Computer Graphics
Implementing sparse matrix-vector multiplication on throughput-oriented processors
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
FAST: fast architecture sensitive tree search on modern CPUs and GPUs
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Fast sort on CPUs and GPUs: a case for bandwidth oblivious SIMD sort
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Size Matters: Space/Time Tradeoffs to Improve GPGPU Applications Performance
Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Single-Chip Heterogeneous Computing: Does the Future Include Custom Logic, FPGAs, and GPGPUs?
MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
Accelerating CUDA graph algorithms at maximum warp
Proceedings of the 16th ACM symposium on Principles and practice of parallel programming
Comparing GPU and CPU in OLAP cubes creation
SOFSEM'11 Proceedings of the 37th international conference on Current trends in theory and practice of computer science
Sponge: portable stream programming on graphics engines
Proceedings of the sixteenth international conference on Architectural support for programming languages and operating systems
Eliminating the memory bottleneck: an FPGA-based solution for 3d reverse time migration
Proceedings of the 19th ACM/SIGDA international symposium on Field programmable gate arrays
Performance analysis of a hybrid MPI/CUDA implementation of the NASLU benchmark
ACM SIGMETRICS Performance Evaluation Review - Special issue on the 1st international workshop on performance modeling, benchmarking and simulation of high performance computing systems (PMBS 10)
CnC-CUDA: declarative programming for GPUs
LCPC'10 Proceedings of the 23rd international conference on Languages and compilers for parallel computing
Rapid computation of distance estimators from nucleotide and amino acid alignments
Proceedings of the 2011 ACM Symposium on Applied Computing
GPU accelerated simulations of 3D deterministic particle transport using discrete ordinates method
Journal of Computational Physics
Experience of parallelizing cryo-EM 3D reconstruction on a CPU-GPU heterogeneous system
Proceedings of the 20th international symposium on High performance distributed computing
Dark silicon and the end of multicore scaling
Proceedings of the 38th annual international symposium on Computer architecture
Considerations when evaluating microprocessor platforms
HotPar'11 Proceedings of the 3rd USENIX conference on Hot topic in parallelism
High performance content-based matching using GPUs
Proceedings of the 5th ACM international conference on Distributed event-based system
Multi- and many-core data mining with adaptive sparse grids
Proceedings of the 8th ACM International Conference on Computing Frontiers
GPU implementation of a Helmholtz Krylov solver preconditioned by a shifted Laplace multigrid method
Journal of Computational and Applied Mathematics
Lessons learned from exploring the backtracking paradigm on the GPU
Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part II
GROPHECY: GPU performance projection from CPU code skeletons
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
A memory accelerator with gather functions for bandwidth-bound irregular applications
Proceedings of the first workshop on Irregular applications: architectures and algorithm
A GPU implementation of inclusion-based points-to analysis
Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
Paragon: collaborative speculative loop execution on GPU and CPU
Proceedings of the 5th Annual Workshop on General Purpose Processing with Graphics Processing Units
Many-Core architecture oriented parallel algorithm design for computer animation
MIG'11 Proceedings of the 4th international conference on Motion in Games
True 4D image denoising on the GPU
Journal of Biomedical Imaging - Special issue on Parallel Computation in Medical Imaging Applications
Improving performance of adaptive component-based dataflow middleware
Parallel Computing
Parallel terrain visibility calculation on the graphics processing unit
Concurrency and Computation: Practice & Experience
GiST scan acceleration using coprocessors
DaMoN '12 Proceedings of the Eighth International Workshop on Data Management on New Hardware
Adaptive input-aware compilation for graphics engines
Proceedings of the 33rd ACM SIGPLAN conference on Programming Language Design and Implementation
EvoApplications'12 Proceedings of the 2012t European conference on Applications of Evolutionary Computation
A fair comparison of modern CPUs and GPUs running the genetic algorithm under the knapsack benchmark
EvoApplications'12 Proceedings of the 2012t European conference on Applications of Evolutionary Computation
SnuCL: an OpenCL framework for heterogeneous CPU/GPU clusters
Proceedings of the 26th ACM international conference on Supercomputing
Concurrency and Computation: Practice & Experience
Power Limitations and Dark Silicon Challenge the Future of Multicore
ACM Transactions on Computer Systems (TOCS)
Simultaneous branch and warp interweaving for sustained GPU performance
Proceedings of the 39th Annual International Symposium on Computer Architecture
Staged memory scheduling: achieving high performance and scalability in heterogeneous systems
Proceedings of the 39th Annual International Symposium on Computer Architecture
Can traditional programming bridge the Ninja performance gap for parallel computing applications?
Proceedings of the 39th Annual International Symposium on Computer Architecture
Power Modeling and Characterization of Computing Devices: A Survey
Foundations and Trends in Electronic Design Automation
A compiler-assisted runtime-prefetching scheme for heterogeneous platforms
IWOMP'12 Proceedings of the 8th international conference on OpenMP in a Heterogeneous World
Accelerating pathology image data cross-comparison on CPU-GPU hybrid systems
Proceedings of the VLDB Endowment
Proceedings of the 21st international conference on Parallel architectures and compilation techniques
Optimization schemes and performance evaluation of Smith–Waterman algorithm on CPU, GPU and FPGA
Concurrency and Computation: Practice & Experience
Automatic generation of software pipelines for heterogeneous parallel systems
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Energy consumption modeling for hybrid computing
Euro-Par'12 Proceedings of the 18th international conference on Parallel Processing
Automatic selection of processing units for coprocessing in databases
ADBIS'12 Proceedings of the 16th East European conference on Advances in Databases and Information Systems
Spill code placement for SIMD machines
SBLP'12 Proceedings of the 16th Brazilian conference on Programming Languages
CAP: co-scheduling based on asymptotic profiling in CPU+GPU hybrid systems
Proceedings of the 2013 International Workshop on Programming Models and Applications for Multicores and Manycores
Portable performance on heterogeneous architectures
Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems
Warped-DMR: Light-weight Error Detection for GPGPU
MICRO-45 Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture
The BiConjugate gradient method on GPUs
The Journal of Supercomputing
Proceedings of the ACM International Conference on Computing Frontiers
Scaling analytics applications with OpenCL for loosely coupled heterogeneous clusters
Proceedings of the ACM International Conference on Computing Frontiers
Information-theoretic analysis of molecular (co)evolution using graphics processing units
Proceedings of the 3rd international workshop on Emerging computational methods for the life sciences
Data management systems on GPUs: promises and challenges
Proceedings of the 25th International Conference on Scientific and Statistical Database Management
On supernode transformations and multithreading for the longest common subsequence problem
AusPDC '12 Proceedings of the Tenth Australasian Symposium on Parallel and Distributed Computing - Volume 127
Transparent CPU-GPU collaboration for data-parallel kernels on heterogeneous systems
PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
Robust and efficient polygon overlay on parallel stream processors
Proceedings of the 21st ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems
Evaluating integrated graphics processors for data center workloads
Proceedings of the Workshop on Power-Aware Computing and Systems
Wimpy or brawny cores: A throughput perspective
Journal of Parallel and Distributed Computing
User transparent data and task parallel multimedia computing with Pyxis-DT
Future Generation Computer Systems
Exploiting hierarchy parallelism for molecular dynamics on a petascale heterogeneous system
Journal of Parallel and Distributed Computing
Fluidic Kernels: Cooperative Execution of OpenCL Programs on Multiple Heterogeneous Devices
Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization
Easy, fast, and energy-efficient object detection on heterogeneous on-chip architectures
ACM Transactions on Architecture and Code Optimization (TACO)
On the automatic generation of GPU-oriented software applications from RTL IPs
Proceedings of the Ninth IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis
A memory access model for highly-threaded many-core architectures
Future Generation Computer Systems
Analytical modeling of energy efficiency in heterogeneous processors
Computers and Electrical Engineering
Detecting, segmenting and tracking unknown objects using multi-label MRF inference
Computer Vision and Image Understanding
High level transforms for SIMD and low-level computer vision algorithms
Proceedings of the 2014 Workshop on Programming models for SIMD/Vector processing
Proceedings of the 5th ACM/SPEC international conference on Performance engineering
Leveraging GPUs using cooperative loop speculation
ACM Transactions on Architecture and Code Optimization (TACO)
CPU+GPU scheduling with asymptotic profiling
Parallel Computing
A Case Study of Implementing Supernode Transformations
International Journal of Parallel Programming
Hi-index | 0.01 |
Recent advances in computing have led to an explosion in the amount of data being generated. Processing the ever-growing data in a timely manner has made throughput computing an important aspect for emerging applications. Our analysis of a set of important throughput computing kernels shows that there is an ample amount of parallelism in these kernels which makes them suitable for today's multi-core CPUs and GPUs. In the past few years there have been many studies claiming GPUs deliver substantial speedups (between 10X and 1000X) over multi-core CPUs on these kernels. To understand where such large performance difference comes from, we perform a rigorous performance analysis and find that after applying optimizations appropriate for both CPUs and GPUs the performance gap between an Nvidia GTX280 processor and the Intel Core i7-960 processor narrows to only 2.5x on average. In this paper, we discuss optimization techniques for both CPU and GPU, analyze what architecture features contributed to performance differences between the two architectures, and recommend a set of architectural features which provide significant improvement in architectural efficiency for throughput kernels.