Automatic translation of FORTRAN programs to vector form
ACM Transactions on Programming Languages and Systems (TOPLAS)
The cache performance and optimizations of blocked algorithms
ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Fortran 90 handbook: complete ANSI/ISO reference
Fortran 90 handbook: complete ANSI/ISO reference
Optimizing compilers for modern architectures: a dependence-based approach
Optimizing compilers for modern architectures: a dependence-based approach
Optimizing Supercompilers for Supercomputers
Optimizing Supercompilers for Supercomputers
MPI-The Complete Reference, Volume 1: The MPI Core
MPI-The Complete Reference, Volume 1: The MPI Core
Algorithms and Theory of Computation Handbook
Algorithms and Theory of Computation Handbook
IEEE Parallel & Distributed Technology: Systems & Technology
Cg: a system for programming graphics hardware in a C-like language
ACM SIGGRAPH 2003 Papers
Improving register allocation for subscripted variables
ACM SIGPLAN Notices - Best of PLDI 1979-1999
Understanding the efficiency of GPU algorithms for matrix-matrix multiplication
Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware
Accelerator: using data parallelism to program GPUs for general-purpose uses
Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
A memory model for scientific algorithms on graphics processors
Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Performance evaluation of GPUs using the RapidMind development platform
Proceedings of the 2006 ACM/IEEE conference on Supercomputing
EXOCHI: architecture and programming environment for a heterogeneous multi-core multithreaded system
Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and implementation
Program optimization space pruning for a multithreaded gpu
Proceedings of the 6th annual IEEE/ACM international symposium on Code generation and optimization
Accelerating advanced mri reconstructions on gpus
Proceedings of the 5th conference on Computing frontiers
GPU acceleration of cutoff pair potentials for molecular modeling applications
Proceedings of the 5th conference on Computing frontiers
A compiler framework for optimization of affine loop nests for gpgpus
Proceedings of the 22nd annual international conference on Supercomputing
CUBA: an architecture for efficient CPU/co-processor data communication
Proceedings of the 22nd annual international conference on Supercomputing
Accelerating advanced MRI reconstructions on GPUs
Journal of Parallel and Distributed Computing
Program optimization carving for GPU computing
Journal of Parallel and Distributed Computing
Benchmarking GPUs to tune dense linear algebra
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Systematic Parallelization of Medical Image Reconstruction for Graphics Hardware
Euro-Par '08 Proceedings of the 14th international Euro-Par conference on Parallel Processing
CUDA-Lite: Reducing GPU Programming Complexity
Languages and Compilers for Parallel Computing
MCUDA: An Efficient Implementation of CUDA Kernels for Multi-core CPUs
Languages and Compilers for Parallel Computing
OpenMP to GPGPU: a compiler framework for automatic translation and optimization
Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
Predictive Runtime Code Scheduling for Heterogeneous Architectures
HiPEAC '09 Proceedings of the 4th International Conference on High Performance Embedded Architectures and Compilers
Multigrid on GPU: tackling power grid analysis on parallel SIMT platforms
Proceedings of the 2008 IEEE/ACM International Conference on Computer-Aided Design
Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units
hiCUDA: a high-level directive-based language for GPU programming
Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units
Accelerating PQMRCGSTAB algorithm on GPU
Proceedings of the combined workshops on UnConventional high performance computing workshop plus memory access workshop
Quantitative analysis of sequence alignment applications on multiprocessor architectures
Proceedings of the 6th ACM conference on Computing frontiers
A control-structure splitting optimization for GPGPU
Proceedings of the 6th ACM conference on Computing frontiers
Single-particle 3d reconstruction from cryo-electron microscopy images on GPU
Proceedings of the 23rd international conference on Supercomputing
Push-assisted migration of real-time tasks in multi-core processors
Proceedings of the 2009 ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools for embedded systems
Synergistic execution of stream programs on multicores with accelerators
Proceedings of the 2009 ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools for embedded systems
Programming model for a heterogeneous x86 platform
Proceedings of the 2009 ACM SIGPLAN conference on Programming language design and implementation
Software Pipelined Execution of Stream Programs on GPUs
Proceedings of the 7th annual IEEE/ACM International Symposium on Code Generation and Optimization
Practical Random Linear Network Coding on GPUs
NETWORKING '09 Proceedings of the 8th International IFIP-TC 6 Networking Conference
Efficient Mapping of Multiresolution Image Filtering Algorithms on Graphics Processors
SAMOS '09 Proceedings of the 9th International Workshop on Embedded Computer Systems: Architectures, Modeling, and Simulation
Parallel Medical Image Reconstruction: From Graphics Processors to Grids
PaCT '09 Proceedings of the 10th International Conference on Parallel Computing Technologies
On GPU's viability as a middleware accelerator
Cluster Computing
Accelerating MR image reconstruction on GPUs
ISBI'09 Proceedings of the Sixth IEEE international conference on Symposium on Biomedical Imaging: From Nano to Macro
Stream-centric stereo matching and view synthesis: a high-speed approach on GPUs
IEEE Transactions on Circuits and Systems for Video Technology
Exploring NVIDIA-CUDA for video coding
MMSys '10 Proceedings of the first annual ACM SIGMM conference on Multimedia systems
Cortical architectures on a GPGPU
Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units
Massively parallel forward modeling of scalar and tensor gravimetry data
Computers & Geosciences
Efficient compilation of fine-grained SPMD-threaded programs for multicore CPUs
Proceedings of the 8th annual IEEE/ACM international symposium on Code generation and optimization
Solving path problems on the GPU
Parallel Computing
A GPGPU compiler for memory optimization and parallelism management
PLDI '10 Proceedings of the 2010 ACM SIGPLAN conference on Programming language design and implementation
Proceedings of the 24th ACM International Conference on Supercomputing
Large-scale FFT on GPU clusters
Proceedings of the 24th ACM International Conference on Supercomputing
Speeding up homomorpic hashing using GPUs
ICC'09 Proceedings of the 2009 IEEE international conference on Communications
Efficient design and implementation of visual computing algorithms on the GPU
ICIP'09 Proceedings of the 16th IEEE international conference on Image processing
PacketShader: a GPU-accelerated software router
Proceedings of the ACM SIGCOMM 2010 conference
Proceedings of the second ACM SIGCOMM workshop on Networking, systems, and applications on mobile handhelds
A GPU accelerated storage system
Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Hera-JVM: abstracting processor heterogeneity behind a virtual machine
HotOS'09 Proceedings of the 12th conference on Hot topics in operating systems
Accelerating iterative field-compensated MR image reconstruction on GPUS
ISBI'10 Proceedings of the 2010 IEEE international conference on Biomedical imaging: from nano to Macro
Hera-JVM: a runtime system for heterogeneous multi-core architectures
Proceedings of the ACM international conference on Object oriented programming systems languages and applications
Distributed stream processing with DUP
NPC'10 Proceedings of the 2010 IFIP international conference on Network and parallel computing
memCUDA: map device memory to host memory on GPGPU platform
NPC'10 Proceedings of the 2010 IFIP international conference on Network and parallel computing
OpenMPC: Extended OpenMP Programming and Tuning for GPUs
Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Size Matters: Space/Time Tradeoffs to Improve GPGPU Applications Performance
Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Optimizing memory access on GPUs using morton order indexing
Proceedings of the 48th Annual Southeast Regional Conference
Software-based branch predication for AMD GPUs
ACM SIGARCH Computer Architecture News
Compact data structure and scalable algorithms for the sparse grid technique
Proceedings of the 16th ACM symposium on Principles and practice of parallel programming
BehaveRT: a GPU-based library for autonomous characters
MIG'10 Proceedings of the Third international conference on Motion in games
On-the-fly elimination of dynamic irregularities for GPU computing
Proceedings of the sixteenth international conference on Architectural support for programming languages and operating systems
Sponge: portable stream programming on graphics engines
Proceedings of the sixteenth international conference on Architectural support for programming languages and operating systems
Kernel Fusion: An Effective Method for Better Power Efficiency on Multithreaded GPU
GREENCOM-CPSCOM '10 Proceedings of the 2010 IEEE/ACM Int'l Conference on Green Computing and Communications & Int'l Conference on Cyber, Physical and Social Computing
Power and Performance Characterization of Computational Kernels on the GPU
GREENCOM-CPSCOM '10 Proceedings of the 2010 IEEE/ACM Int'l Conference on Green Computing and Communications & Int'l Conference on Cyber, Physical and Social Computing
High performance predictable histogramming on GPUs: exploring and evaluating algorithm trade-offs
Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units
Analyzing program flow within a many-kernel OpenCL application
Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units
Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units
Performance analysis of a hybrid MPI/CUDA implementation of the NASLU benchmark
ACM SIGMETRICS Performance Evaluation Review - Special issue on the 1st international workshop on performance modeling, benchmarking and simulation of high performance computing systems (PMBS 10)
VECPAR'10 Proceedings of the 9th international conference on High performance computing for computational science
Optimizing and auto-tuning belief propagation on the GPU
LCPC'10 Proceedings of the 23rd international conference on Languages and compilers for parallel computing
A code-based analytical approach for using separate device coprocessors in computing systems
ARCS'11 Proceedings of the 24th international conference on Architecture of computing systems
Importance of explicit vectorization for CPU and GPU software performance
Journal of Computational Physics
A static task partitioning approach for heterogeneous systems using OpenCL
CC'11/ETAPS'11 Proceedings of the 20th international conference on Compiler construction: part of the joint European conferences on theory and practice of software
Automatic CPU-GPU communication management and optimization
Proceedings of the 32nd ACM SIGPLAN conference on Programming language design and implementation
Automating GPU computing in MATLAB
Proceedings of the international conference on Supercomputing
Shadowfax: scaling in heterogeneous cluster systems via GPGPU assemblies
Proceedings of the 5th international workshop on Virtualization technologies in distributed computing
A new approach to the lattice Boltzmann method for graphics processing units
Computers & Mathematics with Applications
Automatic abstraction and fault tolerance in cortical microachitectures
Proceedings of the 38th annual international symposium on Computer architecture
Parallel programming with inductive synthesis
HotPar'11 Proceedings of the 3rd USENIX conference on Hot topic in parallelism
Pegasus: coordinated scheduling for virtualized accelerator-based systems
USENIXATC'11 Proceedings of the 2011 USENIX conference on USENIX annual technical conference
CuMAPz: a tool to analyze memory access patterns in CUDA
Proceedings of the 48th Design Automation Conference
On the GPGPU parallelization issues of finite element approximate inverse preconditioning
Journal of Computational and Applied Mathematics
Model-driven tile size selection for DOACROSS loops on GPUs
Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part II
Parallel medical image reconstruction: from graphics processing units (GPU) to Grids
The Journal of Supercomputing
PTask: operating system abstractions to manage GPUs as compute devices
SOSP '11 Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Fast implementation of DGEMM on Fermi GPU
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
The Journal of Supercomputing
Exploration of CPU/GPU co-execution: from the perspective of performance, energy, and temperature
Proceedings of the 2011 ACM Symposium on Research in Applied Computation
Optimization strategies in different CUDA architectures using llCoMP
Microprocessors & Microsystems
Solving classification problems using genetic programming algorithms on GPUs
HAIS'10 Proceedings of the 5th international conference on Hybrid Artificial Intelligence Systems - Volume Part II
Improving GPU performance via large warps and two-level warp scheduling
Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
Auto-tuning interactive ray tracing using an analytical GPU architecture model
Proceedings of the 5th Annual Workshop on General Purpose Processing with Graphics Processing Units
Implementing p systems parallelism by means of GPUs
WMC'09 Proceedings of the 10th international conference on Membrane Computing
Optimizing stencil application on multi-thread GPU architecture using stream programming model
ARCS'10 Proceedings of the 23rd international conference on Architecture of Computing Systems
Automatic C-to-CUDA code generation for affine programs
CC'10/ETAPS'10 Proceedings of the 19th joint European conference on Theory and Practice of Software, international conference on Compiler Construction
A fast GPU-based motion estimation algorithm for H.264/AVC
MMM'12 Proceedings of the 18th international conference on Advances in Multimedia Modeling
A unified optimizing compiler framework for different GPGPU architectures
ACM Transactions on Architecture and Code Optimization (TACO)
CUDA optimization strategies for compute- and memory-bound neuroimaging algorithms
Computer Methods and Programs in Biomedicine
Parallelizing SOR for GPGPUs using alternate loop tiling
Parallel Computing
Compiling a high-level language for GPUs: (via language support for architectures and compilers)
Proceedings of the 33rd ACM SIGPLAN conference on Programming Language Design and Implementation
Adaptive input-aware compilation for graphics engines
Proceedings of the 33rd ACM SIGPLAN conference on Programming Language Design and Implementation
Dynamically managed data for CPU-GPU architectures
Proceedings of the Tenth International Symposium on Code Generation and Optimization
A GPU-Based accelerator for chinese word segmentation
APWeb'12 Proceedings of the 14th Asia-Pacific international conference on Web Technologies and Applications
Automatic restructuring of GPU kernels for exploiting inter-thread data locality
CC'12 Proceedings of the 21st international conference on Compiler Construction
One stone two birds: synchronization relaxation and redundancy removal in GPU-CPU translation
Proceedings of the 26th ACM international conference on Supercomputing
An optimized large-scale hybrid DGEMM design for CPUs and ATI GPUs
Proceedings of the 26th ACM international conference on Supercomputing
Journal of Computational Physics
Optimizing linpack benchmark on GPU-accelerated petascale supercomputer
Journal of Computer Science and Technology - Special issue on Community Analysis and Information Recommendation
Can traditional programming bridge the Ninja performance gap for parallel computing applications?
Proceedings of the 39th Annual International Symposium on Computer Architecture
Accelerating pathology image data cross-comparison on CPU-GPU hybrid systems
Proceedings of the VLDB Endowment
Financial software on GPUs: between Haskell and Fortran
Proceedings of the 1st ACM SIGPLAN workshop on Functional high-performance computing
Shared memory multiplexing: a novel way to improve GPGPU throughput
Proceedings of the 21st international conference on Parallel architectures and compilation techniques
Optimizing H.264/AVC interprediction on a GPU-based framework
Concurrency and Computation: Practice & Experience
Tsunami: massively parallel homomorphic hashing on many-core GPUs
Concurrency and Computation: Practice & Experience
Algorithmic species: A classification of affine loop nests for parallel programming
ACM Transactions on Architecture and Code Optimization (TACO) - Special Issue on High-Performance Embedded Architectures and Compilers
Polyhedral parallel code generation for CUDA
ACM Transactions on Architecture and Code Optimization (TACO) - Special Issue on High-Performance Embedded Architectures and Compilers
Speeding up k-Means algorithm by GPUs
Journal of Computer and System Sciences
Spill code placement for SIMD machines
SBLP'12 Proceedings of the 16th Brazilian conference on Programming Languages
Finite Element Integration on GPUs
ACM Transactions on Mathematical Software (TOMS)
Optimizing Techniques for OpenCL Programs on Heterogeneous Platforms
International Journal of Grid and High Performance Computing
OpenMPC: extended OpenMP for efficient programming and tuning on GPUs
International Journal of Computational Science and Engineering
Comparing the performance of stochastic simulation on GPUs and OpenMP
International Journal of Computational Science and Engineering
Iterative statistical kernels on contemporary GPUs
International Journal of Computational Science and Engineering
Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming
CAP: co-scheduling based on asymptotic profiling in CPU+GPU hybrid systems
Proceedings of the 2013 International Workshop on Programming Models and Applications for Multicores and Manycores
Performance modelling of magnetohydrodynamics codes
EPEW'12 Proceedings of the 9th European conference on Computer Performance Engineering
Performance modelling of magnetohydrodynamics codes
EPEW'12 Proceedings of the 9th European conference on Computer Performance Engineering
Optimizing tensor contraction expressions for hybrid CPU-GPU execution
Cluster Computing
Memory reuse optimizations in the R-Stream compiler
Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units
Valar: a benchmark suite to study the dynamic behavior of heterogeneous systems
Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units
Performance characterization of data-intensive kernels on AMD Fusion architectures
Computer Science - Research and Development
Complexity of the path avoiding forbidden pairs problem revisited
Discrete Applied Mathematics
Proceedings of the 2013 ACM SIGSIM conference on Principles of advanced discrete simulation
Parallel implementation of a X-ray tomography reconstruction algorithm based on MPI and CUDA
Proceedings of the 20th European MPI Users' Group Meeting
Automatic parallelization of canonical loops
Science of Computer Programming
Real-time implementation and performance optimization of 3D sound localization on GPUs
DATE '12 Proceedings of the Conference on Design, Automation and Test in Europe
GPU-based acceleration of an RNA tertiary structure prediction algorithm
Computers in Biology and Medicine
A large-scale cross-architecture evaluation of thread-coarsening
SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Optimising lossless stages in a GPU-based MPEG encoder
Multimedia Tools and Applications
An efficient scheduling scheme using estimated execution time for heterogeneous computing systems
The Journal of Supercomputing
uBench: exposing the impact of CUDA block geometry in terms of performance
The Journal of Supercomputing
Simulating large topologies in ns-3 using BRITE and CUDA driven global routing
Proceedings of the 6th International ICST Conference on Simulation Tools and Techniques
Memory performance estimation of CUDA programs
ACM Transactions on Embedded Computing Systems (TECS) - Special issue on application-specific processors
Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles
ACM SIGOPS 24th Symposium on Operating Systems Principles
Dandelion: a compiler and runtime for heterogeneous systems
Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles
ACM Transactions on Programming Languages and Systems (TOPLAS)
User transparent data and task parallel multimedia computing with Pyxis-DT
Future Generation Computer Systems
CUDA-NP: realizing nested thread-level parallelism in GPGPU applications
Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming
GPU code generation for ODE-based applications with phased shared-data access patterns
ACM Transactions on Architecture and Code Optimization (TACO)
The Implementation of a High Performance GPGPU Compiler
International Journal of Parallel Programming
An efficient compiler framework for cache bypassing on GPUs
Proceedings of the International Conference on Computer-Aided Design
Optimizing LZSS compression on GPGPUs
Future Generation Computer Systems
3D high definition video coding on a GPU-based heterogeneous system
Computers and Electrical Engineering
Optimising space exploration of OpenCL for GPGPUs
International Journal of Computational Science and Engineering
Proceedings of Workshop on General Purpose Processing Using GPUs
Implementation of LTE system on an SDR platform using CUDA and UHD
Analog Integrated Circuits and Signal Processing
CPU+GPU scheduling with asymptotic profiling
Parallel Computing
Parallel implementation of a real-time high dynamic range video system
Integrated Computer-Aided Engineering
Advances in Engineering Software
Hi-index | 0.01 |
GPUs have recently attracted the attention of many application developers as commodity data-parallel coprocessors. The newest generations of GPU architecture provide easier programmability and increased generality while maintaining the tremendous memory bandwidth and computational power of traditional GPUs. This opportunity should redirect efforts in GPGPU research from ad hoc porting of applications to establishing principles and strategies that allow efficient mapping of computation to graphics hardware. In this work we discuss the GeForce 8800 GTX processor's organization, features, and generalized optimization strategies. Key to performance on this platform is using massive multithreading to utilize the large number of cores and hide global memory latency. To achieve this, developers face the challenge of striking the right balance between each thread's resource usage and the number of simultaneously active threads. The resources to manage include the number of registers and the amount of on-chip memory used per thread, number of threads per multiprocessor, and global memory bandwidth. We also obtain increased performance by reordering accesses to off-chip memory to combine requests to the same or contiguous memory locations and apply classical optimizations to reduce the number of executed operations. We apply these strategies across a variety of applications and domains and achieve between a 10.5X to 457X speedup in kernel codes and between 1.16X to 431X total application speedup.