NVIDIA Tesla: A Unified Graphics and Computing Architecture

Authors:
Erik Lindholm;John Nickolls;Stuart Oberman;John Montrym
Affiliations:
NVIDIA;NVIDIA;NVIDIA;NVIDIA
Venue:
IEEE Micro
Year:
2008

Citing 0
Cited 174

Sparse matrix computations on manycore GPU's

Proceedings of the 45th annual Design Automation Conference
A performance study of general-purpose applications on graphics processors using CUDA

Journal of Parallel and Distributed Computing
GRAMPS: A programming model for graphics pipelines

ACM Transactions on Graphics (TOG)
MCUDA: An Efficient Implementation of CUDA Kernels for Multi-core CPUs

Languages and Compilers for Parallel Computing
A Hardware Task Scheduler for Embedded Video Processing

HiPEAC '09 Proceedings of the 4th International Conference on High Performance Embedded Architectures and Compilers
Multilevel summation of electrostatic potentials using graphics processing units

Parallel Computing
3D finite difference computation on GPUs using CUDA

Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units
Accelerating PQMRCGSTAB algorithm on GPU

Proceedings of the combined workshops on UnConventional high performance computing workshop plus memory access workshop
A multi-streaming SIMD architecture for multimedia applications

Proceedings of the 6th ACM conference on Computing frontiers
Towards automatic program partitioning

Proceedings of the 6th ACM conference on Computing frontiers
High-performance SIMT code generation in an active visual effects library

Proceedings of the 6th ACM conference on Computing frontiers
Dynamic warp formation: Efficient MIMD control flow on SIMD graphics hardware

ACM Transactions on Architecture and Code Optimization (TACO)
Rigel: an architecture and scalable programming interface for a 1000-core accelerator

Proceedings of the 36th annual international symposium on Computer architecture
An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness

Proceedings of the 36th annual international symposium on Computer architecture
Solving Sparse Linear Systems on NVIDIA Tesla GPUs

ICCS '09 Proceedings of the 9th International Conference on Computational Science: Part I
Experiences with Mapping Non-linear Memory Access Patterns into GPUs

ICCS '09 Proceedings of the 9th International Conference on Computational Science: Part I
Probing biomolecular machines with graphics processors

Communications of the ACM - A View of Parallel Computing
A fast high quality pseudo random number generator for nVidia CUDA

Proceedings of the 11th Annual Conference Companion on Genetic and Evolutionary Computation Conference: Late Breaking Papers
Solving quadratic assignment problems by genetic algorithms with GPU computation: a case study

Proceedings of the 11th Annual Conference Companion on Genetic and Evolutionary Computation Conference: Late Breaking Papers
Understanding the efficiency of ray traversal on GPUs

Proceedings of the Conference on High Performance Graphics 2009
Efficient stream compaction on wide SIMD many-core architectures

Proceedings of the Conference on High Performance Graphics 2009
Stream compaction for deferred shading

Proceedings of the Conference on High Performance Graphics 2009
Programmable and Scalable Architecture for Graphics Processing Units

SAMOS '09 Proceedings of the 9th International Workshop on Embedded Computer Systems: Architectures, Modeling, and Simulation
Applying the Stream-Based Computing Model to Design Hardware Accelerators: A Case Study

SAMOS '09 Proceedings of the 9th International Workshop on Embedded Computer Systems: Architectures, Modeling, and Simulation
Efficient Mapping of Multiresolution Image Filtering Algorithms on Graphics Processors

SAMOS '09 Proceedings of the 9th International Workshop on Embedded Computer Systems: Architectures, Modeling, and Simulation
Real-Time GPU-Based Voxel Carving with Systematic Occlusion Handling

Proceedings of the 31st DAGM Symposium on Pattern Recognition
On GPU's viability as a middleware accelerator

Cluster Computing
Nodal discontinuous Galerkin methods on graphics processors

Journal of Computational Physics
Tracking as Segmentation of Spatial-Temporal Volumes by Anisotropic Weighted TV

EMMCVPR '09 Proceedings of the 7th International Conference on Energy Minimization Methods in Computer Vision and Pattern Recognition
Efficient Multiplication of Polynomials on Graphics Hardware

APPT '09 Proceedings of the 8th International Symposium on Advanced Parallel Processing Technologies
Probing Biomolecular Machines with Graphics Processors

Queue - Bioscience
Technical Section: Shader-based tessellation to save memory bandwidth in a mobile multimedia processor

Computers and Graphics
Implementing sparse matrix-vector multiplication on throughput-oriented processors

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Increasing memory miss tolerance for SIMD cores

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Hybrid of genetic algorithm and local search to solve MAX-SAT problem using nVidia CUDA framework

Genetic Programming and Evolvable Machines
Multi-core platforms for signal processing: source and channel coding

ICME'09 Proceedings of the 2009 IEEE international conference on Multimedia and Expo
TRaX: a multicore hardware architecture for real-time ray tracing

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
Interactive fluid-particle simulation using translating Eulerian grids

Proceedings of the 2010 ACM SIGGRAPH symposium on Interactive 3D Graphics and Games
Teaching design & analysis of multi-core parallel algorithms using CUDA

Journal of Computing Sciences in Colleges
Parallel multiclass classification using SVMs on GPUs

Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units
Iterative induced dipoles computation for molecular mechanics on GPUs

Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units
An asymmetric distributed shared memory model for heterogeneous parallel systems

Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
OptiX: a general purpose ray tracing engine

ACM SIGGRAPH 2010 papers
A compact harmonic code for early vision based on anisotropic frequency channels

Computer Vision and Image Understanding
GPU computing with Kaczmarz's and other iterative algorithms for linear systems

Parallel Computing
Solving path problems on the GPU

Parallel Computing
Cohesion: a hybrid memory model for accelerators

Proceedings of the 37th annual international symposium on Computer architecture
A Network Congestion-Aware Memory Controller

NOCS '10 Proceedings of the 2010 Fourth ACM/IEEE International Symposium on Networks-on-Chip
Efficient fault simulation on many-core processors

Proceedings of the 47th Design Automation Conference
Exploiting the reuse supplied by loop-dependent stream references for stream processors

ACM Transactions on Architecture and Code Optimization (TACO)
Understanding throughput-oriented architectures

Communications of the ACM
High-order finite-element seismic wave propagation modeling with MPI on a large GPU cluster

Journal of Computational Physics
Multi-port abstraction layer for FPGA intensive memory exploitation applications

Journal of Systems Architecture: the EUROMICRO Journal
Asynchronous Communication Schemes for Finite Difference Methods on Multiple GPUs

CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
Cooperative Multitasking for GPU-Accelerated Grid Systems

CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
Hard Data on Soft Errors: A Large-Scale Assessment of Real-World Error Rates in GPGPU

CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
A multi-streaming SIMD multimedia computing engine

Microprocessors & Microsystems
An instruction-systolic programmable shader architecture for multi-threaded 3D graphics processing

Journal of Parallel and Distributed Computing
WAYPOINT: scaling coherence to thousand-core architectures

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
CUDA-MEME: Accelerating motif discovery in biological sequences using CUDA-enabled graphics processing units

Pattern Recognition Letters
Tenor: making coding practical from servers to smartphones

Proceedings of the international conference on Multimedia
Distributed stream processing with DUP

NPC'10 Proceedings of the 2010 IFIP international conference on Network and parallel computing
The core degree based tag reduction on chip multiprocessor to balance energy saving and performance overhead

NPC'10 Proceedings of the 2010 IFIP international conference on Network and parallel computing
Finite element numerical integration on GPUs

PPAM'09 Proceedings of the 8th international conference on Parallel processing and applied mathematics: Part I
Optimal Utilization of Heterogeneous Resources for Biomolecular Simulations

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
The Sharing Tracker: Using Ideas from Cache Coherence Hardware to Reduce Off-Chip Memory Traffic with Non-Coherent Caches

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Parallel variable-length encoding on GPGPUs

Euro-Par'09 Proceedings of the 2009 international conference on Parallel processing
Dynamic detection of uniform and affine vectors in GPGPU computations

Euro-Par'09 Proceedings of the 2009 international conference on Parallel processing
Development of a GPU-based high-performance radiative transfer model for the Infrared Atmospheric Sounding Interferometer (IASI)

Journal of Computational Physics
Compact data structure and scalable algorithms for the sparse grid technique

Proceedings of the 16th ACM symposium on Principles and practice of parallel programming
A highly-parallel TSP solver for a GPU computing platform

NMA'10 Proceedings of the 7th international conference on Numerical methods and applications
Parallel implementation of a spatio-temporal visual saliency model

Journal of Real-Time Image Processing
Translation-invariant two-dimensional discrete wavelet transform on graphics processing units

ECS'10/ECCTD'10/ECCOM'10/ECCS'10 Proceedings of the European conference of systems, and European conference of circuits technology and devices, and European conference of communications, and European conference on Computer science
Reducing branch divergence in GPU programs

Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units
Floating-point data compression at 75 Gb/s on a GPU

Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units
Massively Parallel Logic Simulation with GPUs

ACM Transactions on Design Automation of Electronic Systems (TODAES)
FPGA vs. multi-core CPUs vs. GPUs: hands-on experience with a sorting application

Facing the multicore-challenge
FPGA vs. multi-core CPUs vs. GPUs: hands-on experience with a sorting application

Facing the multicore-challenge
Structuring the unstructured middle with chunk computing

HotOS'13 Proceedings of the 13th USENIX conference on Hot topics in operating systems
Automatic compilation of MATLAB programs for synergistic execution on heterogeneous processors

Proceedings of the 32nd ACM SIGPLAN conference on Programming language design and implementation
OUTRIDER: efficient memory latency tolerance with decoupled strands

Proceedings of the 38th annual international symposium on Computer architecture
Exploring the tradeoffs between programmability and efficiency in data-parallel accelerators

Proceedings of the 38th annual international symposium on Computer architecture
Considerations when evaluating microprocessor platforms

HotPar'11 Proceedings of the 3rd USENIX conference on Hot topic in parallelism
High-performance software rasterization on GPUs

Proceedings of the ACM SIGGRAPH Symposium on High Performance Graphics
Spatial hardware implementation for sparse graph algorithms in GraphStep

ACM Transactions on Autonomous and Adaptive Systems (TAAS)
Razor: An architecture for dynamic multiresolution ray tracing

ACM Transactions on Graphics (TOG)
Mathematical morphology in computer graphics, scientific visualization and visual exploration

ISMM'11 Proceedings of the 10th international conference on Mathematical morphology and its applications to image and signal processing
Optimization of N-queens solvers on graphics processors

APPT'11 Proceedings of the 9th international conference on Advanced parallel processing technologies
Implementation of an SDR platform using GPU and its application to a 2 × 2 MIMO WiMAX system

Analog Integrated Circuits and Signal Processing
CUDA-BLASTP: Accelerating BLASTP on CUDA-Enabled Graphics Hardware

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Dymaxion: optimizing memory access patterns for heterogeneous systems

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Correlation analysis on GPU systems using NVIDIA's CUDA

Journal of Real-Time Image Processing
Geospatial overlay computation on the GPU

Proceedings of the 19th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems
PyCUDA and PyOpenCL: A scripting-based approach to GPU run-time code generation

Parallel Computing
Bandwidth-aware reconfigurable cache design with hybrid memory technologies

Proceedings of the International Conference on Computer-Aided Design
Modeling the computational efficiency of 2-D and 3-D silicon processors for early-chip planning

Proceedings of the International Conference on Computer-Aided Design
GPU-based parallel collision detection for fast motion planning

International Journal of Robotics Research
Better speedups using simpler parallel programming for graph connectivity and biconnectivity

Proceedings of the 2012 International Workshop on Programming Models and Applications for Multicores and Manycores
Original article: Parallel collision detection of ellipsoids with applications in large scale multibody dynamics

Mathematics and Computers in Simulation
High-performance Monte Carlo radiosity on GPU based on scene partitioning

Microprocessors & Microsystems
Hardware transactional memory for GPU architectures

Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
High performance 3-D FFT using multiple CUDA GPUs

Proceedings of the 5th Annual Workshop on General Purpose Processing with Graphics Processing Units
Implementing p systems parallelism by means of GPUs

WMC'09 Proceedings of the 10th international conference on Membrane Computing
Implementing a GPU programming model on a Non-GPU accelerator architecture

ISCA'10 Proceedings of the 2010 international conference on Computer Architecture
Smoldyn on Graphics Processing Units: Massively Parallel Brownian Dynamics Simulations

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Fast Parallel Markov Clustering in Bioinformatics Using Massively Parallel Computing on GPU with CUDA and ELLPACK-R Sparse Format

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Efficient parallel CKY parsing on GPUs

IWPT '11 Proceedings of the 12th International Conference on Parsing Technologies
Revisiting finite difference and spectral migration methods on diverse parallel architectures

Computers & Geosciences
High-throughput antibody sequence alignment based on GPU computing

Proceedings of the 27th Annual ACM Symposium on Applied Computing
Design patterns for scientific computations on sparse matrices

Euro-Par'11 Proceedings of the 2011 international conference on Parallel Processing
Boosted human re-identification using Riemannian manifolds

Image and Vision Computing
Fine-grain parallelism using multi-core, Cell/BE, and GPU Systems

Parallel Computing
Integrating data-intensive cloud computing with multicores and clusters in an HPC course

Proceedings of the 17th ACM annual conference on Innovation and technology in computer science education
Simultaneous branch and warp interweaving for sustained GPU performance

Proceedings of the 39th Annual International Symposium on Computer Architecture
Staged memory scheduling: achieving high performance and scalability in heterogeneous systems

Proceedings of the 39th Annual International Symposium on Computer Architecture
GPU-based parallel algorithms for sparse nonlinear systems

Journal of Parallel and Distributed Computing
GPU accelerated computation of the longest common subsequence

Facing the Multicore-Challenge II
Operating systems should manage accelerators

HotPar'12 Proceedings of the 4th USENIX conference on Hot Topics in Parallelism
Using blue gene/p and GPUs to accelerate computations in the EULAG model

LSSC'11 Proceedings of the 8th international conference on Large-Scale Scientific Computing
Fast and small nonlinear pseudorandom number generators for computer simulation

PPAM'11 Proceedings of the 9th international conference on Parallel Processing and Applied Mathematics - Volume Part I
Parallelization of EULAG model on multicore architectures with GPU accelerators

PPAM'11 Proceedings of the 9th international conference on Parallel Processing and Applied Mathematics - Volume Part II
Performance evaluation of hybrid implementation of support vector machine

IDEAL'12 Proceedings of the 13th international conference on Intelligent Data Engineering and Automated Learning
Power-efficient computing for compute-intensive GPGPU applications

Proceedings of the 21st international conference on Parallel architectures and compilation techniques
Fragment-parallel composite and filter

EGSR'10 Proceedings of the 21st Eurographics conference on Rendering
Wavelet-based multiresolution isosurface rendering

VG'10 Proceedings of the 8th IEEE/EG international conference on Volume Graphics
Stencil computations on heterogeneous platforms for the Jacobi method: GPUs versus Cell BE

The Journal of Supercomputing
Direct approaches to exploit many-core architecture in bioinformatics

Future Generation Computer Systems
Scalable multi-GPU 3-D FFT for TSUBAME 2.0 supercomputer

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Analysis and performance estimation of the Conjugate Gradient method on multiple GPUs

Parallel Computing
Parallel perfusion imaging processing using GPGPU

Computer Methods and Programs in Biomedicine
The CRNS framework and its application to programmable and reconfigurable cryptography

ACM Transactions on Architecture and Code Optimization (TACO) - Special Issue on High-Performance Embedded Architectures and Compilers
A dynamic self-scheduling scheme for heterogeneous multiprocessor architectures

ACM Transactions on Architecture and Code Optimization (TACO) - Special Issue on High-Performance Embedded Architectures and Compilers
Efficient data management for incoherent ray tracing

Applied Soft Computing
CUDA-Enabled Optimisation of Technical Analysis Parameters

DS-RT '12 Proceedings of the 2012 IEEE/ACM 16th International Symposium on Distributed Simulation and Real Time Applications
Use of FPGA or GPU-based architectures for remotely sensed hyperspectral image processing

Integration, the VLSI Journal
Practical time bundle adjustment for 3d reconstruction on the GPU

ECCV'10 Proceedings of the 11th European conference on Trends and Topics in Computer Vision - Volume Part II
MORPHEUS: A heterogeneous dynamically reconfigurable platform for designing highly complex embedded systems

ACM Transactions on Embedded Computing Systems (TECS)
GPU ray tracing

Communications of the ACM
GPUDet: a deterministic GPU architecture

Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems
OWL: cooperative thread array aware scheduling techniques for improving GPGPU performance

Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems
Improving GPGPU concurrency with elastic kernels

Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems
KFusion: optimizing data flow without compromising modularity

Proceedings of the 12th annual international conference on Aspect-oriented software development
Cache-Conscious Wavefront Scheduling

MICRO-45 Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture
Exploring GPU architectures to accelerate semantic comparison for intention-based search

Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units
Warp size impact in GPUs: large or small?

Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units
COSMIC: middleware for high performance and reliable multiprocessing on xeon phi coprocessors

Proceedings of the 22nd international symposium on High-performance parallel and distributed computing
Neutron sensitivity and software hardening strategies for matrix multiplication and FFT on graphics processing units

Proceedings of the 3rd Workshop on Fault-tolerance for HPC at extreme scale
Future of GPGPU micro-architectural parameters

Proceedings of the Conference on Design, Automation and Test in Europe
Microarchitectural mechanisms to exploit value structure in SIMT architectures

Proceedings of the 40th Annual International Symposium on Computer Architecture
Orchestrated scheduling and prefetching for GPGPUs

Proceedings of the 40th Annual International Symposium on Computer Architecture
Reducing memory access latency with asymmetric DRAM bank organizations

Proceedings of the 40th Annual International Symposium on Computer Architecture
GPUWattch: enabling energy optimizations in GPGPUs

Proceedings of the 40th Annual International Symposium on Computer Architecture
A network congestion-aware memory subsystem for manycore

ACM Transactions on Embedded Computing Systems (TECS) - Special Section on Wireless Health Systems, On-Chip and Off-Chip Network Architectures
An expansion-aided synchronous conservative time management algorithm on GPU

Proceedings of the 2013 ACM SIGSIM conference on Principles of advanced discrete simulation
GPU-CC: a reconfigurable GPU architecture with communicating cores

Proceedings of the 16th International Workshop on Software and Compilers for Embedded Systems
Homogeneous stream processors with embedded special function units for high-utilization programmable shaders

IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Exploring the Tradeoffs between Programmability and Efficiency in Data-Parallel Accelerators

ACM Transactions on Computer Systems (TOCS)
Optimizing the performance of streaming numerical kernels on the IBM Blue Gene/P PowerPC 450 processor

International Journal of High Performance Computing Applications
Extended Kalman filter-based Elman networks for industrial time series prediction with GPU acceleration

Neurocomputing
GPU-based approaches for real-time sound source localization using the SRP-PHAT algorithm

International Journal of High Performance Computing Applications
Visualizing 3D/4D environmental data using many-core graphics processing units (GPUs) and multi-core central processing units (CPUs)

Computers & Geosciences
APOGEE: adaptive prefetching on GPUs for energy efficiency

PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
Neither more nor less: optimizing thread-level parallelism for GPGPUs

PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
Scalability study of molecular dynamics simulation on Godson-T many-core architecture

Journal of Parallel and Distributed Computing
Computing resultants on Graphics Processing Units: Towards GPU-accelerated computer algebra

Journal of Parallel and Distributed Computing
Accelerated implementation of adaptive directional lifting-based discrete wavelet transform on GPU

Image Communication
Efficient sorting design on a novel embedded parallel computing architecture with unique memory access

Computers and Electrical Engineering
Divergence-aware warp scheduling

Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
A GPU-based discrete event simulation kernel

Simulation
Easy, fast, and energy-efficient object detection on heterogeneous on-chip architectures

ACM Transactions on Architecture and Code Optimization (TACO)
HARP: Harnessing inactive threads in many-core processors

ACM Transactions on Embedded Computing Systems (TECS) - Special Issue on Design Challenges for Many-Core Processors, Special Section on ESTIMedia'13 and Regular Papers
Optimising space exploration of OpenCL for GPGPUs

International Journal of Computational Science and Engineering
Design patterns for sparse-matrix computations on hybrid CPU/GPU platforms

Scientific Programming
Performance models and workload distribution algorithms for optimizing a hybrid CPU-GPU multifrontal solver

Computers & Mathematics with Applications

Quantified Score

Hi-index	0.05

Visualization

Abstract

To enable flexible, programmable graphics and high-performance computing, NVIDIA has developed the Tesla scalable unified graphics and parallel computing architecture. Its scalable parallel array of processors is massively multithreaded and programmable in C or via graphics APIs.