Atomic Vector Operations on Chip Multiprocessors

Authors:
Sanjeev Kumar;Daehyun Kim;Mikhail Smelyanskiy;Yen-Kuang Chen;Jatin Chhugani;Christopher J. Hughes;Changkyu Kim;Victor W. Lee;Anthony D. Nguyen
Affiliations:
-;-;-;-;-;-;-;-;-
Venue:
ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
Year:
2008

Citing 16
Cited 6

Marching cubes: A high resolution 3D surface construction algorithm

SIGGRAPH '87 Proceedings of the 14th annual conference on Computer graphics and interactive techniques
A new approach to the maximum-flow problem

Journal of the ACM (JACM)
Scan primitives for vector computers

Proceedings of the 1990 ACM/IEEE conference on Supercomputing
MIPS RISC architectures

MIPS RISC architectures
Transactional memory: architectural support for lock-free data structures

ISCA '93 Proceedings of the 20th annual international symposium on computer architecture
Synchronization and communication in the T3E multiprocessor

Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
The SGI Origin: a ccNUMA highly scalable server

Proceedings of the 24th annual international symposium on Computer architecture
Basic Techniques for the Efficient Coordination of Very Large Numbers of Cooperating Sequential Processors

ACM Transactions on Programming Languages and Systems (TOPLAS)
Integrated Region-Based Image Retrieval

Integrated Region-Based Image Retrieval
Scatter-Add in Data Parallel Architectures

HPCA '05 Proceedings of the 11th International Symposium on High-Performance Computer Architecture
IBM system/360 principles of operation

IBM system/360 principles of operation
Real-Time Collision Detection (The Morgan Kaufmann Series in Interactive 3-D Technology) (The Morgan Kaufmann Series in Interactive 3D Technology)

Real-Time Collision Detection (The Morgan Kaufmann Series in Interactive 3-D Technology) (The Morgan Kaufmann Series in Interactive 3D Technology)
Chip multiprocessing and the cell broadband engine

Proceedings of the 3rd conference on Computing frontiers
Active memory operations

Proceedings of the 21st annual international conference on Supercomputing
Architectural Support for the Stream Execution Model on General-Purpose Processors

PACT '07 Proceedings of the 16th International Conference on Parallel Architecture and Compilation Techniques
The Cray BlackWidow: a highly scalable vector multiprocessor

Proceedings of the 2007 ACM/IEEE conference on Supercomputing

Sort vs. Hash revisited: fast join implementation on modern multi-core CPUs

Proceedings of the VLDB Endowment
Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU

Proceedings of the 37th annual international symposium on Computer architecture
Template-based memory access engine for accelerators in SoCs

Proceedings of the 16th Asia and South Pacific Design Automation Conference
Fast analysis of molecular dynamics trajectories with graphics processing units-Radial distribution function histogramming

Journal of Computational Physics
Active memory controller

The Journal of Supercomputing
Billion-particle SIMD-friendly two-point correlation on large-scale HPC cluster systems

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

Quantified Score

Hi-index	0.00

Visualization

Abstract

The current trend is for processors to deliver dramatic improvements in parallel performance while only modestly improving serial performance. Parallel performance is harvested through vector/SIMD instructions as well as multithreading (through both multithreaded cores and chip multiprocessors). Vector parallelism can be more efficiently supported than multithreading, but is often harder for software to exploit. In particular, code with sparse data access patterns cannot easily utilize the vector/SIMD instructions of mainstream processors. Hardware to scatter and gather sparse data has previously been proposed to enable vector execution for these codes. However, on multithreaded architectures, a number of applications spend significant time on atomic operations (e.g., parallel reductions), which cannot be vectorized using previously proposed schemes. This paper proposes architectural support for atomic vector operations (referred to as GLSC) that addresses this limitation. GLSC extends scatter-gather hardware to support atomic memory operations. Our experiments show that the GLSC provides an average performance improvement on a set of important RMS kernels of 54% for 4-wide SIMD.