Marching cubes: A high resolution 3D surface construction algorithm
SIGGRAPH '87 Proceedings of the 14th annual conference on Computer graphics and interactive techniques
A new approach to the maximum-flow problem
Journal of the ACM (JACM)
Scan primitives for vector computers
Proceedings of the 1990 ACM/IEEE conference on Supercomputing
MIPS RISC architectures
Transactional memory: architectural support for lock-free data structures
ISCA '93 Proceedings of the 20th annual international symposium on computer architecture
Synchronization and communication in the T3E multiprocessor
Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
The SGI Origin: a ccNUMA highly scalable server
Proceedings of the 24th annual international symposium on Computer architecture
ACM Transactions on Programming Languages and Systems (TOPLAS)
Integrated Region-Based Image Retrieval
Integrated Region-Based Image Retrieval
Scatter-Add in Data Parallel Architectures
HPCA '05 Proceedings of the 11th International Symposium on High-Performance Computer Architecture
IBM system/360 principles of operation
IBM system/360 principles of operation
Real-Time Collision Detection (The Morgan Kaufmann Series in Interactive 3-D Technology) (The Morgan Kaufmann Series in Interactive 3D Technology)
Chip multiprocessing and the cell broadband engine
Proceedings of the 3rd conference on Computing frontiers
Proceedings of the 21st annual international conference on Supercomputing
Architectural Support for the Stream Execution Model on General-Purpose Processors
PACT '07 Proceedings of the 16th International Conference on Parallel Architecture and Compilation Techniques
The Cray BlackWidow: a highly scalable vector multiprocessor
Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Sort vs. Hash revisited: fast join implementation on modern multi-core CPUs
Proceedings of the VLDB Endowment
Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU
Proceedings of the 37th annual international symposium on Computer architecture
Template-based memory access engine for accelerators in SoCs
Proceedings of the 16th Asia and South Pacific Design Automation Conference
Journal of Computational Physics
The Journal of Supercomputing
Billion-particle SIMD-friendly two-point correlation on large-scale HPC cluster systems
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Hi-index | 0.00 |
The current trend is for processors to deliver dramatic improvements in parallel performance while only modestly improving serial performance. Parallel performance is harvested through vector/SIMD instructions as well as multithreading (through both multithreaded cores and chip multiprocessors). Vector parallelism can be more efficiently supported than multithreading, but is often harder for software to exploit. In particular, code with sparse data access patterns cannot easily utilize the vector/SIMD instructions of mainstream processors. Hardware to scatter and gather sparse data has previously been proposed to enable vector execution for these codes. However, on multithreaded architectures, a number of applications spend significant time on atomic operations (e.g., parallel reductions), which cannot be vectorized using previously proposed schemes. This paper proposes architectural support for atomic vector operations (referred to as GLSC) that addresses this limitation. GLSC extends scatter-gather hardware to support atomic memory operations. Our experiments show that the GLSC provides an average performance improvement on a set of important RMS kernels of 54% for 4-wide SIMD.