The irregular Z-buffer: Hardware acceleration for irregular data structures
ACM Transactions on Graphics (TOG)
Executing irregular scientific applications on stream architectures
Proceedings of the 21st annual international conference on Supercomputing
Proceedings of the 21st annual international conference on Supercomputing
Atomic Vector Operations on Chip Multiprocessors
ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
The Journal of Supercomputing
Histogram computation based on image bitwise decomposition
ICIP'09 Proceedings of the 16th IEEE international conference on Image processing
Journal of Computational Physics
Region-based parallelization of irregular reductions on explicitly managed memory hierarchies
The Journal of Supercomputing
An Efficient Particle Filter---based Tracking Method Using Graphics Processing Unit (GPU)
Journal of Signal Processing Systems
The Journal of Supercomputing
Hi-index | 0.00 |
Many important applications exhibit large amounts of data parallelism, and modern computer systems are designed to take advantage of it. While much of the computation in the multimedia and scientific application domains is data parallel, certain operations require costly serialization that increase the run time. Examples include superposition type updates in scientific computing and histogram computations in media processing. We introduce scatter-add, which is the data-parallel form of the well-known scalar fetch-and-op, specifically tuned for SIMD/vector/stream style memory systems. The scatter-add mechanism scatters a set of data values to a set of memory addresses and adds each data value to each referenced memory location instead of overwriting it. This novel architecture extension allows us to efficiently support data-parallel atomic update computations found in parallel programming languages such as HPF, and applies both to single-processor and multi-processor SIMD data-parallel systems. We detail the micro-architecture of a scatter-add implementation on a stream architecture, which requires less than 2% increase in die area yet shows performance speedups ranging from 1.45 to over 11 on a set of applications that require a scatter-add computation.