Cache Refill/Access Decoupling for Vector Machines
Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
Analysis and Performance Results of a Molecular Modeling Application on Merrimac
Proceedings of the 2004 ACM/IEEE conference on Supercomputing
High-throughput sketch update on a low-power stream processor
Proceedings of the 2006 ACM/IEEE symposium on Architecture for networking and communications systems
Executing irregular scientific applications on stream architectures
Proceedings of the 21st annual international conference on Supercomputing
Tradeoff between data-, instruction-, and thread-level parallelism in stream processors
Proceedings of the 21st annual international conference on Supercomputing
Transform coding on programmable stream processors
The Journal of Supercomputing
Exploiting loop-dependent stream reuse for stream processors
Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Dynamic warp formation: Efficient MIMD control flow on SIMD graphics hardware
ACM Transactions on Architecture and Code Optimization (TACO)
SARA: StreAm register allocation
CODES+ISSS '09 Proceedings of the 7th IEEE/ACM international conference on Hardware/software codesign and system synthesis
The research of an embedded processor element for multimedia domain
MCAM'07 Proceedings of the 2007 international conference on Multimedia content analysis and mining
Xetal-Pro: an ultra-low energy and high throughput SIMD processor
Proceedings of the 47th Design Automation Conference
Exploiting the reuse supplied by loop-dependent stream references for stream processors
ACM Transactions on Architecture and Code Optimization (TACO)
Improving GPU performance via large warps and two-level warp scheduling
Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
A stream architecture supporting multiple stream execution models
ACSAC'05 Proceedings of the 10th Asia-Pacific conference on Advances in Computer Systems Architecture
Implementation and optimization of sparse matrix-vector multiplication on imagine stream processor
ISPA'07 Proceedings of the 5th international conference on Parallel and Distributed Processing and Applications
Hi-index | 0.00 |
Many current programmable architectures designed to exploit data parallelism require computation to be structured to operate on sequentially accessed vectors or streams of data. Applications with less regular data access patterns perform sub-optimally on such architectures. This paper presents a register file for streams (SRF) that allows arbitrary, indexed accesses. Compared to sequential SRF access, indexed access captures more temporal locality, reduces data replication in the SRF, and provides efficient support for certain types of complex access patterns. Our simulations show that indexed SRF access provides speedups of 1.03x to 4.1x and memory bandwidth reductions of up to 95% over sequential SRF access for a set of benchmarks representative of data-parallel applications with irregular accesses. Indexed SRF access also provides greater speedups than caches for a number of application classes despite significantly lower hardware costs. The area overhead of our indexed SRF implementation is 11%-22% over a sequentially accessed SRF, which corresponds to a modest 1.5%-3% increase in the total die area of a typical stream processor.