Journal of Parallel and Distributed Computing - Special issue on scalability of parallel algorithms and architectures
MOM: a matrix SIMD instruction set architecture for multimedia applications
SC '99 Proceedings of the 1999 ACM/IEEE conference on Supercomputing
Implementation and Evaluation of the Complex Streamed Instruction Set
Proceedings of the 2001 International Conference on Parallel Architectures and Compilation Techniques
An Introduction to the Conjugate Gradient Method Without the Agonizing Pain
An Introduction to the Conjugate Gradient Method Without the Agonizing Pain
Matrix register file and extended subwords: two techniques for embedded media processors
Proceedings of the 2nd conference on Computing frontiers
Introduction to the cell multiprocessor
IBM Journal of Research and Development - POWER5 and packaging
Register pointer architecture for efficient embedded processors
Proceedings of the conference on Design, automation and test in Europe
The Burroughs Scientific Processor (BSP)
IEEE Transactions on Computers
Evaluation of memory performance on the cell BE with the SARC programming model
Proceedings of the 9th workshop on MEmory performance: DEaling with Applications, systems and architecture
Dynamically reconfigurable register file for a softcore VLIW processor
Proceedings of the Conference on Design, Automation and Test in Europe
IEEE Micro
Separable 2d convolution with polymorphic register files
ARCS'13 Proceedings of the 26th international conference on Architecture of Computing Systems
Hi-index | 0.00 |
We evaluate the scalability of a Polymorphic Register File using the Conjugate Gradient method as a case study. We focus on a heterogeneous multi-processor architecture, taking into consideration critical parameters such as cache bandwidth and memory latency. We compare the performance of 256 Polymorphic Register File-augmented workers against a single Cell PowerPC Processor Unit (PPU). In such a scenario, simulation results suggest that for the Sparse Matrix Vector Multiplication kernel, absolute speedups of up to 200 times can be obtained. Moreover, when equal number of workers in the range 1-256 is employed, our design is between 1.7 and 4.2 times faster than a Cell PPU-based system. Furthermore, we study the memory latency and cache bandwidth impact on the sustainable speedups of the system considered. Our tests suggest that a 128 worker configuration requires the caches to deliver 1638.4 GB/sec in order to preserve 80% of its peak speedup.