DATE '00 Proceedings of the conference on Design, automation and test in Europe
Polygon rendering on a stream architecture
HWWS '00 Proceedings of the ACM SIGGRAPH/EUROGRAPHICS workshop on Graphics hardware
Imagine: Media Processing with Streams
IEEE Micro
Memory-Intensive Benchmarks: IRAM vs. Cache-Based Machines
IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
A PIM-based Multiprocessor System
IPDPS '01 Proceedings of the 15th International Parallel & Distributed Processing Symposium
Implementations of Real-time Data Intensive Applications on PIM-based Multiprocessor Systems
IPDPS '01 Proceedings of the 15th International Parallel & Distributed Processing Symposium
Euro-Par '02 Proceedings of the 8th International Euro-Par Conference on Parallel Processing
Exploiting On-Chip Memory Bandwidth in the VIRAM Compiler
IMS '00 Revised Papers from the Second International Workshop on Intelligent Memory Systems
Three-dimensional memory vectorization for high bandwidth media memory systems
Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
Programmable Stream Processors
Computer
Design and Optimization of Large Size and Low Overhead Off-Chip Caches
IEEE Transactions on Computers
Brook for GPUs: stream computing on graphics hardware
ACM SIGGRAPH 2004 Papers
Dynamic warp subdivision for integrated branch and memory divergence tolerance
Proceedings of the 37th annual international symposium on Computer architecture
Design and analysis of adaptive processor
ACM Transactions on Reconfigurable Technology and Systems (TRETS)
Exascale workload characterization and architecture implications
Proceedings of the High Performance Computing Symposium
Hi-index | 0.00 |
Next generation portable devices will require processors with both low energy consumption and high performance for media functions. At the same time, modern CMOS technology creates the need for highly scalable VLSI architectures. Conventional processor architectures fail to meet these requirements. This paper presents the architecture of Vector IRAM (VIRAM), a processor that combines vector processing with embedded DRAM technology. Vector processing achieves high multimedia performance with simple hardware, while embedded DRAM provides high memory bandwidth at low energy consumption. VIRAM provides flexible support for media data types, short vectors, and DSP features. The vector pipeline is enhanced to hide DRAM latency without using caches. The peak performance is 3.2 GFLOPS (single precision) and maximum memory bandwidth is 25.6 GBytes/s. With a target power consumption of 2 Watts for the vector pipeline and the memory system, VIRAM supports 1.6 GFLOPS/Watt. For a set of representative media kernels, VIRAM sustains on average 88% of its peak performance, outperforming conventional SIMD media extensions and DSP processors by factors of 4.5 to 17. Using a clustered implementation approach, the modular design can be scaled without complicating control logic. We demonstrate that scaling the architecture leads to near linear application speedup. We also evaluate the effect of scaling the capacity and parallelism of the on-chip memory system to die area and sustained performance.