Adding a vector unit to a superscalar processor
ICS '99 Proceedings of the 13th international conference on Supercomputing
Wattch: a framework for architectural-level power analysis and optimizations
Proceedings of the 27th annual international symposium on Computer architecture
Vector vs. superscalar and VLIW architectures for embedded multimedia benchmarks
Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
Orion: a power-performance simulator for interconnection networks
Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
Decoupled vector architectures
HPCA '96 Proceedings of the 2nd IEEE Symposium on High-Performance Computer Architecture
Multithreaded Vector Architectures
HPCA '97 Proceedings of the 3rd IEEE Symposium on High-Performance Computer Architecture
Overcoming the limitations of conventional vector processors
Proceedings of the 30th annual international symposium on Computer architecture
The Vector-Thread Architecture
Proceedings of the 31st annual international symposium on Computer architecture
Conjoined-Core Chip Multiprocessing
Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
Cache Refill/Access Decoupling for Vector Machines
Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
Merrimac: Supercomputing with Streams
Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Converting massive TLP to DLP: a special-purpose processor for molecular orbital computations
Proceedings of the 4th international conference on Computing frontiers
HotPar'09 Proceedings of the First USENIX conference on Hot topics in parallelism
Vector Extensions for Decision Support DBMS Acceleration
MICRO-45 Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture
Hi-index | 0.00 |
Energy efficiency of computation is quickly becoming a key problem from the chip through the data center. This paper presents the first quantitative study of the potential energy efficiency of vector accelerators. We propose and study a vector accelerator architecture suitable for implementation in a 70nm technology. The vector architecture has a high-bandwidth on-chip cache system coupled to 16 independent memory channels. We show that such an accelerator can achieve speedups of 10X or more on loop kernels in comparison to a quad-issue superscalar uniprocessor, while using less energy. We also introduce run-ahead lanes, a complexity and energy efficient means of tolerating variable latency from crossbar contention, cache bank conflicts, cache misses, and the memory system. Run-ahead lanes only synchronize on dependencies or when explicitly directed.