Cache performance of vector processors
ISCA '88 Proceedings of the 15th Annual International Symposium on Computer architecture
Data prefetching in multiprocessor vector cache memories
ISCA '91 Proceedings of the 18th annual international symposium on Computer architecture
Performance of cached DRAM organizations in vector supercomputers
ISCA '93 Proceedings of the 20th annual international symposium on computer architecture
ACM Computing Surveys (CSUR)
Using cache memory to reduce processor-memory traffic
ISCA '83 Proceedings of the 10th annual international symposium on Computer architecture
Memory bandwidth limitations of future microprocessors
ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
A victim cache for vector registers
ICS '97 Proceedings of the 11th international conference on Supercomputing
Out-of-order vector architectures
MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Tolerating latency in multiprocessors through compiler-inserted prefetching
ACM Transactions on Computer Systems (TOCS)
Adding a vector unit to a superscalar processor
ICS '99 Proceedings of the 13th international conference on Supercomputing
A Simulation Study of Decoupled Vector Architectures
The Journal of Supercomputing
Decoupled vector architectures
HPCA '96 Proceedings of the 2nd IEEE Symposium on High-Performance Computer Architecture
An on-chip cache design for vector processors
MEDEA '07 Proceedings of the 2007 workshop on MEmory performance: DEaling with Applications, systems and architecture
Sams: single-affiliation multiple-stride parallel memory scheme
Proceedings of the 2008 workshop on Memory access on future processors: a solved problem?
Hi-index | 0.00 |
Traditional supercomputers use a flat multi-bank SRAM memory organization to supply high bandwidth at low latency. Most other computers use a hierarchical organization with a small SRAM cache and slower, cheaper DRAM for main memory. Such systems rely heavily on data locality for achieving optimum performance. This paper evaluates cache-based memory systems for vector supercomputers. We develop a simulation model for a cache-based version of the Cray Research C90 and use the NAS parallel benchmarks to provide a large scale workload. We show that while caches reduce memory traffic and improve the performance of plain DRAM memory, they still lag behind cacheless SRAM. We identify the performance bottle-necks in DRAM-based memory systems and quantify their contribution to program performance degradation. We find the data fetch strategy to be a significant parameter affecting performance, evaluate the performance of several fetch policies, and show that small fetch sizes improve performance by maximizing the use of available memory bandwidth.