An on-chip cache design for vector processors

Authors:
Akihiro Musa;Yoshiei Sato;Ryusuke Egawa;Hiroyuki Takizawa;Koki Okabe;Hiroaki Kobayashi
Affiliations:
Tohoku University, Sendai, Tokyo, Japan;Tohoku University, Sendai, Japan;Tohoku University, Sendai, Japan;Tohoku University, Sendai, Japan;Tohoku University, Sendai, Japan;Tohoku University, Sendai, Japan
Venue:
MEDEA '07 Proceedings of the 2007 workshop on MEmory performance: DEaling with Applications, systems and architecture
Year:
2007

Citing 7
Cited 2

Cache performance in vector supercomputers

Proceedings of the 1994 ACM/IEEE conference on Supercomputing
16.4-Tflops direct numerical simulation of turbulence by a Fourier spectral method on the Earth Simulator

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
A 26.58 Tflops global atmospheric simulation with the spectral transform method on the Earth Simulator

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Scientific Computations on Modern Parallel Vector Systems

Proceedings of the 2004 ACM/IEEE conference on Supercomputing
Evaluation of Cache-based Superscalar and Cacheless Vector Architectures for Scientific Computations

Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Performance Evaluation of the Cray X1 Distributed Shared-Memory Architecture

IEEE Micro
Implications of memory performance for highly efficient supercomputing of scientific applications

ISPA'06 Proceedings of the 4th international conference on Parallel and Distributed Processing and Applications

A shared cache for a chip multi vector processor

Proceedings of the 9th workshop on MEmory performance: DEaling with Applications, systems and architecture
Performance tuning and analysis of future vector processors based on the roofline model

Proceedings of the 10th workshop on MEmory performance: DEaling with Applications, systems and architecture

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper discusses the potential of an on-chip cache memory for modern vector supercomputers. The vector supercomputers can achieve the high computational efficiency for compute-intensive scientific applications. The most important factor affecting the computational performance is high memory bandwidth to provide a sufficient amount of data to the rich arithmetic units in time; the modern vector supercomputers such as NEC SX-7 and SX-8 have 4 bytes per flop (4B/FLOP) on the ratio of memory bandwidth to floating-point operations. However, the gap in performance between memory and processors has become remarkably exposed year by year in high performance computing. Therefore, it is getting harder to keep the 4B/FLOP memory bandwidth in design of future vector supercomputers. As a promising solution to cover a lack of the memory bandwidths of vector load/store units of the future vector supercomputers, we design an on-chip vector cache for the NEC SX vector processor architecture. This paper evaluates the performance of the on-chip cache memory system on the SX-7 system with 2B/FLOP or lower memory bandwidth across two kernel loops and five leading scientific applications. The results of the kernel loops demonstrate that a 2B/FLOP memory system with the on-chip cache whose hit ratio is 50% can achieve a performance comparable to that of a 4B/FLOP system without the cache. The results of the four applications indicate that the on-chip cache can improve sustained performance of the four applications by 20% to 98%. The experimental results regarding the last one show a conflicting effect of loop unrolling with vector caching, resulting in a poor hit rate. However, when loop-unrolling is disabled, its cache hit rate is improved, and the sustained performance comparable to that of the 4B/FLOP memory bandwidth without the loop-unrolling is obtained. In addition, selective caching, in which only a part of data with the high locality of reference are cached, is also effective for efficient use of the limited cache capacity.