An on-chip cache design for vector processors

  • Authors:
  • Akihiro Musa;Yoshiei Sato;Ryusuke Egawa;Hiroyuki Takizawa;Koki Okabe;Hiroaki Kobayashi

  • Affiliations:
  • Tohoku University, Sendai, Tokyo, Japan;Tohoku University, Sendai, Japan;Tohoku University, Sendai, Japan;Tohoku University, Sendai, Japan;Tohoku University, Sendai, Japan;Tohoku University, Sendai, Japan

  • Venue:
  • MEDEA '07 Proceedings of the 2007 workshop on MEmory performance: DEaling with Applications, systems and architecture
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper discusses the potential of an on-chip cache memory for modern vector supercomputers. The vector supercomputers can achieve the high computational efficiency for compute-intensive scientific applications. The most important factor affecting the computational performance is high memory bandwidth to provide a sufficient amount of data to the rich arithmetic units in time; the modern vector supercomputers such as NEC SX-7 and SX-8 have 4 bytes per flop (4B/FLOP) on the ratio of memory bandwidth to floating-point operations. However, the gap in performance between memory and processors has become remarkably exposed year by year in high performance computing. Therefore, it is getting harder to keep the 4B/FLOP memory bandwidth in design of future vector supercomputers. As a promising solution to cover a lack of the memory bandwidths of vector load/store units of the future vector supercomputers, we design an on-chip vector cache for the NEC SX vector processor architecture. This paper evaluates the performance of the on-chip cache memory system on the SX-7 system with 2B/FLOP or lower memory bandwidth across two kernel loops and five leading scientific applications. The results of the kernel loops demonstrate that a 2B/FLOP memory system with the on-chip cache whose hit ratio is 50% can achieve a performance comparable to that of a 4B/FLOP system without the cache. The results of the four applications indicate that the on-chip cache can improve sustained performance of the four applications by 20% to 98%. The experimental results regarding the last one show a conflicting effect of loop unrolling with vector caching, resulting in a poor hit rate. However, when loop-unrolling is disabled, its cache hit rate is improved, and the sustained performance comparable to that of the 4B/FLOP memory bandwidth without the loop-unrolling is obtained. In addition, selective caching, in which only a part of data with the high locality of reference are cached, is also effective for efficient use of the limited cache capacity.