NoC-aware cache design for multithreaded execution on tiled chip multiprocessors

  • Authors:
  • Ahmed K. Abousamra;Alex K. Jones;Rami G. Melhem

  • Affiliations:
  • University of Pittsburgh, Pittsburgh, PA;University of Pittsburgh, Pittsburgh, PA;University of Pittsburgh, Pittsburgh, PA

  • Venue:
  • Proceedings of the 6th International Conference on High Performance and Embedded Architectures and Compilers
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

In chip multiprocessors (CMPs), data access latency depends on the memory hierarchy organization, the on-chip interconnect (NoC), and the running workload. Reducing data access latency is vital to achieving performance improvements and scalability of threaded applications. Multithreaded applications generally exhibit sharing of data among the program threads, which generates coherence and data traffic on the NoC. Many NoC designs exploit communication locality to reduce communication latency by configuring special fast paths on which communication is faster than the rest of the NoC. Communication patterns are directly affected by the cache organization. However, many cache organizations are designed in isolation of the underlying NoC or assume a simple NoC design, thus possibly missing optimization opportunities. In this work, we present a NoC-aware cache design that creates a symbiotic relationship between the NoC and cache to reduce data access latency, improve utilization of cache capacity, and improve overall system performance. Specifically, considering a NoC designed to exploit communication locality, we design a Unique Private caching scheme that promotes locality in communication patterns. In turn, the NoC exploits this locality to allow fast access to remote data, thus reducing the need for data replication and allowing better utilization of cache capacity. The Unique Private cache stores the data mostly used by a processor core in its locally accessible cache bank, while leveraging dedicated high speed circuits in the interconnect to provide remote cores with fast access to shared data. Simulations of a suite of scientific and commercial workloads show that our proposed design achieves a speedup of 14% and 16% on a 16-core and a 64-core CMP, respectively, over the state-of-the-art NoC-Cache co-designed system which also exploits communication locality in multithreaded applications.