A composite and scalable cache coherence protocol for large scale CMPs

  • Authors:
  • Yi Xu;Yu Du;Youtao Zhang;Jun Yang

  • Affiliations:
  • University of Pittsburgh, Pittsburgh, PA, USA;University of Pittsburgh, Pittsburgh, PA, USA;University of Pittsburgh, Pittsburgh, PA, USA;University of Pittsburgh, Pittsburgh, PA, USA

  • Venue:
  • Proceedings of the international conference on Supercomputing
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

The number of on-chip cores of modern chip multiprocessors (CMPs) is growing fast with technology scaling. However, it remains a big challenge to efficiently support cache coherence for large scale CMPs. The conventional snoopy and directory coherence protocols cannot be smoothly scaled to many-core or thousand-core processors. Snoopy protocols introduce large power overhead due to enormous amount of cache tag probing triggered by broadcast. Directory protocols introduce performance penalty due to indirection, and large storage overhead due to storing directories. This paper addresses the efficiency problem when supporting cache coherency for large-scale CMPs. By leveraging emerging optical on-chip interconnect (OP-I) technology to provide high bandwidth density, low propagation delay and natural support for multicast/broadcast in a hierarchical network organization, we propose a composite cache coherence (C3) protocol that benefits from direct cache-to-cache accesses as in snoopy protocol and small amount of cache probing as in directory protocol. Targeting at quickly completing coherence transactions, C3 organizes accesses in a three-tier hierarchy by combining a mix of designs including local broadcast prediction, filtering, and a coarse-grained directory. Compared to directory-based protocol[18], our evaluations on a thousand-core CMP show that C3 improves performance by 21%, reduces network latency of coherence messages by 41% and saves network energy consumption by 5.5% on average for PARSEC applications.