TurboTag: lookup filtering to reduce coherence directory power

Authors:
Pejman Lotfi-Kamran;Michael Ferdman;Daniel Crisan;Babak Falsafi
Affiliations:
École publique Polytechnique Fédérale de Lausanne, Lausanne, Switzerland;Carnegie Mellon University/École publique Polytechnique Fédérale de Lausanne, Pittsburgh/Lausanne, USA;École publique Polytechnique Fédérale de Lausanne, Lausanne, Switzerland;École publique Polytechnique Fédérale de Lausanne, Lausanne, Switzerland
Venue:
Proceedings of the 16th ACM/IEEE international symposium on Low power electronics and design
Year:
2010

Citing 14
Cited 5

The filter cache: an energy efficient memory structure

MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Piranha: a scalable architecture based on single-chip multiprocessing

Proceedings of the 27th annual international symposium on Computer architecture
Space/time trade-offs in hash coding with allowable errors

Communications of the ACM
Bloom filtering cache misses for accurate data speculation and prefetching

ICS '02 Proceedings of the 16th international conference on Supercomputing
JETTY: Filtering Snoops for Reduced Energy Consumption in SMP Servers

HPCA '01 Proceedings of the 7th International Symposium on High-Performance Computer Architecture
RegionScout: Exploiting Coarse Grain Sharing in Snoop-Based Coherence

Proceedings of the 32nd annual international symposium on Computer Architecture
Improving Multiprocessor Performance with Coarse-Grain Coherence Tracking

Proceedings of the 32nd annual international symposium on Computer Architecture
Flexible Snooping: Adaptive Forwarding and Filtering of Snoops in Embedded-Ring Multiprocessors

Proceedings of the 33rd annual international symposium on Computer Architecture
SimFlex: Statistical Sampling of Computer System Simulation

IEEE Micro
Reducing energy of virtual cache synonym lookup using bloom filters

CASES '06 Proceedings of the 2006 international conference on Compilers, architecture and synthesis for embedded systems
Improving the accuracy of snoop filtering using stream registers

MEDEA '07 Proceedings of the 2007 workshop on MEmory performance: DEaling with Applications, systems and architecture
Implementing Signatures for Transactional Memory

Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture
Exploiting access semantics and program behavior to reduce snoop power in chip multiprocessors

Proceedings of the 13th international conference on Architectural support for programming languages and operating systems
A tagless coherence directory

Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture

Filtering directory lookups in CMPs

Microprocessors & Microsystems
DAPSCO: Distance-aware partially shared cache organization

ACM Transactions on Architecture and Code Optimization (TACO) - HIPEAC Papers
NOC-Out: Microarchitecting a Scale-Out Processor

MICRO-45 Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture
VGTS: variable granularity transactional snoop

Euro-Par'13 Proceedings of the 19th international conference on Parallel Processing
Heterogeneous system coherence for integrated CPU-GPU systems

Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture

Quantified Score

Hi-index	0.00

Visualization

Abstract

On-chip coherence directories of today's multi-core systems are not energy efficient. Coherence directories dissipate a significant fraction of their power on unnecessary lookups when running commercial server and scientific workloads. These workloads have large working sets that are beyond the reach of on-chip caches of modern processors. Limited to capturing a small part of the working set, private caches retain cache blocks only for a short period of time before replacing them with new blocks. Moreover, coherence enforcement is a known performance bottleneck of multi-threaded software, hence data-sharing in optimized high performance software is minimal. Consequently, the majority of the accesses to the coherence directory find no sharers in the directory because the data are not available in the on-chip private caches, effectively wasting power on the coherence checks. To improve energy-efficiency for future many-core systems, we propose TurboTag, a filtering mechanism to eliminate needless directory lookups. We analyze full-system traces of server and scientific workloads and find that over 69% of accesses to the directory find no sharers and can be entirely avoided. Taking advantage of this behavior, TurboTag achieves a 58% reduction in the directory's dynamic power consumption.