Exploiting access semantics and program behavior to reduce snoop power in chip multiprocessors

Authors:
Chinnakrishnan S. Ballapuram;Ahmad Sharif;Hsien-Hsin S. Lee
Affiliations:
Intel Corporation, Folsom, CA;Georgia Institute of Technology, Atlanta, GA;School of Electrical and Computer Engineering, Atlanta, GA
Venue:
Proceedings of the 13th international conference on Architectural support for programming languages and operating systems
Year:
2008

Citing 22
Cited 13

On the inclusion properties for multi-level cache hierarchies

ISCA '88 Proceedings of the 15th Annual International Symposium on Computer architecture
Piranha: a scalable architecture based on single-chip multiprocessing

Proceedings of the 27th annual international symposium on Computer architecture
Region-based caching: an energy-delay efficient memory architecture for embedded processors

CASES '00 Proceedings of the 2000 international conference on Compilers, architecture, and synthesis for embedded systems
Space/time trade-offs in hash coding with allowable errors

Communications of the ACM
Slipstream processors: improving both performance and fault tolerance

ASPLOS IX Proceedings of the ninth international conference on Architectural support for programming languages and operating systems
TLB and snoop energy-reduction using virtual caches in low-power chip-multiprocessors

Proceedings of the 2002 international symposium on Low power electronics and design
The 82460GX Sever/Workstation Chip Set

IEEE Micro
Using Processor-Cache Affinity Information in Shared-Memory Multiprocessor Scheduling

IEEE Transactions on Parallel and Distributed Systems
Multiprocessor validation of the Pentium Pro microprocessor

COMPCON '96 Proceedings of the 41st IEEE International Computer Conference
A Performance Comparison of Hierarchical Ring- and Mesh- Connected Multiprocessor Networks

HPCA '97 Proceedings of the 3rd IEEE Symposium on High-Performance Computer Architecture
Transient-fault recovery for chip multiprocessors

Proceedings of the 30th annual international symposium on Computer architecture
Energy efficient D-TLB and data cache using semantic-aware multilateral partitioning

Proceedings of the 2003 international symposium on Low power electronics and design
Stack Value File: Custom Microarchitecture for the Stack

HPCA '01 Proceedings of the 7th International Symposium on High-Performance Computer Architecture
JETTY: Filtering Snoops for Reduced Energy Consumption in SMP Servers

HPCA '01 Proceedings of the 7th International Symposium on High-Performance Computer Architecture
RegionScout: Exploiting Coarse Grain Sharing in Snoop-Based Coherence

Proceedings of the 32nd annual international symposium on Computer Architecture
Improving Multiprocessor Performance with Coarse-Grain Coherence Tracking

Proceedings of the 32nd annual international symposium on Computer Architecture
Store Vulnerability Window (SVW): Re-Execution Filtering for Enhanced Load Optimization

Proceedings of the 32nd annual international symposium on Computer Architecture
An Integrated Framework for Dependable and Revivable Architectures Using Multicore Processors

Proceedings of the 33rd annual international symposium on Computer Architecture
Energy-Efficient Cache Coherence for Embedded Multi-Processor Systems through Application-Driven Snoop Filtering

DSD '06 Proceedings of the 9th EUROMICRO Conference on Digital System Design
Design tradeoffs for tiled CMP on-chip networks

Proceedings of the 20th annual international conference on Supercomputing
Late-binding: enabling unordered load-store queues

Proceedings of the 34th annual international symposium on Computer architecture
Efficient system-on-chip energy management with a segmented bloom filter

ARCS'06 Proceedings of the 19th international conference on Architecture of Computing Systems

Low-power inter-core communication through cache partitioning in embedded multiprocessors

Proceedings of the 22nd Annual Symposium on Integrated Circuits and System Design: Chip on the Dunes
Efficient program scheduling for heterogeneous multi-core processors

Proceedings of the 46th Annual Design Automation Conference
A tagless coherence directory

Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
Optimizing shared cache behavior of chip multiprocessors

Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
Low-power snoop architecture for synchronized producer-consumer embedded multiprocessing

IEEE Transactions on Very Large Scale Integration (VLSI) Systems
TurboTag: lookup filtering to reduce coherence directory power

Proceedings of the 16th ACM/IEEE international symposium on Low power electronics and design
Energy- and Performance-Efficient Communication Framework for Embedded MPSoCs through Application-Driven Release Consistency

ACM Transactions on Design Automation of Electronic Systems (TODAES)
Stack filter: Reducing L1 data cache power consumption

Journal of Systems Architecture: the EUROMICRO Journal
Filtering directory lookups in CMPs with write-through caches

Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part I
Filtering directory lookups in CMPs

Microprocessors & Microsystems
Using partial tag comparison in low-power snoop-based chip multiprocessors

ISCA'10 Proceedings of the 2010 international conference on Computer Architecture
Exploiting semantics of virtual memory to improve the efficiency of the on-chip memory system

Euro-Par'12 Proceedings of the 18th international conference on Parallel Processing
VGTS: variable granularity transactional snoop

Euro-Par'13 Proceedings of the 19th international conference on Parallel Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Integrating more processor cores on-die has become the unanimous trend in the microprocessor industry. Most of the current research thrusts using chip multiprocessors (CMPs) as the baseline to analyze problems in various domains. One of the main design issues facing CMP systems is the growing number of snoops required to maintain cache coherency and to support self/cross-modifying code that leads to power and performance limitations. In this paper, we analyze the internal and external snoop behavior in a CMP system and relax the snoopy cache coherence protocol based on the program semantics and properties of the shared variables for saving power. Based on the observations and analyses, we propose two novel techniques: Selective Snoop Probe (SSP) and Essential Snoop Probe (ESP) to reduce power without compromising performance. Our simulation results show that using the SSPtechnique, 5% to 65% data cache energy savings per core for different processor configurations can be achieved with 1% to 2% performance improvement. We also show that 5% to 82% of data cache energy per core is spent on the non-essential snoop probes that can be saved using the ESP technique.