Using supplier locality in power-aware interconnects and caches in chip multiprocessors

Authors:
Ehsan Atoofian;Amirali Baniasadi
Affiliations:
Department of Electrical and Computer Engineering, University of Victoria, Victoria, B.C., Canada V8W 3P6;Department of Electrical and Computer Engineering, University of Victoria, Victoria, B.C., Canada V8W 3P6
Venue:
Journal of Systems Architecture: the EUROMICRO Journal
Year:
2008

Citing 16
Cited 1

Race-free interconnection networks and multiprocessor consistency

ISCA '91 Proceedings of the 18th annual international symposium on Computer architecture
The SPLASH-2 programs: characterization and methodological considerations

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Using prediction to accelerate coherence protocols

Proceedings of the 25th annual international symposium on Computer architecture
Memory sharing predictor: the key to a speculative coherent DSM

ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
Parallel Computer Architecture: A Hardware/Software Approach

Parallel Computer Architecture: A Hardware/Software Approach
TLB and snoop energy-reduction using virtual caches in low-power chip-multiprocessors

Proceedings of the 2002 international symposium on Low power electronics and design
The sun fireplane system interconnect

Proceedings of the 2001 ACM/IEEE conference on Supercomputing
The Use of Prediction for Accelerating Upgrade Misses in cc-NUMA Multiprocessors

Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques
Owner prediction for accelerating cache-to-cache transfer misses in a cc-NUMA architecture

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Orion: a power-performance simulator for interconnection networks

Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
Using hints to reduce the read miss penalty for flat COMA protocols

HICSS '95 Proceedings of the 28th Hawaii International Conference on System Sciences
The Coherence Predictor Cache: A Resource-Efficient and Accurate Coherence Prediction Infrastructure

IPDPS '03 Proceedings of the 17th International Symposium on Parallel and Distributed Processing
Using destination-set prediction to improve the latency/bandwidth tradeoff in shared-memory multiprocessors

Proceedings of the 30th annual international symposium on Computer architecture
JETTY: Filtering Snoops for Reduced Energy Consumption in SMP Servers

HPCA '01 Proceedings of the 7th International Symposium on High-Performance Computer Architecture
A unified theory of shared memory consistency

Journal of the ACM (JACM)
Coarse-Grain Coherence Tracking: RegionScout and Region Coherence Arrays

IEEE Micro

Using partial tag comparison in low-power snoop-based chip multiprocessors

ISCA'10 Proceedings of the 2010 international conference on Computer Architecture

Quantified Score

Hi-index	0.00

Visualization

Abstract

Conventional snoopy-based chip multiprocessors take an aggressive approach broadcasting snoop requests to all nodes. In addition each node checks all received requests. This approach reduces the latency of cache to cache transfer misses at the expense of increasing power. In this paper we show that a large portion of interconnect/cache transactions are redundant as many snoop requests miss in the remote nodes. We exploit this inefficiency and introduce power optimization techniques for chip multiprocessors. Our optimizations rely on the observation that in a snoopy-based shared memory system the data supplier can be predicted with high accuracy. Our optimizations reduce power by eliminating unnecessary activity at both the requester and the supplier end of snoop requests. We reduce power as we (a) avoid broadcasting snoop requests to all processors and (b) avoid tag lookup for all nodes and for all requests arriving. In particular, we use supplier locality and introduce the following two optimizations. First, and at the requester end, we introduce speculative selective request (SSR) to reduce power dissipation in the binary tree interconnect. In SSR, we send the request only to the node more likely to have the missing data. We reduce power as we limit access only to the interconnect components between the requestor and the supplier node. Second, and at the supplier end, we propose speculative tag lookup (STL) to reduce power dissipation in data caches. We filter those accesses more likely to miss in the L"1 cache. Using shared memory applications, we show that by limiting snoop requests to the speculated nodes we reduce interconnect power by 25% in a four-way multiprocessor. Moreover, we show that speculative tag lookup reduces power in tag arrays by 14.1% in a four-way multiprocessor. Both optimizations come with negligible performance loss and hardware overhead.