Speculative Clustered Caches for Clustered Processors

Authors:
Dana S. Henry;Gabriel H. Loh;Rahul Sami
Affiliations:
-;-;-
Venue:
ISHPC '02 Proceedings of the 4th International Symposium on High Performance Computing
Year:
2002

Citing 12
Cited 0

Multiscalar processors

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Trace cache: a low latency approach to high bandwidth instruction fetching

Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture
Complexity-effective superscalar processors

Proceedings of the 24th annual international symposium on Computer architecture
The bi-mode branch predictor

MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Alternative fetch and issue policies for the trace cache fetch mechanism

MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Trace processors

MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
The multicluster architecture: reducing cycle time through partitioning

MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
An empirical study of decentralized ILP execution models

Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
Instruction distribution heuristics for quad-cluster, dynamically-scheduled, superscalar processors

Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
The Alpha 21264 Microprocessor

IEEE Micro
The Stanford Hydra CMP

IEEE Micro
Speculative Versioning Cache

HPCA '98 Proceedings of the 4th International Symposium on High-Performance Computer Architecture

Quantified Score

Hi-index	0.00

Visualization

Abstract

Clustering is a technique for partitioning superscalar processor's execution resources to simultaneously allow for more in-flight instructions, wider issue width, and more aggressive clock speeds. As either the size of individual clusters or the total number of clusters increases, the distance to the first level data cache increases as well. Although clustering may expose more parallelism by allowing a greater number of instructions to be simultaneously analyzed and issued, the gains may be obliterated if the latencies to memory grow too large. We propose to augment each cluster with a small, fast, simple Level Zero (L0) data cache that is accessed in parallel with a traditional L1 data cache. The difference between our solution and other proposed caching techniques for clustered processors is that we do not support versioning or coherence. This may occasionally result in a load instruction that reads a stale value from the L0 cache, but the common case is a low latency hit in the L0 cache. Our simulation studies show that 4KB, 2-way set associative L0 caches provide a 6.5-12.3% IPC improvement over a wide range of processor configurations.