ICCD '05 Proceedings of the 2005 International Conference on Computer Design
Hi-index | 0.00 |
Clustered microarchitectures are an attractive alternative to large monolithic superscalar designs due to their potential for higher clock rates in the face of increasingly wire-delay-constrained process technologies. In such a microarchitecture, the distribution of functional units, the register files, and the issue queues across multiple clusters reduces the latency of various cycle time critical paths, thereby enabling a faster clock. However, a penalty in terms of instructions per cycle is incurred if instructions frequently communicate values among clusters because of dependences. .pp In this paper, we propose several novel extensions that significantly improve the performance of clustered designs. First, we explore a word-interleaved clustered cache in which memory instructions are steered to clusters based on addresses, and when the effective address is unknown, directs memory operations to the appropriate cluster via bank prediction. We then study the scalability of the resulting clustered microarchitecture as the number of clusters is increased (resulting in a corresponding increase in inter-cluster communication latency). Our evaluation identifies the key bottlenecks and shows how novel enhancements to the cluster resource allocation mechanisms can significantly improve the scalability of the design. We also show that communication latency in a highly clustered processor can be reduced for certain programs by only using a subset of the clusters. Overall, these enhancements achieve a 30% fill in the correct value improvement over a baseline design with the clustered cache.