An asymmetric clustered processor based on value content

  • Authors:
  • R. González;A. Cristal;M. Pericas;M. Valero;A. Veidenbaum

  • Affiliations:
  • Universitat Politècnica de Catalunya;Universitat Politècnica de Catalunya;Universitat Politècnica de Catalunya;Universitat Politècnica de Catalunya;University of California, Irvine

  • Venue:
  • Proceedings of the 19th annual international conference on Supercomputing
  • Year:
  • 2005

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper proposes a new organization for clustered processors. Such processors have many advantages, including improved implementability and scalability, reduced power, and, potentially, faster clock speed. Difficulties lie in assigning instructions to clusters (steering) so as to minimize the effect of inter-cluster communication latency. The asymmetric clustered architecture proposed in this paper aims to increase the IPC and reduce power consumption by using two different types of integer clusters and a new steering algorithm. One type is a standard, 64b integer cluster, while the other is a very narrow, 20b cluster. The narrow cluster runs at twice the clock rate of the standard cluster.A new instruction steering mechanism is proposed to increase the use of the fast, narrow cluster as well as to minimize inter-cluster communication. Steering is performed by a history-based predictor, which is shown to be 98% accurate.The proposed architecture is shown to have a higher average IPC than its un-clustered equivalent for a four-wide issue processor, something that has never been achieved by previously proposed clustered organizations. Overall, a 3% increase in average IPC over an un-clustered design and a 8% over a symmetric cluster with dependence based steering are achieved for a 2-cycle intercluster communication latency.Part of the reason for higher IPC is the ability of the new architecture to execute most of the address computations as narrow, fast operations. The new architecture exploits its early knowledge of partial address values to achieve a 0-cycle address translation for 90% of all address computations, further improving performance.