An asymmetric clustered processor based on value content

Authors:
R. González;A. Cristal;M. Pericas;M. Valero;A. Veidenbaum
Affiliations:
Universitat Politècnica de Catalunya;Universitat Politècnica de Catalunya;Universitat Politècnica de Catalunya;Universitat Politècnica de Catalunya;University of California, Irvine
Venue:
Proceedings of the 19th annual international conference on Supercomputing
Year:
2005

Citing 28
Cited 4

Multiscalar processors

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Register renaming and dynamic speculation: an alternative approach

MICRO 26 Proceedings of the 26th annual international symposium on Microarchitecture
Complexity-effective superscalar processors

Proceedings of the 24th annual international symposium on Computer architecture
Trace processors

MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
The multicluster architecture: reducing cycle time through partitioning

MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
The energy complexity of register files

ISLPED '98 Proceedings of the 1998 international symposium on Low power electronics and design
A novel renaming scheme to exploit value temporal locality through physical register reuse and unification

MICRO 31 Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture
Multiple-banked register file architectures

Proceedings of the 27th annual international symposium on Computer architecture
On pipelining dynamic instruction scheduling logic

Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
Very low power pipelines using significance compression

Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
Dynamic zero compression for cache energy reduction

Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
Instruction distribution heuristics for quad-cluster, dynamically-scheduled, superscalar processors

Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
Frequent value locality and value-centric data cache design

ASPLOS IX Proceedings of the ninth international conference on Architectural support for programming languages and operating systems
Focusing processor policies via critical-path prediction

ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture
Energy-effective issue logic

ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture
Energy-efficient hybrid wakeup logic

Proceedings of the 2002 international symposium on Low power electronics and design
The Alpha 21264 Microprocessor

IEEE Micro
Basic Block Distribution Analysis to Find Periodic Behavior and Simulation Points in Applications

Proceedings of the 2001 International Conference on Parallel Architectures and Compilation Techniques
Register write specialization register read specialization: a path to complexity-effective wide-issue superscalar processors

Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
Exploiting data-width locality to increase superscalar execution bandwidth

Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
Reducing register ports using delayed write-back queues and operand pre-fetch

ICS '03 Proceedings of the 17th annual international conference on Supercomputing
Instruction issue logic for pipelined supercomputers

ISCA '84 Proceedings of the 11th annual international symposium on Computer architecture
Dynamically Exploiting Narrow Width Operands to Improve Processor Power and Performance

HPCA '99 Proceedings of the 5th International Symposium on High Performance Computer Architecture
Energy-Efficient Register Access

SBCCI '00 Proceedings of the 13th symposium on Integrated circuits and systems design
Banked multiported register files for high-frequency superscalar microprocessors

Proceedings of the 30th annual international symposium on Computer architecture
Dynamically managing the communication-parallelism trade-off in future clustered processors

Proceedings of the 30th annual international symposium on Computer architecture
Loose Loops Sink Chips

HPCA '02 Proceedings of the 8th International Symposium on High-Performance Computer Architecture
Physical Register Inlining

Proceedings of the 31st annual international symposium on Computer architecture

A case for a complexity-effective, width-partitioned microarchitecture

ACM Transactions on Architecture and Code Optimization (TACO)
Design principles for a virtual multiprocessor

Proceedings of the 2007 annual research conference of the South African institute of computer scientists and information technologists on IT research in developing countries
Characterization and exploitation of narrow-width loads: the narrow-width cache approach

CASES '10 Proceedings of the 2010 international conference on Compilers, architectures and synthesis for embedded systems
Empowering a helper cluster through data-width aware instruction selection policies

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper proposes a new organization for clustered processors. Such processors have many advantages, including improved implementability and scalability, reduced power, and, potentially, faster clock speed. Difficulties lie in assigning instructions to clusters (steering) so as to minimize the effect of inter-cluster communication latency. The asymmetric clustered architecture proposed in this paper aims to increase the IPC and reduce power consumption by using two different types of integer clusters and a new steering algorithm. One type is a standard, 64b integer cluster, while the other is a very narrow, 20b cluster. The narrow cluster runs at twice the clock rate of the standard cluster.A new instruction steering mechanism is proposed to increase the use of the fast, narrow cluster as well as to minimize inter-cluster communication. Steering is performed by a history-based predictor, which is shown to be 98% accurate.The proposed architecture is shown to have a higher average IPC than its un-clustered equivalent for a four-wide issue processor, something that has never been achieved by previously proposed clustered organizations. Overall, a 3% increase in average IPC over an un-clustered design and a 8% over a symmetric cluster with dependence based steering are achieved for a 2-cycle intercluster communication latency.Part of the reason for higher IPC is the ability of the new architecture to execute most of the address computations as narrow, fast operations. The new architecture exploits its early knowledge of partial address values to achieve a 0-cycle address translation for 90% of all address computations, further improving performance.