The multicluster architecture: reducing cycle time through partitioning

Authors:
Keith I. Farkas;Paul Chow;Norman P. Jouppi;Zvonko Vranesic
Affiliations:
Digital Equipment Corporation, Western Research Lab, 250 University Avenue, Palo Alto, California;Electrical and Computer Engineering, University of Toronto, 10 Kings College Road, Toronto, Ontario, Canada, M5S 3G4;Digital Equipment Corporation, Western Research Lab, 250 University Avenue, Palo Alto, California;Electrical and Computer Engineering, University of Toronto, 10 Kings College Road, Toronto, Ontario, Canada, M5S 3G4
Venue:
MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Year:
1997

Citing 12
Cited 93

Compilers: principles, techniques, and tools

Compilers: principles, techniques, and tools
IMPACT: an architectural framework for multiple-instruction-issue processors

ISCA '91 Proceedings of the 18th annual international symposium on Computer architecture
The multiflow trace scheduling compiler

The Journal of Supercomputing - Special issue on instruction-level parallelism
Improvements to graph coloring register allocation

ACM Transactions on Programming Languages and Systems (TOPLAS)
ATOM: a system for building customized program analysis tools

PLDI '94 Proceedings of the ACM SIGPLAN 1994 conference on Programming language design and implementation
Complexity/performance tradeoffs with non-blocking loads

ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Multiscalar processors

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Evaluation of design alternatives for a multiprocessor microprocessor

ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Exploiting choice: instruction fetch and issue on an implementable simultaneous multithreading processor

ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Complexity-effective superscalar processors

Proceedings of the 24th annual international symposium on Computer architecture
The MIPS R10000 Superscalar Microprocessor

IEEE Micro
Decoupled access/execute computer architectures

ISCA '82 Proceedings of the 9th annual symposium on Computer Architecture

Exploiting idle floating-point resources for integer execution

PLDI '98 Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation
Improving prediction for procedure returns with return-address-stack repair mechanisms

MICRO 31 Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture
An empirical study of decentralized ILP execution models

Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
Clustered speculative multithreaded processors

ICS '99 Proceedings of the 13th international conference on Supercomputing
A low-complexity issue logic

Proceedings of the 14th international conference on Supercomputing
Reducing wire delay penalty through value prediction

Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
Instruction distribution heuristics for quad-cluster, dynamically-scheduled, superscalar processors

Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
Inherently Lower-Power High-Performance Superscalar Architectures

IEEE Transactions on Computers
Reducing the complexity of the issue logic

ICS '01 Proceedings of the 15th international conference on Supercomputing
Focusing processor policies via critical-path prediction

ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture
Improving Latency Tolerance of Multithreading through Decoupling

IEEE Transactions on Computers
Hardware and Software Techniques for Controlling DRAM Power Modes

IEEE Transactions on Computers
An instruction set and microarchitecture for instruction level distributed processing

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
A scalable instruction queue design using dependence chains

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
A high-speed dynamic instruction scheduling scheme for superscalar processors

Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
Reducing the complexity of the register file in dynamic superscalar processors

Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
Dynamic Code Partitioning for Clustered Architectures

International Journal of Parallel Programming
Simultaneous Multithreading: A Platform for Next-Generation Processors

IEEE Micro
Typing the ISA to cluster the processor

Future Generation Computer Systems - Parallel computing technologies (PaCT-2001)
Typing the ISA to Cluster the Processor

PaCT '01 Proceedings of the 6th International Conference on Parallel Computing Technologies
Efficient Interconnects for Clustered Microarchitectures

Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques
Speculative Clustered Caches for Clustered Processors

ISHPC '02 Proceedings of the 4th International Symposium on High Performance Computing
Reducing register ports for higher speed and lower energy

Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
Register write specialization register read specialization: a path to complexity-effective wide-issue superscalar processors

Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
Dynamic binary translation for accumulator-oriented architectures

Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
Region-based hierarchical operation partitioning for multicluster processors

PLDI '03 Proceedings of the ACM SIGPLAN 2003 conference on Programming language design and implementation
Partitioned first-level cache design for clustered microarchitectures

ICS '03 Proceedings of the 17th annual international conference on Supercomputing
The Effectiveness of Loop Unrolling for Modulo Scheduling in Clustered VLIW Architectures

ICPP '00 Proceedings of the Proceedings of the 2000 International Conference on Parallel Processing
Banked multiported register files for high-frequency superscalar microprocessors

Proceedings of the 30th annual international symposium on Computer architecture
Improving dynamic cluster assignment for clustered trace cache processors

Proceedings of the 30th annual international symposium on Computer architecture
Dynamically managing the communication-parallelism trade-off in future clustered processors

Proceedings of the 30th annual international symposium on Computer architecture
Overcoming the limitations of conventional vector processors

Proceedings of the 30th annual international symposium on Computer architecture
Increasing the number of effective registers in a low-power processor using a windowed register file

Proceedings of the 2003 international conference on Compilers, architecture and synthesis for embedded systems
Exploiting Value Locality in Physical Register Files

Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
Modeling technology impact on cluster microprocessor performance

IEEE Transactions on Very Large Scale Integration (VLSI) Systems - Special section on low power
Complexity-Effective Reorder Buffer Designs for Superscalar Processors

IEEE Transactions on Computers
Characterizing a new class of threads in scientific applications for high end supercomputers

Proceedings of the 18th annual international conference on Supercomputing
Back-end assignment schemes for clustered multithreaded processors

Proceedings of the 18th annual international conference on Supercomputing
Cluster prefetch: tolerating on-chip wire delays in clustered microarchitectures

Proceedings of the 18th annual international conference on Supercomputing
Scaling to the End of Silicon with EDGE Architectures

Computer
Cost-Sensitive Partitioning in an Architecture Synthesis System for Multicluster Processors

IEEE Micro
Late Allocation and Early Release of Physical Registers

IEEE Transactions on Computers
A scalable, clustered SMT processor for digital signal processing

MEDEA '03 Proceedings of the 2003 workshop on MEmory performance: DEaling with Applications , systems and architecture
Static Placement, Dynamic Issue (SPDI) Scheduling for EDGE Architectures

Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques
Area and System Clock Effects on SMT/CMP Throughput

IEEE Transactions on Computers
Dynamic Strands: Collapsing Speculative Dependence Chains for Reducing Pipeline Communication

Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
Register Packing: Exploiting Narrow-Width Operands for Reducing Register File Pressure

Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
On-Chip Interconnects and Instruction Steering Schemes for Clustered Microarchitectures

IEEE Transactions on Parallel and Distributed Systems
Inherently Workload-Balanced Clustered Microarchitecture

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Papers - Volume 01
A Dependency Chain Clustered Microarchitecture

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Papers - Volume 01
Cache organizations for clustered microarchitectures

WMPI '04 Proceedings of the 3rd workshop on Memory performance issues: in conjunction with the 31st international symposium on computer architecture
A Speculative Control Scheme for an Energy-Efficient Banked Register File

IEEE Transactions on Computers
Balancing clustering-induced stalls to improve performance in clustered processors

Proceedings of the 2nd conference on Computing frontiers
Partitioning Variables across Register Windows to Reduce Spill Code in a Low-Power Processor

IEEE Transactions on Computers
An asymmetric clustered processor based on value content

Proceedings of the 19th annual international conference on Supercomputing
Scalability Aspects of Instruction Distribution Algorithms for Clustered Processors

IEEE Transactions on Parallel and Distributed Systems
Instruction Replication for Reducing Delays Due to Inter-PE Communication Latency

IEEE Transactions on Computers
A Criticality Analysis of Clustering in Superscalar Processors

Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
Compiler-directed Data Partitioning for Multicluster Processors

Proceedings of the International Symposium on Code Generation and Optimization
Early Register Deallocation Mechanisms Using Checkpointed Register Files

IEEE Transactions on Computers
A case for a complexity-effective, width-partitioned microarchitecture

ACM Transactions on Architecture and Code Optimization (TACO)
Design and evaluation of a hierarchical decoupled architecture

The Journal of Supercomputing
Platform-based resource binding using a distributed register-file microarchitecture

Proceedings of the 2006 IEEE/ACM international conference on Computer-aided design
Core fusion: accommodating software diversity in chip multiprocessors

Proceedings of the 34th annual international symposium on Computer architecture
INTACTE: an interconnect area, delay, and energy estimation tool for microarchitectural explorations

CASES '07 Proceedings of the 2007 international conference on Compilers, architecture, and synthesis for embedded systems
Design principles for a virtual multiprocessor

Proceedings of the 2007 annual research conference of the South African institute of computer scientists and information technologists on IT research in developing countries
Exploiting virtual registers to reduce pressure on real registers

ACM Transactions on Architecture and Code Optimization (TACO)
Trends toward on-chip networked microsystems

International Journal of High Performance Computing and Networking
Hardware support for early register release

International Journal of High Performance Computing and Networking
Asymmetrically banked value-aware register files for low-energy and high-performance

Microprocessors & Microsystems
Improving performance and reducing energy-delay with adaptive resource resizing for out-of-order embedded processors

Proceedings of the 2008 ACM SIGPLAN-SIGBED conference on Languages, compilers, and tools for embedded systems
A distributed, simultaneously multi-threaded (SMT) processor with clustered scheduling windows for scalable DSP performance

Journal of Signal Processing Systems - Special Issue: Embedded computing systems for DSP
A low-complexity microprocessor design with speculative pre-execution

Journal of Systems Architecture: the EUROMICRO Journal
Towards achieving reliable and high-performance nanocomputing via dynamic redundancy allocation

ACM Journal on Emerging Technologies in Computing Systems (JETC)
Register Bank Assignment for Spatially Partitioned Processors

Languages and Compilers for Parallel Computing
Energy-aware register file re-partitioning for clustered VLIW architectures

Proceedings of the 2009 Asia and South Pacific Design Automation Conference
Strategies for mapping dataflow blocks to distributed hardware

Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
Simultaneous resource binding and interconnection optimization based on a distributed register-file microarchitecture

ACM Transactions on Design Automation of Electronic Systems (TODAES)
A complexity-effective microprocessor design with decoupled dispatch queues and prefetching

Parallel Computing
Complexity Effective Bypass Networks

Transactions on High-Performance Embedded Architectures and Compilers II
A 186-Mvertices/s 161-mW floating-point vertex processor with optimized datapath and vertex caches

IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Virtual registers: reducing register pressure without enlarging the register file

HiPEAC'07 Proceedings of the 2nd international conference on High performance embedded architectures and compilers
Exploiting narrow-width values for thermal-aware register file designs

Proceedings of the Conference on Design, Automation and Test in Europe
Empowering a helper cluster through data-width aware instruction selection policies

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Erasing Core Boundaries for Robust and Configurable Performance

MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
A low-complexity issue queue design with speculative pre-execution

HiPC'05 Proceedings of the 12th international conference on High Performance Computing
Low power microprocessor design for embedded systems

ICCSA'06 Proceedings of the 2006 international conference on Computational Science and Its Applications - Volume Part IV
Design and effectiveness of small-sized decoupled dispatch queues

Euro-Par'06 Proceedings of the 12th international conference on Parallel Processing
Single FU bypass networks for high clock rate superscalar processors

HiPC'04 Proceedings of the 11th international conference on High Performance Computing
CRAM: coded registers for amplified multiporting

Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
A scalable, multi-thread, multi-issue array processor architecture for DSP applications based on extended tomasulo scheme

SAMOS'06 Proceedings of the 6th international conference on Embedded Computer Systems: architectures, Modeling, and Simulation
Compiler-assisted energy optimization for clustered VLIW processors

Journal of Parallel and Distributed Computing
Low-latency adaptive mode transitions and hierarchical power management in asymmetric clustered cores

ACM Transactions on Architecture and Code Optimization (TACO)

Quantified Score

Hi-index	0.03

Visualization

Abstract

The multicluster architecture that we introduce offers a decentralized, dynamically-scheduled architecture, in which the register files, dispatch queue, and functional units of the architecture are distributed across multiple clusters, and each cluster is assigned a subset of the architectural registers. The motivation for the multicluster architecture is to reduce the clock cycle time, relative to a single-cluster architecture with the same number of hardware resources, by reducing the size and complexity of components on critical timing paths. Resource partitioning, however, introduces instruction-execution overhead and may reduce the number of concurrently executing instructions. To counter these two negative by-products of partitioning, we developed a static instruction scheduling algorithm. We describe this algorithm, and using trace-driven simulations of SPEC92 benchmarks, evaluate its effectiveness. This evaluation indicates that for the configurations considered, the multicluster architecture may have significant performance advantages at feature sizes below 0.35um, and warrants further investigation.