Region-based hierarchical operation partitioning for multicluster processors

Authors:
Michael Chu;Kevin Fan;Scott Mahlke
Affiliations:
University of Michigan, Ann Arbor, MI;University of Michigan, Ann Arbor, MI;University of Michigan, Ann Arbor, MI
Venue:
PLDI '03 Proceedings of the ACM SIGPLAN 2003 conference on Programming language design and implementation
Year:
2003

Citing 19
Cited 25

Partitioned register files for VLIWs: a preliminary analysis of tradeoffs

MICRO 25 Proceedings of the 25th annual international symposium on Microarchitecture
The multiflow trace scheduling compiler

The Journal of Supercomputing - Special issue on instruction-level parallelism
Iterative modulo scheduling: an algorithm for software pipelining loops

MICRO 27 Proceedings of the 27th annual international symposium on Microarchitecture
Region-based compilation: an introduction and motivation

Proceedings of the 28th annual international symposium on Microarchitecture
The multicluster architecture: reducing cycle time through partitioning

MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Effective cluster assignment for modulo scheduling

MICRO 31 Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture
Unified assign and schedule: a new approach to scheduling for clustered register file microarchitectures

MICRO 31 Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture
High-quality operation binding for clustered VLIW datapaths

Proceedings of the 38th annual Design Automation Conference
Affinity-based cluster assignment for unrolled loops

ICS '02 Proceedings of the 16th international conference on Supercomputing
Slack: maximizing performance under technological constraints

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Graph-partitioning based instruction scheduling for clustered processors

Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
DSC: Scheduling Parallel Tasks on an Unbounded Number of Processors

IEEE Transactions on Parallel and Distributed Systems
Exploiting Pseudo-Schedules to Guide Data Dependence Graph Partitioning

Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques
Convergent scheduling

Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
Very Long Instruction Word architectures and the ELI-512

ISCA '83 Proceedings of the 10th annual international symposium on Computer architecture
A New Heuristic for Scheduling Parallel Programs on Multiprocessor

PACT '98 Proceedings of the 1998 International Conference on Parallel Architectures and Compilation Techniques
Instruction Scheduling for Clustered VLIW DSPs

PACT '00 Proceedings of the 2000 International Conference on Parallel Architectures and Compilation Techniques
Global Register Partitioning

PACT '00 Proceedings of the 2000 International Conference on Parallel Architectures and Compilation Techniques
CARS: A New Code Generation Framework for Clustered ILP Processors

HPCA '01 Proceedings of the 7th International Symposium on High-Performance Computer Architecture

Increasing the number of effective registers in a low-power processor using a windowed register file

Proceedings of the 2003 international conference on Compilers, architecture and synthesis for embedded systems
FLASH: Foresighted Latency-Aware Scheduling Heuristic for Processors with Customized Datapaths

Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
Cost-Sensitive Partitioning in an Architecture Synthesis System for Multicluster Processors

IEEE Micro
Partitioning Variables across Register Windows to Reduce Spill Code in a Low-Power Processor

IEEE Transactions on Computers
A Distributed Control Path Architecture for VLIW Processors

Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Allocating Non-Real-Time and Soft Real-Time Jobs in Multiclusters

IEEE Transactions on Parallel and Distributed Systems
Compiler-directed Data Partitioning for Multicluster Processors

Proceedings of the International Symposium on Code Generation and Optimization
Compiler-assisted leakage energy optimization for clustered VLIW architectures

EMSOFT '06 Proceedings of the 6th ACM & IEEE International conference on Embedded software
Inter-cluster communication in VLIW architectures

ACM Transactions on Architecture and Code Optimization (TACO)
Virtual Cluster Scheduling Through the Scheduling Graph

Proceedings of the International Symposium on Code Generation and Optimization
Code and data partitioning for fine-grain parallelism

Proceedings of the 2007 ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools for embedded systems
Interactive presentation: Time-constrained clustering for DSE of clustered VLIW-ASP

Proceedings of the conference on Design, automation and test in Europe
Modulo scheduling for highly customized datapaths to increase hardware reusability

Proceedings of the 6th annual IEEE/ACM international symposium on Code generation and optimization
Compiling for vector-thread architectures

Proceedings of the 6th annual IEEE/ACM international symposium on Code generation and optimization
Placement-and-routing-based register allocation for coarse-grained reconfigurable arrays

Proceedings of the 2008 ACM SIGPLAN-SIGBED conference on Languages, compilers, and tools for embedded systems
REDEFINE: Runtime reconfigurable polymorphic ASIC

ACM Transactions on Embedded Computing Systems (TECS)
Compiler-assisted instruction decoder energy optimization for clustered VLIW architectures

HiPC'07 Proceedings of the 14th international conference on High performance computing
Performance evaluation of scheduling applications with DAG topologies on multiclusters with independent local schedulers

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Compiler-assisted power optimization for clustered VLIW architectures

Parallel Computing
Exploring energy-performance trade-offs for heterogeneous interconnect clustered VLIW processors

HiPC'06 Proceedings of the 13th international conference on High Performance Computing
SIMD defragmenter: efficient ILP realization on data-parallel architectures

ASPLOS XVII Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems
Compiler-assisted energy optimization for clustered VLIW processors

Journal of Parallel and Distributed Computing
Compiler supports for VLIW DSP processors with SIMD intrinsics

Concurrency and Computation: Practice & Experience
Low cost control flow protection using abstract control signatures

Proceedings of the 14th ACM SIGPLAN/SIGBED conference on Languages, compilers and tools for embedded systems
A constraint programming approach for integrated spatial and temporal scheduling for clustered architectures

ACM Transactions on Embedded Computing Systems (TECS)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Clustered architectures are a solution to the bottleneck of centralized register files in superscalar and VLIW processors. The main challenge associated with clustered architectures is compiler support to effectively partition operations across the available resources on each cluster. In this work, we present a novel technique for clustering operations based on graph partitioning methods. Our approach incorporates new methods of assigning weights to nodes and edges within the dataflow graph to guide the partitioner. Nodes are assigned weights to reflect their resource usage within a cluster, while a slack distribution method intelligently assigns weights to edges to reflect the cost of inserting moves across clusters. A multilevel graph partitioning algorithm, which globally divides a dataflow graph into multiple parts in a hierarchical manner, uses these weights to efficiently generate estimates for the quality of partitions. We found that our algorithm was able to achieve an average of 20% improvement in DSP kernels and 5% improvement in SPECint2000 for a four-cluster architecture.