Instruction scheduling for a tiled dataflow architecture

Authors:
Martha Mercaldi;Steven Swanson;Andrew Petersen;Andrew Putnam;Andrew Schwerin;Mark Oskin;Susan J. Eggers
Affiliations:
University of Washington;University of Washington;University of Washington;University of Washington;University of Washington;University of Washington;University of Washington
Venue:
Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
Year:
2006

Citing 37
Cited 10

The Manchester prototype dataflow computer

Communications of the ACM - Special section on computer architecture
Evaluation of a prototype data flow processor of the SIGMA-1 for scientific computations

ISCA '86 Proceedings of the 13th annual international symposium on Computer architecture
The Epsilon dataflow processor

ISCA '89 Proceedings of the 16th annual international symposium on Computer architecture
An architecture of a dataflow single chip processor

ISCA '89 Proceedings of the 16th annual international symposium on Computer architecture
Executing a Program on the MIT Tagged-Token Dataflow Architecture

IEEE Transactions on Computers
Fine-grain parallelism with minimal hardware support: a compiler-controlled threaded abstract machine

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
A data locality optimizing algorithm

PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
Linear-time, optimal code scheduling for delayed-load architectures

PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
Multithreading: a revisionist view of dataflow architectures

ISCA '91 Proceedings of the 18th annual international symposium on Computer architecture
PYRROS: static task scheduling and code generation for message passing multiprocessors

ICS '92 Proceedings of the 6th international conference on Supercomputing
Global optimizations for parallelism and locality on scalable parallel machines

PLDI '93 Proceedings of the ACM SIGPLAN 1993 conference on Programming language design and implementation
Communication optimization and code generation for distributed memory machines

PLDI '93 Proceedings of the ACM SIGPLAN 1993 conference on Programming language design and implementation
Balanced scheduling: instruction scheduling when memory latency is uncertain

PLDI '93 Proceedings of the ACM SIGPLAN 1993 conference on Programming language design and implementation
The multiflow trace scheduling compiler

The Journal of Supercomputing - Special issue on instruction-level parallelism
GIVE-N-TAKE—a balanced code placement framework

PLDI '94 Proceedings of the ACM SIGPLAN 1994 conference on Programming language design and implementation
Improving balanced scheduling with compiler optimizations that increase instruction-level parallelism

PLDI '95 Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation
Data transformations for eliminating conflict misses

PLDI '98 Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation
Unified assign and schedule: a new approach to scheduling for clustered register file microarchitectures

MICRO 31 Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture
Space-time scheduling of instruction-level parallelism on a raw machine

Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
Cache-conscious structure layout

Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation
New tiling techniques to improve cache temporal locality

Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation
Monsoon: an explicit token-store architecture

ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
Lx: a technology platform for customizable VLIW embedded processing

Proceedings of the 27th annual international symposium on Computer architecture
Optimal instruction scheduling using integer programming

PLDI '00 Proceedings of the ACM SIGPLAN 2000 conference on Programming language design and implementation
Instruction scheduling for clustered VLIW architectures

ISSS '00 Proceedings of the 13th international symposium on System synthesis
A design space evaluation of grid processor architectures

Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
Modulo scheduling with integrated register spilling for clustered VLIW architectures

Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
Baring It All to Software: Raw Machines

Computer
A preliminary architecture for a basic data-flow processor

ISCA '75 Proceedings of the 2nd annual symposium on Computer architecture
Convergent scheduling

Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
DDDP-a Distributed Data Driven Processor

ISCA '83 Proceedings of the 10th annual international symposium on Computer architecture
The architecture and system method of DDM1: A recursively structured Data Driven Machine

ISCA '78 Proceedings of the 5th annual symposium on Computer architecture
Bulldog: a compiler for vliw architectures (parallel computing, reduced-instruction-set, trace scheduling, scientific)

Bulldog: a compiler for vliw architectures (parallel computing, reduced-instruction-set, trace scheduling, scientific)
WaveScalar

Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
Static Placement, Dynamic Issue (SPDI) Scheduling for EDGE Architectures

Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques
Area-Performance Trade-offs in Tiled Dataflow Architectures

Proceedings of the 33rd annual international symposium on Computer Architecture
A spatial path scheduling algorithm for EDGE architectures

Proceedings of the 12th international conference on Architectural support for programming languages and operating systems

Reducing control overhead in dataflow architectures

Proceedings of the 15th international conference on Parallel architectures and compilation techniques
The WaveScalar architecture

ACM Transactions on Computer Systems (TOCS)
Virtual Circuit Tree Multicasting: A Case for On-Chip Hardware Multicast Support

ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
Feature selection and policy optimization for distributed instruction placement using reinforcement learning

Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Edge-centric modulo scheduling for coarse-grained reconfigurable architectures

Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Loop-Aware Instruction Scheduling with Dynamic Contention Tracking for Tiled Dataflow Architectures

CC '09 Proceedings of the 18th International Conference on Compiler Construction: Held as Part of the Joint European Conferences on Theory and Practice of Software, ETAPS 2009
On-chip COMA cache-coherence protocol for microgrids of microthreaded cores

Euro-Par'07 Proceedings of the 2007 conference on Parallel processing
Improving communication latency with the write-only architecture

Journal of Parallel and Distributed Computing
A general constraint-centric scheduling framework for spatial architectures

Proceedings of the 34th ACM SIGPLAN conference on Programming language design and implementation
Constraint centric scheduling guide

ACM SIGARCH Computer Architecture News

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper explores hierarchical instruction scheduling for a tiled processor. Our results show that at the top level of the hierarchy, a simple profile-driven algorithm effectively minimizes operand latency. After this schedule has been partitioned into large sections, the bottom-level algorithm must more carefully analyze program structure when producing the final schedule.Our analysis reveals that at this bottom level, good scheduling depends upon carefully balancing instruction contention for processing elements and operand latency between producer and consumer instructions. We develop a parameterizable instruction scheduler that more effectively optimizes this trade-off. We use this scheduler to determine the contention-latency sweet spot that generates the best instruction schedule for each application. To avoid this application-specific tuning, we also determine the parameters that produce the best performance across all applications. The result is a contention-latency setting that generates instruction schedules for all applications in our workload that come within 17% of the best schedule for each.