Processor coupling: integrating compile time and runtime scheduling for parallelism

Authors:
Stephem W. Keckler;William J. Dally
Affiliations:
-;-
Venue:
ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
Year:
1992

Citing 14
Cited 38

Bulldog: a compiler for VLSI architectures

Bulldog: a compiler for VLSI architectures
Dataflow architectures

Annual review of computer science vol. 1, 1986
A VLIW architecture for a trace Scheduling Compiler

IEEE Transactions on Computers - Special issue on architectural support for programming languages and operating systems
Toward a dataflow/von Neumann hybrid architecture

ISCA '88 Proceedings of the 15th Annual International Symposium on Computer architecture
MASA: a multithreaded processor architecture for parallel symbolic computing

ISCA '88 Proceedings of the 15th Annual International Symposium on Computer architecture
Software pipelining: an effective scheduling technique for VLIW machines

PLDI '88 Proceedings of the ACM SIGPLAN 1988 conference on Programming Language design and Implementation
Circuit Simulation on Shared-Memory Multiprocessors

IEEE Transactions on Computers
The horizon supercomputing system: architecture and software

Proceedings of the 1988 ACM/IEEE conference on Supercomputing
Available instruction-level parallelism for superscalar and superpipelined machines

ASPLOS III Proceedings of the third international conference on Architectural support for programming languages and operating systems
Exploring the benefits of multiple hardware contexts in a multiprocessor architecture: preliminary results

ISCA '89 Proceedings of the 16th annual international symposium on Computer architecture
A variable instruction stream extension to the VLIW architecture

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Architecture and implementation of a VLIW supercomputer

Proceedings of the 1990 ACM/IEEE conference on Supercomputing
The Tera computer system

ICS '90 Proceedings of the 4th international conference on Supercomputing
A Mechanism for Efficient Context Switching

ICCD '91 Proceedings of the 1991 IEEE International Conference on Computer Design on VLSI in Computer & Processors

Exploiting instruction-level parallelism: the multithreaded approach

MICRO 25 Proceedings of the 25th annual international symposium on Microarchitecture
The J-machine multicomputer: an architectural evaluation

ISCA '93 Proceedings of the 20th annual international symposium on computer architecture
Super-threading: architectural and software mechanisms for optimizing parallel computation

ICS '93 Proceedings of the 7th international conference on Supercomputing
Interleaving: a multithreading technique targeting multiprocessors and workstations

ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
Hardware support for fast capability-based addressing

ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
Simultaneous multithreading: maximizing on-chip parallelism

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Ordered multithreading: a novel technique for exploiting thread-level parallelism

PACT '95 Proceedings of the IFIP WG10.3 working conference on Parallel architectures and compilation techniques
Increasing superscalar performance through multistreaming

PACT '95 Proceedings of the IFIP WG10.3 working conference on Parallel architectures and compilation techniques
Single-program speculative multithreading (SPSM) architecture: compiler-assisted fine-grained multithreading

PACT '95 Proceedings of the IFIP WG10.3 working conference on Parallel architectures and compilation techniques
The M-Machine multicomputer

Proceedings of the 28th annual international symposium on Microarchitecture
Exploiting choice: instruction fetch and issue on an implementable simultaneous multithreading processor

ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Limits on the performance benefits of multithreading and prefetching

Proceedings of the 1996 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Thread scheduling for cache locality

Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Multithreading with Distributed Functional Units

IEEE Transactions on Computers
Converting thread-level parallelism to instruction-level parallelism via simultaneous multithreading

ACM Transactions on Computer Systems (TOCS)
High-Throughput, Low-Memory Applications on the Pica Architecture

IEEE Transactions on Parallel and Distributed Systems
Simultaneous multithreading: maximizing on-chip parallelism

25 years of the international symposia on Computer architecture (selected papers)
Design Alternatives of Multithreaded Architecture

International Journal of Parallel Programming
Comparing power consumption of an SMT and a CMP DSP for mobile phone workloads

CASES '01 Proceedings of the 2001 international conference on Compilers, architecture, and synthesis for embedded systems
Simultaneous Multithreading: A Platform for Next-Generation Processors

IEEE Micro
Weld: A Multithreading Technique Towards Latency-Tolerant VLIW Processors

HiPC '01 Proceedings of the 8th International Conference on High Performance Computing
A Fine-Grain Threaded Abstract Machine

PACT '94 Proceedings of the IFIP WG10.3 Working Conference on Parallel Architectures and Compilation Techniques
Toward a General-Purpose Multi-Stream System

PACT '94 Proceedings of the IFIP WG10.3 Working Conference on Parallel Architectures and Compilation Techniques
Processor Mechanisms for Software Shared Memory

ISHPC '00 Proceedings of the Third International Symposium on High Performance Computing
Combined DRAM and logic chip for massively parallel systems

ARVLSI '95 Proceedings of the 16th Conference on Advanced Research in VLSI (ARVLSI'95)
Thread prioritization: a thread scheduling mechanism for multiple-context parallel processors

HPCA '95 Proceedings of the 1st IEEE Symposium on High-Performance Computer Architecture
The Named-State Register File: Implementation and Performance

HPCA '95 Proceedings of the 1st IEEE Symposium on High-Performance Computer Architecture
Design and performance evaluation of a multithreaded architecture

HPCA '95 Proceedings of the 1st IEEE Symposium on High-Performance Computer Architecture
Fine-grain multi-thread processor architecture for massively parallel processing

HPCA '95 Proceedings of the 1st IEEE Symposium on High-Performance Computer Architecture
Timed Petri net models of multithreaded multiprocessor architectures

PNPM '97 Proceedings of the 6th International Workshop on Petri Nets and Performance Models
Dynamically managing the communication-parallelism trade-off in future clustered processors

Proceedings of the 30th annual international symposium on Computer architecture
Controlling the data space of tree structured computations

Information and Computation
Cluster prefetch: tolerating on-chip wire delays in clustered microarchitectures

Proceedings of the 18th annual international conference on Supercomputing
Extended Split-Issue: Enabling Flexibility in the Hardware Implementation of NUAL VLIW DSPs

Proceedings of the 31st annual international symposium on Computer architecture
High-Performance and Low-Cost Dual-Thread VLIW Processor Using Weld Architecture Paradigm

IEEE Transactions on Parallel and Distributed Systems
EXECUBE-A New Architecture for Scaleable MPPs

ICPP '94 Proceedings of the 1994 International Conference on Parallel Processing - Volume 01
The future of interconnection technology

IBM Journal of Research and Development
A multithreaded multicore system for embedded media processing

Transactions on high-performance embedded architectures and compilers III

Quantified Score

Hi-index	0.00

Visualization

Abstract

The technology to implement a single-chip node composed of 4 high-performance floating-point ALUs will be available by 1995. This paper presents processor coupling, a mechanism for controlling multiple ALUs to exploit both instruction-level and inter-thread parallelism, by using compile time and runtime scheduling. The compiler statically schedules individual threads to discover available intra-thread instruction-level parallelism. The runtime scheduling mechanism interleaves threads, exploiting inter-thread parallelism to maintain high ALU utilization. ALUs are assigned to threads on a cycle by cycle basis, and several threads can be active concurrently. We provide simulation results demonstrating that, on four simple numerical benchmarks, processor coupling achieves better performance than purely statically scheduled or multi-processor machine organizations. We examine how performance is affected by restricted communication between ALUs and by long memory latencies. We also present an implementation and feasibility study of a processor coupled node.