Exploiting fine-grain thread level parallelism on the MIT multi-ALU processor

Authors:
Stephen W. Keckler;William J. Dally;Daniel Maskit;Nicholas P. Carter;Andrew Chang;Whay S. Lee
Affiliations:
Computer Systems Laboratory, Stanford University, Gates CS Building Stanford, CA;Computer Systems Laboratory, Stanford University, Gates CS Building Stanford, CA;Computer Systems Laboratory, Stanford University, Gates CS Building Stanford, CA;Computer Systems Laboratory, Stanford University, Gates CS Building Stanford, CA;Computer Systems Laboratory, Stanford University, Gates CS Building Stanford, CA;Artificial Intelligence Laboratory, Massachusetts Institute of Technology, 545 Technology Square, Cambridge, MA
Venue:
Proceedings of the 25th annual international symposium on Computer architecture
Year:
1998

Citing 14
Cited 23

Mul-T: a high-performance parallel Lisp

PLDI '89 Proceedings of the ACM SIGPLAN 1989 Conference on Programming language design and implementation
Limits of instruction-level parallelism

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
T: a multithreaded massively parallel architecture

ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
Experience with fine-grain synchronization in MIMD machines for preconditioned conjugate gradient

PPOPP '93 Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming
Parallel programming in Split-C

Proceedings of the 1993 ACM/IEEE conference on Supercomputing
The multiflow trace scheduling compiler

The Journal of Supercomputing - Special issue on instruction-level parallelism
Simultaneous multithreading: maximizing on-chip parallelism

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Multiscalar processors

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
The M-Machine multicomputer

Proceedings of the 28th annual international symposium on Microarchitecture
Evaluation of design alternatives for a multiprocessor microprocessor

ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
The Tera computer system

ICS '90 Proceedings of the 4th international conference on Supercomputing
The impact of synchronization and granularity on parallel systems

ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
Application Performance on the MIT Alewife Machine

Computer
Thread prioritization: a thread scheduling mechanism for multiple-context parallel processors

HPCA '95 Proceedings of the 1st IEEE Symposium on High-Performance Computer Architecture

Improving the performance of speculatively parallel applications on the Hydra CMP

ICS '99 Proceedings of the 13th international conference on Supercomputing
Concurrent Event Handling through Multithreading

IEEE Transactions on Computers
The Stanford Hydra CMP

IEEE Micro
Limits of Task-Based Parallelism in Irregular Applications

ISHPC '00 Proceedings of the Third International Symposium on High Performance Computing
A Register File Architecture and Compilation Scheme for Clustered ILP Processors

Euro-Par '02 Proceedings of the 8th International Euro-Par Conference on Parallel Processing
Processor Mechanisms for Software Shared Memory

ISHPC '00 Proceedings of the Third International Symposium on High Performance Computing
Using thread-level speculation to simplify manual parallelization

Proceedings of the ninth ACM SIGPLAN symposium on Principles and practice of parallel programming
VLSI Architecture: Past, Present, and Future

ARVLSI '99 Proceedings of the 20th Anniversary Conference on Advanced Research in VLSI
ReEnact: using thread-level speculation mechanisms to debug data races in multithreaded codes

Proceedings of the 30th annual international symposium on Computer architecture
Spin Detection Hardware for Improved Management of Multithreaded Systems

IEEE Transactions on Parallel and Distributed Systems
Circulating shared-registers for multiprocessor systems

Journal of Systems Architecture: the EUROMICRO Journal
Exploiting Fine-Grained Data Parallelism with Chip Multiprocessors and Fast Barriers

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Real-time rendering systems in 2010

SIGGRAPH '05 ACM SIGGRAPH 2005 Courses
A wire delay-tolerant reconfigurable unit for a clustered programmable-reconfigurable processor

Microprocessors & Microsystems
The WaveScalar architecture

ACM Transactions on Computer Systems (TOCS)
Synchronization state buffer: supporting efficient fine-grain synchronization on many-core architectures

Proceedings of the 34th annual international symposium on Computer architecture
Implementation and Evaluation of a Dynamically Routed Processor Operand Network

NOCS '07 Proceedings of the First International Symposium on Networks-on-Chip
Synchronization coherence: A transparent hardware mechanism for cache coherence and fine-grained synchronization

Journal of Parallel and Distributed Computing
Architectural Support for Fair Reader-Writer Locking

MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
ReMAP: A Reconfigurable Heterogeneous Multicore Architecture

MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
Landing stencil code on Godson-T

Journal of Computer Science and Technology
TLSync: support for multiple fast barriers using on-chip transmission lines

Proceedings of the 38th annual international symposium on Computer architecture
Synchronization mechanisms on modern multi-core architectures

ACSAC'07 Proceedings of the 12th Asia-Pacific conference on Advances in Computer Systems Architecture

Quantified Score

Hi-index	0.00

Visualization

Abstract

Much of the improvement in computer performance over the last twenty years has come from faster transistors and architectural advances that increase parallelism. Historically, parallelism has been exploited either at the instruction level with a grain-size of a single instruction or by partitioning applications into coarse threads with grain-sizes of thousands of instructions. Fine-grain threads fill the parallelism gap between these extremes by enabling tasks with run lengths as small as 20 cycles. As this fine-grain parallelism is orthogonal to ILP and coarse threads, it complements both methods and provides an opportunity for greater speedup. This paper describes the efficient communication and synchronization mechanisms implemented in the Multi-ALU Processor (MAP) chip, including a thread creation instruction, register communication, and a hardware barrier. These register-based mechanisms provide 10 times faster communication and 60 times faster synchronization than mechanisms that operate via a shared on chip cache. With a three-processor implementation of the MAP, fine-grain speedups of 1.2-2.1 are demonstrated on a suite of applications.