Hardware-modulated parallelism in chip multiprocessors

Authors:
Julia Chen;Philo Juang;Kevin Ko;Gilberto Contreras;David Penry;Ram Rangan;Adam Stoler;Li-Shiuan Peh;Margaret Martonosi
Affiliations:
Princeton University;Princeton University;Princeton University;Princeton University;Princeton University;Princeton University;Princeton University;Princeton University;Princeton University
Venue:
ACM SIGARCH Computer Architecture News - Special issue: dasCMP'05
Year:
2005

Citing 31
Cited 9

The Manchester prototype dataflow computer

Communications of the ACM - Special section on computer architecture
The Epsilon dataflow processor

ISCA '89 Proceedings of the 16th annual international symposium on Computer architecture
Executing a Program on the MIT Tagged-Token Dataflow Architecture

IEEE Transactions on Computers
The J-machine multicomputer: an architectural evaluation

ISCA '93 Proceedings of the 20th annual international symposium on computer architecture
The Stanford FLASH multiprocessor

ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Simultaneous multithreading: maximizing on-chip parallelism

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Multiscalar processors

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
The M-Machine multicomputer

Proceedings of the 28th annual international symposium on Microarchitecture
Complexity-effective superscalar processors

Proceedings of the 24th annual international symposium on Computer architecture
Thread scheduling for multiprogrammed multiprocessors

Proceedings of the tenth annual ACM symposium on Parallel algorithms and architectures
A bandwidth-efficient architecture for media processing

MICRO 31 Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture
Wattch: a framework for architectural-level power analysis and optimizations

Proceedings of the 27th annual international symposium on Computer architecture
Smart Memories: a modular reconfigurable architecture

Proceedings of the 27th annual international symposium on Computer architecture
Benchmarking and comparison of the task graph scheduling algorithms

Journal of Parallel and Distributed Computing
Pthreads for dynamic and irregular parallelism

SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
An instruction set and microarchitecture for instruction level distributed processing

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Hypertool: A Programming Aid for Message-Passing Systems

IEEE Transactions on Parallel and Distributed Systems
Virtual-Channel Flow Control

IEEE Transactions on Parallel and Distributed Systems
Overview of the Monsoon Project

ICCD '91 Proceedings of the 1991 IEEE International Conference on Computer Design on VLSI in Computer & Processors
Efficient Interconnects for Clustered Microarchitectures

Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques
Two Fundamental Limits on Dataflow Multiprocessing

PACT '93 Proceedings of the IFIP WG10.3. Working Conference on Architectures and Compilation Techniques for Fine and Medium Grain Parallelism
Microarchitectural exploration with Liberty

Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
Scalar Operand Networks: On-Chip Interconnect for ILP in Partitioned Architectures

HPCA '03 Proceedings of the 9th International Symposium on High-Performance Computer Architecture
Exploiting ILP, TLP, and DLP with the polymorphous TRIPS architecture

Proceedings of the 30th annual international symposium on Computer architecture
WaveScalar

Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
Scheduling Algorithms

Scheduling Algorithms
Principles and Practices of Interconnection Networks

Principles and Practices of Interconnection Networks
Synchroscalar: A Multiple Clock Domain, Power-Aware, Tile-Based Embedded Processor

Proceedings of the 31st annual international symposium on Computer architecture
Evaluation of the Raw Microprocessor: An Exposed-Wire-Delay Architecture for ILP and Streams

Proceedings of the 31st annual international symposium on Computer architecture
Programming with transactional coherence and consistency (TCC)

ASPLOS XI Proceedings of the 11th international conference on Architectural support for programming languages and operating systems
Decoupled Software Pipelining with the Synchronization Array

Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques

Efficient pattern mining on shared memory systems: implications for chip multiprocessor architectures

Proceedings of the 2006 workshop on Memory system performance and correctness
CAPSULE: Hardware-Assisted Parallel Execution of Component-Based Programs

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Carbon: architectural support for fine-grained parallelism on chip multiprocessors

Proceedings of the 34th annual international symposium on Computer architecture
Visions for application development on hybrid computing systems

Parallel Computing
Leveraging on-chip networks for data cache migration in chip multiprocessors

Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Scalable hardware support for conditional parallelization

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
A moving threads processor architecture MTPA

The Journal of Supercomputing
An efficient and flexible task management for many cores

Transactions on High-Performance Embedded Architectures and Compilers IV
Single thread program parallelism with dataflow abstracting thread

ICA3PP'10 Proceedings of the 10th international conference on Algorithms and Architectures for Parallel Processing - Volume Part II

Quantified Score

Hi-index	0.00

Visualization

Abstract

Chip multi-processors (CMPs) already have widespread commercial availability, and technology roadmaps project enough on-chip transistors to replicate tens or hundreds of current processor cores. How will we express parallelism, partition applications, and schedule/place/migrate threads on these highly-parallel CMPs?This paper presents and evaluates a new approach to highly-parallel CMPs, advocating a new hardware-software contract. The software layer is encouraged to expose large amounts of multi-granular, heterogeneous parallelism. The hardware, meanwhile, is designed to offer low-overhead, low-area support for orchestrating and modulating this parallelism on CMPs at runtime. Specifically, our proposed CMP architecture consists of architectural and ISA support targeting thread creation, scheduling and context-switching, designed to facilitate effective hardware run-time mapping of threads to cores at low overheads.Dynamic modulation of parallelism provides the ability to respond to run-time variability that arises from dataset changes, memory system effects and power spikes and lulls, to name a few. It also naturally provides a long-term CMP platform with performance portability and tolerance to frequency and reliability variations across multiple CMP generations. Our simulations of a range of applications possessing do-all, streaming and recursive parallellism show speedups of 4-11.5X and energy-delay-product savings of 3.8X, on average, on a 16-core vs. a 1-core system. This is achieved with modest amounts of hardware support that allows for low overheads in thread creation, scheduling and context-switching. In particular, our simulations motivated the need for hardware support, showing that the large thread management overheads of current run-time software systems can lead to up to 6.5X slowdown. The difficulties faced in static scheduling were shown in our simulations with a static scheduling algorithm, fed with oracle profiled inputs suffering up to 107% slowdown compared to NDP's hardware scheduler, due to its inability to handle memory system variabilities. More broadly, we feel that the ideas presented here show promise for scaling to the systems expected in ten years, where the advantages of high transistor counts may be dampened by difficulties in circuit variations and reliability. These issues will make dynamic scheduling and adaptation mandatory; our proposals represent a first step towards that direction.