Carbon: architectural support for fine-grained parallelism on chip multiprocessors

Authors:
Sanjeev Kumar;Christopher J. Hughes;Anthony Nguyen
Affiliations:
Intel Corp., Santa Clara, CA;Intel Corp., Santa Clara, CA;Intel Corp., Santa Clara, CA
Venue:
Proceedings of the 34th annual international symposium on Computer architecture
Year:
2007

Citing 24
Cited 38

A Computational Approach to Edge Detection

IEEE Transactions on Pattern Analysis and Machine Intelligence
Workcrews: an abstraction for controlling parallelism

International Journal of Parallel Programming
Lazy task creation: a technique for increasing the granularity of parallel programs

LFP '90 Proceedings of the 1990 ACM conference on LISP and functional programming
The NAS parallel benchmarks—summary and preliminary results

Proceedings of the 1991 ACM/IEEE conference on Supercomputing
Numerical recipes in C (2nd ed.): the art of scientific computing

Numerical recipes in C (2nd ed.): the art of scientific computing
High performance Fortran language specification (part III)

ACM SIGPLAN Fortran Forum
Cilk: an efficient multithreaded runtime system

PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
The SPLASH-2 programs: characterization and methodological considerations

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Thread scheduling for cache locality

Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
The implementation of the Cilk-5 multithreaded language

PLDI '98 Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation
Scheduling threads for low space requirement and good locality

Proceedings of the eleventh annual ACM symposium on Parallel algorithms and architectures
Scheduling multithreaded computations by work stealing

Journal of the ACM (JACM)
The data locality of work stealing

Proceedings of the twelfth annual ACM symposium on Parallel algorithms and architectures
Static scheduling algorithms for allocating directed task graphs to multiprocessors

ACM Computing Surveys (CSUR)
Non-blocking steal-half work queues

Proceedings of the twenty-first annual symposium on Principles of distributed computing
Implementation of multilisp: Lisp on a multiprocessor

LFP '84 Proceedings of the 1984 ACM Symposium on LISP and functional programming
Executing functional programs on a virtual tree of processors

FPCA '81 Proceedings of the 1981 conference on Functional programming languages and computer architecture
Using Hardware Operations to Reduce the Synchronization Overhead of Task Pools

ICPP '04 Proceedings of the 2004 International Conference on Parallel Processing
Performance Evaluation of Task Pools Based on Hardware Synchronization

Proceedings of the 2004 ACM/IEEE conference on Supercomputing
A comparison of task pools for dynamic load balancing of irregular algorithms: Research Articles

Concurrency and Computation: Practice & Experience
Dynamic circular work-stealing deque

Proceedings of the seventeenth annual ACM symposium on Parallelism in algorithms and architectures
Hardware-modulated parallelism in chip multiprocessors

ACM SIGARCH Computer Architecture News - Special issue: dasCMP'05
Chip multiprocessing and the cell broadband engine

Proceedings of the 3rd conference on Computing frontiers
Multiple Instruction Stream Processor

Proceedings of the 33rd annual international symposium on Computer Architecture

Scaling performance of interior-point method on large-scale chip multiprocessor system

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Fine-Grained Task Scheduling Using Adaptive Data Structures

Euro-Par '08 Proceedings of the 14th international Euro-Par conference on Parallel Processing
GRAMPS: A programming model for graphics pipelines

ACM Transactions on Graphics (TOG)
A Hardware Task Scheduler for Embedded Video Processing

HiPEAC '09 Proceedings of the 4th International Conference on High Performance Embedded Architectures and Compilers
Toward a multicore architecture for real-time ray-tracing

Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
Towards embedded runtime system level optimization for MPSoCs: on-chip task allocation

Proceedings of the 19th ACM Great Lakes symposium on VLSI
Rigel: an architecture and scalable programming interface for a 1000-core accelerator

Proceedings of the 36th annual international symposium on Computer architecture
Thread criticality predictors for dynamic performance, power, and resource management in chip multiprocessors

Proceedings of the 36th annual international symposium on Computer architecture
Task management in MPSoCs: an ASIP approach

Proceedings of the 2009 International Conference on Computer-Aided Design
Scalable HMM based inference engine in large vocabulary continuous speech recognition

ICME'09 Proceedings of the 2009 IEEE international conference on Multimedia and Expo
Flexible architectural support for fine-grain scheduling

Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU

Proceedings of the 37th annual international symposium on Computer architecture
Relax: an architectural framework for software recovery of hardware faults

Proceedings of the 37th annual international symposium on Computer architecture
WAYPOINT: scaling coherence to thousand-core architectures

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Scalable hardware support for conditional parallelization

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Synchronization via scheduling: managing shared state in video games

HotPar'10 Proceedings of the 2nd USENIX conference on Hot topics in parallelism
Task superscalar: using processors as functional units

HotPar'10 Proceedings of the 2nd USENIX conference on Hot topics in parallelism
Exploiting fine-grained parallelism on cell processors

Euro-Par'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part II
Task Superscalar: An Out-of-Order Task Pipeline

MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
Parallelization libraries: Characterizing and reducing overheads

ACM Transactions on Architecture and Code Optimization (TACO)
Shared Register File Based ILP for Multicore

GREENCOM-CPSCOM '10 Proceedings of the 2010 IEEE/ACM Int'l Conference on Green Computing and Communications & Int'l Conference on Cyber, Physical and Social Computing
A multithreaded multicore system for embedded media processing

Transactions on high-performance embedded architectures and compilers III
LIME: a framework for debugging load imbalance in multi-threaded execution

Proceedings of the 33rd International Conference on Software Engineering
A moving threads processor architecture MTPA

The Journal of Supercomputing
OUTRIDER: efficient memory latency tolerance with decoupled strands

Proceedings of the 38th annual international symposium on Computer architecture
BWS: balanced work stealing for time-sharing multicores

Proceedings of the 7th ACM european conference on Computer Systems
An efficient and flexible task management for many cores

Transactions on High-Performance Embedded Architectures and Compilers IV
Single thread program parallelism with dataflow abstracting thread

ICA3PP'10 Proceedings of the 10th international conference on Algorithms and Architectures for Parallel Processing - Volume Part II
An efficient scheduler of RTOS for multi/many-core system

Computers and Electrical Engineering
A programmable processing array architecture supporting dynamic task scheduling and module-level prefetching

Proceedings of the 9th conference on Computing Frontiers
Shared hardware data structures for hard real-time systems

Proceedings of the tenth ACM international conference on Embedded software
Billion-particle SIMD-friendly two-point correlation on large-scale HPC cluster systems

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Scheduling parallel programs by work stealing with private deques

Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming
Hardware support for fine-grained event-driven computation in Anton 2

Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems
MP-Tomasulo: A Dependency-Aware Automatic Parallel Execution Engine for Sequential Programs

ACM Transactions on Architecture and Code Optimization (TACO)
Enabling fine-grained OpenMP tasking on tightly-coupled shared memory clusters

Proceedings of the Conference on Design, Automation and Test in Europe
Locality-aware task management for unstructured parallelism: a quantitative limit study

Proceedings of the twenty-fifth annual ACM symposium on Parallelism in algorithms and architectures
Colored Petri Net model with automatic parallelization on real-time multicore architectures

Journal of Systems Architecture: the EUROMICRO Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

Chip multiprocessors (CMPs) are now commonplace, and the number of cores on a CMP is likely to grow steadily. However, in order to harness the additional compute resources of a CMP, applications must expose their thread-level parallelism to the hardware. One common approach to doing this is to decompose a program into parallel "tasks" and allow an underlying software layer to schedule these tasks to different threads. Software task scheduling can provide good parallel performance as long as tasks are large compared to the software overheads. We examine a set of applications from an important emerging domain: Recognition, Mining, and Synthesis (RMS). Many RMS applications are compute-intensive and have abundant thread-level parallelism, and are therefore good targets for running on a CMP. However, a significant number have small tasks for which software task schedulers achieve only limited parallel speedups. We propose Carbon, a hardware technique to accelerate dynamic task scheduling on scalable CMPs. Carbon has relatively simple hardware, most of which can be placed far from the cores. We compare Carbon to some highly tuned software task schedulers for a set of RMS benchmarks with small tasks. Carbon delivers significant performance improvements over the best software scheduler: on average for 64 cores, 68% faster on a set of loop-parallel benchmarks, and 109% faster on aset of task-parallel benchmarks.