A Computational Approach to Edge Detection
IEEE Transactions on Pattern Analysis and Machine Intelligence
Workcrews: an abstraction for controlling parallelism
International Journal of Parallel Programming
Lazy task creation: a technique for increasing the granularity of parallel programs
LFP '90 Proceedings of the 1990 ACM conference on LISP and functional programming
The NAS parallel benchmarks—summary and preliminary results
Proceedings of the 1991 ACM/IEEE conference on Supercomputing
Numerical recipes in C (2nd ed.): the art of scientific computing
Numerical recipes in C (2nd ed.): the art of scientific computing
High performance Fortran language specification (part III)
ACM SIGPLAN Fortran Forum
Cilk: an efficient multithreaded runtime system
PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
The SPLASH-2 programs: characterization and methodological considerations
ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Thread scheduling for cache locality
Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
The implementation of the Cilk-5 multithreaded language
PLDI '98 Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation
Scheduling threads for low space requirement and good locality
Proceedings of the eleventh annual ACM symposium on Parallel algorithms and architectures
Scheduling multithreaded computations by work stealing
Journal of the ACM (JACM)
The data locality of work stealing
Proceedings of the twelfth annual ACM symposium on Parallel algorithms and architectures
Static scheduling algorithms for allocating directed task graphs to multiprocessors
ACM Computing Surveys (CSUR)
Non-blocking steal-half work queues
Proceedings of the twenty-first annual symposium on Principles of distributed computing
Implementation of multilisp: Lisp on a multiprocessor
LFP '84 Proceedings of the 1984 ACM Symposium on LISP and functional programming
Executing functional programs on a virtual tree of processors
FPCA '81 Proceedings of the 1981 conference on Functional programming languages and computer architecture
Using Hardware Operations to Reduce the Synchronization Overhead of Task Pools
ICPP '04 Proceedings of the 2004 International Conference on Parallel Processing
Performance Evaluation of Task Pools Based on Hardware Synchronization
Proceedings of the 2004 ACM/IEEE conference on Supercomputing
A comparison of task pools for dynamic load balancing of irregular algorithms: Research Articles
Concurrency and Computation: Practice & Experience
Dynamic circular work-stealing deque
Proceedings of the seventeenth annual ACM symposium on Parallelism in algorithms and architectures
Hardware-modulated parallelism in chip multiprocessors
ACM SIGARCH Computer Architecture News - Special issue: dasCMP'05
Chip multiprocessing and the cell broadband engine
Proceedings of the 3rd conference on Computing frontiers
Multiple Instruction Stream Processor
Proceedings of the 33rd annual international symposium on Computer Architecture
Scaling performance of interior-point method on large-scale chip multiprocessor system
Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Fine-Grained Task Scheduling Using Adaptive Data Structures
Euro-Par '08 Proceedings of the 14th international Euro-Par conference on Parallel Processing
GRAMPS: A programming model for graphics pipelines
ACM Transactions on Graphics (TOG)
A Hardware Task Scheduler for Embedded Video Processing
HiPEAC '09 Proceedings of the 4th International Conference on High Performance Embedded Architectures and Compilers
Toward a multicore architecture for real-time ray-tracing
Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
Towards embedded runtime system level optimization for MPSoCs: on-chip task allocation
Proceedings of the 19th ACM Great Lakes symposium on VLSI
Rigel: an architecture and scalable programming interface for a 1000-core accelerator
Proceedings of the 36th annual international symposium on Computer architecture
Proceedings of the 36th annual international symposium on Computer architecture
Task management in MPSoCs: an ASIP approach
Proceedings of the 2009 International Conference on Computer-Aided Design
Scalable HMM based inference engine in large vocabulary continuous speech recognition
ICME'09 Proceedings of the 2009 IEEE international conference on Multimedia and Expo
Flexible architectural support for fine-grain scheduling
Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU
Proceedings of the 37th annual international symposium on Computer architecture
Relax: an architectural framework for software recovery of hardware faults
Proceedings of the 37th annual international symposium on Computer architecture
WAYPOINT: scaling coherence to thousand-core architectures
Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Scalable hardware support for conditional parallelization
Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Synchronization via scheduling: managing shared state in video games
HotPar'10 Proceedings of the 2nd USENIX conference on Hot topics in parallelism
Task superscalar: using processors as functional units
HotPar'10 Proceedings of the 2nd USENIX conference on Hot topics in parallelism
Exploiting fine-grained parallelism on cell processors
Euro-Par'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part II
Task Superscalar: An Out-of-Order Task Pipeline
MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
Parallelization libraries: Characterizing and reducing overheads
ACM Transactions on Architecture and Code Optimization (TACO)
Shared Register File Based ILP for Multicore
GREENCOM-CPSCOM '10 Proceedings of the 2010 IEEE/ACM Int'l Conference on Green Computing and Communications & Int'l Conference on Cyber, Physical and Social Computing
A multithreaded multicore system for embedded media processing
Transactions on high-performance embedded architectures and compilers III
LIME: a framework for debugging load imbalance in multi-threaded execution
Proceedings of the 33rd International Conference on Software Engineering
A moving threads processor architecture MTPA
The Journal of Supercomputing
OUTRIDER: efficient memory latency tolerance with decoupled strands
Proceedings of the 38th annual international symposium on Computer architecture
BWS: balanced work stealing for time-sharing multicores
Proceedings of the 7th ACM european conference on Computer Systems
An efficient and flexible task management for many cores
Transactions on High-Performance Embedded Architectures and Compilers IV
Single thread program parallelism with dataflow abstracting thread
ICA3PP'10 Proceedings of the 10th international conference on Algorithms and Architectures for Parallel Processing - Volume Part II
An efficient scheduler of RTOS for multi/many-core system
Computers and Electrical Engineering
Proceedings of the 9th conference on Computing Frontiers
Shared hardware data structures for hard real-time systems
Proceedings of the tenth ACM international conference on Embedded software
Billion-particle SIMD-friendly two-point correlation on large-scale HPC cluster systems
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Scheduling parallel programs by work stealing with private deques
Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming
Hardware support for fine-grained event-driven computation in Anton 2
Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems
MP-Tomasulo: A Dependency-Aware Automatic Parallel Execution Engine for Sequential Programs
ACM Transactions on Architecture and Code Optimization (TACO)
Enabling fine-grained OpenMP tasking on tightly-coupled shared memory clusters
Proceedings of the Conference on Design, Automation and Test in Europe
Locality-aware task management for unstructured parallelism: a quantitative limit study
Proceedings of the twenty-fifth annual ACM symposium on Parallelism in algorithms and architectures
Colored Petri Net model with automatic parallelization on real-time multicore architectures
Journal of Systems Architecture: the EUROMICRO Journal
Hi-index | 0.00 |
Chip multiprocessors (CMPs) are now commonplace, and the number of cores on a CMP is likely to grow steadily. However, in order to harness the additional compute resources of a CMP, applications must expose their thread-level parallelism to the hardware. One common approach to doing this is to decompose a program into parallel "tasks" and allow an underlying software layer to schedule these tasks to different threads. Software task scheduling can provide good parallel performance as long as tasks are large compared to the software overheads. We examine a set of applications from an important emerging domain: Recognition, Mining, and Synthesis (RMS). Many RMS applications are compute-intensive and have abundant thread-level parallelism, and are therefore good targets for running on a CMP. However, a significant number have small tasks for which software task schedulers achieve only limited parallel speedups. We propose Carbon, a hardware technique to accelerate dynamic task scheduling on scalable CMPs. Carbon has relatively simple hardware, most of which can be placed far from the cores. We compare Carbon to some highly tuned software task schedulers for a set of RMS benchmarks with small tasks. Carbon delivers significant performance improvements over the best software scheduler: on average for 64 cores, 68% faster on a set of loop-parallel benchmarks, and 109% faster on aset of task-parallel benchmarks.