Feedback-driven threading: power-efficient and high-performance execution of multi-threaded workloads on CMPs

Authors:
M. Aater Suleman;Moinuddin K. Qureshi;Yale N. Patt
Affiliations:
The University of Texas at Austin, Austin, TX;T. J. Watson Research Center, Yorktown Hieghts, NY;The University of Texas at Austin, Austin, TX
Venue:
Proceedings of the 13th international conference on Architectural support for programming languages and operating systems
Year:
2008

Citing 14
Cited 27

A dynamic processor allocation policy for multiprogrammed shared-memory multiprocessors

ACM Transactions on Computer Systems (TOCS)
Using parallel program characteristics in dynamic processor allocation policies

Performance Evaluation
Mersenne twister: a 623-dimensionally equidistributed uniform pseudo-random number generator

ACM Transactions on Modeling and Computer Simulation (TOMACS) - Special issue on uniform random number generation
Optimizing compilers for modern architectures: a dependence-based approach

Optimizing compilers for modern architectures: a dependence-based approach
Compiling Several Classes of Communication Patterns on a Multithreaded Architecture

IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
Maximizing Speedup through Self-Tuning of Processor Allocation

IPPS '96 Proceedings of the 10th International Parallel Processing Symposium
Exploring the Design Space of Future CMPs

Proceedings of the 2001 International Conference on Parallel Architectures and Compilation Techniques
Single-ISA Heterogeneous Multi-Core Architectures: The Potential for Processor Power Reduction

Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
Single-ISA Heterogeneous Multi-Core Architectures for Multithreaded Workload Performance

Proceedings of the 31st annual international symposium on Computer architecture
The OpenMP Source Code Repository

PDP '05 Proceedings of the 13th Euromicro Conference on Parallel, Distributed and Network-Based Processing
Niagara: A 32-Way Multithreaded Sparc Processor

IEEE Micro
Performance-Driven Processor Allocation

IEEE Transactions on Parallel and Distributed Systems
Evaluating the potential of multithreaded platforms for irregular scientific computations

Proceedings of the 4th international conference on Computing frontiers
IBM Power5 Chip: A Dual-Core Multithreaded Processor

IEEE Micro

Accelerating critical section execution with asymmetric multi-core architectures

Proceedings of the 14th international conference on Architectural support for programming languages and operating systems
Multicore diversity: a software developer's nightmare

ACM SIGOPS Operating Systems Review
Dynamic performance tuning for speculative threads

Proceedings of the 36th annual international symposium on Computer architecture
Adapting application execution in CMPs using helper threads

Journal of Parallel and Distributed Computing
Maximizing power efficiency with asymmetric multicore systems

Communications of the ACM - Finding the Fun in Computer Science Education
Performance balancing: software-based on-chip memory management for effective CMP executions

Proceedings of the 10th workshop on MEmory performance: DEaling with Applications, systems and architecture
Exposing parallelism and locality in a runtime parallel optimization framework

Proceedings of the 7th ACM international conference on Computing frontiers
An approach to resource-aware co-scheduling for CMPs

Proceedings of the 24th ACM International Conference on Supercomputing
Thread tailor: dynamically weaving threads together for efficient, adaptive parallel applications

Proceedings of the 37th annual international symposium on Computer architecture
An integrated GPU power and performance model

Proceedings of the 37th annual international symposium on Computer architecture
Modeling critical sections in Amdahl's law and its implications for multicore design

Proceedings of the 37th annual international symposium on Computer architecture
Adaptive multi-threading for dynamic workloads in embedded multiprocessors

SBCCI '10 Proceedings of the 23rd symposium on Integrated circuits and system design
Feedback-directed pipeline parallelism

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
How many threads to spawn during program multithreading?

LCPC'10 Proceedings of the 23rd international conference on Languages and compilers for parallel computing
Parallelism orchestration using DoPE: the degree of parallelism executive

Proceedings of the 32nd ACM SIGPLAN conference on Programming language design and implementation
Pervasive parallelism for managed runtimes

HotPar'11 Proceedings of the 3rd USENIX conference on Hot topic in parallelism
Leveraging Core Specialization via OS Scheduling to Improve Performance on Asymmetric Multicore Systems

ACM Transactions on Computer Systems (TOCS)
Performance driven cooperation between kernel and auto-tuning multi-threaded interval b&b applications

ICCSA'12 Proceedings of the 12th international conference on Computational Science and Its Applications - Volume Part I
Dynamically dispatching speculative threads to improve sequential execution

ACM Transactions on Architecture and Code Optimization (TACO)
Coalition threading: combining traditional andnon-traditional parallelism to maximize scalability

Proceedings of the 21st international conference on Parallel architectures and compilation techniques
When less is more (LIMO):controlled parallelism forimproved efficiency

Proceedings of the 2012 international conference on Compilers, architectures and synthesis for embedded systems
Power and Performance Management of GPUs Based Cluster

International Journal of Cloud Applications and Computing
Holistic run-time parallelism management for time and energy efficiency

Proceedings of the 27th international ACM conference on International conference on supercomputing
Adaptive parallelism for web search

Proceedings of the 8th ACM European Conference on Computer Systems
Load-balanced pipeline parallelism

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Threadguide: profiler assisted application adaptation on CMP

Proceedings of the 5th IBM Collaborative Academia Research Exchange Workshop
Efficient multiprogramming for multicores with SCAF

Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture

Quantified Score

Hi-index	0.00

Visualization

Abstract

Extracting high-performance from the emerging Chip Multiprocessors (CMPs) requires that the application be divided into multiple threads. Each thread executes on a separate core thereby increasing concurrency and improving performance. As the number of cores on a CMP continues to increase, the performance of some multi-threaded applications will benefit from the increased number of threads, whereas, the performance of other multi-threaded applications will become limited by data-synchronization and off-chip bandwidth. For applications that get limited by data-synchronization, increasing the number of threads significantly degrades performance and increases on-chip power. Similarly, for applications that get limited by off-chip bandwidth, increasing the number of threads increases on-chip power without providing any performance improvement. Furthermore, whether an application gets limited by data-synchronization, or bandwidth, or neither depends not only on the application but also on the input set and the machine configuration. Therefore, controlling the number of threads based on the run-time behavior of the application can significantly improve performance and reduce power. This paper proposes Feedback-Driven Threading (FDT), a framework to dynamically control the number of threads using run-time information. FDT can be used to implement Synchronization-Aware Threading (SAT), which predicts the optimal number of threads depending on the amount of data-synchronization. Our evaluation shows that SAT can reduce both execution time and power by up to 66% and 78% respectively. Similarly, FDT can be used to implement Bandwidth-Aware Threading (BAT), which predicts the minimum number of threads required to saturate the off-chip bus. Our evaluation shows that BAT reduces on-chip power by up to 78%. When SAT and BAT are combined, the average execution time reduces by 17% and power reduces by 59%. The proposed techniques leverage existing performance counters and require minimal support from the threading library.