Thread criticality predictors for dynamic performance, power, and resource management in chip multiprocessors

Authors:
Abhishek Bhattacharjee;Margaret Martonosi
Affiliations:
Princeton University, Princeton, NJ, USA;Princeton University, Princeton, NJ, USA
Venue:
Proceedings of the 36th annual international symposium on Computer architecture
Year:
2009

Citing 28
Cited 38

The SPLASH-2 programs: characterization and methodological considerations

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Exploiting choice: instruction fetch and issue on an implementable simultaneous multithreading processor

ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Cilk: an efficient multithreaded runtime system

Journal of Parallel and Distributed Computing - Special issue on multithreading for multiprocessors
Scheduling multithreaded computations by work stealing

Journal of the ACM (JACM)
The data locality of work stealing

Proceedings of the twelfth annual ACM symposium on Parallel algorithms and architectures
Focusing processor policies via critical-path prediction

ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture
Parameter variations and impact on circuits and microarchitecture

Proceedings of the 40th annual Design Automation Conference
Dynamic Prediction of Critical Path Instructions

HPCA '01 Proceedings of the 7th International Symposium on High-Performance Computer Architecture
Reducing DRAM Latencies with an Integrated Memory Hierarchy Design

HPCA '01 Proceedings of the 7th International Symposium on High-Performance Computer Architecture
Fair Cache Sharing and Partitioning in a Chip Multiprocessor Architecture

Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques
Exploiting Barriers to Optimize Power Consumption of CMPs

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Papers - Volume 01
Full-chip analysis of leakage power under process variations, including spatial correlations

Proceedings of the 42nd annual Design Automation Conference
The Thrifty Barrier: Energy-Aware Synchronization in Shared-Memory Multiprocessors

HPCA '04 Proceedings of the 10th International Symposium on High Performance Computer Architecture
Dynamic circular work-stealing deque

Proceedings of the seventeenth annual ACM symposium on Parallelism in algorithms and architectures
Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset

ACM SIGARCH Computer Architecture News - Special issue: dasCMP'05
Techniques for Multicore Thermal Management: Classification and New Exploration

Proceedings of the 33rd annual international symposium on Computer Architecture
Fair Queuing Memory Systems

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
CAPSULE: Hardware-Assisted Parallel Execution of Component-Based Programs

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
An Analysis of Efficient Multi-Core Global Power Management Policies: Maximizing Performance for a Given Power Budget

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Virtual private caches

Proceedings of the 34th annual international symposium on Computer architecture
Carbon: architectural support for fine-grained parallelism on chip multiprocessors

Proceedings of the 34th annual international symposium on Computer architecture
Adaptive set pinning: managing shared caches in chip multiprocessors

Proceedings of the 13th international conference on Architectural support for programming languages and operating systems
Self-Optimizing Memory Controllers: A Reinforcement Learning Approach

ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
A probabilistic technique for full-chip leakage estimation

Proceedings of the 13th international symposium on Low power electronics and design
Full-system chip multiprocessor power evaluations using FPGA-based emulation

Proceedings of the 13th international symposium on Low power electronics and design
The PARSEC benchmark suite: characterization and architectural implications

Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Adaptive insertion policies for managing shared caches

Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Meeting points: using thread criticality to adapt multicore hardware to parallel regions

Proceedings of the 17th international conference on Parallel architectures and compilation techniques

Decomposable and responsive power models for multicore processors using performance counters

Proceedings of the 24th ACM International Conference on Supercomputing
Data marshaling for multi-core architectures

Proceedings of the 37th annual international symposium on Computer architecture
Energy efficient speculative threads: dynamic thread allocation in Same-ISA heterogeneous multicore systems

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Thread criticality support in on-chip networks

Proceedings of the Third International Workshop on Network on Chip Architectures
Chaotic attractor prediction for server run-time energy consumption

HotPower'10 Proceedings of the 2010 international conference on Power aware computing and systems
Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior

MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
Adaptive timekeeping replacement: Fine-grained capacity management for shared CMP caches

ACM Transactions on Architecture and Code Optimization (TACO)
Parallelization libraries: Characterizing and reducing overheads

ACM Transactions on Architecture and Code Optimization (TACO)
LIME: a framework for debugging load imbalance in multi-threaded execution

Proceedings of the 33rd International Conference on Software Engineering
Page placement in hybrid memory systems

Proceedings of the international conference on Supercomputing
Scalable power control for many-core architectures running multi-threaded applications

Proceedings of the 38th annual international symposium on Computer architecture
Parallel pattern detection for architectural improvements

HotPar'11 Proceedings of the 3rd USENIX conference on Hot topic in parallelism
Improving energy efficiency of multi-threaded applications using heterogeneous CMOS-TFET multicores

Proceedings of the 17th IEEE/ACM international symposium on Low-power electronics and design
Thread shuffling: combining DVFS and thread migration toreduce energy consumptions for multi-core systems

Proceedings of the 17th IEEE/ACM international symposium on Low-power electronics and design
Adaptive resource management for simultaneous multitasking in mixed-grained reconfigurable multi-core processors

CODES+ISSS '11 Proceedings of the seventh IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis
Energy accounting for shared virtualized environments under DVFS using PMC-based power models

Future Generation Computer Systems
A fault-tolerant, dynamically scheduled pipeline structure for chip multiprocessors

SAFECOMP'11 Proceedings of the 30th international conference on Computer safety, reliability, and security
Bottleneck identification and scheduling in multithreaded applications

ASPLOS XVII Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems
Parallel application memory scheduling

Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
Instruction-based energy estimation methodology for asymmetric manycore processor simulations

Proceedings of the 5th International ICST Conference on Simulation Tools and Techniques
VRSync: characterizing and eliminating synchronization-induced voltage emergencies in many-core processors

Proceedings of the 39th Annual International Symposium on Computer Architecture
Runtime energy consumption estimation for server workloads based on chaotic time-series approximation

ACM Transactions on Architecture and Code Optimization (TACO)
When less is more (LIMO):controlled parallelism forimproved efficiency

Proceedings of the 2012 international conference on Compilers, architectures and synthesis for embedded systems
Critical lock analysis: diagnosing critical section bottlenecks in multithreaded applications

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Per-thread cycle accounting in multicore processors

ACM Transactions on Architecture and Code Optimization (TACO) - Special Issue on High-Performance Embedded Architectures and Compilers
Orchestrator: a low-cost solution to reduce voltage emergencies for multi-threaded applications

Proceedings of the Conference on Design, Automation and Test in Europe
Utility-based acceleration of multithreaded applications on asymmetric CMPs

Proceedings of the 40th Annual International Symposium on Computer Architecture
Reducing memory access latency with asymmetric DRAM bank organizations

Proceedings of the 40th Annual International Symposium on Computer Architecture
Criticality stacks: identifying critical threads in parallel programs using synchronization behavior

Proceedings of the 40th Annual International Symposium on Computer Architecture
Quantifying the impact of frequency scaling on the energy efficiency of the single-chip cloud computer

DATE '12 Proceedings of the Conference on Design, Automation and Test in Europe
Exploring power behaviors and trade-offs of in-situ data analytics

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
SMT-centric power-aware thread placement in chip multiprocessors

PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
Fairness-aware scheduling on single-ISA heterogeneous multi-cores

PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
TCPT: thread criticality-driven prefetcher throttling

PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
Crank it up or dial it down: coordinated multiprocessor frequency and folding control

Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
The design and implementation of heterogeneous multicore systems for energy-efficient speculative thread execution

ACM Transactions on Architecture and Code Optimization (TACO)
Thread-criticality aware dynamic cache reconfiguration in multi-core system

Proceedings of the International Conference on Computer-Aided Design
PAIS: Parallelism-aware interconnect scheduling in multicores

ACM Transactions on Embedded Computing Systems (TECS) - Special Issue on Design Challenges for Many-Core Processors, Special Section on ESTIMedia'13 and Regular Papers

Quantified Score

Hi-index	0.00

Visualization

Abstract

With the shift towards chip multiprocessors (CMPs), exploiting and managing parallelism has become a central problem in computing systems. Many issues of parallelism management boil down to discerning which running threads or processes are critical, or slowest, versus which are non-critical. If one can accurately predict critical threads in a parallel program, then one can respond in a variety of ways. Possibilities include running the critical thread at a faster clock rate, performing load balancing techniques to offload work onto currently non-critical threads, or giving the critical thread more on-chip resources to execute faster. This paper proposes and evaluates simple but effective thread criticality predictors for parallel applications. We show that accurate predictors can be built using counters that are typically already available on-chip. Our predictor, based on memory hierarchy statistics, identifies thread criticality with an average accuracy of 93% across a range of architectures. We also demonstrate two applications of our predictor. First, we show how Intel's Threading Building Blocks (TBB) parallel runtime system can benefit from task stealing techniques that use our criticality predictor to reduce load imbalance. Using criticality prediction to guide TBB's task-stealing decisions improves performance by 13-32% for TBB-based PARSEC benchmarks running on a 32-core CMP. As a second application, criticality prediction guides dynamic energy optimizations in barrier-based applications. By running the predicted critical thread at the full clock rate and frequency-scaling non-critical threads, this approach achieves average energy savings of 15% while negligibly degrading performance for SPLASH-2 and PARSEC benchmarks.