The SPLASH-2 programs: characterization and methodological considerations
ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Cilk: an efficient multithreaded runtime system
Journal of Parallel and Distributed Computing - Special issue on multithreading for multiprocessors
Scheduling multithreaded computations by work stealing
Journal of the ACM (JACM)
The data locality of work stealing
Proceedings of the twelfth annual ACM symposium on Parallel algorithms and architectures
Focusing processor policies via critical-path prediction
ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture
Parameter variations and impact on circuits and microarchitecture
Proceedings of the 40th annual Design Automation Conference
Dynamic Prediction of Critical Path Instructions
HPCA '01 Proceedings of the 7th International Symposium on High-Performance Computer Architecture
Reducing DRAM Latencies with an Integrated Memory Hierarchy Design
HPCA '01 Proceedings of the 7th International Symposium on High-Performance Computer Architecture
Fair Cache Sharing and Partitioning in a Chip Multiprocessor Architecture
Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques
Exploiting Barriers to Optimize Power Consumption of CMPs
IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Papers - Volume 01
Full-chip analysis of leakage power under process variations, including spatial correlations
Proceedings of the 42nd annual Design Automation Conference
The Thrifty Barrier: Energy-Aware Synchronization in Shared-Memory Multiprocessors
HPCA '04 Proceedings of the 10th International Symposium on High Performance Computer Architecture
Dynamic circular work-stealing deque
Proceedings of the seventeenth annual ACM symposium on Parallelism in algorithms and architectures
Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset
ACM SIGARCH Computer Architecture News - Special issue: dasCMP'05
Techniques for Multicore Thermal Management: Classification and New Exploration
Proceedings of the 33rd annual international symposium on Computer Architecture
Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
CAPSULE: Hardware-Assisted Parallel Execution of Component-Based Programs
Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Proceedings of the 34th annual international symposium on Computer architecture
Carbon: architectural support for fine-grained parallelism on chip multiprocessors
Proceedings of the 34th annual international symposium on Computer architecture
Adaptive set pinning: managing shared caches in chip multiprocessors
Proceedings of the 13th international conference on Architectural support for programming languages and operating systems
Self-Optimizing Memory Controllers: A Reinforcement Learning Approach
ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
A probabilistic technique for full-chip leakage estimation
Proceedings of the 13th international symposium on Low power electronics and design
Full-system chip multiprocessor power evaluations using FPGA-based emulation
Proceedings of the 13th international symposium on Low power electronics and design
The PARSEC benchmark suite: characterization and architectural implications
Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Adaptive insertion policies for managing shared caches
Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Meeting points: using thread criticality to adapt multicore hardware to parallel regions
Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Decomposable and responsive power models for multicore processors using performance counters
Proceedings of the 24th ACM International Conference on Supercomputing
Data marshaling for multi-core architectures
Proceedings of the 37th annual international symposium on Computer architecture
Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Thread criticality support in on-chip networks
Proceedings of the Third International Workshop on Network on Chip Architectures
Chaotic attractor prediction for server run-time energy consumption
HotPower'10 Proceedings of the 2010 international conference on Power aware computing and systems
Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior
MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
Adaptive timekeeping replacement: Fine-grained capacity management for shared CMP caches
ACM Transactions on Architecture and Code Optimization (TACO)
Parallelization libraries: Characterizing and reducing overheads
ACM Transactions on Architecture and Code Optimization (TACO)
LIME: a framework for debugging load imbalance in multi-threaded execution
Proceedings of the 33rd International Conference on Software Engineering
Page placement in hybrid memory systems
Proceedings of the international conference on Supercomputing
Scalable power control for many-core architectures running multi-threaded applications
Proceedings of the 38th annual international symposium on Computer architecture
Parallel pattern detection for architectural improvements
HotPar'11 Proceedings of the 3rd USENIX conference on Hot topic in parallelism
Improving energy efficiency of multi-threaded applications using heterogeneous CMOS-TFET multicores
Proceedings of the 17th IEEE/ACM international symposium on Low-power electronics and design
Proceedings of the 17th IEEE/ACM international symposium on Low-power electronics and design
CODES+ISSS '11 Proceedings of the seventh IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis
Energy accounting for shared virtualized environments under DVFS using PMC-based power models
Future Generation Computer Systems
A fault-tolerant, dynamically scheduled pipeline structure for chip multiprocessors
SAFECOMP'11 Proceedings of the 30th international conference on Computer safety, reliability, and security
Bottleneck identification and scheduling in multithreaded applications
ASPLOS XVII Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems
Parallel application memory scheduling
Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
Instruction-based energy estimation methodology for asymmetric manycore processor simulations
Proceedings of the 5th International ICST Conference on Simulation Tools and Techniques
Proceedings of the 39th Annual International Symposium on Computer Architecture
ACM Transactions on Architecture and Code Optimization (TACO)
When less is more (LIMO):controlled parallelism forimproved efficiency
Proceedings of the 2012 international conference on Compilers, architectures and synthesis for embedded systems
Critical lock analysis: diagnosing critical section bottlenecks in multithreaded applications
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Per-thread cycle accounting in multicore processors
ACM Transactions on Architecture and Code Optimization (TACO) - Special Issue on High-Performance Embedded Architectures and Compilers
Orchestrator: a low-cost solution to reduce voltage emergencies for multi-threaded applications
Proceedings of the Conference on Design, Automation and Test in Europe
Utility-based acceleration of multithreaded applications on asymmetric CMPs
Proceedings of the 40th Annual International Symposium on Computer Architecture
Reducing memory access latency with asymmetric DRAM bank organizations
Proceedings of the 40th Annual International Symposium on Computer Architecture
Criticality stacks: identifying critical threads in parallel programs using synchronization behavior
Proceedings of the 40th Annual International Symposium on Computer Architecture
DATE '12 Proceedings of the Conference on Design, Automation and Test in Europe
Exploring power behaviors and trade-offs of in-situ data analytics
SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
SMT-centric power-aware thread placement in chip multiprocessors
PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
Fairness-aware scheduling on single-ISA heterogeneous multi-cores
PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
TCPT: thread criticality-driven prefetcher throttling
PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
Crank it up or dial it down: coordinated multiprocessor frequency and folding control
Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
ACM Transactions on Architecture and Code Optimization (TACO)
Thread-criticality aware dynamic cache reconfiguration in multi-core system
Proceedings of the International Conference on Computer-Aided Design
PAIS: Parallelism-aware interconnect scheduling in multicores
ACM Transactions on Embedded Computing Systems (TECS) - Special Issue on Design Challenges for Many-Core Processors, Special Section on ESTIMedia'13 and Regular Papers
Hi-index | 0.00 |
With the shift towards chip multiprocessors (CMPs), exploiting and managing parallelism has become a central problem in computing systems. Many issues of parallelism management boil down to discerning which running threads or processes are critical, or slowest, versus which are non-critical. If one can accurately predict critical threads in a parallel program, then one can respond in a variety of ways. Possibilities include running the critical thread at a faster clock rate, performing load balancing techniques to offload work onto currently non-critical threads, or giving the critical thread more on-chip resources to execute faster. This paper proposes and evaluates simple but effective thread criticality predictors for parallel applications. We show that accurate predictors can be built using counters that are typically already available on-chip. Our predictor, based on memory hierarchy statistics, identifies thread criticality with an average accuracy of 93% across a range of architectures. We also demonstrate two applications of our predictor. First, we show how Intel's Threading Building Blocks (TBB) parallel runtime system can benefit from task stealing techniques that use our criticality predictor to reduce load imbalance. Using criticality prediction to guide TBB's task-stealing decisions improves performance by 13-32% for TBB-based PARSEC benchmarks running on a 32-core CMP. As a second application, criticality prediction guides dynamic energy optimizations in barrier-based applications. By running the predicted critical thread at the full clock rate and frequency-scaling non-critical threads, this approach achieves average energy savings of 15% while negligibly degrading performance for SPLASH-2 and PARSEC benchmarks.