Accelerating critical section execution with asymmetric multi-core architectures

Authors:
M. Aater Suleman;Onur Mutlu;Moinuddin K. Qureshi;Yale N. Patt
Affiliations:
The University of Texas at Austin, Austin, TX, USA;Carnegie Mellon University, Seattle, WA, USA;IBM Research, Yorktown Hieghts, NY, USA;The University of Texas at Austin, Austin, TX, USA
Venue:
Proceedings of the 14th international conference on Architectural support for programming languages and operating systems
Year:
2009

Citing 25
Cited 47

Transactional memory: architectural support for lock-free data structures

ISCA '93 Proceedings of the 20th annual international symposium on computer architecture
The SPLASH-2 programs: characterization and methodological considerations

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Simultaneous multithreading: maximizing on-chip parallelism

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Scalable high speed IP routing lookups

SIGCOMM '97 Proceedings of the ACM SIGCOMM '97 conference on Applications, technologies, architectures, and protocols for computer communication
The interaction of software prefetching with ILP processors in shared-memory systems

Proceedings of the 24th annual international symposium on Computer architecture
The SGI Origin: a ccNUMA highly scalable server

Proceedings of the 24th annual international symposium on Computer architecture
Implementing remote procedure calls

ACM Transactions on Computer Systems (TOCS)
A new solution of Dijkstra's concurrent programming problem

Communications of the ACM
Parallel Computer Architecture: A Hardware/Software Approach

Parallel Computer Architecture: A Hardware/Software Approach
Speculative lock elision: enabling highly concurrent multithreaded execution

Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
Transactional lock-free execution of lock-based programs

Proceedings of the 10th international conference on Architectural support for programming languages and operating systems
Using Hardware Operations to Reduce the Synchronization Overhead of Task Pools

ICPP '04 Proceedings of the 2004 International Conference on Parallel Processing
The OpenMP Source Code Repository

PDP '05 Proceedings of the 13th Euromicro Conference on Parallel, Distributed and Network-Based Processing
Niagara: A 32-Way Multithreaded Sparc Processor

IEEE Micro
Heterogeneous Chip Multiprocessors

Computer
Performance, Power Efficiency and Scalability of Asymmetric Cluster Chip Multiprocessors

IEEE Computer Architecture Letters
Ablego: a function outlining and partial inlining framework: Research Articles

Software—Practice & Experience
Core fusion: accommodating software diversity in chip multiprocessors

Proceedings of the 34th annual international symposium on Computer architecture
TxLinux: using and managing hardware transactional memory in an operating system

Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles
OpenUH: an optimizing, portable OpenMP compiler: Research Articles

Concurrency and Computation: Practice & Experience - Current Trends in Compilers for Parallel Computers (CPC2006)
Feedback-driven threading: power-efficient and high-performance execution of multi-threaded workloads on CMPs

Proceedings of the 13th international conference on Architectural support for programming languages and operating systems
Larrabee: a many-core x86 architecture for visual computing

ACM SIGGRAPH 2008 papers
Amdahl's Law in the Multicore Era

Computer
Validity of the single processor approach to achieving large scale computing capabilities

AFIPS '67 (Spring) Proceedings of the April 18-20, 1967, spring joint computer conference
POWER4 system microarchitecture

IBM Journal of Research and Development

Operating system support for mitigating software scalability bottlenecks on asymmetric multicore processors

Proceedings of the 7th ACM international conference on Computing frontiers
EXACT: explicit dynamic-branch prediction with active updates

Proceedings of the 7th ACM international conference on Computing frontiers
Proposition for a sequential accelerator in future general-purpose manycore processors and the problem of migration-induced cache misses

Proceedings of the 7th ACM international conference on Computing frontiers
Modeling critical sections in Amdahl's law and its implications for multicore design

Proceedings of the 37th annual international symposium on Computer architecture
Data marshaling for multi-core architectures

Proceedings of the 37th annual international symposium on Computer architecture
Energy efficient speculative threads: dynamic thread allocation in Same-ISA heterogeneous multicore systems

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Architectural Support for Fair Reader-Writer Locking

MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior

MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
Follow the data: concurrent execution pattern

Proceedings of the 2010 Workshop on Parallel Programming Patterns
Using Amdahl's law for performance analysis of many-core SoC architectures based on functionally asymmetric processors

ARCS'11 Proceedings of the 24th international conference on Architecture of computing systems
FabScalar: composing synthesizable RTL designs of arbitrary cores within a canonical superscalar template

Proceedings of the 38th annual international symposium on Computer architecture
Dark silicon and the end of multicore scaling

Proceedings of the 38th annual international symposium on Computer architecture
Mobile processors for energy-efficient web search

ACM Transactions on Computer Systems (TOCS)
Bahurupi: A polymorphic heterogeneous multi-core architecture

ACM Transactions on Architecture and Code Optimization (TACO) - HIPEAC Papers
Efficiently exploiting memory level parallelism on asymmetric coupled cores in the dark silicon era

ACM Transactions on Architecture and Code Optimization (TACO) - HIPEAC Papers
Chameleon: operating system support for dynamic processors

ASPLOS XVII Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems
Bottleneck identification and scheduling in multithreaded applications

ASPLOS XVII Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems
Parallel application memory scheduling

Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
Leveraging Core Specialization via OS Scheduling to Improve Performance on Asymmetric Multicore Systems

ACM Transactions on Computer Systems (TOCS)
A HW/SW co-designed heterogeneous multi-core virtual machine for energy-efficient general purpose computing

CGO '11 Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization
Self-adaptive software meets control theory: A preliminary approach supporting reliability requirements

ASE '11 Proceedings of the 2011 26th IEEE/ACM International Conference on Automated Software Engineering
Improving coherence protocol reactiveness by trading bandwidth for latency

Proceedings of the 9th conference on Computing Frontiers
Trace-driven simulation of memory system scheduling in multithread application

Proceedings of the 2012 ACM SIGPLAN Workshop on Memory Systems Performance and Correctness
Power Limitations and Dark Silicon Challenge the Future of Multicore

ACM Transactions on Computer Systems (TOCS)
Remote core locking: migrating critical-section execution to improve the performance of multithreaded applications

USENIX ATC'12 Proceedings of the 2012 USENIX conference on Annual Technical Conference
When less is more (LIMO):controlled parallelism forimproved efficiency

Proceedings of the 2012 international conference on Compilers, architectures and synthesis for embedded systems
Performance enhancement under power constraints using heterogeneous CMOS-TFET multicores

Proceedings of the eighth IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis
Critical lock analysis: diagnosing critical section bottlenecks in multithreaded applications

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Lock-contention-aware scheduler: A scalable and energy-efficient method for addressing scalability collapse on multicore systems

ACM Transactions on Architecture and Code Optimization (TACO) - Special Issue on High-Performance Embedded Architectures and Compilers
Power challenges may end the multicore era

Communications of the ACM
MorphCore: An Energy-Efficient Microarchitecture for High Performance ILP and High Throughput TLP

MICRO-45 Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture
Composite Cores: Pushing Heterogeneity Into a Core

MICRO-45 Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture
Predicting Coherence Communication by Tracking Synchronization Points at Run Time

MICRO-45 Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture
Dynamic Acceleration of Multithreaded Program Critical Paths in Near-Threshold Systems

MICROW '12 Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture Workshops
Utility-based acceleration of multithreaded applications on asymmetric CMPs

Proceedings of the 40th Annual International Symposium on Computer Architecture
Criticality stacks: identifying critical threads in parallel programs using synchronization behavior

Proceedings of the 40th Annual International Symposium on Computer Architecture
Low-latency adaptive mode transitions and hierarchical power management in asymmetric clustered cores

ACM Transactions on Architecture and Code Optimization (TACO)
Interprocedural strength reduction of critical sections in explicitly-parallel programs

PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
A unified view of non-monotonic core selection and application steering in heterogeneous chip multiprocessors

PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
Fairness-aware scheduling on single-ISA heterogeneous multi-cores

PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
The design and implementation of heterogeneous multicore systems for energy-efficient speculative thread execution

ACM Transactions on Architecture and Code Optimization (TACO)
Market mechanisms for managing datacenters with heterogeneous microarchitectures

ACM Transactions on Computer Systems (TOCS)
The benefit of SMT in the multi-core era: flexibility towards degrees of thread-level parallelism

Proceedings of the 19th international conference on Architectural support for programming languages and operating systems
A hyperscalar dual-core architecture for embedded systems

Microprocessors & Microsystems
The effect of communication and synchronization on Amdahl's law in multicore systems

Parallel Computing
PAIS: Parallelism-aware interconnect scheduling in multicores

ACM Transactions on Embedded Computing Systems (TECS) - Special Issue on Design Challenges for Many-Core Processors, Special Section on ESTIMedia'13 and Regular Papers
Adaptive workload-aware task scheduling for single-ISA asymmetric multicore architectures

ACM Transactions on Architecture and Code Optimization (TACO)

Quantified Score

Hi-index	0.02

Visualization

Abstract

To improve the performance of a single application on Chip Multiprocessors (CMPs), the application must be split into threads which execute concurrently on multiple cores. In multi-threaded applications, critical sections are used to ensure that only one thread accesses shared data at any given time. Critical sections can serialize the execution of threads, which significantly reduces performance and scalability. This paper proposes Accelerated Critical Sections (ACS), a technique that leverages the high-performance core(s) of an Asymmetric Chip Multiprocessor (ACMP) to accelerate the execution of critical sections. In ACS, selected critical sections are executed by a high-performance core, which can execute the critical section faster than the other, smaller cores. As a result, ACS reduces serialization: it lowers the likelihood of threads waiting for a critical section to finish. Our evaluation on a set of 12 critical-section-intensive workloads shows that ACS reduces the average execution time by 34% compared to an equal-area 32T-core symmetric CMP and by 23% compared to an equal-area ACMP. Moreover, for 7 out of the 12 workloads, ACS improves scalability by increasing the number of threads at which performance saturates.