Complementing user-level coarse-grain parallelism with implicit speculative parallelism

Authors:
Nikolas Ioannou;Marcelo Cintra
Affiliations:
University of Edinburgh;Intel Labs
Venue:
Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
Year:
2011

Citing 36
Cited 1

The NAS parallel benchmarks—summary and preliminary results

Proceedings of the 1991 ACM/IEEE conference on Supercomputing
The SPLASH-2 programs: characterization and methodological considerations

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Multiscalar processors

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Hardware and software support for speculative execution of sequential binaries on a chip-multiprocessor

ICS '98 Proceedings of the 12th international conference on Supercomputing
Data speculation support for a chip multiprocessor

Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
Clustered speculative multithreaded processors

ICS '99 Proceedings of the 13th international conference on Supercomputing
The directory-based cache coherence protocol for the DASH multiprocessor

ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
Wattch: a framework for architectural-level power analysis and optimizations

Proceedings of the 27th annual international symposium on Computer architecture
Architectural support for scalable speculative parallelization in shared-memory multiprocessors

Proceedings of the 27th annual international symposium on Computer architecture
Multiplex: unifying conventional and speculative thread-level parallelism on a chip multiprocessor

ICS '01 Proceedings of the 15th international conference on Supercomputing
Speculative lock elision: enabling highly concurrent multithreaded execution

Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
Transactional lock-free execution of lock-based programs

Proceedings of the 10th international conference on Architectural support for programming languages and operating systems
Speculative synchronization: applying thread-level speculation to explicitly parallel applications

Proceedings of the 10th international conference on Architectural support for programming languages and operating systems
Performance Evaluation of the Omni OpenMP Compiler

ISHPC '00 Proceedings of the Third International Symposium on High Performance Computing
Interconnections in Multi-Core Architectures: Understanding Mechanisms, Overheads and Scaling

Proceedings of the 32nd annual international symposium on Computer Architecture
Tasking with out-of-order spawn in TLS chip multiprocessors: microarchitecture and compilation

Proceedings of the 19th annual international conference on Supercomputing
POSH: a TLS compiler that exploits program structure

Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming
Bulk Disambiguation of Speculative Threads in Multiprocessors

Proceedings of the 33rd annual international symposium on Computer Architecture
The Problem with Threads

Computer
Supporting nested transactional memory in logTM

Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
Implicit parallelism with ordered transactions

Proceedings of the 12th ACM SIGPLAN symposium on Principles and practice of parallel programming
An effective hybrid transactional memory system with strong isolation guarantees

Proceedings of the 34th annual international symposium on Computer architecture
Core fusion: accommodating software diversity in chip multiprocessors

Proceedings of the 34th annual international symposium on Computer architecture
LogTM-SE: Decoupling Hardware Transactional Memory from Caches

HPCA '07 Proceedings of the 2007 IEEE 13th International Symposium on High Performance Computer Architecture
A Scalable, Non-blocking Approach to Transactional Memory

HPCA '07 Proceedings of the 2007 IEEE 13th International Symposium on High Performance Computer Architecture
Amdahl's Law in the Multicore Era

Computer
The PARSEC benchmark suite: characterization and architectural implications

Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Prediction models for multi-dimensional power-performance optimization on many cores

Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Boosting single-thread performance in multi-core systems through fine-grain multi-threading

Proceedings of the 36th annual international symposium on Computer architecture
Simultaneous speculative threading: a novel pipeline architecture implemented in sun's rock processor

Proceedings of the 36th annual international symposium on Computer architecture
Mapping Out a Path from Hardware Transactional Memory to Speculative Multithreading

PACT '09 Proceedings of the 2009 18th International Conference on Parallel Architectures and Compilation Techniques
Cetus: A Source-to-Source Compiler Infrastructure for Multicores

Computer
Understanding PARSEC performance on contemporary CMPs

IISWC '09 Proceedings of the 2009 IEEE International Symposium on Workload Characterization (IISWC)
Decoupled software pipelining creates parallelization opportunities

Proceedings of the 8th annual IEEE/ACM international symposium on Code generation and optimization
Modeling critical sections in Amdahl's law and its implications for multicore design

Proceedings of the 37th annual international symposium on Computer architecture
Scalable Speculative Parallelization on Commodity Clusters

MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture

Mixed speculative multithreaded execution models

ACM Transactions on Architecture and Code Optimization (TACO)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Multi-core and many-core systems are the norm in contemporary processor technology and are expected to remain so for the foreseeable future. Programs using parallel programming primitives like PThreads or OpenMP often exploit coarse-grain parallelism, because it offers a good trade-off between programming effort versus performance gain. Some parallel applications show limited or no scaling beyond a number of cores. Given the abundant number of cores expected in future many-cores, several cores would remain idle in such cases while execution performance stagnates. This paper proposes using cores that do not contribute to performance improvement for running implicit fine-grain speculative threads. In particular, we present a many-core architecture and protocol that allow applications with coarse-grain explicit parallelism to further exploit implicit speculative parallelism within each thread. Implicit speculative parallelism frees the programmer from the additional effort to explicitly partition the work into finer and properly synchronized tasks. Our results show that, for a many-core comprising of 128 cores supporting implicit speculative parallelism in clusters of 2 or 4 cores, performance improves on top of the highest scalability point by 41% on average for the 4-core cluster and by 27% on average for the 2-core cluster. These performance improvements come with an energy consumption that is close to -- and sometimes better than -- the baseline. This approach often leads to better performance and energy efficiency compared to existing alternatives such as Core Fusion and Frequency Boosting. We also investigate the tradeoffs between explicit and implicit threads as input dataset sizes vary. Finally, we present a dynamic mechanism to choose the number of explicit and implicit threads, which performs within 6% of the static oracle selection of threads.