Efficiently exploiting memory level parallelism on asymmetric coupled cores in the dark silicon era

Authors:
George Patsilaras;Niket K. Choudhary;James Tuck
Affiliations:
North Carolina State University;North Carolina State University;North Carolina State University
Venue:
ACM Transactions on Architecture and Code Optimization (TACO) - HIPEAC Papers
Year:
2012

Citing 23
Cited 6

Improving data cache performance by pre-executing instructions under a cache miss

ICS '97 Proceedings of the 11th international conference on Supercomputing
Slice-processors: an implementation of operation-based prediction

ICS '01 Proceedings of the 15th international conference on Supercomputing
A large, fast instruction window for tolerating cache misses

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Dynamic speculative precomputation

Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
Enhancing memory level parallelism via recovery-free value prediction

ICS '03 Proceedings of the 17th annual international conference on Supercomputing
Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-Order Processors

HPCA '03 Proceedings of the 9th International Symposium on High-Performance Computer Architecture
Single-ISA Heterogeneous Multi-Core Architectures for Multithreaded Workload Performance

Proceedings of the 31st annual international symposium on Computer architecture
Microarchitecture Optimizations for Exploiting Memory-Level Parallelism

Proceedings of the 31st annual international symposium on Computer architecture
Continual flow pipelines

ASPLOS XI Proceedings of the 11th international conference on Architectural support for programming languages and operating systems
Checkpointed Early Load Retirement

HPCA '05 Proceedings of the 11th International Symposium on High-Performance Computer Architecture
Toward kilo-instruction processors

ACM Transactions on Architecture and Code Optimization (TACO)
Dual-Core Execution: Building a Highly Scalable Single-Thread Instruction Window

Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Dynamic thread assignment on heterogeneous multiprocessor architectures

Proceedings of the 3rd conference on Computing frontiers
CAVA: Using checkpoint-assisted value prediction to hide L2 misses

ACM Transactions on Architecture and Code Optimization (TACO)
Future execution: A prefetching mechanism that uses multiple cores to speed up single threads

ACM Transactions on Architecture and Code Optimization (TACO)
FreePDK: An Open-Source Variation-Aware Design Kit

MSE '07 Proceedings of the 2007 IEEE International Conference on Microelectronic Systems Education
A Flexible Heterogeneous Multi-Core Architecture

PACT '07 Proceedings of the 16th International Conference on Parallel Architecture and Compilation Techniques
Discovering and Exploiting Program Phases

IEEE Micro
Accelerating critical section execution with asymmetric multi-core architectures

Proceedings of the 14th international conference on Architectural support for programming languages and operating systems
A performance-correctness explicitly-decoupled architecture

Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
A comprehensive scheduler for asymmetric multicore systems

Proceedings of the 5th European conference on Computer systems
FabScalar: composing synthesizable RTL designs of arbitrary cores within a canonical superscalar template

Proceedings of the 38th annual international symposium on Computer architecture
Dark silicon and the end of multicore scaling

Proceedings of the 38th annual international symposium on Computer architecture

Scheduling heterogeneous multi-cores through Performance Impact Estimation (PIE)

Proceedings of the 39th Annual International Symposium on Computer Architecture
Understanding fundamental design choices in single-ISA heterogeneous multicore architectures

ACM Transactions on Architecture and Code Optimization (TACO) - Special Issue on High-Performance Embedded Architectures and Compilers
Composite Cores: Pushing Heterogeneity Into a Core

MICRO-45 Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture
A unified view of non-monotonic core selection and application steering in heterogeneous chip multiprocessors

PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
Fairness-aware scheduling on single-ISA heterogeneous multi-cores

PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
The design and implementation of heterogeneous multicore systems for energy-efficient speculative thread execution

ACM Transactions on Architecture and Code Optimization (TACO)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Extracting high memory-level parallelism (MLP) is essential for speeding up single-threaded applications which are memory bound. At the same time, the projected amount of dark silicon (the fraction of the chip powered off) on a chip is growing. Hence, Asymmetric Multicore Processors (AMP) offer a unique opportunity to integrate many types of cores, each powered at different times, in order to optimize for different regions of execution. In this work, we quantify the potential for exploiting core customization to speedup programs during regions of high MLP. Based on a careful design space exploration, we discover that an AMP that includes a narrow and fast specialized core has the potential to efficiently exploit MLP. Using the results of our analysis, we design an AMP with both an MLP and ILP specialized core, and we propose a hardware-level, application steering mechanism called Symbiotic Core Execution (SCE). SCE detects MLP phases by monitoring the L2 miss rate of the application, and it uses that information to steer the application to the best core. Interestingly, we show that L2 miss rates are important for deciding when an MLP region begins and when it ends. As a program runs, its execution migrates to a core customized for MLP during regions of high MLP; when the region ends, it is re-scheduled on the core that fits the application characteristics. Compared to a monolithic core optimized for both modes of operation, our AMP design provides a harmonic mean performance improvement of 5.3% and 6.6% for SPEC2000 and SPEC2006, respectively, with a maximum speedup of 14.5%. For the same study, it achieves a 18.3% and 21.1% energy delay2 reduction for SPEC2000 and SPEC2006, respectively. Our findings yield an important message for designing AMPs with specialized cores: core customization enables efficient exploitation of MLP, and application steering mechanisms for MLP are simple to implement and effective.