Improving data cache performance by pre-executing instructions under a cache miss
ICS '97 Proceedings of the 11th international conference on Supercomputing
Slice-processors: an implementation of operation-based prediction
ICS '01 Proceedings of the 15th international conference on Supercomputing
A large, fast instruction window for tolerating cache misses
ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Dynamic speculative precomputation
Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
Enhancing memory level parallelism via recovery-free value prediction
ICS '03 Proceedings of the 17th annual international conference on Supercomputing
Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-Order Processors
HPCA '03 Proceedings of the 9th International Symposium on High-Performance Computer Architecture
Single-ISA Heterogeneous Multi-Core Architectures for Multithreaded Workload Performance
Proceedings of the 31st annual international symposium on Computer architecture
Microarchitecture Optimizations for Exploiting Memory-Level Parallelism
Proceedings of the 31st annual international symposium on Computer architecture
ASPLOS XI Proceedings of the 11th international conference on Architectural support for programming languages and operating systems
Checkpointed Early Load Retirement
HPCA '05 Proceedings of the 11th International Symposium on High-Performance Computer Architecture
Toward kilo-instruction processors
ACM Transactions on Architecture and Code Optimization (TACO)
Dual-Core Execution: Building a Highly Scalable Single-Thread Instruction Window
Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Dynamic thread assignment on heterogeneous multiprocessor architectures
Proceedings of the 3rd conference on Computing frontiers
CAVA: Using checkpoint-assisted value prediction to hide L2 misses
ACM Transactions on Architecture and Code Optimization (TACO)
Future execution: A prefetching mechanism that uses multiple cores to speed up single threads
ACM Transactions on Architecture and Code Optimization (TACO)
FreePDK: An Open-Source Variation-Aware Design Kit
MSE '07 Proceedings of the 2007 IEEE International Conference on Microelectronic Systems Education
A Flexible Heterogeneous Multi-Core Architecture
PACT '07 Proceedings of the 16th International Conference on Parallel Architecture and Compilation Techniques
Discovering and Exploiting Program Phases
IEEE Micro
Accelerating critical section execution with asymmetric multi-core architectures
Proceedings of the 14th international conference on Architectural support for programming languages and operating systems
A performance-correctness explicitly-decoupled architecture
Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
A comprehensive scheduler for asymmetric multicore systems
Proceedings of the 5th European conference on Computer systems
Proceedings of the 38th annual international symposium on Computer architecture
Dark silicon and the end of multicore scaling
Proceedings of the 38th annual international symposium on Computer architecture
Scheduling heterogeneous multi-cores through Performance Impact Estimation (PIE)
Proceedings of the 39th Annual International Symposium on Computer Architecture
Understanding fundamental design choices in single-ISA heterogeneous multicore architectures
ACM Transactions on Architecture and Code Optimization (TACO) - Special Issue on High-Performance Embedded Architectures and Compilers
Composite Cores: Pushing Heterogeneity Into a Core
MICRO-45 Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture
PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
Fairness-aware scheduling on single-ISA heterogeneous multi-cores
PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
ACM Transactions on Architecture and Code Optimization (TACO)
Hi-index | 0.00 |
Extracting high memory-level parallelism (MLP) is essential for speeding up single-threaded applications which are memory bound. At the same time, the projected amount of dark silicon (the fraction of the chip powered off) on a chip is growing. Hence, Asymmetric Multicore Processors (AMP) offer a unique opportunity to integrate many types of cores, each powered at different times, in order to optimize for different regions of execution. In this work, we quantify the potential for exploiting core customization to speedup programs during regions of high MLP. Based on a careful design space exploration, we discover that an AMP that includes a narrow and fast specialized core has the potential to efficiently exploit MLP. Using the results of our analysis, we design an AMP with both an MLP and ILP specialized core, and we propose a hardware-level, application steering mechanism called Symbiotic Core Execution (SCE). SCE detects MLP phases by monitoring the L2 miss rate of the application, and it uses that information to steer the application to the best core. Interestingly, we show that L2 miss rates are important for deciding when an MLP region begins and when it ends. As a program runs, its execution migrates to a core customized for MLP during regions of high MLP; when the region ends, it is re-scheduled on the core that fits the application characteristics. Compared to a monolithic core optimized for both modes of operation, our AMP design provides a harmonic mean performance improvement of 5.3% and 6.6% for SPEC2000 and SPEC2006, respectively, with a maximum speedup of 14.5%. For the same study, it achieves a 18.3% and 21.1% energy delay2 reduction for SPEC2000 and SPEC2006, respectively. Our findings yield an important message for designing AMPs with specialized cores: core customization enables efficient exploitation of MLP, and application steering mechanisms for MLP are simple to implement and effective.