A performance-correctness explicitly-decoupled architecture

Authors:
Alok Garg;Michael C. Huang
Affiliations:
Department of Electrical&Computer Engineering, University of Rochester, USA;Department of Electrical&Computer Engineering, University of Rochester, USA
Venue:
Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
Year:
2008

Citing 37
Cited 6

Improving data cache performance by pre-executing instructions under a cache miss

ICS '97 Proceedings of the 11th international conference on Supercomputing
Dataflow analysis of branch mispredictions and its application to early resolution of branch outcomes

MICRO 31 Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture
Simultaneous subordinate microthreading (SSMT)

ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
DIVA: a reliable substrate for deep submicron microarchitecture design

Proceedings of the 32nd annual ACM/IEEE international symposium on Microarchitecture
Decoupled access/execute computer architectures

ACM Transactions on Computer Systems (TOCS)
Memory hierarchy reconfiguration for energy and performance in general-purpose processor architectures

Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
A study of slipstream processors

Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
Alto: a link-time optimizer for the Compaq alpha

Software—Practice & Experience
Slice-processors: an implementation of operation-based prediction

ICS '01 Proceedings of the 15th international conference on Supercomputing
Execution-based prediction using speculative slices

ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture
Slipstream processors: improving both performance and fault tolerance

ASPLOS IX Proceedings of the ninth international conference on Architectural support for programming languages and operating systems
Speculative precomputation: long-range prefetching of delinquent loads

ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture
Tolerating memory latency through software-controlled pre-execution in simultaneous multithreading processors

ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture
Data prefetching by dependence graph precomputation

ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture
A large, fast instruction window for tolerating cache misses

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Master/slave speculative parallelization

Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
Parameter variations and impact on circuits and microarchitecture

Proceedings of the 40th annual Design Automation Conference
Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-Order Processors

HPCA '03 Proceedings of the 9th International Symposium on High-Performance Computer Architecture
Token coherence: decoupling performance and correctness

Proceedings of the 30th annual international symposium on Computer architecture
Speculative Data-Driven Multithreading

HPCA '01 Proceedings of the 7th International Symposium on High-Performance Computer Architecture
Scalable Hardware Memory Disambiguation for High ILP Processors

Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
Checkpoint Processing and Recovery: Towards Scalable Large Instruction Window Processors

Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
Beating in-order stalls with "flea-flicker" two-pass pipelining

Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
Making Typical Silicon Matter with Razor

Computer
Checkpointed Early Load Retirement

HPCA '05 Proceedings of the 11th International Symposium on High-Performance Computer Architecture
Scalable Load and Store Processing in Latency Tolerant Processors

Proceedings of the 32nd annual international symposium on Computer Architecture
Store Buffer Design in First-Level Multibanked Data Caches

Proceedings of the 32nd annual international symposium on Computer Architecture
Understanding Scheduling Replay Schemes

HPCA '04 Proceedings of the 10th International Symposium on High Performance Computer Architecture
Dual-Core Execution: Building a Highly Scalable Single-Thread Instruction Window

Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
uComplexity: Estimating Processor Design Effort

Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
Address-Value Delta (AVD) Prediction: Increasing the Effectiveness of Runahead Execution by Exploiting Regular Memory Allocation Patterns

Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
On the importance of optimizing the configuration of stream prefetchers

Proceedings of the 2005 workshop on Memory system performance
CAVA: Hiding L2 Misses with Checkpoint-Assisted Value Prediction

IEEE Computer Architecture Letters
Program Demultiplexing: Data-flow based Speculative Parallelization of Methods in Sequential Programs

Proceedings of the 33rd annual international symposium on Computer Architecture
Paceline: Improving Single-Thread Performance in Nanoscale CMPs through Core Overclocking

PACT '07 Proceedings of the 16th International Conference on Parallel Architecture and Compilation Techniques
Effective Optimistic-Checker Tandem Core Design through Architectural Pruning

Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture
POWER4 system microarchitecture

IBM Journal of Research and Development

Fast Track: A Software System for Speculative Program Optimization

Proceedings of the 7th annual IEEE/ACM International Symposium on Code Generation and Optimization
Speculative-aware execution: a simple and efficient technique for utilizing multi-cores to improve single-thread performance

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Inter-core prefetching for multicore processors using migrating helper threads

Proceedings of the sixteenth international conference on Architectural support for programming languages and operating systems
Efficiently exploiting memory level parallelism on asymmetric coupled cores in the dark silicon era

ACM Transactions on Architecture and Code Optimization (TACO) - HIPEAC Papers
A coldness metric for cache optimization

Proceedings of the ACM SIGPLAN Workshop on Memory Systems Performance and Correctness
A survey of checker architectures

ACM Computing Surveys (CSUR)

Quantified Score

Hi-index	0.01

Visualization

Abstract

Optimizing the common case has been an adage in decades of processor design practices. However, as the system complexity and optimization techniques’ sophistication have increased substantially, maintaining correctness under all situations, however unlikely, is contributing to the necessity of extra conservatism in all layers of the system design. The mounting process, voltage, and temperature variation concerns further add to the conservatism in setting operating parameters. Excessive conservatism in turn hurt performance and efficiency in the common case. However, much of the system’s complexity comes from advanced performance features and may not compromise the whole system’s functionality and correctness even if some components are imperfect and introduce occasional errors. We propose to separate performance goals from the correctness goal using an explicitly-decoupled architecture. In this paper, we discuss one such incarnation where an independent core serves as an optimistic performance enhancement engine that helps accelerate the correctness-guaranteeing core by passing high-quality predictions and performing accurate prefetching. The lack of concern for correctness in the optimistic core allows us to optimize its execution in a more effective fashion than possible in optimizing a monolithic core with correctness requirements. We show that such a decoupled design allows significant optimization benefits and is much less sensitive to conservatism applied in the correctness domain.