Minimal Multi-threading: Finding and Removing Redundant Instructions in Multi-threaded Processors

Authors:
Guoping Long;Diana Franklin;Susmit Biswas;Pablo Ortiz;Jason Oberg;Dongrui Fan;Frederic T. Chong
Affiliations:
-;-;-;-;-;-;-
Venue:
MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
Year:
2010

Citing 39
Cited 4

The SPLASH-2 programs: characterization and methodological considerations

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Simultaneous multithreading: maximizing on-chip parallelism

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Exploiting choice: instruction fetch and issue on an implementable simultaneous multithreading processor

ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Trace cache: a low latency approach to high bandwidth instruction fetching

Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture
Low power data processing by elimination of redundant computations

ISLPED '97 Proceedings of the 1997 international symposium on Low power electronics and design
Dynamic instruction reuse

Proceedings of the 24th annual international symposium on Computer architecture
The SimpleScalar tool set, version 2.0

ACM SIGARCH Computer Architecture News
Understanding the differences between value prediction and instruction reuse

MICRO 31 Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture
An empirical analysis of instruction repetition

Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
Accelerating multi-media processing by implementing memoing in multiplication and division units

Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
The block-based trace cache

ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
Dynamic removal of redundant computations

ICS '99 Proceedings of the 13th international conference on Supercomputing
Compiler-directed dynamic computation reuse: rationale and initial results

Proceedings of the 32nd annual ACM/IEEE international symposium on Microarchitecture
Wattch: a framework for architectural-level power analysis and optimizations

Proceedings of the 27th annual international symposium on Computer architecture
Adaptive functional programming

POPL '02 Proceedings of the 29th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
An architectural alternative to optimizing compilers

ASPLOS I Proceedings of the first international symposium on Architectural support for programming languages and operating systems
Exploiting Basic Block Value Locality with Block Reuse

HPCA '99 Proceedings of the 5th International Symposium on High Performance Computer Architecture
Dynamic Elimination of Pointer-Expressions

PACT '98 Proceedings of the 1998 International Conference on Parallel Architectures and Compilation Techniques
Exploring Sub-Block Value Reuse for Superscalar Processors

PACT '00 Proceedings of the 2000 International Conference on Parallel Architectures and Compilation Techniques
The Dynamic Trace Memorization Reuse Technique

PACT '00 Proceedings of the 2000 International Conference on Parallel Architectures and Compilation Techniques
ON DIVISION AND RECIPROCAL CACHES

ON DIVISION AND RECIPROCAL CACHES
Profiling Driven Computation Reuse: An Embedded Software Synthesis Technique for Energy and Performance Optimization

VLSID '04 Proceedings of the 17th International Conference on VLSI Design
Caching Function Results: Faster Arithmetic by Avoiding Unnecessary Computation

Caching Function Results: Faster Arithmetic by Avoiding Unnecessary Computation
A Compiler Scheme for Reusing Intermediate Computation Results

Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
A Complexity-Effective Approach to ALU Bandwidth Enhancement for Instruction-Level Temporal Redundancy

Proceedings of the 31st annual international symposium on Computer architecture
Enhancing Speedup in Network Processing Applications by Exploiting Instruction Reuse with Flow Aggregation

DATE '03 Proceedings of the conference on Design, Automation and Test in Europe - Volume 1
Value Predictors for Reuse through Speculation on Traces

SBAC-PAD '04 Proceedings of the 16th Symposium on Computer Architecture and High Performance Computing
Victim Replication: Maximizing Capacity while Hiding Wire Delay in Tiled Chip Multiprocessors

Proceedings of the 32nd annual international symposium on Computer Architecture
Optimizing Replication, Communication, and Capacity Allocation in CMPs

Proceedings of the 32nd annual international symposium on Computer Architecture
On Reusing the Results of Pre-Executed Instructions in a Runahead Execution Processor

IEEE Computer Architecture Letters
Cooperative Caching for Chip Multiprocessors

Proceedings of the 33rd annual international symposium on Computer Architecture
Computation spreading: employing hardware migration to specialize CMP cores on-the-fly

Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
ASR: Adaptive Selective Replication for CMP Caches

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Steps towards cache-resident transaction processing

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Thread fusion

Proceedings of the 13th international symposium on Low power electronics and design
The PARSEC benchmark suite: characterization and architectural implications

Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Reexecution and Selective Reuse in Checkpoint Processors

Transactions on High-Performance Embedded Architectures and Compilers II
Multi-execution: multicore caching for data-similar executions

Proceedings of the 36th annual international symposium on Computer architecture
LIBSVM: A library for support vector machines

ACM Transactions on Intelligent Systems and Technology (TIST)

Simultaneous branch and warp interweaving for sustained GPU performance

Proceedings of the 39th Annual International Symposium on Computer Architecture
Exploiting uniform vector instructions for GPGPU performance, energy efficiency, and opportunistic reliability enhancement

Proceedings of the 27th international ACM conference on International conference on supercomputing
Microarchitectural mechanisms to exploit value structure in SIMT architectures

Proceedings of the 40th Annual International Symposium on Computer Architecture
Rhythm: harnessing data parallel hardware for server workloads

Proceedings of the 19th international conference on Architectural support for programming languages and operating systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Parallelism is the key to continued performance scaling in modern microprocessors. Yet we observe that this parallelism can often contain a surprising amount of instruction redundancy. We propose to exploit this redundancy to improve performance and decrease energy consumption. We propose a multi-threading micro-architecture, Minimal Multi-Threading (MMT), that leverages register renaming and the instruction window to combine the fetch and execution of identical instructions between threads in SPMD applications. While many techniques exploit intra-thread similarities by detecting when a later instruction may use an earlier result, MMT exploits inter-thread similarities by, whenever possible, fetching instructions from different threads together and only splitting them if the computation is unique. With two threads, our design achieves a speedup of 1.15(geometric mean) over a two-thread traditional SMT with a trace cache. With four threads, our design achieves a speedup of 1.25 (geometric mean) over a traditional SMT processor with four-threads and a trace cache. These correspond to speedups of 1.5 and 1.84 over a traditional out-of-order processor. Moreover, our performance increases inmost applications with no power increase because the increase in overhead is countered with a decrease in cache accesses, leading to a decrease in energy consumption for all applications.