Improving Latency Tolerance of Multithreading through Decoupling

Authors:
Joan-Manuel Parcerisa;Antonio Gonzalez
Affiliations:
Univ. Politècnica de Catalunya, Barcelona, Spain;Univ. Politècnica de Catalunya, Barcelona, Spain
Venue:
IEEE Transactions on Computers
Year:
2001

Citing 31
Cited 5

A Simulation Study of Decoupled Architecture Computers

IEEE Transactions on Computers
The ZS-1 central processor

ASPLOS II Proceedings of the second international conference on Architectual support for programming languages and operating systems
High-bandwidth data memory systems for superscalar processors

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
An elementary processor architecture with simultaneous instruction issuing from multiple threads

ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
Evaluation of the WM architecture

ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
MISC: a Multiple Instruction Stream Computer

MICRO 25 Proceedings of the 25th annual international symposium on Microarchitecture
The effectiveness of decoupling

ICS '93 Proceedings of the 7th international conference on Supercomputing
ATOM: a system for building customized program analysis tools

PLDI '94 Proceedings of the ACM SIGPLAN 1994 conference on Programming language design and implementation
Designing the TFP Microprocessor

IEEE Micro
Compiling and optimizing for decoupled architectures

Supercomputing '95 Proceedings of the 1995 ACM/IEEE conference on Supercomputing
Simultaneous multithreading: maximizing on-chip parallelism

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Multiscalar processors

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Decoupling integer execution in superscalar processors

Proceedings of the 28th annual international symposium on Microarchitecture
Exploiting choice: instruction fetch and issue on an implementable simultaneous multithreading processor

ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Complexity-effective superscalar processors

Proceedings of the 24th annual international symposium on Computer architecture
Trace processors

MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
The multicluster architecture: reducing cycle time through partitioning

MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Exploiting idle floating-point resources for integer execution

PLDI '98 Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation
Performance modeling and code partitioning for the DS architecture

Proceedings of the 25th annual international symposium on Computer architecture
Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers

ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
PIPE: a VLSI decoupled architecture

ISCA '85 Proceedings of the 12th annual international symposium on Computer architecture
Implementation of precise interrupts in pipelined processors

ISCA '85 Proceedings of the 12th annual international symposium on Computer architecture
Decoupled access/execute computer architectures

ACM Transactions on Computer Systems (TOCS)
The MIPS R10000 Superscalar Microprocessor

IEEE Micro
Memory Latency Effects in Decoupled Architectures

IEEE Transactions on Computers
A Limitation Study into Access Decoupling

Euro-Par '97 Proceedings of the Third International Euro-Par Conference on Parallel Processing
The PowerPC 620 microprocessor: a high performance superscalar RISC microprocessor

COMPCON '95 Proceedings of the 40th IEEE Computer Society International Conference
Lockup-free instruction fetch/prefetch cache organization

ISCA '81 Proceedings of the 8th annual symposium on Computer Architecture
A study of branch prediction strategies

ISCA '81 Proceedings of the 8th annual symposium on Computer Architecture
A Cost-Effective Clustered Architecture

PACT '99 Proceedings of the 1999 International Conference on Parallel Architectures and Compilation Techniques
The Latency Hiding Effectiveness of Decoupled Access/Execute Processors

EUROMICRO '98 Proceedings of the 24th Conference on EUROMICRO - Volume 1

System-level exploration of run-time clusterization for energy-efficient on-chip communication

Proceedings of the 2nd International Workshop on Network on Chip Architectures
Analysis of execution efficiency in the microthreaded processor UTLEON3

ARCS'11 Proceedings of the 24th international conference on Architecture of computing systems
OUTRIDER: efficient memory latency tolerance with decoupled strands

Proceedings of the 38th annual international symposium on Computer architecture
Improving latency tolerance of network processors through simultaneous multithreading

APPT'05 Proceedings of the 6th international conference on Advanced Parallel Processing Technologies
Boosting mobile GPU performance with a decoupled access/execute fragment processor

Proceedings of the 39th Annual International Symposium on Computer Architecture

Quantified Score

Hi-index	14.98

Visualization

Abstract

The increasing hardware complexity of dynamically scheduled superscalar processors may compromise the scalability of this organization to make an efficient use of future increases in transistor budget. SMT processors, designed over a superscalar core, are therefore directly concerned by this problem. This work presents and evaluates a novel processor microarchitecture which combines two paradigms: simultaneous multithreading and access/execute decoupling. Since its decoupled units issue instructions in-order, this architecture is significantly less complex, in terms of critical path delays, than a centralized out-of-order design, and it is more effective for future growth in issue-width and clock speed. We investigate how both techniques complement each other. Since decoupling features an excellent memory latency hiding efficiency, the large amount of parallelism exploited by multithreading may be used to hide the latency of functional units and keep them fully utilized. Our study shows that, by adding decoupling to a multithreaded architecture, fewer threads are needed to achieve maximum throughput. Therefore, in addition to the obvious hardware complexity reduction, it places lower demands on the memory system. Since one of the problems of multithreading is the degradation of the memory system performance, both in terms of miss latency and bandwidth requirements, this improvement becomes critical for high miss latencies, where bandwidth might become a bottleneck. Finally, although it may seem rather surprising, our study reveals that multithreading by itself exhibits little memory latency tolerance. Our results suggest that most of the latency hiding effectiveness of SMT architectures comes from the dynamic scheduling. On the other hand, decoupling is very effective at hiding memory latency. An increase in the cache miss penalty from 1 to 32 cycles reduces the performance of a 4-context multithreaded decoupled processor by less than 2 percent. For the nondecoupled multithreaded processor, the loss of performance is about 23 percent.