A Simulation Study of Decoupled Architecture Computers
IEEE Transactions on Computers
ASPLOS II Proceedings of the second international conference on Architectual support for programming languages and operating systems
High-bandwidth data memory systems for superscalar processors
ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
An elementary processor architecture with simultaneous instruction issuing from multiple threads
ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
Evaluation of the WM architecture
ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
MISC: a Multiple Instruction Stream Computer
MICRO 25 Proceedings of the 25th annual international symposium on Microarchitecture
The effectiveness of decoupling
ICS '93 Proceedings of the 7th international conference on Supercomputing
ATOM: a system for building customized program analysis tools
PLDI '94 Proceedings of the ACM SIGPLAN 1994 conference on Programming language design and implementation
Designing the TFP Microprocessor
IEEE Micro
Compiling and optimizing for decoupled architectures
Supercomputing '95 Proceedings of the 1995 ACM/IEEE conference on Supercomputing
Simultaneous multithreading: maximizing on-chip parallelism
ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Decoupling integer execution in superscalar processors
Proceedings of the 28th annual international symposium on Microarchitecture
ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Complexity-effective superscalar processors
Proceedings of the 24th annual international symposium on Computer architecture
MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
The multicluster architecture: reducing cycle time through partitioning
MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Exploiting idle floating-point resources for integer execution
PLDI '98 Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation
Performance modeling and code partitioning for the DS architecture
Proceedings of the 25th annual international symposium on Computer architecture
ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
PIPE: a VLSI decoupled architecture
ISCA '85 Proceedings of the 12th annual international symposium on Computer architecture
Implementation of precise interrupts in pipelined processors
ISCA '85 Proceedings of the 12th annual international symposium on Computer architecture
Decoupled access/execute computer architectures
ACM Transactions on Computer Systems (TOCS)
The MIPS R10000 Superscalar Microprocessor
IEEE Micro
Memory Latency Effects in Decoupled Architectures
IEEE Transactions on Computers
A Limitation Study into Access Decoupling
Euro-Par '97 Proceedings of the Third International Euro-Par Conference on Parallel Processing
The PowerPC 620 microprocessor: a high performance superscalar RISC microprocessor
COMPCON '95 Proceedings of the 40th IEEE Computer Society International Conference
Lockup-free instruction fetch/prefetch cache organization
ISCA '81 Proceedings of the 8th annual symposium on Computer Architecture
A study of branch prediction strategies
ISCA '81 Proceedings of the 8th annual symposium on Computer Architecture
A Cost-Effective Clustered Architecture
PACT '99 Proceedings of the 1999 International Conference on Parallel Architectures and Compilation Techniques
The Latency Hiding Effectiveness of Decoupled Access/Execute Processors
EUROMICRO '98 Proceedings of the 24th Conference on EUROMICRO - Volume 1
System-level exploration of run-time clusterization for energy-efficient on-chip communication
Proceedings of the 2nd International Workshop on Network on Chip Architectures
Analysis of execution efficiency in the microthreaded processor UTLEON3
ARCS'11 Proceedings of the 24th international conference on Architecture of computing systems
OUTRIDER: efficient memory latency tolerance with decoupled strands
Proceedings of the 38th annual international symposium on Computer architecture
Improving latency tolerance of network processors through simultaneous multithreading
APPT'05 Proceedings of the 6th international conference on Advanced Parallel Processing Technologies
Boosting mobile GPU performance with a decoupled access/execute fragment processor
Proceedings of the 39th Annual International Symposium on Computer Architecture
Hi-index | 14.98 |
The increasing hardware complexity of dynamically scheduled superscalar processors may compromise the scalability of this organization to make an efficient use of future increases in transistor budget. SMT processors, designed over a superscalar core, are therefore directly concerned by this problem. This work presents and evaluates a novel processor microarchitecture which combines two paradigms: simultaneous multithreading and access/execute decoupling. Since its decoupled units issue instructions in-order, this architecture is significantly less complex, in terms of critical path delays, than a centralized out-of-order design, and it is more effective for future growth in issue-width and clock speed. We investigate how both techniques complement each other. Since decoupling features an excellent memory latency hiding efficiency, the large amount of parallelism exploited by multithreading may be used to hide the latency of functional units and keep them fully utilized. Our study shows that, by adding decoupling to a multithreaded architecture, fewer threads are needed to achieve maximum throughput. Therefore, in addition to the obvious hardware complexity reduction, it places lower demands on the memory system. Since one of the problems of multithreading is the degradation of the memory system performance, both in terms of miss latency and bandwidth requirements, this improvement becomes critical for high miss latencies, where bandwidth might become a bottleneck. Finally, although it may seem rather surprising, our study reveals that multithreading by itself exhibits little memory latency tolerance. Our results suggest that most of the latency hiding effectiveness of SMT architectures comes from the dynamic scheduling. On the other hand, decoupling is very effective at hiding memory latency. An increase in the cache miss penalty from 1 to 32 cycles reduces the performance of a 4-context multithreaded decoupled processor by less than 2 percent. For the nondecoupled multithreaded processor, the loss of performance is about 23 percent.