Design and evaluation of a hierarchical decoupled architecture

Authors:
Won W. Ro;Stephen P. Crago;Alvin M. Despain;Jean-Luc Gaudiot
Affiliations:
Department of Electrical and Computer Engineering, California State University, Northridge;Information Sciences Institute-East, University of Southern California;Department of Electrical Engineering, University of Southern California;Department of Electrical Engineering and Computer Science, University of California, Irvine
Venue:
The Journal of Supercomputing
Year:
2006

Citing 33
Cited 3

Dynamic Instruction Scheduling and the Astronautics ZS-1

Computer
Evaluation of the WM architecture

ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
Exploiting instruction-level parallelism with the conjugate register file scheme

MICRO 25 Proceedings of the 25th annual international symposium on Microarchitecture
MISC: a Multiple Instruction Stream Computer

MICRO 25 Proceedings of the 25th annual international symposium on Microarchitecture
The effectiveness of decoupling

ICS '93 Proceedings of the 7th international conference on Supercomputing
Multiscalar processors

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Compiler-based prefetching for recursive data structures

Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
A comparision of superscalar and decoupled access/execute architectures

MICRO 26 Proceedings of the 26th annual international symposium on Microarchitecture
Complexity-effective superscalar processors

Proceedings of the 24th annual international symposium on Computer architecture
A comparison of data prefetching on an access decoupled and superscalar machine

MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
The multicluster architecture: reducing cycle time through partitioning

MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Performance modeling and code partitioning for the DS architecture

Proceedings of the 25th annual international symposium on Computer architecture
Dependence based prefetching for linked data structures

Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
Simultaneous subordinate microthreading (SSMT)

ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
A Chip-Multiprocessor Architecture with Speculative Multithreading

IEEE Transactions on Computers
PIPE: a VLSI decoupled architecture

ISCA '85 Proceedings of the 12th annual international symposium on Computer architecture
Understanding the backward slices of performance degrading instructions

Proceedings of the 27th annual international symposium on Computer architecture
Clock rate versus IPC: the end of the road for conventional microarchitectures

Proceedings of the 27th annual international symposium on Computer architecture
Slice-processors: an implementation of operation-based prediction

ICS '01 Proceedings of the 15th international conference on Supercomputing
Speculative precomputation: long-range prefetching of delinquent loads

ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture
Tolerating memory latency through software-controlled pre-execution in simultaneous multithreading processors

ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture
Data prefetching by dependence graph precomputation

ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture
SMT Layout Overhead and Scalability

IEEE Transactions on Parallel and Distributed Systems
Dynamic speculative precomputation

Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
A Case for Intelligent RAM

IEEE Micro
Simultaneous Multithreading: A Platform for Next-Generation Processors

IEEE Micro
Memory Latency Effects in Decoupled Architectures

IEEE Transactions on Computers
Effective Hardware-Based Data Prefetching for High-Performance Processors

IEEE Transactions on Computers
Decoupled access/execute computer architectures

ISCA '82 Proceedings of the 9th annual symposium on Computer Architecture
Access Order and Effective Bandwidth for Streams on a Direct Rambus Memory

HPCA '99 Proceedings of the 5th International Symposium on High Performance Computer Architecture
The Synergy of Multithreading and Access/Execute Decoupling

HPCA '99 Proceedings of the 5th International Symposium on High Performance Computer Architecture
HiDISC: A Decoupled Architecture for Data-Intensive Applications

IPDPS '03 Proceedings of the 17th International Symposium on Parallel and Distributed Processing
Speculative Data-Driven Multithreading

HPCA '01 Proceedings of the 7th International Symposium on High-Performance Computer Architecture

Performance scalability of decoupled software pipelining

ACM Transactions on Architecture and Code Optimization (TACO)
Mat-core: a decoupled matrix core extension for general-purpose processors

Neural, Parallel & Scientific Computations
A shared matrix unit for a chip multi-core processor

Journal of Parallel and Distributed Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

The speed gap between processor and main memory is the major performance bottleneck of modern computer systems. As a result, today's microprocessors suffer from frequent cache misses and lose many CPU cycles due to pipeline stalling. Although traditional data prefetching methods considerably reduce the number of cache misses, most of them strongly rely on the predictability for future accesses and often fail when memory accesses do not contain much locality.To solve the long latency problem of current memory systems, this paper presents the design and evaluation of our high-performance decoupled architecture, the HiDISC (Hierarchical Decoupled Instruction Stream Computer). The motivation for the design originated from the traditional decoupled architecture concept and its limits. The HiDISC approach implements an additional prefetching processor on top of a traditional access/execute architecture. Our design aims at providing low memory access latency by separating and decoupling otherwise sequential pieces of code into three streams and executing each stream on three dedicated processors. The three streams act in concert to mask the long access latencies by providing the necessary data to the upper level on time. This is achieved by separating the access-related instructions from the main computation and running them early enough on the two dedicated processors.Detailed hardware design and performance evaluation are performed with development of an architectural simulator and compiling tools. Our performance results show that the proposed HiDISC model reduces 19.7% of the cache misses and improves the overall IPC (Instructions Per Cycle) by 15.8%. With a slower memory model assuming 200 CPU cycles as memory access latency, our HiDISC improves the performance by 17.2%.