Multithreading with Distributed Functional Units

Authors:
Bernard K. Gunther
Affiliations:
Univ. of South Australia, Adelaide, Australia
Venue:
IEEE Transactions on Computers
Year:
1997

Citing 21
Cited 2

MASA: a multithreaded processor architecture for parallel symbolic computing

ISCA '88 Proceedings of the 15th Annual International Symposium on Computer architecture
Architecture of high performance computers: vol. 1

Architecture of high performance computers: vol. 1
I-structures: data structures for parallel computing

ACM Transactions on Programming Languages and Systems (TOPLAS)
Can dataflow subsume von Neumann computing?

ISCA '89 Proceedings of the 16th annual international symposium on Computer architecture
Computer architecture: a quantitative approach

Computer architecture: a quantitative approach
Employing register channels for the exploitation of instruction level parallelism

PPOPP '90 Proceedings of the second ACM SIGPLAN symposium on Principles & practice of parallel programming
A variable instruction stream extension to the VLIW architecture

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
A high speed mechanism for short branches

ACM SIGARCH Computer Architecture News
An elementary processor architecture with simultaneous instruction issuing from multiple threads

ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
Processor coupling: integrating compile time and runtime scheduling for parallelism

ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
Exploring the design space for a shared-cache multiprocessor

ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
A study of single-chip processor/cache organizations for large numbers of transistors

ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Interleaving: a multithreading technique targeting multiprocessors and workstations

ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
The Tera computer system

ICS '90 Proceedings of the 4th international conference on Supercomputing
Monsoon: an explicit token-store architecture

ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
Sparcle: An Evolutionary Processor Design for Large-Scale Multiprocessors

IEEE Micro
M-Structures: Extending a Parallel, Non-strict, Functional Language with State

Proceedings of the 5th ACM Conference on Functional Programming Languages and Computer Architecture
Performance measurements on HEP - a pipelined MIMD computer

ISCA '83 Proceedings of the 10th annual international symposium on Computer architecture
Lockup-free instruction fetch/prefetch cache organization

ISCA '81 Proceedings of the 8th annual symposium on Computer Architecture
Architectural and implementation tradeoffs in the design of multiple-context processors

Architectural and implementation tradeoffs in the design of multiple-context processors
Two Fundamental Limits on Dataflow Multiprocessing

Two Fundamental Limits on Dataflow Multiprocessing

Effects of Multithreading on Cache Performance

IEEE Transactions on Computers - Special issue on cache memory and related problems
Design Alternatives of Multithreaded Architecture

International Journal of Parallel Programming

Quantified Score

Hi-index	14.98

Visualization

Abstract

Multithreaded processors multiplex the execution of a number of concurrent threads onto the hardware in order to hide latencies associated with memory access, synchronization, and arithmetic operations. Conventional multithreading aims to maximize throughput in a single instruction pipeline whose execution stages are served by a collection of centralized functional units. This paper examines a multithreaded microarchitecture where the heterogeneous functional unit set is expanded so that units may be distributed and partly shared across several instruction pipelines operating simultaneously, thereby allowing greater exploitation of interthread parallelism in improving utilization factors of costly resources. The multiple pipeline approach is studied specifically in the Concurro processor architecture驴a machine supporting multiple thread contexts and capable of context switching asynchronously in response to dynamic data and resource availability.Detailed simulations of Concurro processors indicate that instruction throughputs for programs accessing main memory directly can be scaled, without recompilation, from one to over eight instructions per cycle simply by varying the number of pipelines and functional units. In comparison with an equivalent coherent-cache, single-chip multiprocessor, Concurro offers marginally better performance at less than half of the estimated implementation cost. With suitable prefetching, multiple instruction caches can be avoided, and multithreading is shown to obviate the need for sophisticated instruction dispatch mechanisms on parallel workloads. Distribution of functional units results in a 150% improvement over the centralized approach in utilization factors of arithmetic units, and enables saturation of the most critical processor resources.