MASA: a multithreaded processor architecture for parallel symbolic computing
ISCA '88 Proceedings of the 15th Annual International Symposium on Computer architecture
Architecture of high performance computers: vol. 1
Architecture of high performance computers: vol. 1
I-structures: data structures for parallel computing
ACM Transactions on Programming Languages and Systems (TOPLAS)
Can dataflow subsume von Neumann computing?
ISCA '89 Proceedings of the 16th annual international symposium on Computer architecture
Computer architecture: a quantitative approach
Computer architecture: a quantitative approach
Employing register channels for the exploitation of instruction level parallelism
PPOPP '90 Proceedings of the second ACM SIGPLAN symposium on Principles & practice of parallel programming
A variable instruction stream extension to the VLIW architecture
ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
A high speed mechanism for short branches
ACM SIGARCH Computer Architecture News
An elementary processor architecture with simultaneous instruction issuing from multiple threads
ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
Processor coupling: integrating compile time and runtime scheduling for parallelism
ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
Exploring the design space for a shared-cache multiprocessor
ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
A study of single-chip processor/cache organizations for large numbers of transistors
ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Interleaving: a multithreading technique targeting multiprocessors and workstations
ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
ICS '90 Proceedings of the 4th international conference on Supercomputing
Monsoon: an explicit token-store architecture
ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
M-Structures: Extending a Parallel, Non-strict, Functional Language with State
Proceedings of the 5th ACM Conference on Functional Programming Languages and Computer Architecture
Performance measurements on HEP - a pipelined MIMD computer
ISCA '83 Proceedings of the 10th annual international symposium on Computer architecture
Lockup-free instruction fetch/prefetch cache organization
ISCA '81 Proceedings of the 8th annual symposium on Computer Architecture
Architectural and implementation tradeoffs in the design of multiple-context processors
Architectural and implementation tradeoffs in the design of multiple-context processors
Two Fundamental Limits on Dataflow Multiprocessing
Two Fundamental Limits on Dataflow Multiprocessing
Effects of Multithreading on Cache Performance
IEEE Transactions on Computers - Special issue on cache memory and related problems
Design Alternatives of Multithreaded Architecture
International Journal of Parallel Programming
Hi-index | 14.98 |
Multithreaded processors multiplex the execution of a number of concurrent threads onto the hardware in order to hide latencies associated with memory access, synchronization, and arithmetic operations. Conventional multithreading aims to maximize throughput in a single instruction pipeline whose execution stages are served by a collection of centralized functional units. This paper examines a multithreaded microarchitecture where the heterogeneous functional unit set is expanded so that units may be distributed and partly shared across several instruction pipelines operating simultaneously, thereby allowing greater exploitation of interthread parallelism in improving utilization factors of costly resources. The multiple pipeline approach is studied specifically in the Concurro processor architecture驴a machine supporting multiple thread contexts and capable of context switching asynchronously in response to dynamic data and resource availability.Detailed simulations of Concurro processors indicate that instruction throughputs for programs accessing main memory directly can be scaled, without recompilation, from one to over eight instructions per cycle simply by varying the number of pipelines and functional units. In comparison with an equivalent coherent-cache, single-chip multiprocessor, Concurro offers marginally better performance at less than half of the estimated implementation cost. With suitable prefetching, multiple instruction caches can be avoided, and multithreading is shown to obviate the need for sophisticated instruction dispatch mechanisms on parallel workloads. Distribution of functional units results in a 150% improvement over the centralized approach in utilization factors of arithmetic units, and enables saturation of the most critical processor resources.