MASA: a multithreaded processor architecture for parallel symbolic computing
ISCA '88 Proceedings of the 15th Annual International Symposium on Computer architecture
Can dataflow subsume von Neumann computing?
ISCA '89 Proceedings of the 16th annual international symposium on Computer architecture
Improved multithreading techniques for hiding communication latency in multiprocessors
ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
ICS '90 Proceedings of the 4th international conference on Supercomputing
ACM Computing Surveys (CSUR)
Computer Architecture; A Quantitative Approach
Computer Architecture; A Quantitative Approach
Performance Tradeoffs in Multithreaded Processors
IEEE Transactions on Parallel and Distributed Systems
SPLASH: Stanford parallel applications for shared-memory*
SPLASH: Stanford parallel applications for shared-memory*
Performance Issues of a Superscalar Microprocessor
ICPP '94 Proceedings of the 1994 International Conference on Parallel Processing - Volume 01
An efficient algorithm for exploiting multiple arithmetic units
IBM Journal of Research and Development
ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Converting thread-level parallelism to instruction-level parallelism via simultaneous multithreading
ACM Transactions on Computer Systems (TOCS)
Effects of Multithreading on Cache Performance
IEEE Transactions on Computers - Special issue on cache memory and related problems
Improving 3D geometry transformations on a simultaneous multithreaded SIMD processor
ICS '01 Proceedings of the 15th international conference on Supercomputing
Asynchrony in parallel computing: from dataflow to multithreading
Progress in computer research
Asynchrony in parallel computing: from dataflow to multithreading
Progress in computer research
A survey of processors with explicit multithreading
ACM Computing Surveys (CSUR)
A Study of a Simultaneous Multithreaded Processor Implementation
Euro-Par '99 Proceedings of the 5th International Euro-Par Conference on Parallel Processing
An Architecture based on the Memory Mapped Node Addressing in Reconfigurable Interconnection Network
PAS '97 Proceedings of the 2nd AIZU International Symposium on Parallel Algorithms / Architecture Synthesis
The need for adaptive dynamic thread scheduling
High performance scientific and engineering computing
Predictable performance in SMT processors
Proceedings of the 1st conference on Computing frontiers
Dynamically Controlled Resource Allocation in SMT Processors
Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
Optimizing NANOS OpenMP for the IBM Cyclops Multithreaded Architecture
IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Papers - Volume 01
Architecture optimization for multimedia application exploiting data and thread-level parallelism
Journal of Systems Architecture: the EUROMICRO Journal
Journal of Parallel and Distributed Computing
Design of adaptive multiprocessor on chip systems
Proceedings of the 20th annual conference on Integrated circuits and systems design
Optimising long-latency-load-aware fetch policies for SMT processors
International Journal of High Performance Computing and Networking
Exploiting multilevel parallelism using OpenMP on a massive multithreaded architecture
Journal of Embedded Computing - Issues in embedded single-chip multicore architectures
Improving SMT performance scheduling processes
EUROMICRO-PDP'02 Proceedings of the 10th Euromicro conference on Parallel, distributed and network-based processing
Scheduling optimization in multicore multithreaded microprocessors through dynamic modeling
Proceedings of the ACM International Conference on Computing Frontiers
Hi-index | 0.00 |
This paper describes a technique for improving the performance of a superscalar processor through multithreading. The technique exploits the instruction-level parallelism available both inside each individual stream, and across streams. The former is exploited through out-of-order execution of instructions within a stream, and the latter through execution of instructions from different streams simultaneously. Aspects of multithreaded superscalar design, such as fetch policy, cache performance, instruction scheduling, and functional unit utilization are studied. We analyze performance based on the simulation of a superscalar architecture and show that it is possible to provide support for multiple streams with minimal extra hardware, yet achieving significant performance gain (20 - 55%) across a range of benchmarks.