Converting thread-level parallelism to instruction-level parallelism via simultaneous multithreading

Authors:
Jack L. Lo;Joel S. Emer;Henry M. Levy;Rebecca L. Stamm;Dean M. Tullsen;S. J. Eggers
Affiliations:
Univ. of Washington, Seattle;Digital Equipment Corporation, Hudson, MA;Univ. of Washington, Seattle;Digital Equipment Corporation, Hudson, MA;Univ. of California, San Diego;-
Venue:
ACM Transactions on Computer Systems (TOCS)
Year:
1997

Citing 33
Cited 75

Strategies for cache and local memory management by global program transformation

Journal of Parallel and Distributed Computing - Special Issue on Languages, Compilers and environments for Parallel Programming
Exploring the benefits of multiple hardware contexts in a multiprocessor architecture: preliminary results

ISCA '89 Proceedings of the 16th annual international symposium on Computer architecture
Analysis of multithreaded architectures for parallel computing

SPAA '90 Proceedings of the second annual ACM symposium on Parallel algorithms and architectures
New CPU benchmark suites from SPEC

COMPCON '92 Proceedings of the thirty-seventh international conference on COMPCON
An elementary processor architecture with simultaneous instruction issuing from multiple threads

ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
Processor coupling: integrating compile time and runtime scheduling for parallelism

ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
Microarchitecture support for dynamic scheduling of acyclic task graphs

MICRO 25 Proceedings of the 25th annual international symposium on Microarchitecture
MISC: a Multiple Instruction Stream Computer

MICRO 25 Proceedings of the 25th annual international symposium on Microarchitecture
The multiflow trace scheduling compiler

The Journal of Supercomputing - Special issue on instruction-level parallelism
Fast and accurate instruction fetch and branch prediction

ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Exploring the design space for a shared-cache multiprocessor

ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
SUIF: an infrastructure for research on parallelizing and optimizing compilers

ACM SIGPLAN Notices
The effectiveness of multiple hardware contexts

ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
The SPLASH-2 programs: characterization and methodological considerations

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Simultaneous multithreading: maximizing on-chip parallelism

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Multiscalar processors

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Increasing superscalar performance through multistreaming

PACT '95 Proceedings of the IFIP WG10.3 working conference on Parallel architectures and compilation techniques
The M-Machine multicomputer

Proceedings of the 28th annual international symposium on Microarchitecture
Evaluation of design alternatives for a multiprocessor microprocessor

ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Increasing cache port efficiency for dynamic superscalar microprocessors

ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Exploiting choice: instruction fetch and issue on an implementable simultaneous multithreading processor

ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
The case for a single-chip multiprocessor

Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Techniques for extracting instruction level parallelism on MIMD architectures

MICRO 26 Proceedings of the 26th annual international symposium on Microarchitecture
Complexity-effective superscalar processors

Proceedings of the 24th annual international symposium on Computer architecture
Portable Programs for Parallel Processors

Portable Programs for Parallel Processors
Superscalar Instruction Execution in the 21164 Alpha Microprocessor

IEEE Micro
The MIPS R10000 Superscalar Microprocessor

IEEE Micro
Performance Tradeoffs in Multithreaded Processors

IEEE Transactions on Parallel and Distributed Systems
A Fine-Grain Threaded Abstract Machine

PACT '94 Proceedings of the IFIP WG10.3 Working Conference on Parallel Architectures and Compilation Techniques
The effects of STEF in finely parallel multithreaded processors

HPCA '95 Proceedings of the 1st IEEE Symposium on High-Performance Computer Architecture
Design and performance evaluation of a multithreaded architecture

HPCA '95 Proceedings of the 1st IEEE Symposium on High-Performance Computer Architecture
Performance Study of a Multithreaded Superscalar Microprocessor

HPCA '96 Proceedings of the 2nd IEEE Symposium on High-Performance Computer Architecture
Software methods for improvement of cache performance on supercomputer applications

Software methods for improvement of cache performance on supercomputer applications

Tuning compiler optimizations for simultaneous multithreading

MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Hardware-software trade-offs in a direct Rambus implementation of the RAMpage memory hierarchy

Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
Effects of Multithreading on Cache Performance

IEEE Transactions on Computers - Special issue on cache memory and related problems
Instruction fetch mechanisms for multipath execution processors

Proceedings of the 32nd annual ACM/IEEE international symposium on Microarchitecture
Software-Directed Register Deallocation for Simultaneous Multithreaded Processors

IEEE Transactions on Parallel and Distributed Systems
ILP versus TLP on SMT

SC '99 Proceedings of the 1999 ACM/IEEE conference on Supercomputing
Tuning Compiler Optimizations for Simultaneous Multithreading

International Journal of Parallel Programming - Special issue on the 30th annual ACM/IEEE international symposium on microarchitecture, part II
Symbiotic jobscheduling for a simultaneous mutlithreading processor

ACM SIGPLAN Notices
An analysis of operating system behavior on a simultaneous multithreaded architecture

ACM SIGPLAN Notices
Analytical cache models with applications to cache partitioning

ICS '01 Proceedings of the 15th international conference on Supercomputing
Symbiotic jobscheduling for a simultaneous multithreaded processor

ASPLOS IX Proceedings of the ninth international conference on Architectural support for programming languages and operating systems
An analysis of operating system behavior on a simultaneous multithreaded architecture

ASPLOS IX Proceedings of the ninth international conference on Architectural support for programming languages and operating systems
Asynchrony in parallel computing: from dataflow to multithreading

Progress in computer research
SMT Layout Overhead and Scalability

IEEE Transactions on Parallel and Distributed Systems
Scheduled Dataflow: Execution Paradigm, Architecture, and Performance Evaluation

IEEE Transactions on Computers - Special issue on the parallel architecture and compilation techniques conference
Symbiotic jobscheduling with priorities for a simultaneous multithreading processor

SIGMETRICS '02 Proceedings of the 2002 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Asynchrony in parallel computing: from dataflow to multithreading

Progress in computer research
The Need for Fast Communication in Hardware-Based Speculative Chip Multiprocessors

International Journal of Parallel Programming
Simultaneous Multithreading: A Platform for Next-Generation Processors

IEEE Micro
A survey of processors with explicit multithreading

ACM Computing Surveys (CSUR)
Effects of Memory Performance on Parallel Job Scheduling

JSSPP '01 Revised Papers from the 7th International Workshop on Job Scheduling Strategies for Parallel Processing
A framework for performance modeling and prediction

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Improving server software support for simultaneous multithreaded processors

Proceedings of the ninth ACM SIGPLAN symposium on Principles and practice of parallel programming
Mini-Threads: Increasing TLP on Small-Scale SMT Processors

HPCA '03 Proceedings of the 9th International Symposium on High-Performance Computer Architecture
Large-Scale TCP Models Using Optimistic Parallel Simulation

Proceedings of the seventeenth workshop on Parallel and distributed simulation
Power-Sensitive Multithreaded Architecture

ICCD '00 Proceedings of the 2000 IEEE International Conference on Computer Design: VLSI in Computers & Processors
Simultaneous Multithreading-Based Routers

ICPP '00 Proceedings of the Proceedings of the 2000 International Conference on Parallel Processing
An evaluation of speculative instruction execution on simultaneous multithreaded processors

ACM Transactions on Computer Systems (TOCS)
A Clustered Approach to Multithreaded Processors

IPPS '98 Proceedings of the 12th. International Parallel Processing Symposium on International Parallel Processing Symposium
Large-scale network simulation techniques: examples of TCP and OSPF models

ACM SIGCOMM Computer Communication Review
Dynamic Partitioning of Shared Cache Memory

The Journal of Supercomputing
The need for adaptive dynamic thread scheduling

High performance scientific and engineering computing
The energy efficiency of CMP vs. SMT for multimedia workloads

Proceedings of the 18th annual international conference on Supercomputing
Architectural Support for Enhanced SMT Job Scheduling

Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques
Area and System Clock Effects on SMT/CMP Throughput

IEEE Transactions on Computers
Efficient Direct User Level Sockets for an Intel® Xeon" Processor Based TCP On-Load Engine

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 9 - Volume 10
Scalable cache memory design for large-scale SMT architectures

WMPI '04 Proceedings of the 3rd workshop on Memory performance issues: in conjunction with the 31st international symposium on computer architecture
Evaluating the impact of simultaneous multithreading on network servers using real hardware

SIGMETRICS '05 Proceedings of the 2005 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
How Well Can Simple Metrics Represent the Performance of HPC Applications?

SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
An Efficient Way of Passing of Data in a Multithreaded Scheduled Dataflow Architecture

HPCASIA '05 Proceedings of the Eighth International Conference on High-Performance Computing in Asia-Pacific Region
Chip multithreading systems need a new operating system scheduler

Proceedings of the 11th workshop on ACM SIGOPS European workshop
A performance prediction framework for scientific applications

Future Generation Computer Systems
Throttling-Based Resource Management in High Performance Multithreaded Architectures

IEEE Transactions on Computers
Thread-associative memory for multicore and multithreaded computing

Proceedings of the 2006 international symposium on Low power electronics and design
Online power-performance adaptation of multithreaded programs using hardware event-based prediction

Proceedings of the 20th annual international conference on Supercomputing
CAPSULE: Hardware-Assisted Parallel Execution of Component-Based Programs

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
SMP-SoC is the answer if you ask the right questions

SAICSIT '06 Proceedings of the 2006 annual research conference of the South African institute of computer scientists and information technologists on IT research in developing countries
Adaptive dynamic thread scheduling for simultaneous multithreaded architectures with a detector thread

Journal of Parallel and Distributed Computing
An efficient implementation of a 3D wavelet transform based encoder on hyper-threading technology

Parallel Computing
The WaveScalar architecture

ACM Transactions on Computer Systems (TOCS)
Fuce: the continuation-based multithreading processor

Proceedings of the 4th international conference on Computing frontiers
Scalability of continuation-based fine-grained multithreading in handling multiple I/O requests on FUCE

Proceedings of the 4th international conference on Computing frontiers
Performance of multithreaded chip multiprocessors and implications for operating system design

ATEC '05 Proceedings of the annual conference on USENIX Annual Technical Conference
Managing The Complexity Of Performance Monitoring Hardware: The Brink Andabyss Approach

International Journal of High Performance Computing Applications
Exploring the performance limits of simultaneous multithreading for memory intensive applications

The Journal of Supercomputing
Dynamic tiling for effective use of shared caches on multithreaded processors

International Journal of High Performance Computing and Networking
A genetic algorithms approach to modeling the performance of memory-bound computations

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
A SMT-ARM simulator and performance evaluation

SEPADS'06 Proceedings of the 5th WSEAS International Conference on Software Engineering, Parallel and Distributed Systems
DLL-conscious instruction fetch optimization for SMT processors

Journal of Systems Architecture: the EUROMICRO Journal
Towards achieving reliable and high-performance nanocomputing via dynamic redundancy allocation

ACM Journal on Emerging Technologies in Computing Systems (JETC)
A continuation-based noninterruptible multithreading processor architecture

The Journal of Supercomputing
Source level merging of independent programs

Journal of Parallel and Distributed Computing
Performance evaluation of the sparse matrix-vector multiplication on modern architectures

The Journal of Supercomputing
A performance prediction framework for scientific applications

Future Generation Computer Systems
A performance prediction framework for scientific applications

ICCS'03 Proceedings of the 2003 international conference on Computational science: PartIII
Energy efficient speculative threads: dynamic thread allocation in Same-ISA heterogeneous multicore systems

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Shared Register File Based ILP for Multicore

GREENCOM-CPSCOM '10 Proceedings of the 2010 IEEE/ACM Int'l Conference on Green Computing and Communications & Int'l Conference on Cyber, Physical and Social Computing
Managing SMT resource usage through speculative instruction window weighting

ACM Transactions on Architecture and Code Optimization (TACO)
PMPS(3): a performance model of parallel systems

ACSAC'06 Proceedings of the 11th Asia-Pacific conference on Advances in Computer Systems Architecture
Low power microprocessor design for embedded systems

ICCSA'06 Proceedings of the 2006 international conference on Computational Science and Its Applications - Volume Part IV
Software simultaneous multi-threading, a technique to exploit task-level parallelism to improve instruction- and data-level parallelism

PATMOS'06 Proceedings of the 16th international conference on Integrated Circuit and System Design: power and Timing Modeling, Optimization and Simulation
HELIX: automatic parallelization of irregular programs for chip multiprocessing

Proceedings of the Tenth International Symposium on Code Generation and Optimization
Hazard driven test generation for SMT processors

DATE '12 Proceedings of the Conference on Design, Automation and Test in Europe
The design and implementation of heterogeneous multicore systems for energy-efficient speculative thread execution

ACM Transactions on Architecture and Code Optimization (TACO)
The sharing architecture: sub-core configurability for IaaS clouds

Proceedings of the 19th international conference on Architectural support for programming languages and operating systems

Quantified Score

Hi-index	0.01

Visualization

Abstract

To achieve high performance, contemporary computer systems rely on two forms of parallelism: instruction-level parallelism (ILP) and thread-level parallelism (TLP). Wide-issue super-scalar processors exploit ILP by executing multiple instructions from a single program in a single cycle. Multiprocessors (MP) exploit TLP by executing different threads in parallel on different processors. Unfortunately, both parallel processing styles statically partition processor resources, thus preventing them from adapting to dynamically changing levels of ILP and TLP in a program. With insufficient TLP, processors in an MP will be idle; with insufficient ILP, multiple-issue hardware on a superscalar is wasted. This article explores parallel processing on an alternative architecture, simultaneous multithreading (SMT), which allows multiple threads to complete for and share all of the processor's resources every cycle.The most compelling reason for running parallel applications on an SMT processor is its ability to use thread-level parallelism and instruction-level parallelism interchangeably. By permitting multiple threads to share the processor's functional units simultaneously, the processor can use both ILP and TLP to accommodate variations in parallelism. When a program has only a single thread, all of the SMT processor's resources can be dedicated to that thread; when more TLP exists, this parallelism can compensate for a lack of per-thread ILP. We examine two alternative on-chip parallel architectures for the next generation of processors. We compare SMT and small-scale, on-chip multiprocessors in their ability to exploit both ILP and TLP. First, we identify the hardware bottlenecks that prevent multiprocessors from effectively exploiting ILP. Then, we show that because of its dynamic resource sharing, SMT avoids these inefficiencies and benefits from being able to run more threads on a single processor. The use of TLP is especially advantageous when per-thread ILP is limited. The ease of adding additional thread contexts on an SMT (relative to adding additional processors on an MP) allows simultaneous multithreading to expose more parallelism, further increasing functional unit utilization and attaining a 52% average speedup (versus a four-processor, single-chip multiprocessor with comparable execution resources). This study also addresses an often-cited concern regarding the use of thread-level parallelism or multithreading: interference in the memory system and branch prediction hardware.We find the multiple threads cause interthread interference in the caches and place greater demands on the memory system, thus increasing average memory latencies. By exploiting threading-level parallelism, however, SMT hides these additional latencies, so that they only have a small impact on total program performance. We also find that for parallel applications, the additional threads have minimal effects on branch prediction.