Improved multithreading techniques for hiding communication latency in multiprocessors

Authors:
Bob Boothe;Abhiram Ranade
Affiliations:
-;-
Venue:
ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
Year:
1992

Citing 20
Cited 15

on Parallel MIMD computation: HEP supercomputer and its applications

on Parallel MIMD computation: HEP supercomputer and its applications
Reevaluating Amdahl's law

Communications of the ACM
MASA: a multithreaded processor architecture for parallel symbolic computing

ISCA '88 Proceedings of the 15th Annual International Symposium on Computer architecture
MIPS RISC architecture

MIPS RISC architecture
Guide to parallel programming on Sequent computer systems: 2nd edition

Guide to parallel programming on Sequent computer systems: 2nd edition
Can dataflow subsume von Neumann computing?

ISCA '89 Proceedings of the 16th annual international symposium on Computer architecture
Exploring the benefits of multiple hardware contexts in a multiprocessor architecture: preliminary results

ISCA '89 Proceedings of the 16th annual international symposium on Computer architecture
Directory-Based Cache Coherence in Large-Scale Multiprocessors

Computer
Analysis of multithreaded architectures for parallel computing

SPAA '90 Proceedings of the second annual ACM symposium on Parallel algorithms and architectures
Fine-grain parallelism with minimal hardware support: a compiler-controlled threaded abstract machine

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
An architecture for software-controlled data prefetching

ISCA '91 Proceedings of the 18th annual international symposium on Computer architecture
Comparative evaluation of latency reducing and tolerating techniques

ISCA '91 Proceedings of the 18th annual international symposium on Computer architecture
The Tera computer system

ICS '90 Proceedings of the 4th international conference on Supercomputing
Virtual-channel flow control

ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
Monsoon: an explicit token-store architecture

ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
APRIL: a processor architecture for multiprocessing

ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
An empirical evaluation of two memory-efficient directory methods

ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
The NYU Ultracomputer—designing a MIMD, shared-memory parallel machine (Extended Abstract)

ISCA '82 Proceedings of the 9th annual symposium on Computer Architecture
SPLASH: Stanford parallel applications for shared-memory

SPLASH: Stanford parallel applications for shared-memory
Multiprocessor Strategies for Ray-Tracing

Multiprocessor Strategies for Ray-Tracing

Improving AP1000 parallel computer performance with message communication

ISCA '93 Proceedings of the 20th annual international symposium on computer architecture
Empirical study of latency hiding on a fine-grain parallel processor

ICS '93 Proceedings of the 7th international conference on Supercomputing
Space-efficient scheduling of multithreaded computations

STOC '93 Proceedings of the twenty-fifth annual ACM symposium on Theory of computing
The effectiveness of multiple hardware contexts

ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
Performance Evaluation of Hierarchical Ring-Based Shared Memory Multiprocessors

IEEE Transactions on Computers
Analysis of performance bottlenecks in multithreaded multiprocessor systems

Fundamenta Informaticae - Application of concurrency to system design
A survey of processors with explicit multithreading

ACM Computing Surveys (CSUR)
Latency Tolerance: A Metric for Performance Analysis of Multithreaded Architectures

IPPS '97 Proceedings of the 11th International Symposium on Parallel Processing
Code Generation for Multi-Threaded Architectures from Dataflow Graphs

PACT '93 Proceedings of the IFIP WG10.3. Working Conference on Architectures and Compilation Techniques for Fine and Medium Grain Parallelism
Improving the Performance of Heterogeneous DSMs via Multithreading

VECPAR '00 Selected Papers and Invited Talks from the 4th International Conference on Vector and Parallel Processing
Performance Study of a Multithreaded Superscalar Microprocessor

HPCA '96 Proceedings of the 2nd IEEE Symposium on High-Performance Computer Architecture
A Simulation Platform for Multi-Threaded Architectures

MASCOTS '96 Proceedings of the 4th International Workshop on Modeling, Analysis, and Simulation of Computer and Telecommunications Systems
Timed Petri net models of multithreaded multiprocessor architectures

PNPM '97 Proceedings of the 6th International Workshop on Petri Nets and Performance Models
Performance limitations of block-multithreaded distributed-memory systems

Winter Simulation Conference
Analysis of Performance Bottlenecks in Multithreaded Multiprocessor Systems

Fundamenta Informaticae - Application of Concurrency to System Design

Quantified Score

Hi-index	0.00

Visualization

Abstract

Shared memory multiprocessors are considered among the easiest parallel computers to program. However building shared memory machines with thousands of processors has proved difficult because of the inevitably long memory latencies. Much previous research has focused on cache coherency techniques, but it remains unclear if caches can obtain sufficiently high hit rates. In this paper we present improved multithreading techniques that can easily tolerate latencies of hundreds of cycles, and yet only require a small number of threads per processor. High performance is achieved by introducing an explicit context switch instruction that can be used by a simple optimizing compiler to group together several shared accesses. This grouping of shared accesses dramatically reduces the frequency of context switches compared to simpler multithreading models. The combination of our techniques achieves efficiencies of 80% or higher on a broad set of applications.