Exploring the benefits of multiple hardware contexts in a multiprocessor architecture: preliminary results

Authors:
W.-D. Weber;A. Gupta
Affiliations:
Computer Systems Laboratory, Stanford University, Stanford, CA;Computer Systems Laboratory, Stanford University, Stanford, CA
Venue:
ISCA '89 Proceedings of the 16th annual international symposium on Computer architecture
Year:
1989

Citing 14
Cited 47

A microprocessor-based hypercube supercomputer

IEEE Micro
Architecture of a message-driven processor

ISCA '87 Proceedings of the 14th annual international symposium on Computer architecture
Measurement and evaluation of the MIPS architecture and processor

ACM Transactions on Computer Systems (TOCS)
Toward a dataflow/von Neumann hybrid architecture

ISCA '88 Proceedings of the 15th Annual International Symposium on Computer architecture
An evaluation of directory schemes for cache coherence

ISCA '88 Proceedings of the 15th Annual International Symposium on Computer architecture
MASA: a multithreaded processor architecture for parallel symbolic computing

ISCA '88 Proceedings of the 15th Annual International Symposium on Computer architecture
The parallel decomposition and implementation of an integrated circuit global router

PPEALS '88 Proceedings of the ACM/SIGPLAN conference on Parallel programming: experience with applications, languages and systems
The architecture and programming of the Ametek series 2010 multicomputer

C3P Proceedings of the third conference on Hypercube concurrent computers and applications: Architecture, software, computer systems, and general issues - Volume 1
Parallel implementation of OPS5 on the encore multiprocessor: results and analysis

International Journal of Parallel Programming
Characterization of parallelism and deadlocks in distributed digital logic simulation

DAC '89 Proceedings of the 26th ACM/IEEE Design Automation Conference
Reduced instruction set computers

Communications of the ACM - Special section on computer architecture
LocusRoute: a parallel global router for standard cells

DAC '88 Proceedings of the 25th ACM/IEEE Design Automation Conference
A critique of multiprocessing von Neumann style

ISCA '83 Proceedings of the 10th annual international symposium on Computer architecture
A critique of multiprocessing von Neumann style

ISCA '83 Proceedings of the 10th annual international symposium on Computer architecture

Analysis of multithreaded architectures for parallel computing

SPAA '90 Proceedings of the second annual ACM symposium on Parallel algorithms and architectures
Fine-grain parallelism with minimal hardware support: a compiler-controlled threaded abstract machine

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Comparative evaluation of latency reducing and tolerating techniques

ISCA '91 Proceedings of the 18th annual international symposium on Computer architecture
Multithreading: a revisionist view of dataflow architectures

ISCA '91 Proceedings of the 18th annual international symposium on Computer architecture
Hiding memory latency using dynamic scheduling in shared-memory multiprocessors

ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
An elementary processor architecture with simultaneous instruction issuing from multiple threads

ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
Processor coupling: integrating compile time and runtime scheduling for parallelism

ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
Improved multithreading techniques for hiding communication latency in multiprocessors

ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
Reducing memory latency via non-blocking and prefetching caches

ASPLOS V Proceedings of the fifth international conference on Architectural support for programming languages and operating systems
Closing the window of vulnerability in multiphase memory transactions

ASPLOS V Proceedings of the fifth international conference on Architectural support for programming languages and operating systems
Register relocation: flexible contexts for multithreading

ISCA '93 Proceedings of the 20th annual international symposium on computer architecture
Empirical study of latency hiding on a fine-grain parallel processor

ICS '93 Proceedings of the 7th international conference on Supercomputing
A survey of PRAM simulation techniques

ACM Computing Surveys (CSUR)
A case for the multithreaded processor architecture

ACM SIGARCH Computer Architecture News - Special issue on input/output in parallel computer systems
Impact of sharing-based thread placement on multithreaded architectures

ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Interleaving: a multithreading technique targeting multiprocessors and workstations

ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
The effectiveness of multiple hardware contexts

ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
Architectural mechanisms for explicit communication in shared memory multiprocessors

Supercomputing '95 Proceedings of the 1995 ACM/IEEE conference on Supercomputing
Simultaneous multithreading: maximizing on-chip parallelism

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Increasing superscalar performance through multistreaming

PACT '95 Proceedings of the IFIP WG10.3 working conference on Parallel architectures and compilation techniques
Single-program speculative multithreading (SPSM) architecture: compiler-assisted fine-grained multithreading

PACT '95 Proceedings of the IFIP WG10.3 working conference on Parallel architectures and compilation techniques
Analysis of communications and overhead reduction in multithreaded execution

PACT '95 Proceedings of the IFIP WG10.3 working conference on Parallel architectures and compilation techniques
The M-Machine multicomputer

Proceedings of the 28th annual international symposium on Microarchitecture
Evaluation of multithreaded uniprocessors for commercial application environments

ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Converting thread-level parallelism to instruction-level parallelism via simultaneous multithreading

ACM Transactions on Computer Systems (TOCS)
Monsoon: an explicit token-store architecture

25 years of the international symposia on Computer architecture (selected papers)
Simultaneous multithreading: maximizing on-chip parallelism

25 years of the international symposia on Computer architecture (selected papers)
Concurrent Event Handling through Multithreading

IEEE Transactions on Computers
Monsoon: an explicit token-store architecture

ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
APRIL: a processor architecture for multiprocessing

ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
PLUS: a distributed shared-memory system

ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
Cache Memories for Dataflow Systems

IEEE Parallel & Distributed Technology: Systems & Technology
Dataflow Architectures and Multithreading

Computer
Sparcle: An Evolutionary Processor Design for Large-Scale Multiprocessors

IEEE Micro
Performance Tradeoffs in Multithreaded Processors

IEEE Transactions on Parallel and Distributed Systems
Performance Analysis of Mesh Interconnection Networks with Deterministic Routing

IEEE Transactions on Parallel and Distributed Systems
Performance Analysis of Four Memory Consistency Models for Multithreaded Multiprocessors

IEEE Transactions on Parallel and Distributed Systems
Latency Tolerance: A Metric for Performance Analysis of Multithreaded Architectures

IPPS '97 Proceedings of the 11th International Symposium on Parallel Processing
Analytic Performance Modeling for a Spectrum of Multithreaded Processor Architectures

MASCOTS '95 Proceedings of the 3rd International Workshop on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems
Data locality sensitivity of multithreaded computations on a distributed-memory multiprocessor

CASCON '96 Proceedings of the 1996 conference of the Centre for Advanced Studies on Collaborative research
Thread prioritization: a thread scheduling mechanism for multiple-context parallel processors

HPCA '95 Proceedings of the 1st IEEE Symposium on High-Performance Computer Architecture
The Named-State Register File: Implementation and Performance

HPCA '95 Proceedings of the 1st IEEE Symposium on High-Performance Computer Architecture
Multitasking and Multithreading on a Multiprocessor with Virtual Shared Memory

HPCA '96 Proceedings of the 2nd IEEE Symposium on High-Performance Computer Architecture
Measurement and Modeling of EARTH-MANNA Multithreaded Architecture

MASCOTS '96 Proceedings of the 4th International Workshop on Modeling, Analysis, and Simulation of Computer and Telecommunications Systems
Timed Petri net models of multithreaded multiprocessor architectures

PNPM '97 Proceedings of the 6th International Workshop on Petri Nets and Performance Models
Enhancing Microkernel Performance on VLIW DSP Processors via Multiset Context Switch

Journal of Signal Processing Systems
Hybrid multithreading for VLIW processors

CASES '09 Proceedings of the 2009 international conference on Compilers, architecture, and synthesis for embedded systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

A fundamental problem that any scalable multiprocessor must address is the ability to tolerate high latency memory operations. This paper explores the extent to which multiple hardware contexts per processor can help to mitigate the negative effects of high latency. In particular, we evaluate the performance of a directory-based cache coherent multiprocessor using memory reference traces obtained from three parallel applications. We explore the case where there are a small fixed number (2-4) of hardware contexts per processor and the context switch overhead is low. In contrast to previously proposed approaches, we also use a very simple context switch criterion, namely a cache miss or a write-hit to shared data. Our results show that the effectiveness of multiple contexts depends on the nature of the applications, the context switch overhead, and the inherent latency of the machine architecture. Given reasonably low overhead hardware context switches, we show that two or four contexts can achieve substantial performance gains over a single context. For one application, the processor utilization increased by about 46% with two contexts and by about 80% with four contexts.