Evaluation of mechanisms for fine-grained parallel programs in the J-machine and the CM-5

Authors:
Ellen Spertus;Seth Copen Goldstein;Klaus Erik Schauser;Thorsten von Eicken;David E. Culler;William J. Dally
Affiliations:
-;-;-;-;-;-
Venue:
ISCA '93 Proceedings of the 20th annual international symposium on computer architecture
Year:
1993

Citing 7
Cited 16

Future scientific programming on parallel machines

Journal of Parallel and Distributed Computing - Special Issue on Languages, Compilers and environments for Parallel Programming
Vectorization on Monte Carlo particle transport: an architectural study using the LANL benchmark “GAMTEB”

Proceedings of the 1989 ACM/IEEE conference on Supercomputing
Fine-grain parallelism with minimal hardware support: a compiler-controlled threaded abstract machine

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
The NAS parallel benchmarks—summary and preliminary results

Proceedings of the 1991 ACM/IEEE conference on Supercomputing
Compiler-controlled multithreading for lenient parallel languages

Proceedings of the 5th ACM conference on Functional programming languages and computer architecture
SPLASH: Stanford parallel applications for shared-memory

SPLASH: Stanford parallel applications for shared-memory
The Implementation of a Threaded Abstract Machine

The Implementation of a Threaded Abstract Machine

T: integrated building blocks for parallel computing

Proceedings of the 1993 ACM/IEEE conference on Supercomputing
The Stanford FLASH multiprocessor

ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Tempest and typhoon: user-level shared memory

ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Separation constraint partitioning: a new algorithm for partitioning non-strict programs into sequential threads

POPL '95 Proceedings of the 22nd ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Evaluating the locality benefits of active messages

PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
Remote queues: exposing message queues for optimization and atomicity

Proceedings of the seventh annual ACM symposium on Parallel algorithms and architectures
How much non-strictness do lenient programs require?

FPCA '95 Proceedings of the seventh international conference on Functional programming languages and computer architecture
A design study of the EARTH multiprocessor

PACT '95 Proceedings of the IFIP WG10.3 working conference on Parallel architectures and compilation techniques
Polling watchdog: combining polling and interrupts for efficient message handling

ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Retrospective: the J-machine

25 years of the international symposia on Computer architecture (selected papers)
The Stanford FLASH multiprocessor

25 years of the international symposia on Computer architecture (selected papers)
Tempest and typhoon: user-level shared memory

25 years of the international symposia on Computer architecture (selected papers)
Dataflow Architectures and Multithreading

Computer
Measuring Cache and TLB Performance and Their Effect on Benchmark Runtimes

IEEE Transactions on Computers
MORPH: a system architecture for robust high performance using customization (an NSF 100 TeraOps point design study)

FRONTIERS '96 Proceedings of the 6th Symposium on the Frontiers of Massively Parallel Computation
HeapMon: a helper-thread approach to programmable, automatic, and low-overhead memory bug detection

IBM Journal of Research and Development

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper uses an abstract machine approach to compare the mechanisms of two parallel machines: the J-Machine and the CM-5. High-level parallel programs are translated by a single optimizing compiler to a fine-grained abstract parallel machine, TAM. A final compilation step is unique to each machine and optimizes for specifics of the architecture. By determining the cost of the primitives and weighting them by their dynamic frequency in parallel programs, we quantify the effectiveness of the following mechanisms individually and in combination. Efficient processor/network coupling proves valuable. Message dispatch is found to be less valuable without atomic operations that allow the scheduling levels to cooperate. Multiple hardware contexts are of small value when the contexts cooperate and the compiler can partition the register set. Tagged memory provides little gain. Finally, the performance of the overall system is strongly influenced by the performance of the memory system and the frequency of control operations.