A comparison of architectural support for messaging in the TMC CM-5 and the Cray T3D

Authors:
Vijay Karamcheti;Andrew A. Chien
Affiliations:
Department of Computer Science, University of Illinois at Urbana-Champaign, 1304 W. Springfield Avenue, Urbana, IL;Department of Computer Science, University of Illinois at Urbana-Champaign, 1304 W. Springfield Avenue, Urbana, IL
Venue:
ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Year:
1995

Citing 15
Cited 25

Fine-grain parallelism with minimal hardware support: a compiler-controlled threaded abstract machine

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Performance evaluation of mesh-connected wormhole-routed networks for interprocessor communication in multicomputers

Proceedings of the 1990 ACM/IEEE conference on Supercomputing
Active messages: a mechanism for integrated communication and computation

ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
A tightly-coupled processor-network interface

ASPLOS V Proceedings of the fifth international conference on Architectural support for programming languages and operating systems
Scheduling regular and irregular communication patterns on the CM-5

Proceedings of the 1992 ACM/IEEE conference on Supercomputing
Concurrent aggregates: supporting modularity in massively parallel programs

Concurrent aggregates: supporting modularity in massively parallel programs
Fbufs: a high-bandwidth cross-domain transfer facility

SOSP '93 Proceedings of the fourteenth ACM symposium on Operating systems principles
Concert-efficient runtime support for concurrent object-oriented programming languages on stock hardware

Proceedings of the 1993 ACM/IEEE conference on Supercomputing
Separating data and control transfer in distributed operating systems

ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
Software overhead in messaging layers: where does the time go?

ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
Supporting systolic and memory communication in iWarp

ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
Ethernet: distributed packet switching for local computer networks

Communications of the ACM
How to Get Good Performance from the CM-5 Data Network

Proceedings of the 8th International Symposium on Parallel Processing
Compositional C++: Compositional Parallel Programming

Proceedings of the 5th International Workshop on Languages and Compilers for Parallel Computing
The concert system--compiler and runtime support for efficient, fine-grained concurrent object-oriented programs

The concert system--compiler and runtime support for efficient, fine-grained concurrent object-oriented programs

Efficient support of location transparency in concurrent object-oriented programming languages

Supercomputing '95 Proceedings of the 1995 ACM/IEEE conference on Supercomputing
High performance messaging on workstations: Illinois fast messages (FM) for Myrinet

Supercomputing '95 Proceedings of the 1995 ACM/IEEE conference on Supercomputing
Coherent network interfaces for fine-grain communication

ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Synchronization and communication in the T3E multiprocessor

Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
A template for non-uniform parallel loops based on dynamic scheduling and prefetching techniques

ICS '96 Proceedings of the 10th international conference on Supercomputing
Evaluating the limits of message passing via the shared attraction memory on CC-COMA machines: experiences with TCGMSG and PVM

ICS '96 Proceedings of the 10th international conference on Supercomputing
LoGPC: modeling network contention in message-passing programs

SIGMETRICS '98/PERFORMANCE '98 Proceedings of the 1998 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
A Performance Evaluation of the Convex SPP-1000 Scalable Shared Memory Parallel Computer

Supercomputing '95 Proceedings of the 1995 ACM/IEEE conference on Supercomputing
The design, implementation, and evaluation of Jade

ACM Transactions on Programming Languages and Systems (TOPLAS)
LoGPC: Modeling Network Contention in Message-Passing Programs

IEEE Transactions on Parallel and Distributed Systems
FM-QoS: real-time communication using self-synchronizing schedules

SC '97 Proceedings of the 1997 ACM/IEEE conference on Supercomputing
A lightweight idempotent messaging protocol for faulty networks

Proceedings of the fourteenth annual ACM symposium on Parallel algorithms and architectures
An Advanced Compiler Framework for Non-Cache-Coherent Multiprocessors

IEEE Transactions on Parallel and Distributed Systems
Run-time adaptation in river

ACM Transactions on Computer Systems (TOCS)
Efficient layering for high speed communication: the MPI over Fast Messages (FM) experience

Cluster Computing
Fast Messages: Efficient, Portable Communication for Workstation Clusters and MPPs

IEEE Parallel & Distributed Technology: Systems & Technology
Portable and scalable algorithm for irregular all-to-all communication

Journal of Parallel and Distributed Computing
Optimizing COOP Languages: Study of a Protein Dynamics Program

IPPS '96 Proceedings of the 10th International Parallel Processing Symposium
A Compiler-Directed Cache Coherence Scheme Using Data Prefetching

IPPS '97 Proceedings of the 11th International Symposium on Parallel Processing
View Caching: Efficient Software Shared Memory for Dynamic Computations

IPPS '97 Proceedings of the 11th International Symposium on Parallel Processing
Supporting High Level Programming with High Performance: The Illinois Concert System

HIPS '97 Proceedings of the 1997 Workshop on High-Level Programming Models and Supportive Environments (HIPS '97)
Managing Concurrent Access for Shared Memory Active Messages

IPPS '98 Proceedings of the 12th. International Parallel Processing Symposium on International Parallel Processing Symposium
Distributed computing using Java: a comparison of two server designs

Journal of Systems Architecture: the EUROMICRO Journal
FLIPC: a low latency messaging system for distributed real time environments

ATEC '96 Proceedings of the 1996 annual conference on USENIX Annual Technical Conference
The future of microprocessors

Communications of the ACM

Quantified Score

Hi-index	0.02

Visualization

Abstract

Programming models based on messaging continue to be an important programming model for parallel machines. Messaging costs are strongly influenced by a machine's network interface architecture. We examine the impact of architectural support for messaging in two machines --- the TMC CM-5 and the Cray T3D --- by exploring the design and performance of several messaging implementations. The additional features in the T3D support remote operations: memory access, fetch-and-increment, atomic swaps, and prefetch.Experiments on the CM-5 show that requiring processor involvement for message reception can increase the communication overheads from 60% to 300% for moderate variations in computation grain size at the destination. In contrast, the T3D hardware for remote operations decouples message reception from processor activity, producing high-performance messaging independent of computation grain size or variability.In addition, hardware support for a shared address space in the T3D can be used to solve the output contention problem (output hot spots), producing messaging implementations that are robust over a wide variety of traffic patterns. Atomic swap hardware can be used to build a distributed message queue, enabling a "pull" messaging scheme where the destination requests data transfer upon receive. This scheme uses prefetches to mask receive latency. While this yields performance robust over output contention, its base cost is competitive only for small messages (up to 64 bytes) because of the high cost of issuing and resolving prefetches in the T3D. Emulation shows that if the interaction costs can be reduced by a factor of eight (250ns to 31ns), perhaps by moving the prefetch queue on chip, and there is a corresponding increase in the prefetch queue size, the pull scheme can give superior performance in all eases.