ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Proceedings of the 1990 ACM/IEEE conference on Supercomputing
Active messages: a mechanism for integrated communication and computation
ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
A tightly-coupled processor-network interface
ASPLOS V Proceedings of the fifth international conference on Architectural support for programming languages and operating systems
Scheduling regular and irregular communication patterns on the CM-5
Proceedings of the 1992 ACM/IEEE conference on Supercomputing
Concurrent aggregates: supporting modularity in massively parallel programs
Concurrent aggregates: supporting modularity in massively parallel programs
Fbufs: a high-bandwidth cross-domain transfer facility
SOSP '93 Proceedings of the fourteenth ACM symposium on Operating systems principles
Proceedings of the 1993 ACM/IEEE conference on Supercomputing
Separating data and control transfer in distributed operating systems
ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
Software overhead in messaging layers: where does the time go?
ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
Supporting systolic and memory communication in iWarp
ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
Ethernet: distributed packet switching for local computer networks
Communications of the ACM
How to Get Good Performance from the CM-5 Data Network
Proceedings of the 8th International Symposium on Parallel Processing
Compositional C++: Compositional Parallel Programming
Proceedings of the 5th International Workshop on Languages and Compilers for Parallel Computing
The concert system--compiler and runtime support for efficient, fine-grained concurrent object-oriented programs
Efficient support of location transparency in concurrent object-oriented programming languages
Supercomputing '95 Proceedings of the 1995 ACM/IEEE conference on Supercomputing
High performance messaging on workstations: Illinois fast messages (FM) for Myrinet
Supercomputing '95 Proceedings of the 1995 ACM/IEEE conference on Supercomputing
Coherent network interfaces for fine-grain communication
ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Synchronization and communication in the T3E multiprocessor
Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
A template for non-uniform parallel loops based on dynamic scheduling and prefetching techniques
ICS '96 Proceedings of the 10th international conference on Supercomputing
ICS '96 Proceedings of the 10th international conference on Supercomputing
LoGPC: modeling network contention in message-passing programs
SIGMETRICS '98/PERFORMANCE '98 Proceedings of the 1998 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
A Performance Evaluation of the Convex SPP-1000 Scalable Shared Memory Parallel Computer
Supercomputing '95 Proceedings of the 1995 ACM/IEEE conference on Supercomputing
The design, implementation, and evaluation of Jade
ACM Transactions on Programming Languages and Systems (TOPLAS)
LoGPC: Modeling Network Contention in Message-Passing Programs
IEEE Transactions on Parallel and Distributed Systems
FM-QoS: real-time communication using self-synchronizing schedules
SC '97 Proceedings of the 1997 ACM/IEEE conference on Supercomputing
A lightweight idempotent messaging protocol for faulty networks
Proceedings of the fourteenth annual ACM symposium on Parallel algorithms and architectures
An Advanced Compiler Framework for Non-Cache-Coherent Multiprocessors
IEEE Transactions on Parallel and Distributed Systems
ACM Transactions on Computer Systems (TOCS)
Fast Messages: Efficient, Portable Communication for Workstation Clusters and MPPs
IEEE Parallel & Distributed Technology: Systems & Technology
Portable and scalable algorithm for irregular all-to-all communication
Journal of Parallel and Distributed Computing
Optimizing COOP Languages: Study of a Protein Dynamics Program
IPPS '96 Proceedings of the 10th International Parallel Processing Symposium
A Compiler-Directed Cache Coherence Scheme Using Data Prefetching
IPPS '97 Proceedings of the 11th International Symposium on Parallel Processing
View Caching: Efficient Software Shared Memory for Dynamic Computations
IPPS '97 Proceedings of the 11th International Symposium on Parallel Processing
Supporting High Level Programming with High Performance: The Illinois Concert System
HIPS '97 Proceedings of the 1997 Workshop on High-Level Programming Models and Supportive Environments (HIPS '97)
Managing Concurrent Access for Shared Memory Active Messages
IPPS '98 Proceedings of the 12th. International Parallel Processing Symposium on International Parallel Processing Symposium
Distributed computing using Java: a comparison of two server designs
Journal of Systems Architecture: the EUROMICRO Journal
FLIPC: a low latency messaging system for distributed real time environments
ATEC '96 Proceedings of the 1996 annual conference on USENIX Annual Technical Conference
Communications of the ACM
Hi-index | 0.02 |
Programming models based on messaging continue to be an important programming model for parallel machines. Messaging costs are strongly influenced by a machine's network interface architecture. We examine the impact of architectural support for messaging in two machines --- the TMC CM-5 and the Cray T3D --- by exploring the design and performance of several messaging implementations. The additional features in the T3D support remote operations: memory access, fetch-and-increment, atomic swaps, and prefetch.Experiments on the CM-5 show that requiring processor involvement for message reception can increase the communication overheads from 60% to 300% for moderate variations in computation grain size at the destination. In contrast, the T3D hardware for remote operations decouples message reception from processor activity, producing high-performance messaging independent of computation grain size or variability.In addition, hardware support for a shared address space in the T3D can be used to solve the output contention problem (output hot spots), producing messaging implementations that are robust over a wide variety of traffic patterns. Atomic swap hardware can be used to build a distributed message queue, enabling a "pull" messaging scheme where the destination requests data transfer upon receive. This scheme uses prefetches to mask receive latency. While this yields performance robust over output contention, its base cost is competitive only for small messages (up to 64 bytes) because of the high cost of issuing and resolving prefetches in the T3D. Emulation shows that if the interaction costs can be reduced by a factor of eight (250ns to 31ns), perhaps by moving the prefetch queue on chip, and there is a corresponding increase in the prefetch queue size, the pull scheme can give superior performance in all eases.