Supporting parallel applications on clusters of workstations: The Virtual Communication Machine-based architecture

Authors:
Marcel-Cătălin Roşu;Karsten Schwan;Richard Fujimoto
Affiliations:
College of Computing, Georgia Institute of Technology, Atlanta, GA 30332-0280, USA;College of Computing, Georgia Institute of Technology, Atlanta, GA 30332-0280, USA;College of Computing, Georgia Institute of Technology, Atlanta, GA 30332-0280, USA
Venue:
Cluster Computing
Year:
1998

Citing 32
Cited 4

Virtual time

ACM Transactions on Programming Languages and Systems (TOPLAS)
High-performance operating system primitives for robotics and real-time control systems

ACM Transactions on Computer Systems (TOCS)
RPC in the x-Kernel: evaluating new design techniques

SOSP '89 Proceedings of the twelfth ACM symposium on Operating systems principles
The interaction of architecture and operating system design

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Limits to low-latency communication on high-speed networks

ACM Transactions on Computer Systems (TOCS)
A systematic approach to host interface design for high-speed networks

Computer
Experiences with a high-speed network adaptor: a software perspective

SIGCOMM '94 Proceedings of the conference on Communications architectures, protocols and applications
User-space protocols deliver high performance to applications on a low-cost Gb/s LAN

SIGCOMM '94 Proceedings of the conference on Communications architectures, protocols and applications
GTW: a time warp system for shared memory multiprocessors

WSC '94 Proceedings of the 26th conference on Winter simulation
Optimistic active messages: a mechanism for scheduling communication with computation

PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
Buffer management in shared-memory Time Warp systems

PADS '95 Proceedings of the ninth workshop on Parallel and distributed simulation
U-Net: a user-level network interface for parallel and distributed computing

SOSP '95 Proceedings of the fifteenth ACM symposium on Operating systems principles
Exokernel: an operating system architecture for application-level resource management

SOSP '95 Proceedings of the fifteenth ACM symposium on Operating systems principles
Extensibility safety and performance in the SPIN operating system

SOSP '95 Proceedings of the fifteenth ACM symposium on Operating systems principles
The impact of architectural trends on operating system performance

SOSP '95 Proceedings of the fifteenth ACM symposium on Operating systems principles
High performance messaging on workstations: Illinois fast messages (FM) for Myrinet

Supercomputing '95 Proceedings of the 1995 ACM/IEEE conference on Supercomputing
Missing the memory wall: the case for processor/memory integration

ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Early experience with message-passing on the SHRIMP multicomputer

ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
An implementation of the Hamlyn sender-managed interface architecture

OSDI '96 Proceedings of the second USENIX symposium on Operating systems design and implementation
Lazy receiver processing (LRP): a network subsystem architecture for server systems

OSDI '96 Proceedings of the second USENIX symposium on Operating systems design and implementation
Effects of communication latency, overhead, and bandwidth in a cluster architecture

Proceedings of the 24th annual international symposium on Computer architecture
CPU reservations and time constraints: efficient, predictable scheduling of independent activities

Proceedings of the sixteenth ACM symposium on Operating systems principles
A Case for Intelligent RAM

IEEE Micro
Exploiting the Capabilities of Communications Co-Processors

IPPS '96 Proceedings of the 10th International Parallel Processing Symposium
Design and Implementation of Virtual Memory-Mapped Communication on Myrinet

IPPS '97 Proceedings of the 11th International Symposium on Parallel Processing
Efficient Communication Mechanisms for Cluster Based Parallel Computing

CANPC '97 Proceedings of the First International Workshop on Communication and Architectural Support for Network-Based Parallel Computing
Distributed Simulation of Large-Scale PCS Networks

MASCOTS '94 Proceedings of the Second International Workshop on Modeling, Analysis, and Simulation On Computer and Telecommunication Systems
CNI: A High-Performance Network Interface for Workstation Clusters

HPDC '96 Proceedings of the 5th IEEE International Symposium on High Performance Distributed Computing
Employing Logic-Enhanced Memory for High-performance ATM Network Interfaces

HPDC '96 Proceedings of the 5th IEEE International Symposium on High Performance Distributed Computing
Processor Controlled Off-Processor I/O

Processor Controlled Off-Processor I/O
Incorporating Memory Management into User-Level Network Interfaces

Incorporating Memory Management into User-Level Network Interfaces
EXECUBE-A New Architecture for Scaleable MPPs

ICPP '94 Proceedings of the 1994 International Conference on Parallel Processing - Volume 01

Efficient wire formats for high performance computing

Proceedings of the 2000 ACM/IEEE conference on Supercomputing
Native Data Representation: An Efficient Wire Format for High-Performance Distributed Computing

IEEE Transactions on Parallel and Distributed Systems
On Network CoProcessors for Scalable, Predictable Media Services

IEEE Transactions on Parallel and Distributed Systems
Advanced networking services for distributed multimedia streaming applications

Multimedia Tools and Applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper presents a novel networking architecture designed for communication intensive parallel applications running on clusters of workstations (COWs) connected by high speed networks. The architecture addresses what is considered one of the most important problems of cluster-based parallel computing: the inherent inability of scaling the performance of communication software along with the host CPU performance. The Virtual Communication Machine (VCM), resident on the network coprocessor, presents a scalable software solution by providing configurable communication functionality directly accessible at user-level. The VCM architecture is configurable in that it enables the transfer to the VCM of selected communication-related functionality that is traditionally part of the application and/or the host kernel. Such transfers are beneficial when a significant reduction of the host CPU’s load translates into a small increase in the coprocessor’s load. The functionality implemented by the coprocessor is available at the application level as VCM instructions. Host CPU(s) and coprocessor interact through shared memory regions, thereby avoiding expensive CPU context switches. The host kernel is not involved in this interaction; it simply “connects” the application to the VCM during the initialization phase and is called infrequently to handle exceptional conditions. Protection is enforced by the VCM based on information supplied by the kernel. The VCM-based communication architecture admits low cost and open implementations, as demonstrated by its current ATM-based implementation based on off-the-shelf hardware components and using standard AAL5 packets. The architecture makes it easy to implement communication software that exhibits negligible overheads on the host CPU(s) and offers latencies and bandwidths close to the hardware limits of the underlying network. These characteristics are due to the VCM’s support for zero-copy messaging with gather/scatter capabilities and the VCM’s direct access to any data structure in an application’s address space. This paper describes two versions of an ATM-based VCM implementation, which differ in the way they use the memory on the network adapter. Their performance under heavy load is compared in the context of a synthetic client/server application. The same application is used to evaluate the scalability of the architecture to multiple VCM-based network interfaces per host. Parallel implementations of the Traveling Salesman Problem and of Georgia Tech Time Warp, an engine for discrete-event simulation, are used to demonstrate VCM functionality and the high performance of its implementation. The distributed- and shared-memory versions of these two applications exhibit comparable performance, despite the significant cost-performance advantage of the distributed-memory platform.