ACM Transactions on Programming Languages and Systems (TOPLAS)
High-performance operating system primitives for robotics and real-time control systems
ACM Transactions on Computer Systems (TOCS)
RPC in the x-Kernel: evaluating new design techniques
SOSP '89 Proceedings of the twelfth ACM symposium on Operating systems principles
The interaction of architecture and operating system design
ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Limits to low-latency communication on high-speed networks
ACM Transactions on Computer Systems (TOCS)
Experiences with a high-speed network adaptor: a software perspective
SIGCOMM '94 Proceedings of the conference on Communications architectures, protocols and applications
User-space protocols deliver high performance to applications on a low-cost Gb/s LAN
SIGCOMM '94 Proceedings of the conference on Communications architectures, protocols and applications
GTW: a time warp system for shared memory multiprocessors
WSC '94 Proceedings of the 26th conference on Winter simulation
Optimistic active messages: a mechanism for scheduling communication with computation
PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
Buffer management in shared-memory Time Warp systems
PADS '95 Proceedings of the ninth workshop on Parallel and distributed simulation
U-Net: a user-level network interface for parallel and distributed computing
SOSP '95 Proceedings of the fifteenth ACM symposium on Operating systems principles
Exokernel: an operating system architecture for application-level resource management
SOSP '95 Proceedings of the fifteenth ACM symposium on Operating systems principles
Extensibility safety and performance in the SPIN operating system
SOSP '95 Proceedings of the fifteenth ACM symposium on Operating systems principles
The impact of architectural trends on operating system performance
SOSP '95 Proceedings of the fifteenth ACM symposium on Operating systems principles
High performance messaging on workstations: Illinois fast messages (FM) for Myrinet
Supercomputing '95 Proceedings of the 1995 ACM/IEEE conference on Supercomputing
Missing the memory wall: the case for processor/memory integration
ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Early experience with message-passing on the SHRIMP multicomputer
ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
An implementation of the Hamlyn sender-managed interface architecture
OSDI '96 Proceedings of the second USENIX symposium on Operating systems design and implementation
Lazy receiver processing (LRP): a network subsystem architecture for server systems
OSDI '96 Proceedings of the second USENIX symposium on Operating systems design and implementation
Effects of communication latency, overhead, and bandwidth in a cluster architecture
Proceedings of the 24th annual international symposium on Computer architecture
CPU reservations and time constraints: efficient, predictable scheduling of independent activities
Proceedings of the sixteenth ACM symposium on Operating systems principles
IEEE Micro
Exploiting the Capabilities of Communications Co-Processors
IPPS '96 Proceedings of the 10th International Parallel Processing Symposium
Design and Implementation of Virtual Memory-Mapped Communication on Myrinet
IPPS '97 Proceedings of the 11th International Symposium on Parallel Processing
Efficient Communication Mechanisms for Cluster Based Parallel Computing
CANPC '97 Proceedings of the First International Workshop on Communication and Architectural Support for Network-Based Parallel Computing
Distributed Simulation of Large-Scale PCS Networks
MASCOTS '94 Proceedings of the Second International Workshop on Modeling, Analysis, and Simulation On Computer and Telecommunication Systems
CNI: A High-Performance Network Interface for Workstation Clusters
HPDC '96 Proceedings of the 5th IEEE International Symposium on High Performance Distributed Computing
Employing Logic-Enhanced Memory for High-performance ATM Network Interfaces
HPDC '96 Proceedings of the 5th IEEE International Symposium on High Performance Distributed Computing
Processor Controlled Off-Processor I/O
Processor Controlled Off-Processor I/O
Incorporating Memory Management into User-Level Network Interfaces
Incorporating Memory Management into User-Level Network Interfaces
EXECUBE-A New Architecture for Scaleable MPPs
ICPP '94 Proceedings of the 1994 International Conference on Parallel Processing - Volume 01
Efficient wire formats for high performance computing
Proceedings of the 2000 ACM/IEEE conference on Supercomputing
Native Data Representation: An Efficient Wire Format for High-Performance Distributed Computing
IEEE Transactions on Parallel and Distributed Systems
On Network CoProcessors for Scalable, Predictable Media Services
IEEE Transactions on Parallel and Distributed Systems
Advanced networking services for distributed multimedia streaming applications
Multimedia Tools and Applications
Hi-index | 0.00 |
This paper presents a novel networking architecture designed for communication intensive parallel applications running on clusters of workstations (COWs) connected by high speed networks. The architecture addresses what is considered one of the most important problems of cluster-based parallel computing: the inherent inability of scaling the performance of communication software along with the host CPU performance. The Virtual Communication Machine (VCM), resident on the network coprocessor, presents a scalable software solution by providing configurable communication functionality directly accessible at user-level. The VCM architecture is configurable in that it enables the transfer to the VCM of selected communication-related functionality that is traditionally part of the application and/or the host kernel. Such transfers are beneficial when a significant reduction of the host CPU’s load translates into a small increase in the coprocessor’s load. The functionality implemented by the coprocessor is available at the application level as VCM instructions. Host CPU(s) and coprocessor interact through shared memory regions, thereby avoiding expensive CPU context switches. The host kernel is not involved in this interaction; it simply “connects” the application to the VCM during the initialization phase and is called infrequently to handle exceptional conditions. Protection is enforced by the VCM based on information supplied by the kernel. The VCM-based communication architecture admits low cost and open implementations, as demonstrated by its current ATM-based implementation based on off-the-shelf hardware components and using standard AAL5 packets. The architecture makes it easy to implement communication software that exhibits negligible overheads on the host CPU(s) and offers latencies and bandwidths close to the hardware limits of the underlying network. These characteristics are due to the VCM’s support for zero-copy messaging with gather/scatter capabilities and the VCM’s direct access to any data structure in an application’s address space. This paper describes two versions of an ATM-based VCM implementation, which differ in the way they use the memory on the network adapter. Their performance under heavy load is compared in the context of a synthetic client/server application. The same application is used to evaluate the scalability of the architecture to multiple VCM-based network interfaces per host. Parallel implementations of the Traveling Salesman Problem and of Georgia Tech Time Warp, an engine for discrete-event simulation, are used to demonstrate VCM functionality and the high performance of its implementation. The distributed- and shared-memory versions of these two applications exhibit comparable performance, despite the significant cost-performance advantage of the distributed-memory platform.