Proceedings of the seventh annual ACM symposium on Parallel algorithms and architectures
U-Net: a user-level network interface for parallel and distributed computing
SOSP '95 Proceedings of the fifteenth ACM symposium on Operating systems principles
Towards modeling the performance of a fast connected components algorithm on parallel machines
Supercomputing '95 Proceedings of the 1995 ACM/IEEE conference on Supercomputing
Efficient support of location transparency in concurrent object-oriented programming languages
Supercomputing '95 Proceedings of the 1995 ACM/IEEE conference on Supercomputing
Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Run-time compilation for parallel sparse matrix computations
ICS '96 Proceedings of the 10th international conference on Supercomputing
Performance implications of communication mechanisms in all-software global address space systems
PPOPP '97 Proceedings of the sixth ACM SIGPLAN symposium on Principles and practice of parallel programming
Space/time-efficient scheduling and execution of parallel irregular computations
ACM Transactions on Programming Languages and Systems (TOPLAS)
Design challenges of virtual networks: fast, general-purpose communication
Proceedings of the seventh ACM SIGPLAN symposium on Principles and practice of parallel programming
Profiling a parallel language based on fine-grained communication
Supercomputing '96 Proceedings of the 1996 ACM/IEEE conference on Supercomputing
Low-latency communication on the IBM RISC system/6000 SP
Supercomputing '96 Proceedings of the 1996 ACM/IEEE conference on Supercomputing
Multi-protocol active messages on a cluster of SMP's
SC '97 Proceedings of the 1997 ACM/IEEE conference on Supercomputing
Efficient Run-Time Support for Irregular Task Computations with Mixed Granularities
IPPS '96 Proceedings of the 10th International Parallel Processing Symposium
Exploiting the Capabilities of Communications Co-Processors
IPPS '96 Proceedings of the 10th International Parallel Processing Symposium
Low Latency MPI for Meiko CS/2 and ATM Clusters
IPPS '97 Proceedings of the 11th International Symposium on Parallel Processing
Optimizing Parallel Bitonic Sort
IPPS '97 Proceedings of the 11th International Symposium on Parallel Processing
Level-wise scheduling algorithm for fat tree interconnection networks
Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Hi-index | 0.00 |
Active messages provide a low latency communication architecture which on modern parallel machines achieves more than an order of magnitude performance improvement over more traditional communication libraries. This paper discusses the experience we gained while implementing active messages on the Meiko CS-2, and discusses implementations for similar architectures. During our work we have identified that architectures which only support efficient remote write operations (or DMA transfers as in the case of the CS-2) make it difficult to transfer both data and control as required by active messages. Traditional network interfaces avoid this problem because they have a single point of entry which essentially acts as a queue. To efficiently support active messages on modern network communication co-processors, hardware primitives are required which support this queue behavior The overcame this problem by producing specialized code which runs on the communications co-processor and supports the active messages protocol. Our implementation of active messages results in a one-way latency of 12.3 /spl mu/s and achieves up to 39 MB/s for bulk transfers. Both numbers are close to optimal for the current Meiko hardware and are competitive with performance of active messages on other hardware platforms.