Warp: an integrated solution of high-speed parallel computing
Proceedings of the 1988 ACM/IEEE conference on Supercomputing
Optimum Broadcasting and Personalized Communication in Hypercubes
IEEE Transactions on Computers
PVM: a framework for parallel distributed computing
Concurrency: Practice and Experience
An architecture for parallel database computing
Proceedings of the world transputer user group (WOTUG) conference on Transputing '91
Adaptive deadlock- and livelock-free routing with all minimal paths in Torus networks
SPAA '92 Proceedings of the fourth annual ACM symposium on Parallel algorithms and architectures
Designing broadcasting algorithms in the postal model for message-passing systems
SPAA '92 Proceedings of the fourth annual ACM symposium on Parallel algorithms and architectures
The network architecture of the Connection Machine CM-5 (extended abstract)
SPAA '92 Proceedings of the fourth annual ACM symposium on Parallel algorithms and architectures
LogP: towards a realistic model of parallel computation
PPOPP '93 Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming
Exploiting task and data parallelism on a multicomputer
PPOPP '93 Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming
Scalable parallel computing: the IBM 9076 scalable POWERparallel 1
SPAA '93 Proceedings of the fifth annual ACM symposium on Parallel algorithms and architectures
Supporting sets of arbitrary connections on iWarp through communication context switches
SPAA '93 Proceedings of the fifth annual ACM symposium on Parallel algorithms and architectures
Generating communication for array statements: design, implementation, and evaluation
Journal of Parallel and Distributed Computing - Special issue on data parallel algorithms and programming
Supporting systolic and memory communication in iWarp
ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
Design and Implementation of an Interconnection Network for the AP1000
Proceedings of the IFIP 12th World Computer Congress on Algorithms, Software, Architecture - Information Processing '92, Volume 1 - Volume I
An Architecture for Optimal All-to-All Personalized Communication
An Architecture for Optimal All-to-All Personalized Communication
Universal congestion control for meshes
Proceedings of the seventh annual ACM symposium on Parallel algorithms and architectures
Optimizing memory system performance for communication in parallel computers
ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
IEEE Transactions on Parallel and Distributed Systems
Modeling parallel bandwidth: local vs. global restrictions
Proceedings of the ninth annual ACM symposium on Parallel algorithms and architectures
Edge Congestion of Shortest Path Systems for All-to-All Communication
IEEE Transactions on Parallel and Distributed Systems
Scheduling time-constrained communication in linear networks
Proceedings of the tenth annual ACM symposium on Parallel algorithms and architectures
Scalable S-To-P Broadcasting on Message-Passing MPPs
IEEE Transactions on Parallel and Distributed Systems
On scheduling all-to-all personalized connections and cost-effective designs in WDM rings
IEEE/ACM Transactions on Networking (TON)
Configurable Algorithms for Complete Exchange in 2D Meshes
IEEE Transactions on Parallel and Distributed Systems
All-to-All Personalized Communication in Multidimensional Torus and Mesh Networks
IEEE Transactions on Parallel and Distributed Systems
Compiled communication for all-optical TDM networks
Supercomputing '96 Proceedings of the 1996 ACM/IEEE conference on Supercomputing
Optimal All-to-All Personalized Exchange in a Class of Optical Multistage Networks
IEEE Transactions on Parallel and Distributed Systems
Multiphase Complete Exchange on Paragon, SP2, and CS-2
IEEE Parallel & Distributed Technology: Systems & Technology
All-To-All Communication with Minimum Start-Up Costs in 2D/3D Tori and Meshes
IEEE Transactions on Parallel and Distributed Systems
Algorithms for All-to-All Personalized Exchange in 2D and 3D Tori
IPPS '96 Proceedings of the 10th International Parallel Processing Symposium
CC--MPI: a compiled communication capable MPI prototype for ethernet switched clusters
Proceedings of the ninth ACM SIGPLAN symposium on Principles and practice of parallel programming
Message Scheduling for All-to-All Personalized Communication on Ethernet Switched Clusters
IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Papers - Volume 01
Performance modeling and optimization of a high energy colliding beam simulation code
Proceedings of the 2006 ACM/IEEE conference on Supercomputing
A Message Scheduling Scheme for All-to-All Personalized Communication on Ethernet Switched Clusters
IEEE Transactions on Parallel and Distributed Systems
One-to-all personalized communication in torus networks
PDCN'07 Proceedings of the 25th conference on Proceedings of the 25th IASTED International Multi-Conference: parallel and distributed computing and networks
rMPI: message passing on multicore processors with on-chip interconnect
HiPEAC'08 Proceedings of the 3rd international conference on High performance embedded architectures and compilers
TELE-INFO'06 Proceedings of the 5th WSEAS international conference on Telecommunications and informatics
HiPC'05 Proceedings of the 12th international conference on High Performance Computing
Hi-index | 0.01 |
In all-to-all personalized communication (AAPC), every node of a parallel system sends a potentially unique packet to every other node. AAPC is an important primitive operation for modern parallel compilers, since it is used to redistribute data structures during parallel computations. As an extremely dense communication pattern, AAPC causes congestion in many types of networks and therefore executes very poorly on general purpose, asynchronous message passsing routers.We present and evaluate a network architecture that executes all-to-all communication optimally on a two-dimensional torus. The router combines optimal partitions of the AAPC step with a self-synchronizing switching mechanism integrated into a conventional wormhole router. Optimality is achieved by routing along shortest paths while fully utilizing all links. A simple hardware addition for synchronized message switching can guarantee optimal AAPC routing in many existing network architectures.The flexible communication agent of the iWarp VLSI component allowed us to implement an efficient prototype for the evaluation of the hardware complexity as well as possible software overheads. The measured performance on an 8 × 8 torus exceeded 2 GigaBytes/sec or 80% of the limit set by the raw speed of the interconnects. We make a quantitative comparison of the AAPC router with a conventional message passing system. The potential gain of such a router for larger parallel programs is illustrated with the example of a two-dimensional Fast Fourier Transform.