An architecture for optimal all-to-all personalized communication

Authors:
Susan Hinrichs;Corey Kosak;David R. O'Hallaron;Thomas M. Stricker;Riichiro Take
Affiliations:
School of Computer Science, Carnegie Mellon University, Pittsburgh, PA;School of Computer Science, Carnegie Mellon University, Pittsburgh, PA;School of Computer Science, Carnegie Mellon University, Pittsburgh, PA;School of Computer Science, Carnegie Mellon University, Pittsburgh, PA;Fujitsu Laboratories Ltd., 1015 Kamikodanaka, Nakahara-ku, Kawasaki, 211, Japan and School of Computer Science, Carnegie Mellon University, Pittsburgh, PA
Venue:
SPAA '94 Proceedings of the sixth annual ACM symposium on Parallel algorithms and architectures
Year:
1994

Citing 15
Cited 24

Warp: an integrated solution of high-speed parallel computing

Proceedings of the 1988 ACM/IEEE conference on Supercomputing
Optimum Broadcasting and Personalized Communication in Hypercubes

IEEE Transactions on Computers
PVM: a framework for parallel distributed computing

Concurrency: Practice and Experience
An architecture for parallel database computing

Proceedings of the world transputer user group (WOTUG) conference on Transputing '91
Adaptive deadlock- and livelock-free routing with all minimal paths in Torus networks

SPAA '92 Proceedings of the fourth annual ACM symposium on Parallel algorithms and architectures
Designing broadcasting algorithms in the postal model for message-passing systems

SPAA '92 Proceedings of the fourth annual ACM symposium on Parallel algorithms and architectures
The network architecture of the Connection Machine CM-5 (extended abstract)

SPAA '92 Proceedings of the fourth annual ACM symposium on Parallel algorithms and architectures
LogP: towards a realistic model of parallel computation

PPOPP '93 Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming
Exploiting task and data parallelism on a multicomputer

PPOPP '93 Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming
Scalable parallel computing: the IBM 9076 scalable POWERparallel 1

SPAA '93 Proceedings of the fifth annual ACM symposium on Parallel algorithms and architectures
Supporting sets of arbitrary connections on iWarp through communication context switches

SPAA '93 Proceedings of the fifth annual ACM symposium on Parallel algorithms and architectures
Generating communication for array statements: design, implementation, and evaluation

Journal of Parallel and Distributed Computing - Special issue on data parallel algorithms and programming
Supporting systolic and memory communication in iWarp

ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
Design and Implementation of an Interconnection Network for the AP1000

Proceedings of the IFIP 12th World Computer Congress on Algorithms, Software, Architecture - Information Processing '92, Volume 1 - Volume I
An Architecture for Optimal All-to-All Personalized Communication

An Architecture for Optimal All-to-All Personalized Communication

Universal congestion control for meshes

Proceedings of the seventh annual ACM symposium on Parallel algorithms and architectures
Optimizing memory system performance for communication in parallel computers

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Bandwidth-Optimal Complete Exchange on Wormhole-Routed 2D/3D Torus Networks: A Diagonal-Propagation Approach

IEEE Transactions on Parallel and Distributed Systems
Modeling parallel bandwidth: local vs. global restrictions

Proceedings of the ninth annual ACM symposium on Parallel algorithms and architectures
Edge Congestion of Shortest Path Systems for All-to-All Communication

IEEE Transactions on Parallel and Distributed Systems
Scheduling time-constrained communication in linear networks

Proceedings of the tenth annual ACM symposium on Parallel algorithms and architectures
Scalable S-To-P Broadcasting on Message-Passing MPPs

IEEE Transactions on Parallel and Distributed Systems
On scheduling all-to-all personalized connections and cost-effective designs in WDM rings

IEEE/ACM Transactions on Networking (TON)
Configurable Algorithms for Complete Exchange in 2D Meshes

IEEE Transactions on Parallel and Distributed Systems
All-to-All Personalized Communication in Multidimensional Torus and Mesh Networks

IEEE Transactions on Parallel and Distributed Systems
Compiled communication for all-optical TDM networks

Supercomputing '96 Proceedings of the 1996 ACM/IEEE conference on Supercomputing
Optimal All-to-All Personalized Exchange in a Class of Optical Multistage Networks

IEEE Transactions on Parallel and Distributed Systems
Multiphase Complete Exchange on Paragon, SP2, and CS-2

IEEE Parallel & Distributed Technology: Systems & Technology
Baring It All to Software: Raw Machines

Computer
All-To-All Communication with Minimum Start-Up Costs in 2D/3D Tori and Meshes

IEEE Transactions on Parallel and Distributed Systems
Algorithms for All-to-All Personalized Exchange in 2D and 3D Tori

IPPS '96 Proceedings of the 10th International Parallel Processing Symposium
CC--MPI: a compiled communication capable MPI prototype for ethernet switched clusters

Proceedings of the ninth ACM SIGPLAN symposium on Principles and practice of parallel programming
Message Scheduling for All-to-All Personalized Communication on Ethernet Switched Clusters

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Papers - Volume 01
Performance modeling and optimization of a high energy colliding beam simulation code

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
A Message Scheduling Scheme for All-to-All Personalized Communication on Ethernet Switched Clusters

IEEE Transactions on Parallel and Distributed Systems
One-to-all personalized communication in torus networks

PDCN'07 Proceedings of the 25th conference on Proceedings of the 25th IASTED International Multi-Conference: parallel and distributed computing and networks
rMPI: message passing on multicore processors with on-chip interconnect

HiPEAC'08 Proceedings of the 3rd international conference on High performance embedded architectures and compilers
Packet scheduling in WDM ring networks with non-uniform traffic demands and arbitrary transceiver tuning latencies

TELE-INFO'06 Proceedings of the 5th WSEAS international conference on Telecommunications and informatics
Performance analysis of user-level PIM communication in the data intensive architecture (DIVA) system

HiPC'05 Proceedings of the 12th international conference on High Performance Computing

Quantified Score

Hi-index	0.01

Visualization

Abstract

In all-to-all personalized communication (AAPC), every node of a parallel system sends a potentially unique packet to every other node. AAPC is an important primitive operation for modern parallel compilers, since it is used to redistribute data structures during parallel computations. As an extremely dense communication pattern, AAPC causes congestion in many types of networks and therefore executes very poorly on general purpose, asynchronous message passsing routers.We present and evaluate a network architecture that executes all-to-all communication optimally on a two-dimensional torus. The router combines optimal partitions of the AAPC step with a self-synchronizing switching mechanism integrated into a conventional wormhole router. Optimality is achieved by routing along shortest paths while fully utilizing all links. A simple hardware addition for synchronized message switching can guarantee optimal AAPC routing in many existing network architectures.The flexible communication agent of the iWarp VLSI component allowed us to implement an efficient prototype for the evaluation of the hardware complexity as well as possible software overheads. The measured performance on an 8 × 8 torus exceeded 2 GigaBytes/sec or 80% of the limit set by the raw speed of the interconnects. We make a quantitative comparison of the AAPC router with a conventional message passing system. The potential gain of such a router for larger parallel programs is illustrated with the example of a two-dimensional Fast Fourier Transform.