Algorithms for matrix transposition on Boolean N-cube configured ensemble architecture
SIAM Journal on Matrix Analysis and Applications
Hypercube clock synchronization
Concurrency: Practice and Experience
PAX Computer; High-Speed Parallel Processing and Scientific Computing
PAX Computer; High-Speed Parallel Processing and Scientific Computing
On scheduling all-to-all personalized connections and cost-effective designs in WDM rings
IEEE/ACM Transactions on Networking (TON)
Multiphase Complete Exchange on Paragon, SP2, and CS-2
IEEE Parallel & Distributed Technology: Systems & Technology
Balancing Contention and Synchronization on the Intel Paragon
IEEE Parallel & Distributed Technology: Systems & Technology
Parallel ADI solver based on processor scheduling
Applied Mathematics and Computation
Message Scheduling for All-to-All Personalized Communication on Ethernet Switched Clusters
IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Papers - Volume 01
A Message Scheduling Scheme for All-to-All Personalized Communication on Ethernet Switched Clusters
IEEE Transactions on Parallel and Distributed Systems
Hi-index | 14.98 |
Abstract-Complete Exchange requires each of N processors to send a unique message to each of the remaining N驴 1 processors. For a circuit switched hypercube with N = 2d processors, the Direct and Standard algorithms for Complete Exchange are time optimal for very large and very small message sizes, respectively. For intermediate sizes, a hybrid Multiphase algorithm is better. This carries out Direct exchanges on a set of subcubes whose dimensions are a partition of the integer d. The best such algorithm for a given message size m could hitherto only be found by enumerating all partitions of d.The Multiphase algorithm is analyzed assuming a high performance communication network. It is proved that only algorithms corresponding to equipartitions of d (partitions in which the maximum and minimum elements differ by at most one) can possibly be optimal. The run times of these algorithms plotted against m form a hull of optimality. It is proved that, although there is an exponential number of partitions, 1) the number of faces on this hull is $\Theta \left( {\sqrt d} \right)$, 2) the hull can be found in $\Theta \left( {\sqrt d} \right)$ time, and 3) once it has been found, the optimal algorithm for any given m can be found in 驴(log d) time.These results provide a very fast technique for minimizing communication overhead in many important applications, such as matrix transpose, fast Fourier transform, and alternating directions implicit (ADI).