Interprocessor Traffic Scheduling Algorithm for Multiple-Processor Networks
IEEE Transactions on Computers
A Mapping Strategy for Parallel Processing
IEEE Transactions on Computers
On mapping parallel algorithms into parallel architectures
Journal of Parallel and Distributed Computing
Task allocation onto a hypercube by recursive mincut bipartitioning
C3P Proceedings of the third conference on Hypercube concurrent computers and applications: Architecture, software, computer systems, and general issues - Volume 1
On the Communication Complexity of Generalized 2-D Convolution on Array Processors
IEEE Transactions on Computers
A network-topology independent task allocation strategy for parallel computers
Proceedings of the 1990 ACM/IEEE conference on Supercomputing
Multilevel k-way partitioning scheme for irregular graphs
Journal of Parallel and Distributed Computing
An overview of the BlueGene/L Supercomputer
Proceedings of the 2002 ACM/IEEE conference on Supercomputing
A Heuristic Algorithm for Mapping Communicating Tasks on Heterogeneous Resources
HCW '00 Proceedings of the 9th Heterogeneous Computing Workshop
A New Task Mapping Technique for Communication-Aware Scheduling Strategies
ICPPW '01 Proceedings of the 2001 International Conference on Parallel Processing Workshops
Simulation-based performance prediction for large parallel machines
International Journal of Parallel Programming - Special issue: The next generation software program
Achieving high performance on extremely large parallel machines: performance prediction and load balancing
IEEE Transactions on Computers
Multiprocessor Scheduling with the Aid of Network Flow Algorithms
IEEE Transactions on Software Engineering
Optimizing task layout on the Blue Gene/L supercomputer
IBM Journal of Research and Development
HPC-Colony: services and interfaces for very large systems
ACM SIGOPS Operating Systems Review
Scalable computing with parallel tasks
Proceedings of the 2nd Workshop on Many-Task Computing on Grids and Supercomputers
On deploying tree structured agent applications in networked embedded systems
Euro-Par'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part II
Mapping communication layouts to network hardware characteristics on massive-scale blue gene systems
Computer Science - Research and Development
Scalable node allocation for improved performance in regular and anisotropic 3D torus supercomputers
EuroMPI'11 Proceedings of the 18th European MPI Users' Group conference on Recent advances in the message passing interface
Avoiding hot-spots on two-level direct networks
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Hierarchical task mapping of cell-based AMR cosmology simulations
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Enabling efficient placement of virtual infrastructures in the cloud
Proceedings of the 13th International Middleware Conference
Task mapping in rectangular twisted tori
Proceedings of the High Performance Computing Symposium
Predicting application performance using supervised learning on communication features
SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
The Journal of Supercomputing
Combined scheduling and mapping for scalable computing with parallel tasks
Scientific Programming - Biological Knowledge Discovery and Data Mining
Hi-index | 0.00 |
Communication latencies constitute a significant factor in the performance of parallel applications. With techniques such as wormhole routing, the variation in no-load latencies became insignificant, i.e., the no-load latencies for far-away processors were not significantly higher (and too small to matter) than those for nearby processors. Contention in the network is then left as the major factor affecting latencies. With networks such as Fat-Trees of hypercubes, with number of wires growing as P log P, even this is not a very significant factor. However, for torus and grid networks now being used in large machines such as BlueGene/L and the Cray XT3, such contention becomes an issue. We quantify the effect of this contention with benchmarks that vary the number of hops traveled by each communicated byte. We then demonstrate a process mapping strategy that minimizes the impact of topology by heuristically minimizing the total number of hop-bytes communicated. This strategy, and its variants, are implemented in an adaptive runtime system in Charm++ and AdaptiveMPI, so it is available for a broad class of applications.