GIVE-N-TAKE—a balanced code placement framework
PLDI '94 Proceedings of the ACM SIGPLAN 1994 conference on Programming language design and implementation
Techniques to overlap computation and communication in irregular iterative applications
ICS '94 Proceedings of the 8th international conference on Supercomputing
Compiler optimizations for improving data locality
ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
Unifying data and control transformations for distributed shared-memory machines
PLDI '95 Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation
Data and computation transformations for multiprocessors
PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
Hitting the memory wall: implications of the obvious
ACM SIGARCH Computer Architecture News
Missing the memory wall: the case for processor/memory integration
ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Synchronization and communication in the T3E multiprocessor
Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Data distribution support on distributed shared memory multiprocessors
Proceedings of the ACM SIGPLAN 1997 conference on Programming language design and implementation
The SGI Origin: a ccNUMA highly scalable server
Proceedings of the 24th annual international symposium on Computer architecture
Performance of the CRAY T3E multiprocessor
SC '97 Proceedings of the 1997 ACM/IEEE conference on Supercomputing
Designing and Building Parallel Programs: Concepts and Tools for Parallel Software Engineering
Designing and Building Parallel Programs: Concepts and Tools for Parallel Software Engineering
Parallel Computer Architecture: A Hardware/Software Approach
Parallel Computer Architecture: A Hardware/Software Approach
Interconnection Networks: An Engineering Approach
Interconnection Networks: An Engineering Approach
Modeling Communication Overhead: MPI and MPL Performance on the IBM SP2
IEEE Parallel & Distributed Technology: Systems & Technology
Assessing the Performance of the New IBM SP2 Communication Subsystem
IEEE Parallel & Distributed Technology: Systems & Technology
Relationships Between Efficiency and Execution Time of Full Multigrid Methods on Parallel Computers
IEEE Transactions on Parallel and Distributed Systems
Message-Passing Performance of Parallel Computers
Euro-Par '97 Proceedings of the Third International Euro-Par Conference on Parallel Processing
Iterative Algorithms on High Performance Architectures
Euro-Par '97 Proceedings of the Third International Euro-Par Conference on Parallel Processing
Message Passing Evaluation and Analysis on Cray T3E and SGI Origin 2000 Systems
Euro-Par '99 Proceedings of the 5th International Euro-Par Conference on Parallel Processing
Selected Results from the ParkBench Benchmark
Euro-Par '96 Proceedings of the Second International Euro-Par Conference on Parallel Processing-Volume II
Partitioning Regular Domains on Modern Parallel Computers
VECPAR '98 Selected Papers and Invited Talks from the Third International Conference on Vector and Parallel Processing
A Performance Analysis of the SGI Origin2000
VECPAR '98 Selected Papers and Invited Talks from the Third International Conference on Vector and Parallel Processing
PDP '95 Proceedings of the 3rd Euromicro Workshop on Parallel and Distributed Processing
Solution of Alternating-Line Processes on Modern Parallel Computers
ICPP '99 Proceedings of the 1999 International Conference on Parallel Processing
IEEE Transactions on Parallel and Distributed Systems
A software architecture for user transparent parallel image processing
Parallel Computing - Parallel computing in image and video processing
Parallel Wavelet Transform for Large Scale Image Processing
IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
Incorporating memory layout in the modeling of message passing programs
Journal of Systems Architecture: the EUROMICRO Journal - Special issue: Parallel, distributed and network-based processing
IEEE Transactions on Parallel and Distributed Systems
Transformations to Parallel Codes for Communication-Computation Overlap
SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
Beowulf performance in CFD multigrid applications
EUROMICRO-PDP'02 Proceedings of the 10th Euromicro conference on Parallel, distributed and network-based processing
Incorporating memory layout in the modeling of message passing programs
EUROMICRO-PDP'02 Proceedings of the 10th Euromicro conference on Parallel, distributed and network-based processing
Parallel morphological processing of hyperspectral image data on heterogeneous networks of computers
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Message strip-mining heuristics for high speed networks
VECPAR'04 Proceedings of the 6th international conference on High Performance Computing for Computational Science
Hi-index | 0.00 |
The aim of this paper is to study the effect of local memory hierarchy and communication network exploitation on message sending and the influence of this effect on the decomposition of regular applications. In particular, we have considered two different parallel computers, a Cray T3E-900 and an SGI Origin 2000. In both systems, the bandwidth reduction due to non-unit-stride memory access is quite significant and could be more important than the reduction due to contention in the network. These conclusions affect the choice of optimal decompositions for regular domains problems. Thus, although traditional 3D decompositions lead to lower inherent communication-to-computation ratios and could exploit more efficiently the interconnection network, lower dimensional decompositions are found to be more efficient due to the data decomposition effects on the spatial locality of the messages to be communicated. This increasing importance of local optimisations has also been shown using a well-known communication-computation overlapping technique which increases execution time, instead of reducing it as we could expect, due to poor cache memory exploitation.