IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Papers - Volume 01
Integrated Loop Optimizations for Data Locality Enhancement of Tensor Contraction Expressions
SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
Layout transformation support for the disk resident arrays framework
The Journal of Supercomputing
Efficient synthesis of out-of-core algorithms using a nonlinear optimization solver
Journal of Parallel and Distributed Computing - Special issue: 18th International parallel and distributed processing symposium
Efficient parallel out-of-core matrix transposition
International Journal of High Performance Computing and Networking
Efficient search-space pruning for integrated fusion and tiling transformations
LCPC'05 Proceedings of the 18th international conference on Languages and Compilers for Parallel Computing
Journal of Parallel and Distributed Computing
Empirical performance-model driven data layout optimization
LCPC'04 Proceedings of the 17th international conference on Languages and Compilers for High Performance Computing
Efficient layout transformation for disk-based multidimensional arrays
HiPC'04 Proceedings of the 11th international conference on High Performance Computing
Hi-index | 0.00 |
The accurate modeling of the electronic structure of atoms and molecules involves computationally intensive tensor contractions involving large multi-dimensional arrays. The efficient computation of complex tensor contractions usually requires the generation of temporary intermediate arrays. These intermediates could be extremely large, but they can often be generated and used in batches through appropriate loop fusion transformations. To optimize the performance of such computations on parallel computers, the total amount of inter-processor communication must be minimized, subject to the available memory on each processor. In this paper, we address the memory-constrained communication minimization problem in the context of this class of computations. Based on a framework that modelsthe relationship between loop fusion and memory usage, we develop an approach to identify the best combination of loop fusion and data partitioning that minimizes inter-processor communication cost without exceeding the per-processor memory limit. The effectiveness of the developed optimization approach is demonstrated on a computation representative of a component used in quantum chemistry suites.