MCAMP: communication optimization on massively parallel machines with hierarchical scratch-pad memory

Authors:
Hiroshige Hayashizaki;Yutaka Sugawara;Mary Inaba;Kei Hiraki
Affiliations:
The University of Tokyo, Tokyo, Japan;The University of Tokyo, Tokyo, Japan;The University of Tokyo, Tokyo, Japan;The University of Tokyo, Tokyo, Japan
Venue:
Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Year:
2008

Citing 13
Cited 0

Compiler optimizations for Fortran D on MIMD distributed-memory machines

Proceedings of the 1991 ACM/IEEE conference on Supercomputing
Coloring register pairs

ACM Letters on Programming Languages and Systems (LOPLAS)
Formalized methodology for data reuse exploration in hierarchical memory mappings

ISLPED '97 Proceedings of the 1997 international symposium on Low power electronics and design
Compiler-directed scratch pad memory hierarchy design and management

Proceedings of the 39th annual Design Automation Conference
Compiling Communication-Efficient Programs for Massively Parallel Machines

IEEE Transactions on Parallel and Distributed Systems
Data Reuse Exploration Techniques for Loop-Dominated Applications

Proceedings of the conference on Design, automation and test in Europe
Layer Assignment echniques for Low Energy in Multi-Layered Memory Organisations

DATE '03 Proceedings of the conference on Design, Automation and Test in Europe - Volume 1
Reuse analysis of indirectly indexed arrays

ACM Transactions on Design Automation of Electronic Systems (TODAES)
Multiprocessor system-on-chip data reuse analysis for exploring customized memory hierarchies

Proceedings of the 43rd annual Design Automation Conference
Multi-Level On-Chip Memory Hierarchy Design for Embedded Chip Multiprocessors

ICPADS '06 Proceedings of the 12th International Conference on Parallel and Distributed Systems - Volume 1
Data reuse driven energy-aware MPSoC co-synthesis of memory and communication architecture for streaming applications

CODES+ISSS '06 Proceedings of the 4th international conference on Hardware/software codesign and system synthesis
DRDU: A data reuse analysis technique for efficient scratch-pad memory management

ACM Transactions on Design Automation of Electronic Systems (TODAES)
GRAPE-DR: 2-Pflops massively-parallel computer with 512-core, 512-Gflops processor chips for scientific computing

Proceedings of the 2007 ACM/IEEE conference on Supercomputing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Massively parallel machines that integrate a large number of simple processors and small scratch-pad memories (SPMs) into a single chip can achieve a high peak performance per watt of power. In these machines, communication optimizations are important because the communication bandwidth tends to be a bottleneck. Previously proposed communication optimizations using copy candidates, which have been shown to be effective, detect frequently reused array regions by compile-time analysis and copy the regions to scratch-pad memories nearer to the processors. However, they have been proposed for uniprocessor systems or small parallel machines with one or more layers of scratch-pad memories, and the analysis time increases when they are applied to massively parallel machines. In this paper, we propose Multilayer Copy-candidate Analysis for Massively Parallel machines (MCAMP), a communication optimization method for massively parallel machines. MCAMP re-formalizes the framework used in earlier works and improves the scalability of the analysis by assuming the homogeneity of the target systems. We implemented an MCAMP optimizer, which takes an input program that consists of perfectly nested loops containing array references and computation codes, and generates optimized communication. We measured the performance of the output programs of the MCAMP optimizer by executing them on a real massively parallel machine GRAPE-DR using a software tool chain that we also implemented. We showed that MCAMP can achieve optimal data transfer patterns and comparable performance to that of hand-optimized codes with a short analysis time.