Advanced compiler optimizations for supercomputers
Communications of the ACM - Special issue on parallelism
Theory of linear and integer programming
Theory of linear and integer programming
Automatic translation of FORTRAN programs to vector form
ACM Transactions on Programming Languages and Systems (TOPLAS)
Compiler algorithms for synchronization
IEEE Transactions on Computers
Strategies for cache and local memory management by global program transformation
Journal of Parallel and Distributed Computing - Special Issue on Languages, Compilers and environments for Parallel Programming
Principles of runtime support for parallel processors
ICS '88 Proceedings of the 2nd international conference on Supercomputing
POPL '88 Proceedings of the 15th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Process decomposition through locality of reference
PLDI '89 Proceedings of the ACM SIGPLAN 1989 Conference on Programming language design and implementation
A parallelizing compiler for distributed memory parallel computers
A parallelizing compiler for distributed memory parallel computers
Data optimization: allocation of arrays to reduce communication on SIMD machines
Journal of Parallel and Distributed Computing - Massively parallel computation
Supercompilers for parallel and vector computers
Supercompilers for parallel and vector computers
A data locality optimizing algorithm
PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
Automatic generation of global optimizers
PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
Compiler techniques for data partitioning of sequentially iterated parallel loops
ICS '90 Proceedings of the 4th international conference on Supercomputing
The parallel execution of DO loops
Communications of the ACM
Optimizing Supercompilers for Supercomputers
Optimizing Supercompilers for Supercomputers
Compiling Communication-Efficient Programs for Massively Parallel Machines
IEEE Transactions on Parallel and Distributed Systems
Limits on Interconnection Network Performance
IEEE Transactions on Parallel and Distributed Systems
Compiling Global Name-Space Parallel Loops for Distributed Execution
IEEE Transactions on Parallel and Distributed Systems
A Loop Transformation Theory and an Algorithm to Maximize Parallelism
IEEE Transactions on Parallel and Distributed Systems
Compile-Time Techniques for Data Distribution in Distributed Memory Machines
IEEE Transactions on Parallel and Distributed Systems
Access Normalization: Loop Restructuring for NUMA Compilers
Access Normalization: Loop Restructuring for NUMA Compilers
Software methods for improvement of cache performance on supercomputer applications
Software methods for improvement of cache performance on supercomputer applications
Compiling for locality of reference
Compiling for locality of reference
Loop transformations for NUMA machines
ACM SIGPLAN Notices - Workshop on languages, compilers and run-time environments for distributed memory multiprocessors
Partitioning the statement per iteration space using non-singular matrices
ICS '93 Proceedings of the 7th international conference on Supercomputing
Compiler optimizations for improving data locality
ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
ICS '95 Proceedings of the 9th international conference on Supercomputing
Improving data locality with loop transformations
ACM Transactions on Programming Languages and Systems (TOPLAS)
Cache miss equations: a compiler framework for analyzing and tuning memory behavior
ACM Transactions on Programming Languages and Systems (TOPLAS)
Quantifying loop nest locality using SPEC'95 and the perfect benchmarks
ACM Transactions on Computer Systems (TOCS)
Communication Analysis for Multicomputer Compilers
PACT '94 Proceedings of the IFIP WG10.3 Working Conference on Parallel Architectures and Compilation Techniques
Integrated code and data placement in two-dimensional mesh based chip multiprocessors
Proceedings of the 2008 IEEE/ACM International Conference on Computer-Aided Design
On minimizing register usage of linearly scheduled algorithms with uniform dependencies
Computer Languages, Systems and Structures
Embedded Systems Design
Hi-index | 0.00 |
In scalable parallel machines, processors can make local memory accesses much faster than they can make remote memory accesses. In addition, when a number of remote accesses must be made, it is usually more efficient to use block transfers of data rather than to use many small messages. To run well on such machines, software must exploit these features. We believe it is too onerous for a programmer to do this by hand, so we have been exploring the use of restructuring compiler tecnology for this purpose. In this paper, we start with a language like FORTRAN-D with user-specified data distribution and develop a systematic loop transformation strategy called access normalization that restructures loop nests to exploit locality and block transfers. We demonstrate the power of our techniques using routines from the BLAS (Basic Linear Algebra Subprograms) library. An important feature of our approach is that we model loop transformations using invertible matrices and integer lattice theory, thereby generalizing Banerjee's framework of unimodular matrices [5].