Data optimization: allocation of arrays to reduce communication on SIMD machines
Journal of Parallel and Distributed Computing - Massively parallel computation
Compiling Fortran D for MIMD distributed-memory machines
Communications of the ACM
Compiling for shared-memory and message-passing computers
ACM Letters on Programming Languages and Systems (LOPLAS)
Counting solutions to Presburger formulas: how and why
PLDI '94 Proceedings of the ACM SIGPLAN 1994 conference on Programming language design and implementation
Compiler optimizations for improving data locality
ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
Data and computation transformations for multiprocessors
PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
Compiler reduction of synchronisation in shared virtual memory systems
ICS '95 Proceedings of the 9th international conference on Supercomputing
Global communication analysis and optimization
PLDI '96 Proceedings of the ACM SIGPLAN 1996 conference on Programming language design and implementation
Synchronization and communication in the T3E multiprocessor
Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Algorithms for automatic alignment of arrays
Journal of Parallel and Distributed Computing - Special issue on compilation techniques for distributed memory systems
Improving Cache Locality by a Combination of Loop and Data Transformations
IEEE Transactions on Computers - Special issue on cache memory and related problems
Software environment for a multiprocessor DSP
Proceedings of the 36th annual ACM/IEEE Design Automation Conference
Compilation techniques for parallel systems
Parallel Computing - Special Anniversary issue
IEEE Transactions on Parallel and Distributed Systems
Compile Time Barrier Synchronization Minimization
IEEE Transactions on Parallel and Distributed Systems
Integrating loop and data transformations for global optimization
Journal of Parallel and Distributed Computing
Array recovery and high-level transformations for DSP applications
ACM Transactions on Embedded Computing Systems (TECS)
Solving Alignment Using Elementary Linear Algebra
LCPC '94 Proceedings of the 7th International Workshop on Languages and Compilers for Parallel Computing
Automatic Data Layout Using 0-1 Integer Programming
PACT '94 Proceedings of the IFIP WG10.3 Working Conference on Parallel Architectures and Compilation Techniques
Exploiting Fine- and Coarse-grain Parallelism in Embedded Programs
PACT '98 Proceedings of the 1998 International Conference on Parallel Architectures and Compilation Techniques
Parallelization of Benchmarks for Scalable Shared-Memory Multiprocessors
PACT '98 Proceedings of the 1998 International Conference on Parallel Architectures and Compilation Techniques
How to Write Fast Numerical Code: A Small Introduction
Generative and Transformational Techniques in Software Engineering II
Exploring parallelizations of applications for MPSoC platforms using MPA
Proceedings of the Conference on Design, Automation and Test in Europe
Polyhedral code generation in the real world
CC'06 Proceedings of the 15th international conference on Compiler Construction
Hi-index | 0.00 |
Auto-parallelizing compilers for embedded applications have been unsuccessful due to the widespread use of pointer arithmetic and the complex memory model of multiple-address space digital signal processors (DSPs). This paper develops, for the first time, a complete auto-parallelization approach, which overcomes these issues. It first combines a pointer conversion technique with a new modulo elimination transformation for program recovery enabling later parallelization stages. Next, it integrates a novel data transformation technique that exposes the processor location of partitioned data. When this is combined with a new address resolution mechanism, it generates efficient programs that run on multiple address spaces without using message passing. Furthermore, as DSPs do not possess any data cache structure, an optimization is presented which transforms the program to both exploit remote data locality and local memory bandwidth. This parallelization approach is applied to the DSPstone and UTDSP benchmark suites, giving an average speedup of 3.78 on four Analog Devices TigerSHARC TS-101 processors.