Advanced compiler optimizations for supercomputers
Communications of the ACM - Special issue on parallelism
A global approach to detection of parallelism
A global approach to detection of parallelism
Strategies for cache and local memory management by global program transformation
Journal of Parallel and Distributed Computing - Special Issue on Languages, Compilers and environments for Parallel Programming
Hector: A Hierarchically Structured Shared-Memory Multiprocessor
Computer - Special issue on experimental research in computer architecture
The cache performance and optimizations of blocked algorithms
ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Uniform techniques for loop optimization
ICS '91 Proceedings of the 5th international conference on Supercomputing
A data locality optimizing algorithm
PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
The Stanford Dash Multiprocessor
Computer
A practical algorithm for exact array dependence analysis
Communications of the ACM
Improving locality and parallelism in nested loops
Improving locality and parallelism in nested loops
The Stanford FLASH multiprocessor
ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Fusing loops with backward inter loop data dependence
ACM SIGPLAN Notices
Compiler transformations for high-performance computing
ACM Computing Surveys (CSUR)
Optimizing Supercompilers for Supercomputers
Optimizing Supercompilers for Supercomputers
Dependence Analysis for Supercomputing
Dependence Analysis for Supercomputing
Determining Transformation Sequences for Loop Parallelization
Proceedings of the 5th International Workshop on Languages and Compilers for Parallel Computing
Maximizing Loop Parallelism and Improving Data Locality via Loop Fusion and Distribution
Proceedings of the 6th International Workshop on Languages and Compilers for Parallel Computing
Polaris: Improving the Effectiveness of Parallelizing Compilers
LCPC '94 Proceedings of the 7th International Workshop on Languages and Compilers for Parallel Computing
A compiler framework for restructuring data declarations to enhance cache and TLB effectiveness
CASCON '94 Proceedings of the 1994 conference of the Centre for Advanced Studies on Collaborative research
A hierarchical basis for reordering transformations
POPL '84 Proceedings of the 11th ACM SIGACT-SIGPLAN symposium on Principles of programming languages
(R) Scheduling of Wavefront Parallelism on Scalable Shared-memory Multiprocessors
ICPP '96 Proceedings of the Proceedings of the 1996 International Conference on Parallel Processing - Volume 3
Software methods for improvement of cache performance on supercomputer applications
Software methods for improvement of cache performance on supercomputer applications
Iteration space slicing and its application to communication optimization
ICS '97 Proceedings of the 11th international conference on Supercomputing
Data transformations for eliminating conflict misses
PLDI '98 Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation
The implementation and evaluation of fusion and contraction in array languages
PLDI '98 Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation
Eliminating conflict misses for high performance architectures
ICS '98 Proceedings of the 12th international conference on Supercomputing
New tiling techniques to improve cache temporal locality
Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation
Locality optimizations for multi-level caches
SC '99 Proceedings of the 1999 ACM/IEEE conference on Supercomputing
Exploiting Wavefront Parallelism on Large-Scale Shared-Memory Multiprocessors
IEEE Transactions on Parallel and Distributed Systems
Data locality enhancement by memory reduction
ICS '01 Proceedings of the 15th international conference on Supercomputing
Scheduling and partitioning for multiple loop nests
Proceedings of the 14th international symposium on Systems synthesis
Combined partitioning and data padding for scheduling multiple loop nests
CASES '01 Proceedings of the 2001 international conference on Compilers, architecture, and synthesis for embedded systems
Combining Loop Fusion with Prefetching on Shared-memory Multiprocessors
ICPP '97 Proceedings of the international Conference on Parallel Processing
Improving Effective Bandwidth through Compiler Enhancement of Global Cache Reuse
IPDPS '01 Proceedings of the 15th International Parallel & Distributed Processing Symposium
A Compiler Framework for Tiling Imperfectly-Nested Loops
LCPC '99 Proceedings of the 12th International Workshop on Languages and Compilers for Parallel Computing
Improving Cache Effectiveness through Array Data Layout Manipulation in SAC
IFL '00 Selected Papers from the 12th International Workshop on Implementation of Functional Languages
Locality Enhancement for Large-Scale Shared-Memory Multiprocessors
LCR '98 Selected Papers from the 4th International Workshop on Languages, Compilers, and Run-Time Systems for Scalable Computers
Estimating cache misses and locality using stack distances
ICS '03 Proceedings of the 17th annual international conference on Supercomputing
Transforming Complex Loop Nests for Locality
The Journal of Supercomputing
Improving effective bandwidth through compiler enhancement of global cache reuse
Journal of Parallel and Distributed Computing
Improving Data Locality by Array Contraction
IEEE Transactions on Computers
General loop fusion technique for nested loops considering timing and code size
Proceedings of the 2004 international conference on Compilers, architecture, and synthesis for embedded systems
Automatic tiling of iterative stencil loops
ACM Transactions on Programming Languages and Systems (TOPLAS)
Improving Memory Hierarchy Performance through Combined Loop Interchange and Multi-Level Fusion
International Journal of High Performance Computing Applications
A polynomial-time algorithm for memory space reduction
International Journal of Parallel Programming
On minimizing materializations of array-valued temporaries
ACM Transactions on Programming Languages and Systems (TOPLAS)
Merging compositions of array skeletons in SAC
Parallel Computing - Algorithmic skeletons
SAC: off-the-shelf support for data-parallelism on multicores
Proceedings of the 2007 workshop on Declarative aspects of multicore programming
MPSoC memory optimization using program transformation
ACM Transactions on Design Automation of Electronic Systems (TODAES)
Improving the parallelism of iterative methods by aggressive loop fusion
The Journal of Supercomputing
Energy minimization with loop fusion and multi-functional-unit scheduling for multidimensional DSP
Journal of Parallel and Distributed Computing
Transformations techniques for extracting parallelism in non-uniform nested loops
WSEAS Transactions on Computers
Performance modeling and automatic ghost zone optimization for iterative stencil loops on GPUs
Proceedings of the 23rd international conference on Supercomputing
Optimal loop parallelization for maximizing iteration-level parallelism
CASES '09 Proceedings of the 2009 international conference on Compilers, architecture, and synthesis for embedded systems
Locality enhancement by array contraction
LCPC'01 Proceedings of the 14th international conference on Languages and compilers for parallel computing
Loop Distribution and Fusion with Timing and Code Size Optimization
Journal of Signal Processing Systems
Aggressive loop fusion for improving locality and parallelism
ISPA'05 Proceedings of the Third international conference on Parallel and Distributed Processing and Applications
With-Loop fusion for data locality and parallelism
IFL'05 Proceedings of the 17th international conference on Implementation and Application of Functional Languages
Hi-index | 0.00 |
Loop fusion improves data locality and reduces synchronization in data-parallel applications. However, loop fusion is not always legal. Even when legal, fusion may introduce loop-carried dependences which prevent parallelism. In addition, performance losses result from cache conflicts in fused loops. In this paper, we present new techniques to: 1) allow fusion of loop nests in the presence of fusion-preventing dependences, 2) maintain parallelism and allow the parallel execution of fused loops with minimal synchronization, and 3) eliminate cache conflicts in fused loops. We describe algorithms for implementing these techniques in compilers. The techniques are evaluated on a 56-processor KSR2 multiprocessor and on a 16-processor Convex SPP-1000 multiprocessor. The results demonstrate performance improvements for both kernels and complete applications. The results also indicate that careful evaluation of the profitability of fusion is necessary as more processors are used.