A Loop Transformation Theory and an Algorithm to Maximize Parallelism

Authors:
M. E. Wolf;M. S. Lam
Affiliations:
-;-
Venue:
IEEE Transactions on Parallel and Distributed Systems
Year:
1991

Citing 14
Cited 184

Automatic translation of FORTRAN programs to vector form

ACM Transactions on Programming Languages and Systems (TOPLAS)
Strategies for cache and local memory management by global program transformation

Journal of Parallel and Distributed Computing - Special Issue on Languages, Compilers and environments for Parallel Programming
Software pipelining: an effective scheduling technique for VLIW machines

PLDI '88 Proceedings of the ACM SIGPLAN 1988 conference on Programming Language design and Implementation
Supernode partitioning

POPL '88 Proceedings of the 15th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
More iteration space tiling

Proceedings of the 1989 ACM/IEEE conference on Supercomputing
A theory of loop permutations

Selected papers of the second workshop on Languages and compilers for parallel computing
Efficient and exact data dependence analysis

PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
A data locality optimizing algorithm

PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
Optimizing Supercompilers for Supercomputers

Optimizing Supercompilers for Supercomputers
Dependence Analysis for Supercomputing

Dependence Analysis for Supercomputing
Automatic synthesis of systolic arrays from uniform recurrent equations

ISCA '84 Proceedings of the 11th annual international symposium on Computer architecture
Compile-time scheduling and optimization for asynchronous machines (multiprocessor, compiler, parallel processing)

Compile-time scheduling and optimization for asynchronous machines (multiprocessor, compiler, parallel processing)
Software methods for improvement of cache performance on supercomputer applications

Software methods for improvement of cache performance on supercomputer applications
Automatic generation of systolic programs from nested loops

Automatic generation of systolic programs from nested loops

A data locality optimizing algorithm

PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
A general framework for iteration-reordering loop transformations

PLDI '92 Proceedings of the ACM SIGPLAN 1992 conference on Programming language design and implementation
A dynamic scheduling method for irregular parallel programs

PLDI '92 Proceedings of the ACM SIGPLAN 1992 conference on Programming language design and implementation
Deriving good transformations for mapping nested loops on hierarchical parallel machines in polynomial time

ICS '92 Proceedings of the 6th international conference on Supercomputing
Access normalization: loop restructuring for NUMA compilers

ASPLOS V Proceedings of the fifth international conference on Architectural support for programming languages and operating systems
Non-unimodular transformations of nested loops

Proceedings of the 1992 ACM/IEEE conference on Supercomputing
Global optimizations for parallelism and locality on scalable parallel machines

PLDI '93 Proceedings of the ACM SIGPLAN 1993 conference on Programming language design and implementation
Loop transformations for NUMA machines

ACM SIGPLAN Notices - Workshop on languages, compilers and run-time environments for distributed memory multiprocessors
Access normalization: loop restructuring for NUMA computers

ACM Transactions on Computer Systems (TOCS)
Exact side effects for interprocedural dependence analysis

ICS '93 Proceedings of the 7th international conference on Supercomputing
Partitioning the statement per iteration space using non-singular matrices

ICS '93 Proceedings of the 7th international conference on Supercomputing
Reducing data communication overhead for DOACROSS loop nests

ICS '94 Proceedings of the 8th international conference on Supercomputing
Evaluating automatic parallelization for efficient execution on shared-memory multiprocessors

ICS '94 Proceedings of the 8th international conference on Supercomputing
Compiler techniques for maximizing fine-grain and coarse-grain parallelism in loops with uniform dependences

ICS '94 Proceedings of the 8th international conference on Supercomputing
Defining, Analyzing, and Transforming Program Constructs

IEEE Parallel & Distributed Technology: Systems & Technology
Fusing loops with backward inter loop data dependence

ACM SIGPLAN Notices
SUIF: an infrastructure for research on parallelizing and optimizing compilers

ACM SIGPLAN Notices
Memory estimation for high level synthesis

DAC '94 Proceedings of the 31st annual Design Automation Conference
Compiler transformations for high-performance computing

ACM Computing Surveys (CSUR)
XIL and YIL: the intermediate languages of TOBEY

IR '95 Papers from the 1995 ACM SIGPLAN workshop on Intermediate representations
Generating parallel code from object oriented mathematical models

PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
Compiler optimizations for eliminating barrier synchronization

PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
Data and computation transformations for multiprocessors

PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
An HPF compiler for the IBM SP2

Supercomputing '95 Proceedings of the 1995 ACM/IEEE conference on Supercomputing
Controlling application grain size on a network of workstations

Supercomputing '95 Proceedings of the 1995 ACM/IEEE conference on Supercomputing
System level verification of video and image processing specifications

ISSS '95 Proceedings of the 8th international symposium on System synthesis
Compiler cache optimizations for banded matrix problems

ICS '95 Proceedings of the 9th international conference on Supercomputing
Unified compilation techniques for shared and distributed address space machines

ICS '95 Proceedings of the 9th international conference on Supercomputing
Optimal tile size adjustment in compiling general DOACROSS loop nests

ICS '95 Proceedings of the 9th international conference on Supercomputing
Single-program speculative multithreading (SPSM) architecture: compiler-assisted fine-grained multithreading

PACT '95 Proceedings of the IFIP WG10.3 working conference on Parallel architectures and compilation techniques
Mappings for communication minimization using distribution and alignment

PACT '95 Proceedings of the IFIP WG10.3 working conference on Parallel architectures and compilation techniques
Valid Transformations: A New Class of Loop Transformations for High-Level Synthesis and Pipelined Scheduling Applications

IEEE Transactions on Parallel and Distributed Systems
Automatic Data Structure Selection and Transformation for Sparse Matrix Computations

IEEE Transactions on Parallel and Distributed Systems
Achieving Full Parallelism Using Multidimensional Retiming

IEEE Transactions on Parallel and Distributed Systems
Optimal weighted loop fusion for parallel programs

Proceedings of the ninth annual ACM symposium on Parallel algorithms and architectures
Efficient Algorithms for Data Distribution on Distributed Memory Parallel Computers

IEEE Transactions on Parallel and Distributed Systems
Determining the idle time of a tiling

Proceedings of the 24th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Maximizing parallelism and minimizing synchronization with affine transforms

Proceedings of the 24th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Potential-driven statistical ordering of transformations

DAC '97 Proceedings of the 34th annual Design Automation Conference
A unified compiler algorithm for optimizing locality, parallelism and communication in out-of-core computations

Proceedings of the fifth workshop on I/O in parallel and distributed systems
Tuning compiler optimizations for simultaneous multithreading

MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Automatic selection of high-order transformations in the IBM XL FORTRAN compilers

IBM Journal of Research and Development - Special issue: performance analysis and its impact on design
Efficient householder QR factorization for superscalar processors

ACM Transactions on Mathematical Software (TOMS)
An Approach to Designing Modular Extensible Linear Arrays for Regular Algorithms

IEEE Transactions on Computers
Cost Effective VLSI Architectures for Full-SearchBlock-Matching Motion Estimation Algorithm

Journal of VLSI Signal Processing Systems - Special issue on recent development in video: algorithms, implementation and applications
A methodology for guided behavioral-level optimization

DAC '98 Proceedings of the 35th annual Design Automation Conference
A general algorithm for tiling the register level

ICS '98 Proceedings of the 12th international conference on Supercomputing
A Compiler Optimization Algorithm for Shared-Memory Multiprocessors

IEEE Transactions on Parallel and Distributed Systems
An affine partitioning algorithm to maximize parallelism and minimize communication

ICS '99 Proceedings of the 13th international conference on Supercomputing
Selecting tile shape for minimal execution time

Proceedings of the eleventh annual ACM symposium on Parallel algorithms and architectures
An Object-Oriented Framework for Loop Parallelization

The Journal of Supercomputing
Power optimization using divide-and-conquer techniques for minimization of the number of operations

ACM Transactions on Design Automation of Electronic Systems (TODAES)
Probabilistic Loop Scheduling for Applications with Uncertain Execution Time

IEEE Transactions on Computers
Locality optimizations for multi-level caches

SC '99 Proceedings of the 1999 ACM/IEEE conference on Supercomputing
A Space-Time Representation Method of Iterative Algorithms for the Design of Processor Arrays

Journal of VLSI Signal Processing Systems
Automatic loop transformations and parallelization for Java

Proceedings of the 14th international conference on Supercomputing
Tuning Compiler Optimizations for Simultaneous Multithreading

International Journal of Parallel Programming - Special issue on the 30th annual ACM/IEEE international symposium on microarchitecture, part II
Statement-Level Communication-Free Partitioning Techniques for Parallelizing Compilers

The Journal of Supercomputing
A Loop Transformation Algorithm for Communication Overlapping

International Journal of Parallel Programming - Special issue on international symposium on high performance computing 1997, part I
From flop to megaflops: Java for technical computing

ACM Transactions on Programming Languages and Systems (TOPLAS)
Properties and Algorithms for Unfolding of Probabilistic Data-Flow Graphs

Journal of VLSI Signal Processing Systems
A Unified Framework for Optimizing Locality, Parallelism, and Communication in Out-of-Core Computations

IEEE Transactions on Parallel and Distributed Systems
Generation of Efficient Nested Loops from Polyhedra

International Journal of Parallel Programming - Special issue on instruction-level parallelism and parallelizing compilation, part 2
Chain Grouping: A Method for Partitioning Loops onto Mesh-Connected Processor Arrays

IEEE Transactions on Parallel and Distributed Systems
Matching and searching analysis for parallel hardware implementation on FPGAs

FPGA '01 Proceedings of the 2001 ACM/SIGDA ninth international symposium on Field programmable gate arrays
Minimizing Average Schedule Length under Memory Constraints by Optimal Partitioning and Prefetching

Journal of VLSI Signal Processing Systems
Data and memory optimization techniques for embedded systems

ACM Transactions on Design Automation of Electronic Systems (TODAES)
Reducing memory requirements of nested loops for embedded systems

Proceedings of the 38th annual Design Automation Conference
Loop parallelization algorithms

Compiler optimizations for scalable parallel systems
Communication-free partitioning of nested loops

Compiler optimizations for scalable parallel systems
Performance-constrained pipelining of software loops onto reconfigurable hardware

FPGA '02 Proceedings of the 2002 ACM/SIGDA tenth international symposium on Field-programmable gate arrays
Data reorganization engines for the next generation of system-on-a-chip FPGAs

FPGA '02 Proceedings of the 2002 ACM/SIGDA tenth international symposium on Field-programmable gate arrays
Automatic Compilation of Loops to Exploit Operator Parallelism on Configurable Arithmetic Logic Units

IEEE Transactions on Parallel and Distributed Systems
Automatic data and computation decomposition on distributed memory parallel computers

ACM Transactions on Programming Languages and Systems (TOPLAS)
Register tiling in nonrectangular iteration spaces

ACM Transactions on Programming Languages and Systems (TOPLAS)
Automatic Partitioning of Parallel Loops with Parallelepiped-Shaped Tiles

IEEE Transactions on Parallel and Distributed Systems
Memory Design and Exploration for Low Power, Embedded Systems

Journal of VLSI Signal Processing Systems - Special issue on signal processing systems design and implementation
Quantifying the Multi-Level Nature of Tiling Interactions

International Journal of Parallel Programming
Reuse-Driven Tiling for Improving Data Locality

International Journal of Parallel Programming
Time-minimal tiling when rise is larger than zero

Parallel Computing
Jade: A High-Level, Machine-Independent Language for Parallel Programming

Computer
A Layout-Conscious Iteration Space Transformation Technique

IEEE Transactions on Computers
Hierarchical Compilation of Macro Dataflow Graphs for Multiprocessors with Local Memory

IEEE Transactions on Parallel and Distributed Systems
Communication-Free Data Allocation Techniques for Parallelizing Compilers on Multicomputers

IEEE Transactions on Parallel and Distributed Systems
Loop Transformation Using Nonunimodular Matrices

IEEE Transactions on Parallel and Distributed Systems
A General Methodology of Partitioning and Mapping for Given Regular Arrays

IEEE Transactions on Parallel and Distributed Systems
Affine-by-Statement Transformations of Imperfectly Nested Loops

IPPS '96 Proceedings of the 10th International Parallel Processing Symposium
The Combined Effectiveness of Unimodular Transformations, Tiling, and Software Prefetching

IPPS '96 Proceedings of the 10th International Parallel Processing Symposium
A BSP Approach to the Scheduling of Tightly-Nested Loops

IPPS '97 Proceedings of the 11th International Symposium on Parallel Processing
Next Generation System Software for Future High-End Computing Systems

IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
From Flop to MegaFlops: Java for Technical Computing

LCPC '98 Proceedings of the 11th International Workshop on Languages and Compilers for Parallel Computing
Optimized Execution of Fortran 90 Array Language on Symmetric Shared-Memory Multiprocessors

LCPC '98 Proceedings of the 11th International Workshop on Languages and Compilers for Parallel Computing
Optimizing Java Programs in the Presence of Exceptions

ECOOP '00 Proceedings of the 14th European Conference on Object-Oriented Programming
Structured Scheduling of Recurrence Equations: Theory and Practice

Embedded Processor Design Challenges: Systems, Architectures, Modeling, and Simulation - SAMOS
Exact Partitioning of Affine Dependence Algorithms

Embedded Processor Design Challenges: Systems, Architectures, Modeling, and Simulation - SAMOS
Complexity of Multi-dimensional Loop Alignment

STACS '02 Proceedings of the 19th Annual Symposium on Theoretical Aspects of Computer Science
Scheduling the Computations of a Loop Nest with Respect to a Given Mapping

Euro-Par '00 Proceedings from the 6th International Euro-Par Conference on Parallel Processing
A Technique for FPGA Synthesis Driven by Automatic Source Code Analysis and Transformations

FPL '02 Proceedings of the Reconfigurable Computing Is Going Mainstream, 12th International Conference on Field-Programmable Logic and Applications
Loop Transformations for Hierarchical Parallelism and Locality

LCR '98 Selected Papers from the 4th International Workshop on Languages, Compilers, and Run-Time Systems for Scalable Computers
Optimizing Computational and Spatial Overheads in Complex Transformed Loops

LCR '98 Selected Papers from the 4th International Workshop on Languages, Compilers, and Run-Time Systems for Scalable Computers
Structured scheduling of recurrence equations: theory and practice

Embedded processor design challenges
Exact partitioning of affine dependence algorithms

Embedded processor design challenges
On the Parallel Execution Time of Tiled Loops

IEEE Transactions on Parallel and Distributed Systems
Reducing False Sharing and Improving Spatial Locality in a Unified Compilation Framework

IEEE Transactions on Parallel and Distributed Systems
QR factorization for shared memory and message passing

Parallel Computing
Three-dimensional orthogonal tile sizing problem: mathematical programming approach

ASAP '97 Proceedings of the IEEE International Conference on Application-Specific Systems, Architectures and Processors
Fully Parallel Hardware/Software Codesign for Multi-Dimensional DSP Applications

CODES '96 Proceedings of the 4th International Workshop on Hardware/Software Co-Design
Pipeline Vectorization for Reconfigurable Systems

FCCM '99 Proceedings of the Seventh Annual IEEE Symposium on Field-Programmable Custom Computing Machines
Automatic Synthesis of Data Storage and Control Structures for FPGA-Based Computing Engines

FCCM '00 Proceedings of the 2000 IEEE Symposium on Field-Programmable Custom Computing Machines
Reference Distance as a Metric for Data Locality

HPC-ASIA '97 Proceedings of the High-Performance Computing on the Information Superhighway, HPC-Asia '97
A Loop Transformation for Maximizing Parallelism from Single Loops with Nonuniform Dependencies

HPC-ASIA '97 Proceedings of the High-Performance Computing on the Information Superhighway, HPC-Asia '97
Partitioning Loops with Variable Dependence Distances

ICPP '00 Proceedings of the Proceedings of the 2000 International Conference on Parallel Processing
Extracting Parallelism in Nested Loops

COMPSAC '96 Proceedings of the 20th Conference on Computer Software and Applications
Code Transformations for Low Power Caching in Embedded Multimedia Processors

IPPS '98 Proceedings of the 12th. International Parallel Processing Symposium on International Parallel Processing Symposium
References

Sourcebook of parallel computing
Single Assignment C: efficient support for high-level array operations in a functional setting

Journal of Functional Programming
Automatic parallel code generation for tiled nested loops

Proceedings of the 2004 ACM symposium on Applied computing
A data locality optimizing algorithm

ACM SIGPLAN Notices - Best of PLDI 1979-1999
Input data reuse in compiling window operations onto reconfigurable hardware

Proceedings of the 2004 ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools for embedded systems
An extended ANSI C for processors with a multimedia extension

International Journal of Parallel Programming
Performance and Area Modeling of Complete FPGA Designs in the Presence of Loop Transformations

IEEE Transactions on Computers
A two-level scheduling method: an effective parallelizing technique for uniform nested loops on a DSP multiprocessor

Journal of Systems and Software - Special issue: Software engineering education and training
Optimizing inter-processor data locality on embedded chip multiprocessors

Proceedings of the 5th ACM international conference on Embedded software
The Effect of Process Topology and Load Balancing on Parallel Programming Models for SMP Clusters and Iterative Algorithms

The Journal of Supercomputing
Behavior and communication co-optimization for systems with sequential communication media

Proceedings of the 43rd annual Design Automation Conference
Source level transformations to improve I/O data partitioning

SNAPI '03 Proceedings of the international workshop on Storage network architecture and parallel I/Os
Message-passing code generation for non-rectangular tiling transformations

Parallel Computing
Function level parallelism driven by data dependencies

ACM SIGARCH Computer Architecture News
A scalable embedded JPEG 2000 architecture

Journal of Systems Architecture: the EUROMICRO Journal
Maximize Parallelism Minimize Overhead for Nested Loops via Loop Striping

Journal of VLSI Signal Processing Systems
MPSoC memory optimization using program transformation

ACM Transactions on Design Automation of Electronic Systems (TODAES)
Incremental hierarchical memory size estimation for steering of loop transformations

ACM Transactions on Design Automation of Electronic Systems (TODAES)
A memory-conscious code parallelization scheme

Proceedings of the 44th annual Design Automation Conference
Designer-controlled generation of parallel and flexible heterogeneous MPSoC specification

Proceedings of the 44th annual Design Automation Conference
SPRINT: a tool to generate concurrent transaction-level models from sequential code

EURASIP Journal on Applied Signal Processing
A practical automatic polyhedral parallelizer and locality optimizer

Proceedings of the 2008 ACM SIGPLAN conference on Programming language design and implementation
Using FORAY models to enable MPSoC memory optimizations

International Journal of Parallel Programming - Special Issue on Multiprocessor-based embedded systems
Timing optimization via nest-loop pipelining considering code size

Microprocessors & Microsystems
Guidance of Loop Ordering for Reduced Memory Usage in Signal Processing Applications

Journal of Signal Processing Systems
Transformations techniques for extracting parallelism in non-uniform nested loops

WSEAS Transactions on Computers
Matrix-based streamization approach for improving locality and parallelism on FT64 stream processor

The Journal of Supercomputing
Compiler-assisted dynamic scheduling for effective parallelization of loop nests on multicore processors

Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
An Approach for Enhancing Inter-processor Data Locality on Chip Multiprocessors

Transactions on High-Performance Embedded Architectures and Compilers I
Affine and unimodular transformations for non-uniform nested loops

ICCOMP'08 Proceedings of the 12th WSEAS international conference on Computers
Cache-aware partitioning of multi-dimensional iteration spaces

SYSTOR '09 Proceedings of SYSTOR 2009: The Israeli Experimental Systems Conference
Efficient hybrid parallelisation of tiled algorithms on SMP clusters

International Journal of Computational Science and Engineering
Optimal loop parallelization for maximizing iteration-level parallelism

CASES '09 Proceedings of the 2009 international conference on Compilers, architecture, and synthesis for embedded systems
Slicing based code parallelization for minimizing inter-processor communication

CASES '09 Proceedings of the 2009 international conference on Compilers, architecture, and synthesis for embedded systems
ODISET: On-line distributed session tracing using agents

IJCAI'03 Proceedings of the 18th international joint conference on Artificial intelligence
Parallel image processing with the block data parallel architecture

IBM Journal of Research and Development
Parallel loop generation and scheduling

The Journal of Supercomputing
On minimizing register usage of linearly scheduled algorithms with uniform dependencies

Computer Languages, Systems and Structures
Bridging the gap between compilation and synthesis in the DEFACTO system

LCPC'01 Proceedings of the 14th international conference on Languages and compilers for parallel computing
A profile-based tool for finding pipeline parallelism in sequential programs

Parallel Computing
DMATiler: revisiting loop tiling for direct memory access

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Selecting the tile shape to reduce the total communication volume

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Data locality and parallelism optimization using a constraint-based approach

Journal of Parallel and Distributed Computing
McFLAT: a profile-based framework for MATLAB loop analysis and transformations

LCPC'10 Proceedings of the 23rd international conference on Languages and compilers for parallel computing
A programming language interface to describe transformations and code generation

LCPC'10 Proceedings of the 23rd international conference on Languages and compilers for parallel computing
Loop Distribution and Fusion with Timing and Code Size Optimization

Journal of Signal Processing Systems
Loop striping: maximize parallelism for nested loops

EUC'06 Proceedings of the 2006 international conference on Embedded and Ubiquitous Computing
A framework for compiler driven design space exploration for embedded system customization

ASIAN'04 Proceedings of the 9th Asian Computing Science conference on Advances in Computer Science: dedicated to Jean-Louis Lassez on the Occasion of His 5th Cycle Birthday
Optimizing data locality using array tiling

Proceedings of the International Conference on Computer-Aided Design
Combined loop transformation and hierarchy allocation for data reuse optimization

Proceedings of the International Conference on Computer-Aided Design
Experiments with auto-parallelizing SPEC2000FP benchmarks

LCPC'04 Proceedings of the 17th international conference on Languages and Compilers for High Performance Computing
Forward communication only placements and their use for parallel program construction

LCPC'02 Proceedings of the 15th international conference on Languages and Compilers for Parallel Computing
Loop transformation recipes for code generation and auto-tuning

LCPC'09 Proceedings of the 22nd international conference on Languages and Compilers for Parallel Computing
TL-DAE: thread-level decoupled access/execution for OpenMP on the cyclops-64 many-core processor

LCPC'09 Proceedings of the 22nd international conference on Languages and Compilers for Parallel Computing
Impact of array data flow analysis on the design of energy-efficient circuits

PATMOS'06 Proceedings of the 16th international conference on Integrated Circuit and System Design: power and Timing Modeling, Optimization and Simulation
Matrix-Based programming optimization for improving memory hierarchy performance on imagine

ISPA'06 Proceedings of the 4th international conference on Parallel and Distributed Processing and Applications
Optimizing memory hierarchy allocation with loop transformations for high-level synthesis

Proceedings of the 49th Annual Design Automation Conference
Domain-Specific language and compiler for stencil computation on FPGA-Based systolic computational-memory array

ARC'12 Proceedings of the 8th international conference on Reconfigurable Computing: architectures, tools and applications
Polyhedra scanning revisited

Proceedings of the 33rd ACM SIGPLAN conference on Programming Language Design and Implementation
Hierarchical overlapped tiling

Proceedings of the Tenth International Symposium on Code Generation and Optimization
Partitioning and scheduling loops on NOWs

Computer Communications
Automatic extraction of multi-objective aware pipeline parallelism using genetic algorithms

Proceedings of the eighth IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis
Architecture-based optimization for mapping scientific applications to imagine

ISPA'07 Proceedings of the 5th international conference on Parallel and Distributed Processing and Applications
Improved loop tiling based on the removal of spurious false dependences

ACM Transactions on Architecture and Code Optimization (TACO) - Special Issue on High-Performance Embedded Architectures and Compilers
High performance FFT on SGI Altix 3700

HPCC'07 Proceedings of the Third international conference on High Performance Computing and Communications
Sub-polyhedral scheduling using (unit-)two-variable-per-inequality polyhedra

POPL '13 Proceedings of the 40th annual ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Improving last level cache locality by integrating loop and data transformations

Proceedings of the International Conference on Computer-Aided Design
Optimizing 3d convolutions for wavelet transforms on CPUs with SSE units and GPUs

Euro-Par'13 Proceedings of the 19th international conference on Parallel Processing
Automatic extraction of pipeline parallelism for embedded heterogeneous multi-core platforms

Proceedings of the 2013 International Conference on Compilers, Architectures and Synthesis for Embedded Systems
Loop Transforming for Reducing Data Alignment on Multi-Core SIMD Processors

Journal of Signal Processing Systems

Quantified Score

Hi-index	0.01

Visualization

Abstract

An approach to transformations for general loops in which dependence vectors represent precedence constraints on the iterations of a loop is presented. Therefore, dependences extracted from a loop nest must be lexicographically positive. This leads to a simple test for legality of compound transformations: any code transformation that leaves the dependences lexicographically positive is legal. The loop transformation theory is applied to the problem of maximizing the degree of coarse- or fine-grain parallelism in a loop nest. It is shown that the maximum degree of parallelism can be achieved by transforming the loops into a nest of coarsest fully permutable loop nests and wavefronting the fully permutable nests. The canonical form of coarsest fully permutable nests can be transformed mechanically to yield maximum degrees of coarse- and/or fine-grain parallelism. The efficient heuristics can find the maximum degrees of parallelism for loops whose nesting level is less than five.