More iteration space tiling

Authors:
M. Wolfe
Affiliations:
Oregon Graduate Center
Venue:
Proceedings of the 1989 ACM/IEEE conference on Supercomputing
Year:
1989

Citing 11
Cited 124

Automatic translation of FORTRAN programs to vector form

ACM Transactions on Programming Languages and Systems (TOPLAS)
Loop skewing: the wavefront method revisited

International Journal of Parallel Programming
Data dependence and its application to parallel processing

International Journal of Parallel Programming
Supernode partitioning

POPL '88 Proceedings of the 15th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Program Improvement by Source-to-Source Transformation

Journal of the ACM (JACM)
The parallel execution of DO loops

Communications of the ACM
Automatic loop interchange

SIGPLAN '84 Proceedings of the 1984 SIGPLAN symposium on Compiler construction
Optimizing Supercompilers for Supercomputers

Optimizing Supercompilers for Supercomputers
Dependence Analysis for Supercomputing

Dependence Analysis for Supercomputing
Iteration Space Tiling for Memory Hierarchies

Proceedings of the Third SIAM Conference on Parallel Processing for Scientific Computing
Parallelism exposure and exploitation in programs

Parallelism exposure and exploitation in programs

Improving register allocation for subscripted variables

PLDI '90 Proceedings of the ACM SIGPLAN 1990 conference on Programming language design and implementation
Loop partitioning for distributed memory multiprocessors as unimodular transformations

ICS '91 Proceedings of the 5th international conference on Supercomputing
Analysis and transformation in the ParaScope editor

ICS '91 Proceedings of the 5th international conference on Supercomputing
A unified framework for systematic loop transformations

PPOPP '91 Proceedings of the third ACM SIGPLAN symposium on Principles and practice of parallel programming
Scanning polyhedra with DO loops

PPOPP '91 Proceedings of the third ACM SIGPLAN symposium on Principles and practice of parallel programming
A data locality optimizing algorithm

PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
Fortran at ten gigaflops: the connection machine convolution compiler

PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
Tiling multidimensional iteration spaces for nonshared memory machines

Proceedings of the 1991 ACM/IEEE conference on Supercomputing
A dynamic scheduling method for irregular parallel programs

PLDI '92 Proceedings of the ACM SIGPLAN 1992 conference on Programming language design and implementation
Optimizing for parallelism and data locality

ICS '92 Proceedings of the 6th international conference on Supercomputing
A methodology for high-level synthesis of communication on multicomputers

ICS '92 Proceedings of the 6th international conference on Supercomputing
Compiler blockability of numerical algorithms

Proceedings of the 1992 ACM/IEEE conference on Supercomputing
Non-unimodular transformations of nested loops

Proceedings of the 1992 ACM/IEEE conference on Supercomputing
To copy or not to copy: a compile-time technique for assessing when data copying should be used to eliminate cache conflicts

Proceedings of the 1993 ACM/IEEE conference on Supercomputing
Compiler techniques for maximizing fine-grain and coarse-grain parallelism in loops with uniform dependences

ICS '94 Proceedings of the 8th international conference on Supercomputing
Compiler transformations for high-performance computing

ACM Computing Surveys (CSUR)
Tile size selection using cache organization and data layout

PLDI '95 Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation
Optimal tile size adjustment in compiling general DOACROSS loop nests

ICS '95 Proceedings of the 9th international conference on Supercomputing
Cache miss equations: an analytical representation of cache misses

ICS '97 Proceedings of the 11th international conference on Supercomputing
Determining the idle time of a tiling

Proceedings of the 24th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
A general algorithm for tiling the register level

ICS '98 Proceedings of the 12th international conference on Supercomputing
An Efficient Solution to the Cache Thrashing Problem Caused by True Data Sharing

IEEE Transactions on Computers
A Software Approach to Avoiding Spatial Cache Collisions in Parallel Processor Systems

IEEE Transactions on Parallel and Distributed Systems
Compiler-controlled memory

Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
Precise miss analysis for program transformations with caches of arbitrary associativity

Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
Augmenting Loop Tiling with Data Alignment for Improved Cache Performance

IEEE Transactions on Computers - Special issue on cache memory and related problems
Nonlinear array layouts for hierarchical memory systems

ICS '99 Proceedings of the 13th international conference on Supercomputing
A tile selection algorithm for data locality and cache interference

ICS '99 Proceedings of the 13th international conference on Supercomputing
Selecting tile shape for minimal execution time

Proceedings of the eleventh annual ACM symposium on Parallel algorithms and architectures
Recursive array layouts and fast parallel matrix multiplication

Proceedings of the eleventh annual ACM symposium on Parallel algorithms and architectures
The influence of caches on the performance of sorting

SODA '97 Proceedings of the eighth annual ACM-SIAM symposium on Discrete algorithms
Cache miss equations: a compiler framework for analyzing and tuning memory behavior

ACM Transactions on Programming Languages and Systems (TOPLAS)
Locality optimizations for multi-level caches

SC '99 Proceedings of the 1999 ACM/IEEE conference on Supercomputing
ILP versus TLP on SMT

SC '99 Proceedings of the 1999 ACM/IEEE conference on Supercomputing
Mapping irregular applications to DIVA, a PIM-based data-intensive architecture

SC '99 Proceedings of the 1999 ACM/IEEE conference on Supercomputing
Automated cache optimizations using CME driven diagnosis

Proceedings of the 14th international conference on Supercomputing
A Loop Transformation Algorithm for Communication Overlapping

International Journal of Parallel Programming - Special issue on international symposium on high performance computing 1997, part I
Transforming loops to recursion for multi-level memory hierarchies

PLDI '00 Proceedings of the ACM SIGPLAN 2000 conference on Programming language design and implementation
Improving Memory Traffic by Assembly-Level Exploitation of Reuses for Vector Registers

The Journal of Supercomputing
Tiling optimizations for 3D scientific computations

Proceedings of the 2000 ACM/IEEE conference on Supercomputing
Towards effective embedded processors in codesigns: customizable partitioned caches

Proceedings of the ninth international symposium on Hardware/software codesign
Exact analysis of the cache behavior of nested loops

Proceedings of the ACM SIGPLAN 2001 conference on Programming language design and implementation
Communication-free partitioning of nested loops

Compiler optimizations for scalable parallel systems
Performance-constrained pipelining of software loops onto reconfigurable hardware

FPGA '02 Proceedings of the 2002 ACM/SIGDA tenth international symposium on Field-programmable gate arrays
Static and Dynamic Locality Optimizations Using Integer Linear Programming

IEEE Transactions on Parallel and Distributed Systems
Loop re-ordering and pre-fetching at run-time

SC '97 Proceedings of the 1997 ACM/IEEE conference on Supercomputing
Skewed Data Partition and Alignment Techniques for Compiling Programs on Distributed Memory Multicomputers

The Journal of Supercomputing
Register tiling in nonrectangular iteration spaces

ACM Transactions on Programming Languages and Systems (TOPLAS)
Precise Data Locality Optimization of Nested Loops

The Journal of Supercomputing
Quantifying the Multi-Level Nature of Tiling Interactions

International Journal of Parallel Programming
Reuse-Driven Tiling for Improving Data Locality

International Journal of Parallel Programming
Time-minimal tiling when rise is larger than zero

Parallel Computing
A Layout-Conscious Iteration Space Transformation Technique

IEEE Transactions on Computers
Compile-Time Partitioning of Iterative Parallel Loops to Reduce Cache Coherency Traffic

IEEE Transactions on Parallel and Distributed Systems
Interactive Parallel Programming using the ParaScope Editor

IEEE Transactions on Parallel and Distributed Systems
A Loop Transformation Theory and an Algorithm to Maximize Parallelism

IEEE Transactions on Parallel and Distributed Systems
Performance Analysis of Parallelizing Compilers on the Perfect Benchmarks Programs

IEEE Transactions on Parallel and Distributed Systems
Hierarchical Compilation of Macro Dataflow Graphs for Multiprocessors with Local Memory

IEEE Transactions on Parallel and Distributed Systems
Communication-Free Data Allocation Techniques for Parallelizing Compilers on Multicomputers

IEEE Transactions on Parallel and Distributed Systems
A General Methodology of Partitioning and Mapping for Given Regular Arrays

IEEE Transactions on Parallel and Distributed Systems
On Supernode Transformation with Minimized Total Running Time

IEEE Transactions on Parallel and Distributed Systems
Recursive Array Layouts and Fast Matrix Multiplication

IEEE Transactions on Parallel and Distributed Systems
On Time Optimal Supernode Shape

IEEE Transactions on Parallel and Distributed Systems
Cache-Efficient Multigrid Algorithms

ICCS '01 Proceedings of the International Conference on Computational Sciences-Part I
Source Code and Task Graphs in Program Optimization

HPCN Europe 2001 Proceedings of the 9th International Conference on High-Performance Computing and Networking
A Memory Controller for Improved Performance of Streamed Computations on Symmetric Multiprocessors

IPPS '96 Proceedings of the 10th International Parallel Processing Symposium
The Combined Effectiveness of Unimodular Transformations, Tiling, and Software Prefetching

IPPS '96 Proceedings of the 10th International Parallel Processing Symposium
Asynchronous Resource Management

IPDPS '01 Proceedings of the 15th International Parallel & Distributed Processing Symposium
Compiler-Controlled Caching in Superword Register Files for Multimedia Extension Architectures

Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques
Data Sequence Locality: A Generalization of Temporal Locality

Euro-Par '01 Proceedings of the 7th International Euro-Par Conference Manchester on Parallel Processing
Fractal Matrix Multiplication: A Case Study on Portability of Cache Performance

WAE '01 Proceedings of the 5th International Workshop on Algorithm Engineering
Optimal task scheduling at run time to exploit intra-tile parallelism

Parallel Computing
On the Parallel Execution Time of Tiled Loops

IEEE Transactions on Parallel and Distributed Systems
Estimating cache misses and locality using stack distances

ICS '03 Proceedings of the 17th annual international conference on Supercomputing
Access ordering and memory-conscious cache utilization

HPCA '95 Proceedings of the 1st IEEE Symposium on High-Performance Computer Architecture
Address Code and Arithmetic Optimizations for Embedded Systems

ASP-DAC '02 Proceedings of the 2002 Asia and South Pacific Design Automation Conference
Transforming Complex Loop Nests for Locality

The Journal of Supercomputing
Analysis and Modeling of Energy Reducing Source Code Transformations

Proceedings of the conference on Design, automation and test in Europe - Volume 3
Improving register allocation for subscripted variables

ACM SIGPLAN Notices - Best of PLDI 1979-1999
A data locality optimizing algorithm

ACM SIGPLAN Notices - Best of PLDI 1979-1999
Sparse Tiling for Stationary Iterative Methods

International Journal of High Performance Computing Applications
Cache-Efficient Multigrid Algorithms

International Journal of High Performance Computing Applications
Improving Memory Hierarchy Performance through Combined Loop Interchange and Multi-Level Fusion

International Journal of High Performance Computing Applications
Empirical optimization for a sparse linear solver: a case study

International Journal of Parallel Programming - Special issue: The next generation software program
Compiling for stream processing

Proceedings of the 15th international conference on Parallel architectures and compilation techniques
Analyzing block locality in Morton-order and Morton-hybrid matrices

MEDEA '06 Proceedings of the 2006 workshop on MEmory performance: DEaling with Applications, systems and architectures
Compilation for explicitly managed memory hierarchies

Proceedings of the 12th ACM SIGPLAN symposium on Principles and practice of parallel programming
Effective automatic parallelization of stencil computations

Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and implementation
Analyzing block locality in Morton-order and Morton-hybrid matrices

ACM SIGARCH Computer Architecture News
Improving the parallelism of iterative methods by aggressive loop fusion

The Journal of Supercomputing
Programming with tiles

Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Block size selection of parallel LU and QR on PVP-based and RISC-based supercomputers

CHINA HPC '07 Proceedings of the 2007 Asian technology information program's (ATIP's) 3rd workshop on High performance computing in China: solution approaches to impediments for high performance computing
Stream Scheduling: A Framework to Manage Bulk Operations in Memory Hierarchies

Euro-Par '08 Proceedings of the 14th international Euro-Par conference on Parallel Processing
Parallelization spectroscopy: analysis of thread-level parallelism in hpc programs

Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
Exploring parallelization strategies for NUFFT data translation

EMSOFT '09 Proceedings of the seventh ACM international conference on Embedded software
Combining data reuse with data-level parallelization for FPGA-targeted hardware compilation: a geometric programming framework

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
A directive-based MPI code generator for Linux PC clusters

The Journal of Supercomputing
Parallel loop generation and scheduling

The Journal of Supercomputing
Loop parallelization in multi-dimensional cartesian space

PSI'06 Proceedings of the 6th international Andrei Ershov memorial conference on Perspectives of systems informatics
Using non-canonical array layouts in dense matrix operations

PARA'06 Proceedings of the 8th international conference on Applied parallel computing: state of the art in scientific computing
JCudaMP: OpenMP/Java on CUDA

Proceedings of the 3rd International Workshop on Multicore Software Engineering
Cache oblivious parallelograms in iterative stencil computations

Proceedings of the 24th ACM International Conference on Supercomputing
Loop transformations: convexity, pruning and optimization

Proceedings of the 38th annual ACM SIGPLAN-SIGACT symposium on Principles of programming languages
ULCC: a user-level facility for optimizing shared cache performance on multicores

Proceedings of the 16th ACM symposium on Principles and practice of parallel programming
Fast analysis of molecular dynamics trajectories with graphics processing units-Radial distribution function histogramming

Journal of Computational Physics
Locality optimization of stencil applications using data dependency graphs

LCPC'10 Proceedings of the 23rd international conference on Languages and compilers for parallel computing
A programming language interface to describe transformations and code generation

LCPC'10 Proceedings of the 23rd international conference on Languages and compilers for parallel computing
Physis: an implicitly parallel programming model for stencil computations on large-scale GPU-accelerated supercomputers

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Optimizing explicit data transfers for data parallel applications on the cell architecture

ACM Transactions on Architecture and Code Optimization (TACO) - HIPEAC Papers
Using machine learning to improve automatic vectorization

ACM Transactions on Architecture and Code Optimization (TACO) - HIPEAC Papers
Aggressive loop fusion for improving locality and parallelism

ISPA'05 Proceedings of the Third international conference on Parallel and Distributed Processing and Applications
Efficient tiled loop generation: D-tiling

LCPC'09 Proceedings of the 22nd international conference on Languages and Compilers for Parallel Computing
Computer memory: why we should care what is under the hood

MEMICS'11 Proceedings of the 7th international conference on Mathematical and Engineering Methods in Computer Science
LAR-CC: Large atomic regions with conditional commits

CGO '11 Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization
Automated programmable control and parameterization of compiler optimizations

CGO '11 Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization
Predictive modeling in a polyhedral optimization space

CGO '11 Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization
On-chip cache hierarchy-aware tile scheduling for multicore machines

CGO '11 Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization
POET: a scripting language for applying parameterized source-to-source program transformations

Software—Practice & Experience
Hierarchical overlapped tiling

Proceedings of the Tenth International Symposium on Code Generation and Optimization
Analytical bounds for optimal tile size selection

CC'12 Proceedings of the 21st international conference on Compiler Construction
Riposte: a trace-driven compiler and parallel VM for vector code in R

Proceedings of the 21st international conference on Parallel architectures and compilation techniques
A script-based autotuning compiler system to generate high-performance CUDA code

ACM Transactions on Architecture and Code Optimization (TACO) - Special Issue on High-Performance Embedded Architectures and Compilers
Designing and auto-tuning parallel 3-D FFT for computation-communication overlap

Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming
Tile size selection revisited

ACM Transactions on Architecture and Code Optimization (TACO)

Quantified Score

Hi-index	0.01

Visualization

Abstract

Subdividing the iteration space of a loop into blocks or tiles with a fixed maximum size has several advantages. Tiles become a natural candidate as the unit of work for parallel task scheduling. Synchronization between processors can be done between tiles, reducing synchronization frequency (at some loss of potential parallelism). The shape and size of a tile can be optimized to take advantage of memory locality for memory hierarchy utilization. Vectorization and register locality naturally fits into the optimization within a tile, while parallelization and cache locality fits into optimization between tiles.