Tile size selection using cache organization and data layout

Authors:
Stephanie Coleman;Kathryn S. McKinley
Affiliations:
Intermetrics, Inc., 733 Concord Ave., Cambridge, MA;Computer Science, LGRC, University of Massachusetts, Amherst, MA
Venue:
PLDI '95 Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation
Year:
1995

Citing 10
Cited 172

Strategies for cache and local memory management by global program transformation

Journal of Parallel and Distributed Computing - Special Issue on Languages, Compilers and environments for Parallel Programming
Supernode partitioning

POPL '88 Proceedings of the 15th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
More iteration space tiling

Proceedings of the 1989 ACM/IEEE conference on Supercomputing
The cache performance and optimizations of blocked algorithms

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
A data locality optimizing algorithm

PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
Compiler blockability of numerical algorithms

Proceedings of the 1992 ACM/IEEE conference on Supercomputing
To copy or not to copy: a compile-time technique for assessing when data copying should be used to eliminate cache conflicts

Proceedings of the 1993 ACM/IEEE conference on Supercomputing
MOB forms: a class of multilevel block algorithms for dense linear algebra operations

ICS '94 Proceedings of the 8th international conference on Supercomputing
Compiler optimizations for improving data locality

ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
Cache Memories

ACM Computing Surveys (CSUR)

Improving data locality with loop transformations

ACM Transactions on Programming Languages and Systems (TOPLAS)
Thread scheduling for cache locality

Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
A quantitative analysis of loop nest locality

Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Combining loop transformations considering caches and scheduling

Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture
Data-centric multi-level blocking

Proceedings of the ACM SIGPLAN 1997 conference on Programming language design and implementation
A compiler algorithm for optimizing locality in loop nests

ICS '97 Proceedings of the 11th international conference on Supercomputing
Cache miss equations: an analytical representation of cache misses

ICS '97 Proceedings of the 11th international conference on Supercomputing
Tuning compiler optimizations for simultaneous multithreading

MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Compiler blockability of dense matrix factorizations

ACM Transactions on Mathematical Software (TOMS)
Data transformations for eliminating conflict misses

PLDI '98 Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation
Eliminating conflict misses for high performance architectures

ICS '98 Proceedings of the 12th international conference on Supercomputing
A Software Approach to Avoiding Spatial Cache Collisions in Parallel Processor Systems

IEEE Transactions on Parallel and Distributed Systems
Improving locality using loop and data transformations in an integrated framework

MICRO 31 Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture
Precise miss analysis for program transformations with caches of arbitrary associativity

Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
Augmenting Loop Tiling with Data Alignment for Improved Cache Performance

IEEE Transactions on Computers - Special issue on cache memory and related problems
Effects of Multithreading on Cache Performance

IEEE Transactions on Computers - Special issue on cache memory and related problems
New tiling techniques to improve cache temporal locality

Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation
Nonlinear array layouts for hierarchical memory systems

ICS '99 Proceedings of the 13th international conference on Supercomputing
An experimental evaluation of tiling and shackling for memory hierarchy management

ICS '99 Proceedings of the 13th international conference on Supercomputing
A tile selection algorithm for data locality and cache interference

ICS '99 Proceedings of the 13th international conference on Supercomputing
Selecting tile shape for minimal execution time

Proceedings of the eleventh annual ACM symposium on Parallel algorithms and architectures
Analytical Modeling of Set-Associative Cache Behavior

IEEE Transactions on Computers
Cache miss equations: a compiler framework for analyzing and tuning memory behavior

ACM Transactions on Programming Languages and Systems (TOPLAS)
Quantifying loop nest locality using SPEC'95 and the perfect benchmarks

ACM Transactions on Computer Systems (TOCS)
Locality optimizations for multi-level caches

SC '99 Proceedings of the 1999 ACM/IEEE conference on Supercomputing
Automated cache optimizations using CME driven diagnosis

Proceedings of the 14th international conference on Supercomputing
Tuning Compiler Optimizations for Simultaneous Multithreading

International Journal of Parallel Programming - Special issue on the 30th annual ACM/IEEE international symposium on microarchitecture, part II
Accurately Selecting Block Size at Runtime in Pipelined Parallel Programs

International Journal of Parallel Programming
Transforming loops to recursion for multi-level memory hierarchies

PLDI '00 Proceedings of the ACM SIGPLAN 2000 conference on Programming language design and implementation
Cacheminer: A Runtime Approach to Exploit Cache Locality on SMP

IEEE Transactions on Parallel and Distributed Systems
A Unified Framework for Optimizing Locality, Parallelism, and Communication in Out-of-Core Computations

IEEE Transactions on Parallel and Distributed Systems
A compiler technique for improving whole-program locality

POPL '01 Proceedings of the 28th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Tiling imperfectly-nested loop nests

Proceedings of the 2000 ACM/IEEE conference on Supercomputing
Tiling optimizations for 3D scientific computations

Proceedings of the 2000 ACM/IEEE conference on Supercomputing
Exploiting non-uniform reuse for cache optimization

Proceedings of the 2001 ACM symposium on Applied computing
Loop optimization for a class of memory-constrained computations

ICS '01 Proceedings of the 15th international conference on Supercomputing
Evaluating the impact of memory system performance on software prefetching and locality optimizations

ICS '01 Proceedings of the 15th international conference on Supercomputing
Optimal semi-oblique tiling

Proceedings of the thirteenth annual ACM symposium on Parallel algorithms and architectures
Exact analysis of the cache behavior of nested loops

Proceedings of the ACM SIGPLAN 2001 conference on Programming language design and implementation
Static and Dynamic Locality Optimizations Using Integer Linear Programming

IEEE Transactions on Parallel and Distributed Systems
Data Relation Vectors: A New Abstraction for Data Optimizations

IEEE Transactions on Computers - Special issue on the parallel architecture and compilation techniques conference
Efficient Representation Scheme for Multidimensional Array Operations

IEEE Transactions on Computers
Tuning Strassen's matrix multiplication for memory efficiency

SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
Optimal tiling for the RNA base pairing problem

Proceedings of the fourteenth annual ACM symposium on Parallel algorithms and architectures
Synthesizing Transformations for Locality Enhancement of Imperfectly-Nested Loop Nests

International Journal of Parallel Programming
Register tiling in nonrectangular iteration spaces

ACM Transactions on Programming Languages and Systems (TOPLAS)
Tight bounds on cache use for stencil operations on rectangular grids

Journal of the ACM (JACM)
Increasing temporal locality with skewing and recursive blocking

Proceedings of the 2001 ACM/IEEE conference on Supercomputing
An I/O-Conscious Tiling Strategy for Disk-Resident Data Sets

The Journal of Supercomputing
Compilation of Vector Statements of C[] Language for Architectures with Multilevel Memory Hierarchy

Programming and Computing Software
Combining Loop Transformations Considering Caches and Scheduling

International Journal of Parallel Programming
Quantifying the Multi-Level Nature of Tiling Interactions

International Journal of Parallel Programming
Data-Centric Transformations for Locality Enhancement

International Journal of Parallel Programming
A Cache Visualization Tool

Computer
A Layout-Conscious Iteration Space Transformation Technique

IEEE Transactions on Computers
Data Space Oriented Tiling

ESOP '02 Proceedings of the 11th European Symposium on Programming Languages and Systems
An Efficient Technique for Corner-Turn in SAR Image Reconstruction by Improving Cache Access

IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
Experimental Evaluation of Energy Behavior of Iteration Space Tiling

LCPC '00 Proceedings of the 13th International Workshop on Languages and Compilers for Parallel Computing-Revised Papers
Compiler-Controlled Caching in Superword Register Files for Multimedia Extension Architectures

Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques
Iterative Compilation

Embedded Processor Design Challenges: Systems, Architectures, Modeling, and Simulation - SAMOS
Cache Remapping to Improve the Performance of Tiled Algorithms

Euro-Par '00 Proceedings from the 6th International Euro-Par Conference on Parallel Processing
Cache Models for Iterative Compilation

Euro-Par '01 Proceedings of the 7th International Euro-Par Conference Manchester on Parallel Processing
Reducing Cache Conflicts by a Parametrized Memory Mapping

ParNum '99 Proceedings of the 4th International ACPC Conference Including Special Tracks on Parallel Numerics and Parallel Computing in Image Processing, Video Processing, and Multimedia: Parallel Computation
Software Controlled Reconfigurable On-Chip Memory for High Performance Computing

IMS '00 Revised Papers from the Second International Workshop on Intelligent Memory Systems
Improving Cache Effectiveness through Array Data Layout Manipulation in SAC

IFL '00 Selected Papers from the 12th International Workshop on Implementation of Functional Languages
Cache Line Impact on 3D PDE Solvers

ISHPC '02 Proceedings of the 4th International Symposium on High Performance Computing
On increasing architecture awareness in program optimizations to bridge the gap between peak and sustained processor performance: matrix-multiply revisited

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Optimal task scheduling at run time to exploit intra-tile parallelism

Parallel Computing
Iterative compilation

Embedded processor design challenges
On the Parallel Execution Time of Tiled Loops

IEEE Transactions on Parallel and Distributed Systems
Reducing False Sharing and Improving Spatial Locality in a Unified Compilation Framework

IEEE Transactions on Parallel and Distributed Systems
Dynamic compilation for energy adaptation

Proceedings of the 2002 IEEE/ACM international conference on Computer-aided design
Data cache locking for higher program predictability

SIGMETRICS '03 Proceedings of the 2003 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
A comparison of empirical and model-driven optimization

PLDI '03 Proceedings of the ACM SIGPLAN 2003 conference on Programming language design and implementation
Estimating cache misses and locality using stack distances

ICS '03 Proceedings of the 17th annual international conference on Supercomputing
Reference Distance as a Metric for Data Locality

HPC-ASIA '97 Proceedings of the High-Performance Computing on the Information Superhighway, HPC-Asia '97
Strategies for Improving Data Locality in Embedded Applications

ASP-DAC '02 Proceedings of the 2002 Asia and South Pacific Design Automation Conference
Address Code and Arithmetic Optimizations for Embedded Systems

ASP-DAC '02 Proceedings of the 2002 Asia and South Pacific Design Automation Conference
SCIMA: A Novel Architecture for High Performance Computing

IWIA '99 Proceedings of the 1999 International Workshop on Innovative Architecture
Efficient Data Parallel Algorithms for Multidimensional Array Operations Based on the EKMR Scheme for Distributed Memory Multicomputers

IEEE Transactions on Parallel and Distributed Systems
Tiling, Block Data Layout, and Memory Hierarchy Performance

IEEE Transactions on Parallel and Distributed Systems
Array Regrouping and Its Use in Compiling Data-Intensive Embedded Applications

IEEE Transactions on Computers
Transforming Complex Loop Nests for Locality

The Journal of Supercomputing
A Quantitative Analysis of Tile Size Selection Algorithms

The Journal of Supercomputing
Single Assignment C: efficient support for high-level array operations in a functional setting

Journal of Functional Programming
Instruction Scheduling for Low Power

Journal of VLSI Signal Processing Systems
Access Pattern Restructuring for Memory Energy

IEEE Transactions on Parallel and Distributed Systems
A fast and accurate framework to analyze and optimize cache memory behavior

ACM Transactions on Programming Languages and Systems (TOPLAS)
Efficient and Accurate Analytical Modeling of Whole-Program Data Cache Behavior

IEEE Transactions on Computers
Quasidynamic Layout Optimizations for Improving Data Locality

IEEE Transactions on Parallel and Distributed Systems
Automatic tiling of iterative stencil loops

ACM Transactions on Programming Languages and Systems (TOPLAS)
Combining Models and Guided Empirical Search to Optimize for Multiple Levels of the Memory Hierarchy

Proceedings of the international symposium on Code generation and optimization
Optimizing Address Code Generation for Array-Intensive DSP Applications

Proceedings of the international symposium on Code generation and optimization
A Model-Based Framework: An Approach for Profit-Driven Optimization

Proceedings of the international symposium on Code generation and optimization
A Geometric Programming Framework for Optimal Multi-Level Tiling

Proceedings of the 2004 ACM/IEEE conference on Supercomputing
A case for a working-set-based memory hierarchy

Proceedings of the 2nd conference on Computing frontiers
Automatic blocking of QR and LU factorizations for locality

MSP '04 Proceedings of the 2004 workshop on Memory system performance
Data space-oriented tiling for enhancing locality

ACM Transactions on Embedded Computing Systems (TECS)
Memory Performance Optimizations For Real-Time Software HDTV Decoding

Journal of VLSI Signal Processing Systems
Cache-oblivious mesh layouts

ACM SIGGRAPH 2005 Papers
Improving whole-program locality using intra-procedural and inter-procedural transformations

Journal of Parallel and Distributed Computing
Performance Enhancement on Microprocessors with Hierarchical Memory Systems for Solving Large Sparse Linear Systems

International Journal of High Performance Computing Applications
Improving Memory Hierarchy Performance through Combined Loop Interchange and Multi-Level Fusion

International Journal of High Performance Computing Applications
An accurate cost model for guiding data locality transformations

ACM Transactions on Programming Languages and Systems (TOPLAS)
Think globally, search locally

Proceedings of the 19th annual international conference on Supercomputing
Integrated Loop Optimizations for Data Locality Enhancement of Tensor Contraction Expressions

SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
Reduction Transformations for Optimization Parameter Selection

HPCASIA '05 Proceedings of the Eighth International Conference on High-Performance Computing in Asia-Pacific Region
Compiler-Guided data compression for reducing memory consumption of embedded applications

ASP-DAC '06 Proceedings of the 2006 Asia and South Pacific Design Automation Conference
Multi-compilation: capturing interactions among concurrently-executing applications

Proceedings of the 3rd conference on Computing frontiers
Integrating loop and data optimizations for locality within a constraint network based framework

ICCAD '05 Proceedings of the 2005 IEEE/ACM International conference on Computer-aided design
Optimizing locality and scalability of embedded Runge--Kutta solvers using block-based pipelining

Journal of Parallel and Distributed Computing
Efficient synthesis of out-of-core algorithms using a nonlinear optimization solver

Journal of Parallel and Distributed Computing - Special issue: 18th International parallel and distributed processing symposium
Empirical optimization for a sparse linear solver: a case study

International Journal of Parallel Programming - Special issue: The next generation software program
An approach toward profit-driven optimization

ACM Transactions on Architecture and Code Optimization (TACO)
Profitable loop fusion and tiling using model-driven empirical search

Proceedings of the 20th annual international conference on Supercomputing
Mesh Layouts for Block-Based Caches

IEEE Transactions on Visualization and Computer Graphics
Locality and parallelism optimization for dynamic programming algorithm in bioinformatics

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
A memory model for scientific algorithms on graphics processors

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Iterative compilation for energy reduction

Journal of Embedded Computing - Cache exploitation in embedded systems
The rise and fall of High Performance Fortran: an historical object lesson

Proceedings of the third ACM SIGPLAN conference on History of programming languages
An experimental comparison of cache-oblivious and cache-conscious programs

Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures
A parallel dynamic programming algorithm on a multi-core architecture

Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures
Effective automatic parallelization of stencil computations

Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and implementation
Parameterized tiled loops for free

Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and implementation
Loop Optimization using Hierarchical Compilation and Kernel Decomposition

Proceedings of the International Symposium on Code Generation and Optimization
Cache-efficient numerical algorithms using graphics hardware

Parallel Computing
Fast indexing for blocked array layouts to reduce cache misses

International Journal of High Performance Computing and Networking
Dynamic tiling for effective use of shared caches on multithreaded processors

International Journal of High Performance Computing and Networking
Massive model visualization techniques: course notes

ACM SIGGRAPH 2008 classes
Positivity, posynomials and tile size selection

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Adaptive Loop Tiling for a Multi-cluster CMP

ICA3PP '08 Proceedings of the 8th international conference on Algorithms and Architectures for Parallel Processing
Using Padding to Optimize Locality in Scientific Applications

ICCS '08 Proceedings of the 8th international conference on Computational Science, Part I
Exploiting SIMD Parallelism with the CGiS Compiler Framework

Languages and Compilers for Parallel Computing
A Framework for Exploring Optimization Properties

CC '09 Proceedings of the 18th International Conference on Compiler Construction: Held as Part of the Joint European Conferences on Theory and Practice of Software, ETAPS 2009
Parametric multi-level tiling of imperfectly nested loops

Proceedings of the 23rd international conference on Supercomputing
Simultaneous minimization of capacity and conflict misses

Journal of Computer Science and Technology
Using data compression for increasing memory system utilization

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
ePRO-MP: A tool for profiling and optimizing energy and performance of mobile multiprocessor applications

Scientific Programming - Software Development for Multi-core Computing Systems
Cache line reservation: exploring a scheme for cache-friendly object allocation

CASCON '09 Proceedings of the 2009 Conference of the Center for Advanced Studies on Collaborative Research
Iterative compilation with kernel exploration

LCPC'06 Proceedings of the 19th international conference on Languages and compilers for parallel computing
Improving data locality by chunking

CC'03 Proceedings of the 12th international conference on Compiler construction
Automatic creation of tile size selection models

Proceedings of the 8th annual IEEE/ACM international symposium on Code generation and optimization
Parameterized tiling revisited

Proceedings of the 8th annual IEEE/ACM international symposium on Code generation and optimization
Tiled-MapReduce: optimizing resource usages of data-parallel applications on multicore with tiling

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Exposing tunable parameters in multi-threaded numerical code

NPC'10 Proceedings of the 2010 IFIP international conference on Network and parallel computing
Optimized dense matrix multiplication on a many-core architecture

Euro-Par'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part II
Code scheduling for optimizing parallelism and data locality

EuroPar'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part I
On the interaction of tiling and automatic parallelization

IWOMP'05/IWOMP'06 Proceedings of the 2005 and 2006 international conference on OpenMP shared memory parallel programming
Data locality and parallelism optimization using a constraint-based approach

Journal of Parallel and Distributed Computing
Fast analysis of molecular dynamics trajectories with graphics processing units-Radial distribution function histogramming

Journal of Computational Physics
Practical loop transformations for tensor contraction expressions on multi-level memory hierarchies

CC'11/ETAPS'11 Proceedings of the 20th international conference on Compiler construction: part of the joint European conferences on theory and practice of software
Studying inter-core data reuse in multicores

Proceedings of the ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
Studying inter-core data reuse in multicores

ACM SIGMETRICS Performance Evaluation Review - Performance evaluation review
Optimizing matrix multiplication with a classifier learning system

LCPC'05 Proceedings of the 18th international conference on Languages and Compilers for Parallel Computing
Analytic models and empirical search: a hybrid approach to code optimization

LCPC'05 Proceedings of the 18th international conference on Languages and Compilers for Parallel Computing
Optimizing explicit data transfers for data parallel applications on the cell architecture

ACM Transactions on Architecture and Code Optimization (TACO) - HIPEAC Papers
Tuning blocked array layouts to exploit memory hierarchy in SMT architectures

PCI'05 Proceedings of the 10th Panhellenic conference on Advances in Informatics
Applying loop optimizations to object-oriented abstractions through general classification of array semantics

LCPC'04 Proceedings of the 17th international conference on Languages and Compilers for High Performance Computing
A matrix-type for performance–portability

PARA'04 Proceedings of the 7th international conference on Applied Parallel Computing: state of the Art in Scientific Computing
Efficient execution of time-step computations with pipelined parallelism and inter-thread data locality optimizaitions

Proceedings of the 2012 International Workshop on Programming Models and Applications for Multicores and Manycores
Parameterized loop tiling

ACM Transactions on Programming Languages and Systems (TOPLAS)
On-chip cache hierarchy-aware tile scheduling for multicore machines

CGO '11 Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization
Hierarchical overlapped tiling

Proceedings of the Tenth International Symposium on Code Generation and Optimization
Analytical bounds for optimal tile size selection

CC'12 Proceedings of the 21st international conference on Compiler Construction
Accelerating SURF detector on mobile devices

Proceedings of the 20th ACM international conference on Multimedia
Grex: An efficient MapReduce framework for graphics processing units

Journal of Parallel and Distributed Computing
Tiled-MapReduce: Efficient and Flexible MapReduce Processing on Multicore with Tiling

ACM Transactions on Architecture and Code Optimization (TACO)
Strategies for improving performance and energy efficiency on a many-core

Proceedings of the ACM International Conference on Computing Frontiers
Automatic OpenCL work-group size selection for multicore CPUs

PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
Adaptive parallel tiled code generation and accelerated auto-tuning

International Journal of High Performance Computing Applications
Adaptive Mapping and Parameter Selection Scheme to Improve Automatic Code Generation for GPUs

Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization
Tile size selection revisited

ACM Transactions on Architecture and Code Optimization (TACO)

Quantified Score

Hi-index	0.02

Visualization

Abstract

When dense matrix computations are too large to fit in cache, previous research proposes tiling to reduce or eliminate capacity misses. This paper presents a new algorithm for choosing problem-size dependent tile sizes based on the cache size and cache line size for a direct-mapped cache. The algorithm eliminates both capacity and self-interference misses and reduces cross-interference misses. We measured simulated miss rates and execution times for our algorithm and two others on a variety of problem sizes and cache organizations. At higher set associativity, our algorithm does not always achieve the best performance. However on direct-mapped caches, our algorithm improves simulated miss rates and measured execution times when compared with previous work.