A data locality optimizing algorithm

Authors:
Michael E. Wolf;Monica S. Lam
Affiliations:
Computer Systems Laboratory, Stanford University, CA;Computer Systems Laboratory, Stanford University, CA
Venue:
PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
Year:
1991

Citing 11
Cited 469

Strategies for cache and local memory management by global program transformation

Journal of Parallel and Distributed Computing - Special Issue on Languages, Compilers and environments for Parallel Programming
Supernode partitioning

POPL '88 Proceedings of the 15th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
More iteration space tiling

Proceedings of the 1989 ACM/IEEE conference on Supercomputing
A set of level 3 basic linear algebra subprograms

ACM Transactions on Mathematical Software (TOMS)
Improving register allocation for subscripted variables

PLDI '90 Proceedings of the ACM SIGPLAN 1990 conference on Programming language design and implementation
The cache performance and optimizations of blocked algorithms

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Organizing matrices and matrix operations for paged memory systems

Communications of the ACM
Dependence Analysis for Supercomputing

Dependence Analysis for Supercomputing
A Loop Transformation Theory and an Algorithm to Maximize Parallelism

IEEE Transactions on Parallel and Distributed Systems
Improving the performance of virtual memory computers.

Improving the performance of virtual memory computers.
Software methods for improvement of cache performance on supercomputer applications

Software methods for improvement of cache performance on supercomputer applications

Tiling multidimensional iteration spaces for nonshared memory machines

Proceedings of the 1991 ACM/IEEE conference on Supercomputing
Detecting redundant accesses to array data

Proceedings of the 1991 ACM/IEEE conference on Supercomputing
Cache replacement with dynamic exclusion

ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
The impact of communication locality on large-scale multiprocessor performance

ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
Automatic partitioning of a program dependence graph into parallel tasks

IBM Journal of Research and Development
Delinearization: an efficient way to break multiloop dependence equations

PLDI '92 Proceedings of the ACM SIGPLAN 1992 conference on Programming language design and implementation
A dynamic scheduling method for irregular parallel programs

PLDI '92 Proceedings of the ACM SIGPLAN 1992 conference on Programming language design and implementation
Optimizing for parallelism and data locality

ICS '92 Proceedings of the 6th international conference on Supercomputing
A transformational approach to compiling Sisal for distributed memory architectures

ICS '92 Proceedings of the 6th international conference on Supercomputing
Design and evaluation of a compiler algorithm for prefetching

ASPLOS V Proceedings of the fifth international conference on Architectural support for programming languages and operating systems
Access normalization: loop restructuring for NUMA compilers

ASPLOS V Proceedings of the fifth international conference on Architectural support for programming languages and operating systems
Compiler blockability of numerical algorithms

Proceedings of the 1992 ACM/IEEE conference on Supercomputing
Non-unimodular transformations of nested loops

Proceedings of the 1992 ACM/IEEE conference on Supercomputing
Global optimizations for parallelism and locality on scalable parallel machines

PLDI '93 Proceedings of the ACM SIGPLAN 1993 conference on Programming language design and implementation
Communication optimization and code generation for distributed memory machines

PLDI '93 Proceedings of the ACM SIGPLAN 1993 conference on Programming language design and implementation
Lifetime-sensitive modulo scheduling

PLDI '93 Proceedings of the ACM SIGPLAN 1993 conference on Programming language design and implementation
Data locality and load balancing in COOL

PPOPP '93 Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming
Access normalization: loop restructuring for NUMA computers

ACM Transactions on Computer Systems (TOCS)
Managing pages in shared virtual memory systems: getting the compiler into the game

ICS '93 Proceedings of the 7th international conference on Supercomputing
A static parameter based performance prediction tool for parallel programs

ICS '93 Proceedings of the 7th international conference on Supercomputing
To copy or not to copy: a compile-time technique for assessing when data copying should be used to eliminate cache conflicts

Proceedings of the 1993 ACM/IEEE conference on Supercomputing
RISC microprocessors and scientific computing

Proceedings of the 1993 ACM/IEEE conference on Supercomputing
Compiling for shared-memory and message-passing computers

ACM Letters on Programming Languages and Systems (LOPLAS)
Exploiting the parallelism available in loops

Computer
Effective partial redundancy elimination

PLDI '94 Proceedings of the ACM SIGPLAN 1994 conference on Programming language design and implementation
Compiler techniques for maximizing fine-grain and coarse-grain parallelism in loops with uniform dependences

ICS '94 Proceedings of the 8th international conference on Supercomputing
Data and program restructuring of irregular applications for cache-coherent multiprocessor

ICS '94 Proceedings of the 8th international conference on Supercomputing
Using virtual lines to enhance locality exploitation

ICS '94 Proceedings of the 8th international conference on Supercomputing
Cache interference phenomena

SIGMETRICS '94 Proceedings of the 1994 ACM SIGMETRICS conference on Measurement and modeling of computer systems
A performance study of software and hardware data prefetching schemes

ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
SUIF: an infrastructure for research on parallelizing and optimizing compilers

ACM SIGPLAN Notices
Compiler optimizations for improving data locality

ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
Improving the ratio of memory operations to floating-point operations in loops

ACM Transactions on Programming Languages and Systems (TOPLAS)
XIL and YIL: the intermediate languages of TOBEY

IR '95 Papers from the 1995 ACM SIGPLAN workshop on Intermediate representations
Unifying data and control transformations for distributed shared-memory machines

PLDI '95 Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation
Tile size selection using cache organization and data layout

PLDI '95 Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation
Data and computation transformations for multiprocessors

PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
Reducing false sharing on shared memory multiprocessors through compile time data transformations

PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
Influence of cross-interferences on blocked loops: a case study with matrix-vector multiply

ACM Transactions on Programming Languages and Systems (TOPLAS)
Abstract interpretation and low-level code optimization

PEPM '95 Proceedings of the 1995 ACM SIGPLAN symposium on Partial evaluation and semantics-based program manipulation
Skewed associativity enhances performance predictability

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Compiler cache optimizations for banded matrix problems

ICS '95 Proceedings of the 9th international conference on Supercomputing
Data forwarding in scalable shared-memory multiprocessors

ICS '95 Proceedings of the 9th international conference on Supercomputing
Optimal tile size adjustment in compiling general DOACROSS loop nests

ICS '95 Proceedings of the 9th international conference on Supercomputing
Compiler techniques for data prefetching on the PowerPC

PACT '95 Proceedings of the IFIP WG10.3 working conference on Parallel architectures and compilation techniques
A limit study of local memory requirements using value reuse profiles

Proceedings of the 28th annual international symposium on Microarchitecture
SPAID: software prefetching in pointer- and call-intensive environments

Proceedings of the 28th annual international symposium on Microarchitecture
An effective programmable prefetch engine for on-chip caches

Proceedings of the 28th annual international symposium on Microarchitecture
A compiler optimization to reduce execution time of loop nest

ACM SIGARCH Computer Architecture News
Informing memory operations: providing memory performance feedback in modern processors

ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Improving data locality with loop transformations

ACM Transactions on Programming Languages and Systems (TOPLAS)
The influence of caches on the performance of heaps

Journal of Experimental Algorithmics (JEA)
A quantitative analysis of loop nest locality

Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Compiler-based prefetching for recursive data structures

Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Compiler-directed page coloring for multiprocessors

Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Data Forwarding in Scalable Shared-Memory Multiprocessors

IEEE Transactions on Parallel and Distributed Systems
Fusion of Loops for Parallelism and Locality

IEEE Transactions on Parallel and Distributed Systems
Optimal weighted loop fusion for parallel programs

Proceedings of the ninth annual ACM symposium on Parallel algorithms and architectures
Performance debugging shared memory parallel programs using run-time dependence analysis

SIGMETRICS '97 Proceedings of the 1997 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Automatic inline allocation of objects

Proceedings of the ACM SIGPLAN 1997 conference on Programming language design and implementation
Data-centric multi-level blocking

Proceedings of the ACM SIGPLAN 1997 conference on Programming language design and implementation
Architectural exploration and optimization of local memory in embedded systems

ISSS '97 Proceedings of the 10th international symposium on System synthesis
Efficient Algorithms for Data Distribution on Distributed Memory Parallel Computers

IEEE Transactions on Parallel and Distributed Systems
Designing a Scalable Processor Array for Recurrent Computations

IEEE Transactions on Parallel and Distributed Systems
A compiler algorithm for optimizing locality in loop nests

ICS '97 Proceedings of the 11th international conference on Supercomputing
Cache miss equations: an analytical representation of cache misses

ICS '97 Proceedings of the 11th international conference on Supercomputing
Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology

ICS '97 Proceedings of the 11th international conference on Supercomputing
Determining the idle time of a tiling

Proceedings of the 24th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Data prefetching on the HP PA-8000

Proceedings of the 24th annual international symposium on Computer architecture
Static timing analysis of embedded software

DAC '97 Proceedings of the 34th annual Design Automation Conference
A unified compiler algorithm for optimizing locality, parallelism and communication in out-of-core computations

Proceedings of the fifth workshop on I/O in parallel and distributed systems
Tuning compiler optimizations for simultaneous multithreading

MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Cache sensitive modulo scheduling

MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Unroll-and-jam using uniformly generated sets

MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Memory data organization for improved cache performance in embedded processor applications

ACM Transactions on Design Automation of Electronic Systems (TODAES)
Automatic selection of high-order transformations in the IBM XL FORTRAN compilers

IBM Journal of Research and Development - Special issue: performance analysis and its impact on design
Tolerating latency in multiprocessors through compiler-inserted prefetching

ACM Transactions on Computer Systems (TOCS)
Compiler blockability of dense matrix factorizations

ACM Transactions on Mathematical Software (TOMS)
Data transformations for eliminating conflict misses

PLDI '98 Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation
The implementation and evaluation of fusion and contraction in array languages

PLDI '98 Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation
A hyperplane based approach for optimizing spatial locality in loop nests

ICS '98 Proceedings of the 12th international conference on Supercomputing
A general algorithm for tiling the register level

ICS '98 Proceedings of the 12th international conference on Supercomputing
Eliminating conflict misses for high performance architectures

ICS '98 Proceedings of the 12th international conference on Supercomputing
Informing memory operations: memory performance feedback mechanisms and their applications

ACM Transactions on Computer Systems (TOCS)
An Efficient Solution to the Cache Thrashing Problem Caused by True Data Sharing

IEEE Transactions on Computers
Using generational garbage collection to implement cache-conscious data placement

Proceedings of the 1st international symposium on Memory management
Improving locality using loop and data transformations in an integrated framework

MICRO 31 Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture
Compiler-controlled memory

Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
Schedule-independent storage mapping for loops

Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
Precise miss analysis for program transformations with caches of arbitrary associativity

Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
Augmenting Loop Tiling with Data Alignment for Improved Cache Performance

IEEE Transactions on Computers - Special issue on cache memory and related problems
Improving Cache Locality by a Combination of Loop and Data Transformations

IEEE Transactions on Computers - Special issue on cache memory and related problems
A Linear Algebra Framework for Automatic Determination of Optimal Data Layouts

IEEE Transactions on Parallel and Distributed Systems
Memory forwarding: enabling aggressive layout optimizations by guaranteeing the safety of data relocation

ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
Cache-conscious structure layout

Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation
New tiling techniques to improve cache temporal locality

Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation
Improving cache performance in dynamic applications through data and computation reorganization at run time

Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation
A General Interprocedural Framework for Placement of Split-Phase Large Latency Operations

IEEE Transactions on Parallel and Distributed Systems
A locality sensitive multi-module cache with explicit management

ICS '99 Proceedings of the 13th international conference on Supercomputing
Improving memory hierarchy performance for irregular applications

ICS '99 Proceedings of the 13th international conference on Supercomputing
Nonlinear array layouts for hierarchical memory systems

ICS '99 Proceedings of the 13th international conference on Supercomputing
An experimental evaluation of tiling and shackling for memory hierarchy management

ICS '99 Proceedings of the 13th international conference on Supercomputing
An integer linear programming approach for optimizing cache locality

ICS '99 Proceedings of the 13th international conference on Supercomputing
Selecting tile shape for minimal execution time

Proceedings of the eleventh annual ACM symposium on Parallel algorithms and architectures
Recursive array layouts and fast parallel matrix multiplication

Proceedings of the eleventh annual ACM symposium on Parallel algorithms and architectures
Memory exploration for low power, embedded systems

Proceedings of the 36th annual ACM/IEEE Design Automation Conference
A Tree-Based Alternative to Java Byte-Codes

International Journal of Parallel Programming
An Integrated Hardware/Software Data Prefetching Scheme for Shared-Memory Multiprocessors

International Journal of Parallel Programming
Code transformations to improve memory parallelism

Proceedings of the 32nd annual ACM/IEEE international symposium on Microarchitecture
Analytical Modeling of Set-Associative Cache Behavior

IEEE Transactions on Computers
Cache miss equations: a compiler framework for analyzing and tuning memory behavior

ACM Transactions on Programming Languages and Systems (TOPLAS)
Nonsingular Data Transformations: Definition, Validity, and Applications

International Journal of Parallel Programming
Quantifying loop nest locality using SPEC'95 and the perfect benchmarks

ACM Transactions on Computer Systems (TOCS)
Energy-Delay Efficient Data Storage and Transfer Architectures and Methodologies: Current Solutions and Remaining Problems

Journal of VLSI Signal Processing Systems - Special issue on system level design
Locality optimizations for multi-level caches

SC '99 Proceedings of the 1999 ACM/IEEE conference on Supercomputing
Memory characteristics of iterative methods

SC '99 Proceedings of the 1999 ACM/IEEE conference on Supercomputing
ILP versus TLP on SMT

SC '99 Proceedings of the 1999 ACM/IEEE conference on Supercomputing
Synthesizing transformations for locality enhancement of imperfectly-nested loop nests

Proceedings of the 14th international conference on Supercomputing
Optimized unrolling of nested loops

Proceedings of the 14th international conference on Supercomputing
Automated cache optimizations using CME driven diagnosis

Proceedings of the 14th international conference on Supercomputing
ZPL: A Machine Independent Programming Language for Parallel Computers

IEEE Transactions on Software Engineering - Special issue on architecture-independent languages and software tools for parallel processing
Tuning Compiler Optimizations for Simultaneous Multithreading

International Journal of Parallel Programming - Special issue on the 30th annual ACM/IEEE international symposium on microarchitecture, part II
A Loop Transformation Algorithm for Communication Overlapping

International Journal of Parallel Programming - Special issue on international symposium on high performance computing 1997, part I
Transforming loops to recursion for multi-level memory hierarchies

PLDI '00 Proceedings of the ACM SIGPLAN 2000 conference on Programming language design and implementation
An automatic object inlining optimization and its evaluation

PLDI '00 Proceedings of the ACM SIGPLAN 2000 conference on Programming language design and implementation
Cacheminer: A Runtime Approach to Exploit Cache Locality on SMP

IEEE Transactions on Parallel and Distributed Systems
A Unified Framework for Optimizing Locality, Parallelism, and Communication in Out-of-Core Computations

IEEE Transactions on Parallel and Distributed Systems
Automated data-member layout of heap objects to improve memory-hierarchy performance

ACM Transactions on Programming Languages and Systems (TOPLAS)
Improving Memory Traffic by Assembly-Level Exploitation of Reuses for Vector Registers

The Journal of Supercomputing
A compiler technique for improving whole-program locality

POPL '01 Proceedings of the 28th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Access pattern based local memory customization for low power embedded systems

Proceedings of the conference on Design, automation and test in Europe
Tiling imperfectly-nested loop nests

Proceedings of the 2000 ACM/IEEE conference on Supercomputing
Tiling optimizations for 3D scientific computations

Proceedings of the 2000 ACM/IEEE conference on Supercomputing
Improving fine-grained irregular shared-memory benchmarks by data reordering

Proceedings of the 2000 ACM/IEEE conference on Supercomputing
Exploiting Wavefront Parallelism on Large-Scale Shared-Memory Multiprocessors

IEEE Transactions on Parallel and Distributed Systems
Towards effective embedded processors in codesigns: customizable partitioned caches

Proceedings of the ninth international symposium on Hardware/software codesign
Exploiting non-uniform reuse for cache optimization

Proceedings of the 2001 ACM symposium on Applied computing
A dynamic locality optimization algorithm for linear algebra codes

Proceedings of the 2001 ACM symposium on Applied computing
Compiler-based I/O prefetching for out-of-core applications

ACM Transactions on Computer Systems (TOCS)
Fractal symbolic analysis

ICS '01 Proceedings of the 15th international conference on Supercomputing
Loop optimization for a class of memory-constrained computations

ICS '01 Proceedings of the 15th international conference on Supercomputing
Evaluating the impact of memory system performance on software prefetching and locality optimizations

ICS '01 Proceedings of the 15th international conference on Supercomputing
Reducing memory requirements of nested loops for embedded systems

Proceedings of the 38th annual Design Automation Conference
Optimal semi-oblique tiling

Proceedings of the thirteenth annual ACM symposium on Parallel algorithms and architectures
Exact analysis of the cache behavior of nested loops

Proceedings of the ACM SIGPLAN 2001 conference on Programming language design and implementation
Blocking and array contraction across arbitrarily nested loops using affine partitioning

PPoPP '01 Proceedings of the eighth ACM SIGPLAN symposium on Principles and practices of parallel programming
Optimal tiling for minimizing communication in distributed shared-memory multiprocessors

Compiler optimizations for scalable parallel systems
Communication-free partitioning of nested loops

Compiler optimizations for scalable parallel systems
Data cache energy minimizations through programmable tag size matching to the applications

Proceedings of the 14th international symposium on Systems synthesis
An efficient profile-analysis framework for data-layout optimizations

POPL '02 Proceedings of the 29th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Static and Dynamic Locality Optimizations Using Integer Linear Programming

IEEE Transactions on Parallel and Distributed Systems
Data Relation Vectors: A New Abstraction for Data Optimizations

IEEE Transactions on Computers - Special issue on the parallel architecture and compilation techniques conference
Efficient Representation Scheme for Multidimensional Array Operations

IEEE Transactions on Computers
On optimal temporal locality of stencil codes

Proceedings of the 2002 ACM symposium on Applied computing
Page replacement using marginal loss functions

SC '97 Proceedings of the 1997 ACM/IEEE conference on Supercomputing
Using locality surfaces to characterize the SPECint 2000 benchmark suite

Workload characterization of emerging computer applications
Compiler-directed cache polymorphism

Proceedings of the joint conference on Languages, compilers and tools for embedded systems: software and compilers for embedded systems
Optimal tiling for the RNA base pairing problem

Proceedings of the fourteenth annual ACM symposium on Parallel algorithms and architectures
Synthesizing Transformations for Locality Enhancement of Imperfectly-Nested Loop Nests

International Journal of Parallel Programming
Optimized Unrolling of Nested Loops

International Journal of Parallel Programming
Register tiling in nonrectangular iteration spaces

ACM Transactions on Programming Languages and Systems (TOPLAS)
Tight bounds on cache use for stencil operations on rectangular grids

Journal of the ACM (JACM)
Memory Design and Exploration for Low Power, Embedded Systems

Journal of VLSI Signal Processing Systems - Special issue on signal processing systems design and implementation
Compile Time Barrier Synchronization Minimization

IEEE Transactions on Parallel and Distributed Systems
Low-power data memory communication for application-specific embedded processors

Proceedings of the 15th international symposium on System Synthesis
Optimizing inter-nest data locality

CASES '02 Proceedings of the 2002 international conference on Compilers, architecture, and synthesis for embedded systems
Integrating loop and data transformations for global optimization

Journal of Parallel and Distributed Computing
Reducing Cache Conflicts by Multi-Level Cache Partitioning and Array Elements Mapping

The Journal of Supercomputing
An I/O-Conscious Tiling Strategy for Disk-Resident Data Sets

The Journal of Supercomputing
Precise Data Locality Optimization of Nested Loops

The Journal of Supercomputing
Compilation of Vector Statements of C[] Language for Architectures with Multilevel Memory Hierarchy

Programming and Computing Software
Synthesis of Embedded Software from Synchronous Dataflow Specifications

Journal of VLSI Signal Processing Systems
MIST: an algorithm for memory miss traffic management

Proceedings of the 2000 IEEE/ACM international conference on Computer-aided design
Adaptive Optimizing Compilers for the 21st Century

The Journal of Supercomputing
Quantifying the Multi-Level Nature of Tiling Interactions

International Journal of Parallel Programming
Reuse-Driven Tiling for Improving Data Locality

International Journal of Parallel Programming
Improving Memory Hierarchy Performance for Irregular Applications Using Data and Computation Reorderings

International Journal of Parallel Programming
Data-Centric Transformations for Locality Enhancement

International Journal of Parallel Programming
Achieving Scalable Locality with Time Skewing

International Journal of Parallel Programming
A Cache Visualization Tool

Computer
Multiprocessors from a Software Perspective

IEEE Micro
Analyzing Data Locality in Numeric Applications

IEEE Micro
False Sharing and Spatial Locality in Multiprocessor Caches

IEEE Transactions on Computers
Skewed Associativity Improves Program Performance and Enhances Predictability

IEEE Transactions on Computers
A Layout-Conscious Iteration Space Transformation Technique

IEEE Transactions on Computers
Loop Restructuring for Data I/O Minimization on Limited On-Chip Memory Embedded Processors

IEEE Transactions on Computers
A Loop Transformation Theory and an Algorithm to Maximize Parallelism

IEEE Transactions on Parallel and Distributed Systems
Performance Analysis of Parallelizing Compilers on the Perfect Benchmarks Programs

IEEE Transactions on Parallel and Distributed Systems
Communication-Free Data Allocation Techniques for Parallelizing Compilers on Multicomputers

IEEE Transactions on Parallel and Distributed Systems
Automatic Partitioning of Parallel Loops and Data Arrays for Distributed Shared-Memory Multiprocessors

IEEE Transactions on Parallel and Distributed Systems
Recursive Array Layouts and Fast Matrix Multiplication

IEEE Transactions on Parallel and Distributed Systems
Application-Specific Instruction Memory Customizations for Power-Efficient Embedded Processors

IEEE Design & Test
Probabilistic Miss Equations: Evaluating Memory Hierarchy Performance

IEEE Transactions on Computers
Towards Automatic Synthesis of High-Performance Codes for Electronic Structure Calculations: Data Locality Optimization

HiPC '01 Proceedings of the 8th International Conference on High Performance Computing
Cache-Efficient Multigrid Algorithms

ICCS '01 Proceedings of the International Conference on Computational Sciences-Part I
Tight Bounds on Capacity Misses for 3D Stencil Codes

ICCS '02 Proceedings of the International Conference on Computational Science-Part I
False Sharing Elimination by Selection of Runtime Scheduling Parameters

ICPP '97 Proceedings of the international Conference on Parallel Processing
Improving the Performance of Out-of-Core Computations

ICPP '97 Proceedings of the international Conference on Parallel Processing
Data Access Reorganizations in Compiling Out-of-Core Data Parallel Programs on Distributed Memory Machines

IPPS '97 Proceedings of the 11th International Symposium on Parallel Processing
A Loop Transformation Algorithm Based on Explicit Data Layout Representation for Optimizing Locality

LCPC '98 Proceedings of the 11th International Workshop on Languages and Compilers for Parallel Computing
Optimized Execution of Fortran 90 Array Language on Symmetric Shared-Memory Multiprocessors

LCPC '98 Proceedings of the 11th International Workshop on Languages and Compilers for Parallel Computing
Inter-array Data Regrouping

LCPC '99 Proceedings of the 12th International Workshop on Languages and Compilers for Parallel Computing
A Compiler Framework for Tiling Imperfectly-Nested Loops

LCPC '99 Proceedings of the 12th International Workshop on Languages and Compilers for Parallel Computing
Experimental Evaluation of Energy Behavior of Iteration Space Tiling

LCPC '00 Proceedings of the 13th International Workshop on Languages and Compilers for Parallel Computing-Revised Papers
Performance Optimization of 3D Multigrid on Hierarchical Memory Architectures

PARA '02 Proceedings of the 6th International Conference on Applied Parallel Computing Advanced Scientific Computing
Cache Conscious Indexing for Decision-Support in Main Memory

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
A Blocked All-Pairs Shortest-Path Algorithm

SWAT '00 Proceedings of the 7th Scandinavian Workshop on Algorithm Theory
Compiler-Controlled Caching in Superword Register Files for Multimedia Extension Architectures

Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques
A Fast and Accurate Approach to Analyze Cache Memory Behavior (Research Note)

Euro-Par '00 Proceedings from the 6th International Euro-Par Conference on Parallel Processing
Cache Remapping to Improve the Performance of Tiled Algorithms

Euro-Par '00 Proceedings from the 6th International Euro-Par Conference on Parallel Processing
Volume Driven Data Distribution for NUMA-Machines

Euro-Par '00 Proceedings from the 6th International Euro-Par Conference on Parallel Processing
Is Morton Layout Competitive for Large Two-Dimensional Arrays?

Euro-Par '02 Proceedings of the 8th International Euro-Par Conference on Parallel Processing
On the Optimality of Feautrier's Scheduling Algorithm

Euro-Par '02 Proceedings of the 8th International Euro-Par Conference on Parallel Processing
A Holistic Approach to System Level Energy Optimization

PATMOS '00 Proceedings of the 10th International Workshop on Integrated Circuit Design, Power and Timing Modeling, Optimization and Simulation
Fractal Matrix Multiplication: A Case Study on Portability of Cache Performance

WAE '01 Proceedings of the 5th International Workshop on Algorithm Engineering
Reducing Cache Conflicts by a Parametrized Memory Mapping

ParNum '99 Proceedings of the 4th International ACPC Conference Including Special Tracks on Parallel Numerics and Parallel Computing in Image Processing, Video Processing, and Multimedia: Parallel Computation
A Framework for Loop Distribution on Limited On-Chip Memory Processors

CC '00 Proceedings of the 9th International Conference on Compiler Construction
Improving Cache Effectiveness through Array Data Layout Manipulation in SAC

IFL '00 Selected Papers from the 12th International Workshop on Implementation of Functional Languages
Loop Transformations for Hierarchical Parallelism and Locality

LCR '98 Selected Papers from the 4th International Workshop on Languages, Compilers, and Run-Time Systems for Scalable Computers
A Comparison of Locality Transformations for Irregular Codes

LCR '00 Selected Papers from the 5th International Workshop on Languages, Compilers, and Run-Time Systems for Scalable Computers
Array Unification: A Locality Optimization Technique

CC '01 Proceedings of the 10th International Conference on Compiler Construction
On increasing architecture awareness in program optimizations to bridge the gap between peak and sustained processor performance: matrix-multiply revisited

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Better tiling and array contraction for compiling scientific programs

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Performance optimizations and bounds for sparse matrix-vector multiply

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Optimal task scheduling at run time to exploit intra-tile parallelism

Parallel Computing
On the Parallel Execution Time of Tiled Loops

IEEE Transactions on Parallel and Distributed Systems
Reducing False Sharing and Improving Spatial Locality in a Unified Compilation Framework

IEEE Transactions on Parallel and Distributed Systems
Query processing techniques for arrays

The VLDB Journal — The International Journal on Very Large Data Bases
Locality-conscious process scheduling in embedded systems

Proceedings of the tenth international symposium on Hardware/software codesign
Interprocedural optimizations for improving data cache performance of array-intensive embedded applications

Proceedings of the 40th annual Design Automation Conference
METRIC: tracking down inefficiencies in the memory hierarchy via binary rewriting

Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
Continuous program optimization: A case study

ACM Transactions on Programming Languages and Systems (TOPLAS)
Predicting the impact of optimizations for embedded systems

Proceedings of the 2003 ACM SIGPLAN conference on Language, compiler, and tool for embedded systems
Data cache locking for higher program predictability

SIGMETRICS '03 Proceedings of the 2003 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Optimization of Data Distribution and Processor Allocation Problem Using Simulated Annealing

The Journal of Supercomputing
A compiler framework for restructuring data declarations to enhance cache and TLB effectiveness

CASCON '94 Proceedings of the 1994 conference of the Centre for Advanced Studies on Collaborative research
A compiler approach for reducing data cache energy

ICS '03 Proceedings of the 17th annual international conference on Supercomputing
Estimating cache misses and locality using stack distances

ICS '03 Proceedings of the 17th annual international conference on Supercomputing
Compiler optimizations for low power systems

Power aware computing
Optimized software synthesis for synchronous dataflow

ASAP '97 Proceedings of the IEEE International Conference on Application-Specific Systems, Architectures and Processors
Software assistance for data caches

HPCA '95 Proceedings of the 1st IEEE Symposium on High-Performance Computer Architecture
Reference Distance as a Metric for Data Locality

HPC-ASIA '97 Proceedings of the High-Performance Computing on the Information Superhighway, HPC-Asia '97
Using cache optimizing compiler for managing software cache on distributed shared memory system

HPC-ASIA '97 Proceedings of the High-Performance Computing on the Information Superhighway, HPC-Asia '97
A Performance Debugger for Eliminating Excess Synchronization in Shared-Memory Parallel Programs

MASCOTS '96 Proceedings of the 4th International Workshop on Modeling, Analysis, and Simulation of Computer and Telecommunications Systems
Compiler-Directed Array Interleaving for Reducing Energy in Multi-Bank Memories

ASP-DAC '02 Proceedings of the 2002 Asia and South Pacific Design Automation Conference
Memory Organization for Improved Data Cache Performance in Embedded Processors

ISSS '96 Proceedings of the 9th international symposium on System synthesis
Guided region prefetching: a cooperative hardware/software approach

Proceedings of the 30th annual international symposium on Computer architecture
Efficient Data Parallel Algorithms for Multidimensional Array Operations Based on the EKMR Scheme for Distributed Memory Multicomputers

IEEE Transactions on Parallel and Distributed Systems
Fractal symbolic analysis

ACM Transactions on Programming Languages and Systems (TOPLAS)
Exploiting bank locality in multi-bank memories

Proceedings of the 2003 international conference on Compilers, architecture and synthesis for embedded systems
Array Regrouping and Its Use in Compiling Data-Intensive Embedded Applications

IEEE Transactions on Computers
Data Caches in Multitasking Hard Real-Time Systems

RTSS '03 Proceedings of the 24th IEEE International Real-Time Systems Symposium
Static analysis of parameterized loop nests for energy efficient use of data caches

Compilers and operating systems for low power
Transforming Complex Loop Nests for Locality

The Journal of Supercomputing
A Quantitative Analysis of Tile Size Selection Algorithms

The Journal of Supercomputing
Single Assignment C: efficient support for high-level array operations in a functional setting

Journal of Functional Programming
Automatic parallel code generation for tiled nested loops

Proceedings of the 2004 ACM symposium on Applied computing
Impact of Data Transformations on Memory Bank Locality

Proceedings of the conference on Design, automation and test in Europe - Volume 1
Instruction Scheduling for Low Power

Journal of VLSI Signal Processing Systems
A fast and accurate framework to analyze and optimize cache memory behavior

ACM Transactions on Programming Languages and Systems (TOPLAS)
Improving effective bandwidth through compiler enhancement of global cache reuse

Journal of Parallel and Distributed Computing
Efficient and Accurate Analytical Modeling of Whole-Program Data Cache Behavior

IEEE Transactions on Computers
Single-Dimension Software Pipelining for Multi-Dimensional Loops

Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
Automatic loop interchange

ACM SIGPLAN Notices - Best of PLDI 1979-1999
Improving register allocation for subscripted variables

ACM SIGPLAN Notices - Best of PLDI 1979-1999
A data locality optimizing algorithm

ACM SIGPLAN Notices - Best of PLDI 1979-1999
A blocked all-pairs shortest-paths algorithm

Journal of Experimental Algorithmics (JEA)
Array regrouping and structure splitting using whole-program reference affinity

Proceedings of the ACM SIGPLAN 2004 conference on Programming language design and implementation
Array Composition and Decomposition for Optimizing Embedded Applications

Proceedings of the 2003 IEEE/ACM international conference on Computer-aided design
Power Efficiency through Application-Specific Instruction Memory Transformations

DATE '03 Proceedings of the conference on Design, Automation and Test in Europe - Volume 1
An Integrated Approach for Improving Cache Behavior

DATE '03 Proceedings of the conference on Design, Automation and Test in Europe - Volume 1
Generalized Data Transformations for Enhancing Cache Behavior

DATE '03 Proceedings of the conference on Design, Automation and Test in Europe - Volume 1
Quasidynamic Layout Optimizations for Improving Data Locality

IEEE Transactions on Parallel and Distributed Systems
Line Size Adaptivity Analysis of Parameterized Loop Nests for Direct Mapped Data Cache

IEEE Transactions on Computers
Combining Models and Guided Empirical Search to Optimize for Multiple Levels of the Memory Hierarchy

Proceedings of the international symposium on Code generation and optimization
A Model-Based Framework: An Approach for Profit-Driven Optimization

Proceedings of the international symposium on Code generation and optimization
Compiler-Based Approach for Exploiting Scratch-Pad in Presence of Irregular Array Access

Proceedings of the conference on Design, Automation and Test in Europe - Volume 2
The Potential of Computation Regrouping for Improving Locality

Proceedings of the 2004 ACM/IEEE conference on Supercomputing
New Complexity Results on Array Contraction and Related Problems

Journal of VLSI Signal Processing Systems
A case for a working-set-based memory hierarchy

Proceedings of the 2nd conference on Computing frontiers
Locality-conscious workload assignment for array-based computations in MPSOC architectures

Proceedings of the 42nd annual Design Automation Conference
Automatic blocking of QR and LU factorizations for locality

MSP '04 Proceedings of the 2004 workshop on Memory system performance
Reuse-distance-based miss-rate prediction on a per instruction basis

MSP '04 Proceedings of the 2004 workshop on Memory system performance
Compiling for memory emergency

LCTES '05 Proceedings of the 2005 ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools for embedded systems
Data space-oriented tiling for enhancing locality

ACM Transactions on Embedded Computing Systems (TECS)
Dynamic memory interval test vs. interprocedural pointer analysis in multimedia applications

ACM Transactions on Architecture and Code Optimization (TACO)
Generating cache hints for improved program efficiency

Journal of Systems Architecture: the EUROMICRO Journal
Statistical Models for Empirical Search-Based Performance Tuning

International Journal of High Performance Computing Applications
Sparse Tiling for Stationary Iterative Methods

International Journal of High Performance Computing Applications
Cache-Efficient Multigrid Algorithms

International Journal of High Performance Computing Applications
Improving Memory Hierarchy Performance through Combined Loop Interchange and Multi-Level Fusion

International Journal of High Performance Computing Applications
Optimizing inter-processor data locality on embedded chip multiprocessors

Proceedings of the 5th ACM international conference on Embedded software
Reducing data cache leakage energy using a compiler-based approach

ACM Transactions on Embedded Computing Systems (TECS)
An accurate cost model for guiding data locality transformations

ACM Transactions on Programming Languages and Systems (TOPLAS)
Instruction Based Memory Distance Analysis and its Application

Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Optimizing Compiler for the CELL Processor

Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Obtaining Affine Transformations to Improve Locality of Loop Nests

Programming and Computing Software
A hierarchical model of data locality

Conference record of the 33rd ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Compiler-directed high-level energy estimation and optimization

ACM Transactions on Embedded Computing Systems (TECS)
Analyzing data reuse for cache reconfiguration

ACM Transactions on Embedded Computing Systems (TECS)
Hierarchical memory size estimation for loop fusion and loop shifting in data-dominated applications

ASP-DAC '06 Proceedings of the 2006 Asia and South Pacific Design Automation Conference
Automatic benchmark generation for cache optimization of matrix operations

ACM-SE 33 Proceedings of the 33rd annual on Southeast regional conference
Programming for parallelism and locality with hierarchically tiled arrays

Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming
Energy-aware data prefetching for multi-speed disks

Proceedings of the 3rd conference on Computing frontiers
Multi-compilation: capturing interactions among concurrently-executing applications

Proceedings of the 3rd conference on Computing frontiers
Intermediately executed code is the key to find refactorings that improve temporal data locality

Proceedings of the 3rd conference on Computing frontiers
Code restructuring for improving cache performance of MPSoCs

ICCAD '05 Proceedings of the 2005 IEEE/ACM International conference on Computer-aided design
2D data locality: definition, abstraction, and application

ICCAD '05 Proceedings of the 2005 IEEE/ACM International conference on Computer-aided design
Integrating loop and data optimizations for locality within a constraint network based framework

ICCAD '05 Proceedings of the 2005 IEEE/ACM International conference on Computer-aided design
Optimizing compiler for shared-memory multiple SIMD architecture

Proceedings of the 2006 ACM SIGPLAN/SIGBED conference on Language, compilers, and tool support for embedded systems
Global memory optimisation for embedded systems allowed by code duplication

SCOPES '05 Proceedings of the 2005 workshop on Software and compilers for embedded systems
Using advanced compiler technology to exploit the performance of the Cell Broadband EngineTM architecture

IBM Systems Journal
Performance optimization of irregular codes based on the combination of reordering and blocking techniques

Parallel Computing
Reuse analysis of indirectly indexed arrays

ACM Transactions on Design Automation of Electronic Systems (TODAES)
Reducing energy consumption of multiprocessor SoC architectures by exploiting memory bank locality

ACM Transactions on Design Automation of Electronic Systems (TODAES)
Efficient synthesis of out-of-core algorithms using a nonlinear optimization solver

Journal of Parallel and Distributed Computing - Special issue: 18th International parallel and distributed processing symposium
Analytical modeling of codes with arbitrary data-dependent conditional structures

Journal of Systems Architecture: the EUROMICRO Journal
Empirical optimization for a sparse linear solver: a case study

International Journal of Parallel Programming - Special issue: The next generation software program
An approach toward profit-driven optimization

ACM Transactions on Architecture and Code Optimization (TACO)
Instruction scheduling for a tiled dataflow architecture

Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
Profitable loop fusion and tiling using model-driven empirical search

Proceedings of the 20th annual international conference on Supercomputing
Analysis of cache-coherence bottlenecks with hybrid hardware/software techniques

ACM Transactions on Architecture and Code Optimization (TACO)
FFT program generation for shared memory: SMP and multicore

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Single-dimension software pipelining for multidimensional loops

ACM Transactions on Architecture and Code Optimization (TACO)
Message-passing code generation for non-rectangular tiling transformations

Parallel Computing
Improving power efficiency with compiler-assisted cache replacement

Journal of Embedded Computing - Cache exploitation in embedded systems
The rise and fall of High Performance Fortran: an historical object lesson

Proceedings of the third ACM SIGPLAN conference on History of programming languages
Compiler optimization to improve data locality for processor multithreading

Scientific Programming
$P$^$3$$T+$: A performance estimator for distributed and parallel programs

Scientific Programming
Memetic algorithms for parallel code optimization

International Journal of Parallel Programming
The cache-oblivious gaussian elimination paradigm: theoretical framework, parallelization and experimental evaluation

Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures
Effective automatic parallelization of stencil computations

Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and implementation
Parameterized tiled loops for free

Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and implementation
Iterative Optimization in the Polyhedral Model: Part I, One-Dimensional Time

Proceedings of the International Symposium on Code Generation and Optimization
Bee+Cl@k: an implementation of lattice-based array contraction in the source-to-source translator rose

Proceedings of the 2007 ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools for embedded systems
Maximize Parallelism Minimize Overhead for Nested Loops via Loop Striping

Journal of VLSI Signal Processing Systems
MPSoC memory optimization using program transformation

ACM Transactions on Design Automation of Electronic Systems (TODAES)
Locality optimization in wireless applications

CODES+ISSS '07 Proceedings of the 5th IEEE/ACM international conference on Hardware/software codesign and system synthesis
Lightweight barrier-based parallelization support for non-cache-coherent MPSoC platforms

CASES '07 Proceedings of the 2007 international conference on Compilers, architecture, and synthesis for embedded systems
Efficient execution of multiple queries on deep memory hierarchy

Journal of Computer Science and Technology
Data cache locking for tight timing calculations

ACM Transactions on Embedded Computing Systems (TECS)
Data locality enhancement for CMPs

Proceedings of the 2007 IEEE/ACM international conference on Computer-aided design
Programming with tiles

Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Optimization of memory system in real-time embedded systems

ICCOMP'07 Proceedings of the 11th WSEAS International Conference on Computers
Fast indexing for blocked array layouts to reduce cache misses

International Journal of High Performance Computing and Networking
Dynamic tiling for effective use of shared caches on multithreaded processors

International Journal of High Performance Computing and Networking
Multi-level tiling: M for the price of one

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Analyzing memory access intensity in parallel programs on multicore

Proceedings of the 22nd annual international conference on Supercomputing
A practical automatic polyhedral parallelizer and locality optimizer

Proceedings of the 2008 ACM SIGPLAN conference on Programming language design and implementation
Block size selection of parallel LU and QR on PVP-based and RISC-based supercomputers

CHINA HPC '07 Proceedings of the 2007 Asian technology information program's (ATIP's) 3rd workshop on High performance computing in China: solution approaches to impediments for high performance computing
A compiler approach to managing storage and memory bandwidth in configurable architectures

ACM Transactions on Design Automation of Electronic Systems (TODAES)
Positivity, posynomials and tile size selection

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Design Issues in Parallel Array Languages for Shared Memory

SAMOS '08 Proceedings of the 8th international workshop on Embedded Computer Systems: Architectures, Modeling, and Simulation
Storage Estimation and Design Space Exploration Methodologies for the Memory Management of Signal Processing Applications

Journal of Signal Processing Systems
Exploiting loop-dependent stream reuse for stream processors

Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Profiler and compiler assisted adaptive I/O prefetching for shared storage caches

Proceedings of the 17th international conference on Parallel architectures and compilation techniques
CUDA-Lite: Reducing GPU Programming Complexity

Languages and Compilers for Parallel Computing
Smashing: Folding Space to Tile through Time

Languages and Compilers for Parallel Computing
Trade-offs in loop transformations

ACM Transactions on Design Automation of Electronic Systems (TODAES)
Matrix-based streamization approach for improving locality and parallelism on FT64 stream processor

The Journal of Supercomputing
A compiler-directed data prefetching scheme for chip multiprocessors

Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
Revisiting Cache Block Superloading

HiPEAC '09 Proceedings of the 4th International Conference on High Performance Embedded Architectures and Compilers
Finding and Applying Loop Transformations for Generating Optimized FPGA Implementations

Transactions on High-Performance Embedded Architectures and Compilers I
An Approach for Enhancing Inter-processor Data Locality on Chip Multiprocessors

Transactions on High-Performance Embedded Architectures and Compilers I
A Prefetching Algorithm for Multi-speed Disks

Transactions on High-Performance Embedded Architectures and Compilers I
3D seismic imaging through reverse-time migration on homogeneous and heterogeneous multi-core processors

Scientific Programming - High Performance Computing with the Cell Broadband Engine
Reducing memory requirements of resource-constrained applications

ACM Transactions on Embedded Computing Systems (TECS)
Precise Management of Scratchpad Memories for Localising Array Accesses in Scientific Codes

CC '09 Proceedings of the 18th International Conference on Compiler Construction: Held as Part of the Joint European Conferences on Theory and Practice of Software, ETAPS 2009
Cache-aware partitioning of multi-dimensional iteration spaces

SYSTOR '09 Proceedings of SYSTOR 2009: The Israeli Experimental Systems Conference
Program locality analysis using reuse distance

ACM Transactions on Programming Languages and Systems (TOPLAS)
Tile Reduction: The First Step towards Tile Aware Parallelization in OpenMP

IWOMP '09 Proceedings of the 5th International Workshop on OpenMP: Evolving OpenMP in an Age of Extreme Parallelism
Markov Model Based Disk Power Management for Data Intensive Workloads

CCGRID '09 Proceedings of the 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid
Virtual reuse distance analysis of SPECjvm2008 data locality

PPPJ '09 Proceedings of the 7th International Conference on Principles and Practice of Programming in Java
Design and Tool Flow of Multimedia MPSoC Platforms

Journal of Signal Processing Systems
Multiprocessor, Multithreading and Memory Optimization for On-Chip Multimedia Applications

Journal of Signal Processing Systems
Exploring parallelization strategies for NUFFT data translation

EMSOFT '09 Proceedings of the seventh ACM international conference on Embedded software
SARA: StreAm register allocation

CODES+ISSS '09 Proceedings of the 7th IEEE/ACM international conference on Hardware/software codesign and system synthesis
Optimizing shared cache behavior of chip multiprocessors

Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
Mining tree-structured data on multicore systems

Proceedings of the VLDB Endowment
Iterational retiming with partitioning: Loop scheduling with complete memory latency hiding

ACM Transactions on Embedded Computing Systems (TECS)
Into the Loops: Practical Issues in Translation Validation for Optimizing Compilers

Electronic Notes in Theoretical Computer Science (ENTCS)
A hardware/software framework for instruction and data scratchpad memory allocation

ACM Transactions on Architecture and Code Optimization (TACO)
Algorithms for memory hierarchies: advanced lectures

Algorithms for memory hierarchies: advanced lectures
Performance optimization of irregular codes based on the combination of reordering and blocking techniques

Parallel Computing
Composition-based Cache simulation for structure reorganization

Journal of Systems Architecture: the EUROMICRO Journal
On minimizing register usage of linearly scheduled algorithms with uniform dependencies

Computer Languages, Systems and Structures
Cache vulnerability equations for protecting data in embedded processor caches from soft errors

Proceedings of the ACM SIGPLAN/SIGBED 2010 conference on Languages, compilers, and tools for embedded systems
Design and use of htalib: a library for hierarchically tiled arrays

LCPC'06 Proceedings of the 19th international conference on Languages and compilers for parallel computing
Data pipeline optimization for shared memory multiple-SIMD architecture

LCPC'06 Proceedings of the 19th international conference on Languages and compilers for parallel computing
Custom memory allocation for free

LCPC'06 Proceedings of the 19th international conference on Languages and compilers for parallel computing
Loop transformations for reducing data space requirements of resource-constrained applications

SAS'03 Proceedings of the 10th international conference on Static analysis
Compiler directed parallelization of loops in scale for shared-memory multiprocessors

ICCS'03 Proceedings of the 2003 international conference on Computational science: PartIII
Partial data reuse for windowing computations: performance modeling for FPGA implementations

ARC'07 Proceedings of the 3rd international conference on Reconfigurable computing: architectures, tools and applications
Improving data locality by chunking

CC'03 Proceedings of the 12th international conference on Compiler construction
Locality enhancement by array contraction

LCPC'01 Proceedings of the 14th international conference on Languages and compilers for parallel computing
Automatic transformations for communication-minimized parallelization and locality optimization in the polyhedral model

CC'08/ETAPS'08 Proceedings of the Joint European Conferences on Theory and Practice of Software 17th international conference on Compiler construction
Static reuse distances for locality-based optimizations in MATLAB

Proceedings of the 24th ACM International Conference on Supercomputing
Model-guided empirical tuning of loop fusion

International Journal of High Performance Systems Architecture
Exploiting the reuse supplied by loop-dependent stream references for stream processors

ACM Transactions on Architecture and Code Optimization (TACO)
A grid-based programming approach for distributed linear algebra applications

Multiagent and Grid Systems
Reuse-aware modulo scheduling for stream processors

Proceedings of the Conference on Design, Automation and Test in Europe
Exposing tunable parameters in multi-threaded numerical code

NPC'10 Proceedings of the 2010 IFIP international conference on Network and parallel computing
Code scheduling for optimizing parallelism and data locality

EuroPar'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part I
On the interaction of tiling and automatic parallelization

IWOMP'05/IWOMP'06 Proceedings of the 2005 and 2006 international conference on OpenMP shared memory parallel programming
Generating structured program instances with a high degree of locality

EURO-PDP'00 Proceedings of the 8th Euromicro conference on Parallel and distributed processing
Hierarchically tiled arrays for parallelism and locality

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Improving cache locality for thread-level speculation

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Data locality and parallelism optimization using a constraint-based approach

Journal of Parallel and Distributed Computing
The HV-tree: a memory hierarchy aware version index

Proceedings of the VLDB Endowment
A compiler framework for restructuring data declarations to enhance cache and TLB effectiveness

CASCON First Decade High Impact Papers
Landing stencil code on Godson-T

Journal of Computer Science and Technology
Parallelization of DNA sequence alignment using OpenMP

Proceedings of the 2011 International Conference on Communication, Computing & Security
Polyhedral Model Based Data Locality Optimization for Embedded Applications

GREENCOM-CPSCOM '10 Proceedings of the 2010 IEEE/ACM Int'l Conference on Green Computing and Communications & Int'l Conference on Cyber, Physical and Social Computing
A parallel numerical solver using hierarchically tiled arrays

LCPC'10 Proceedings of the 23rd international conference on Languages and compilers for parallel computing
Locality optimization of stencil applications using data dependency graphs

LCPC'10 Proceedings of the 23rd international conference on Languages and compilers for parallel computing
A programming language interface to describe transformations and code generation

LCPC'10 Proceedings of the 23rd international conference on Languages and compilers for parallel computing
Constructing application-specific memory hierarchies on FPGAs

Transactions on high-performance embedded architectures and compilers III
On the theory and potential of LRU-MRU collaborative cache management

Proceedings of the international symposium on Memory management
Studying inter-core data reuse in multicores

Proceedings of the ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
Memory access optimization in compilation for coarse-grained reconfigurable architectures

ACM Transactions on Design Automation of Electronic Systems (TODAES)
Studying inter-core data reuse in multicores

ACM SIGMETRICS Performance Evaluation Review - Performance evaluation review
Task ordering and memory management problem for degree of parallelism estimation

COCOON'11 Proceedings of the 17th annual international conference on Computing and combinatorics
A cache-conscious profitability model for empirical tuning of loop fusion

LCPC'05 Proceedings of the 18th international conference on Languages and Compilers for Parallel Computing
Using platform-specific performance counters for dynamic compilation

LCPC'05 Proceedings of the 18th international conference on Languages and Compilers for Parallel Computing
A 0-1 integer linear programming based approach for global locality optimizations

ACSAC'06 Proceedings of the 11th Asia-Pacific conference on Advances in Computer Systems Architecture
Loop striping: maximize parallelism for nested loops

EUC'06 Proceedings of the 2006 international conference on Embedded and Ubiquitous Computing
Tuning blocked array layouts to exploit memory hierarchy in SMT architectures

PCI'05 Proceedings of the 10th Panhellenic conference on Advances in Informatics
A data transformations based approach for optimizing memory and cache locality on distributed memory multiprocessors

APPT'05 Proceedings of the 6th international conference on Advanced Parallel Processing Technologies
Out-of-Core Computations of High-Resolution Level Sets by Means of Code Transformation

Journal of Scientific Computing
Optimizing modulo scheduling to achieve reuse and concurrency for stream processors

The Journal of Supercomputing
Optimizing data locality using array tiling

Proceedings of the International Conference on Computer-Aided Design
Combined loop transformation and hierarchy allocation for data reuse optimization

Proceedings of the International Conference on Computer-Aided Design
MiniTasking: improving cache performance for multiple query workloads

WAIM '06 Proceedings of the 7th international conference on Advances in Web-Age Information Management
Optimization of dense matrix multiplication on IBM cyclops-64: challenges and experiences

Euro-Par'06 Proceedings of the 12th international conference on Parallel Processing
Applying loop optimizations to object-oriented abstractions through general classification of array semantics

LCPC'04 Proceedings of the 17th international conference on Languages and Compilers for High Performance Computing
Experiments with auto-parallelizing SPEC2000FP benchmarks

LCPC'04 Proceedings of the 17th international conference on Languages and Compilers for High Performance Computing
Extending the applicability of scalar replacement to multiple induction variables

LCPC'04 Proceedings of the 17th international conference on Languages and Compilers for High Performance Computing
Low power engineering

Embedded Systems Design
Efficient execution of time-step computations with pipelined parallelism and inter-thread data locality optimizaitions

Proceedings of the 2012 International Workshop on Programming Models and Applications for Multicores and Manycores
Combining performance aspects of irregular gauss-seidel via sparse tiling

LCPC'02 Proceedings of the 15th international conference on Languages and Compilers for Parallel Computing
A hybrid strategy based on data distribution and migration for optimizing memory locality

LCPC'02 Proceedings of the 15th international conference on Languages and Compilers for Parallel Computing
Optimizing SDRAM bandwidth for custom FPGA loop accelerators

Proceedings of the ACM/SIGDA international symposium on Field Programmable Gate Arrays
Systematic preprocessing of data dependent constructs for embedded systems

PATMOS'05 Proceedings of the 15th international conference on Integrated Circuit and System Design: power and Timing Modeling, Optimization and Simulation
Loop transformation recipes for code generation and auto-tuning

LCPC'09 Proceedings of the 22nd international conference on Languages and Compilers for Parallel Computing
Efficient tiled loop generation: D-tiling

LCPC'09 Proceedings of the 22nd international conference on Languages and Compilers for Parallel Computing
A data layout optimization framework for NUCA-based multicores

Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
Memory space conscious loop iteration duplication for reliable execution

SAS'05 Proceedings of the 12th international conference on Static Analysis
Runtime biased pointer reuse analysis and its application to energy efficiency

PACS'03 Proceedings of the Third international conference on Power - Aware Computer Systems
A comparative analysis of performance improvement schemes for cache memories

Computers and Electrical Engineering
Parameterized loop tiling

ACM Transactions on Programming Languages and Systems (TOPLAS)
Path-Based reuse distance analysis

CC'06 Proceedings of the 15th international conference on Compiler Construction
On-chip cache hierarchy-aware tile scheduling for multicore machines

CGO '11 Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization
Optimizing I/O for big array analytics

Proceedings of the VLDB Endowment
Optimizing memory hierarchy allocation with loop transformations for high-level synthesis

Proceedings of the 49th Annual Design Automation Conference
Polyhedra scanning revisited

Proceedings of the 33rd ACM SIGPLAN conference on Programming Language Design and Implementation
A generalized theory of collaborative caching

Proceedings of the 2012 international symposium on Memory Management
Hierarchical overlapped tiling

Proceedings of the Tenth International Symposium on Code Generation and Optimization
Analytical bounds for optimal tile size selection

CC'12 Proceedings of the 21st international conference on Compiler Construction
Partitioning and scheduling loops on NOWs

Computer Communications
On Urgency of I/O Operations

CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
Integrating Memory Optimization with Mapping Algorithms for Multi-Processors System-on-Chip

ACM Transactions on Embedded Computing Systems (TECS)
Algorithmic species: A classification of affine loop nests for parallel programming

ACM Transactions on Architecture and Code Optimization (TACO) - Special Issue on High-Performance Embedded Architectures and Compilers
Improving last level cache locality by integrating loop and data transformations

Proceedings of the International Conference on Computer-Aided Design
Reshaping cache misses to improve row-buffer locality in multicore systems

PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
Automatic OpenCL work-group size selection for multicore CPUs

PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
Adaptive parallel tiled code generation and accelerated auto-tuning

International Journal of High Performance Computing Applications
Adaptive Mapping and Parameter Selection Scheme to Improve Automatic Code Generation for GPUs

Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization
Tile size selection revisited

ACM Transactions on Architecture and Code Optimization (TACO)
Beyond reuse distance analysis: Dynamic analysis for characterization of data locality potential

ACM Transactions on Architecture and Code Optimization (TACO)

Quantified Score

Hi-index	0.03

A data locality optimizing algorithm

Quantified Score

Visualization

Abstract