Strategies for cache and local memory management by global program transformation
Journal of Parallel and Distributed Computing - Special Issue on Languages, Compilers and environments for Parallel Programming
Performance evaluation of static and dynamic memory systems on the Cray-2
ICS '88 Proceedings of the 2nd international conference on Supercomputing
Program optimization for instruction caches
ASPLOS III Proceedings of the third international conference on Architectural support for programming languages and operating systems
Simple but effective techniques for NUMA memory management
SOSP '89 Proceedings of the twelfth ACM symposium on Operating systems principles
A data locality optimizing algorithm
PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
SIGMETRICS '94 Proceedings of the 1994 ACM SIGMETRICS conference on Measurement and modeling of computer systems
SUIF: an infrastructure for research on parallelizing and optimizing compilers
ACM SIGPLAN Notices
Avoiding conflict misses dynamically in large direct-mapped caches
ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
Unifying data and control transformations for distributed shared-memory machines
PLDI '95 Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation
Tile size selection using cache organization and data layout
PLDI '95 Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation
Data and computation transformations for multiprocessors
PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
Reducing false sharing on shared memory multiprocessors through compile time data transformations
PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
Improving data locality with loop transformations
ACM Transactions on Programming Languages and Systems (TOPLAS)
A quantitative analysis of loop nest locality
Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Compiler-directed page coloring for multiprocessors
Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Fusion of Loops for Parallelism and Locality
IEEE Transactions on Parallel and Distributed Systems
Eliminating cache conflict misses through XOR-based placement functions
ICS '97 Proceedings of the 11th international conference on Supercomputing
A compiler algorithm for optimizing locality in loop nests
ICS '97 Proceedings of the 11th international conference on Supercomputing
Non-singular data transformations: definition, validity and applications
ICS '97 Proceedings of the 11th international conference on Supercomputing
Cache miss equations: an analytical representation of cache misses
ICS '97 Proceedings of the 11th international conference on Supercomputing
ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
On Estimating and Enhancing Cache Effectiveness
Proceedings of the Fourth International Workshop on Languages and Compilers for Parallel Computing
Eliminating conflict misses for high performance architectures
ICS '98 Proceedings of the 12th international conference on Supercomputing
Cache-conscious data placement
Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
Precise miss analysis for program transformations with caches of arbitrary associativity
Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
Reducing cache misses using hardware and software page placement
ICS '99 Proceedings of the 13th international conference on Supercomputing
Nonlinear array layouts for hierarchical memory systems
ICS '99 Proceedings of the 13th international conference on Supercomputing
Cache miss equations: a compiler framework for analyzing and tuning memory behavior
ACM Transactions on Programming Languages and Systems (TOPLAS)
Locality optimizations for multi-level caches
SC '99 Proceedings of the 1999 ACM/IEEE conference on Supercomputing
Cache-optimal methods for bit-reversals
SC '99 Proceedings of the 1999 ACM/IEEE conference on Supercomputing
Memory characteristics of iterative methods
SC '99 Proceedings of the 1999 ACM/IEEE conference on Supercomputing
Automated cache optimizations using CME driven diagnosis
Proceedings of the 14th international conference on Supercomputing
ZPL: A Machine Independent Programming Language for Parallel Computers
IEEE Transactions on Software Engineering - Special issue on architecture-independent languages and software tools for parallel processing
Automated data-member layout of heap objects to improve memory-hierarchy performance
ACM Transactions on Programming Languages and Systems (TOPLAS)
A compiler technique for improving whole-program locality
POPL '01 Proceedings of the 28th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Tiling optimizations for 3D scientific computations
Proceedings of the 2000 ACM/IEEE conference on Supercomputing
Loop optimization for a class of memory-constrained computations
ICS '01 Proceedings of the 15th international conference on Supercomputing
ICS '01 Proceedings of the 15th international conference on Supercomputing
Exact analysis of the cache behavior of nested loops
Proceedings of the ACM SIGPLAN 2001 conference on Programming language design and implementation
Improving memory performance of sorting algorithms
Journal of Experimental Algorithmics (JEA)
Combined partitioning and data padding for scheduling multiple loop nests
CASES '01 Proceedings of the 2001 international conference on Compilers, architecture, and synthesis for embedded systems
Data reorganization engines for the next generation of system-on-a-chip FPGAs
FPGA '02 Proceedings of the 2002 ACM/SIGDA tenth international symposium on Field-programmable gate arrays
Static and Dynamic Locality Optimizations Using Integer Linear Programming
IEEE Transactions on Parallel and Distributed Systems
Data Relation Vectors: A New Abstraction for Data Optimizations
IEEE Transactions on Computers - Special issue on the parallel architecture and compilation techniques conference
Hardware and Software Techniques for Controlling DRAM Power Modes
IEEE Transactions on Computers
Tuning Strassen's matrix multiplication for memory efficiency
SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
Software caching vs. prefetching
Proceedings of the 3rd international symposium on Memory management
Tight bounds on cache use for stencil operations on rectangular grids
Journal of the ACM (JACM)
Reducing Cache Conflicts by Multi-Level Cache Partitioning and Array Elements Mapping
The Journal of Supercomputing
A Layout-Conscious Iteration Space Transformation Technique
IEEE Transactions on Computers
HiPC '01 Proceedings of the 8th International Conference on High Performance Computing
Data Layout Optimizations for Variable Coefficient Multigrid
ICCS '02 Proceedings of the International Conference on Computational Science-Part III
Optimizing Graph Algorithms for Improved Cache Performance
IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
Influence of Array Allocation Mechanisms on Memory System Energy
IPDPS '01 Proceedings of the 15th International Parallel & Distributed Processing Symposium
High Performance Numerical Computing in Java: Language and Compiler Issues
LCPC '99 Proceedings of the 12th International Workshop on Languages and Compilers for Parallel Computing
Compiler and Run-Time Support for Improving Locality in Scientific Codes
LCPC '99 Proceedings of the 12th International Workshop on Languages and Compilers for Parallel Computing
Experimental Evaluation of Energy Behavior of Iteration Space Tiling
LCPC '00 Proceedings of the 13th International Workshop on Languages and Compilers for Parallel Computing-Revised Papers
Reducing Cache Conflicts by a Parametrized Memory Mapping
ParNum '99 Proceedings of the 4th International ACPC Conference Including Special Tracks on Parallel Numerics and Parallel Computing in Image Processing, Video Processing, and Multimedia: Parallel Computation
Improving Cache Effectiveness through Array Data Layout Manipulation in SAC
IFL '00 Selected Papers from the 12th International Workshop on Implementation of Functional Languages
A Comparison of Locality Transformations for Irregular Codes
LCR '00 Selected Papers from the 5th International Workshop on Languages, Compilers, and Run-Time Systems for Scalable Computers
Cache Line Impact on 3D PDE Solvers
ISHPC '02 Proceedings of the 4th International Symposium on High Performance Computing
Array Unification: A Locality Optimization Technique
CC '01 Proceedings of the 10th International Conference on Compiler Construction
Reducing False Sharing and Improving Spatial Locality in a Unified Compilation Framework
IEEE Transactions on Parallel and Distributed Systems
Predicting the impact of optimizations for embedded systems
Proceedings of the 2003 ACM SIGPLAN conference on Language, compiler, and tool for embedded systems
Data cache locking for higher program predictability
SIGMETRICS '03 Proceedings of the 2003 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
A compiler approach for reducing data cache energy
ICS '03 Proceedings of the 17th annual international conference on Supercomputing
Tiling, Block Data Layout, and Memory Hierarchy Performance
IEEE Transactions on Parallel and Distributed Systems
Array Regrouping and Its Use in Compiling Data-Intensive Embedded Applications
IEEE Transactions on Computers
Data Caches in Multitasking Hard Real-Time Systems
RTSS '03 Proceedings of the 24th IEEE International Real-Time Systems Symposium
Transforming Complex Loop Nests for Locality
The Journal of Supercomputing
A Quantitative Analysis of Tile Size Selection Algorithms
The Journal of Supercomputing
A fast and accurate framework to analyze and optimize cache memory behavior
ACM Transactions on Programming Languages and Systems (TOPLAS)
Efficient and Accurate Analytical Modeling of Whole-Program Data Cache Behavior
IEEE Transactions on Computers
Java programming for high-performance numerical computing
IBM Systems Journal
Data Space Oriented Scheduling in Embedded Systems
DATE '03 Proceedings of the conference on Design, Automation and Test in Europe - Volume 1
Generalized Data Transformations for Enhancing Cache Behavior
DATE '03 Proceedings of the conference on Design, Automation and Test in Europe - Volume 1
Optimizing Graph Algorithms for Improved Cache Performance
IEEE Transactions on Parallel and Distributed Systems
Quasidynamic Layout Optimizations for Improving Data Locality
IEEE Transactions on Parallel and Distributed Systems
Combining Models and Guided Empirical Search to Optimize for Multiple Levels of the Memory Hierarchy
Proceedings of the international symposium on Code generation and optimization
Locality-Aware Process Scheduling for Embedded MPSoCs
Proceedings of the conference on Design, Automation and Test in Europe - Volume 2
A Geometric Programming Framework for Optimal Multi-Level Tiling
Proceedings of the 2004 ACM/IEEE conference on Supercomputing
A case for a working-set-based memory hierarchy
Proceedings of the 2nd conference on Computing frontiers
Automatic blocking of QR and LU factorizations for locality
MSP '04 Proceedings of the 2004 workshop on Memory system performance
Generating cache hints for improved program efficiency
Journal of Systems Architecture: the EUROMICRO Journal
Improving whole-program locality using intra-procedural and inter-procedural transformations
Journal of Parallel and Distributed Computing
Reducing data cache leakage energy using a compiler-based approach
ACM Transactions on Embedded Computing Systems (TECS)
An accurate cost model for guiding data locality transformations
ACM Transactions on Programming Languages and Systems (TOPLAS)
Lightweight reference affinity analysis
Proceedings of the 19th annual international conference on Supercomputing
Maximizing data reuse for minimizing memory space requirements and execution cycles
ASP-DAC '06 Proceedings of the 2006 Asia and South Pacific Design Automation Conference
Register aware scheduling for distributed cache clustered architecture
ASP-DAC '03 Proceedings of the 2003 Asia and South Pacific Design Automation Conference
Empirical optimization for a sparse linear solver: a case study
International Journal of Parallel Programming - Special issue: The next generation software program
Instruction scheduling for a tiled dataflow architecture
Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
Detailed cache simulation for detecting bottleneck, miss reason and optimization potentialities
valuetools '06 Proceedings of the 1st international conference on Performance evaluation methodolgies and tools
Improving locality for ODE solvers by program transformations
Scientific Programming
Compiler-managed partitioned data caches for low power
Proceedings of the 2007 ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools for embedded systems
Forma: A framework for safe automatic array reshaping
ACM Transactions on Programming Languages and Systems (TOPLAS)
Data cache locking for tight timing calculations
ACM Transactions on Embedded Computing Systems (TECS)
Fast indexing for blocked array layouts to reduce cache misses
International Journal of High Performance Computing and Networking
Using Padding to Optimize Locality in Scientific Applications
ICCS '08 Proceedings of the 8th international conference on Computational Science, Part I
Comprehensive cache performance tuning with a toolset
Future Generation Computer Systems
Cache line reservation: exploring a scheme for cache-friendly object allocation
CASCON '09 Proceedings of the 2009 Conference of the Center for Advanced Studies on Collaborative Research
Algorithms for memory hierarchies: advanced lectures
Algorithms for memory hierarchies: advanced lectures
Compiling for reconfigurable computing: A survey
ACM Computing Surveys (CSUR)
Compiler techniques for reducing data cache miss rate on a multithreaded architecture
HiPEAC'08 Proceedings of the 3rd international conference on High performance embedded architectures and compilers
Redesigning the string hash table, burst trie, and BST to exploit cache
Journal of Experimental Algorithmics (JEA)
Parallel memory prediction for fused linear algebra kernels
ACM SIGMETRICS Performance Evaluation Review - Special issue on the 1st international workshop on performance modeling, benchmarking and simulation of high performance computing systems (PMBS 10)
A programming language interface to describe transformations and code generation
LCPC'10 Proceedings of the 23rd international conference on Languages and compilers for parallel computing
Data layout for cache performance on a multithreaded architecture
Transactions on high-performance embedded architectures and compilers III
Data layout transformation for stencil computations on short-vector SIMD architectures
CC'11/ETAPS'11 Proceedings of the 20th international conference on Compiler construction: part of the joint European conferences on theory and practice of software
Applying data copy to improve memory performance of general array computations
LCPC'05 Proceedings of the 18th international conference on Languages and Compilers for Parallel Computing
Optimizing matrix multiplication with a classifier learning system
LCPC'05 Proceedings of the 18th international conference on Languages and Compilers for Parallel Computing
Using platform-specific performance counters for dynamic compilation
LCPC'05 Proceedings of the 18th international conference on Languages and Compilers for Parallel Computing
YACO: a user conducted visualization tool for supporting cache optimization
HPCC'05 Proceedings of the First international conference on High Performance Computing and Communications
Optimizing data locality using array tiling
Proceedings of the International Conference on Computer-Aided Design
Near-optimal padding for removing conflict misses
LCPC'02 Proceedings of the 15th international conference on Languages and Compilers for Parallel Computing
Evaluating iterative compilation
LCPC'02 Proceedings of the 15th international conference on Languages and Compilers for Parallel Computing
Optimization-Oriented visualization of cache access behavior
ICCS'05 Proceedings of the 5th international conference on Computational Science - Volume Part II
Loop transformation recipes for code generation and auto-tuning
LCPC'09 Proceedings of the 22nd international conference on Languages and Compilers for Parallel Computing
A data layout optimization framework for NUCA-based multicores
Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
Analysis of the spatial and temporal locality in data accesses
ICCS'06 Proceedings of the 6th international conference on Computational Science - Volume Part II
Improving last level cache locality by integrating loop and data transformations
Proceedings of the International Conference on Computer-Aided Design
Reshaping cache misses to improve row-buffer locality in multicore systems
PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
High performance evaluation of evolutionary-mined association rules on GPUs
The Journal of Supercomputing
Hi-index | 0.01 |
Many cache misses in scientific programs are due to conflicts caused by limited set associativity. We examine two compile-time data-layout transformations for eliminating conflict misses, concentrating on misses occuring on every loop iteration. Inter-variable padding adjusts variable base addresses, while intra-variable padding modifies array dimension sizes. Two levels of precision are evaluated. PADLITE only uses array and column dimension sizes, relying on assumptions about common array reference patterns. PAD analyzes programs, detecting conflict misses by linearizing array references and calculating conflict distances between uniformly-generated references. The Euclidean algorithm for computing the gcd of two numbers is used to predict conflicts between different array columns for linear algebra codes. Experiments on a range of programs indicate PADLITE can eliminate conflicts for benchmarks, but PAD is more effective over a range of cache and problem sizes. Padding reduces cache miss rates by 16% on average for a 16K direct-mapped cache. Execution times are reduced by 6% on average, with some SPEC95 programs improving up to 15%.