Restructuring computations for temporal data cache locality

Authors:
Venkata K. Pingali;Sally A. McKee;Wilson C. Hsieh;John B. Carter
Affiliations:
Information Sciences Institute, University of Southern California, Los Angeles, California;Electrical and Computer Engineering, Cornell University, Ithaca, New York;School of Computing, 50S Central Campus Drive, Room 3190, University of Utah, Salt Lake City, Utah;School of Computing, 50S Central Campus Drive, Room 3190, University of Utah, Salt Lake City, Utah
Venue:
International Journal of Parallel Programming
Year:
2003

Citing 32
Cited 7

Real-time concurrent collection on stock multiprocessors

PLDI '88 Proceedings of the ACM SIGPLAN 1988 conference on Programming Language design and Implementation
Process decomposition through locality of reference

PLDI '89 Proceedings of the ACM SIGPLAN 1989 Conference on Programming language design and implementation
Software prefetching

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
The cache performance and optimizations of blocked algorithms

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Parallel programming in Split-C

Proceedings of the 1993 ACM/IEEE conference on Supercomputing
Compiler optimizations for improving data locality

ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
Thread scheduling for cache locality

Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Compiler-based prefetching for recursive data structures

Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Data-centric multi-level blocking

Proceedings of the ACM SIGPLAN 1997 conference on Programming language design and implementation
Improving locality using loop and data transformations in an integrated framework

MICRO 31 Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture
Precise miss analysis for program transformations with caches of arbitrary associativity

Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
Cache-conscious structure layout

Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation
New tiling techniques to improve cache temporal locality

Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation
High-level semantic optimization of numerical codes

ICS '99 Proceedings of the 13th international conference on Supercomputing
Recursive array layouts and fast parallel matrix multiplication

Proceedings of the eleventh annual ACM symposium on Parallel algorithms and architectures
Next-generation generic programming and its application to sparse matrix computations

Proceedings of the 14th international conference on Supercomputing
Making B+- trees cache conscious in main memory

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Performance analysis using the MIPS R10000 performance counters

Supercomputing '96 Proceedings of the 1996 ACM/IEEE conference on Supercomputing
R-trees: a dynamic index structure for spatial searching

SIGMOD '84 Proceedings of the 1984 ACM SIGMOD international conference on Management of data
Data-Centric Transformations for Locality Enhancement

International Journal of Parallel Programming
Improving Effective Bandwidth through Compiler Enhancement of Global Cache Reuse

IPDPS '01 Proceedings of the 15th International Parallel & Distributed Processing Symposium
On Automatic Data Structure Selection and Code Generation for Sparse Computations

Proceedings of the 6th International Workshop on Languages and Compilers for Parallel Computing
Early Experiences with Olden

Proceedings of the 6th International Workshop on Languages and Compilers for Parallel Computing
Cache Conscious Indexing for Decision-Support in Main Memory

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
The Buffer Tree: A New Technique for Optimal I/O-Algorithms (Extended Abstract)

WADS '95 Proceedings of the 4th International Workshop on Algorithms and Data Structures
Cache Conscious Algorithms for Relational Query Processing

VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
Cache-Oblivious Algorithms

FOCS '99 Proceedings of the 40th Annual Symposium on Foundations of Computer Science
Cache-oblivious B-trees

FOCS '00 Proceedings of the 41st Annual Symposium on Foundations of Computer Science
Improving Cache Behavior of Dynamically Allocated Data Structures

PACT '98 Proceedings of the 1998 International Conference on Parallel Architectures and Compilation Techniques
Localizing Non-Affine Array References

PACT '99 Proceedings of the 1999 International Conference on Parallel Architectures and Compilation Techniques
Caches and algorithms

Caches and algorithms
Guiding program transformations with modal performance models

Guiding program transformations with modal performance models

Locality phase prediction

ASPLOS XI Proceedings of the 11th international conference on Architectural support for programming languages and operating systems
The Potential of Computation Regrouping for Improving Locality

Proceedings of the 2004 ACM/IEEE conference on Supercomputing
Optimizing locality and scalability of embedded Runge--Kutta solvers using block-based pipelining

Journal of Parallel and Distributed Computing
Feedback-directed thread scheduling with memory considerations

Proceedings of the 16th international symposium on High performance distributed computing
Revisiting Cache Block Superloading

HiPEAC '09 Proceedings of the 4th International Conference on High Performance Embedded Architectures and Compilers
Task ordering and memory management problem for degree of parallelism estimation

COCOON'11 Proceedings of the 17th annual international conference on Computing and combinatorics
Simulation-based analysis of parallel runge-kutta solvers

PARA'04 Proceedings of the 7th international conference on Applied Parallel Computing: state of the Art in Scientific Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Data access costs contribute significantly to the execution time of applications with complex data structures. A the latency of memory accesses becomes high relative to processor cycle times, application performance is increasingly limited by memory performance. In some situations it is useful to trade increased computation costs for reduced memory costs. The contributions of this paper are three-fold: we provide a detailed analysis of the memory performance of seven memory-intensive benchmarks; we describe Computation Regrouping, a source-level approach to improving the performance of memory-bound applications by increasing temporal locality to eliminate cache and TLB misses; and, we demonstrate significant performance improvement by applying Computation Regrouping to our suite of seven benchmarks. Using Computation Regrouping, we observe a geometric mean speedup of 1.90, with individual speedups ranging from 1.26 to 3.03. Most of this improvement comes from eliminating memory tall time.