Tiling, Block Data Layout, and Memory Hierarchy Performance

Authors:
Neungsoo Park;Bo Hong;Viktor K. Prasanna
Affiliations:
-;-;-
Venue:
IEEE Transactions on Parallel and Distributed Systems
Year:
2003

Citing 25
Cited 31

The cache performance and optimizations of blocked algorithms

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Optimization of array accesses by collective loop transformations

ICS '91 Proceedings of the 5th international conference on Supercomputing
To copy or not to copy: a compile-time technique for assessing when data copying should be used to eliminate cache conflicts

Proceedings of the 1993 ACM/IEEE conference on Supercomputing
Unifying data and control transformations for distributed shared-memory machines

PLDI '95 Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation
Tile size selection using cache organization and data layout

PLDI '95 Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation
A data cache with multiple caching strategies tuned to different types of locality

ICS '95 Proceedings of the 9th international conference on Supercomputing
Run-time spatial locality detection and optimization

MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
The SimpleScalar tool set, version 2.0

ACM SIGARCH Computer Architecture News
Data transformations for eliminating conflict misses

PLDI '98 Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation
Improving locality using loop and data transformations in an integrated framework

MICRO 31 Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture
Augmenting Loop Tiling with Data Alignment for Improved Cache Performance

IEEE Transactions on Computers - Special issue on cache memory and related problems
Nonlinear array layouts for hierarchical memory systems

ICS '99 Proceedings of the 13th international conference on Supercomputing
Locality optimizations for multi-level caches

SC '99 Proceedings of the 1999 ACM/IEEE conference on Supercomputing
Recency-based TLB preloading

Proceedings of the 27th annual international symposium on Computer architecture
Transforming loops to recursion for multi-level memory hierarchies

PLDI '00 Proceedings of the ACM SIGPLAN 2000 conference on Programming language design and implementation
Automatically tuned linear algebra software

SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
Quantifying the Multi-Level Nature of Tiling Interactions

International Journal of Parallel Programming
Outline of a Roadmap for Compiler Technology

IEEE Computational Science & Engineering
The Fortran I Compiler

Computing in Science and Engineering
A Case for Intelligent RAM

IEEE Micro
Itanium Processor Microarchitecture

IEEE Micro
Impulse: Building a Smarter Memory Controller

HPCA '99 Proceedings of the 5th International Symposium on High Performance Computer Architecture
Command Vector Memory Systems: High Performance at Low Cost

PACT '98 Proceedings of the 1998 International Conference on Parallel Architectures and Compilation Techniques
Dynamic Data Layouts for Cache-Conscious Factorization of DFT

IPDPS '00 Proceedings of the 14th International Symposium on Parallel and Distributed Processing
Cache conscious Walsh-Hadamard transform

ICASSP '01 Proceedings of the Acoustics, Speech, and Signal Processing, 200. on IEEE International Conference - Volume 02

Optimizing Graph Algorithms for Improved Cache Performance

IEEE Transactions on Parallel and Distributed Systems
Combining Models and Guided Empirical Search to Optimize for Multiple Levels of the Memory Hierarchy

Proceedings of the international symposium on Code generation and optimization
Hyperplane Grouping and Pipelined Schedules: How to Execute Tiled Loops Fast on Clusters of SMPs

The Journal of Supercomputing
Analyzing block locality in Morton-order and Morton-hybrid matrices

MEDEA '06 Proceedings of the 2006 workshop on MEmory performance: DEaling with Applications, systems and architectures
Cache oblivious algorithms for nonserial polyadic programming

The Journal of Supercomputing
Supermatrix out-of-order scheduling of matrix operations for SMP and multi-core architectures

Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures
Representation-transparent matrix algorithms with scalable performance

Proceedings of the 21st annual international conference on Supercomputing
Analyzing block locality in Morton-order and Morton-hybrid matrices

ACM SIGARCH Computer Architecture News
SuperMatrix: a multithreaded runtime scheduling system for algorithms-by-blocks

Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Fast indexing for blocked array layouts to reduce cache misses

International Journal of High Performance Computing and Networking
An Algorithm-by-Blocks for SuperMatrix Band Cholesky Factorization

High Performance Computing for Computational Science - VECPAR 2008
QR factorization for the Cell Broadband Engine

Scientific Programming - High Performance Computing with the Cell Broadband Engine
Optimizing matrix multiplication for a short-vector SIMD architecture - CELL processor

Parallel Computing
An Extension of the StarSs Programming Model for Platforms with Multiple GPUs

Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
Dynamic task scheduling for linear algebra algorithms on distributed-memory multicore systems

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Minimal data copy for dense linear algebra factorization

PARA'06 Proceedings of the 8th international conference on Applied parallel computing: state of the art in scientific computing
Using non-canonical array layouts in dense matrix operations

PARA'06 Proceedings of the 8th international conference on Applied parallel computing: state of the art in scientific computing
Is cache-oblivious DGEMM viable?

PARA'06 Proceedings of the 8th international conference on Applied parallel computing: state of the art in scientific computing
The relevance of new data structure approaches for dense linear algebra in the new multi-core/many core environments

PPAM'07 Proceedings of the 7th international conference on Parallel processing and applied mathematics
New data structures for matrices and specialized inner kernels: low overhead for high performance

PPAM'07 Proceedings of the 7th international conference on Parallel processing and applied mathematics
Optimization of BLAS on the cell processor

HiPC'08 Proceedings of the 15th international conference on High performance computing
Scheduling two-sided transformations using tile algorithms on multicore architectures

Scientific Programming
Optimizing matrix multiplication with a classifier learning system

LCPC'05 Proceedings of the 18th international conference on Languages and Compilers for Parallel Computing
New generalized data structures for matrices lead to a variety of high performance dense linear algebra algorithms

PARA'04 Proceedings of the 7th international conference on Applied Parallel Computing: state of the Art in Scientific Computing
Parallel and Cache-Efficient In-Place Matrix Storage Format Conversion

ACM Transactions on Mathematical Software (TOMS)
Model-driven adaptation of double-precision matrix multiplication to the Cell processor architecture

Parallel Computing
Cache blocking

PARA'10 Proceedings of the 10th international conference on Applied Parallel and Scientific Computing - Volume Part I
Cache blocking for linear algebra algorithms

PPAM'11 Proceedings of the 9th international conference on Parallel Processing and Applied Mathematics - Volume Part I
Measuring the overhead of Intel C++ Concurrent Collections over Threading Building Blocks for Gauss–Jordan elimination

Concurrency and Computation: Practice & Experience
Benefits of using parallelized non-progressive network coding

Journal of Network and Computer Applications
Dual-addressing memory architecture for two-dimensional memory access patterns

Proceedings of the Conference on Design, Automation and Test in Europe

Quantified Score

Hi-index	0.01

Visualization

Abstract

Recently, several experimental studies have been conducted on block data layout in conjunction with tiling as a data transformation technique to improve cache performance. In this paper, we analyze cache and TLB performance of such alternate layouts (including block data layout and Morton layout) when used in conjunction with tiling. We derive a tight lower bound on TLB performance for standard matrix access patterns, and show that block data layout and Morton layout achieve this bound. To improve cache performance, block data layout is used in concert with tiling. Based on the cache and TLB performance analysis, we propose a data block size selection algorithm that finds a tight range for optimal block size. To validate our analysis, we conducted simulations and experiments using tiled matrix multiplication, LU decomposition, and Cholesky factorization. For matrix multiplication, simulation results using UltraSparc II parameters show that tiling and block data layout with a block size given by our block size selection algorithm, reduces up to 93 percent of TLB misses compared with other techniques (copying, padding, etc.). The total miss cost is reduced considerably. Experiments on several platforms (UltraSparc II and III, Alpha, and Pentium III) show that tiling with block data layout achieves up to 50 percent performance improvement over other techniques that use conventional layouts. Morton layout is also analyzed and compared with block data layout. Experimental results show that matrix multiplication using block data layout is up to 15 percent faster than that using Morton data layout.