Nonlinear array layouts for hierarchical memory systems

Authors:
Siddhartha Chatterjee;Vibhor V. Jain;Alvin R. Lebeck;Shyam Mundhra;Mithuna Thottethodi
Affiliations:
Department of Computer Science, The University of North Carolina, Chapel Hill, NC;Department of Computer Science, The University of North Carolina, Chapel Hill, NC;Department of Computer Science, Duke University, Durham, NC;Department of Computer Science, The University of North Carolina, Chapel Hill, NC;Department of Computer Science, Duke University, Durham, NC
Venue:
ICS '99 Proceedings of the 13th international conference on Supercomputing
Year:
1999

Citing 50
Cited 58

Footprints in the cache

ACM Transactions on Computer Systems (TOCS)
Memory storage patterns in parallel processing

Memory storage patterns in parallel processing
More iteration space tiling

Proceedings of the 1989 ACM/IEEE conference on Supercomputing
Data optimization: allocation of arrays to reduce communication on SIMD machines

Journal of Parallel and Distributed Computing - Massively parallel computation
A set of level 3 basic linear algebra subprograms

ACM Transactions on Mathematical Software (TOMS)
Linear clustering of objects with multiple attributes

SIGMOD '90 Proceedings of the 1990 ACM SIGMOD international conference on Management of data
The cache performance and optimizations of blocked algorithms

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
A data locality optimizing algorithm

PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
The high performance Fortran handbook

The high performance Fortran handbook
Automatic data partitioning on distributed memory multicomputers

Automatic data partitioning on distributed memory multicomputers
A parallel hashed Oct-Tree N-body algorithm

Proceedings of the 1993 ACM/IEEE conference on Supercomputing
An empirical comparison of the Kendall Square Research KSR-1 and Stanford DASH multiprocessors

Proceedings of the 1993 ACM/IEEE conference on Supercomputing
ATOM: a system for building customized program analysis tools

PLDI '94 Proceedings of the ACM SIGPLAN 1994 conference on Programming language design and implementation
Surpassing the TLB performance of superpages with less operating system support

ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
Compiler optimizations for improving data locality

ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
Optimal evaluation of array expressions on massively parallel machines

ACM Transactions on Programming Languages and Systems (TOPLAS)
Unifying data and control transformations for distributed shared-memory machines

PLDI '95 Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation
Tile size selection using cache organization and data layout

PLDI '95 Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation
Data and computation transformations for multiprocessors

PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
Influence of cross-interferences on blocked loops: a case study with matrix-vector multiply

ACM Transactions on Programming Languages and Systems (TOPLAS)
Balancing processor loads and exploiting data locality in N-body simulations

Supercomputing '95 Proceedings of the 1995 ACM/IEEE conference on Supercomputing
Empirical evaluation of the CRAY-T3D: a compiler perspective

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Dynamic Partitioning of Non-Uniform Structured Workloads with Spacefilling Curves

IEEE Transactions on Parallel and Distributed Systems
Can parallel algorithms enhance serial implementation?

Communications of the ACM
The influence of caches on the performance of heaps

Journal of Experimental Algorithmics (JEA)
Analysis of the clustering properties of Hilbert space-filling curve

Analysis of the clustering properties of Hilbert space-filling curve
Active memory: a new abstraction for memory system simulation

ACM Transactions on Modeling and Computer Simulation (TOMACS)
Data-centric multi-level blocking

Proceedings of the ACM SIGPLAN 1997 conference on Programming language design and implementation
Cache miss equations: an analytical representation of cache misses

ICS '97 Proceedings of the 11th international conference on Supercomputing
Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology

ICS '97 Proceedings of the 11th international conference on Supercomputing
High performance Fortran for highly irregular problems

PPOPP '97 Proceedings of the sixth ACM SIGPLAN symposium on Principles and practice of parallel programming
Auto-blocking matrix-multiplication or tracking BLAS3 performance from source code

PPOPP '97 Proceedings of the sixth ACM SIGPLAN symposium on Principles and practice of parallel programming
Data transformations for eliminating conflict misses

PLDI '98 Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation
Eliminating conflict misses for high performance architectures

ICS '98 Proceedings of the 12th international conference on Supercomputing
Increasing TLB reach using superpages backed by shadow memory

Proceedings of the 25th annual international symposium on Computer architecture
Recursion leads to automatic variable blocking for dense linear-algebra algorithms

IBM Journal of Research and Development
Wavelets for computer graphics: theory and applications

Wavelets for computer graphics: theory and applications
Advanced compiler design and implementation

Advanced compiler design and implementation
Cache-conscious data placement

Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
Precise miss analysis for program transformations with caches of arbitrary associativity

Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
Automatic data layout for distributed-memory machines

ACM Transactions on Programming Languages and Systems (TOPLAS)
Recursive array layouts and fast parallel matrix multiplication

Proceedings of the eleventh annual ACM symposium on Parallel algorithms and architectures
Cache performance analysis of traversals and random accesses

Proceedings of the tenth annual ACM-SIAM symposium on Discrete algorithms
LAPACK Users' guide (third ed.)

LAPACK Users' guide (third ed.)
Tuning Strassen's matrix multiplication for memory efficiency

SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
Loop Transformations for Restructuring Compilers: The Foundations

Loop Transformations for Restructuring Compilers: The Foundations
Computer Solution of Large Sparse Positive Definite

Computer Solution of Large Sparse Positive Definite
Hierarchical tiling for improved superscalar performance

IPPS '95 Proceedings of the 9th International Symposium on Parallel Processing
Software methods for improvement of cache performance on supercomputer applications

Software methods for improvement of cache performance on supercomputer applications
Caches and algorithms

Caches and algorithms

Locality optimizations for multi-level caches

SC '99 Proceedings of the 1999 ACM/IEEE conference on Supercomputing
Design and evaluation of a linear algebra package for Java

Proceedings of the ACM 2000 conference on Java Grande
Towards a theory of cache-efficient algorithms

SODA '00 Proceedings of the eleventh annual ACM-SIAM symposium on Discrete algorithms
Transforming loops to recursion for multi-level memory hierarchies

PLDI '00 Proceedings of the ACM SIGPLAN 2000 conference on Programming language design and implementation
Tiling optimizations for 3D scientific computations

Proceedings of the 2000 ACM/IEEE conference on Supercomputing
Exact analysis of the cache behavior of nested loops

Proceedings of the ACM SIGPLAN 2001 conference on Programming language design and implementation
Language support for Morton-order matrices

PPoPP '01 Proceedings of the eighth ACM SIGPLAN symposium on Principles and practices of parallel programming
The NINJA project

Communications of the ACM
Efficient Representation Scheme for Multidimensional Array Operations

IEEE Transactions on Computers
Cache oblivious search trees via binary trees of small height

SODA '02 Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete algorithms
Synthesizing Transformations for Locality Enhancement of Imperfectly-Nested Loop Nests

International Journal of Parallel Programming
Precise Data Locality Optimization of Nested Loops

The Journal of Supercomputing
Towards a theory of cache-efficient algorithms

Journal of the ACM (JACM)
Recursive Array Layouts and Fast Matrix Multiplication

IEEE Transactions on Parallel and Distributed Systems
-D Wavelet Transform Enhancement on General-Purpose Microprocessors: Memory Hierarchy and SIMD Parallelism Exploitation

HiPC '02 Proceedings of the 9th International Conference on High Performance Computing
Automatic Generation of Block-Recursive Codes

Euro-Par '00 Proceedings from the 6th International Euro-Par Conference on Parallel Processing
Is Morton Layout Competitive for Large Two-Dimensional Arrays?

Euro-Par '02 Proceedings of the 8th International Euro-Par Conference on Parallel Processing
Fractal Matrix Multiplication: A Case Study on Portability of Cache Performance

WAE '01 Proceedings of the 5th International Workshop on Algorithm Engineering
Parallel Wavelet Transform for Large Scale Image Processing

IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
Reducing False Sharing and Improving Spatial Locality in a Unified Compilation Framework

IEEE Transactions on Parallel and Distributed Systems
Data cache locking for higher program predictability

SIGMETRICS '03 Proceedings of the 2003 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Cache-Oblivious Algorithms

FOCS '99 Proceedings of the 40th Annual Symposium on Foundations of Computer Science
Efficient Data Parallel Algorithms for Multidimensional Array Operations Based on the EKMR Scheme for Distributed Memory Multicomputers

IEEE Transactions on Parallel and Distributed Systems
Tiling, Block Data Layout, and Memory Hierarchy Performance

IEEE Transactions on Parallel and Distributed Systems
A Novel Implementation of Tile-Based Address Mapping

Proceedings of the conference on Design, automation and test in Europe - Volume 1
A fast and accurate framework to analyze and optimize cache memory behavior

ACM Transactions on Programming Languages and Systems (TOPLAS)
Efficient and Accurate Analytical Modeling of Whole-Program Data Cache Behavior

IEEE Transactions on Computers
Mesh Partitioning Approach to Energy Efficient Data Layout

DATE '03 Proceedings of the conference on Design, Automation and Test in Europe - Volume 1
Optimizing Graph Algorithms for Improved Cache Performance

IEEE Transactions on Parallel and Distributed Systems
Automatic tiling of iterative stencil loops

ACM Transactions on Programming Languages and Systems (TOPLAS)
Cache-Conscious Automata for XML Filtering

ICDE '05 Proceedings of the 21st International Conference on Data Engineering
The Opie compiler from row-major source to Morton-ordered matrices

WMPI '04 Proceedings of the 3rd workshop on Memory performance issues: in conjunction with the 31st international symposium on computer architecture
A fully portable high performance minimal storage hybrid format Cholesky algorithm

ACM Transactions on Mathematical Software (TOMS)
Generating cache hints for improved program efficiency

Journal of Systems Architecture: the EUROMICRO Journal
Cache-Efficient Multigrid Algorithms

International Journal of High Performance Computing Applications
An accurate cost model for guiding data locality transformations

ACM Transactions on Programming Languages and Systems (TOPLAS)
Memory Coloring: A Compiler Approach for Scratchpad Memory Management

Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
A hierarchical model of data locality

Conference record of the 33rd ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Cache-Conscious Automata for XML Filtering

IEEE Transactions on Knowledge and Data Engineering
NINJA: Java for high performance numerical computing

Scientific Programming
Fast indexing for blocked array layouts to reduce cache misses

International Journal of High Performance Computing and Networking
Towards many-core implementation of LU decomposition using Peano Curves

Proceedings of the combined workshops on UnConventional high performance computing workshop plus memory access workshop
Wavelet transform for large scale image processing on modern microprocessors

VECPAR'02 Proceedings of the 5th international conference on High performance computing for computational science
Using non-canonical array layouts in dense matrix operations

PARA'06 Proceedings of the 8th international conference on Applied parallel computing: state of the art in scientific computing
Evaluating ISA support and hardware support for recursive data layouts

HiPC'07 Proceedings of the 14th international conference on High performance computing
New data structures for matrices and specialized inner kernels: low overhead for high performance

PPAM'07 Proceedings of the 7th international conference on Parallel processing and applied mathematics
Porting existing cache-oblivious linear algebra HPC modules to larrabee architecture

Proceedings of the 7th ACM international conference on Computing frontiers
Cache-Oblivious Algorithms

ACM Transactions on Algorithms (TALG)
Optimizing matrix multiplication with a classifier learning system

LCPC'05 Proceedings of the 18th international conference on Languages and Compilers for Parallel Computing
A study on load imbalance in parallel hypermatrix multiplication using OpenMP

PPAM'05 Proceedings of the 6th international conference on Parallel Processing and Applied Mathematics
A cache oblivious algorithm for matrix multiplication based on peano's space filling curve

PPAM'05 Proceedings of the 6th international conference on Parallel Processing and Applied Mathematics
Adapting linear algebra codes to the memory hierarchy using a hypermatrix scheme

PPAM'05 Proceedings of the 6th international conference on Parallel Processing and Applied Mathematics
Minimizing associativity conflicts in morton layout

PPAM'05 Proceedings of the 6th international conference on Parallel Processing and Applied Mathematics
Optimizing data locality using array tiling

Proceedings of the International Conference on Computer-Aided Design
A data layout optimization framework for NUCA-based multicores

Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
Towards data tiling for whole programs in scratchpad memory allocation

ACSAC'07 Proceedings of the 12th Asia-Pacific conference on Advances in Computer Systems Architecture
Dual-addressing memory architecture for two-dimensional memory access patterns

Proceedings of the Conference on Design, Automation and Test in Europe
Reshaping cache misses to improve row-buffer locality in multicore systems

PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques

Quantified Score

Hi-index	0.02

Nonlinear array layouts for hierarchical memory systems

Quantified Score

Visualization

Abstract