Recursive array layouts and fast parallel matrix multiplication

Authors:
Siddhartha Chatterjee;Alvin R. Lebeck;Praveen K. Patnala;Mithuna Thottethodi
Affiliations:
Department of Computer Science, The University of North Carolina, Chapel Hill, NC;Department of Computer Science, Duke University, Durham, NC;Department of Computer Science, The University of North Carolina, Chapel Hill, NC;Department of Computer Science, Duke University, Durham, NC
Venue:
Proceedings of the eleventh annual ACM symposium on Parallel algorithms and architectures
Year:
1999

Citing 30
Cited 42

Memory storage patterns in parallel processing

Memory storage patterns in parallel processing
More iteration space tiling

Proceedings of the 1989 ACM/IEEE conference on Supercomputing
Evaluating Associativity in CPU Caches

IEEE Transactions on Computers
Data optimization: allocation of arrays to reduce communication on SIMD machines

Journal of Parallel and Distributed Computing - Massively parallel computation
A set of level 3 basic linear algebra subprograms

ACM Transactions on Mathematical Software (TOMS)
Linear clustering of objects with multiple attributes

SIGMOD '90 Proceedings of the 1990 ACM SIGMOD international conference on Management of data
The cache performance and optimizations of blocked algorithms

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
A data locality optimizing algorithm

PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
The high performance Fortran handbook

The high performance Fortran handbook
Automatic data partitioning on distributed memory multicomputers

Automatic data partitioning on distributed memory multicomputers
A parallel hashed Oct-Tree N-body algorithm

Proceedings of the 1993 ACM/IEEE conference on Supercomputing
An empirical comparison of the Kendall Square Research KSR-1 and Stanford DASH multiprocessors

Proceedings of the 1993 ACM/IEEE conference on Supercomputing
Compiler optimizations for improving data locality

ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
Optimal evaluation of array expressions on massively parallel machines

ACM Transactions on Programming Languages and Systems (TOPLAS)
Unifying data and control transformations for distributed shared-memory machines

PLDI '95 Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation
Cilk: an efficient multithreaded runtime system

PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
Balancing processor loads and exploiting data locality in N-body simulations

Supercomputing '95 Proceedings of the 1995 ACM/IEEE conference on Supercomputing
Dynamic Partitioning of Non-Uniform Structured Workloads with Spacefilling Curves

IEEE Transactions on Parallel and Distributed Systems
Analysis of the clustering properties of Hilbert space-filling curve

Analysis of the clustering properties of Hilbert space-filling curve
Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology

ICS '97 Proceedings of the 11th international conference on Supercomputing
High performance Fortran for highly irregular problems

PPOPP '97 Proceedings of the sixth ACM SIGPLAN symposium on Principles and practice of parallel programming
Auto-blocking matrix-multiplication or tracking BLAS3 performance from source code

PPOPP '97 Proceedings of the sixth ACM SIGPLAN symposium on Principles and practice of parallel programming
Recursion leads to automatic variable blocking for dense linear-algebra algorithms

IBM Journal of Research and Development
Automatic data layout for distributed-memory machines

ACM Transactions on Programming Languages and Systems (TOPLAS)
Tuning Strassen's matrix multiplication for memory efficiency

SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
Accuracy and Stability of Numerical Algorithms

Accuracy and Stability of Numerical Algorithms
Parallel Computer Architecture: A Hardware/Software Approach

Parallel Computer Architecture: A Hardware/Software Approach
Digital Design

Digital Design
Hierarchical tiling for improved superscalar performance

IPPS '95 Proceedings of the 9th International Symposium on Parallel Processing
Efficient Procedures for Using Matrix Algorithms

Proceedings of the 2nd Colloquium on Automata, Languages and Programming

Nonlinear array layouts for hierarchical memory systems

ICS '99 Proceedings of the 13th international conference on Supercomputing
Design and evaluation of a linear algebra package for Java

Proceedings of the ACM 2000 conference on Java Grande
Towards a theory of cache-efficient algorithms

SODA '00 Proceedings of the eleventh annual ACM-SIAM symposium on Discrete algorithms
Symbolic bounds analysis of pointers, array indices, and accessed memory regions

PLDI '00 Proceedings of the ACM SIGPLAN 2000 conference on Programming language design and implementation
A comparison of three approaches to language, compiler, and library support for multidimensional arrays in Java

Proceedings of the 2001 joint ACM-ISCOPE conference on Java Grande
Exact analysis of the cache behavior of nested loops

Proceedings of the ACM SIGPLAN 2001 conference on Programming language design and implementation
Language support for Morton-order matrices

PPoPP '01 Proceedings of the eighth ACM SIGPLAN symposium on Principles and practices of parallel programming
Efficient Representation Scheme for Multidimensional Array Operations

IEEE Transactions on Computers
Computation regrouping: restructuring programs for temporal data cache locality

ICS '02 Proceedings of the 16th international conference on Supercomputing
Global static indexing for real-time exploration of very large regular grids

Proceedings of the 2001 ACM/IEEE conference on Supercomputing
Towards a theory of cache-efficient algorithms

Journal of the ACM (JACM)
Recursive Array Layouts and Fast Matrix Multiplication

IEEE Transactions on Parallel and Distributed Systems
Mixed Parallel Implementations of Strassen and Winograd Matrix Multiplication Algorithms

IPDPS '01 Proceedings of the 15th International Parallel & Distributed Processing Symposium
Recursion Unrolling for Divide and Conquer Programs

LCPC '00 Proceedings of the 13th International Workshop on Languages and Compilers for Parallel Computing-Revised Papers
Fractal Matrix Multiplication: A Case Study on Portability of Cache Performance

WAE '01 Proceedings of the 5th International Workshop on Algorithm Engineering
Design-Driven Compilation

CC '01 Proceedings of the 10th International Conference on Compiler Construction
Parallel Complexity of Matrix Multiplication

The Journal of Supercomputing
Cache-Oblivious Algorithms

FOCS '99 Proceedings of the 40th Annual Symposium on Foundations of Computer Science
Efficient Data Parallel Algorithms for Multidimensional Array Operations Based on the EKMR Scheme for Distributed Memory Multicomputers

IEEE Transactions on Parallel and Distributed Systems
On improving the memory access patterns during the execution of Strassen's matrix multiplication algorithm

ACSC '04 Proceedings of the 27th Australasian conference on Computer science - Volume 26
Restructuring computations for temporal data cache locality

International Journal of Parallel Programming
Automatic tiling of iterative stencil loops

ACM Transactions on Programming Languages and Systems (TOPLAS)
SFCGen: A framework for efficient generation of multi-dimensional space-filling curves by recursion

ACM Transactions on Mathematical Software (TOMS)
Symbolic bounds analysis of pointers, array indices, and accessed memory regions

ACM Transactions on Programming Languages and Systems (TOPLAS)
LU-GPU: Efficient Algorithms for Solving Dense Linear Systems on Graphics Hardware

SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
The cache-oblivious gaussian elimination paradigm: theoretical framework, parallelization and experimental evaluation

Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures
An experimental comparison of cache-oblivious and cache-conscious programs

Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures
Representation-transparent matrix algorithms with scalable performance

Proceedings of the 21st annual international conference on Supercomputing
Adaptive Strassen's matrix multiplication

Proceedings of the 21st annual international conference on Supercomputing
Mapping with Space Filling Surfaces

IEEE Transactions on Parallel and Distributed Systems
Compact Hilbert indices: Space-filling curves for domains with unequal side lengths

Information Processing Letters
Fast indexing for blocked array layouts to reduce cache misses

International Journal of High Performance Computing and Networking
On the limits of cache-oblivious rational permutations

Theoretical Computer Science
Using non-canonical array layouts in dense matrix operations

PARA'06 Proceedings of the 8th international conference on Applied parallel computing: state of the art in scientific computing
Evaluating ISA support and hardware support for recursive data layouts

HiPC'07 Proceedings of the 14th international conference on High performance computing
Cache-Oblivious Algorithms

ACM Transactions on Algorithms (TALG)
A study on load imbalance in parallel hypermatrix multiplication using OpenMP

PPAM'05 Proceedings of the 6th international conference on Parallel Processing and Applied Mathematics
Minimizing associativity conflicts in morton layout

PPAM'05 Proceedings of the 6th international conference on Parallel Processing and Applied Mathematics
Compiler-optimized kernels: an efficient alternative to hand-coded inner kernels

ICCSA'06 Proceedings of the 2006 international conference on Computational Science and Its Applications - Volume Part V
Optimizing data locality using array tiling

Proceedings of the International Conference on Computer-Aided Design
A cache-aware algorithm for PDEs on hierarchical data structures

PARA'04 Proceedings of the 7th international conference on Applied Parallel Computing: state of the Art in Scientific Computing
Cache-Oblivious algorithms and matrix formats for computations on interval matrices

PARA'10 Proceedings of the 10th international conference on Applied Parallel and Scientific Computing - Volume 2

Quantified Score

Hi-index	0.00

Recursive array layouts and fast parallel matrix multiplication

Quantified Score

Visualization

Abstract