Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology

Authors:
Jeff Bilmes;Krste Asanovic;Chee-Whye Chin;Jim Demmel
Affiliations:
CS Division, University of California at Berkeley, Berkeley, CA and International Computer Science Institute, Berkeley, CA;CS Division, University of California at Berkeley, Berkeley, CA and International Computer Science Institute, Berkeley, CA;CS Division, University of California at Berkeley, Berkeley, CA and International Computer Science Institute, Berkeley, CA;CS Division, University of California at Berkeley, Berkeley, CA and International Computer Science Institute, Berkeley, CA
Venue:
ICS '97 Proceedings of the 11th international conference on Supercomputing
Year:
1997

Citing 15
Cited 164

An extended set of FORTRAN basic linear algebra subprograms

ACM Transactions on Mathematical Software (TOMS)
A set of level 3 basic linear algebra subprograms

ACM Transactions on Mathematical Software (TOMS)
The cache performance and optimizations of blocked algorithms

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
A data locality optimizing algorithm

PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
Using Strassen's algorithm to accelerate the solution of linear systems

The Journal of Supercomputing
LAPACK's user's guide

LAPACK's user's guide
DXML: a high-performance scientific subroutine library

Digital Technical Journal
Basic Linear Algebra Subprograms for Fortran Usage

ACM Transactions on Mathematical Software (TOMS)
High Performance Compilers for Parallel Computing

High Performance Compilers for Parallel Computing
The Combined Effectiveness of Unimodular Transformations, Tiling, and Software Prefetching

IPPS '96 Proceedings of the 10th International Parallel Processing Symposium
Hierarchical tiling for improved superscalar performance

IPPS '95 Proceedings of the 9th International Symposium on Parallel Processing
Space-limited procedures: a methodology for portable high-performance

PMMP '95 Proceedings of the conference on Programming Models for Massively Parallel Computers
LAPACK Working Note 95: ScaLAPACK: A Portable Linear Algebra Library for Distributed Memory Computers -- Design Issues and Performance

LAPACK Working Note 95: ScaLAPACK: A Portable Linear Algebra Library for Distributed Memory Computers -- Design Issues and Performance
Optimizing Matrix Multiply using PHiPAC: a Portable,High-Performance, ANSI C Coding Methodology

Optimizing Matrix Multiply using PHiPAC: a Portable,High-Performance, ANSI C Coding Methodology
Automatic benchmark generation for cache optimization of matrix operations

ACM-SE 33 Proceedings of the 33rd annual on Southeast regional conference

Nonlinear array layouts for hierarchical memory systems

ICS '99 Proceedings of the 13th international conference on Supercomputing
Recursive array layouts and fast parallel matrix multiplication

Proceedings of the eleventh annual ACM symposium on Parallel algorithms and architectures
Memory characteristics of iterative methods

SC '99 Proceedings of the 1999 ACM/IEEE conference on Supercomputing
AJaPACK: experiments in performance portable parallel Java numerical libraries

Proceedings of the ACM 2000 conference on Java Grande
Finding least common ancestors in directed acyclic graphs

SODA '01 Proceedings of the twelfth annual ACM-SIAM symposium on Discrete algorithms
Optimizing locality for ODE solvers

ICS '01 Proceedings of the 15th international conference on Supercomputing
SPL: a language and compiler for DSP algorithms

Proceedings of the ACM SIGPLAN 2001 conference on Programming language design and implementation
Language support for Morton-order matrices

PPoPP '01 Proceedings of the eighth ACM SIGPLAN symposium on Principles and practices of parallel programming
A recursive formulation of Cholesky factorization of a matrix in packed storage

ACM Transactions on Mathematical Software (TOMS)
Automatically tuned linear algebra software

SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
Array form representation of idiom recognition system for numerical programs

Proceedings of the 2001 conference on APL: an arrays odyssey
Stochastic search for signal processing algorithm optimization

Proceedings of the 2001 ACM/IEEE conference on Supercomputing
Quantifying the Multi-Level Nature of Tiling Interactions

International Journal of Parallel Programming
Recursive Array Layouts and Fast Matrix Multiplication

IEEE Transactions on Parallel and Distributed Systems
Towards Automatic Synthesis of High-Performance Codes for Electronic Structure Calculations: Data Locality Optimization

HiPC '01 Proceedings of the 8th International Conference on High Performance Computing
A Family of High-Performance Matrix Multiplication Algorithms

ICCS '01 Proceedings of the International Conference on Computational Sciences-Part I
Statistical Models for Automatic Performance Tuning

ICCS '01 Proceedings of the International Conference on Computational Sciences-Part I
Rescheduling for Locality in Sparse Matrix Computations

ICCS '01 Proceedings of the International Conference on Computational Sciences-Part I
A Modal Model of Memory

ICCS '01 Proceedings of the International Conference on Computational Sciences-Part I
Parallel and Fully Recursive Multifrontal Supernodal Sparse Cholesky

ICCS '02 Proceedings of the International Conference on Computational Science-Part II
A Performance Optimization Framework for Compilation of Tensor Contraction Expressions into Parallel Programs

IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
A Recursive Formulation of the Inversion of Symmetric Positive Definite Matrices in Packed Storage Data Format

PARA '02 Proceedings of the 6th International Conference on Applied Parallel Computing Advanced Scientific Computing
Code Generators for Automatic Tuning of Numerical Kernels: Experiences with FFTW

SAIG '00 Proceedings of the International Workshop on Semantics, Applications, and Implementation of Program Generation
A Characterization of Temporal Locality and Its Portability across Memory Hierarchies

ICALP '01 Proceedings of the 28th International Colloquium on Automata, Languages and Programming,
Iterative Compilation

Embedded Processor Design Challenges: Systems, Architectures, Modeling, and Simulation - SAMOS
Cache Models for Iterative Compilation

Euro-Par '01 Proceedings of the 7th International Euro-Par Conference Manchester on Parallel Processing
Delayed Evaluation, Self-optimising Software Components as a Programming Model

Euro-Par '02 Proceedings of the 8th International Euro-Par Conference on Parallel Processing
Pipelining for Locality Improvement in RK Methods

Euro-Par '02 Proceedings of the 8th International Euro-Par Conference on Parallel Processing
OCEANS - Optimising Compilers for Embedded Applications

Euro-Par '99 Proceedings of the 5th International Euro-Par Conference on Parallel Processing
Fractal Matrix Multiplication: A Case Study on Portability of Cache Performance

WAE '01 Proceedings of the 5th International Workshop on Algorithm Engineering
HPF and Numerical Libraries

ParNum '99 Proceedings of the 4th International ACPC Conference Including Special Tracks on Parallel Numerics and Parallel Computing in Image Processing, Video Processing, and Multimedia: Parallel Computation
Blocking Techniques in Numerical Software

ParNum '99 Proceedings of the 4th International ACPC Conference Including Special Tracks on Parallel Numerics and Parallel Computing in Image Processing, Video Processing, and Multimedia: Parallel Computation
Knowledge Discovery in Auto-tuning Parallel Numerical Library

Progress in Discovery Science, Final Report of the Japanese Discovery Science Project
Heterogeneous Networks of Workstations and the Parallel Matrix Multiplication

Proceedings of the 8th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
On increasing architecture awareness in program optimizations to bridge the gap between peak and sustained processor performance: matrix-multiply revisited

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Better tiling and array contraction for compiling scientific programs

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
A high-level approach to synthesis of high-performance codes for quantum chemistry

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Performance optimizations and bounds for sparse matrix-vector multiply

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Iterative compilation

Embedded processor design challenges
On the Parallel Execution Time of Tiled Loops

IEEE Transactions on Parallel and Distributed Systems
Reducing False Sharing and Improving Spatial Locality in a Unified Compilation Framework

IEEE Transactions on Parallel and Distributed Systems
Formal derivation of algorithms: The triangular sylvester equation

ACM Transactions on Mathematical Software (TOMS)
QR factorization with Morton-ordered quadtree matrices for memory re-use and parallelism

Proceedings of the ninth ACM SIGPLAN symposium on Principles and practice of parallel programming
Local Discovery of System Architecture - Application Parameter Sensitivity: An Empirical Technique for Adaptive Grid Applications

HPDC '02 Proceedings of the 11th IEEE International Symposium on High Performance Distributed Computing
Self-adapting software for numerical linear algebra and LAPACK for clusters

Parallel Computing - Special issue: Parallel and distributed scientific and engineering computing
Performance optimization of RK methods using block-based pipelining

Performance analysis and grid computing
Effect of auto-tuning with user's knowledge for numerical software

Proceedings of the 1st conference on Computing frontiers
A fast Fourier transform compiler

ACM SIGPLAN Notices - Best of PLDI 1979-1999
Architecture of an automatically tuned linear algebra library

Parallel Computing
Parallel and fully recursive multifrontal sparse Cholesky

Future Generation Computer Systems - Special issue: Selected numerical algorithms
Multilevel hierarchical matrix multiplication on clusters

Proceedings of the 18th annual international conference on Supercomputing
High-performance linear algebra algorithms using new generalized data structures for matrices

IBM Journal of Research and Development
Communication lower bounds for distributed-memory matrix multiplication

Journal of Parallel and Distributed Computing
Optimizing Sorting with Genetic Algorithms

Proceedings of the international symposium on Code generation and optimization
Combining Models and Guided Empirical Search to Optimize for Multiple Levels of the Memory Hierarchy

Proceedings of the international symposium on Code generation and optimization
A Geometric Programming Framework for Optimal Multi-Level Tiling

Proceedings of the 2004 ACM/IEEE conference on Supercomputing
The Opie compiler from row-major source to Morton-ordered matrices

WMPI '04 Proceedings of the 3rd workshop on Memory performance issues: in conjunction with the 31st international symposium on computer architecture
Energy aware lossless data compression

Proceedings of the 1st international conference on Mobile systems, applications and services
Spiral: A Generator for Platform-Adapted Libraries of Signal Processing Algorithms

International Journal of High Performance Computing Applications
Statistical Models for Empirical Search-Based Performance Tuning

International Journal of High Performance Computing Applications
Sparsity: Optimization Framework for Sparse Matrix Kernels

International Journal of High Performance Computing Applications
Automatic generation and tuning of MPI collective communication routines

Proceedings of the 19th annual international conference on Supercomputing
Automatic Tuning Matrix Multiplication Performance on Graphics Hardware

Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Adaptive Strassen and ATLAS's DGEMM: A Fast Square-Matrix Multiply for Modern High-Performance Systems

HPCASIA '05 Proceedings of the Eighth International Conference on High-Performance Computing in Asia-Pacific Region
Reduction Transformations for Optimization Parameter Selection

HPCASIA '05 Proceedings of the Eighth International Conference on High-Performance Computing in Asia-Pacific Region
Lowest common ancestors in trees and directed acyclic graphs

Journal of Algorithms
Automatic tuning of whole applications using direct search and a performance-based transformation system

The Journal of Supercomputing
Online performance auditing: using hot optimizations without getting burned

Proceedings of the 2006 ACM SIGPLAN conference on Programming language design and implementation
Optimizing locality and scalability of embedded Runge--Kutta solvers using block-based pipelining

Journal of Parallel and Distributed Computing
ABCLib_DRSSED: A parallel eigensolver with an auto-tuning facility

Parallel Computing
ABCLibScript: a directive to support specification of an auto-tuning facility for numerical software

Parallel Computing
Distribution of a class of divide and conquer recurrences arising from the computation of the Walsh-Hadamard transform

Theoretical Computer Science
Self-adapting numerical software (SANS) effort

IBM Journal of Research and Development
Energy-aware lossless data compression

ACM Transactions on Computer Systems (TOCS)
Empirical optimization for a sparse linear solver: a case study

International Journal of Parallel Programming - Special issue: The next generation software program
STAR-MPI: self tuned adaptive routines for MPI collective operations

Proceedings of the 20th annual international conference on Supercomputing
Profitable loop fusion and tiling using model-driven empirical search

Proceedings of the 20th annual international conference on Supercomputing
A comparison of online and offline strategies for program adaptation

ACM-SE 45 Proceedings of the 45th annual southeast regional conference
Improving locality for ODE solvers by program transformations

Scientific Programming
Automatic data movement and computation mapping for multi-level parallel architectures with explicitly managed memories

Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
A method to derive the cache performance of irregular applications on machines with direct mapped caches

International Journal of Computational Science and Engineering
Multi-level tiling: M for the price of one

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Sketching concurrent data structures

Proceedings of the 2008 ACM SIGPLAN conference on Programming language design and implementation
Combining building blocks for parallel multi-level matrix multiplication

Parallel Computing
Families of algorithms related to the inversion of a Symmetric Positive Definite matrix

ACM Transactions on Mathematical Software (TOMS)
The impact of paravirtualized memory hierarchy on linear algebra computational kernels and software

HPDC '08 Proceedings of the 17th international symposium on High performance distributed computing
Positivity, posynomials and tile size selection

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Performance Model for Parallel Mathematical Libraries Based on Historical Knowledgebase

Euro-Par '08 Proceedings of the 14th international Euro-Par conference on Parallel Processing
A tuning framework for software-managed memory hierarchies

Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Achieving accurate and context-sensitive timing for code optimization

Software—Practice & Experience
How to Write Fast Numerical Code: A Small Introduction

Generative and Transformational Techniques in Software Engineering II
Adaptive Winograd's matrix multiplications

ACM Transactions on Mathematical Software (TOMS)
Quick and Practical Run-Time Evaluation of Multiple Program Optimizations

Transactions on High-Performance Embedded Architectures and Compilers I
PetaBricks: a language and compiler for algorithmic choice

Proceedings of the 2009 ACM SIGPLAN conference on Programming language design and implementation
Model-guided autotuning of high-productivity languages for petascale computing

Proceedings of the 18th ACM international symposium on High performance distributed computing
Generating Empirically Optimized Composed Matrix Kernels from MATLAB Prototypes

ICCS '09 Proceedings of the 9th International Conference on Computational Science: Part I
A Note on Auto-tuning GEMM for GPUs

ICCS '09 Proceedings of the 9th International Conference on Computational Science: Part I
Paravirtualization effect on single- and multi-threaded memory-intensive linear algebra software

Cluster Computing
Autotuning multigrid with PetaBricks

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Automating the generation of composed linear algebra kernels

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Lowest common ancestors in trees and directed acyclic graphs

Journal of Algorithms
Algorithms for memory hierarchies: advanced lectures

Algorithms for memory hierarchies: advanced lectures
Self-adapting numerical software and automatic tuning of heuristics

ICCS'03 Proceedings of the 2003 international conference on Computational science
Self-adapting numerical software and automatic tuning of heuristics

ICCS'03 Proceedings of the 2003 international conference on Computational science
Self-adapting software for numerical linear algebra library routines on clusters

ICCS'03 Proceedings of the 2003 international conference on Computational science: PartIII
Memory hierarchy optimizations and performance bounds for sparse ATAx

ICCS'03 Proceedings of the 2003 international conference on Computational science: PartIII
A compiler approach to performance prediction using empirical-based modeling

ICCS'03 Proceedings of the 2003 international conference on Computational science: PartIII
Minimal data copy for dense linear algebra factorization

PARA'06 Proceedings of the 8th international conference on Applied parallel computing: state of the art in scientific computing
Automatic performance tuning for the multi-section with multiple eigenvalues method for symmetric tridiagonal eigenproblems

PARA'06 Proceedings of the 8th international conference on Applied parallel computing: state of the art in scientific computing
d-spline based incremental parameter estimation in automatic performance tuning

PARA'06 Proceedings of the 8th international conference on Applied parallel computing: state of the art in scientific computing
Using recursion to boost ATLAS's performance

ISHPC'05/ALPS'06 Proceedings of the 6th international symposium on high-performance computing and 1st international conference on Advanced low power systems
Speeding up Nek5000 with autotuning and specialization

Proceedings of the 24th ACM International Conference on Supercomputing
SLAMM - Automating Memory Analysis for Numerical Algorithms

Electronic Notes in Theoretical Computer Science (ENTCS)
An input-centric paradigm for program dynamic optimizations

Proceedings of the ACM international conference on Object oriented programming systems languages and applications
Measuring execution times of collective communications in an empirical optimization framework

EuroMPI'10 Proceedings of the 17th European MPI users' group meeting conference on Recent advances in the message passing interface
Towards the design of an automatically tuned linear algebra library

EUROMICRO-PDP'02 Proceedings of the 10th Euromicro conference on Parallel, distributed and network-based processing
Automated empirical tuning of scientific codes for performance and power consumption

Proceedings of the 6th International Conference on High Performance and Embedded Architectures and Compilers
Dynamic selection of implementation variants of sequential iterated runge-kutta methods with tile size sampling

Proceedings of the 2nd ACM/SPEC International Conference on Performance engineering
The Vocal Joystick Engine v1.0

Computer Speech and Language
Parallel memory prediction for fused linear algebra kernels

ACM SIGMETRICS Performance Evaluation Review - Special issue on the 1st international workshop on performance modeling, benchmarking and simulation of high performance computing systems (PMBS 10)
Parallel Low-Storage Runge-Kutta Solvers for ODE Systems with Limited Access Distance

International Journal of High Performance Computing Applications
Smart data structures: an online machine learning approach to multicore data structures

Proceedings of the 8th ACM international conference on Autonomic computing
AARTS: low overhead online adaptive auto-tuning

Proceedings of the 1st International Workshop on Adaptive Self-Tuning Computing Systems for the Exaflop Era
Probabilistic auto-tuning for architectures with complex constraints

Proceedings of the 1st International Workshop on Adaptive Self-Tuning Computing Systems for the Exaflop Era
An efficient evolutionary algorithm for solving incrementally structured problems

Proceedings of the 13th annual conference on Genetic and evolutionary computation
Autotuned parallel I/O for highly scalable biosequence analysis

Proceedings of the 2011 TeraGrid Conference: Extreme Digital Discovery
An efficient time-step-based self-adaptive algorithm for predictor-corrector methods of Runge-Kutta type

Journal of Computational and Applied Mathematics
A step towards transparent integration of input-consciousness into dynamic program optimizations

Proceedings of the 2011 ACM international conference on Object oriented programming systems languages and applications
Automatic performance programming

Proceedings of the 10th SIGPLAN symposium on New ideas, new paradigms, and reflections on programming and software
Exploiting parallelism in matrix-computation kernels for symmetric multiprocessor systems: Matrix-multiplication and matrix-addition algorithm optimizations by software pipelining and threads allocation

ACM Transactions on Mathematical Software (TOMS)
Optimizing symmetric dense matrix-vector multiplication on GPUs

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Optimizing matrix multiplication with a classifier learning system

LCPC'05 Proceedings of the 18th international conference on Languages and Compilers for Parallel Computing
Analytic models and empirical search: a hybrid approach to code optimization

LCPC'05 Proceedings of the 18th international conference on Languages and Compilers for Parallel Computing
A data locality methodology for matrix---matrix multiplication algorithm

The Journal of Supercomputing
Empirical performance model-driven data layout optimization and library call selection for tensor contraction expressions

Journal of Parallel and Distributed Computing
A practical method for quickly evaluating program optimizations

HiPEAC'05 Proceedings of the First international conference on High Performance Embedded Architectures and Compilers
Compiler-optimized kernels: an efficient alternative to hand-coded inner kernels

ICCSA'06 Proceedings of the 2006 international conference on Computational Science and Its Applications - Volume Part V
JuliusC: a practical approach for the analysis of divide-and-conquer algorithms

LCPC'04 Proceedings of the 17th international conference on Languages and Compilers for High Performance Computing
A code isolator: isolating code fragments from large programs

LCPC'04 Proceedings of the 17th international conference on Languages and Compilers for High Performance Computing
A family of high-performance matrix multiplication algorithms

PARA'04 Proceedings of the 7th international conference on Applied Parallel Computing: state of the Art in Scientific Computing
Automatic tuning technique exploring within the hardware-specific constrained parameters

LSSC'05 Proceedings of the 5th international conference on Large-Scale Scientific Computing
An evaluation towards automatically tuned eigensolvers

LSSC'05 Proceedings of the 5th international conference on Large-Scale Scientific Computing
Evaluating iterative compilation

LCPC'02 Proceedings of the 15th international conference on Languages and Compilers for Parallel Computing
Performance modeling and optimal block size selection for the small-bulge multishift QR algorithm

ISPA'06 Proceedings of the 4th international conference on Parallel and Distributed Processing and Applications
Language and compiler support for auto-tuning variable-accuracy algorithms

CGO '11 Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization
Automated programmable control and parameterization of compiler optimizations

CGO '11 Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization
POET: a scripting language for applying parameterized source-to-source program transformations

Software—Practice & Experience
Analytical bounds for optimal tile size selection

CC'12 Proceedings of the 21st international conference on Compiler Construction
From CUDA to OpenCL: Towards a performance-portable solution for multi-platform GPU programming

Parallel Computing
Cache blocking for linear algebra algorithms

PPAM'11 Proceedings of the 9th international conference on Parallel Processing and Applied Mathematics - Volume Part I
Autotuning of adaptive mesh refinement PDE solvers on shared memory architectures

PPAM'11 Proceedings of the 9th international conference on Parallel Processing and Applied Mathematics - Volume Part I
Siblingrivalry: online autotuning through local competitions

Proceedings of the 2012 international conference on Compilers, architectures and synthesis for embedded systems
Locality optimized shared-memory implementations of iterated runge-kutta methods

Euro-Par'07 Proceedings of the 13th international Euro-Par conference on Parallel Processing
A script-based autotuning compiler system to generate high-performance CUDA code

ACM Transactions on Architecture and Code Optimization (TACO) - Special Issue on High-Performance Embedded Architectures and Compilers
Layout-oblivious compiler optimization for matrix computations

ACM Transactions on Architecture and Code Optimization (TACO) - Special Issue on High-Performance Embedded Architectures and Compilers
Towards a functional run-time for dense NLA domain

Proceedings of the 2nd ACM SIGPLAN workshop on Functional high-performance computing
AUGEM: automatically generate high performance dense linear algebra kernels on x86 CPUs

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Taming parallel I/O complexity with auto-tuning

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Precimonious: tuning assistant for floating-point precision

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Spiral in scala: towards the systematic construction of generators for performance libraries

Proceedings of the 12th international conference on Generative programming: concepts & experiences
Adaptive Mapping and Parameter Selection Scheme to Improve Automatic Code Generation for GPUs

Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization
A Basic Linear Algebra Compiler

Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization
Tile size selection revisited

ACM Transactions on Architecture and Code Optimization (TACO)
An Infrastructure for Tackling Input-Sensitivity of GPU Program Optimizations

International Journal of Parallel Programming

Quantified Score

Hi-index	0.00

Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology

Quantified Score

Visualization

Abstract