Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology
ICS '97 Proceedings of the 11th international conference on Supercomputing
Data transformations for eliminating conflict misses
PLDI '98 Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation
Cache miss equations: a compiler framework for analyzing and tuning memory behavior
ACM Transactions on Programming Languages and Systems (TOPLAS)
Achieving high sustained performance in an unstructured mesh CFD application
SC '99 Proceedings of the 1999 ACM/IEEE conference on Supercomputing
On the complexity of loop fusion
Parallel Computing - Special issue on new trends on scheduling in parallel and distributed systems
FLAME: Formal Linear Algebra Methods Environment
ACM Transactions on Mathematical Software (TOMS)
Automatically tuned linear algebra software
SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
An updated set of basic linear algebra subprograms (BLAS)
ACM Transactions on Mathematical Software (TOMS)
On Estimating and Enhancing Cache Effectiveness
Proceedings of the Fourth International Workshop on Languages and Compilers for Parallel Computing
Collective Loop Fusion for Array Contraction
Proceedings of the 5th International Workshop on Languages and Compilers for Parallel Computing
Combining Models and Guided Empirical Search to Optimize for Multiple Levels of the Memory Hierarchy
Proceedings of the international symposium on Code generation and optimization
Think globally, search locally
Proceedings of the 19th annual international conference on Supercomputing
Predicting memory-access cost based on data-access patterns
CLUSTER '04 Proceedings of the 2004 IEEE International Conference on Cluster Computing
On Improving Linear Solver Performance: A Block Variant of GMRES
SIAM Journal on Scientific Computing
Applying Automated Memory Analysis to Improve Iterative Algorithms
SIAM Journal on Scientific Computing
Cache efficient bidiagonalization using BLAS 2.5 operators
ACM Transactions on Mathematical Software (TOMS)
Automatic tuning of scientific applications
Automatic tuning of scientific applications
Automating the generation of composed linear algebra kernels
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Memory hierarchy optimizations and performance bounds for sparse ATAx
ICCS'03 Proceedings of the 2003 international conference on Computational science: PartIII
CC'08/ETAPS'08 Proceedings of the Joint European Conferences on Theory and Practice of Software 17th international conference on Compiler construction
hwloc: A Generic Framework for Managing Hardware Affinities in HPC Applications
PDP '10 Proceedings of the 2010 18th Euromicro Conference on Parallel, Distributed and Network-based Processing
Analytic models and empirical search: a hybrid approach to code optimization
LCPC'05 Proceedings of the 18th international conference on Languages and Compilers for Parallel Computing
Hi-index | 0.00 |
The performance of many scientific programs is limited by data movement. Loop fusion is one optimization used to increase the speed of memory bound operations. To automate loop fusion for matrix computations, we developed the Build to Order (BTO) compiler. Within BTO, an analytic memory model efficiently and accurately reduces the number of serial loop fusion options considered. In this paper, we extend the model to shared memory parallel machines. We detail the differences between parallel and serial memory use and runtime prediction and explain the changes made to include parallel machines in the model. Analysis of the parallel model's predictions show that when it is included in BTO it will reduce the search space of considered routines.