JuliusC: a practical approach for the analysis of divide-and-conquer algorithms

Authors:
Paolo D'Alberto;Alexandru Nicolau
Affiliations:
School of Information and Computer Science, University of California at Irvine;School of Information and Computer Science, University of California at Irvine
Venue:
LCPC'04 Proceedings of the 17th international conference on Languages and Compilers for High Performance Computing
Year:
2004

Citing 32
Cited 1

An improved replacement strategy for function caching

LFP '88 Proceedings of the 1988 ACM conference on LISP and functional programming
Incremental computation via function caching

POPL '89 Proceedings of the 16th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Introduction to algorithms

Introduction to algorithms
The input/output complexity of transitive closure

SIGMOD '90 Proceedings of the 1990 ACM SIGMOD international conference on Management of data
The cache performance and optimizations of blocked algorithms

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Parallel functional languages and compilers

Parallel functional languages and compilers
A self-applicable partial evaluator for the lambda calculus: correctness and pragmatics

ACM Transactions on Programming Languages and Systems (TOPLAS)
Abstract description of pointer data structures: an approach for improving the analysis and optimization of imperative programs

ACM Letters on Programming Languages and Systems (LOPLAS)
Partial evaluation and automatic program generation

Partial evaluation and automatic program generation
Partial dead code elimination

PLDI '94 Proceedings of the ACM SIGPLAN 1994 conference on Programming language design and implementation
Analysis and caching of dependencies

Proceedings of the first ACM SIGPLAN international conference on Functional programming
Matrix computations (3rd ed.)

Matrix computations (3rd ed.)
Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology

ICS '97 Proceedings of the 11th international conference on Supercomputing
Auto-blocking matrix-multiplication or tracking BLAS3 performance from source code

PPOPP '97 Proceedings of the sixth ACM SIGPLAN symposium on Principles and practice of parallel programming
Locality of Reference in LU Decomposition with Partial Pivoting

SIAM Journal on Matrix Analysis and Applications
GEMM-based level 3 BLAS: high-performance model implementations and performance evaluation benchmark

ACM Transactions on Mathematical Software (TOMS)
Automatic parallelization of divide and conquer algorithms

Proceedings of the seventh ACM SIGPLAN symposium on Principles and practice of parallel programming
From recursion to iteration: what are the optimizations?

PEPM '00 Proceedings of the 2000 ACM SIGPLAN workshop on Partial evaluation and semantics-based program manipulation
Transforming loops to recursion for multi-level memory hierarchies

PLDI '00 Proceedings of the ACM SIGPLAN 2000 conference on Programming language design and implementation
Caching function calls using precise dependencies

PLDI '00 Proceedings of the ACM SIGPLAN 2000 conference on Programming language design and implementation
Algorithm 97: Shortest path

Communications of the ACM
Automatically tuned linear algebra software

SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
Numerical Linear Algebra for High Performance Computers

Numerical Linear Algebra for High Performance Computers
Recursive blocked algorithms for solving triangular systems—Part I: one-sided and coupled Sylvester-type matrix equations

ACM Transactions on Mathematical Software (TOMS)
Dynamic Programming via Static Incrementalization

Higher-Order and Symbolic Computation
Optimizing Graph Algorithms for Improved Cache Performance

IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
Recursive Blocked Data Formats and BLAS's for Dense Linear Algebra Algorithms

PARA '98 Proceedings of the 4th International Workshop on Applied Parallel Computing, Large Scale Scientific and Industrial Problems
Fractal Matrix Multiplication: A Case Study on Portability of Cache Performance

WAE '01 Proceedings of the 5th International Workshop on Algorithm Engineering
Cache-Oblivious Algorithms

FOCS '99 Proceedings of the 40th Annual Symposium on Foundations of Computer Science
The Fastest Fourier Transform in the West

The Fastest Fourier Transform in the West
Performance Evaluation of Data Locality Exploitation (Ph.D. Thesis)

Performance Evaluation of Data Locality Exploitation (Ph.D. Thesis)
Using an abstract representation to specialize functional logic programs

LPAR'00 Proceedings of the 7th international conference on Logic for programming and automated reasoning

Think globally, search locally

Proceedings of the 19th annual international conference on Supercomputing

Quantified Score

Hi-index	0.00

Visualization

Abstract

The development of divide and conquer (D&C) algorithms for matrix computations has led to the widespread use of high- performance scientific applications and libraries. In turn, D&C algorithms can be implemented using loop nests or recursion. Recursion is extremely appealing because it is an intuitive means for the deployment of top-down techniques, which exploit data locality and parallelism naturally. However, recursion has been considered impractical for high-performance codes, mostly because of the inherent overhead of the division process into small subproblems. In this work, we develop techniques to model the behavior of recursive algorithms in a way suitable for use by a compiler in estimating and reducing the division process overheads. We describe these techniques and JuliusC, a (lite) C compiler, which we developed to exploit them. JuliusC unfolds the application call graph (partially) and extracts the relations among function calls. As a final result, it produces a directed acyclic graph (DAG) modeling the function calls concisely. The approach is a combination of compile-time and run-time analysis and both have negligible complexity. We illustrate the applicability of our approach by studying 6 test cases. We present the analysis results and we show how our (optimizing) compiler can use these results to increase the efficiency of the division process between 14 to 20 million times, for our codes.