A data locality optimizing algorithm
PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
An analysis of dag-consistent distributed shared-memory algorithms
Proceedings of the eighth annual ACM symposium on Parallel algorithms and architectures
The implementation of the Cilk-5 multithreaded language
PLDI '98 Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation
Advanced compiler design and implementation
Advanced compiler design and implementation
Recursive array layouts and fast parallel matrix multiplication
Proceedings of the eleventh annual ACM symposium on Parallel algorithms and architectures
Journal of the ACM (JACM)
Communications of the ACM
FLAME: Formal Linear Algebra Methods Environment
ACM Transactions on Mathematical Software (TOMS)
FOCS '99 Proceedings of the 40th Annual Symposium on Foundations of Computer Science
Effectively sharing a cache among threads
Proceedings of the sixteenth annual ACM symposium on Parallelism in algorithms and architectures
Optimizing Graph Algorithms for Improved Cache Performance
IEEE Transactions on Parallel and Distributed Systems
A programming language
Cache-oblivious dynamic programming
SODA '06 Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm
The cache complexity of multithreaded cache oblivious algorithms
Proceedings of the eighteenth annual ACM symposium on Parallelism in algorithms and architectures
An experimental comparison of cache-oblivious and cache-conscious programs
Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures
Supermatrix out-of-order scheduling of matrix operations for SMP and multi-core architectures
Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures
STXXL: standard template library for XXL data sets
ESA'05 Proceedings of the 13th annual European conference on Algorithms
An experimental comparison of cache-oblivious and cache-conscious programs
Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures
Provably good multicore cache performance for divide-and-conquer algorithms
Proceedings of the nineteenth annual ACM-SIAM symposium on Discrete algorithms
Cache-oblivious databases: Limitations and opportunities
ACM Transactions on Database Systems (TODS)
Cache-efficient dynamic programming algorithms for multicores
Proceedings of the twentieth annual symposium on Parallelism in algorithms and architectures
Brief announcement: low depth cache-oblivious sorting
Proceedings of the twenty-first annual symposium on Parallelism in algorithms and architectures
Low depth cache-oblivious algorithms
Proceedings of the twenty-second annual ACM symposium on Parallelism in algorithms and architectures
Cache-Oblivious Dynamic Programming for Bioinformatics
IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Optimization of triangular matrix functions in BLAS library on Loongson2F
NPC'10 Proceedings of the 2010 IFIP international conference on Network and parallel computing
Geometric algorithms for private-cache chip multiprocessors
ESA'10 Proceedings of the 18th annual European conference on Algorithms: Part II
Scheduling irregular parallel computations on hierarchical caches
Proceedings of the twenty-third annual ACM symposium on Parallelism in algorithms and architectures
A memory access model for highly-threaded many-core architectures
Future Generation Computer Systems
Hi-index | 0.00 |
The Gaussian Elimination Paradigm (GEP) was introduced by the authors in [6] to represent the triply-nested loop computation that occurs in several important algorithms including Gaussian elimination without pivoting and Floyd-Warshall's all-pairs shortest paths algorithm. An efficient cache-oblivious algorithm for these instances of GEP was presented in [6]. In this paper we establish several important properties of this cache-oblivious framework, and extend the framework to solve GEP in its full generality within the same time and I/O bounds. We then analyze a parallel implementation of the framework and its caching performance for both shared and distributed caches. We present extensive experimental results for both in-core and out-of-core performance of our algorithms. We consider both sequential and parallel implementations of our algorithms, and compare them with finely-tuned cache-aware BLAS code for matrix multiplication and Gaussian elimination without pivoting. Our results indicate that cache-oblivious GEP offers an attractive tradeoff between efficiency and portability.