Parallelism and error reduction in a high performance environment

Authors:
Richard C. Whaley;Anthony M. Castaldo
Affiliations:
The University of Texas at San Antonio;The University of Texas at San Antonio
Venue:
Parallelism and error reduction in a high performance environment
Year:
2010

Citing 0
Cited 1

Minimizing synchronizations in sparse iterative solvers for distributed supercomputers

Computers & Mathematics with Applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

This dissertation details contributions made by the author to the field of computer science while working to improve the performance of a state-of-the-art linear algebra library intended for use on commodity platforms. These platforms and libraries have become a de facto standard for most scientific researchers. There are three significant contributions to the field. The superblock family of summation algorithms generalizes the dot product and its error analysis, and allows floating point error to be controlled with a desired target bound and performance loss and with workspace determined at compile time. It demonstrates empirically that theoretical error bounds are good predictors of actual errors. The opposite had been asserted by leading experts in the field. The Master Last algorithm identifies and corrects a heretofore unknown performance drain in multi-core parallelism, Parallel Management Overhead (PMO). The author reduced PMO by as much as two orders of magnitude, and sped up QR and LU matrix factorizations (heavily used by researchers) as much as 65%, and asymptotically as much as 45%. Finally, in the Parallel Cache Assignment algorithm, the author realized the goal of exploiting the “collective cache” of all the cores in a multi-core system for a tightly coupled algorithm with high data dependencies requiring extreme synchronization. This allowed parallelization of some heavily used (and heavily researched) operations in linear algebra, panel factorizations, without changing the flop counts or arithmetic, a feat that had defied parallel researchers for many years. The result was superlinear speedup (as much as 19 times faster for an eight core system) for the QR panel factorization over the previous state of the art serial implementation.