Optimizing the parallel computation of linear recurrences using compact matrix representations

Authors:
Adrian Nistor;Wei-Ngan Chin;Tiow-Seng Tan;Nicolae Tapus
Affiliations:
Department of Computer Science, Politehnica University of Bucharest, Romania;Department of Computer Science, National University of Singapore, Singapore;Department of Computer Science, National University of Singapore, Singapore;Department of Computer Science, Politehnica University of Bucharest, Romania
Venue:
Journal of Parallel and Distributed Computing
Year:
2009

Citing 18
Cited 1

Parallelizing complex scans and reductions

PLDI '94 Proceedings of the ACM SIGPLAN 1994 conference on Programming language design and implementation
Powerlist: a structure for parallel recursion

ACM Transactions on Programming Languages and Systems (TOPLAS)
Solving linear recurrences with loop raking

Journal of Parallel and Distributed Computing
Massive parallelization of divide-and-conquer algorithms over powerlists

Science of Computer Programming - Special issue on mathematics of program construction
Formal derivation of efficient parallel programs by construction of list homomorphisms

ACM Transactions on Programming Languages and Systems (TOPLAS)
Systematic Efficient Parallelization of Scan and Other List Homomorphisms

Euro-Par '96 Proceedings of the Second International Euro-Par Conference on Parallel Processing-Volume II
Deriving Parallel Codes via Invariants

SAS '00 Proceedings of the 7th International Symposium on Static Analysis
Parallelization via Context Preservatio

ICCL '98 Proceedings of the 1998 International Conference on Computer Languages
NESL: A Nested Data-Parallel Language

NESL: A Nested Data-Parallel Language
Efficient parallel solutions of linear algebraic circuits

Journal of Parallel and Distributed Computing
Brook for GPUs: stream computing on graphics hardware

ACM SIGGRAPH 2004 Papers
GPUTeraSort: high performance graphics co-processor sorting for large database management

Proceedings of the 2006 ACM SIGMOD international conference on Management of data
Accelerator: using data parallelism to program GPUs for general-purpose uses

Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
Simulation of cloud dynamics on graphics hardware

SIGGRAPH '05 ACM SIGGRAPH 2005 Courses
Automatic inversion generates divide-and-conquer parallel programs

Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and implementation
Parallel solution of recurrence problems

IBM Journal of Research and Development
GPU-ABiSort: optimal parallel sorting on stream architectures

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Initial experiences porting a bioinformatics application to a graphics processor

PCI'05 Proceedings of the 10th Panhellenic conference on Advances in Informatics

Automatic parallelization via matrix multiplication

Proceedings of the 32nd ACM SIGPLAN conference on Programming language design and implementation

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper presents a novel method for optimizing the parallel computation of linear recurrences. Our method can help reduce the resource requirements for both memory and computation. A unique feature of our technique is its formulation of linear recurrences as matrix computations, before exploiting their mathematical properties for more compact representations. Based on a general notion of closure for matrix multiplication, we present two classes of matrices that have compact representations. These classes are permutation matrices and matrices whose elements are linearly related to each other. To validate the proposed method, we experiment with solving recurrences whose matrices have compact representations using CUDA on nVidia GeForce 8800 GTX GPU. The advantages of our technique are that it enables the computation of larger recurrences in parallel and it provides good speedups of up to eleven times over the un-optimized parallel computations. Also, the memory usage can be as much as nine times lower than that of the un-optimized parallel computations. Our result confirms a promising approach for the adoption of more advanced parallelization techniques.