Compiling techniques for first-order liner recurrences on a Vector computer
Proceedings of the 1988 ACM/IEEE conference on Supercomputing
Supercompilers for parallel and vector computers
Supercompilers for parallel and vector computers
Introduction to parallel algorithms and architectures: array, trees, hypercubes
Introduction to parallel algorithms and architectures: array, trees, hypercubes
Computing Programs Containing Band Linear Recurrences on Vector Supercomputers
IEEE Transactions on Parallel and Distributed Systems
Parallel Solutions of Simple Indexed Recurrence Equations
IEEE Transactions on Parallel and Distributed Systems
Numerical Mathematics and Computing
Numerical Mathematics and Computing
High Performance Compilers for Parallel Computing
High Performance Compilers for Parallel Computing
Parallel Solutions of Indexed Recurrence Equations
IPPS '97 Proceedings of the 11th International Symposium on Parallel Processing
Proceedings of the 8th International Symposium on Parallel Processing
Improving Software Pipelining With Unroll-and-Jam
HICSS '96 Proceedings of the 29th Hawaii International Conference on System Sciences Volume 1: Software Technology and Architecture
DCC '98 Proceedings of the Conference on Data Compression
Optimizing techniques for saturated arithmetic with first-order linear recurrence
Proceedings of the 2009 ACM symposium on Applied Computing
Hi-index | 0.00 |
In this paper, we propose a new algorithm that analyzes the data dependency pattern in the first-order linear recurrence (FOLR) and transforms it into algebraically equivalent expanded form so that it can be processed in parallel using the threads on symmetric multiprocessor (SMP) machines. The transformation aims to eliminate the data dependencies in the naive nested form of the FOLR. However, as this transformation may result in extra multiplication operations, our algorithm examines the immanent overhead of the expanded form of the FOLR and generates a new hybrid form of the FOLR. The hybrid form combines nested and appropriately expanded form in order to make it suitable for parallel processing. The parallel algorithm based on the hybrid form of the FOLR is analytically examined and tested through implementation on SMP machines. The implementation details, such as the workload balancing between processors and the optimization of cache performance, are also discussed. The experimental results show that the parallel algorithm based on the hybrid form of the FOLR considerably improves the performance on SMP machines that have three of more processors.