A fine-grain load-adaptive algorithm of the 2D discrete wavelet transform for multithreaded architectures

Authors:
Parimala Thulasiraman;Ashfaq A. Khokhar;Gerd Heber;Guang R. Gao
Affiliations:
Department of Computer Science, University of Manitoba, Winnipeg, Manitoba R3T 2N2, Canada;Department of EECS, University of Illinois at Chicago, Chicago, IL;Cornell Theory Center, Cornell University, 638 Rhodes Hall, Ithaca, NY;Department of ECE, University of Delaware, Newark, DE
Venue:
Journal of Parallel and Distributed Computing
Year:
2004

Citing 12
Cited 4

A Theory for Multiresolution Signal Decomposition: The Wavelet Representation

IEEE Transactions on Pattern Analysis and Machine Intelligence
An introduction to parallel algorithms

An introduction to parallel algorithms
On the representation of operators in bases of compactly supported wavelets

SIAM Journal on Numerical Analysis
Parallelizing Mallat algorithm for 2-D wavelet transforms

Information Processing Letters
A study of the EARTH-MANNA multithreaded system

International Journal of Parallel Programming - Special issue on parallel architectures and compilation techniques—part II
Load Adaptive Algorithms and Implementations for the 2D Discrete Wavelet Transform on Fine-Grain Multithreaded Architectures

IPPS '99/SPDP '99 Proceedings of the 13th International Symposium on Parallel Processing and the 10th Symposium on Parallel and Distributed Processing
Latency Hiding in Message-Passing Architectures

Proceedings of the 8th International Symposium on Parallel Processing
Costs and Benefits of Multithreading with Off-the-Shelf RISC Processors

Euro-Par '95 Proceedings of the First International Euro-Par Conference on Parallel Processing
Computation of 2-D Wavelet Transforms on the Connection Machine-2

Proceedings of the IFIP WG10.3 Working Conference on Applications in Parallel and Distributed Computing
Building Multithreaded Architectures with Off-the-Shelf Microprocessors

Proceedings of the 8th International Symposium on Parallel Processing
Variable Rate Speech Coding using Discrete Time Wavelet Extrema Representation

ASILOMAR '95 Proceedings of the 29th Asilomar Conference on Signals, Systems and Computers (2-Volume Set)
Earth: an efficient architecture for running threads

Earth: an efficient architecture for running threads

Performance portability on EARTH: a case study across several parallel architectures

Cluster Computing
Development and evaluation of high-performance decorrelation algorithms for the nonalternating 3D wavelet transform

EURASIP Journal on Applied Signal Processing
Parallel Algorithms Based on the Temporal-Window Method for Non-Alternating 3D-WT over Angiographies Using a Multicomputer

Journal of Signal Processing Systems
Intelligent fault diagnosis of rotating machinery using infrared thermal image

Expert Systems with Applications: An International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper we develop a load-adaptive multithreaded algorithm to compute 2D Discrete Wavelet Transform (DWT) and its implementation on a fine-grain multithreading platform. In a 2D DWT computation, the problem sizes reduces at every decomposition level and the length of the emerging computation paths also vary. The parallel algorithm proposed in this paper, dynamically scales itself to the varying problem size. During any iteration, the ratio of the number of local threads to the number of remote threads issued by a processor can be adjusted to be greater than 1 by controlling the algorithm parameters. This approach provides an opportunity to interleave computation and communication without explicitly introducing idle cycles on waiting for the remote threads to finish. Experimental results are reported based on the implementations of the proposed algorithm on a 20 node emulated multithreaded platform, EARTH-MANNA, specifically designed for fine-grain multithreaded paradigms. We show that multithreading implementations of the proposed algorithm are at least 2 times faster than the MPI-based message passing implementations reported in the literature, assuming the same processor speed. We further show that the proposed algorithm and implementations scale linearly with respect to problem and machine sizes.