Parallel Clustering Algorithm for Large Data Sets with Applications in Bioinformatics

Authors:
Victor Olman;Fenglou Mao;Hongwei Wu;Ying Xu
Affiliations:
University of Georgia, Athens;University of Georgia, Athens;University of Georgia, Athens;University of Georgia, Athens
Venue:
IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Year:
2009

Citing 9
Cited 7

Efficiency of hierarchic agglomerative clustering using the ICL distributed array processor

Journal of Documentation
A parallel algorithm for computing minimum spanning trees

SPAA '92 Proceedings of the fourth annual ACM symposium on Parallel algorithms and architectures
Parallel algorithms for hierarchical clustering

Parallel Computing
Neural Networks for Pattern Recognition

Neural Networks for Pattern Recognition
Pattern Recognition with Fuzzy Objective Function Algorithms

Pattern Recognition with Fuzzy Objective Function Algorithms
Clustering in massive data sets

Handbook of massive data sets
Accurate Prediction of Orthologous Gene Groups in Microbes

CSB '05 Proceedings of the 2005 IEEE Computational Systems Bioinformatics Conference
Short communication: A novel parallelization approach for hierarchical clustering

Parallel Computing
A fast, parallel spanning tree algorithm for symmetric multiprocessors (SMPs)

Journal of Parallel and Distributed Computing

An OpenMP algorithm and implementation for clustering biological graphs

Proceedings of the first workshop on Irregular applications: architectures and algorithm
DICLENS: Divisive Clustering Ensemble with Automatic Cluster Number

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
The Impact of Normalization and Phylogenetic Information on Estimating the Distance for Metagenomes

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Objective function-based clustering

Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery
A framework for Multi-Agent Based Clustering

Autonomous Agents and Multi-Agent Systems
p-PIC: Parallel power iteration clustering for big data

Journal of Parallel and Distributed Computing
An evolutionary computational model applied to cluster analysis of DNA microarray data

Expert Systems with Applications: An International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

Large sets of bioinformatical data provide a challenge in time consumption while solving the cluster identification problem, and that is why a parallel algorithm is so needed for identifying dense clusters in a noisy background. Our algorithm works on a graph representation of the data set to be analyzed. It identifies clusters through the identification of densely intraconnected subgraphs. We have employed a minimum spanning tree (MST) representation of the graph and solve the cluster identification problem using this representation. The computational bottleneck of our algorithm is the construction of an MST of a graph, for which a parallel algorithm is employed. Our high-level strategy for the parallel MST construction algorithm is to first partition the graph, then construct MSTs for the partitioned subgraphs and auxiliary bipartite graphs based on the subgraphs, and finally merge these MSTs to derive an MST of the original graph. The computational results indicate that when running on 150 CPUs, our algorithm can solve a cluster identification problem on a data set with 1,000,000 data points almost 100 times faster than on single CPU, indicating that this program is capable of handling very large data clustering problems in an efficient manner. We have implemented the clustering algorithm as the software CLUMP.