A parallel hybrid web document clustering algorithm and its performance study

Authors:
Shuting Xu;Jun Zhang
Affiliations:
Laboratory for High Performance Scientific Computing and Computer Simulation, Department of Computer Science, University of Kentucky, Lexington, KY;Laboratory for High Performance Scientific Computing and Computer Simulation, Department of Computer Science, University of Kentucky, Lexington, KY
Venue:
The Journal of Supercomputing - Special issue: Parallel and distributed processing and applications
Year:
2004

Citing 14
Cited 7

Introduction to parallel computing: design and analysis of algorithms

Introduction to parallel computing: design and analysis of algorithms
Parallel algorithms for hierarchical clustering

Parallel Computing
BIRCH: an efficient data clustering method for very large databases

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
Matrix computations (3rd ed.)

Matrix computations (3rd ed.)
CURE: an efficient clustering algorithm for large databases

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Using MPI (2nd ed.): portable parallel programming with the message-passing interface

Using MPI (2nd ed.): portable parallel programming with the message-passing interface
Data clustering: a review

ACM Computing Surveys (CSUR)
Document Categorization and Query Generation on the World Wide WebUsing WebACE

Artificial Intelligence Review - Special issue on data mining on the Internet
Concept decompositions for large sparse text data using clustering

Machine Learning
Regular Article: A Structured Family of Clustering and Tree Construction Methods

Advances in Applied Mathematics
Introduction to Modern Information Retrieval

Introduction to Modern Information Retrieval
Principal Direction Divisive Partitioning

Data Mining and Knowledge Discovery
Chameleon: Hierarchical Clustering Using Dynamic Modeling

Computer
Iterative Methods for Sparse Linear Systems

Iterative Methods for Sparse Linear Systems

Enhancing clustering blog documents by utilizing author/reader comments

ACM-SE 45 Proceedings of the 45th annual southeast regional conference
Parallel Spectral Clustering

ECML PKDD '08 Proceedings of the European conference on Machine Learning and Knowledge Discovery in Databases - Part II
Projective ART with buffers for the high dimensional space clustering and an application to discover stock associations

Neurocomputing
Enhanced bisecting k-means clustering using intermediate cooperation

Pattern Recognition
WisColl: Collective wisdom based blog clustering

Information Sciences: an International Journal
Cooperative clustering

Pattern Recognition
Efficient stochastic algorithms for document clustering

Information Sciences: an International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

Clustering web document is an important procedure in many web information retrieval systems. As the size of the Internet grows rapidly and the amount of information requests increases exponentially, the use of parallel computing techniques in large scale web document retrieval is unavoidable. We propose a parallel hybrid web document clustering algorithm, which combines the Principal Direction Divisive Partitioning (PDDP) algorithm with the K-means algorithm. Computational experiments were conducted to test the performance of the hybrid algorithm using three real life web document datasets, and the results were compared with that of the parallel PDDP algorithm and the parallel K-means algorithm. The experiments show that the quality of the clustering solutions obtained from the hybrid algorithm is better than that from the parallel PDDP or the parallel K-means. The parallel run time of the hybrid algorithm is similar to and sometimes less than that of the widely used K-means algorithm.