Exploiting parallelism to support scalable hierarchical clustering

Authors:
Rebecca J. Cathey;Eric C. Jensen;Steven M. Beitzel;Ophir Frieder;David Grossman
Affiliations:
Information Retrieval Laboratory, Department of Computer Science, Illinois Institute of Technology, 10 W. 31st Street, Chicago, IL 60616;Information Retrieval Laboratory, Department of Computer Science, Illinois Institute of Technology, 10 W. 31st Street, Chicago, IL 60616;Information Retrieval Laboratory, Department of Computer Science, Illinois Institute of Technology, 10 W. 31st Street, Chicago, IL 60616;Information Retrieval Laboratory, Department of Computer Science, Illinois Institute of Technology, 10 W. 31st Street, Chicago, IL 60616;Information Retrieval Laboratory, Department of Computer Science, Illinois Institute of Technology, 10 W. 31st Street, Chicago, IL 60616
Venue:
Journal of the American Society for Information Science and Technology
Year:
2007

Citing 37
Cited 1

Implementing agglomerative hierarchic clustering algorithms for use in document retrieval

Information Processing and Management: an International Journal
Algorithms for clustering data

Algorithms for clustering data
Recent trends in hierarchic document clustering: a critical review

Information Processing and Management: an International Journal
Parallel Algorithms for Hierarchical Clustering and Cluster Validity

IEEE Transactions on Pattern Analysis and Machine Intelligence
Scatter/Gather: a cluster-based approach to browsing large document collections

SIGIR '92 Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval
Using MPI: portable parallel programming with the message-passing interface

Using MPI: portable parallel programming with the message-passing interface
Parallel algorithms for hierarchical clustering

Parallel Computing
BIRCH: an efficient data clustering method for very large databases

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
A high-performance, portable implementation of the MPI message passing interface standard

Parallel Computing
Projections for efficient document clustering

Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval
Clustering and classification of large document bases in a parallel environment

Journal of the American Society for Information Science
CURE: an efficient clustering algorithm for large databases

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Web document clustering: a feasibility demonstration

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Fast and effective text mining using linear-time document clustering

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Cluster-based language models for distributed retrieval

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Efficient parallel algorithms for hierarchical clustering on arrays with reconfigurable optical buses

Journal of Parallel and Distributed Computing
A vector space model for automatic indexing

Communications of the ACM
Clustering Algorithms

Clustering Algorithms
MPI: The Complete Reference

MPI: The Complete Reference
Information Retrieval: Algorithms and Heuristics

Information Retrieval: Algorithms and Heuristics
Data Mining for Scientific and Engineering Applications

Data Mining for Scientific and Engineering Applications
Introduction to Modern Information Retrieval

Introduction to Modern Information Retrieval
Evaluation of hierarchical clustering algorithms for document datasets

Proceedings of the eleventh international conference on Information and knowledge management
Document Ranking and the Vector-Space Model

IEEE Software
MPIJAVA: An Object-Oriented JAVA Interface to MPI

Proceedings of the 11 IPPS/SPDP'99 Workshops Held in Conjunction with the 13th International Parallel Processing Symposium and 10th Symposium on Parallel and Distributed Processing
Contorting high dimensional data for efficient main memory KNN processing

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Improve Precategorized Collection Retrieval by Using Supervised Term Weighting Schemes

ITCC '02 Proceedings of the International Conference on Information Technology: Coding and Computing
Sufficient dimensionality reduction

The Journal of Machine Learning Research
Pattern Classification (2nd Edition)

Pattern Classification (2nd Edition)
Industrial evaluation of a highly-accurate academic IR system

CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
Incremental and effective data summarization for dynamic hierarchical clustering

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Locality preserving indexing for document representation

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Efficient Disk-Based K-Means Clustering for Relational Databases

IEEE Transactions on Knowledge and Data Engineering
ClusterMap: labeling clusters in large datasets via visualization

Proceedings of the thirteenth ACM international conference on Information and knowledge management
Efficient Parallel Hierarchical Clustering Algorithms

IEEE Transactions on Parallel and Distributed Systems
Clustering high-dimensional data using an efficient and effective data space reduction

Proceedings of the 14th ACM international conference on Information and knowledge management
Artificial neural networks for feature extraction and multivariate data projection

IEEE Transactions on Neural Networks

Hierarchical constraints

Machine Learning

Quantified Score

Hi-index	0.00

Visualization

Abstract

A distributed memory parallel version of the group average hierarchical agglomerative clustering algorithm is proposed to enable scaling the document clustering problem to large collections. Using standard message passing operations reduces interprocess communication while maintaining efficient load balancing. In a series of experiments using a subset of a standard Text REtrieval Conference (TREC) test collection, our parallel hierarchical clustering algorithm is shown to be scalable in terms of processors efficiently used and the collection size. Results show that our algorithm performs close to the expected O(n2/p) time on p processors rather than the worst-case O(n3/p) time. Furthermore, the O(n2/p) memory complexity per node allows larger collections to be clustered as the number of nodes increases. While partitioning algorithms such as k-means are trivially parallelizable, our results confirm those of other studies which showed that hierarchical algorithms produce significantly tighter clusters in the document clustering task. Finally, we show how our parallel hierarchical agglomerative clustering algorithm can be used as the clustering subroutine for a parallel version of the buckshot algorithm to cluster the complete TREC collection at near theoretical runtime expectations. © 2007 Wiley Periodicals, Inc.