Implementing agglomerative hierarchic clustering algorithms for use in document retrieval
Information Processing and Management: an International Journal
Algorithms for clustering data
Algorithms for clustering data
Recent trends in hierarchic document clustering: a critical review
Information Processing and Management: an International Journal
Parallel Algorithms for Hierarchical Clustering and Cluster Validity
IEEE Transactions on Pattern Analysis and Machine Intelligence
Scatter/Gather: a cluster-based approach to browsing large document collections
SIGIR '92 Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval
Using MPI: portable parallel programming with the message-passing interface
Using MPI: portable parallel programming with the message-passing interface
Parallel algorithms for hierarchical clustering
Parallel Computing
BIRCH: an efficient data clustering method for very large databases
SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
Projections for efficient document clustering
Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval
Clustering and classification of large document bases in a parallel environment
Journal of the American Society for Information Science
CURE: an efficient clustering algorithm for large databases
SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Web document clustering: a feasibility demonstration
Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Fast and effective text mining using linear-time document clustering
KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Cluster-based language models for distributed retrieval
Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Journal of Parallel and Distributed Computing
A vector space model for automatic indexing
Communications of the ACM
Clustering Algorithms
MPI: The Complete Reference
Information Retrieval: Algorithms and Heuristics
Information Retrieval: Algorithms and Heuristics
Data Mining for Scientific and Engineering Applications
Data Mining for Scientific and Engineering Applications
Introduction to Modern Information Retrieval
Introduction to Modern Information Retrieval
Evaluation of hierarchical clustering algorithms for document datasets
Proceedings of the eleventh international conference on Information and knowledge management
Document Ranking and the Vector-Space Model
IEEE Software
MPIJAVA: An Object-Oriented JAVA Interface to MPI
Proceedings of the 11 IPPS/SPDP'99 Workshops Held in Conjunction with the 13th International Parallel Processing Symposium and 10th Symposium on Parallel and Distributed Processing
Contorting high dimensional data for efficient main memory KNN processing
Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Improve Precategorized Collection Retrieval by Using Supervised Term Weighting Schemes
ITCC '02 Proceedings of the International Conference on Information Technology: Coding and Computing
Sufficient dimensionality reduction
The Journal of Machine Learning Research
Pattern Classification (2nd Edition)
Pattern Classification (2nd Edition)
Industrial evaluation of a highly-accurate academic IR system
CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
Incremental and effective data summarization for dynamic hierarchical clustering
SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Locality preserving indexing for document representation
Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Efficient Disk-Based K-Means Clustering for Relational Databases
IEEE Transactions on Knowledge and Data Engineering
ClusterMap: labeling clusters in large datasets via visualization
Proceedings of the thirteenth ACM international conference on Information and knowledge management
Efficient Parallel Hierarchical Clustering Algorithms
IEEE Transactions on Parallel and Distributed Systems
Clustering high-dimensional data using an efficient and effective data space reduction
Proceedings of the 14th ACM international conference on Information and knowledge management
Artificial neural networks for feature extraction and multivariate data projection
IEEE Transactions on Neural Networks
Machine Learning
Hi-index | 0.00 |
A distributed memory parallel version of the group average hierarchical agglomerative clustering algorithm is proposed to enable scaling the document clustering problem to large collections. Using standard message passing operations reduces interprocess communication while maintaining efficient load balancing. In a series of experiments using a subset of a standard Text REtrieval Conference (TREC) test collection, our parallel hierarchical clustering algorithm is shown to be scalable in terms of processors efficiently used and the collection size. Results show that our algorithm performs close to the expected O(n2/p) time on p processors rather than the worst-case O(n3/p) time. Furthermore, the O(n2/p) memory complexity per node allows larger collections to be clustered as the number of nodes increases. While partitioning algorithms such as k-means are trivially parallelizable, our results confirm those of other studies which showed that hierarchical algorithms produce significantly tighter clusters in the document clustering task. Finally, we show how our parallel hierarchical agglomerative clustering algorithm can be used as the clustering subroutine for a parallel version of the buckshot algorithm to cluster the complete TREC collection at near theoretical runtime expectations. © 2007 Wiley Periodicals, Inc.