A hybrid unsupervised approach for document clustering

Authors:
Mihai Surdeanu;Jordi Turmo;Alicia Ageno
Affiliations:
Technical University of Catalonia, Barcelona, Spain;Technical University of Catalonia, Barcelona, Spain;Technical University of Catalonia, Barcelona, Spain
Venue:
Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
Year:
2005

Citing 5
Cited 8

Text Classification from Labeled and Unlabeled Documents using EM

Machine Learning - Special issue on information retrieval
Evaluation of hierarchical clustering algorithms for document datasets

Proceedings of the eleventh international conference on Information and knowledge management
Pattern Classification (2nd Edition)

Pattern Classification (2nd Edition)
Empirical and Theoretical Comparisons of Selected Criterion Functions for Document Clustering

Machine Learning
Automatically generating extraction patterns from untagged text

AAAI'96 Proceedings of the thirteenth national conference on Artificial intelligence - Volume 2

Anomaly detection in a mobile communication network

Computational & Mathematical Organization Theory
Comparing Non-parametric Ensemble Methods for Document Clustering

NLDB '08 Proceedings of the 13th international conference on Natural Language and Information Systems: Applications of Natural Language to Information Systems
Parallel WaveCluster: A linear scaling parallel clustering algorithm implementation with application to very large datasets

Journal of Parallel and Distributed Computing
A comparative study on representing units in chinese text clustering

KSEM'06 Proceedings of the First international conference on Knowledge Science, Engineering and Management
A new scalable parallel DBSCAN algorithm using the disjoint-set data structure

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Joint entity and event coreference resolution across documents

EMNLP-CoNLL '12 Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning
MOSAIC: a proximity graph approach for agglomerative clustering

DaWaK'07 Proceedings of the 9th international conference on Data Warehousing and Knowledge Discovery
Scalable parallel OPTICS data clustering using graph algorithmic techniques

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

Quantified Score

Hi-index	0.00

Visualization

Abstract

We propose a hybrid, unsupervised document clustering approach that combines a hierarchical clustering algorithm with Expectation Maximization. We developed several heuristics to automatically select a subset of the clusters generated by the first algorithm as the initial points of the second one. Furthermore, our initialization algorithm generates not only an initial model for the iterative refinement algorithm but also an estimate of the model dimension, thus eliminating another important element of human supervision. We have evaluated the proposed system on five real-world document collections. The results show that our approach generates clustering solutions of higher quality than both its individual components.