Text Classification from Labeled and Unlabeled Documents using EM
Machine Learning - Special issue on information retrieval
Evaluation of hierarchical clustering algorithms for document datasets
Proceedings of the eleventh international conference on Information and knowledge management
Pattern Classification (2nd Edition)
Pattern Classification (2nd Edition)
Automatically generating extraction patterns from untagged text
AAAI'96 Proceedings of the thirteenth national conference on Artificial intelligence - Volume 2
Anomaly detection in a mobile communication network
Computational & Mathematical Organization Theory
Comparing Non-parametric Ensemble Methods for Document Clustering
NLDB '08 Proceedings of the 13th international conference on Natural Language and Information Systems: Applications of Natural Language to Information Systems
Journal of Parallel and Distributed Computing
A comparative study on representing units in chinese text clustering
KSEM'06 Proceedings of the First international conference on Knowledge Science, Engineering and Management
A new scalable parallel DBSCAN algorithm using the disjoint-set data structure
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Joint entity and event coreference resolution across documents
EMNLP-CoNLL '12 Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning
MOSAIC: a proximity graph approach for agglomerative clustering
DaWaK'07 Proceedings of the 9th international conference on Data Warehousing and Knowledge Discovery
Scalable parallel OPTICS data clustering using graph algorithmic techniques
SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Hi-index | 0.00 |
We propose a hybrid, unsupervised document clustering approach that combines a hierarchical clustering algorithm with Expectation Maximization. We developed several heuristics to automatically select a subset of the clusters generated by the first algorithm as the initial points of the second one. Furthermore, our initialization algorithm generates not only an initial model for the iterative refinement algorithm but also an estimate of the model dimension, thus eliminating another important element of human supervision. We have evaluated the proposed system on five real-world document collections. The results show that our approach generates clustering solutions of higher quality than both its individual components.