Semi-supervised model-based document clustering: A comparative study

Authors:
Shi Zhong
Affiliations:
Department of Computer Science and Engineering, Florida Atlantic University, Boca Raton 33431
Venue:
Machine Learning
Year:
2006

Citing 28
Cited 14

Probability, random processes, and estimation theory for engineers

Probability, random processes, and estimation theory for engineers
OHSUMED: an interactive retrieval evaluation and new large test collection for research

SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
Combining labeled and unlabeled data with co-training

COLT' 98 Proceedings of the eleventh annual conference on Computational learning theory
WebACE: a Web agent for document categorization and exploration

AGENTS '98 Proceedings of the second international conference on Autonomous agents
Text Classification from Labeled and Unlabeled Documents using EM

Machine Learning - Special issue on information retrieval
An experimental comparison of model-based clustering methods

Machine Learning
Concept decompositions for large sparse text data using clustering

Machine Learning
Constrained K-means Clustering with Background Knowledge

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Semi-supervised Clustering by Seeding

ICML '02 Proceedings of the Nineteenth International Conference on Machine Learning
Transductive Inference for Text Classification using Support Vector Machines

ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
Improving Short-Text Classification using Unlabeled Data for Classification Problems

ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
Multivariate Information Bottleneck

UAI '01 Proceedings of the 17th Conference in Uncertainty in Artificial Intelligence
Learning from Labeled and Unlabeled Data using Graph Mincuts

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Self-Supervised Learning for Visual Tracking and Recognition of Human Hand

Proceedings of the Seventeenth National Conference on Artificial Intelligence and Twelfth Conference on Innovative Applications of Artificial Intelligence
Enhanced word clustering for hierarchical text classification

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Using unlabeled data to improve text classification

Using unlabeled data to improve text classification
Cluster ensembles --- a knowledge reuse framework for combining multiple partitions

The Journal of Machine Learning Research
Information Theoretic Clustering of Sparse Co-Occurrence Data

ICDM '03 Proceedings of the Third IEEE International Conference on Data Mining
CBC: Clustering Based Text Classification Requiring Minimal Labeled Data

ICDM '03 Proceedings of the Third IEEE International Conference on Data Mining
Generative model-based clustering of directional data

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
A unified framework for model-based clustering

The Journal of Machine Learning Research
A probabilistic framework for semi-supervised clustering

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Locally linear metric adaptation for semi-supervised clustering

ICML '04 Proceedings of the twenty-first international conference on Machine learning
An information theoretic analysis of maximum likelihood mixture estimation for exponential families

ICML '04 Proceedings of the twenty-first international conference on Machine learning
Generative model-based document clustering: a comparative study

Knowledge and Information Systems
Criterion functions for document clustering

Criterion functions for document clustering
Advances in Neural Information Processing Systems 18: Proceedings of the 2005 Conference (Neural Information Processing)

Advances in Neural Information Processing Systems 18: Proceedings of the 2005 Conference (Neural Information Processing)
The relative value of labeled and unlabeled samples in pattern recognition with an unknown mixing parameter

IEEE Transactions on Information Theory - Part 2

An active learning framework for semi-supervised document clustering with language modeling

Data & Knowledge Engineering
Harmony K-means algorithm for document clustering

Data Mining and Knowledge Discovery
A Semi-supervised Topic-Driven Approach for Clustering Textual Answers to Survey Questions

ADMA '09 Proceedings of the 5th International Conference on Advanced Data Mining and Applications
Finding the optimal feature representations for Bayesian network learning

PAKDD'07 Proceedings of the 11th Pacific-Asia conference on Advances in knowledge discovery and data mining
Document clustering via dirichlet process mixture model with feature selection

Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining
Semi-supervised Bayesian ARTMAP

Applied Intelligence
A novel initialization method for semi-supervised clustering

KSEM'10 Proceedings of the 4th international conference on Knowledge science, engineering and management
Semi-supervised k-means clustering by optimizing initial cluster centers

WISM'11 Proceedings of the 2011 international conference on Web information systems and mining - Volume Part II
Research of immune intrusion detection algorithm based on semi-supervised clustering

AICI'11 Proceedings of the Third international conference on Artificial intelligence and computational intelligence - Volume Part II
Tri-training and data editing based semi-supervised clustering algorithm

MICAI'06 Proceedings of the 5th Mexican international conference on Artificial Intelligence
Fuzzy semi-supervised co-clustering for text documents

Fuzzy Sets and Systems
Clustering documents with labeled and unlabeled documents using fuzzy semi-Kmeans

Fuzzy Sets and Systems
Absolute and relative clustering

Proceedings of the 4th MultiClust Workshop on Multiple Clusterings, Multi-view Data, and Multi-source Knowledge-driven Clustering
Robust predictive model for evaluating breast cancer survivability

Engineering Applications of Artificial Intelligence

Quantified Score

Hi-index	0.00

Visualization

Abstract

Semi-supervised learning has become an attractive methodology for improving classification models and is often viewed as using unlabeled data to aid supervised learning. However, it can also be viewed as using labeled data to help clustering, namely, semi-supervised clustering. Viewing semi-supervised learning from a clustering angle is useful in practical situations when the set of labels available in labeled data are not complete, i.e., unlabeled data contain new classes that are not present in labeled data. This paper analyzes several multinomial model-based semi-supervised document clustering methods under a principled model-based clustering framework. The framework naturally leads to a deterministic annealing extension of existing semi-supervised clustering approaches. We compare three (slightly) different semi-supervised approaches for clustering documents: Seeded damnl, Constrained damnl, and Feedback-based damnl, where damnl stands for multinomial model-based deterministic annealing algorithm. The first two are extensions of the seeded k-means and constrained k-means algorithms studied by Basu et al. (2002); the last one is motivated by Cohn et al. (2003). Through empirical experiments on text datasets, we show that: (a) deterministic annealing can often significantly improve the performance of semi-supervised clustering; (b) the constrained approach is the best when available labels are complete whereas the feedback-based approach excels when available labels are incomplete.