The nature of statistical learning theory
The nature of statistical learning theory
Combining labeled and unlabeled data with co-training
COLT' 98 Proceedings of the eleventh annual conference on Computational learning theory
Making large-scale support vector machine learning practical
Advances in kernel methods
Maximum likelihood estimation for filtering thresholds
Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Machine learning in automated text categorization
ACM Computing Surveys (CSUR)
Introduction to Modern Information Retrieval
Introduction to Modern Information Retrieval
Distributional word clusters vs. words for text categorization
The Journal of Machine Learning Research
Bootstrapping for hierarchical document classification
CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
The role of domain information in Word Sense Disambiguation
Natural Language Engineering
Unsupervised word sense disambiguation rivaling supervised methods
ACL '95 Proceedings of the 33rd annual meeting on Association for Computational Linguistics
Automatic text categorization by unsupervised learning
COLING '00 Proceedings of the 18th conference on Computational linguistics - Volume 1
In Defense of One-Vs-All Classification
The Journal of Machine Learning Research
Document classification through interactive supervision of document and term labels
PKDD '04 Proceedings of the 8th European Conference on Principles and Practice of Knowledge Discovery in Databases
Text categorization using feature projections
COLING '02 Proceedings of the 19th international conference on Computational linguistics - Volume 1
ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
Understanding the Yarowsky Algorithm
Computational Linguistics
ACL '04 Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics
Investigating unsupervised learning for text categorization bootstrapping
HLT '05 Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing
The Journal of Machine Learning Research
Domain kernels for text categorization
CONLL '05 Proceedings of the Ninth Conference on Computational Natural Language Learning
Using domain information for word sense disambiguation
SENSEVAL '01 The Proceedings of the Second International Workshop on Evaluating Word Sense Disambiguation Systems
Large-scale hierarchical text classification without labelled data
Proceedings of the fourth ACM international conference on Web search and data mining
Classification-based contextual preferences
TIWTE '11 Proceedings of the TextInfer 2011 Workshop on Textual Entailment
A new benchmark dataset with production methodology for short text semantic similarity algorithms
ACM Transactions on Speech and Language Processing (TSLP)
Hi-index | 0.00 |
We propose a text-categorization bootstrapping algorithm in which categories are described by relevant seed words. Our method introduces two unsupervised techniques to improve the initial categorization step of the bootstrapping scheme: (i) using latent semantic spaces to estimate the similarity among documents and words, and (ii) the Gaussian mixture algorithm, which differentiates relevant and nonrelevant category information using statistics from unlabeled examples. In particular, this second step maps the similarity scores to class posterior probabilities, and therefore reduces sensitivity to keyword-dependent variations in scores. The algorithm was evaluated on two text categorization tasks, and obtained good performance using only the category names as initial seeds. In particular, the performance of the proposed method proved to be equivalent to a pure supervised approach trained on 70--160 labeled documents per category.