Improving text categorization bootstrapping via unsupervised learning

Authors:
Alfio Gliozzo;Carlo Strapparava;Ido Dagan
Affiliations:
STLab-ISTC-CNR, Rome;FBK-IRST, Povo;Bar Ilan University
Venue:
ACM Transactions on Speech and Language Processing (TSLP)
Year:
2009

Citing 21
Cited 3

The nature of statistical learning theory

The nature of statistical learning theory
Combining labeled and unlabeled data with co-training

COLT' 98 Proceedings of the eleventh annual conference on Computational learning theory
Making large-scale support vector machine learning practical

Advances in kernel methods
Maximum likelihood estimation for filtering thresholds

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
Introduction to Modern Information Retrieval

Introduction to Modern Information Retrieval
Distributional word clusters vs. words for text categorization

The Journal of Machine Learning Research
Bootstrapping for hierarchical document classification

CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
The role of domain information in Word Sense Disambiguation

Natural Language Engineering
Unsupervised word sense disambiguation rivaling supervised methods

ACL '95 Proceedings of the 33rd annual meeting on Association for Computational Linguistics
Automatic text categorization by unsupervised learning

COLING '00 Proceedings of the 18th conference on Computational linguistics - Volume 1
In Defense of One-Vs-All Classification

The Journal of Machine Learning Research
Document classification through interactive supervision of document and term labels

PKDD '04 Proceedings of the 8th European Conference on Principles and Practice of Knowledge Discovery in Databases
Text categorization using feature projections

COLING '02 Proceedings of the 19th international conference on Computational linguistics - Volume 1
Bootstrapping

ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
Understanding the Yarowsky Algorithm

Computational Linguistics
Learning with unlabeled data for text categorization using bootstrapping and feature projection techniques

ACL '04 Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics
Investigating unsupervised learning for text categorization bootstrapping

HLT '05 Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing
Harnessing the Expertise of 70,000 Human Editors: Knowledge-Based Feature Generation for Text Categorization

The Journal of Machine Learning Research
Domain kernels for text categorization

CONLL '05 Proceedings of the Ninth Conference on Computational Natural Language Learning
Using domain information for word sense disambiguation

SENSEVAL '01 The Proceedings of the Second International Workshop on Evaluating Word Sense Disambiguation Systems

Large-scale hierarchical text classification without labelled data

Proceedings of the fourth ACM international conference on Web search and data mining
Classification-based contextual preferences

TIWTE '11 Proceedings of the TextInfer 2011 Workshop on Textual Entailment
A new benchmark dataset with production methodology for short text semantic similarity algorithms

ACM Transactions on Speech and Language Processing (TSLP)

Quantified Score

Hi-index	0.00

Visualization

Abstract

We propose a text-categorization bootstrapping algorithm in which categories are described by relevant seed words. Our method introduces two unsupervised techniques to improve the initial categorization step of the bootstrapping scheme: (i) using latent semantic spaces to estimate the similarity among documents and words, and (ii) the Gaussian mixture algorithm, which differentiates relevant and nonrelevant category information using statistics from unlabeled examples. In particular, this second step maps the similarity scores to class posterior probabilities, and therefore reduces sensitivity to keyword-dependent variations in scores. The algorithm was evaluated on two text categorization tasks, and obtained good performance using only the category names as initial seeds. In particular, the performance of the proposed method proved to be equivalent to a pure supervised approach trained on 70--160 labeled documents per category.