Semi-supervised single-label text categorization using centroid-based classifiers

Authors:
Ana Cardoso-Cachopo;Arlindo L. Oliveira
Affiliations:
IST --- DEI, Lisboa --- Portugal;INESC-ID / IST --- DEI, Lisboa --- Portugal
Venue:
Proceedings of the 2007 ACM symposium on Applied computing
Year:
2007

Citing 13
Cited 7

Automatic text processing: the transformation, analysis, and retrieval of information by computer

Automatic text processing: the transformation, analysis, and retrieval of information by computer
Improving text retrieval for the routing problem using latent semantic indexing

SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
A comparison of classifiers and document representations for the routing problem

SIGIR '95 Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval
Context-sensitive learning methods for text categorization

ACM Transactions on Information Systems (TOIS)
Text Classification from Labeled and Unlabeled Documents using EM

Machine Learning - Special issue on information retrieval
Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

ECML '98 Proceedings of the 10th European Conference on Machine Learning
A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Employing EM and Pool-Based Active Learning for Text Classification

ICML '98 Proceedings of the Fifteenth International Conference on Machine Learning
Effect of term distributions on centroid-based text categorization

Information Sciences—Informatics and Computer Science: An International Journal - Special issue: Informatics and computer science intelligent systems applications
Integrating constraints and metric learning in semi-supervised clustering

ICML '04 Proceedings of the twenty-first international conference on Machine learning
An analysis of the relative hardness of Reuters-21578 subsets: Research Articles

Journal of the American Society for Information Science and Technology
Model-based overlapping clustering

Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining

A quickly trainable hybrid SOM-based document organization system

Neurocomputing
A triangle area based nearest neighbors approach to intrusion detection

Pattern Recognition
Semi-supervised Text Classification Using RBF Networks

IDA '09 Proceedings of the 8th International Symposium on Intelligent Data Analysis: Advances in Intelligent Data Analysis VIII
Document clustering using unsupervised learning method: topology-preserving map

Proceedings of the International Conference and Workshop on Emerging Trends in Technology
Using information from the target language to improve crosslingual text classification

IceTAL'10 Proceedings of the 7th international conference on Advances in natural language processing
Enhancing text classification by information embedded in the test set

CICLing'10 Proceedings of the 11th international conference on Computational Linguistics and Intelligent Text Processing
A document is known by the company it keeps: neighborhood consensus for short text categorization

Language Resources and Evaluation

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper we study the effect of using unlabeled data in conjunction with a small portion of labeled data on the accuracy of a centroid-based classifier used to perform single-label text categorization. We chose to use centroid-based methods because they are very fast when compared with other classification methods, but still present an accuracy close to that of the state-of-the-art methods. Efficiency is particularly important for very large domains, like regular news feeds, or the web. We propose the combination of Expectation-Maximization with a centroid-based method to incorporate information about the unlabeled data during the training phase. We also propose an alternative to EM, based on the incremental update of a centroid-based method with the unlabeled documents during the training phase. We show that these approaches can greatly improve accuracy relatively to a simple centroid-based method, in particular when there are very small amounts of labeled data available (as few as one single document per class). Using one synthetic and three real-world datasets, we show that, if the initial model of the data is sufficiently precise, using unlabeled data improves performance. On the other hand, using unlabeled data degrades performance if the initial model is not precise enough.