Using the Web as corpus for self-training text categorization

Authors:
Rafael Guzmán-Cabrera;Manuel Montes-Y-Gómez;Paolo Rosso;Luis Villaseñor-Pineda
Affiliations:
Facultad de Ingeniería Mecánica, Electrica y Electrónica, Universidad de Guanajuato, Guanajuato, Mexico and Natural Language Engineering Lab., Polytechnic University of Valencia, Va ...;Laboratorio de Tecnologías del Lenguaje, Instituto Nacional de Astrofísica, Óptica y Electrónica, Tonantzintla, Mexico;Natural Language Engineering Lab., Polytechnic University of Valencia, Valencia, Spain;Laboratorio de Tecnologías del Lenguaje, Instituto Nacional de Astrofísica, Óptica y Electrónica, Tonantzintla, Mexico
Venue:
Information Retrieval
Year:
2009

Citing 14
Cited 7

Data mining: practical machine learning tools and techniques with Java implementations

Data mining: practical machine learning tools and techniques with Java implementations
Text Classification from Labeled and Unlabeled Documents using EM

Machine Learning - Special issue on information retrieval
Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
Transductive Inference for Text Classification using Support Vector Machines

ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
Integrating Background Knowledge into Nearest-Neighbor Text Classification

ECCBR '02 Proceedings of the 6th European Conference on Advances in Case-Based Reasoning
Authorship Attribution with Support Vector Machines

Applied Intelligence
Augmenting Naive Bayes Classifiers with Statistical Language Models

Information Retrieval
Introduction to the special issue on the web as corpus

Computational Linguistics - Special issue on web as corpus
Editorial: special issue on learning from imbalanced data sets

ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets
An evaluation of text classification methods for literary study

An evaluation of text classification methods for literary study
A comparison of statistical significance tests for information retrieval evaluation

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Authorship attribution using word sequences

CIARP'06 Proceedings of the 11th Iberoamerican conference on Progress in Pattern Recognition, Image Analysis and Applications
Effective and scalable authorship attribution using function words

AIRS'05 Proceedings of the Second Asia conference on Asia Information Retrieval Technology
Authorship attribution of texts: a review

General Theory of Information Transfer and Combinatorics

Semi-supervised Word Sense Disambiguation Using the Web as Corpus

CICLing '09 Proceedings of the 10th International Conference on Computational Linguistics and Intelligent Text Processing
An Intelligent Agent That Autonomously Learns How to Translate

WI-IAT '09 Proceedings of the 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology - Volume 02
Using Nearest Neighbor Information to Improve Cross-Language Text Classification

MICAI '09 Proceedings of the 8th Mexican International Conference on Artificial Intelligence
Processing Amazighe language

NLDB'11 Proceedings of the 16th international conference on Natural language processing and information systems
Extended information inference model for unsupervised categorization of web short texts

Journal of Information Science
A document is known by the company it keeps: neighborhood consensus for short text categorization

Language Resources and Evaluation
An intelligent Web agent that autonomously learns how to translate

Web Intelligence and Agent Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Most current methods for automatic text categorization are based on supervised learning techniques and, therefore, they face the problem of requiring a great number of training instances to construct an accurate classifier. In order to tackle this problem, this paper proposes a new semi-supervised method for text categorization, which considers the automatic extraction of unlabeled examples from the Web and the application of an enriched self-training approach for the construction of the classifier. This method, even though language independent, is more pertinent for scenarios where large sets of labeled resources do not exist. That, for instance, could be the case of several application domains in different non-English languages such as Spanish. The experimental evaluation of the method was carried out in three different tasks and in two different languages. The achieved results demonstrate the applicability and usefulness of the proposed method.