Text classification using web corpora and EM algorithms

Authors:
Chen-Ming Hung;Lee-Feng Chien
Affiliations:
Institute of Information Science, Academia Sinica, Taipei, Taiwan;Institute of Information Science, Academia Sinica, Taipei, Taiwan
Venue:
AIRS'04 Proceedings of the 2004 international conference on Asian Information Retrieval Technology
Year:
2004

Citing 5
Cited 1

Learning to extract symbolic knowledge from the World Wide Web

AAAI '98/IAAI '98 Proceedings of the fifteenth national/tenth conference on Artificial intelligence/Innovative applications of artificial intelligence
Text Classification from Labeled and Unlabeled Documents using EM

Machine Learning - Special issue on information retrieval
A Greedy EM Algorithm for Gaussian Mixture Learning

Neural Processing Letters
Efficient greedy learning of Gaussian mixture models

Neural Computation
Liveclassifier: creating hierarchical text classifiers through web corpora

Proceedings of the 13th international conference on World Wide Web

Sampling the Web as Training Data for Text Classification

International Journal of Digital Library Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

The insufficiency and irrelevancy of training corpora is always the main task to overcome while doing text classification. This paper proposes a Web-based text classification approach to train a text classifier without the pre-request of labeled training data. Under the assumption that each class of concern is associated with several relevant concept classes, the approach first applies a greedy EM algorithm to find a proper number of concept clusters for each class, via clustering the documents retrieved by sending the class name itself to Web search engines. It then retrieves more training data through the keywords generated from the clusters and set the initial parameters of the text classifier. It further refines the initial classifier by an augmented EM algorithm. Experimental results have shown the great potential of the proposed approach in creating text classifiers without the pre-request of labeled training data.