Web-based text classification in the absence of manually labeled training documents

Authors:
Chen-Ming Hung;Lee-Feng Chien
Affiliations:
Institute of Information Science, Academia Sinica, 128 Sec. 2, Academia Road, Nankang, Taipei 115, Taiwan;Institute of Information Science, Academia Sinica, 128 Sec. 2, Academia Road, Nankang, Taipei 115, Taiwan
Venue:
Journal of the American Society for Information Science and Technology
Year:
2007

Citing 9
Cited 5

Elements of information theory

Elements of information theory
A sequential algorithm for training text classifiers: corrigendum and additional data

ACM SIGIR Forum
Combining labeled and unlabeled data with co-training

COLT' 98 Proceedings of the eleventh annual conference on Computational learning theory
Text Classification from Labeled and Unlabeled Documents using EM

Machine Learning - Special issue on information retrieval
A Greedy EM Algorithm for Gaussian Mixture Learning

Neural Processing Letters
Efficient greedy learning of Gaussian mixture models

Neural Computation
Enhancing Supervised Learning with Unlabeled Data

ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
Liveclassifier: creating hierarchical text classifiers through web corpora

Proceedings of the 13th international conference on World Wide Web
An evaluation of statistical spam filtering techniques

ACM Transactions on Asian Language Information Processing (TALIP)

Large-scale hierarchical text classification without labelled data

Proceedings of the fourth ACM international conference on Web search and data mining
A cost-sensitive technique for positive-example learning supporting content-based product recommendations in B-to-C e-commerce

Decision Support Systems
Web log analysis: a review of a decade of studies about information acquisition, inspection and interpretation of user interaction

Data Mining and Knowledge Discovery
Artificial immune system for illicit content identification in social media

Journal of the American Society for Information Science and Technology
Sampling the Web as Training Data for Text Classification

International Journal of Digital Library Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Most text classification techniques assume that manually labeled documents (corpora) can be easily obtained while learning text classifiers. However, labeled training documents are sometimes unavailable or inadequate even if they are available. The goal of this article is to present a self-learned approach to extract high-quality training documents from the Web when the required manually labeled documents are unavailable or of poor quality. To learn a text classifier automatically, we need only a set of user-defined categories and some highly related keywords. Extensive experiments are conducted to evaluate the performance of the proposed approach using the test set from the Reuters-21578 news data set. The experiments show that very promising results can be achieved only by using automatically extracted documents from the Web. © 2007 Wiley Periodicals, Inc.