Web-based text classification in the absence of manually labeled training documents

  • Authors:
  • Chen-Ming Hung;Lee-Feng Chien

  • Affiliations:
  • Institute of Information Science, Academia Sinica, 128 Sec. 2, Academia Road, Nankang, Taipei 115, Taiwan;Institute of Information Science, Academia Sinica, 128 Sec. 2, Academia Road, Nankang, Taipei 115, Taiwan

  • Venue:
  • Journal of the American Society for Information Science and Technology
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

Most text classification techniques assume that manually labeled documents (corpora) can be easily obtained while learning text classifiers. However, labeled training documents are sometimes unavailable or inadequate even if they are available. The goal of this article is to present a self-learned approach to extract high-quality training documents from the Web when the required manually labeled documents are unavailable or of poor quality. To learn a text classifier automatically, we need only a set of user-defined categories and some highly related keywords. Extensive experiments are conducted to evaluate the performance of the proposed approach using the test set from the Reuters-21578 news data set. The experiments show that very promising results can be achieved only by using automatically extracted documents from the Web. © 2007 Wiley Periodicals, Inc.