The use of bigrams to enhance text categorization

  • Authors:
  • Chade-Meng Tan;Yuan-Fang Wang;Chan-Do Lee

  • Affiliations:
  • Google Inc., 2400 Bayshore Pkwy, Mountain View, CA;Department of Computer Science, University of California, Santa Barbara, CA;Department of Information and Communication Engineering, Taejon University, Taejon 300-716, South Korea

  • Venue:
  • Information Processing and Management: an International Journal
  • Year:
  • 2002

Quantified Score

Hi-index 0.00

Visualization

Abstract

In this paper, we present an efficient text categorization algorithm that generates bigrams selectively by looking for ones that have an especially good chance of being useful. The algorithm uses the information gain metric, combined with various frequency thresholds. The bigrams, along with unigrams, are then given as features to two different classifiers: Naïve Bayes and maximum entropy. The experimental results suggest that the bigrams can substantially raise the quality of feature sets, showing increases in the break-even points and F1 measures. The McNemar test shows that in most categories the increases are very significant. Upon close examination of the algorithm, we concluded that the algorithm is most successful in correctly classifying more positive documents, but may cause more negative documents to be classified incorrectly.