Exploiting term relationship to boost text classification

  • Authors:
  • Dou Shen;Jianmin Wu;Bin Cao;Jian-Tao Sun;Qiang Yang;Zheng Chen;Ying Li

  • Affiliations:
  • Microsoft, Redmond, WA, USA;Microsoft, Beijing, China;The Hong Kong University of Science and Technology, Hong Kong, China;Microsoft Research Asia, Beijing, China;The Hong Kong University of Science and Technology, Hong Kong, China;Microsoft Research Asia, Beijing, China;Microsoft, Redmond, WA, USA

  • Venue:
  • Proceedings of the 18th ACM conference on Information and knowledge management
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

Document classification provides an effective way to handle the explosive online textual data. However, in practical classification settings, we face the so-called feature sparsity problem caused by a lack of training documents or the shortness of text to be classified. In this paper, we solve the sparsity problem by exploiting term relationships along with Naive Bayes classifiers. The first method is to estimate term relationships based on the co-occurrence information of two terms in a certain context. The second method estimates the term relationships based on the distribution of terms over different hierarchical categories in a publicly available document taxonomy. Thereafter, term relationship is used to augment Naive Bayes classifiers. We test our methods on two open-domain data sets to demonstrate its advantages. The experimental results show that our method can significantly improve the classification performance, especially when we do not have enough training data or the texts are Web search queries.