Chinese text categorization based on the binary weighting model with non-binary smoothing

  • Authors:
  • Xue Dejun;Sun Maosong

  • Affiliations:
  • State Key Laboratory of Intelligent Technology and Systems, Department of Computer Science and Technology, Tsinghua University, Beijing, China;State Key Laboratory of Intelligent Technology and Systems, Department of Computer Science and Technology, Tsinghua University, Beijing, China

  • Venue:
  • ECIR'03 Proceedings of the 25th European conference on IR research
  • Year:
  • 2003

Quantified Score

Hi-index 0.00

Visualization

Abstract

In Text Categorization (TC) based on the vector space model, feature weighting is vital for the categorization effectiveness. Various non-binary weighting schemes are widely used for this purpose. By emphasizing the category discrimination capability of features, the paper firstly puts forward a new weighting scheme TF*IDF*IG. Upon the fact that refined statistics may have more chance to meet sparse data problem, we re-evaluate the role of the Binary Weighting Model (BWM) in TC for further consideration. As a consequence, a novel approach named the Binary Weighting Model with Non-Binary Smoothing (BWM-NBS) is then proposed so as to overcome the drawback of BWM. A TC system for Chinese texts using words as features is implemented. Experiments on a large-scale Chinese document collection with 71,674 texts show that the F1 metric of categorization performance of BWM-NBS gets to 94.9% in the best case, which is 26.4% higher than that of TF*IDF, 19.1% higher than that of TF*IDF*IG, and 5.8% higher than that of BWM under the same condition. Moreover, BWM-NBS exhibits the strong stability in categorization performance.