Combining bi-gram of character and word to classify two-class chinese texts in two steps

  • Authors:
  • Xinghua Fan;Difei Wan;Guoying Wang

  • Affiliations:
  • College of Computer Science and Technology, Chongqing University of Posts and Telecommunications, Chongqing, P.R. China;College of Computer Science and Technology, Chongqing University of Posts and Telecommunications, Chongqing, P.R. China;College of Computer Science and Technology, Chongqing University of Posts and Telecommunications, Chongqing, P.R. China

  • Venue:
  • RSCTC'06 Proceedings of the 5th international conference on Rough Sets and Current Trends in Computing
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper presents a two-step method of combining two types of features for two-class Chinese text categorization. First, the bi-gram of character is regarded as candidate feature, a Naive Bayesian classifier is used to classify texts. Then, the fuzzy area between two categories is fixed directly according to the outputs of the classifier. Second, the bi-gram of word with parts of speech verb or noun is regarded as candidate feature, a Naive Bayesian classifier same as that in the first step is used to deal with the documents falling into the fuzzy area, which are thought of classifying unreliable in the previous step. Our experiment validated the soundness of the proposed method, which achieved a high performance, with the precision, recall and F1 being 97.65%, 97.00% and 97.31% respectively on a test set consisting of 12,600 Chinese texts.