Accurate Chinese Text Classification via Multiple Strategies

  • Authors:
  • Xiulan Hao;Chenghong Zhang;Xiaopeng Tao;Shuyun Wang;and Yunfa Hu

  • Affiliations:
  • Fudan University;Fudan University;Fudan University;Fudan University;Fudan University

  • Venue:
  • FSKD '07 Proceedings of the Fourth International Conference on Fuzzy Systems and Knowledge Discovery - Volume 03
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

Text classification is one of means to understand text content. It is widely used in information retrieving, fil- tering spam, monitoring ill gossips, and blocking porno- graphic and evil messages. kNN is widely used in text categorization, but it suffers from biased training data set. In developing Prototype of Internet Information Security for Shanghai Council of Information and Security, we de- tect that when training data set is biased, almost all test documents of some rare (smaller) categories are classi- fied into common (larger) ones by traditional kNN clas- sifier. The performance of text classification can not sat- isfy the user's requirement in this case. To alleviate such a misfortune, we adopt 2 measures to boost kNN classi- fier. Firstly, we optimize features by removing some can- didate features. Secondly, we modify traditional decision rules by integrating number of training samples of each category with them. Exhaustive experiments illustrate that the adapted kNN achieves significant classification perfor- mance improvement on biased corpora.