A logistic regression-based smoothing method for Chinese text categorization

  • Authors:
  • Show-Jane Yen;Yue-Shi Lee;Jia-Ching Ying;Yu-Chieh Wu

  • Affiliations:
  • Department of Computer Science and Information Engineering, Ming Chuan University 5, De-Ming Rd, Gweishan District, Taoyuan 333, Taiwan;Department of Computer Science and Information Engineering, Ming Chuan University 5, De-Ming Rd, Gweishan District, Taoyuan 333, Taiwan;Department of Computer Science and Information Engineering, National Cheng-Kung University 1, University Road, Tainan City 701, Taiwan;Department of Electronic Commerce, Kai-Nan University 1, Kainan Road, Luzhu Shiang, Taoyuan 33857, Taiwan

  • Venue:
  • Expert Systems with Applications: An International Journal
  • Year:
  • 2011

Quantified Score

Hi-index 12.05

Visualization

Abstract

Automatic Chinese text classification is an important and a well-known technology in the field of machine learning. The first step for solving Chinese text categorization problems is to tokenize the Chinese words from a sequence of non-segmented sentences. However, previous literatures often employ a Chinese word tokenizer that was trained with different sources and then perform the conventional text classification approaches. However, these taggers are not perfect and often provide incorrect word boundary information. In this paper, we propose an N-gram-based language model which takes word relations into account for Chinese text categorization without Chinese word tokenizer. To prevent from out-of-vocabulary, we also propose a novel smoothing approach based on logistic regression to improve accuracy. The experimental result shows that our approach outperforms traditional methods at least 11% on micro-average F-measure.