A logistic regression-based smoothing method for Chinese text categorization

Authors:
Show-Jane Yen;Yue-Shi Lee;Jia-Ching Ying;Yu-Chieh Wu
Affiliations:
Department of Computer Science and Information Engineering, Ming Chuan University 5, De-Ming Rd, Gweishan District, Taoyuan 333, Taiwan;Department of Computer Science and Information Engineering, Ming Chuan University 5, De-Ming Rd, Gweishan District, Taoyuan 333, Taiwan;Department of Computer Science and Information Engineering, National Cheng-Kung University 1, University Road, Tainan City 701, Taiwan;Department of Electronic Commerce, Kai-Nan University 1, Kainan Road, Luzhu Shiang, Taoyuan 33857, Taiwan
Venue:
Expert Systems with Applications: An International Journal
Year:
2011

Citing 13
Cited 1

Inductive learning algorithms and representations for text categorization

Proceedings of the seventh international conference on Information and knowledge management
Foundations of statistical natural language processing

Foundations of statistical natural language processing
A re-examination of text categorization methods

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
An Evaluation of Statistical Approaches to Text Categorization

Information Retrieval
Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
Automatic Text Categorization and Its Application to Text Retrieval

IEEE Transactions on Knowledge and Data Engineering
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

ECML '98 Proceedings of the 10th European Conference on Machine Learning
On Machine Learning Methods for Chinese Document Categorization

Applied Intelligence
Sparse bayesian learning and the relevance vector machine

The Journal of Machine Learning Research
An empirical study of smoothing techniques for language modeling

ACL '96 Proceedings of the 34th annual meeting on Association for Computational Linguistics
Investigating the relationship between word segmentation performance and retrieval performance in Chinese IR

COLING '02 Proceedings of the 19th international conference on Computational linguistics - Volume 1
Text classification in Asian languages without word segmentation

AsianIR '03 Proceedings of the sixth international workshop on Information retrieval with Asian languages - Volume 11
Combining naive bayes and n-gram language models for text classification

ECIR'03 Proceedings of the 25th European conference on IR research

Supporting product design by anticipating the success chances of new value profiles

Computers in Industry

Quantified Score

Hi-index	12.05

Visualization

Abstract

Automatic Chinese text classification is an important and a well-known technology in the field of machine learning. The first step for solving Chinese text categorization problems is to tokenize the Chinese words from a sequence of non-segmented sentences. However, previous literatures often employ a Chinese word tokenizer that was trained with different sources and then perform the conventional text classification approaches. However, these taggers are not perfect and often provide incorrect word boundary information. In this paper, we propose an N-gram-based language model which takes word relations into account for Chinese text categorization without Chinese word tokenizer. To prevent from out-of-vocabulary, we also propose a novel smoothing approach based on logistic regression to improve accuracy. The experimental result shows that our approach outperforms traditional methods at least 11% on micro-average F-measure.