Text classification in Asian languages without word segmentation

Authors:
Fuchun Peng;Xiangji Huang;Dale Schuurmans;Shaojun Wang
Affiliations:
University of Waterloo, Ontario, Canada;University of Waterloo, Ontario, Canada;University of Waterloo, Ontario, Canada;University of Waterloo, Ontario, Canada
Venue:
AsianIR '03 Proceedings of the sixth international workshop on Information retrieval with Asian languages - Volume 11
Year:
2003

Citing 7
Cited 6

The nature of statistical learning theory

The nature of statistical learning theory
Inductive learning algorithms and representations for text categorization

Proceedings of the seventh international conference on Information and knowledge management
An Evaluation of Statistical Approaches to Text Categorization

Information Retrieval
Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

ECML '98 Proceedings of the 10th European Conference on Machine Learning
Investigating the relationship between word segmentation performance and retrieval performance in Chinese IR

COLING '02 Proceedings of the 19th international conference on Computational linguistics - Volume 1
Combining naive bayes and n-gram language models for text classification

ECIR'03 Proceedings of the 25th European conference on IR research

Support vector machines based Arabic language text classification system: feature selection comparative study

MATH'07 Proceedings of the 12th WSEAS International Conference on Applied Mathematics
Discovering genres of online discussion threads via text mining

Computers & Education
Using some web content mining techniques for Arabic text classification

DNCOCO'09 Proceedings of the 8th WSEAS international conference on Data networks, communications, computers
Text categorization using distributional clustering and concept extraction

ICIC'07 Proceedings of the intelligent computing 3rd international conference on Advanced intelligent computing theories and applications
A logistic regression-based smoothing method for Chinese text categorization

Expert Systems with Applications: An International Journal
Automatic chinese text classification using n-gram model

ICCSA'10 Proceedings of the 2010 international conference on Computational Science and Its Applications - Volume Part III

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present a simple approach for Asian language text classification without word segmentation, based on statistical n-gram language modeling. In particular, we examine Chinese and Japanese text classification. With character n-gram models, our approach avoids word segmentation. However, unlike traditional ad hoc n-gram models, the statistical language modeling based approach has strong information theoretic basis and avoids explicit feature selection procedure which potentially loses significantly amount of useful information. We systematically study the key factors in language modeling and their influence on classification. Experiments on Chinese TREC and Japanese NTCIR topic detection show that the simple approach can achieve better performance compared to traditional approaches while avoiding word segmentation, which demonstrates its superiority in Asian language text classification.