A comparison and semi-quantitative analysis of words and character-bigrams as features in Chinese text categorization

Authors:
Jingyang Li;Maosong Sun;Xian Zhang
Affiliations:
Tsinghua University, Beijing, China;Tsinghua University, Beijing, China;Tsinghua University, Beijing, China
Venue:
ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Year:
2006

Citing 13
Cited 3

Term-weighting approaches in automatic text retrieval

Information Processing and Management: an International Journal
The nature of statistical learning theory

The nature of statistical learning theory
Comparing representations in Chinese information retrieval

Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval
The feature quantity: an information theoretic perspective of Tfidf-like measures

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
On the use of words and n-grams for Chinese information retrieval

IRAL '00 Proceedings of the fifth international workshop on on Information retrieval with Asian languages
Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
Modern Information Retrieval

Modern Information Retrieval
High-performing feature selection for text classification

Proceedings of the eleventh international conference on Information and knowledge management
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

ECML '98 Proceedings of the 10th European Conference on Machine Learning
A Comparative Study on Feature Selection in Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
A study on feature weighting in Chinese text categorization

CICLing'03 Proceedings of the 4th international conference on Computational linguistics and intelligent text processing
Experimental study on representing units in Chinese text categorization

CICLing'03 Proceedings of the 4th international conference on Computational linguistics and intelligent text processing

Exploiting Category Information and Document Information to Improve Term Weighting for Text Categorization

CICLing '07 Proceedings of the 8th International Conference on Computational Linguistics and Intelligent Text Processing
Efficient Text Classification Using Term Projection

AIRS '09 Proceedings of the 5th Asia Information Retrieval Symposium on Information Retrieval Technology
Job information retrieval based on document similarity

AIRS'08 Proceedings of the 4th Asia information retrieval conference on Information retrieval technology

Quantified Score

Hi-index	0.00

Visualization

Abstract

Words and character-bigrams are both used as features in Chinese text processing tasks, but no systematic comparison or analysis of their values as features for Chinese text categorization has been reported heretofore. We carry out here a full performance comparison between them by experiments on various document collections (including a manually word-segmented corpus as a golden standard), and a semi-quantitative analysis to elucidate the characteristics of their behavior; and try to provide some preliminary clue for feature term choice (in most cases, character-bigrams are better than words) and dimensionality setting in text categorization systems.