Term-weighting approaches in automatic text retrieval
Information Processing and Management: an International Journal
The nature of statistical learning theory
The nature of statistical learning theory
Comparing representations in Chinese information retrieval
Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval
The feature quantity: an information theoretic perspective of Tfidf-like measures
SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
On the use of words and n-grams for Chinese information retrieval
IRAL '00 Proceedings of the fifth international workshop on on Information retrieval with Asian languages
Machine learning in automated text categorization
ACM Computing Surveys (CSUR)
Modern Information Retrieval
High-performing feature selection for text classification
Proceedings of the eleventh international conference on Information and knowledge management
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features
ECML '98 Proceedings of the 10th European Conference on Machine Learning
A Comparative Study on Feature Selection in Text Categorization
ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization
ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
A study on feature weighting in Chinese text categorization
CICLing'03 Proceedings of the 4th international conference on Computational linguistics and intelligent text processing
Experimental study on representing units in Chinese text categorization
CICLing'03 Proceedings of the 4th international conference on Computational linguistics and intelligent text processing
CICLing '07 Proceedings of the 8th International Conference on Computational Linguistics and Intelligent Text Processing
Efficient Text Classification Using Term Projection
AIRS '09 Proceedings of the 5th Asia Information Retrieval Symposium on Information Retrieval Technology
Job information retrieval based on document similarity
AIRS'08 Proceedings of the 4th Asia information retrieval conference on Information retrieval technology
Hi-index | 0.00 |
Words and character-bigrams are both used as features in Chinese text processing tasks, but no systematic comparison or analysis of their values as features for Chinese text categorization has been reported heretofore. We carry out here a full performance comparison between them by experiments on various document collections (including a manually word-segmented corpus as a golden standard), and a semi-quantitative analysis to elucidate the characteristics of their behavior; and try to provide some preliminary clue for feature term choice (in most cases, character-bigrams are better than words) and dimensionality setting in text categorization systems.