A comparative study on representing units in chinese text clustering

Authors:
Wang Hongjun;Yu Shiwen;Lv Xueqiang;Shi Shuicai;Xiao Shibin
Affiliations:
Institute Of Computing Linguistics Peking University, Beijing;Institute Of Computing Linguistics Peking University, Beijing;Chinese Information Processing Center Beijing Information Technology Institute, Beijing;Chinese Information Processing Center Beijing Information Technology Institute, Beijing;Chinese Information Processing Center Beijing Information Technology Institute, Beijing
Venue:
KSEM'06 Proceedings of the First international conference on Knowledge Science, Engineering and Management
Year:
2006

Citing 12
Cited 0

Automatic text processing: the transformation, analysis, and retrieval of information by computer

Automatic text processing: the transformation, analysis, and retrieval of information by computer
Scatter/Gather: a cluster-based approach to browsing large document collections

SIGIR '92 Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval
Fast and effective text mining using linear-time document clustering

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Evaluation of hierarchical clustering algorithms for document datasets

Proceedings of the eleventh international conference on Information and knowledge management
A Comparative Study on Feature Selection in Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Refining Initial Points for K-Means Clustering

ICML '98 Proceedings of the Fifteenth International Conference on Machine Learning
A hierarchical monothetic document clustering algorithm for summarization and browsing search results

Proceedings of the 13th international conference on World Wide Web
Empirical and Theoretical Comparisons of Selected Criterion Functions for Document Clustering

Machine Learning
A hybrid unsupervised approach for document clustering

Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
Automatic recognition of Chinese unknown words based on roles tagging

SIGHAN '02 Proceedings of the first SIGHAN workshop on Chinese language processing - Volume 18
Chinese verb sense discrimination using an EM clustering model with rich linguistic features

ACL '04 Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics
Experimental study on representing units in Chinese text categorization

CICLing'03 Proceedings of the 4th international conference on Computational linguistics and intelligent text processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Words and n-grams are commonly used Chinese text representing units and are proved to be good features for Chinese Text Categorization and Information Retrieval. But the effectiveness of applying these representing units for Chinese Text Clustering is still uncovered. This paper is a comparative study of representing units in Chinese Text Clustering. With K-means algorithm, several representing units were evaluated including Chinese character N-gram features, word features and their combinations. We found Chinese word features, Chinese character unigram features and bi-gram features most effective in our experiments. The combination of features didn’t improve the results. Detailed experimental results on several public Chinese Text Categorization datasets are provided in the paper.