Text categorization using distributional clustering and concept extraction

Authors:
Yifan He;Minghu Jiang
Affiliations:
Lab of Computational Linguistics, School of Humanities and Social Sciences, Tsinghua University, Beijing, China;Lab of Computational Linguistics, School of Humanities and Social Sciences, Tsinghua University, Beijing, China
Venue:
ICIC'07 Proceedings of the intelligent computing 3rd international conference on Advanced intelligent computing theories and applications
Year:
2007

Citing 6
Cited 0

Document clustering using word clusters via the information bottleneck method

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
On feature distributional clustering for text categorization

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
Text classification in Asian languages without word segmentation

AsianIR '03 Proceedings of the sixth international workshop on Information retrieval with Asian languages - Volume 11
A New Text Categorization Technique Using Distributional Clustering and Learning Logic

IEEE Transactions on Knowledge and Data Engineering
An improved method of feature selection based on concept attributes in text classification

ICNC'05 Proceedings of the First international conference on Advances in Natural Computation - Volume Part I

Quantified Score

Hi-index	0.01

Visualization

Abstract

Text categorization (TC) has become one the most researched fields in NLP. In this paper, we try to solve the problem of TC through a 2-step feature selection approach. First we cluster the words that appear in the texts according to their distribution in categories. Then we extract concepts from these clusters, which are DEF terms in HowNet. The extraction is according to the word clusters instead of single words. This method maintains the generalization ability of concept extraction based TC and at the same time makes full use of the occurrences of new words that are not found in concept thesaurus. We test the performance of our feature selection method on the Sogou corpus for TC with an SVM classifier. Results of our experiments show that our method can improve the performance of TC in all categories.