Short documents clustering in very large text databases

Authors:
Yongheng Wang;Yan Jia;Shuqiang Yang
Affiliations:
Computer School, National University of Defense Technology, Changsha, China;Computer School, National University of Defense Technology, Changsha, China;Computer School, National University of Defense Technology, Changsha, China
Venue:
WISE'06 Proceedings of the 7th international conference on Web Information Systems
Year:
2006

Citing 8
Cited 0

Large-scale information retrieval with latent semantic indexing

Information Sciences: an International Journal
Probabilistic latent semantic indexing

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Data mining: concepts and techniques

Data mining: concepts and techniques
Discovering information flow suing high dimensional conceptual space

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Parallelizing the buckshot algorithm for efficient document clustering

Proceedings of the eleventh international conference on Information and knowledge management
Frequent term-based text clustering

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Support Vector Machines Based on a Semantic Kernel for Text Categorization

IJCNN '00 Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks (IJCNN'00)-Volume 5 - Volume 5
Parallel mining of top-k frequent itemsets in very large text database

WAIM'05 Proceedings of the 6th international conference on Advances in Web-Age Information Management

Quantified Score

Hi-index	0.00

Visualization

Abstract

With the rapid development of the internet and communication technology, huge data is accumulated. Short text such as paper abstract and email is common in such data. It is useful to cluster such short documents to get the data structure or to help build other data mining applications. But almost all the current clustering algorithms become very inefficient or even unusable when handle very large (hundreds of GB) and high-dimensional text data. It is also difficult to get acceptable clustering accuracy since key words appear only few times in short documents. In this paper, we propose a frequent term based parallel clustering algorithm which can be used to cluster short documents in very large text database. A novel semantic classification method is also used to improve the accuracy of clustering. Our experimental study shows that our algorithm is more accurate and efficient than other clustering algorithms when clustering large scale short documents. Furthermore, our algorithm has good scalability and can be used to process even huge data.