Large-scale information retrieval with latent semantic indexing
Information Sciences: an International Journal
Probabilistic latent semantic indexing
Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Data mining: concepts and techniques
Data mining: concepts and techniques
Discovering information flow suing high dimensional conceptual space
Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Parallelizing the buckshot algorithm for efficient document clustering
Proceedings of the eleventh international conference on Information and knowledge management
Frequent term-based text clustering
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Support Vector Machines Based on a Semantic Kernel for Text Categorization
IJCNN '00 Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks (IJCNN'00)-Volume 5 - Volume 5
Parallel mining of top-k frequent itemsets in very large text database
WAIM'05 Proceedings of the 6th international conference on Advances in Web-Age Information Management
Hi-index | 0.00 |
With the rapid development of the internet and communication technology, huge data is accumulated. Short text such as paper abstract and email is common in such data. It is useful to cluster such short documents to get the data structure or to help build other data mining applications. But almost all the current clustering algorithms become very inefficient or even unusable when handle very large (hundreds of GB) and high-dimensional text data. It is also difficult to get acceptable clustering accuracy since key words appear only few times in short documents. In this paper, we propose a frequent term based parallel clustering algorithm which can be used to cluster short documents in very large text database. A novel semantic classification method is also used to improve the accuracy of clustering. Our experimental study shows that our algorithm is more accurate and efficient than other clustering algorithms when clustering large scale short documents. Furthermore, our algorithm has good scalability and can be used to process even huge data.