An efficient clustering approach for large document collections

Authors:
Bo Han;Lishan Kang;Huazhu Song
Affiliations:
School of Computer Science, Wuhan University, Wuhan, Hubei, P.R.China;School of Computer Science, Wuhan University, Wuhan, Hubei, P.R.China;School of Computer Science and Technology, Wuhan University of Technology, Wuhan, Hubei, P.R.China
Venue:
ADMA'05 Proceedings of the First international conference on Advanced Data Mining and Applications
Year:
2005

Citing 6
Cited 0

Scatter/Gather: a cluster-based approach to browsing large document collections

SIGIR '92 Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval
Web document clustering: a feasibility demonstration

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Efficient clustering of high-dimensional data sets with application to reference matching

Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Co-clustering documents and words using bipartite spectral graph partitioning

Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining
Hierarchical model-based clustering of large datasets through fractionation and refractionation

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Relation Between Permutation-Test P Values and Classifier Error Estimates

Machine Learning

Quantified Score

Hi-index	0.00

Visualization

Abstract

A vast amount of unstructured text data, such as scientific publications, commercial reports and webpages are required to be quickly categorized into different semantic groups for facilitating online information query. However, the state-of-the art clustering methods are suffered from the huge size of documents with high-dimensional text features. In this paper, we propose an efficient clustering algorithm for large document collections, which performs clustering in three stages: 1) by using permutation test, the informative topic words are identified so as to reduce feature dimension; 2) selecting a small number of most typical documents to perform initial clustering 3) refining clustering on all documents. The algorithm was tested by the 20 newsgroup data and experimental results showed that, comparing with the methods which cluster corpus based on all document samples and full features directly, this approach significantly reduced the time cost in an order while slightly improving the clustering quality.