A clustering scheme for large high-dimensional document datasets

Authors:
Jung-Yi Jiang;Jing-Wen Chen;Shie-Jue Lee
Affiliations:
Dept. of Electrical Engineering, National Sun Yat-Sen University, Taiwan;Dept. of Electrical Engineering, National Sun Yat-Sen University, Taiwan;Dept. of Electrical Engineering, National Sun Yat-Sen University, Taiwan
Venue:
ISICA'07 Proceedings of the 2nd international conference on Advances in computation and intelligence
Year:
2007

Citing 9
Cited 0

Distributional clustering of words for text classification

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Concept decompositions for large sparse text data using clustering

Machine Learning
Introduction to Modern Information Retrieval

Introduction to Modern Information Retrieval
Principal Direction Divisive Partitioning

Data Mining and Knowledge Discovery
A Comparative Study on Feature Selection in Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Distributional word clusters vs. words for text categorization

The Journal of Machine Learning Research
A divisive information theoretic feature clustering algorithm for text classification

The Journal of Machine Learning Research
Distributional clustering of English words

ACL '93 Proceedings of the 31st annual meeting on Association for Computational Linguistics
Data Driven Similarity Measures for k-Means Like Clustering Algorithms

Information Retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

Scalability and high dimensionality are two common problems associated with document clustering. We present a novel scheme to deal with these problems. Given a set of documents, we partition the set into several parts.We use one part and cluster the constituent documents into groups. By the obtained groups, we reduce the number of features by a certain ratio. Then we add another part, cluster the documents into groups based on the reduced features, and further reduce the number of the remaining features. This process is iterated until all parts are used. Experimental results have shown that our proposed scheme is effective for clustering large high-dimensional document datasets.