Document Representation and Dimension Reduction for Text Clustering

Authors:
Mahdi Shafiei;Singer Wang;Roger Zhang;Evangelos Milios;Bin Tang;Jane Tougas;Ray Spiteri
Affiliations:
Faculty of Computer Science, Dalhousie University, Halifax, Canada;Faculty of Computer Science, Dalhousie University, Halifax, Canada;Faculty of Computer Science, Dalhousie University, Halifax, Canada;Faculty of Computer Science, Dalhousie University, Halifax, Canada;Faculty of Computer Science, Dalhousie University, Halifax, Canada;Faculty of Computer Science, Dalhousie University, Halifax, Canada;Faculty of Computer Science, Dalhousie University, Halifax, Canada
Venue:
ICDEW '07 Proceedings of the 2007 IEEE 23rd International Conference on Data Engineering Workshop
Year:
2007

Citing 0
Cited 1

Shape pattern matching: A tool to cluster unstructured text documents

Journal of Computational Methods in Sciences and Engineering - Special Supplement Issue in Section A and B: Selected Papers from the ISCA International Conference on Software Engineering and Data Engineering, 2009

Quantified Score

Hi-index	0.00

Visualization

Abstract

Increasingly large text datasets and the high dimensionality associated with natural language create a great challenge in text mining. In this research, a systematic study is conducted, in which three different document representation methods for text are used, together with three Dimension Reduction Techniques (DRT), in the context of the text clustering problem. Several standard benchmark datasets are used. The three Document representation methods considered are based on the vector space model, and they include word, multi-word term, and character N-gram representations. The dimension reduction methods are independent component analysis (ICA), latent semantic indexing (LSI), and a feature selection technique based on Document Frequency (DF). Results are compared in terms of clustering performance, using the k-means clustering algorithm. Experiments show that ICA and LSI are clearly better than DF on all datasets. For word and N-gram representation, ICA generally gives better results compared with LSI. Experiments also show that the word representation gives better clustering results compared to term and N-gram representation. Finally, for the N-gram representation, it is demonstrated that a profile length (before dimensionality reduction) of 2000 is sufficient to capture the information and, in most cases, a 4-gram representation gives better performance than 3-gram representation.