A Novel Document Clustering Model Based on Latent Semantic Analysis

Authors:
Wei Song;Soon Cheol Park
Affiliations:
-;-
Venue:
SKG '07 Proceedings of the Third International Conference on Semantics, Knowledge and Grid
Year:
2007

Citing 0
Cited 5

An application of latent semantic analysis to word sense discrimination for words with related and unrelated meanings

EdAppsNLP '09 Proceedings of the Fourth Workshop on Innovative Use of NLP for Building Educational Applications
Text document clustering based on neighbors

Data & Knowledge Engineering
Usage patterns and latent semantic analyses for task goal inference of multimodal user interactions

Proceedings of the 15th international conference on Intelligent user interfaces
A topological embedding of the lexicon for semantic distance computation

Natural Language Engineering
A unified framework for web video topic discovery and visualization

Pattern Recognition Letters

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper we propose a document representation model based on latent semantic analysis (LSA) for text clustering. Most classic clustering systems represent document with a set of indices, which have been known as vector space model (VSM). In such a model, documents are encoded as vectors in N-dimensional space, where N is the number of unique terms. However, this method causes that the scalability will be poor and the cost of computational time will be high. Latent semantic analysis is a promising approach which attempts to construct a latent semantic structure in textual data and finds relevant documents such that they may not even share any common words, moreover, it reduces the large term-by-document matrix to a smaller one and provides a robust space for clustering. Two clustering algorithms, K-means and genetic algorithm (GA), are constructed in LSA space to demonstrate the effectiveness and validity of our text representation model. We use SSTRESS criteria to analyze the dissimilarity between the original corpus matrix and the approximate objective matrix with different ranks corresponding to the performance of the two clustering algorithms. The superiority of GA and K-means applied in LSA model over conventional GA and K-means in VSM is demonstrated by providing good text clustering results.