Comparing LDA with pLSI as a dimensionality reduction method in document clustering

Authors:
Tomonari Masada;Senya Kiyasu;Sueharu Miyahara
Affiliations:
Nagasaki University, Nagasaki, Japan;Nagasaki University, Nagasaki, Japan;Nagasaki University, Nagasaki, Japan
Venue:
LKR'08 Proceedings of the 3rd international conference on Large-scale knowledge resources: construction and application
Year:
2008

Citing 11
Cited 1

A deterministic annealing approach to clustering

Pattern Recognition Letters
Probabilistic latent semantic indexing

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Text Classification from Labeled and Unlabeled Documents using EM

Machine Learning - Special issue on information retrieval
Random projection in dimensionality reduction: applications to image and text data

Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining
Latent dirichlet allocation

The Journal of Machine Learning Research
Matching words and pictures

The Journal of Machine Learning Research
Learning to cluster web search results

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Modeling word burstiness using the Dirichlet distribution

ICML '05 Proceedings of the 22nd international conference on Machine learning
Effective document clustering for large heterogeneous law firm collections

ICAIL '05 Proceedings of the 10th international conference on Artificial intelligence and law
Organizing the OCA: learning faceted subjects from a library of digital books

Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
Expertise modeling for matching papers with reviewers

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining

Topic discovery and topic-driven clustering for audit method datasets

ADMA'11 Proceedings of the 7th international conference on Advanced Data Mining and Applications - Volume Part II

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we compare latent Dirichlet allocation (LDA) with probabilistic latent semantic indexing (pLSI) as a dimensionality reduction method and investigate their effectiveness in document clustering by using real-world document sets. For clustering of documents, we use a method based on multinomial mixture, which is known as an efficient framework for text mining. Clustering results are evaluated by F-measure, i.e., harmonic mean of precision and recall. We use Japanese and Korean Web articles for evaluation and regard the category assigned to each Web article as the ground truth for the evaluation of clustering results. Our experiment shows that the dimensionality reduction via LDA and pLSI results in document clusters of almost the same quality as those obtained by using original feature vectors. Therefore, we can reduce the vector dimension without degrading cluster quality. Further, both LDA and pLSI are more effective than random projection, the baseline method in our experiment. However, our experiment provides no meaningful difference between LDA and pLSI. This result suggests that LDA does not replace pLSI at least for dimensionality reduction in document clustering.