Efficient Probabilistic Latent Semantic Analysis through Parallelization

Authors:
Raymond Wan;Vo Ngoc Anh;Hiroshi Mamitsuka
Affiliations:
Bioinformatics Center, Institute for Chemical Research, Kyoto University, Gokasho, Japan 611-0011 and Computational Biology Research Center, AIST, Tokyo, Japan 135-0064;Department of Computer Science and Software Engineering, University of Melbourne, Victoria, Australia 3010;Bioinformatics Center, Institute for Chemical Research, Kyoto University, Gokasho, Japan 611-0011
Venue:
AIRS '09 Proceedings of the 5th Asia Information Retrieval Symposium on Information Retrieval Technology
Year:
2009

Citing 6
Cited 1

Probabilistic latent semantic indexing

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Unsupervised learning by probabilistic latent semantic analysis

Machine Learning
Parallel Programming in C with MPI and OpenMP

Parallel Programming in C with MPI and OpenMP
Parallelization and Characterization of Probabilistic Latent Semantic Analysis

ICPP '08 Proceedings of the 2008 37th International Conference on Parallel Processing
Efficient storage and retrieval of probabilistic latent semantic information for information retrieval

The VLDB Journal — The International Journal on Very Large Data Bases
An empirical study on dimensionality optimization in text mining for linguistic knowledge acquisition

PAKDD'03 Proceedings of the 7th Pacific-Asia conference on Advances in knowledge discovery and data mining

P2LSA and P2LSA+: two paralleled probabilistic latent semantic analysis algorithms based on the mapreduce model

IDEAL'11 Proceedings of the 12th international conference on Intelligent data engineering and automated learning

Quantified Score

Hi-index	0.00

Visualization

Abstract

Probabilistic latent semantic analysis (PLSA) is considered an effective technique for information retrieval, but has one notable drawback: its dramatic consumption of computing resources, in terms of both execution time and internal memory. This drawback limits the practical application of the technique only to document collections of modest size. In this paper, we look into the practice of implementing PLSA with the aim of improving its efficiency without changing its output. Recently, Hong et al. [2008] has shown how the execution time of PLSA can be improved by employing OpenMP for shared memory parallelization. We extend their work by also studying the effects from using it in combination with the Message Passing Interface (MPI) for distributed memory parallelization. We show how a more careful implementation of PLSA reduces execution time and memory costs by applying our method on several text collections commonly used in the literature.