P2LSA and P2LSA+: two paralleled probabilistic latent semantic analysis algorithms based on the mapreduce model

Authors:
Yan Jin;Yang Gao;Yinghuan Shi;Lin Shang;Ruili Wang;Yubin Yang
Affiliations:
State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China;State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China;State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China;State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China;School of Engineering and Advanced Technology Massey University Palmerston North, New Zealand;State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China
Venue:
IDEAL'11 Proceedings of the 12th international conference on Intelligent data engineering and automated learning
Year:
2011

Citing 8
Cited 1

Probabilistic latent semantic indexing

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Latent semantic models for collaborative filtering

ACM Transactions on Information Systems (TOIS)
Web usage mining based on probabilistic latent semantic analysis

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Probabilistic topic decomposition of an eighteenth-century American newspaper

Journal of the American Society for Information Science and Technology
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Parallelization and Characterization of Probabilistic Latent Semantic Analysis

ICPP '08 Proceedings of the 2008 37th International Conference on Parallel Processing
Efficient Probabilistic Latent Semantic Analysis through Parallelization

AIRS '09 Proceedings of the 5th Asia Information Retrieval Symposium on Information Retrieval Technology
Hadoop: The Definitive Guide

Hadoop: The Definitive Guide

Parallel k-most similar neighbor classifier for mixed data

IDEAL'12 Proceedings of the 13th international conference on Intelligent Data Engineering and Automated Learning

Quantified Score

Hi-index	0.00

Visualization

Abstract

Two novel paralleled Probabilistic Latent Semantic Analysis (PLSA) algorithms based on the MapReduce model are proposed, which are P2LSA and P2LSA+, respectively. When dealing with a large-scale data set, P2LSA and P2LSA+ can improve the computing speed with the Hadoop platform. The Expectation-Maximization (EM) algorithm is often used in the traditional PLSA method to estimate two hidden parameter vectors, while the parallel PLSA is to implement the EM algorithm in parallel. The EM algorithm includes two steps: E-step and M-step. In P2LSA, the Map function is adopted to perform the E-step and the Reduce function is adopted to perform the M-step. However, all the intermediate results computed in the E-step need to be sent to the M-step. Transferring a large amount of data between the E-step and the M-step increases the burden on the network and the overall running time. Different from P2LSA, the Map function in P2LSA+ performs the E-step and M-step simultaneously. Therefore, the data transferred between the E-step and M-step is reduced and the performance is improved. Experiments are conducted to evaluate the performances of P2LSA and P2LSA+. The data set includes 20000 users and 10927 goods. The speedup curves show that the overall running time decrease as the number of computing nodes increases.Also, the overall running time demonstrates that P2LSA+ is about 3 times faster than P2LSA.