Computing paper similarity based on latent dirichlet allocation

Authors:
Duck-Ho Bae;Seok-Ho Yoon;Tae-Hwan Eom;Jiwoon Ha;Young-Sup Hwang;Sang-Wook Kim
Affiliations:
Hanyang University, Seoul, Korea;Hanyang University, Seoul, Korea;Hanyang University, Seoul, Korea;Hanyang University, Seoul, Korea;Sunmoon University, Asan-si, Korea;Hanyang University, Seoul, Korea
Venue:
Proceedings of the 8th International Conference on Ubiquitous Information Management and Communication
Year:
2014

Citing 14
Cited 0

Probabilistic models in information retrieval

The Computer Journal - Special issue on information retrieval
On modeling information retrieval with probabilistic inference

ACM Transactions on Information Systems (TOIS)
Probabilistic latent semantic indexing

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Computer Evaluation of Indexing and Text Processing

Journal of the ACM (JACM)
Extended Boolean information retrieval

Communications of the ACM
A vector space model for automatic indexing

Communications of the ACM
Modern Information Retrieval

Modern Information Retrieval
SimRank: a measure of structural-context similarity

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
On an equivalence between PLSI and LDA

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Latent dirichlet allocation

The Journal of Machine Learning Research
Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques
The SMART Retrieval System—Experiments in Automatic Document Processing

The SMART Retrieval System—Experiments in Automatic Document Processing
P-Rank: a comprehensive structural similarity measure over information networks

Proceedings of the 18th ACM conference on Information and knowledge management
On computing text-based similarity in scientific literature

Proceedings of the 20th international conference companion on World wide web

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper discusses methods to compute paper similarity accurately using Latent Dirichlet Allocation (LDA). The problems occurring when we compute paper similarity based on LDA are as follows. At first, paper similarity in a paper database is hard to be calculated accurately because they are too deficient in text information, which is caused by the copyright problem and the technical limits of crawling and parsing. Secondly, it is hard to provide the inputs necessary to compute similarity based on LDA. To compute LDA-based similarity, a user should input the topic number and determine seed papers as many as the topic number. This paper proposes the following methods to solve these two problems. To solve the deficiency of text, we apply the keyword extension method to compute LDA-based similarity. The keyword extension method uses the text referred by the compared paper or text in papers referring the compared paper as text information. To select appropriate seed papers, we propose a method to utilize reference information of the paper compared. Finally, we demonstrate the superiority of the proposed method by experimenting on real papers.