Computing paper similarity based on latent dirichlet allocation

  • Authors:
  • Duck-Ho Bae;Seok-Ho Yoon;Tae-Hwan Eom;Jiwoon Ha;Young-Sup Hwang;Sang-Wook Kim

  • Affiliations:
  • Hanyang University, Seoul, Korea;Hanyang University, Seoul, Korea;Hanyang University, Seoul, Korea;Hanyang University, Seoul, Korea;Sunmoon University, Asan-si, Korea;Hanyang University, Seoul, Korea

  • Venue:
  • Proceedings of the 8th International Conference on Ubiquitous Information Management and Communication
  • Year:
  • 2014

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper discusses methods to compute paper similarity accurately using Latent Dirichlet Allocation (LDA). The problems occurring when we compute paper similarity based on LDA are as follows. At first, paper similarity in a paper database is hard to be calculated accurately because they are too deficient in text information, which is caused by the copyright problem and the technical limits of crawling and parsing. Secondly, it is hard to provide the inputs necessary to compute similarity based on LDA. To compute LDA-based similarity, a user should input the topic number and determine seed papers as many as the topic number. This paper proposes the following methods to solve these two problems. To solve the deficiency of text, we apply the keyword extension method to compute LDA-based similarity. The keyword extension method uses the text referred by the compared paper or text in papers referring the compared paper as text information. To select appropriate seed papers, we propose a method to utilize reference information of the paper compared. Finally, we demonstrate the superiority of the proposed method by experimenting on real papers.