Probabilistic models in information retrieval
The Computer Journal - Special issue on information retrieval
On modeling information retrieval with probabilistic inference
ACM Transactions on Information Systems (TOIS)
Probabilistic latent semantic indexing
Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Computer Evaluation of Indexing and Text Processing
Journal of the ACM (JACM)
Extended Boolean information retrieval
Communications of the ACM
A vector space model for automatic indexing
Communications of the ACM
Modern Information Retrieval
SimRank: a measure of structural-context similarity
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
On an equivalence between PLSI and LDA
Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
The Journal of Machine Learning Research
Data Mining: Concepts and Techniques
Data Mining: Concepts and Techniques
The SMART Retrieval System—Experiments in Automatic Document Processing
The SMART Retrieval System—Experiments in Automatic Document Processing
P-Rank: a comprehensive structural similarity measure over information networks
Proceedings of the 18th ACM conference on Information and knowledge management
On computing text-based similarity in scientific literature
Proceedings of the 20th international conference companion on World wide web
Hi-index | 0.00 |
This paper discusses methods to compute paper similarity accurately using Latent Dirichlet Allocation (LDA). The problems occurring when we compute paper similarity based on LDA are as follows. At first, paper similarity in a paper database is hard to be calculated accurately because they are too deficient in text information, which is caused by the copyright problem and the technical limits of crawling and parsing. Secondly, it is hard to provide the inputs necessary to compute similarity based on LDA. To compute LDA-based similarity, a user should input the topic number and determine seed papers as many as the topic number. This paper proposes the following methods to solve these two problems. To solve the deficiency of text, we apply the keyword extension method to compute LDA-based similarity. The keyword extension method uses the text referred by the compared paper or text in papers referring the compared paper as text information. To select appropriate seed papers, we propose a method to utilize reference information of the paper compared. Finally, we demonstrate the superiority of the proposed method by experimenting on real papers.