Using semi-structured data for assessing research paper similarity

  • Authors:
  • GermáN Hurtado MartíN;Steven Schockaert;Chris Cornelis;Helga Naessens

  • Affiliations:
  • Dept. of Industrial Engineering, University College Ghent, Belgium and Dept. of Applied Mathematics and Computer Science, Ghent University, Belgium;School of Computer Science & Informatics, Cardiff University, UK;Dept. of Applied Mathematics and Computer Science, Ghent University, Belgium and Dept. of Computer Science and Artificial Intelligence, University of Granada, Spain;Dept. of Industrial Engineering, University College Ghent, Belgium

  • Venue:
  • Information Sciences: an International Journal
  • Year:
  • 2013

Quantified Score

Hi-index 0.07

Visualization

Abstract

The task of assessing the similarity of research papers is of interest in a variety of application contexts. It is a challenging task, however, as the full text of the papers is often not available, and similarity needs to be determined based on the papers' abstract, and some additional features such as their authors, keywords, and the journals in which they were published. Our work explores several methods to exploit this information, first by using methods based on the vector space model and then by adapting language modeling techniques to this end. In the first case, in addition to a number of standard approaches we experiment with the use of a form of explicit semantic analysis. In the second case, the basic strategy we pursue is to augment the information contained in the abstract by interpolating the corresponding language model with language models for the authors, keywords and journal of the paper. This strategy is then extended by revealing the latent topic structure of the collection using an adaptation of Latent Dirichlet Allocation, in which the keywords that were provided by the authors are used to guide the process. Experimental analysis shows that a well-considered use of these techniques significantly improves the results of the standard vector space model approach.