Using semi-structured data for assessing research paper similarity

Authors:
GermáN Hurtado MartíN;Steven Schockaert;Chris Cornelis;Helga Naessens
Affiliations:
Dept. of Industrial Engineering, University College Ghent, Belgium and Dept. of Applied Mathematics and Computer Science, Ghent University, Belgium;School of Computer Science & Informatics, Cardiff University, UK;Dept. of Applied Mathematics and Computer Science, Ghent University, Belgium and Dept. of Computer Science and Artificial Intelligence, University of Granada, Spain;Dept. of Industrial Engineering, University College Ghent, Belgium
Venue:
Information Sciences: an International Journal
Year:
2013

Citing 29
Cited 0

Term-weighting approaches in automatic text retrieval

Information Processing and Management: an International Journal
A language modeling approach to information retrieval

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
A study of smoothing methods for language models applied to Ad Hoc information retrieval

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
On the recommending of citations for research papers

CSCW '02 Proceedings of the 2002 ACM conference on Computer supported cooperative work
Latent dirichlet allocation

The Journal of Machine Learning Research
Probabilistic author-topic models for information discovery

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Concept-matching IR systems versus word-matching information retrieval systems: Considering fuzzy interrelations for indexing Web pages: Special Topic Section on Soft Approaches to Information Retrieval and Information Access on the Web

Journal of the American Society for Information Science and Technology
Hierarchical Language Models for Expert Finding in Enterprise Corpora

ICTAI '06 Proceedings of the 18th IEEE International Conference on Tools with Artificial Intelligence
Expertise modeling for matching papers with reviewers

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Learning to classify short and sparse text & web with hidden topics from large-scale data collections

Proceedings of the 17th international conference on World Wide Web
Exploring social annotations for information retrieval

Proceedings of the 17th international conference on World Wide Web
Novelty and diversity in information retrieval evaluation

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Adapting LDA Model to Discover Author-Topic Relations for Email Analysis

DaWaK '08 Proceedings of the 10th international conference on Data Warehousing and Knowledge Discovery
Latent dirichlet allocation in web spam filtering

AIRWeb '08 Proceedings of the 4th international workshop on Adversarial information retrieval on the web
Recommending scientific articles using citeulike

Proceedings of the 2008 ACM conference on Recommender systems
Enhancing Expert Finding Using Organizational Hierarchies

ECIR '09 Proceedings of the 31th European Conference on IR Research on Advances in Information Retrieval
Finding topic trends in digital libraries

Proceedings of the 9th ACM/IEEE-CS joint conference on Digital libraries
Computing semantic relatedness using Wikipedia-based explicit semantic analysis

IJCAI'07 Proceedings of the 20th international joint conference on Artifical intelligence
Latent dirichlet allocation for tag recommendation

Proceedings of the third ACM conference on Recommender systems
Enhancing expertise retrieval using community-aware strategies

Proceedings of the 18th ACM conference on Information and knowledge management
Labeled LDA: a supervised topic model for credit attribution in multi-labeled corpora

EMNLP '09 Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 1 - Volume 1
TwitterRank: finding topic-sensitive influential twitterers

Proceedings of the third ACM international conference on Web search and data mining
Integrating multiple document features in language models for expert finding

Knowledge and Information Systems
Metadata impact on research paper similarity

ECDL'10 Proceedings of the 14th European conference on Research and advanced technology for digital libraries
Short text similarity based on probabilistic topics

Knowledge and Information Systems
Empirical study of topic modeling in Twitter

Proceedings of the First Workshop on Social Media Analytics
Entity disambiguation with hierarchical topic models

Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
A hybrid recommender system for the selective dissemination of research resources in a Technology Transfer Office

Information Sciences: an International Journal
Least squares quantization in PCM

IEEE Transactions on Information Theory

Quantified Score

Hi-index	0.07

Visualization

Abstract

The task of assessing the similarity of research papers is of interest in a variety of application contexts. It is a challenging task, however, as the full text of the papers is often not available, and similarity needs to be determined based on the papers' abstract, and some additional features such as their authors, keywords, and the journals in which they were published. Our work explores several methods to exploit this information, first by using methods based on the vector space model and then by adapting language modeling techniques to this end. In the first case, in addition to a number of standard approaches we experiment with the use of a form of explicit semantic analysis. In the second case, the basic strategy we pursue is to augment the information contained in the abstract by interpolating the corresponding language model with language models for the authors, keywords and journal of the paper. This strategy is then extended by revealing the latent topic structure of the collection using an adaptation of Latent Dirichlet Allocation, in which the keywords that were provided by the authors are used to guide the process. Experimental analysis shows that a well-considered use of these techniques significantly improves the results of the standard vector space model approach.