On generating large-scale ground truth datasets for the deduplication of bibliographic records

Authors:
James A. Hammerton;Michael Granitzer;Dan Harvey;Maya Hristakeva;Kris Jack
Affiliations:
Mendeley Ltd., London, UK;University of Passau;State LTD, London, UK;-;-
Venue:
Proceedings of the 2nd International Conference on Web Intelligence, Mining and Semantics
Year:
2012

Citing 11
Cited 1

Autonomous citation matching

Proceedings of the third annual conference on Autonomous Agents
Similarity estimation techniques from rounding algorithms

STOC '02 Proceedings of the thiry-fourth annual ACM symposium on Theory of computing
Digital Libraries and Autonomous Citation Indexing

Computer
Interactive deduplication using active learning

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Adaptive duplicate detection using learnable string similarity measures

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Learning metadata from the evidence in an on-line citation matching scheme

Proceedings of the 6th ACM/IEEE-CS joint conference on Digital libraries
Duplicate Record Detection: A Survey

IEEE Transactions on Knowledge and Data Engineering
Detecting near-duplicates for web crawling

Proceedings of the 16th international conference on World Wide Web
Efficient parallel set-similarity joins using MapReduce

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
MapDupReducer: detecting near duplicates over massive datasets

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Adaptive near-duplicate detection via similarity learning

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval

Harnessing user library statistics for research evaluation and knowledge domain visualization

Proceedings of the 21st international conference companion on World Wide Web

Quantified Score

Hi-index	0.00

Visualization

Abstract

Mendeley's crowd-sourced catalogue of research papers forms the basis of features such as the ability to search for papers, finding papers related to one currently being viewed and personalised recommendations. In order to generate this catalogue it is necessary to deduplicate the records uploaded from users' libraries and imported from external sources such as PubMed and arXiv. This task has been achieved at Mendeley via an automated system. However the quality of the deduplication needs to be improved. "Ground truth" data sets are thus needed for evaluating the system's performance but existing datasets are very small. In this paper, the problem of generating large scale data sets from Mendeley's database is tackled. An approach based purely on random sampling produced very easy data sets so approaches that focus on more difficult examples were explored. We found that selecting duplicates and non duplicates from documents with similar titles produced more challenging datasets. Additionally we established that a Solr-based deduplication system can achieve a similar deduplication quality to the fingerprint-based system currently employed. Finally, we introduce a large scale deduplication ground truth dataset that we hope will be useful to others tackling deduplication.