Proceedings of the third annual conference on Autonomous Agents
Similarity estimation techniques from rounding algorithms
STOC '02 Proceedings of the thiry-fourth annual ACM symposium on Theory of computing
Interactive deduplication using active learning
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Adaptive duplicate detection using learnable string similarity measures
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Learning metadata from the evidence in an on-line citation matching scheme
Proceedings of the 6th ACM/IEEE-CS joint conference on Digital libraries
Duplicate Record Detection: A Survey
IEEE Transactions on Knowledge and Data Engineering
Detecting near-duplicates for web crawling
Proceedings of the 16th international conference on World Wide Web
Efficient parallel set-similarity joins using MapReduce
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
MapDupReducer: detecting near duplicates over massive datasets
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Adaptive near-duplicate detection via similarity learning
Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Harnessing user library statistics for research evaluation and knowledge domain visualization
Proceedings of the 21st international conference companion on World Wide Web
Hi-index | 0.00 |
Mendeley's crowd-sourced catalogue of research papers forms the basis of features such as the ability to search for papers, finding papers related to one currently being viewed and personalised recommendations. In order to generate this catalogue it is necessary to deduplicate the records uploaded from users' libraries and imported from external sources such as PubMed and arXiv. This task has been achieved at Mendeley via an automated system. However the quality of the deduplication needs to be improved. "Ground truth" data sets are thus needed for evaluating the system's performance but existing datasets are very small. In this paper, the problem of generating large scale data sets from Mendeley's database is tackled. An approach based purely on random sampling produced very easy data sets so approaches that focus on more difficult examples were explored. We found that selecting duplicates and non duplicates from documents with similar titles produced more challenging datasets. Additionally we established that a Solr-based deduplication system can achieve a similar deduplication quality to the fingerprint-based system currently employed. Finally, we introduce a large scale deduplication ground truth dataset that we hope will be useful to others tackling deduplication.