Efficient duplicate detection on cloud using a new signature scheme

  • Authors:
  • Chuitian Rong;Wei Lu;Xiaoyong Du;Xiao Zhang

  • Affiliations:
  • Key Labs of Data Engineering and Knowledge Engineering, MOE and School of Information, Renmin University of China, China;Key Labs of Data Engineering and Knowledge Engineering, MOE and School of Information, Renmin University of China, China;Key Labs of Data Engineering and Knowledge Engineering, MOE and School of Information, Renmin University of China, China;Key Labs of Data Engineering and Knowledge Engineering, MOE and School of Information, Renmin University of China, China and Shanghai Key Laboratory of Intelligent Information Processing

  • Venue:
  • WAIM'11 Proceedings of the 12th international conference on Web-age information management
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

Duplicate detection has been well recognized as a crucial task to improve the quality of data. Related work on this problem mainly aims to propose efficient approaches over a single machine. However, with increasing volume of the data, the performance to identify duplicates is still far from satisfactory. Hence, we try to handle the problem of duplicate detection over MapReduce, a share-nothing paradigm. We argue the performance of utilizing MapReduce to detect duplicates mainly depends on the number of candidate record pairs. In this paper, we proposed a new signature scheme with new pruning strategy over MapReduce to minimize the number of candidate record pairs. Our experimental results over both real and synthetic datasets demonstrate that our proposed signature based method is efficient and scalable.