Efficient duplicate detection on cloud using a new signature scheme

Authors:
Chuitian Rong;Wei Lu;Xiaoyong Du;Xiao Zhang
Affiliations:
Key Labs of Data Engineering and Knowledge Engineering, MOE and School of Information, Renmin University of China, China;Key Labs of Data Engineering and Knowledge Engineering, MOE and School of Information, Renmin University of China, China;Key Labs of Data Engineering and Knowledge Engineering, MOE and School of Information, Renmin University of China, China;Key Labs of Data Engineering and Knowledge Engineering, MOE and School of Information, Renmin University of China, China and Shanghai Key Laboratory of Intelligent Information Processing
Venue:
WAIM'11 Proceedings of the 12th international conference on Web-age information management
Year:
2011

Citing 18
Cited 0

The merge/purge problem for large databases

SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Syntactic clustering of the Web

Selected papers from the sixth international conference on World Wide Web
Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem

Data Mining and Knowledge Discovery
Finding Interesting Associations without Support Pruning

IEEE Transactions on Knowledge and Data Engineering
Reference reconciliation in complex information spaces

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
A Primitive Operator for Similarity Joins in Data Cleaning

ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Finding near-duplicate web pages: a large-scale evaluation of algorithms

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Efficient exact set-similarity joins

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Duplicate Record Detection: A Survey

IEEE Transactions on Knowledge and Data Engineering
Efficient similarity joins for near duplicate detection

Proceedings of the 17th international conference on World Wide Web
Top-k Set Similarity Joins

ICDE '09 Proceedings of the 2009 IEEE International Conference on Data Engineering
Large-Scale Deduplication with Constraints Using Dedupalog

ICDE '09 Proceedings of the 2009 IEEE International Conference on Data Engineering
HARRA: fast iterative hashed record linkage for large-scale data collections

Proceedings of the 13th International Conference on Extending Database Technology
Efficient parallel set-similarity joins using MapReduce

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Bed-tree: an all-purpose index structure for string similarity search based on edit distance

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
An Introduction to Duplicate Detection

An Introduction to Duplicate Detection
Trie-join: efficient trie-based string similarity joins with edit-distance constraints

Proceedings of the VLDB Endowment
Hadoop: The Definitive Guide

Hadoop: The Definitive Guide

Quantified Score

Hi-index	0.00

Visualization

Abstract

Duplicate detection has been well recognized as a crucial task to improve the quality of data. Related work on this problem mainly aims to propose efficient approaches over a single machine. However, with increasing volume of the data, the performance to identify duplicates is still far from satisfactory. Hence, we try to handle the problem of duplicate detection over MapReduce, a share-nothing paradigm. We argue the performance of utilizing MapReduce to detect duplicates mainly depends on the number of candidate record pairs. In this paper, we proposed a new signature scheme with new pruning strategy over MapReduce to minimize the number of candidate record pairs. Our experimental results over both real and synthetic datasets demonstrate that our proposed signature based method is efficient and scalable.