A fast approach for parallel deduplication on multicore processors

Authors:
Guilherme Dal Bianco;Renata Galante;Carlos A. Heuser
Affiliations:
Universidade Federal do Rio, Porto Alegre, RS, Brazil;Universidade Federal do Rio, Porto Alegre, RS, Brazil;Universidade Federal do Rio, Porto Alegre, RS, Brazil
Venue:
Proceedings of the 2011 ACM Symposium on Applied Computing
Year:
2011

Citing 11
Cited 2

Efficient clustering of high-dimensional data sets with application to reference matching

Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
A Fast Linkage Detection Scheme for Multi-Source Information Integration

WIRI '05 Proceedings of the International Workshop on Challenges in Web Information Retrieval and Integration
Duplicate Record Detection: A Survey

IEEE Transactions on Knowledge and Data Engineering
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
D-Swoosh: A Family of Algorithms for Generic, Distributed Entity Resolution

ICDCS '07 Proceedings of the 27th International Conference on Distributed Computing Systems
Evaluating MapReduce for Multi-core and Multiprocessor Systems

HPCA '07 Proceedings of the 2007 IEEE 13th International Symposium on High Performance Computer Architecture
Parallel linkage

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
MapReduce: simplified data processing on large clusters

Communications of the ACM - 50th anniversary issue: 1958 - 2008
Robust record linkage blocking using suffix arrays

Proceedings of the 18th ACM conference on Information and knowledge management
Efficient parallel set-similarity joins using MapReduce

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Probabilistic data generation for deduplication and data linkage

IDEAL'05 Proceedings of the 6th international conference on Intelligent Data Engineering and Automated Learning

GHOST: GPGPU-offloaded high performance storage I/O deduplication for primary storage system

Proceedings of the 2012 International Workshop on Programming Models and Applications for Multicores and Manycores
WOO: a scalable and multi-tenant platform for continuous knowledge base synthesis

Proceedings of the VLDB Endowment

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we propose a fast approach that parallelizes the deduplication process on multicore processors. Our approach, named MD-Approach, combines an efficient blocking method with a robust data parallel programming model. The blocking phase is composed of two steps. The first step generates large blocks by grouping records with low degree of similarity. The second step segments large blocks, that may result in unbalanced load, in more precise sub-blocks. A parallel data programming model is used to implement our approach in a sequence of both map and reduce operations. An empirical evaluation has shown that our deduplication approach is almost twice faster than BTO-BK, that is a scalable parallel deduplication solution in distributed environment. To the best of our knowledge, MD-Approach is the first to focus on multicore processors for parallel dedu-plication.