MapDupReducer: detecting near duplicates over massive datasets

Authors:
Chaokun Wang;Jianmin Wang;Xuemin Lin;Wei Wang;Haixun Wang;Hongsong Li;Wanpeng Tian;Jun Xu;Rui Li
Affiliations:
Tsinghua University, Beijing, China;Tsinghua University, Beijing, China;University of New South Wales and NICTA, Sydney, Australia;University of New South Wales and NICTA, Sydney, Australia;Microsoft Research Asia, Beijing, China;Microsoft Research Asia, Beijing, China;Tsinghua University, Beijing, China;Tsinghua University, Beijing, China;Tsinghua University, Beijing, China
Venue:
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Year:
2010

Citing 5
Cited 8

Improved robustness of signature-based near-replica detection via lexicon randomization

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Duplicate Record Detection: A Survey

IEEE Transactions on Knowledge and Data Engineering
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Efficient similarity joins for near duplicate detection

Proceedings of the 17th international conference on World Wide Web
Brute force and indexed approaches to pairwise document similarity comparisons with MapReduce

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval

Similarity joins as stronger metric operations

SIGSPATIAL Special
Block-based load balancing for entity resolution with MapReduce

Proceedings of the 20th ACM international conference on Information and knowledge management
Learning-based entity resolution with MapReduce

Proceedings of the third international workshop on Cloud data management
Clustering and load balancing optimization for redundant content removal

Proceedings of the 21st international conference companion on World Wide Web
On generating large-scale ground truth datasets for the deduplication of bibliographic records

Proceedings of the 2nd International Conference on Web Intelligence, Mining and Semantics
Multimedia Applications and Security in MapReduce: Opportunities and Challenges

Concurrency and Computation: Practice & Experience
The family of mapreduce and large-scale data processing systems

ACM Computing Surveys (CSUR)
A big data based data storage systems for rock burst experiment

International Journal of Wireless and Mobile Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Near duplicate detection benefits many applications, e.g., on-line news selection over the Web by keyword search. The purpose of this demo is to show the design and implementation of MapDupReducer, a MapReduce based system capable of detecting near duplicates over massive datasets efficiently.