Efficient parallel set-similarity joins using MapReduce

Authors:
Rares Vernica;Michael J. Carey;Chen Li
Affiliations:
University of California, Irvine, Irvine, CA, USA;University of California, Irvine, Irvine, CA, USA;University of California, Irvine, Irvine, CA, USA
Venue:
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Year:
2010

Citing 22
Cited 64

A performance evaluation of four parallel join algorithms in a shared-nothing multiprocessor environment

SIGMOD '89 Proceedings of the 1989 ACM SIGMOD international conference on Management of data
Bucket spreading parallel hash: a new, robust, parallel hash join method for data skew in the super database computer (SDC)

Proceedings of the sixteenth international conference on Very large databases
Parallel database systems: the future of high performance database systems

Communications of the ACM
Syntactic clustering of the Web

Selected papers from the sixth international conference on World Wide Web
Similarity Search in High Dimensions via Hashing

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Approximate String Joins in a Database (Almost) for Free

Proceedings of the 27th International Conference on Very Large Data Bases
An Evaluation of Non-Equijoin Algorithms

VLDB '91 Proceedings of the 17th International Conference on Very Large Data Bases
Methods for identifying versioned and plagiarized documents

Journal of the American Society for Information Science and Technology
Efficient set joins on similarity predicates

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Evaluating similarity measures: a large-scale study in the orkut social network

Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
A Primitive Operator for Similarity Joins in Data Cleaning

ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
A web-based kernel function for measuring the similarity of short text snippets

Proceedings of the 15th international conference on World Wide Web
Finding near-duplicate web pages: a large-scale evaluation of algorithms

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Efficient exact set-similarity joins

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Scaling up all pairs similarity search

Proceedings of the 16th international conference on World Wide Web
Detectives: detecting coalition hit inflation attacks in advertising networks streams

Proceedings of the 16th international conference on World Wide Web
Map-reduce-merge: simplified relational data processing on large clusters

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
MapReduce: simplified data processing on large clusters

Communications of the ACM - 50th anniversary issue: 1958 - 2008
Efficient similarity joins for near duplicate detection

Proceedings of the 17th international conference on World Wide Web
Ed-Join: an efficient algorithm for similarity joins with edit distance constraints

Proceedings of the VLDB Endowment
A comparison of approaches to large-scale data analysis

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Building a high-level dataflow system on top of Map-Reduce: the Pig experience

Proceedings of the VLDB Endowment

Similarity joins as stronger metric operations

SIGSPATIAL Special
Real-life performance of metric searching

SIGSPATIAL Special
Efficient answering of set containment queries for skewed item distributions

Proceedings of the 14th International Conference on Extending Database Technology
Efficient k-nearest neighbor graph construction for generic similarity measures

Proceedings of the 20th international conference on World wide web
ASTERIX: towards a scalable, semistructured data platform for evolving-world models

Distributed and Parallel Databases
A fast approach for parallel deduplication on multicore processors

Proceedings of the 2011 ACM Symposium on Applied Computing
Processing theta-joins using MapReduce

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
ATLAS: a probabilistic algorithm for high dimensional similarity search

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Efficient processing of data warehousing queries in a split execution environment

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Batch text similarity search with MapReduce

APWeb'11 Proceedings of the 13th Asia-Pacific web conference on Web technologies and applications
No free lunch: brute force vs. locality-sensitive hashing for cross-lingual pairwise similarity

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Efficient duplicate detection on cloud using a new signature scheme

WAIM'11 Proceedings of the 12th international conference on Web-age information management
Block-based load balancing for entity resolution with MapReduce

Proceedings of the 20th ACM international conference on Information and knowledge management
Learning-based entity resolution with MapReduce

Proceedings of the third international workshop on Cloud data management
Building wavelet histograms on large data in MapReduce

Proceedings of the VLDB Endowment
Pass-join: a partition-based method for similarity joins

Proceedings of the VLDB Endowment
SpSJoin: parallel spatial similarity joins

Proceedings of the 19th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems
Parallel data processing with MapReduce: a survey

ACM SIGMOD Record
Multi-pass sorted neighborhood blocking with MapReduce

Computer Science - Research and Development
Apriori-based frequent itemset mining algorithms on MapReduce

Proceedings of the 6th International Conference on Ubiquitous Information Management and Communication
The HaLoop approach to large-scale iterative data analysis

The VLDB Journal — The International Journal on Very Large Data Bases
V-SMART-join: a scalable mapreduce framework for all-pair similarity joins of multisets and vectors

Proceedings of the VLDB Endowment
Can we beat the prefix filtering?: an adaptive framework for similarity join and search

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Exploiting MapReduce-based similarity joins

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Finding related tables

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
To nest or not to nest, when and how much: representing intermediate results of graph pattern queries in MapReduce based processing

SWIM '12 Proceedings of the 4th International Workshop on Semantic Web Information Management
Scalable sequence similarity search and join in main memory on multi-cores

Euro-Par'11 Proceedings of the 2011 international conference on Parallel Processing - Volume 2
Entity matching for semistructured data in the Cloud

Proceedings of the 27th Annual ACM Symposium on Applied Computing
Inside "Big Data management": ogres, onions, or parfaits?

Proceedings of the 15th International Conference on Extending Database Technology
Efficient parallel kNN joins for large data in MapReduce

Proceedings of the 15th International Conference on Extending Database Technology
Adaptive MapReduce using situation-aware mappers

Proceedings of the 15th International Conference on Extending Database Technology
On generating large-scale ground truth datasets for the deduplication of bibliographic records

Proceedings of the 2nd International Conference on Web Intelligence, Mining and Semantics
Integrating open government data with stratosphere for more transparency

Web Semantics: Science, Services and Agents on the World Wide Web
ASTERIX: scalable warehouse-style web data integration

Proceedings of the Ninth International Workshop on Information Integration on the Web
Efficient processing of k nearest neighbor joins using MapReduce

Proceedings of the VLDB Endowment
MapReduce-based similarity join for metric spaces

Proceedings of the 1st International Workshop on Cloud Intelligence
Efficient multi-way theta-join processing using MapReduce

Proceedings of the VLDB Endowment
Constructing virtual documents for ontology matching using mapreduce

JIST'11 Proceedings of the 2011 joint international conference on The Semantic Web
ASTERIX: an open source system for "Big Data" management and analysis (demo)

Proceedings of the VLDB Endowment
MapReduce algorithms for big data analysis

Proceedings of the VLDB Endowment
Designing good algorithms for MapReduce and beyond

Proceedings of the Third ACM Symposium on Cloud Computing
An automatic blocking mechanism for large-scale de-duplication tasks

Proceedings of the 21st ACM international conference on Information and knowledge management
Landmark-join: hash-join based string similarity joins with edit distance constraints

DaWaK'12 Proceedings of the 14th international conference on Data Warehousing and Knowledge Discovery
Optimizing parallel algorithms for all pairs similarity search

Proceedings of the sixth ACM international conference on Web search and data mining
A MapReduce-based filtering algorithm for vector similarity join

Proceedings of the 7th International Conference on Ubiquitous Information Management and Communication
Eagle-eyed elephant: split-oriented indexing in Hadoop

Proceedings of the 16th International Conference on Extending Database Technology
Processing multi-way spatial joins on map-reduce

Proceedings of the 16th International Conference on Extending Database Technology
A performance comparison of parallel DBMSs and MapReduce on large-scale text analytics

Proceedings of the 16th International Conference on Extending Database Technology
Efficient parallel partition-based algorithms for similarity search and join with edit distance constraints

Proceedings of the Joint EDBT/ICDT 2013 Workshops
Minimal MapReduce algorithms

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Tuning large scale deduplication with reduced effort

Proceedings of the 25th International Conference on Scientific and Statistical Database Management
Don't match twice: redundancy-free similarity computation with MapReduce

Proceedings of the Second Workshop on Data Analytics in the Cloud
A partition-based method for string similarity joins with edit-distance constraints

ACM Transactions on Database Systems (TODS)
Scalable all-pairs similarity search in metric spaces

Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining
Upper and lower bounds on the cost of a map-reduce computation

Proceedings of the VLDB Endowment
Anatomy of a web-scale resale market: a data mining approach

Proceedings of the 22nd international conference on World Wide Web
Toward intersection filter-based optimization for joins in MapReduce

Proceedings of the 2nd International Workshop on Cloud Intelligence
Distributed data management using MapReduce

ACM Computing Surveys (CSUR)
Super-EGO: fast multi-dimensional similarity join

The VLDB Journal — The International Journal on Very Large Data Bases
The family of mapreduce and large-scale data processing systems

ACM Computing Surveys (CSUR)
Extending string similarity join to tolerant fuzzy token matching

ACM Transactions on Database Systems (TODS)
WOO: a scalable and multi-tenant platform for continuous knowledge base synthesis

Proceedings of the VLDB Endowment
Scalable column concept determination for web tables using large knowledge bases

Proceedings of the VLDB Endowment
ComMapReduce: An improvement of MapReduce with lightweight communication mechanisms

Data & Knowledge Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper we study how to efficiently perform set-similarity joins in parallel using the popular MapReduce framework. We propose a 3-stage approach for end-to-end set-similarity joins. We take as input a set of records and output a set of joined records based on a set-similarity condition. We efficiently partition the data across nodes in order to balance the workload and minimize the need for replication. We study both self-join and R-S join cases, and show how to carefully control the amount of data kept in main memory on each node. We also propose solutions for the case where, even if we use the most fine-grained partitioning, the data still does not fit in the main memory of a node. We report results from extensive experiments on real datasets, synthetically increased in size, to evaluate the speedup and scaleup properties of the proposed algorithms using Hadoop.