V-SMART-join: a scalable mapreduce framework for all-pair similarity joins of multisets and vectors

Authors:
Ahmed Metwally;Christos Faloutsos
Affiliations:
Google, Inc., Mountain View, CA;SCS, Carnegie Mellon University, Pittsburgh, PA
Venue:
Proceedings of the VLDB Endowment
Year:
2012

Citing 30
Cited 11

Approximate nearest neighbors: towards removing the curse of dimensionality

STOC '98 Proceedings of the thirtieth annual ACM symposium on Theory of computing
Syntactic clustering of the Web

Selected papers from the sixth international conference on World Wide Web
Similarity estimation techniques from rounding algorithms

STOC '02 Proceedings of the thiry-fourth annual ACM symposium on Theory of computing
On the Resemblance and Containment of Documents

SEQUENCES '97 Proceedings of the Compression and Complexity of Sequences 1997
A Metric for Distributions with Applications to Image Databases

ICCV '98 Proceedings of the Sixth International Conference on Computer Vision
The Google file system

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Efficient set joins on similarity predicates

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
A Primitive Operator for Similarity Joins in Data Cleaning

ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Finding near-duplicate web pages: a large-scale evaluation of algorithms

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Efficient exact set-similarity joins

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Interpreting the data: Parallel analysis with Sawzall

Scientific Programming - Dynamic Grids and Worldwide Computing
Scaling up all pairs similarity search

Proceedings of the 16th international conference on World Wide Web
Detectives: detecting coalition hit inflation attacks in advertising networks streams

Proceedings of the 16th international conference on World Wide Web
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Blog Community Discovery and Evolution Based on Mutual Awareness Expansion

WI '07 Proceedings of the IEEE/WIC/ACM International Conference on Web Intelligence
Efficient similarity joins for near duplicate detection

Proceedings of the 17th international conference on World Wide Web
Pig latin: a not-so-foreign language for data processing

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
An Efficient Spectral Algorithm for Network Community Discovery and Its Applications to Biological and Social Networks

ICDM '07 Proceedings of the 2007 Seventh IEEE International Conference on Data Mining
SLEUTH: Single-pubLisher attack dEtection Using correlaTion Hunting

Proceedings of the VLDB Endowment
Efficient Merging and Filtering Algorithms for Approximate String Searches

ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
Scalable graph clustering using stochastic flows: applications to community discovery

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Pairwise document similarity in large collections with MapReduce

HLT-Short '08 Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Short Papers
Probabilistic community discovery using hierarchical latent Gaussian mixture model

AAAI'07 Proceedings of the 22nd national conference on Artificial intelligence - Volume 1
Hive: a warehousing solution over a map-reduce framework

Proceedings of the VLDB Endowment
Efficient parallel set-similarity joins using MapReduce

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Data-Intensive Text Processing with MapReduce

Data-Intensive Text Processing with MapReduce
DryadLINQ: a system for general-purpose distributed data-parallel computing using a high-level language

OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
Document Similarity Self-Join with MapReduce

ICDM '10 Proceedings of the 2010 IEEE International Conference on Data Mining
Estimating the number of users behind ip addresses for combating abusive traffic

Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Enumerative Combinatorics: Volume 1

Enumerative Combinatorics: Volume 1

Efficient processing of k nearest neighbor joins using MapReduce

Proceedings of the VLDB Endowment
MapReduce algorithms for big data analysis

Proceedings of the VLDB Endowment
Designing good algorithms for MapReduce and beyond

Proceedings of the Third ACM Symposium on Cloud Computing
A MapReduce-based filtering algorithm for vector similarity join

Proceedings of the 7th International Conference on Ubiquitous Information Management and Communication
Efficient parallel partition-based algorithms for similarity search and join with edit distance constraints

Proceedings of the Joint EDBT/ICDT 2013 Workshops
Minimal MapReduce algorithms

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Scalable all-pairs similarity search in metric spaces

Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining
Distributed data management using MapReduce

ACM Computing Surveys (CSUR)
The family of mapreduce and large-scale data processing systems

ACM Computing Surveys (CSUR)
Scalable column concept determination for web tables using large knowledge bases

Proceedings of the VLDB Endowment
PLASMA-HD: probing the lattice structure and makeup of high-dimensional data

Proceedings of the VLDB Endowment

Quantified Score

Hi-index	0.00

Visualization

Abstract

This work proposes V-SMART-Join, a scalable MapReduce-based framework for discovering all pairs of similar entities. The V-SMART-Join framework is applicable to sets, multisets, and vectors. V-SMART-Join is motivated by the observed skew in the underlying distributions of Internet traffic, and is a family of 2-stage algorithms, where the first stage computes and joins the partial results, and the second stage computes the similarity exactly for all candidate pairs. The V-SMART-Join algorithms are very efficient and scalable in the number of entities, as well as their cardinalities. They were up to 30 times faster than the state of the art algorithm, VCL, when compared on a real dataset of a small size. We also established the scalability of the proposed algorithms by running them on a dataset of a realistic size, on which VCL never succeeded to finish. Experiments were run using real datasets of IPs and cookies, where each IP is represented as a multiset of cookies, and the goal is to discover similar IPs to identify Internet proxies.