Scalable all-pairs similarity search in metric spaces

Authors:
Ye Wang;Ahmed Metwally;Srinivasan Parthasarathy
Affiliations:
The Ohio State University, Columbus, Ohio, USA;Google Inc., Mountain View, California, USA;The Ohio State University, Columbus, Ohio, USA
Venue:
Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining
Year:
2013

Citing 27
Cited 0

Bipartite dimensions and bipartite degrees of graphs

Discrete Mathematics
The Earth Mover's Distance as a Metric for Image Retrieval

International Journal of Computer Vision
The webgraph framework I: compression techniques

Proceedings of the 13th international conference on World Wide Web
Efficient set joins on similarity predicates

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Evaluating similarity measures: a large-scale study in the orkut social network

Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
Discovering large dense subgraphs in massive graphs

VLDB '05 Proceedings of the 31st international conference on Very large data bases
A Primitive Operator for Similarity Joins in Data Cleaning

ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Approximation Techniques for Indexing the Earth Mover's Distance in Multimedia Databases

ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Finding near-duplicate web pages: a large-scale evaluation of algorithms

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Scaling up all pairs similarity search

Proceedings of the 16th international conference on World Wide Web
Detectives: detecting coalition hit inflation attacks in advertising networks streams

Proceedings of the 16th international conference on World Wide Web
k-means++: the advantages of careful seeding

SODA '07 Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms
Clustering Using a Similarity Measure Based on Shared Near Neighbors

IEEE Transactions on Computers
MapReduce: simplified data processing on large clusters

Communications of the ACM - 50th anniversary issue: 1958 - 2008
Metric space similarity joins

ACM Transactions on Database Systems (TODS)
Efficient similarity joins for near duplicate detection

Proceedings of the 17th international conference on World Wide Web
Pairwise document similarity in large collections with MapReduce

HLT-Short '08 Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Short Papers
Collaborative filtering with temporal dynamics

Communications of the ACM
Efficient parallel set-similarity joins using MapReduce

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Document Similarity Self-Join with MapReduce

ICDM '10 Proceedings of the 2010 IEEE International Conference on Data Mining
Processing theta-joins using MapReduce

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Clustering very large multi-dimensional datasets with MapReduce

Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Bayesian locality sensitive hashing for fast similarity search

Proceedings of the VLDB Endowment
V-SMART-join: a scalable mapreduce framework for all-pair similarity joins of multisets and vectors

Proceedings of the VLDB Endowment
Exploiting MapReduce-based similarity joins

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Searching and mining trillions of time series subsequences under dynamic time warping

Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining
Optimizing parallel algorithms for all pairs similarity search

Proceedings of the sixth ACM international conference on Web search and data mining

Quantified Score

Hi-index	0.00

Visualization

Abstract

Given a set of entities, the all-pairs similarity search aims at identifying all pairs of entities that have similarity greater than (or distance smaller than) some user-defined threshold. In this article, we propose a parallel framework for solving this problem in metric spaces. Novel elements of our solution include: i) flexible support for multiple metrics of interest; ii) an autonomic approach to partition the input dataset with minimal redundancy to achieve good load-balance in the presence of limited computing resources; iii) an on-the- fly lossless compression strategy to reduce both the running time and the final output size. We validate the utility, scalability and the effectiveness of the approach on hundreds of machines using real and synthetic datasets.