Approximate nearest neighbors: towards removing the curse of dimensionality
STOC '98 Proceedings of the thirtieth annual ACM symposium on Theory of computing
Syntactic clustering of the Web
Selected papers from the sixth international conference on World Wide Web
Similarity estimation techniques from rounding algorithms
STOC '02 Proceedings of the thiry-fourth annual ACM symposium on Theory of computing
On the Resemblance and Containment of Documents
SEQUENCES '97 Proceedings of the Compression and Complexity of Sequences 1997
A Metric for Distributions with Applications to Image Databases
ICCV '98 Proceedings of the Sixth International Conference on Computer Vision
SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Efficient set joins on similarity predicates
SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
A Primitive Operator for Similarity Joins in Data Cleaning
ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Finding near-duplicate web pages: a large-scale evaluation of algorithms
SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Efficient exact set-similarity joins
VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Interpreting the data: Parallel analysis with Sawzall
Scientific Programming - Dynamic Grids and Worldwide Computing
Scaling up all pairs similarity search
Proceedings of the 16th international conference on World Wide Web
Detectives: detecting coalition hit inflation attacks in advertising networks streams
Proceedings of the 16th international conference on World Wide Web
MapReduce: simplified data processing on large clusters
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Blog Community Discovery and Evolution Based on Mutual Awareness Expansion
WI '07 Proceedings of the IEEE/WIC/ACM International Conference on Web Intelligence
Efficient similarity joins for near duplicate detection
Proceedings of the 17th international conference on World Wide Web
Pig latin: a not-so-foreign language for data processing
Proceedings of the 2008 ACM SIGMOD international conference on Management of data
ICDM '07 Proceedings of the 2007 Seventh IEEE International Conference on Data Mining
SLEUTH: Single-pubLisher attack dEtection Using correlaTion Hunting
Proceedings of the VLDB Endowment
Efficient Merging and Filtering Algorithms for Approximate String Searches
ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
Scalable graph clustering using stochastic flows: applications to community discovery
Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Pairwise document similarity in large collections with MapReduce
HLT-Short '08 Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Short Papers
Probabilistic community discovery using hierarchical latent Gaussian mixture model
AAAI'07 Proceedings of the 22nd national conference on Artificial intelligence - Volume 1
Hive: a warehousing solution over a map-reduce framework
Proceedings of the VLDB Endowment
Efficient parallel set-similarity joins using MapReduce
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Data-Intensive Text Processing with MapReduce
Data-Intensive Text Processing with MapReduce
OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
Document Similarity Self-Join with MapReduce
ICDM '10 Proceedings of the 2010 IEEE International Conference on Data Mining
Estimating the number of users behind ip addresses for combating abusive traffic
Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Enumerative Combinatorics: Volume 1
Enumerative Combinatorics: Volume 1
Efficient processing of k nearest neighbor joins using MapReduce
Proceedings of the VLDB Endowment
MapReduce algorithms for big data analysis
Proceedings of the VLDB Endowment
Designing good algorithms for MapReduce and beyond
Proceedings of the Third ACM Symposium on Cloud Computing
A MapReduce-based filtering algorithm for vector similarity join
Proceedings of the 7th International Conference on Ubiquitous Information Management and Communication
Proceedings of the Joint EDBT/ICDT 2013 Workshops
Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Scalable all-pairs similarity search in metric spaces
Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining
Distributed data management using MapReduce
ACM Computing Surveys (CSUR)
The family of mapreduce and large-scale data processing systems
ACM Computing Surveys (CSUR)
Scalable column concept determination for web tables using large knowledge bases
Proceedings of the VLDB Endowment
PLASMA-HD: probing the lattice structure and makeup of high-dimensional data
Proceedings of the VLDB Endowment
Hi-index | 0.00 |
This work proposes V-SMART-Join, a scalable MapReduce-based framework for discovering all pairs of similar entities. The V-SMART-Join framework is applicable to sets, multisets, and vectors. V-SMART-Join is motivated by the observed skew in the underlying distributions of Internet traffic, and is a family of 2-stage algorithms, where the first stage computes and joins the partial results, and the second stage computes the similarity exactly for all candidate pairs. The V-SMART-Join algorithms are very efficient and scalable in the number of entities, as well as their cardinalities. They were up to 30 times faster than the state of the art algorithm, VCL, when compared on a real dataset of a small size. We also established the scalability of the proposed algorithms by running them on a dataset of a realistic size, on which VCL never succeeded to finish. Experiments were run using real datasets of IPs and cookies, where each IP is represented as a multiset of cookies, and the goal is to discover similar IPs to identify Internet proxies.