Scaling up all pairs similarity search

Authors:
Roberto J. Bayardo;Yiming Ma;Ramakrishnan Srikant
Affiliations:
Google: Inc., Mountain View, CA;University of California: Irvine, Irvine, CA;Google: Inc., Mountain View, CA
Venue:
Proceedings of the 16th international conference on World Wide Web
Year:
2007

Citing 21
Cited 117

Document filtering for fast ranking

SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
Query evaluation: strategies and optimizations

Information Processing and Management: an International Journal
Self-indexing inverted files for fast text retrieval

ACM Transactions on Information Systems (TOIS)
Optimization of inverted vector searches

SIGIR '85 Proceedings of the 8th annual international ACM SIGIR conference on Research and development in information retrieval
Approximate nearest neighbors: towards removing the curse of dimensionality

STOC '98 Proceedings of the thirtieth annual ACM symposium on Theory of computing
Syntactic clustering of the Web

Selected papers from the sixth international conference on World Wide Web
Agglomerative clustering of a search engine query log

Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
High performance clustering based on the similarity join

Proceedings of the ninth international conference on Information and knowledge management
Similarity estimation techniques from rounding algorithms

STOC '02 Proceedings of the thiry-fourth annual ACM symposium on Theory of computing
Database Management Systems

Database Management Systems
Similarity Search in High Dimensions via Hashing

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Efficient similarity search and classification via rank aggregation

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Efficient set joins on similarity predicates

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Semantic similarity between search engine queries using temporal correlation

WWW '05 Proceedings of the 14th international conference on World Wide Web
Optimization strategies for complex queries

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Evaluating similarity measures: a large-scale study in the orkut social network

Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
Taxonomy generation for text segments: A practical web-based approach

ACM Transactions on Information Systems (TOIS)
A Primitive Operator for Similarity Joins in Data Cleaning

ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
A web-based kernel function for measuring the similarity of short text snippets

Proceedings of the 15th international conference on World Wide Web
Efficient exact set-similarity joins

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Detectives: detecting coalition hit inflation attacks in advertising networks streams

Proceedings of the 16th international conference on World Wide Web

User-assisted similarity estimation for searching related web pages

Proceedings of the eighteenth conference on Hypertext and hypermedia
VGRAM: improving performance of approximate queries on string collections using variable-length grams

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Efficient similarity joins for near duplicate detection

Proceedings of the 17th international conference on World Wide Web
Cost-based variable-length-gram selection for string collections to support approximate queries efficiently

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Efficient Similarity Search for Tree-Structured Data

SSDBM '08 Proceedings of the 20th international conference on Scientific and Statistical Database Management
Ed-Join: an efficient algorithm for similarity joins with edit distance constraints

Proceedings of the VLDB Endowment
Scalable mining of large video databases using copy detection

MM '08 Proceedings of the 16th ACM international conference on Multimedia
NewsStand: a new view on news

Proceedings of the 16th ACM SIGSPATIAL international conference on Advances in geographic information systems
Fast Content-Based Mining of Web2.0 Videos

PCM '08 Proceedings of the 9th Pacific Rim Conference on Multimedia: Advances in Multimedia Information Processing
Efficient overlap and content reuse detection in blogs and online news articles

Proceedings of the 18th international conference on World wide web
Fast error-tolerant search on very large texts

Proceedings of the 2009 ACM symposium on Applied Computing
Efficient top-k algorithms for fuzzy search in string collections

Proceedings of the First International Workshop on Keyword Search on Structured Data
Pairwise document similarity in large collections with MapReduce

HLT-Short '08 Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Short Papers
Efficient approximate entity extraction with edit distance constraints

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Brute force and indexed approaches to pairwise document similarity comparisons with MapReduce

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Efficient Set Similarity Joins Using Min-prefixes

ADBIS '09 Proceedings of the 13th East European Conference on Advances in Databases and Information Systems
A cluster-based approach to XML similarity joins

IDEAS '09 Proceedings of the 2009 International Database Engineering & Applications Symposium
Semi-automatic entity set refinement

NAACL '09 Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Fast Matching for All Pairs Similarity Search

WI-IAT '09 Proceedings of the 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology - Volume 01
Creating probabilistic databases from duplicated data

The VLDB Journal — The International Journal on Very Large Data Bases
A framework for semantic link discovery over relational data

Proceedings of the 18th ACM conference on Information and knowledge management
Similarity-aware indexing for real-time entity resolution

Proceedings of the 18th ACM conference on Information and knowledge management
Power-law based estimation of set similarity join size

Proceedings of the VLDB Endowment
Framework for evaluating clustering algorithms in duplicate detection

Proceedings of the VLDB Endowment
Web-scale distributional similarity and entity set expansion

EMNLP '09 Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 2 - Volume 2
An incremental clustering scheme for data de-duplication

Data Mining and Knowledge Discovery
On compressing the textual web

Proceedings of the third ACM international conference on Web search and data mining
Incremental all pairs similarity search for varying similarity thresholds

Proceedings of the 3rd Workshop on Social Network Mining and Analysis
HARRA: fast iterative hashed record linkage for large-scale data collections

Proceedings of the 13th International Conference on Extending Database Technology
Learning When Concepts Abound

The Journal of Machine Learning Research
Efficient parallel set-similarity joins using MapReduce

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Bed-tree: an all-purpose index structure for string similarity search based on edit distance

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
From frequency to meaning: vector space models of semantics

Journal of Artificial Intelligence Research
Similarity joins as stronger metric operations

SIGSPATIAL Special
Generalizing prefix filtering to improve set similarity joins

Information Systems
CasJoin: a cascade chain for text similarity joins

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
An indexing scheme for fast and accurate chemical fingerprint database searching

SSDBM'10 Proceedings of the 22nd international conference on Scientific and statistical database management
Duplicate identification in deep web data integration

WAIM'10 Proceedings of the 11th international conference on Web-age information management
CAMEO: continuous analytics for massively multiplayer online games on cloud resources

Euro-Par'09 Proceedings of the 2009 international conference on Parallel processing
An efficient similarity join algorithm with cosine similarity predicate

DEXA'10 Proceedings of the 21st international conference on Database and expert systems applications: Part II
Scaling up top-K cosine similarity search

Data & Knowledge Engineering
Set similarity join on probabilistic data

Proceedings of the VLDB Endowment
Trie-join: efficient trie-based string similarity joins with edit-distance constraints

Proceedings of the VLDB Endowment
The social bookmark and publication management system bibsonomy

The VLDB Journal — The International Journal on Very Large Data Bases
Towards active detection of identity clone attacks on online social networks

Proceedings of the first ACM conference on Data and application security and privacy
CAMEO: enabling social networks for massively multiplayer online games through continuous analytics and cloud computing

Proceedings of the 9th Annual Workshop on Network and Systems Support for Games
Context-sensitive document ranking

Journal of Computer Science and Technology
Symmetrizations for clustering directed graphs

Proceedings of the 14th International Conference on Extending Database Technology
Efficient k-nearest neighbor graph construction for generic similarity measures

Proceedings of the 20th international conference on World wide web
Approximate String Processing

Foundations and Trends in Databases
Similarity join size estimation using locality sensitive hashing

Proceedings of the VLDB Endowment
Faerie: efficient filtering algorithms for approximate dictionary-based entity extraction

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
ATLAS: a probabilistic algorithm for high dimensional similarity search

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Efficient exact edit similarity query processing with the asymmetric signature scheme

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Batch text similarity search with MapReduce

APWeb'11 Proceedings of the 13th Asia-Pacific web conference on Web technologies and applications
Efficient similarity joins for near-duplicate detection

ACM Transactions on Database Systems (TODS)
Mavuno: a scalable and effective Hadoop-based paraphrase acquisition system

Proceedings of the Third Workshop on Large Scale Data Mining: Theory and Applications
A supervised machine learning approach for duplicate detection over gazetteer records

GeoS'11 Proceedings of the 4th international conference on GeoSpatial semantics
No free lunch: brute force vs. locality-sensitive hashing for cross-lingual pairwise similarity

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Fast locality-sensitive hashing

Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Efficient fuzzy full-text type-ahead search

The VLDB Journal — The International Journal on Very Large Data Bases
Cosine interesting pattern discovery

Information Sciences: an International Journal
Automatically generating data linkages using a domain-independent candidate selection approach

ISWC'11 Proceedings of the 10th international conference on The semantic web - Volume Part I
Semi-supervised learning to rank with preference regularization

Proceedings of the 20th ACM international conference on Information and knowledge management
Filtering and clustering relations for unsupervised information extraction in open domain

Proceedings of the 20th ACM international conference on Information and knowledge management
Pass-join: a partition-based method for similarity joins

Proceedings of the VLDB Endowment
Cross-language high similarity search: why no sub-linear time bound can be expected

ECIR'2010 Proceedings of the 32nd European conference on Advances in Information Retrieval
Bayesian locality sensitive hashing for fast similarity search

Proceedings of the VLDB Endowment
Clustering and load balancing optimization for redundant content removal

Proceedings of the 21st international conference companion on World Wide Web
V-SMART-join: a scalable mapreduce framework for all-pair similarity joins of multisets and vectors

Proceedings of the VLDB Endowment
Can we beat the prefix filtering?: an adaptive framework for similarity join and search

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Measuring semantic similarity between words by removing noise and redundancy in web snippets

Concurrency and Computation: Practice & Experience
CRSI: a compact randomized similarity index for set-valued features

Proceedings of the 15th International Conference on Extending Database Technology
Mining temporal patterns in popularity of web items

Information Sciences: an International Journal
Seal: spatio-textual similarity search

Proceedings of the VLDB Endowment
Distributed KNN-graph approximation via hashing

Proceedings of the 2nd ACM International Conference on Multimedia Retrieval
Maximum inner-product search using cone trees

Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining
Learning hash codes for efficient content reuse detection

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
CrowdER: crowdsourcing entity resolution

Proceedings of the VLDB Endowment
Efficient range queries over uncertain strings

SSDBM'12 Proceedings of the 24th international conference on Scientific and Statistical Database Management
An optimized in-network aggregation scheme for data collection in periodic sensor networks

ADHOC-NOW'12 Proceedings of the 11th international conference on Ad-hoc, Mobile, and Wireless Networks
Scaling pair-wise similarity-based algorithms in tagging spaces

ICWE'12 Proceedings of the 12th international conference on Web Engineering
Scalable similarity-based neighborhood methods with MapReduce

Proceedings of the sixth ACM conference on Recommender systems
Measuring website similarity using an entity-aware click graph

Proceedings of the 21st ACM international conference on Information and knowledge management
Star-Join: spatio-textual similarity join

Proceedings of the 21st ACM international conference on Information and knowledge management
Landmark-join: hash-join based string similarity joins with edit distance constraints

DaWaK'12 Proceedings of the 14th international conference on Data Warehousing and Knowledge Discovery
Detecting near-duplicate documents using sentence-level features and supervised learning

Expert Systems with Applications: An International Journal
Link discovery with guaranteed reduction ratio in affine spaces with minkowski measures

ISWC'12 Proceedings of the 11th international conference on The Semantic Web - Volume Part I
Scalable and domain-independent entity coreference: establishing high quality data linkages across heterogeneous data sources

ISWC'12 Proceedings of the 11th international conference on The Semantic Web - Volume Part II
Spatio-textual similarity joins

Proceedings of the VLDB Endowment
Optimizing parallel algorithms for all pairs similarity search

Proceedings of the sixth ACM international conference on Web search and data mining
Domain-Independent Entity Coreference for Linking Ontology Instances

Journal of Data and Information Quality (JDIQ) - Special Issue on Entity Resolution
Automatic thesaurus construction for cross generation corpus

Journal on Computing and Cultural Heritage (JOCCH)
Towards scalable real-time entity resolution using a similarity-aware inverted index approach

AusDM '08 Proceedings of the 7th Australasian Data Mining Conference - Volume 87
Efficient parallel partition-based algorithms for similarity search and join with edit distance constraints

Proceedings of the Joint EDBT/ICDT 2013 Workshops
Trie-based similarity search and join

Proceedings of the Joint EDBT/ICDT 2013 Workshops
Cache-aware parallel approximate matching and join algorithms using BWT

Proceedings of the Joint EDBT/ICDT 2013 Workshops
Efficient fuzzy search in large text collections

ACM Transactions on Information Systems (TOIS)
Accuracy vs. Speed: Scalable Entity Coreference on the Semantic Web with On-the-Fly Pruning

WI-IAT '12 Proceedings of the The 2012 IEEE/WIC/ACM International Joint Conferences on Web Intelligence and Intelligent Agent Technology - Volume 01
An efficient approach to suggesting topically related web queries using hidden topic model

World Wide Web
PartSS: an efficient partition-based filtering for edit distance constraints

ADC '11 Proceedings of the Twenty-Second Australasian Database Conference - Volume 115
String similarity measures and joins with synonyms

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Cache-conscious performance optimization for similarity search

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
Tuning large scale deduplication with reduced effort

Proceedings of the 25th International Conference on Scientific and Statistical Database Management
A partition-based method for string similarity joins with edit-distance constraints

ACM Transactions on Database Systems (TODS)
Scalable all-pairs similarity search in metric spaces

Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining
Scalable k-nearest neighbor graph construction based on greedy filtering

Proceedings of the 22nd international conference on World Wide Web companion
Distributed data management using MapReduce

ACM Computing Surveys (CSUR)
QUBiC: An adaptive approach to query-based recommendation

Journal of Intelligent Information Systems
A two-phase algorithm for mining sequential patterns with differential privacy

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Asymmetric signature schemes for efficient exact edit similarity query processing

ACM Transactions on Database Systems (TODS)
Entity resolution on uncertain relations

WAIM'13 Proceedings of the 14th international conference on Web-Age Information Management
Extending string similarity join to tolerant fuzzy token matching

ACM Transactions on Database Systems (TODS)
PLASMA-HD: probing the lattice structure and makeup of high-dimensional data

Proceedings of the VLDB Endowment
Scalable K-Means by ranked retrieval

Proceedings of the 7th ACM international conference on Web search and data mining
Online image search result grouping with MapReduce-based image clustering and graph construction for large-scale photos

Journal of Visual Communication and Image Representation
Using Non-Zero Dimensions for the Cosine and Tanimoto Similarity Search Among Real Valued Vectors

Fundamenta Informaticae - To Andrzej Skowron on His 70th Birthday

Quantified Score

Hi-index	0.02

Visualization

Abstract

Given a large collection of sparse vector data in a high dimensional space, we investigate the problem of finding all pairs of vectors whose similarity score (as determined by a function such as cosine distance) is above a given threshold. We propose a simple algorithm based on novel indexing and optimization strategies that solves this problem without relying on approximation methods or extensive parameter tuning. We show the approach efficiently handles a variety of datasets across a wide setting of similarity thresholds, with large speedups over previous state-of-the-art approaches.