Syntactic clustering of the Web
Selected papers from the sixth international conference on World Wide Web
Approximating matrix multiplication for pattern recognition tasks
Journal of Algorithms
Managing gigabytes (2nd ed.): compressing and indexing documents and images
Managing gigabytes (2nd ed.): compressing and indexing documents and images
Selectively estimation for Boolean queries
PODS '00 Proceedings of the nineteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Data integration using similarity joins and a word-based information representation language
ACM Transactions on Information Systems (TOIS)
Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem
Data Mining and Knowledge Discovery
Text Compression for Dynamic Document Databases
IEEE Transactions on Knowledge and Data Engineering
Finding Interesting Associations without Support Pruning
IEEE Transactions on Knowledge and Data Engineering
Set Containment Joins: The Good, The Bad and The Ugly
VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Approximate String Joins in a Database (Almost) for Free
Proceedings of the 27th International Conference on Very Large Data Bases
An Evaluation of Non-Equijoin Algorithms
VLDB '91 Proceedings of the 17th International Conference on Very Large Data Bases
Fast Algorithms for Mining Association Rules in Large Databases
VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
Evaluation of Main Memory Join Algorithms for Joins with Set Comparison Join Predicates
VLDB '97 Proceedings of the 23rd International Conference on Very Large Data Bases
Interactive deduplication using active learning
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Efficient processing of joins on set-valued attributes
Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Robust and efficient fuzzy match for online data cleaning
Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Efficient single-pass index construction for text databases
Journal of the American Society for Information Science and Technology
Mining Frequent Patterns without Candidate Generation: A Frequent-Pattern Tree Approach
Data Mining and Knowledge Discovery
Eliminating fuzzy duplicates in data warehouses
VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
On the complexity of division and set joins in the relational algebra
Proceedings of the twenty-fourth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Efficient exact set-similarity joins
VLDB '06 Proceedings of the 32nd international conference on Very large data bases
A combination of trie-trees and inverted files for the indexing of set-valued attributes
CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
Duplicate Record Detection: A Survey
IEEE Transactions on Knowledge and Data Engineering
On the complexity of division and set joins in the relational algebra
Journal of Computer and System Sciences
Scaling up all pairs similarity search
Proceedings of the 16th international conference on World Wide Web
Benchmarking declarative approximate selection predicates
Proceedings of the 2007 ACM SIGMOD international conference on Management of data
VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Example-driven design of efficient record matching queries
VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Randomized algorithms for data reconciliation in wide area aggregate query processing
VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Bridging the application and DBMS profiling divide for database application developers
VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Compositional mining of multirelational biological datasets
ACM Transactions on Knowledge Discovery from Data (TKDD)
Efficient similarity joins for near duplicate detection
Proceedings of the 17th international conference on World Wide Web
Proceedings of the 2008 ACM SIGMOD international conference on Management of data
An efficient filter for approximate membership checking
Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Efficient Similarity Search for Tree-Structured Data
SSDBM '08 Proceedings of the 20th international conference on Scientific and Statistical Database Management
Evaluating Performance and Quality of XML-Based Similarity Joins
ADBIS '08 Proceedings of the 12th East European conference on Advances in Databases and Information Systems
Parallelizing query optimization
Proceedings of the VLDB Endowment
Hashed samples: selectivity estimators for set similarity selection queries
Proceedings of the VLDB Endowment
Ed-Join: an efficient algorithm for similarity joins with edit distance constraints
Proceedings of the VLDB Endowment
Scalable ad-hoc entity extraction from text collections
Proceedings of the VLDB Endowment
Scalable mining of large video databases using copy detection
MM '08 Proceedings of the 16th ACM international conference on Multimedia
Efficient top-k count queries over imprecise duplicates
Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology
Efficient overlap and content reuse detection in blogs and online news articles
Proceedings of the 18th international conference on World wide web
Towards intent-driven bidterm suggestion
Proceedings of the 18th international conference on World wide web
Effective Similarity Analysis over Event Streams Based on Sharing Extent
APWeb/WAIM '09 Proceedings of the Joint International Conferences on Advances in Data and Web Management
Efficient top-k algorithms for fuzzy search in string collections
Proceedings of the First International Workshop on Keyword Search on Structured Data
Incremental maintenance of length normalized indexes for approximate string matching
Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Efficient approximate entity extraction with edit distance constraints
Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Efficient Set Similarity Joins Using Min-prefixes
ADBIS '09 Proceedings of the 13th East European Conference on Advances in Databases and Information Systems
A cluster-based approach to XML similarity joins
IDEAS '09 Proceedings of the 2009 International Database Engineering & Applications Symposium
Fast Matching for All Pairs Similarity Search
WI-IAT '09 Proceedings of the 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology - Volume 01
Creating probabilistic databases from duplicated data
The VLDB Journal — The International Journal on Very Large Data Bases
Space-economical partial gram indices for exact substring matching
Proceedings of the 18th ACM conference on Information and knowledge management
Efficient algorithms for approximate member extraction using signature-based inverted lists
Proceedings of the 18th ACM conference on Information and knowledge management
Frameworks for entity matching: A comparison
Data & Knowledge Engineering
Efficient approximate search on string collections
Proceedings of the VLDB Endowment
Power-law based estimation of set similarity join size
Proceedings of the VLDB Endowment
Framework for evaluating clustering algorithms in duplicate detection
Proceedings of the VLDB Endowment
Web-scale distributional similarity and entity set expansion
EMNLP '09 Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 2 - Volume 2
An incremental clustering scheme for data de-duplication
Data Mining and Knowledge Discovery
Incremental all pairs similarity search for varying similarity thresholds
Proceedings of the 3rd Workshop on Social Network Mining and Analysis
HARRA: fast iterative hashed record linkage for large-scale data collections
Proceedings of the 13th International Conference on Extending Database Technology
Relational duality: unsupervised extraction of semantic relations between entities on the web
Proceedings of the 19th international conference on World wide web
Efficient parallel set-similarity joins using MapReduce
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
On active learning of record matching packages
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Bed-tree: an all-purpose index structure for string similarity search based on edit distance
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
On indexing error-tolerant set containment
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
From frequency to meaning: vector space models of semantics
Journal of Artificial Intelligence Research
Towards a theory of search queries
ACM Transactions on Database Systems (TODS)
Generalizing prefix filtering to improve set similarity joins
Information Systems
CasJoin: a cascade chain for text similarity joins
CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Simple and efficient algorithm for approximate dictionary matching
COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics
Duplicate identification in deep web data integration
WAIM'10 Proceedings of the 11th international conference on Web-age information management
Efficient similarity query in RFID trajectory databases
WAIM'10 Proceedings of the 11th international conference on Web-age information management
An efficient similarity join algorithm with cosine similarity predicate
DEXA'10 Proceedings of the 21st international conference on Database and expert systems applications: Part II
Scaling up top-K cosine similarity search
Data & Knowledge Engineering
Set similarity join on probabilistic data
Proceedings of the VLDB Endowment
Trie-join: efficient trie-based string similarity joins with edit-distance constraints
Proceedings of the VLDB Endowment
Efficient answering of set containment queries for skewed item distributions
Proceedings of the 14th International Conference on Extending Database Technology
Approximate entity extraction in temporal databases
World Wide Web
Foundations and Trends in Databases
Similarity join size estimation using locality sensitive hashing
Proceedings of the VLDB Endowment
Faerie: efficient filtering algorithms for approximate dictionary-based entity extraction
Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
ATLAS: a probabilistic algorithm for high dimensional similarity search
Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Efficient exact edit similarity query processing with the asymmetric signature scheme
Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Batch text similarity search with MapReduce
APWeb'11 Proceedings of the 13th Asia-Pacific web conference on Web technologies and applications
PG-Skip: proximity graph based clustering of long strings
DASFAA'11 Proceedings of the 16th international conference on Database systems for advanced applications: Part II
Efficient similarity joins for near-duplicate detection
ACM Transactions on Database Systems (TODS)
HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1
SEJoin: an optimized algorithm towards efficient approximate string searches
Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Efficient fuzzy full-text type-ahead search
The VLDB Journal — The International Journal on Very Large Data Bases
Pass-join: a partition-based method for similarity joins
Proceedings of the VLDB Endowment
Models and indices for integrating unstructured data with a relational database
KDID'04 Proceedings of the Third international conference on Knowledge Discovery in Inductive Databases
Using prefix-trees for efficiently computing set joins
DASFAA'05 Proceedings of the 10th international conference on Database Systems for Advanced Applications
Efficient processing of probabilistic set-containment queries on uncertain set-valued data
Information Sciences: an International Journal
Clustering and load balancing optimization for redundant content removal
Proceedings of the 21st international conference companion on World Wide Web
V-SMART-join: a scalable mapreduce framework for all-pair similarity joins of multisets and vectors
Proceedings of the VLDB Endowment
Can we beat the prefix filtering?: an adaptive framework for similarity join and search
SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
CRSI: a compact randomized similarity index for set-valued features
Proceedings of the 15th International Conference on Extending Database Technology
Seal: spatio-textual similarity search
Proceedings of the VLDB Endowment
ASTERIX: scalable warehouse-style web data integration
Proceedings of the Ninth International Workshop on Information Integration on the Web
ER'12 Proceedings of the 31st international conference on Conceptual Modeling
Spatio-textual similarity joins
Proceedings of the VLDB Endowment
Indexing dataspaces with partitions
World Wide Web
Proceedings of the sixth ACM international conference on Web search and data mining
Efficient processing of containment queries on nested sets
Proceedings of the 16th International Conference on Extending Database Technology
Proceedings of the Joint EDBT/ICDT 2013 Workshops
Approximate string matching by position restricted alignment
Proceedings of the Joint EDBT/ICDT 2013 Workshops
FPI: a novel indexing method using frequent patterns for approximate string searches
Proceedings of the Joint EDBT/ICDT 2013 Workshops
PartSS: an efficient partition-based filtering for edit distance constraints
ADC '11 Proceedings of the Twenty-Second Australasian Database Conference - Volume 115
String similarity measures and joins with synonyms
Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Tuning large scale deduplication with reduced effort
Proceedings of the 25th International Conference on Scientific and Statistical Database Management
A partition-based method for string similarity joins with edit-distance constraints
ACM Transactions on Database Systems (TODS)
Scalable all-pairs similarity search in metric spaces
Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining
Efficient filtering and ranking schemes for finding inclusion dependencies on the web
Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Asymmetric signature schemes for efficient exact edit similarity query processing
ACM Transactions on Database Systems (TODS)
Entity resolution on uncertain relations
WAIM'13 Proceedings of the 14th international conference on Web-Age Information Management
Extending string similarity join to tolerant fuzzy token matching
ACM Transactions on Database Systems (TODS)
Scalable column concept determination for web tables using large knowledge bases
Proceedings of the VLDB Endowment
Dimension independent similarity computation
The Journal of Machine Learning Research
Hi-index | 0.00 |
In this paper we present an efficient, scalable and general algorithm for performing set joins on predicates involving various similarity measures like intersect size, Jaccard-coefficient, cosine similarity, and edit-distance. This expands the existing suite of algorithms for set joins on simpler predicates such as, set containment, equality and non-zero overlap. We start with a basic inverted index based probing method and add a sequence of optimizations that result in one to two orders of magnitude improvement in running time. The algorithm folds in a data partitioning strategy that can work efficiently with an index compressed to fit in any available amount of main memory. The optimizations used in our algorithm generalize to several weighted and unweighted measures of partial word overlap between sets.