Example-driven design of efficient record matching queries

Authors:
Surajit Chaudhuri;Bee-Chung Chen;Venkatesh Ganti;Raghav Kaushik
Affiliations:
Microsoft Research;UW-Madison;Microsoft Research;Microsoft Research
Venue:
VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Year:
2007

Citing 29
Cited 34

The merge/purge problem for large databases

SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Approximation algorithms for NP-hard problems

Approximation algorithms for NP-hard problems
Integration of heterogeneous databases without common domains using queries based on textual similarity

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
An introduction to support Vector Machines: and other kernel-based learning methods

An introduction to support Vector Machines: and other kernel-based learning methods
Efficient clustering of high-dimensional data sets with application to reference matching

Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Data integration using similarity joins and a word-based information representation language

ACM Transactions on Information Systems (TOIS)
Automatic segmentation of text into structured records

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Learning object identification rules for information integration

Information Systems - Data extraction, cleaning and reconciliation
The Skyline Operator

Proceedings of the 17th International Conference on Data Engineering
Declarative Data Cleaning: Language, Model, and Algorithms

Proceedings of the 27th International Conference on Very Large Data Bases
Approximate String Joins in a Database (Almost) for Free

Proceedings of the 27th International Conference on Very Large Data Bases
Interactive deduplication using active learning

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Learning domain-independent string transformation weights for high accuracy object identification

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Learning to match and cluster large high-dimensional data sets for data integration

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Robust and efficient fuzzy match for online data cleaning

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
An Interactive Framework for Data Cleaning

An Interactive Framework for Data Cleaning
Adaptive duplicate detection using learnable string similarity measures

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Efficient set joins on similarity predicates

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Mining reference tables for automatic text segmentation

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
A Primitive Operator for Similarity Joins in Data Cleaning

ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Data integration through transform reuse in the Morpheus project

Proceedings of the 2006 ACM SIGMOD international conference on Management of data
Managing information extraction: state of the art and research directions

Proceedings of the 2006 ACM SIGMOD international conference on Management of data
Record linkage: similarity measures and algorithms

Proceedings of the 2006 ACM SIGMOD international conference on Management of data
Efficient exact set-similarity joins

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Adaptive Blocking: Learning to Scale Up Record Linkage

ICDM '06 Proceedings of the Sixth International Conference on Data Mining
Learnable similarity functions and their application to record linkage and clustering

Learnable similarity functions and their application to record linkage and clustering
Merging the results of approximate match operations

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
FASE: A Framework for Scalable Performance Prediction of HPC Systems and Applications

Simulation
Learning blocking schemes for record linkage

AAAI'06 Proceedings of the 21st national conference on Artificial intelligence - Volume 1

Ed-Join: an efficient algorithm for similarity joins with edit distance constraints

Proceedings of the VLDB Endowment
Industry-scale duplicate detection

Proceedings of the VLDB Endowment
Efficient top-k count queries over imprecise duplicates

Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology
Efficient approximate entity extraction with edit distance constraints

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
A strategy for allowing meaningful and comparable scores in approximate matching

Information Systems
A strategy for allowing meaningful and comparable scores in approximate matching

Information Systems
Frameworks for entity matching: A comparison

Data & Knowledge Engineering
Comparative evaluation of entity resolution approaches with FEVER

Proceedings of the VLDB Endowment
Efficient approximate search on string collections

Proceedings of the VLDB Endowment
Reasoning about record matching rules

Proceedings of the VLDB Endowment
Learning string transformations from examples

Proceedings of the VLDB Endowment
On active learning of record matching packages

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Duplicate identification in deep web data integration

WAIM'10 Proceedings of the 11th international conference on Web-age information management
EIF: a framework of effective entity identification

WAIM'10 Proceedings of the 11th international conference on Web-age information management
An efficient similarity join algorithm with cosine similarity predicate

DEXA'10 Proceedings of the 21st international conference on Database and expert systems applications: Part II
Evaluation of entity resolution approaches on real-world match problems

Proceedings of the VLDB Endowment
Keyword++: a framework to improve keyword search over entity databases

Proceedings of the VLDB Endowment
Approximate entity extraction in temporal databases

World Wide Web
Entity matching: how similar is similar

Proceedings of the VLDB Endowment
Dynamic constraints for record matching

The VLDB Journal — The International Journal on Very Large Data Bases
Efficient top-K approximate searches against a relation with multiple attributes

World Wide Web
Efficient similarity search: arbitrary similarity measures, arbitrary composition

Proceedings of the 20th ACM international conference on Information and knowledge management
Context-based entity description rule for entity resolution

Proceedings of the 20th ACM international conference on Information and knowledge management
Learning-based entity resolution with MapReduce

Proceedings of the third international workshop on Cloud data management
Heterogeneous web data search using relevance-based on the fly data integration

Proceedings of the 21st international conference on World Wide Web
Efficient Privacy Preserving Protocols for Similarity Join

Transactions on Data Privacy
Active sampling for entity matching

Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining
Integrating feature analysis and background knowledge to recommend similarity functions

WISE'12 Proceedings of the 13th international conference on Web Information Systems Engineering
TYPiMatch: type-specific unsupervised learning of keys and key values for heterogeneous web data integration

Proceedings of the sixth ACM international conference on Web search and data mining
Cost-aware query planning for similarity search

Information Systems
Selectivity estimation for hybrid queries over text-rich data graphs

Proceedings of the 16th International Conference on Extending Database Technology
Tuning large scale deduplication with reduced effort

Proceedings of the 25th International Conference on Scientific and Statistical Database Management
Active Sampling for Entity Matching with Guarantees

ACM Transactions on Knowledge Discovery from Data (TKDD) - Special Issue on ACM SIGKDD 2012
Extending string similarity join to tolerant fuzzy token matching

ACM Transactions on Database Systems (TODS)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Record matching is the task of identifying records that match the same real world entity. This is a problem of great significance for a variety of business intelligence applications. Implementations of record matching rely on exact as well as approximate string matching (e.g., edit distances) and use of external reference data sources. Record matching can be viewed as a query composed of a small set of primitive operators. However, formulating such record matching queries is difficult and depends on the specific application scenario. Specifically, the number of options both in terms of string matching operations as well as the choice of external sources can be daunting. In this paper, we exploit the availability of positive and negative examples to search through this space and suggest an initial record matching query. Such queries can be subsequently modified by the programmer as needed. We ensure that the record matching queries our approach produces are (1) efficient: these queries can be run on large datasets by leveraging operations that are well-supported by RDBMSs, and (2) explainable: the queries are easy to understand so that they may be modified by the programmer with relative ease. We demonstrate the effectiveness of our approach on several real-world datasets.