Incorporating string transformations in record matching

Authors:
Arvind Arasu;Surajit Chaudhuri;Kris Ganjam;Raghav Kaushik
Affiliations:
Microsoft Research, Redmond, WA, USA;Microsoft Research, Redmond, WA, USA;Microsoft Research, Redmond, WA, USA;Microsoft Research, Redmond, WA, USA
Venue:
Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Year:
2008

Citing 5
Cited 3

Similarity Search in High Dimensions via Hashing

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Data cleaning in microsoft SQL server 2005

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Record linkage: similarity measures and algorithms

Proceedings of the 2006 ACM SIGMOD international conference on Management of data
Efficient exact set-similarity joins

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Transformation-based Framework for Record Matching

ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering

Efficient approximate entity extraction with edit distance constraints

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Efficient approximate search on string collections

Proceedings of the VLDB Endowment
Learning top-k transformation rules

DEXA'11 Proceedings of the 22nd international conference on Database and expert systems applications - Volume Part I

Quantified Score

Hi-index	0.00

Visualization

Abstract

Today's record matching infrastructure does not allow a flexible way to account for synonyms such as "Robert" and "Bob" which refer to the same name, and more general forms of string transformations such as abbreviations. We expand the problem of record matching to take such user-defined string transformations as input. These transformations coupled with an underlying similarity function are used to define the similarity between two strings. We demonstrate the effectiveness of this approach via a fuzzy match operation that is used to lookup an input record against a table of records, where we have an additional table of transformations as input. We demonstrate an improvement in record matching quality and efficient retrieval based on our index structure that is cognizant of transformations.