Learning string transformations from examples

Authors:
Arvind Arasu;Surajit Chaudhuri;Raghav Kaushik
Affiliations:
Microsoft Research, Redmond, WA;Microsoft Research, Redmond, WA;Microsoft Research, Redmond, WA
Venue:
Proceedings of the VLDB Endowment
Year:
2009

Citing 12
Cited 14

XTRACT: a system for extracting document type descriptors from XML documents

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Interactive deduplication using active learning

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Maximizing the spread of influence through a social network

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Reference reconciliation in complex information spaces

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
An improved error model for noisy channel spelling correction

ACL '00 Proceedings of the 38th Annual Meeting on Association for Computational Linguistics
A Heterogeneous Field Matching Method for Record Linkage

ICDM '05 Proceedings of the Fifth IEEE International Conference on Data Mining
Record linkage: similarity measures and algorithms

Proceedings of the 2006 ACM SIGMOD international conference on Management of data
Duplicate Record Detection: A Survey

IEEE Transactions on Knowledge and Data Engineering
The Minimum Description Length Principle (Adaptive Computation and Machine Learning)

The Minimum Description Length Principle (Adaptive Computation and Machine Learning)
Example-driven design of efficient record matching queries

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Transformation-based Framework for Record Matching

ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
Collective object identification

IJCAI'05 Proceedings of the 19th international joint conference on Artificial intelligence

Mining document collections to facilitate accurate approximate entity matching

Proceedings of the VLDB Endowment
Text-to-query: dynamically building structured analytics to illustrate textual content

Proceedings of the 2010 EDBT/ICDT Workshops
On indexing error-tolerant set containment

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
EIF: a framework of effective entity identification

WAIM'10 Proceedings of the 11th international conference on Web-age information management
Approximate String Processing

Foundations and Trends in Databases
Schema mapping with quality assurance for data integration

APWeb'11 Proceedings of the 13th Asia-Pacific web conference on Web technologies and applications
Learning top-k transformation rules

DEXA'11 Proceedings of the 22nd international conference on Database and expert systems applications - Volume Part I
Context-based entity description rule for entity resolution

Proceedings of the 20th ACM international conference on Information and knowledge management
Declarative platform for data sourcing games

Proceedings of the 21st international conference on World Wide Web
Crowd-based data sourcing

DNIS'11 Proceedings of the 7th international conference on Databases in Networked Information Systems
Learning semantic string transformations from examples

Proceedings of the VLDB Endowment
Schema decryption for large extract-transform-load systems

ER'12 Proceedings of the 31st international conference on Conceptual Modeling
Adaptive Connection Strength Models for Relationship-Based Entity Resolution

Journal of Data and Information Quality (JDIQ) - Special Issue on Entity Resolution
Fast greedy algorithms in mapreduce and streaming

Proceedings of the twenty-fifth annual ACM symposium on Parallelism in algorithms and architectures

Quantified Score

Hi-index	0.00

Visualization

Abstract

"Robert" and "Bob" refer to the same first name but are textually far apart. Traditional string similarity functions do not allow a flexible way to account for such synonyms, abbreviations and aliases. Recently, string transformations have been proposed as a mechanism to make matching robust to such variations. However, in many domains, identifying an appropriate set of transformations is challenging as the space of possible transformations is large. In this paper, we investigate the problem of leveraging examples of matching strings to learn string transformations. We formulate an optimization problem where we are required to learn a concise set of transformations that explain most of the differences. We propose a greedy approximation algorithm for this NP-hard problem. Our experiments over real-life data illustrate the benefits of our approach.