Canonicalization of database records using adaptive similarity measures

Authors:
Aron Culotta;Michael Wick;Robert Hall;Matthew Marzilli;Andrew McCallum
Affiliations:
University of Massachusetts;University of Massachusetts;University of Massachusetts;University of Massachusetts;University of Massachusetts
Venue:
Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Year:
2007

Citing 9
Cited 7

On the limited memory BFGS method for large scale optimization

Mathematical Programming: Series A and B
Learning object identification rules for information integration

Information Systems - Data extraction, cleaning and reconciliation
Parallel Optimization: Theory, Algorithms and Applications

Parallel Optimization: Theory, Algorithms and Applications
Discriminative training methods for hidden Markov models: theory and experiments with perceptron algorithms

EMNLP '02 Proceedings of the ACL-02 conference on Empirical methods in natural language processing - Volume 10
Creating probabilistic databases from information extraction models

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Multi-field information extraction and cross-document fusion

ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics
Online Passive-Aggressive Algorithms

The Journal of Machine Learning Research
Learning field compatibilities to extract database records from unstructured text

EMNLP '06 Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing
BLOG: probabilistic models with unknown objects

IJCAI'05 Proceedings of the 19th international joint conference on Artificial intelligence

Canonicalization of graph database records using similarity measures

Proceedings of the 2nd international conference on Ubiquitous information management and communication
A unified approach for schema matching, coreference and canonicalization

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Learning decision trees with taxonomy of propositionalized attributes

Pattern Recognition
An unsupervised approach for product record normalization across different web sites

AAAI'08 Proceedings of the 23rd national conference on Artificial intelligence - Volume 2
Propositionalized attribute taxonomies from data for data-driven construction of concise classifiers

Expert Systems with Applications: An International Journal
Name phylogeny: a generative model of string variation

EMNLP-CoNLL '12 Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning
WOO: a scalable and multi-tenant platform for continuous knowledge base synthesis

Proceedings of the VLDB Endowment

Quantified Score

Hi-index	0.01

Visualization

Abstract

It is becoming increasingly common to construct databases from information automatically culled from many heterogeneous sources. For example, a research publication database can be constructed by automatically extracting titles, authors, and conference information from online papers. A common difficulty in consolidating data from multiple sources is that records are referenced in a variety of ways (e.g. abbreviations, aliases, and misspellings). Therefore, it can be difficult to construct a single, standard representation to present to the user. We refer to the task of constructing this representation as canonicalization. Despite its importance, there is little existing work on canonicalization. In this paper, we explore the use of edit distance measures to construct a canonical representation that is "central" in the sense that it is most similar to each of the disparate records. This approach reduces the impact of noisy records on the canonical representation. Furthermore, because the user may prefer different styles of canonicalization, we show how different edit distance costs can result in different forms of canonicalization. For example, reducing the cost of character deletions can result in representations that favor abbreviated forms over expanded forms (e.g. KDD versus Conference on Knowledge Discovery and Data Mining). We describe how to learn these costs from a small amount of manually annotated data using stochastic hill-climbing. Additionally, we investigate feature-based methods to learn ranking preferences over canonicalizations. These approaches can incorporate arbitrary textual evidence to select a canonical record. We evaluate our approach on a real-world publications database and show that our learning method results in a canonicalization solution that is robust to errors and easily customizable to user preferences.