Aligning database columns using mutual information

Authors:
Patrick Pantel;Andrew Philpot;Eduard Hovy
Affiliations:
University of Southern California, Marina del Rey, CA;University of Southern California, Marina del Rey, CA;University of Southern California, Marina del Rey, CA
Venue:
dg.o '05 Proceedings of the 2005 national conference on Digital government research
Year:
2005

Citing 9
Cited 5

Performance standards and evaluations in IR test collections: cluster-based retrieval models

Information Processing and Management: an International Journal
XML-based information mediation with MIX

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Reconciling schemas of disparate data sources: a machine-learning approach

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Introduction to Modern Information Retrieval

Introduction to Modern Information Retrieval
Using an ontology to simplify data access

Communications of the ACM
Using Schema Matching to Simplify Heterogeneous Data Translation

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Discovering word senses from text

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
On schema matching with opaque column names and data values

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Word association norms, mutual information, and lexicography

ACL '89 Proceedings of the 27th annual meeting on Association for Computational Linguistics

Matching and integration across heterogeneous data sources

dg.o '06 Proceedings of the 2006 international conference on Digital government research
Enhancing border security: Mutual information analysis to identify suspect vehicles

Decision Support Systems
Event-triggered data and knowledge sharing among collaborating government organizations

dg.o '07 Proceedings of the 8th annual international conference on Digital government research: bridging disciplines & domains
Evaluating ontology mapping techniques: An experiment in public safety information sharing

Decision Support Systems
Suspect vehicle identification for border safety with modified mutual information

ISI'06 Proceedings of the 4th IEEE international conference on Intelligence and Security Informatics

Quantified Score

Hi-index	0.01

Visualization

Abstract

As with many large organizations, the Government's data is split in many different ways and is collected at different times by different people. The resulting massive data heterogeneity means government staff cannot effectively locate, share, or compare data across sources, let alone achieve computational data interoperability. A case in point is the California Air Resources Board (CARB), which is faced with the challenge of integrating the emissions inventory databases belonging to California's 35 air quality management districts to create a state inventory. This inventory must be submitted annually to the US EPA which, in turn, must perform quality assurance tests on these inventories and integrate them into a national emissions inventory for use in tracking the effects of national air quality policies. The premise of our research is that it is possible to significantly reduce the amount of manual labor required in database wrapping and integration by automatically learning mappings in the data. In this research, we applied statistical algorithms to discover correspondences across comparable datasets. We have seen particular success in an information theoretic model, called SIfT (Significance Information for Translation), that performs data-driven column alignments. We have applied SIfT to mapping the Santa Barbara County Air Pollution Control District's 2001 emissions inventory database with the California Air Resources Board statewide inventory database. A fully customizable interface to the SIfT toolkit is available at http://sift.isi.edu/, allowing users to create new alignments, navigate the information theoretic model, and inspect alignment decisions. On a broader scale, this work makes strides toward appeasing a central problem in data management of integrating legacy data.