Matching and integration across heterogeneous data sources

Authors:
Patrick Pantel;Andrew Philpot;Eduard Hovy
Affiliations:
University of Southern California, Marina del Rey, CA;University of Southern California, Marina del Rey, CA;University of Southern California, Marina del Rey, CA
Venue:
dg.o '06 Proceedings of the 2006 international conference on Digital government research
Year:
2006

Citing 3
Cited 1

Modern Information Retrieval

Modern Information Retrieval
Aligning database columns using mutual information

dg.o '05 Proceedings of the 2005 national conference on Digital government research
An information theoretic model for database alignment

SSDBM'2005 Proceedings of the 17th international conference on Scientific and statistical database management

An entity name system (ENS) for the semantic web

ESWC'08 Proceedings of the 5th European semantic web conference on The semantic web: research and applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

A sea of undifferentiated information is forming from the body of data that is collected by people and organizations, across government, for different purposes, at different times, and using different methodologies. The resulting massive data heterogeneity requires automatic methods for data alignment, matching and/or merging. In this poster, we describe two systems, Guspin™ and Sift™, for automatically identifying equivalence classes and for aligning data across databases. Our technology, based on principles of information theory, measures the relative importance of data, leveraging them to quantify the similarity between entities. These systems have been applied to solve real problems faced by the Environmental Protection Agency and its counterparts at the state and local government level.