A Framework for Reconciling Attribute Values from Multiple Data Sources

Authors:
Zhengrui Jiang;Sumit Sarkar;Prabuddha De;Debabrata Dey
Affiliations:
College of Business, University of North Alabama, Florence, Alabama 35632;School of Management, University of Texas at Dallas, Richardson, Texas 75083;Krannert School of Management, Purdue University, West Lafayette, Indiana 47907;Michael G. Foster School of Business, University of Washington, Seattle, Washington 98195
Venue:
Management Science
Year:
2007

Citing 0
Cited 6

A Bayesian Approach for Estimating and Replacing Missing Categorical Data

Journal of Data and Information Quality (JDIQ)
Assessing data currency - a probabilistic approach

Journal of Information Science
Data Quality of Query Results with Generalized Selection Conditions

Operations Research
A social network-based inference model for validating customer profile data

MIS Quarterly
Assessment of data quality in accounting data with association rules

Expert Systems with Applications: An International Journal
Identity matching and information acquisition: Estimation of optimal threshold parameters

Decision Support Systems

Quantified Score

Hi-index	0.01

Visualization

Abstract

Because of the heterogeneous nature of different data sources, data integration is often one of the most challenging tasks in managing modern information systems. While the existing literature has focused on problems such as schema integration and entity identification, it has largely overlooked a basic question: When an attribute value for a real-world entity is recorded differently in different databases, how should the “best” value be chosen from the set of possible values? This paper provides an answer to this question. We first show how a probability distribution over a set of possible values can be derived. We then demonstrate how these probabilities can be used to solve a given decision problem by minimizing the total cost of type I, type II, and misrepresentation errors. Finally, we propose a framework for integrating multiple data sources when a single “best” value has to be chosen and stored for every attribute of an entity.