Entity matching in heterogeneous databases: A logistic regression approach

Authors:
Debabrata Dey
Affiliations:
Michael G. Foster School of Business, University of Washington, Seattle, WA 98195-3200, United States
Venue:
Decision Support Systems
Year:
2008

Citing 9
Cited 2

Federated database systems for managing distributed, heterogeneous, and autonomous databases

ACM Computing Surveys (CSUR) - Special issue on heterogeneous databases
Classifying Schematic and Data Heterogeneity in Multidatabase Systems

Computer
Automated resolution of semantic heterogeneity in multidatabases

ACM Transactions on Database Systems (TODS)
Data sharing economics and requirements for integration tool design

Information Systems - Special issue: distributed information systems in business and management
The merge/purge problem for large databases

SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Identifying object isomerism in multidatabase systems

Distributed and Parallel Databases
A Probabilistic Decision Model for Entity Matching in Heterogeneous Databases

Management Science
A Distance-Based Approach to Entity Reconciliation in Heterogeneous Databases

IEEE Transactions on Knowledge and Data Engineering
The Inter-Database Instance Identification Problem in Integrating Autonomous Systems

Proceedings of the Fifth International Conference on Data Engineering

A hierarchical Naïve Bayes model for approximate identity matching

Decision Support Systems
Linking records in dynamic world

PhD '12 Proceedings of the on SIGMOD/PODS 2012 PhD Symposium

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper examines a widely-encountered data heterogeneity problems-often faced in real-world decision support situations-called the entity heterogeneity problem, which arises when the same real-world entity type is represented using different identifiers in different applications. Supporting real-world decisions often requires one to identify which entity in one application is the same as another in a second application. Previous research has proposed decision models to resolve this problem. However, the implementation of those models requires either the estimation of probability parameters by manually matching a large sample of the existing data or the estimation of a distance measure based on user-specified weights. This paper proposes an alternative technique based on logistic regression for estimating the matching probabilities to be used in the matching decision model. This approach has been implemented and tested on real-world data. Comparison of the results with those from earlier approaches indicate that the proposed approach performs quite well and is certainly a viable approach in practical situations.