Federated database systems for managing distributed, heterogeneous, and autonomous databases
ACM Computing Surveys (CSUR) - Special issue on heterogeneous databases
Automated resolution of semantic heterogeneity in multidatabases
ACM Transactions on Database Systems (TODS)
Data sharing economics and requirements for integration tool design
Information Systems - Special issue: distributed information systems in business and management
The merge/purge problem for large databases
SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Identifying object isomerism in multidatabase systems
Distributed and Parallel Databases
A Distance-Based Approach to Entity Reconciliation in Heterogeneous Databases
IEEE Transactions on Knowledge and Data Engineering
The Inter-Database Instance Identification Problem in Integrating Autonomous Systems
Proceedings of the Fifth International Conference on Data Engineering
A hierarchical Naïve Bayes model for approximate identity matching
Decision Support Systems
Linking records in dynamic world
PhD '12 Proceedings of the on SIGMOD/PODS 2012 PhD Symposium
Hi-index | 0.00 |
This paper examines a widely-encountered data heterogeneity problems-often faced in real-world decision support situations-called the entity heterogeneity problem, which arises when the same real-world entity type is represented using different identifiers in different applications. Supporting real-world decisions often requires one to identify which entity in one application is the same as another in a second application. Previous research has proposed decision models to resolve this problem. However, the implementation of those models requires either the estimation of probability parameters by manually matching a large sample of the existing data or the estimation of a distance measure based on user-specified weights. This paper proposes an alternative technique based on logistic regression for estimating the matching probabilities to be used in the matching decision model. This approach has been implemented and tested on real-world data. Comparison of the results with those from earlier approaches indicate that the proposed approach performs quite well and is certainly a viable approach in practical situations.