Combining schema and instance information for integrating heterogeneous data sources

Authors:
Huimin Zhao;Sudha Ram
Affiliations:
Sheldon B. Lubar School of Business, University of Wisconsin-Milwaukee, P.O. Box 742, Milwaukee, WI 53201, USA;Department of Management Information Systems, Eller College of Management, University of Arizona, Tucson, AZ, USA
Venue:
Data & Knowledge Engineering
Year:
2007

Citing 34
Cited 9

Computer systems that learn: classification and prediction methods from statistics, neural nets, machine learning, and expert systems

Computer systems that learn: classification and prediction methods from statistics, neural nets, machine learning, and expert systems
C4.5: programs for machine learning

C4.5: programs for machine learning
Automated resolution of semantic heterogeneity in multidatabases

ACM Transactions on Database Systems (TODS)
String searching algorithms

String searching algorithms
Semantic similarity relations and computation in schema integration

Data & Knowledge Engineering
Identifying object isomerism in multidatabase systems

Distributed and Parallel Databases
The linguistic level: contribution for conceptual design, view integration, reuse and documentation

Data & Knowledge Engineering - Special issue natural language for data bases
Supporting schema integration by linguistic instruments

Data & Knowledge Engineering - Special issue natural language for data bases
Semantic integration of conceptual schemas

Data & Knowledge Engineering - Special issue natural language for data bases
Schema integration: past, present, and future

Management of heterogeneous and autonomous database systems
A Probabilistic Decision Model for Entity Matching in Heterogeneous Databases

Management Science
Managing heterogeneous information systems through discovery and retrieval of generic concepts

Journal of the American Society for Information Science
SEMINT: a tool for identifying attribute correspondences in heterogeneous databases using neural networks

Data & Knowledge Engineering
Automating the approximate record-matching process

Information Sciences—Informatics and Computer Science: An International Journal
A Comparison of Prediction Accuracy, Complexity, and Training Time of Thirty-Three Old and New Classification Algorithms

Machine Learning
Assessment of cluster analysis and self-organizing maps

International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems
Intensional and extensional integration and abstraction of heterogeneous databases

Data & Knowledge Engineering
Matching records in a national medical patient index

Communications of the ACM
Learning object identification rules for information integration

Information Systems - Data extraction, cleaning and reconciliation
Discovering and reconciling value conflicts for numerical data integration

Information Systems - Data extraction, cleaning and reconciliation
Computer-Aided Multivariate Analysis

Computer-Aided Multivariate Analysis
Self-Organizing Maps

Self-Organizing Maps
Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem

Data Mining and Knowledge Discovery
Multi-User View Integration System (MUVIS): An Expert System for View Integration

Proceedings of the Sixth International Conference on Data Engineering
Generic Schema Matching with Cupid

Proceedings of the 27th International Conference on Very Large Data Bases
Reducing Inconsistency in Integrating Data From Different Sources

IDEAS '01 Proceedings of the International Database Engineering & Applications Symposium
Asessing Semnatic Similarities among Geospatial Feature Class Definitions

INTEROP '99 Proceedings of the Second International Conference on Interoperating Geographic Information Systems
A Model to Support E-Catalog Integration

Proceedings of the IFIP TC2/WG2.6 Ninth Working Conference on Database Semantics: Semantic Issues in E-Commerce Systems
Automatic Classification of Semantic Concepts in View Specifications

DEXA '96 Proceedings of the 7th International Conference on Database and Expert Systems Applications
Semantic Based Schema Analysis

DEXA '98 Proceedings of the 9th International Conference on Database and Expert Systems Applications
On Using Historical Update Information for Instance Identification in Federated Databases

COOPIS '96 Proceedings of the First IFCIS International Conference on Cooperative Information Systems
Entity identification for heterogeneous database integration: a multiple classifier system approach and empirical evaluation

Information Systems
Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)

Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)
Cluster Analysis

Cluster Analysis

Modeling and manipulating the structure of hierarchical schemas for the web

Information Sciences: an International Journal
Entity matching across heterogeneous data sources: An approach based on constrained cascade generalization

Data & Knowledge Engineering
Automatic Methods for Integrating Biomedical Data Sources in a Mediator-Based System

DILS '08 Proceedings of the 5th international workshop on Data Integration in the Life Sciences
Collective taxonomizing: A collaborative approach to organizing document repositories

Decision Support Systems
Clustering and visualizing SOM results

IDEAL'10 Proceedings of the 11th international conference on Intelligent data engineering and automated learning
Editorial: Revising the constraints of lightweight mediated schemas

Data & Knowledge Engineering
Linear combination of component results in information retrieval

Data & Knowledge Engineering
FedDW global schema architect: UML-based design tool for the integration of data mart schemas

Proceedings of the fifteenth international workshop on Data warehousing and OLAP
Matching Attributes across Overlapping Heterogeneous Data Sources Using Mutual Information

Journal of Database Management

Quantified Score

Hi-index	0.00

Visualization

Abstract

Determining the correspondences among heterogeneous data sources, which is critical to integration of the data sources, is a complex and resource-consuming task that demands automated support. We propose an iterative procedure for detecting both schema-level and instance-level correspondences from heterogeneous data sources. Cluster analysis techniques are used first to identify similar schema elements (i.e., relations and attributes). Based on the identified schema-level correspondences, classification techniques are used to identify matching tuples. Statistical analysis techniques are then applied to a preliminary integrated data set to evaluate the relationships among schema elements more accurately. Improvement in schema-level correspondences triggers another iteration of an iterative procedure. We have performed empirical evaluation using real-world heterogeneous data sources and report in this paper some promising results (i.e., incremental improvement in identified correspondences) that demonstrate the utility of the proposed iterative procedure.