Approximate matching of textual domain attributes for information source integration

  • Authors:
  • Andreas Koeller;Vinay Keelara

  • Affiliations:
  • Montclair State University, Montclair, NJ;Montclair State University, Montclair, NJ

  • Venue:
  • Proceedings of the 2nd international workshop on Information quality in information systems
  • Year:
  • 2005

Quantified Score

Hi-index 0.00

Visualization

Abstract

A key problem in the integration of information sources is the identification of related attributes or objects across independent sources. Inferring such meta-information from source data (rather than a-priori available meta-data, such as attribute names) is sometimes possible. For example, existing algorithms attempt to integrate information sources by finding patterns such as Inclusion Dependencies (INDs) across them. However, INDs are based on exact set inclusion and are thus very strict patterns that rarely hold across independent real-world databases.We propose two error-tolerant measures, termed Similarity Score and Distribution Score, that help identify related attributes across two independent databases, based on similarities in their data. Those measures specifically address the problem of identifying semantic relationships between textual attributes of databases that have few or no equal values.We also present implementations of those measures and some experimental results.