Applications of corpus-based semantic similarity and word segmentation to database schema matching

  • Authors:
  • Aminul Islam;Diana Inkpen;Iluju Kiringa

  • Affiliations:
  • School of Information Technology and Engineering, University of Ottawa, Ottawa, Canada;School of Information Technology and Engineering, University of Ottawa, Ottawa, Canada;School of Information Technology and Engineering, University of Ottawa, Ottawa, Canada

  • Venue:
  • The VLDB Journal — The International Journal on Very Large Data Bases
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

In this paper, we present a method for database schema matching: the problem of identifying elements of two given schemas that correspond to each other. Schema matching is useful in e-commerce exchanges, in data integration/warehousing, and in semantic web applications. We first present two corpus-based methods: one method is for determining the semantic similarity of two target words and the other is for automatic word segmentation. Then we present a name-based element-level database schema matching method that exploits both the semantic similarity and the word segmentation methods. Our word similarity method uses pointwise mutual information (PMI) to sort lists of important neighbor words of two target words; the words which are common in both lists are selected and their PMI values are aggregated to calculate the relative similarity score. Our word segmentation method uses corpus type frequency information to choose the type with maximum length and frequency from "desegmented" text. It also uses a modified forward---backward matching technique using maximum length frequency and entropy rate if any non-matching portions of the text exist. Finally, we exploit both the semantic similarity and the word segmentation methods in our proposed name-based element-level schema matching method. This method uses a single property (i.e., element name) for schema matching and nevertheless achieves a measure score that is comparable to the methods that use multiple properties (e.g., element name, text description, data instance, context description). Our schema matching method also uses normalized and modified versions of the longest common subsequence string matching algorithm with weight factors to allow for a balanced combination. We validate our methods with experimental studies, the results of which suggest that these methods can be a useful addition to the set of existing methods.