Semantic Schema Matching without Shared Instances

Authors:
Jeffrey Partyka;Latifur Khan;Bhavani Thuraisingham
Affiliations:
-;-;-
Venue:
ICSC '09 Proceedings of the 2009 IEEE International Conference on Semantic Computing
Year:
2009

Citing 0
Cited 1

Design and Implementation of a Data Mining System for Malware Detection

Journal of Integrated Design & Process Science

Quantified Score

Hi-index	0.00

Visualization

Abstract

Semantic heterogeneity across data sources remains a widespread and relevant problem requiring innovative solutions. Our approach towards resolving semantic disparities among distinct data sources aligns their constituent tables by first choosing attributes for comparison. We then examine their instances and calculate a similarity value between them known as entropy-based distribution (EBD). One method of calculating EBD applies a state-of-the-art instance matching strategy based on N-grams in the data. However, this method often fails because it relies on shared instance data to determine similarity. This results in an overestimation of semantic similarity between unrelated attributes and an underestimation of semantic similarity between related attributes. Our method resolves this using clustering and a measure known as Normalized Google Distance. The EBD is then calculated among all clusters by treating each as a type. We show the effectiveness of our approach over the traditional N-gram approach across multi-jurisdictional datasets by generating impressive results.