A unifying semantic distance model for determining the similarity of attribute values

  • Authors:
  • John F. Roddick;Kathleen Hornsby;Denise de Vries

  • Affiliations:
  • School of Informatics and Engineering, Flinders University of South Australia, PO Box 2100, Adelaide 5001, South Australia;National Centre for Geographic Information and Analysis, University of Maine, Orono, Maine;School of Informatics and Engineering, Flinders University of South Australia, PO Box 2100, Adelaide 5001, South Australia

  • Venue:
  • ACSC '03 Proceedings of the 26th Australasian computer science conference - Volume 16
  • Year:
  • 2003

Quantified Score

Hi-index 0.00

Visualization

Abstract

The relative difference between two data values is of interest in a number of application domains including temporal and spatial applications, schema versioning, data warehousing (particularly data preparation), internet searching, validation and error correction, and data mining. Moreover, consistency across systems in determining such distances and the robustness of such calculations is essential in some domains and useful in many. Despite this, there is no generally adopted approach to determining such distances and no accommodation of distance within SQL or any commercially available DBMS.For non-numeric data values calculating the difference between values often requires application-specific support but even for numeric values the practical distance between two values may not simply be their numeric difference or Euclidean distance.In this paper, a model of semantic distance is developed in which a graph-based approach is used to quantify the distance between two data values. The approach facilitates a notion of distance, both as a simple traversal distance and as weighted arcs. Transition costs, as an additional expense of passing through a node, are also accommodated. Furthermore, multiple distance measures can be incorporated and a method of 'localisation' is discussed which allows relevant information to take precedence over less relevant information. Some results from our investigations, including our SQL based implementation, are presented.