The Normalized Compression Distance as a Distance Measure in Entity Identification

  • Authors:
  • Sebastian Klenk;Dennis Thom;Gunther Heidemann

  • Affiliations:
  • Intelligent Systems Group, Stuttgart University, Email: ais@vis.uni-stuttgart.de, Stuttgart, Germany 70569;Intelligent Systems Group, Stuttgart University, Email: ais@vis.uni-stuttgart.de, Stuttgart, Germany 70569;Intelligent Systems Group, Stuttgart University, Email: ais@vis.uni-stuttgart.de, Stuttgart, Germany 70569

  • Venue:
  • ICDM '09 Proceedings of the 9th Industrial Conference on Advances in Data Mining. Applications and Theoretical Aspects
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

The identification of identical entities accross heterogeneous data sources still involves a large amount of manual processing. This is mainly due to the fact that different sources use different data representations in varying semantic contexts. Up to now entity identification requires either the --- often manual --- unification of different representations, or alternatively the effort of programming tools with specialized interfaces for each representation type. However, for large and sparse databases, which are common e.g. for medical data, the manual approach becomes infeasible. We have developed a widely applicable compression based approach that does not rely on structural or semantical unity. The results we have obtained are promising both in recognition precision and performance.