Canonicalization of graph database records using similarity measures

  • Authors:
  • Na Li;Qing Li;Liping Wang

  • Affiliations:
  • City University of Hong Kong;City University of Hong Kong;The University of Queensland

  • Venue:
  • Proceedings of the 2nd international conference on Ubiquitous information management and communication
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

Information extraction and crawling from the Web have been increasingly common, yet raw data are often noisy and redundant due to heterogeneous sources. Although much work has focused on duplicate records detection, there is little investigation in providing a uniform, standard result from the duplicates to users, which we refer to as a canonical result, and the process is referred to record canonicalization. In this paper, we focus on the situation of imperfect and duplicate documents on the Web, and propose a preprocessing method of graph canonicalization. We first formalize the problem of graph records canonicalization, and then we propose three possible solutions in order. Upon the framework, we implement graph selection canonicalization, which aims to construct a canonical graph by selecting the central graph among records. Experiment results demonstrate its performance in representing real world entities.