Canonicalization of graph database records using similarity measures

Authors:
Na Li;Qing Li;Liping Wang
Affiliations:
City University of Hong Kong;City University of Hong Kong;The University of Queensland
Venue:
Proceedings of the 2nd international conference on Ubiquitous information management and communication
Year:
2008

Citing 8
Cited 0

Conceptual structures: information processing in mind and machine

Conceptual structures: information processing in mind and machine
Duplicate record elimination in large data files

ACM Transactions on Database Systems (TODS)
Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem

Data Mining and Knowledge Discovery
Characterization and Algorithmic Recognition of Canonical Conceptual Graphs

ICCS '93 Proceedings on Conceptual Graphs for Knowledge Representation
Frequent Subgraph Discovery

ICDM '01 Proceedings of the 2001 IEEE International Conference on Data Mining
Fast Algorithms for Mining Association Rules in Large Databases

VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
Canonicalization of database records using adaptive similarity measures

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
RecipeCrawler: collecting recipe data from WWW incrementally

WAIM '06 Proceedings of the 7th international conference on Advances in Web-Age Information Management

Quantified Score

Hi-index	0.00

Visualization

Abstract

Information extraction and crawling from the Web have been increasingly common, yet raw data are often noisy and redundant due to heterogeneous sources. Although much work has focused on duplicate records detection, there is little investigation in providing a uniform, standard result from the duplicates to users, which we refer to as a canonical result, and the process is referred to record canonicalization. In this paper, we focus on the situation of imperfect and duplicate documents on the Web, and propose a preprocessing method of graph canonicalization. We first formalize the problem of graph records canonicalization, and then we propose three possible solutions in order. Upon the framework, we implement graph selection canonicalization, which aims to construct a canonical graph by selecting the central graph among records. Experiment results demonstrate its performance in representing real world entities.