Conceptual structures: information processing in mind and machine
Conceptual structures: information processing in mind and machine
Duplicate record elimination in large data files
ACM Transactions on Database Systems (TODS)
Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem
Data Mining and Knowledge Discovery
Characterization and Algorithmic Recognition of Canonical Conceptual Graphs
ICCS '93 Proceedings on Conceptual Graphs for Knowledge Representation
ICDM '01 Proceedings of the 2001 IEEE International Conference on Data Mining
Fast Algorithms for Mining Association Rules in Large Databases
VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
Canonicalization of database records using adaptive similarity measures
Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
RecipeCrawler: collecting recipe data from WWW incrementally
WAIM '06 Proceedings of the 7th international conference on Advances in Web-Age Information Management
Hi-index | 0.00 |
Information extraction and crawling from the Web have been increasingly common, yet raw data are often noisy and redundant due to heterogeneous sources. Although much work has focused on duplicate records detection, there is little investigation in providing a uniform, standard result from the duplicates to users, which we refer to as a canonical result, and the process is referred to record canonicalization. In this paper, we focus on the situation of imperfect and duplicate documents on the Web, and propose a preprocessing method of graph canonicalization. We first formalize the problem of graph records canonicalization, and then we propose three possible solutions in order. Upon the framework, we implement graph selection canonicalization, which aims to construct a canonical graph by selecting the central graph among records. Experiment results demonstrate its performance in representing real world entities.