A constrained clustering approach to duplicate detection among relational data

Authors:
Chao Wang;Jie Lu;Guangquan Zhang
Affiliations:
Faculty of Information Technology, University of Technology, Sydney, Broadway, NSW, Australia;Faculty of Information Technology, University of Technology, Sydney, Broadway, NSW, Australia;Faculty of Information Technology, University of Technology, Sydney, Broadway, NSW, Australia
Venue:
PAKDD'07 Proceedings of the 11th Pacific-Asia conference on Advances in knowledge discovery and data mining
Year:
2007

Citing 10
Cited 3

Term-weighting approaches in automatic text retrieval

Information Processing and Management: an International Journal
Mining association rules between sets of items in large databases

SIGMOD '93 Proceedings of the 1993 ACM SIGMOD international conference on Management of data
The nature of statistical learning theory

The nature of statistical learning theory
Efficient clustering of high-dimensional data sets with application to reference matching

Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Learning domain-independent string transformation weights for high accuracy object identification

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Learning to match and cluster large high-dimensional data sets for data integration

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Adaptive duplicate detection using learnable string similarity measures

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Joint deduplication of multiple record types in relational data

Proceedings of the 14th ACM international conference on Information and knowledge management
Collective object identification

IJCAI'05 Proceedings of the 19th international joint conference on Artificial intelligence

A Graph Partitioning Approach to Entity Disambiguation Using Uncertain Information

GoTAL '08 Proceedings of the 6th international conference on Advances in Natural Language Processing
Collaborative management of web ontology data with flexible access control

Expert Systems with Applications: An International Journal
Creating and managing ontology data on the web: a semantic wiki approach

WISE'07 Proceedings of the 8th international conference on Web information systems engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper proposes an approach to detect duplicates among relational data. Traditional methods for record linkage or duplicate detection work on a set of records which have no explicit relations with each other. These records can be formatted into a single database table for processing. However, there are situations that records from different sources can not be flattened into one table and records within one source have certain (semantic) relations between them. The duplicate detection issue of these relational data records/instances can be dealt with by formatting them into several tables and applying traditional methods to each table. However, as the relations among the original data records are ignored, this approach generates poor or inconsistent results. This paper analyzes the characteristics of relational data and proposes a particular clustering approach to perform duplicate detection. This approach incorporates constraint rules derived from the characteristics of relational data and therefore yields better and more consistent results, which are revealed by our experiments.