Joint deduplication of multiple record types in relational data

Authors:
Aron Culotta;Andrew McCallum
Affiliations:
University of Massachusetts, Amherst, MA;University of Massachusetts, Amherst, MA
Venue:
Proceedings of the 14th ACM international conference on Information and knowledge management
Year:
2005

Citing 2
Cited 18

Fast Approximate Energy Minimization via Graph Cuts

IEEE Transactions on Pattern Analysis and Machine Intelligence
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning

Link mining: a survey

ACM SIGKDD Explorations Newsletter
Adaptive graphical approach to entity resolution

Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
Unsupervised deduplication using cross-field dependencies

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Querying and Merging Heterogeneous Data by Approximate Joins on Higher-Order Terms

ILP '08 Proceedings of the 18th international conference on Inductive Logic Programming
A Graph Partitioning Approach to Entity Disambiguation Using Uncertain Information

GoTAL '08 Proceedings of the 6th international conference on Advances in Natural Language Processing
Exploiting context analysis for combining multiple entity resolution systems

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Online collective entity resolution

AAAI'07 Proceedings of the 22nd national conference on Artificial intelligence - Volume 2
Frameworks for entity matching: A comparison

Data & Knowledge Engineering
A constrained clustering approach to duplicate detection among relational data

PAKDD'07 Proceedings of the 11th Pacific-Asia conference on Advances in knowledge discovery and data mining
Scaling record linkage to non-uniform distributed class sizes

PAKDD'08 Proceedings of the 12th Pacific-Asia conference on Advances in knowledge discovery and data mining
Detecting duplicate biological entities using Shortest Path Edit Distance

International Journal of Data Mining and Bioinformatics
EIF: a framework of effective entity identification

WAIM'10 Proceedings of the 11th international conference on Web-age information management
Modeling relations and their mentions without labeled text

ECML PKDD'10 Proceedings of the 2010 European conference on Machine learning and knowledge discovery in databases: Part III
Evaluation of entity resolution approaches on real-world match problems

Proceedings of the VLDB Endowment
Identity matching using personal and social identity features

Information Systems Frontiers
ReDD-Observatory: Using the Web of Data for Evaluating the Research-Disease Disparity

WI-IAT '11 Proceedings of the 2011 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology - Volume 01
Location-based reasoning about complex multi-agent behavior

Journal of Artificial Intelligence Research
Joint entity resolution on multiple datasets

The VLDB Journal — The International Journal on Very Large Data Bases

Quantified Score

Hi-index	0.00

Visualization

Abstract

Record deduplication is the task of merging database records that refer to the same underlying entity. In relational data-bases, accurate deduplication for records of one type is often dependent on the decisions made for records of other types. Whereas nearly all previous approaches have merged records of different types independently, this work models these inter-dependencies explicitly to collectively deduplicate records of multiple types. We construct a conditional random field model of deduplication that captures these relational dependencies, and then employ a novel relational partitioning algorithm to jointly deduplicate records. For two citation matching datasets, we show that collectively deduplicating paper and venue records results in up to a 30% error reduction in venue deduplication, and up to a 20% error reduction in paper deduplication.