Record linkage with uniqueness constraints and erroneous values

Authors:
Songtao Guo;Xin Luna Dong;Divesh Srivastava;Remi Zajac
Affiliations:
AT&T Interactive Research;AT&T Labs-Research;AT&T Labs-Research;AT&T Interactive Research
Venue:
Proceedings of the VLDB Endowment
Year:
2010

Citing 22
Cited 5

The complexity of optimization problems

Proc. of the conference on Structure in complexity theory
Silhouettes: a graphical aid to the interpretation and validation of cluster analysis

Journal of Computational and Applied Mathematics
A practical clustering algorithm for static and dynamic information organization

Proceedings of the tenth annual ACM-SIAM symposium on Discrete algorithms
Constraint-based clustering in large databases

ICDT '01 Proceedings of the 8th International Conference on Database Theory
A survey of approaches to automatic schema matching

The VLDB Journal — The International Journal on Very Large Data Bases
Adaptive duplicate detection using learnable string similarity measures

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Correlation Clustering

Machine Learning
Integrating constraints and metric learning in semi-supervised clustering

ICML '04 Proceedings of the twenty-first international conference on Machine learning
Reference reconciliation in complex information spaces

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
A cost-based model and effective heuristic for repairing constraints by value modification

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Record linkage: similarity measures and algorithms

Proceedings of the 2006 ACM SIGMOD international conference on Management of data
Duplicate Record Detection: A Survey

IEEE Transactions on Knowledge and Data Engineering
Collective entity resolution in relational data

ACM Transactions on Knowledge Discovery from Data (TKDD)
Leveraging aggregate constraints for deduplication

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
CORDS: automatic generation of correlation statistics in DB2

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Cluster validity measurement techniques

AIKED'06 Proceedings of the 5th WSEAS International Conference on Artificial Intelligence, Knowledge Engineering and Data Bases
Dependencies revisited for improving data quality

Proceedings of the twenty-seventh ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Metric Functional Dependencies

ICDE '09 Proceedings of the 2009 IEEE International Conference on Data Engineering
Data fusion: resolving data conflicts for integration

Proceedings of the VLDB Endowment
Reasoning about record matching rules

Proceedings of the VLDB Endowment
Integrating conflicting data: the role of source dependence

Proceedings of the VLDB Endowment
On the NP-Completeness of some graph cluster measures

SOFSEM'06 Proceedings of the 32nd conference on Current Trends in Theory and Practice of Computer Science

Interaction between record matching and data repairing

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
A secured collaborative model for data integration in life sciences

Transactions on large-scale data- and knowledge-centered systems IV
Towards certain fixes with editing rules and master data

The VLDB Journal — The International Journal on Very Large Data Bases
Flexible and efficient distributed resolution of large entities

FoIKS'12 Proceedings of the 7th international conference on Foundations of Information and Knowledge Systems
Data Linking for the Semantic Web

International Journal on Semantic Web & Information Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Many data-management applications require integrating data from a variety of sources, where different sources may refer to the same real-world entity in different ways and some may even provide erroneous data. An important task in this process is to recognize and merge the various references that refer to the same entity. In practice, some attributes satisfy a uniqueness constraint---each real-world entity (or most entities) has a unique value for the attribute (e.g., business contact phone, address, and email). Traditional techniques tackle this case by first linking records that are likely to refer to the same real-world entity, and then fusing the linked records and resolving conflicts if any. Such methods can fall short for three reasons: first, erroneous values from sources may prevent correct linking; second, the real world may contain exceptions to the uniqueness constraints and always enforcing uniqueness can miss correct values; third, locally resolving conflicts for linked records may overlook important global evidence. This paper proposes a novel technique to solve this problem. The key component of our solution is to reduce the problem into a k-partite graph clustering problem and consider in clustering both similarity of attribute values and the sources that associate a pair of values in the same record. Thus, we perform global linkage and fusion simultaneously, and can identify incorrect values and differentiate them from alternative representations of the correct value from the beginning. In addition, we extend our algorithm to be tolerant to a few violations of the uniqueness constraints. Experimental results show accuracy and scalability of our technique.