The complexity of optimization problems
Proc. of the conference on Structure in complexity theory
Silhouettes: a graphical aid to the interpretation and validation of cluster analysis
Journal of Computational and Applied Mathematics
A practical clustering algorithm for static and dynamic information organization
Proceedings of the tenth annual ACM-SIAM symposium on Discrete algorithms
Constraint-based clustering in large databases
ICDT '01 Proceedings of the 8th International Conference on Database Theory
A survey of approaches to automatic schema matching
The VLDB Journal — The International Journal on Very Large Data Bases
Adaptive duplicate detection using learnable string similarity measures
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Machine Learning
Integrating constraints and metric learning in semi-supervised clustering
ICML '04 Proceedings of the twenty-first international conference on Machine learning
Reference reconciliation in complex information spaces
Proceedings of the 2005 ACM SIGMOD international conference on Management of data
A cost-based model and effective heuristic for repairing constraints by value modification
Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Record linkage: similarity measures and algorithms
Proceedings of the 2006 ACM SIGMOD international conference on Management of data
Duplicate Record Detection: A Survey
IEEE Transactions on Knowledge and Data Engineering
Collective entity resolution in relational data
ACM Transactions on Knowledge Discovery from Data (TKDD)
Leveraging aggregate constraints for deduplication
Proceedings of the 2007 ACM SIGMOD international conference on Management of data
CORDS: automatic generation of correlation statistics in DB2
VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Cluster validity measurement techniques
AIKED'06 Proceedings of the 5th WSEAS International Conference on Artificial Intelligence, Knowledge Engineering and Data Bases
Dependencies revisited for improving data quality
Proceedings of the twenty-seventh ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Metric Functional Dependencies
ICDE '09 Proceedings of the 2009 IEEE International Conference on Data Engineering
Data fusion: resolving data conflicts for integration
Proceedings of the VLDB Endowment
Reasoning about record matching rules
Proceedings of the VLDB Endowment
Integrating conflicting data: the role of source dependence
Proceedings of the VLDB Endowment
On the NP-Completeness of some graph cluster measures
SOFSEM'06 Proceedings of the 32nd conference on Current Trends in Theory and Practice of Computer Science
Interaction between record matching and data repairing
Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
A secured collaborative model for data integration in life sciences
Transactions on large-scale data- and knowledge-centered systems IV
Towards certain fixes with editing rules and master data
The VLDB Journal — The International Journal on Very Large Data Bases
Flexible and efficient distributed resolution of large entities
FoIKS'12 Proceedings of the 7th international conference on Foundations of Information and Knowledge Systems
Data Linking for the Semantic Web
International Journal on Semantic Web & Information Systems
Hi-index | 0.00 |
Many data-management applications require integrating data from a variety of sources, where different sources may refer to the same real-world entity in different ways and some may even provide erroneous data. An important task in this process is to recognize and merge the various references that refer to the same entity. In practice, some attributes satisfy a uniqueness constraint---each real-world entity (or most entities) has a unique value for the attribute (e.g., business contact phone, address, and email). Traditional techniques tackle this case by first linking records that are likely to refer to the same real-world entity, and then fusing the linked records and resolving conflicts if any. Such methods can fall short for three reasons: first, erroneous values from sources may prevent correct linking; second, the real world may contain exceptions to the uniqueness constraints and always enforcing uniqueness can miss correct values; third, locally resolving conflicts for linked records may overlook important global evidence. This paper proposes a novel technique to solve this problem. The key component of our solution is to reduce the problem into a k-partite graph clustering problem and consider in clustering both similarity of attribute values and the sources that associate a pair of values in the same record. Thus, we perform global linkage and fusion simultaneously, and can identify incorrect values and differentiate them from alternative representations of the correct value from the beginning. In addition, we extend our algorithm to be tolerant to a few violations of the uniqueness constraints. Experimental results show accuracy and scalability of our technique.