The merge/purge problem for large databases
SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Enhanced hypertext categorization using hyperlinks
SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Efficient clustering of high-dimensional data sets with application to reference matching
Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Automatic segmentation of text into structured records
SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Learning to map between ontologies on the semantic web
Proceedings of the 11th international conference on World Wide Web
Multistrategy Learning for Information Extraction
ICML '98 Proceedings of the Fifteenth International Conference on Machine Learning
Learning domain-independent string transformation weights for high accuracy object identification
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Adaptive duplicate detection using learnable string similarity measures
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Iterative record linkage for cleaning and integration
Proceedings of the 9th ACM SIGMOD workshop on Research issues in data mining and knowledge discovery
Mining reference tables for automatic text segmentation
Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
A hierarchical graphical model for record linkage
UAI '04 Proceedings of the 20th conference on Uncertainty in artificial intelligence
UAI '04 Proceedings of the 20th conference on Uncertainty in artificial intelligence
Reference reconciliation in complex information spaces
Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Profile-Based Object Matching for Information Integration
IEEE Intelligent Systems
Management of probabilistic data: foundations and challenges
Proceedings of the twenty-sixth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Extracting knowledge from XML document repository: a semantic Web-based approach
Information Technology and Management
Proceedings of the 9th annual ACM international workshop on Web information and data management
Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Query evaluation with soft-key constraints
Proceedings of the twenty-seventh ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
A Graph Partitioning Approach to Entity Disambiguation Using Uncertain Information
GoTAL '08 Proceedings of the 6th international conference on Advances in Natural Language Processing
idMesh: graph-based disambiguation of linked data
Proceedings of the 18th international conference on World wide web
Extension of Schema Matching Platform ASMADE to Constraints and Mapping Expression
Advanced Internet Based Systems and Applications
Generic entity resolution with negative rules
The VLDB Journal — The International Journal on Very Large Data Bases
Reasoning about record matching rules
Proceedings of the VLDB Endowment
HARRA: fast iterative hashed record linkage for large-scale data collections
Proceedings of the 13th International Conference on Extending Database Technology
Querying uncertain data with aggregate constraints
Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Dynamic constraints for record matching
The VLDB Journal — The International Journal on Very Large Data Bases
Applied Intelligence
Entity resolution: theory, practice & open challenges
Proceedings of the VLDB Endowment
Journal of Information Science
Hi-index | 0.00 |
Entity matching is the problem of deciding if two given mentions in the data, such as "Helen Hunt" and "H. M. Hunt", refer to the same real-world entity. Numerous solutions have been developed, but they have not considered in depth the problem of exploiting integrity constraints that frequently exist in the domains. Examples of such constraints include "a mention with age two cannot match a mention with salary 200K" and "if two paper citations match, then their authors are likely to match in the same order". In this paper we describe a probabilistic solution to entity matching that exploits such constraints to improve matching accuracy. At the heart of the solution is a generative model that takes into account the constraints during the generation process, and provides well-defined interpretations of the constraints. We describe a novel combination of EM and relaxation labeling algorithms that efficiently learns the model, thereby matching mentions in an unsupervised way, without the need for annotated training data. Experiments on several real-world domains show that our solution can exploit constraints to significantly improve matching accuracy, by 3-12% F-1, and that the solution scales up to large data sets.