The merge/purge problem for large databases
SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Efficient clustering of high-dimensional data sets with application to reference matching
Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Record linkage: making maximum use of the discriminating power of identifying information
Communications of the ACM
Efficient Record Linkage in Large Data Sets
DASFAA '03 Proceedings of the Eighth International Conference on Database Systems for Advanced Applications
On the complexity of the Extended String-to-String Correction Problem
STOC '75 Proceedings of seventh annual ACM symposium on Theory of computing
Adaptive duplicate detection using learnable string similarity measures
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Grouping search-engine returned citations for person-name queries
Proceedings of the 6th annual ACM international workshop on Web information and data management
Robust Identification of Fuzzy Duplicates
ICDE '05 Proceedings of the 21st International Conference on Data Engineering
Reference reconciliation in complex information spaces
Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Comparing clusterings: an axiomatic view
ICML '05 Proceedings of the 22nd international conference on Machine learning
Eliminating fuzzy duplicates in data warehouses
VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Introduction to Information Retrieval
Introduction to Information Retrieval
Keeping a digital library clean: new solutions to old problems
Proceedings of the eighth ACM symposium on Document engineering
Swoosh: a generic approach to entity resolution
The VLDB Journal — The International Journal on Very Large Data Bases
Entity resolution with iterative blocking
Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Framework for evaluating clustering algorithms in duplicate detection
Proceedings of the VLDB Endowment
Efficient name disambiguation for large-scale databases
PKDD'06 Proceedings of the 10th European conference on Principle and Practice of Knowledge Discovery in Databases
Entity Resolution and Information Quality
Entity Resolution and Information Quality
Scalable entity matching computation with materialization
Proceedings of the 20th ACM international conference on Information and knowledge management
Flexible and efficient distributed resolution of large entities
FoIKS'12 Proceedings of the 7th international conference on Foundations of Information and Knowledge Systems
Entity resolution: theory, practice & open challenges
Proceedings of the VLDB Endowment
Evaluating indeterministic duplicate detection results
SUM'12 Proceedings of the 6th international conference on Scalable Uncertainty Management
MFIBlocks: An effective blocking algorithm for entity resolution
Information Systems
Incremental entity resolution on rules and data
The VLDB Journal — The International Journal on Very Large Data Bases
Hi-index | 0.00 |
Entity Resolution (ER) is the process of identifying groups of records that refer to the same real-world entity. Various measures (e.g., pairwise F1, cluster F1) have been used for evaluating ER results. However, ER measures tend to be chosen in an ad-hoc fashion without careful thought as to what defines a good result for the specific application at hand. In this paper, our contributions are twofold. First, we conduct an analysis on existing ER measures, showing that they can often conflict with each other by ranking the results of ER algorithms differently. Second, we explore a new distance measure for ER (called "generalized merge distance" or GMD) inspired by the edit distance of strings, using cluster splits and merges as its basic operations. A significant advantage of GMD is that the cost functions for splits and merges can be configured, enabling us to clearly understand the characteristics of a defined GMD measure. Surprisingly, a state-of-the-art clustering measure called Variation of Information is a special case of our configurable GMD measure, and the widely used pairwise F1 measure can be directly computed using GMD. We present an efficient linear-time algorithm that correctly computes the GMD measure for a large class of cost functions that satisfy reasonable properties.