Introduction to algorithms
Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem
Data Mining and Knowledge Discovery
Approximate String Joins in a Database (Almost) for Free
Proceedings of the 27th International Conference on Very Large Data Bases
Interactive deduplication using active learning
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Adaptive duplicate detection using learnable string similarity measures
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
A hierarchical naive Bayes mixture model for name disambiguation in author citations
Proceedings of the 2005 ACM symposium on Applied computing
ACM SIGKDD Explorations Newsletter
Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Duplicate Record Detection: A Survey
IEEE Transactions on Knowledge and Data Engineering
Collective entity resolution in relational data
ACM Transactions on Knowledge Discovery from Data (TKDD)
Leveraging aggregate constraints for deduplication
Proceedings of the 2007 ACM SIGMOD international conference on Management of data
D-Swoosh: A Family of Algorithms for Generic, Distributed Entity Resolution
ICDCS '07 Proceedings of the 27th International Conference on Distributed Computing Systems
Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
MapReduce: simplified data processing on large clusters
Communications of the ACM - 50th anniversary issue: 1958 - 2008
Structured entity identification and document categorization: two tasks with one joint model
Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Automatic record linkage using seeded nearest neighbour and support vector machine classification
Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Unsupervised deduplication using cross-field dependencies
Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
A unified approach for schema matching, coreference and canonicalization
Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Efficient Closed Pattern Mining in Strongly Accessible Set Systems (Extended Abstract)
PKDD 2007 Proceedings of the 11th European conference on Principles and Practice of Knowledge Discovery in Databases
Industry-scale duplicate detection
Proceedings of the VLDB Endowment
Swoosh: a generic approach to entity resolution
The VLDB Journal — The International Journal on Very Large Data Bases
Entity resolution with iterative blocking
Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Generic Entity Resolution in Relational Databases
ADBIS '09 Proceedings of the 13th East European Conference on Advances in Databases and Information Systems
Using decision trees for conference resolution
IJCAI'95 Proceedings of the 14th international joint conference on Artificial intelligence - Volume 2
Similarity-aware indexing for real-time entity resolution
Proceedings of the 18th ACM conference on Information and knowledge management
Frameworks for entity matching: A comparison
Data & Knowledge Engineering
PEGASUS: A Peta-Scale Graph Mining System Implementation and Observations
ICDM '09 Proceedings of the 2009 Ninth IEEE International Conference on Data Mining
Bed-tree: an all-purpose index structure for string similarity search based on edit distance
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Evaluating entity resolution results
Proceedings of the VLDB Endowment
Record linkage with uniqueness constraints and erroneous values
Proceedings of the VLDB Endowment
Proceedings of the VLDB Endowment
Evaluation of entity resolution approaches on real-world match problems
Proceedings of the VLDB Endowment
Entity resolution with evolving rules
Proceedings of the VLDB Endowment
Entity Resolution and Information Quality
Entity Resolution and Information Quality
Hadoop: The Definitive Guide
A Survey of Indexing Techniques for Scalable Record Linkage and Deduplication
IEEE Transactions on Knowledge and Data Engineering
Hi-index | 0.00 |
Entity resolution (ER) is a computationally hard problem of data integration scenarios, where database records have to be grouped according to the real-world entities they belong to. In practice these entities may consist of only a few records from different data sources with typos or historical data. In other cases they may contain significantly more records, especially when we search for entities on a higher level of a concept hierarchy than records. In this paper we give theoretical foundation of a variety of practically important match functions. We show that under these formulations, ER with large entities can be solved efficiently with algorithms based on MapReduce, a distributed computing paradigm. Our algorithm can efficiently incorporate probabilistic and similarity-based record match, enabling flexible match function definition. We demonstrate the usability of our model and algorithm in a real-world insurance ER scenario, where we identify household groups of client records.