Scaling multiple-source entity resolution using statistically efficient transfer learning

Authors:
Sahand N. Negahban;Benjamin I.P. Rubinstein;Jim Gemmell Gemmell
Affiliations:
MIT, Cambridge, MA, USA;Microsoft Research, Mountain View, CA, USA;Microsoft Research, Mountain View, CA, USA
Venue:
Proceedings of the 21st ACM international conference on Information and knowledge management
Year:
2012

Citing 9
Cited 1

Entity identification for heterogeneous database integration: a multiple classifier system approach and empirical evaluation

Information Systems
Duplicate Record Detection: A Survey

IEEE Transactions on Knowledge and Data Engineering
A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data

The Journal of Machine Learning Research
Exploiting context analysis for combining multiple entity resolution systems

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Evaluation of entity resolution approaches on real-world match problems

Proceedings of the VLDB Endowment
Decoding by linear programming

IEEE Transactions on Information Theory
Compressed sensing

IEEE Transactions on Information Theory
Simultaneous Support Recovery in High Dimensions: Benefits and Perils of Block $ell _{1}/ell _{infty} $-Regularization

IEEE Transactions on Information Theory
Scaling multiple-source entity resolution using statistically efficient transfer learning

Proceedings of the 21st ACM international conference on Information and knowledge management

Scaling multiple-source entity resolution using statistically efficient transfer learning

Proceedings of the 21st ACM international conference on Information and knowledge management

Quantified Score

Hi-index	0.00

Visualization

Abstract

We consider a serious, previously-unexplored challenge facing almost all approaches to scaling up entity resolution (ER) to multiple data sources: the prohibitive cost of labeling training data for supervised learning of similarity scores for each pair of sources. While there exists a rich literature describing almost all aspects of pairwise ER, this new challenge is arising now due to the unprecedented ability to acquire and store data from online sources, interest in features driven by ER such as enriched search verticals, and the uniqueness of noisy and missing data characteristics for each source. We show on real-world and synthetic data that for state-of-the-art techniques, the reality of heterogeneous sources means that the number of labeled training data must scale quadratically in the number of sources, just to maintain constant precision/recall. We address this challenge with a brand new transfer learning algorithm which requires far less training data (or equivalently, achieves superior accuracy with the same data) and is trained using fast convex optimization. The intuition behind our approach is to adaptively share structure learned about one scoring problem with all other scoring problems sharing a data source in common. We demonstrate that our theoretically-motivated approach improves upon existing techniques for multi-source ER.