Pair-Wise entity resolution: overview and challenges

  • Authors:
  • Hector Garcia-Molina

  • Affiliations:
  • Stanford University

  • Venue:
  • CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

Information integration is one of the oldest and most important computer science problems: Information from diverse sources must be combined, so that users can access and manipulate the information in a unified way. One of the central problems in information integration is that of Entity Resolution (ER) (sometimes referred to as deduplication). ER is the process of identifying and merging incoming records judged to represent the same real-world entity.For example, consider a company that has different customer databases (e.g., one for each subsidiary), and would like to integrate them. Identifying matching records is challenging because there are no unique identifiers across the different sources or databases. A given customer may appear in different ways in each database, and there is a fair amount of guesswork in determining which customers match. Deciding if records match is often computationally expensive, e.g., may involve finding maximal common subsequences in two strings. How to combine matching records is often also application dependent. For example, say different phone numbers appear in two records to be merged. In some cases we may wish to keep both of them, while in others we may want to pick just one as the "consolidated" number.Another source of complexity is that newly merged records may match with other records. For instance, when we combine records r1 and r2 we may obtain a record r12 that now matches r3. The original records, r1 and r2, may not match with r3, but because r12 contains more information about the same real-word entity that r1 and r2 represent, the "connection" to r3 may now be apparent. Such "chained" matches imply that new merged records must be recursively compared to all records.There are many ways to perform ER, but in this talk I will explore only one general approach, where the decision of what records represent the same real-world entity is done in a pair-wise fashion. Furthermore, we assume that the matching is done by a "black-box" function, which makes our approach generic and applicable to many domains. Thus, given two records, r1 and r2, the match function M(r1, r2) returns true if there is enough evidence in the two records that they both refer to the same real-world entity. We also assume a black-box merge function that combines a pair of matching records.In this talk I will discuss the advantages and disadvantages of such a generic, pair-wise approach to ER. And even though the approach is relatively simple, there are still many interesting challenges. For instance, how can one minimize the number of invocations to the match and merge black-boxes? Are there any properties of the functions that can significantly reduce the number of calls? If one has available multiple processors, how can one distribute the computational load? If records have confidences associated with them, how does the problem complexity change, and how can we efficiently find the confidence of the resolved records? In the talk I will address these challenges, and report on some preliminary work we have done at Stanford. (This Stanford work in joint with Omar Benjelloun, Tyson Condie, Johnson (Heng) Gong, Jeff Jonas, Hideki Kawai, Tait E. Larson, David Menestrina, Nicolas Pombourcq, Qi Su, Steven Whang, Jennifer Widom.For additional information on ER and our Stanford SERF Project, please visit http://www-db.stanford.edu/serf/.