Industry-scale duplicate detection

  • Authors:
  • Melanie Weis;Felix Naumann;Ulrich Jehle;Jens Lufter;Holger Schuster

  • Affiliations:
  • Hasso-Plattner-Institut, Potsdam, Germany;Hasso-Plattner-Institut, Potsdam, Germany;SCHUFA Holding AG, Wiesbaden, Germany;SCHUFA Holding AG, Wiesbaden, Germany;SCHUFA Holding AG, Wiesbaden, Germany

  • Venue:
  • Proceedings of the VLDB Endowment
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

Duplicate detection is the process of identifying multiple representations of a same real-world object in a data source. Duplicate detection is a problem of critical importance in many applications, including customer relationship management, personal information management, or data mining. In this paper, we present how a research prototype, namely DogmatiX, which was designed to detect duplicates in hierarchical XML data, was successfully extended and applied on a large scale industrial relational database in cooperation with Schufa Holding AG. Schufa's main business line is to store and retrieve credit histories of over 60 million individuals. Here, correctly identifying duplicates is critical both for individuals and companies: On the one hand, an incorrectly identified duplicate potentially results in a false negative credit history for an individual, who will then not be granted credit anymore. On the other hand, it is essential for companies that Schufa detects duplicates of a person that deliberately tries to create a new identity in the database in order to have a clean credit history. Besides the quality of duplicate detection, i.e., its effectiveness, scalability cannot be neglected, because of the considerable size of the database. We describe our solution to coping with both problems and present a comprehensive evaluation based on large volumes of real-world data.