Development and user experiences of an open source data cleaning, deduplication and record linkage system

  • Authors:
  • Peter Christen

  • Affiliations:
  • The Australian National University, Canberra, Australia

  • Venue:
  • ACM SIGKDD Explorations Newsletter
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

Record linkage, also known as database matching or entity resolution, is now recognised as a core step in the KDD process. Data mining projects increasingly require that information from several sources is combined before the actual mining can be conducted. Also of increasing interest is the deduplication of a single database. The objectives of record linkage and deduplication are to identify, match and merge all records that relate to the same real-world entities. Because real-world data is commonly 'dirty', data cleaning is an important first step in many deduplication, record linkage, and data mining project. In this paper, an overview of the Febrl (Freely Extensible Biomedical Record Linkage) system is provided, and the results of a recent survey of Febrl users is discussed. Febrl includes a variety of functionalities required for data cleaning, deduplication and record linkage, and it provides a graphical user interface that facilitates its application for users who do not have programming experience.