A branch-and-cut algorithm for the continuous error localization problem in data cleaning

  • Authors:
  • Jorge Riera-Ledesma;Juan-José Salazar-González

  • Affiliations:
  • DEIOC, Universidad de La Laguna, 38271 La Laguna, Spain;DEIOC, Universidad de La Laguna, 38271 La Laguna, Spain

  • Venue:
  • Computers and Operations Research
  • Year:
  • 2007

Quantified Score

Hi-index 0.01

Visualization

Abstract

Data collected by statistical agencies may contain mistakes made during the acquisition, transcription and coding process. Thus, before using all this information to infer statistical properties, the statistical agencies must check the consistency of their information. To this end, each record has to be tested on a set of consistency rules. Therefore, if one of these records does not meet all the consistency rules, it must be established which fields have to be modified in order to make the new record valid with respect to that set of consistency rules. Among all the possible solutions, statistical agencies are interested in finding one in which the number of fields that should be modified is minimum, thus leading to a combinatorial optimization problem known as the Error Localization Problem. This article approaches the optimization problem of finding the smallest set of fields whose values must be changed in order to satisfy a given set of consistency rules. With this purpose in mind an Integer Linear Programming model is proposed for the particular case in which the fields are continuous values and the consistency rules are given by linear inequalities. This model is solved through a branch-and-cut approach based on a Benders' decomposition. The new proposal is compared to others previously published in the literature and tested on benchmark instances. The overall performance of these new algorithms succeeded in solving to optimality difficult instances with up to 100 fields in a record in about 1min of a personal computer.