A branch-and-cut algorithm for the continuous error localization problem in data cleaning

Authors:
Jorge Riera-Ledesma;Juan-José Salazar-González
Affiliations:
DEIOC, Universidad de La Laguna, 38271 La Laguna, Spain;DEIOC, Universidad de La Laguna, 38271 La Laguna, Spain
Venue:
Computers and Operations Research
Year:
2007

Citing 6
Cited 0

Theory of linear and integer programming

Theory of linear and integer programming
Optimal imputation of erroneous data: Categorical data, general edits

Operations Research
Facets and lifting procedures for the set covering polytope

Mathematical Programming: Series A and B
On solving the continuous data editing problem

Computers and Operations Research
Computers and Intractability; A Guide to the Theory of NP-Completeness

Computers and Intractability; A Guide to the Theory of NP-Completeness
Discrete models for data imputation

Discrete Applied Mathematics - Discrete mathematics & data mining (DM & DM)

Quantified Score

Hi-index	0.01

Visualization

Abstract

Data collected by statistical agencies may contain mistakes made during the acquisition, transcription and coding process. Thus, before using all this information to infer statistical properties, the statistical agencies must check the consistency of their information. To this end, each record has to be tested on a set of consistency rules. Therefore, if one of these records does not meet all the consistency rules, it must be established which fields have to be modified in order to make the new record valid with respect to that set of consistency rules. Among all the possible solutions, statistical agencies are interested in finding one in which the number of fields that should be modified is minimum, thus leading to a combinatorial optimization problem known as the Error Localization Problem. This article approaches the optimization problem of finding the smallest set of fields whose values must be changed in order to satisfy a given set of consistency rules. With this purpose in mind an Integer Linear Programming model is proposed for the particular case in which the fields are continuous values and the consistency rules are given by linear inequalities. This model is solved through a branch-and-cut approach based on a Benders' decomposition. The new proposal is compared to others previously published in the literature and tested on benchmark instances. The overall performance of these new algorithms succeeded in solving to optimality difficult instances with up to 100 fields in a record in about 1min of a personal computer.