Imputation of Missing Data in Industrial Databases

  • Authors:
  • Kamakshi Lakshminarayan;Steven A. Harp;Tariq Samad

  • Affiliations:
  • Honeywell Technology Center, 3660 Technology Drive, Minneapolis, MN 55418. laksh004@tc.umn.edu;Honeywell Technology Center, 3660 Technology Drive, Minneapolis, MN 55418. sharp@htc.honeywell.com;Honeywell Technology Center, 3660 Technology Drive, Minneapolis, MN 55418. samad@htc.honeywell.com

  • Venue:
  • Applied Intelligence
  • Year:
  • 1999

Quantified Score

Hi-index 0.00

Visualization

Abstract

A limiting factor for the application ofIDA methods in many domains is the incompleteness of datarepositories. Many records have fields that are not filled in,especially, when data entry is manual. In addition, a significantfraction of the entries can be erroneous and there may be noalternative but to discard these records. But every cell in adatabase is not an independent datum. Statistical relationships willconstrain and, often determine, missing values. Dataimputation, the filling in of missing values for partially missingdata, can thus be an invaluable first step in many IDA projects. Newimputation methods that can handle the large-scale problems andlarge-scale sparsity of industrial databases are needed. Toillustrate the incomplete database problem, we analyze one databasewith instrumentation maintenance and test records for an industrialprocess. Despite regulatory requirements for process data collection,this database is less than 50% complete. Next, we discuss possiblesolutions to the missing data problem. Several approaches toimputation are noted and classified into two categories: data-drivenand model-based. We then describe two machine-learning-basedapproaches that we have worked with. These build upon well-knownalgorithms: AutoClass and C4.5. Several experiments are designed,all using the maintenance database as a common test-bed but withvarious data splits and algorithmic variations. Results aregenerally positive with up to 80% accuracies of imputation. Weconclude the paper by outlining some considerations in selectingimputation methods, and by discussing applications of data imputationfor intelligent data analysis.