The problem of missing data in geoscience databases

  • Authors:
  • Stephen Henley

  • Affiliations:
  • Resources Computing International Ltd., 185 Starkholmes Road, Matlock, Derbyshire DE4 5JA, UK

  • Venue:
  • Computers & Geosciences
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

SQL is the (more or less) standardised language that is used by the majority of commercial database management systems. However, it is seriously flawed, as has been documented in detail by Date, Darwen, Pascal, and others. One of the most serious problems with SQL is the way it handles missing data. It uses a special value 'NULL' to represent data items whose value is not known. This can have a variety of meanings in different circumstances (such as 'inapplicable' or 'unknown'). The SQL language also allows an 'unknown' truth value in logical expressions. The resulting incomplete three-valued logic leads to inconsistencies in data handling within relational database management systems. Relational database theorists advocate that a strict two-valued logic (true/false) be used instead, with prohibition of the use of NULL, and justify this stance by assertion that it is a true representation of the 'real world'. Nevertheless, in real geoscience data there is a complete gradation between exact values and missing data: for example, geochemical analyses are inexact (and the uncertainty should be recorded); the precision of numeric or textual data may also be expressed qualitatively by terms such as 'approximately' or 'possibly'. Furthermore, some data are by their nature incomplete: for example, where samples could not be collected or measurements could not be taken because of inaccessibility. It is proposed in this paper that the best way to handle such data sets is to replace the closed-world assumption and its concomitant strict two-valued logic, upon which the present relational database model is based, by the open-world assumption which allows for other logical values in addition to the extremes of 'true' and 'false'. Possible frameworks for such a system are explored, and could use Codd's 'marks', Darwen's approach (recording the status of information known about each data item), or other approaches such as fuzzy logic.