Statistics

  • Authors:
  • David J. Hand

  • Affiliations:
  • Professor of Statistics, Imperial College, London, United Kingdom

  • Venue:
  • Handbook of data mining and knowledge discovery
  • Year:
  • 2002

Quantified Score

Hi-index 0.00

Visualization

Abstract

Statistics and knowledge discovery in databases (KDD) have intersecting aims--the discovery of structures and patterns in data sets--but they also have differences. KDD, for example, is almost always concerned solely with the analysis of data, while statistics is also concerned with optimal strategies for collecting data. On the other hand, statisticians are rarely concerned with efficient strategies for searching large databases, while this is often an important concern in KDD exercises. The data sets examined by KDD researchers tend to be larger than those examined by statisticians, and this has implications for the nature of the analytic tools. Sometimes analysis can be based on subsamples of the data--and statistical inferential procedures adopted. Often, however, there is no alternative to analyzing the entire data set, and the relevance of inferential procedures is less obvious. Likewise, in the past, statisticians have mainly been concerned with static data sets, while data miners are often concerned with data sets that are constantly evolving. This also has implications for the nature of analysis--a method that may be optimal from a statistical perspective may not be optimal from a KDD perspective. In modern statistics, models play a central role. In KDD, however, algorithms are more often regarded as central. Indeed, in much knowledge discovery work, patterns rather than models are the structure being sought. There is then a very real danger that the detected structures will either be artifacts of data contamination or will be simply due to chance.