Case studies: Public domain, single mining tasks systems: autoclass (clustering)

  • Authors:
  • John Stutz

  • Affiliations:
  • Computational Sciences Division, NASA--Ames Research Center, Moffett Field, California

  • Venue:
  • Handbook of data mining and knowledge discovery
  • Year:
  • 2002

Quantified Score

Hi-index 0.00

Visualization

Abstract

AutoClass seeks intrinsic clusters or classes in an instance vector database. It applies a user-specified probabilistic class model and searches for a maximum posterior probability parameterization of a set of such classes. The number of classes is one of these parameters. The resulting clustering is locally optimal with respect to the data and class model. The class model is a product of mutually independent probability distribution or density functions that relate regions of the data space to the individual classes. The fully parameterized classes thus define relative probability of class membership with respect to location in data space. Class membership of instances is then a probability mass distribution over classes. The use of maximum posterior probability parameter estimation, based on minimum information prior parameter probabilities, precludes the overfitting problems that bedevil maximum likelihood methods. The approach is implicitly applicable to any kind of data for which data space clusters can be defined in terms of parameterized probability distributions. In practice, the public domain AutoClass-C is limited to combinations of discrete and number valued data.