Sampling and Subsampling for Cluster Analysis in Data Mining: With Applications to Sky Survey Data

Authors:
David M. Rocke;Jian Dai
Affiliations:
Center for Image Processing and Integrated Computing, University of California, Davis, CA 95616, USA;Center for Image Processing and Integrated Computing, University of California, Davis, CA 95616, USA
Venue:
Data Mining and Knowledge Discovery
Year:
2003

Citing 0
Cited 2

Classification of large data sets with mixture models via sufficient EM

Computational Statistics & Data Analysis
Finding approximate solutions to combinatorial problems with very large data sets using BIRCH

Computational Statistics & Data Analysis

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper describes a clustering method for unsupervised classification of objects in large data sets. The new methodology combines the mixture likelihood approach with a sampling and subsampling strategy in order to cluster large data sets efficiently. This sampling strategy can be applied to a large variety of data mining methods to allow them to be used on very large data sets. The method is applied to the problem of automated star/galaxy classification for digital sky data and is tested using a sample from the Digitized Palomar Sky Survey (DPOSS) data. The method is quick and reliable and produces classifications comparable to previous work on these data using supervised clustering.