Sampling for information and structure preservation when mining large data bases

Authors:
Angel Kuri-Morales;Alexis Lozano
Affiliations:
Departamento de Computación, Instituto Tecnológico Autónomo de México, Mexico City, Mexico;Instituto de Investigaciones en Matemáticas Aplicadas y Sistemas, Universidad Nacional Autónoma de México, Mexico City, Mexico
Venue:
IBERAMIA'10 Proceedings of the 12th Ibero-American conference on Advances in artificial intelligence
Year:
2010

Citing 8
Cited 0

Snakes and sandwiches: optimal clustering strategies for a data warehouse

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Density biased sampling: an improved method for data mining and clustering

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
An introduction to variable and feature selection

The Journal of Machine Learning Research
A Monte Carlo Sampling Method for Drawing Representative Samples from Large Databases

SSDBM '04 Proceedings of the 16th International Conference on Scientific and Statistical Database Management
A divide-and-merge methodology for clustering

ACM Transactions on Database Systems (TODS)
The use of various data mining and feature selection methods in the analysis of a population survey dataset

AIDM '07 Proceedings of the 2nd international workshop on Integrating artificial intelligence and data mining - Volume 84
Sampling for Sequential Pattern Mining: From Static Databases to Data Streams

ICDM '07 Proceedings of the 2007 Seventh IEEE International Conference on Data Mining
Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering

ACM Transactions on Knowledge Discovery from Data (TKDD)

Quantified Score

Hi-index	0.00

Visualization

Abstract

The unsupervised learning process of identifying data clusters on large databases, in common use nowadays, requires an extremely costly computational effort. The analysis of a large volume of data makes it impossible to handle it in the computer's main storage. In this paper we propose a methodology (henceforth referred to as "FDM" for fast data mining) to determine the optimal sample from a database according to the relevant information on the data, based on concepts drawn from the statistical theory of communication and L8 approximation theory. The methodology achieves significant data reduction on real databases and yields equivalent cluster models as those resulting from the original database. Data reduction is accomplished through the determination of the adequate number of instances required to preserve the information present in the population. Then, special effort is put in the validation of the obtained sample distribution through the application of classical statistical non parametrical tests and other tests based on the minimization of the approximation error of polynomial models.