Snakes and sandwiches: optimal clustering strategies for a data warehouse
SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Density biased sampling: an improved method for data mining and clustering
SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
An introduction to variable and feature selection
The Journal of Machine Learning Research
A Monte Carlo Sampling Method for Drawing Representative Samples from Large Databases
SSDBM '04 Proceedings of the 16th International Conference on Scientific and Statistical Database Management
A divide-and-merge methodology for clustering
ACM Transactions on Database Systems (TODS)
AIDM '07 Proceedings of the 2nd international workshop on Integrating artificial intelligence and data mining - Volume 84
Sampling for Sequential Pattern Mining: From Static Databases to Data Streams
ICDM '07 Proceedings of the 2007 Seventh IEEE International Conference on Data Mining
ACM Transactions on Knowledge Discovery from Data (TKDD)
Hi-index | 0.00 |
The unsupervised learning process of identifying data clusters on large databases, in common use nowadays, requires an extremely costly computational effort. The analysis of a large volume of data makes it impossible to handle it in the computer's main storage. In this paper we propose a methodology (henceforth referred to as "FDM" for fast data mining) to determine the optimal sample from a database according to the relevant information on the data, based on concepts drawn from the statistical theory of communication and L8 approximation theory. The methodology achieves significant data reduction on real databases and yields equivalent cluster models as those resulting from the original database. Data reduction is accomplished through the determination of the adequate number of instances required to preserve the information present in the population. Then, special effort is put in the validation of the obtained sample distribution through the application of classical statistical non parametrical tests and other tests based on the minimization of the approximation error of polynomial models.