Accuracy estimation with clustered dataset

Authors:
Ricco Rakotomalala;Jean-Hughes Chauchat;Francois Pellegrino
Affiliations:
University of Lyon 2, Lyon - France;University of Lyon 2, Lyon - France;University of Lyon 2, Lyon - France
Venue:
AusDM '06 Proceedings of the fifth Australasian conference on Data mining and analystics - Volume 61
Year:
2006

Citing 6
Cited 1

C4.5: programs for machine learning

C4.5: programs for machine learning
Algorithmic stability and sanity-check bounds for leave-one-out cross-validation

COLT '97 Proceedings of the tenth annual conference on Computational learning theory
Approximate statistical tests for comparing supervised classification learning algorithms

Neural Computation
Automatic language identification

Speech Communication
No Unbiased Estimator of the Variance of K-Fold Cross-Validation

The Journal of Machine Learning Research
A study of cross-validation and bootstrap for accuracy estimation and model selection

IJCAI'95 Proceedings of the 14th international joint conference on Artificial intelligence - Volume 2

Exploring classification concept drift on a large news text corpus

CICLing'12 Proceedings of the 13th international conference on Computational Linguistics and Intelligent Text Processing - Volume Part I

Quantified Score

Hi-index	0.00

Visualization

Abstract

If the dataset available to machine learning results from cluster sampling (e.g. patients from a sample of hospital wards), the usual cross-validation error rate estimate can lead to biased and misleading results. An adapted cross-validation is described for this case. Using a simulation, the sampling distribution of the generalization error rate estimate, under cluster or simple random sampling hypothesis, are compared to the true value. The results highlight the impact of the sampling design on inference: clearly, clustering has a significant impact; the repartition between learning set and test set should result from a random partition of the clusters, and not from a random partition of the examples. With cluster sampling, standard cross-validation underestimates the generalization error rate, and is deficient for model selection. These results are illustrated with a real application of automatic identification of spoken language.