Operative assessment of predicted generalization errors on non-stationary distributions in data-intensive applications

Authors:
Sergio Decherchi;Paolo Gastaldo;Fabio Sangiacomo;Alessio Leoncini;Rodolfo Zunino
Affiliations:
(Correspd. E-mail: sergio.decherchi@unige.it) Department Biophysical and Electronic Engineering (DIBE), University of Genova, Genoa, Italy;Department Biophysical and Electronic Engineering (DIBE), University of Genova, Genoa, Italy;Department Biophysical and Electronic Engineering (DIBE), University of Genova, Genoa, Italy;Department Biophysical and Electronic Engineering (DIBE), University of Genova, Genoa, Italy;Department Biophysical and Electronic Engineering (DIBE), University of Genova, Genoa, Italy
Venue:
Intelligent Data Analysis
Year:
2011

Citing 19
Cited 0

Self-organization and associative memory: 3rd edition

Self-organization and associative memory: 3rd edition
Computational learning theory: an introduction

Computational learning theory: an introduction
Topology representing networks

Neural Networks
A new competitive learning approach based on an equidistortion principle for designing optimal vector quantizers

Neural Networks
Sample Compression, Learnability, and the Vapnik-Chervonenkis Dimension

Machine Learning
Kernel-based equiprobabilistic topographic map formation

Neural Computation
A streaming ensemble algorithm (SEA) for large-scale classification

Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining
Adversarial classification

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Text Mining: Predictive Methods for Analyzing Unstructured Information

Text Mining: Predictive Methods for Analyzing Unstructured Information
Relative information of type s, Csiszár's f-divergence, and information inequalities

Information Sciences—Informatics and Computer Science: An International Journal
Clustering For Data Mining: A Data Recovery Approach (Chapman & Hall/Crc Computer Science)

Clustering For Data Mining: A Data Recovery Approach (Chapman & Hall/Crc Computer Science)
Estimation of Dependences Based on Empirical Data: Springer Series in Statistics (Springer Series in Statistics)

Estimation of Dependences Based on Empirical Data: Springer Series in Statistics (Springer Series in Statistics)
An Experimental Study on Pedestrian Classification

IEEE Transactions on Pattern Analysis and Machine Intelligence
Data mining approaches for intrusion detection

SSYM'98 Proceedings of the 7th conference on USENIX Security Symposium - Volume 7
An adaptive personalized news dissemination system

Journal of Intelligent Information Systems
Circular backpropagation networks embed vector quantization

IEEE Transactions on Neural Networks
K-winner machines for pattern classification

IEEE Transactions on Neural Networks
Empirical measure of multiclass generalization performance: the K-winner machine case

IEEE Transactions on Neural Networks
Just-in-Time Adaptive Classifiers—Part I: Detecting Nonstationary Changes

IEEE Transactions on Neural Networks

Quantified Score

Hi-index	0.00

Visualization

Abstract

Data-intensive applications use empirical methods to extract consistent information from huge samples. When applied to classification tasks, their aim is to optimize accuracy on unseen data hence a reliable prediction of the generalization error is of paramount importance. Theoretical models, such as Statistical Learning Theory, and empirical estimations, such as cross-validation, can both fit data-mining classification domains very well, provided some crucial assumptions are verified in advance. In particular, the stationary distribution of the observed data is critical, although it is sometimes overlooked in practice. The paper formulates an operative criterion to verify the stationary assumption; the method applies to both theoretical and practical predictions of generalization errors. The analysis addresses the specific case of clustering-based classifiers; the K-Winner Machine (KWM) model is used as a reference for its known theoretical bounds; cross-validation provides an empirical counterpart for practical comparison. The criterion, based on efficient unsupervised clustering-based probability distribution estimation, is tested experimentally on a set of different, data-intensive applications, including: intrusion detection for computer-network security, optical character recognition, text mining and pedestrian detection. Experimental results confirm the effectiveness of the proposed approach to efficiently detect non stationarity.