Weighted proportional k-interval discretization for naive-Bayes classifiers

Authors:
Ying Yang;Geoffrey I. Webb
Affiliations:
School of Computer Science and Software Engineering, Monash University, Melbourne, VIC, Australia;School of Computer Science and Software Engineering, Monash University, Melbourne, VIC, Australia
Venue:
PAKDD'03 Proceedings of the 7th Pacific-Asia conference on Advances in knowledge discovery and data mining
Year:
2003

Citing 7
Cited 7

Statistics: principles and methods

Statistics: principles and methods
C4.5: programs for machine learning

C4.5: programs for machine learning
On the Optimality of the Simple Bayesian Classifier under Zero-One Loss

Machine Learning - Special issue on learning with probabilistic representations
MultiBoosting: A Technique for Combining Boosting and Wagging

Machine Learning
On Bias, Variance, 0/1—Loss, and the Curse-of-Dimensionality

Data Mining and Knowledge Discovery
Search-Based Class Discretization

ECML '97 Proceedings of the 9th European Conference on Machine Learning
Why Discretization Works for Naive Bayesian Classifiers

ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning

A greedy algorithm for supervised discretization

Journal of Biomedical Informatics - Special issue: Biomedical machine learning
Coarse-grained classification of web sites by their structural properties

WIDM '06 Proceedings of the 8th annual ACM international workshop on Web information and data management
WORDS AS CLASSIFIERS OF DOCUMENTS ACCORDING TO THEIR HISTORICAL PERIOD AND THE ETHNIC ORIGIN OF THEIR AUTHORS

Cybernetics and Systems
Improved Comprehensibility and Reliability of Explanations via Restricted Halfspace Discretization

MLDM '09 Proceedings of the 6th International Conference on Machine Learning and Data Mining in Pattern Recognition
On the relative value of cross-company and within-company data for defect prediction

Empirical Software Engineering
Software is data too

Proceedings of the FSE/SDP workshop on Future of software engineering research
Automatic extraction and learning of keyphrases from scientific articles

CICLing'05 Proceedings of the 6th international conference on Computational Linguistics and Intelligent Text Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

The use of different discretization techniques can be expected to affect the classification bias and variance of naive-Bayes classifiers. We call such an effect discretization bias and variance. Proportional k-interval discretization (PKID) tunes discretization bias and variance by adjusting discretized interval size and number proportional to the number of training instances. Theoretical analysis suggests that this is desirable for naive-Bayes classifiers. However PKID is sub-optimal when learning from training data of small size. We argue that this is because PKID equally weighs bias reduction and variance reduction. But for small data, variance reduction can contribute more to lower learning error and thus should be given greater weight than bias reduction. Accordingly we propose weighted proportional k-interval discretization (WPKID), which establishes a more suitable bias and variance trade-off for small data while allowing additional training data to be used to reduce both bias and variance. Our experiments demonstrate that for naive-Bayes classifiers, WPKID improves upon PKID for smaller datasets with significant frequency; and WPKID delivers lower classification error significantly more often than not in comparison to three other leading alternative discretization techniques studied.