Technical Note: Bias in Information-Based Measures in Decision Tree Induction

Authors:
Allan P. White;Wei Zhong Liu
Affiliations:
Computer Centre, University of Birmingham, P.O. Box 363, Birmingham B15 2TT, United Kingdom. A.P.WHITE@BHAM.AC.UK;School of Mathematics and Statistics, University of Birmingham, P.O. Box 363, Birmingham B15 2TT, United Kingdom. W.Z.LIU@BHAM.AC.UK
Venue:
Machine Learning
Year:
1994

Citing 0
Cited 27

An Exact Probability Metric for Decision Tree Splitting and Stopping

Machine Learning
Decision Tree Induction Based on Efficient Tree Restructuring

Machine Learning
General and Efficient Multisplitting of Numerical Attributes

Machine Learning
Textual Data Mining to Support Science and Technology Management

Journal of Intelligent Information Systems
Automatic Construction of Decision Trees from Data: A Multi-Disciplinary Survey

Data Mining and Knowledge Discovery
A Unified Framework for Evaluation Metrics in Classification Using Decision Trees

EMCL '01 Proceedings of the 12th European Conference on Machine Learning
Empirical Evaluation of Feature Subset Selection Based on a Real-World Data Set

PKDD '00 Proceedings of the 4th European Conference on Principles of Data Mining and Knowledge Discovery
GD: A Measure Based on Information Theory for Attribute Selection

IBERAMIA '98 Proceedings of the 6th Ibero-American Conference on AI: Progress in Artificial Intelligence
Theoretical Comparison between the Gini Index and Information Gain Criteria

Annals of Mathematics and Artificial Intelligence
A review of machine learning

The Knowledge Engineering Review
Building a Medical Decision Support System for Colon Polyp Screening by Using Fuzzy Classification Trees

Applied Intelligence
Classification of Multivariate Time Series and Structured Data Using Constructive Induction

Machine Learning
A plethora of methods for learning English countability

EMNLP '03 Proceedings of the 2003 conference on Empirical methods in natural language processing
Comparing information‐theoretic attribute selection measures: a statistical approach

AI Communications
End user friendly data mining with decision trees: a reality or a wish?

CEA'07 Proceedings of the 2007 annual Conference on International Conference on Computer Engineering and Applications
On biases in estimating multi-valued attributes

IJCAI'95 Proceedings of the 14th international joint conference on Artificial intelligence - Volume 2
Accuracy of intelligent medical systems

Computer Methods and Programs in Biomedicine
Predictive algorithms in the management of computer systems

IBM Systems Journal
Simulated evaluation of faceted browsing based on feature selection

Multimedia Tools and Applications
Data mining on multimedia data

Data mining on multimedia data
Ensemble missing data techniques for software effort prediction

Intelligent Data Analysis
Analysis and correction of bias in Total Decrease in Node Impurity measures for tree-based algorithms

Statistics and Computing
Bias of importance measures for multi-valued attributes and solutions

ICANN'11 Proceedings of the 21st international conference on Artificial neural networks - Volume Part II
How to interpret decision trees?

ICDM'11 Proceedings of the 11th international conference on Advances in data mining: applications and theoretical aspects
An example-based study on chinese word segmentation using critical fragments

IJCNLP'04 Proceedings of the First international joint conference on Natural Language Processing
A new variable selection approach using Random Forests

Computational Statistics & Data Analysis
A new variable importance measure for random forests with missing data

Statistics and Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

A fresh look is taken at the problem of bias in information-based attribute selection measures, used in the induction of decision trees. The approach uses statistical simulation techniques to demonstrate that the usual measures such as information gain, gain ratio, and a new measure recently proposed by Lopez de Mantaras (1991) are all biased in favour of attributes with large numbers of values. It is concluded that approaches which utilise the chi-square distribution are preferable because they compensate automatically for differences between attributes in the number of levels they take.