Modelling Classification Performance for Large Data Sets

Authors:
Baohua Gu;Feifang Hu;Huan Liu
Affiliations:
-;-;-
Venue:
WAIM '01 Proceedings of the Second International Conference on Advances in Web-Age Information Management
Year:
2001

Citing 8
Cited 2

Four types of learning curves

Neural Computation
Prediction of generalization ability in learning machines

Prediction of generalization ability in learning machines
Machine learning, neural and statistical classification

Machine learning, neural and statistical classification
SEER: maximum likelihood regression for learning-speed curves

SEER: maximum likelihood regression for learning-speed curves
Efficient progressive sampling

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
A Comparison of Prediction Accuracy, Complexity, and Training Time of Thirty-Three Old and New Classification Algorithms

Machine Learning
C4.5: Programs for Machine Learning

C4.5: Programs for Machine Learning
Exponential or Polynomial Learning Curves? Case-Based Studies

Neural Computation

A Database for Arabic Printed Character Recognition

ICIAR '08 Proceedings of the 5th international conference on Image Analysis and Recognition
Prediction of learning curves in machine translation

ACL '12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1

Quantified Score

Hi-index	0.00

Visualization

Abstract

For many learning algorithms, their learning accuracy will increase as the size of training data increases, forming the well-known learning curve. Usually a learning curve can be fitted by interpolating or extrapolating some points on it with a specified model. The obtained learning curve can then be used to predict the maximum achievable learning accuracy or to estimate the amount of data needed to achieve an expected learning accuracy, both of which will be especially meaningful to data mining on large data sets. Although some models have been proposed to model learning curves, most of them do not test their applicability to large data sets. In this paper, we focus on this issue. We empirically compare six potentially useful models by fitting learning curves of two typical classification algorithms--C4.5 (decision tree) and LOG (logistic discrimination) on eight large UCI benchmark data sets. By using all available data for learning, we fit a full-length learning curve; by using a small portion of the data, we fit a part-length learning curve. The models are then compared in terms of two performances: (1) how well they fit a full-length learning curve, and (2) how well a fitted part-length learning curve can predict learning accuracy at the full length. Experimental results show that the power law (y = a - b * x-c) is the best among the six models in both the performances for the two algorithms and all the data sets. These results support the applicability of learning curves to data mining.