Learning algorithms may perform worse with increasing training set size: Algorithm-data incompatibility

  • Authors:
  • Waleed A. Yousef;Subrata Kundu

  • Affiliations:
  • -;-

  • Venue:
  • Computational Statistics & Data Analysis
  • Year:
  • 2014

Quantified Score

Hi-index 0.03

Visualization

Abstract

In machine learning problems a learning algorithm tries to learn the input-output dependency (relationship) of a system from a training dataset. This input-output relationship is usually deformed by a random noise. From experience, simulations, and special case theories, most practitioners believe that increasing the size of the training set improves the performance of the learning algorithm. It is shown that this phenomenon is not true in general for any pair of a learning algorithm and a data distribution. In particular, it is proven that for certain distributions and learning algorithms, increasing the training set size may result in a worse performance and increasing the training set size infinitely may result in the worst performance-even when there is no model misspecification for the input-output relationship. Simulation results and analysis of real datasets are provided to support the mathematical argument.