Is random model better? On its accuracy and efficiency

  • Authors:
  • Wei Fan;Haixun Wang;Philip S. Yu;Sheng Ma

  • Affiliations:
  • -;-;-;-

  • Venue:
  • ICDM '03 Proceedings of the Third IEEE International Conference on Data Mining
  • Year:
  • 2003

Quantified Score

Hi-index 0.00

Visualization

Abstract

Inductive learning searches an optimal hypothesis thatminimizes a given loss function. It is usually assumed thatthe simplest hypothesis that fits the data is the best approximateto an optimal hypothesis. Since finding the simplesthypothesis is NP-hard for most representations, we generallyemploy various heuristics to search its closest match.Computing these heuristics incurs significant cost, makinglearning inefficient and unscalable for large dataset. In thesame time, it is still questionable if the simplest hypothesisis indeed the closest approximate to the optimal model.Recent success of combining multiple models, such as bagging,boosting and meta-learning, has greatly improved theaccuracy of the simplest hypothesis, providing a strong argumentagainst the optimality of the simplest hypothesis.However, computing these combined hypotheses incurs significantlyhigher cost. In this paper, we first advert that aslong as the error of a hypothesis on each example is withina range dictated by a given loss function, it can still be optimal.Contrary to common beliefs, we propose a completelyrandom decision tree algorithm that achieves much higheraccuracy than the single best hypothesis and is comparableto boosted or bagged multiple best hypotheses. The advantageof multiple random tree is its training efficiency aswell as minimal memory requirement.