Using random forests to uncover bivariate interactions in high dimensional small data sets

  • Authors:
  • Jorge M. Arevalillo;Hilario Navarro

  • Affiliations:
  • UNED, Madrid, Spain;UNED, Madrid, Spain

  • Venue:
  • Proceedings of the KDD-09 Workshop on Statistical and Relational Learning in Bioinformatics
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

Random Forests (RF) is an ensemble method which has become widely accepted within the machine learning and bioinformatics communities in the last few years. Its predictive strength, along with some of the ingredients --- rich in information --- provided by the output, has made RF an efficient Data Mining tool for discovering patterns in data. In this paper we review the learning mechanism of RF within the classification setting and apply it to uncover bivariate interactions, carrying on useful information about an outcome, in high dimensional low sample data. We propose a divide and conquer search strategy in the variable space that benefits from the ranking of variable importances of RF at a first stage, along with the out of bag error rate (oob) of the ensemble at a second stage. The procedure combines both elements in order to capture difficult to uncover patterns in these type of data. We will show the performance of our procedure in some synthetic scenarios and will give a real application to a microarray data set in order to illustrate how it works.