Uncovering Bivariate Interactions in High Dimensional Data Using Random Forests with Data Augmentation

  • Authors:
  • Jorge M Arevalillo;Hilario Navarro

  • Affiliations:
  • (Correspd.) Department of Statistics and Operational Research, UNED University, Paseo Senda del Rey 9, 28040 Madrid, Spain. jmartin@ccia.uned.es/ hnavarro@ccia.uned.es;Department of Statistics and Operational Research, UNED University, Paseo Senda del Rey 9, 28040 Madrid, Spain. jmartin@ccia.uned.es/ hnavarro@ccia.uned.es

  • Venue:
  • Fundamenta Informaticae - Machine Learning in Bioinformatics
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

Random Forests (RF) is an ensemble technology for classification and regression which has become widely accepted in the bioinformatics community in the last few years. Its predictive strength, along with some of the utilities, rich in information, provided by the output, has made RF an efficient data mining tool for discovering patterns in high dimensional data. In this paper we propose a search strategy that explores a subset of the input space in an exhaustive way using RF as the search engine. Our procedure begins by taking the variables previously rejected by a sequential search procedure and uses the out of bag error rate of the ensemble, obtained when trained over an augmented data set, as criterion to capture difficult to uncover bivariate patterns associated with an outcome variable. We will show the performance of the procedure in some synthetic scenarios and will give an application to a real microarray experiment in order to illustrate how it works for gene expression data.