Supervising random forest using attribute interaction networks

  • Authors:
  • Qinxin Pan;Ting Hu;James D. Malley;Angeline S. Andrew;Margaret R. Karagas;Jason H. Moore

  • Affiliations:
  • Department of Genetics, Geisel School of Medicine, Dartmouth College, Hanover, NH;Department of Genetics, Geisel School of Medicine, Dartmouth College, Hanover, NH;Division of Computational Bioscience, Center for Information Technology, National Institutes of Health, Bethesda, MD;Department of Community and Family Medicine, Geisel School of Medicine, Dartmouth College, Hanover, NH and Institute for Quantitative Biomedical Sciences, Dartmouth College, Hanover, NH;Department of Community and Family Medicine, Geisel School of Medicine, Dartmouth College, Hanover, NH and Institute for Quantitative Biomedical Sciences, Dartmouth College, Hanover, NH;Department of Genetics, Geisel School of Medicine, Dartmouth College, Hanover, NH and Department of Community and Family Medicine, Geisel School of Medicine, Dartmouth College, Hanover, NH and Ins ...

  • Venue:
  • EvoBIO'13 Proceedings of the 11th European conference on Evolutionary Computation, Machine Learning and Data Mining in Bioinformatics
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

Genome-wide association studies (GWAS) have become a powerful and affordable tool to study the genetic variations associated with common human diseases. However, only few of the loci found are associated with a moderate or large increase in disease risk and therefore using GWAS findings to study the underlying biological mechanisms remains a challenge. One possible cause for the "missing heritability" is the gene-gene interactions or epistasis. Several methods have been developed and among them Random Forest (RF) is a popular one. RF has been successfully applied in many studies. However, it is also known to rely on marginal main effects. Meanwhile, networks have become a popular approach for characterizing the space of pairwise interactions systematically, which can be informative for classification problems. In this study, we compared the findings of Mutual Information Network (MIN) to that of RF and observed that the variables identified by the two methods overlap with differences. To integrate advantages of MIN into RF, we proposed a hybrid algorithm, MIN-guided RF (MINGRF), which overlays the neighborhood structure of MIN onto the growth of trees. After comparing MINGRF to the standard RF on a bladder cancer dataset, we conclude that MINGRF produces trees with a better accuracy at a smaller computational cost.