Learning a complex metabolomic dataset using random forests and support vector machines

  • Authors:
  • Young Truong;Xiaodong Lin;Chris Beecher

  • Affiliations:
  • University of North Carolina, Chapel Hill, NC;Duke University, Research Triangle Park, NC;Metabolon, Research Triangle Park, NC

  • Venue:
  • Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
  • Year:
  • 2004

Quantified Score

Hi-index 0.00

Visualization

Abstract

Metabolomics is the "omics" science of biochemistry. The associated data include the quantitative measurements of all small molecule metabolites in a biological sample. These datasets provide a window into dynamic biochemical networks and conjointly with other "omic" data, genes and proteins, have great potential to unravel complex human diseases. The dataset used in this study has 63 individuals, normal and diseased, and the diseased are drug treated or not, so there are three classes. The goal is to classify these individuals using the observed metabolite levels for 317 measured metabolites. There are a number of statistical challenges: non-normal data, the number of samples is less than the number of metabolites; there are missing data and the fact that data are missing is informative (assay values below detection limits can point to a specific class); also, there are high correlations among the metabolites. We investigate support vector machines (SVM), and random forest (RF), for outlier detection, variable selection and classification. We use the variables selected with RF in SVM and visa versa. The benefit of this study is insight into interplay of variable selection and classification methods. We link our selected predictors to the biochemistry of the disease.