Improved Feature Selection by Incorporating Gene Similarity into the LASSO

  • Authors:
  • Christopher E. Gillies;Xiaoli Gao;Nilesh V. Patel;Mohammad-Reza Siadat;George D. Wilson

  • Affiliations:
  • Department of Computer Science and Engineering, Oakland University, Rochester, MI, USA;Department of Mathematics and Statistics, Oakland University, Rochester, MI, USA;Department of Computer Science and Engineering, Oakland University, Rochester, MI, USA;Department of Computer Science and Engineering, Oakland University, Rochester, MI, USA;Radiation Oncology Department and BioBank Department Beaumont Health System, Royal Oak, MI, USA

  • Venue:
  • International Journal of Knowledge Discovery in Bioinformatics
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

Personalized medicine is customizing treatments to a patient's genetic profile and has the potential to revolutionize medical practice. An important process used in personalized medicine is gene expression profiling. Analyzing gene expression profiles is difficult, because there are usually few patients and thousands of genes, leading to the curse of dimensionality. To combat this problem, researchers suggest using prior knowledge to enhance feature selection for supervised learning algorithms. The authors propose an enhancement to the LASSO, a shrinkage and selection technique that induces parameter sparsity by penalizing a model's objective function. Their enhancement gives preference to the selection of genes that are involved in similar biological processes. The authors' modified LASSO selects similar genes by penalizing interaction terms between genes. They devise a coordinate descent algorithm to minimize the corresponding objective function. To evaluate their method, the authors created simulation data where they compared their model to the standard LASSO model and an interaction LASSO model. The authors' model outperformed both the standard and interaction LASSO models in terms of detecting important genes and gene interactions for a reasonable number of training samples. They also demonstrated the performance of their method on a real gene expression data set from lung cancer cell lines.