Induction of comprehensible models for gene expression datasets by subgroup discovery methodology

  • Authors:
  • Dragan Gamberger;Nada Lavrač;Filip Železný;Jakub Tolar

  • Affiliations:
  • Laboratory for Information Systems, Rudjer Bošković Institute, Bijenička 54, 10000 Zagreb, Croatia;Department of Knowledge Technologies, Jožef Stefan Institute, Jamova 39, 1000 Ljubljana, Slovenia and Nova Gorica Polytechnic Vipavska 13, 5000 Nova Gorica, Slovenia;Department of Cybernetics, Czech Institute of Technology (CVUT FEL), Technická 2, 16627 Prague, Czech Republic and Department of Biostatistics, University of Wisconsin Medical School, 1300 Un ...;Institute of Human Genetics, University of Minnesota Medical School, 420 Delaware Street, 55455 Minneapolis

  • Venue:
  • Journal of Biomedical Informatics - Special issue: Biomedical machine learning
  • Year:
  • 2004

Quantified Score

Hi-index 0.00

Visualization

Abstract

Finding disease markers (classifiers) from gene expression data by machine learning algorithms is characterized by a high risk of overfitting the data due the abundance of attributes (simultaneously measured gene expression values) and shortage of available examples (observations). To avoid this pitfall and achieve predictor robustness, state-of-the-art approaches construct complex classifiers that combine relatively weak contributions of up to thousands of genes (attributes) to classify a disease. The complexity of such classifiers limits their transparency and consequently the biological insights they can provide. The goal of this study is to apply to this domain the methodology of constructing simple yet robust logic-based classifiers amenable to direct expert interpretation. On two well-known, publicly available gene expression classification problems, the paper shows the feasibility of this approach, employing a recently developed subgroup discovery methodology. Some of the discovered classifiers allow for novel biological interpretations.