Simple Bayesian binary framework for discovering significant genes and classifying cancer diagnosis

  • Authors:
  • Tae Young Yang

  • Affiliations:
  • Department of Mathematics, Myongji University, Kyonggi, 449-728, Republic of Korea

  • Venue:
  • Computational Statistics & Data Analysis
  • Year:
  • 2009

Quantified Score

Hi-index 0.03

Visualization

Abstract

Given a microarray dataset consisting of two classes, type I and type II, the proposed coherent binary framework sequentially combines a gene-rank algorithm and a classifier. Genes that are expressed at a consistently high level in one type and at a consistently low level in the other type are of much interest. The wider the gap between the expression levels, the more significant the gene is as a discriminator. A new distance metric is used to measure the gap and is obtained using Bayesian nonparametric approaches involving Dirichlet process priors. Significant genes are ranked separately based on the pattern (the genes are over-expressed in type I and under-expressed in type II) or the pattern (the genes are under-expressed in type I and over-expressed in type II). An out-of-sample cross-validation approach is suggested for use in deciding how many significant genes are necessary for the classifier. The classifier uses each selected top-ranked gene to calculate a classification score when a test sample is presented. The sample is then classified as having the type with the larger score. Empirical studies using two public datasets show that top-ranked genes in each pattern clearly distinguish the existing pattern, and the classifier uses a few significant genes to classify the types in the test samples correctly. The framework is a simple, easy alternative to more complex models in terms of its accuracy and robustness.