Classification and feature gene selection using the normalized maximum likelihood model for discrete regression

  • Authors:
  • Ioan Tabus;Jorma Rissanen;Jaakko Astola

  • Affiliations:
  • Institute of Signal Processing, Tampere University of Technology, P.O Box 553, FIN-33101 Tampere, Finland;Institute of Signal Processing, Tampere University of Technology, P.O Box 553, FIN-33101 Tampere, Finland;Institute of Signal Processing, Tampere University of Technology, P.O Box 553, FIN-33101 Tampere, Finland

  • Venue:
  • Signal Processing - Special issue: Genomic signal processing
  • Year:
  • 2003

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper studies the problem of class discrimination based on the normalized maximum likelihood (NML) model for a nonlinear regression, where the nonlinearly transformed class labels, each taking M possible values, are assumed to be drawn from a multinomial trial process. The strength of the MDL methods in statistical inference is to find the model structure which, in this particular classification problem, amounts to finding the best set of feature genes. We first show that the minimization of the codelength of the NML model for different sets of feature genes is a tractable problem. We then extend the model for selecting the feature genes to a completely defined classifier and check its classification error in a cross-validation experiment. Also the quantization process itself involved in getting the required entries in the model, can be evaluated with the NML description length. The new classification method is applied to leukemia class discrimination based on gene expression microarray data. We find classification errors as low as 0.03% with a quadruplet of binary qnantized genes, which was top ranked by the NML description length. Such a length of the class labels, obtained with various sets of feature genes in the nonlinear regression model, allows intuitive comparisons of nested feature sets.