Classification and feature gene selection using the normalized maximum likelihood model for discrete regression

Authors:
Ioan Tabus;Jorma Rissanen;Jaakko Astola
Affiliations:
Institute of Signal Processing, Tampere University of Technology, P.O Box 553, FIN-33101 Tampere, Finland;Institute of Signal Processing, Tampere University of Technology, P.O Box 553, FIN-33101 Tampere, Finland;Institute of Signal Processing, Tampere University of Technology, P.O Box 553, FIN-33101 Tampere, Finland
Venue:
Signal Processing - Special issue: Genomic signal processing
Year:
2003

Citing 7
Cited 2

Vector quantization and signal compression

Vector quantization and signal compression
Wrappers for feature subset selection

Artificial Intelligence - Special issue on relevance
On the use of MDL principle in gene expression prediction

EURASIP Journal on Applied Signal Processing - Nonlinear signal and image processing - part I
Fisher information and stochastic complexity

IEEE Transactions on Information Theory
The minimum description length principle in coding and modeling

IEEE Transactions on Information Theory
MDL denoising

IEEE Transactions on Information Theory
Strong optimality of the normalized ML models as universal codes and information in data

IEEE Transactions on Information Theory

Cancer classification and prediction using logistic regression with Bayesian gene selection

Journal of Biomedical Informatics - Special issue: Biomedical machine learning
An efficient normalized maximum likelihood algorithm for DNA sequence compression

ACM Transactions on Information Systems (TOIS)

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper studies the problem of class discrimination based on the normalized maximum likelihood (NML) model for a nonlinear regression, where the nonlinearly transformed class labels, each taking M possible values, are assumed to be drawn from a multinomial trial process. The strength of the MDL methods in statistical inference is to find the model structure which, in this particular classification problem, amounts to finding the best set of feature genes. We first show that the minimization of the codelength of the NML model for different sets of feature genes is a tractable problem. We then extend the model for selecting the feature genes to a completely defined classifier and check its classification error in a cross-validation experiment. Also the quantization process itself involved in getting the required entries in the model, can be evaluated with the NML description length. The new classification method is applied to leukemia class discrimination based on gene expression microarray data. We find classification errors as low as 0.03% with a quadruplet of binary qnantized genes, which was top ranked by the NML description length. Such a length of the class labels, obtained with various sets of feature genes in the nonlinear regression model, allows intuitive comparisons of nested feature sets.