Class discovery in gene expression data
RECOMB '01 Proceedings of the fifth annual international conference on Computational biology
Hi-index | 0.00 |
Let a set of training and test samples be given, and the samples from the training set be partitioned into a number of classes, while classification of the test samples is unknown. The classification problem consists in determining classes of the test samples utilizing the information provided by the training set. Usually, not all features of the data set are informative for discovering the classification, and a subset of features relevant to it should be found. This task is called the feature selection. We handle it from the viewpoint of mathematical programming in the following way. We consider several unsupervised clustering principles and use them as constraints, while representing the desirable properties of feature selection as the objective function. In particular, we consider k-means local optimality constraints, pairwise threshold constraints, and biclustering consistency constraints. The involved objectives are used either to maximize separation of classes or to minimize the information loss. The developed optimization-based approach has shown good performance on well-known DNA microarray data sets.