Analysis of complexity indices for classification problems: Cancer gene expression data

Authors:
Ana C. Lorena;Ivan G. Costa;Newton Spolaôr;Marcilio C. P. de Souto
Affiliations:
Centro de Matemática, Computação e Cognição, Universidade Federal do ABC, Brazil;Centro de Informática, Universidade Federal de Pernambuco, Brazil;Centro de Matemática, Computação e Cognição, Universidade Federal do ABC, Brazil;Centro de Informática, Universidade Federal de Pernambuco, Brazil
Venue:
Neurocomputing
Year:
2012

Citing 17
Cited 1

An introduction to support Vector Machines: and other kernel-based learning methods

An introduction to support Vector Machines: and other kernel-based learning methods
Complexity Measures of Supervised Classification Problems

IEEE Transactions on Pattern Analysis and Machine Intelligence
Pattern Classification (2nd Edition)

Pattern Classification (2nd Edition)
On Data and Algorithms: Understanding Inductive Performance

Machine Learning
On Classifier Domains of Competence

ICPR '04 Proceedings of the Pattern Recognition, 17th International Conference on (ICPR'04) Volume 1 - Volume 01
Efficient Feature Selection via Analysis of Relevance and Redundancy

The Journal of Machine Learning Research
A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis

Bioinformatics
Outcome signature genes in breast cancer: is there a unique set?

Bioinformatics
Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)

Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)
Pattern Classifier Design by Linear Programming

IEEE Transactions on Computers
On the Complexity of Gene Expression Classification Data Sets

HIS '08 Proceedings of the 2008 8th International Conference on Hybrid Intelligent Systems
A comparative study of survival models for breast cancer prognostication based on microarray data

Bioinformatics
Dataset complexity in gene expression based cancer classification using ensembles of k-nearest neighbors

Artificial Intelligence in Medicine
Using Supervised Complexity Measures in the Analysis of Cancer Gene Expression Data Sets

BSB '09 Proceedings of the 4th Brazilian Symposium on Bioinformatics: Advances in Bioinformatics and Computational Biology
Mining Rules for the Automatic Selection Process of Clustering Methods Applied to Cancer Gene Expression Data

ICANN '09 Proceedings of the 19th International Conference on Artificial Neural Networks: Part II
Empirical evaluation of ranking prediction methods for gene expression data classification

IBERAMIA'10 Proceedings of the 12th Ibero-American conference on Advances in artificial intelligence
On the Complexity of Gene Marker Selection

SBRN '10 Proceedings of the 2010 Eleventh Brazilian Symposium on Neural Networks

Predicting noise filtering efficacy with data complexity measures for nearest neighbor classification

Pattern Recognition

Quantified Score

Hi-index	0.01

Visualization

Abstract

Currently, cancer diagnosis at a molecular level has been made possible through the analysis of gene expression data. More specifically, one usually uses machine learning (ML) techniques to build, from cancer gene expression data, automatic diagnosis models (classifiers). Cancer gene expression data often present some characteristics that can have a negative impact in the generalization ability of the classifiers generated. Some of these properties are data sparsity and an unbalanced class distribution. We investigate the results of a set of indices able to extract the intrinsic complexity information from the data. Such measures can be used to analyze, among other things, which particular characteristics of cancer gene expression data mostly impact the prediction ability of support vector machine classifiers. In this context, we also show that, by applying a proper feature selection procedure to the data, one can reduce the influence of those characteristics in the error rates of the classifiers induced.