Dataset complexity in gene expression based cancer classification using ensembles of k-nearest neighbors

Authors:
Oleg Okun;Helen Priisalu
Affiliations:
University of Oulu, Department of Electrical and Information Engineering, P.O. Box 4500, Oulu 90014, Finland;Tallinn University of Technology, Institute of Cybernetics, Akadeemia Tee 21, Tallinn 12618, Estonia
Venue:
Artificial Intelligence in Medicine
Year:
2009

Citing 13
Cited 8

Complexity Measures of Supervised Classification Problems

IEEE Transactions on Pattern Analysis and Machine Intelligence
Measures of Diversity in Classifier Ensembles and Their Relationship with the Ensemble Accuracy

Machine Learning
Boosting and Microarray Data

Machine Learning
Combining Pattern Classifiers: Methods and Algorithms

Combining Pattern Classifiers: Methods and Algorithms
The 'subsequent artificial neural network' (SANN) approach might bring more classificatory power to ANN-based DNA microarray analyses

Bioinformatics
Outcome signature genes in breast cancer: is there a unique set?

Bioinformatics
Biostatistical Analysis (5th Edition)

Biostatistical Analysis (5th Edition)
An Introduction to Copulas (Springer Series in Statistics)

An Introduction to Copulas (Springer Series in Statistics)
Ensemble methods for classification of patients for personalized medicine with high-dimensional data

Artificial Intelligence in Medicine
The classification of cancer based on DNA microarray data that uses diverse ensemble genetic programming

Artificial Intelligence in Medicine
Impact of error estimation on feature selection

Pattern Recognition
DNA gene expression classification with ensemble classifiers optimized by speciated genetic algorithm

PReMI'05 Proceedings of the First international conference on Pattern Recognition and Machine Intelligence
Gene expression data analysis of human lymphoma using support vector machines and output coding ensembles

Artificial Intelligence in Medicine

Guest editorial: Computational intelligence and machine learning in bioinformatics

Artificial Intelligence in Medicine
Using Supervised Complexity Measures in the Analysis of Cancer Gene Expression Data Sets

BSB '09 Proceedings of the 4th Brazilian Symposium on Bioinformatics: Advances in Bioinformatics and Computational Biology
Ensemble gene selection by grouping for microarray data classification

Journal of Biomedical Informatics
Predicting business failure using classification and regression tree: An empirical comparison with popular classical statistical methods and top classification mining methods

Expert Systems with Applications: An International Journal
Ensemble gene selection for cancer classification

Pattern Recognition
Gene selection and classification using Taguchi chaotic binary particle swarm optimization

Expert Systems with Applications: An International Journal
Analysis of complexity indices for classification problems: Cancer gene expression data

Neurocomputing
Predicting noise filtering efficacy with data complexity measures for nearest neighbor classification

Pattern Recognition

Quantified Score

Hi-index	0.00

Visualization

Abstract

Objective: We explore the link between dataset complexity, determining how difficult a dataset is for classification, and classification performance defined by low-variance and low-biased bolstered resubstitution error made by k-nearest neighbor classifiers. Methods and material: Gene expression based cancer classification is used as the task in this study. Six gene expression datasets containing different types of cancer constitute test data. Results: Through extensive simulation coupled with the copula method for analysis of association in bivariate data, we show that dataset complexity and bolstered resubstitution error are associated in terms of dependence. As a result, we propose a new scheme for generating ensembles of classifiers that selects subsets of features of low complexity for ensemble members, which constitutes the accurate members according to the found dependence relation. Conclusion: Experiments with six gene expression datasets demonstrate that our ensemble generating scheme based on the dependence of dataset complexity and classification error is superior to a single best classifier in the ensemble and to the traditional ensemble construction scheme that is ignorant of dataset complexity.