Brief communication: Classification for high-throughput data with an optimal subset of principal components

Authors:
Joon Jin Song;Yuan Ren;Fenglan Yan
Affiliations:
Department of Mathematical Sciences, University of Arkansas, Fayetteville, AR 72701, USA;Department of Mathematical Sciences, University of Arkansas, Fayetteville, AR 72701, USA;Department of Poultry Science, University of Arkansas, Fayetteville, AR 72701, USA
Venue:
Computational Biology and Chemistry
Year:
2009

Citing 5
Cited 2

Robust PCA and classification in biosciences

Bioinformatics
Clustering of time-course gene expression data using functional data analysis

Computational Biology and Chemistry
On the number of principal components: A test of dimensionality based on measurements of similarity between matrices

Computational Statistics & Data Analysis
Research article: Optimal classification for time-course gene expression data using functional data analysis

Computational Biology and Chemistry
How many principal components? stopping rules for determining the number of non-trivial axes revisited

Computational Statistics & Data Analysis

Model selection for partial least squares based dimension reduction

Pattern Recognition Letters
Enhanced classification for high-throughput data with an optimal projection and hybrid classifier

International Journal of Data Mining and Bioinformatics

Quantified Score

Hi-index	0.00

Visualization

Abstract

High-throughput data have been widely used in biological and medical studies to discover gene and protein functions. Due to the high dimensionality, principal component analysis (PCA) is often involved for data dimension reduction. However, when a few principal components (PCs) are selected for dimension reduction or considered for dimension determination, they are typically ranked by their variances, eigenvalues. However, this approach is not always effective in subsequent multivariate analysis, particularly classification. To maximize information from data with a subset of the components, we apply a different ranking criterion, canonical variate criterion, which considers within- and between-group variance rather than total variance in the classical criterion. Four prevalent classification methods are considered and compared using leave-one-out cross-validation. These methods are illustrated with three real high-throughput data sets, two microarray data sets and a nuclear magnetic resonance spectra data set.