The impact of sample reduction on PCA-based feature extraction for supervised learning

  • Authors:
  • Mykola Pechenizkiy;Seppo Puuronen;Alexey Tsymbal

  • Affiliations:
  • Univ. of Jyväskylä, Finland;Univ. of Jyväskylä, Finland;Trinity College Dublin, Ireland

  • Venue:
  • Proceedings of the 2006 ACM symposium on Applied computing
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

"The curse of dimensionality" is pertinent to many learning algorithms, and it denotes the drastic raise of computational complexity and classification error in high dimensions. In this paper, different feature extraction (FE) techniques are analyzed as means of dimensionality reduction, and constructive induction with respect to the performance of Naïve Bayes classifier. When a data set contains a large number of instances, some sampling approach is applied to address the computational complexity of FE and classification processes. The main goal of this paper is to show the impact of sample reduction on the process of FE for supervised learning. In our study we analyzed the conventional PCA and two eigenvector-based approaches that take into account class information. The first class-conditional approach is parametric and optimizes the ratio of between-class variance to the within-class variance of the transformed data. The second approach is a nonparametric modification of the first one based on the local calculation of the between-class covariance matrix. The experiments are conducted on ten UCI data sets, using four different strategies to select samples: (1) random sampling, (2) stratified random sampling, (3) kd-tree based selective sampling, and (4) stratified sampling with kd-tree based selection. Our experiments show that if the sample size for FE model construction is small then it is important to take into account both class information and data distribution. Further, for supervised learning the nonparametric FE approach needs much less instances to produce a new representation space that result in the same or higher classification accuracy than the other FE approaches.