Determination of the optimal number of features for quadratic discriminant analysis via the normal approximation to the discriminant distribution

  • Authors:
  • Jianping Hua;Zixiang Xiong;Edward R. Dougherty

  • Affiliations:
  • Department of Electrical Engineering, Texas A&M University, College Station, TX 77843, USA;Department of Electrical Engineering, Texas A&M University, College Station, TX 77843, USA;Department of Electrical Engineering, Texas A&M University, College Station, TX 77843, USA and Department of Pathology, University of Texas M.D. Anderson Cancer Center, Houston, TX, USA

  • Venue:
  • Pattern Recognition
  • Year:
  • 2005

Quantified Score

Hi-index 0.01

Visualization

Abstract

Given the joint feature-label distribution, increasing the number of features always results in decreased classification error; however, this is not the case when a classifier is designed via a classification rule from sample data. Typically, for fixed sample size, the error of a designed classifier decreases and then increases as the number of features grows. The problem is especially acute when sample sizes are very small and the potential number of features is very large. To obtain a general understanding of the kinds of feature-set sizes that provide good performance for a particular classification rule, performance must be evaluated based on accurate error estimation, and hence a model-based setting for optimizing the number of features is needed. This paper treats quadratic discriminant analysis (QDA) in the case of unequal covariance matrices. For two normal class-conditional distributions, the QDA classifier is determined according to a discriminant. The standard plug-in rule estimates the discriminant from a feature-label sample to obtain an estimate of the discriminant by replacing the means and covariance matrices by their respective sample means and sample covariance matrices. The unbiasedness of these estimators assures good estimation for large samples, but not for small samples. Our goal is to find an essentially analytic method to produce an error curve as a function of the number of features so that the curve can be minimized to determine an optimal number of features. We use a normal approximation to the distribution of the estimated discriminant. Since the mean and variance of the estimated discriminant will be exact, these provide insight into how the covariance matrices affect the optimal number of features. We derive the mean and variance of the estimated discriminant and compare feature-size optimization using the normal approximation to the estimated discriminant with optimization obtained by simulating the true distribution of the estimated discriminant. Optimization via the normal approximation to the estimated discriminant provides huge computational savings in comparison to optimization via simulation of the true distribution. Feature-size optimization via the normal approximation is very accurate when the covariance matrices differ modestly. The optimal number of features based on the normal approximation will exceed the actual optimal number when there is large disagreement between the covariance matrices; however, this difference is not important because the true misclassification error using the number of features obtained from the normal approximation and the number obtained from the true distribution differ only slightly, even for significantly different covariance matrices.