Nonparametric variable selection and classification: The CATCH algorithm

  • Authors:
  • Shijie Tang;Lisha Chen;Kam-Wah Tsui;Kjell Doksum

  • Affiliations:
  • -;-;-;-

  • Venue:
  • Computational Statistics & Data Analysis
  • Year:
  • 2014

Quantified Score

Hi-index 0.03

Visualization

Abstract

The problem of classifying a categorical response Y is considered in a nonparametric framework. The distribution of Y depends on a vector of predictors X, where the coordinates X"j of X may be continuous, discrete, or categorical. An algorithm is constructed to select the variables to be used for classification. For each variable X"j, an importance score s"j is computed to measure the strength of association of X"j with Y. The algorithm deletes X"j if s"j falls below a certain threshold. It is shown in Monte Carlo simulations that the algorithm has a high probability of only selecting variables associated with Y. Moreover when this variable selection rule is used for dimension reduction prior to applying classification procedures, it improves the performance of these procedures. The approach for computing importance scores is based on root Chi-square type statistics computed for randomly selected regions (tubes) of the sample space. The size and shape of the regions are adjusted iteratively and adaptively using the data to enhance the ability of the importance score to detect local relationships between the response and the predictors. These local scores are then averaged over the tubes to form a global importance score s"j for variable X"j. When confounding and spurious associations are issues, the nonparametric importance score for variable X"j is computed conditionally by using tubes to restrict the other variables. This variable selection procedure is called CATCH (Categorical Adaptive Tube Covariate Hunting). Asymptotic properties, including consistency, are established.