The coefficient of intrinsic dependence (feature selection using el CID)

  • Authors:
  • Tailen Hsing;Li-Yu Liu; Marcel Brun;Edward R. Dougherty

  • Affiliations:
  • Department of Statistics, Texas A&M University, College Station, TX, USA;Department of Statistics, Texas A&M University, College Station, TX, USA;Department of Biochemistry and Molecular Biology, University of Louisville, KY, USA;Department of Electrical Engineering, Texas A&M University, 3128 TAMU, College Station, TX 77843-3128, USA and Department of Pathology, University of Texas M. D. Anderson Cancer Center, Houston, T ...

  • Venue:
  • Pattern Recognition
  • Year:
  • 2005

Quantified Score

Hi-index 0.01

Visualization

Abstract

Measuring the strength of dependence between two sets of random variables lies at the heart of many statistical problems, in particular, feature selection for pattern recognition. We believe that there are some basic desirable criteria for a measure of dependence not satisfied by many commonly employed measures, such as the correlation coefficient, Briefly stated, a measure of dependence should: (1) be model-free and invariant under monotone transformations of the marginals; (2) fully differentiate different levels of dependence; (3) be applicable to both continuous and categorical distributions; (4) should not have the dependence of X on Y be necessarily the same as the dependence of Y on X; (5) be readily estimated from data; and (6) be straightforwardly extended to multivariate distributions. The new measure of dependence introduced in this paper, called the coefficient of intrinsic dependence(CID), satisfies these criteria. The main motivating idea is that Y is strongly (weakly, resp.) dependent on X if and only if the conditional distribution of Y given X is significantly (mildly, resp.) different from the marginal distribution of Y. We measure the difference by the normalized integrated square difference distance so that the full range of dependence can be adequately reflected in the interval [0, 1]. The paper treats estimation of the CID, provides simulations and comparisons, and applies the CID to gene prediction and cancer classification based on gene-expression measurements from microarrays.