Efficient selection of feature sets possessing high coefficients of determination based on incremental determinations

Authors:
Ronaldo F. Hashimoto;Edward R. Dougherty;Marcel Brun;Zheng-Zheng Zhou;Michael L. Bittner;Jeffrey M. Trent
Affiliations:
Department of Electrical Engineering, Texas A&M University, 3128 TAMU, College Station, TX and Departamento de Ciêência de Computacão, Instituto de Matemática e Estatístic ...;Department of Electrical Engineering, Texas A&M University, 3128 TAMU, College Station, TX and Department of Pathology, University of Texas M.D. Anderson Cancer Center, Houston, TX;Department of Electrical Engineering, Texas A&M University, 3128 TAMU, College Station, TX and Departamento de Ciêência de Computacão, Instituto de Matemática e Estatístic ...;NuTec Sciences, Inc.,;National Human Genome Research Institute of the National Institutes of Health, Bethesda, MD;National Human Genome Research Institute of the National Institutes of Health, Bethesda, MD
Venue:
Signal Processing - Special issue: Genomic signal processing
Year:
2003

Citing 2
Cited 5

Coefficient of determination in nonlinear signal processing

Signal Processing - Special section on signal processing technologies for short burst wireless communications
Decision-making processes in pattern recognition (ACM monograph series)

Decision-making processes in pattern recognition (ACM monograph series)

Relationships between probabilistic Boolean networks and dynamic Bayesian networks as models of gene regulatory networks

Signal Processing
Gene prediction using multinomial probit regression with Bayesian gene selection

EURASIP Journal on Applied Signal Processing
The coefficient of intrinsic dependence (feature selection using el CID)

Pattern Recognition
Cartoon features selection using Diffusion Score

Signal Processing
Growing Seed Genes from Time Series Data and Thresholded Boolean Networks with Perturbation

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Feature selection is problematic when the number of potential features is very large. Absent distribution knowledge, to select a best feature set of a certain size requires that all feature sets of that size be examined. This paper considers the question in the context of variable selection for prediction based on the coefficient of determination (CoD). The CoD varies between 0 and 1, and measures the degree to which prediction is improved by using the features relative to prediction in the absence of the features. It examines the following heuristic: if we wish to find feature sets of size m with CoD exceeding δ, what is the effect of only considering a feature set if it contains a subset with CoD exceeding λ P(θ δ | max{θ1,θ2,...,θv} , where θ is the CoD of the feature set and θ1,θ2,...,θv are the CoDs of the subsets. Such probabilities allow a rigorous analysis of the following decision procedure: the feature set is examined if max{θ1,θ2,...,θv} ≥ λ. Computational saving increases as λ increases, but the probability of missing desirable feature sets increases as the increment δ - λ decreases; conversely, computational saving goes down as λ decreases, but the probability of missing desirable feature sets decreases as δ - λ increases. The paper considers various loss measures pertaining to omitting feature sets based on the criteria. After specializing the matter to binary features, it considers a simulation model, and then applies the theory in the context of microarray-based genomic CoD analysis. It also provides optimal computational algorithms.