GFO: A data driven approach for optimizing the Gaussian function based similarity metric in computational biology

Authors:
Jian-Bo Lei;Jiang-Bo Yin;Hong-Bin Shen
Affiliations:
Department of Automation, Shanghai Jiao Tong University, and Key Laboratory of System Control and Information Processing, Ministry of Education of China, 800 Dongchuan Road, Shanghai 200240, China;Department of Automation, Shanghai Jiao Tong University, and Key Laboratory of System Control and Information Processing, Ministry of Education of China, 800 Dongchuan Road, Shanghai 200240, China;Department of Automation, Shanghai Jiao Tong University, and Key Laboratory of System Control and Information Processing, Ministry of Education of China, 800 Dongchuan Road, Shanghai 200240, China
Venue:
Neurocomputing
Year:
2013

Citing 10
Cited 2

Generalized Discriminant Analysis Using a Kernel Approach

Neural Computation
Discrimination of outer membrane proteins using support vector machines

Bioinformatics
De novo SVM classification of precursor microRNAs from genomic pseudo hairpins using global and intrinsic folding measures

Bioinformatics
A comprehensive assessment of sequence-based and template-based methods for protein contact prediction

Bioinformatics
Structural analysis of regulatory DNA sequences using grammar inference and Support Vector Machine

Neurocomputing
Improved sequence-based prediction of disordered regions with multilayer fusion of multiple information sources

Bioinformatics
A multi-stage automatic arrhythmia recognition and classification system

Computers in Biology and Medicine
PlantMiRNAPred

Bioinformatics
Gaussian kernel optimization: Complex problem and a simple solution

Neurocomputing
A comparison of methods for multiclass support vector machines

IEEE Transactions on Neural Networks

Quantified Score

Hi-index	0.01

Visualization

Abstract

The Gaussian function or kernel (exp(-@?x"i-x"j@?^2/@b)) based algorithms are popularly applied in various computational biology researches. It is well known for its outstanding capability of measuring the remote similarity between any two samples in a mapped space. The Gaussian kernel can not only be used in unsupervised fields but also in supervised cases. Despite the success of the Gaussian kernel in bioinformatics applications, the scalar parameter @b is demonstrated to have significant influences on final results. There are no good methods to determine optimal values of @b until now since they vary in different applications, which are usually identified by trial and error tests achieved by a global grid search in a pre-defined potential rage. This global grid search approach is heavily limited by the difficulty for setting proper start and end edges of the range, grid scales, as well as the huge search computational complexity in both cases of large dataset size and complicated learning algorithms. To deal with these problems, we present a systematic protocol consisting of two data-driven approaches to derive optimal choices for the Gaussian kernel parameter in bioinformatics studies, one for unsupervised cases and the other for supervised applications. The advantage of the two methods is that they only depend on the original dataset. The corresponding experiments on 6 datasets demonstrate the robustness and efficacy of the proposed approaches. An online calculator is implemented at: http://www.csbio.sjtu.edu.cn/bioinf/GFO/ for free academic use.