GFO: A data driven approach for optimizing the Gaussian function based similarity metric in computational biology

  • Authors:
  • Jian-Bo Lei;Jiang-Bo Yin;Hong-Bin Shen

  • Affiliations:
  • Department of Automation, Shanghai Jiao Tong University, and Key Laboratory of System Control and Information Processing, Ministry of Education of China, 800 Dongchuan Road, Shanghai 200240, China;Department of Automation, Shanghai Jiao Tong University, and Key Laboratory of System Control and Information Processing, Ministry of Education of China, 800 Dongchuan Road, Shanghai 200240, China;Department of Automation, Shanghai Jiao Tong University, and Key Laboratory of System Control and Information Processing, Ministry of Education of China, 800 Dongchuan Road, Shanghai 200240, China

  • Venue:
  • Neurocomputing
  • Year:
  • 2013

Quantified Score

Hi-index 0.01

Visualization

Abstract

The Gaussian function or kernel (exp(-@?x"i-x"j@?^2/@b)) based algorithms are popularly applied in various computational biology researches. It is well known for its outstanding capability of measuring the remote similarity between any two samples in a mapped space. The Gaussian kernel can not only be used in unsupervised fields but also in supervised cases. Despite the success of the Gaussian kernel in bioinformatics applications, the scalar parameter @b is demonstrated to have significant influences on final results. There are no good methods to determine optimal values of @b until now since they vary in different applications, which are usually identified by trial and error tests achieved by a global grid search in a pre-defined potential rage. This global grid search approach is heavily limited by the difficulty for setting proper start and end edges of the range, grid scales, as well as the huge search computational complexity in both cases of large dataset size and complicated learning algorithms. To deal with these problems, we present a systematic protocol consisting of two data-driven approaches to derive optimal choices for the Gaussian kernel parameter in bioinformatics studies, one for unsupervised cases and the other for supervised applications. The advantage of the two methods is that they only depend on the original dataset. The corresponding experiments on 6 datasets demonstrate the robustness and efficacy of the proposed approaches. An online calculator is implemented at: http://www.csbio.sjtu.edu.cn/bioinf/GFO/ for free academic use.