Approximation of graph kernel similarities for chemical graphs by kernel principal component analysis

  • Authors:
  • Georg Hinselmann;Andreas Jahn;Nikolas Fechner;Lars Rosenbaum;Andreas Zell

  • Affiliations:
  • Wilhelm-Schickard-Institute for Computer Science, Dept. Cognitive Science, University of Tübingen, Tübingen, Germany;Wilhelm-Schickard-Institute for Computer Science, Dept. Cognitive Science, University of Tübingen, Tübingen, Germany;Eli Lilly U.K., Erl Wood Manor, Windlesham, Surrey, U.K;Wilhelm-Schickard-Institute for Computer Science, Dept. Cognitive Science, University of Tübingen, Tübingen, Germany;Wilhelm-Schickard-Institute for Computer Science, Dept. Cognitive Science, University of Tübingen, Tübingen, Germany

  • Venue:
  • EvoBIO'11 Proceedings of the 9th European conference on Evolutionary computation, machine learning and data mining in bioinformatics
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

Graph kernels have been successfully applied on chemical graphs on small to medium sized machine learning problems. However, graph kernels often require a graph transformation before the computation can be applied. Furthermore, the kernel calculation can have a polynomial complexity of degree three and higher. Therefore, they cannot be applied in large instance-based machine learning problems. By using kernel principal component analysis, we mapped the compounds to the principal components, obtaining q-dimensional real-valued vectors. The goal of this study is to investigate the correlation between the graph kernel similarities and the similarities between the vectors. In the experiments we compared the similarities on various data sets, covering a wide range of typical chemical data mining problems. The similarity matrix between the vectorial projection was computed with the Jaccard and Cosine similarity coefficient and was correlated with the similarity matrix of the original graph kernel. The main result is that there is a strong correlation between the similarities of the vectors and the original graph kernel regarding rank correlation and linear correlation. The method seems to be robust and independent of the choice of the reference subset with observed standard deviations below 5%. An important application of the approach are instance-based data mining and machine learning tasks where the computation of the original graph kernel would be prohibitive.