Mining protein interactions from text using convolution kernels

  • Authors:
  • Ramanathan Narayanan;Sanchit Misra;Simon Lin;Alok Choudhary

  • Affiliations:
  • Department of Electrical Engineering and Computer Science, Northwestern University;Department of Electrical Engineering and Computer Science, Northwestern University;Feinberg School of Medicine, Northwestern University;Department of Electrical Engineering and Computer Science, Northwestern University

  • Venue:
  • PAKDD'09 Proceedings of the 13th Pacific-Asia international conference on Knowledge discovery and data mining: new frontiers in applied data mining
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

As the sizes of biomedical literature databases increase, there is an urgent need to develop intelligent systems that automatically discover Protein-Protein interactions from text. Despite resource-intensive efforts to create manually curated interaction databases, the sheer volume of biological literature databases makes it impossible to achieve significant coverage. In this paper, we describe a scalable hierarchical Support Vector Machine(SVM) based framework to efficiently mine protein interactions with high precision. In addition, we describe a convolution tree-vector kernel based on syntactic similarity of natural language text to further enhance the mining process. By using the inherent syntactic similarity of interaction phrases as a kernel method, we are able to significantly improve the classification quality. Our hierarchical framework allows us to reduce the search space dramatically with each stage, while sustaining a high level of accuracy. We test our framework on a corpus of over 10000 manually annotated phrases gathered from various sources. The convolution kernel technique identifies sentences describing interactions with a precision of 95% and a recall of 92%, yielding significant improvements over previous machine learning techniques.