Multiple kernel learning in protein-protein interaction extraction from biomedical literature

  • Authors:
  • Zhihao Yang;Nan Tang;Xiao Zhang;Hongfei Lin;Yanpeng Li;Zhiwei Yang

  • Affiliations:
  • Department of Computer Science and Engineering, Dalian University of Technology, Dalian 116024, China;Department of Computer Science and Engineering, Dalian University of Technology, Dalian 116024, China;Department of Computer Science and Engineering, Dalian University of Technology, Dalian 116024, China;Department of Computer Science and Engineering, Dalian University of Technology, Dalian 116024, China;Department of Computer Science and Engineering, Dalian University of Technology, Dalian 116024, China;Department of Ultrasound, Oil Field Hospital of Daqing, Heilongjiang 163001, China

  • Venue:
  • Artificial Intelligence in Medicine
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

Objective: Knowledge about protein-protein interactions (PPIs) unveils the molecular mechanisms of biological processes. The volume and content of published biomedical literature on protein interactions is expanding rapidly, making it increasingly difficult for interaction database administrators, responsible for content input and maintenance to detect and manually update protein interaction information. The objective of this work is to develop an effective approach to automatic extraction of PPI information from biomedical literature. Methods and materials: We present a weighted multiple kernel learning-based approach for automatic PPI extraction from biomedical literature. The approach combines the following kernels: feature-based, tree, graph and part-of-speech (POS) path. In particular, we extend the shortest path-enclosed tree (SPT) and dependency path tree to capture richer contextual information. Results: Our experimental results show that the combination of SPT and dependency path tree extensions contributes to the improvement of performance by almost 0.7 percentage units in F-score and 2 percentage units in area under the receiver operating characteristics curve (AUC). Combining two or more appropriately weighed individual will further improve the performance. Both on the individual corpus and cross-corpus evaluation our combined kernel can achieve state-of-the-art performance with respect to comparable evaluations, with 64.41% F-score and 88.46% AUC on the AImed corpus. Conclusions: As different kernels calculate the similarity between two sentences from different aspects. Our combined kernel can reduce the risk of missing important features. More specifically, we use a weighted linear combination of individual kernels instead of assigning the same weight to each individual kernel, thus allowing the introduction of each kernel to incrementally contribute to the performance improvement. In addition, SPT and dependency path tree extensions can improve the performance by including richer context information.