PubMed-scale event extraction for post-translational modifications, epigenetics and protein structural relations

  • Authors:
  • Jari Björne;Sofie Van Landeghem;Sampo Pyysalo;Tomoko Ohta;Filip Ginter;Yves Van de Peer;Sophia Ananiadou;Tapio Salakoski

  • Affiliations:
  • Turku Centre for Computer Science (TUCS), Joukahaisenkatu, Turku, Finland and University of Turku, Finland;VIB, Technologiepark, Gent, Belgium and Ghent University, Gent, Belgium;National Centre for Text Mining and University of Manchester, Manchester Interdisciplinary Biocentre, Manchester, UK;National Centre for Text Mining and University of Manchester, Manchester, UK;University of Turku, Finland;VIB, Technologiepark, Gent, Belgium and Ghent University, Gent, Belgium;National Centre for Text Mining and University of Manchester, Manchester, UK;Turku Centre for Computer Science (TUCS), Joukahaisenkatu, Turku, Finland and University of Turku, Finland

  • Venue:
  • BioNLP '12 Proceedings of the 2012 Workshop on Biomedical Natural Language Processing
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

Recent efforts in biomolecular event extraction have mainly focused on core event types involving genes and proteins, such as gene expression, protein-protein interactions, and protein catabolism. The BioNLP'11 Shared Task extended the event extraction approach to sub-protein events and relations in the Epigenetics and Post-translational Modifications (EPI) and Protein Relations (REL) tasks. In this study, we apply the Turku Event Extraction System, the best-performing system for these tasks, to all PubMed abstracts and all available PMC full-text articles, extracting 1.4M EPI events and 2.2M REL relations from 21M abstracts and 372K articles. We introduce several entity normalization algorithms for genes, proteins, protein complexes and protein components, aiming to uniquely identify these biological entities. This normalization effort allows direct mapping of the extracted events and relations with post-translational modifications from UniProt, epigenetics from PubMeth, functional domains from InterPro and macromolecular structures from PDB. The extraction of such detailed protein information provides a unique text mining dataset, offering the opportunity to further deepen the information provided by existing PubMed-scale event extraction efforts. The methods and data introduced in this study are freely available from bionlp.utu.fi.