Building a coreference-annotated corpus from the domain of biochemistry

  • Authors:
  • Riza Theresa Batista-Navarro;Sophia Ananiadou

  • Affiliations:
  • University of Manchester, United Kingdom and University of the Philippines Diliman, Philippines;University of Manchester, United Kingdom

  • Venue:
  • BioNLP '11 Proceedings of BioNLP 2011 Workshop
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

One of the reasons for which the resolution of coreferences has remained a challenging information extraction task, especially in the biomedical domain, is the lack of training data in the form of annotated corpora. In order to address this issue, we developed the HANAPIN corpus. It consists of full-text articles from biochemistry literature, covering entities of several semantic types: chemical compounds, drug targets (e.g., proteins, enzymes, cell lines, pathogens), diseases, organisms and drug effects. All of the co-referring expressions pertaining to these semantic types were annotated based on the annotation scheme that we developed. We observed four general types of coreferences in the corpus: sortal, pronominal, abbreviation and numerical. Using the MASI distance metric, we obtained 84% in computing the inter-annotator agreement in terms of Krippendorff's alpha. Consisting of 20 full-text, open-access articles, the corpus will enable other researchers to use it as a resource for their own coreference resolution methodologies.