Building a coreference-annotated corpus from the domain of biochemistry

Authors:
Riza Theresa Batista-Navarro;Sophia Ananiadou
Affiliations:
University of Manchester, United Kingdom and University of the Philippines Diliman, Philippines;University of Manchester, United Kingdom
Venue:
BioNLP '11 Proceedings of BioNLP 2011 Workshop
Year:
2011

Citing 4
Cited 1

Design of the MUC-6 evaluation

MUC6 '95 Proceedings of the 6th conference on Message understanding
Distribution of information in biomedical abstracts and full-text publications

Bioinformatics
Speech and Language Processing (2nd Edition)

Speech and Language Processing (2nd Edition)
Inter-coder agreement for computational linguistics

Computational Linguistics

Facilitating the analysis of discourse phenomena in an interoperable NLP platform

CICLing'13 Proceedings of the 14th international conference on Computational Linguistics and Intelligent Text Processing - Volume Part I

Quantified Score

Hi-index	0.00

Visualization

Abstract

One of the reasons for which the resolution of coreferences has remained a challenging information extraction task, especially in the biomedical domain, is the lack of training data in the form of annotated corpora. In order to address this issue, we developed the HANAPIN corpus. It consists of full-text articles from biochemistry literature, covering entities of several semantic types: chemical compounds, drug targets (e.g., proteins, enzymes, cell lines, pathogens), diseases, organisms and drug effects. All of the co-referring expressions pertaining to these semantic types were annotated based on the annotation scheme that we developed. We observed four general types of coreferences in the corpus: sortal, pronominal, abbreviation and numerical. Using the MASI distance metric, we obtained 84% in computing the inter-annotator agreement in terms of Krippendorff's alpha. Consisting of 20 full-text, open-access articles, the corpus will enable other researchers to use it as a resource for their own coreference resolution methodologies.