Encoding biomedical resources in TEI: the case of the GENIA corpus

Authors:
Tomaž Erjavec;Jin-Dong Kim;Tomoko Ohta;Yuka Tateisi;Jun-ichi Tsujii
Affiliations:
Jožef Stefan Institute, Ljubljana;University of Tokyo;Japan Science and Technology Corporation;Japan Science and Technology Corporation;University of Tokyo
Venue:
BioMed '03 Proceedings of the ACL 2003 workshop on Natural language processing in biomedicine - Volume 13
Year:
2003

Citing 3
Cited 1

Guidelines for Electronic Text Encoding and Interchange: Volumes 1 and 2: P4

Guidelines for Electronic Text Encoding and Interchange: Volumes 1 and 2: P4
XML-based NLP tools for analysing and annotating medical language

NLPXML '02 Proceedings of the 2nd workshop on NLP and XML - Volume 17
The GENIA corpus: an annotated research abstract corpus in molecular biology domain

HLT '02 Proceedings of the second international conference on Human Language Technology Research

Towards morphologically annotated corpus of hospital discharge reports in Polish

BioNLP '11 Proceedings of BioNLP 2011 Workshop

Quantified Score

Hi-index	0.00

Visualization

Abstract

It is well known that standardising the annotation of language resources significantly raises their potential, as it enables re-use and spurs the development of common technologies. Despite the fact that increasingly complex linguistic information is being added to biomedical texts, no standard solutions have so far been proposed for their encoding. This paper describes a standardised XML tagset (DTD) for annotated biomedical corpora and other resources, which is based on the Text Encoding Initiative Guidelines P4, a general and parameterisable standard for encoding language resources. We ground the discussion in the encoding of the GENIA corpus, which currently contains 2,000 abstracts taken from the MEDLINE database, and has almost 100,000 hand-annotated terms marked for semantic class from the accompanying ontology. The paper introduces GENIA and TEI and implements a TEI parametrisation and conversion for the GENIA corpus. A number of aspects of biomedical language are discussed, such as complex tokenisation, prevalence of contractions and complex terms, and the linkage and encoding of ontologies.