Encoding biomedical resources in TEI: the case of the GENIA corpus

  • Authors:
  • Tomaž Erjavec;Jin-Dong Kim;Tomoko Ohta;Yuka Tateisi;Jun-ichi Tsujii

  • Affiliations:
  • Jožef Stefan Institute, Ljubljana;University of Tokyo;Japan Science and Technology Corporation;Japan Science and Technology Corporation;University of Tokyo

  • Venue:
  • BioMed '03 Proceedings of the ACL 2003 workshop on Natural language processing in biomedicine - Volume 13
  • Year:
  • 2003

Quantified Score

Hi-index 0.00

Visualization

Abstract

It is well known that standardising the annotation of language resources significantly raises their potential, as it enables re-use and spurs the development of common technologies. Despite the fact that increasingly complex linguistic information is being added to biomedical texts, no standard solutions have so far been proposed for their encoding. This paper describes a standardised XML tagset (DTD) for annotated biomedical corpora and other resources, which is based on the Text Encoding Initiative Guidelines P4, a general and parameterisable standard for encoding language resources. We ground the discussion in the encoding of the GENIA corpus, which currently contains 2,000 abstracts taken from the MEDLINE database, and has almost 100,000 hand-annotated terms marked for semantic class from the accompanying ontology. The paper introduces GENIA and TEI and implements a TEI parametrisation and conversion for the GENIA corpus. A number of aspects of biomedical language are discussed, such as complex tokenisation, prevalence of contractions and complex terms, and the linkage and encoding of ontologies.