Guidelines for Electronic Text Encoding and Interchange: Volumes 1 and 2: P4
Guidelines for Electronic Text Encoding and Interchange: Volumes 1 and 2: P4
XML-based NLP tools for analysing and annotating medical language
NLPXML '02 Proceedings of the 2nd workshop on NLP and XML - Volume 17
The GENIA corpus: an annotated research abstract corpus in molecular biology domain
HLT '02 Proceedings of the second international conference on Human Language Technology Research
Towards morphologically annotated corpus of hospital discharge reports in Polish
BioNLP '11 Proceedings of BioNLP 2011 Workshop
Hi-index | 0.00 |
It is well known that standardising the annotation of language resources significantly raises their potential, as it enables re-use and spurs the development of common technologies. Despite the fact that increasingly complex linguistic information is being added to biomedical texts, no standard solutions have so far been proposed for their encoding. This paper describes a standardised XML tagset (DTD) for annotated biomedical corpora and other resources, which is based on the Text Encoding Initiative Guidelines P4, a general and parameterisable standard for encoding language resources. We ground the discussion in the encoding of the GENIA corpus, which currently contains 2,000 abstracts taken from the MEDLINE database, and has almost 100,000 hand-annotated terms marked for semantic class from the accompanying ontology. The paper introduces GENIA and TEI and implements a TEI parametrisation and conversion for the GENIA corpus. A number of aspects of biomedical language are discussed, such as complex tokenisation, prevalence of contractions and complex terms, and the linkage and encoding of ontologies.