Common data model for natural language processing based on two existing standard information models: CDA+GrAF

Authors:
StéPhane M. Meystre;Sanghoon Lee;Chai Young Jung;RaphaëL D. Chevrier
Affiliations:
Department of Biomedical Informatics, University of Utah, School of Medicine, Salt Lake City, UT, United States and VA Salt Lake City Health Care System, Salt Lake City, UT, United States;Department of Biomedical Informatics, University of Utah, School of Medicine, Salt Lake City, UT, United States;Department of Biomedical Informatics, University of Utah, School of Medicine, Salt Lake City, UT, United States;University of Geneva, School of Medicine, Geneva, Switzerland
Venue:
Journal of Biomedical Informatics
Year:
2012

Citing 5
Cited 1

A formal framework for linguistic annotation

Speech Communication - Special issue on speech annotation and corpus tools
Building a large annotated corpus of English: the penn treebank

Computational Linguistics - Special issue on using large corpora: II
International standard for a linguistic annotation framework

Natural Language Engineering
GrAF: a graph-based format for linguistic annotations

LAW '07 Proceedings of the Linguistic Annotation Workshop
Bridging the gaps: interoperability for GrAF, GATE, and UIMA

ACL-IJCNLP '09 Proceedings of the Third Linguistic Annotation Workshop

Guest Editorial: Translating standards into practice: Experiences and lessons learned in biomedicine and health care

Journal of Biomedical Informatics

Quantified Score

Hi-index	0.00

Visualization

Abstract

An increasing need for collaboration and resources sharing in the Natural Language Processing (NLP) research and development community motivates efforts to create and share a common data model and a common terminology for all information annotated and extracted from clinical text. We have combined two existing standards: the HL7 Clinical Document Architecture (CDA), and the ISO Graph Annotation Format (GrAF; in development), to develop such a data model entitled ''CDA+GrAF''. We experimented with several methods to combine these existing standards, and eventually selected a method wrapping separate CDA and GrAF parts in a common standoff annotation (i.e., separate from the annotated text) XML document. Two use cases, clinical document sections, and the 2010 i2b2/VA NLP Challenge (i.e., problems, tests, and treatments, with their assertions and relations), were used to create examples of such standoff annotation documents, and were successfully validated with the XML schemata provided with both standards. We developed a tool to automatically translate annotation documents from the 2010 i2b2/VA NLP Challenge format to GrAF, and automatically generated 50 annotation documents using this tool, all successfully validated. Finally, we adapted the XSL stylesheet provided with HL7 CDA to allow viewing annotation XML documents in a web browser, and plan to adapt existing tools for translating annotation documents between CDA+GrAF and the UIMA and GATE frameworks. This common data model may ease directly comparing NLP tools and applications, combining their output, transforming and ''translating'' annotations between different NLP applications, and eventually ''plug-and-play'' of different modules in NLP applications.