Special Report: NCBI disease corpus: A resource for disease name recognition and concept normalization

Authors:
Rezarta Islamaj Doğan;Robert Leaman;Zhiyong Lu
Affiliations:
National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA;National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA and Department of Computer Science and Engineering, Arizona Stat ...;National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
Venue:
Journal of Biomedical Informatics
Year:
2014

Citing 8
Cited 0

Exploring two biomedical text genres for disease recognition

BioNLP '09 Proceedings of the Workshop on Current Trends in Biomedical Natural Language Processing
Disease mention recognition with specific features

BioNLP '10 Proceedings of the 2010 Workshop on Biomedical Natural Language Processing
Semi-automatic semantic annotation of PubMed queries: A study on quality, efficiency, satisfaction

Journal of Biomedical Informatics
Linking multiple disease-related resources through UMLS

Proceedings of the 2nd ACM SIGHIT International Health Informatics Symposium
Parsing biomedical literature

IJCNLP'05 Proceedings of the Second international joint conference on Natural Language Processing
Anaphoric reference in clinical reports: Characteristics of an annotated corpus

Journal of Biomedical Informatics
An improved corpus of disease mentions in PubMed citations

BioNLP '12 Proceedings of the 2012 Workshop on Biomedical Natural Language Processing
Pre-annotating Clinical Notes and Clinical Trial Announcements for Gold Standard Corpus Development: Evaluating the Impact on Annotation Speed and Potential Bias

HISB '12 Proceedings of the 2012 IEEE Second International Conference on Healthcare Informatics, Imaging and Systems Biology

Quantified Score

Hi-index	0.00

Visualization

Abstract

Information encoded in natural language in biomedical literature publications is only useful if efficient and reliable ways of accessing and analyzing that information are available. Natural language processing and text mining tools are therefore essential for extracting valuable information, however, the development of powerful, highly effective tools to automatically detect central biomedical concepts such as diseases is conditional on the availability of annotated corpora. This paper presents the disease name and concept annotations of the NCBI disease corpus, a collection of 793 PubMed abstracts fully annotated at the mention and concept level to serve as a research resource for the biomedical natural language processing community. Each PubMed abstract was manually annotated by two annotators with disease mentions and their corresponding concepts in Medical Subject Headings (MeSH(R)) or Online Mendelian Inheritance in Man (OMIM(R)). Manual curation was performed using PubTator, which allowed the use of pre-annotations as a pre-step to manual annotations. Fourteen annotators were randomly paired and differing annotations were discussed for reaching a consensus in two annotation phases. In this setting, a high inter-annotator agreement was observed. Finally, all results were checked against annotations of the rest of the corpus to assure corpus-wide consistency. The public release of the NCBI disease corpus contains 6892 disease mentions, which are mapped to 790 unique disease concepts. Of these, 88% link to a MeSH identifier, while the rest contain an OMIM identifier. We were able to link 91% of the mentions to a single disease concept, while the rest are described as a combination of concepts. In order to help researchers use the corpus to design and test disease identification methods, we have prepared the corpus as training, testing and development sets. To demonstrate its utility, we conducted a benchmarking experiment where we compared three different knowledge-based disease normalization methods with a best performance in F-measure of 63.7%. These results show that the NCBI disease corpus has the potential to significantly improve the state-of-the-art in disease name recognition and normalization research, by providing a high-quality gold standard thus enabling the development of machine-learning based approaches for such tasks. The NCBI disease corpus, guidelines and other associated resources are available at: http://www.ncbi.nlm.nih.gov/CBBresearch/Dogan/DISEASE/.