Semi-automatic semantic annotation of PubMed queries: A study on quality, efficiency, satisfaction

Authors:
Aurélie Névéol;Rezarta Islamaj Doğan;Zhiyong Lu
Affiliations:
National Center for Biotechnology Information, US National Library of Medicine, Bethesda, MD 20894, USA;National Center for Biotechnology Information, US National Library of Medicine, Bethesda, MD 20894, USA;National Center for Biotechnology Information, US National Library of Medicine, Bethesda, MD 20894, USA
Venue:
Journal of Biomedical Informatics
Year:
2011

Citing 6
Cited 3

Building a large annotated corpus of English: the penn treebank

Computational Linguistics - Special issue on using large corpora: II
Mining MEDLINE for implicit links between dietary substances and diseases

Bioinformatics
Author name disambiguation in MEDLINE

ACM Transactions on Knowledge Discovery from Data (TKDD)
Overview of BioNLP'09 shared task on event extraction

BioNLP '09 Proceedings of the Workshop on Current Trends in Biomedical Natural Language Processing: Shared Task
Exploring two biomedical text genres for disease recognition

BioNLP '09 Proceedings of the Workshop on Current Trends in Biomedical Natural Language Processing
Semantically rich human-aided machine annotation

CorpusAnno '05 Proceedings of the Workshop on Frontiers in Corpus Annotations II: Pie in the Sky

An improved corpus of disease mentions in PubMed citations

BioNLP '12 Proceedings of the 2012 Workshop on Biomedical Natural Language Processing
A prototype tool set to support machine-assisted annotation

BioNLP '12 Proceedings of the 2012 Workshop on Biomedical Natural Language Processing
Special Report: NCBI disease corpus: A resource for disease name recognition and concept normalization

Journal of Biomedical Informatics

Quantified Score

Hi-index	0.00

Visualization

Abstract

Information processing algorithms require significant amounts of annotated data for training and testing. The availability of such data is often hindered by the complexity and high cost of production. In this paper, we investigate the benefits of a state-of-the-art tool to help with the semantic annotation of a large set of biomedical queries. Seven annotators were recruited to annotate a set of 10,000 PubMed(R) queries with 16 biomedical and bibliographic categories. About half of the queries were annotated from scratch, while the other half were automatically pre-annotated and manually corrected. The impact of the automatic pre-annotations was assessed on several aspects of the task: time, number of actions, annotator satisfaction, inter-annotator agreement, quality and number of the resulting annotations. The analysis of annotation results showed that the number of required hand annotations is 28.9% less when using pre-annotated results from automatic tools. As a result, the overall annotation time was substantially lower when pre-annotations were used, while inter-annotator agreement was significantly higher. In addition, there was no statistically significant difference in the semantic distribution or number of annotations produced when pre-annotations were used. The annotated query corpus is freely available to the research community. This study shows that automatic pre-annotations are found helpful by most annotators. Our experience suggests using an automatic tool to assist large-scale manual annotation projects. This helps speed-up the annotation time and improve annotation consistency while maintaining high quality of the final annotations.