Corpus design for biomedical natural language processing

Authors:
K. Bretonnel Cohen;Philip V. Ogren;Lynne Fox;Lawrence Hunter
Affiliations:
U. of Colorado School of Medicine, Aurora, Colorado;U. of Colorado School of Medicine, Aurora, Colorado;U. of Colorado Health Sciences Center, Denver, Colorado;U. of Colorado Health Sciences Center, Denver, Colorado
Venue:
ISMB '05 Proceedings of the ACL-ISMB Workshop on Linking Biological Literature, Ontologies and Databases: Mining Biological Semantics
Year:
2005

Citing 14
Cited 14

OHSUMED: an interactive retrieval evaluation and new large test collection for research

SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
Automatic Extraction of Biological Information from Scientific Text: Protein-Protein Interactions

Proceedings of the Seventh International Conference on Intelligent Systems for Molecular Biology
Constructing Biological Knowledge Bases by Extracting Information from Text Sources

Proceedings of the Seventh International Conference on Intelligent Systems for Molecular Biology
The GENIA project: corpus-based knowledge acquisition and information extraction from genome research papers

EACL '99 Proceedings of the ninth conference on European chapter of the Association for Computational Linguistics
Gene name identification and normalization using a model organism database

Journal of Biomedical Informatics - Special issue: Named entity recognition in biomedicine
WordFreak: an open tool for linguistic annotation

NAACL-Demonstrations '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology: Demonstrations - Volume 4
BioRAT: extracting biological information from full-length papers

Bioinformatics
Distribution of information in biomedical abstracts and full-text publications

Bioinformatics
Tagging gene and protein names in full text articles

BioMed '02 Proceedings of the ACL-02 workshop on Natural language processing in the biomedical domain - Volume 3
Medstract: creating large-scale information servers for biomedical libraries

BioMed '02 Proceedings of the ACL-02 workshop on Natural language processing in the biomedical domain - Volume 3
The GENIA corpus: an annotated research abstract corpus in molecular biology domain

HLT '02 Proceedings of the second international conference on Human Language Technology Research
Introduction to the bio-entity recognition task at JNLPBA

JNLPBA '04 Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications
MedTag: a collection of biomedical annotations

ISMB '05 Proceedings of the ACL-ISMB Workshop on Linking Biological Literature, Ontologies and Databases: Mining Biological Semantics
Improving noun phrase coreference resolution by matching strings

IJCNLP'04 Proceedings of the First international joint conference on Natural Language Processing

Active learning for anaphora resolution

HLT '09 Proceedings of the NAACL HLT 2009 Workshop on Active Learning for Natural Language Processing
A priority model for named entities

BioNLP '06 Proceedings of the Workshop on Linking Natural Language Processing and Biology: Towards Deeper Biological Literature Analysis
Postnominal prepositional phrase attachment in proteomics

BioNLP '06 Proceedings of the Workshop on Linking Natural Language Processing and Biology: Towards Deeper Biological Literature Analysis
Recognising nested named entities in biomedical text

BioNLP '07 Proceedings of the Workshop on BioNLP 2007: Biological, Translational, and Clinical Language Processing
Statistical anaphora resolution in biomedical texts

COLING '08 Proceedings of the 22nd International Conference on Computational Linguistics - Volume 1
Semi-supervised Prediction of Protein Interaction Sentences Exploiting Semantically Encoded Metrics

PRIB '09 Proceedings of the 4th IAPR International Conference on Pattern Recognition in Bioinformatics
A priority model for named entities

LNLBioNLP '06 Proceedings of the HLT-NAACL BioNLP Workshop on Linking Natural Language and Biology
Postnominal prepositional phrase attachment in proteomics

LNLBioNLP '06 Proceedings of the HLT-NAACL BioNLP Workshop on Linking Natural Language and Biology
Semi-parametric analysis of multi-rater data

Statistics and Computing
Towards morphologically annotated corpus of hospital discharge reports in Polish

BioNLP '11 Proceedings of BioNLP 2011 Workshop
A scaleable automated quality assurance technique for semantic representations and proposition banks

LAW V '11 Proceedings of the 5th Linguistic Annotation Workshop
Automatic semantic labeling of medical texts with feature structures

TSD'11 Proceedings of the 14th international conference on Text, speech and dialogue
Datasets for generic relation extraction*

Natural Language Engineering
Boosting the protein name recognition performance by bootstrapping on selected text

BioNLP '12 Proceedings of the 2012 Workshop on Biomedical Natural Language Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper classifies six publicly available biomedical corpora according to various corpus design features and characteristics. We then present usage data for the six corpora. We show that corpora that are carefully annotated with respect to structural and linguistic characteristics and that are distributed in standard formats are more widely used than corpora that are not. These findings have implications for the design of the next generation of biomedical corpora.