Building an annotated corpus in the molecular-biology domain

Authors:
Yuka Tateisi;Tomoko Ohta;Nigel Collier;Chikashi Nobata;Jun-ichi Tsujii
Affiliations:
University of Tokyo, Hongo, Bunkyo-ku, Tokyo, Japan;University of Tokyo, Hongo, Bunkyo-ku, Tokyo, Japan;University of Tokyo, Hongo, Bunkyo-ku, Tokyo, Japan;University of Tokyo, Hongo, Bunkyo-ku, Tokyo, Japan;University of Tokyo, Hongo, Bunkyo-ku, Tokyo, Japan
Venue:
Proceedings of the COLING-2000 Workshop on Semantic Annotation and Intelligent Content
Year:
2000

Citing 3
Cited 10

Automatic Extraction of Biological Information from Scientific Text: Protein-Protein Interactions

Proceedings of the Seventh International Conference on Intelligent Systems for Molecular Biology
Automatic Construction of Knowledge Base from Biological Papers

Proceedings of the 5th International Conference on Intelligent Systems for Molecular Biology
Development and use of a gold-standard data set for subjectivity classifications

ACL '99 Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics

Eight Questions about Semantic Web Annotations

IEEE Intelligent Systems
Comparison of character-level and part of speech features for name recognition in biomedical texts

Journal of Biomedical Informatics - Special issue: Named entity recognition in biomedicine
Use of support vector machines in extended named entity recognition

COLING-02 proceedings of the 6th conference on Natural language learning - Volume 20
Bio-medical entity extraction using Support Vector Machines

BioMed '03 Proceedings of the ACL 2003 workshop on Natural language processing in biomedicine - Volume 13
Biological relation extraction and query answering from MEDLINE abstracts using ontology-based text mining

Data & Knowledge Engineering
Using heuristics, syntax and a local dynamic dictionary for protein name tagging

HLT '02 Proceedings of the second international conference on Human Language Technology Research
The GENIA corpus: an annotated research abstract corpus in molecular biology domain

HLT '02 Proceedings of the second international conference on Human Language Technology Research
Bio-medical entity extraction using support vector machines

Artificial Intelligence in Medicine
Classifier subset selection for biomedical named entity recognition

Applied Intelligence
Anaphora resolution for biomedical literature by exploiting multiple resources

IJCNLP'05 Proceedings of the Second international joint conference on Natural Language Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Corpus annotation is now a key topic for all areas of natural language processing (NLP) and information extraction (IE) which employ supervised learning. With the explosion of results in molecular-biology there is an increased need for IE to extract knowledge to support database building and to search intelligently for information in online journal collections. To support this we are building a corpus of annotated abstracts taken from National Library of Medicine's MEDLINE database. In this paper we report on this new corpus, its ontological basis, and our experience in designing the annotation scheme. Experimental results are shown for inter-annotator agreement and comments are made on methodological considerations.