Use of morphological analysis in protein name recognition

Authors:
Kaoru Yamamoto;Taku Kudo;Akihiko Konagaya;Yuji Matsumoto
Affiliations:
CREST, Japan Science and Technology Agency, Japan;NTT Communication Science Laboratories, Japan;Bioinformatics Group, Genomic Sciences Center, The Institute for Physical and Chemical Research (RIKEN), Japan;Graduate School of Information Science, Nara Institute of Science and Technology, Japan
Venue:
Journal of Biomedical Informatics - Special issue: Named entity recognition in biomedicine
Year:
2004

Citing 11
Cited 1

An Efficient Digital Search Algorithm by Using a Double-Array Structure

IEEE Transactions on Software Engineering
Transformation-based error-driven learning and natural language processing: a case study in part-of-speech tagging

Computational Linguistics
Foundations of statistical natural language processing

Foundations of statistical natural language processing
Language independent morphological analysis

ANLC '00 Proceedings of the sixth conference on Applied natural language processing
Representing text chunks

EACL '99 Proceedings of the ninth conference on European chapter of the Association for Computational Linguistics
Chunking with support vector machines

NAACL '01 Proceedings of the second meeting of the North American Chapter of the Association for Computational Linguistics on Language technologies
Fast methods for kernel-based text analysis

ACL '03 Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1
Boosting precision and recall of dictionary-based protein name recognition

BioMed '03 Proceedings of the ACL 2003 workshop on Natural language processing in biomedicine - Volume 13
Effective adaptation of a Hidden Markov Model-based named entity recognizer for biomedical domain

BioMed '03 Proceedings of the ACL 2003 workshop on Natural language processing in biomedicine - Volume 13
Bio-medical entity extraction using Support Vector Machines

BioMed '03 Proceedings of the ACL 2003 workshop on Natural language processing in biomedicine - Volume 13
The GENIA corpus: an annotated research abstract corpus in molecular biology domain

HLT '02 Proceedings of the second international conference on Human Language Technology Research

Introduction: named entity recognition in biomedicine

Journal of Biomedical Informatics - Special issue: Named entity recognition in biomedicine

Quantified Score

Hi-index	0.00

Visualization

Abstract

Protein name recognition aims to detect each and every protein names appearing in a PubMed abstract. The task is not simple, as the graphic word boundary (space separator) assumed in conventional preprocessing does not necessarily coincide with the protein name boundary. Such boundary disagreement caused by tokenization ambiguity has usually been ignored in conventional preprocessing of general English. In this paper, we argue that boundary disagreement poses serious limitations in biomedical English text processing, not to mention protein name recognition. Our key idea for dealing with the boundary disagreement is to apply techniques used in Japanese morphological analysis where there are no word boundaries. Having evaluated the proposed method with GENIA corpus 3.02, we obtain F-measure of 69.01 on a strict criterion and 79.32 on a relaxed criterion. The result is comparable to other published work in protein name recognition, without resorting to manually prepared ad hoc feature engineering. Further, compared to the conventional preprocessing, the use of morphological analysis as preprocessing improves the performance of protein name recognition and reduces the execution time.