A Probabilistic Model for Identifying Protein Names and their Name Boundaries

Authors:
Kazuhiro Seki;Javed Mostafa
Affiliations:
-;-
Venue:
CSB '03 Proceedings of the IEEE Computer Society Conference on Bioinformatics
Year:
2003

Citing 8
Cited 1

A Multi-Level Text Mining Method to Extract Biological Relationships

CSB '02 Proceedings of the IEEE Computer Society Conference on Bioinformatics
A Literature Based Method for Identifying Gene-Disease Connections

CSB '02 Proceedings of the IEEE Computer Society Conference on Bioinformatics
Protein association discovery in biomedical literature

Proceedings of the 3rd ACM/IEEE-CS joint conference on Digital libraries
Extracting the names of genes and gene products with a hidden Markov model

COLING '00 Proceedings of the 18th conference on Computational linguistics - Volume 1
Message Understanding Conference-6: a brief history

COLING '96 Proceedings of the 16th conference on Computational linguistics - Volume 1
Notions of correctness when evaluating protein name taggers

COLING '02 Proceedings of the 19th international conference on Computational linguistics - Volume 1
Tuning support vector machines for biomedical named entity recognition

BioMed '02 Proceedings of the ACL-02 workshop on Natural language processing in the biomedical domain - Volume 3
Tagging gene and protein names in full text articles

BioMed '02 Proceedings of the ACL-02 workshop on Natural language processing in the biomedical domain - Volume 3

Two learning approaches for protein name extraction

Journal of Biomedical Informatics

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper proposes a method for identifying proteinnames in biomedical texts with an emphasis on detectingprotein name boundaries. We use a probabilistic modelwhich exploits several surface clues characterizing proteinnames and incorporates word classes for generalization.In contrast to previously proposed methods, our approachdoes not rely on natural language processing tools suchas part-of-speech taggers and syntactic parsers, so as toreduce processing overhead and the potential number ofprobabilistic parameters to be estimated. A notion of certaintyis also proposed to improve precision for identification.We implemented a protein name identification systembased on our proposed method, and evaluated the systemon real-world biomedical texts in conjunction with the previouswork. The results showed that overall our system performscomparably to the state-of-the-art protein name identificationsystem and that higher performance is achievedfor compound names. In addition, it is demonstrated thatour system can further improve precision by restricting thesystem output to those names with high certainties.