A hybrid approach to protein name identification in biomedical texts

Authors:
Kazuhiro Seki;Javed Mostafa
Affiliations:
Laboratory for Applied Informatics Research, Indiana University, 1320 East Tenth Street, LI 011, Bloomington, Indiana;Laboratory for Applied Informatics Research, Indiana University, 1320 East Tenth Street, LI 011, Bloomington, Indiana
Venue:
Information Processing and Management: an International Journal
Year:
2005

Citing 10
Cited 4

Some advances in transformation-based part of speech tagging

AAAI '94 Proceedings of the twelfth national conference on Artificial intelligence (vol. 1)
Proceedings of the 2001 IEEE International Conference on Data Mining

ICDM '01 Proceedings of the 2001 IEEE International Conference on Data Mining
A Multi-Level Text Mining Method to Extract Biological Relationships

CSB '02 Proceedings of the IEEE Computer Society Conference on Bioinformatics
A Literature Based Method for Identifying Gene-Disease Connections

CSB '02 Proceedings of the IEEE Computer Society Conference on Bioinformatics
Protein association discovery in biomedical literature

Proceedings of the 3rd ACM/IEEE-CS joint conference on Digital libraries
Extracting the names of genes and gene products with a hidden Markov model

COLING '00 Proceedings of the 18th conference on Computational linguistics - Volume 1
Message Understanding Conference-6: a brief history

COLING '96 Proceedings of the 16th conference on Computational linguistics - Volume 1
Notions of correctness when evaluating protein name taggers

COLING '02 Proceedings of the 19th international conference on Computational linguistics - Volume 1
Tuning support vector machines for biomedical named entity recognition

BioMed '02 Proceedings of the ACL-02 workshop on Natural language processing in the biomedical domain - Volume 3
The GENIA corpus: an annotated research abstract corpus in molecular biology domain

HLT '02 Proceedings of the second international conference on Human Language Technology Research

Two learning approaches for protein name extraction

Journal of Biomedical Informatics
Effectiveness of methods for syntactic and semantic recognition of numeral strings: tradeoffs between number of features and length of word N-grams

AI'07 Proceedings of the 20th Australian joint conference on Advances in artificial intelligence
Burning up: finding fever expressions in triage notes

Proceedings of the 73rd ASIS&T Annual Meeting on Navigating Streams in an Information Ecosystem - Volume 47
An environment for data analysis in biomedical domain: information extraction for decision support systems

IEA/AIE'10 Proceedings of the 23rd international conference on Industrial engineering and other applications of applied intelligent systems - Volume Part I

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper presents a hybrid approach to identifying protein names in biomedical texts, which is regarded as a crucial step for text mining. Our approach employs a set of simple heuristics for initial detection of protein names and uses a probabilistic model for locating complete protein names. In addition, a protein name dictionary is complementarily consulted. In contrast to previously proposed methods, our proposed method avoids the use of natural language processing tools such as part-of-speech taggers and syntactic parsers and solely relies on surface clues, so as to reduce the processing overhead. Moreover, we propose a framework to automatically create a large-scale corpus annotated with protein names, which can be then used for training our probabilistic model. We implemented a protein name identification system, named PROTEX, based on our proposed method and evaluated it by comparing with a system developed by other researchers on a common test set. The experiments showed that the automatically constructed corpus is equally useful in training as compared with manually annotated corpora and that effective performance can be achieved in identifying compound protein names with PROTEX.