Some advances in transformation-based part of speech tagging
AAAI '94 Proceedings of the twelfth national conference on Artificial intelligence (vol. 1)
Proceedings of the 2001 IEEE International Conference on Data Mining
ICDM '01 Proceedings of the 2001 IEEE International Conference on Data Mining
A Multi-Level Text Mining Method to Extract Biological Relationships
CSB '02 Proceedings of the IEEE Computer Society Conference on Bioinformatics
A Literature Based Method for Identifying Gene-Disease Connections
CSB '02 Proceedings of the IEEE Computer Society Conference on Bioinformatics
Protein association discovery in biomedical literature
Proceedings of the 3rd ACM/IEEE-CS joint conference on Digital libraries
Extracting the names of genes and gene products with a hidden Markov model
COLING '00 Proceedings of the 18th conference on Computational linguistics - Volume 1
Message Understanding Conference-6: a brief history
COLING '96 Proceedings of the 16th conference on Computational linguistics - Volume 1
Notions of correctness when evaluating protein name taggers
COLING '02 Proceedings of the 19th international conference on Computational linguistics - Volume 1
Tuning support vector machines for biomedical named entity recognition
BioMed '02 Proceedings of the ACL-02 workshop on Natural language processing in the biomedical domain - Volume 3
The GENIA corpus: an annotated research abstract corpus in molecular biology domain
HLT '02 Proceedings of the second international conference on Human Language Technology Research
Two learning approaches for protein name extraction
Journal of Biomedical Informatics
AI'07 Proceedings of the 20th Australian joint conference on Advances in artificial intelligence
Burning up: finding fever expressions in triage notes
Proceedings of the 73rd ASIS&T Annual Meeting on Navigating Streams in an Information Ecosystem - Volume 47
IEA/AIE'10 Proceedings of the 23rd international conference on Industrial engineering and other applications of applied intelligent systems - Volume Part I
Hi-index | 0.00 |
This paper presents a hybrid approach to identifying protein names in biomedical texts, which is regarded as a crucial step for text mining. Our approach employs a set of simple heuristics for initial detection of protein names and uses a probabilistic model for locating complete protein names. In addition, a protein name dictionary is complementarily consulted. In contrast to previously proposed methods, our proposed method avoids the use of natural language processing tools such as part-of-speech taggers and syntactic parsers and solely relies on surface clues, so as to reduce the processing overhead. Moreover, we propose a framework to automatically create a large-scale corpus annotated with protein names, which can be then used for training our probabilistic model. We implemented a protein name identification system, named PROTEX, based on our proposed method and evaluated it by comparing with a system developed by other researchers on a common test set. The experiments showed that the automatically constructed corpus is equally useful in training as compared with manually annotated corpora and that effective performance can be achieved in identifying compound protein names with PROTEX.