Various features with integrated strategies for protein name classification

Authors:
Budi Taruna Ongkowijaya;Shilin Ding;Xiaoyan Zhu
Affiliations:
State Key Laboratory of Intelligent Technology and Systems (LITS), Department of Computer Science and Technology, Tsinghua University, Beijing, China;State Key Laboratory of Intelligent Technology and Systems (LITS), Department of Computer Science and Technology, Tsinghua University, Beijing, China;State Key Laboratory of Intelligent Technology and Systems (LITS), Department of Computer Science and Technology, Tsinghua University, Beijing, China
Venue:
ISPA'05 Proceedings of the 2005 international conference on Parallel and Distributed Processing and Applications
Year:
2005

Citing 3
Cited 0

Enhancing a biomedical information extraction system with dictionary mining and context disambiguation

IBM Journal of Research and Development
Recognizing names in biomedical texts: a machine learning approach

Bioinformatics
Two-phase biomedical NE recognition based on SVMs

BioMed '03 Proceedings of the ACL 2003 workshop on Natural language processing in biomedicine - Volume 13

Quantified Score

Hi-index	0.00

Visualization

Abstract

Classification task is an integral part of named entity recognition system to classify a recognized named entity to its corresponding class. This task has not received much attention in the biomedical domain, due to the lack of awareness to differentiate feature sources and strategies in previous studies. In this research, we analyze different sources and strategies of protein name classification, and developed integrated strategies that incorporate advantages from rule-based, dictionary-based and statistical-based method. In rule-based method, terms and knowledge of protein nomenclature that provide strong cue for protein name are used. In dictionary-based method, a set of rules for curating protein name dictionary are used. These terms and dictionaries are combined with our developed features into a statistical-based classifier. Our developed features are comprised of word shape features and unigram & bi-gram features. Our various information sources and integrated strategies are able to achieve state-of-the-art performance to classify protein and non-protein names.