A hybrid approach to biomedical named entity recognition and semantic role labeling

Authors:
Richard Tzong-Han Tsai
Affiliations:
National Taiwan University, Nankang, Taipei, Taiwan
Venue:
NAACL-DocConsortium '06 Proceedings of the 2006 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology: companion volume: doctoral consortium
Year:
2006

Citing 9
Cited 0

Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Recognizing names in biomedical texts: a machine learning approach

Bioinformatics
Two-phase biomedical NE recognition based on SVMs

BioMed '03 Proceedings of the ACL 2003 workshop on Natural language processing in biomedicine - Volume 13
The Proposition Bank: An Annotated Corpus of Semantic Roles

Computational Linguistics
Semantic role labeling via integer linear programming inference

COLING '04 Proceedings of the 20th international conference on Computational Linguistics
BIOSMILE: adapting semantic role labeling for biomedical verbs: an exponential model coupled with automatically generated template features

BioNLP '06 Proceedings of the Workshop on Linking Natural Language Processing and Biology: Towards Deeper Biological Literature Analysis
Shallow semantics for relation extraction

IJCAI'05 Proceedings of the 19th international joint conference on Artificial intelligence
Exploiting full parsing information to label semantic roles using an ensemble of ME and SVM via integer linear programming

CONLL '05 Proceedings of the Ninth Conference on Computational Natural Language Learning
Integrating linguistic knowledge into a conditional random fieldframework to identify biomedical named entities

Expert Systems with Applications: An International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we describe our hybrid approach to two key NLP technologies: biomedical named entity recognition (Bio-NER) and (Bio-SRL). In Bio-NER, our system successfully integrates linguistic features into the CRF framework. In addition, we employ web lexicons and template-based post-processing to further boost its performance. Through these broad linguistic features and the nature of CRF, our system outperforms state-of-the-art machine-learning-based systems, especially in the recognition of protein names (F=78.5%). In Bio-SRL, first, we construct a proposition bank on top of the popular biomedical GENIA treebank following the PropBank annotation scheme. We only annotate the predicate-argument structures (PAS's) of thirty frequently used biomedical verbs (predicates) and their corresponding arguments. Second, we use our proposition bank to train a biomedical SRL system, which uses a maximum entropy (ME) machine-learning model. Thirdly, we automatically generate argument-type templates, which can be used to improve classification of biomedical argument roles. Our experimental results show that a newswire English SRL system that achieves an F-score of 86.29% in the newswire English domain can maintain an F-score of 64.64% when ported to the biomedical domain. By using our annotated biomedical corpus, we can increase that F-score by 22.9%. Adding automatically generated template features further increases overall F-score by 0.47% and adjunct (AM) F-score by 1.57%, respectively.