High-recall protein entity recognition using a dictionary

Authors:
Zhenzhen Kou;William W. Cohen;Robert F. Murphy
Affiliations:
Center for Automated Learning and Discovery, Carnegie Mellon University Pittsburgh, PA 15213, USA;Center for Automated Learning and Discovery, Carnegie Mellon University Pittsburgh, PA 15213, USA;Center for Automated Learning and Discovery, Carnegie Mellon University Pittsburgh, PA 15213, USA
Venue:
Bioinformatics
Year:
2005

Citing 0
Cited 8

Improving the scalability of semi-Markov conditional random fields for named entity recognition

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Rich features based Conditional Random Fields for biological named entities recognition

Computers in Biology and Medicine
Structured correspondence topic models for mining captioned figures in biological literature

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Two learning approaches for protein name extraction

Journal of Biomedical Informatics
Invited paper: Structured literature image finder: Parsing text and figures in biomedical literature

Web Semantics: Science, Services and Agents on the World Wide Web
MinePhos: A Literature Mining System for Protein Phoshphorylation Information Extraction

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Biomedical named entities recognition using conditional random fields model

FSKD'06 Proceedings of the Third international conference on Fuzzy Systems and Knowledge Discovery
Structured literature image finder: extracting information from text and images in biomedical literature

ISMB/ECCB'09 Proceedings of the 2009 workshop of the BioLink Special Interest Group, international conference on Linking Literature, Information, and Knowledge for Biology

Quantified Score

Hi-index	3.84

Visualization

Abstract

Summary: Protein name extraction is an important step in mining biological literature. We describe two new methods for this task: semiCRFs and dictionary HMMs. SemiCRFs are a recently-proposed extension to conditional random fields (CRFs) that enables more effective use of dictionary information as features. Dictionary HMMs are a technique in which a dictionary is converted to a large HMM that recognizes phrases from the dictionary, as well as variations of these phrases. Standard training methods for HMMs can be used to learn which variants should be recognized. We compared the performance of our new approaches with that of Maximum Entropy (MaxEnt) and normal CRFs on three datasets, and improvement was obtained for all four methods over the best published results for two of the datasets. CRFs and semiCRFs achieved the highest overall performance according to the widely-used F-measure, while the dictionary HMMs performed the best at finding entities that actually appear in the dictionary---the measure of most interest in our intended application. Availability: Dictionary HMMs were implemented in Java. Algorithms are available through an information extraction package MINORTHIRD on http://minorthird.sourceforge.net Contact: zkou@andrew.cmu.edu