Improving the scalability of semi-Markov conditional random fields for named entity recognition
ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Rich features based Conditional Random Fields for biological named entities recognition
Computers in Biology and Medicine
Structured correspondence topic models for mining captioned figures in biological literature
Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Two learning approaches for protein name extraction
Journal of Biomedical Informatics
Invited paper: Structured literature image finder: Parsing text and figures in biomedical literature
Web Semantics: Science, Services and Agents on the World Wide Web
MinePhos: A Literature Mining System for Protein Phoshphorylation Information Extraction
IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Biomedical named entities recognition using conditional random fields model
FSKD'06 Proceedings of the Third international conference on Fuzzy Systems and Knowledge Discovery
ISMB/ECCB'09 Proceedings of the 2009 workshop of the BioLink Special Interest Group, international conference on Linking Literature, Information, and Knowledge for Biology
Hi-index | 3.84 |
Summary: Protein name extraction is an important step in mining biological literature. We describe two new methods for this task: semiCRFs and dictionary HMMs. SemiCRFs are a recently-proposed extension to conditional random fields (CRFs) that enables more effective use of dictionary information as features. Dictionary HMMs are a technique in which a dictionary is converted to a large HMM that recognizes phrases from the dictionary, as well as variations of these phrases. Standard training methods for HMMs can be used to learn which variants should be recognized. We compared the performance of our new approaches with that of Maximum Entropy (MaxEnt) and normal CRFs on three datasets, and improvement was obtained for all four methods over the best published results for two of the datasets. CRFs and semiCRFs achieved the highest overall performance according to the widely-used F-measure, while the dictionary HMMs performed the best at finding entities that actually appear in the dictionary---the measure of most interest in our intended application. Availability: Dictionary HMMs were implemented in Java. Algorithms are available through an information extraction package MINORTHIRD on http://minorthird.sourceforge.net Contact: zkou@andrew.cmu.edu