Statistical Language Learning
AZuRE, a Scalable System for Automated Term Disambiguation of Gene and Protein Names
CSB '04 Proceedings of the 2004 IEEE Computational Systems Bioinformatics Conference
Probabilistic representation of formal languages
SWAT '69 Proceedings of the 10th Annual Symposium on Switching and Automata Theory (swat 1969)
MedTag: a collection of biomedical annotations
ISMB '05 Proceedings of the ACL-ISMB Workshop on Linking Biological Literature, Ontologies and Databases: Mining Biological Semantics
Corpus design for biomedical natural language processing
ISMB '05 Proceedings of the ACL-ISMB Workshop on Linking Biological Literature, Ontologies and Databases: Mining Biological Semantics
Hi-index | 0.00 |
We introduce a new approach to named entity classification which we term a Priority Model. We also describe the construction of a semantic database called SemCat consisting of a large number of semantically categorized names relevant to biomedicine. We used SemCat as training data to investigate name classification techniques. We generated a statistical language model and probabilistic context-free grammars for gene and protein name classification, and compared the results with the new model. For all three methods, we used a variable order Markov model to predict the nature of strings not represented in the training data. The Priority Model achieves an F-measure of 0.958--0.960, consistently higher than the statistical language model and probabilistic context-free grammar.