Identification of related gene/protein names based on an HMM of name variations

Authors:
L. Yeganova;L. Smith;W. J. Wilbur
Affiliations:
Computational Biology Branch, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bldg. 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA;Computational Biology Branch, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bldg. 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA;Computational Biology Branch, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bldg. 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA
Venue:
Computational Biology and Chemistry
Year:
2004

Citing 9
Cited 3

The state of retrieval system evaluation

Information Processing and Management: an International Journal - Special issue on evaluation issues in information retrieval
Foundations of statistical natural language processing

Foundations of statistical natural language processing
Corpus-based statistical screening for content-bearing terms

Journal of the American Society for Information Science and Technology
Statistical Language Learning

Statistical Language Learning
Machine Learning

Machine Learning
Automatically identifying gene/protein terms in MEDLINE abstracts

Journal of Biomedical Informatics
Extracting the names of genes and gene products with a hidden Markov model

COLING '00 Proceedings of the 18th conference on Computational linguistics - Volume 1
Contrast and variability in gene names

BioMed '02 Proceedings of the ACL-02 workshop on Natural language processing in the biomedical domain - Volume 3
Brief communication: Hidden Markov models and optimized sequence alignments

Computational Biology and Chemistry

Mining semantically related terms from biomedical literature

ACM Transactions on Asian Language Information Processing (TALIP)
@Note: A workbench for Biomedical Text Mining

Journal of Biomedical Informatics
Multi-way association extraction and visualization from biological text documents using hyper-graphs: Applications to genetic association studies for diseases

Artificial Intelligence in Medicine

Quantified Score

Hi-index	0.00

Visualization

Abstract

Gene and protein names follow few, if any, true naming conventions and are subject to great variation in different occurrences of the same name. This gives rise to two important problems in natural language processing. First, can one locate the names of genes or proteins in free text, and second, can one determine when two names denote the same gene or protein? The first of these problems is a special case of the problem of named entity recognition, while the second is a special case of the problem of automatic term recognition (ATR). We study the second problem, that of gene or protein name variation. Here we describe a system which, given a query gene or protein name, identifies related gene or protein names in a large list. The system is based on a dynamic programming algorithm for sequence alignment in which the mutation matrix is allowed to vary under the control of a fully trainable hidden Markov model.