Automatically identifying gene/protein terms in MEDLINE abstracts

Authors:
Hong Yu;Vasileios Hatzivassiloglou;Andrey Rzhetsky;W. John Wilbur
Affiliations:
Department of Computer Science, Columbia University, 1214 Amsterdam Avenue, New York, NY;Department of Computer Science, Columbia University, 1214 Amsterdam Avenue, New York, NY;Department of Medical Informatics, Columbia Genome Center, Columbia University, 622 W, 168th St., VC-5, New York, NY;National Center for Biotechnology Information, National Library of Medicine, NIH, Building 38A, Room 5S506, 8600 Rockville Pike, Bethesda, MD
Venue:
Journal of Biomedical Informatics
Year:
2002

Citing 6
Cited 9

Transformation-based error-driven learning and natural language processing: a case study in part-of-speech tagging

Computational Linguistics
Acrophile: an automated acronym extractor and server

DL '00 Proceedings of the fifth ACM conference on Digital libraries
Automatic Extraction of Biological Information from Scientific Text: Protein-Protein Interactions

Proceedings of the Seventh International Conference on Intelligent Systems for Molecular Biology
Extracting the names of genes and gene products with a hidden Markov model

COLING '00 Proceedings of the 18th conference on Computational linguistics - Volume 1
Semi-supervised Maximum Entropy based approach to acronym and abbreviation normalization in medical texts

ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
Comparison between tagged corpora for the named entity task

WCC '00 Proceedings of the workshop on Comparing corpora - Volume 9

Term identification in the biomedical literature

Journal of Biomedical Informatics - Special issue: Named entity recognition in biomedicine
Automatic document indexing in large medical collections

HIKM '06 Proceedings of the international workshop on Healthcare information and knowledge management
Using MEDLINE as a knowledge source for disambiguating abbreviations and acronyms in full-text biomedical journal articles

Journal of Biomedical Informatics
The AMTEx approach in the medical document indexing and retrieval application

Data & Knowledge Engineering
Two learning approaches for protein name extraction

Journal of Biomedical Informatics
Automatic extraction of genomic glossary triggered by query

BioDM'06 Proceedings of the 2006 international conference on Data Mining for Biomedical Applications
A framework for schema-driven relationship discovery from unstructured text

ISWC'06 Proceedings of the 5th international conference on The Semantic Web
Semantic annotation of biomedical literature using google

ICCSA'05 Proceedings of the 2005 international conference on Computational Science and Its Applications - Volume Part III
Identification of related gene/protein names based on an HMM of name variations

Computational Biology and Chemistry

Quantified Score

Hi-index	0.00

Visualization

Abstract

Motivation. Natural language processing (NLP) techniques are used to extract information automatically from computer-readable literature. In biology, the identification of terms corresponding to biological substances (e.g., genes and proteins) is a necessary step that precedes the application of other NLP systems that extract biological information (e.g., protein-protein interactions, gene regulation events, and biochemical pathways). We have developed GPmarkup (for "gene/protein-full name mark up"), a software system that automatically identifies gene/protein terms (i.e., symbols or full names) in MEDLINE abstracts. As a part of marking up process, we also generated automatically a knowledge source of paired gene/protein symbols and full names (e.g., LARD for lymphocyte associated receptor of death) from MEDLINE. We found that many of the pairs in our knowledge source do not appear in the current GenBank database. Therefore our methods may also be used for automatic lexicon generation.Results. GPmarkup has 73% recall and 93% precision in identifying and marking up gene/protein terms in MEDLINE abstracts.Availability: A random sample of gene/protein symbols and full names and a sample set of marked up abstracts can be viewed at http://www.cpmc.columbia.edu/homepages/yuh9001/GPmarkup/. Contact. hy52@columbia.edu. Voice: 212-939-7028; fax: 212-666-0140.