Automatically identifying gene/protein terms in MEDLINE abstracts

  • Authors:
  • Hong Yu;Vasileios Hatzivassiloglou;Andrey Rzhetsky;W. John Wilbur

  • Affiliations:
  • Department of Computer Science, Columbia University, 1214 Amsterdam Avenue, New York, NY;Department of Computer Science, Columbia University, 1214 Amsterdam Avenue, New York, NY;Department of Medical Informatics, Columbia Genome Center, Columbia University, 622 W, 168th St., VC-5, New York, NY;National Center for Biotechnology Information, National Library of Medicine, NIH, Building 38A, Room 5S506, 8600 Rockville Pike, Bethesda, MD

  • Venue:
  • Journal of Biomedical Informatics
  • Year:
  • 2002

Quantified Score

Hi-index 0.00

Visualization

Abstract

Motivation. Natural language processing (NLP) techniques are used to extract information automatically from computer-readable literature. In biology, the identification of terms corresponding to biological substances (e.g., genes and proteins) is a necessary step that precedes the application of other NLP systems that extract biological information (e.g., protein-protein interactions, gene regulation events, and biochemical pathways). We have developed GPmarkup (for "gene/protein-full name mark up"), a software system that automatically identifies gene/protein terms (i.e., symbols or full names) in MEDLINE abstracts. As a part of marking up process, we also generated automatically a knowledge source of paired gene/protein symbols and full names (e.g., LARD for lymphocyte associated receptor of death) from MEDLINE. We found that many of the pairs in our knowledge source do not appear in the current GenBank database. Therefore our methods may also be used for automatic lexicon generation.Results. GPmarkup has 73% recall and 93% precision in identifying and marking up gene/protein terms in MEDLINE abstracts.Availability: A random sample of gene/protein symbols and full names and a sample set of marked up abstracts can be viewed at http://www.cpmc.columbia.edu/homepages/yuh9001/GPmarkup/. Contact. hy52@columbia.edu. Voice: 212-939-7028; fax: 212-666-0140.