Selecting text features for gene name classification: from documents to terms

  • Authors:
  • Goran Nenadić;Simon Rice;Irena Spasić;Sophia Ananiadou;Benjamin Stapley

  • Affiliations:
  • UMIST, Manchester;UMIST, Manchester;University of Salford, Salford;University of Salford, Salford;UMIST, Manchester

  • Venue:
  • BioMed '03 Proceedings of the ACL 2003 workshop on Natural language processing in biomedicine - Volume 13
  • Year:
  • 2003

Quantified Score

Hi-index 0.00

Visualization

Abstract

In this paper we discuss the performance of a text-based classification approach by comparing different types of features. We consider the automatic classification of gene names from the molecular biology literature, by using a support-vector machine method. Classification features range from words, lemmas and stems, to automatically extracted terms. Also, simple co-occurrences of genes within documents are considered. The preliminary experiments performed on a set of 3,000 S. cerevisiae gene names and 53,000 Medline abstracts have shown that using domain-specific terms can improve the performance compared to the standard bag-of-words approach, in particular for genes classified with higher confidence, and for under-represented classes.