Selecting text features for gene name classification: from documents to terms

Authors:
Goran Nenadić;Simon Rice;Irena Spasić;Sophia Ananiadou;Benjamin Stapley
Affiliations:
UMIST, Manchester;UMIST, Manchester;University of Salford, Salford;University of Salford, Salford;UMIST, Manchester
Venue:
BioMed '03 Proceedings of the ACL 2003 workshop on Natural language processing in biomedicine - Volume 13
Year:
2003

Citing 8
Cited 5

The nature of statistical learning theory

The nature of statistical learning theory
An Evaluation of Statistical Approaches to Text Categorization

Information Retrieval
Text Categorization with Support Vector Machines. How to Represent Texts in Input Space?

Machine Learning
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

ECML '98 Proceedings of the 10th European Conference on Machine Learning
Text classification using string kernels

The Journal of Machine Learning Research
Identifying terms by their family and friends

COLING '00 Proceedings of the 18th conference on Computational linguistics - Volume 1
Hierarchical clustering of words

COLING '96 Proceedings of the 16th conference on Computational linguistics - Volume 2
Tuning support vector machines for biomedical named entity recognition

BioMed '02 Proceedings of the ACL-02 workshop on Natural language processing in the biomedical domain - Volume 3

Term identification in the biomedical literature

Journal of Biomedical Informatics - Special issue: Named entity recognition in biomedicine
Using domain-specific verbs for term classification

BioMed '03 Proceedings of the ACL 2003 workshop on Natural language processing in biomedicine - Volume 13
Mining semantically related terms from biomedical literature

ACM Transactions on Asian Language Information Processing (TALIP)
Extracting regulatory gene expression networks from PubMed

ACL '04 Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics
Protein function classification based on gene ontology

AIRS'05 Proceedings of the Second Asia conference on Asia Information Retrieval Technology

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper we discuss the performance of a text-based classification approach by comparing different types of features. We consider the automatic classification of gene names from the molecular biology literature, by using a support-vector machine method. Classification features range from words, lemmas and stems, to automatically extracted terms. Also, simple co-occurrences of genes within documents are considered. The preliminary experiments performed on a set of 3,000 S. cerevisiae gene names and 53,000 Medline abstracts have shown that using domain-specific terms can improve the performance compared to the standard bag-of-words approach, in particular for genes classified with higher confidence, and for under-represented classes.