The nature of statistical learning theory
The nature of statistical learning theory
An Evaluation of Statistical Approaches to Text Categorization
Information Retrieval
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features
ECML '98 Proceedings of the 10th European Conference on Machine Learning
Text classification using string kernels
The Journal of Machine Learning Research
Identifying terms by their family and friends
COLING '00 Proceedings of the 18th conference on Computational linguistics - Volume 1
Hierarchical clustering of words
COLING '96 Proceedings of the 16th conference on Computational linguistics - Volume 2
Tuning support vector machines for biomedical named entity recognition
BioMed '02 Proceedings of the ACL-02 workshop on Natural language processing in the biomedical domain - Volume 3
Term identification in the biomedical literature
Journal of Biomedical Informatics - Special issue: Named entity recognition in biomedicine
Using domain-specific verbs for term classification
BioMed '03 Proceedings of the ACL 2003 workshop on Natural language processing in biomedicine - Volume 13
Mining semantically related terms from biomedical literature
ACM Transactions on Asian Language Information Processing (TALIP)
Extracting regulatory gene expression networks from PubMed
ACL '04 Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics
Protein function classification based on gene ontology
AIRS'05 Proceedings of the Second Asia conference on Asia Information Retrieval Technology
Hi-index | 0.00 |
In this paper we discuss the performance of a text-based classification approach by comparing different types of features. We consider the automatic classification of gene names from the molecular biology literature, by using a support-vector machine method. Classification features range from words, lemmas and stems, to automatically extracted terms. Also, simple co-occurrences of genes within documents are considered. The preliminary experiments performed on a set of 3,000 S. cerevisiae gene names and 53,000 Medline abstracts have shown that using domain-specific terms can improve the performance compared to the standard bag-of-words approach, in particular for genes classified with higher confidence, and for under-represented classes.