Gene ontology annotation as text categorization: An empirical study

Authors:
Kazuhiro Seki;Javed Mostafa
Affiliations:
Organization of Advanced Science and Technology, Kobe University, 1-1 Rokkodai, Nada, Kobe 657-8501, Japan;Laboratory of Applied Informatics Research, University of North Carolina at Chapel Hill, 216 Lenoir Drive, CB#3360, 100 Manning Hall, Chapel Hill, NC 27599-3360, USA
Venue:
Information Processing and Management: an International Journal
Year:
2008

Citing 16
Cited 3

A re-examination of text categorization methods

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Text retrieval conference (TREC) genomics pre-track workshop

Proceedings of the 2nd ACM/IEEE-CS joint conference on Digital libraries
Introduction to Modern Information Retrieval

Introduction to Modern Information Retrieval
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

ECML '98 Proceedings of the 10th European Conference on Machine Learning
A Comparative Study on Feature Selection in Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Supervised term weighting for automated text categorization

Proceedings of the 2003 ACM symposium on Applied computing
Mining concept-drifting data streams using ensemble classifiers

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Report on TREC 2003 genomics track first-year results and future plans

ACM SIGIR Forum
A pitfall and solution in multi-class feature selection for text classification

ICML '04 Proceedings of the twenty-first international conference on Machine learning
An application of text categorization methods to gene ontology annotation

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Contrast and variability in gene names

BioMed '02 Proceedings of the ACL-02 workshop on Natural language processing in the biomedical domain - Volume 3
Data Analysis and Visualization in Genomics and Proteomics

Data Analysis and Visualization in Genomics and Proteomics
Manual curation is not sufficient for annotation of genomic databases

Bioinformatics
Biomedical named entity recognition using conditional random fields and rich feature sets

JNLPBA '04 Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications
Human gene name normalization using text matching with automatically extracted synonym dictionaries

BioNLP '06 Proceedings of the Workshop on Linking Natural Language Processing and Biology: Towards Deeper Biological Literature Analysis
Unsupervised gene/protein named entity normalization using automatically extracted dictionaries

ISMB '05 Proceedings of the ACL-ISMB Workshop on Linking Biological Literature, Ontologies and Databases: Mining Biological Semantics

Gene Functional Annotation with Dynamic Hierarchical Classification Guided by Orthologs

DS '09 Proceedings of the 12th International Conference on Discovery Science
Application of semantic kernels to literature-based gene function annotation

DS'11 Proceedings of the 14th international conference on Discovery science
Text Mining in Bioinformatics: Research and Application

International Journal of Information Retrieval Research

Quantified Score

Hi-index	0.00

Visualization

Abstract

Gene ontology (GO) consists of three structured controlled vocabularies, i.e., GO domains, developed for describing attributes of gene products, and its annotation is crucial to provide a common gateway to access different model organism databases. This paper explores an effective application of text categorization methods to this highly practical problem in biology. As a first step, we attempt to tackle the automatic GO annotation task posed in the Text Retrieval Conference (TREC) 2004 Genomics Track. Given a pair of genes and an article reference where the genes appear, the task simulates assigning GO domain codes. We approach the problem with careful consideration of the specialized terminology and pay special attention to various forms of gene synonyms, so as to exhaustively locate the occurrences of the target gene. We extract the words around the spotted gene occurrences and used them to represent the gene for GO domain code annotation. We regard the task as a text categorization problem and adopt a variant of kNN with supervised term weighting schemes, making our method among the top-performing systems in the TREC official evaluation. Furthermore, we investigate different feature selection policies in conjunction with the treatment of terms associated with negative instances. Our experiments reveal that round-robin feature space allocation with eliminating negative terms substantially improves performance as GO terms become specific.