Comparison of Two Schemes for Automatic Keyword Extraction from MEDLINE for Functional Gene Clustering

Authors:
Ying Liu;Brian J. Ciliax;Karin Borges;Venu Dasigi;Ashwin Ram;Shamkant B. Navathe;Ray Dingledine
Affiliations:
Georgia Institute of Technology;Emory University;Emory University;Southern Polytechnic State University;Georgia Institute of Technology;Georgia Institute of Technology;Emory University
Venue:
CSB '04 Proceedings of the 2004 IEEE Computational Systems Bioinformatics Conference
Year:
2004

Citing 3
Cited 6

An evaluation of retrieval effectiveness for a full-text document-retrieval system

Communications of the ACM
Term-weighting approaches in automatic text retrieval

Information Processing and Management: an International Journal
An interactive system for finding complementary literatures: a stimulus to scientific discovery

Artificial Intelligence - Special issue on scientific discovery

Text Mining Biomedical Literature for Discovering Gene-to-Gene Relationships: A Comparative Study of Algorithms

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Investigation into Biomedical Literature Classification Using Support Vector Machines

CSB '05 Proceedings of the 2005 IEEE Computational Systems Bioinformatics Conference
A methodology for semantic integration of metadata in bioinformatics data sources

Proceedings of the 43rd annual Southeast regional conference - Volume 1
An architecture to automatically store and update MEDLINE data for text mining

Proceedings of the 43rd annual Southeast regional conference - Volume 1
Keyword extraction using an artificial immune system

Proceedings of the 9th annual conference on Genetic and evolutionary computation
Improving persian text classification using persian thesaurus

CIARP'11 Proceedings of the 16th Iberoamerican Congress conference on Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

One of the key challenges of microarray studies is to derive biological insights from the unprecedented quatities of data on gene-expression patterns. Clustering genes by functional keyword association can provide direct information about the nature of the functional links among genes within the derived clusters. However, the quality of the keyword lists extracted from biomedical literature for each gene significantly affects the clustering results. We extracted keywords from MEDLINE that describes the most prominent functions of the genes, and used the resulting weights of the keywords as feature vectors for gene clustering. By analyzing the resulting cluster quality, we compared two keyword weighting schemes: normalized z-score and term frequency-inverse document frequency (TFIDF). The best combination of background comparison set, stop list and stemming algorithm was selected based on precision and recall metrics. In a test set of four known gene groups, a hierarchical algorithm correctly assigned 25 of 26 genes to the appropriate clusters based on keywords extracted by the TDFIDF weighting scheme, but only 23 og 26 with the z-score method. To evaluate the effectiveness of the weighting schemes for keyword extraction for gene clusters from microarray profiles, 44 yeast genes that are differentially expressed during the cell cycle were used as a second test set. Using established measures of cluster quality, the results produced from TFIDF-weighted keywords had higher purity, lower entropy, and higher mutual information than those produced from normalized z-score weighted keywords. The optimized algorithms should be useful for sorting genes from microarray lists into functionally discrete clusters