An evaluation of retrieval effectiveness for a full-text document-retrieval system
Communications of the ACM
Term-weighting approaches in automatic text retrieval
Information Processing and Management: an International Journal
An interactive system for finding complementary literatures: a stimulus to scientific discovery
Artificial Intelligence - Special issue on scientific discovery
IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Investigation into Biomedical Literature Classification Using Support Vector Machines
CSB '05 Proceedings of the 2005 IEEE Computational Systems Bioinformatics Conference
A methodology for semantic integration of metadata in bioinformatics data sources
Proceedings of the 43rd annual Southeast regional conference - Volume 1
An architecture to automatically store and update MEDLINE data for text mining
Proceedings of the 43rd annual Southeast regional conference - Volume 1
Keyword extraction using an artificial immune system
Proceedings of the 9th annual conference on Genetic and evolutionary computation
Improving persian text classification using persian thesaurus
CIARP'11 Proceedings of the 16th Iberoamerican Congress conference on Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications
Hi-index | 0.00 |
One of the key challenges of microarray studies is to derive biological insights from the unprecedented quatities of data on gene-expression patterns. Clustering genes by functional keyword association can provide direct information about the nature of the functional links among genes within the derived clusters. However, the quality of the keyword lists extracted from biomedical literature for each gene significantly affects the clustering results. We extracted keywords from MEDLINE that describes the most prominent functions of the genes, and used the resulting weights of the keywords as feature vectors for gene clustering. By analyzing the resulting cluster quality, we compared two keyword weighting schemes: normalized z-score and term frequency-inverse document frequency (TFIDF). The best combination of background comparison set, stop list and stemming algorithm was selected based on precision and recall metrics. In a test set of four known gene groups, a hierarchical algorithm correctly assigned 25 of 26 genes to the appropriate clusters based on keywords extracted by the TDFIDF weighting scheme, but only 23 og 26 with the z-score method. To evaluate the effectiveness of the weighting schemes for keyword extraction for gene clusters from microarray profiles, 44 yeast genes that are differentially expressed during the cell cycle were used as a second test set. Using established measures of cluster quality, the results produced from TFIDF-weighted keywords had higher purity, lower entropy, and higher mutual information than those produced from normalized z-score weighted keywords. The optimized algorithms should be useful for sorting genes from microarray lists into functionally discrete clusters