Whole-genome prokaryotic clustering based on gene lengths

Authors:
A. Bolshoy;Z. Volkovich
Affiliations:
Genome Diversity Center, University of Haifa, Haifa 39105, Israel;Software Engineering Department, ORT Braude College of Engineering, Karmiel 21982, Israel
Venue:
Discrete Applied Mathematics
Year:
2009

Citing 4
Cited 1

Document clustering using word clusters via the information bottleneck method

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Unsupervised document classification using sequential information maximization

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Distributional clustering of English words

ACL '93 Proceedings of the 31st annual meeting on Association for Computational Linguistics
Phylogenetic trees based on gene content

Bioinformatics

Robust classifying of prokaryotic genomes

Computational Biology and Chemistry

Quantified Score

Hi-index	0.04

Visualization

Abstract

The fast-growing number of complete genome sequences prompts the development of new phylogenetic approaches. Until recently, understanding the phylogeny of prokaryotes was based on the comparison of highly conserved genes. Several novel whole-genome methods have been proposed during the last few years. Here, we present a novel method of taxonomic analysis, constructed on the basis of gene content and lengths of orthologous genes of 66 completely sequenced genomes of unicellular organisms using Clusters of Orthologous Groups (COGs). Our input data consist of average protein lengths related to ~5000 COGs from 66 genomes. We clustered these data, using an application of the information bottleneck method for unsupervised clustering. This approach is not a regular distance-based method, distinguishing it from other recently published whole-genome based clustering techniques. Although our comprehensive genome clustering is independent of phylogenies based on the level of homology of individual genes, it correlates well with the standard ''tree of life'' based on sequence similarity of 16s rRNA.