On the predictive power of sequence similarity in yeast

  • Authors:
  • Yonatan Bilu;Michal Linial

  • Affiliations:
  • Institute of Computer Sciences, The Hebrew University, Jerusalem 91904, Israel;Department of Biological Chemistry, The Hebrew University, Jerusalem 91904, Israel

  • Venue:
  • RECOMB '01 Proceedings of the fifth annual international conference on Computational biology
  • Year:
  • 2001

Quantified Score

Hi-index 0.00

Visualization

Abstract

Perhaps the most direct way to infer functional linkage of proteins is through structural similarity. However, structure determination lags behind DNA sequencing. Here we show that sequence similarity based on nucleotide sequences alone between ORFs in yeast is indicative of the corresponding genes being of the same functional group, having a similar gene expression pattern or of being involved in a protein-protein interaction. In particular, we compare the nucleotide sequences corresponding to the 6280 yeast ORFs using BLAST, and then cluster them together using a simple neighbor-joining algorithm. This, in effect, gives us hierarchical clustering of 53 levels, where higher levels have bigger clusters. We compare the clustering to large databases that are not based a-priori on sequence information to get a notion of how well our clustering is correlated with this data. For functional annotation we use the SGD database that gives one of 540 annotations for about half the yeast genes. For all pairs that appear within a cluster, we test the hypothesis that almost all genes within the same cluster have the same function. We get very high percentage rates of correct annotation at the lower levels of the hierarchy, which decreases gradually at higher ones. From the results of the large scale gene expression experiments we generate a list of pairs of genes, whose expression is highly correlated (≱0.9). We then go over this list of pairs, and count how many of them are contained within a cluster as opposed to the expected number by a random model. We estimate the significance of our results using simulation, and get for all levels p-value≰≰0.001. The third type of data obtained from the protein-protein interaction database that is given as a list of airs of proteins involved in interaction.As before, we count how many pairs are contained within a cluster, and get much better results than expected by a random model, with p-values≰0.001 for almost all levels. In summary, we show that successful functional predictions and functional annotations can be applied at a genomic scale. This can be achieved by combining a naive hierarchical clustering method that creates sets of clusters at different levels of granularity with statistical validation tools.