The statistical significance of max-gap clusters

Authors:
Rose Hoberman;David Sankoff;Dannie Durand
Affiliations:
Computer Science Department, Carnegie Mellon University, Pittsburgh, PA;Department of Mathematics and Statistics, University of Ottawa, Ontario, Canada;Departments of Biological Sciences and Computer Science, Carnegie Mellon University, Pittsburgh, PA
Venue:
RCG'04 Proceedings of the 2004 RECOMB international conference on Comparative Genomics
Year:
2004

Citing 8
Cited 2

The complexity of gene placement

Journal of Algorithms
Concrete Math

Concrete Math
Algorithms for Finding Gene Clusters

WABI '01 Proceedings of the First International Workshop on Algorithms in Bioinformatics
The Algorithmic of Gene Teams

WABI '02 Proceedings of the Second International Workshop on Algorithms in Bioinformatics
Finding All Common Intervals of k Permutations

CPM '01 Proceedings of the 12th Annual Symposium on Combinatorial Pattern Matching
Genome Halving

CPM '98 Proceedings of the 9th Annual Symposium on Combinatorial Pattern Matching
The Reconstruction of Doubled Genomes

SIAM Journal on Computing
Software note: Gene teams: a new formalization of gene clusters for comparative genomics

Computational Biology and Chemistry

The incompatible desiderata of gene cluster properties

RCG'05 Proceedings of the 2005 international conference on Comparative Genomics
Power boosts for cluster tests

RCG'05 Proceedings of the 2005 international conference on Comparative Genomics

Quantified Score

Hi-index	0.00

Visualization

Abstract

Identifying gene clusters, genomic regions that share local similarities in gene organization, is a prerequisite for many different types of genomic analyses, including operon prediction, reconstruction of chromosomal rearrangements, and detection of whole-genome duplications. A number of formal definitions of gene clusters have been proposed, as well as methods for finding such clusters and/or statistical tests for determining their significance. Unfortunately, there is very little overlap between previously published rigorous analytical statistical tests and the definitions used in practice. In this paper, we consider the max-gap cluster: a contiguous region containing a maximal set of homologs, where the number of non-homologous genes between pairs of adjacent homologs is never greater than a predefined, fixed parameter, g. Although this is one of the models most widely used in practice, currently the statistical significance of max-gap clusters can only be evaluated using Monte Carlo simulations because no analytical statistical tests have been developed for it. We give exact expressions for the probability of observing such a cluster by chance, assuming a simple reference-region scenario and random gene order, as well as more efficient methods for approximating this probability. We use these methods to identify which regions of the parameter space yield clusters that are statistically significant. Finally, we discuss some of the challenges in extending this model to whole-genome comparison.