Identifying statistical dependence in genomic sequences via mutual information estimates

Authors:
Hasan Metin Aktulga;Ioannis Kontoyiannis;L. Alex Lyznik;Lukasz Szpankowski;Ananth Y. Grama;Wojciech Szpankowski
Affiliations:
Department of Computer Science, Purdue University, West Lafayette, IN;Department of Informatics, Athens University of Economics & Business, Patission, Athens, Greece;Pioneer Hi-Breed International, Johnston, IA;Bioinformatics Program, University of California, San Diego, CA;Department of Computer Science, Purdue University, West Lafayette, IN;Department of Computer Science, Purdue University, West Lafayette, IN
Venue:
EURASIP Journal on Bioinformatics and Systems Biology
Year:
2007

Citing 5
Cited 2

Elements of information theory

Elements of information theory
Multialphabet Coding with Separate Alphabet Description

SEQUENCES '97 Proceedings of the Compression and Complexity of Sequences 1997
Gene Mapping and Marker Clustering Using Shannon's Mutual Information

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Brief communication: Comparative analysis of core promoter region: Information content from mono and dinucleotide substitution matrices

Computational Biology and Chemistry
Limit results on pattern entropy

IEEE Transactions on Information Theory

Application of information-theoretic tests for the analysis of DNA sequences based on Markov chain models

Computational Statistics & Data Analysis
Universal Estimation of Information Measures for Analog Sources

Foundations and Trends in Communications and Information Theory

Quantified Score

Hi-index	0.00

Visualization

Abstract

Questions of understanding and quantifying the representation and amount of information in organisms have become a central part of biological research, as they potentially hold the key to fundamental advances. In this paper, we demonstrate the use of information-theoretic tools for the task of identifying segments of biomolecules (DNA or RNA) that are statistically correlated. We develop a precise and reliable methodology, based on the notion of mutual information, for finding and extracting statistical as well as structural dependencies. A simple threshold function is defined, and its use in quantifying the level of significance of dependencies between biological segments is explored. These tools are used in two specific applications. First, they are used for the identification of correlations between different parts of the maize zmSRp32 gene. There, we find significant dependencies between the 5′ untranslated region in zmSRp32 and its alternatively spliced exons. This observation may indicate the presence of as-yet unknown alternative splicing mechanisms or structural scaffolds. Second, using data from the FBI's combined DNA index system (CODIS), we demonstrate that our approach is particularly well suited for the problem of discovering short tandem repeats--an application of importance in genetic profiling.