Rich probabilistic models for genomic data

  • Authors:
  • Daphne Koller;Eran Segal

  • Affiliations:
  • -;-

  • Venue:
  • Rich probabilistic models for genomic data
  • Year:
  • 2004

Quantified Score

Hi-index 0.00

Visualization

Abstract

Genomic datasets, spanning many organisms and data types, are rapidly being produced, creating new opportunities for understanding the molecular mechanisms underlying human disease, and for studying complex biological processes on a global scale. Transforming these immense amounts of data into biological information is a challenging task. In this thesis, we address this challenge by presenting a statistical modeling language, based on Bayesian networks, for representing heterogeneous biological entities and modeling the mechanism by which they interact. We use statistical learning approaches in order to learn the details of these models (structure and parameters) automatically from raw genomic data. The biological insights are then derived directly from the learned model. We describe three applications of this framework to the study of gene regulation: (1) Understanding the process by which DNA patterns (motifs) in the control regions of genes play a role in controlling their activity. Using only DNA sequence and gene expression data as input, these models recovered many of the known motifs in yeast and several known motif combinations in human. (2) Finding regulatory modules and their actual regulator genes directly from gene expression data. Some of the predictions from this analysis were tested successfully in the wet-lab, suggesting regulatory roles for three previously uncharacterized proteins. (3) Combining gene expression profiles from several organisms for a more robust prediction of gene function and regulatory pathways, and for studying the degree to which regulatory relationships have been conserved across evolution.