Statistical methods for joint data mining of gene expression and DNA sequence database

  • Authors:
  • Marla D. Curran;Hong Liu;Fan Long;Nanxiang Ge

  • Affiliations:
  • Aventis Pharmaceuticals Biometrics & Data Mgmt, Bridgewater, NJ;Aventis Pharmaceuticals Molecular Immunology, Bridgewater, NJ;Aventis Pharmaceuticals Molecular Immunology, Bridgewater, NJ;Aventis Pharmaceuticals Biometrics & Data Mgmt, Bridgewater, NJ

  • Venue:
  • ACM SIGKDD Explorations Newsletter
  • Year:
  • 2003

Quantified Score

Hi-index 0.00

Visualization

Abstract

One of the purposes of microarray gene expression experiments is to identify genes regulated under specific cellular conditions. With the availability of putative transcription factor binding motifs, it is now possible to relate gene expression pattern to the pattern of transcription factor binding sites (TFBS), as well as study how TFBS interact with each other to control gene expression. The objectives of this study are to develop a systematic approach for combining data from microarray gene expression experiments and the corresponding regulatory motif patterns in order to delineate gene regulation mechanisms. A secondary goal is to develop a predictive model for finding similarly regulated genes. Three consecutive procedures are proposed for such data mining activities. First, a linear mixed-effect model is fit to data from microarray gene expression experiments and potential regulated (positive) genes are identified based on a specific biological hypothesis. Putative TFBS are then retrieved for the identified positive genes and randomly selected controls. Second, a cluster analysis is conducted to reduce collinearity among the binding sites. In the third step, logistic regression is applied to choose the best model to predict gene type (positive, control) based on the numerous TFBS predictors. The above approach was applied to an internal example and a model was developed to predict up-regulated genes in activated T-helper (Th) cells. Using a leave-one-out cross- validation scheme, the model has an 18.9% false positive rate and a 41.7% false negative rate.