Dependencies between Transcription Factor Binding Sites: Comparison between ICA, NMF, PLSA and Frequent Sets

  • Authors:
  • Heli Hiisila;Ella Bingham

  • Affiliations:
  • Helsinki University of Technology, Finland;Helsinki University of Technology, Finland

  • Venue:
  • ICDM '04 Proceedings of the Fourth IEEE International Conference on Data Mining
  • Year:
  • 2004

Quantified Score

Hi-index 0.00

Visualization

Abstract

Gene expression of eucaryotes is regulated through transcription factors, which are molecules able to attach to the binding sites in the DNA sequence. These binding sites are small pieces of DNA usually found upstream from the gene they regulate. As the binding sites play an important role in the gene expression, it is of interest to find out their characteristics. In this paper we look for dependencies and independencies between these binding sites using independent component analysis (ICA), non-negative matrix factorization (NMF), probabilistic latent semantic analysis (PLSA) and the method of frequent sets. The data used are human gene upstream regions and possible binding sites listed in a biological database. Also, results on the baker's yeast (S.Cerevisiae) upstream regions are briefly discussed for comparison. ICA, NMF and PLSA are latent variable methods that decompose the observed data into smaller components. Of these, ICA and NMF were originally aimed for continuous data. We show that these methods can be successfully used on discrete DNA data as well. PLSA and the method of frequent sets were created for discrete data sets. The above methods reveal partially overlapping sets of possible binding sites such that the binding sites within a set are dependent of each other. The methods of frequent sets and NMF give a good overview of the most common data structures, whereas using ICA and PLSA we find large sets that are surprisingly frequent. That is, sets of very frequently occurring possible binding sites can be found near hundreds or thousands of genes; also interesting but less frequent ones co-occur surprisingly often.