Finding Additive Biclusters with Random Background

  • Authors:
  • Jing Xiao;Lusheng Wang;Xiaowen Liu;Tao Jiang

  • Affiliations:
  • Department of Computer Science and Technology, Tsinghua University,;Department of Computer Science, City University of Hong Kong, Hong Kong;Department of Computer Science, University of Western Ontario, London, Canada N6A 5B7;Department of Computer Science and Engineering, University of California, Riverside

  • Venue:
  • CPM '08 Proceedings of the 19th annual symposium on Combinatorial Pattern Matching
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

The biclustering problem has been extensively studied in many areas including e-commerce, data mining, machine learning, pattern recognition, statistics, and more recently in computational biology. Given an n×mmatrix A(n茂戮驴 m), the main goal of biclustering is to identify a subset of rows (called objects) and a subset of columns (called properties) such that some objective function that specifies the quality of the found bicluster (formed by the subsets of rows and of columns of A) is optimized. The problem has been proved or conjectured to be NP-hard under various mathematical models. In this paper, we study a probabilistic model of the implanted additive bicluster problem, where each element in the n×mbackground matrix is a random number from [0, L茂戮驴 1], and a k×kimplanted additive bicluster is obtained from an error-free additive bicluster by randomly changing each element to a number in [0, L茂戮驴 1] with probability 茂戮驴. We propose an O(n2m) time voting algorithm to solve the problem. We show that for any constant 茂戮驴such that $(1-\delta)(1-\theta)^2 -\frac 1 L 0$, when $k \ge \max \left\{\frac 8 \alpha \sqrt{n\log n},~ \frac {8 \log n} c + \log (2L)\right\}$, where cis a constant number, the voting algorithm can correctly find the implanted bicluster with probability at least $1 - \frac{9}{n^{2}}$. We also implement our algorithm as a software tool for finding novel biclusters in microarray gene expression data, called VOTE. The implementation incorporates several nontrivial ideas for estimating the size of an implanted bicluster, adjusting the threshold in voting, dealing with small biclusters, and dealing with multiple (and overlapping) implanted biclusters. Our experimental results on both simulated and real datasets show that VOTE can find biclusters with a high accuracy and speed.