Co-clustering numerical data under user-defined constraints

  • Authors:
  • Ruggero G. Pensa;Jean-Francois Boulicaut;Francesca Cordero;Maurizio Atzori

  • Affiliations:
  • Department of Computer Science, University of Torino, I-10149 Torino, Italy and Pisa KDD Laboratory, ISTI-CNR, I-56124 Pisa, Italy;INSA-Lyon, LIRIS CNRS UMR5205, F-69621 Villeurbanne, France;Department of Computer Science, University of Torino, I-10149 Torino, Italy and Department of Clinical and Biological Sciences, University of Torino, I-10043 Orbassano, Italy;Pisa KDD Laboratory, ISTI-CNR, I-56124 Pisa, Italy

  • Venue:
  • Statistical Analysis and Data Mining
  • Year:
  • 2010

Quantified Score

Hi-index 0.01

Visualization

Abstract

In the generic setting of objects × attributes matrix data analysis, co-clustering appears as an interesting unsupervised data mining method. A co-clustering task provides a bi-partition made of co-clusters: each co-cluster is a group of objects associated to a group of attributes and these associations can support expert interpretations. Many constrained clustering algorithms have been proposed to exploit the domain knowledge and to improve partition relevancy in the mono-dimensional clustering case (e.g. using the must-link and cannot-link constraints on one of the two dimensions). Here, we consider constrained co-clustering not only for extended must-link and cannot-link constraints (i.e. both objects and attributes can be involved), but also for interval constraints that enforce properties of co-clusters when considering ordered domains. We describe an iterative co-clustering algorithm which exploits user-defined constraints while minimizing a given objective function. Thanks to a generic setting, we emphasize that different objective functions can be used. The added value of our approach is demonstrated on both synthetic and real data. Among others, several experiments illustrate the practical impact of this original co-clustering setting in the context of gene expression data analysis, and in an original application to a protein motif discovery problem. Copyright © 2009 Wiley Periodicals, Inc. Statistical Analysis and Data Mining 3: 38-55, 2010