Graph Clustering Via a Discrete Uncoupling Process

  • Authors:
  • Stijn Van Dongen

  • Affiliations:
  • svd@sanger.ac.uk

  • Venue:
  • SIAM Journal on Matrix Analysis and Applications
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

A discrete uncoupling process for finite spaces is introduced, called the Markov Cluster Process or the MCL process. The process is the engine for the graph clustering algorithm called the MCL algorithm. The MCL process takes a stochastic matrix as input, and then alternates expansion and inflation, each step defining a stochastic matrix in terms of the previous one. Expansion corresponds with taking the $k$th power of a stochastic matrix, where $k\in\N$. Inflation corresponds with a parametrized operator $\Gamma_r$, $r\geq 0$, that maps the set of (column) stochastic matrices onto itself. The image $\Gamma_r M$ is obtained by raising each entry in $M$ to the $r$th power and rescaling each column to have sum 1 again. In practice the process converges very fast towards a limit that is invariant under both matrix multiplication and inflation, with quadratic convergence around the limit points. The heuristic behind the process is its expected behavior for (Markov) graphs possessing cluster structure. The process is typically applied to the matrix of random walks on a given graph $G$, and the connected components of (the graph associated with) the process limit generically allow a clustering interpretation of $G$. The limit is in general extremely sparse and iterands are sparse in a weighted sense, implying that the MCL algorithm is very fast and highly scalable. Several mathematical properties of the MCL process are established. Most notably, the process (and algorithm) iterands posses structural properties generalizing the mapping from process limits onto clusterings. The inflation operator $\Gamma_r$ maps the class of matrices that are diagonally similar to a symmetric matrix onto itself. The phrase diagonally positive semi-definite (dpsd) is used for matrices that are diagonally similar to a positive semi-definite matrix. For $r\in\N$ and for $M$ a stochastic dpsd matrix, the image $\Gamma_r M$ is again dpsd. Determinantal inequalities satisfied by a dpsd matrix $M$ imply a natural ordering among the diagonal elements of $M$, generalizing the mapping of process limits onto clusterings. The spectrum of $\Gamma_{\infty} M$ is of the form $\{0^{n-k}, 1^k\}$, where $k$ is the number of endclasses of the ordering associated with $M$, and $n$ is the dimension of $M$. This attests to the uncoupling effect of the inflation operator.