Loads the necessary packages.

library(greed)
library(future) # allows parralel processing in greed()
library(aricode) # for ari computation
library(Matrix)
library(igraph)
library(mclust)

# data wrangling
library(dplyr) 
library(tidyr)
library(purrr)
library(tidygraph)

# advanced ploting 
library(ggpubr) 
library(ggplot2)
library(ggraph)
library(ggwordcloud)
library(kableExtra)

set.seed(2134)

future::plan("multisession", workers=1) # increase workers for parallel processing

Package overview

Mathematical context

Now that we introduced the methodology we can proceed to a demonstration of the greed package on real datasets. As explained in the presentation, the methodology in quite versatile and applies to several data-types as long as we have a model for which an exact ICL may be derived \[ \text{ICL}_{ex} = \log p(X \mid Z, \beta) + \log p(Z \mid \alpha). \]

Data-types currently handled by the greed package

Currently, the package handles the following tasks:

  • Continuous data clustering with Gaussian Mixture Models. A GMM tutorial is available. See also the documentation for the Gmm and DiagGmm S4 classes.
  • Graph data clustering with the Stochastic Block Model or its degree corrected variants. A SBM tutorial is available . See also the documentation for the Sbm and dcSbm S4 classes.
  • Categorical data clustering with the Latent Class Analysis. An LCA tutorial is available. See also the documentation for the Lca S4 class.
  • Count data clustering with the Mixture of Multinomials model. A tutorial will soon be available. For now, we refer to the documentation for the Mom S4 class.
  • Mixed-typed data clustering, e.g. categorical and numerical but the package handles virtually any type of data combination by stacking models on top of each data types. For example graph data with continuous or categorical data attached to the nodes are handled. A CombinedModels tutorial is available. See also the documentation for the CombinedModels S4 class.
  • Mixture of regression for simultaneous clustering and fitting a regression model in each cluster. A MoR tutorial is available. See also the documentation for the MoR S4 class.
  • Co-clustering of binary and count-data via the Latent Block Model and its degree-corrected variant. A tutorial will soon be available. For now, we refer to the documentation for the DcLbm S4 class.

For this practical session, we will focus on continuous, categorical and graph data clustering. Do not hesitate to try on your own, real dataset !

Basic usage: the greed function

The package’s main entry point is the greed function with a typical call as follow

sol = greed(
   X                  , # dataset to cluster (mandatory)
   K = 20             , # number of initial clusters (optional)
   model = Gmm()      , # model to use and its prior parameters (optional)
   alg = Hybrid()     , # algorithm to use and its parameters (optional)
   verbose= FALSE    )  # verbosity level (optional)

Let us decompose this call.

  • First, X contains the dataset do cluster. Depending on the task it can be an continuous or discrete \(n \times p\) matrix of multivariate observations, or an \(n \times n\) adjacency matrix.
  • Then, the model argument defines the observational model, i.e the expression of \(\log p (X \mid Z, \beta)\) that shall be used. It is optional as the package can infer the model from X data-type (continous, factor, graph). The hyper-parameters \(\beta\) can be specified.
  • alg specifies which algorithm to use among hybrid GA, standard GA, greedy with clever initialization or multiple random initializations.

S4 class abstraction

The greed package uses S4 class abstraction to specify the model, alg arguments as well as the results of the greed() function. This allows to use generic method to explore the results such as

  • plot(sol, ...)
  • clustering(sol)
  • cut(sol, )
  • coef(sol)

Introductory example on simulated data: graph clustering with binary SBM

We will begin with a graph clustering example with a binary SBM model.

The model

The stochastic block model (SBM) is a random graph model of the adjacency matrix \(\mathbf{X}\) widely used for graph clustering. In this model, the probability of an edge \((i,j)\) is driven by the cluster membership of node \(i\) and \(j\), hence the block terminology. The Bayesian formulation of a binary SBM is as follow

\[\begin{align*} \forall (k,l) \in [K]^2, \quad \theta_{kl} & \sim \textrm{Beta}(a_0, b_0) \\ \forall (i,j) \in [n]^2, \quad x_{ij} \mid \{ z_{ik}z_{jl}=1 \}, \theta & \sim \mathcal{B}(\theta_{kl}) \end{align*}\]

Beta-Bernoulli conjugacy allows to derive an exact expression of \(\log p(X\mid Z, \beta) = \log \int_\theta p(X \mid Z, \theta) p(\theta \mid \beta) \, d\theta\). Here, the hyper-parameters \(\beta = (a_0, b_0)\) may be set to default “non-informative” values \(a_0=b_0=1\) (Beta(1,1) is uniform on \([0,1]\)).

Reproducing the slides example: nested SBM simulation

We construct a simulation scheme with \(K=6\) clusters with a nested structure: 3 big clusters are each composed of 2 smaller clusters. We choose a community structure where nodes inside the same blocks tend to connect more.

Simulate the data

N=1600
K=15
prop=rep(1/K,K)
lambda  = 0.1
lambda_o = 0.025
Ks=5
theta = bdiag(lapply(1:(K/Ks), function(k){matrix(lambda_o,Ks,Ks)+diag(rep(lambda,Ks))}))+0.001
sbm = rsbm(N,prop,theta)
dim(sbm$x)
#> [1] 1600 1600

We have a graph with 1600 nodes. The true connectivity matrix \(\theta\) can be visualized and any (reordered) adjacency matrices from this SBM model would have the same profile.


theta.long=tibble(p=as.numeric(theta),k=rep(1:K,K),l=rep(1:K,each=K))

ggplot(theta.long)+
  geom_tile(aes(x=l,y=k,fill=p))+
  scale_x_discrete()+
  scale_y_discrete()+
  ggplot2::scale_fill_distiller("Link density", palette = "Greys", direction = 1, limits = c(0, max(theta)))+theme_minimal()+coord_equal()

Prepare the data and call greed

sol_sbmsim <- greed(X = sbm$x, model = Sbm(), alg = Hybrid(pop_size = 40))
#> 
#> ── Fitting a guess SBM model ──
#> 
#> ℹ Initializing a population of 40 solutions.
#> ℹ Generation 1 : best solution with an ICL of -171800 and 15 clusters.
#> ℹ Generation 2 : best solution with an ICL of -168669 and 12 clusters.
#> ℹ Generation 3 : best solution with an ICL of -167104 and 15 clusters.
#> ℹ Generation 4 : best solution with an ICL of -167104 and 15 clusters.
#> ── Final clustering ──
#> 
#> ── Clustering with a SBM model 15 clusters and an ICL of -167104
K(sol_sbmsim)
#> [1] 15

We see that greed correctly recovers the \(K=15\) clusters, in a reasonable amount of time. What about the recovered partition ? We can explore the results thanks to dedicated function in the package.

Exploring the results

First, we investigate the recovered partition \(Z^{(6)}\) with \(K=6\) groups via a confusion matrix and the Adjusted Rand Index, both indicating the true partition has been recovered.

cl <- clustering(sol_sbmsim)
table(truth=sbm$cl, est=cl) %>% kable(row.names = T) %>% kable_styling(full_width = F)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
1 0 0 0 0 0 0 0 0 0 0 0 94 0 0 0
2 0 0 0 0 0 0 0 0 0 0 0 0 0 111 0
3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 129
4 0 0 0 0 0 0 0 0 0 0 87 0 0 0 0
5 0 0 0 0 0 0 0 0 0 0 0 0 103 0 0
6 0 0 0 109 0 0 0 0 0 0 0 0 0 0 0
7 0 108 0 0 0 0 0 0 0 0 0 0 0 0 0
8 0 0 0 0 100 0 0 0 0 0 0 0 0 0 0
9 122 0 0 0 0 0 0 0 0 0 0 0 0 0 0
10 0 0 105 0 0 0 0 0 0 0 0 0 0 0 0
11 0 0 0 0 0 89 0 0 1 0 0 0 0 0 0
12 0 0 0 0 0 0 0 109 0 0 0 0 0 0 0
13 0 0 0 0 0 0 0 0 0 114 0 0 0 0 0
14 0 0 0 0 0 0 0 0 117 0 0 0 0 0 0
15 0 0 0 0 0 0 102 0 0 0 0 0 0 0 0
aricode::ARI(sbm$cl, cl)
#> [1] 0.9987085

Moreover, greed also computes the hierarchical clustering starting from \(Z^{(6)}\) to \(Z^{(1)}\). The dendrogram can be visualized with the plot function, specifying its type argument to "tree".

plot(sol_sbmsim, type="tree")
plot(sol_sbmsim, type="path")

The fusion level corresponds to the amount of regularization \(\log(\alpha)\) needed to unlock the fusion. It is consistent with the hierarchical structure of the dataset, and the leaf ordering algorithm provides a “natural” ordering of the cluster that can be used in other plotting functions.

For example, we can visualize the block adjacency matrix corresponding to \(Z^{(6)}\) with clusters ordered via the dendrogram. The visualisation is greatly enhanced, highlighting the hierarchical structure of the data.

plot(sol_sbmsim, type="blocks")

An alternative visualization of the aggregated graph is the nodelink diagram

plot(sol_sbmsim, type="nodelink")