Loads the necessary packages.

```
library(greed)
library(future) # allows parralel processing in greed()
library(aricode) # for ari computation
library(Matrix)
library(igraph)
library(mclust)
# data wrangling
library(dplyr)
library(tidyr)
library(purrr)
library(tidygraph)
# advanced ploting
library(ggpubr)
library(ggplot2)
library(ggraph)
library(ggwordcloud)
library(kableExtra)
set.seed(2134)
future::plan("multisession", workers=1) # increase workers for parallel processing
```

Now that we introduced the methodology we can proceed to a
demonstration of the *greed* package on real datasets. As
explained in the presentation, the methodology in quite versatile and
applies to several data-types as long as we have a
**model** for which an exact ICL may be derived \[
\text{ICL}_{ex} = \log p(X \mid Z, \beta) + \log p(Z \mid \alpha).
\]

Currently, the package handles the following tasks:

**Continuous data clustering**with Gaussian Mixture Models. A GMM tutorial is available. See also the documentation for the`Gmm`

and`DiagGmm`

S4 classes.**Graph data clustering**with the Stochastic Block Model or its degree corrected variants. A SBM tutorial is available . See also the documentation for the`Sbm`

and`dcSbm`

S4 classes.**Categorical data clustering**with the Latent Class Analysis. An LCA tutorial is available. See also the documentation for the`Lca`

S4 class.**Count data clustering**with the Mixture of Multinomials model. A tutorial will soon be available. For now, we refer to the documentation for the`Mom`

S4 class.**Mixed-typed**data clustering,*e.g.*categorical and numerical but the package handles virtually any type of data combination by stacking models on top of each data types. For example graph data with continuous or categorical data attached to the nodes are handled. A CombinedModels tutorial is available. See also the documentation for the`CombinedModels`

S4 class.**Mixture of regression**for simultaneous clustering and fitting a regression model in each cluster. A MoR tutorial is available. See also the documentation for the`MoR`

S4 class.**Co-clustering**of binary and count-data via the Latent Block Model and its degree-corrected variant. A tutorial will soon be available. For now, we refer to the documentation for the`DcLbm`

S4 class.

For this practical session, we will focus on continuous, categorical and graph data clustering. Do not hesitate to try on your own, real dataset !

`greed`

functionThe package’s main entry point is the `greed`

function
with a typical call as follow

```
sol = greed(
X , # dataset to cluster (mandatory)
K = 20 , # number of initial clusters (optional)
model = Gmm() , # model to use and its prior parameters (optional)
alg = Hybrid() , # algorithm to use and its parameters (optional)
verbose= FALSE ) # verbosity level (optional)
```

Let us decompose this call.

- First,
`X`

contains the dataset do cluster. Depending on the task it can be an continuous or discrete \(n \times p\) matrix of multivariate observations, or an \(n \times n\) adjacency matrix. - Then, the
`model`

argument defines the observational model,*i.e*the expression of \(\log p (X \mid Z, \beta)\) that shall be used. It is optional as the package can infer the model from`X`

data-type (continous, factor, graph). The hyper-parameters \(\beta\) can be specified. `alg`

specifies which algorithm to use among hybrid GA, standard GA, greedy with clever initialization or multiple random initializations.

The **greed** package uses S4 class abstraction to
specify the `model`

, `alg`

arguments as well as
the results of the `greed()`

function. This allows to use
*generic* method to explore the results such as

`plot(sol, ...)`

`clustering(sol)`

`cut(sol, )`

`coef(sol)`

We will begin with a graph clustering example with a binary SBM model.

The stochastic block model (SBM) is a random graph model of the
adjacency matrix \(\mathbf{X}\) widely
used for graph clustering. In this model, the probability of an edge
\((i,j)\) is driven by the cluster
membership of node \(i\) and \(j\), hence the **block**
terminology. The Bayesian formulation of a binary SBM is as follow

Beta-Bernoulli conjugacy allows to derive an exact expression of \(\log p(X\mid Z, \beta) = \log \int_\theta p(X \mid Z, \theta) p(\theta \mid \beta) \, d\theta\). Here, the hyper-parameters \(\beta = (a_0, b_0)\) may be set to default “non-informative” values \(a_0=b_0=1\) (Beta(1,1) is uniform on \([0,1]\)).

We construct a simulation scheme with \(K=6\) clusters with a nested structure: 3 big clusters are each composed of 2 smaller clusters. We choose a community structure where nodes inside the same blocks tend to connect more.

```
N=1600
K=15
prop=rep(1/K,K)
lambda = 0.1
lambda_o = 0.025
Ks=5
theta = bdiag(lapply(1:(K/Ks), function(k){matrix(lambda_o,Ks,Ks)+diag(rep(lambda,Ks))}))+0.001
sbm = rsbm(N,prop,theta)
dim(sbm$x)
#> [1] 1600 1600
```

We have a graph with 1600 nodes. The true connectivity matrix \(\theta\) can be visualized and any (reordered) adjacency matrices from this SBM model would have the same profile.

```
theta.long=tibble(p=as.numeric(theta),k=rep(1:K,K),l=rep(1:K,each=K))
ggplot(theta.long)+
geom_tile(aes(x=l,y=k,fill=p))+
scale_x_discrete()+
scale_y_discrete()+
ggplot2::scale_fill_distiller("Link density", palette = "Greys", direction = 1, limits = c(0, max(theta)))+theme_minimal()+coord_equal()
```

`greed`

```
sol_sbmsim <- greed(X = sbm$x, model = Sbm(), alg = Hybrid(pop_size = 40))
#>
#> ── Fitting a guess SBM model ──
#>
#> ℹ Initializing a population of 40 solutions.
#> ℹ Generation 1 : best solution with an ICL of -171800 and 15 clusters.
#> ℹ Generation 2 : best solution with an ICL of -168669 and 12 clusters.
#> ℹ Generation 3 : best solution with an ICL of -167104 and 15 clusters.
#> ℹ Generation 4 : best solution with an ICL of -167104 and 15 clusters.
#> ── Final clustering ──
#>
#> ── Clustering with a SBM model 15 clusters and an ICL of -167104
K(sol_sbmsim)
#> [1] 15
```

We see that `greed`

correctly recovers the \(K=15\) clusters, in a reasonable amount of
time. What about the recovered partition ? We can explore the results
thanks to dedicated function in the package.

First, we investigate the recovered partition \(Z^{(6)}\) with \(K=6\) groups via a confusion matrix and the Adjusted Rand Index, both indicating the true partition has been recovered.

```
cl <- clustering(sol_sbmsim)
table(truth=sbm$cl, est=cl) %>% kable(row.names = T) %>% kable_styling(full_width = F)
```

1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 94 | 0 | 0 | 0 |

2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 111 | 0 |

3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 129 |

4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 87 | 0 | 0 | 0 | 0 |

5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 103 | 0 | 0 |

6 | 0 | 0 | 0 | 109 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |

7 | 0 | 108 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |

8 | 0 | 0 | 0 | 0 | 100 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |

9 | 122 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |

10 | 0 | 0 | 105 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |

11 | 0 | 0 | 0 | 0 | 0 | 89 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |

12 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 109 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |

13 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 114 | 0 | 0 | 0 | 0 | 0 |

14 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 117 | 0 | 0 | 0 | 0 | 0 | 0 |

15 | 0 | 0 | 0 | 0 | 0 | 0 | 102 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |

```
aricode::ARI(sbm$cl, cl)
#> [1] 0.9987085
```

Moreover, `greed`

also computes the hierarchical
clustering starting from \(Z^{(6)}\) to
\(Z^{(1)}\). The dendrogram can be
visualized with the plot function, specifying its `type`

argument to `"tree"`

.

```
plot(sol_sbmsim, type="tree")
plot(sol_sbmsim, type="path")
```

The fusion level corresponds to the amount of regularization \(\log(\alpha)\) needed to unlock the fusion. It is consistent with the hierarchical structure of the dataset, and the leaf ordering algorithm provides a “natural” ordering of the cluster that can be used in other plotting functions.

For example, we can visualize the block adjacency matrix corresponding to \(Z^{(6)}\) with clusters ordered via the dendrogram. The visualisation is greatly enhanced, highlighting the hierarchical structure of the data.

`plot(sol_sbmsim, type="blocks")`

An alternative visualization of the aggregated graph is the
`nodelink`

diagram

`plot(sol_sbmsim, type="nodelink")`