Loads the necessary packages.
library(greed)
library(future) # allows parralel processing in greed()
library(aricode) # for ari computation
library(Matrix)
library(igraph)
library(mclust)
# data wrangling
library(dplyr)
library(tidyr)
library(purrr)
library(tidygraph)
# advanced ploting
library(ggpubr)
library(ggplot2)
library(ggraph)
library(ggwordcloud)
library(kableExtra)
set.seed(2134)
future::plan("multisession", workers=1) # increase workers for parallel processing
Now that we introduced the methodology we can proceed to a demonstration of the greed package on real datasets. As explained in the presentation, the methodology in quite versatile and applies to several data-types as long as we have a model for which an exact ICL may be derived \[ \text{ICL}_{ex} = \log p(X \mid Z, \beta) + \log p(Z \mid \alpha). \]
Currently, the package handles the following tasks:
Gmm
and DiagGmm
S4 classes.Sbm
and dcSbm
S4 classes.Lca
S4 class.Mom
S4 class.CombinedModels
S4 class.MoR
S4 class.DcLbm
S4 class.For this practical session, we will focus on continuous, categorical and graph data clustering. Do not hesitate to try on your own, real dataset !
greed
functionThe package’s main entry point is the greed
function
with a typical call as follow
sol = greed(
X , # dataset to cluster (mandatory)
K = 20 , # number of initial clusters (optional)
model = Gmm() , # model to use and its prior parameters (optional)
alg = Hybrid() , # algorithm to use and its parameters (optional)
verbose= FALSE ) # verbosity level (optional)
Let us decompose this call.
X
contains the dataset do cluster. Depending on
the task it can be an continuous or discrete \(n \times p\) matrix of multivariate
observations, or an \(n \times n\)
adjacency matrix.model
argument defines the observational
model, i.e the expression of \(\log p
(X \mid Z, \beta)\) that shall be used. It is optional as the
package can infer the model from X
data-type (continous,
factor, graph). The hyper-parameters \(\beta\) can be specified.alg
specifies which algorithm to use among hybrid GA,
standard GA, greedy with clever initialization or multiple random
initializations.The greed package uses S4 class abstraction to
specify the model
, alg
arguments as well as
the results of the greed()
function. This allows to use
generic method to explore the results such as
plot(sol, ...)
clustering(sol)
cut(sol, )
coef(sol)
We will begin with a graph clustering example with a binary SBM model.
The stochastic block model (SBM) is a random graph model of the adjacency matrix \(\mathbf{X}\) widely used for graph clustering. In this model, the probability of an edge \((i,j)\) is driven by the cluster membership of node \(i\) and \(j\), hence the block terminology. The Bayesian formulation of a binary SBM is as follow
\[\begin{align*} \forall (k,l) \in [K]^2, \quad \theta_{kl} & \sim \textrm{Beta}(a_0, b_0) \\ \forall (i,j) \in [n]^2, \quad x_{ij} \mid \{ z_{ik}z_{jl}=1 \}, \theta & \sim \mathcal{B}(\theta_{kl}) \end{align*}\]Beta-Bernoulli conjugacy allows to derive an exact expression of \(\log p(X\mid Z, \beta) = \log \int_\theta p(X \mid Z, \theta) p(\theta \mid \beta) \, d\theta\). Here, the hyper-parameters \(\beta = (a_0, b_0)\) may be set to default “non-informative” values \(a_0=b_0=1\) (Beta(1,1) is uniform on \([0,1]\)).
We construct a simulation scheme with \(K=6\) clusters with a nested structure: 3 big clusters are each composed of 2 smaller clusters. We choose a community structure where nodes inside the same blocks tend to connect more.
N=1600
K=15
prop=rep(1/K,K)
lambda = 0.1
lambda_o = 0.025
Ks=5
theta = bdiag(lapply(1:(K/Ks), function(k){matrix(lambda_o,Ks,Ks)+diag(rep(lambda,Ks))}))+0.001
sbm = rsbm(N,prop,theta)
dim(sbm$x)
#> [1] 1600 1600
We have a graph with 1600 nodes. The true connectivity matrix \(\theta\) can be visualized and any (reordered) adjacency matrices from this SBM model would have the same profile.
theta.long=tibble(p=as.numeric(theta),k=rep(1:K,K),l=rep(1:K,each=K))
ggplot(theta.long)+
geom_tile(aes(x=l,y=k,fill=p))+
scale_x_discrete()+
scale_y_discrete()+
ggplot2::scale_fill_distiller("Link density", palette = "Greys", direction = 1, limits = c(0, max(theta)))+theme_minimal()+coord_equal()
greed
sol_sbmsim <- greed(X = sbm$x, model = Sbm(), alg = Hybrid(pop_size = 40))
#>
#> ── Fitting a guess SBM model ──
#>
#> ℹ Initializing a population of 40 solutions.
#> ℹ Generation 1 : best solution with an ICL of -171800 and 15 clusters.
#> ℹ Generation 2 : best solution with an ICL of -168669 and 12 clusters.
#> ℹ Generation 3 : best solution with an ICL of -167104 and 15 clusters.
#> ℹ Generation 4 : best solution with an ICL of -167104 and 15 clusters.
#> ── Final clustering ──
#>
#> ── Clustering with a SBM model 15 clusters and an ICL of -167104
K(sol_sbmsim)
#> [1] 15
We see that greed
correctly recovers the \(K=15\) clusters, in a reasonable amount of
time. What about the recovered partition ? We can explore the results
thanks to dedicated function in the package.
First, we investigate the recovered partition \(Z^{(6)}\) with \(K=6\) groups via a confusion matrix and the Adjusted Rand Index, both indicating the true partition has been recovered.
cl <- clustering(sol_sbmsim)
table(truth=sbm$cl, est=cl) %>% kable(row.names = T) %>% kable_styling(full_width = F)
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 94 | 0 | 0 | 0 |
2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 111 | 0 |
3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 129 |
4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 87 | 0 | 0 | 0 | 0 |
5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 103 | 0 | 0 |
6 | 0 | 0 | 0 | 109 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
7 | 0 | 108 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
8 | 0 | 0 | 0 | 0 | 100 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
9 | 122 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
10 | 0 | 0 | 105 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
11 | 0 | 0 | 0 | 0 | 0 | 89 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
12 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 109 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
13 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 114 | 0 | 0 | 0 | 0 | 0 |
14 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 117 | 0 | 0 | 0 | 0 | 0 | 0 |
15 | 0 | 0 | 0 | 0 | 0 | 0 | 102 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
aricode::ARI(sbm$cl, cl)
#> [1] 0.9987085
Moreover, greed
also computes the hierarchical
clustering starting from \(Z^{(6)}\) to
\(Z^{(1)}\). The dendrogram can be
visualized with the plot function, specifying its type
argument to "tree"
.
plot(sol_sbmsim, type="tree")
plot(sol_sbmsim, type="path")
The fusion level corresponds to the amount of regularization \(\log(\alpha)\) needed to unlock the fusion. It is consistent with the hierarchical structure of the dataset, and the leaf ordering algorithm provides a “natural” ordering of the cluster that can be used in other plotting functions.
For example, we can visualize the block adjacency matrix corresponding to \(Z^{(6)}\) with clusters ordered via the dendrogram. The visualisation is greatly enhanced, highlighting the hierarchical structure of the data.
plot(sol_sbmsim, type="blocks")
An alternative visualization of the aggregated graph is the
nodelink
diagram
plot(sol_sbmsim, type="nodelink")