Loads the necessary packages.
library(greed)
library(future) # allows parralel processing in greed()
library(aricode) # for ari computation
library(Matrix)
library(igraph)
library(mclust)
# data wrangling
library(dplyr)
library(tidyr)
library(purrr)
library(tidygraph)
# advanced ploting
library(ggpubr)
library(ggplot2)
library(ggraph)
library(ggwordcloud)
library(kableExtra)
set.seed(2134)
future::plan("multisession", workers=1) # increase workers for parallel processing
Now that we introduced the methodology we can proceed to a demonstration of the greed package on real datasets. As explained in the presentation, the methodology in quite versatile and applies to several data-types as long as we have a model for which an exact ICL may be derived \[ \text{ICL}_{ex} = \log p(X \mid Z, \beta) + \log p(Z \mid \alpha). \]
Currently, the package handles the following tasks:
Gmm
and DiagGmm
S4 classes.Sbm
and dcSbm
S4 classes.Lca
S4 class.Mom
S4 class.CombinedModels
S4 class.MoR
S4 class.DcLbm
S4 class.For this practical session, we will focus on continuous, categorical and graph data clustering. Do not hesitate to try on your own, real dataset !
greed
functionThe package’s main entry point is the greed
function
with a typical call as follow
sol = greed(
X , # dataset to cluster (mandatory)
K = 20 , # number of initial clusters (optional)
model = Gmm() , # model to use and its prior parameters (optional)
alg = Hybrid() , # algorithm to use and its parameters (optional)
verbose= FALSE ) # verbosity level (optional)
Let us decompose this call.
X
contains the dataset do cluster. Depending on
the task it can be an continuous or discrete \(n \times p\) matrix of multivariate
observations, or an \(n \times n\)
adjacency matrix.model
argument defines the observational
model, i.e the expression of \(\log p
(X \mid Z, \beta)\) that shall be used. It is optional as the
package can infer the model from X
data-type (continous,
factor, graph). The hyper-parameters \(\beta\) can be specified.alg
specifies which algorithm to use among hybrid GA,
standard GA, greedy with clever initialization or multiple random
initializations.The greed package uses S4 class abstraction to
specify the model
, alg
arguments as well as
the results of the greed()
function. This allows to use
generic method to explore the results such as
plot(sol, ...)
clustering(sol)
cut(sol, )
coef(sol)
We will begin with a graph clustering example with a binary SBM model.
The stochastic block model (SBM) is a random graph model of the adjacency matrix \(\mathbf{X}\) widely used for graph clustering. In this model, the probability of an edge \((i,j)\) is driven by the cluster membership of node \(i\) and \(j\), hence the block terminology. The Bayesian formulation of a binary SBM is as follow
\[\begin{align*} \forall (k,l) \in [K]^2, \quad \theta_{kl} & \sim \textrm{Beta}(a_0, b_0) \\ \forall (i,j) \in [n]^2, \quad x_{ij} \mid \{ z_{ik}z_{jl}=1 \}, \theta & \sim \mathcal{B}(\theta_{kl}) \end{align*}\]Beta-Bernoulli conjugacy allows to derive an exact expression of \(\log p(X\mid Z, \beta) = \log \int_\theta p(X \mid Z, \theta) p(\theta \mid \beta) \, d\theta\). Here, the hyper-parameters \(\beta = (a_0, b_0)\) may be set to default “non-informative” values \(a_0=b_0=1\) (Beta(1,1) is uniform on \([0,1]\)).
We construct a simulation scheme with \(K=6\) clusters with a nested structure: 3 big clusters are each composed of 2 smaller clusters. We choose a community structure where nodes inside the same blocks tend to connect more.
N=1600
K=15
prop=rep(1/K,K)
lambda = 0.1
lambda_o = 0.025
Ks=5
theta = bdiag(lapply(1:(K/Ks), function(k){matrix(lambda_o,Ks,Ks)+diag(rep(lambda,Ks))}))+0.001
sbm = rsbm(N,prop,theta)
dim(sbm$x)
#> [1] 1600 1600
We have a graph with 1600 nodes. The true connectivity matrix \(\theta\) can be visualized and any (reordered) adjacency matrices from this SBM model would have the same profile.
theta.long=tibble(p=as.numeric(theta),k=rep(1:K,K),l=rep(1:K,each=K))
ggplot(theta.long)+
geom_tile(aes(x=l,y=k,fill=p))+
scale_x_discrete()+
scale_y_discrete()+
ggplot2::scale_fill_distiller("Link density", palette = "Greys", direction = 1, limits = c(0, max(theta)))+theme_minimal()+coord_equal()
greed
sol_sbmsim <- greed(X = sbm$x, model = Sbm(), alg = Hybrid(pop_size = 40))
#>
#> ── Fitting a guess SBM model ──
#>
#> ℹ Initializing a population of 40 solutions.
#> ℹ Generation 1 : best solution with an ICL of -171800 and 15 clusters.
#> ℹ Generation 2 : best solution with an ICL of -168669 and 12 clusters.
#> ℹ Generation 3 : best solution with an ICL of -167104 and 15 clusters.
#> ℹ Generation 4 : best solution with an ICL of -167104 and 15 clusters.
#> ── Final clustering ──
#>
#> ── Clustering with a SBM model 15 clusters and an ICL of -167104
K(sol_sbmsim)
#> [1] 15
We see that greed
correctly recovers the \(K=15\) clusters, in a reasonable amount of
time. What about the recovered partition ? We can explore the results
thanks to dedicated function in the package.
First, we investigate the recovered partition \(Z^{(6)}\) with \(K=6\) groups via a confusion matrix and the Adjusted Rand Index, both indicating the true partition has been recovered.
cl <- clustering(sol_sbmsim)
table(truth=sbm$cl, est=cl) %>% kable(row.names = T) %>% kable_styling(full_width = F)
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 94 | 0 | 0 | 0 |
2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 111 | 0 |
3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 129 |
4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 87 | 0 | 0 | 0 | 0 |
5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 103 | 0 | 0 |
6 | 0 | 0 | 0 | 109 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
7 | 0 | 108 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
8 | 0 | 0 | 0 | 0 | 100 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
9 | 122 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
10 | 0 | 0 | 105 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
11 | 0 | 0 | 0 | 0 | 0 | 89 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
12 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 109 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
13 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 114 | 0 | 0 | 0 | 0 | 0 |
14 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 117 | 0 | 0 | 0 | 0 | 0 | 0 |
15 | 0 | 0 | 0 | 0 | 0 | 0 | 102 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
aricode::ARI(sbm$cl, cl)
#> [1] 0.9987085
Moreover, greed
also computes the hierarchical
clustering starting from \(Z^{(6)}\) to
\(Z^{(1)}\). The dendrogram can be
visualized with the plot function, specifying its type
argument to "tree"
.
plot(sol_sbmsim, type="tree")
plot(sol_sbmsim, type="path")
The fusion level corresponds to the amount of regularization \(\log(\alpha)\) needed to unlock the fusion. It is consistent with the hierarchical structure of the dataset, and the leaf ordering algorithm provides a “natural” ordering of the cluster that can be used in other plotting functions.
For example, we can visualize the block adjacency matrix corresponding to \(Z^{(6)}\) with clusters ordered via the dendrogram. The visualisation is greatly enhanced, highlighting the hierarchical structure of the data.
plot(sol_sbmsim, type="blocks")
An alternative visualization of the aggregated graph is the
nodelink
diagram
plot(sol_sbmsim, type="nodelink")
cut()
Eventually, one can explore coarser scales of the hierarchy and still use the same type of functions.
sol_sbm_K3 = cut(sol_sbmsim, 3)
table(clustering(sol_sbm_K3), sbm$cl) %>% kable(row.names=T)
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 0 | 0 | 0 | 0 | 0 | 109 | 108 | 100 | 122 | 105 | 0 | 0 | 0 | 0 | 0 |
2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 90 | 109 | 114 | 117 | 102 |
3 | 94 | 111 | 129 | 87 | 103 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
plot(sol_sbm_K3, type="blocks")
coef()
At any stage \(k \leq \hat{K}\) of the hierarchy, one can get a conditional MAP estimate of the observational model parameters \(\hat{\theta} \mid \hat{Z}^{(k)}\) which maximizes \(\log p(X \mid \hat{Z}^{(k)}, \theta)\).
hat_theta= coef(sol_sbmsim)$theta
mean(abs(hat_theta-theta)) # l1 relative error, no permutation needed here.
#> [1] 0.0006745534
The hierarchical ordering of the cluster (and the specifity of this simulation) allows to avoid permuting \(\hat{\theta}\) before computing $^- $.
The Books dataset was gathered by Valdis Krebs and is attached to the greed package. It consist of a co-purchasing network of \(N=105\) books on US politics. Two books have an edge between them if they have been frequently co-purchased together. We have access to the labels of each book according to its political inclination: conservative (“n”), liberal (“l”) or neutral (“n”).
data(Books)
sol_dcsbm <- greed(Books$X,model = DcSbm())
#>
#> ── Fitting a guess DCSBM model ──
#>
#> ℹ Initializing a population of 20 solutions.
#> ℹ Generation 1 : best solution with an ICL of -1349 and 4 clusters.
#> ℹ Generation 2 : best solution with an ICL of -1346 and 4 clusters.
#> ℹ Generation 3 : best solution with an ICL of -1346 and 4 clusters.
#> ── Final clustering ──
#>
#> ── Clustering with a DCSBM model 3 clusters and an ICL of -1345
sol_sbm <- greed(Books$X,model = Sbm())
#>
#> ── Fitting a guess SBM model ──
#>
#> ℹ Initializing a population of 20 solutions.
#> ℹ Generation 1 : best solution with an ICL of -1279 and 8 clusters.
#> ℹ Generation 2 : best solution with an ICL of -1259 and 6 clusters.
#> ℹ Generation 3 : best solution with an ICL of -1255 and 6 clusters.
#> ℹ Generation 4 : best solution with an ICL of -1255 and 6 clusters.
#> ── Final clustering ──
#>
#> ── Clustering with a SBM model 5 clusters and an ICL of -1254
The network as been automatically recognized as an undirected graph, as we can see see in the fitted models prior:
prior(sol_dcsbm)@type
#> [1] "undirected"
prior(sol_sbm)@type
#> [1] "undirected"
For this dataset, the regular SBM model seems to reach a better ICL solution than its degree-correction variant. Still, we can visualize both aggregated adjacency matrices and the dendrogram.
bl_sbm = plot(sol_sbm,type='blocks')
bl_dcsbm = plot(sol_dcsbm,type='blocks')
ggarrange(bl_sbm,bl_dcsbm)
It is also possible to use external R packages to plot the graph layout with node color as clusters and node size as book popularity (computed using centrality degree). Here, we represent the result for the SBM solution with 5 clusters. One can see a hierarchical clustering structure appearing, with a central cluster of neutral books in-between two densely connected set. In each of these two dense set, there is a clear distinction between popular books (heavily purchased) and more peripheral ones, indicated by node size.
graph <- igraph::graph_from_adjacency_matrix(Books$X) %>%
as_tbl_graph() %>%
mutate(Popularity = centrality_degree()) %>%
activate(nodes) %>%
mutate(cluster=factor(clustering(sol_sbm),1:K(sol_sbm)))
# plot using ggraph
ggraph(graph, layout = 'kk') +
geom_edge_link() +
geom_node_point(aes(size = Popularity,color=cluster))
Finally, we can look at both models solutions for \(K=3\) and their confusion matrix. We see that both partition make sense according to the available political label.
c | l | n | |
---|---|---|---|
1 | 1 | 38 | 2 |
2 | 8 | 5 | 9 |
3 | 40 | 0 | 2 |
c | l | n | |
---|---|---|---|
1 | 41 | 0 | 3 |
2 | 7 | 5 | 8 |
3 | 1 | 38 | 2 |
table(clustering(cut(sol_sbm, 3)), clustering(sol_dcsbm))
#>
#> 1 2 3
#> 1 0 0 41
#> 2 2 20 0
#> 3 42 0 0
aricode::ARI(clustering(cut(sol_sbm, 3)), clustering(sol_dcsbm))
#> [1] 0.9503619
Without any constraints, the Bayesian formulation of GMMs leading to a tractable exact ICL expression uses a Normal and inverse-Wishart conjugate prior on the mean and covariances \(\mathbf{\theta} = (\mathbf{\mu}_k, \mathbf{\Sigma}_k)_k\). This prior is defined with hyperparameters \(\mathbf{\beta} = (\mathbf{\mu}, \tau, n_0, \mathbf{\varepsilon})\) and the hierarchical model is written as follows:
\[\begin{equation} \label{eq:gmm} \begin{aligned} \pi&\sim \textrm{Dirichlet}_K(\alpha)\\ Z_i&\sim \mathcal{M}(1,\pi)\\ \mathbf{\Sigma}_k^{-1} & \sim \textrm{Wishart}(\mathbf{\varepsilon}^{-1},n_0)\\ \mathbf{\mu}_k&\sim \mathcal{N}(\mathbf{\mu},\frac{1}{\tau} \mathbf{\Sigma}_k)\\ X_{i}|Z_{ik}=1 &\sim \mathcal{N}(\mathbf{\mu}_k, \mathbf{\Sigma}_{k})\\ \end{aligned} \end{equation}\]Contrary to other models, these priors are informative and may therefore have a sensible impact on the obtained results. By default, the priors parameters are set as follow:
These default values were chosen to accommodate a variety of
situation but may be modified easily. For more details, we refer to the
class documentation ?Gmm
. Finally, it is possible to
specify constraints on the covariance matrix as discussed below.
For illustration, we will use the diabetes dataset from the mclust package. The data describes \(p=3\) biological variables for \(n=145\) patients with \(K=3\) different types of diabetes: normal, overt and chemical. We are interested in the clustering of this dataset to see if the three biological variables can discriminate between the three diabetes type. We first need to define the data to cluster
data(diabetes,package="mclust")
head(diabetes %>% sample_n(100))
#> class glucose insulin sspg
#> 82 Normal 93 393 490
#> 97 Chemical 102 472 297
#> 14 Normal 90 371 200
#> 81 Normal 99 398 76
#> 31 Normal 70 360 134
#> 123 Overt 195 920 160
X = diabetes[,-1]
X = data.frame(scale(X)) # centering and scaling
Again, we use the greed()
function, this time a
model=Gmm()
argument
sol = greed(X,model=Gmm())
#>
#> ── Fitting a GMM model ──
#>
#> ℹ Initializing a population of 20 solutions.
#> ℹ Generation 1 : best solution with an ICL of -278 and 6 clusters.
#> ℹ Generation 2 : best solution with an ICL of -271 and 4 clusters.
#> ℹ Generation 3 : best solution with an ICL of -262 and 3 clusters.
#> ℹ Generation 4 : best solution with an ICL of -262 and 3 clusters.
#> ── Final clustering ──
#>
#> ── Clustering with a GMM model 3 clusters and an ICL of -262
It found a partition with \(\hat{K} = 3\) clusters, consistent with the diabetes types with few missclassifications that are expected when fitting a GMM model on this dataset.
table(diabetes$class, clustering(sol)) %>% kable(row.names = T) %>% kable_styling(full_width = F)
1 | 2 | 3 | |
---|---|---|---|
Chemical | 1 | 24 | 11 |
Normal | 0 | 3 | 73 |
Overt | 27 | 6 | 0 |
As a comparison, one can use a standard EM procedure implemented in mclust
sol_mclust = mclust::Mclust(data = X, G = 2:5)
aricode::ARI(sol_mclust$classification, diabetes$class)
#> [1] 0.6640181
aricode::ARI(clustering(sol), diabetes$class)
#> [1] 0.6531253
aricode::ARI(clustering(sol), sol_mclust$classification)
#> [1] 0.9324887
As in the SBM case, we can use the tree visualization of the hierarchy, which is not of great interest here since \(\hat{K}=3\).
plot(sol, type="tree")
Moreover, specific visualization functions are implemented for this
model. For example, we may look at the pairsplot with
associated cluster colors via the gmmpairs()
function
gmmpairs(sol, X)
One can also investigate per-clusters marginal distributions with a
violin plot via plot()
and its
violins
type
plot(sol, type="violins")
Furthermore, users may use the generic coef()
function
to get the mixture component parameters estimate. The result will be a
simple list with fields for each parameters (means: muk
,
covariance matrices: Sigmak
, proportions:
pi
).
params = coef(sol)
names(params)
#> [1] "pi" "muk" "Sigmak"
params$Sigmak[[2]]
#> glucose insulin sspg
#> glucose 0.04732456 0.06098111 -0.1020909
#> insulin 0.06098111 0.13714310 -0.1359453
#> sspg -0.10209092 -0.13594534 1.7013069
As explained above, the hyperparameters \(\beta\) have no default “non-informative” values, and thus impact the clustering results. One may want to specify a priori belief that the Gaussian ellipse volume is small inside clusters, which amounts to diminish the \(0.1\) coefficient in front of \(\hat{\mathbf{\Sigma}}_{\mathbf{X}}\). In this case, it makes sense to decrease \(\tau\) in the same proportions to keep a flat priors on the clusters means. For the diabetes data such a choice leads to an interesting solution, where the strong prior constraint leads to one cluster being created to fit one outlier in the “chemical” diabetes.
sol_dense = greed(X,model=Gmm(epsilon=0.01*diag(diag(cov(X))), tau =0.001), alg=Hybrid(pop_size = 100))
#>
#> ── Fitting a GMM model ──
#>
#> ℹ Initializing a population of 100 solutions.
#> ℹ Generation 1 : best solution with an ICL of -361 and 10 clusters.
#> ℹ Generation 2 : best solution with an ICL of -316 and 6 clusters.
#> ℹ Generation 3 : best solution with an ICL of -293 and 4 clusters.
#> ℹ Generation 4 : best solution with an ICL of -290 and 4 clusters.
#> ℹ Generation 5 : best solution with an ICL of -290 and 4 clusters.
#> ℹ Generation 6 : best solution with an ICL of -290 and 4 clusters.
#> ── Final clustering ──
#>
#> ── Clustering with a GMM model 4 clusters and an ICL of -290
gmmpairs(sol_dense,X)
The first merge of the hierarchy recovers the previous partition with default parameters
aricode::ARI(clustering(cut(sol_dense, 3)), clustering(sol))
#> [1] 0.9772411
We are interested in the clustering of categorical datasets, which are typically found in survey data or item response theory (ITR). In this context, we observe \(n\) individuals described by \(p\) variables, taking one among \(d_j\) modalities for each variable \(j\). ### The model
Such datasets are typically represented using a one-hot-encoding of each factor in a design matrix \(\mathbf{X} \in \{0,1\}^{n \times d}\) where \(d = \sum_{j=1}^p d_j\). Latent class analysis (LCA) is a generative model for categorical data clustering which posits conditional independence of the factor variables conditionally on the (unknown) partition. Below is a description of its Bayesian formulation with the use of proper conjugate priors \[ \begin{equation} \begin{aligned} \pi&\sim \textrm{Dirichlet}(\alpha),\\ \forall k, \forall j, \quad \theta_{kj} &\sim \textrm{Dirichlet}_{d_j}(\beta), \\ Z_i&\sim \mathcal{M}_K(1,\pi),\\ \forall j=1, \ldots, p, \quad X_{ij}|Z_{ik}=1 &\sim \mathcal{M}_{d_j}(1, \theta_{kj}),\\ \end{aligned} \end{equation} \] For each cluster \(k\) and variable \(j\), the vector \(\theta_{kj}\) represents the probability of each of the \(d_j\) modalities. With the above choice of priors, the LCA model admits an exact ICL expression similar to the mixture of multinomials derived in the supplementary materials of Côme et. al. (2021, Section 3) .
The greed package implements this model via the
Lca
model-class and the default values for hyper-parameters
are set as follows:
This default values may be changed and we refer to the
?Lca
documentation for further details.
Mushroom data from UCI Machine Learning Repository describing 8124 mushrooms with 22 phenotype variables. Each mushroom is classified as “edible” or “poisonous” and the goal is to recover the mushroom class from its phenotype. It comes attached with the greed package.
data("mushroom")
We begin by forming the necessary vectors for analysis. The data has \(n=8124\) rows and \(p=23\) columns. The first column contains the poisonous status of each mushroom with two possible values, “p” for “poisonous” and “e” for edible, it will serve as the clustering we seek to recover.
mushroom[,1:10] %>% head() %>% knitr::kable()
edibility | V2 | V3 | V4 | V5 | V6 | V7 | V8 | V9 | V10 |
---|---|---|---|---|---|---|---|---|---|
p | x | s | n | t | p | f | c | n | k |
e | x | s | y | t | a | f | c | b | k |
e | b | s | w | t | l | f | c | b | n |
p | x | y | w | t | p | f | c | n | n |
e | x | s | g | f | n | f | w | b | k |
e | x | y | y | t | a | f | c | b | n |
The remaining variables are used for clustering. Note that we only use a subset of the data for illustration purpose here.
X = mushroom[,-1]
subset = sample(1:nrow(X), size = 1000)
label = mushroom$edibility[subset]
The clustering is done via the main function greed()
with argument model
set to LCA and the genetic hybrid
algorithm for ICL maximization.
Note: The
Lca
model may only be used with datasets stored in a data.frame object containing factors only. When such data are provided to thegreed()
function, anLca
model is picked by default. To perform the clustering, it is therefore sufficient to call greed with the prepared data.frame.
sol_mush<-greed(X[subset,], model=Lca())
#>
#> ── Fitting a LCA model ──
#>
#> ℹ Initializing a population of 20 solutions.
#> ℹ Generation 1 : best solution with an ICL of -13018 and 11 clusters.
#> ℹ Generation 2 : best solution with an ICL of -12817 and 16 clusters.
#> ℹ Generation 3 : best solution with an ICL of -12817 and 16 clusters.
#> ── Final clustering ──
#>
#> ── Clustering with a LCA model 15 clusters and an ICL of -12785
The hybrid genetic algorithm found a solution with \(K=15\) clusters which is quite over-segmented. But, it display a good separation among edible and poisonous mushrooms:
table(label, clustering(sol_mush)) %>% knitr::kable()
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
e | 0 | 0 | 0 | 17 | 19 | 1 | 41 | 22 | 92 | 0 | 15 | 9 | 59 | 26 | 213 |
p | 213 | 163 | 4 | 4 | 0 | 13 | 0 | 0 | 0 | 28 | 27 | 34 | 0 | 0 | 0 |
Partition’s NMI is 0.27 which is explained by the over-segmentation of the solution compared to the \(2\)-class problem.
Exploring the dendrogram provided by the hierarchical algorithm is quite useful in this case.
plot(sol_mush, type='tree')
We clearly see a hierarchical structure appearing with \(K=2\) or \(K=4\) main clusters. Thus, we can cut the tree at this height and look at the solution.
sol2 = cut(sol_mush, 2)
table(label, clustering(sol2)) %>% knitr::kable(row.names = T) %>% kable_styling(full_width = F)
1 | 2 | |
---|---|---|
e | 0 | 514 |
p | 376 | 110 |
Here, we clearly see that the order of merges is consistent with the labels, and the final ARI is 0.58. While, some poisonous mushrooms have been categorized as edible, this is the consequence of the way the labels have been set, since mushrooms for which the edibility status was unknown were classified as poisonous by default. While this choice is reasonable from a strict health perspective. Furthermore, as the data documentation specifies, ’‘’’. Thus, the unsupervised problem is hard and the obtained clustering is already satisfactory. Moreover, this illustrates the interest of having the hierarchical algorithm in order to access coarser partitions.
plot(cut(sol_mush,2), type="marginals")
You can go look at the vignettes for available models on the package’s website !