Loads the necessary packages.

library(greed)
library(future) # allows parralel processing in greed()
library(aricode) # for ari computation
library(Matrix)
library(igraph)
library(mclust)

# data wrangling
library(dplyr) 
library(tidyr)
library(purrr)
library(tidygraph)

# advanced ploting 
library(ggpubr) 
library(ggplot2)
library(ggraph)
library(ggwordcloud)
library(kableExtra)

set.seed(2134)

future::plan("multisession", workers=1) # increase workers for parallel processing

Package overview

Mathematical context

Now that we introduced the methodology we can proceed to a demonstration of the greed package on real datasets. As explained in the presentation, the methodology in quite versatile and applies to several data-types as long as we have a model for which an exact ICL may be derived \[ \text{ICL}_{ex} = \log p(X \mid Z, \beta) + \log p(Z \mid \alpha). \]

Data-types currently handled by the greed package

Currently, the package handles the following tasks:

  • Continuous data clustering with Gaussian Mixture Models. A GMM tutorial is available. See also the documentation for the Gmm and DiagGmm S4 classes.
  • Graph data clustering with the Stochastic Block Model or its degree corrected variants. A SBM tutorial is available . See also the documentation for the Sbm and dcSbm S4 classes.
  • Categorical data clustering with the Latent Class Analysis. An LCA tutorial is available. See also the documentation for the Lca S4 class.
  • Count data clustering with the Mixture of Multinomials model. A tutorial will soon be available. For now, we refer to the documentation for the Mom S4 class.
  • Mixed-typed data clustering, e.g. categorical and numerical but the package handles virtually any type of data combination by stacking models on top of each data types. For example graph data with continuous or categorical data attached to the nodes are handled. A CombinedModels tutorial is available. See also the documentation for the CombinedModels S4 class.
  • Mixture of regression for simultaneous clustering and fitting a regression model in each cluster. A MoR tutorial is available. See also the documentation for the MoR S4 class.
  • Co-clustering of binary and count-data via the Latent Block Model and its degree-corrected variant. A tutorial will soon be available. For now, we refer to the documentation for the DcLbm S4 class.

For this practical session, we will focus on continuous, categorical and graph data clustering. Do not hesitate to try on your own, real dataset !

Basic usage: the greed function

The package’s main entry point is the greed function with a typical call as follow

sol = greed(
   X                  , # dataset to cluster (mandatory)
   K = 20             , # number of initial clusters (optional)
   model = Gmm()      , # model to use and its prior parameters (optional)
   alg = Hybrid()     , # algorithm to use and its parameters (optional)
   verbose= FALSE    )  # verbosity level (optional)

Let us decompose this call.

  • First, X contains the dataset do cluster. Depending on the task it can be an continuous or discrete \(n \times p\) matrix of multivariate observations, or an \(n \times n\) adjacency matrix.
  • Then, the model argument defines the observational model, i.e the expression of \(\log p (X \mid Z, \beta)\) that shall be used. It is optional as the package can infer the model from X data-type (continous, factor, graph). The hyper-parameters \(\beta\) can be specified.
  • alg specifies which algorithm to use among hybrid GA, standard GA, greedy with clever initialization or multiple random initializations.

S4 class abstraction

The greed package uses S4 class abstraction to specify the model, alg arguments as well as the results of the greed() function. This allows to use generic method to explore the results such as

  • plot(sol, ...)
  • clustering(sol)
  • cut(sol, )
  • coef(sol)

Introductory example on simulated data: graph clustering with binary SBM

We will begin with a graph clustering example with a binary SBM model.

The model

The stochastic block model (SBM) is a random graph model of the adjacency matrix \(\mathbf{X}\) widely used for graph clustering. In this model, the probability of an edge \((i,j)\) is driven by the cluster membership of node \(i\) and \(j\), hence the block terminology. The Bayesian formulation of a binary SBM is as follow

\[\begin{align*} \forall (k,l) \in [K]^2, \quad \theta_{kl} & \sim \textrm{Beta}(a_0, b_0) \\ \forall (i,j) \in [n]^2, \quad x_{ij} \mid \{ z_{ik}z_{jl}=1 \}, \theta & \sim \mathcal{B}(\theta_{kl}) \end{align*}\]

Beta-Bernoulli conjugacy allows to derive an exact expression of \(\log p(X\mid Z, \beta) = \log \int_\theta p(X \mid Z, \theta) p(\theta \mid \beta) \, d\theta\). Here, the hyper-parameters \(\beta = (a_0, b_0)\) may be set to default “non-informative” values \(a_0=b_0=1\) (Beta(1,1) is uniform on \([0,1]\)).

Reproducing the slides example: nested SBM simulation

We construct a simulation scheme with \(K=6\) clusters with a nested structure: 3 big clusters are each composed of 2 smaller clusters. We choose a community structure where nodes inside the same blocks tend to connect more.

Simulate the data

N=1600
K=15
prop=rep(1/K,K)
lambda  = 0.1
lambda_o = 0.025
Ks=5
theta = bdiag(lapply(1:(K/Ks), function(k){matrix(lambda_o,Ks,Ks)+diag(rep(lambda,Ks))}))+0.001
sbm = rsbm(N,prop,theta)
dim(sbm$x)
#> [1] 1600 1600

We have a graph with 1600 nodes. The true connectivity matrix \(\theta\) can be visualized and any (reordered) adjacency matrices from this SBM model would have the same profile.


theta.long=tibble(p=as.numeric(theta),k=rep(1:K,K),l=rep(1:K,each=K))

ggplot(theta.long)+
  geom_tile(aes(x=l,y=k,fill=p))+
  scale_x_discrete()+
  scale_y_discrete()+
  ggplot2::scale_fill_distiller("Link density", palette = "Greys", direction = 1, limits = c(0, max(theta)))+theme_minimal()+coord_equal()

Prepare the data and call greed

sol_sbmsim <- greed(X = sbm$x, model = Sbm(), alg = Hybrid(pop_size = 40))
#> 
#> ── Fitting a guess SBM model ──
#> 
#> ℹ Initializing a population of 40 solutions.
#> ℹ Generation 1 : best solution with an ICL of -171800 and 15 clusters.
#> ℹ Generation 2 : best solution with an ICL of -168669 and 12 clusters.
#> ℹ Generation 3 : best solution with an ICL of -167104 and 15 clusters.
#> ℹ Generation 4 : best solution with an ICL of -167104 and 15 clusters.
#> ── Final clustering ──
#> 
#> ── Clustering with a SBM model 15 clusters and an ICL of -167104
K(sol_sbmsim)
#> [1] 15

We see that greed correctly recovers the \(K=15\) clusters, in a reasonable amount of time. What about the recovered partition ? We can explore the results thanks to dedicated function in the package.

Exploring the results

First, we investigate the recovered partition \(Z^{(6)}\) with \(K=6\) groups via a confusion matrix and the Adjusted Rand Index, both indicating the true partition has been recovered.

cl <- clustering(sol_sbmsim)
table(truth=sbm$cl, est=cl) %>% kable(row.names = T) %>% kable_styling(full_width = F)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
1 0 0 0 0 0 0 0 0 0 0 0 94 0 0 0
2 0 0 0 0 0 0 0 0 0 0 0 0 0 111 0
3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 129
4 0 0 0 0 0 0 0 0 0 0 87 0 0 0 0
5 0 0 0 0 0 0 0 0 0 0 0 0 103 0 0
6 0 0 0 109 0 0 0 0 0 0 0 0 0 0 0
7 0 108 0 0 0 0 0 0 0 0 0 0 0 0 0
8 0 0 0 0 100 0 0 0 0 0 0 0 0 0 0
9 122 0 0 0 0 0 0 0 0 0 0 0 0 0 0
10 0 0 105 0 0 0 0 0 0 0 0 0 0 0 0
11 0 0 0 0 0 89 0 0 1 0 0 0 0 0 0
12 0 0 0 0 0 0 0 109 0 0 0 0 0 0 0
13 0 0 0 0 0 0 0 0 0 114 0 0 0 0 0
14 0 0 0 0 0 0 0 0 117 0 0 0 0 0 0
15 0 0 0 0 0 0 102 0 0 0 0 0 0 0 0
aricode::ARI(sbm$cl, cl)
#> [1] 0.9987085

Moreover, greed also computes the hierarchical clustering starting from \(Z^{(6)}\) to \(Z^{(1)}\). The dendrogram can be visualized with the plot function, specifying its type argument to "tree".

plot(sol_sbmsim, type="tree")
plot(sol_sbmsim, type="path")

The fusion level corresponds to the amount of regularization \(\log(\alpha)\) needed to unlock the fusion. It is consistent with the hierarchical structure of the dataset, and the leaf ordering algorithm provides a “natural” ordering of the cluster that can be used in other plotting functions.

For example, we can visualize the block adjacency matrix corresponding to \(Z^{(6)}\) with clusters ordered via the dendrogram. The visualisation is greatly enhanced, highlighting the hierarchical structure of the data.

plot(sol_sbmsim, type="blocks")

An alternative visualization of the aggregated graph is the nodelink diagram

plot(sol_sbmsim, type="nodelink")

Exploring coarser partitions with cut()

Eventually, one can explore coarser scales of the hierarchy and still use the same type of functions.

sol_sbm_K3 = cut(sol_sbmsim, 3)
table(clustering(sol_sbm_K3), sbm$cl) %>% kable(row.names=T)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
1 0 0 0 0 0 109 108 100 122 105 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0 90 109 114 117 102
3 94 111 129 87 103 0 0 0 0 0 0 0 0 0 0
plot(sol_sbm_K3, type="blocks")

Getting a conditional parameter estimate with coef()

At any stage \(k \leq \hat{K}\) of the hierarchy, one can get a conditional MAP estimate of the observational model parameters \(\hat{\theta} \mid \hat{Z}^{(k)}\) which maximizes \(\log p(X \mid \hat{Z}^{(k)}, \theta)\).

hat_theta= coef(sol_sbmsim)$theta
mean(abs(hat_theta-theta)) # l1 relative error, no permutation needed here.
#> [1] 0.0006745534

The hierarchical ordering of the cluster (and the specifity of this simulation) allows to avoid permuting \(\hat{\theta}\) before computing $^- $.

A real data example: the Book dataset

The Books dataset was gathered by Valdis Krebs and is attached to the greed package. It consist of a co-purchasing network of \(N=105\) books on US politics. Two books have an edge between them if they have been frequently co-purchased together. We have access to the labels of each book according to its political inclination: conservative (“n”), liberal (“l”) or neutral (“n”).

data(Books)

sol_dcsbm <- greed(Books$X,model = DcSbm())
#> 
#> ── Fitting a guess DCSBM model ──
#> 
#> ℹ Initializing a population of 20 solutions.
#> ℹ Generation 1 : best solution with an ICL of -1349 and 4 clusters.
#> ℹ Generation 2 : best solution with an ICL of -1346 and 4 clusters.
#> ℹ Generation 3 : best solution with an ICL of -1346 and 4 clusters.
#> ── Final clustering ──
#> 
#> ── Clustering with a DCSBM model 3 clusters and an ICL of -1345

sol_sbm   <- greed(Books$X,model = Sbm())
#> 
#> ── Fitting a guess SBM model ──
#> 
#> ℹ Initializing a population of 20 solutions.
#> ℹ Generation 1 : best solution with an ICL of -1279 and 8 clusters.
#> ℹ Generation 2 : best solution with an ICL of -1259 and 6 clusters.
#> ℹ Generation 3 : best solution with an ICL of -1255 and 6 clusters.
#> ℹ Generation 4 : best solution with an ICL of -1255 and 6 clusters.
#> ── Final clustering ──
#> 
#> ── Clustering with a SBM model 5 clusters and an ICL of -1254

The network as been automatically recognized as an undirected graph, as we can see see in the fitted models prior:

prior(sol_dcsbm)@type
#> [1] "undirected"
prior(sol_sbm)@type
#> [1] "undirected"

For this dataset, the regular SBM model seems to reach a better ICL solution than its degree-correction variant. Still, we can visualize both aggregated adjacency matrices and the dendrogram.

bl_sbm = plot(sol_sbm,type='blocks')
bl_dcsbm = plot(sol_dcsbm,type='blocks')
ggarrange(bl_sbm,bl_dcsbm)
Block matrix representation of the DcSbm and Sbm solution found with greed on the Book network.

Block matrix representation of the DcSbm and Sbm solution found with greed on the Book network.

It is also possible to use external R packages to plot the graph layout with node color as clusters and node size as book popularity (computed using centrality degree). Here, we represent the result for the SBM solution with 5 clusters. One can see a hierarchical clustering structure appearing, with a central cluster of neutral books in-between two densely connected set. In each of these two dense set, there is a clear distinction between popular books (heavily purchased) and more peripheral ones, indicated by node size.


graph <- igraph::graph_from_adjacency_matrix(Books$X) %>% 
  as_tbl_graph() %>% 
  mutate(Popularity = centrality_degree())  %>% 
  activate(nodes) %>%
  mutate(cluster=factor(clustering(sol_sbm),1:K(sol_sbm)))

# plot using ggraph
ggraph(graph, layout = 'kk') + 
    geom_edge_link() + 
    geom_node_point(aes(size = Popularity,color=cluster))
Ggraph plot of the book network with node colors given by the clustering found with greed and an Sbm model.

Ggraph plot of the book network with node colors given by the clustering found with greed and an Sbm model.

Finally, we can look at both models solutions for \(K=3\) and their confusion matrix. We see that both partition make sense according to the available political label.

SBM
c l n
1 1 38 2
2 8 5 9
3 40 0 2
DC-SBM
c l n
1 41 0 3
2 7 5 8
3 1 38 2
table(clustering(cut(sol_sbm, 3)), clustering(sol_dcsbm))
#>    
#>      1  2  3
#>   1  0  0 41
#>   2  2 20  0
#>   3 42  0  0
aricode::ARI(clustering(cut(sol_sbm, 3)), clustering(sol_dcsbm))
#> [1] 0.9503619

Continuous data clustering with GMM

The model

Without any constraints, the Bayesian formulation of GMMs leading to a tractable exact ICL expression uses a Normal and inverse-Wishart conjugate prior on the mean and covariances \(\mathbf{\theta} = (\mathbf{\mu}_k, \mathbf{\Sigma}_k)_k\). This prior is defined with hyperparameters \(\mathbf{\beta} = (\mathbf{\mu}, \tau, n_0, \mathbf{\varepsilon})\) and the hierarchical model is written as follows:

\[\begin{equation} \label{eq:gmm} \begin{aligned} \pi&\sim \textrm{Dirichlet}_K(\alpha)\\ Z_i&\sim \mathcal{M}(1,\pi)\\ \mathbf{\Sigma}_k^{-1} & \sim \textrm{Wishart}(\mathbf{\varepsilon}^{-1},n_0)\\ \mathbf{\mu}_k&\sim \mathcal{N}(\mathbf{\mu},\frac{1}{\tau} \mathbf{\Sigma}_k)\\ X_{i}|Z_{ik}=1 &\sim \mathcal{N}(\mathbf{\mu}_k, \mathbf{\Sigma}_{k})\\ \end{aligned} \end{equation}\]

Contrary to other models, these priors are informative and may therefore have a sensible impact on the obtained results. By default, the priors parameters are set as follow:

  • \(\alpha=1\)
  • \(\mu=\bar{X}\)
  • \(\tau=0.01\)
  • \(n_0=d\)
  • \(\mathbf{\varepsilon}=0.1\,\textrm{diag}(\mathbf{\hat{\Sigma}}_{\mathbf{X}})\)

These default values were chosen to accommodate a variety of situation but may be modified easily. For more details, we refer to the class documentation ?Gmm. Finally, it is possible to specify constraints on the covariance matrix as discussed below.

Diabetes data set

For illustration, we will use the diabetes dataset from the mclust package. The data describes \(p=3\) biological variables for \(n=145\) patients with \(K=3\) different types of diabetes: normal, overt and chemical. We are interested in the clustering of this dataset to see if the three biological variables can discriminate between the three diabetes type. We first need to define the data to cluster

data(diabetes,package="mclust")
head(diabetes %>% sample_n(100))
#>        class glucose insulin sspg
#> 82    Normal      93     393  490
#> 97  Chemical     102     472  297
#> 14    Normal      90     371  200
#> 81    Normal      99     398   76
#> 31    Normal      70     360  134
#> 123    Overt     195     920  160
X = diabetes[,-1]
X = data.frame(scale(X)) # centering and scaling

Again, we use the greed() function, this time a model=Gmm() argument

sol = greed(X,model=Gmm())
#> 
#> ── Fitting a GMM model ──
#> 
#> ℹ Initializing a population of 20 solutions.
#> ℹ Generation 1 : best solution with an ICL of -278 and 6 clusters.
#> ℹ Generation 2 : best solution with an ICL of -271 and 4 clusters.
#> ℹ Generation 3 : best solution with an ICL of -262 and 3 clusters.
#> ℹ Generation 4 : best solution with an ICL of -262 and 3 clusters.
#> ── Final clustering ──
#> 
#> ── Clustering with a GMM model 3 clusters and an ICL of -262

It found a partition with \(\hat{K} = 3\) clusters, consistent with the diabetes types with few missclassifications that are expected when fitting a GMM model on this dataset.

table(diabetes$class, clustering(sol)) %>% kable(row.names = T) %>% kable_styling(full_width = F)
1 2 3
Chemical 1 24 11
Normal 0 3 73
Overt 27 6 0

As a comparison, one can use a standard EM procedure implemented in mclust

sol_mclust = mclust::Mclust(data = X, G = 2:5)
aricode::ARI(sol_mclust$classification, diabetes$class)
#> [1] 0.6640181
aricode::ARI(clustering(sol), diabetes$class)
#> [1] 0.6531253
aricode::ARI(clustering(sol), sol_mclust$classification)
#> [1] 0.9324887

Visualizing the clustering results

As in the SBM case, we can use the tree visualization of the hierarchy, which is not of great interest here since \(\hat{K}=3\).

plot(sol, type="tree")

Moreover, specific visualization functions are implemented for this model. For example, we may look at the pairsplot with associated cluster colors via the gmmpairs() function

gmmpairs(sol, X)

One can also investigate per-clusters marginal distributions with a violin plot via plot() and its violins type

plot(sol, type="violins")

Extract a conditional estimate of \(\theta\)

Furthermore, users may use the generic coef() function to get the mixture component parameters estimate. The result will be a simple list with fields for each parameters (means: muk, covariance matrices: Sigmak, proportions: pi).

params = coef(sol)
names(params)
#> [1] "pi"     "muk"    "Sigmak"
params$Sigmak[[2]]
#>             glucose     insulin       sspg
#> glucose  0.04732456  0.06098111 -0.1020909
#> insulin  0.06098111  0.13714310 -0.1359453
#> sspg    -0.10209092 -0.13594534  1.7013069

Sensitivity analysis: impact of the hyper-parameters \(\beta\)

As explained above, the hyperparameters \(\beta\) have no default “non-informative” values, and thus impact the clustering results. One may want to specify a priori belief that the Gaussian ellipse volume is small inside clusters, which amounts to diminish the \(0.1\) coefficient in front of \(\hat{\mathbf{\Sigma}}_{\mathbf{X}}\). In this case, it makes sense to decrease \(\tau\) in the same proportions to keep a flat priors on the clusters means. For the diabetes data such a choice leads to an interesting solution, where the strong prior constraint leads to one cluster being created to fit one outlier in the “chemical” diabetes.

sol_dense = greed(X,model=Gmm(epsilon=0.01*diag(diag(cov(X))), tau =0.001), alg=Hybrid(pop_size = 100))
#> 
#> ── Fitting a GMM model ──
#> 
#> ℹ Initializing a population of 100 solutions.
#> ℹ Generation 1 : best solution with an ICL of -361 and 10 clusters.
#> ℹ Generation 2 : best solution with an ICL of -316 and 6 clusters.
#> ℹ Generation 3 : best solution with an ICL of -293 and 4 clusters.
#> ℹ Generation 4 : best solution with an ICL of -290 and 4 clusters.
#> ℹ Generation 5 : best solution with an ICL of -290 and 4 clusters.
#> ℹ Generation 6 : best solution with an ICL of -290 and 4 clusters.
#> ── Final clustering ──
#> 
#> ── Clustering with a GMM model 4 clusters and an ICL of -290
gmmpairs(sol_dense,X)
Matrix pairs plots of the clustering with user specified hyperparameters on the diabetes data.

Matrix pairs plots of the clustering with user specified hyperparameters on the diabetes data.

The first merge of the hierarchy recovers the previous partition with default parameters

aricode::ARI(clustering(cut(sol_dense, 3)), clustering(sol))
#> [1] 0.9772411

Categorical data clustering with the latent class analysis

We are interested in the clustering of categorical datasets, which are typically found in survey data or item response theory (ITR). In this context, we observe \(n\) individuals described by \(p\) variables, taking one among \(d_j\) modalities for each variable \(j\). ### The model

Such datasets are typically represented using a one-hot-encoding of each factor in a design matrix \(\mathbf{X} \in \{0,1\}^{n \times d}\) where \(d = \sum_{j=1}^p d_j\). Latent class analysis (LCA) is a generative model for categorical data clustering which posits conditional independence of the factor variables conditionally on the (unknown) partition. Below is a description of its Bayesian formulation with the use of proper conjugate priors \[ \begin{equation} \begin{aligned} \pi&\sim \textrm{Dirichlet}(\alpha),\\ \forall k, \forall j, \quad \theta_{kj} &\sim \textrm{Dirichlet}_{d_j}(\beta), \\ Z_i&\sim \mathcal{M}_K(1,\pi),\\ \forall j=1, \ldots, p, \quad X_{ij}|Z_{ik}=1 &\sim \mathcal{M}_{d_j}(1, \theta_{kj}),\\ \end{aligned} \end{equation} \] For each cluster \(k\) and variable \(j\), the vector \(\theta_{kj}\) represents the probability of each of the \(d_j\) modalities. With the above choice of priors, the LCA model admits an exact ICL expression similar to the mixture of multinomials derived in the supplementary materials of Côme et. al. (2021, Section 3) .

The greed package implements this model via the Lca model-class and the default values for hyper-parameters are set as follows:

This default values may be changed and we refer to the ?Lca documentation for further details.

A real-data analysis: the mushroom dataset

Mushroom data from UCI Machine Learning Repository describing 8124 mushrooms with 22 phenotype variables. Each mushroom is classified as “edible” or “poisonous” and the goal is to recover the mushroom class from its phenotype. It comes attached with the greed package.

data("mushroom")

Subsetting the data

We begin by forming the necessary vectors for analysis. The data has \(n=8124\) rows and \(p=23\) columns. The first column contains the poisonous status of each mushroom with two possible values, “p” for “poisonous” and “e” for edible, it will serve as the clustering we seek to recover.

mushroom[,1:10] %>% head() %>% knitr::kable()
edibility V2 V3 V4 V5 V6 V7 V8 V9 V10
p x s n t p f c n k
e x s y t a f c b k
e b s w t l f c b n
p x y w t p f c n n
e x s g f n f w b k
e x y y t a f c b n

The remaining variables are used for clustering. Note that we only use a subset of the data for illustration purpose here.

X = mushroom[,-1]
subset = sample(1:nrow(X), size = 1000)
label = mushroom$edibility[subset]

Clustering

The clustering is done via the main function greed() with argument model set to LCA and the genetic hybrid algorithm for ICL maximization.

Note: The Lca model may only be used with datasets stored in a data.frame object containing factors only. When such data are provided to the greed() function, an Lca model is picked by default. To perform the clustering, it is therefore sufficient to call greed with the prepared data.frame.

sol_mush<-greed(X[subset,], model=Lca())
#> 
#> ── Fitting a LCA model ──
#> 
#> ℹ Initializing a population of 20 solutions.
#> ℹ Generation 1 : best solution with an ICL of -13018 and 11 clusters.
#> ℹ Generation 2 : best solution with an ICL of -12817 and 16 clusters.
#> ℹ Generation 3 : best solution with an ICL of -12817 and 16 clusters.
#> ── Final clustering ──
#> 
#> ── Clustering with a LCA model 15 clusters and an ICL of -12785

The hybrid genetic algorithm found a solution with \(K=15\) clusters which is quite over-segmented. But, it display a good separation among edible and poisonous mushrooms:

table(label, clustering(sol_mush)) %>% knitr::kable()
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
e 0 0 0 17 19 1 41 22 92 0 15 9 59 26 213
p 213 163 4 4 0 13 0 0 0 28 27 34 0 0 0

Partition’s NMI is 0.27 which is explained by the over-segmentation of the solution compared to the \(2\)-class problem.

Analysis of the hierarchy

Exploring the dendrogram provided by the hierarchical algorithm is quite useful in this case.

plot(sol_mush, type='tree')
Dendogram obtained with greed with the Lca model on the Mushroom dataset.

Dendogram obtained with greed with the Lca model on the Mushroom dataset.

We clearly see a hierarchical structure appearing with \(K=2\) or \(K=4\) main clusters. Thus, we can cut the tree at this height and look at the solution.

sol2 = cut(sol_mush, 2)
table(label, clustering(sol2)) %>% knitr::kable(row.names = T) %>% kable_styling(full_width = F)
1 2
e 0 514
p 376 110

Here, we clearly see that the order of merges is consistent with the labels, and the final ARI is 0.58. While, some poisonous mushrooms have been categorized as edible, this is the consequence of the way the labels have been set, since mushrooms for which the edibility status was unknown were classified as poisonous by default. While this choice is reasonable from a strict health perspective. Furthermore, as the data documentation specifies, ’‘’’. Thus, the unsupervised problem is hard and the obtained clustering is already satisfactory. Moreover, this illustrates the interest of having the hierarchical algorithm in order to access coarser partitions.

Marginal plots

plot(cut(sol_mush,2), type="marginals")

Going further

You can go look at the vignettes for available models on the package’s website !