Cluster count

Introduction

Cluster Count determines the number of clusters (states) for a discrete latent variable (cluster / mixture) in a Bayesian network.

The process uses cross validation, and evaluates the log-likelihood for a series of different cluster counts.

Cluster count

The Cluster counts and scores can be exported to a tool, such as Excel, in order to generate a plot such as the following:

Cluster count

Pre-requisites

Unlike the Clustering Structural learning algorithm, the cluster count tool requires a network with at least one existing discrete latent variable and links should already have been added.

NOTE

Bayes Server supports networks with multiple latent variables.

Algorithm

In order to determine a suitable number of clusters, cross validation is used. The data is split randomly into a configurable number of partitions. For each partition p, a models is learned on (data - p), and the log-likelihood is evaluated on the unseen data p. The log-likelihood is then summed over each partition, resulting in an overall score.

This score is calculated for each configurable cluster count, and the scores plotted.

NOTE

A higher score is generally preferred, but so are fewer clusters. Typically the best cluster count occurs before the score no longer increases significantly to justify the additional parameters.

Once a suitable number of clusters has been determined, close the Cluster count window and update the number of states in the cluster variable.

NOTE

A cluster count of 1 is included by default to test the hypothesis that the cluster variable is not required at all.