Cluster count

Cluster Count determines the number of clusters (states) for a discrete latent variable (cluster / mixture) in a Bayesian network.

The process uses cross validation, and evaluates the log-likelihood for a series of different cluster counts.

Cluster count

Opening

With a Bayesian network or Dynamic Bayesian network open that contains one or more discrete latent variables, click the Cluster Count button on the main window toolbar tab entitled Data.

Cluster Count

In order to determine a suitable number of clusters, cross validation is used. The data is split randomly into a configurable number of partitions. For each partition p, a models is learned on (data - p), and the log-likelihood is evaluated on the unseen data p. The log-likelihood is then summed over each partition, resulting in an overall score.

This score is calculated for each configurable cluster count, and the scores plotted.

NOTE

A higher score is preferred, especially if it is part of a smooth curve. Any areas that exhibit volatility should usually be ignored.

Once a suitable number of clusters has been determined, close the Cluster count window and update the number of states in the cluster variable.

A cluster count of 1 is included by default to test the hypothesis that the cluster variable is not required at all.