Skip to main content

Data sampling

Introduction

Data Sampling is the process of generating data from a Bayesian network or Dynamic Bayesian network. Some uses for Data Sampling are:

  • Generating data to understand and visualize a network.
  • Generating test data.
  • Understanding how the log-likelihood varies.

Data sampling

Log Likelihood

When the Log Likelihood option is true, an additional column is added to the sampled data, which reports the Log likelihood of the generated sample.

[!IMPORTANT] When evidence is fixed, each case has an associated weight that can be output if necessary. The log-likelihood output however is unweighted. i.e. it reflects the log-likelihood of the case as if it had a weight of 1.

Weight / Log(weight)

Bayes Server supports sampling data when certain evidence is fixed. When evidence is fixed, each sample has an associated weight, which is the likelihood of the fixed evidence for the current sample. This indicates how likely it is that the fixed evidence could have occurred with this sample.

To output the weight, ensure the Weight options is true or the Log Weight options is true.

info

Log weight is useful when the likelihood of the fixed evidence can be very small, such as when sampling from time series and sequences. The log weight is simply log(weight) however is calculated in such a way that it does not suffer from underflow problems.

Current Evidence

When the Current Evidence option is true, any evidence currently entered in the current Bayesian network or Dynamic Bayesian network will be used in the data sampling process.

Missing Data

By specifying a value between 0 and 1 (inclusive), in the missing data Probability text box, a proportion of values will be randomly set to missing (null/unobserved). Optionally an additional minimum probability can be specified in the Probability (Min) text box. When set, the missing data probability for each case varies randomly between the two specified probabilities.

info

The missing data mechanism used is Missing Completely At Random (MCAR).

Export

Data can be easily exported.

For example, this is useful to understand how the log-likelihood varies.

DBN Sampling

When sampling from networks with DBN variables (Dynamic Bayesian networks), the sample count does not equal the number of sequence rows generated, but rather the number of cases generated, each of which will have its own sequence.