Tutorial 5 - Classification

In this tutorial we will perform classification, which is the prediction of one or more discrete variables given what we know about other variables.

The following concepts will be covered:

  • Creating a Data Connection
  • Performing classification using batch queries
  • Queries with missing data
  • Confusion matrix
  • Charting predictions
  • Predicting the log likelihood

Bayes Server must be installed, before starting this tutorial. An evaluation version can be downloaded from the Downloads page

Companion video (No Audio)

Open the model

We will use the Bayesian network built in Tutorial 1 - A simple network shown below.

Identification network

Open the model

  • Launch Bayes Server, and on the Start page click the network entitled 'Tutorial 1 - A simple network' in the Sample networks pane.

If the Start page is not set to display on start up, or has been closed, click the Start page button, on the View tab, General group.

Batch queries

We have 100 test cases defined in the data section. We are going to use the columns Hair Length and Height in order to predict Gender. Since we are predicting a discrete variable, the task is known as classification. The data set includes the actual Gender, so that we can determine how well our model performs, however this column is not used to perform the predictions.

Note that the data has some missing values. This is not a problem for a Bayesian network. It can still perform the prediction, using whatever information is available.

For convenience, we will use Microsoft Excel as the data source, however another database can be substituted.

Although Microsoft Excel is a convenient way of storing data, in practice we recommend using a database as the data source.

Adding a data connection

Note: You can skip this step, and instead use the pre-installed Tutorial data connection (Walkthrough Data in earlier versions).

  • Select the data (including the header) in the data section and copy it to the clipboard (Ctrl+C).
  • Open Microsoft Excel and paste the data into a new Microsoft Excel spreadsheet (Ctrl+V).
  • Save the new spreadsheet.
  • In Bayes Server, click the Data Connections button on the Data tab, Data Sources group. This will launch the Data connection manager.
  • Click the New button on the toolbar. This will launch the Data connection editor.
  • In the list of data providers, select the appropriate Excel Driver for the version of Microsoft Excel you are using.
  • Next to the File Name text box, click the Ellipsis (...) button, and select the Microsoft Excel spreadsheet created in an earlier step.
  • Click the Test Connection button, to ensure the new data connection is working.
  • Click OK to add the new Data Connection.

Batch query

  • Click the Batch query button, on the Data tab. This will launch the Data tables window.
  • In the Data Connection drop down, select the new Data Connection created in an earlier step, or the Tutorial data connection if you skipped that step. This should enable the Data drop down.
  • In the Data drop down, select the worksheet that contains the data. (If the data is on the first worksheet, select Sheet1$). If you are using the pre-installed Tutorial data connection, select Tutorial 5 - Classification.
  • Click the OK button. This will launch the Data map window.
  • In the Data map window, ensure that variable Hair length has automatically been mapped to column Hair length, and variable Height has automatically been mapped to column Height.

Because we are predicting Gender, we do not want the Gender variable to be mapped.

  • Click the Un-map column button at the end of Gender row.

In order to test how well our model can predict Gender, we want to have access to the Gender data column, but we do not want to map it to the variable we are predicting.

  • Click on the Information tab, and click the check box next to Gender.

    The window tabs should look like this:

    Classification data map variables

    Classification data map information

    Another way of performing the same prediction, would be to leave the default mappings (including Gender) and use the Retract evidence feature which assumes the variable you are predicting is missing, even if it mapped to non missing data.

  • Click the OK button. This will launch the Batch query window.

  • In the query pane on the left hand side, ensure the following queries/information columns are checked.

    • LogLikelihood
    • Predict(Gender)
    • PredictProbability(Gender)
    • Gender
  • Click the Start button on the Batch Query tab, Batch Query group. This outputs the predictions to the window.

Instead of outputting to the window, you can also output the predictions to a database. This is useful if you are working with large datasets.

The window should look like this:

Classification batch query window

Confusion matrix

In order to determine how well our model performed, we can use a confusion matrix.

Confusion matrix

  • Change to the statistics tab on the Batch query window, and click the Confusion Matrix button in the Classification group. This will launch the Confusion matrix options window.

  • Ensure that Gender is selected in the Actual drop down, and Predict(Gender) is selected in the Predicted drop down.

    The window should look like this.

    Classification confusion matrix options

  • Click the Ok button, which will calculate and display the confusion matrix. The Confusion matrix window should look like this.

    Classification confusion matrix

    Diagonal elements in the confusion matrix relate to predictions which correctly classify Gender. Off diagonal elements in the confusion matrix are incorrect classifications.

Data

Gender Hair Length Height
Female Medium 159.64532
Male Short 178.50209
Female Short 170.2725
Female Medium 160.31395
Female Long 156.32858
Female Long 165.43799
Male Short 177.59889
Female Medium 161.11003
Male Short 166.09811
Female Long 173.34889
Male Short 169.16522
Male Medium 179.45741
Female Long
Female Medium 158.67832
Female Long 171.75507
Female Short 165.4013
Male Short 188.6639
Male Short
Female Long 165.88785
Female Medium 168.43815
Male Short 178.84286
Female Short 164.10128
Female Medium 173.39975
Female Medium 160.2925
Female Medium 166.0434
Female Long 159.51891
Female Medium 167.27399
Female Medium 162.01801
Male Short 159.67172
Female Long 149.85316
Male Short 178.85521
Female Medium 159.10519
Male Short 176.89731
Male Medium 160.80553
Male Short 176.67044
Female Medium 151.4692
Female Medium 159.47791
Medium 178.30403
Male Long 177.37518
Male Short 175.68627
Male Medium 182.13118
Female Long 168.80542
Male Short 173.47985
Male 174.67784
Female Long 167.92433
Female Long 170.78801
Short 173.21558
Male Short 185.71675
Male Medium 192.61151
Female Long 165.47273
Male Short 179.94032
Male 185.23601
Male Short 180.676
Female Long 167.14232
Male Short 166.71996
Female Long 147.9807
Female Long
Male Short 178.66922
Male Short 179.55905
Male Short 189.99837
Male Short 172.49842
Male Short 186.58113
Female Short 169.12165
Long 165.95135
Female Long 168.34383
Long 174.84138
Male Short 173.94395
Female Short 155.70222
Female Long 177.06825
Male Short 173.52714
Female Short 170.73774
Female Medium 158.87229
Female Long 147.5172
Male Medium 170.96061
Short 191.28145
Male Medium 170.87405
Male Short 179.53121
Long 160.09839
Female Long 153.82008
Female Long 167.66346
Male Medium
Male Short 176.23203
Female Medium 160.16516
Female Medium 153.82284
Male Medium 169.74507
Male Short 179.47557
Female Long 162.2582
Female Long 154.11746
Male Short 168.06671
Male Short 191.50926
Male Medium 185.57492
Female Long 161.82199
Female Medium 158.64344
Female Short 175.84038
Female Medium 162.36804
Male Short 169.27324
Female Medium 169.56408
Male Short 174.71516
Male Short 181.95237
Male Short 187.56014