In this article we introduce some basic concepts in probability.

A **variable** refers to a value that can change between measurements/examples such as quantity or state. This could be a **discrete** variable such as *Gender* with possible values {*Female*, *Male*} or a **continuous** variable such as *Age* (e.g. 22 or 37 or 65).

The symbols X, Y & Z are typically used when we are concerned with general rules (theorems) involving variables.

Discrete variables can take on one of a set of distinct states. For example, the variable *Gender* might have possible states {Female, Male} meaning that each measurement/example would either be Female or Male.

The set of states in a variable must be exhaustive, simply meaning that the states cover all possible scenarios. They are also mutually exclusive, meaning that you cannot be in more that one state at the same time.

It is important to point out that although the actual (known) value of a discrete variable can only be one of the mutually exclusive states, if we do not know the exact state of a variable, we can still assign a certainty of being in each state. For example {Female = 30%, Male = 70%}.

If the states are not mutually exclusive, e.g. a person plays both football and tennis, then we need more than one variable, e.g. Football {True, False} and Tennis {True, False}.

A continuous variable can be assigned a value which is a real number. For example, the value of a variable *Speed* could be 0 kph, 25.5 kph, 100 kph etc...

Again it is important to point out that if we do not know the exact value of a continuous variable we can still assign a certainty to it lying within a range of values. E.g. we are 30% sure that the car was traveling at at least 70 kph.

It is sometimes useful to convert a continuous variable to a discrete variable. For example, some algorithms do not support continuous variables.

This can be done by creating bins which represent a range of values. Each bin then becomes a discrete state of a derived discrete variable.

There are a number of different algorithms commonly used to determine the bins (ranges). For example, equal ranges (each range has the same size), equal frequencies (each bin has the same number of records) or clustering (a machine learning algorithm is employed).

Usually a variable corresponds 1-to-1 with a column of data in a spreadsheet or database.
For example, the variables in the **Microsoft Excel** spread-sheet shown below are as follows:

Variable | Value Type |
---|---|

Sepal length | Continuous |

Sepal width | Continuous |

Petal length | Continuous |

Petal width | Continuous |

Class | Discrete |

A probability is quantified as a number between 0 and 1. It tells us how likely a particular situation is, such as a discrete variable being in a particular state, or a continuous variable being within a range of values. A value of 0 indicates no chance (impossible) and a value of 1 is a sure thing (certainty).

For example, the probability of Gender=Male might equal 0.7, meaning that we think that there is a 70% chance a person is Male. In mathematical notations this is written $P(Gender=Male)\; =\; 0.7$, or where the context of the state 'Male' is clear you might see it written as $P(Male)\; =\; 0.7$.

A probability may be based on more than one variable. For example, the probability of Rain {True, False} and Wind {True, False} both being True might be 0.1 (10%). In math notation this would be written as $P(Raining=True,\; Windy=True)$. This type of probability also has a special name called a **Joint probability**.

A probability value can be conditioned on other variables. For example we may require the probability of *Raining* being True given that we know *Windy* is True. In mathematical notation this would be written $P(Raining=True\; |\; Windy=True)$.

You can also have a probability which is based on multiple states or ranges (over one or more variables). For example if the variable **Hair Length** has states {Short, Medium, Long} we might be interested in the probability that someone does not have short hair.

We will cover probability distributions later in the article, but it is useful to explain what the difference is at this point.

A probability refers to the number between 0 and 1 which tells us how likely a particular situation is, for example $P(Gender=Female)=0.4$.

A probability distribution on the other hand, provides us with a way of getting the probability value for each possible outcome.
The probability distribution for a variable *Gender* therefore provides us with both the probability for male $P(Gender=Male)$ and the probability for female $P(Gender=Female)$.

In mathematical notation the probability distribution for *Gender* is written $P(Gender)$. The only difference between this notation and that for each of the probability values returned from the probability distribution is that the state/outcome (Female or Male) is omitted.

In mathematics a probability distribution is a type of function. i.e. a function which returns a probability for each outcome. A discrete probability distribution is often called a

Probability Mass Functionand a continuous distribution aProbability Density Function.

The following famous quote from Voltaire reminds us why probability is such a useful concept.

Doubt is not a pleasant condition, but certainty is absurd.

-- Voltaire

A probability distribution tells us how likely each possible state/outcome (or range) is.

For discrete variables, a probability distribution provides us with a probability value for each state/outcome. To be a valid distribution, the sum of these values must equal 1 (since states are mutually exclusive and exhaustive).

For example, if a variable **Hair Length** had possible states {Short, Medium, Long}, a discrete distribution might look like [0.2, 0.3, 0.5]. This means that the probability of Hair Length = Short is 0.2 (20%), Hair Length = Medium is 0.3 (30%) and Hair Length = Long is 0.5 (50%).

A discrete probability distribution is known as a

Probability Mass Function.

When a discrete variable has only two possible states (outcomes) such as *True* or *False*, we can use two probabilities (e.g. True=0.7, False=0.3) to define a probability distribution.

In fact since the probabilities must sum to 1, we actually only need one of these values to fully specify this distribution.

This distribution has a special name called the **Bernoulli** distribution.

The

Bernoullidistribution, which helps us model a particular outcome, is not to be confused with theBinomialdistribution which helps us model a sequence of outcomes. In fact when aBinomialdistribution has only one outcome it is aBernoullidistribution.

An extension of the **Bernoulli** distribution can be used when we have more than two outcomes.
For example, we might have a variable *Hair Length* with possible outcomes {Short, Medium, Long} and use the distribution [0.2, 0.3, 0.5].

This distribution is often called a **Multinoulli** distribution. Again this is not to be confused with the **Multinomial** distribution which is used to model a sequence of outcomes.

Non-mathematicians tend to talk about probabilities in terms of percentages rather than a number between 0 and 1.

A discrete probability can be multiplied by 100 to give its percentage equivalent. For example a probability of 0.25 is equivalent to a 25% chance.

Note that this is consistent with

Microsoft Excel. In Excel, 0.2 is equivalent to 20%. If you multiply either by 100 you get 20. Also if you enter 0.2 in a cell and format the cell with the percentage button it will display 20%.

For continuous variables a probability distribution tells us what the chance of being within a range of values is.

Whereas a discrete probability distribution must sum to one, the analogous for a continuous probability distribution is that the integral must sum to 1. (E.g. for a single variable, the area under the curve must sum to 1).

In many cases you can think of summing as being equivalent to integrating, when dealing with discrete or continuous distributions respectively.

Note that for continuous variables, the chance of having an individual value tends to zero, as there are infinitely many possible real number values the variable can take.

A very common example of a continuous distribution is the Gaussian (Normal) probability distribution.

This is a distribution from the exponential family of distributions (it has an **e** term in the equation).

For continuous variables the cumulative density function tells us what the chance of being less than or equal to a particular value is.

As you might expect, the cdf evaluated at x, is simply the the area under the probability density function from minus infinity to x

Earlier we introduced the **Joint probability**, for example $P(Raining=True,\; Windy=True)$. A **joint probability distribution** gives us the joint probability value for each possible combination of states (outcomes) from both variables.

For example, the joint distribution over *Raining* and *Windy* would be able to return the probability value for each possible combination:

Raining | Windy | Probability |
---|---|---|

True | True | 0.10 |

True | False | 0.15 |

False | True | 0.25 |

False | False | 0.5 |

Note that the sum (or integral) of a joint probability distribution is always 1.

The mathematical notation for this joint probability distribution is $P(Raining,\; Windy)$.

In set notation you will often see this written as $P(Raining\; \xe2\u02c6\copyright \; Windy)$, however in this article we will not use this form.

Earlier we introduced the **Conditional probability**, e.g. $P(Raining=True\; |\; Windy=True)$.

A conditional probability distribution gives us the conditional probability value for each possible combination of states (outcomes) from both variables.

For example, the conditional distribution over *Raining* and *Windy* would be able to return the conditional probability value for each possible combination. For example:

Raining | Windy | Probability |
---|---|---|

True | True | 0.286 |

True | False | 0.231 |

False | True | 0.714 |

False | False | 0.769 |

The mathematical notation for this conditional probability distribution is $P(Raining\; |\; Windy)$.

In the expression P(

X|Y), the variable(s)Xare calledHeadvariables andYare calledTailvariables.

The sum of a conditional probability distribution is always 1 for each **Tail** combination. The only **Tail** variable in the example above
is *Windy* therefore the sum of probabiliy values given that Windy=True will always equal 1, and the sum given Windy=False will be 1 also.

A number of simple rules exist called the **Axioms of probability**. We will describe them in simple terms, but the interested reader may wish to consult a probability text.

Given a variable S with states (outcomes) $s$_{1}, s_{2}, s_{3},..., s_{n}:

- $P(s$
_{i}) >= 0 and P(s_{i}) <= 1, this just means that the probability of being in each state (outcome) must have a value between 0 and 1 - $P(S)\; =\; 1$ or $sum(P(s$
_{1}) + P(s_{2}) + ... + P(s_{n})) = 1, this means that the sum of all probability values must equal 1. i.e. the states are exhaustive (cover all outcomes). - $P(s$
_{i}, s_{j}) = P(s_{i}) + P(s_{j}), this means that the joint probability of states from the same variable is equal to their sum. I.e. the states are mutually exclusive. E.g. if a variable*Raining*can be True or False, when considering an individual record, it cannot be both raining and not raining at the same time. Note that if we do not know the actual outcome (perhaps we are in a room with no windows), we can still have a probability of being in each state, e.g. [0.3, 0.7] meaning that we are 30% sure it is raining.

There are a number of consequences that follow directly from the axioms of probability.

One that is especially important is:

$P(A,B)\; =\; P(A|B)P(B)$

As we will see later, Bayes' theorem and its applications follow from this equation.

If two variables are independent (i.e. unrelated) then P(A,B) = P(A)P(B). Note that A and B could also be groups of variables.

Groups of variables are often denoted with a bold character, such as

XorY.

Independence between X and Y simply means that varying X will have no impact on Y and vice versa.

Variables A and C are independent given B, if P(A|B) = P(A|B,C).

What this means is that certain variables will become conditionally independent of each other depending on what is known about other variables (evidence), i.e. they will vary independently given certain evidence.

For example, the probability of having short hair might be independent of someone's height, given that we know someone is male.

A **marginal probability distribution** is formed by calculating the subset of a larger probability distribution. For example calculating P(X) from P(X, Y, Z) or P(X,Y) from P(X,Y,Z).

Many use the term to refer to distributions that contain only a single variable, however any subset is a valid marginal.

Consider the joint probability over the variables *Raining* and *Windy* shown below:

P(Raining, Windy)

Raining | Windy = False | Windy = True |
---|---|---|

False | 0.64 | 0.16 |

True | 0.1 | 0.1 |

If someone asks us what is the probability of it raining, we need P(Raining), not P(Raining, Windy).

In order to calculate P(Raining), we can simply sum up all the values for Raining = False, and Raining = True, as shown below.

P(Raining, Windy)

Raining Windy = False | Windy = True | Sum |
---|---|---|

False | 0.64 | 0.16 |

True | 0.1 | 0.1 |

This is known as marginalization, and the result we were after is shown below.

P(Raining)

Raining = False | Raining = True |
---|---|

0.8 | 0.2 |

For discrete variables we sum, whereas for continuous variables we integrate.

Although it is not known for sure, many believe the term

marginalizationcame about through the use of ledgers for basic accounting. The totals (sums) would be recorded in the margins, hence the term marginalization.

$P(A|B)\; =\; P(B|A)P(A)\; /\; P(B)$

The equation specified in Bayes' theorem can be simply derived from standard probability, as follows:

From the consequences that follow from the axioms of probability we get that $P(A,B)\; =\; P(A|B)P(B)$ and equivalently $P(A,B)=P(B|A)P(A)$.

Therefore $P(A|B)P(B)\; =\; P(B|A)P(A)$ which when rearranged is

$P(A|B)\; =\; P(B|A)P(A)\; /\; P(B)$, which is Bayes' theorem.

Sometimes Bayes' theorem is written without the numerator as $P(A|B)\; \xe2\u02c6\x9d\; P(B|A)P(A)$ where the symbol âˆ in math means

is proportional to. P(B) is a normalization term, to ensure that the probability sums/integrates to 1. This form is quite common and used when the denominator has no bearing on the end computation.

Bayes' theorem can be used when we need to invert the variables we are conditioning on, such as $P(A|B)$ to P(B|A). For example, we may have probabilities of the form P(Hair Length | Gender) and P(Gender), and we wish to calculate the P(Gender | Hair Length). To do this we would use Bayes' Theorem.

In Bayesian statistics, Bayes' theorem can be used used to incrementally update a model *M* from data *D*.

This can be done using Bayes Theorem as follows:

$P(M|D)\; =\; P(D|M)P(M)\; /\; P(D)$

Each term in this equation has a name. P(M|D) is called the **Posterior**, P(D|M) the **likelihood**, P(M) the **Prior**, and P(D) the **Marginal likelihood** or **Probability of Evidence**.

The term

likelihoodcan easily be confusing, as in some fields the termlikelihoodis used to refer to the marginal likelihood. To avoid confusion you could choose to usemarginal likelihoodfor P(D) andconditional likelihoodfor P(M|D).

At some points in history, Bayesian statistics has been considered controversial. This was due to the subjective nature of the choice of prior P(M). However Bayesian statistics is now mainstream, and if anything the balance may have swung the other way, with many advocating its use while questioning frequentist techniques such as p-values. Whatever your stance, Bayes' Theorem is useful in both frequentist and Bayesian statistics.

Priors are used to give an initial starting point from which we can learn. For example, if a cricketer scores 200 runs in their first test match, would we predict that they would score 200 in their second? A Bayesian approach would start with an initial model (prior), perhaps based on batting averages given the characteristics of the player, and the prediction would be based on both the prior and the first score.

As more data is acquired the frequentist and Bayesian approaches converge to the same result.

Conceptually priors can be though of as virtual data. A bit like inserting a few records at the beginning of our data. These virtual records could be based on similar historical results or if we do not have historical data we may use our judgement.

There are a number of common priors used in practice. They are typically

conjugate, simply meaning that when plugged into Bayes' theorem, the resulting distribution/model is of the same form as the original. Examples of conjugate priors are theBetadistribution (conjugate of the Bernoulli), its extension theDirichletdistribution (conjugate of the Multinoulli) and thenormal inverse Wishart(conjugate of the multi-variate Gaussian/Normal).