Math Summaries

Basic Notions of Probability

In this post, we review the basic notions of probability that will be needed to study Stochastic Calculus.

Probability Space.

Probability theory provides a framework to study random experiments, that is, experiments where the exact outcome cannot be predicted accurately. Such random experiments involve a source of randomness. Elementary random experiments include the roll of the die and consecutive coin tosses. In applications, the source of randomness could be, for example, all the transactions that take place at the New York Stock Exchange in a given time, the times of decay of radioactive material, and pseudorandom numbers generated by a computer.

The theoretical framework of a random experiment is as follows. The set of all possible outcomes of the random experiment is called the sample space and is denoted by $\Omega$. An outcome of the random experiment is then represented by an element $\omega$ of $\Omega$. An event is a subset of $\Omega$. An event can be described by either enumerating all the elements contained in the subset or by giving the properties that characterize the elements of the subset. Note that the emepty set $\phi$ and $\Omega$ are subsets of $\Omega$ and are thus events.

Example 1. For the random experiment consisting of generating an ordered pair of $0$s and $1$s, the sample space is :

$$\Omega = {(0,0),(0,1),(1,0),(1,1)}$$

An example of an event is $A = {(0,0),(0,1)}$, that is the set of outcomes whose first number is $0$. Note that $\Omega$ can also be written as ${0,1} \times {0,1}$ where $\times$ stands for the Cartesian product between the two sets (like $\mathbb{R} \times \mathbb{R}$).

To quantify the randomness in a random experiment, we need the notion of probability on the sample space. In example 1, we can define a probability on $\Omega$ by

$$ P({\omega}) = 1/4 \quad \text{ for all } \omega $$

Since there are four outcomes in the sample space, this is the equiprobability on the sample space.

More generally, a probability on a general sample space $\Omega$ is defined as follows.


Definition. A probability $\mathbf{P}$ is a function on events of $\Omega$ with the following properties:

(1) For any event $A\subset \Omega$, the probability of the event denoted $\mathbf{P}(A)$ is a number in $[0,1]$:

$$\mathbf{P}(A)\in [0,1]$$

(2) $\mathbf{P}(\emptyset) = 0$ and $\mathbf{P}(\Omega) = 1$.

(3) Additivity: If $A_1,A_2,A_3,\ldots$ is a countably infinite sequence of events in $\Omega$ that are mutually exclusive or disjoin, that is $A_i \cap A_j = \emptyset$, where $i\neq j$, then

$$ \mathbf{P}(A_1 \cup A_2 \cup A_3 \ldots ) = \mathbf{P}(A_1) + \mathbf{P}(A_2) + \mathbf{P}(A_3) + \ldots$$


It is not hard to check, that if $\Omega$ has a finite number of outcomes, then the equiprobability is defined by:

$$\mathbf{P}(A) = \frac{n(A)}{n(\Omega)}$$

where $n(A)$ stands for the cardinality of the set $A$. It is very important to keep in ming that there might be many probability functions that might be defined on a given sample space $\Omega$.

The defining properties of a probability $\mathbf{P}$ have some simple consequences.


Proposition. The defining properties of a probabilty $\mathbf{P}$ on $\Omega$ imply the following:

(1)Finite additivity If two events $A,B$ are disjoint, then $\mathbf{P}(A \cup B) = \mathbf{P}(A) + \mathbf{P}(B)$.

(2) For any event $A$, $\mathbf{P}(A^C) = 1 - \mathbf{P}(A)$

(3) For any events $A,B$, $\mathbf{P}(A \cup B) = \mathbf{P}(A) + \mathbf{P}(B) - \mathbf{P}(A \cap B)$.

(4) Monotonicity: If $A \subseteq B$, $\mathbf{P}(A) \leq \mathbf{P}(B)$.


A less elementary albeit very useful property is the continuity of a probability.


Lemma. (Continuity of Probability) Consider $\mathbf{P}$ a probability on $\Omega$. If $A_1$, $A_2$, $A_3$, $\ldots$ is an infinite sequence of increasing events, that is

$$A_1 \subseteq A_2 \subseteq A_3 \ldots$$

then

$$\mathbf{P}(A_1 \cup A_2 \cup A_3 \cup \ldots ) = \lim_{n \to \infty} \mathbf{P}(A_n)$$.

Similarly, if $A_1,A_2,A_3,\ldots$ is an infinite sequence of decreasing events, that is

$$A_1 \supseteq A_2 \supseteq A_3 \supseteq \ldots $$

then

$$\mathbf{P}(A_1 \cap A_2 \cap A_3 \cap \ldots) = \lim_{n \to \infty}\mathbf{P}(A_n)$$


Remark. Note that the limit of $\mathbf{P}(A_n)$ exists a apriori as $n \to \infty$, since the sequence of numbers $\mathbf{P}(A_n)$ is increasing, when the events are increasing by the monotonicity property (or decreasing, when the events are decreasing). It is also bounded below by $0$ and above by $1$. So, by the Monotone Convergence Theorem(MCT), the sequence $(\mathbf{P}(A_n))$ converges.

Proof.

The continuity for decreasing events follows from the claim for increasing events by taking the complement. For increasing events, this is a consequence of the additivity of a probability. We construct a sequence of mutually exclusive events from the $A_n$’s. Let’s take the set $A_0 = \emptyset$. Take $B_1 = A_1 = A_1 \setminus A_0$. Then consider $B_2 = A_2 \setminus A_1$ so that $B_1$ and $B_2$ do not intersect. More generally, it suffices to take $B_n = A_n \setminus A_{n-1}$ for $n \geq 1$. By construction, the $B_n$’s are disjoint. Moreover, we have $\cup_{n \geq 1} B_n = \cup_{n \geq 1}A_n$. Therefore, we have:

$$\mathbf{P}(A_1 \cup A_2 \cup A_3 \ldots)= \mathbf{P}(B_1 \cup B_2 \cup B_3 \ldots )= \mathbf{P}(B_1) + \mathbf{P}(B_2) + \mathbf{P}(B_3) + \ldots$$

We are interested to show that the above infinite series $\sum_{n=1}^{\infty}\mathbf{P}(B_n)$ converges.

For each $n \geq 1$, we have $\mathbf{P}(B_n) = \mathbf{P}(A_n \setminus A_{n-1}) = \mathbf{P}(A_n) - \mathbf{P}(A_{n-1})$. Therefore, the partial sum can be written as a telescopic sum:

\begin{align*}S_N &= \sum_{n=1}^{N}\mathbf{P}(B_n) \\ &= \mathbf{P}(A_1) - \mathbf{P}(A_0)\\ &+ \mathbf{P}(A_2) - \mathbf{P}(A_1) \\ &+ \ldots\\ &+ \mathbf{P}(A_N) - \mathbf{P}(A_{N-1}) \\ &= \mathbf{P}(A_N) \end{align*}

Passing to the limit on both sides,

$$\lim_{n \to \infty} S_n = \lim_{n \to \infty}P(A_n)$$

This closes the proof.

We have thus shown, that the probability function $\mathbf{P}$ is a continuous set function.

Remark. If $\Omega$ is finite or countably infinite (that is, its elements can be enumerated such as $\mathbb{N}$, $\mathbb{Z}$ or $\mathbb{Q}$), then a probability can always be defined on every subset of $\Omega$. If $\Omega$ is uncountable however (such as $\mathbb{R}$, $[0,1]$, $2^{\mathbb{N}}$), there might be subsets on which the probability cannot be defined. For example, let $\Omega = [0,1]$ and consider the uniform probability $\mathbf{P}((a,b]) = b - a$ for $(a,b] \subset [0,1]$. The probability is then the length of the interval. It turns out that there exists subsets of $[0,1]$ such as the Cantor set, the set $\mathbb{Q} \cap [0,1]$, for which the probability does not make sense! In other words, there are subsets of $[0,1]$ for which the concept of length does not have meaning.

For the reasons laid out in the remark above, it is necessary to develop a consistent probability theory to restrict the probability to good subsets of the sample space on which the probability is well-defined. In probability terminology, these subsets are said to be measurable. In this book, measurable subsets will simply be called events. Let’s denote by $\mathcal{F}$ the collection of events on $\Omega$ on which $\mathbf{P}$ is defined. It is good to thinnk of $\mathcal{F}$ as the domain of the probability $\mathbf{P}$. Of course, we want the probability to be defined when we take basic operations of events such as unions and complement. Because of this, it is reasonable to demand that the collection of events $\mathcal{F}$ has the following properties:

The collection of subsets of $\Omega$ with the above properties is called a sigma-field of $\Omega$. We will go back to this notion at length later, when we study martingales.

Example. (The power set of $\Omega$). What is an appropriate sigma-field for the sample space $\Omega = {0,1} \times {0,1} = {(0,0),(0,1),(1,0),(1,1)}$. In this simple case, there are finite number of subsets of $\Omega$. In fact, there are $2^4 = 16$. The set of all subsets of a sample space $\Omega$ is called the power set and is usually denoted by $\mathcal{P}(\Omega)$. There is no problem of defining a probability on each of the subsets of $\Omega$ here by simply adding up the probabilities of each outcome in a particular event, since there can only be a finite number of outcomes in an event. More generally, if $\Omega$ is countable, finite or infinite, a probability can always be defined on all subsets of $\Omega$ by adding the probability of each outcome. Note that, the power set $\mathcal{P}(\Omega)$ is a sigma field as it satisfies all the properties above.


Definition. A probability space $(\Omega,\mathcal{F},\mathbf{P})$ is a triplet consisting of a sample space $\Omega$, a sigma field $\mathcal{F}$ of events of $\Omega$, and a probablity function $\mathbf{P}$ that is well defined on the events in $\mathcal{F}$.


Random Variables and their Distributions.

A random variable $X$ is a function from a sample space $\Omega$ taking values in $\mathbb{R}$. (To be precise, this definition needs to be slightly refined). This is written in mathematical notation as

\begin{align*} X : &\Omega \to \mathbf{R}\\ &\omega \to X(\omega) \end{align*}

Example. Consider example 1. An example of random variables is $X(\omega)$ that gives the number of $0$s in the outcome $\omega$. For example, if $\omega = (0,1)$, then $X(\omega)$ is equal to $1$, and if $\omega = (0,0)$, then $X(\omega) = 2$.

Remark. Why do we call $X$ a random variable when it is actually a function on the sample space? It goes back to the terminology of dependent variable for a function whose input is an independent variable. (Think of $y = f(x)$ in calculus.) Here, since the input of $X$ is random, we call the function $X$ a random variable.

How can we construct a random variable on $\Omega$? If we have a good description of the sample space, then we can build the function for each $\omega$. An important example of random variables that we can easily construct is an indicator random variable.

Example. (Indicator functions as random variables). Let $(\Omega,\mathcal{F},\mathbf{P})$ be a probability space. Let $A$ be some event in $\mathcal{F}$. We define the random variable called the indicator function of the event $A$ as follows:

\begin{align*} \mathbb{1}_A(\omega) = \begin{cases} 1 \quad & \text{ if } \omega \in A, \\ 0 \quad & \text{ if } \omega \notin A. \end{cases} \end{align*}

In words, $\mathbf{1}_A(\omega) = 1$, if the event $A$ occurs, that is, the outcome $\omega \in A$, and $\mathbf{1}_A(\omega) = 0$ if the event $A$ does not occur, that is the outcome $\omega \notin A$.

More generally, if $X$ is a random variable on the probability space $(\Omega, \mathcal{F},\mathbf{P})$, then any reasonable function $g(X)$ is also a random variable. For example, if $g:\mathbb{R} \to \mathbb{R}$ is a continuous function (like $g(x) = x^2$ for example), then the composition $g(X)$ is also a random variable. Clearly, there are many different random variables that can be defined on a given sample space.

Consider a probability space $(\Omega,\mathcal{F},\mathbf{P})$ modelling some random experiment. It is often difficult to have a precise knowledge of all outcomes of $\Omega$ and of the specific function $X$. If the experiment is elementary, such as a die roll or coin tosses, then there is no problem in enumerating the elements of $\Omega$ and to construct random variable explicitly as we did before. However, in more complex modelling situations such as models of mathematical finance, the sample space might be very large and hard to describe. Furthermore, the detailed relations between the source of randomness and the observed output (for example Dow Jones index at closing on a given day) might be too complex to write down. This is one of the reasons why it often more convenient to study the distribution of a random variable rather than to study the random variable as a function on the source of randomness.

To illustrate the notion of distribution, consider the probability space $(\Omega, \mathcal{F},\mathbf{P})$ with $\Omega = \{0,1\} \times \{0,1\}$. All outcomes are equally likely. $\mathcal{F}$ is all the subsets of $\Omega$. Take $X$ to be the random variable equal to the number of $0$s in the outcome. This random variable takes three possible values: $0,1,2$. Th exact value of $X(\omega)$ is random since it depends on the input $\omega$. In fact, it is not hard to check that:

$$\mathbf{P}(\{\omega:X(\omega) = 0\}) = 1/4, \quad \mathbf{P}(\{\omega:X(\omega) = 1\}) = 1/2, \quad \mathbf{P}(\{\omega:X(\omega) = 2\}) = 1/4$$

Now, if one is only interested in the values of $X$ and not on the particular outcomee, then $X$ can serve as a source of randomness itself. This source of randomness is in fact a probability on the possible outcomes ${ 0,1,2 }$. However, each outcome here is not equiprobable! Therefore, the right way to think of the distribution of $X$ is as a probability function on $\mathbb{R}$.


Definition. Consider a probability space $(\Omega,\mathcal{F},\mathbf{P})$ and a random variable $X$ on it. The distribution of $X$ is a probability on $\mathbb{R}$ denoted by $\rho_X$ such that for any interval $(a,b]$ in $\mathbb{R}$, $\rho_X((a,b])$ is given by the probability that $X$ takes value in $(a,b]$. In other words, we have:

$$\rho_X((a,b]) = \mathbf{P}(\{\omega \in \Omega:X(\omega) \in (a,b]\})$$


To lighten the notation, we write the events involving random variables by dropping the $\omega$’s. For example, the probability above is written:

$$\rho_X((a,b]) = \mathbf{P}(\{\omega \in \Omega : X(\omega) \in (a,b]\}) = \mathbf{P}(X \in (a,b])$$

But always keep in mind that a probability is evaluated on a subset of the elementary outcomes.

We stress that the distribution $\rho_X$ is a probability on subsets of $\mathbb{R}$.

Remark. (A refined definition of a random variable). In the above definition of distribution, how can we be sure that the event $\{\omega \in \Omega : X(\omega) \in (a,b]\}$ is in $\mathcal{F}$, so that the probability is well-defined? In general, we are not! This is why it is necessary to be more precise when building rigorous probability theory. With this in mind, the correct definition of a random variable $X$ is a function $X:\Omega \to \mathbb{R}$ such that for any interval $(a,b] \in \mathbb{R}$, the pre-image $\{\omega \in \Omega:X(\omega) \in ((a,b])\}$ is an event in $\mathcal{F}$. If we consider a function of $X$, $g(X)$, then we must ensure that events of the form ${\omega \in \Omega:g(X(\omega)) \in (a,b]}= {\omega \in \Omega:X(\omega) \in g^{-1}(a,b]}$ are in $\mathcal{F}$. This is the case in particular if $g$ is continuous. More generally, a function whose pre-image of intervals $(a,b]$ are reasonable subsets of $\mathbb{R}$ is called Borel measurable. In this blog, whenever we write $g(X)$, we will always assume that $g$ is Borel measureable.

Note that by the properties of a probability, we have for any interval $(a,b]$ :

\begin{align*} \mathbf{P}(X \in (a,b]) &= \mathbf{P}(X \in (-\infty,b] \setminus (-\infty,a]) \\ &=\mathbf{P}(X \in (-\infty,b]) - \mathbf{P}(X \in (-\infty,a]) \quad \{ \text{ since } (-\infty,a) \subseteq (-\infty,b)\}\\ &= \mathbf{P}(X \leq b) - \mathbf{P}(X \leq a) \end{align*}

This motivates the following definition.


Definition. The cumulative distribution function (CDF) of a random variable $X$ on a probability space $(\Omega,\mathcal{F},\mathbf{P})$ is a function $F_X : \mathbb{R} \to [0,1]$ defined by :

$$ F_X(x) = \mathbf{P}(X \leq x) $$


Note that we also have, by definition, that $F_X(x) = \rho_X((-\infty,x])$. Clearly, if we know $F_x$, we know the distribution of $X$ for any interval $(a,b]$. It turns out that the CDF determines the distribution of $X$ for any (Borel measurable) subset of $\mathbb{R}). These distinctions will not be important at this stage; only the fact that the CDF characterizes the distribution of a random variable will be important to us.


Proportion. Let $(\Omega,\mathcal{F},\mathbf{P})$ be a probability space. If two random variables $X$ and $Y$ have the same CDF, then they have the same distribution. In other words,

$$\rho_X(B) = \rho_Y(B)$$

for any (Borel measurable) subset of $\mathbb{R}$


We will not prove this result here, but we will use it often. Here are some important examples of distributions and their CDF.

Example.

(i) Bernoulli Distribution. A random variable $X$ has Bernoulli distribution with parameter $0 \leq p \leq 1$ if the CDF is

\begin{align*} F_X(x) = \begin{cases} 0 \quad & \text{ if } x < 0, \\ 1 - p \quad & \text{ if } 0 \leq x < 1\\ 1 \quad & \text{ if } x \geq 1 \end{cases} \end{align*}

A Bernoulli random varaible takes the value $0$ with probability $1 - p$, and it takes the value $1$ with the probability $p$.

(ii) Binomial distribution. A random variable $X$ is said to have the binomial distribution with parameters $0 \leq p \leq 1$ and $n \in \mathbb{N}$ if

$$\mathbf{P}(X = k) = {n \choose k} p^k (1 - p)^{n-k}, \quad k=0,1,\ldots,n$$

In this case, the CDF is

\begin{align*} F_X(x) = \begin{cases} 0 & \text{ if } x < 0 \\ \sum_{j=0}^k {n \choose j} p^j (1-p)^{n-j} & \text{ if } k \leq x < k+1\\ 1 & \text{ if } x \geq n \end{cases} \end{align*}

(iii) Poisson distribution. A random variable $X$ has Poisson distribution with