Markov Chain Properties

In the previous post, I gave the definition of Markov chain and as you can see in the page that I introduced in the previous post, a Markov chain can be represented by a directed graph. Each node represents a state and each edge of the graph represents a probability from one state to another state.

f:id:nakaly:20170218030945p:plain:w200

Imagine that there are various directed graphs. It means that there are various Markov chains. To understand Markov chain more deeply, you want to categorise Markov chain by statistic characteristics.

One of the most important properties that represents Markov chain statistic characteristics is ergodicity. Before understanding ergodicity, there are some other properties that you need to understand.

Reducibility

A Markov chain is said to be irreducible if it is possible to get to any state from any state.

f:id:nakaly:20170218032246p:plain:w700

Transience and recurrence

When you are given a state of a Markov chain, if there is a non-zero probability that you will never return to the state, the state is said to be transient.

If a state is not transient, it’s said to be recurrent.

f:id:nakaly:20170218040341p:plain:w200

The state A is transient even though it has a self loop. There is a probability that you will return to A when you start from A. But you will never return to A once you go to the state B. B, C, D are recurrent.

If the expected return time of a state is finite, the state is said to be positive recurrent.

To understand an expected return time, we need to define the first return time of a state i.

 \displaystyle T_i =  \inf \left\{ n\ge1: X_n = i \mid X_0 = i \right\}

And we define the probability of the first return time after  n steps as follows:

 \displaystyle f_{ii}^{(n)} = P(T_i = n)

Then we can define the expected return time  M_i as follows:

 \displaystyle M_i = \sum_{n=1}^{\infty} n\cdot f_{ii}^{(n)}

Periodicity

When you are given a state of a Markov chain, if you return to the same state after a multiples of k steps, k is said to be the period of the state.

f:id:nakaly:20170218042209p:plain:w500

If a state has 1 period, the state is said to be aperiodic.

Ergodicity

If a state is aperiodic and positive recurrent, the state is said to be ergodic. If all states in an irreducible Markov chain are ergodic, the Markov chain is said to be ergodic.

Stationarity

One of the remarkable characteristics of a ergodic Markov chain is that a ergodic Markov chain has a stationary distribution.

It means that regardless of the initial state, the proportion of time the chain spends in a state is going to be converge to a finite value.

Let’s give a formal definition of a stationary distribution of Markov chain.

Let  p_{ij} is a probability of a transition from a state  i to another state  j.

A Markov chain can be represented by a transition matrix.

  P = (p_{ij}) =

\begin{equation} \begin{bmatrix} p_{11} & p_{12} & \cdots & p_{1n} \\ p_{21} & p_{22} & \cdots & p_{2n} \\ \vdots & \vdots & p_{ij} & \vdots \\ p_{m1} & p_{m2} & \cdots & p_{mn} \end{bmatrix} \end{equation}

If a Markov chain is ergodic, there is a vector  \pi such as:

 \pi = \pi P

 \pi is said to be the stationary distribution of the Markov chain.

 \pi_i , which is the proportion of time the chain spends in a state i, is related to the expected return time  M_i.

 \pi_i = \frac{C}{M_I} ( C is the normalizing constant.)

Bayesian Inference

Bayesian Inference is a method for estimating a true probability distribution from samples of the distribution.

When I read the article in the wikipedia, I didn’t get the point.

But I found a better article that gives me more intuitive understanding.

My understanding of Bayesian Inference consists of the following steps:

  1. Come up with a hypothesis of a target distribution, which is called a prior probability distribution. Use uniform distribution as a prior probability distribution if you have no hypotheses.
  2. Update the hypothesis based on samples (which are called evidence) from the target distribution. The updated hypothesis is called a posterior probability.

When updating the hypothesis, we use Bayes' theorem.

\displaystyle P(H\mid E) = \frac{P(E\mid H) \cdot P(H)}{P(E)} = \frac{P(E\mid H)}{P(E)} \cdot P(H)

 \displaystyle P(H) : the prior probability

 \displaystyle P(H\mid E) : the posterior probability

We can get the posterior probability  \displaystyle P(H\mid E) by multiplying the prior probability  \displaystyle P(H) by the value  \displaystyle \frac{P(E\mid H)}{P(E)}

The difficult part of Bayesian Inference is calculating  \displaystyle P(E).

It can be calculated as follows:

 \displaystyle P(E) = \sum_{H}P(E|H) P(H) when the target distribution is discrete.

 \displaystyle P(E) =  \int_{H} p(\mathbf{X} \mid H) p(H) when the target distribution is continuous.

MCMC (Markov Chain Monte Carlo) is sometimes used to calculate it.

Implementing Box-Muller Transform

After writing [the previous post, I think I understand Box-Muller transform.

I’m implementing it in this post.

Original form of Box-Muller transform is as follows:

 U_1, U_2 : random variables from uniform distribution (0,1)
 y_1 \, = \, \sqrt{-2 ln U_1} cos(2\pi U_2)
 y_2 \, = \, \sqrt{-2 ln U_1} sin(2\pi U_2)

However it includes square root, logarithm, trigonometric functions(sin/cos), which are costly and complex, might includes errors after calculation. J. Bell proposed a more robust and fast way of sampling, which is called polar form.

Unlike original form, polar form is a rejection sampling.

  1. Draw samples ( x_1, x_2) from uniform distribution(-1, 1)
  2. Calculate  s \, = x_1 * x_1 \, + x_2 * x_2
  3. If  s \lt 1 , go to 4, else go to 1
  4. Get samples  y_1, y_2 by calculating:  y_1 = x_1 \,  \sqrt{\frac{-2 \ln s}{s}}  y_2 = x_2 \,  \sqrt{\frac{-2 \ln s}{s}}

Python implementation

import random as rd
import math as ma
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

data = []
for i in range(0, 500):
    while True:
        x1 = rd.uniform(-1,1);
        x2 = rd.uniform(-1,1);
        s = x1 ** 2 + x2 **2
        if s < 1.0:
            break
    w = ma.sqrt((-2.0 *  ma.log(s)) / s)
    y1 = x1 * w
    y2 = x2 * w
    data.append(y1)
    data.append(y2)

sns.distplot(data)
plt.show()

The result:

f:id:nakaly:20170213050259p:plain:w500

Box Muller Transformation

When I read the post, I got an intuitive understanding of Box Muller Transformation because it has a lot of images.

However I didn’t think I fully understand Box-Muller transformation and I didn’t understand the reason why we need to draw two samples for the transformation.

I read the proof written by Box and Muller. It’s difficult for me to understand it due to lack of statistics knowledge. For example,

If Uh as a rectangular deinsity(uniform distribution) on (0, 10) then  -2 ln U has a Chi-squared distribution with two degrees of freedom.

Probably it’s trivial for statistician. But I don’t know Chi-squared distribution

Then I googled about Box Muller transformation again and found a proof which is easier to understand.

The proof uses basic math formula transformation and explain the important trick in the first paragraph, which gives us the reason why we need to draw two samples, in other words, why we need to consider joint probability of two normal distribution.

Draw a sample from a normal distribution with mean 0 and variance 1

In the previous post, it’s easy to transform it to a sample from a normal distribution with mean  \mu and variance  \sigma^{2} by calculating  \sigma x \, + \, \mu if we can draw a sample x from a normal distribution with mean 0 and variance 1.

Then I started googling how to draw a sample from a normal distribution with mean 0 and variance 1.

Box-Muller transformation seems to be one of the most popular way to do it.

According to the post I found, it does seem to be difficult. However we need to now how to draw a sample from an exponential distribution. Inverse transform sampling seems to be a way to do it.

According to the Inverse transform sampling page on wikipedia, the following method can be used.

The problem that the inverse transform sampling method solves is as follows:

  • Let X be a random variable whose distribution can be described by the cumulative distribution function F.
  • We want to generate values of X which are distributed according to this distribution.

The inverse transform sampling method works as follows:

  1. Generate a random number u from the standard uniform distribution in the interval [0,1].
  2. Compute the value x such that F(x) = u.
  3. Take x to be the random number drawn from the distribution described by F.

The most important thing is to know how to calculate the value x such that  F(x) \, = \, u. You need to know the inverse function  F^{-1}(x) of cumulative distribution function  F, which quantile function.

Actually if we know the quantile function of normal distribution, we don’t need Box-Muller transformation.

However the quantile function of normal distribution does not have any closed-form representation using basic algebraic functions.

On the other hand, a cumulative distribution function of an exponential distribution is  F(x)=1-e^{-\lambda x}. The quantile function  x = F^{-1}(y) = -\frac{1}{\lambda}\ln(1-y) can be calculated by solving  y=F(x) easily.