Markov Chain Monte Carlo

One of the goal of the current my study about statistics is understanding Markov Chain Monte Carlo(MCMC).

I would like to write about one of the popular algorithms of MCMC, Metropolis–Hastings algorithm. It’s a kind of rejection sampling algorithm.

Let  P(x) be a probability distribution of a target distribution( P) that you would like to draw samples from.

Let  g(x) be a probability distribution of a proposal distribution( G that you already know how to draw samples from.

  1. Draw a sample  x_{n+1} from  P (The previous sample is  x_n)

  2. Accept the sample with a probability  \min\left(1,\frac{P(x_{n+1})}{P(x_n)}\frac{g(x_n | x_{n+1})}{g(x_{n+1} | x_n)}\right). Otherwise, reject it.

    It means that you always accept the sample when  \frac{P(x_{n+1})}{P(x_n)}\frac{g(x_n | x_{n+1})}{g(x_{n+1} | x_n)} \geq 1. And you accept the sample with the probability  \frac{P(x_{n+1})}{P(x_n)}\frac{g(x_n | x_{n+1})}{g(x_{n+1} | x_n)} when it’s not.

  3. Repeat 1 and 2 util the sample’s probability distribution is close enough to the target.

Let’s think about why this algorithm works.

Reversible Markov chain that I explained in the previous post plays an important role of this algorithm.

Let’s assume that samples from the target probability distribution are generated based on a reversible Markov chain which has a stationary distribution  \pi(x) = P(x).

Since the Markov chain is reversible, each transition( x_n -> x_{n+1}) of the Markov chain fulfils the following condition:

 P(x_n | x_{n+1})P(x_{n+1}) = P(x_{n+1} | x_n)P(x_n)

When  \frac{P(x_{n+1})}{P(x_n)}\frac{g(x_n | x_{n+1})}{g(x_{n+1} | x_n)} \geq 1 is true, we always accept a sample.

 \frac{P(x_{n+1})}{P(x_n)}\frac{g(x_n | x_{n+1})}{g(x_{n+1} | x_n)} \geq 1 \Leftrightarrow P(x_{n+1})g(x_n | x_{n+1}) \geq P(x_n)g(x_{n+1} | x_n)

If  P(x) = g(x), then  P(x_{n+1})g(x_n | x_{n+1}) =P(x_n)g(x_{n+1} | x_n), but it’s not. It means the proposal distribution  g(x_{n+1} | x_n) causes the chain to move from  x_n to  x_{n+1} too rarely and from  x_{n+1} to  x_n too often. Even though this condition, we get  x_{n+1} from G. Then we accept the sample  x_{n+1}.

On the other hand, when  \frac{P(x_{n+1})}{P(x_n)}\frac{g(x_n | x_{n+1})}{g(x_{n+1} | x_n)} < 1, we accept the sample with the probability  \frac{P(x_{n+1})}{P(x_n)}\frac{g(x_n | x_{n+1})}{g(x_{n+1} | x_n)}.

 \frac{P(x_{n+1})}{P(x_n)}\frac{g(x_n | x_{n+1})}{g(x_{n+1} | x_n)} \Leftrightarrow P(x_{n+1})g(x_n | x_{n+1}) <  P(x_n)g(x_{n+1} | x_n)

It means the proposal distribution  g(x_{n+1} | x_n) causes the chain to move from  x_n to  x_{n+1} too often and from  x_{n+1} to  x_n too rarely.

We need to adjust the transition by using acceptance distribution  A(x_{n+1} | x_n) such that:

 P(x_{n+1})g(x_n | x_{n+1}) =A(x_{n+1} | x_n)  P(x_n)g(x_{n+1} | x_n)

 A(x_{n+1} | x_n) = \frac{P(x_{n+1})}{P(x_n)}\frac{g(x_n | x_{n+1})}{g(x_{n+1} | x_n)}

This is my intuitive understanding of Metropolis–Hastings algorithm.

More formal way of explanation from wikipedia is as follows:

 P(x_n | x_{n+1})P(x_{n+1}) = P(x_{n+1} | x_n)P(x_n) can be re-written as

 \frac{P(x_{n+1} | x_n)}{P(x_n | x_{n+1})} = \frac{P(x_{n+1})}{P(x_n)}

We draw a sample from the proposal distribution G and accept it with a certain probability  A(x_{n+1} | x_n) to simulate the target distribution P.

 P(x_{n+1} | x_n) = A(x_{n+1} | x_n) g(x_{n+1} | x_n)

Let’s insert this to the previous equation.

 \frac{A(x_{n+1} | x_n) g(x_{n+1} | x_n)}{A(x_ | x_{n+1}) g(x_n | x_{n+1})} = \frac{P(x_{n+1})}{P(x_n)}

  \Leftrightarrow  \frac{A(x_{n+1} | x_n) }{A(x_ | x_{n+1}) } = \frac{P(x_{n+1})g(x_n | x_{n+1})}{P(x_n)g(x_{n+1} | x_n)}

Then we need to choose a acceptance distribution  A(x_{n+1} | x_n) which fulfils this condition. The choice is used in the Metropolis–Hastings algorithm is as follows:

 A(x_{n+1} | x_n) =  \min\left(1,\frac{P(x_{n+1})}{P(x_n)}\frac{g(x_n | x_{n+1})}{g(x_{n+1} | x_n)}\right)

This fulfils the previous condition because

When   \frac{P(x_{n+1})}{P(x_n)}\frac{g(x_n | x_{n+1})}{g(x_{n+1} | x_n)} \geq 1, then  A(x_{n+1} | x_n) = 1.

When   \frac{P(x_{n+1})}{P(x_n)}\frac{g(x_n | x_{n+1})}{g(x_{n+1} | x_n)} < 1  \Leftrightarrow \frac{P(x_n)}{P(x_{n+1})}\frac{g(x_{n+1} | x_n)}{g(x_n | x_{n+1})} > 1, then  A(x_ | x_{n+1}) = 1.