Generating random samples pt. 3 – Metropolis-Hastings algorithm

So last time I talked about rejection sampling for generating random samples of a specific distribution, starting from another distribution. We talked about what made it suck, and really, if you are going to do rejection sampling, you might as well head straight to Metropolis-Hastings, it’s basically an improved version with no downsides when compared to rejection sampling.

So the Metropolis-Hastings algorithm is another Monte Carlo method of generating samples, and really, it’s basically rejection sampling with a minor change. So I’ll go over the gist of it, and then point out the change, and why Metropolis-Hastings even works.
The algorithm is as follows.
Given a target PDF of f(x) and SYMMETRIC densities of g(x|x_{-1}) (symmetric meaning g(x|x_{-1}) = g(x_{-1}|x)) that we are drawing samples from, where g is conditioned on the previous sample we had accepted,
1. Get a sample x from g(x|x_{-1}), and a sample from a uniform random variable between 0 and 1, i.e. Y \sim U(0,1), which we’ll MIGHT be using to make our decision whether or not to keep or reject the new sample.
2. Check if we keep the sample; if \frac{f(x)}{f(x_{-1})} \ge 1, such that the PDF of the new sample is larger than the previous one, we keep our new sample straight up. Otherwise, we need to do a check against our uniformly distributed value; if our uniformly distributed random sample Y \ge \frac{f(x)}{f(x_{-1})}, then we keep the new sample x. Otherwise, we reject x.
3. If we kept the previous sample, x_{-1} is replaced with the x we just got. Go back to step 1 and keep doing it until we have enough values.

A common example of g(x|x_{-1}) is a Gaussian distribution with a mean of x_{-1}. Then the resulting process will be something of a random walk, which very nicely lines up with the Markov chain proof of correctness for this algorithm. I’M not going to prove it is right, but there certainly is proof, this process is pretty old.

So what’s the big deal? We changed our criterion for keeping samples. On inspection, you’ll note that we don’t need a dumb scaling constant anymore, and assuming we chose nice source and target distributions to work with, our acceptance rates will be much nicer than before.

But you might think, hey, if we accept more stuff like this, how can we guarantee Metropolis-Hastings even gives us samples pertaining to our target distribution?

In truth, the Metropolis-Hastings algorithm is loosely construction a Markov chain system like the ones we’ve suggested before. What we are doing is moving from state-to-state as we accept samples, and our Markov chain is constructed in such a way such that the stationary distribution of the Markov chain will be the target distribution! Of course, for a target PDF, since it’s a continuous function, this is not intuitive to see, but working with a discrete PMF, say, something trivial like simulating a fair 6-sided die by choosing a conditional g(x|x_{-1}) such that each roll is biased (evenly, like the PMF is simply being shifted over a single value at a time from 1 to 6) more towards the number last rolled, it’s very possible to construct the transition matrix and see that the end stationary distribution of the system will be that of a fair die. An example would be a transition matrix like this…

\begin{bmatrix} .5 & .1 & .1 & .1 & .1 & .1 \\ .1 & .5 & .1 & .1 & .1 & .1 \\ .1 & .1 & .5 & .1 & .1 & .1 \\ .1 & .1 & .1 & .5 & .1 & .1 \\ .1 & .1 & .1 & .1 & .5 & .1 \\ .1 & .1 & .1 & .1 & .1 & .5 \end{bmatrix}

Even though the die is a biased turd, if we generated dice rolls over time with this transition matrix, the stationary distribution in the end would be a nice uniform fair die.

So what are the downsides to this? Well, samples tend to be correlated with their neighboring samples, so if you want uncorrelated samples, you sort of have to give the samples a good shuffle. Additionally, the first bunch of samples generated clearly won’t be distributed as we’d want because the Markov chain hasn’t had time to hit a stationary distribution, so expect to have a burn-in period of running the algorithm, throwing things away, and then getting samples to use.

Hell, I’m lazy, you can fix my previous function to work with Metropolis-Hastings if you really want, the change is trivial and my previous function wasn’t great anyways. I wrote it in maybe a couple minutes.

Next time, if I get the chance, I will discuss Gibbs sampling and wrap up what I want to say on generating samples of a target distribution.

On stationary distributions of discrete Markov Chains

I feel like I need to publish something quick, so I’ll just discuss stationary distributions for Markov Chains real quick before my orientation stuff at Cisco in California. I don’t necessarily know if I have time to make a post while I’m there.

We’ve discussed Markov Chains previously. Given a set of states, say n states such that S = \{s_{1}, s_{2}, \hdots, s_{n}\} where S is the set of all states we care about, we have a transition matrix that is n x n elements, say a matrix A like the following:
A = \begin{bmatrix} a_{1,1} & a_{1,2} & \hdots & a_{1,n} \\ a_{2,1} & a_{2,2} & \hdots & a_{2,n} \\ \vdots & \vdots & & \vdots \\ a_{n,1} & a_{n,2} & \hdots & a_{n,n} \end{bmatrix}
where the transition matrix entry a_{i,j} is the probability of being AT state i, and moving to state j. REMEMBER THIS, because some books like to refer to it the other way around, like i is the target state and j is the previous state, and that changes our notation up entirely here.

For simplicity, you can assume our Markov Chain is time-invariant, meaning the probabilities in the transition matrix don’t change over time. I believe what we are discussing here will hold, but definitely not in a obvious and conveniently calculable way, like there’s going to be symbols and stuff, and potentially you’d probably have to prove that the markov chain remains positive recurrent over time, weird shit, let’s not get into that. Let’s assume we have a nice matrix full of constants that don’t change with each time step.

So say we have a starting distribution of what state we might be in, a row vector x such that X_{0} = [x_{1}, x_{2}, \hdots, x_{n}]. The subscript of X indicates which time step we’re talking about, and 0 is obvious the start. We can find the probability of being in whatever state at time step t by doing the following
X_{t} = XA^{t} so we have a row vector multiplied by our A transition matrix to get us another row vector with the changed probabilities of being at each state at whatever time step. So that’s cool, right? Very handy in terms of calculation. Assuming your transition matrix is legit, like the rows add up to 1 properly, then your X elements should also sum up to 1 all the time. You can prove this but it takes some work, just accept it for now.

So a stationary distribution is when we have some row vector Y such that Y = YA, so when you multiply the row vector Y with the transition matrix, there is no change to the elements of Y, the probabilities remain the same. This stationary distribution typically comes up as a limiting distribution where Y = \lim_{t \to \infty}XA^{t}. So yeah, if you want to brute force your way to a stationary distribution, you can calculate it on a computer by doing matrix multiplies over and over until hitting some convergence point.

But linear algebra people should recognize that Y here is actually a LEFT eigenvector! You can solve for this simply by solving for the (right) eigenvectors of A^{T}, the transpose of the transition matrix, and finding the eigenvector associated with eigenvalue 1, and normalizing it such that it is a legit discrete probability vector. So there’s a clear relation of linear algebra to Markov chains.

Obviously, there has to be a catch to this bit, right? So there are three things that could happen with stationary Markov chains.
1. unique stationary distribution
2. multiple unique final distributions
3. oscillating/alternative final distributions

The uniqueness of the stationary distribution is only guaranteed iff the MC is positive-recurrent. What does that even mean? The simplest way to find it is by making sure each state is reachable at any time, i.e. they are all recurring states, and that the periodicity of each state going back to itself eventually has a greatest common factor of 1 (like there could be a state that goes back to itself potentially even 2 steps at the quickest, but somebody else could only do it in minimum 3 steps, then the GCF is 1). In linear algebra terms, this effectively guarantees the matrix is full rank, geometric multiplicity of eigenvalue 1 is 1 (there’s only 1 eigenvector associated with eigenvalue 1), and there’s no weird thing involving eigenvector -1.

How do 2 and 3 even occur then? 2, if you effectively have transient states, or just completely separable Markov chains you happened to put into the same matrix (like with 5 states, if states 1-3 communicated amongst themselves and 4,5 communicated only between themselves), you will end up with more than 1 stationary distribution at the end. In case 3, if there’s some periodicity going on, then clearly it’s possible that the Markov chain will have some alternating with each time step and not actually be stationary by the end, though it will have converged to a set of probability vectors. Generally, you may end up with a eigenvector for eigenvalue 1, and then an eigenvector for eigenvalue -1 that will represent the change in the distribution with every time step.

Baum-Welch algorithm: Finding parameters for our HMM

Up to this point, I’ve discussed hidden Markov models, the Viterbi algorithm, and the forward-backward algorithm. This is all fun and great, but we’ve also made the assumption that we know or assume a lot of information about the HMM. In a lot of cases, we would like a way to estimate the transition matrix A and the observation matrix O, but how would we go about doing that?

The expectation-maximization algorithm can be discussed in detail later, but the Baum-Welch algorithm is considered to be an application of the EM algorithm for use with HMMs. The idea here is that we can start with some sort of prior A and O matrix, possibly a trivial one with completely uniform probabilities, and we have a set of observations from the HMM. We use the forward-backward algorithm to calculate the probabilities of being in each state at each time step, and then we use this estimate of probabilities to make a better estimate of the transition and observation matrices. Using these improved (hopefully) A and O matrices, we do forward-backwards algorithm again, and the cycle continues until the A and O matrices converge and we can call it a day.

As for why the expectation maximization idea even works, the proof isn’t trivial, you can check Wikipedia, but essentially the EM algorithm tends to converge to some local solution that is feasible for the observations and prior distributions given, but may not be the actual parameters to the problem.

So let’s break down Baum-Welch. We split it up into expectation and maximization steps just to make it a point that we have two different parts.

Given things: n states, m observations, A transition matrix with n rows by n columns, O transition matrix with n rows by m columns, and k observations in a vector Y with elements Y_{j} for each time step 1 to k. A and O can be filled with appropriate uniform probability values if we really have no idea what should go in there.

Expectation step

1. Do forward-backward algorithm to get forward probabilities F_{i,j} and backward probabilities B_{i,j}, where the subscripts mean state i, time step j, where F and B are both 2D arrays with n rows and k+1 columns.

2. Finish forward-backward algorithm to get the final 2D array of probabilities P with n rows and k+1 columns, where elements P_{i,j} is the probability of being at state i at time j.

Maximization step

2. We need the probabilities of being at state \alpha at time j, and \beta at time j+1, which we can find using the following calculation.

\theta_{\alpha,\beta,j} = \frac{ F_{\alpha,j} A_{\alpha, \beta} B_{\beta, j+1} O_{\beta, Y_{j+1} } } {\sum_{c=1}^{n} \sum_{d=1}^{n} F_{c,j} A_{c,d} B_{d,j+1} O_{d,Y_{j+1} } }

You can see that the formula is the probability of being at state \alpha at time j, times the transition probability from \alpha to \beta, the probability of being at state \beta at time j+1, and the probability of the observation happening given we’re now at state \beta. You can store this as a 3D array or something.

3. Find the new transition matrix A by calculating each element

A_{\alpha, \beta} = \frac{\sum_{t=0}^{k-1} \theta_{\alpha, \beta, t} }{\sum_{t=0}^{k-1} P_{\alpha,t} }

This can be summarized as the sum of probabilities of going from state \alpha to \beta at each time stamp over the sum of probabilities of just being in state i (and going anywhere for the next state).

4. Find the new transition matrix B by calculating each element O_{\alpha, x} = \frac{ \sum_{t: x=Y_{t}}^{k} P_{\alpha,t}}{\sum_{t=0}^{k} P_{\alpha,t} }. This is calculating the probabilities of being in a state at the times that the observation x happened divided by the probabilities that we are in that state at any time.

5. Go back to the expectation step, and repeat until convergence.

Below, I have python code for doing the Baum-Welch algorithm. Note that I use observations similar to before, so we would expect our observation and probability matrices to not be the same as before because of my artificially manufactured observations that are stuck on a single state. Note also that above, I mentioned the solution will hit a local best solution, not necessarily what you wanted to see.


import numpy as np
# functions and classes go here
def fb_alg(A_mat, O_mat, observ):
    # set up
    k = observ.size
    (n,m) = O_mat.shape
    prob_mat = np.zeros( (n,k) )
    fw = np.zeros( (n,k+1) )
    bw = np.zeros( (n,k+1) )
    # forward part
    fw[:, 0] = 1.0/n
    for obs_ind in xrange(k):
        f_row_vec = np.matrix(fw[:,obs_ind])
        fw[:, obs_ind+1] = f_row_vec * \
                           np.matrix(A_mat) * \
                           np.matrix(np.diag(O_mat[:,observ[obs_ind]]))
        fw[:,obs_ind+1] = fw[:,obs_ind+1]/np.sum(fw[:,obs_ind+1])
    # backward part
    bw[:,-1] = 1.0
    for obs_ind in xrange(k, 0, -1):
        b_col_vec = np.matrix(bw[:,obs_ind]).transpose()
        bw[:, obs_ind-1] = (np.matrix(A_mat) * \
                            np.matrix(np.diag(O_mat[:,observ[obs_ind-1]])) * \
                            b_col_vec).transpose()
        bw[:,obs_ind-1] = bw[:,obs_ind-1]/np.sum(bw[:,obs_ind-1])
    # combine it
    prob_mat = np.array(fw)*np.array(bw)
    prob_mat = prob_mat/np.sum(prob_mat, 0)
    # get out
    return prob_mat, fw, bw

def baum_welch( num_states, num_obs, observ ):
    # allocate
    A_mat = np.ones( (num_states, num_states) )
    A_mat = A_mat / np.sum(A_mat,1)
    O_mat = np.ones( (num_states, num_obs) )
    O_mat = O_mat / np.sum(O_mat,1)
    theta = np.zeros( (num_states, num_states, observ.size) )
    while True:
        old_A = A_mat
        old_O = O_mat
        A_mat = np.ones( (num_states, num_states) )
        O_mat = np.ones( (num_states, num_obs) )
        # expectation step, forward and backward probs
        P,F,B = fb_alg( old_A, old_O, observ)
        # need to get transitional probabilities at each time step too
        for a_ind in xrange(num_states):
            for b_ind in xrange(num_states):
                for t_ind in xrange(observ.size):
                    theta[a_ind,b_ind,t_ind] = \
                    F[a_ind,t_ind] * \
                    B[b_ind,t_ind+1] * \
                    old_A[a_ind,b_ind] * \
                    old_O[b_ind, observ[t_ind]]
        # form A_mat and O_mat
        for a_ind in xrange(num_states):
            for b_ind in xrange(num_states):
                A_mat[a_ind, b_ind] = np.sum( theta[a_ind, b_ind, :] )/ \
                                      np.sum(P[a_ind,:])
        A_mat = A_mat / np.sum(A_mat,1)
        for a_ind in xrange(num_states):
            for o_ind in xrange(num_obs):
                right_obs_ind = np.array(np.where(observ == o_ind))+1
                O_mat[a_ind, o_ind] = np.sum(P[a_ind,right_obs_ind])/ \
                                      np.sum( P[a_ind,1:])
        O_mat = O_mat / np.sum(O_mat,1)
        # compare
        if np.linalg.norm(old_A-A_mat) < .00001 and np.linalg.norm(old_O-O_mat) < .00001:
            break
    # get out
    return A_mat, O_mat

num_obs = 25
observations1 = np.random.randn( num_obs )
observations1[observations1>0] = 1
observations1[observations1<=0] = 0
A_mat, O_mat = baum_welch(2,2,observations1)
print observations1
print A_mat
print O_mat
observations2 = np.random.random(num_obs)
observations2[observations2>.15] = 1
observations2[observations2<=.85] = 0
A_mat, O_mat = baum_welch(2,2,observations2)
print observations2
print A_mat
print O_mat
A_mat, O_mat = baum_welch(2,2,np.hstack( (observations1, observations2) ) )
print A_mat
print O_mat

If you run the code as I have shown up here, you’ll notice some strange behavior involving our model. Because our model is so simplistic and we have absolutely no prior data on the coin bias and the tendency to switch, we set everything to uniform probability. The algorithm for all three cases then results in the observations being a ratio based on what the observations were, which makes sense, and the transition matrix for the states remains uniform and untouched. Obviously, then the Baum Welch algorithm here basically ignores the fact that we have a hidden Markov model and goes for the maximum likelihood estimate of the coin bias and reports that this is the observation probability for both states. This isn’t wrong, but it really may not be what you wanted to see either. The only real way to solve this is by giving the system a prior that helps solve whatever we are trying to do.

This is actually less of a problem had we modeled our hidden Markov model to have multivariate Gaussian mixtures for the observation probabilities instead of discrete values, say if we were trying to observe returned samples of a population or something. But here, it’s clear we show that Baum Welch algorithm gives us a solution, but the usefulness in this gambling bit is pretty limited.

Forward-backward Algorithm: Finding probability of states at each time step in an HMM

I’ve talked about Markov chains, hidden Markov models, and the Viterbi algorithm for finding the most probable path of states in a hidden Markov model. Moving on, let’s say we want to know the actual probability of each state at each time step of our process, given the observations. The algorithm to do this with is called the forward-backward algorithm.

So we have our hidden Markov model from before, transition matrix A, observation matrix O, with n states, m observations such that A is n rows by n columns, and O is n rows by m columns. Say our sequence of states that happened is X, with each time stamp denoted as x_{1}, x_{2}, \hdots, x_{k}, where k is the number of time steps that we are dealing with. We have a set of observations Y, also a vector k long, that we saw happen, and we want to know the probabilities of each state at each time step. By the magic of Bayes theorem, we can express our probability of a being in state i at time j, given all of our k observations, to be

P( x_{j} = i | Y ) = P( x_{j} = i | Y_{1:j}, Y_{j+1:k} ) \approx P( Y_{j+1:k} | x_{j} = i ) P( x_{j} = i | Y_{1:j}

The approximation is there because there is some scaling denominator in Bayes theorem that should be be there, but we can figure that out after the fact in our algorithm. We can see that the probability of being in state i at time j, given the observations we saw, can be broken up into two parts, one of which is the probability of a state given the observations up until that point times the probability of the observations after that point in time, given the state. We divide these up into a forward propagation step, and a backward propagation step, described using the following,

f_{j+1} =c_{f,j+1}^{-1} f_{j} A \hat{O}_{j+1}

b_{j} = c_{b,j}^{-1} A \hat{O}_{j+1} b_{j+1}

where f is a row vector of n elements containing our forward probabilities at whatever time step, and b is a column vector of n elements containing our backward probabiltiies at whatever time step, A is our transition matrix like before, and this new \hat{O} is a diagonal matrix that contains the probabilities of seeing the observation that we did for each of the states on the corresponding diagonal element. As mentioned before, we left out some scaling factor in our Bayes theorem, so we need to make sure at each time step, the probabilities of being in each state sum to 1. We can just sum the vector get at each time step and just divide the vector by the sum.

Why would we do this? Aside from it being part of the Baum-Welch algorithm, it should be pretty clear why we want to see the probability of each state at each time step. There may be cases where the best path as determined by the Viterbi algorithm or something is deemed incorrect, and it may be worth going back through the actual probability values and attempting to find where some sort of ambiguity or anomaly tripped up our dynamic programming solution.

Under the assumption we’re using MATLAB or Python, something with proper matrix and array support, the forward and backward algorithm is pretty simple. The algorithm is described in full below.

Forward propagation

1. initialize some starting row vector f at time 0 (this is where you’d want to put prior information if you have it, otherwise, just make the probability of each state \frac{1}{n}.

2. For each time step, propagate forward f_{j+1} = f_{j} A \hat{O}_{j+1}

3. Normalize the resulting f vector at time j+1. Save the result somewhere, we’re gonna need it at the end.

4. Return to step 2 and keep propagating until you are out of observations.

Backward propagation

5. initialize n-element column vector of 1s as probability of being in each state at the end of the process (since the initial state of these is assumed to be given, like we already know we are ending in whatever state, the probability of ending in each of these is 1).

6. For each time step, propagate backwards b_{j} = A \hat{O}_{j+1} b_{j+1}

7. Normalize the resulting b vector at time j. Save this result somewhere too, we’ll need it.

8. Return to step 6 and keep propagating backwards until you are out of observations.

Putting it all together

9. Multiply all of your forwards and backwards probabilities together, matching values by states and time-steps. If you had organized your probabilities for forwards and backwards propagation into n by k matrices, you can just do an element multiply and end up with the wanted result.

10. Normalize each time step’s probabilities again. Darnit, Bayes.

So below, I have a Python implementation of the forward-backward algorithm using the fair coin biased coin example from before.

# imports go here
import numpy as np
# functions and classes go here
def fb_alg(A_mat, O_mat, observ):
    # set up
    k = observ.size
    (n,m) = O_mat.shape
    prob_mat = np.zeros( (n,k) )
    fw = np.zeros( (n,k+1) )
    bw = np.zeros( (n,k+1) )
    # forward part
    fw[:, 0] = 1.0/n
    for obs_ind in xrange(k):
        f_row_vec = np.matrix(fw[:,obs_ind])
        fw[:, obs_ind+1] = f_row_vec * \
                           np.matrix(A_mat) * \
                           np.matrix(np.diag(O_mat[:,observ[obs_ind]]))
        fw[:,obs_ind+1] = fw[:,obs_ind+1]/np.sum(fw[:,obs_ind+1])
    # backward part
    bw[:,-1] = 1.0
    for obs_ind in xrange(k, 0, -1):
        b_col_vec = np.matrix(bw[:,obs_ind]).transpose()
        bw[:, obs_ind-1] = (np.matrix(A_mat) * \
                            np.matrix(np.diag(O_mat[:,observ[obs_ind-1]])) * \
                            b_col_vec).transpose()
        bw[:,obs_ind-1] = bw[:,obs_ind-1]/np.sum(bw[:,obs_ind-1])
    # combine it
    prob_mat = np.array(fw)*np.array(bw)
    prob_mat = prob_mat/np.sum(prob_mat, 0)
    # get out
    return prob_mat, fw, bw

# main script stuff goes here
if __name__ == '__main__':
    # the transition matrix
    A_mat = np.array([[.6, .4], [.2, .8]])
    # the observation matrix
    O_mat = np.array([[.5, .5], [.15, .85]])
    # sample heads or tails, 0 is heads, 1 is tails
    num_obs = 15
    observations1 = np.random.randn( num_obs )
    observations1[observations1>0] = 1
    observations1[observations1<=0] = 0
    p,f,b = fb_alg(A_mat, O_mat, observations1)
    print p
    # change observations to reflect messed up ratio
    observations2 = np.random.random(num_obs)
    observations2[observations2>.85] = 0
    observations2[observations2<=.85] = 1
    # majority of the time its tails, now what?
    p,f,b = fb_alg(A_mat, O_mat, observations1)
    print p
    p,f,b = fb_alg(A_mat, O_mat, np.hstack( (observations1, observations2) ))
    print p

If you run the code, using the examples from the previous post, we can see that in general, for observation 1 the probabilities tend to be on the side of fair coin state, though it can sway easily. Observation 2 is definitely more on the biased coin side, and for the extended observation 3, you can see initially it is more on the fair coin state with a lot of swaying, but the latter half is almost dead on the biased coin side.

But realistically, we may not know the transition or observation matrices to our model beforehand. What’s a good way to try and estimate them? The Baum-Welch algorithm is a version of the expectation-maximization algorithm used for hidden Markov models, and it uses the forward backward algorithm for part of the process. We will talk about that next.

Viterbi algorithm: finding most likely sequence in HMM

the last post, I discussed Markov chains and hidden Markov models on a basic level. of course, there’s a lot of sweet stuff I can discuss about Markov chains, but I want to dive straight into relevant algorithms with real applications. Let’s start with the Viterbi algorithm.

The Viterbi algorithm is an efficient way to find the most likely sequence of states for a Hidden Markov model. Let’s approach the problem in the dumbest way possible to show why this is computationally good, because really, the reasoning behind it just makes perfect sense.

Let’s say we have our buddy at the casino with the possibly biased coin flipping game. Let’s say after we’ve done our losing, other gamblers who are now stuck in debt from this game give you some insight that leads to you knowing the observation matrix O and the state matrix A. The observation matrix is something like before,

O = \begin{bmatrix} .5 & .5 \\ .15 & .85 \end{bmatrix}

where when the coin is biased, we have a strong chance of obtaining a tails flip, and fair coin has a fifty fifty shot of seeing either. The state matrix somehow shows up like this,

A = \begin{bmatrix} .6 & .4 \\ .2 & .8 \end{bmatrix}

Where we can see in general, the dealer probably won’t change coins, but tends to stick with the biased coin so the house can make money to pay his checks. Given our knowledge of A and O matrices, and our observations from playing the game and probably losing, we can find what the most likely set of states was, i.e. when did the dealer use what coin?

The brute force obvious way to do it is to calculate the probability of each combination of states happening given the observations and information matrices. But god, that’s some horrible exponentially growing problem, we don’t want to do that, and we probably don’t want to wait for the computer to calculate that either if the problem space isn’t small to begin with.

It is also clear from the definition of a HMM that the probability of states and observations happening can simply be calculated from something like

P_{n,m,l,\hdots}^{i} = P_{m,l,...}^{i-1} T_{m,n} O_{n}^{i}

Where P is the probability of of being at state n at time i, having come from m, l, and whatever states before that, T is the transition probability from m to n, and O is the probability of that specific observation at time i from state n. Those superscripts are not meant to be powers. Using Markov property, we know the current state only depends on the last state, so if we think about the process from the beginning, we can calculate probabilities of sequences happening simply by multiplying together the corresponding transition values and observation values with the previous step’s probability.

What the Viterbi algorithm then expands on is the idea that at each time step we calculate, we only need to store the sequence path that has the best probability going into each state. If our Markov model has 2 states, we only need to store at most 2 paths, updated at every time step, because all that matters for the next time step is where we were at in the previous time step! That saves calculation and storage by the buttload, and makes the Viterbi algorithm a classic dynamic programming example.

I have a python example below using the casino idea. First, I try the Viterbi deal using only fair coin flips. Next, I try using only biased coin flips. Finally, I use a sequence that starts with a fair coin and switches to a biased coin.

# import stuff here
import numpy as np
# classes/functions go here
def viterbi_alg(A_mat, O_mat, observations):
    # get number of states
    num_obs = observations.size
    num_states = A_mat.shape[0]
    # initialize path costs going into each state, start with 0
    log_probs = np.zeros(num_states)
    # initialize arrays to store best paths, 1 row for each ending state
    paths = np.zeros( (num_states, num_obs+1 ))
    paths[:, 0] = np.arange(num_states)
    # start looping
    for obs_ind, obs_val in enumerate(observations):
        # for each obs, need to check for best path into each state
        for state_ind in xrange(num_states):
            # given observation, check prob of each path
            temp_probs = log_probs + \
                         np.log(O_mat[state_ind, obs_val]) + \
                         np.log(A_mat[:, state_ind])
            # check for largest score
            best_temp_ind = np.argmax(temp_probs)
            # save the path with a higher prob and score
            paths[state_ind,:] = paths[best_temp_ind,:]
            paths[state_ind,(obs_ind+1)] = state_ind
            log_probs[state_ind] = temp_probs[best_temp_ind]
    # we now have a best stuff going into each path, find the best score
    best_path_ind = np.argmax(log_probs)
    # done, get out.
    return (best_path_ind, paths, log_probs)

# main script stuff goes here
if __name__ == '__main__':
    # the transition matrix
    A_mat = np.array([[.6, .4], [.2, .8]])
    # the observation matrix
    O_mat = np.array([[.5, .5], [.15, .85]])
    # sample heads or tails, 0 is heads, 1 is tails
    num_obs = 20
    observations1 = np.random.randn( num_obs )
    observations1[observations1>0] = 1
    observations1[observations1<=0] = 0
    # we have what we need, do viterbi
    best_path_ind, paths, log_probs = viterbi_alg(A_mat, O_mat, observations1)
    print "obs1 is " + str(observations1)
    print "obs1, best path is" + str(paths[best_path_ind,:])
    # change observations to reflect messed up ratio
    observations2 = np.random.random(num_obs)
    observations2[observations2>.85] = 0
    observations2[observations2<=.85] = 1
    # majority of the time its tails, now what?
    best_path_ind, paths, log_probs = viterbi_alg(A_mat, O_mat, observations2)
    print "obs2 is " + str(observations1)
    print "obs2, best path is" + str(paths[best_path_ind,:])
    #have it switch partway?
    best_path_ind, paths, log_probs = viterbi_alg(A_mat, \
                                          O_mat, \
                                          np.hstack( (observations1, observations2) ) )
    print "obs12 is " + str(np.hstack( (observations1, observations2) ) )
    print "obs12, best path is" + str(paths[best_path_ind,:])

If you run the code, you can see that generally, the states guessed for observation set 1 will be mostly fair state, and observation set 2 will be mostly biased coin state. The concatenated observations will start fair and switch over to being biased after the results start kicking in.

You may also see that in some runs, the starting states could suggest the wrong state. It’s entirely possible that the coin flips at the beginning of each sequence seem to be belonging to the other state, but because of system was designed to be stationary over time steps, it will generally correct itself. This is particularly common for observation 1, where sometimes the first several states guess to use the biased coin because our gambling problem was set up so that it’s supposedly common to transition to the biased coin, even though I’m fixing the states myself.

So why is this useful? I’ll give two situations.

The Viterbi decoder is used in decoding convolutional codes and maximum-likelihood sequence detection in communication systems. By setting the transmitted constellation symbols as states, and the received baseband signal as an observation with some noise distribution behind it like Gaussian white noise, we can use the Viterbi decoder to find the most likely sequence of transmitted symbols, given our observed values at the receiver. Do note that the Viterbi decoder is still probably one of the most costly things to put in a transmission device, so it’s not advantageous to require multiple for different codes simple on a cost/space basis.

The 2nd example I have is that HMM is used for DNA/RNA sequencing; it’s easy to say attaching one of 4 possible pieces to the end of DNA when naturally building these strands (like from growing) can be modeled using an HMM, so the Viterbi decoder can help reveal a most likely set of sequences.

But even more useful for DNA/RNA sequencing is the forward-backward algorithm, which I will discuss next time? Later? And then that leads directly to the Baum-Welch algorithm, an HMM variant on the expectation-maximization algorithm.

Markov Chains and Hidden Markov Models

I want to discuss algorithms and things involving Markov Chains and Hidden Markov Models, so maybe I should cover some sort of background before I do that. This will be about DISCRETE Markov chains, if you want things involving continuous Markov chains, you might want to look elsewhere. Or get that Markov Chains book by Bremaud, that book was pretty good from what I remember.

A Markov chain is a probabilistic model where we claim that we have some n number of states, and at each state we have some probability of transitioning to 1 or more other states (or the state it is already in) at each time step. The probability of where it goes is only affected by the state it was in, and none of the time steps before that.

Formally speaking, let’s say the Markov chains has states denoted S = \{s_{1}, s_{2}, \hdots, s_{n}\}. We have n states, and we represent the probabilities of moving from state i to state j such that i,j \in \{1, \hdots, n\} using a matrix A, size n by n. Element of the matrix a_{i,j} in row i and column j is the probability of transitioning from state i to state j. For simplicity, let’s assume the matrix A does not change over time steps.

The simplest example of a Markov chain I can think of is the following: imagine that there are 5 lily pads in a row, numbered one to five from left to right, that a frog is hopping back and forth on. Let’s say he refuses to stay on the same lily pad and has to move at each time step, and he can only jump as far as one lily pad. Then the states here, s_{x}, means the frog is on lily pad number x. The transition matrix would look something like this:

A = \begin{bmatrix} 0 & 1 & 0 & 0 & 0 \\ .5 & 0 & .5 & 0 & 0 \\ 0 & .5 & 0 & .5 & 0 \\ 0 & 0 & .5 & 0 & .5 \\ 0 & 0 & 0 & 1 & 0 \end{bmatrix}

We can see here that the frog has equal probability of moving left or right when on lily pads 2, 3, or 4, but on 1 and 5, he has to go back to 2 or 4 respectively. This chain is sort of a symmetric bounded random walk, so it’s special in a way, but that’s not really important for what I’m trying to illustrate. Clearly, the probability of where he goes next is only dependent on where he is at that time.

Of course, we could redesign the Markov chain so that the states encompass the last 2 spots the frog was on, so Markov chains can inherently be designed to deal with some form of memory, but that blows up the size and complexity of the transition matrix obviously (in the case of our frog lily pad, it’s basically n^{2} states), and may make models more complicated than you really need it to be. I will probably talk about neat features of Markov chains later, but the key here is that it’s a nice simple model with a sweet Markov property that makes things simpler in terms of past events.

Hidden Markov Models expands on the idea of Markov chains. Previously, we had assumed we could just simply see the state that the frog was in, and that was that. It’s entirely more likely in real life that even if we say something moves in a Markov fashion, we can’t view the states, we can only see an observation that may clue us into what state the Markov process was in at that time.

Formally, we can build on the previous Markov chain definition and define an observation matrix, with m possible observations, so that our observation matrix is n by m, n states by m possible observations. So at each state, there are probabilities associated with possible observations that may come out as a result of being at that state.

Let’s go with a new example for the HMM. Let’s say there’s a casino that plays a simple coin flip game. The “dealer” switches back and forth randomly without the player observing between a fair coin, an a biased coin so the house has some sort of advantage to this coin flipping business. The player then only observes heads or tails from the coin flips.

Since we’re god, we can represent this HMM with the following transition and observation matrices, denoted matrix A and B. The A matrix describes probability of switching or staying with whatever coin is in the dealer’s hand, and the observation matrix is probability of heads or tails given what coin is being used:

A = \begin{bmatrix} .6 & .4 \\ .2 & .8 \end{bmatrix}

B = \begin{bmatrix} .5 & .5 \\ .15 & .85 \end{bmatrix}

we can see that it’s not that likely for the dealer to switch coins in either case, but he’s generally inclined to stick with biased coin. We can see from the observation matrix that the fair coin is fair like it should be, but the unfair coin flips tails much more often than heads. NOTE that observations actually do not have to be discrete like this! It is actually common to have a HMM model such that the observations are Gaussian distributed, so the observations are somewhat “noisy,” and are likely distributed slightly differently depending what state the model is currently sitting in.

So why is the HMM model important? The HMM can be used to modeling many things, such as DNA sequencing, or digital communication systems, and many algorithms exist for dealing with HMM.

The Viterbi algorithm finds the most likely series of states, given some observations and assumed parameters. This is very useful for decoding convolutional codes and doing maximum likelihood sequence detection in digital communications.

The forward-backward algorithm finds the probabilities of being at any of the states at any time, given some observations and the parameters of the HMM. It’s pretty clear why this is useful, there are definitely cases where you would like to know probabilities of states and transition from the entire process.

The Baum-Welch algorithm is a variant of the expectation maximization algorithm that uses observations and some initial guesses at the parameters to make an intelligent guess at the actual parameters of the system. It’s clear that in many cases, the actual parameters of the system will not be known, so figuring them out one way or another would be ideal. The Baum-Welch algorithm does the expectation step by finding probabilities of being in each state using the forward-backward algorithm, and then re-calculates a transition and observation matrix based on the results of the F-B algorithm, and this is repeated until convergence.