Viterbi algorithm: finding most likely sequence in HMM

the last post, I discussed Markov chains and hidden Markov models on a basic level. of course, there’s a lot of sweet stuff I can discuss about Markov chains, but I want to dive straight into relevant algorithms with real applications. Let’s start with the Viterbi algorithm.

The Viterbi algorithm is an efficient way to find the most likely sequence of states for a Hidden Markov model. Let’s approach the problem in the dumbest way possible to show why this is computationally good, because really, the reasoning behind it just makes perfect sense.

Let’s say we have our buddy at the casino with the possibly biased coin flipping game. Let’s say after we’ve done our losing, other gamblers who are now stuck in debt from this game give you some insight that leads to you knowing the observation matrix O and the state matrix A. The observation matrix is something like before,

O = \begin{bmatrix} .5 & .5 \\ .15 & .85 \end{bmatrix}

where when the coin is biased, we have a strong chance of obtaining a tails flip, and fair coin has a fifty fifty shot of seeing either. The state matrix somehow shows up like this,

A = \begin{bmatrix} .6 & .4 \\ .2 & .8 \end{bmatrix}

Where we can see in general, the dealer probably won’t change coins, but tends to stick with the biased coin so the house can make money to pay his checks. Given our knowledge of A and O matrices, and our observations from playing the game and probably losing, we can find what the most likely set of states was, i.e. when did the dealer use what coin?

The brute force obvious way to do it is to calculate the probability of each combination of states happening given the observations and information matrices. But god, that’s some horrible exponentially growing problem, we don’t want to do that, and we probably don’t want to wait for the computer to calculate that either if the problem space isn’t small to begin with.

It is also clear from the definition of a HMM that the probability of states and observations happening can simply be calculated from something like

P_{n,m,l,\hdots}^{i} = P_{m,l,...}^{i-1} T_{m,n} O_{n}^{i}

Where P is the probability of of being at state n at time i, having come from m, l, and whatever states before that, T is the transition probability from m to n, and O is the probability of that specific observation at time i from state n. Those superscripts are not meant to be powers. Using Markov property, we know the current state only depends on the last state, so if we think about the process from the beginning, we can calculate probabilities of sequences happening simply by multiplying together the corresponding transition values and observation values with the previous step’s probability.

What the Viterbi algorithm then expands on is the idea that at each time step we calculate, we only need to store the sequence path that has the best probability going into each state. If our Markov model has 2 states, we only need to store at most 2 paths, updated at every time step, because all that matters for the next time step is where we were at in the previous time step! That saves calculation and storage by the buttload, and makes the Viterbi algorithm a classic dynamic programming example.

I have a python example below using the casino idea. First, I try the Viterbi deal using only fair coin flips. Next, I try using only biased coin flips. Finally, I use a sequence that starts with a fair coin and switches to a biased coin.

# import stuff here
import numpy as np
# classes/functions go here
def viterbi_alg(A_mat, O_mat, observations):
    # get number of states
    num_obs = observations.size
    num_states = A_mat.shape[0]
    # initialize path costs going into each state, start with 0
    log_probs = np.zeros(num_states)
    # initialize arrays to store best paths, 1 row for each ending state
    paths = np.zeros( (num_states, num_obs+1 ))
    paths[:, 0] = np.arange(num_states)
    # start looping
    for obs_ind, obs_val in enumerate(observations):
        # for each obs, need to check for best path into each state
        for state_ind in xrange(num_states):
            # given observation, check prob of each path
            temp_probs = log_probs + \
                         np.log(O_mat[state_ind, obs_val]) + \
                         np.log(A_mat[:, state_ind])
            # check for largest score
            best_temp_ind = np.argmax(temp_probs)
            # save the path with a higher prob and score
            paths[state_ind,:] = paths[best_temp_ind,:]
            paths[state_ind,(obs_ind+1)] = state_ind
            log_probs[state_ind] = temp_probs[best_temp_ind]
    # we now have a best stuff going into each path, find the best score
    best_path_ind = np.argmax(log_probs)
    # done, get out.
    return (best_path_ind, paths, log_probs)

# main script stuff goes here
if __name__ == '__main__':
    # the transition matrix
    A_mat = np.array([[.6, .4], [.2, .8]])
    # the observation matrix
    O_mat = np.array([[.5, .5], [.15, .85]])
    # sample heads or tails, 0 is heads, 1 is tails
    num_obs = 20
    observations1 = np.random.randn( num_obs )
    observations1[observations1>0] = 1
    observations1[observations1<=0] = 0
    # we have what we need, do viterbi
    best_path_ind, paths, log_probs = viterbi_alg(A_mat, O_mat, observations1)
    print "obs1 is " + str(observations1)
    print "obs1, best path is" + str(paths[best_path_ind,:])
    # change observations to reflect messed up ratio
    observations2 = np.random.random(num_obs)
    observations2[observations2>.85] = 0
    observations2[observations2<=.85] = 1
    # majority of the time its tails, now what?
    best_path_ind, paths, log_probs = viterbi_alg(A_mat, O_mat, observations2)
    print "obs2 is " + str(observations1)
    print "obs2, best path is" + str(paths[best_path_ind,:])
    #have it switch partway?
    best_path_ind, paths, log_probs = viterbi_alg(A_mat, \
                                          O_mat, \
                                          np.hstack( (observations1, observations2) ) )
    print "obs12 is " + str(np.hstack( (observations1, observations2) ) )
    print "obs12, best path is" + str(paths[best_path_ind,:])

If you run the code, you can see that generally, the states guessed for observation set 1 will be mostly fair state, and observation set 2 will be mostly biased coin state. The concatenated observations will start fair and switch over to being biased after the results start kicking in.

You may also see that in some runs, the starting states could suggest the wrong state. It’s entirely possible that the coin flips at the beginning of each sequence seem to be belonging to the other state, but because of system was designed to be stationary over time steps, it will generally correct itself. This is particularly common for observation 1, where sometimes the first several states guess to use the biased coin because our gambling problem was set up so that it’s supposedly common to transition to the biased coin, even though I’m fixing the states myself.

So why is this useful? I’ll give two situations.

The Viterbi decoder is used in decoding convolutional codes and maximum-likelihood sequence detection in communication systems. By setting the transmitted constellation symbols as states, and the received baseband signal as an observation with some noise distribution behind it like Gaussian white noise, we can use the Viterbi decoder to find the most likely sequence of transmitted symbols, given our observed values at the receiver. Do note that the Viterbi decoder is still probably one of the most costly things to put in a transmission device, so it’s not advantageous to require multiple for different codes simple on a cost/space basis.

The 2nd example I have is that HMM is used for DNA/RNA sequencing; it’s easy to say attaching one of 4 possible pieces to the end of DNA when naturally building these strands (like from growing) can be modeled using an HMM, so the Viterbi decoder can help reveal a most likely set of sequences.

But even more useful for DNA/RNA sequencing is the forward-backward algorithm, which I will discuss next time? Later? And then that leads directly to the Baum-Welch algorithm, an HMM variant on the expectation-maximization algorithm.

Leave a comment