Estimating Hidden Markov Models (HMM)

The Hidden Markov Model (HMM) assumes the system being modelled to be Markov process with unobserved (i.e. hidden) states. In particular, one problem is to learn the hidden Markov transition matrix and conditional probability on Markov states from a sequence of observations. We explain the Baum-Welch algorithm which finds the maximum likelihood estimate of these parameters of a hidden Markov model given a sequence of observed data.

The HMM learning problem

In the HMM learning problem, one learns the hidden transition matrix and conditional probability on states from a sequence of observations.

Consider a hidden discrete-time Markov chain with random variables \(X_t\) that take values in a set \(S_X = \{1,\cdots,N\}\) of \(N \in \mathbb{N}\) elements. We assume the time-homogeneous Markov property such that \(\mathbb{P}(X_t|X_{t-1})\) is independent of \(t\), which leads to the definition of the time-independent transition matrix \begin{equation} A = {a_{ij}} = \mathbb{P}(X_t=j|X_{t-1}=i). \end{equation} The initial state distribution is given by \begin{equation} \pi = {\pi_j} = \mathbb{P}(X_1 = j), \quad \sum_{j=1}^N \pi_j = 1. \end{equation} Value of \(X_t\) is not observed directly. Rather, we can observe \(Y_t\) that takes values in a set \(S_Y = \{1,\cdots,M\}\). Moreover, the probability of a certain observation at time \(t\) for state \(j\) is denoted as \begin{equation} b_j(k) = \mathbb{P} (Y_t = k | X_t = j). \end{equation} Taking into account all possible values of \(Y_t\) and \(X_t\), we obtain the \(N \times M\) observation matrix \(B = \{b_j(k)\}\).

An observation sequence is given by \(Y = \{Y_1 = y_1\), \(Y_2 = y_2, \cdots, Y_T = y_T\}\). The HMM learning problem is then to estimate the HMM parameters \(A\) and \(B\) based on the observation sequence \(Y\), and the HMM model \(\lambda = (\pi, A, B)\).

Baum-Welch algorithm

We explain the Baum-Welch algorithm that finds the maximum likelihood estimate of the HMM transition matrix \(A\) and observation matrix \(B\).

Compute maximum likelihood estimate

The maximum likelihood estimate (MLE) of the probability \(a_{ij}\) of a particular transition from state \(i\) to \(j\) is

\begin{equation} \hat{a}_{ij} = \frac{\text{expected number of transitions from state } i \text{ to } j}{\text{expected number of transitions from state } i}, \label{eq:MLE_a} \end{equation}

and the MLE of the probability \(b_j(k)\) of a given label \(k\) from the observation \(Y\), given a state \(j\), is

\begin{equation} \hat{b}_j(k) = \frac{\text{exp number of times in state } j \text{ and observing } k}{\text{exp number of times in state } j}. \label{eq:MLE_b} \end{equation}

Therefore, the primary idea of computing \(\hat{a}_{ij}\) and \(\hat{b}_j(k)\) is to count all those “expected” numbers in the equation (\ref{eq:MLE_a}) and (\ref{eq:MLE_b}). The Baum-Welch algorithm iteratively estimates the counts, by initialising with a proper guess of the transition and observation probabilities. The estimated probabilities are then used to derive better and better probabilities in the following iterations.

Compute \(\hat{a}_{ij}\) and \(\hat{b}_j(k)\)

Let’s proceed with computing \(\hat{a}_{ij}\). Define the probability \(\xi_t(i,j)\) as the probability of being in state \(i\) at time \(t\) and state \(j\) at time \(t+1\), given the observation sequence and the model \(\lambda\),

\begin{equation} \xi_t(i,j) = \mathbb{P} (X_t = i, X_{t+1} = j | Y, \lambda). \end{equation}

The expected number of transitions from state \(i\) to \(j\) is then the sum of \(\xi\) over all \(t\). The total expected number of transition from state \(i\) is obtained by summing over all transitions out of state \(i\). Hence,

\begin{equation} \hat{a}_{ij} = \frac{\sum_{t=1}^{T-1} \xi_t (i,j)}{\sum_{k=1}^N \sum_{t=1}^{T-1} \xi_t (i,k)}. \label{eq:xi_def} \end{equation}

We also need a formula for computing the observation probability \(\hat{b}_j(k)\). Define the probability \(\gamma_t(j)\) as the probability of being in state \(j\) at time \(t\), given the observation sequence and the model \(\lambda\),

\begin{equation} \gamma_t(j) = \mathbb{P} (X_t = j | Y, \lambda). \end{equation}

This enables us to compute \(\hat{b}_j(k)\). For the numerator, we sum \(\gamma_t(j)\) for all time steps \(t\) in which the observation \(y_t = k\). For the denominator, we sum \(\gamma_t(j)\) over all time steps \(t\). Namely,

\begin{equation} \hat{b}_j(k) = \frac{\sum_{y_t = k} \gamma_t(j)}{\sum_{t=1}^{T} \gamma_t(j)}. \label{eq:gamma_def} \end{equation}

Compute \(\xi_t(i,j)\) and \(\gamma_t(j)\)

To compute \(\xi\) and \(\gamma\), we need define two more specific probabilities \(\alpha\) and \(\beta\). We define the forward path probability \(\alpha_t(j)\) as the probability of being in state \(j\) after seeing the first \(t\) observations,

\begin{equation} \alpha_t(j) = \mathbb{P} (y_1, y_2, \cdots, y_t, X_t = j | \lambda), \end{equation}

and, define the backward path probability \(\beta_t (j)\) as the probability of seeing the observations from time \(t+1\) to the end, given that we are in state \(j\) at time \(t\),

\begin{equation} \beta_t(j) = \mathbb{P} (y_{t+1}, y_{t+2}, \cdots, y_T | X_t = j, \lambda). \end{equation}

By Bayes’ theorem, we can re-write \(\xi\) and \(\gamma\) as

\begin{equation} \xi_t(i,j) = \frac{\mathbb{P} (X_t=i,X_{t+1},Y | \lambda)}{\mathbb{P} ( Y | \lambda) }, \quad \gamma_t (j) = \frac{\mathbb{P} (X_t = j, Y | \lambda)}{\mathbb{P} (Y | \lambda)}. \label{eq:xi_gamma_comp} \end{equation}

Note that the denominator is the probability of the observation and can be computed in multiple ways,

\begin{equation} \mathbb{P} (Y | \lambda) = \sum_{j=1}^N \alpha_T (j) = \sum_{j=1}^N \pi_j \beta_0 (j). \label{eq:ProbObservations} \end{equation}

The numerators in (\ref{eq:xi_gamma_comp}) can be expressed in terms of \(\alpha\) and \(\beta\):

\begin{equation} \mathbb{P} (X_t=i,X_{t+1},Y | \lambda) = \alpha_t(i) a_{ij} b_j(y_{t+1}) \beta_{t+1} (j), \end{equation} \begin{equation} \mathbb{P} (X_t = j, Y | \lambda) = \alpha_t(j) \beta_t(j). \end{equation}

Compute \(\alpha_t(j)\) and \(\beta_t(j)\)

It remains to calculate \(\alpha\) and \(\beta\). From their definitions, we can derive an iteration equation for each. Then we can solve them for all time steps by assigning a starting value. To be specific,

\[\alpha_1 (j) = \pi_j b_j(y_1), \text{ } 1 \leq j \leq N,\]

\begin{equation} \alpha_t (j) = \sum_{i=1}^N \alpha_{t-1} (i) a_{ij} b_j(y_t), \text{ } 1 \leq j \leq N, 2 \leq t \leq T, \label{eq:ForwardAlgorithm} \end{equation}

and,

\[\beta_T (j) = 1, \text{ } 1 \leq j \leq N,\]

\begin{equation} \beta_t (j) = \sum_{i=1}^N a_{ji} b_i(y_{t+1}) \beta_{t+1}(i), \text{ } 1 \leq j \leq N, 1 \leq t \leq T-1. \label{eq:BackwardAlgorithm} \end{equation}

In fact, equation (\ref{eq:ForwardAlgorithm}) is called forward algorithm, and (\ref{eq:BackwardAlgorithm}) is called backward algorithm.

Summary of the Baum-Welch Algorithm

Hence, we summarise the algorithm as followings.

function Baum-Welch (observations \(Y = \\{ y_1,\cdots,y_T \\}\), number of hidden states \(N\)):

     initialise \(A\) and \(B\)

     iterate until convergence

compute \(\alpha_t(j)\) using forward algorithm (\ref{eq:ForwardAlgorithm});

compute \(\beta_t(j)\) using backward algorithm (\ref{eq:BackwardAlgorithm});

compute \(\xi_t(i,j) = \frac{\alpha_t(i) a_{ij} b_j(y_{t+1})\beta_{t+1}(j)}{\sum_{k=1}^N \alpha_T (k)}\), \(\forall t\) and \(j\);

compute \(\gamma_t(j) = \frac{\alpha_t(j) \beta_{t}(j)}{\sum_{k=1}^N \alpha_T (k)}\), \(\forall t\), \(i\) and \(j\);

compute \(\hat{a}\_{ij} = \frac{\sum_{t=1}^{T-1} \xi_t (i,j)}{\sum_{k=1}^N \sum_{t=1}^{T-1} \xi_t (i,k)}\), \(\forall i\) and \(j\);

compute \(\hat{b}\_j(k) = \frac{\sum_{y_t = k} \gamma_t(j)}{\sum_{t=1}^{T} \gamma_t(j)}\), \(\forall i\) and \(j\);

return \(A\), \(B\)

Improvments in implementation

We have observed some deficiencies of the stated Baum-Welch algorithm during practical implementation. In particular,

  1. Both forward and backward algorithms require computations involving products of probabilities. It is easy to see, for example, \(\alpha_t(j)\) tends to 0 exponentially as \(T\) increases. Therefore, any attempt to implement the formula as given above will inevitably result in underflow.

  2. The converged estimates for \(A\) and \(B\) are reasonably believed to be a local rather than the global optimum, so that they are sensitive to the initial guess.

Scaling

To solve the problem stated in (1), we scale the probabilities calculated in each time step. Instead of calculating \(\alpha_t(j)\) following algorithm (\ref{eq:ForwardAlgorithm}), we compute \(\hat{\alpha}(j)\), in order,

\[\tilde{\alpha}_1 (j) = \pi_j b_j(y_1), \text{ } 1 \leq j \leq N,\] \[c_1 = \frac{1}{\sum_{j=1}^N \tilde{\alpha}_1(j)},\] \[\hat{\alpha}\_1 (j) = c_1 \tilde{\alpha}_t (j), \text{ } 1 \leq j \leq N,\] \[\tilde{\alpha}\_t (j) = \sum_{i=1}^N \hat{\alpha}\_{t-1} (i) a_{ij} b_j(y_t), \text{ } 1 \leq j \leq N, 2 \leq t \leq T,\] \[c_t = \frac{1}{\sum_{j=1}^N \tilde{\alpha}_t(j)},\]

\begin{equation} \hat{\alpha}_t (j) = c_t \tilde{\alpha}_t (j), \text{ } 1 \leq j \leq N. \label{eq:ForwardAlgorithmScale} \end{equation}

To make sure the above-mentioned scaling on \(\alpha\) do not distort the probabilities computed in the following such as \(\xi\) and \(\gamma\), the same scaling are applied to \(\beta\) accordingly. Therefore, instead of computing \(\beta_t(j)\) using algorithm (\ref{eq:BackwardAlgorithm}), we compute \(\hat{\beta}_t (j)\), in order,

\[\hat{\beta}_T (j) = c_T, \text{ } 1 \leq j \leq N,\] \[\tilde{\beta}\_t (j) = \sum_{i=1}^N a_{ji} b_i(y_{t+1}) \hat{\beta}_{t+1}(i), \text{ } 1 \leq j \leq N, 1 \leq t \leq T-1,\]

\begin{equation} \hat{\beta}_t (j) = c_t \tilde{\beta}_t (j), \text{ } 1 \leq j \leq N. \label{eq:BackwardAlgorithmScale} \end{equation}

Optimal initialization

The converged estimates for \(A\) and \(B\) are usually local optimum, subject to the choice of initial guess of the model \(\lambda^{(0)} = (\pi^{(0)}, A^{(0)}, B^{(0)})\). To attempt to approach the global optimum, we randomly select a sequence of initial guesses and pick the one that results in the largest probability \(\mathbb{P} (Y \\| \lambda)\), the probability of observing the given sequence.

Due to the scaling in the revisited forward algorithm (\ref{eq:BackwardAlgorithmScale}), we use formula (\ref{eq:ProbObservations}) and obtain

\begin{equation} \mathbb{P} (Y | \lambda) = \frac{1}{\prod_{t=1}^T c_t}. \end{equation}

To avoid underflow, we instead compute

\begin{equation} \log[ \mathbb{P} (Y | \lambda) ] = - \sum_{t=1}^T \log c_t. \end{equation}

We don’t allow completely randomness of the model, which might largely increase the computational efforts required for the maximisation. We fix

\begin{equation} \pi^{(0)}_j = \frac{1}{N}, \text{ } b_j(k) = \frac{1}{M}, \text{ for }\forall j \text{ and } k, \end{equation}

and only allow \(a_{ij}\) to be uniformly distributed with respect to \(\sum_j a_{ij} = 1\) for \(\forall i\).