Here, we present worked examples of Continuous Density Hidden
Markov Models (CDHMM). The wording “continuous density” refers
to the emission probabilities connected to the hidden states. In the
first example, the emission probabilities are assumed to be normally
(Gaussian) distributed with known means and standard deviations. This
implies that we are not going to infer these parameters, but we want to
know which emitter is active at which time point. Note that we
presuppose that exactly one of the emitters is radiating at any point in
time. In another paragraph, we will use the Baum-Welch algorithm to
optimize the parameters of the model, including means, standard
deviations, and transition probabilities. Finally, in the last example,
we are choosing another (non-Gaussian) emission profile.
Much attention has been paid to explain in some detail how the
log-likelihood was calculated. However, this paragraph can be skipped if
not of interest, without loosing comprehensibility of other parts of
this document.
Theory, diction and labeling refer to the documents:
- The EM
algorithm for Hidden Markov Models
- The
Hidden Markov Model: transition and emission probabilities
Next, we simulate some observations. We let the emitters radiate normally distributed signals using the means and standard deviations defined above. We let the emitters come into play in the order “EmitterA”, “EmitterB”, “EmitterC”, “EmitterB”, “EmitterA”, that is 1-2-3-2-1, determined by the vector “order”. The span of the individual emissions is defined by the vector “duration”. As a matter of course, these parameters are not known to the algorithm that is supposed to uncover just these numbers. The numbers are only used to simulate the observations.
order = c(1,2,3,2,1) # emitters are active in this order
duration = c(40, 50, 60, 50, 30) # emitters are active for this period of time
signal = data.frame(State = emitters[order, "State"], Duration = duration)
signal
## State Duration
## 1 EmitterA 40
## 2 EmitterB 50
## 3 EmitterC 60
## 4 EmitterB 50
## 5 EmitterA 30
Now, we generate the Gaussian signals using the above defined means, standard deviations, and durations, in order to get the signal strength over time:
set.seed(777)
obs = list()
for (i in 1:length(order)) obs[[i]] = rnorm(duration[i], mean = means[order[i]], sd = sds[order[i]])
observation = unlist(obs)
length(observation) # 230 time points
## [1] 230
Here comes a plot of the simulated signal:
colors = c("red", "blue", "darkgreen")
colvecpart = list()
for (i in 1:length(order)) colvecpart[[i]] = rep(colors[order[i]], duration[i])
colvec = unlist(colvecpart)
plot(1:length(observation), observation, col = colvec, type = "p", xlab = "time",
ylab = "signal", pch = 16, cex = 0.85)
title("Emitted signal over time", font.main = 1)
txt = c(emitters[1, "State"], emitters[2, "State"], emitters[3, "State"])
legend("topleft", txt, col = colors, lwd = 3, cex = 0.8)
mtext("Normally distributed signal", side = 3, col = "blue")
We store the true state path, just to compare the calculated results with these true values later:
true_state_path = list()
for (i in 1:length(order)) true_state_path[[i]] = rep(order[i], duration[i])
true_state_path = unlist(true_state_path)
length(true_state_path)
## [1] 230
true_state_path
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [38] 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [75] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [112] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [149] 3 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [186] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [223] 1 1 1 1 1 1 1 1
In the vector “true_state_path”, “1” means “EmitterA”, “2” means “EmitterB”, and “3” means “EmitterC”.
Some theory regarding the Viterbi algorithm was presented in Lecture 3 above.
In order to use the package HiddenMarkov, we have to
define a list including the parameters of the emission function. Because
we use a normal distribution, this list must coincide with those used by
the functions dnorm or rnorm, which are mean
and sd, see the HiddenMarkov
manual.
Here, we store the parameters in the vector “eparams”:
eparams = list(mean = means, sd = sds)
eparams
## $mean
## [1] 0 3 8
##
## $sd
## [1] 0.02 0.06 0.10
Moreover, we have to use the function dthmm in order to create a discrete time hidden Markov model object with class “dthmm”:
model = dthmm(x = observation, Pi = alpha, delta = startProbs, distn = "norm", pm = eparams)
class(model)
## [1] "dthmm"
The object includes all necessary data. Have a look at the HiddenMarkov manual for details concerning the dthmm object, or run the command “str(model)” in R in order to see the components of the object.
Now, we are able to decode the observation by applying the Viterbi algorithm. This is done using the function “Viterbi” included in the package. The only argument needed by this function is the dthmm object created above:
states_predicted <- Viterbi(model)
states_predicted
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [38] 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [75] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [112] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [149] 3 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [186] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [223] 1 1 1 1 1 1 1 1
The vector “states_predicted” contains the hidden states predicted by the Viterbi algorithm.
Let us compare the prediction with the true states stored above. In order to do so, we can plot the true states together with the prediction:
plot(1:length(true_state_path), true_state_path, xlab = "time", ylab = "states", col = "red", pch = 18)
title("True states and predicted states", font.main = 1)
mtext("Prediction by Viterby", col = "blue")
points(1:length(true_state_path), states_predicted + 0.03, col = "blue", pch = 16, cex = 0.8)
Here, the predicted values have been artificially shifted 0.03 units upwards so that the true values are not completely covered. We see that there is perfect agreement between the true values and the prediction.
As already mentioned, this was an easy-to-solve problem, because the means of the emitters were distant compared to their standard deviations. Let’s make the challenge a little bit harder for the Viterbi by repeating the whole procedure with different data. You can also consider this as a summary of the essential commands:
states = c("EmitterA", "EmitterB", "EmitterC")
means = c(2, 3, 4)
sds = c(0.2, 0.3, 0.3)
emitters = data.frame(State = states, Mean = means, StDev = sds)
emitters
## State Mean StDev
## 1 EmitterA 2 0.2
## 2 EmitterB 3 0.3
## 3 EmitterC 4 0.3
startProbs = c(1/3,1/3,1/3)
alpha = matrix(c(0.6,0.2,0.2,0.8,0.1,0.1,0.4,0.3,0.3), nrow = 3, byrow = TRUE)
rownames(alpha) = states
colnames(alpha) = states
alpha
## EmitterA EmitterB EmitterC
## EmitterA 0.6 0.2 0.2
## EmitterB 0.8 0.1 0.1
## EmitterC 0.4 0.3 0.3
order = c(1,2,3,2,1) # emitters are active in this order
duration = c(40, 50, 60, 50, 30) # emitters are active for this period of time
signal = data.frame(State = emitters[order, "State"], Duration = duration)
signal
## State Duration
## 1 EmitterA 40
## 2 EmitterB 50
## 3 EmitterC 60
## 4 EmitterB 50
## 5 EmitterA 30
set.seed(777)
obs = list()
for (i in 1:length(order)) obs[[i]] = rnorm(duration[i], mean = means[order[i]], sd = sds[order[i]])
observation = unlist(obs)
length(observation) # 230 time points
## [1] 230
Keep in mind that you get different values in repeated runs of
“rnorm” if you don’t set a seed.
Plot the simulated observation:
colors = c("red", "blue", "darkgreen")
colvecpart = list()
for (i in 1:length(order)) colvecpart[[i]] = rep(colors[order[i]], duration[i])
colvec = unlist(colvecpart)
plot(1:length(observation), observation, col = colvec, type = "p", xlab = "time", ylab = "signal",
pch = 16, cex = 0.85)
title("Emitted signal over time", font.main = 1)
txt = c(emitters[1, "State"], emitters[2, "State"], emitters[3, "State"])
legend("topleft", txt, col = colors, lwd = 3, cex = 0.8)
mtext("Normally distributed signal", side = 3, col = "blue")
The simulated signal shown in the above plot is definitely not that nice this time. Let’s see what the “Viterbi” can do:
eparams = list(mean = means, sd = sds)
eparams
## $mean
## [1] 2 3 4
##
## $sd
## [1] 0.2 0.3 0.3
model = dthmm(x = observation, Pi = alpha, delta = startProbs, distn = "norm", pm = eparams)
class(model)
## [1] "dthmm"
states_predicted <- Viterbi(model)
states_predicted
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [38] 1 1 1 2 2 2 2 2 2 2 2 2 3 3 2 3 2 2 2 2 2 1 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2
## [75] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [112] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [149] 3 3 2 2 2 2 2 2 2 2 2 3 2 2 2 1 2 2 2 1 2 2 3 2 2 2 1 3 2 3 2 2 2 2 2 2 2
## [186] 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [223] 1 1 1 1 1 1 1 1
plot(1:length(true_state_path), true_state_path, xlab = "time", ylab = "states", col = "red", pch = 18)
title("True states and predicted states", font.main = 1)
mtext("Prediction by Viterby", col = "blue")
points(1:length(true_state_path), states_predicted + 0.03, col = "blue", pch = 16, cex = 0.8)
Note that we have not repeated the definition of the vector containing the true state path. You can find the code further up. There a only about a dozen wrong predictions despite the rather irregular signal. Nice!
Another decoding approach is posterior
decoding.
Here, we calculate the probability \(P \left(
\pi_i = k \mid observation \right)\), i.e. the probability that
the \(i\)-th state in our state path is
“EmitterA”, “EmitterB” or “EmitterC”, given the observed sequence. This
can be calculated for all state path positions \(i\). Note that this is different from the
Viterbi approach. Posterior decoding was explained a little more in
detail in Lecture
4 above. In the “HiddenMarkov” manual, posterior decoding
is also called “local decoding”, in contrast to the expression
“global decoding” which is used for the Viterbi algorithm. The
Viterbi algorithm looks primarily at the sequence as a whole while
posterior decoding focuses more on single positions in the state
path.
The “HiddenMarkov” package provides the function “Estep” which can be employed to do this task. As a signal, we use the “irregular” observation lastly defined, stored in the vector “observation”. Here it is again:
colors = c("red", "blue", "darkgreen")
colvecpart = list()
for (i in 1:length(order)) colvecpart[[i]] = rep(colors[order[i]], duration[i])
colvec = unlist(colvecpart)
plot(1:length(observation), observation, col = colvec, type = "p", xlab = "time", ylab = "signal", pch = 18)
title("Emitted signal over time", font.main = 1)
txt = c(emitters[1, "State"], emitters[2, "State"], emitters[3, "State"])
legend("topleft", txt, col = colors, lwd = 3, cex = 0.8)
mtext("Normally distributed signal", side = 3, col = "blue")
The “Estep” function takes as arguments the observations, the transition probabilities, the start probabilities, a string indicating the distribution of the emission probabilities, and the parameters belonging to this distribution function, in this case means and standard deviations characterizing the normal distribution:
Estep_out = Estep(x = observation, Pi = alpha, delta = startProbs, distn = "norm", pm = eparams)
str(Estep_out)
## List of 3
## $ u : num [1:230, 1:3] 0.989 1 0.996 1 0.916 ...
## $ v : num [1:229, 1:3, 1:3] 0.989 0.996 0.996 0.915 0.911 ...
## $ LL: num -310
The posterior probabilities for each state and for each time point are included in the u-component of the Estep-output:
postprob = Estep_out$u # posterior probabilities
colnames(postprob) = states
Let’s have a look a the first six time points:
head(postprob)
## EmitterA EmitterB EmitterC
## [1,] 0.9892133 0.0107866981 9.262114e-10
## [2,] 0.9995110 0.0004890130 5.868423e-12
## [3,] 0.9961845 0.0038154633 3.432922e-10
## [4,] 0.9995338 0.0004662075 6.475745e-12
## [5,] 0.9159056 0.0840942957 9.355932e-08
## [6,] 0.9952242 0.0047758426 5.505910e-10
We see that the probability is much bigger for “EmitterA” for these time points, so that “EmitterA” can be regarded as the posterior prediction for these time points. In order to get posterior predictions for the whole sequence, we have to identify the column with the biggest probability in each row. Here we go:
maxrow = apply(Estep_out$u, 1, which.max)
maxrow
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [38] 1 1 1 2 2 2 2 2 2 2 2 2 3 3 2 3 2 2 2 2 2 1 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2
## [75] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [112] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [149] 3 3 2 2 2 2 2 2 2 2 2 3 2 2 2 1 2 2 2 1 2 2 3 2 2 2 1 3 2 3 2 2 2 2 2 2 2
## [186] 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [223] 1 1 1 1 1 1 1 1
The associated posterior probabilities can simply be fetched with the “max” command:
maxprob = apply(Estep_out$u, 1, max)
head(maxprob)
## [1] 0.9892133 0.9995110 0.9961845 0.9995338 0.9159056 0.9952242
We have shown the first six probabilities only (compare with the vector “postprob” above!).
We can combine these findings in a data frame:
posterior = data.frame(State = maxrow, Prob = maxprob)
head(posterior)
## State Prob
## 1 1 0.9892133
## 2 1 0.9995110
## 3 1 0.9961845
## 4 1 0.9995338
## 5 1 0.9159056
## 6 1 0.9952242
Here comes a comparison of the predictions based on posterior probabilities with the true path:
plot(1:length(true_state_path), true_state_path, xlab = "time", ylab = "states", col = "red", pch = 18)
title("True states and predicted states", font.main = 1)
mtext("Posterior probabilities", col = "blue")
points(1:length(true_state_path), posterior$State + 0.03, col = "blue", pch = 16, cex = 0.8)
Again, we have a few discrepancies only.
(This section can be skipped when not of interest, it contains quite a lot of statistical details …)
Another component of the “Estep” output is the log-likelihood:
Estep_out$LL
## [1] -309.6714
Here is what Wikipedia
says about the likelihood when continuous probability distributions are
involved:
“Let \(X\) be a random variable … with
density function \(f(x)\) which depends
on a parameter \(\theta\). Then the
function \[
\mathcal{L}\left( \theta \mid x \right) =f(x \mid \theta)
\] considered as a function of \(\theta\) is the likelihood function …”.
On the other hand, if we have a mixture of continuous distributions, the likelihood becomes a weighted sum of the density functions. If we have a mixture of three distributions, this reads:
\[ \mathcal{L}\left( \theta \mid x \right) = \lambda_1 \cdot f(x \mid \theta_1) + \lambda_2 \cdot f(x \mid \theta_2) + \lambda_3 \cdot f(x \mid \theta_3) \] where the weights add to one: \(\lambda_1 + \lambda_2 +\lambda_3 = 1\). If we consider 3 normal distributions, we get:
\[ \mathcal{L}\left( \theta \mid x \right) = \lambda_1 \cdot \mathcal{N}(x \mid \mu_1 , \sigma_1) + \lambda_2 \cdot \mathcal{N}(x \mid \mu_2 , \sigma_2) + \lambda_3 \cdot \mathcal{N}(x \mid \mu_3 , \sigma_3) \] This is the formula we’ll need very soon (because we have 3 states with Gaussian emission profiles).
Note that likelihood is not a probability (it is a function of a
parameter \(\theta\), and the integral
over \(\theta\) is not one!) but it can
be used to compare the plausibility of different parameter settings in
statistical models. The bigger the likelihood, the more likely it is
that the parameters underlying the model are those parameters that
really generated the observed data. Read more about likelihood in this
interesting
article by Ajay Halthor.
Let us have a closer look at the log-likelihood of the above defined
Hidden Markov Model by trying to recap how it was calculated. In order
to do so, we have to calculate the likelihood of all observations
throughout the whole timeline. While the likelihood value for the whole
chain is obtained by multiplying the likelihood values referring to the
individual time points, the log-likelihood values have to be added.
It is convenient to have the emission density values of all emitters at all time points at hand. These follow from the observations, as well as from the means and standard deviations of emitters A, B, and C:
epsilon = matrix(as.double(0), nrow = length(observation), ncol = ncol(alpha))
for (k in 1:length(means)) epsilon[, k] = dnorm(observation, mean = means[k], sd = sds[k])
colnames(epsilon) = states
rownames(epsilon) = paste0("x", 1:length(observation))
head(epsilon) # emission densities
## EmitterA EmitterB EmitterC
## x1 1.7692471 0.014473781 2.483320e-09
## x2 1.8424234 0.002046825 4.877615e-11
## x3 1.7507119 0.015096095 2.714131e-09
## x4 1.8422246 0.002045496 4.871517e-11
## x5 0.5209263 0.107988746 2.380297e-07
## x6 1.6446189 0.018766422 4.312518e-09
Now, it is quite easy to calculate the likelihood of the observation at time point one. According to the definition of the likelihood given above, it is the weighted sum of the density distribution functions of the emitters A, B, and C at this time point, where the weights are the probabilities referring to these states. The latter are the initial probabilities stored in “startProbs”, i.e. the vector \(\left[1/3 , 1/3, 1/3\right]\).
Lhood = sum(startProbs * epsilon[1, ])
Lhood
## [1] 0.5945736
This is the likelihood at time point 1. The log-likelihood is therefore
logL = log(Lhood)
logL
## [1] -0.5199107
As we will see very soon, we need also to calculate the posterior
probabilities that the system was in state A, B, or C at time point 1,
given the observation \(x_1\), i.e. we
need the conditional probabilities \(P\left(\pi_1 = A \mid x_1\right)\), \(P\left(\pi_1 = B \mid x_1\right)\), and
\(P\left(\pi_1 = C \mid x_1\right)\).
They are different from “startProbs”, because
“startProbs” is a pure guess, while the conditional
probabilities take the observed data \(x_1\) into account.
The posterior probability that the first emission originated from state
A, \(P\left(\pi_1 = A \mid
x_1\right)\), can be estimated using the expression
\[ \begin{align*} P\left( \pi_1 =A \mid x_1 \right) &= \frac{P\left( x_1 = A \right) \cdot \mathcal{N}\left( x_1, \mu_A, \sigma_A \right) }{P\left( x_1 = A \right) \cdot \mathcal{N}\left( x_1, \mu_A, \sigma_A \right) + P\left( x_1 = B \right) \cdot \mathcal{N}\left( x_1, \mu_B, \sigma_B \right) +P\left( x_1 = C \right) \cdot \mathcal{N}\left( x_1, \mu_C, \sigma_C \right) } \\\\ &= \frac{1/3 \cdot \mathcal{N}\left( x_1, \mu_A, \sigma_A \right) }{1/3 \cdot \mathcal{N}\left( x_1, \mu_A, \sigma_A \right) + 1/3 \cdot \mathcal{N}\left( x_1, \mu_B, \sigma_B \right) + 1/3 \cdot \mathcal{N}\left( x_1, \mu_C, \sigma_C \right) } \end{align*} \] Calculation of \(P\left(\pi_1 = B \mid x_1\right)\) and \(P\left(\pi_1 = C \mid x_1\right)\) is done in a similar way. The rationale behind the above formula might become understandable with the help of this plot (showing a mixture of only two Gaussians):
We see that the posterior probability that an individual Gaussian generated the observation \(x_1\) is estimated as the proportion with which this Gaussian contributes to the total signal intensity of \(x_1\). A more detailed explanation of these so called “membership probabilities” can be found in this document or in the E-Step paragraph of this document.
In R, we calculate the posterior probabilities for state A, B, and C in the following way:
pi = startProbs * epsilon[1, ]/sum(startProbs * epsilon[1, ])
print(format(pi, digits = 3), quote = FALSE)
## EmitterA EmitterB EmitterC
## 9.92e-01 8.11e-03 1.39e-09
We see that it is most likely that the first emission came from emitter A. The vector \(\pi\) contains probabilities and must therefore add up to 1:
sum(pi)
## [1] 1
In order to explain the observation \(x_2\) at time point 2, we have to consider two events: a transition to the next state variable A, B, or C, and an emission from this state leading to the observation \(x_2\).
The change of the state probabilities \(\pi\) caused by the transition to the next time point can be calculated by multiplying the current probability vector \(\pi = \left[ P\left( \pi_1 = A \right) , P\left(\pi_1 = B \right) , P\left(\pi_1 = C \right) \right]\) (conditioning on \(x_1\) suppressed) with the transition matrix:
pi = pi %*% alpha
print(format(pi, digits = 3), quote = FALSE)
## EmitterA EmitterB EmitterC
## [1,] 0.602 0.199 0.199
How can we motivate this?
Let’s have a look at the transition matrix, and what its elements mean
in statistical terms:
\[ \alpha = \begin{bmatrix} P \left( \pi_2 = A \mid \pi_1 = A \right) & P\left( \pi_2 = B \mid \pi_1 = A \right) & P\left( \pi_2 = C \mid \pi_1 = A\right) \\ P \left( \pi_2 = A \mid \pi_1 = B \right) & P\left( \pi_2 = B \mid \pi_1 = B \right) & P\left( \pi_2 = C \mid \pi_1 = B \right) \\ P \left( \pi_2 = A \mid \pi_1 = C \right) & P\left( \pi_2 = B \mid \pi_1 = C \right) & P\left( \pi_2 = C \mid \pi_1 = C \right) \end{bmatrix} \]
Therefore, the above R code multiplies
\[ \left[ P\left( \pi_1 = A \right) \;\;\; P\left(\pi_1 = B \right) \;\;\; P\left(\pi_1 = C \right) \right] \cdot \begin{bmatrix} P \left( \pi_2 = A \mid \pi_1 = A \right) & P\left( \pi_2 = B \mid \pi_1 = A \right) & P\left( \pi_2 = C \mid \pi_1 = A\right) \\ P \left( \pi_2 = A \mid \pi_1 = B \right) & P\left( \pi_2 = B \mid \pi_1 = B \right) & P\left( \pi_2 = C \mid \pi_1 = B \right) \\ P \left( \pi_2 = A \mid \pi_1 = C \right) & P\left( \pi_2 = B \mid \pi_1 = C \right) & P\left( \pi_2 = C \mid \pi_1 = C \right) \end{bmatrix} \]
Let us only develop the first component \(p_1\) of the result:
\[ \begin{align*} p_1 &= P\left( \pi_1 = A \right) \cdot P \left( \pi_2 = A \mid \pi_1 = A \right) + P\left(\pi_1 = B \right) \cdot P \left( \pi_2 = A \mid \pi_1 = B \right) + P\left(\pi_1 = C \right) \cdot P \left( \pi_2 = A \mid \pi_1 = C \right) \\ &= P \left( \pi_1 = A , \pi_2 = A \right) + P \left( \pi_1 = B , \pi_2 = A \right) + P \left( \pi_1 = C , \pi_2 = A \right) \\ &= P \left(\pi_2 = A \right) \end{align*} \]
The last line follows from the marginal rule. Likewise, we can show that the \(2\)-nd and \(3\)-rd components are \(P\left(\pi_2 = B \right)\) and \(P\left(\pi_2 = C \right)\), respectively. To summarize, the above shown R code calculates the vector \(\left[ P\left(\pi_2 = A\right) , P\left(\pi_2 = B \right) , P \left(\pi_2 = C \right)\right]\) , i.e. the state probabilities at time point 2.
We can check again if the state probabilities sum up to one:
sum(pi)
## [1] 1
Now, as for observation \(x_1\), we calculate the likelihood of the observation \(x_2\) as a weighted sum of the density distribution functions of the emitters A, B, and C at this time point. Again, the weights are the actual state probabilities, stored in the just calculated vector \(\left[ P\left(\pi_2 = A\right) , P\left(\pi_2 = B \right) , P \left(\pi_2 = C \right)\right]\).
In R code, this is
Lhood = sum(pi * epsilon[2, ])
The joint log-likelihood of observations one and two is therefore
logL = logL + log(Lhood)
logL
## [1] -0.4165857
(In order to get the joint likelihood of the chain, the likelihoods of all observations have to be multiplied, but the log-likelihoods must be added.)
Next, as for time point 1, we estimate the posterior state probabilities by calculating the membership degree of every emitter for observation \(x_2\):
pi = pi * epsilon[2, ]/sum(pi * epsilon[2, ])
print(format(pi, digits = 3), quote = FALSE)
## EmitterA EmitterB EmitterC
## [1,] 1.00e+00 3.68e-04 8.76e-12
Again, it might improve the trust in our algorithm if we can confirm that the calculated probabilities sum up to 1:
sum(pi)
## [1] 1
Now, we repeat the steps carried out for time point 2 for all subsequent time points, until the end of the sequence is reached.
To summarize, what we have done to get the log-likelihood is:
Steps 4 to 7 are repeated until the end of the string of observations.
Based on the algorithm introduced above, the log-likelihood can be calculated using the following function:
logLikelihood <- function(observation, alpha, means, sds, startProbs) {
epsilon = matrix(as.double(0), nrow = length(observation), ncol = ncol(alpha))
for (k in 1:length(means)) epsilon[, k] = dnorm(observation, mean = means[k], sd = sds[k])
Lhood = sum(startProbs * epsilon[1, ]) # x_1
logL = log(Lhood)
pi = startProbs * epsilon[1, ]/sum(startProbs * epsilon[1, ])
for (i in 2:length(observation)) { # x_2 until x_end
pi = pi %*% alpha # new state probabilities resulting from transition
Lhood = sum(pi * epsilon[i, ]) # likelihood for next observation
logL = logL + log(Lhood) # increment log-likelihood
pi = pi * epsilon[i, ]/sum(pi * epsilon[i, ]) # new posterior state probabilities
}
out = list()
out[["epsilon"]] = epsilon
out[["logL"]] = logL
invisible(out)
}
The code presented is written for comprehensibility, not for efficiency. For instance, it is not efficient to calculate the expression \(sum(pi * epsilon[i, ])\) in the loop twice. A faster approach would be to buffer the sum in a temporary variable. Moreover, it should be considered to assign all variables with double precision, because numerical errors may accumulate if the chain is long.
Finally, let us see if the function yields the same result as the output of the “Estep” function obtained above:
result = logLikelihood(observation, alpha, means, sds, startProbs)
result$logL
## [1] -309.6714
The “Estep” result was:
Estep_out$LL
## [1] -309.6714
The log-likelihood can also be calculated using a function coming with the “HiddenMarkov” package. The only argument for this function is the “dthmm” object declared above:
logLik(model)
## [1] -309.6714
Above, the parameters of the model, e.g. the means and the standard deviations of the Gaussian emission distribution, as well as the transition probabilities, were given. However, this might not always be the case. Rather, it might rarely be the case. How can we now the model parameters underlying some observation? In order to unravel unknown parameters of an HMM, a variant of the Expectation-Maximization (EM) algorithm, the Baum-Welch algorithm, can be used.
Some details regarding EM and Baum-Welch were already provided in the article EM algorithm for Hidden Markov Models. An example for a mixture of three (2D-) Gaussians is presented here. You can also have a look at the “dishonest-casino” example.
The Baum-Welch algorithm finds maximum likelihood (ML) estimates of the HMM model parameters, given the observed emissions. It should be repeated here that the method finds local maximum values of the likelihood which might not be the global optimum (initial values of the parameters are critical).
Applying the Baum-Welch algorithm with the help of the HiddenMarkov package is rather simple. All we need is the model object defined above, which must be of class dthmm:
class(model)
## [1] "dthmm"
The behavior of the function can be tweaked using the bwcontrol variable. For details, see the HiddenMarkov manual. Having the model object at hand, we are ready to run the Baum-Welch:
bwout = BaumWelch(model, control = bwcontrol(maxiter = 500, tol = 1e-05, prt = FALSE))
str(bwout)
## List of 13
## $ x : num [1:230] 2.1 1.92 2.1 1.92 2.33 ...
## $ Pi : num [1:3, 1:3] 9.85e-01 1.01e-02 1.81e-24 1.48e-02 9.80e-01 ...
## $ delta : num [1:3] 1.00 6.26e-22 2.51e-86
## $ distn : chr "norm"
## $ pm :List of 2
## ..$ mean: num [1:3] 2.03 2.96 4.04
## ..$ sd : num [1:3] 0.207 0.303 0.275
## $ pn : NULL
## $ discrete: logi FALSE
## $ nonstat : logi TRUE
## $ u : num [1:230, 1:3] 1 1 1 1 1 ...
## $ v : num [1:229, 1:3, 1:3] 1 1 1 1 1 ...
## $ LL : num -40.4
## $ iter : int 7
## $ diff : num 5.5e-06
## - attr(*, "class")= chr "dthmm"
The output of the BaumWelch function contains updated values of the HMM parameters. Let us compare the new parameters with the ones we had initially chosen:
The new transition probabilities are stored in the Pi list element:
print(format(bwout$Pi, digits = 3), quote = FALSE)
## [,1] [,2] [,3]
## [1,] 9.85e-01 1.48e-02 1.27e-16
## [2,] 1.01e-02 9.80e-01 1.01e-02
## [3,] 1.81e-24 1.70e-02 9.83e-01
We had stored the original transition probabilities in the \(\alpha\) matrix:
alpha
## EmitterA EmitterB EmitterC
## EmitterA 0.6 0.2 0.2
## EmitterB 0.8 0.1 0.1
## EmitterC 0.4 0.3 0.3
We see that the numerical values of the parameters changed considerably.
Here are the new means and standard deviations of the Gaussian emission distribution, compared to the values we had choosen before:
bwout$pm$mean
## [1] 2.027118 2.962835 4.044831
means
## [1] 2 3 4
bwout$pm$sd
## [1] 0.2075000 0.3032232 0.2754277
sds
## [1] 0.2 0.3 0.3
Here, the differences are not too big.
Let us also have a look at the posterior probabilities (before and after Baum-Welch):
head(postprob)
## EmitterA EmitterB EmitterC
## [1,] 0.9892133 0.0107866981 9.262114e-10
## [2,] 0.9995110 0.0004890130 5.868423e-12
## [3,] 0.9961845 0.0038154633 3.432922e-10
## [4,] 0.9995338 0.0004662075 6.475745e-12
## [5,] 0.9159056 0.0840942957 9.355932e-08
## [6,] 0.9952242 0.0047758426 5.505910e-10
head(bwout$u)
## [,1] [,2] [,3]
## [1,] 1.0000000 8.015430e-26 1.082473e-103
## [2,] 0.9999997 3.311328e-07 3.104907e-35
## [3,] 0.9999980 2.017395e-06 1.889249e-24
## [4,] 0.9999996 4.039083e-07 8.131014e-24
## [5,] 0.9999657 3.428319e-05 7.673936e-21
## [6,] 0.9999968 3.154977e-06 9.792904e-22
As above, we can find the new posterior predictions for the whole sequence by identifying the column with the biggest probability in each row:
maxrow = apply(bwout$u, 1, which.max)
maxprob = apply(bwout$u, 1, max)
new_posterior = data.frame(State = maxrow, Prob = maxprob)
head(new_posterior)
## State Prob
## 1 1 1.0000000
## 2 1 0.9999997
## 3 1 0.9999980
## 4 1 0.9999996
## 5 1 0.9999657
## 6 1 0.9999968
Now, we can compare the posterior probabilities before and after application of the Baum-Welch algorithm:
plot(1:nrow(posterior), posterior$Prob, xlab = "time", ylab = "Posterior probability", col = "red", pch = 18)
title("Comparison of posterior probabilities", font.main = 1)
mtext("Before and after Baum-Welch", col = "blue")
points(1:nrow(posterior), new_posterior$Prob, col = "blue", pch = 16, cex = 0.8)
legend("left", c("before", "after"), col = c("red","blue"), cex = 0.8, pch = c(18, 16))
Let us also compare the results of Viterbi decoding obtained before
and after application of the Baum-Welch algorithm. In order to do that,
we first have to carry out a Viterbi prediction as above, but considering the new parameters estimated
by Baum-Welch. Therefore, we
- declare the list with the new emission parameters,
- create the “dthmm” model object, and
- run the Viterbi:
new_eparams= list(mean = bwout$pm$mean, sd = bwout$pm$sd)
new_eparams
## $mean
## [1] 2.027118 2.962835 4.044831
##
## $sd
## [1] 0.2075000 0.3032232 0.2754277
new_model = dthmm(x = observation, Pi = bwout$Pi, delta = bwout$delta, distn = "norm", pm = new_eparams)
class(new_model)
## [1] "dthmm"
str(new_model)
## List of 8
## $ x : num [1:230] 2.1 1.92 2.1 1.92 2.33 ...
## $ Pi : num [1:3, 1:3] 9.85e-01 1.01e-02 1.81e-24 1.48e-02 9.80e-01 ...
## $ delta : num [1:3] 1.00 6.26e-22 2.51e-86
## $ distn : chr "norm"
## $ pm :List of 2
## ..$ mean: num [1:3] 2.03 2.96 4.04
## ..$ sd : num [1:3] 0.207 0.303 0.275
## $ pn : NULL
## $ discrete: logi FALSE
## $ nonstat : logi TRUE
## - attr(*, "class")= chr "dthmm"
new_states_predicted <- Viterbi(new_model)
new_states_predicted
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [38] 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [75] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [112] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [149] 3 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [186] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [223] 1 1 1 1 1 1 1 1
We still have the true state path in memory (“true_state_path”), so that we can compare the old and new predictions with the correct results. Before involving Baum-Welch, we had 14 wrong predictions:
sum(states_predicted != true_state_path)
## [1] 14
But there is only one false prediction now:
sum(new_states_predicted != true_state_path)
## [1] 1
which(new_states_predicted != true_state_path)
## [1] 91
new_states_predicted[91]
## [1] 2
true_state_path[91]
## [1] 3
The only wrong prediction is at time point 91. The Viterbi predicts state 2 but the proper result would be 3. Let’s have a look at the data around this position:
colvec = c(rep("blue",6), rep("darkgreen",6))
colvec[7] = "red"
plot(85:96, observation[85:96], col = colvec, type = "p", xlab = "time",
ylab = "signal", pch = 16, cex = 0.85)
title("Emitted signal over time (zoomed)", font.main = 1)
mtext("Normally distributed signal", side = 3, col = "blue")
The time point 91 is in red color. We see that the signal strength is relatively low considering that the true state is “EmitterC”:
observation[91]
## [1] 3.478679
The value is closer to 3 than to 4 which is the mean for Emitter C. Moreover, the transition probability from B to C is only 1 percent:
bwout$Pi[2,3] # B-C transition probability (after BW)
## [1] 0.01005611
As a consequence, the algorithm “thinks” that time point 91 still belongs to emitter B.
Finally, we can have a look at the likelihood (before and after Baum-Welch):
exp(Estep_out$LL)
## [1] 3.246657e-135
exp(bwout$LL)
## [1] 2.845517e-18
The likelihood has grown considerably which means that the adaption
of model and data is better.
This is because the model is just better now!