EM algorithm for a mixture of 2-D Gaussians

1. Generating observations from a mixture of 3 two-dimensional Gaussians

we specify the means, standard deviations, correlation coefficients, and weights for all 3 clusters:

mu1 = c(5, 5)   # x and y component of mean value vector for cluster 1
mu2 = c(9, 9)   # x and y component of mean value vector for cluster 2
mu3 = c(8, 5)   # x and y component of mean value vector for cluster 3

sigma1 = c(1.0, 1.1)    # x and y component of standard deviation vector for cluster 1 
sigma2 = c(1.0, 1.0)    # x and y component of standard deviation vector for cluster 1
sigma3 = c(1.0, 1.1)    # x and y component of standard deviation vector for cluster 1         

rhoxy1 = 0.8    # correlation between x and y, cluster 1
rhoxy2 = 0.9    # correlation between x and y, cluster 2
rhoxy3 = 0.6    # correlation between x and y, cluster 3

alpha = c(0.3, 0.3, 0.4)    # cluster weights
sum(alpha) == 1             # must be TRUE

## [1] TRUE

the covariance matrix for cluster k is then constructed using the corresponding standard deviations and correlation coefficients:

\[ \pmb{\Sigma_k} = \begin{pmatrix} \sigma_x^2 & \rho \sigma_x \sigma_y \\ \rho \sigma_x \sigma_y & \sigma_y^2 \end{pmatrix} \]

(the cluster index k was suppressed in the matrix for better visibility)
the corresponding R code is:

mat = c(sigma1[1]^2, rhoxy1*sigma1[1]*sigma1[2], rhoxy1*sigma1[1]*sigma1[2], sigma1[2]^2)
Sigma1 = matrix(mat, nrow = 2, byrow = TRUE)    # Covariance matrix for cluster 1
mat = c(sigma2[1]^2, rhoxy2*sigma2[1]*sigma2[2], rhoxy2*sigma2[1]*sigma2[2], sigma2[2]^2)
Sigma2 = matrix(mat, nrow = 2, byrow = TRUE)    # Covariance matrix for cluster 2
mat = c(sigma3[1]^2, rhoxy3*sigma3[1]*sigma3[2], rhoxy3*sigma3[1]*sigma3[2], sigma3[2]^2)
Sigma3 = matrix(mat, nrow = 2, byrow = TRUE)    # Covariance matrix for cluster 3

the \(\Sigma_k\) are symmetric, positive (semi-) definite matrices
once mean vector \(\mu_k\) and covariance matrix \(\Sigma_k\) are specified, a sample point from a multivariate Gaussian can be generated by combining samples of standard-normal variables:
- to construct a d-dimensional Gaussian, we have to determine a (d x d) - matrix \(A\) so that \(A \cdot A^T = \Sigma\)
- the matrix \(A\) can be calculated from \(\Sigma\) using Cholesky decomposition
- then, a d-dimensional vector \(z\) of \(N(0,1)\) variables is generated using the built-in R-function rnorm()
- finally, the sample point \(w\) of the multivariate Gaussian is calculated as \(w = A \cdot z + \mu\)
- the function rmultinorm() combines these steps:

rmultinorm <- function(mu, Sigma){
  d = length(mu)       # d = dimension
  A = t(chol(Sigma))   # determine a matrix A so that A %*% t(A) = Sigma, "Cholesky decomposition" 
  Z = rnorm(d)         # generate N(0,1) sample point with dimension d  
  w = A %*% Z + mu     # sample point of the d-dimensional Gaussian
  return(w[,1])        # returns sample point of a multivariate Gaussian with mean mu and covariance matrix Sigma
}

a better description of the latter algorithm is provided by Matthew Stephens at https://stephens999.github.io/fiveMinuteStats/mvnorm.html
for tidiness, we collect the mean vectors and covariance matrices in lists:

mu = list()
mu[[1]] = mu1
mu[[2]] = mu2
mu[[3]] = mu3

Sigma = list()
Sigma[[1]] = Sigma1
Sigma[[2]] = Sigma2
Sigma[[3]] = Sigma3

using the rmultinorm() function, we conduct two-step experiment to generate sample points from a 2-D Gaussian mixture (with 3 clusters)
- first we randomly choose one of the 3 clusters
- subsequently, we draw a sample point from the chosen cluster
this is realized in the following loop:
(note that all code snippets are designed for comprehensibility, not for efficiency!)

N = 1000   # number of observations to generate
x = matrix(NA, ncol = 2, nrow = N)     # matrix for 1000 2-dim observations, one row = one observation
cols = rep(NA, N)                      # color recalls which Gaussian was chosen
color_palette = c("red", "darkgreen", "blue")   # different color for each cluster

for (i in 1:N) {
  component = sample(c(1,2,3), size = 1, prob = alpha)      # draw a cluster with probability alpha
  x[i,] = rmultinorm(mu[[component]], Sigma[[component]])   # generate a 2-dim sample from the chosen Gaussian
  cols[i] = color_palette[component]                        # assign a color to the sample point
}
dim(x)  # the random sample points are stored in the array x

## [1] 1000    2

plot the generated sample points
color indicates the cluster a sample point originated from
- Important: in practice, we do not know from which cluster a sample point originated from (the origins are “hidden variables”)

plot(x[,1], x[,2], col = cols, main = "Observations from Gaussian mixture", font.main = 1)

2. Estimation of the parameters from the sample points using EM`

2.1. Initialization

in practice, the origins of the sample points are not known
that means we have to guess the parameters from a plot of the sample points without knowledge of the hidden variables (without “color” as above):

plot(x[,1], x[,2], col = "grey", main = "Observations from Gaussian mixture", font.main = 1)

we have to know (or to guess) the number of clusters K the points are produced from
also, we define how many iterations we want to carry out at most

K = 3           # we have to know (or to guess) the number of Gaussians
max_iter = 50   # maximum number of iterations

we label the first guess with the additional characters “_t"
of course, the first guess will only roughly agree with the “true” parameters (here, the true parameters are the ones we used above for sample generation)

mu1_t = c(6, 6)     # x and y component of mean value vector for cluster 1
mu2_t = c(8, 8)     # x and y component of mean value vector for cluster 2
mu3_t = c(7, 6)     # x and y component of mean value vector for cluster 3

sigma1_t = c(1.2, 1.0)    # x and y component of standard deviation vector for cluster 1 
sigma2_t = c(1.2, 1.2)    # x and y component of standard deviation vector for cluster 1
sigma3_t = c(1.4, 1.5)    # x and y component of standard deviation vector for cluster 1         

rhoxy1_t = 0.7    # correlation between x and y, cluster 1
rhoxy2_t = 0.8    # correlation between x and y, cluster 2
rhoxy3_t = 0.7    # correlation between x and y, cluster 3

alpha_t = c(0.3, 0.3, 0.4)    # cluster weights
sum(alpha_t) == 1             # must be TRUE

## [1] TRUE

with that, we have to construct the first guess of the covariance matrices \(\Sigma_k\):

mat = c(sigma1_t[1]^2, rhoxy1_t*sigma1_t[1]*sigma1_t[2], rhoxy1_t*sigma1_t[1]*sigma1_t[2], sigma1_t[2]^2)
Sigma1_t = matrix(mat, nrow = 2, byrow = TRUE)    # Covariance matrix for cluster 1
mat = c(sigma2_t[1]^2, rhoxy2_t*sigma2_t[1]*sigma2_t[2], rhoxy2_t*sigma2_t[1]*sigma2_t[2], sigma2_t[2]^2)
Sigma2_t = matrix(mat, nrow = 2, byrow = TRUE)    # Covariance matrix for cluster 2
mat = c(sigma3_t[1]^2, rhoxy3_t*sigma3_t[1]*sigma3_t[2], rhoxy3_t*sigma3_t[1]*sigma3_t[2], sigma3_t[2]^2)
Sigma3_t = matrix(mat, nrow = 2, byrow = TRUE)    # Covariance matrix for cluster 3

collect the mean vectors and covariance matrices in lists:

mu_t = list()
mu_t[[1]] = mu1_t
mu_t[[2]] = mu2_t
mu_t[[3]] = mu3_t

Sigma_t = list()
Sigma_t[[1]] = Sigma1_t
Sigma_t[[2]] = Sigma2_t
Sigma_t[[3]] = Sigma3_t

it can be interesting to see a contour plot of the probability distribution that follows from our first guess together with the “observed” sample points
this might give us a clue if the guess if fairly well … or completely aside
- we have to define the grid for the contour plot first:

xmin = 0 ; xmax = 12   # define the plot range by looking at the sample point plot
ymin = 0 ; ymax = 12
x1 = seq(xmin, xmax, length = 200)
x2 = seq(ymin, ymax, length = 200)

we need also a function that gives us a sample point from a mixture of Gaussian distributions
the function fx_Gauss() can do that:

fx_Gauss <- function(x, y, mu, Sigma, alpha) {   # Density function for a mixture of 2-D Gaussians
  if(length(x) != length(y)) stop("\n 'x' and 'y' must have equal length \n\n")
  N = nrow(x)
  if(!is.list(mu)) stop("\n 'mu' must be a list with K components.\n\n")
  if(!is.list(Sigma)) stop("\n 'Sigma' must be a list with K components.\n\n")
  K = length(mu)   # number of clusters
  density = 0
  for (k in 1:K) density = density + alpha[k]*dmvnorm(cbind(x, y), mean = mu[[k]], sigma = Sigma[[k]])
  return(density)
}

fx_Gauss() uses the function dmvnorm() from the R-package mvtnorm, load the library:

library(mvtnorm)

using outer, we define a 2-dimensional grid for the contour plot
we use fx_Gauss() to determine the function value for each grid point
the function values are based on the first guess (variables mu_t, Sigma_t, alpha_t):

fx_t = outer(x1, x2, FUN = function(x, y) fx_Gauss(x, y, mu = mu_t, Sigma = Sigma_t, alpha = alpha_t))
plot(x[,1], x[,2], col = "grey", main = "Observations from Gaussian mixture", font.main = 1)  # sample points
contour(x1, x2, fx_t, col = "red", lty = 2, add = TRUE)   # contour-plot representing the 1st guess

we see that the first guess is not perfect but might work
because we produced the sample points ourselves, and thereby know the true parameters, we can cheat a little bit and add the contour plots based on these true parameter values:

plot(x[,1], x[,2], col = "grey", main = "Observations from Gaussian mixture", font.main = 1)  # sample points
contour(x1, x2, fx_t, col = "red", lty = 2, add = TRUE)         # contour-plot representing the 1st guess
fx_sol = outer(x1, x2, FUN = function(x, y) fx_Gauss(x, y, mu = mu, Sigma = Sigma, alpha = alpha))
contour(x1, x2, fx_sol, col = "green", lty = 2, add = TRUE)     # contour-plot representing the true prob. distribution

red contour lines stand for the first guess, green contour lines show the true distribution
the 1^st guess is not brilliant but we’ll give it a chance!
we can calculate the log likelihood of the samples based on the 1^st guess:

logL = 0   
for (k in 1:K) logL = logL + alpha_t[k] * dmvnorm(x, mu_t[[k]], Sigma_t[[k]])   # k-sum
logL = log(logL)
logL = sum(logL)    # i-sum
cat(paste(" Log likelihood after 1st guess:", signif(logL,4), "\n"))

##  Log likelihood after 1st guess: -4255

Lvector = logL      # store the first value, append logL-values during the EM loop below

define data structures for the updated parameters:

mu = list()
mu[[1]] = rep(NA, 2)  # components x and y
mu[[2]] = rep(NA, 2)
mu[[3]] = rep(NA, 2)
Sigma = list()
Sigma[[1]] = matrix(NA, nrow = 2, ncol = 2)
Sigma[[2]] = matrix(NA, nrow = 2, ncol = 2)
Sigma[[3]] = matrix(NA, nrow = 2, ncol = 2)
alpha = rep(NA, K)          # K = number of clusters

we must not forget to define maximum relative errors for the parameters:

e_mu_max = 0.001       # means
e_Sigma_max = 0.001    # covariance matrices
e_alpha_max = 0.001    # mixture weights

2.2. Looping through E- and M-steps

now we can alternate through E- and M-step
(note that all code snippets are designed for comprehensibility, not for efficiency!)
during the E-step, we calculate the “membership probabilities” for the sample points, \(\omega_{ik}\):

\[ \omega_{ik} = \; \frac{ \alpha_k^t \cdot f_k \left( \pmb{x_i} \mid \pmb{\mu_k^t}, \pmb{\Sigma_k^t} \right) }{\sum_{k} \alpha_k^t \cdot f_k \left( \pmb{x_i} \mid \pmb{\mu_k^t}, \pmb{\Sigma_k^t} \right)} \]

during the M-step, we update the parameter values (to maximize \(Q_1(\theta, \theta_t\)):

\[ \pmb{\mu_m} = \frac{\sum_{i=1}^N \omega_{im} \cdot \pmb{x_i}}{\sum_{i=1}^N \omega_{im}} \]

\[ \pmb{\Sigma_m} = \frac{ \sum_{i=1}^N \omega_{im} \cdot \left(\pmb{x_i} - \pmb{\mu_m} \right) \cdot \left(\pmb{x_i} - \pmb{\mu_m} \right)^T }{ \sum_{i=1}^N \omega_{im}} \]

\[ \alpha_m = \frac{1}{N} \cdot \sum_{i=1}^N \omega_{im} \]

for (iter in 1:max_iter) {

  # --- E-step
  omega = matrix(NA, nrow = N, ncol = K)
  for (k in 1:K) omega[,k] = alpha_t[k] * dmvnorm(x, mu_t[[k]], Sigma_t[[k]])       # numerator for each k
  denominator  = 0; for (k in 1:K) denominator = denominator + omega[,k]            # denominator
  for (k in 1:K) omega[,k] = omega[,k] / denominator 
  if(!all(round(rowSums(omega),4) == 1)) stop("\n  All rowsums of omega must be 1! \n\n")    
  if(round(sum(rowSums(omega)), 3) != N) 
      stop("\n\n  Sum over all elements of omega must be equal to the number of observations! \n\n") 
  if(any(omega < 0)) stop("\n  No element of omega can be negative! \n\n")

  # --- M-step
  
  # update means:
  for (k in 1:K) mu[[k]] = as.vector((omega[,k] %*% x) / sum(omega[,k]))   # a vector with 2 components
  
  # update covariance matrices:
  for (k in 1:K) {
    Sigma[[k]] = matrix(rep(0,4), nrow = 2)     # init
    for (i in 1:N) Sigma[[k]] = Sigma[[k]] + omega[i,k] * ((x[i,] - mu_t[[k]]) %*% t(x[i,] - mu_t[[k]]))
    Sigma[[k]] = Sigma[[k]] / sum(omega[,k])    # weigthed covariance matrix
  }

  # update cluster membership 
  for (k in 1:K) alpha[k] = sum(omega[,k]) / N

  # new value for log-likelihood:
  logL = rep(0 , N) 
  for (k in 1:K) logL = logL + alpha[k] * dmvnorm(x, mu[[k]], Sigma[[k]])  # k-sum
  logL = log(logL)
  logL = sum(logL) # i-sum
  cat(paste("Log likelihood after iteration", iter, ":", signif(logL,4), "\n"))
  Lvector = c(Lvector, logL)    # append the new value

  # calculate relative errors:
  e_mu = max(abs((unlist(mu) - unlist(mu_t))/(unlist(mu) + 1e-50)))
  e_Sigma = max(abs((unlist(Sigma) - unlist(Sigma_t))/(unlist(Sigma) + 1e-50)))     
  e_alpha  = max(abs((alpha - alpha_t) / (alpha_t + 1e-50)))
  cat(paste("Loop", iter, ":  e_mu =", signif(e_mu,4), "e_Sigma =", signif(e_Sigma,4), "e_alpha =", signif(e_alpha,5), "\n"))

  # need another loop?:
  cond1 = e_mu <= e_mu_max          # condition 1, for mu
  cond2 = e_Sigma <= e_Sigma_max    # condition 2, for sigma
  cond3 = e_alpha <= e_alpha_max    # condition 3, for alpha
  if(cond1 & cond2 & cond3) break   # converged!

  # otherwise, continue to next loop:
  mu_t = mu 
  Sigma_t = Sigma             
  alpha_t = alpha

} # EM-loop

## Log likelihood after iteration 1 : -3633 
## Loop 1 :  e_mu = 0.1521 e_Sigma = 0.4903 e_alpha = 0.375 
## Log likelihood after iteration 2 : -3509 
## Loop 2 :  e_mu = 0.06336 e_Sigma = 0.6493 e_alpha = 0.058626 
## Log likelihood after iteration 3 : -3461 
## Loop 3 :  e_mu = 0.0401 e_Sigma = 0.4534 e_alpha = 0.081443 
## Log likelihood after iteration 4 : -3438 
## Loop 4 :  e_mu = 0.01227 e_Sigma = 0.2887 e_alpha = 0.074074 
## Log likelihood after iteration 5 : -3427 
## Loop 5 :  e_mu = 0.009132 e_Sigma = 0.1364 e_alpha = 0.053024 
## Log likelihood after iteration 6 : -3420 
## Loop 6 :  e_mu = 0.006481 e_Sigma = 0.08775 e_alpha = 0.038344 
## Log likelihood after iteration 7 : -3416 
## Loop 7 :  e_mu = 0.004695 e_Sigma = 0.05418 e_alpha = 0.028112 
## Log likelihood after iteration 8 : -3414 
## Loop 8 :  e_mu = 0.004269 e_Sigma = 0.03442 e_alpha = 0.020268 
## Log likelihood after iteration 9 : -3413 
## Loop 9 :  e_mu = 0.003457 e_Sigma = 0.02189 e_alpha = 0.014391 
## Log likelihood after iteration 10 : -3412 
## Loop 10 :  e_mu = 0.002669 e_Sigma = 0.01588 e_alpha = 0.010161 
## Log likelihood after iteration 11 : -3412 
## Loop 11 :  e_mu = 0.002013 e_Sigma = 0.01293 e_alpha = 0.007187 
## Log likelihood after iteration 12 : -3412 
## Loop 12 :  e_mu = 0.001502 e_Sigma = 0.01015 e_alpha = 0.0051119 
## Log likelihood after iteration 13 : -3412 
## Loop 13 :  e_mu = 0.001114 e_Sigma = 0.007804 e_alpha = 0.0036597 
## Log likelihood after iteration 14 : -3412 
## Loop 14 :  e_mu = 0.0008249 e_Sigma = 0.00593 e_alpha = 0.0026365 
## Log likelihood after iteration 15 : -3412 
## Loop 15 :  e_mu = 0.0006105 e_Sigma = 0.004475 e_alpha = 0.00191 
## Log likelihood after iteration 16 : -3412 
## Loop 16 :  e_mu = 0.0004521 e_Sigma = 0.003365 e_alpha = 0.0013903 
## Log likelihood after iteration 17 : -3412 
## Loop 17 :  e_mu = 0.0003352 e_Sigma = 0.002525 e_alpha = 0.0010161 
## Log likelihood after iteration 18 : -3412 
## Loop 18 :  e_mu = 0.0002488 e_Sigma = 0.001893 e_alpha = 0.00074533 
## Log likelihood after iteration 19 : -3412 
## Loop 19 :  e_mu = 0.0001849 e_Sigma = 0.001419 e_alpha = 0.00054837 
## Log likelihood after iteration 20 : -3412 
## Loop 20 :  e_mu = 0.0001376 e_Sigma = 0.001063 e_alpha = 0.00040454 
## Log likelihood after iteration 21 : -3412 
## Loop 21 :  e_mu = 0.0001025 e_Sigma = 0.0007965 e_alpha = 0.00029913

3. Looking at the results

output the calculated parameter values (which should be fairly close to the true parameters)
scatterplot of the log-likelihood vs. iteration (likelihood cannot decrease according to the theory)
a plot of the sample points together with the obtained solution
a heat plot of the obtained solution
a 3D-plot of the obtained solution

if(iter == max_iter) {
  warning("Not converged!\n\n")
} else {
  cat("Converged after", iter, "iterations.\n\n")
}

## Converged after 21 iterations.

## List results:
cat(paste("Mean values of the", K, "clusters:\n"))

## Mean values of the 3 clusters:

for (k in 1:K) {
  mux = signif(mu[[k]][1],4)
  muy = signif(mu[[k]][2],4)
  cat(paste("  Cluster", k, "  mux =", mux, "  muy =", muy, "\n"))
}

##   Cluster 1   mux = 4.975   muy = 4.933 
##   Cluster 2   mux = 8.947   muy = 8.937 
##   Cluster 3   mux = 7.967   muy = 5.002

cat(paste("Standard deviations and correlation of the", K, "clusters:\n"))

## Standard deviations and correlation of the 3 clusters:

for (k in 1:K) {
  sigmax = signif(sqrt(Sigma[[k]][1,1]),4)
  sigmay = signif(sqrt(Sigma[[k]][2,2]),4)
  rhoxy  = signif(Sigma[[k]][1,2]/sigmax/sigmay, 4)   
  cat(paste("  Cluster", k, "  sdx =", sigmax, "  sdy =", sigmay, "  rhoxy =", rhoxy, "\n"))
}

##   Cluster 1   sdx = 0.9608   sdy = 1.067   rhoxy = 0.8153 
##   Cluster 2   sdx = 1.056   sdy = 1.053   rhoxy = 0.9087 
##   Cluster 3   sdx = 1.018   sdy = 1.117   rhoxy = 0.6334

cat(paste("Cluster weigths of the", K, "clusters:\n"))

## Cluster weigths of the 3 clusters:

for (k in 1:K) cat(paste("  Cluster", k, "  alpha =", signif(alpha[k], 4), "\n"))

##   Cluster 1   alpha = 0.2658 
##   Cluster 2   alpha = 0.3149 
##   Cluster 3   alpha = 0.4193

a plot of the log-likelihood vs. iteration number
- according to the theory, the log-likelihood can never decrease during iterations

  plot(Lvector, ylab = "log(L)", main = "Log-likelihood", font.main = 1, xlab = "iter", col = "blue", lwd = 1.5 )
  mtext(side = 3, "First value based on first guess", col = "blue", cex = 0.9)

a plot of the sample points plus contour lines representing the solution that has been found:

  plot(x[,1], x[,2], xlab = "x1", ylab = "x2", col = "grey", main = "Sample points and EM solution", font.main = 1)
  
  # define grid for the contour plot (same as above):
  xmin = 0 ; xmax = 12
  ymin = 0 ; ymax = 12
  x1 = seq(xmin, xmax, length = 200)
  x2 = seq(ymin, ymax, length = 200)
  fx_sol = outer(x1, x2, FUN = function(x, y) fx_Gauss(x, y, mu = mu, Sigma = Sigma, alpha = alpha))
  contour(x1, x2, fx_sol, col = "green", lty = 2, add = TRUE)   # contour-plot showing the calculated solution

a heat plot showing the calculated distribution (color indicates height):

  image(x1, x2, fx_sol, col = heat.colors(12), xlab = "x", ylab = "y", main = "Solution by EM", font.main = 1)

a 3D plot showing the solution (use coarse grid here):

  x1 = seq(xmin, xmax, length = 100)   
  x2 = seq(ymin, ymax, length = 100)
  fx_sol = outer(x1, x2, FUN = function(x, y) fx_Gauss(x, y, mu = mu, Sigma = Sigma, alpha = alpha))
  persp(x1, x2, fx_sol, col='green', theta = 30, phi = 30, xlab = "x1", ylab = "x2", zlab = "f(x)")