Marginal means

Published

November 28, 2025

In the previous chapter, we introduced how linear models are fitted and how coefficients are estimated. However, these coefficients depend on how categorical factors are encoded in the design matrix (for example, using the first level as a reference). As a result, they are not directly expressed in the units of the response variable. In this section, we show in more detail how to recover estimated marginal means (EMMs)—also known as least-squares means—from the model coefficients, along with their variances.

1) Estimating coefficients

data <- read.csv("../data/example_1.csv") |>
  mutate(gen = as.factor(gen), block = as.factor(block))
n <- 12
n_b <- 3
n_g <- 4
X <- model.matrix(yield ~ 1 + block + gen, data = data)
y <- matrix(data[, "yield"])
print(X)
   (Intercept) block2 block3 geng2 geng3 geng4
1            1      0      0     0     0     0
2            1      1      0     0     0     0
3            1      0      1     0     0     0
4            1      0      0     1     0     0
5            1      1      0     1     0     0
6            1      0      1     1     0     0
7            1      0      0     0     1     0
8            1      1      0     0     1     0
9            1      0      1     0     1     0
10           1      0      0     0     0     1
11           1      1      0     0     0     1
12           1      0      1     0     0     1
attr(,"assign")
[1] 0 1 1 2 2 2
attr(,"contrasts")
attr(,"contrasts")$block
[1] "contr.treatment"

attr(,"contrasts")$gen
[1] "contr.treatment"
Xty <- t(X) %*% y
XtX <- t(X) %*% X
XtX_inv <- solve(XtX)
beta <- XtX_inv %*% Xty
beta
             [,1]
(Intercept)  7.75
block2      -1.65
block3      -2.10
geng2        1.10
geng3        0.10
geng4        1.80

2) Marginal means (scalar form)

Let’s start by using a scalar form to derive the estimated marginal means:

EMM_g1 <- beta[1] + sum(beta[2:3]) / 3
EMM_b1 <- beta[1] + sum(beta[4:6]) / 4

Which mathematically represent the following operations

\[EMM(g_1) = \mu + \frac{1}{3}\hat\beta_{block_1} + \frac{1}{3}\hat\beta_{block_2} + \frac{1}{3}\hat\beta_{block_3} = 7.75 + 0 - \frac{1.65}{3} - \frac{2.10}{3} = 6.5\]

\[EMM(b_1) = \mu + \frac{1}{4}\hat\beta_{g_1} + \frac{1}{4}\hat\beta_{g_2} + \frac{1}{4}\hat\beta_{g_3} + \frac{1}{4}\hat\beta_{g_4} = 7.75 + 0 + \frac{1.10}{4} + \frac{0.10}{4} + \frac{1.80}{4} = 8.5\]

Then, for the remaining levels

EMM_b2 <- beta[1] + sum(beta[4:6]) / 4 + beta[2]
EMM_b3 <- beta[1] + sum(beta[4:6]) / 4 + beta[3]

EMM_g2 <- beta[1] + sum(beta[2:3]) / 3 + beta[4]
EMM_g3 <- beta[1] + sum(beta[2:3]) / 3 + beta[5]
EMM_g4 <- beta[1] + sum(beta[2:3]) / 3 + beta[6]

EMM <- matrix(c(EMM_b1, EMM_b2, EMM_b3, EMM_g1, EMM_g2, EMM_g3, EMM_g4), ncol = 1)
rownames(EMM) <- c("b1", "b2", "b3", "g1", "g2", "g3", "g4")
EMM
   [,1]
b1 8.50
b2 6.85
b3 6.40
g1 6.50
g2 7.60
g3 6.60
g4 8.30

3) Marginal means (matrix form)

Now let’s explore the matrix form. Specifically, we will make use of the \(\boldsymbol{L}\) matrix to compute all desired values simultaneously, so that:

\[EMM(\boldsymbol{g}) = \boldsymbol{L}\hat{\boldsymbol{\beta}}\]

The \(\boldsymbol{L}\) matrix represents the levels we want to compute. For example, if we want to extract the EMMs of the genotypes, we should include the intercept, the fraction of each block level that we have to add to the mean, and the levels that we want to compute from the genotypes. Notice how \(\boldsymbol{L}\) has as many rows as levels are in the factor to be solved.

L <- cbind(
  matrix(1, nrow = n_g, ncol = 1), # Intercept
  matrix(1 / n_b, nrow = n_g, ncol = n_b - 1), # Average block
  matrix(rbind(0, diag(nrow = n_g - 1)), nrow = n_g, ncol = n_g - 1)
)
L
     [,1]      [,2]      [,3] [,4] [,5] [,6]
[1,]    1 0.3333333 0.3333333    0    0    0
[2,]    1 0.3333333 0.3333333    1    0    0
[3,]    1 0.3333333 0.3333333    0    1    0
[4,]    1 0.3333333 0.3333333    0    0    1

Now we can simply multiply this matrix times the estimated coefficients

EMM_g <- L %*% beta
EMM_g
     [,1]
[1,]  6.5
[2,]  7.6
[3,]  6.6
[4,]  8.3

The way the EMM are calculated equals the operations shown above, where

\[ \small EMM(\boldsymbol{g}) = \boldsymbol{L}\,\boldsymbol{\hat\beta} = \begin{bmatrix} \overset{\text{Intercept}}{1} & \overset{\text{block2}}{0.33} & \overset{\text{block3}}{0.33} & \overset{\text{geng2}}{0} & \overset{\text{geng3}}{0} & \overset{\text{geng4}}{0} \\ 1 & 0.33 & 0.33 & 1 & 0 & 0 \\ 1 & 0.33 & 0.33 & 0 & 1 & 0 \\ 1 & 0.33 & 0.33 & 0 & 0 & 1 \\ \end{bmatrix} \, \overset{\hat\beta}{ \begin{bmatrix} 7.75 \\ -1.65 \\ -2.10 \\ 1.10 \\ 0.10 \\ 1.80 \end{bmatrix}} = \begin{bmatrix} 6.5 \\ 7.6 \\ 6.6 \\ 8.3 \end{bmatrix} \]

So that

\[ EMM(\boldsymbol{g})_{[1]} = 1*7.75 + \frac{1}{3}*-1.65 + \frac{1}{3}*-2.10 = 6.5 \]

Similarly, to obtain the EMM of each block we would do

L2 <- cbind(
  matrix(1, nrow = n_b, ncol = 1), # Intercept
  matrix(rbind(0, diag(nrow = n_b - 1)), nrow = n_b, ncol = n_b - 1),
  matrix(1 / n_g, nrow = n_b, ncol = n_g - 1) # Average genotype
)
L2
     [,1] [,2] [,3] [,4] [,5] [,6]
[1,]    1    0    0 0.25 0.25 0.25
[2,]    1    1    0 0.25 0.25 0.25
[3,]    1    0    1 0.25 0.25 0.25
EMM_b <- L2 %*% beta
EMM_b
     [,1]
[1,] 8.50
[2,] 6.85
[3,] 6.40

You can check that the results are the same as the ones computed before:

cbind(EMM, rbind(EMM_b, EMM_g))
   [,1] [,2]
b1 8.50 8.50
b2 6.85 6.85
b3 6.40 6.40
g1 6.50 6.50
g2 7.60 7.60
g3 6.60 6.60
g4 8.30 8.30

4) Variance-covariance (SE)

Now let’s compute the variance and the SE of the EMMS. We know that \(EMMs = \boldsymbol{L\hat\beta}\), thus

\[Var(\boldsymbol{L\hat\beta}) = \boldsymbol{L}Var(\boldsymbol{\hat\beta}) \boldsymbol{L^T} = \boldsymbol{LC_{11} L^T}\]

We will discuss more in detail about the Coefficient Matrix once we start digging into Linear Mixed Models. For now, we will just say that \(\boldsymbol{C_{11}}\) corresponds to \(Var(\boldsymbol{\hat\beta})\), so

# Compute the errors to compute sigma^2
y_hat <- X %*% beta
errors <- y - y_hat
SSE <- sum(errors^2)
sigma_2 <- SSE / (n - 6)
C_11 <- solve(t(X) %*% X) * sigma_2
C_11
            (Intercept) block2 block3      geng2      geng3      geng4
(Intercept)   0.2000000   -0.1   -0.1 -0.1333333 -0.1333333 -0.1333333
block2       -0.1000000    0.2    0.1  0.0000000  0.0000000  0.0000000
block3       -0.1000000    0.1    0.2  0.0000000  0.0000000  0.0000000
geng2        -0.1333333    0.0    0.0  0.2666667  0.1333333  0.1333333
geng3        -0.1333333    0.0    0.0  0.1333333  0.2666667  0.1333333
geng4        -0.1333333    0.0    0.0  0.1333333  0.1333333  0.2666667

Then the Variance and the SE of the EMMs are

var_EMMs <- L %*% C_11 %*% t(L)
var_EMMs |> round(4)
       [,1]   [,2]   [,3]   [,4]
[1,] 0.1333 0.0000 0.0000 0.0000
[2,] 0.0000 0.1333 0.0000 0.0000
[3,] 0.0000 0.0000 0.1333 0.0000
[4,] 0.0000 0.0000 0.0000 0.1333
# SE MMEs
se_EMMs <- sqrt(diag(var_EMMs))
se_EMMs
[1] 0.3651484 0.3651484 0.3651484 0.3651484
data.frame(
  gen = levels(data$gen),
  EMMs_g = EMM_g,
  var_EMMs = diag(var_EMMs),
  se_EMMs = se_EMMs
)
  gen EMMs_g  var_EMMs   se_EMMs
1  g1    6.5 0.1333333 0.3651484
2  g2    7.6 0.1333333 0.3651484
3  g3    6.6 0.1333333 0.3651484
4  g4    8.3 0.1333333 0.3651484

5) Using lm and emmeans

We can also use lm to fit the model and extract the coefficients and emmeans to retrieve the estimated marginal means. The function emmeans does the job automatically but it also returns the \(\boldsymbol{L}\) and \(\boldsymbol{C_{11}}\) matrices used, so we can explore them.

mod <- lm(formula = yield ~ 1 + block + gen, data = data)
beta_lm <- coef(mod)
beta_lm
(Intercept)      block2      block3       geng2       geng3       geng4 
       7.75       -1.65       -2.10        1.10        0.10        1.80 
# Marginal means
mm <- emmeans(mod, ~gen)
mm
 gen emmean    SE df lower.CL upper.CL
 g1     6.5 0.365  6     5.61     7.39
 g2     7.6 0.365  6     6.71     8.49
 g3     6.6 0.365  6     5.71     7.49
 g4     8.3 0.365  6     7.41     9.19

Results are averaged over the levels of: block 
Confidence level used: 0.95 
L_emm <- mm@linfct # L matrix
C_11_emm <- mm@V # Variance-covariance of coefficients
BLUE_mod <- L_emm %*% beta_lm # L beta
var_BLUEs_emm <- L_emm %*% C_11_emm %*% t(L_emm) # L C L'
sqrt(diag(var_BLUEs_emm))
[1] 0.3651484 0.3651484 0.3651484 0.3651484

6) Variance of the differences

Besides that, it allows us to compute the estimates and the variances of the differences between levels, such as genotypes.

diff_mm <- emmeans(mod, pairwise ~ gen)
diff_mm$contrasts
 contrast estimate    SE df t.ratio p.value
 g1 - g2      -1.1 0.516  6  -2.130  0.2447
 g1 - g3      -0.1 0.516  6  -0.194  0.9971
 g1 - g4      -1.8 0.516  6  -3.486  0.0486
 g2 - g3       1.0 0.516  6   1.936  0.3066
 g2 - g4      -0.7 0.516  6  -1.356  0.5656
 g3 - g4      -1.7 0.516  6  -3.292  0.0609

Results are averaged over the levels of: block 
P value adjustment: tukey method for comparing a family of 4 estimates 
L_diff <- diff_mm$contrasts@linfct
L_diff # See how L now links more than 1 genotype per row
     (Intercept) block2 block3 geng2 geng3 geng4
[1,]           0      0      0    -1     0     0
[2,]           0      0      0     0    -1     0
[3,]           0      0      0     0     0    -1
[4,]           0      0      0     1    -1     0
[5,]           0      0      0     1     0    -1
[6,]           0      0      0     0     1    -1

Note that the columns for the intercept and the blocks don’t contribute to the variance of the differences (all zeros).

BLUEs_diff <- L_diff %*% beta_lm
rownames(BLUEs_diff) <- diff_mm$contrasts@levels$contrast
rownames(L_diff) <- rownames(BLUEs_diff)
var_diff <- L_diff %*% C_11_emm %*% t(L_diff)
var_diff |> round(4)
        g1 - g2 g1 - g3 g1 - g4 g2 - g3 g2 - g4 g3 - g4
g1 - g2  0.2667  0.1333  0.1333 -0.1333 -0.1333  0.0000
g1 - g3  0.1333  0.2667  0.1333  0.1333  0.0000 -0.1333
g1 - g4  0.1333  0.1333  0.2667  0.0000  0.1333  0.1333
g2 - g3 -0.1333  0.1333  0.0000  0.2667  0.1333 -0.1333
g2 - g4 -0.1333  0.0000  0.1333  0.1333  0.2667  0.1333
g3 - g4  0.0000 -0.1333  0.1333 -0.1333  0.1333  0.2667
# Average standard error of the differences between two genotype means.
SE <- sqrt(diag(var_diff))
avg_diff <- mean(SE)
avg_diff
[1] 0.5163978

Importantly, since \(\boldsymbol{L}_{diff}\) only contains \(1\) and \(-1\), and as mentioned before, the intercept and the block don’t contribute to the estimation, we can calculate the variance of the differences by simply using the subset of \(\boldsymbol{C_{11}}\) that involves the genotypes (or the factor of interest).

C_11g <- C_11[4:6, 4:6]                # Subset of C
d <- diag(C_11g)                       # Difference with the baseline level
D <- outer(d, d, "+") - 2 * C_11g      # var(a-b) = var(a) + var(b) - 2cov(a,b)
vd <- matrix(NA, length(d) + 1, length(d) + 1)
vd[-1, -1] <- D
vd[-1, 1] <- vd[1, -1] <- d
diag(vd) <- NA
vd
          [,1]      [,2]      [,3]      [,4]
[1,]        NA 0.2666667 0.2666667 0.2666667
[2,] 0.2666667        NA 0.2666667 0.2666667
[3,] 0.2666667 0.2666667        NA 0.2666667
[4,] 0.2666667 0.2666667 0.2666667        NA