Understanding a Linear Model — Marginal Means

Published

September 28, 2025

In the previous chapter, we introduced how linear models are fitted and how coefficients are estimated. However, these coefficients depend on how categorical factors are encoded in the design matrix (for example, using the first level as a reference). As a result, they are not directly expressed in the units of the response variable. In this section, we show how to recover estimated marginal means (EMMs)—also known as least-squares means—from the model coefficients, along with their variances.

1) Estimating coefficients

data <- read.csv("../data/example_1.csv") |>
  mutate(gen = as.factor(gen), block = as.factor(block))
n <- 12
n_b <- 3
n_g <- 4
X <- model.matrix(yield ~ 1 + block + gen, data = data)
y <- matrix(data[, "yield"])
print(X)
   (Intercept) block2 block3 geng2 geng3 geng4
1            1      0      0     0     0     0
2            1      1      0     0     0     0
3            1      0      1     0     0     0
4            1      0      0     1     0     0
5            1      1      0     1     0     0
6            1      0      1     1     0     0
7            1      0      0     0     1     0
8            1      1      0     0     1     0
9            1      0      1     0     1     0
10           1      0      0     0     0     1
11           1      1      0     0     0     1
12           1      0      1     0     0     1
attr(,"assign")
[1] 0 1 1 2 2 2
attr(,"contrasts")
attr(,"contrasts")$block
[1] "contr.treatment"

attr(,"contrasts")$gen
[1] "contr.treatment"
Xty <- t(X) %*% y
XtX <- t(X) %*% X
XtX_inv <- solve(XtX)
beta <- XtX_inv %*% Xty
beta
             [,1]
(Intercept)  7.75
block2      -1.65
block3      -2.10
geng2        1.10
geng3        0.10
geng4        1.80

2) Marginal means (scalar form)

Let’s start by using a scalar form to derive the estimated marginal means:

EMM_g1 <- beta[1] + sum(beta[2:3])/3
EMM_b1 <- beta[1] + sum(beta[4:6])/4 

Which mathematically represent the following operations

\[EMM(g_1) = \mu + \frac{1}{3}\hat\beta_{block_1} + \frac{1}{3}\hat\beta_{block_2} + \frac{1}{3}\hat\beta_{block_3} = 7.75 + 0 - \frac{1.65}{3} - \frac{2.10}{3} = 6.5\]

\[EMM(b_1) = \mu + \frac{1}{4}\hat\beta_{g_1} + \frac{1}{4}\hat\beta_{g_2} + \frac{1}{4}\hat\beta_{g_3} + \frac{1}{4}\hat\beta_{g_4} = 7.75 + 0 + \frac{1.10}{4} + \frac{0.10}{4} + \frac{1.80}{4} = 8.5\]

Then, for the remaining levels

EMM_b2 <- beta[1] + sum(beta[4:6])/4 + beta[2]
EMM_b3 <- beta[1] + sum(beta[4:6])/4 + beta[3]

EMM_g2 <- beta[1] + sum(beta[2:3])/3  + beta[4]
EMM_g3 <- beta[1] + sum(beta[2:3])/3  + beta[5]
EMM_g4 <- beta[1] + sum(beta[2:3])/3  + beta[6]

EMM <- matrix(c(EMM_b1, EMM_b2, EMM_b3, EMM_g1, EMM_g2, EMM_g3, EMM_g4), ncol = 1)
rownames(EMM) <- c('b1', 'b2', 'b3', 'g1', 'g2', 'g3', 'g4')
EMM
   [,1]
b1 8.50
b2 6.85
b3 6.40
g1 6.50
g2 7.60
g3 6.60
g4 8.30

3) Marginal means (matrix form)

Now let’s explore the matrix form. Specifically, we will make use of the \(L\) matrix to compute all desired values simultaneously, so that:

\[EMM(\boldsymbol{g}) = L\hat{\boldsymbol{\beta}}\]

The \(L\) matrix represents the levels we want to compute. For example, if we want to extract the EMMs of the genotypes, we should include the intercept, the fraction of each block level that we have to add to the mean, and the levels that we want to compute from the genotypes. Notice how L has as many rows as levels are in the factor to be solved.

L <- cbind(
  matrix(1, nrow = n_g, ncol = 1), # Intercept
  matrix(1 / n_b, nrow = n_g, ncol = n_b - 1), # Average block
  matrix(rbind(0, diag(nrow = n_g - 1)), nrow = n_g, ncol = n_g - 1)
)
L
     [,1]      [,2]      [,3] [,4] [,5] [,6]
[1,]    1 0.3333333 0.3333333    0    0    0
[2,]    1 0.3333333 0.3333333    1    0    0
[3,]    1 0.3333333 0.3333333    0    1    0
[4,]    1 0.3333333 0.3333333    0    0    1

Now we can simply multiply this matrix times the estimated coefficients

EMM_g <- L %*% beta
EMM_g
     [,1]
[1,]  6.5
[2,]  7.6
[3,]  6.6
[4,]  8.3

The way the EMM are calculated equals the operations shown above, where

\[ \small EMM(\boldsymbol{g}) = \boldsymbol{L}\,\boldsymbol{\hat\beta} = \begin{bmatrix} \overset{\text{Intercept}}{1} & \overset{\text{block2}}{0.33} & \overset{\text{block3}}{0.33} & \overset{\text{geng2}}{0} & \overset{\text{geng3}}{0} & \overset{\text{geng4}}{0} \\ 1 & 0.33 & 0.33 & 1 & 0 & 0 \\ 1 & 0.33 & 0.33 & 0 & 1 & 0 \\ 1 & 0.33 & 0.33 & 0 & 0 & 1 \\ \end{bmatrix} \, \overset{\hat\beta}{ \begin{bmatrix} 7.75 \\ -1.65 \\ -2.10 \\ 1.10 \\ 0.10 \\ 1.80 \end{bmatrix}} = \begin{bmatrix} 6.5 \\ 7.6 \\ 6.6 \\ 8.3 \end{bmatrix} \]

So that

\[ EMM(\boldsymbol{g})_{[1]} = 1*7.75 + \frac{1}{3}*-1.65 + \frac{1}{3}*-2.10 = 6.5 \]

Similarly, to obtain the EMM of each block we would do

L2 <- cbind(
  matrix(1, nrow = n_b, ncol = 1), # Intercept
  matrix(rbind(0, diag(nrow = n_b - 1)), nrow = n_b, ncol = n_b - 1), # Average block
  matrix(1 / n_g, nrow = n_b, ncol = n_g - 1)
)
L2
     [,1] [,2] [,3] [,4] [,5] [,6]
[1,]    1    0    0 0.25 0.25 0.25
[2,]    1    1    0 0.25 0.25 0.25
[3,]    1    0    1 0.25 0.25 0.25
EMM_b <- L2 %*% beta
EMM_b
     [,1]
[1,] 8.50
[2,] 6.85
[3,] 6.40

You can check that the results are the same as the ones computed before:

cbind(EMM, rbind(EMM_b, EMM_g))
   [,1] [,2]
b1 8.50 8.50
b2 6.85 6.85
b3 6.40 6.40
g1 6.50 6.50
g2 7.60 7.60
g3 6.60 6.60
g4 8.30 8.30

4) Variance-covariance (SE)

Now let’s compute the variance and the SE of the EMMS. We know that \(EMMs = L\hat\beta\), thus

\[Var(L\hat\beta) = LVar(\hat\beta) L^T = LC_{11} L^T\]

We will discuss more in detail about the Coefficient Matrix once we start digging into Linear Mixed Models. For now, we will just say that \(C_{ii}\) corresponds to \(Var(\hat\beta)\), so

# Compute the errors to compute sigma^2 
y_hat <- X %*% beta
errors <- y - y_hat
SSE <- sum(errors^2)
sigma_2 <- SSE / (n - 6)
C_11 <- solve(t(X)%*%X) * sigma_2
C_11
            (Intercept) block2 block3      geng2      geng3      geng4
(Intercept)   0.2000000   -0.1   -0.1 -0.1333333 -0.1333333 -0.1333333
block2       -0.1000000    0.2    0.1  0.0000000  0.0000000  0.0000000
block3       -0.1000000    0.1    0.2  0.0000000  0.0000000  0.0000000
geng2        -0.1333333    0.0    0.0  0.2666667  0.1333333  0.1333333
geng3        -0.1333333    0.0    0.0  0.1333333  0.2666667  0.1333333
geng4        -0.1333333    0.0    0.0  0.1333333  0.1333333  0.2666667

Then the Variance and the SE of the EMMs are

var_EMMs <- L %*% C_11 %*% t(L)
var_EMMs |> round(4)
       [,1]   [,2]   [,3]   [,4]
[1,] 0.1333 0.0000 0.0000 0.0000
[2,] 0.0000 0.1333 0.0000 0.0000
[3,] 0.0000 0.0000 0.1333 0.0000
[4,] 0.0000 0.0000 0.0000 0.1333
# SE BLUEs
se_EMMs <- sqrt(diag(var_EMMs))
se_EMMs
[1] 0.3651484 0.3651484 0.3651484 0.3651484
data.frame(
  gen = levels(data$gen),
  EMMs_g = EMM_g,
  var_EMMs = diag(var_EMMs),
  se_EMMs = se_EMMs
)
  gen EMMs_g  var_EMMs   se_EMMs
1  g1    6.5 0.1333333 0.3651484
2  g2    7.6 0.1333333 0.3651484
3  g3    6.6 0.1333333 0.3651484
4  g4    8.3 0.1333333 0.3651484

5) Using lm and emmeans

We can also use lm to fit the model and extract the coefficients and emmeans to retrieve the estimated marginal means. The function emmeans does the job automatically but it also returns the \(L\) and \(C_{11}\) matrices used, so we can explore them.

mod <- lm(formula = yield ~ 1 + block + gen, data = data)
beta_mod <- coef(mod)
mod

Call:
lm(formula = yield ~ 1 + block + gen, data = data)

Coefficients:
(Intercept)       block2       block3        geng2        geng3        geng4  
       7.75        -1.65        -2.10         1.10         0.10         1.80  
mm <- emmeans(mod, ~gen)
mm
 gen emmean    SE df lower.CL upper.CL
 g1     6.5 0.365  6     5.61     7.39
 g2     7.6 0.365  6     6.71     8.49
 g3     6.6 0.365  6     5.71     7.49
 g4     8.3 0.365  6     7.41     9.19

Results are averaged over the levels of: block 
Confidence level used: 0.95 
L_emm <- mm@linfct
C_11_emm <- mm@V

BLUE_mod <- L_emm %*% beta_mod
var_BLUEs_emm <- L_emm %*% C_11_emm %*% t(L_emm)
sqrt(diag(var_BLUEs_emm))
[1] 0.3651484 0.3651484 0.3651484 0.3651484

6) Variance of the differences

Besides that, it allows us to compute the estimates and the variances of the differences between levels, such as genotypes.

diff_mm <- emmeans(mod, pairwise ~ gen)
diff_mm$contrasts
 contrast estimate    SE df t.ratio p.value
 g1 - g2      -1.1 0.516  6  -2.130  0.2447
 g1 - g3      -0.1 0.516  6  -0.194  0.9971
 g1 - g4      -1.8 0.516  6  -3.486  0.0486
 g2 - g3       1.0 0.516  6   1.936  0.3066
 g2 - g4      -0.7 0.516  6  -1.356  0.5656
 g3 - g4      -1.7 0.516  6  -3.292  0.0609

Results are averaged over the levels of: block 
P value adjustment: tukey method for comparing a family of 4 estimates 
L_diff <- diff_mm$contrasts@linfct
L_diff # See how L now links more than 1 genotype per row
     (Intercept) block2 block3 geng2 geng3 geng4
[1,]           0      0      0    -1     0     0
[2,]           0      0      0     0    -1     0
[3,]           0      0      0     0     0    -1
[4,]           0      0      0     1    -1     0
[5,]           0      0      0     1     0    -1
[6,]           0      0      0     0     1    -1
BLUEs_diff <- L_diff %*% beta_mod
rownames(BLUEs_diff) <- diff_mm$contrasts@levels$contrast

rownames(L_diff) <- rownames(BLUEs_diff)
var_diff <- L_diff %*% C_11_emm %*% t(L_diff)
var_diff |> round(4) # Notice how now there are covariances between the levels, since they have one of the genotypes in common
        g1 - g2 g1 - g3 g1 - g4 g2 - g3 g2 - g4 g3 - g4
g1 - g2  0.2667  0.1333  0.1333 -0.1333 -0.1333  0.0000
g1 - g3  0.1333  0.2667  0.1333  0.1333  0.0000 -0.1333
g1 - g4  0.1333  0.1333  0.2667  0.0000  0.1333  0.1333
g2 - g3 -0.1333  0.1333  0.0000  0.2667  0.1333 -0.1333
g2 - g4 -0.1333  0.0000  0.1333  0.1333  0.2667  0.1333
g3 - g4  0.0000 -0.1333  0.1333 -0.1333  0.1333  0.2667
SE <- sqrt(diag(var_diff))

avg_diff <- mean(SE)
avg_diff
[1] 0.5163978