Understanding a linear model

Published

November 28, 2025

With this article we want to analyze a randomized complete block design (RCBD) to show how a traditional full fixed effects model and ANOVA work.

1) Setup & data

Clewer and Scarisbrick (2001) present a yield trial (t/ha) conducted using a randomized complete block design. The design included 3 blocks and 4 cultivars, resulting in 12 experimental plots.

library(tidyverse)
library(emmeans)
library(gt)
library(kableExtra)

# Read and coerce factors
data <- read.csv("../data/example_1.csv") |>
  mutate(gen = as.factor(gen), block = as.factor(block))
head(data)
  block gen yield
1     1  g1   7.4
2     2  g1   6.5
3     3  g1   5.6
4     1  g2   9.8
5     2  g2   6.8
6     3  g2   6.2
str(data)
'data.frame':   12 obs. of  3 variables:
 $ block: Factor w/ 3 levels "1","2","3": 1 2 3 1 2 3 1 2 3 1 ...
 $ gen  : Factor w/ 4 levels "g1","g2","g3",..: 1 1 1 2 2 2 3 3 3 4 ...
 $ yield: num  7.4 6.5 5.6 9.8 6.8 6.2 7.3 6.1 6.4 9.5 ...

Quick visualization

data |>
  ggplot(aes(x = gen, y = yield, color = block)) +
  geom_point(size = 3) +
  theme_classic(base_size = 15)

Yield by genotype colored by block.

2) Linear model building blocks

We will progressively build the RCBD model using model.frame() and model.matrix() to see the design matrices explicitly, then solve normal equations. We use:

  • \(\boldsymbol{y} = \boldsymbol{X\beta} + \boldsymbol{\varepsilon}\): linear model
  • \(\boldsymbol{y}\): response vector (yield)
  • \(\boldsymbol{X}\): model matrix
  • \(\boldsymbol{\varepsilon} \sim MVN(\boldsymbol{0}, \boldsymbol{I}\sigma^2)\)
  • \(E(\boldsymbol{y}) = \boldsymbol{X\beta}; \; V(\boldsymbol{y}) = \boldsymbol{I}\sigma^2\)
  • \(\boldsymbol{\hat\beta} = (\boldsymbol{X^\top X})^{-1} \boldsymbol{X^\top} \boldsymbol{y}\)
  • \(V(\boldsymbol{\hat{\beta}}) = (\boldsymbol{X^\top X})^{-1} \sigma^2\)
  • Fitted values \(\boldsymbol{\hat y} = \boldsymbol{X\hat\beta}\)
  • Errors \(\boldsymbol{\varepsilon} = \boldsymbol{y} - \boldsymbol{\hat y}\)
  • Sum of squared errors \(\text{SSE} = \boldsymbol{\varepsilon^\top \varepsilon}\)

Let n = 12 observations, n_blks = 3 blocks, n_gens = 4 genotypes.

n <- 12
n_blks <- 3
n_gens <- 4

2.1 Intercept-only model (overall mean)

  • Model:

\[ \small \boldsymbol{y} = \boldsymbol{X\beta} + \boldsymbol{\varepsilon} => \begin{bmatrix} 7.4 \\ 6.5 \\ 5.6 \\ 9.8 \\ 6.8 \\ 6.2 \\ 7.3 \\ 6.1 \\ 6.4 \\ 9.5 \\ 8.0 \\ 7.4 \\ \end{bmatrix} = \overset{\text{mean}}{ \begin{bmatrix} 1 \\ 1 \\ 1 \\ 1 \\ 1 \\ 1 \\ 1 \\ 1 \\ 1 \\ 1 \\ 1 \\ 1 \end{bmatrix}} \mu + \boldsymbol{\varepsilon} \]

  • BLUE(s):

\[ \small \boldsymbol{\beta} = \underbrace{\left(\begin{bmatrix} 1 \ 1 \ 1 \ 1 \ 1 \ 1 \ 1 \ 1 \ 1 \ 1 \ 1 \ 1 \ \end{bmatrix} \begin{bmatrix} 1 \\ 1 \\ 1 \\ 1 \\ 1 \\ 1 \\ 1 \\ 1 \\ 1 \\ 1 \\ 1 \\ 1 \\ \end{bmatrix}\right)^{-1}}_{(\boldsymbol{X'X})^{-1}} \underbrace{\begin{bmatrix} 1 \ 1 \ 1 \ 1 \ 1 \ 1 \ 1 \ 1 \ 1 \ 1 \ 1 \ 1 \ \end{bmatrix} \begin{bmatrix} 7.4 \\ 6.5 \\ 5.6 \\ 9.8 \\ 6.8 \\ 6.2 \\ 7.3 \\ 6.1 \\ 6.4 \\ 9.5 \\ 8.0 \\ 7.4 \\ \end{bmatrix}}_{\boldsymbol{X'y}} = \underbrace{\frac{1}{12}}_{(\boldsymbol{X'X})^{-1}}\cdot \underbrace{87}_{\boldsymbol{X'y}} = 7.25 \]

Note

Notice that the first term, \(\left(\boldsymbol{X'X}\right)^{-1}\), corresponds to the inverse of number of observations, while the second term, \(\boldsymbol{X'y}\), gives the sum of phenotypic values.

In this example, there is only a single level, \(\mu\). Therefore, the entire expression simplifies to the sum of the phenotypic values divided by the number of observations, which is simply the mean, as shown below:

data.frame(mean = "mean", beta = mean(data$yield)) |>
  gt()
mean beta
mean 7.25
ff <- yield ~ 1
m <- model.frame(ff, data)
X <- model.matrix(ff, m)
y <- matrix(data$yield)

# Normal equations components
Xty <- t(X) %*% y
XtX <- t(X) %*% X
rank_X <- qr(XtX)$rank
XtX_inv <- solve(XtX)

beta_mu <- XtX_inv %*% Xty # overall mean (mu)
y_hat <- X %*% beta_mu
errors <- y - y_hat
SSE_mu <- t(errors) %*% errors
SSE_mu <- as.numeric(SSE_mu)

list(rank = rank_X, beta_mu = drop(beta_mu), SSE_mu = SSE_mu)
$rank
[1] 1

$beta_mu
(Intercept) 
       7.25 

$SSE_mu
[1] 18.81

2.2 Add blocks

We will now illustrate what happens if we only include the block factor.

  • Model: \[ \small \boldsymbol{y} = \boldsymbol{X\beta} + \boldsymbol{\varepsilon} => \begin{bmatrix} 7.4 \\ 6.5 \\ 5.6 \\ 9.8 \\ 6.8 \\ 6.2 \\ 7.3 \\ 6.1 \\ 6.4 \\ 9.5 \\ 8.0 \\ 7.4 \\ \end{bmatrix} = \begin{bmatrix} \overset{\text{Block 1}}{1} & \overset{\text{Block 2}}{0} & \overset{\text{Block 3}}{0} \\ 0 & 1 & 0 \\ 0 & 0 & 1 \\ 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \\ 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \\ 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{bmatrix} \begin{bmatrix} b_1 \\ b_2 \\ b_3 \\ \end{bmatrix} + \boldsymbol{\varepsilon} \]

  • BLUE(s): \[ \small \boldsymbol{\beta} = \underbrace{\left(\begin{bmatrix} 1 \ 0 \ 0 \ 1 \ 0 \ 0 \ 1 \ 0 \ 0 \ 1 \ 0 \ 0 \\ 0 \ 1 \ 0 \ 0 \ 1 \ 0 \ 0 \ 1 \ 0 \ 1 \ 0 \ 0 \\ 0 \ 0 \ 1 \ 0 \ 0 \ 1 \ 0 \ 0 \ 1 \ 1 \ 0 \ 0 \\ \end{bmatrix} \begin{bmatrix} 1 \ 0 \ 0 \\ 0 \ 1 \ 0 \\ 0 \ 0 \ 1 \\ 1 \ 0 \ 0 \\ 0 \ 1 \ 0 \\ 0 \ 0 \ 1 \\ 1 \ 0 \ 0 \\ 0 \ 1 \ 0 \\ 0 \ 0 \ 1 \\ 1 \ 0 \ 0 \\ 0 \ 1 \ 0 \\ 0 \ 0 \ 1 \\ \end{bmatrix}\right)^{-1}}_{(\boldsymbol{X'X})^{-1}} \underbrace{\begin{bmatrix} 1 \ 0 \ 0 \ 1 \ 0 \ 0 \ 1 \ 0 \ 0 \ 1 \ 0 \ 0 \\ 0 \ 1 \ 0 \ 0 \ 1 \ 0 \ 0 \ 1 \ 0 \ 1 \ 0 \ 0 \\ 0 \ 0 \ 1 \ 0 \ 0 \ 1 \ 0 \ 0 \ 1 \ 1 \ 0 \ 0 \\ \end{bmatrix} \begin{bmatrix} 7.4 \\ 6.5 \\ 5.6 \\ 9.8 \\ 6.8 \\ 6.2 \\ 7.3 \\ 6.1 \\ 6.4 \\ 9.5 \\ 8.0 \\ 7.4 \\ \end{bmatrix}}_{\boldsymbol{X'y}} = \underbrace{\begin{bmatrix} \frac{1}{4} \ 0 \ 0 \\ 0 \ \frac{1}{4} \ 0 \\ 0 \ 0 \ \frac{1}{4} \\ \end{bmatrix}}_{(\boldsymbol{X'X})^{-1}} \underbrace{\begin{bmatrix} 34 \\ 27.4 \\ 25.6 \\ \end{bmatrix}}_{\boldsymbol{X'y}} = \begin{bmatrix} 8.50 \\ 6.85 \\ 6.40 \\ \end{bmatrix} \]

In this model, there are 3 levels (\(Block1\), \(Block2\) and \(Block3\)). Therefore, the entire expression simplifies to the sum of the phenotypic values divided by the number of observations, which is simply the mean per block.

data |>
  group_by(block) |>
  summarise(beta = mean(yield, na.rm = TRUE)) |>
  gt()
block beta
1 8.50
2 6.85
3 6.40
ff <- yield ~ -1 + block
m <- model.frame(ff, data)
X <- model.matrix(ff, m)
y <- matrix(data$yield)

Xty <- t(X) %*% y
XtX <- t(X) %*% X
rank_X <- qr(XtX)$rank
XtX_inv <- solve(XtX)

beta_blk <- XtX_inv %*% Xty
SSE_blk <- t(y - X %*% beta_blk) %*% (y - X %*% beta_blk)
SSE_blk <- as.numeric(SSE_blk)

list(rank = rank_X, beta_blk = drop(beta_blk), SSE_blk = SSE_blk)
$rank
[1] 3

$beta_blk
block1 block2 block3 
  8.50   6.85   6.40 

$SSE_blk
[1] 9.03

2.3 Add genotypes

  • Model:

\[ \small \boldsymbol{y} = \boldsymbol{X\beta} + \boldsymbol{\varepsilon} => \begin{bmatrix} 7.4 \\ 6.5 \\ 5.6 \\ 9.8 \\ 6.8 \\ 6.2 \\ 7.3 \\ 6.1 \\ 6.4 \\ 9.5 \\ 8.0 \\ 7.4 \\ \end{bmatrix} = \begin{bmatrix} \overset{gen1}{1} & \overset{gen2}{0} & \overset{gen3}{0} & \overset{gen4}{0} \\ 1 & 0 & 0 & 0 \\ 1 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 \\ 0 & 1 & 0 & 0 \\ 0 & 1 & 0 & 0 \\ 0 & 0 & 1 & 0 \\ 0 & 0 & 1 & 0 \\ 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1 \\ 0 & 0 & 0 & 1 \\ 0 & 0 & 0 & 1 \end{bmatrix} \begin{bmatrix} g_1 \\ g_2 \\ g_3 \\ g_4 \\ \end{bmatrix} + \boldsymbol{\varepsilon} \]

  • BLUE(s):

\[ \small \boldsymbol{\beta} = \underbrace{ \left(\begin{bmatrix} 1 \ 1 \ 1 \ 0 \ 0 \ 0 \ 0 \ 0 \ 0 \ 0 \ 0 \ 0 \\ 0 \ 0 \ 0 \ 1 \ 1 \ 1 \ 0 \ 0 \ 0 \ 0 \ 0 \ 0 \\ 0 \ 0 \ 0 \ 0 \ 0 \ 0 \ 1 \ 1 \ 1 \ 0 \ 0 \ 0 \\ 0 \ 0 \ 0 \ 0 \ 0 \ 0 \ 0 \ 0 \ 0 \ 1 \ 1 \ 1 \\ \end{bmatrix} \begin{bmatrix} 1 \ 0 \ 0 \ 0 \\ 1 \ 0 \ 0 \ 0 \\ 1 \ 0 \ 0 \ 0 \\ 0 \ 1 \ 0 \ 0 \\ 0 \ 1 \ 0 \ 0 \\ 0 \ 1 \ 0 \ 0 \\ 0 \ 0 \ 1 \ 0 \\ 0 \ 0 \ 1 \ 0 \\ 0 \ 0 \ 1 \ 0 \\ 0 \ 0 \ 0 \ 1 \\ 0 \ 0 \ 0 \ 1 \\ 0 \ 0 \ 0 \ 1 \\ \end{bmatrix}\right)^{-1}}_{(\boldsymbol{X'X})^{-1}} \underbrace{ \begin{bmatrix} 1 \ 1 \ 1 \ 0 \ 0 \ 0 \ 0 \ 0 \ 0 \ 0 \ 0 \ 0 \\ 0 \ 0 \ 0 \ 1 \ 1 \ 1 \ 0 \ 0 \ 0 \ 0 \ 0 \ 0 \\ 0 \ 0 \ 0 \ 0 \ 0 \ 0 \ 1 \ 1 \ 1 \ 0 \ 0 \ 0 \\ 0 \ 0 \ 0 \ 0 \ 0 \ 0 \ 0 \ 0 \ 0 \ 1 \ 1 \ 1 \\ \end{bmatrix} \begin{bmatrix} 7.4 \\ 6.5 \\ 5.6 \\ 9.8 \\ 6.8 \\ 6.2 \\ 7.3 \\ 6.1 \\ 6.4 \\ 9.5 \\ 8.0 \\ 7.4 \\ \end{bmatrix}}_{\boldsymbol{X'y}} = \underbrace{\begin{bmatrix} \frac{1}{3}\ 0 \ 0 \ 0 \\ 0 \ \frac{1}{3} \ 0 \ 0 \\ 0 \ 0 \ \frac{1}{3} \ 0 \\ 0 \ 0 \ 0 \ \frac{1}{3} \\ \end{bmatrix}}_{(\boldsymbol{X'X})^{-1}} \underbrace{ \begin{bmatrix} 19.5 \\ 22.8 \\ 19.8 \\ 24.9 \\ \end{bmatrix} }_{\boldsymbol{X'y}} = \begin{bmatrix} 6.5 \\ 7.6 \\ 6.6 \\ 8.3 \\ \end{bmatrix} \]

In this model, there are 4 levels (\(g1\), \(g2\), \(g3\) and \(g4\)). Therefore, the entire expression simplifies to the sum of the phenotypic values divided by the number of observations, which is simply the mean per genotype.

data |>
  group_by(gen) |>
  summarise(beta = mean(yield, na.rm = TRUE)) |>
  gt()
gen beta
g1 6.5
g2 7.6
g3 6.6
g4 8.3
ff <- yield ~ -1 + gen
m <- model.frame(ff, data)
X <- model.matrix(ff, m)
y <- matrix(data$yield)

Xty <- t(X) %*% y
XtX <- t(X) %*% X
rank_X <- qr(XtX)$rank
XtX_inv <- solve(XtX)

beta_gen <- XtX_inv %*% Xty
SSE_gen <- t(y - X %*% beta_gen) %*% (y - X %*% beta_gen)
SSE_gen <- as.numeric(SSE_gen)

m2 <- list(rank = rank_X, beta_gen = drop(beta_gen), SSE_gen = SSE_gen)

2.4 Full model: intercept + blocks + genotypes

  • Model:

\[ \small \boldsymbol{y} = \boldsymbol{X\beta} + \boldsymbol{\varepsilon} \;\;\;\Rightarrow\;\;\; \begin{bmatrix} 7.4 \\ 6.5 \\ 5.6 \\ 9.8 \\ 6.8 \\ 6.2 \\ 7.3 \\ 6.1 \\ 6.4 \\ 9.5 \\ 8.0 \\ 7.4 \\ \end{bmatrix} = \begin{bmatrix} \overset{\text{intercept}}{1} & \overset{\text{block2}}{0} & \overset{\text{block3}}{0} & \overset{\text{gen2}}{0} & \overset{\text{gen3}}{0} & \overset{\text{gen4}}{0} \\ 1 & 1 & 0 & 0 & 0 & 0 \\ 1 & 0 & 1 & 0 & 0 & 0 \\ 1 & 0 & 0 & 1 & 0 & 0 \\ 1 & 1 & 0 & 1 & 0 & 0 \\ 1 & 0 & 1 & 1 & 0 & 0 \\ 1 & 0 & 0 & 0 & 1 & 0 \\ 1 & 1 & 0 & 0 & 1 & 0 \\ 1 & 0 & 1 & 0 & 1 & 0 \\ 1 & 0 & 0 & 0 & 0 & 1 \\ 1 & 1 & 0 & 0 & 0 & 1 \\ 1 & 0 & 1 & 0 & 0 & 1 \\ \end{bmatrix} \begin{bmatrix} \mu \\ b_2 \\ b_3 \\ g_2 \\ g_3 \\ g_4 \\ \end{bmatrix} + \boldsymbol{\varepsilon} \]

  • BLUE(s):

\[ \small \begin{align*} \boldsymbol{\beta} &= \underbrace{ \left( \begin{bmatrix} 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 \\ 0 & 1 & 0 & 0 & 1 & 0 & 0 & 1 & 0 & 0 & 1 & 0 \\ 0 & 0 & 1 & 0 & 0 & 1 & 0 & 0 & 1 & 0 & 0 & 1 \\ 0 & 0 & 0 & 1 & 1 & 1 & 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & 1 & 1 & 1 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 1 & 1 & 1 \end{bmatrix} \begin{bmatrix} 1 & 0 & 0 & 0 & 0 & 0 \\ 1 & 1 & 0 & 0 & 0 & 0 \\ 1 & 0 & 1 & 0 & 0 & 0 \\ 1 & 0 & 0 & 1 & 0 & 0 \\ 1 & 1 & 0 & 1 & 0 & 0 \\ 1 & 0 & 1 & 1 & 0 & 0 \\ 1 & 0 & 0 & 0 & 1 & 0 \\ 1 & 1 & 0 & 0 & 1 & 0 \\ 1 & 0 & 1 & 0 & 1 & 0 \\ 1 & 0 & 0 & 0 & 0 & 1 \\ 1 & 1 & 0 & 0 & 0 & 1 \\ 1 & 0 & 1 & 0 & 0 & 1 \end{bmatrix} \right)^{-1} }_{(\boldsymbol{X'X})^{-1}} \underbrace{ \begin{bmatrix} 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 \\ 0 & 1 & 0 & 0 & 1 & 0 & 0 & 1 & 0 & 0 & 1 & 0 \\ 0 & 0 & 1 & 0 & 0 & 1 & 0 & 0 & 1 & 0 & 0 & 1 \\ 0 & 0 & 0 & 1 & 1 & 1 & 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & 1 & 1 & 1 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 1 & 1 & 1 \end{bmatrix} \begin{bmatrix} 7.4 \\ 6.5 \\ 5.6 \\ 9.8 \\ 6.8 \\ 6.2 \\ 7.3 \\ 6.1 \\ 6.4 \\ 9.5 \\ 8.0 \\ 7.4 \end{bmatrix} }_{\boldsymbol{X'y}} \\ &= \underbrace{ \begin{bmatrix} \frac{1}{2} & \frac{-1}{4} & \frac{-1}{4} & \frac{-1}{3} & \frac{-1}{3} & \frac{-1}{3} \\ \frac{-1}{4} & \frac{1}{2} & \frac{1}{4} & 0 & 0 & 0 \\ \frac{-1}{4} & \frac{1}{4} & \frac{1}{2} & 1 & 1 & 1 \\ \frac{-1}{3} & 0 & 0 & \frac{2}{3} & \frac{1}{3} & \frac{1}{3} \\ \frac{-1}{3} & 0 & 0 & \frac{1}{3} & \frac{2}{3} & \frac{1}{3} \\ \frac{-1}{3} & 0 & 0 & \frac{1}{3} & \frac{1}{3} & \frac{2}{3} \end{bmatrix} }_{(\boldsymbol{X'X})^{-1}} \underbrace{ \begin{bmatrix} 87.0 \\ 27.4 \\ 25.6 \\ 22.8 \\ 19.8 \\ 24.9 \end{bmatrix} }_{\boldsymbol{X'y}} = \begin{bmatrix} 7.75 \\ -1.65 \\ -2.10 \\ 1.10 \\ 0.10 \\ 1.80 \end{bmatrix} \end{align*} \]

ff <- yield ~ 1 + block + gen
m <- model.frame(ff, data)
X <- model.matrix(ff, m)
y <- matrix(data$yield)

Xty <- t(X) %*% y
XtX <- t(X) %*% X
rank_X <- qr(XtX)$rank
XtX_inv <- solve(XtX)

beta <- XtX_inv %*% Xty
SSE <- t(y - X %*% beta) %*% (y - X %*% beta)
SSE <- as.numeric(SSE)

list(rank = rank_X, betas = drop(beta), SSE = SSE)
$rank
[1] 6

$betas
(Intercept)      block2      block3       geng2       geng3       geng4 
       7.75       -1.65       -2.10        1.10        0.10        1.80 

$SSE
[1] 2.4

Notice that the first level of each factor (baseline) is dropped in order to have a full rank \(\boldsymbol{X}\) matrix.

2.5 Coefficients and variance-covariances

\[ \boldsymbol{\hat\beta} = (\boldsymbol{X^\top X})^{-1} \boldsymbol{X^\top} \boldsymbol{y} \] \[V(\boldsymbol{\hat{\beta}}) = (\boldsymbol{X^\top X})^{-1} \sigma^2\]

sigma_2 <- SSE / (n - length(beta))
vcov_betas <- XtX_inv * sigma_2
vcov_betas
            (Intercept) block2 block3      geng2      geng3      geng4
(Intercept)   0.2000000   -0.1   -0.1 -0.1333333 -0.1333333 -0.1333333
block2       -0.1000000    0.2    0.1  0.0000000  0.0000000  0.0000000
block3       -0.1000000    0.1    0.2  0.0000000  0.0000000  0.0000000
geng2        -0.1333333    0.0    0.0  0.2666667  0.1333333  0.1333333
geng3        -0.1333333    0.0    0.0  0.1333333  0.2666667  0.1333333
geng4        -0.1333333    0.0    0.0  0.1333333  0.1333333  0.2666667
data.frame(
  coefficient = rownames(vcov_betas),
  solution = beta,
  variance = diag(vcov_betas),
  std.error = sqrt(diag(vcov_betas)),
  row.names = NULL
) |> gt()
coefficient solution variance std.error
(Intercept) 7.75 0.2000000 0.4472136
block2 -1.65 0.2000000 0.4472136
block3 -2.10 0.2000000 0.4472136
geng2 1.10 0.2666667 0.5163978
geng3 0.10 0.2666667 0.5163978
geng4 1.80 0.2666667 0.5163978
Note

Let’s be careful when interpreting the coefficients and the variance of these factor coefficients. In a linear model, when you include a categorical factor like “genotype”, the model needs a baseline. In R it chooses g1 as the reference. Therefore:

  • The coefficient for g2 is the estimated difference between g2 and g1 (7.6 - 6.5 = 1.1).
  • The variance for the g2 coefficient is the uncertainty around that difference (g2 - g1).

3) Reduction-in-SS and ANOVA table

Partial Sum of Squares (SS) measures the unique contribution of a predictor variable to a model, accounting for all other variables. It is calculated by comparing the error sum of squares (SSE) of the model without the variable to the SSE of the model with it. A larger reduction in SSE indicates a greater unique contribution of that variable.

sse_df <- data.frame(
  Model = factor(
    c("Intercept", "Add Blocks", "Add Genotypes", "Full"),
    levels = c("Intercept", "Add Blocks", "Add Genotypes", "Full")
  ),
  SS = c(SSE_mu, SSE_mu - SSE_blk, SSE_mu - SSE_gen, SSE)
)

ggplot(sse_df, mapping = aes(x = Model, y = SS, group = 0)) +
  geom_line() +
  geom_point(size = 3, color = "#00BFC4") +
  theme_minimal(base_size = 14)

We now assemble an ANOVA table. Residual df is

anova_dt <- data.frame(
  Source = c("block", "gen", "residuals"), # factors included in the model
  Df     = c(n_blks - 1, n_gens - 1, n - length(beta)), # `n - p`, where `p` is the number of coefficients
  SSq    = c(SSE_mu - SSE_blk, SSE_mu - SSE_gen, SSE) # differences between baseline model and nested models
) |>
  mutate(
    MSq = SSq / Df,
    F.value = MSq / MSq[3],
    `Pr(>F)` = pf(q = F.value, df1 = Df, df2 = Df[3], lower.tail = FALSE),
    F.value = ifelse(Source == "residuals", NA, F.value),
    `Pr(>F)` = ifelse(Source == "residuals", NA, `Pr(>F)`)
  )

anova_dt |>
  gt()
Source Df SSq MSq F.value Pr(>F)
block 2 9.78 4.89 12.225 0.007650536
gen 3 6.63 2.21 5.525 0.036730328
residuals 6 2.40 0.40 NA NA
Note

Why does this work? In fixed-effects ANOVA, Type-I SS for adding a factor equals the reduction in SSE between nested models. Here we use the intercept-only model as the baseline; adding block or gen reduces SSE by their respective SS.

4) Verify with lm()

mod <- lm(yield ~ 1 + block + gen, data = data)
anova(mod)
Analysis of Variance Table

Response: yield
          Df Sum Sq Mean Sq F value   Pr(>F)   
block      2   9.78    4.89  12.225 0.007651 **
gen        3   6.63    2.21   5.525 0.036730 * 
Residuals  6   2.40    0.40                    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
coef(mod)
(Intercept)      block2      block3       geng2       geng3       geng4 
       7.75       -1.65       -2.10        1.10        0.10        1.80 
round(vcov(mod), 5)
            (Intercept) block2 block3    geng2    geng3    geng4
(Intercept)     0.20000   -0.1   -0.1 -0.13333 -0.13333 -0.13333
block2         -0.10000    0.2    0.1  0.00000  0.00000  0.00000
block3         -0.10000    0.1    0.2  0.00000  0.00000  0.00000
geng2          -0.13333    0.0    0.0  0.26667  0.13333  0.13333
geng3          -0.13333    0.0    0.0  0.13333  0.26667  0.13333
geng4          -0.13333    0.0    0.0  0.13333  0.13333  0.26667
# Check that (X'X)^{-1} * sigma^2 matches vcov
round(vcov_betas, 5)
            (Intercept) block2 block3    geng2    geng3    geng4
(Intercept)     0.20000   -0.1   -0.1 -0.13333 -0.13333 -0.13333
block2         -0.10000    0.2    0.1  0.00000  0.00000  0.00000
block3         -0.10000    0.1    0.2  0.00000  0.00000  0.00000
geng2          -0.13333    0.0    0.0  0.26667  0.13333  0.13333
geng3          -0.13333    0.0    0.0  0.13333  0.26667  0.13333
geng4          -0.13333    0.0    0.0  0.13333  0.13333  0.26667

5) Reconstructing means

With block and gen as factors and an intercept present, R uses treatment (baseline) coding by default. The printed beta therefore contains:

  • beta[1]: the intercept (no longer overall mean)
  • beta[2:3]: effects for non-reference blocks
  • beta[4:6]: effects for non-reference genotypes

You can reconstruct overall mean, genotype means, and block means as follows.

# Number of coefficients in full model
n_coef <- length(beta)

# Overall mean reconstructed from coefficients
mu_recon <- beta[1] + sum(beta[2:3]) / n_blks + sum(beta[4:6]) / n_gens

# Compare with the intercept-only estimate
mu_recon - beta_mu[1]
[1] 7.993606e-15

5.1 Genotype means including the missing (baseline) level

# beta currently has: (Intercept), block2, block3, gen2, gen3, gen4
# Create a named vector for gen effects including the reference level set to 0
print(beta)
             [,1]
(Intercept)  7.75
block2      -1.65
block3      -2.10
geng2        1.10
geng3        0.10
geng4        1.80
gens <- c("geng1" = 0, beta[4:6, ])
# Add back the intercept and average block effect
gens <- beta[1] + sum(beta[2:3]) / n_blks + gens
gens
geng1 geng2 geng3 geng4 
  6.5   7.6   6.6   8.3 

5.2 Block means including the missing (baseline) level

print(beta)
             [,1]
(Intercept)  7.75
block2      -1.65
block3      -2.10
geng2        1.10
geng3        0.10
geng4        1.80
blks <- c("block1" = 0, beta[2:3, ])
blks <- beta[1] + sum(beta[4:6]) / n_gens + blks
blks
block1 block2 block3 
  8.50   6.85   6.40 

6) Estimated marginal means and pairwise comparisons

emmeans provides adjusted means (marginal over the other factors) and convenient contrasts.

# Genotype adjusted means
emm_gen <- emmeans(mod, ~gen)
emm_gen
 gen emmean    SE df lower.CL upper.CL
 g1     6.5 0.365  6     5.61     7.39
 g2     7.6 0.365  6     6.71     8.49
 g3     6.6 0.365  6     5.71     7.49
 g4     8.3 0.365  6     7.41     9.19

Results are averaged over the levels of: block 
Confidence level used: 0.95 
# Pairwise genotype comparisons
pairs(emm_gen)
 contrast estimate    SE df t.ratio p.value
 g1 - g2      -1.1 0.516  6  -2.130  0.2447
 g1 - g3      -0.1 0.516  6  -0.194  0.9971
 g1 - g4      -1.8 0.516  6  -3.486  0.0486
 g2 - g3       1.0 0.516  6   1.936  0.3066
 g2 - g4      -0.7 0.516  6  -1.356  0.5656
 g3 - g4      -1.7 0.516  6  -3.292  0.0609

Results are averaged over the levels of: block 
P value adjustment: tukey method for comparing a family of 4 estimates 
# For a quick hand-check of a simple pairwise SE when balanced:
sqrt(sigma_2 / 3 + sigma_2 / 3)
[1] 0.5163978
Important

Notice that the standard errors via \((\boldsymbol{X' X})^{-1}\sigma^2\) are different to the ones return by emmeans, as expected.