Understanding a linear model

Published

September 10, 2025

With this article we want to analyze a randomized complete block design (RCBD) to show how a traditional full fixed effects model and ANOVA work.

1) Setup & data

Clewer and Scarisbrick (2001) present a yield trial (t/ha) conducted using a randomized complete block design. The design included 3 blocks and 4 cultivars, resulting in 12 experimental plots.

library(tidyverse)
library(emmeans)
library(gt)
library(kableExtra)

# Read and coerce factors
data <- read.csv("../data/example_1.csv") |>
  mutate(gen = as.factor(gen), block = as.factor(block))
head(data)

  block gen yield
1     1  g1   7.4
2     2  g1   6.5
3     3  g1   5.6
4     1  g2   9.8
5     2  g2   6.8
6     3  g2   6.2

str(data)

'data.frame':   12 obs. of  3 variables:
 $ block: Factor w/ 3 levels "1","2","3": 1 2 3 1 2 3 1 2 3 1 ...
 $ gen  : Factor w/ 4 levels "g1","g2","g3",..: 1 1 1 2 2 2 3 3 3 4 ...
 $ yield: num  7.4 6.5 5.6 9.8 6.8 6.2 7.3 6.1 6.4 9.5 ...

Quick visualization

data |>
  ggplot(aes(x = gen, y = yield, color = block)) +
  geom_point(size = 3) +
  theme_classic(base_size = 15)

2) Linear model building blocks

We will progressively build the RCBD model using model.frame() and model.matrix() to see the design matrices explicitly, then solve normal equations. We use:

\(\boldsymbol{y} = \boldsymbol{X\beta} + \boldsymbol{\varepsilon}\): linear model
\(\boldsymbol{y}\): response vector (yield)
\(\boldsymbol{X}\): model matrix
\(\boldsymbol{\varepsilon} \sim MVN(0, \boldsymbol{I}\sigma^2)\)
\(E(\boldsymbol{y}) = \boldsymbol{X\beta}; \; V(\boldsymbol{y}) = \boldsymbol{I}\sigma^2\)
\(\boldsymbol{\hat\beta} = (\boldsymbol{X^\top X})^{-1} \boldsymbol{X^\top} \boldsymbol{y}\)
\(V(\boldsymbol{\hat{\beta}}) = (\boldsymbol{X^\top X})^{-1} \sigma^2\)
Fitted values \(\boldsymbol{\hat y} = \boldsymbol{X\hat\beta}\)
Errors \(\boldsymbol{\varepsilon} = \boldsymbol{y} - \boldsymbol{\hat y}\)
Sum of squared errors \(\text{SSE} = \boldsymbol{\varepsilon^\top \varepsilon}\)

Let n = 12 observations, n_blks = 3 blocks, n_gens = 4 genotypes.

n <- 12
n_blks <- 3
n_gens <- 4

Model:

\[ \small \boldsymbol{y} = \boldsymbol{X\beta} + \boldsymbol{\varepsilon} => \begin{bmatrix} 7.4 \\ 6.5 \\ 5.6 \\ 9.8 \\ 6.8 \\ 6.2 \\ 7.3 \\ 6.1 \\ 6.4 \\ 9.5 \\ 8.0 \\ 7.4 \\ \end{bmatrix} = \overset{\text{mean}}{ \begin{bmatrix} 1 \\ 1 \\ 1 \\ 1 \\ 1 \\ 1 \\ 1 \\ 1 \\ 1 \\ 1 \\ 1 \\ 1 \end{bmatrix}} \mu + \boldsymbol{\varepsilon} \]

BLUE(s):

\[ \small \boldsymbol{\beta} = \underbrace{\left(\begin{bmatrix} 1 \ 1 \ 1 \ 1 \ 1 \ 1 \ 1 \ 1 \ 1 \ 1 \ 1 \ 1 \ \end{bmatrix} \begin{bmatrix} 1 \\ 1 \\ 1 \\ 1 \\ 1 \\ 1 \\ 1 \\ 1 \\ 1 \\ 1 \\ 1 \\ 1 \\ \end{bmatrix}\right)^{-1}}_{(\boldsymbol{X'X})^{-1}} \underbrace{\begin{bmatrix} 1 \ 1 \ 1 \ 1 \ 1 \ 1 \ 1 \ 1 \ 1 \ 1 \ 1 \ 1 \ \end{bmatrix} \begin{bmatrix} 7.4 \\ 6.5 \\ 5.6 \\ 9.8 \\ 6.8 \\ 6.2 \\ 7.3 \\ 6.1 \\ 6.4 \\ 9.5 \\ 8.0 \\ 7.4 \\ \end{bmatrix}}_{\boldsymbol{X'y}} = \underbrace{\frac{1}{12}}_{(\boldsymbol{X'X})^{-1}}\cdot \underbrace{87}_{\boldsymbol{X'y}} = 7.25 \]

Note

Notice that the first term, \(\left(\boldsymbol{X'X}\right)^{-1}\), corresponds to the inverse of number of observations, while the second term, \(\boldsymbol{X'y}\), gives the sum of phenotypic values.

In this example, there is only a single level, \(\mu\). Therefore, the entire expression simplifies to the sum of the phenotypic values divided by the number of observations, which is simply the mean, as shown below:

data.frame(mean = "mean", beta = mean(data$yield)) |>
  gt()

mean	beta
mean	7.25

ff <- yield ~ 1
m <- model.frame(ff, data)
X <- model.matrix(ff, m)
y <- matrix(data$yield)

# Normal equations components
Xty <- t(X) %*% y
XtX <- t(X) %*% X
rank_X <- qr(XtX)$rank
XtX_inv <- solve(XtX)

beta_mu <- XtX_inv %*% Xty # overall mean (mu)
y_hat <- X %*% beta_mu
errors <- y - y_hat
SSE_mu <- t(errors) %*% errors
SSE_mu <- as.numeric(SSE_mu)

list(rank = rank_X, beta_mu = drop(beta_mu), SSE_mu = SSE_mu)

$rank
[1] 1

$beta_mu
(Intercept) 
       7.25 

$SSE_mu
[1] 18.81

2.2 Add blocks

We will now illustrate what happens if we only include the block factor.

Matrices
Code

Model: \[ \small \boldsymbol{y} = \boldsymbol{X\beta} + \boldsymbol{\varepsilon} => \begin{bmatrix} 7.4 \\ 6.5 \\ 5.6 \\ 9.8 \\ 6.8 \\ 6.2 \\ 7.3 \\ 6.1 \\ 6.4 \\ 9.5 \\ 8.0 \\ 7.4 \\ \end{bmatrix} = \begin{bmatrix} \overset{\text{Block 1}}{1} & \overset{\text{Block 2}}{0} & \overset{\text{Block 3}}{0} \\ 0 & 1 & 0 \\ 0 & 0 & 1 \\ 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \\ 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \\ 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{bmatrix} \begin{bmatrix} b_1 \\ b_2 \\ b_3 \\ \end{bmatrix} + \boldsymbol{\varepsilon} \]
BLUE(s): \[ \small \boldsymbol{\beta} = \underbrace{\left(\begin{bmatrix} 1 \ 0 \ 0 \ 1 \ 0 \ 0 \ 1 \ 0 \ 0 \ 1 \ 0 \ 0 \\ 0 \ 1 \ 0 \ 0 \ 1 \ 0 \ 0 \ 1 \ 0 \ 1 \ 0 \ 0 \\ 0 \ 0 \ 1 \ 0 \ 0 \ 1 \ 0 \ 0 \ 1 \ 1 \ 0 \ 0 \\ \end{bmatrix} \begin{bmatrix} 1 \ 0 \ 0 \\ 0 \ 1 \ 0 \\ 0 \ 0 \ 1 \\ 1 \ 0 \ 0 \\ 0 \ 1 \ 0 \\ 0 \ 0 \ 1 \\ 1 \ 0 \ 0 \\ 0 \ 1 \ 0 \\ 0 \ 0 \ 1 \\ 1 \ 0 \ 0 \\ 0 \ 1 \ 0 \\ 0 \ 0 \ 1 \\ \end{bmatrix}\right)^{-1}}_{(\boldsymbol{X'X})^{-1}} \underbrace{\begin{bmatrix} 1 \ 0 \ 0 \ 1 \ 0 \ 0 \ 1 \ 0 \ 0 \ 1 \ 0 \ 0 \\ 0 \ 1 \ 0 \ 0 \ 1 \ 0 \ 0 \ 1 \ 0 \ 1 \ 0 \ 0 \\ 0 \ 0 \ 1 \ 0 \ 0 \ 1 \ 0 \ 0 \ 1 \ 1 \ 0 \ 0 \\ \end{bmatrix} \begin{bmatrix} 7.4 \\ 6.5 \\ 5.6 \\ 9.8 \\ 6.8 \\ 6.2 \\ 7.3 \\ 6.1 \\ 6.4 \\ 9.5 \\ 8.0 \\ 7.4 \\ \end{bmatrix}}_{\boldsymbol{X'y}} = \underbrace{\begin{bmatrix} \frac{1}{4} \ 0 \ 0 \\ 0 \ \frac{1}{4} \ 0 \\ 0 \ 0 \ \frac{1}{4} \\ \end{bmatrix}}_{(\boldsymbol{X'X})^{-1}} \underbrace{\begin{bmatrix} 34 \\ 27.4 \\ 25.6 \\ \end{bmatrix}}_{\boldsymbol{X'y}} = \begin{bmatrix} 8.50 \\ 6.85 \\ 6.40 \\ \end{bmatrix} \]

In this model, there are 3 levels (\(Block1\), \(Block2\) and \(Block3\)). Therefore, the entire expression simplifies to the sum of the phenotypic values divided by the number of observations, which is simply the mean per block.

data |>
  group_by(block) |>
  summarise(beta = mean(yield, na.rm = TRUE)) |>
  gt()

block	beta
1	8.50
2	6.85
3	6.40

ff <- yield ~ -1 + block
m <- model.frame(ff, data)
X <- model.matrix(ff, m)
y <- matrix(data$yield)

Xty <- t(X) %*% y
XtX <- t(X) %*% X
rank_X <- qr(XtX)$rank
XtX_inv <- solve(XtX)

beta_blk <- XtX_inv %*% Xty
SSE_blk <- t(y - X %*% beta_blk) %*% (y - X %*% beta_blk)
SSE_blk <- as.numeric(SSE_blk)

list(rank = rank_X, beta_blk = drop(beta_blk), SSE_blk = SSE_blk)

$rank
[1] 3

$beta_blk
block1 block2 block3 
  8.50   6.85   6.40 

$SSE_blk
[1] 9.03

2.3 Genotype main effects (no intercept)

Matrices
Code

Model:

\[ \small \boldsymbol{y} = \boldsymbol{X\beta} + \boldsymbol{\varepsilon} => \begin{bmatrix} 7.4 \\ 6.5 \\ 5.6 \\ 9.8 \\ 6.8 \\ 6.2 \\ 7.3 \\ 6.1 \\ 6.4 \\ 9.5 \\ 8.0 \\ 7.4 \\ \end{bmatrix} = \begin{bmatrix} \overset{gen1}{1} & \overset{gen2}{0} & \overset{gen3}{0} & \overset{gen4}{0} \\ 1 & 0 & 0 & 0 \\ 1 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 \\ 0 & 1 & 0 & 0 \\ 0 & 1 & 0 & 0 \\ 0 & 0 & 1 & 0 \\ 0 & 0 & 1 & 0 \\ 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1 \\ 0 & 0 & 0 & 1 \\ 0 & 0 & 0 & 1 \end{bmatrix} \begin{bmatrix} g_1 \\ g_2 \\ g_3 \\ g_4 \\ \end{bmatrix} + \boldsymbol{\varepsilon} \]

BLUE(s):

\[ \small \boldsymbol{\beta} = \underbrace{ \left(\begin{bmatrix} 1 \ 1 \ 1 \ 0 \ 0 \ 0 \ 0 \ 0 \ 0 \ 0 \ 0 \ 0 \\ 0 \ 0 \ 0 \ 1 \ 1 \ 1 \ 0 \ 0 \ 0 \ 0 \ 0 \ 0 \\ 0 \ 0 \ 0 \ 0 \ 0 \ 0 \ 1 \ 1 \ 1 \ 0 \ 0 \ 0 \\ 0 \ 0 \ 0 \ 0 \ 0 \ 0 \ 0 \ 0 \ 0 \ 1 \ 1 \ 1 \\ \end{bmatrix} \begin{bmatrix} 1 \ 0 \ 0 \ 0 \\ 1 \ 0 \ 0 \ 0 \\ 1 \ 0 \ 0 \ 0 \\ 0 \ 1 \ 0 \ 0 \\ 0 \ 1 \ 0 \ 0 \\ 0 \ 1 \ 0 \ 0 \\ 0 \ 0 \ 1 \ 0 \\ 0 \ 0 \ 1 \ 0 \\ 0 \ 0 \ 1 \ 0 \\ 0 \ 0 \ 0 \ 1 \\ 0 \ 0 \ 0 \ 1 \\ 0 \ 0 \ 0 \ 1 \\ \end{bmatrix}\right)^{-1}}_{(\boldsymbol{X'X})^{-1}} \underbrace{ \begin{bmatrix} 1 \ 1 \ 1 \ 0 \ 0 \ 0 \ 0 \ 0 \ 0 \ 0 \ 0 \ 0 \\ 0 \ 0 \ 0 \ 1 \ 1 \ 1 \ 0 \ 0 \ 0 \ 0 \ 0 \ 0 \\ 0 \ 0 \ 0 \ 0 \ 0 \ 0 \ 1 \ 1 \ 1 \ 0 \ 0 \ 0 \\ 0 \ 0 \ 0 \ 0 \ 0 \ 0 \ 0 \ 0 \ 0 \ 1 \ 1 \ 1 \\ \end{bmatrix} \begin{bmatrix} 7.4 \\ 6.5 \\ 5.6 \\ 9.8 \\ 6.8 \\ 6.2 \\ 7.3 \\ 6.1 \\ 6.4 \\ 9.5 \\ 8.0 \\ 7.4 \\ \end{bmatrix}}_{\boldsymbol{X'y}} = \underbrace{\begin{bmatrix} \frac{1}{3}\ 0 \ 0 \ 0 \\ 0 \ \frac{1}{3} \ 0 \ 0 \\ 0 \ 0 \ \frac{1}{3} \ 0 \\ 0 \ 0 \ 0 \ \frac{1}{3} \\ \end{bmatrix}}_{(X'X)^{-1}} \underbrace{ \begin{bmatrix} 19.5 \\ 22.8 \\ 19.8 \\ 24.9 \\ \end{bmatrix} }_{\boldsymbol{X'y}} = \begin{bmatrix} 6.5 \\ 7.6 \\ 6.6 \\ 8.3 \\ \end{bmatrix} \]

In this model, there are 4 levels (\(g1\), \(g2\), \(g3\) and \(g4\)). Therefore, the entire expression simplifies to the sum of the phenotypic values divided by the number of observations, which is simply the mean per genotype.

data |>
  group_by(gen) |>
  summarise(beta = mean(yield, na.rm = TRUE)) |>
  gt()

gen	beta
g1	6.5
g2	7.6
g3	6.6
g4	8.3

ff <- yield ~ -1 + gen
m <- model.frame(ff, data)
X <- model.matrix(ff, m)
y <- matrix(data$yield)

Xty <- t(X) %*% y
XtX <- t(X) %*% X
rank_X <- qr(XtX)$rank
XtX_inv <- solve(XtX)

beta_gen <- XtX_inv %*% Xty
SSE_gen <- t(y - X %*% beta_gen) %*% (y - X %*% beta_gen)
SSE_gen <- as.numeric(SSE_gen)

m2 <- list(rank = rank_X, beta_gen = drop(beta_gen), SSE_gen = SSE_gen)

2.4 Genotype main effects (Intercept)

Matrices
Code

Model:

\[ \small \boldsymbol{y} = \boldsymbol{X\beta} + \boldsymbol{\varepsilon} => \begin{bmatrix} 7.4 \\ 6.5 \\ 5.6 \\ 9.8 \\ 6.8 \\ 6.2 \\ 7.3 \\ 6.1 \\ 6.4 \\ 9.5 \\ 8.0 \\ 7.4 \\ \end{bmatrix} = \begin{bmatrix} \overset{intercept}{1} & \overset{gen2}{0} & \overset{gen3}{0} & \overset{gen4}{0} \\ 1 & 0 & 0 & 0 \\ 1 & 0 & 0 & 0 \\ 1 & 1 & 0 & 0 \\ 1 & 1 & 0 & 0 \\ 1 & 1 & 0 & 0 \\ 1 & 0 & 1 & 0 \\ 1 & 0 & 1 & 0 \\ 1 & 0 & 1 & 0 \\ 1 & 0 & 0 & 1 \\ 1 & 0 & 0 & 1 \\ 1 & 0 & 0 & 1 \end{bmatrix} \begin{bmatrix} \mu \\ g_2 \\ g_3 \\ g_4 \\ \end{bmatrix} + \boldsymbol{\varepsilon} \]

BLUE(s):

\[ \small \boldsymbol{\beta} = \underbrace{ \left(\begin{bmatrix} 1 \ 1 \ 1 \ 1 \ 1 \ 1 \ 1 \ 1 \ 1 \ 1 \ 1 \ 1 \\ 0 \ 0 \ 0 \ 1 \ 1 \ 1 \ 0 \ 0 \ 0 \ 0 \ 0 \ 0 \\ 0 \ 0 \ 0 \ 0 \ 0 \ 0 \ 1 \ 1 \ 1 \ 0 \ 0 \ 0 \\ 0 \ 0 \ 0 \ 0 \ 0 \ 0 \ 0 \ 0 \ 0 \ 1 \ 1 \ 1 \\ \end{bmatrix} \begin{bmatrix} 1 \ 0 \ 0 \ 0 \\ 1 \ 0 \ 0 \ 0 \\ 1 \ 0 \ 0 \ 0 \\ 1 \ 1 \ 0 \ 0 \\ 1 \ 1 \ 0 \ 0 \\ 1 \ 1 \ 0 \ 0 \\ 1 \ 0 \ 1 \ 0 \\ 1 \ 0 \ 1 \ 0 \\ 1 \ 0 \ 1 \ 0 \\ 1 \ 0 \ 0 \ 1 \\ 1 \ 0 \ 0 \ 1 \\ 1 \ 0 \ 0 \ 1 \\ \end{bmatrix}\right)^{-1}}_{(\boldsymbol{X'X})^{-1}} \underbrace{ \begin{bmatrix} 1 \ 1 \ 1 \ 1 \ 1 \ 1 \ 1 \ 1 \ 1 \ 1 \ 1 \ 1 \\ 0 \ 0 \ 0 \ 1 \ 1 \ 1 \ 0 \ 0 \ 0 \ 0 \ 0 \ 0 \\ 0 \ 0 \ 0 \ 0 \ 0 \ 0 \ 1 \ 1 \ 1 \ 0 \ 0 \ 0 \\ 0 \ 0 \ 0 \ 0 \ 0 \ 0 \ 0 \ 0 \ 0 \ 1 \ 1 \ 1 \\ \end{bmatrix} \begin{bmatrix} 7.4 \\ 6.5 \\ 5.6 \\ 9.8 \\ 6.8 \\ 6.2 \\ 7.3 \\ 6.1 \\ 6.4 \\ 9.5 \\ 8.0 \\ 7.4 \\ \end{bmatrix}}_{\boldsymbol{X'y}} = \underbrace{\begin{bmatrix} \frac{1}{3}\ \frac{-1}{3} \ \frac{-1}{3} \ \frac{-1}{3} \\ \frac{-1}{3} \ \frac{2}{3} \ \frac{1}{3} \ \frac{1}{3} \\ \frac{-1}{3} \ \frac{1}{3} \ \frac{2}{3} \ \frac{1}{3} \\ \frac{-1}{3} \ \frac{1}{3} \ \frac{1}{3} \ \frac{2}{3} \\ \end{bmatrix}}_{(X'X)^{-1}} \underbrace{ \begin{bmatrix} 87.0 \\ 22.8 \\ 19.8 \\ 24.9 \\ \end{bmatrix} }_{\boldsymbol{X'y}} = \begin{bmatrix} 6.5 \\ 1.1 \\ 0.1 \\ 1.8 \\ \end{bmatrix} \]

ff <- yield ~ 1 + gen
m <- model.frame(ff, data)
X <- model.matrix(ff, m)
y <- matrix(data$yield)

Xty <- t(X) %*% y
XtX <- t(X) %*% X
rank_X <- qr(XtX)$rank
XtX_inv <- solve(XtX)

beta_gen <- XtX_inv %*% Xty
SSE_gen <- t(y - X %*% beta_gen) %*% (y - X %*% beta_gen)
SSE_gen <- as.numeric(SSE_gen)

m3 <- list(rank = rank_X, beta_gen = drop(beta_gen), SSE_gen = SSE_gen)

In progress …

2.5 Full model: intercept + blocks + genotypes

Matrices
Code

Model:

\[ \small \boldsymbol{y} = \boldsymbol{X\beta} + \boldsymbol{\varepsilon} \;\;\;\Rightarrow\;\;\; \begin{bmatrix} 7.4 \\ 6.5 \\ 5.6 \\ 9.8 \\ 6.8 \\ 6.2 \\ 7.3 \\ 6.1 \\ 6.4 \\ 9.5 \\ 8.0 \\ 7.4 \\ \end{bmatrix} = \begin{bmatrix} \overset{\text{intercept}}{1} & \overset{\text{block2}}{0} & \overset{\text{block3}}{0} & \overset{\text{gen2}}{0} & \overset{\text{gen3}}{0} & \overset{\text{gen4}}{0} \\ 1 & 1 & 0 & 0 & 0 & 0 \\ 1 & 0 & 1 & 0 & 0 & 0 \\ 1 & 0 & 0 & 1 & 0 & 0 \\ 1 & 1 & 0 & 1 & 0 & 0 \\ 1 & 0 & 1 & 1 & 0 & 0 \\ 1 & 0 & 0 & 0 & 1 & 0 \\ 1 & 1 & 0 & 0 & 1 & 0 \\ 1 & 0 & 1 & 0 & 1 & 0 \\ 1 & 0 & 0 & 0 & 0 & 1 \\ 1 & 1 & 0 & 0 & 0 & 1 \\ 1 & 0 & 1 & 0 & 0 & 1 \\ \end{bmatrix} \begin{bmatrix} \mu \\ b_2 \\ b_3 \\ g_2 \\ g_3 \\ g_4 \\ \end{bmatrix} + \boldsymbol{\varepsilon} \]

BLUE(s):

\[ \small \begin{align*} \boldsymbol{\beta} &= \underbrace{ \left( \begin{bmatrix} 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 \\ 0 & 1 & 0 & 0 & 1 & 0 & 0 & 1 & 0 & 0 & 1 & 0 \\ 0 & 0 & 1 & 0 & 0 & 1 & 0 & 0 & 1 & 0 & 0 & 1 \\ 0 & 0 & 0 & 1 & 1 & 1 & 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & 1 & 1 & 1 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 1 & 1 & 1 \end{bmatrix} \begin{bmatrix} 1 & 0 & 0 & 0 & 0 & 0 \\ 1 & 1 & 0 & 0 & 0 & 0 \\ 1 & 0 & 1 & 0 & 0 & 0 \\ 1 & 0 & 0 & 1 & 0 & 0 \\ 1 & 1 & 0 & 1 & 0 & 0 \\ 1 & 0 & 1 & 1 & 0 & 0 \\ 1 & 0 & 0 & 0 & 1 & 0 \\ 1 & 1 & 0 & 0 & 1 & 0 \\ 1 & 0 & 1 & 0 & 1 & 0 \\ 1 & 0 & 0 & 0 & 0 & 1 \\ 1 & 1 & 0 & 0 & 0 & 1 \\ 1 & 0 & 1 & 0 & 0 & 1 \end{bmatrix} \right)^{-1} }_{(\boldsymbol{X'X})^{-1}} \underbrace{ \begin{bmatrix} 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 \\ 0 & 1 & 0 & 0 & 1 & 0 & 0 & 1 & 0 & 0 & 1 & 0 \\ 0 & 0 & 1 & 0 & 0 & 1 & 0 & 0 & 1 & 0 & 0 & 1 \\ 0 & 0 & 0 & 1 & 1 & 1 & 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & 1 & 1 & 1 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 1 & 1 & 1 \end{bmatrix} \begin{bmatrix} 7.4 \\ 6.5 \\ 5.6 \\ 9.8 \\ 6.8 \\ 6.2 \\ 7.3 \\ 6.1 \\ 6.4 \\ 9.5 \\ 8.0 \\ 7.4 \end{bmatrix} }_{\boldsymbol{X'y}} \\ &= \underbrace{ \begin{bmatrix} \frac{1}{2} & \frac{-1}{4} & \frac{-1}{4} & \frac{-1}{3} & \frac{-1}{3} & \frac{-1}{3} \\ \frac{-1}{4} & \frac{1}{2} & \frac{1}{4} & 0 & 0 & 0 \\ \frac{-1}{4} & \frac{1}{4} & \frac{1}{2} & 1 & 1 & 1 \\ \frac{-1}{3} & 0 & 0 & \frac{2}{3} & \frac{1}{3} & \frac{1}{3} \\ \frac{-1}{3} & 0 & 0 & \frac{1}{3} & \frac{2}{3} & \frac{1}{3} \\ \frac{-1}{3} & 0 & 0 & \frac{1}{3} & \frac{1}{3} & \frac{2}{3} \end{bmatrix} }_{(\boldsymbol{X'X})^{-1}} \underbrace{ \begin{bmatrix} 87.0 \\ 27.4 \\ 25.6 \\ 22.8 \\ 19.8 \\ 24.9 \end{bmatrix} }_{\boldsymbol{X'y}} = \begin{bmatrix} 7.75 \\ -1.65 \\ -2.10 \\ 1.10 \\ 0.10 \\ 1.80 \end{bmatrix} \end{align*} \]

ff <- yield ~ 1 + block + gen
m <- model.frame(ff, data)
X <- model.matrix(ff, m)
y <- matrix(data$yield)

Xty <- t(X) %*% y
XtX <- t(X) %*% X
rank_X <- qr(XtX)$rank
XtX_inv <- solve(XtX)

beta <- XtX_inv %*% Xty
SSE <- t(y - X %*% beta) %*% (y - X %*% beta)
SSE <- as.numeric(SSE)

list(rank = rank_X, betas = drop(beta), SSE = SSE)

$rank
[1] 6

$betas
(Intercept)      block2      block3       geng2       geng3       geng4 
       7.75       -1.65       -2.10        1.10        0.10        1.80 

$SSE
[1] 2.4

3) Reduction-in-SS and ANOVA table

Partial Sum of Squares (SS) measures the unique contribution of a predictor variable to a model, accounting for all other variables. It is calculated by comparing the error sum of squares (SSE) of the model without the variable to the SSE of the model with it. A larger reduction in SSE indicates a greater unique contribution of that variable.

sse_df <- data.frame(
  Model = factor(
    c("Intercept", "Add Blocks", "Add Genotypes", "Full"),
    levels = c("Intercept", "Add Blocks", "Add Genotypes", "Full")
  ),
  SS = c(SSE_mu, SSE_mu - SSE_blk, SSE_mu - SSE_gen, SSE)
)

ggplot(sse_df, mapping = aes(x = Model, y = SS, group = 0)) +
  geom_line() +
  geom_point(size = 3, color = "#00BFC4") +
  theme_minimal(base_size = 14)

We now assemble an ANOVA table. Residual df is

anova_dt <- data.frame(
  Source = c("block", "gen", "residuals"), # factors included in the model
  Df     = c(n_blks - 1, n_gens - 1, n - length(beta)), # `n - p`, where `p` is the number of coefficients
  SSq    = c(SSE_mu - SSE_blk, SSE_mu - SSE_gen, SSE) # differences between baseline model and nested models
) |>
  mutate(
    MSq = SSq / Df,
    F.value = MSq / MSq[3],
    `Pr(>F)` = pf(q = F.value, df1 = Df, df2 = Df[3], lower.tail = FALSE),
    F.value = ifelse(Source == "residuals", NA, F.value),
    `Pr(>F)` = ifelse(Source == "residuals", NA, `Pr(>F)`)
  )

anova_dt |>
  gt()

Source	Df	SSq	MSq	F.value	Pr(>F)
block	2	9.78	4.89	12.225	0.007650536
gen	3	6.63	2.21	5.525	0.036730328
residuals	6	2.40	0.40	NA	NA

Note

Why does this work? In fixed-effects ANOVA, Type-I SS for adding a factor equals the reduction in SSE between nested models. Here we use the intercept-only model as the baseline; adding block or gen reduces SSE by their respective SS.

4) Verify with `lm()`

mod <- lm(yield ~ 1 + block + gen, data = data)
summary(mod)


Call:
lm(formula = yield ~ 1 + block + gen, data = data)

Residuals:
    Min      1Q  Median      3Q     Max 
-0.5500 -0.3625 -0.0500  0.1750  0.9500 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   7.7500     0.4472  17.330 2.37e-06 ***
block2       -1.6500     0.4472  -3.690  0.01021 *  
block3       -2.1000     0.4472  -4.696  0.00334 ** 
geng2         1.1000     0.5164   2.130  0.07719 .  
geng3         0.1000     0.5164   0.194  0.85284    
geng4         1.8000     0.5164   3.486  0.01305 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.6325 on 6 degrees of freedom
Multiple R-squared:  0.8724,    Adjusted R-squared:  0.7661 
F-statistic: 8.205 on 5 and 6 DF,  p-value: 0.01173

coef(mod)

(Intercept)      block2      block3       geng2       geng3       geng4 
       7.75       -1.65       -2.10        1.10        0.10        1.80

anova(mod)

Analysis of Variance Table

Response: yield
          Df Sum Sq Mean Sq F value   Pr(>F)   
block      2   9.78    4.89  12.225 0.007651 **
gen        3   6.63    2.21   5.525 0.036730 * 
Residuals  6   2.40    0.40                    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

round(vcov(mod), 5)

            (Intercept) block2 block3    geng2    geng3    geng4
(Intercept)     0.20000   -0.1   -0.1 -0.13333 -0.13333 -0.13333
block2         -0.10000    0.2    0.1  0.00000  0.00000  0.00000
block3         -0.10000    0.1    0.2  0.00000  0.00000  0.00000
geng2          -0.13333    0.0    0.0  0.26667  0.13333  0.13333
geng3          -0.13333    0.0    0.0  0.13333  0.26667  0.13333
geng4          -0.13333    0.0    0.0  0.13333  0.13333  0.26667

# Check that (X'X)^{-1} * sigma^2 matches vcov
# Sigma^2 estimate using the full model (p = length(beta))
sigma_2 <- SSE / (n - length(beta))
vcov_betas <- XtX_inv * sigma_2
round(vcov_betas, 5)

            (Intercept) block2 block3    geng2    geng3    geng4
(Intercept)     0.20000   -0.1   -0.1 -0.13333 -0.13333 -0.13333
block2         -0.10000    0.2    0.1  0.00000  0.00000  0.00000
block3         -0.10000    0.1    0.2  0.00000  0.00000  0.00000
geng2          -0.13333    0.0    0.0  0.26667  0.13333  0.13333
geng3          -0.13333    0.0    0.0  0.13333  0.26667  0.13333
geng4          -0.13333    0.0    0.0  0.13333  0.13333  0.26667

5) Interpreting coefficients under treatment coding

With block and gen as factors and an intercept present, R uses treatment (reference-cell) coding by default. The printed beta therefore contains:

beta[1]: the intercept (no longer overall mean)
beta[2:3]: effects for non-reference blocks
beta[4:6]: effects for non-reference genotypes

You can reconstruct overall mean, genotype cell means, and block means as follows.

# Number of coefficients in full model
n_coef <- length(beta)

# Overall mean reconstructed from coefficients
mu_recon <- beta[1] + sum(beta[2:3]) / n_blks + sum(beta[4:6]) / n_gens

# Compare with the intercept-only estimate
mu_recon - beta_mu[1]

[1] 7.993606e-15

5.1 Genotype means including the missing (reference) level

# beta currently has: (Intercept), block2, block3, gen2, gen3, gen4
# Create a named vector for gen effects including the reference level set to 0
print(beta)

             [,1]
(Intercept)  7.75
block2      -1.65
block3      -2.10
geng2        1.10
geng3        0.10
geng4        1.80

gens <- c("geng1" = 0, beta[4:6, ])
# Add back the intercept and average block effect
gens <- beta[1] + sum(beta[2:3]) / n_blks + gens
gens

geng1 geng2 geng3 geng4 
  6.5   7.6   6.6   8.3

5.2 Block means including the missing (reference) level

print(beta)

             [,1]
(Intercept)  7.75
block2      -1.65
block3      -2.10
geng2        1.10
geng3        0.10
geng4        1.80

blks <- c("block1" = 0, beta[2:3, ])
blks <- beta[1] + sum(beta[4:6]) / n_gens + blks
blks

block1 block2 block3 
  8.50   6.85   6.40

6) Estimated marginal means and pairwise comparisons

emmeans provides adjusted means (marginal over the other factors) and convenient contrasts.

# Genotype adjusted means
emm_gen <- emmeans(mod, ~gen)
emm_gen

 gen emmean    SE df lower.CL upper.CL
 g1     6.5 0.365  6     5.61     7.39
 g2     7.6 0.365  6     6.71     8.49
 g3     6.6 0.365  6     5.71     7.49
 g4     8.3 0.365  6     7.41     9.19

Results are averaged over the levels of: block 
Confidence level used: 0.95

# Standard errors of genotype via (X'X)^{-1} * sigma2
sqrt(diag(vcov_betas)[4:6])

    geng2     geng3     geng4 
0.5163978 0.5163978 0.5163978

# Pairwise genotype comparisons
pairs(emm_gen)

 contrast estimate    SE df t.ratio p.value
 g1 - g2      -1.1 0.516  6  -2.130  0.2447
 g1 - g3      -0.1 0.516  6  -0.194  0.9971
 g1 - g4      -1.8 0.516  6  -3.486  0.0486
 g2 - g3       1.0 0.516  6   1.936  0.3066
 g2 - g4      -0.7 0.516  6  -1.356  0.5656
 g3 - g4      -1.7 0.516  6  -3.292  0.0609

Results are averaged over the levels of: block 
P value adjustment: tukey method for comparing a family of 4 estimates

# For a quick hand-check of a simple pairwise SE when balanced:
sqrt(sigma_2 / 3 + sigma_2 / 3)

[1] 0.5163978

Important

Notice that the standard errors via \((\boldsymbol{X' X})^{-1}\sigma^2\) are different to the ones return by emmeans. Interestingly those from \((\boldsymbol{X' X})^{-1}\sigma^2\) look exactly like the pairwise genotype comparison Standard Errors. Why? We will see why in the next article.

1) Setup & data

Quick visualization

2) Linear model building blocks

2.1 Intercept-only model (overall mean)

2.2 Add blocks

2.3 Genotype main effects (no intercept)

2.4 Genotype main effects (Intercept)

2.5 Full model: intercept + blocks + genotypes

3) Reduction-in-SS and ANOVA table

4) Verify with lm()

5) Interpreting coefficients under treatment coding

5.1 Genotype means including the missing (reference) level

5.2 Block means including the missing (reference) level

6) Estimated marginal means and pairwise comparisons

4) Verify with `lm()`