With this article we want to analyze a randomized complete block design (RCBD) to show how a traditional full fixed effects model and ANOVA work.
1) Setup & data
Clewer and Scarisbrick (2001) present a yield trial (t/ha) conducted using a randomized complete block design. The design included 3 blocks and 4 cultivars, resulting in 12 experimental plots.
library(tidyverse)library(emmeans)library(gt)library(kableExtra)# Read and coerce factorsdata <-read.csv("../data/example_1.csv") |>mutate(gen =as.factor(gen), block =as.factor(block))head(data)
data |>ggplot(aes(x = gen, y = yield, color = block)) +geom_point(size =3) +theme_classic(base_size =15)
Yield by genotype colored by block.
2) Linear model building blocks
We will progressively build the RCBD model using model.frame() and model.matrix() to see the design matrices explicitly, then solve normal equations. We use:
\(\boldsymbol{y} = \boldsymbol{X\beta} + \boldsymbol{\varepsilon}\): linear model
Notice that the first term, \(\left(\boldsymbol{X'X}\right)^{-1}\), corresponds to the inverse of number of observations, while the second term, \(\boldsymbol{X'y}\), gives the sum of phenotypic values.
In this example, there is only a single level, \(\mu\). Therefore, the entire expression simplifies to the sum of the phenotypic values divided by the number of observations, which is simply the mean, as shown below:
In this model, there are 3 levels (\(Block1\), \(Block2\) and \(Block3\)). Therefore, the entire expression simplifies to the sum of the phenotypic values divided by the number of observations, which is simply the mean per block.
data |>group_by(block) |>summarise(beta =mean(yield, na.rm =TRUE)) |>gt()
In this model, there are 4 levels (\(g1\), \(g2\), \(g3\) and \(g4\)). Therefore, the entire expression simplifies to the sum of the phenotypic values divided by the number of observations, which is simply the mean per genotype.
data |>group_by(gen) |>summarise(beta =mean(yield, na.rm =TRUE)) |>gt()
Partial Sum of Squares (SS) measures the unique contribution of a predictor variable to a model, accounting for all other variables. It is calculated by comparing the error sum of squares (SSE) of the model without the variable to the SSE of the model with it. A larger reduction in SSE indicates a greater unique contribution of that variable.
anova_dt <-data.frame(Source =c("block", "gen", "residuals"), # factors included in the modelDf =c(n_blks -1, n_gens -1, n -length(beta)), # `n - p`, where `p` is the number of coefficientsSSq =c(SSE_mu - SSE_blk, SSE_mu - SSE_gen, SSE) # differences between baseline model and nested models) |>mutate(MSq = SSq / Df,F.value = MSq / MSq[3],`Pr(>F)`=pf(q = F.value, df1 = Df, df2 = Df[3], lower.tail =FALSE),F.value =ifelse(Source =="residuals", NA, F.value),`Pr(>F)`=ifelse(Source =="residuals", NA, `Pr(>F)`) )anova_dt |>gt()
Source
Df
SSq
MSq
F.value
Pr(>F)
block
2
9.78
4.89
12.225
0.007650536
gen
3
6.63
2.21
5.525
0.036730328
residuals
6
2.40
0.40
NA
NA
Note
Why does this work? In fixed-effects ANOVA, Type-I SS for adding a factor equals the reduction in SSE between nested models. Here we use the intercept-only model as the baseline; adding block or gen reduces SSE by their respective SS.
4) Verify with lm()
mod <-lm(yield ~1+ block + gen, data = data)summary(mod)
Call:
lm(formula = yield ~ 1 + block + gen, data = data)
Residuals:
Min 1Q Median 3Q Max
-0.5500 -0.3625 -0.0500 0.1750 0.9500
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.7500 0.4472 17.330 2.37e-06 ***
block2 -1.6500 0.4472 -3.690 0.01021 *
block3 -2.1000 0.4472 -4.696 0.00334 **
geng2 1.1000 0.5164 2.130 0.07719 .
geng3 0.1000 0.5164 0.194 0.85284
geng4 1.8000 0.5164 3.486 0.01305 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.6325 on 6 degrees of freedom
Multiple R-squared: 0.8724, Adjusted R-squared: 0.7661
F-statistic: 8.205 on 5 and 6 DF, p-value: 0.01173
5) Interpreting coefficients under treatment coding
With block and gen as factors and an intercept present, R uses treatment (reference-cell) coding by default. The printed beta therefore contains:
beta[1]: the intercept (no longer overall mean)
beta[2:3]: effects for non-reference blocks
beta[4:6]: effects for non-reference genotypes
You can reconstruct overall mean, genotype cell means, and block means as follows.
# Number of coefficients in full modeln_coef <-length(beta)# Overall mean reconstructed from coefficientsmu_recon <- beta[1] +sum(beta[2:3]) / n_blks +sum(beta[4:6]) / n_gens# Compare with the intercept-only estimatemu_recon - beta_mu[1]
[1] 7.993606e-15
5.1 Genotype means including the missing (reference) level
# beta currently has: (Intercept), block2, block3, gen2, gen3, gen4# Create a named vector for gen effects including the reference level set to 0print(beta)
gen emmean SE df lower.CL upper.CL
g1 6.5 0.365 6 5.61 7.39
g2 7.6 0.365 6 6.71 8.49
g3 6.6 0.365 6 5.71 7.49
g4 8.3 0.365 6 7.41 9.19
Results are averaged over the levels of: block
Confidence level used: 0.95
# Standard errors of genotype via (X'X)^{-1} * sigma2sqrt(diag(vcov_betas)[4:6])
geng2 geng3 geng4
0.5163978 0.5163978 0.5163978
# Pairwise genotype comparisonspairs(emm_gen)
contrast estimate SE df t.ratio p.value
g1 - g2 -1.1 0.516 6 -2.130 0.2447
g1 - g3 -0.1 0.516 6 -0.194 0.9971
g1 - g4 -1.8 0.516 6 -3.486 0.0486
g2 - g3 1.0 0.516 6 1.936 0.3066
g2 - g4 -0.7 0.516 6 -1.356 0.5656
g3 - g4 -1.7 0.516 6 -3.292 0.0609
Results are averaged over the levels of: block
P value adjustment: tukey method for comparing a family of 4 estimates
# For a quick hand-check of a simple pairwise SE when balanced:sqrt(sigma_2 /3+ sigma_2 /3)
[1] 0.5163978
Important
Notice that the standard errors via \((\boldsymbol{X' X})^{-1}\sigma^2\) are different to the ones return by emmeans. Interestingly those from \((\boldsymbol{X' X})^{-1}\sigma^2\) look exactly like the pairwise genotype comparison Standard Errors. Why? We will see why in the next article.