Régression Multiple - Variables Qualitatives

Paul Bastide

2025-04-16

Student t tests

Penguins

Penguins from the Palmer Archipelago in Antartica, see palmerpenguins.

library(palmerpenguins)
kable(head(penguins))
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
Adelie Torgersen 39.1 18.7 181 3750 male 2007
Adelie Torgersen 39.5 17.4 186 3800 female 2007
Adelie Torgersen 40.3 18.0 195 3250 female 2007
Adelie Torgersen NA NA NA NA NA 2007
Adelie Torgersen 36.7 19.3 193 3450 female 2007
Adelie Torgersen 39.3 20.6 190 3650 male 2007

Subset

We keep only the penguins of species “Adelie”, and remove missing data.

adelie <- subset(penguins, species == "Adelie")
adelie <- na.omit(adelie)
kable(head(adelie))
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
Adelie Torgersen 39.1 18.7 181 3750 male 2007
Adelie Torgersen 39.5 17.4 186 3800 female 2007
Adelie Torgersen 40.3 18.0 195 3250 female 2007
Adelie Torgersen 36.7 19.3 193 3450 female 2007
Adelie Torgersen 39.3 20.6 190 3650 male 2007
Adelie Torgersen 38.9 17.8 181 3625 female 2007

Group Comparison

Question: Are male adelie penguins heavier than female adelie penguins ?

par(mfrow = c(1, 2), mar = c(5, 4, 2, 2) + 0.1)
plot(adelie$body_mass_g, col = adelie$sex)      ## scatter plot
boxplot(body_mass_g ~ sex, data = adelie)       ## box plot

Simple \(t\) test

mass_females <- subset(adelie, sex == "female")$body_mass_g
mass_males <- subset(adelie, sex == "male")$body_mass_g
t.test(mass_males, mass_females, var.equal = TRUE)

    Two Sample t-test

data:  mass_males and mass_females
t = 13.126, df = 144, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 573.0666 776.2484
sample estimates:
mean of x mean of y 
 4043.493  3368.836 

Simple \(t\) test

Data:

  • \(n_f\) female adelie penguins, \(n_m\) male adelie penguins
  • \(y^f_{i}\): mass (g) of a female adelie penguin, \(1 \leq i \leq n_f\)
  • \(y^m_{j}\): mass (g) of a male adelie penguin, \(1 \leq j \leq n_m\)

Assumption:

  • \(y^f_{i} \sim \mathcal{N}(\mu_f, \sigma^2)\) iid
  • \(y^m_{j} \sim \mathcal{N}(\mu_m, \sigma^2)\) iid

Test:

\(\mathcal{H}_0\): \(\mu_f = \mu_m\) vs \(\mathcal{H}_1\): \(\mu_f \neq \mu_m\)

Simple \(t\) test

Test:

\(\mathcal{H}_0\): \(\mu_f = \mu_m\) vs \(\mathcal{H}_1\): \(\mu_f \neq \mu_m\)

Test Statistic:

\[ T = \frac{\bar{\mathbf{y}}^f - \bar{\mathbf{y}}^m}{s_p \sqrt{\frac{1}{n_f} + \frac{1}{n_m}}} \underset{\mathcal{H}_0}{\sim} \mathcal{T}_{n_f + n_m - 2} \] with the “pooled” standard deviation: \[ s_p = \sqrt{\frac{(n_f-1)s^2_{\mathbf{y}^f} + (n_m-1)s^2_{\mathbf{y}^m}}{n_f + n_m - 2}} \]

Simple \(t\) test

Simple \(t\) test

t.test(mass_males, mass_females, var.equal = TRUE)

    Two Sample t-test

data:  mass_males and mass_females
t = 13.126, df = 144, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 573.0666 776.2484
sample estimates:
mean of x mean of y 
 4043.493  3368.836 

Reject the null that \(\mu_f = \mu_m\) : males and females have different body masses.

Linear Regression

Model

Assumptions:

  • \(y^f_{i} \sim \mathcal{N}(\mu_f, \sigma^2)\) iid
  • \(y^m_{j} \sim \mathcal{N}(\mu_m, \sigma^2)\) iid

in other words:

  • \(y^f_{i} = \mu_f + \epsilon_i^f\), with \(\epsilon^f_i \sim \mathcal{N}(0, \sigma^2)\) iid
  • \(y^m_{j} = \mu_m + \epsilon_j^m\), with \(\epsilon^m_j \sim \mathcal{N}(0, \sigma^2)\) iid

Can we write this as a linear model: “\(\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon}\)” ?

Model

Write: \[ \mathbf{y} = \begin{pmatrix} \mathbf{y}^f \\ \mathbf{y}^m \\ \end{pmatrix} \quad \text{and} \quad \boldsymbol{\epsilon} = \begin{pmatrix} \boldsymbol{\epsilon}^f \\ \boldsymbol{\epsilon}^m \\ \end{pmatrix} \] \(\mathbf{y}\) and \(\boldsymbol{\epsilon}\) vectors of size \(n = n_f + n_m\).

Then, for \(1 \leq i \leq n\): \[ y_i = \mu_k + \epsilon_i \quad \text{with} \quad \begin{cases} \mu_k = \mu_f & \text{if } 1 \leq i \leq n_f\\ \mu_k = \mu_m & \text{if } n_f+1 \leq i \leq n_f + n_m \end{cases} \]

Regression matrix \(\mathbf{X}\) ?

Linear Model

\[ \text{Write:} \qquad \mathbf{X} = \begin{pmatrix} 1 & 0 \\ \vdots & \vdots \\ 1 & 0 \\ 0 & 1 \\ \vdots & \vdots\\ 0 & 1 \\ \end{pmatrix} \begin{matrix} \\ n_f\\ \\ \\ n_m\\ \\ \end{matrix} \qquad \text{and} \qquad \boldsymbol{\beta} = \begin{pmatrix} \mu_f \\ \mu_m \\ \end{pmatrix} \]

\[ \text{Then:} \qquad \mathbf{X}\boldsymbol{\beta} = \begin{pmatrix} \mu_f + 0\\ \vdots \\ \mu_f + 0\\ 0 + \mu_m\\ \vdots\\ 0 + \mu_m \\ \end{pmatrix} = \begin{pmatrix} \mu_f \\ \vdots \\ \mu_f \\ \mu_m \\ \vdots\\ \mu_m \\ \end{pmatrix} \begin{matrix} \\ n_f\\ \\ \\ n_m\\ \\ \end{matrix} \]

Linear Model

Hence, the model can be written as: \[ \begin{pmatrix} y^f_1 \\ \vdots\\ y^f_{n_f} \\ y^m_{1} \\ \vdots\\ y^m_{n_m} \\ \end{pmatrix} = \begin{pmatrix} 1 & 0 \\ \vdots & \vdots \\ 1 & 0 \\ 0 & 1 \\ \vdots & \vdots\\ 0 & 1 \\ \end{pmatrix} \begin{pmatrix} \mu_f \\ \mu_m \\ \end{pmatrix} + \begin{pmatrix} \epsilon^f_1 \\ \vdots\\ \epsilon^f_{n_f} \\ \epsilon^m_{1} \\ \vdots\\ \epsilon^m_{n_m} \\ \end{pmatrix} \]

i.e: \[ \mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon} \quad \text{with} \quad \boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \sigma^2 \mathbf{I}_n) \]

\(\to\) classical linear model with Gaussian assumption !

Linear Model - With Intercept

\[ \text{Write:} \qquad \mathbf{X} = \begin{pmatrix} 1 & 0 \\ \vdots & \vdots \\ 1 & 0 \\ 1 & 1 \\ \vdots & \vdots\\ 1 & 1 \\ \end{pmatrix} \begin{matrix} \\ n_f\\ \\ \\ n_m\\ \\ \end{matrix} \qquad \text{and} \qquad \boldsymbol{\beta} = \begin{pmatrix} \mu_f \\ \mu_m - \mu_f \\ \end{pmatrix} \]

\[ \text{Then:} \qquad \mathbf{X}\boldsymbol{\beta} = \begin{pmatrix} \mu_f + 0\\ \vdots \\ \mu_f + 0\\ \mu_f + \mu_m - \mu_f\\ \vdots\\ \mu_f + \mu_m - \mu_f \\ \end{pmatrix} = \begin{pmatrix} \mu_f \\ \vdots \\ \mu_f \\ \mu_m \\ \vdots\\ \mu_m \\ \end{pmatrix} \begin{matrix} \\ n_f\\ \\ \\ n_m\\ \\ \end{matrix} \]

Linear Model - With Intercept

The model can also be written as: \[ \begin{pmatrix} y^f_1 \\ \vdots\\ y^f_{n_f} \\ y^m_{1} \\ \vdots\\ y^m_{n_m} \\ \end{pmatrix} = \begin{pmatrix} 1 & 0 \\ \vdots & \vdots \\ 1 & 0 \\ 1 & 1 \\ \vdots & \vdots\\ 1 & 1 \\ \end{pmatrix} \begin{pmatrix} \mu_f \\ \mu_m - \mu_f \\ \end{pmatrix} + \begin{pmatrix} \epsilon^f_1 \\ \vdots\\ \epsilon^f_{n_f} \\ \epsilon^m_{1} \\ \vdots\\ \epsilon^m_{n_m} \\ \end{pmatrix} \]

i.e: \[ \mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon} \quad \text{with} \quad \boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \sigma^2 \mathbf{I}_n) \]

\(\to\) parametrization used in R

Linear Model - \(t\) test

\[ \mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon} \quad \text{with} \quad \boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \sigma^2 \mathbf{I}_n) \]

As \[ \boldsymbol{\beta} = \begin{pmatrix} \mu_f \\ \mu_m - \mu_f \\ \end{pmatrix}, \]

the test \[ \mathcal{H}_0: \mu_f = \mu_m \quad \text{vs} \quad \mathcal{H}_1: \mu_f \neq \mu_m \]

becomes: \[ \mathcal{H}_0: \beta_2 = 0 \quad \text{vs} \quad \mathcal{H}_1: \beta_2 \neq 0 \]

\(\to\) \(t\)-test on the \(\beta_2\) coefficient !

Linear Model

fit <- lm(body_mass_g ~ sex, data = adelie)
model.matrix(fit)[1:10, ]
   (Intercept) sexmale
1            1       1
2            1       0
3            1       0
4            1       0
5            1       1
6            1       0
7            1       1
8            1       0
9            1       1
10           1       1

Linear Model

summary(fit)

Call:
lm(formula = body_mass_g ~ sex, data = adelie)

Residuals:
    Min      1Q  Median      3Q     Max 
-718.49 -218.84  -18.84  225.00  731.51 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  3368.84      36.34   92.69   <2e-16 ***
sexmale       674.66      51.40   13.13   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 310.5 on 144 degrees of freedom
Multiple R-squared:  0.5447,    Adjusted R-squared:  0.5416 
F-statistic: 172.3 on 1 and 144 DF,  p-value: < 2.2e-16

Simple \(t\) test

t.test(mass_males, mass_females, var.equal = TRUE)

    Two Sample t-test

data:  mass_males and mass_females
t = 13.126, df = 144, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 573.0666 776.2484
sample estimates:
mean of x mean of y 
 4043.493  3368.836 

Test statistics

Simple \(t\) test: \[ T = \frac{\bar{\mathbf{y}}^f - \bar{\mathbf{y}}^m}{s_p \sqrt{\frac{1}{n_f} + \frac{1}{n_m}}} \underset{\mathcal{H}_0}{\sim} \mathcal{T}_{n_f + n_m - 2} \quad \text{with} \quad s_p = \sqrt{\frac{(n_f-1)s^2_{\mathbf{y}^f} + (n_m-1)s^2_{\mathbf{y}^m}}{n_f + n_m - 2}} \]

Linear Regression: \[ \frac{\hat{\beta}_k - \beta_k}{\sqrt{\hat{\sigma}^2 [(\mathbf{X}^T\mathbf{X})^{-1}]_{kk}}} \sim \mathcal{T}_{n-p} \quad \text{with} \quad p = 2 ~,~ n = n_f + n_m ~,~ \beta_2 = 0 \text{ under } \mathcal{H}_0 \]

i.e. \[ \frac{\hat{\beta}_2}{\sqrt{\hat{\sigma}^2 [(\mathbf{X}^T\mathbf{X})^{-1}]_{22}}} \underset{\mathcal{H}_0}{\sim} \mathcal{T}_{n_f + n_m - 2} \]

Test statistics

Let’s show that: \[ \frac{\hat{\beta}_2}{\sqrt{\hat{\sigma}^2 [(\mathbf{X}^T\mathbf{X})^{-1}]_{22}}} \quad \text{with} \quad \mathbf{X} = \begin{pmatrix} 1 & 0 \\ \vdots & \vdots \\ 1 & 0 \\ 1 & 1 \\ \vdots & \vdots\\ 1 & 1 \\ \end{pmatrix} \begin{matrix} \\ n_f\\ \\ \\ n_m\\ \\ \end{matrix} \qquad \text{and} \qquad \boldsymbol{\beta} = \begin{pmatrix} \mu_f \\ \mu_m - \mu_f \\ \end{pmatrix} \]

is equal to \[ \frac{\bar{\mathbf{y}}^f - \bar{\mathbf{y}}^m}{s_p \sqrt{\frac{1}{n_f} + \frac{1}{n_m}}} \quad \text{with} \quad s_p = \sqrt{\frac{(n_f-1)s^2_{\mathbf{y}^f} + (n_m-1)s^2_{\mathbf{y}^m}}{n_f + n_m - 2}} \]

Test statistics - Proof - 1/5

\[ \mathbf{X}^T\mathbf{X} = \begin{pmatrix} 1 & \cdots & 1 & 1 & \cdots & 1 \\ 0 & \cdots & 0 & 1 & \cdots & 1 \\ \end{pmatrix} \begin{pmatrix} 1 & 0 \\ \vdots & \vdots \\ 1 & 0 \\ 1 & 1 \\ \vdots & \vdots\\ 1 & 1 \\ \end{pmatrix} = \begin{pmatrix} n & n_m \\ n_m & n_m \end{pmatrix} \]

Hence: \[ (\mathbf{X}^T\mathbf{X})^{-1} = \begin{pmatrix} n & n_m \\ n_m & n_m \end{pmatrix}^{-1} = \frac{1}{nn_m - n_m^2} \begin{pmatrix} n_m & -n_m \\ -n_m & n \end{pmatrix} \]

\[ [(\mathbf{X}^T\mathbf{X})^{-1}]_{22} = \frac{n}{nn_m - n_m^2} = \frac{n_f + n_m}{(n - n_m)n_m} = \frac{n_f + n_m}{n_f n_m} = \frac{1}{n_f} + \frac{1}{n_m} \]

Test statistics - Proof - 2/5

\[ \hat{\boldsymbol{\beta}} = (\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}\mathbf{y} \]

But: \[ \mathbf{X}^{T}\mathbf{y} = \begin{pmatrix} 1 & \cdots & 1 & 1 & \cdots & 1 \\ 0 & \cdots & 0 & 1 & \cdots & 1 \\ \end{pmatrix} \mathbf{y} = \begin{pmatrix} \sum_{i = 1}^n y_i\\ \sum_{j = 1}^{n_m} y^m_j \end{pmatrix} \]

Hence: \[ \begin{aligned} \hat{\boldsymbol{\beta}} & = (\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}\mathbf{y} = \frac{1}{n_fn_m} \begin{pmatrix} n_m & -n_m \\ -n_m & n \end{pmatrix} \begin{pmatrix} \sum_{i = 1}^n y_i\\ \sum_{j = 1}^{n_m} y^m_j \end{pmatrix} \\ &= %\frac{1}{n_fn_m} %\begin{pmatrix} %n_m\sum_{i = 1}^n y_i - n_m\sum_{j = 1}^{n_m} y^m_j\\ %- n_m\sum_{i = 1}^{n} y_i + n\sum_{j = 1}^{n_m} y^m_j\\ %\end{pmatrix} %= \frac{1}{n_fn_m} \begin{pmatrix} n_m\sum_{i = 1}^n y_i - n_m\sum_{j = 1}^{n_m} y^m_j\\ - n_m\sum_{i = 1}^{n} y_i + n_m\sum_{j = 1}^{n_m} y^m_j + n_f\sum_{j = 1}^{n_m} y^m_j\\ \end{pmatrix} \\ &= \frac{1}{n_fn_m} \begin{pmatrix} n_m \sum_{i = 1}^{n_f} y^f_i\\ - n_m\sum_{i = 1}^{n_f} y^f_i + n_f\sum_{j = 1}^{n_m} y^m_j\\ \end{pmatrix} \end{aligned} \]

Test statistics - Proof - 3/5

\[ \hat{\boldsymbol{\beta}} = (\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}\mathbf{y} \]

And: \[ \hat{\boldsymbol{\beta}} = \frac{1}{n_fn_m} \begin{pmatrix} n_m \sum_{i = 1}^{n_f} y^f_i\\ - n_m\sum_{i = 1}^{n_f} y^f_i + n_f\sum_{j = 1}^{n_m} y^m_j\\ \end{pmatrix} = \begin{pmatrix} \bar{\mathbf{y}}^f\\ -\bar{\mathbf{y}}^f + \bar{\mathbf{y}}^m\\ \end{pmatrix} \]

Note: \[ \text{since } \quad \boldsymbol{\beta} = \begin{pmatrix} \mu_f \\ \mu_m - \mu_f \\ \end{pmatrix} \quad \text{this result is coherent.} \]

Note: \[ \text{As} \quad \mathbf{x}^{i} = \begin{cases} (1~0) & \text{ if $i$ female} \\ (1~1) & \text{ if $i$ male} \\ \end{cases} \quad \text{we get:} \quad \hat{y}_i = \mathbf{x}^{i}\hat{\boldsymbol{\beta}} = \begin{cases} \bar{\mathbf{y}}^f & \text{ if $i$ female} \\ \bar{\mathbf{y}}^m & \text{ if $i$ male} \\ \end{cases} \]

Test statistics - Proof - 4/5

\[ \hat{\sigma}^2 = \frac{1}{n - 2}\sum_{i = 1}^n (y_i - \hat{y}_i)^2 \]

Hence: \[ \hat{\sigma}^2 = \frac{1}{n_f + n_m - 2} \left[ \sum_{i = 1}^{n_f} (y^f_i - \bar{\mathbf{y}}^f)^2 + \sum_{i = 1}^{n_m} (y^m_i - \bar{\mathbf{y}}^m)^2 \right] \]

Recall: \[ s^2_p = \frac{(n_f-1)s^2_{\mathbf{y}^f} + (n_m-1)s^2_{\mathbf{y}^m}}{n_f + n_m - 2} \]

hence: \[ \hat{\sigma}^2 = s^2_p. \]

Test statistics - Proof - 5/5

\[ T = \frac{\hat{\beta}_2}{\sqrt{\hat{\sigma}^2 [(\mathbf{X}^T\mathbf{X})^{-1}]_{22}}} \]

but: \[ \hat{\beta}_2 = \bar{\mathbf{y}}^m -\bar{\mathbf{y}}^f \\ \hat{\sigma}^2 = s^2_p \\ [(\mathbf{X}^T\mathbf{X})^{-1}]_{22} = \frac{1}{n_f} + \frac{1}{n_m} \]

hence: \[ T = \frac{\bar{\mathbf{y}}^f - \bar{\mathbf{y}}^m}{s_p \sqrt{\frac{1}{n_f} + \frac{1}{n_m}}} \]

And the simple \(t\)-test is the same as the \(t\)-test on the coefficient of the linear regression.

Summary

Group comparison:

  • \(n_f\) female adelie penguins, \(n_m\) male adelie penguins
  • \(y^f_{i}\): mass (g) of a female adelie penguin, \(1 \leq i \leq n_f\), \(y^f_{i} \sim \mathcal{N}(\mu_f, \sigma^2)\) iid
  • \(y^m_{j}\): mass (g) of a male adelie penguin, \(1 \leq j \leq n_m\), \(y^m_{j} \sim \mathcal{N}(\mu_m, \sigma^2)\) iid
  • \(\mathcal{H}_0\): \(\mu_f = \mu_m\) vs \(\mathcal{H}_1\): \(\mu_f \neq \mu_m\)

is the same as a Linear model with:

  • \(\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon}\) with \(\boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \sigma^2 \mathbf{I}_n)\)
  • \(\mathbf{X}_1 = \mathbb{1}_n\) and \(\mathbf{X}_2 = \mathbb{1}(male)\)
  • \(\beta_1 = \mu_f\) and \(\beta_2 = \mu_m - \mu_f\).
  • \(\mathcal{H}_0: \beta_2 = 0\) vs \(\mathcal{H}_1: \beta_2 \neq 0\)

\(\to\) what happens when there are more than two groups ?

One-way ANOVA

Species

We keep only the females penguins of any species, and remove missing data.

females <- subset(penguins, sex == "female")
females <- na.omit(females)
kable(females[c(1, 2, 101, 102, 151, 152), ])
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
Adelie Torgersen 39.5 17.4 186 3800 female 2007
Adelie Torgersen 40.3 18.0 195 3250 female 2007
Gentoo Biscoe 43.8 13.9 208 4300 female 2008
Gentoo Biscoe 43.2 14.5 208 4450 female 2008
Chinstrap Dream 46.9 16.6 192 2700 female 2008
Chinstrap Dream 46.2 17.5 187 3650 female 2008

Species

Question: are the females of each species of different body mass ?

par(mfrow = c(1, 2), mar = c(5, 4, 2, 2) + 0.1)
plot(females$body_mass_g, col = females$species)     ## scatter plot
boxplot(body_mass_g ~ species, data = females)       ## box plot

\(t\) tests

Pairwise \(t\) tests:

pairwise.t.test(females$body_mass_g, females$species,
                p.adjust.method = "none")

    Pairwise comparisons using t tests with pooled SD 

data:  females$body_mass_g and females$species 

          Adelie Chinstrap
Chinstrap 0.0066 -        
Gentoo    <2e-16 <2e-16   

P value adjustment method: none 

\(\to\) \({K \choose 2}\) tests : Multiple testing problems.

\(t\) tests - Model

Model:

  • \(n_s\) penguins, for species \(s \in S = \{adelie, chinstrap, gentoo\}\)
  • \(y^s_{i}\): mass (g) of a female penguin of species \(s\), \(1 \leq i \leq n_s\)
  • \(y^s_{i} \sim \mathcal{N}(\mu_s, \sigma^2)\) iid for any species \(s\).

Pairwise Tests:

\[ \mathcal{H}^{s_1,s_2}_0:~ \mu_{s_1} = \mu_{s_2} \quad \text{vs} \quad \mathcal{H}^{s_1,s_2}_1:~\mu_{s_1} \neq \mu_{s_2} \text{ for any pair } (s_1,s_2) \text{ of species}. \]

Global Tests can we do ?

\[ \mathcal{H}_0:~\mu_a = \mu_c = \mu_g \quad \text{vs} \quad \mathcal{H}_1:~\exists s_1, s_2 \in S ~|~ \mu_{s_1} \neq \mu_{s_2} ? \]

One-way ANOVA

  • \(S\) a set of \(|S| = K\) groups;
  • \(\mathbf{y}\) the vector of all data, length \(n = \sum_{k=1}^K n_k\);
  • \(\mathbf{X}\) the matrix of size \(n \times K\), such that:
    • \(\mathbf{X}_1 = \mathbb{1}_{n}\) is the intercept (vector column)
    • \(\mathbf{X}_k = \mathbb{1}(S_k)\) the indicator of group \(k\), \(2 \leq k \leq K\).
      • \(x_{ik} = 1\) if individual \(i\) is in group \(k\), and \(0\) otherwise.
  • \(\boldsymbol{\beta}\) the vector of mean differences, of length \(K\):
    • \(\beta_1 = \mu_1\)
    • \(\beta_k = \mu_k - \mu_1\)
  • the first group is the reference

One-way ANOVA

\[ \begin{pmatrix} y^1_1 \\ \vdots\\ y^1_{n_1} \\ y^2_{1} \\ \vdots\\ y^2_{n_2} \\ \vdots\\ y^K_{1} \\ \vdots\\ y^K_{n_K} \\ \end{pmatrix} = \begin{pmatrix} 1 & 0 & 0 & \cdots & 0\\ \vdots & \vdots & \vdots & \cdots & \vdots\\ 1 & 0 & 0 & \cdots & 0\\ 1 & 1 & 0 & \cdots & 0\\ \vdots & \vdots & \vdots & \cdots & \vdots\\ 1 & 1 & 0 & \cdots & 0\\ \vdots & \vdots & \vdots & \cdots & \vdots\\ 1 & 0 & 0 & \cdots & 1\\ \vdots & \vdots & \vdots & \cdots & \vdots\\ 1 & 0 & 0 & \cdots & 1\\ \end{pmatrix} \begin{pmatrix} \mu_1 \\ \mu_2 - \mu_1 \\ \vdots\\ \mu_K - \mu_1 \\ \end{pmatrix} + \begin{pmatrix} \epsilon^1_1 \\ \vdots\\ \epsilon^1_{n_1} \\ \epsilon^2_{1} \\ \vdots\\ \epsilon^2_{n_2} \\ \vdots\\ \epsilon^K_{1} \\ \vdots\\ \epsilon^K_{n_K} \\ \end{pmatrix} \]

One-way ANOVA

Model:

\[ \mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon} \quad \text{with} \quad \boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \sigma^2 \mathbf{I}_n) \]

Note: Assuming that the groups are disjoint, \(\text{rg}(X) = K\).

This is equivalent to the following model: \[ \begin{aligned} y_i &= \beta_1 + \epsilon_i = \mu_1 + \epsilon_i & \text{ if } &i \text{ is in group } 1\\ y_i &= \beta_1 + \beta_k + \epsilon_i = \mu_k + \epsilon_i & \text{ if } &i \text{ is in group } k,\ k>1 \\ \end{aligned} \]

i.e. each group has a mean \(\mu_k\), and the errors are Gaussian iid.

Species

fit <- lm(body_mass_g ~ species, data = females)
model.matrix(fit)[c(1, 2, 101, 102, 151, 152), ]
    (Intercept) speciesChinstrap speciesGentoo
1             1                0             0
2             1                0             0
101           1                0             1
102           1                0             1
151           1                1             0
152           1                1             0

Species

summary(fit)

Call:
lm(formula = body_mass_g ~ species, data = females)

Residuals:
    Min      1Q  Median      3Q     Max 
-827.21 -193.84   20.26  181.16  622.79 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)       3368.84      32.42 103.908  < 2e-16 ***
speciesChinstrap   158.37      57.52   2.754  0.00657 ** 
speciesGentoo     1310.91      48.72  26.904  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 277 on 162 degrees of freedom
Multiple R-squared:  0.8292,    Adjusted R-squared:  0.8271 
F-statistic: 393.2 on 2 and 162 DF,  p-value: < 2.2e-16

One-way ANOVA - Coefficients

It is easy to show that:

\[ \hat{\beta}_1 = \bar{\mathbf{y}}^{s_1}\\ \hat{\beta}_k = \bar{\mathbf{y}}^{s_k} - \bar{\mathbf{y}}^{s_1} \text{ for } 2 \leq k \leq K\\ \] Hence:

  • \(\hat{\beta}_1\) is the estimator of the mean of the first group;
  • \(\hat{\beta}_k\) is the estimator of the difference between the mean of group \(k\) and the first group (\(k > 1\)).

One-way ANOVA - Coefficients

We have: \[ \boldsymbol{\beta} = \begin{pmatrix} \mu_{s_1}\\ \mu_{s_2} - \mu_{s_1}\\ \vdots\\ \mu_{s_K} - \mu_{s_1}\\ \end{pmatrix} \quad \text{and} \quad \hat{\boldsymbol{\beta}} = \begin{pmatrix} \bar{\mathbf{y}}^{s_1}\\ \bar{\mathbf{y}}^{s_2} - \bar{\mathbf{y}}^{s_1}\\ \vdots\\ \bar{\mathbf{y}}^{s_K} - \bar{\mathbf{y}}^{s_1}\\ \end{pmatrix} \]

  • The \(t\) test on the coefficient \(\beta_k\):

\[ \mathcal{H}^{k}_0: \beta_{k} = 0 \quad{ vs } \quad \mathcal{H}^{k}_1: \beta_{k} \neq 0 \]

is the same as the test of the difference between groups \(1\) & \(k\):

\[ \mathcal{H}^{1,k}_0: \mu_{1} = \mu_{k} \quad{ vs } \quad \mathcal{H}^{1,k}_1: \mu_{1} \neq \mu_{k} \]

One-way ANOVA - Global Test

We have: \[ \boldsymbol{\beta} = \begin{pmatrix} \mu_{s_1}\\ \mu_{s_2} - \mu_{s_1}\\ \vdots\\ \mu_{s_K} - \mu_{s_1}\\ \end{pmatrix} \quad \text{and} \quad \hat{\boldsymbol{\beta}} = \begin{pmatrix} \bar{\mathbf{y}}^{s_1}\\ \bar{\mathbf{y}}^{s_2} - \bar{\mathbf{y}}^{s_1}\\ \vdots\\ \bar{\mathbf{y}}^{s_K} - \bar{\mathbf{y}}^{s_1}\\ \end{pmatrix} \]

  • The global \(F\) test on the coefficients \(\beta_2, \dotsc, \beta_K\):

\[ \mathcal{H}_0: \beta_{2} = \beta_3 = \cdots = \beta_{K} = 0 \quad{ vs } \quad \mathcal{H}_1: \exists k | \beta_{k} \neq 0 \]

is the same as the global test of mean equality of all groups:

\[ \mathcal{H}_0: \mu_{1} = \mu_2 = \cdots = \mu_{K} \quad{ vs } \quad \mathcal{H}_1: \exists k, k' | \mu_{k} \neq \mu_{k'} \]

Species

fit <- lm(body_mass_g ~ species, data = females)
summary(fit)

Call:
lm(formula = body_mass_g ~ species, data = females)

Residuals:
    Min      1Q  Median      3Q     Max 
-827.21 -193.84   20.26  181.16  622.79 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)       3368.84      32.42 103.908  < 2e-16 ***
speciesChinstrap   158.37      57.52   2.754  0.00657 ** 
speciesGentoo     1310.91      48.72  26.904  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 277 on 162 degrees of freedom
Multiple R-squared:  0.8292,    Adjusted R-squared:  0.8271 
F-statistic: 393.2 on 2 and 162 DF,  p-value: < 2.2e-16

One-way ANOVA - Parametrization

General Model

In general, we might write for observation \(i\) of group \(k\): \[ y_{i,k} = \mu + \beta_k + \epsilon_i \]

  • \(\mu\) would be the “grand mean” (intercept)
  • \(\beta_k\) the specific effect of group \(k\) (\(1 \leq k \leq K\)).

General Model Over-Parametrization

\[ y_{i,k} = \mu + \beta_k + \epsilon_i \]

The model has \(K+1\) parameters.

The associated regression matrix would be: \[ \mathbf{X} = (\mathbb{1}, \mathbb{1}(S_1), \dotsc, \mathbb{1}(S_K)) \]

\(\mathbf{X}\) is not of full rank: \(\mathbb{1} = \sum_{k=1}^K \mathbb{1}(S_k)\).

Exemple with two groups

\[ \mathbf{X} = \begin{pmatrix} 1 & 1 & 0 \\ \vdots & \vdots & \vdots \\ 1 & 1 & 0 \\ 1 & 0 & 1 \\ \vdots & \vdots & \vdots\\ 1 & 0 & 1 \\ \end{pmatrix} \]

We need some constraint on the parameters.

Constraint : Group 1 as reference

\[ y_{i,k} = \mu + \beta_k + \epsilon_i \]

In the previous studies, we assumed \[\beta_1 = 0\]

  • group 1 has the “reference” mean \(\mu\)
  • group \(k>1\) has mean \(\mu + \beta_k\)
  • \(\beta_k = \mu_k - \mu_1\) is the difference between group 1 and \(k\).
  • \(\mathbf{X} = (\mathbb{1}, \mathbb{1}(S_2), \dotsc, \mathbb{1}(S_K))\) is full rank.

\(\to\) This is the constraint used by default in R.

Constraint : No intercept

\[ y_{i,k} = \mu + \beta_k + \epsilon_i \]

We can assume \[\mu = 0\]

  • \(\beta_k = \mu_k\) models the mean of group \(k\) directly.
  • \(\mathbf{X} = (\mathbb{1}(S_1), \dotsc, \mathbb{1}(S_K))\) is full rank.
  • estimators of \(\beta_k\) are changed, but not the estimator of \(\sigma^2\), nor the fitted values.

Constraint : Null Sum

\[ y_{i,k} = \mu + \beta_k + \epsilon_i \]

We can assume \[ \sum_{k = 1}^K \beta_k= 0 \qquad \text{i.e.} \qquad \beta_K = - \sum_{k = 1}^{K-1} \beta_k \]

  • \(\mathbf{X} = (\mathbb{1}, \mathbb{1}(S_1) - \mathbb{1}(S_K), \dotsc, \mathbb{1}(S_{K-1}) - \mathbb{1}(S_K))\) is full rank.
  • estimators of \(\mu\), \(\beta_k\) are changed, but not the estimator of \(\sigma^2\), nor the fitted values.

Constraint : Weighted Null Sum

\[ y_{i,k} = \mu + \beta_k + \epsilon_i \]

We can assume \[ \sum_{k = 1}^K n_k \beta_k= 0 \qquad \text{i.e.} \qquad \beta_K = -\frac{1}{n_K} \sum_{k = 1}^{K-1} n_k \beta_k \]

  • \(\mathbf{X} = (\mathbb{1}, \mathbb{1}(S_1) - \frac{n_1}{n_K}\mathbb{1}(S_K), \dotsc, \mathbb{1}(S_{K-1}) - \frac{n_{K-1}}{n_K}\mathbb{1}(S_K))\) is full rank.
  • estimators of \(\mu\), \(\beta_k\) are changed, but not the estimator of \(\sigma^2\), nor the fitted values.

One-way ANOVA table

One-way ANOVA - \(F\) test

\[ y_{i,k} = \mu + \beta_k + \epsilon_i \]

ANOVA Test: Has the grouping any effect at all ? I.e. \[ \begin{aligned} \mathcal{H}_0&: \beta_{1} = \cdots = \beta_K = 0 &\text{vs}\\ \mathcal{H}_1&: \exists~k\in \{1, \dotsc, K\} ~|~ \beta_k \neq 0 \end{aligned} \]

No matter which set of constraint we choose, the \(F\) statistic associated with the one factor ANOVA is:

\[ F = \frac{ \|\hat{\mathbf{y}} - \bar{y}\mathbb{1}\|^2 / (K - 1) }{ \|\mathbf{y} - \hat{\mathbf{y}}\|^2 / (n - K) } \underset{\mathcal{H}_0}{\sim} \mathcal{F}^{K - 1}_{n - K}. \]

\(\to\) Global \(F\) test on the coefficients of the linear regression.

ANOVA = “ANalyse Of VAriance”

One-way ANOVA - Global \(F\) Test

With R parametrization: \[ \boldsymbol{\beta} = \begin{pmatrix} \mu_{s_1}\\ \mu_{s_2} - \mu_{s_1}\\ \vdots\\ \mu_{s_K} - \mu_{s_1}\\ \end{pmatrix} \quad \text{and} \quad \hat{\boldsymbol{\beta}} = \begin{pmatrix} \bar{\mathbf{y}}^{s_1}\\ \bar{\mathbf{y}}^{s_2} - \bar{\mathbf{y}}^{s_1}\\ \vdots\\ \bar{\mathbf{y}}^{s_K} - \bar{\mathbf{y}}^{s_1}\\ \end{pmatrix} \]

  • The global \(F\) test on the coefficients \(\beta_2, \dotsc, \beta_K\) is:

\[ \mathcal{H}_0: \beta_{2} = \beta_3 = \cdots = \beta_{K} = 0 \quad{ vs } \quad \mathcal{H}_1: \exists k | \beta_{k} \neq 0 \]

and is the same as the global test of mean equality of all groups:

\[ \mathcal{H}_0: \mu_{1} = \mu_2 = \cdots = \mu_{K} \quad{ vs } \quad \mathcal{H}_1: \exists k, k' | \mu_{k} \neq \mu_{k'} \]

Species

fit <- lm(body_mass_g ~ species, data = females)
summary(fit)

Call:
lm(formula = body_mass_g ~ species, data = females)

Residuals:
    Min      1Q  Median      3Q     Max 
-827.21 -193.84   20.26  181.16  622.79 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)       3368.84      32.42 103.908  < 2e-16 ***
speciesChinstrap   158.37      57.52   2.754  0.00657 ** 
speciesGentoo     1310.91      48.72  26.904  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 277 on 162 degrees of freedom
Multiple R-squared:  0.8292,    Adjusted R-squared:  0.8271 
F-statistic: 393.2 on 2 and 162 DF,  p-value: < 2.2e-16

One-way ANOVA - Table

\[ F = \frac{ \|\hat{\mathbf{y}} - \bar{y}\mathbb{1}\|^2 / (K - 1) }{ \|\mathbf{y} - \hat{\mathbf{y}}\|^2 / (n - K) } = \frac{ ESS / (K - 1) }{ RSS / (n - K) } \underset{\mathcal{H}_0}{\sim} \mathcal{F}^{K - 1}_{n - K}. \]

“Anova table”:

Df SS Mean SS F \(p\) value
factor \(K - 1\) \(ESS\) \(ESS / (K - 1)\) \(\frac{ESS / (K - 1)}{RSS / (n - K)}\) \(p\)
residuals \(n - K\) \(RSS\) \(RSS / (n - K)\)
total \(n - 1\) \(TSS\)

Species

fit <- lm(body_mass_g ~ species, data = females)
anova(fit)
Analysis of Variance Table

Response: body_mass_g
           Df   Sum Sq  Mean Sq F value    Pr(>F)    
species     2 60350016 30175008  393.25 < 2.2e-16 ***
Residuals 162 12430757    76733                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Two-way ANOVA

Species

We keep all the penguins of any species and any sex, and remove missing data.

penguins_nona <- na.omit(penguins)
kable(penguins_nona[c(1, 2, 151, 152, 271, 272), ])
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
Adelie Torgersen 39.1 18.7 181 3750 male 2007
Adelie Torgersen 39.5 17.4 186 3800 female 2007
Gentoo Biscoe 47.6 14.5 215 5400 male 2007
Gentoo Biscoe 46.5 13.5 210 4550 female 2007
Chinstrap Dream 45.2 17.8 198 3950 female 2007
Chinstrap Dream 46.1 18.2 178 3250 female 2007

Species

Question: what is the effect of both species and sex on the body mass ?

par(mfrow = c(1, 2), mar = c(5, 4, 2, 2) + 0.1)
plot(penguins_nona$body_mass_g,                             ## scatter plot
     pch = as.numeric(penguins_nona$species),               ## point type = species
     col = penguins_nona$sex)                               ## color = sex
boxplot(body_mass_g ~ sex + species, data = penguins_nona,  ## box plot
        col = c("grey", "red")) 

Model

Observation \(i\) of group \(k\) in factor 1 and group \(l\) in factor 2 is written as: \[ y_{i,k,l} = \mu + \alpha_k + \beta_l + \gamma_{kl} + \epsilon_i \]

  • \(\mu\) the “grand mean” (intercept)
  • \(\alpha_k\) the specific effect of group \(k\) in factor 1 (\(1 \leq k \leq K\)).
  • \(\beta_l\) the specific effect of group \(l\) in factor 2 (\(1 \leq l \leq L\)).
  • \(\gamma_{kl}\) the interaction effect of groups \(k\) and \(l\).

The full model has regression matrix: \[ {\small \mathbf{X} = (\mathbb{1}, \mathbb{1}_{S_1}, \dotsc, \mathbb{1}_{S_K}, \mathbb{1}_{T_1}, \dotsc, \mathbb{1}_{T_L}, \mathbb{1}_{S_1, T_1}, \dotsc, \mathbb{1}_{S_K, T_L} ) } \]

\(\to\) \(1 + K + L + KL\) predictors. \(\to\) not identifiable !

Constraint : Groups 1 as reference

\[ y_{i,k,l} = \mu + \alpha_k + \beta_l + \gamma_{kl} + \epsilon_i \]

Assume, for all \(1\leq k \leq K\) and \(1\leq l \leq L\): \[ \alpha_1 = 0, \qquad \beta_1 = 0, \qquad \gamma_{1,l} = 0, \qquad \gamma_{k,1} = 0. \]

  • \(y_{i,1,1}\) has mean \(\mu\) (reference)
  • \(y_{i,1,l}\) has mean \(\mu + \beta_l\)
  • \(y_{i,k,1}\) has mean \(\mu + \alpha_k\)
  • \(y_{i,k,l}\) has mean \(\mu + \alpha_k + \beta_l + \gamma_{k,l}\) (\(k,l>1\))
  • \(1 + K + L\) constraints \(\to\) \(KL\) free parameters
  • \(\mathbf{X} = (\mathbb{1}, \mathbb{1}_{S_2}, \dotsc, \mathbb{1}_{S_K}, \mathbb{1}_{T_2}, \dotsc, \mathbb{1}_{T_L}, \mathbb{1}_{S_2, T_2}, \dotsc, \mathbb{1}_{S_K, T_L})\) full

\(\to\) This is the constraint used by default in R.

Species and Sex

fit <- lm(body_mass_g ~  species * sex, data = penguins_nona) 
summary(fit)

Call:
lm(formula = body_mass_g ~ species * sex, data = penguins_nona)

Residuals:
    Min      1Q  Median      3Q     Max 
-827.21 -213.97   11.03  206.51  861.03 

Coefficients:
                         Estimate Std. Error t value Pr(>|t|)    
(Intercept)               3368.84      36.21  93.030  < 2e-16 ***
speciesChinstrap           158.37      64.24   2.465  0.01420 *  
speciesGentoo             1310.91      54.42  24.088  < 2e-16 ***
sexmale                    674.66      51.21  13.174  < 2e-16 ***
speciesChinstrap:sexmale  -262.89      90.85  -2.894  0.00406 ** 
speciesGentoo:sexmale      130.44      76.44   1.706  0.08886 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 309.4 on 327 degrees of freedom
Multiple R-squared:  0.8546,    Adjusted R-squared:  0.8524 
F-statistic: 384.3 on 5 and 327 DF,  p-value: < 2.2e-16

Other constraint

\[ y_{i,k,l} = \mu + \alpha_k + \beta_l + \gamma_{kl} + \epsilon_i \]

Interactions only: for all \(1\leq k \leq K\) and \(1\leq l \leq L\): \[ \mu = 0,\qquad \alpha_k = 0, \qquad \beta_l = 0. \]

Null sum: \[ \sum_{k=1}^K\alpha_k = 0, \qquad \sum_{l=1}^L\beta_l = 0, \qquad \sum_{k=1}^K\gamma_{kl} = 0, \qquad \sum_{l=1}^L\gamma_{kl} = 0. \]

\(\to\) changes the estimation of the coefficients, but not of the variance, nor the predicted values.

Nested F Test - Reminder

Recall that (from CM5), in general, to test: \[ \begin{aligned} \mathcal{H}_0&: \beta_{p_0 + 1} = \cdots = \beta_p = 0 &\text{vs}\\ \mathcal{H}_1&: \exists~k\in \{p_0+1, \dotsc, p\} ~|~ \beta_k \neq 0 \end{aligned} \]

We can use the statistic: \[ F = \frac{ \|\hat{\mathbf{y}} - \hat{\mathbf{y}}_0\|^2 / (p - p_0) }{ \|\mathbf{y} - \hat{\mathbf{y}}\|^2 / (n - p) } = \frac{n - p}{p - p_0}\frac{RSS_0 - RSS}{RSS} \underset{\mathcal{H}_0}{\sim} \mathcal{F}^{p - p_0}_{n - p}. \]

Writing \(\mathbf{X}_{p_0} = (\mathbf{X}_1, \dotsc, \mathbf{X}_{p_0})\), this is the same as testing: \[ \begin{aligned} \mathcal{H}_0&: \mathbf{P}^{\mathbf{X}_{p_0}}\mathbf{y} = \mathbf{P}^{\mathbf{X}}\mathbf{y} &\text{vs}\\ \mathcal{H}_1&: \mathbf{P}^{\mathbf{X}_{p_0}}\mathbf{y} \neq \mathbf{P}^{\mathbf{X}}\mathbf{y} \end{aligned} \]

Nested F Test - Reminder

Nested F Test - General

Take any two nested subsets of indexes \(\eta_1 \subset \eta_2\), and write: \[ \begin{aligned} \mathbf{X}_{\eta_k} &= (\mathbf{X}_i)_{i\in\eta_k} & \mathbf{P}^{\mathbf{X}_{\eta_k}}\mathbf{y} &= \hat{\mathbf{y}}_{\eta_k} & \mathbf{P}^{\mathbf{X}}\mathbf{y} &= \hat{\mathbf{y}} \end{aligned} \]

To test: \[ \begin{aligned} \mathcal{H}_0&: \text{"Larger model 2 does not bring any information compared to model 1"}\\ \mathcal{H}_1&: \text{"Larger model 2 does bring some information compared to model 1"} \end{aligned} \]

i.e. \[ \begin{aligned} \mathcal{H}_0&: \mathbf{P}^{\mathbf{X}_{\eta_1}}\mathbf{y} = \mathbf{P}^{\mathbf{X}_{\eta_2}}\mathbf{y} &\text{v.s.}&& \mathcal{H}_1&: \mathbf{P}^{\mathbf{X}_{\eta_1}}\mathbf{y} \neq \mathbf{P}^{\mathbf{X}_{\eta_2}}\mathbf{y} \end{aligned} \]

We can use: \[ F = \frac{ \|\hat{\mathbf{y}}_{\eta_2} - \hat{\mathbf{y}}_{\eta_1}\|^2 / (|\eta_2| - |\eta_1|) }{ \|\mathbf{y} - \hat{\mathbf{y}}\|^2 / (n - p) } = \frac{ 1 }{ \hat{\sigma}^2 } (RSS_{\eta_1} - RSS_{\eta_2}) / (|\eta_2| - |\eta_1|) \underset{\mathcal{H}_0}{\sim} \mathcal{F}^{|\eta_2| - |\eta_1|}_{n - p}. \]

Anova model

Model: \[ y_{i,k,l} = \mu + \alpha_k + \beta_l + \gamma_{kl} + \epsilon_i \] with, for all \(1\leq k \leq K\) and \(1\leq l \leq L\): \[ \alpha_1 = 0, \qquad \beta_1 = 0, \qquad \gamma_{1,l} = 0, \qquad \gamma_{k,1} = 0. \] (\(\to KL\) free parameters in total.)

Tests:

  • Is factor 1 useful ?
  • Is factor 2 useful ?
  • Is the interaction between factor 1 and 2 useful ?

Nested F Test - Anova - Factor 1

To test: \[ \begin{aligned} \mathcal{H}_0&: \mathbf{P}^{\mathbf{X}_{\mu}}\mathbf{y} = \mathbf{P}^{\mathbf{X}_{\mu,\alpha}}\mathbf{y} &\text{v.s.}&& \mathcal{H}_1&: \mathbf{P}^{\mathbf{X}_{\mu}}\mathbf{y} \neq \mathbf{P}^{\mathbf{X}_{\mu,\alpha}}\mathbf{y} \end{aligned} \] i.e. \(\mathcal{H}_0\): “factor 1 does not bring any information compared to intercept alone”.

We can use: \[ \begin{aligned} F &= \frac{n - KL}{K - 1}\frac{RSS_{\mu} - RSS_{\mu, \alpha}}{RSS}\\ &= \frac{R(\alpha | \mu) / (K - 1)}{\hat{\sigma}^2}\\ &\underset{\mathcal{H}_0}{\sim} \mathcal{F}^{K - 1}_{n - KL}. \end{aligned} \]

Nested F Test - Anova - Factor 2

To test: \[ \begin{aligned} \mathcal{H}_0&: \mathbf{P}^{\mathbf{X}_{\mu,\alpha}}\mathbf{y} = \mathbf{P}^{\mathbf{X}_{\mu,\alpha, \beta}}\mathbf{y} &\text{v.s.}&& \mathcal{H}_1&: \mathbf{P}^{\mathbf{X}_{\mu,\alpha}}\mathbf{y} \neq \mathbf{P}^{\mathbf{X}_{\mu,\alpha,\beta}}\mathbf{y} \end{aligned} \] i.e. \(\mathcal{H}_0\): “factor 2 does not bring any information compared to factor 1”.

We can use: \[ \begin{aligned} F &= \frac{n - KL}{L - 1}\frac{RSS_{\mu, \alpha} - RSS_{\mu, \alpha, \beta}}{RSS}\\ &= \frac{R(\beta | \mu, \alpha) / (L - 1)}{\hat{\sigma}^2}\\ &\underset{\mathcal{H}_0}{\sim} \mathcal{F}^{L - 1}_{n - KL}. \end{aligned} \]

Nested F Test - Anova - Interaction

To test: \[ \begin{aligned} \mathcal{H}_0&: \mathbf{P}^{\mathbf{X}_{\mu,\alpha,\beta}}\mathbf{y} = \mathbf{P}^{\mathbf{X}_{\mu,\alpha, \beta,\gamma}}\mathbf{y} &\text{v.s.}&& \mathcal{H}_1&: \mathbf{P}^{\mathbf{X}_{\mu,\alpha,\gamma}}\mathbf{y} \neq \mathbf{P}^{\mathbf{X}_{\mu,\alpha,\beta,\gamma}}\mathbf{y} \end{aligned} \] i.e. \(\mathcal{H}_0\): “Interactions do not bring any information compared to factors 1 and 2”.

We can use: \[ \begin{aligned} F &= \frac{n - KL}{(L - 1)(K - 1)}\frac{RSS_{\mu, \alpha, \beta} - RSS_{\mu, \alpha, \beta, \gamma}}{RSS}\\ &= \frac{R(\gamma | \mu, \alpha, \beta) / ((K - 1)(L - 1))}{\hat{\sigma}^2}\\ &\underset{\mathcal{H}_0}{\sim} \mathcal{F}^{(K - 1)(L - 1)}_{n - KL}. \end{aligned} \]

Two-way ANOVA - Type I Table

“Type I” Anova table:

df SS Mean SS F \(p\) value
factor 1 \(K - 1\) \(R(\alpha|\mu)\) \(R(\alpha|\mu) / df\) \(\hat{\sigma}^{-2} R(\alpha|\mu) / df\) \(p\)
factor 2 \(L - 1\) \(R(\beta|\mu, \alpha)\) \(R(\beta|\mu, \alpha) / df\) \(\hat{\sigma}^{-2} R(\beta|\mu, \alpha) / df\) \(p\)
interactions \((K - 1)(L - 1)\) \(R(\gamma|\mu, \alpha, \beta)\) \(R(\gamma|\mu, \alpha, \beta) / df\) \(\hat{\sigma}^{-2} R(\gamma|\mu, \alpha, \beta) / df\) \(p\)
residuals \(n - KL\) \(RSS\) \(\hat{\sigma}^2 = RSS / df\)

Species and Sex - Type I

fit <- lm(body_mass_g ~  species * sex, data = penguins_nona) 
anova(fit)
Analysis of Variance Table

Response: body_mass_g
             Df    Sum Sq  Mean Sq F value    Pr(>F)    
species       2 145190219 72595110 758.358 < 2.2e-16 ***
sex           1  37090262 37090262 387.460 < 2.2e-16 ***
species:sex   2   1676557   838278   8.757 0.0001973 ***
Residuals   327  31302628    95727                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Sex and Species - Type I

fit <- lm(body_mass_g ~  sex * species, data = penguins_nona) 
anova(fit)
Analysis of Variance Table

Response: body_mass_g
             Df    Sum Sq  Mean Sq F value    Pr(>F)    
sex           1  38878897 38878897 406.145 < 2.2e-16 ***
species       2 143401584 71700792 749.016 < 2.2e-16 ***
sex:species   2   1676557   838278   8.757 0.0001973 ***
Residuals   327  31302628    95727                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Two-way ANOVA - Type I Table

“Type I” Anova table:

Df SS Mean SS F \(p\) value
factor 1 \(K - 1\) \(R(\alpha|\mu)\) \(R(\alpha|\mu) / df\) \(\hat{\sigma}^{-2} R(\alpha|\mu) / df\) \(p\)
factor 2 \(L - 1\) \(R(\beta|\mu, \alpha)\) \(R(\beta|\mu, \alpha) / df\) \(\hat{\sigma}^{-2} R(\beta|\mu, \alpha) / df\) \(p\)
interactions \((K - 1)(L - 1)\) \(R(\gamma|\mu, \alpha, \beta)\) \(R(\gamma|\mu, \alpha, \beta) / df\) \(\hat{\sigma}^{-2} R(\gamma|\mu, \alpha, \beta) / df\) \(p\)
residuals \(n - KL\) \(RSS\) \(\hat{\sigma}^2 = RSS / df\)

Attention: The order of factors matters !

Two-way ANOVA - Type II Table

“Type II” Anova table:

Df SS Mean SS F \(p\) value
factor 1 \(K - 1\) \(R(\alpha|\mu, \beta)\) \(R(\alpha|\mu, \beta) / df\) \(\hat{\sigma}^{-2} R(\alpha|\mu, \beta) / df\) \(p\)
factor 2 \(L - 1\) \(R(\beta|\mu, \alpha)\) \(R(\beta|\mu, \alpha) / df\) \(\hat{\sigma}^{-2} R(\beta|\mu, \alpha) / df\) \(p\)
interactions \((K - 1)(L - 1)\) \(R(\gamma|\mu, \alpha, \beta)\) \(R(\gamma|\mu, \alpha, \beta) / df\) \(\hat{\sigma}^{-2} R(\gamma|\mu, \alpha, \beta) / df\) \(p\)
residuals \(n - KL\) \(RSS\) \(\hat{\sigma}^2 = RSS / df\)

Symmetric in factor 1 and 2.

Sex and Species - Type II

fit <- lm(body_mass_g ~  sex * species, data = penguins_nona) 
car::Anova(fit)
Anova Table (Type II tests)

Response: body_mass_g
               Sum Sq  Df F value    Pr(>F)    
sex          37090262   1 387.460 < 2.2e-16 ***
species     143401584   2 749.016 < 2.2e-16 ***
sex:species   1676557   2   8.757 0.0001973 ***
Residuals    31302628 327                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Bill Lenght - Type II

fit <- lm(bill_length_mm ~  sex * species, data = penguins_nona) 
car::Anova(fit)
Anova Table (Type II tests)

Response: bill_length_mm
            Sum Sq  Df  F value Pr(>F)    
sex         1135.7   1 211.8066 <2e-16 ***
species     6975.6   2 650.4786 <2e-16 ***
sex:species   24.5   2   2.2841 0.1035    
Residuals   1753.3 327                    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Bill Lenght - Type II - No interaction

fit <- lm(bill_length_mm ~  sex + species, data = penguins_nona) 
car::Anova(fit)
Anova Table (Type II tests)

Response: bill_length_mm
          Sum Sq  Df F value    Pr(>F)    
sex       1135.7   1  210.17 < 2.2e-16 ***
species   6975.6   2  645.44 < 2.2e-16 ***
Residuals 1777.8 329                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

ANCOVA

Penguins

Question: what is the relationship between bill length and depth, controlling for the species ?

One line fits all

fit_all <- lm(bill_length_mm ~ bill_depth_mm, data = females)
summary(fit_all)

Call:
lm(formula = bill_length_mm ~ bill_depth_mm, data = females)

Residuals:
     Min       1Q   Median       3Q      Max 
-11.0745  -3.0689  -0.7609   2.6776  17.5034 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)    61.2214     3.1966  19.152  < 2e-16 ***
bill_depth_mm  -1.1643     0.1935  -6.018 1.13e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.449 on 163 degrees of freedom
Multiple R-squared:  0.1818,    Adjusted R-squared:  0.1768 
F-statistic: 36.22 on 1 and 163 DF,  p-value: 1.128e-08

One line fits all

\[ y_{i} = \mu + \beta x_{i} + \epsilon_i \]

Take species into account

fit_species <- lm(bill_length_mm ~ bill_depth_mm + species, data = females)
summary(fit_species)

Call:
lm(formula = bill_length_mm ~ bill_depth_mm + species, data = females)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.9886 -1.4503 -0.0603  1.4476 11.2797 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)       25.0447     3.9377   6.360 1.99e-09 ***
bill_depth_mm      0.6930     0.2230   3.108  0.00222 ** 
speciesChinstrap   9.3393     0.4648  20.092  < 2e-16 ***
speciesGentoo     10.6515     0.8510  12.516  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.238 on 161 degrees of freedom
Multiple R-squared:  0.7954,    Adjusted R-squared:  0.7916 
F-statistic: 208.7 on 3 and 161 DF,  p-value: < 2.2e-16

Take species into account

\[ y_{i,k} = \mu + \alpha_k + \beta x_{i} + \epsilon_i \]

Take species into account

fit_species <- lm(bill_length_mm ~ bill_depth_mm + species, data = females)
car::Anova(fit_species)
Anova Table (Type II tests)

Response: bill_length_mm
               Sum Sq  Df  F value    Pr(>F)    
bill_depth_mm   48.41   1   9.6624  0.002224 ** 
species       2419.64   2 241.4533 < 2.2e-16 ***
Residuals      806.70 161                       
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Interactions

fit_int <- lm(bill_length_mm ~ bill_depth_mm + species + bill_depth_mm:species,
              data = females)
fit_int

Call:
lm(formula = bill_length_mm ~ bill_depth_mm + species + bill_depth_mm:species, 
    data = females)

Coefficients:
                   (Intercept)                   bill_depth_mm  
                       31.1671                          0.3456  
              speciesChinstrap                   speciesGentoo  
                       -2.5348                         -8.8729  
bill_depth_mm:speciesChinstrap     bill_depth_mm:speciesGentoo  
                        0.6745                          1.2887  

Interactions

\[ y_{i,k} = \mu + \alpha_k + \beta_k x_{i} + \epsilon_i \]

Interactions

fit_int <- lm(bill_length_mm ~ bill_depth_mm + species + bill_depth_mm:species,
              data = females)
car::Anova(fit_int)
Anova Table (Type II tests)

Response: bill_length_mm
                       Sum Sq  Df  F value    Pr(>F)    
bill_depth_mm           48.41   1   9.8428  0.002032 ** 
species               2419.64   2 245.9610 < 2.2e-16 ***
bill_depth_mm:species   24.62   2   2.5029  0.085070 .  
Residuals              782.08 159                       
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Ancova - One Regressor - Model

For any observation \(1 \leq i \leq n\) in group \(1 \leq k \leq K\), we write: \[ y_{i,k} = \mu + \alpha_k + (\beta + \gamma_k) \times x_{i} + \epsilon_i \qquad \epsilon_i\sim\mathcal{N}(0, \sigma^2)~iid \]

Identifiability: \(\alpha_1 = \gamma_1 = 0\).

\[ \text{With $K=2$:}\qquad \mathbf{X} = \begin{pmatrix} 1 & 0 & x_1 & 0\\ \vdots & \vdots & \vdots & \vdots\\ 1 & 0 & x_{n_1} & 0 &\\ 1 & 1 & x_{n_1 + 1} & x_{n_1 + 1} \\ \vdots & \vdots & \vdots & \vdots\\ 1 & 1 & x_{n_1 + n_2} & x_{n_1 + n_2}\\ \end{pmatrix} \]

Ancova

F tests of effects and interactions can be done as before

More than one predictor ? \(\to\) not a problem

More than one factor ? \(\to\) not a problem

Attention as model become more complex, they are more difficult to interpret and test.

Attention diagnostics on the residuals still needed.

Bill length and depth

fit_species <- lm(bill_length_mm ~ bill_depth_mm + species, data = females)
par(mfrow = c(2,2))
plot(fit_species)