Régression Multiple - Variables Qualitatives

Paul Bastide

2025-04-16

Student t tests

Penguins

Penguins from the Palmer Archipelago in Antartica, see palmerpenguins.

library(palmerpenguins)
kable(head(penguins))

species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex	year
Adelie	Torgersen	39.1	18.7	181	3750	male	2007
Adelie	Torgersen	39.5	17.4	186	3800	female	2007
Adelie	Torgersen	40.3	18.0	195	3250	female	2007
Adelie	Torgersen	NA	NA	NA	NA	NA	2007
Adelie	Torgersen	36.7	19.3	193	3450	female	2007
Adelie	Torgersen	39.3	20.6	190	3650	male	2007

Subset

We keep only the penguins of species “Adelie”, and remove missing data.

adelie <- subset(penguins, species == "Adelie")
adelie <- na.omit(adelie)
kable(head(adelie))

species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex	year
Adelie	Torgersen	39.1	18.7	181	3750	male	2007
Adelie	Torgersen	39.5	17.4	186	3800	female	2007
Adelie	Torgersen	40.3	18.0	195	3250	female	2007
Adelie	Torgersen	36.7	19.3	193	3450	female	2007
Adelie	Torgersen	39.3	20.6	190	3650	male	2007
Adelie	Torgersen	38.9	17.8	181	3625	female	2007

Group Comparison

Question: Are male adelie penguins heavier than female adelie penguins ?

par(mfrow = c(1, 2), mar = c(5, 4, 2, 2) + 0.1)
plot(adelie$body_mass_g, col = adelie$sex)      ## scatter plot
boxplot(body_mass_g ~ sex, data = adelie)       ## box plot

Simple $t$ test

mass_females <- subset(adelie, sex == "female")$body_mass_g
mass_males <- subset(adelie, sex == "male")$body_mass_g
t.test(mass_males, mass_females, var.equal = TRUE)


    Two Sample t-test

data:  mass_males and mass_females
t = 13.126, df = 144, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 573.0666 776.2484
sample estimates:
mean of x mean of y 
 4043.493  3368.836

Simple $t$ test

Data:

$n_f$ female adelie penguins, $n_m$ male adelie penguins
$y^f_{i}$: mass (g) of a female adelie penguin, $1 \leq i \leq n_f$
$y^m_{j}$: mass (g) of a male adelie penguin, $1 \leq j \leq n_m$

Assumption:

$y^f_{i} \sim \mathcal{N}(\mu_f, \sigma^2)$ iid
$y^m_{j} \sim \mathcal{N}(\mu_m, \sigma^2)$ iid

Test:

$\mathcal{H}_0$: $\mu_f = \mu_m$ vs $\mathcal{H}_1$: $\mu_f \neq \mu_m$

Simple $t$ test

Test:

$\mathcal{H}_0$: $\mu_f = \mu_m$ vs $\mathcal{H}_1$: $\mu_f \neq \mu_m$

Test Statistic:

\[ T = \frac{\bar{\mathbf{y}}^f - \bar{\mathbf{y}}^m}{s_p \sqrt{\frac{1}{n_f} + \frac{1}{n_m}}} \underset{\mathcal{H}_0}{\sim} \mathcal{T}_{n_f + n_m - 2} \] with the “pooled” standard deviation: \[ s_p = \sqrt{\frac{(n_f-1)s^2_{\mathbf{y}^f} + (n_m-1)s^2_{\mathbf{y}^m}}{n_f + n_m - 2}} \]

Simple $t$ test

t.test(mass_males, mass_females, var.equal = TRUE)


    Two Sample t-test

data:  mass_males and mass_females
t = 13.126, df = 144, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 573.0666 776.2484
sample estimates:
mean of x mean of y 
 4043.493  3368.836

Reject the null that $\mu_f = \mu_m$ : males and females have different body masses.

Linear Regression

Model

Assumptions:

$y^f_{i} \sim \mathcal{N}(\mu_f, \sigma^2)$ iid
$y^m_{j} \sim \mathcal{N}(\mu_m, \sigma^2)$ iid

in other words:

$y^f_{i} = \mu_f + \epsilon_i^f$, with $\epsilon^f_i \sim \mathcal{N}(0, \sigma^2)$ iid
$y^m_{j} = \mu_m + \epsilon_j^m$, with $\epsilon^m_j \sim \mathcal{N}(0, \sigma^2)$ iid

Can we write this as a linear model: “$\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon}$” ?

Model

Write: \[ \mathbf{y} = \begin{pmatrix} \mathbf{y}^f \\ \mathbf{y}^m \\ \end{pmatrix} \quad \text{and} \quad \boldsymbol{\epsilon} = \begin{pmatrix} \boldsymbol{\epsilon}^f \\ \boldsymbol{\epsilon}^m \\ \end{pmatrix} \] $\mathbf{y}$ and $\boldsymbol{\epsilon}$ vectors of size $n = n_f + n_m$.

Then, for $1 \leq i \leq n$: \[ y_i = \mu_k + \epsilon_i \quad \text{with} \quad \begin{cases} \mu_k = \mu_f & \text{if } 1 \leq i \leq n_f\\ \mu_k = \mu_m & \text{if } n_f+1 \leq i \leq n_f + n_m \end{cases} \]

Regression matrix $\mathbf{X}$ ?

Linear Model

\[ \text{Write:} \qquad \mathbf{X} = \begin{pmatrix} 1 & 0 \\ \vdots & \vdots \\ 1 & 0 \\ 0 & 1 \\ \vdots & \vdots\\ 0 & 1 \\ \end{pmatrix} \begin{matrix} \\ n_f\\ \\ \\ n_m\\ \\ \end{matrix} \qquad \text{and} \qquad \boldsymbol{\beta} = \begin{pmatrix} \mu_f \\ \mu_m \\ \end{pmatrix} \]

\[ \text{Then:} \qquad \mathbf{X}\boldsymbol{\beta} = \begin{pmatrix} \mu_f + 0\\ \vdots \\ \mu_f + 0\\ 0 + \mu_m\\ \vdots\\ 0 + \mu_m \\ \end{pmatrix} = \begin{pmatrix} \mu_f \\ \vdots \\ \mu_f \\ \mu_m \\ \vdots\\ \mu_m \\ \end{pmatrix} \begin{matrix} \\ n_f\\ \\ \\ n_m\\ \\ \end{matrix} \]

Linear Model

Hence, the model can be written as: \[ \begin{pmatrix} y^f_1 \\ \vdots\\ y^f_{n_f} \\ y^m_{1} \\ \vdots\\ y^m_{n_m} \\ \end{pmatrix} = \begin{pmatrix} 1 & 0 \\ \vdots & \vdots \\ 1 & 0 \\ 0 & 1 \\ \vdots & \vdots\\ 0 & 1 \\ \end{pmatrix} \begin{pmatrix} \mu_f \\ \mu_m \\ \end{pmatrix} + \begin{pmatrix} \epsilon^f_1 \\ \vdots\\ \epsilon^f_{n_f} \\ \epsilon^m_{1} \\ \vdots\\ \epsilon^m_{n_m} \\ \end{pmatrix} \]

i.e: \[ \mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon} \quad \text{with} \quad \boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \sigma^2 \mathbf{I}_n) \]

$\to$ classical linear model with Gaussian assumption !

Linear Model - With Intercept

\[ \text{Write:} \qquad \mathbf{X} = \begin{pmatrix} 1 & 0 \\ \vdots & \vdots \\ 1 & 0 \\ 1 & 1 \\ \vdots & \vdots\\ 1 & 1 \\ \end{pmatrix} \begin{matrix} \\ n_f\\ \\ \\ n_m\\ \\ \end{matrix} \qquad \text{and} \qquad \boldsymbol{\beta} = \begin{pmatrix} \mu_f \\ \mu_m - \mu_f \\ \end{pmatrix} \]

\[ \text{Then:} \qquad \mathbf{X}\boldsymbol{\beta} = \begin{pmatrix} \mu_f + 0\\ \vdots \\ \mu_f + 0\\ \mu_f + \mu_m - \mu_f\\ \vdots\\ \mu_f + \mu_m - \mu_f \\ \end{pmatrix} = \begin{pmatrix} \mu_f \\ \vdots \\ \mu_f \\ \mu_m \\ \vdots\\ \mu_m \\ \end{pmatrix} \begin{matrix} \\ n_f\\ \\ \\ n_m\\ \\ \end{matrix} \]

Linear Model - With Intercept

The model can also be written as: \[ \begin{pmatrix} y^f_1 \\ \vdots\\ y^f_{n_f} \\ y^m_{1} \\ \vdots\\ y^m_{n_m} \\ \end{pmatrix} = \begin{pmatrix} 1 & 0 \\ \vdots & \vdots \\ 1 & 0 \\ 1 & 1 \\ \vdots & \vdots\\ 1 & 1 \\ \end{pmatrix} \begin{pmatrix} \mu_f \\ \mu_m - \mu_f \\ \end{pmatrix} + \begin{pmatrix} \epsilon^f_1 \\ \vdots\\ \epsilon^f_{n_f} \\ \epsilon^m_{1} \\ \vdots\\ \epsilon^m_{n_m} \\ \end{pmatrix} \]

i.e: \[ \mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon} \quad \text{with} \quad \boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \sigma^2 \mathbf{I}_n) \]

$\to$ parametrization used in R

Linear Model - $t$ test

\[ \mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon} \quad \text{with} \quad \boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \sigma^2 \mathbf{I}_n) \]

As \[ \boldsymbol{\beta} = \begin{pmatrix} \mu_f \\ \mu_m - \mu_f \\ \end{pmatrix}, \]

the test \[ \mathcal{H}_0: \mu_f = \mu_m \quad \text{vs} \quad \mathcal{H}_1: \mu_f \neq \mu_m \]

becomes: \[ \mathcal{H}_0: \beta_2 = 0 \quad \text{vs} \quad \mathcal{H}_1: \beta_2 \neq 0 \]

$\to$ $t$-test on the $\beta_2$ coefficient !

Linear Model

fit <- lm(body_mass_g ~ sex, data = adelie)

model.matrix(fit)[1:10, ]

   (Intercept) sexmale
1            1       1
2            1       0
3            1       0
4            1       0
5            1       1
6            1       0
7            1       1
8            1       0
9            1       1
10           1       1

Linear Model

summary(fit)


Call:
lm(formula = body_mass_g ~ sex, data = adelie)

Residuals:
    Min      1Q  Median      3Q     Max 
-718.49 -218.84  -18.84  225.00  731.51 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  3368.84      36.34   92.69   <2e-16 ***
sexmale       674.66      51.40   13.13   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 310.5 on 144 degrees of freedom
Multiple R-squared:  0.5447,    Adjusted R-squared:  0.5416 
F-statistic: 172.3 on 1 and 144 DF,  p-value: < 2.2e-16

Simple $t$ test

t.test(mass_males, mass_females, var.equal = TRUE)


    Two Sample t-test

data:  mass_males and mass_females
t = 13.126, df = 144, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 573.0666 776.2484
sample estimates:
mean of x mean of y 
 4043.493  3368.836

Test statistics

Simple $t$ test: \[ T = \frac{\bar{\mathbf{y}}^f - \bar{\mathbf{y}}^m}{s_p \sqrt{\frac{1}{n_f} + \frac{1}{n_m}}} \underset{\mathcal{H}_0}{\sim} \mathcal{T}_{n_f + n_m - 2} \quad \text{with} \quad s_p = \sqrt{\frac{(n_f-1)s^2_{\mathbf{y}^f} + (n_m-1)s^2_{\mathbf{y}^m}}{n_f + n_m - 2}} \]

Linear Regression: \[ \frac{\hat{\beta}_k - \beta_k}{\sqrt{\hat{\sigma}^2 [(\mathbf{X}^T\mathbf{X})^{-1}]_{kk}}} \sim \mathcal{T}_{n-p} \quad \text{with} \quad p = 2 ~,~ n = n_f + n_m ~,~ \beta_2 = 0 \text{ under } \mathcal{H}_0 \]

i.e. \[ \frac{\hat{\beta}_2}{\sqrt{\hat{\sigma}^2 [(\mathbf{X}^T\mathbf{X})^{-1}]_{22}}} \underset{\mathcal{H}_0}{\sim} \mathcal{T}_{n_f + n_m - 2} \]

Test statistics

Let’s show that: \[ \frac{\hat{\beta}_2}{\sqrt{\hat{\sigma}^2 [(\mathbf{X}^T\mathbf{X})^{-1}]_{22}}} \quad \text{with} \quad \mathbf{X} = \begin{pmatrix} 1 & 0 \\ \vdots & \vdots \\ 1 & 0 \\ 1 & 1 \\ \vdots & \vdots\\ 1 & 1 \\ \end{pmatrix} \begin{matrix} \\ n_f\\ \\ \\ n_m\\ \\ \end{matrix} \qquad \text{and} \qquad \boldsymbol{\beta} = \begin{pmatrix} \mu_f \\ \mu_m - \mu_f \\ \end{pmatrix} \]

is equal to \[ \frac{\bar{\mathbf{y}}^f - \bar{\mathbf{y}}^m}{s_p \sqrt{\frac{1}{n_f} + \frac{1}{n_m}}} \quad \text{with} \quad s_p = \sqrt{\frac{(n_f-1)s^2_{\mathbf{y}^f} + (n_m-1)s^2_{\mathbf{y}^m}}{n_f + n_m - 2}} \]

Test statistics - Proof - 1/5

\[ \mathbf{X}^T\mathbf{X} = \begin{pmatrix} 1 & \cdots & 1 & 1 & \cdots & 1 \\ 0 & \cdots & 0 & 1 & \cdots & 1 \\ \end{pmatrix} \begin{pmatrix} 1 & 0 \\ \vdots & \vdots \\ 1 & 0 \\ 1 & 1 \\ \vdots & \vdots\\ 1 & 1 \\ \end{pmatrix} = \begin{pmatrix} n & n_m \\ n_m & n_m \end{pmatrix} \]

Hence: \[ (\mathbf{X}^T\mathbf{X})^{-1} = \begin{pmatrix} n & n_m \\ n_m & n_m \end{pmatrix}^{-1} = \frac{1}{nn_m - n_m^2} \begin{pmatrix} n_m & -n_m \\ -n_m & n \end{pmatrix} \]

\[ [(\mathbf{X}^T\mathbf{X})^{-1}]_{22} = \frac{n}{nn_m - n_m^2} = \frac{n_f + n_m}{(n - n_m)n_m} = \frac{n_f + n_m}{n_f n_m} = \frac{1}{n_f} + \frac{1}{n_m} \]

Test statistics - Proof - 2/5

\[ \hat{\boldsymbol{\beta}} = (\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}\mathbf{y} \]

But: \[ \mathbf{X}^{T}\mathbf{y} = \begin{pmatrix} 1 & \cdots & 1 & 1 & \cdots & 1 \\ 0 & \cdots & 0 & 1 & \cdots & 1 \\ \end{pmatrix} \mathbf{y} = \begin{pmatrix} \sum_{i = 1}^n y_i\\ \sum_{j = 1}^{n_m} y^m_j \end{pmatrix} \]

Hence: \[ \begin{aligned} \hat{\boldsymbol{\beta}} & = (\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}\mathbf{y} = \frac{1}{n_fn_m} \begin{pmatrix} n_m & -n_m \\ -n_m & n \end{pmatrix} \begin{pmatrix} \sum_{i = 1}^n y_i\\ \sum_{j = 1}^{n_m} y^m_j \end{pmatrix} \\ &= %\frac{1}{n_fn_m} %\begin{pmatrix} %n_m\sum_{i = 1}^n y_i - n_m\sum_{j = 1}^{n_m} y^m_j\\ %- n_m\sum_{i = 1}^{n} y_i + n\sum_{j = 1}^{n_m} y^m_j\\ %\end{pmatrix} %= \frac{1}{n_fn_m} \begin{pmatrix} n_m\sum_{i = 1}^n y_i - n_m\sum_{j = 1}^{n_m} y^m_j\\ - n_m\sum_{i = 1}^{n} y_i + n_m\sum_{j = 1}^{n_m} y^m_j + n_f\sum_{j = 1}^{n_m} y^m_j\\ \end{pmatrix} \\ &= \frac{1}{n_fn_m} \begin{pmatrix} n_m \sum_{i = 1}^{n_f} y^f_i\\ - n_m\sum_{i = 1}^{n_f} y^f_i + n_f\sum_{j = 1}^{n_m} y^m_j\\ \end{pmatrix} \end{aligned} \]

Test statistics - Proof - 3/5

\[ \hat{\boldsymbol{\beta}} = (\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}\mathbf{y} \]

And: \[ \hat{\boldsymbol{\beta}} = \frac{1}{n_fn_m} \begin{pmatrix} n_m \sum_{i = 1}^{n_f} y^f_i\\ - n_m\sum_{i = 1}^{n_f} y^f_i + n_f\sum_{j = 1}^{n_m} y^m_j\\ \end{pmatrix} = \begin{pmatrix} \bar{\mathbf{y}}^f\\ -\bar{\mathbf{y}}^f + \bar{\mathbf{y}}^m\\ \end{pmatrix} \]

Note: \[ \text{since } \quad \boldsymbol{\beta} = \begin{pmatrix} \mu_f \\ \mu_m - \mu_f \\ \end{pmatrix} \quad \text{this result is coherent.} \]

Note: \[ \text{As} \quad \mathbf{x}^{i} = \begin{cases} (1~0) & \text{ if $i$ female} \\ (1~1) & \text{ if $i$ male} \\ \end{cases} \quad \text{we get:} \quad \hat{y}_i = \mathbf{x}^{i}\hat{\boldsymbol{\beta}} = \begin{cases} \bar{\mathbf{y}}^f & \text{ if $i$ female} \\ \bar{\mathbf{y}}^m & \text{ if $i$ male} \\ \end{cases} \]

Test statistics - Proof - 4/5

\[ \hat{\sigma}^2 = \frac{1}{n - 2}\sum_{i = 1}^n (y_i - \hat{y}_i)^2 \]

Hence: \[ \hat{\sigma}^2 = \frac{1}{n_f + n_m - 2} \left[ \sum_{i = 1}^{n_f} (y^f_i - \bar{\mathbf{y}}^f)^2 + \sum_{i = 1}^{n_m} (y^m_i - \bar{\mathbf{y}}^m)^2 \right] \]

Recall: \[ s^2_p = \frac{(n_f-1)s^2_{\mathbf{y}^f} + (n_m-1)s^2_{\mathbf{y}^m}}{n_f + n_m - 2} \]

hence: \[ \hat{\sigma}^2 = s^2_p. \]

Test statistics - Proof - 5/5

\[ T = \frac{\hat{\beta}_2}{\sqrt{\hat{\sigma}^2 [(\mathbf{X}^T\mathbf{X})^{-1}]_{22}}} \]

but: \[ \hat{\beta}_2 = \bar{\mathbf{y}}^m -\bar{\mathbf{y}}^f \\ \hat{\sigma}^2 = s^2_p \\ [(\mathbf{X}^T\mathbf{X})^{-1}]_{22} = \frac{1}{n_f} + \frac{1}{n_m} \]

hence: \[ T = \frac{\bar{\mathbf{y}}^f - \bar{\mathbf{y}}^m}{s_p \sqrt{\frac{1}{n_f} + \frac{1}{n_m}}} \]

And the simple $t$-test is the same as the $t$-test on the coefficient of the linear regression.

Summary

Group comparison:

$n_f$ female adelie penguins, $n_m$ male adelie penguins
$y^f_{i}$: mass (g) of a female adelie penguin, $1 \leq i \leq n_f$, $y^f_{i} \sim \mathcal{N}(\mu_f, \sigma^2)$ iid
$y^m_{j}$: mass (g) of a male adelie penguin, $1 \leq j \leq n_m$, $y^m_{j} \sim \mathcal{N}(\mu_m, \sigma^2)$ iid
$\mathcal{H}_0$: $\mu_f = \mu_m$ vs $\mathcal{H}_1$: $\mu_f \neq \mu_m$

is the same as a Linear model with:

$\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon}$ with $\boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \sigma^2 \mathbf{I}_n)$
$\mathbf{X}_1 = \mathbb{1}_n$ and $\mathbf{X}_2 = \mathbb{1}(male)$
$\beta_1 = \mu_f$ and $\beta_2 = \mu_m - \mu_f$.
$\mathcal{H}_0: \beta_2 = 0$ vs $\mathcal{H}_1: \beta_2 \neq 0$

$\to$ what happens when there are more than two groups ?

One-way ANOVA

Species

We keep only the females penguins of any species, and remove missing data.

females <- subset(penguins, sex == "female")
females <- na.omit(females)
kable(females[c(1, 2, 101, 102, 151, 152), ])

species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex	year
Adelie	Torgersen	39.5	17.4	186	3800	female	2007
Adelie	Torgersen	40.3	18.0	195	3250	female	2007
Gentoo	Biscoe	43.8	13.9	208	4300	female	2008
Gentoo	Biscoe	43.2	14.5	208	4450	female	2008
Chinstrap	Dream	46.9	16.6	192	2700	female	2008
Chinstrap	Dream	46.2	17.5	187	3650	female	2008

Species

Question: are the females of each species of different body mass ?

par(mfrow = c(1, 2), mar = c(5, 4, 2, 2) + 0.1)
plot(females$body_mass_g, col = females$species)     ## scatter plot
boxplot(body_mass_g ~ species, data = females)       ## box plot

$t$ tests

Pairwise $t$ tests:

pairwise.t.test(females$body_mass_g, females$species,
                p.adjust.method = "none")


    Pairwise comparisons using t tests with pooled SD 

data:  females$body_mass_g and females$species 

          Adelie Chinstrap
Chinstrap 0.0066 -        
Gentoo    <2e-16 <2e-16   

P value adjustment method: none

$\to$ ${K \choose 2}$ tests : Multiple testing problems.

$t$ tests - Model

Model:

$n_s$ penguins, for species $s \in S = \{adelie, chinstrap, gentoo\}$
$y^s_{i}$: mass (g) of a female penguin of species $s$, $1 \leq i \leq n_s$
$y^s_{i} \sim \mathcal{N}(\mu_s, \sigma^2)$ iid for any species $s$.

Pairwise Tests:

\[ \mathcal{H}^{s_1,s_2}_0:~ \mu_{s_1} = \mu_{s_2} \quad \text{vs} \quad \mathcal{H}^{s_1,s_2}_1:~\mu_{s_1} \neq \mu_{s_2} \text{ for any pair } (s_1,s_2) \text{ of species}. \]

Global Tests can we do ?

\[ \mathcal{H}_0:~\mu_a = \mu_c = \mu_g \quad \text{vs} \quad \mathcal{H}_1:~\exists s_1, s_2 \in S ~|~ \mu_{s_1} \neq \mu_{s_2} ? \]

One-way ANOVA

$S$ a set of $|S| = K$ groups;
$\mathbf{y}$ the vector of all data, length $n = \sum_{k=1}^K n_k$;
$\mathbf{X}$ the matrix of size $n \times K$, such that:
- $\mathbf{X}_1 = \mathbb{1}_{n}$ is the intercept (vector column)
- $\mathbf{X}_k = \mathbb{1}(S_k)$ the indicator of group $k$, $2 \leq k \leq K$.
  - $x_{ik} = 1$ if individual $i$ is in group $k$, and $0$ otherwise.
$\boldsymbol{\beta}$ the vector of mean differences, of length $K$:
- $\beta_1 = \mu_1$
- $\beta_k = \mu_k - \mu_1$
the first group is the reference

One-way ANOVA

\[ \begin{pmatrix} y^1_1 \\ \vdots\\ y^1_{n_1} \\ y^2_{1} \\ \vdots\\ y^2_{n_2} \\ \vdots\\ y^K_{1} \\ \vdots\\ y^K_{n_K} \\ \end{pmatrix} = \begin{pmatrix} 1 & 0 & 0 & \cdots & 0\\ \vdots & \vdots & \vdots & \cdots & \vdots\\ 1 & 0 & 0 & \cdots & 0\\ 1 & 1 & 0 & \cdots & 0\\ \vdots & \vdots & \vdots & \cdots & \vdots\\ 1 & 1 & 0 & \cdots & 0\\ \vdots & \vdots & \vdots & \cdots & \vdots\\ 1 & 0 & 0 & \cdots & 1\\ \vdots & \vdots & \vdots & \cdots & \vdots\\ 1 & 0 & 0 & \cdots & 1\\ \end{pmatrix} \begin{pmatrix} \mu_1 \\ \mu_2 - \mu_1 \\ \vdots\\ \mu_K - \mu_1 \\ \end{pmatrix} + \begin{pmatrix} \epsilon^1_1 \\ \vdots\\ \epsilon^1_{n_1} \\ \epsilon^2_{1} \\ \vdots\\ \epsilon^2_{n_2} \\ \vdots\\ \epsilon^K_{1} \\ \vdots\\ \epsilon^K_{n_K} \\ \end{pmatrix} \]

One-way ANOVA

Model:

\[ \mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon} \quad \text{with} \quad \boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \sigma^2 \mathbf{I}_n) \]

Note: Assuming that the groups are disjoint, $\text{rg}(X) = K$.

This is equivalent to the following model: \[ \begin{aligned} y_i &= \beta_1 + \epsilon_i = \mu_1 + \epsilon_i & \text{ if } &i \text{ is in group } 1\\ y_i &= \beta_1 + \beta_k + \epsilon_i = \mu_k + \epsilon_i & \text{ if } &i \text{ is in group } k,\ k>1 \\ \end{aligned} \]

i.e. each group has a mean $\mu_k$, and the errors are Gaussian iid.

Species

fit <- lm(body_mass_g ~ species, data = females)
model.matrix(fit)[c(1, 2, 101, 102, 151, 152), ]

    (Intercept) speciesChinstrap speciesGentoo
1             1                0             0
2             1                0             0
101           1                0             1
102           1                0             1
151           1                1             0
152           1                1             0

Species

summary(fit)


Call:
lm(formula = body_mass_g ~ species, data = females)

Residuals:
    Min      1Q  Median      3Q     Max 
-827.21 -193.84   20.26  181.16  622.79 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)       3368.84      32.42 103.908  < 2e-16 ***
speciesChinstrap   158.37      57.52   2.754  0.00657 ** 
speciesGentoo     1310.91      48.72  26.904  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 277 on 162 degrees of freedom
Multiple R-squared:  0.8292,    Adjusted R-squared:  0.8271 
F-statistic: 393.2 on 2 and 162 DF,  p-value: < 2.2e-16

One-way ANOVA - Coefficients

It is easy to show that:

\[ \hat{\beta}_1 = \bar{\mathbf{y}}^{s_1}\\ \hat{\beta}_k = \bar{\mathbf{y}}^{s_k} - \bar{\mathbf{y}}^{s_1} \text{ for } 2 \leq k \leq K\\ \] Hence:

$\hat{\beta}_1$ is the estimator of the mean of the first group;
$\hat{\beta}_k$ is the estimator of the difference between the mean of group $k$ and the first group ($k > 1$).

One-way ANOVA - Coefficients

We have: \[ \boldsymbol{\beta} = \begin{pmatrix} \mu_{s_1}\\ \mu_{s_2} - \mu_{s_1}\\ \vdots\\ \mu_{s_K} - \mu_{s_1}\\ \end{pmatrix} \quad \text{and} \quad \hat{\boldsymbol{\beta}} = \begin{pmatrix} \bar{\mathbf{y}}^{s_1}\\ \bar{\mathbf{y}}^{s_2} - \bar{\mathbf{y}}^{s_1}\\ \vdots\\ \bar{\mathbf{y}}^{s_K} - \bar{\mathbf{y}}^{s_1}\\ \end{pmatrix} \]

The $t$ test on the coefficient $\beta_k$:

\[ \mathcal{H}^{k}_0: \beta_{k} = 0 \quad{ vs } \quad \mathcal{H}^{k}_1: \beta_{k} \neq 0 \]

is the same as the test of the difference between groups $1$ & $k$:

\[ \mathcal{H}^{1,k}_0: \mu_{1} = \mu_{k} \quad{ vs } \quad \mathcal{H}^{1,k}_1: \mu_{1} \neq \mu_{k} \]

One-way ANOVA - Global Test

The global $F$ test on the coefficients $\beta_2, \dotsc, \beta_K$:

\[ \mathcal{H}_0: \beta_{2} = \beta_3 = \cdots = \beta_{K} = 0 \quad{ vs } \quad \mathcal{H}_1: \exists k | \beta_{k} \neq 0 \]

is the same as the global test of mean equality of all groups:

\[ \mathcal{H}_0: \mu_{1} = \mu_2 = \cdots = \mu_{K} \quad{ vs } \quad \mathcal{H}_1: \exists k, k' | \mu_{k} \neq \mu_{k'} \]

Species

fit <- lm(body_mass_g ~ species, data = females)
summary(fit)


Call:
lm(formula = body_mass_g ~ species, data = females)

Residuals:
    Min      1Q  Median      3Q     Max 
-827.21 -193.84   20.26  181.16  622.79 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)       3368.84      32.42 103.908  < 2e-16 ***
speciesChinstrap   158.37      57.52   2.754  0.00657 ** 
speciesGentoo     1310.91      48.72  26.904  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 277 on 162 degrees of freedom
Multiple R-squared:  0.8292,    Adjusted R-squared:  0.8271 
F-statistic: 393.2 on 2 and 162 DF,  p-value: < 2.2e-16

One-way ANOVA - Parametrization

General Model

In general, we might write for observation $i$ of group $k$: \[ y_{i,k} = \mu + \beta_k + \epsilon_i \]

$\mu$ would be the “grand mean” (intercept)
$\beta_k$ the specific effect of group $k$ ($1 \leq k \leq K$).

General Model Over-Parametrization

\[ y_{i,k} = \mu + \beta_k + \epsilon_i \]

The model has $K+1$ parameters.

The associated regression matrix would be: \[ \mathbf{X} = (\mathbb{1}, \mathbb{1}(S_1), \dotsc, \mathbb{1}(S_K)) \]

$\mathbf{X}$ is not of full rank: $\mathbb{1} = \sum_{k=1}^K \mathbb{1}(S_k)$.

Exemple with two groups

\[ \mathbf{X} = \begin{pmatrix} 1 & 1 & 0 \\ \vdots & \vdots & \vdots \\ 1 & 1 & 0 \\ 1 & 0 & 1 \\ \vdots & \vdots & \vdots\\ 1 & 0 & 1 \\ \end{pmatrix} \]

We need some constraint on the parameters.

Constraint : Group 1 as reference

\[ y_{i,k} = \mu + \beta_k + \epsilon_i \]

In the previous studies, we assumed \[\beta_1 = 0\]

group 1 has the “reference” mean $\mu$
group $k>1$ has mean $\mu + \beta_k$
$\beta_k = \mu_k - \mu_1$ is the difference between group 1 and $k$.
$\mathbf{X} = (\mathbb{1}, \mathbb{1}(S_2), \dotsc, \mathbb{1}(S_K))$ is full rank.

$\to$ This is the constraint used by default in R.

Constraint : No intercept

\[ y_{i,k} = \mu + \beta_k + \epsilon_i \]

We can assume \[\mu = 0\]

$\beta_k = \mu_k$ models the mean of group $k$ directly.
$\mathbf{X} = (\mathbb{1}(S_1), \dotsc, \mathbb{1}(S_K))$ is full rank.
estimators of $\beta_k$ are changed, but not the estimator of $\sigma^2$, nor the fitted values.

Constraint : Null Sum

\[ y_{i,k} = \mu + \beta_k + \epsilon_i \]

We can assume \[ \sum_{k = 1}^K \beta_k= 0 \qquad \text{i.e.} \qquad \beta_K = - \sum_{k = 1}^{K-1} \beta_k \]

$\mathbf{X} = (\mathbb{1}, \mathbb{1}(S_1) - \mathbb{1}(S_K), \dotsc, \mathbb{1}(S_{K-1}) - \mathbb{1}(S_K))$ is full rank.
estimators of $\mu$, $\beta_k$ are changed, but not the estimator of $\sigma^2$, nor the fitted values.

Constraint : Weighted Null Sum

\[ y_{i,k} = \mu + \beta_k + \epsilon_i \]

We can assume \[ \sum_{k = 1}^K n_k \beta_k= 0 \qquad \text{i.e.} \qquad \beta_K = -\frac{1}{n_K} \sum_{k = 1}^{K-1} n_k \beta_k \]

$\mathbf{X} = (\mathbb{1}, \mathbb{1}(S_1) - \frac{n_1}{n_K}\mathbb{1}(S_K), \dotsc, \mathbb{1}(S_{K-1}) - \frac{n_{K-1}}{n_K}\mathbb{1}(S_K))$ is full rank.
estimators of $\mu$, $\beta_k$ are changed, but not the estimator of $\sigma^2$, nor the fitted values.

One-way ANOVA table

One-way ANOVA - $F$ test

\[ y_{i,k} = \mu + \beta_k + \epsilon_i \]

ANOVA Test: Has the grouping any effect at all ? I.e. \[ \begin{aligned} \mathcal{H}_0&: \beta_{1} = \cdots = \beta_K = 0 &\text{vs}\\ \mathcal{H}_1&: \exists~k\in \{1, \dotsc, K\} ~|~ \beta_k \neq 0 \end{aligned} \]

No matter which set of constraint we choose, the $F$ statistic associated with the one factor ANOVA is:

\[ F = \frac{ \|\hat{\mathbf{y}} - \bar{y}\mathbb{1}\|^2 / (K - 1) }{ \|\mathbf{y} - \hat{\mathbf{y}}\|^2 / (n - K) } \underset{\mathcal{H}_0}{\sim} \mathcal{F}^{K - 1}_{n - K}. \]

$\to$ Global $F$ test on the coefficients of the linear regression.

ANOVA = “ANalyse Of VAriance”

One-way ANOVA - Global $F$ Test

With R parametrization: \[ \boldsymbol{\beta} = \begin{pmatrix} \mu_{s_1}\\ \mu_{s_2} - \mu_{s_1}\\ \vdots\\ \mu_{s_K} - \mu_{s_1}\\ \end{pmatrix} \quad \text{and} \quad \hat{\boldsymbol{\beta}} = \begin{pmatrix} \bar{\mathbf{y}}^{s_1}\\ \bar{\mathbf{y}}^{s_2} - \bar{\mathbf{y}}^{s_1}\\ \vdots\\ \bar{\mathbf{y}}^{s_K} - \bar{\mathbf{y}}^{s_1}\\ \end{pmatrix} \]

The global $F$ test on the coefficients $\beta_2, \dotsc, \beta_K$ is:

\[ \mathcal{H}_0: \beta_{2} = \beta_3 = \cdots = \beta_{K} = 0 \quad{ vs } \quad \mathcal{H}_1: \exists k | \beta_{k} \neq 0 \]

and is the same as the global test of mean equality of all groups:

\[ \mathcal{H}_0: \mu_{1} = \mu_2 = \cdots = \mu_{K} \quad{ vs } \quad \mathcal{H}_1: \exists k, k' | \mu_{k} \neq \mu_{k'} \]

Species

fit <- lm(body_mass_g ~ species, data = females)
summary(fit)


Call:
lm(formula = body_mass_g ~ species, data = females)

Residuals:
    Min      1Q  Median      3Q     Max 
-827.21 -193.84   20.26  181.16  622.79 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)       3368.84      32.42 103.908  < 2e-16 ***
speciesChinstrap   158.37      57.52   2.754  0.00657 ** 
speciesGentoo     1310.91      48.72  26.904  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 277 on 162 degrees of freedom
Multiple R-squared:  0.8292,    Adjusted R-squared:  0.8271 
F-statistic: 393.2 on 2 and 162 DF,  p-value: < 2.2e-16

One-way ANOVA - Table

\[ F = \frac{ \|\hat{\mathbf{y}} - \bar{y}\mathbb{1}\|^2 / (K - 1) }{ \|\mathbf{y} - \hat{\mathbf{y}}\|^2 / (n - K) } = \frac{ ESS / (K - 1) }{ RSS / (n - K) } \underset{\mathcal{H}_0}{\sim} \mathcal{F}^{K - 1}_{n - K}. \]

“Anova table”:

	Df	SS	Mean SS	F	$p$ value
factor	$K - 1$	$ESS$	$ESS / (K - 1)$	$\frac{ESS / (K - 1)}{RSS / (n - K)}$	$p$
residuals	$n - K$	$RSS$	$RSS / (n - K)$
total	$n - 1$	$TSS$

Species

fit <- lm(body_mass_g ~ species, data = females)
anova(fit)

Analysis of Variance Table

Response: body_mass_g
           Df   Sum Sq  Mean Sq F value    Pr(>F)    
species     2 60350016 30175008  393.25 < 2.2e-16 ***
Residuals 162 12430757    76733                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Two-way ANOVA

Species

We keep all the penguins of any species and any sex, and remove missing data.

penguins_nona <- na.omit(penguins)
kable(penguins_nona[c(1, 2, 151, 152, 271, 272), ])

species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex	year
Adelie	Torgersen	39.1	18.7	181	3750	male	2007
Adelie	Torgersen	39.5	17.4	186	3800	female	2007
Gentoo	Biscoe	47.6	14.5	215	5400	male	2007
Gentoo	Biscoe	46.5	13.5	210	4550	female	2007
Chinstrap	Dream	45.2	17.8	198	3950	female	2007
Chinstrap	Dream	46.1	18.2	178	3250	female	2007

Species

Question: what is the effect of both species and sex on the body mass ?

par(mfrow = c(1, 2), mar = c(5, 4, 2, 2) + 0.1)
plot(penguins_nona$body_mass_g,                             ## scatter plot
     pch = as.numeric(penguins_nona$species),               ## point type = species
     col = penguins_nona$sex)                               ## color = sex
boxplot(body_mass_g ~ sex + species, data = penguins_nona,  ## box plot
        col = c("grey", "red"))

Model

Observation $i$ of group $k$ in factor 1 and group $l$ in factor 2 is written as: \[ y_{i,k,l} = \mu + \alpha_k + \beta_l + \gamma_{kl} + \epsilon_i \]

$\mu$ the “grand mean” (intercept)
$\alpha_k$ the specific effect of group $k$ in factor 1 ($1 \leq k \leq K$).
$\beta_l$ the specific effect of group $l$ in factor 2 ($1 \leq l \leq L$).
$\gamma_{kl}$ the interaction effect of groups $k$ and $l$.

The full model has regression matrix: \[ {\small \mathbf{X} = (\mathbb{1}, \mathbb{1}_{S_1}, \dotsc, \mathbb{1}_{S_K}, \mathbb{1}_{T_1}, \dotsc, \mathbb{1}_{T_L}, \mathbb{1}_{S_1, T_1}, \dotsc, \mathbb{1}_{S_K, T_L} ) } \]

$\to$ $1 + K + L + KL$ predictors. $\to$ not identifiable !

Constraint : Groups 1 as reference

\[ y_{i,k,l} = \mu + \alpha_k + \beta_l + \gamma_{kl} + \epsilon_i \]

Assume, for all $1\leq k \leq K$ and $1\leq l \leq L$: \[ \alpha_1 = 0, \qquad \beta_1 = 0, \qquad \gamma_{1,l} = 0, \qquad \gamma_{k,1} = 0. \]

$y_{i,1,1}$ has mean $\mu$ (reference)
$y_{i,1,l}$ has mean $\mu + \beta_l$
$y_{i,k,1}$ has mean $\mu + \alpha_k$
$y_{i,k,l}$ has mean $\mu + \alpha_k + \beta_l + \gamma_{k,l}$ ($k,l>1$)
$1 + K + L$ constraints $\to$ $KL$ free parameters
$\mathbf{X} = (\mathbb{1}, \mathbb{1}_{S_2}, \dotsc, \mathbb{1}_{S_K}, \mathbb{1}_{T_2}, \dotsc, \mathbb{1}_{T_L}, \mathbb{1}_{S_2, T_2}, \dotsc, \mathbb{1}_{S_K, T_L})$ full

$\to$ This is the constraint used by default in R.

Species and Sex

fit <- lm(body_mass_g ~  species * sex, data = penguins_nona) 
summary(fit)


Call:
lm(formula = body_mass_g ~ species * sex, data = penguins_nona)

Residuals:
    Min      1Q  Median      3Q     Max 
-827.21 -213.97   11.03  206.51  861.03 

Coefficients:
                         Estimate Std. Error t value Pr(>|t|)    
(Intercept)               3368.84      36.21  93.030  < 2e-16 ***
speciesChinstrap           158.37      64.24   2.465  0.01420 *  
speciesGentoo             1310.91      54.42  24.088  < 2e-16 ***
sexmale                    674.66      51.21  13.174  < 2e-16 ***
speciesChinstrap:sexmale  -262.89      90.85  -2.894  0.00406 ** 
speciesGentoo:sexmale      130.44      76.44   1.706  0.08886 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 309.4 on 327 degrees of freedom
Multiple R-squared:  0.8546,    Adjusted R-squared:  0.8524 
F-statistic: 384.3 on 5 and 327 DF,  p-value: < 2.2e-16

Other constraint

\[ y_{i,k,l} = \mu + \alpha_k + \beta_l + \gamma_{kl} + \epsilon_i \]

Interactions only: for all $1\leq k \leq K$ and $1\leq l \leq L$: \[ \mu = 0,\qquad \alpha_k = 0, \qquad \beta_l = 0. \]

Null sum: \[ \sum_{k=1}^K\alpha_k = 0, \qquad \sum_{l=1}^L\beta_l = 0, \qquad \sum_{k=1}^K\gamma_{kl} = 0, \qquad \sum_{l=1}^L\gamma_{kl} = 0. \]

$\to$ changes the estimation of the coefficients, but not of the variance, nor the predicted values.

Nested F Test - Reminder

Recall that (from CM5), in general, to test: \[ \begin{aligned} \mathcal{H}_0&: \beta_{p_0 + 1} = \cdots = \beta_p = 0 &\text{vs}\\ \mathcal{H}_1&: \exists~k\in \{p_0+1, \dotsc, p\} ~|~ \beta_k \neq 0 \end{aligned} \]

We can use the statistic: \[ F = \frac{ \|\hat{\mathbf{y}} - \hat{\mathbf{y}}_0\|^2 / (p - p_0) }{ \|\mathbf{y} - \hat{\mathbf{y}}\|^2 / (n - p) } = \frac{n - p}{p - p_0}\frac{RSS_0 - RSS}{RSS} \underset{\mathcal{H}_0}{\sim} \mathcal{F}^{p - p_0}_{n - p}. \]

Writing $\mathbf{X}_{p_0} = (\mathbf{X}_1, \dotsc, \mathbf{X}_{p_0})$, this is the same as testing: \[ \begin{aligned} \mathcal{H}_0&: \mathbf{P}^{\mathbf{X}_{p_0}}\mathbf{y} = \mathbf{P}^{\mathbf{X}}\mathbf{y} &\text{vs}\\ \mathcal{H}_1&: \mathbf{P}^{\mathbf{X}_{p_0}}\mathbf{y} \neq \mathbf{P}^{\mathbf{X}}\mathbf{y} \end{aligned} \]

Nested F Test - Reminder

Nested F Test - General

Take any two nested subsets of indexes $\eta_1 \subset \eta_2$, and write: \[ \begin{aligned} \mathbf{X}_{\eta_k} &= (\mathbf{X}_i)_{i\in\eta_k} & \mathbf{P}^{\mathbf{X}_{\eta_k}}\mathbf{y} &= \hat{\mathbf{y}}_{\eta_k} & \mathbf{P}^{\mathbf{X}}\mathbf{y} &= \hat{\mathbf{y}} \end{aligned} \]

To test: \[ \begin{aligned} \mathcal{H}_0&: \text{"Larger model 2 does not bring any information compared to model 1"}\\ \mathcal{H}_1&: \text{"Larger model 2 does bring some information compared to model 1"} \end{aligned} \]

i.e. \[ \begin{aligned} \mathcal{H}_0&: \mathbf{P}^{\mathbf{X}_{\eta_1}}\mathbf{y} = \mathbf{P}^{\mathbf{X}_{\eta_2}}\mathbf{y} &\text{v.s.}&& \mathcal{H}_1&: \mathbf{P}^{\mathbf{X}_{\eta_1}}\mathbf{y} \neq \mathbf{P}^{\mathbf{X}_{\eta_2}}\mathbf{y} \end{aligned} \]

We can use: \[ F = \frac{ \|\hat{\mathbf{y}}_{\eta_2} - \hat{\mathbf{y}}_{\eta_1}\|^2 / (|\eta_2| - |\eta_1|) }{ \|\mathbf{y} - \hat{\mathbf{y}}\|^2 / (n - p) } = \frac{ 1 }{ \hat{\sigma}^2 } (RSS_{\eta_1} - RSS_{\eta_2}) / (|\eta_2| - |\eta_1|) \underset{\mathcal{H}_0}{\sim} \mathcal{F}^{|\eta_2| - |\eta_1|}_{n - p}. \]

Anova model

Model: \[ y_{i,k,l} = \mu + \alpha_k + \beta_l + \gamma_{kl} + \epsilon_i \] with, for all $1\leq k \leq K$ and $1\leq l \leq L$: \[ \alpha_1 = 0, \qquad \beta_1 = 0, \qquad \gamma_{1,l} = 0, \qquad \gamma_{k,1} = 0. \] ($\to KL$ free parameters in total.)

Tests:

Is factor 1 useful ?
Is factor 2 useful ?
Is the interaction between factor 1 and 2 useful ?

Nested F Test - Anova - Factor 1

To test: \[ \begin{aligned} \mathcal{H}_0&: \mathbf{P}^{\mathbf{X}_{\mu}}\mathbf{y} = \mathbf{P}^{\mathbf{X}_{\mu,\alpha}}\mathbf{y} &\text{v.s.}&& \mathcal{H}_1&: \mathbf{P}^{\mathbf{X}_{\mu}}\mathbf{y} \neq \mathbf{P}^{\mathbf{X}_{\mu,\alpha}}\mathbf{y} \end{aligned} \] i.e. $\mathcal{H}_0$: “factor 1 does not bring any information compared to intercept alone”.

We can use: \[ \begin{aligned} F &= \frac{n - KL}{K - 1}\frac{RSS_{\mu} - RSS_{\mu, \alpha}}{RSS}\\ &= \frac{R(\alpha | \mu) / (K - 1)}{\hat{\sigma}^2}\\ &\underset{\mathcal{H}_0}{\sim} \mathcal{F}^{K - 1}_{n - KL}. \end{aligned} \]

Nested F Test - Anova - Factor 2

To test: \[ \begin{aligned} \mathcal{H}_0&: \mathbf{P}^{\mathbf{X}_{\mu,\alpha}}\mathbf{y} = \mathbf{P}^{\mathbf{X}_{\mu,\alpha, \beta}}\mathbf{y} &\text{v.s.}&& \mathcal{H}_1&: \mathbf{P}^{\mathbf{X}_{\mu,\alpha}}\mathbf{y} \neq \mathbf{P}^{\mathbf{X}_{\mu,\alpha,\beta}}\mathbf{y} \end{aligned} \] i.e. $\mathcal{H}_0$: “factor 2 does not bring any information compared to factor 1”.

We can use: \[ \begin{aligned} F &= \frac{n - KL}{L - 1}\frac{RSS_{\mu, \alpha} - RSS_{\mu, \alpha, \beta}}{RSS}\\ &= \frac{R(\beta | \mu, \alpha) / (L - 1)}{\hat{\sigma}^2}\\ &\underset{\mathcal{H}_0}{\sim} \mathcal{F}^{L - 1}_{n - KL}. \end{aligned} \]

Nested F Test - Anova - Interaction

To test: \[ \begin{aligned} \mathcal{H}_0&: \mathbf{P}^{\mathbf{X}_{\mu,\alpha,\beta}}\mathbf{y} = \mathbf{P}^{\mathbf{X}_{\mu,\alpha, \beta,\gamma}}\mathbf{y} &\text{v.s.}&& \mathcal{H}_1&: \mathbf{P}^{\mathbf{X}_{\mu,\alpha,\gamma}}\mathbf{y} \neq \mathbf{P}^{\mathbf{X}_{\mu,\alpha,\beta,\gamma}}\mathbf{y} \end{aligned} \] i.e. $\mathcal{H}_0$: “Interactions do not bring any information compared to factors 1 and 2”.

We can use: \[ \begin{aligned} F &= \frac{n - KL}{(L - 1)(K - 1)}\frac{RSS_{\mu, \alpha, \beta} - RSS_{\mu, \alpha, \beta, \gamma}}{RSS}\\ &= \frac{R(\gamma | \mu, \alpha, \beta) / ((K - 1)(L - 1))}{\hat{\sigma}^2}\\ &\underset{\mathcal{H}_0}{\sim} \mathcal{F}^{(K - 1)(L - 1)}_{n - KL}. \end{aligned} \]

Two-way ANOVA - Type I Table

“Type I” Anova table:

	df	SS	Mean SS	F	$p$ value
factor 1	$K - 1$	$R(\alpha\|\mu)$	$R(\alpha\|\mu) / df$	$\hat{\sigma}^{-2} R(\alpha\|\mu) / df$	$p$
factor 2	$L - 1$	$R(\beta\|\mu, \alpha)$	$R(\beta\|\mu, \alpha) / df$	$\hat{\sigma}^{-2} R(\beta\|\mu, \alpha) / df$	$p$
interactions	$(K - 1)(L - 1)$	$R(\gamma\|\mu, \alpha, \beta)$	$R(\gamma\|\mu, \alpha, \beta) / df$	$\hat{\sigma}^{-2} R(\gamma\|\mu, \alpha, \beta) / df$	$p$
residuals	$n - KL$	$RSS$	$\hat{\sigma}^2 = RSS / df$

Species and Sex - Type I

fit <- lm(body_mass_g ~  species * sex, data = penguins_nona) 
anova(fit)

Analysis of Variance Table

Response: body_mass_g
             Df    Sum Sq  Mean Sq F value    Pr(>F)    
species       2 145190219 72595110 758.358 < 2.2e-16 ***
sex           1  37090262 37090262 387.460 < 2.2e-16 ***
species:sex   2   1676557   838278   8.757 0.0001973 ***
Residuals   327  31302628    95727                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Sex and Species - Type I

fit <- lm(body_mass_g ~  sex * species, data = penguins_nona) 
anova(fit)

Analysis of Variance Table

Response: body_mass_g
             Df    Sum Sq  Mean Sq F value    Pr(>F)    
sex           1  38878897 38878897 406.145 < 2.2e-16 ***
species       2 143401584 71700792 749.016 < 2.2e-16 ***
sex:species   2   1676557   838278   8.757 0.0001973 ***
Residuals   327  31302628    95727                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Two-way ANOVA - Type I Table

“Type I” Anova table:

	Df	SS	Mean SS	F	$p$ value
factor 1	$K - 1$	$R(\alpha\|\mu)$	$R(\alpha\|\mu) / df$	$\hat{\sigma}^{-2} R(\alpha\|\mu) / df$	$p$
factor 2	$L - 1$	$R(\beta\|\mu, \alpha)$	$R(\beta\|\mu, \alpha) / df$	$\hat{\sigma}^{-2} R(\beta\|\mu, \alpha) / df$	$p$
interactions	$(K - 1)(L - 1)$	$R(\gamma\|\mu, \alpha, \beta)$	$R(\gamma\|\mu, \alpha, \beta) / df$	$\hat{\sigma}^{-2} R(\gamma\|\mu, \alpha, \beta) / df$	$p$
residuals	$n - KL$	$RSS$	$\hat{\sigma}^2 = RSS / df$

Attention: The order of factors matters !

Two-way ANOVA - Type II Table

“Type II” Anova table:

	Df	SS	Mean SS	F	$p$ value
factor 1	$K - 1$	$R(\alpha\|\mu, \beta)$	$R(\alpha\|\mu, \beta) / df$	$\hat{\sigma}^{-2} R(\alpha\|\mu, \beta) / df$	$p$
factor 2	$L - 1$	$R(\beta\|\mu, \alpha)$	$R(\beta\|\mu, \alpha) / df$	$\hat{\sigma}^{-2} R(\beta\|\mu, \alpha) / df$	$p$
interactions	$(K - 1)(L - 1)$	$R(\gamma\|\mu, \alpha, \beta)$	$R(\gamma\|\mu, \alpha, \beta) / df$	$\hat{\sigma}^{-2} R(\gamma\|\mu, \alpha, \beta) / df$	$p$
residuals	$n - KL$	$RSS$	$\hat{\sigma}^2 = RSS / df$

Symmetric in factor 1 and 2.

Sex and Species - Type II

fit <- lm(body_mass_g ~  sex * species, data = penguins_nona) 
car::Anova(fit)

Anova Table (Type II tests)

Response: body_mass_g
               Sum Sq  Df F value    Pr(>F)    
sex          37090262   1 387.460 < 2.2e-16 ***
species     143401584   2 749.016 < 2.2e-16 ***
sex:species   1676557   2   8.757 0.0001973 ***
Residuals    31302628 327                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Bill Lenght - Type II

fit <- lm(bill_length_mm ~  sex * species, data = penguins_nona) 
car::Anova(fit)

Anova Table (Type II tests)

Response: bill_length_mm
            Sum Sq  Df  F value Pr(>F)    
sex         1135.7   1 211.8066 <2e-16 ***
species     6975.6   2 650.4786 <2e-16 ***
sex:species   24.5   2   2.2841 0.1035    
Residuals   1753.3 327                    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Bill Lenght - Type II - No interaction

fit <- lm(bill_length_mm ~  sex + species, data = penguins_nona) 
car::Anova(fit)

Anova Table (Type II tests)

Response: bill_length_mm
          Sum Sq  Df F value    Pr(>F)    
sex       1135.7   1  210.17 < 2.2e-16 ***
species   6975.6   2  645.44 < 2.2e-16 ***
Residuals 1777.8 329                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

ANCOVA

Penguins

Question: what is the relationship between bill length and depth, controlling for the species ?

One line fits all

fit_all <- lm(bill_length_mm ~ bill_depth_mm, data = females)
summary(fit_all)


Call:
lm(formula = bill_length_mm ~ bill_depth_mm, data = females)

Residuals:
     Min       1Q   Median       3Q      Max 
-11.0745  -3.0689  -0.7609   2.6776  17.5034 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)    61.2214     3.1966  19.152  < 2e-16 ***
bill_depth_mm  -1.1643     0.1935  -6.018 1.13e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.449 on 163 degrees of freedom
Multiple R-squared:  0.1818,    Adjusted R-squared:  0.1768 
F-statistic: 36.22 on 1 and 163 DF,  p-value: 1.128e-08

One line fits all

\[ y_{i} = \mu + \beta x_{i} + \epsilon_i \]

Take species into account

fit_species <- lm(bill_length_mm ~ bill_depth_mm + species, data = females)
summary(fit_species)


Call:
lm(formula = bill_length_mm ~ bill_depth_mm + species, data = females)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.9886 -1.4503 -0.0603  1.4476 11.2797 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)       25.0447     3.9377   6.360 1.99e-09 ***
bill_depth_mm      0.6930     0.2230   3.108  0.00222 ** 
speciesChinstrap   9.3393     0.4648  20.092  < 2e-16 ***
speciesGentoo     10.6515     0.8510  12.516  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.238 on 161 degrees of freedom
Multiple R-squared:  0.7954,    Adjusted R-squared:  0.7916 
F-statistic: 208.7 on 3 and 161 DF,  p-value: < 2.2e-16

Take species into account

\[ y_{i,k} = \mu + \alpha_k + \beta x_{i} + \epsilon_i \]

Take species into account

fit_species <- lm(bill_length_mm ~ bill_depth_mm + species, data = females)
car::Anova(fit_species)

Anova Table (Type II tests)

Response: bill_length_mm
               Sum Sq  Df  F value    Pr(>F)    
bill_depth_mm   48.41   1   9.6624  0.002224 ** 
species       2419.64   2 241.4533 < 2.2e-16 ***
Residuals      806.70 161                       
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Interactions

fit_int <- lm(bill_length_mm ~ bill_depth_mm + species + bill_depth_mm:species,
              data = females)
fit_int


Call:
lm(formula = bill_length_mm ~ bill_depth_mm + species + bill_depth_mm:species, 
    data = females)

Coefficients:
                   (Intercept)                   bill_depth_mm  
                       31.1671                          0.3456  
              speciesChinstrap                   speciesGentoo  
                       -2.5348                         -8.8729  
bill_depth_mm:speciesChinstrap     bill_depth_mm:speciesGentoo  
                        0.6745                          1.2887

Interactions

\[ y_{i,k} = \mu + \alpha_k + \beta_k x_{i} + \epsilon_i \]

Interactions

fit_int <- lm(bill_length_mm ~ bill_depth_mm + species + bill_depth_mm:species,
              data = females)
car::Anova(fit_int)

Anova Table (Type II tests)

Response: bill_length_mm
                       Sum Sq  Df  F value    Pr(>F)    
bill_depth_mm           48.41   1   9.8428  0.002032 ** 
species               2419.64   2 245.9610 < 2.2e-16 ***
bill_depth_mm:species   24.62   2   2.5029  0.085070 .  
Residuals              782.08 159                       
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Ancova - One Regressor - Model

For any observation $1 \leq i \leq n$ in group $1 \leq k \leq K$, we write: \[ y_{i,k} = \mu + \alpha_k + (\beta + \gamma_k) \times x_{i} + \epsilon_i \qquad \epsilon_i\sim\mathcal{N}(0, \sigma^2)~iid \]

Identifiability: $\alpha_1 = \gamma_1 = 0$.

\[ \text{With $K=2$:}\qquad \mathbf{X} = \begin{pmatrix} 1 & 0 & x_1 & 0\\ \vdots & \vdots & \vdots & \vdots\\ 1 & 0 & x_{n_1} & 0 &\\ 1 & 1 & x_{n_1 + 1} & x_{n_1 + 1} \\ \vdots & \vdots & \vdots & \vdots\\ 1 & 1 & x_{n_1 + n_2} & x_{n_1 + n_2}\\ \end{pmatrix} \]

Ancova

F tests of effects and interactions can be done as before

More than one predictor ? $\to$ not a problem

More than one factor ? $\to$ not a problem

Attention as model become more complex, they are more difficult to interpret and test.

Attention diagnostics on the residuals still needed.

Bill length and depth

fit_species <- lm(bill_length_mm ~ bill_depth_mm + species, data = females)
par(mfrow = c(2,2))
plot(fit_species)

	Df	SS	Mean SS	F	\(p\) value
factor	\(K - 1\)	\(ESS\)	\(ESS / (K - 1)\)	\(\frac{ESS / (K - 1)}{RSS / (n - K)}\)	\(p\)
residuals	\(n - K\)	\(RSS\)	\(RSS / (n - K)\)
total	\(n - 1\)	\(TSS\)

	df	SS	Mean SS	F	\(p\) value
factor 1	\(K - 1\)	\(R(\alpha\|\mu)\)	\(R(\alpha\|\mu) / df\)	\(\hat{\sigma}^{-2} R(\alpha\|\mu) / df\)	\(p\)
factor 2	\(L - 1\)	\(R(\beta\|\mu, \alpha)\)	\(R(\beta\|\mu, \alpha) / df\)	\(\hat{\sigma}^{-2} R(\beta\|\mu, \alpha) / df\)	\(p\)
interactions	\((K - 1)(L - 1)\)	\(R(\gamma\|\mu, \alpha, \beta)\)	\(R(\gamma\|\mu, \alpha, \beta) / df\)	\(\hat{\sigma}^{-2} R(\gamma\|\mu, \alpha, \beta) / df\)	\(p\)
residuals	\(n - KL\)	\(RSS\)	\(\hat{\sigma}^2 = RSS / df\)

	Df	SS	Mean SS	F	\(p\) value
factor 1	\(K - 1\)	\(R(\alpha\|\mu)\)	\(R(\alpha\|\mu) / df\)	\(\hat{\sigma}^{-2} R(\alpha\|\mu) / df\)	\(p\)
factor 2	\(L - 1\)	\(R(\beta\|\mu, \alpha)\)	\(R(\beta\|\mu, \alpha) / df\)	\(\hat{\sigma}^{-2} R(\beta\|\mu, \alpha) / df\)	\(p\)
interactions	\((K - 1)(L - 1)\)	\(R(\gamma\|\mu, \alpha, \beta)\)	\(R(\gamma\|\mu, \alpha, \beta) / df\)	\(\hat{\sigma}^{-2} R(\gamma\|\mu, \alpha, \beta) / df\)	\(p\)
residuals	\(n - KL\)	\(RSS\)	\(\hat{\sigma}^2 = RSS / df\)

	Df	SS	Mean SS	F	\(p\) value
factor 1	\(K - 1\)	\(R(\alpha\|\mu, \beta)\)	\(R(\alpha\|\mu, \beta) / df\)	\(\hat{\sigma}^{-2} R(\alpha\|\mu, \beta) / df\)	\(p\)
factor 2	\(L - 1\)	\(R(\beta\|\mu, \alpha)\)	\(R(\beta\|\mu, \alpha) / df\)	\(\hat{\sigma}^{-2} R(\beta\|\mu, \alpha) / df\)	\(p\)
interactions	\((K - 1)(L - 1)\)	\(R(\gamma\|\mu, \alpha, \beta)\)	\(R(\gamma\|\mu, \alpha, \beta) / df\)	\(\hat{\sigma}^{-2} R(\gamma\|\mu, \alpha, \beta) / df\)	\(p\)
residuals	\(n - KL\)	\(RSS\)	\(\hat{\sigma}^2 = RSS / df\)

Régression Multiple - Variables Qualitatives

Student t tests

Penguins

Subset

Group Comparison

Simple \(t\) test

Simple \(t\) test

Simple \(t\) test

Simple \(t\) test

Simple \(t\) test

Linear Regression

Model

Model

Linear Model

Linear Model

Linear Model - With Intercept

Linear Model - With Intercept

Linear Model - \(t\) test

Linear Model

Linear Model

Simple \(t\) test

Test statistics

Test statistics

Test statistics - Proof - 1/5

Test statistics - Proof - 2/5

Test statistics - Proof - 3/5

Test statistics - Proof - 4/5

Test statistics - Proof - 5/5

Summary

One-way ANOVA

Species

Species

\(t\) tests

\(t\) tests - Model

One-way ANOVA

One-way ANOVA

One-way ANOVA

Species

Species

One-way ANOVA - Coefficients

One-way ANOVA - Coefficients

One-way ANOVA - Global Test

Species

One-way ANOVA - Parametrization

General Model

General Model Over-Parametrization

Exemple with two groups

Constraint : Group 1 as reference

Constraint : No intercept

Constraint : Null Sum

Constraint : Weighted Null Sum

One-way ANOVA table

One-way ANOVA - \(F\) test

One-way ANOVA - Global \(F\) Test

Species

One-way ANOVA - Table

Species

Two-way ANOVA

Species

Species

Model

Constraint : Groups 1 as reference

Species and Sex

Other constraint

Nested F Test - Reminder

Nested F Test - Reminder

Nested F Test - General

Anova model

Nested F Test - Anova - Factor 1

Nested F Test - Anova - Factor 2

Nested F Test - Anova - Interaction

Two-way ANOVA - Type I Table

Species and Sex - Type I

Sex and Species - Type I

Two-way ANOVA - Type I Table

Two-way ANOVA - Type II Table

Sex and Species - Type II

Bill Lenght - Type II

Bill Lenght - Type II - No interaction

ANCOVA