Model
Assumptions :
\(y^f_{i} \sim \mathcal{N}(\mu_f, \sigma^2)\) iid
\(y^m_{j} \sim \mathcal{N}(\mu_m, \sigma^2)\) iid
in other words:
\(y^f_{i} = \mu_f + \epsilon_i^f\) , with \(\epsilon^f_i \sim \mathcal{N}(0, \sigma^2)\) iid
\(y^m_{j} = \mu_m + \epsilon_j^m\) , with \(\epsilon^m_j \sim \mathcal{N}(0, \sigma^2)\) iid
Can we write this as a linear model: “\(\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon}\) ” ?
Model
Write: \[
\mathbf{y} =
\begin{pmatrix}
\mathbf{y}^f \\
\mathbf{y}^m \\
\end{pmatrix}
\quad
\text{and}
\quad
\boldsymbol{\epsilon} =
\begin{pmatrix}
\boldsymbol{\epsilon}^f \\
\boldsymbol{\epsilon}^m \\
\end{pmatrix}
\] \(\mathbf{y}\) and \(\boldsymbol{\epsilon}\) vectors of size \(n = n_f + n_m\) .
Then, for \(1 \leq i \leq n\) : \[
y_i = \mu_k + \epsilon_i
\quad
\text{with}
\quad
\begin{cases}
\mu_k = \mu_f & \text{if } 1 \leq i \leq n_f\\
\mu_k = \mu_m & \text{if } n_f+1 \leq i \leq n_f + n_m
\end{cases}
\]
Regression matrix \(\mathbf{X}\) ?
Linear Model
\[
\text{Write:}
\qquad
\mathbf{X} =
\begin{pmatrix}
1 & 0 \\
\vdots & \vdots \\
1 & 0 \\
0 & 1 \\
\vdots & \vdots\\
0 & 1 \\
\end{pmatrix}
\begin{matrix}
\\
n_f\\
\\
\\
n_m\\
\\
\end{matrix}
\qquad
\text{and}
\qquad
\boldsymbol{\beta} =
\begin{pmatrix}
\mu_f \\
\mu_m \\
\end{pmatrix}
\]
\[
\text{Then:}
\qquad
\mathbf{X}\boldsymbol{\beta}
=
\begin{pmatrix}
\mu_f + 0\\
\vdots \\
\mu_f + 0\\
0 + \mu_m\\
\vdots\\
0 + \mu_m \\
\end{pmatrix}
=
\begin{pmatrix}
\mu_f \\
\vdots \\
\mu_f \\
\mu_m \\
\vdots\\
\mu_m \\
\end{pmatrix}
\begin{matrix}
\\
n_f\\
\\
\\
n_m\\
\\
\end{matrix}
\]
Linear Model
Hence, the model can be written as: \[
\begin{pmatrix}
y^f_1 \\
\vdots\\
y^f_{n_f} \\
y^m_{1} \\
\vdots\\
y^m_{n_m} \\
\end{pmatrix}
=
\begin{pmatrix}
1 & 0 \\
\vdots & \vdots \\
1 & 0 \\
0 & 1 \\
\vdots & \vdots\\
0 & 1 \\
\end{pmatrix}
\begin{pmatrix}
\mu_f \\
\mu_m \\
\end{pmatrix}
+
\begin{pmatrix}
\epsilon^f_1 \\
\vdots\\
\epsilon^f_{n_f} \\
\epsilon^m_{1} \\
\vdots\\
\epsilon^m_{n_m} \\
\end{pmatrix}
\]
i.e: \[
\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon}
\quad \text{with} \quad \boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \sigma^2 \mathbf{I}_n)
\]
\(\to\) classical linear model with Gaussian assumption !
Linear Model - With Intercept
\[
\text{Write:}
\qquad
\mathbf{X} =
\begin{pmatrix}
1 & 0 \\
\vdots & \vdots \\
1 & 0 \\
1 & 1 \\
\vdots & \vdots\\
1 & 1 \\
\end{pmatrix}
\begin{matrix}
\\
n_f\\
\\
\\
n_m\\
\\
\end{matrix}
\qquad
\text{and}
\qquad
\boldsymbol{\beta} =
\begin{pmatrix}
\mu_f \\
\mu_m - \mu_f \\
\end{pmatrix}
\]
\[
\text{Then:}
\qquad
\mathbf{X}\boldsymbol{\beta}
=
\begin{pmatrix}
\mu_f + 0\\
\vdots \\
\mu_f + 0\\
\mu_f + \mu_m - \mu_f\\
\vdots\\
\mu_f + \mu_m - \mu_f \\
\end{pmatrix}
=
\begin{pmatrix}
\mu_f \\
\vdots \\
\mu_f \\
\mu_m \\
\vdots\\
\mu_m \\
\end{pmatrix}
\begin{matrix}
\\
n_f\\
\\
\\
n_m\\
\\
\end{matrix}
\]
Linear Model - With Intercept
The model can also be written as: \[
\begin{pmatrix}
y^f_1 \\
\vdots\\
y^f_{n_f} \\
y^m_{1} \\
\vdots\\
y^m_{n_m} \\
\end{pmatrix}
=
\begin{pmatrix}
1 & 0 \\
\vdots & \vdots \\
1 & 0 \\
1 & 1 \\
\vdots & \vdots\\
1 & 1 \\
\end{pmatrix}
\begin{pmatrix}
\mu_f \\
\mu_m - \mu_f \\
\end{pmatrix}
+
\begin{pmatrix}
\epsilon^f_1 \\
\vdots\\
\epsilon^f_{n_f} \\
\epsilon^m_{1} \\
\vdots\\
\epsilon^m_{n_m} \\
\end{pmatrix}
\]
i.e: \[
\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon}
\quad \text{with} \quad \boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \sigma^2 \mathbf{I}_n)
\]
\(\to\) parametrization used in R
Linear Model - \(t\) test
\[
\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon}
\quad \text{with} \quad \boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \sigma^2 \mathbf{I}_n)
\]
As \[
\boldsymbol{\beta} =
\begin{pmatrix}
\mu_f \\
\mu_m - \mu_f \\
\end{pmatrix},
\]
the test \[
\mathcal{H}_0: \mu_f = \mu_m \quad \text{vs} \quad \mathcal{H}_1: \mu_f \neq \mu_m
\]
becomes: \[
\mathcal{H}_0: \beta_2 = 0 \quad \text{vs} \quad \mathcal{H}_1: \beta_2 \neq 0
\]
\(\to\) \(t\) -test on the \(\beta_2\) coefficient !
Linear Model
fit <- lm (body_mass_g ~ sex, data = adelie)
model.matrix (fit)[1 : 10 , ]
(Intercept) sexmale
1 1 1
2 1 0
3 1 0
4 1 0
5 1 1
6 1 0
7 1 1
8 1 0
9 1 1
10 1 1
Linear Model
Call:
lm(formula = body_mass_g ~ sex, data = adelie)
Residuals:
Min 1Q Median 3Q Max
-718.49 -218.84 -18.84 225.00 731.51
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3368.84 36.34 92.69 <2e-16 ***
sexmale 674.66 51.40 13.13 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 310.5 on 144 degrees of freedom
Multiple R-squared: 0.5447, Adjusted R-squared: 0.5416
F-statistic: 172.3 on 1 and 144 DF, p-value: < 2.2e-16
Simple \(t\) test
t.test (mass_males, mass_females, var.equal = TRUE )
Two Sample t-test
data: mass_males and mass_females
t = 13.126, df = 144, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
573.0666 776.2484
sample estimates:
mean of x mean of y
4043.493 3368.836
Test statistics
Simple \(t\) test : \[
T = \frac{\bar{\mathbf{y}}^f - \bar{\mathbf{y}}^m}{s_p \sqrt{\frac{1}{n_f} + \frac{1}{n_m}}}
\underset{\mathcal{H}_0}{\sim} \mathcal{T}_{n_f + n_m - 2}
\quad
\text{with}
\quad
s_p = \sqrt{\frac{(n_f-1)s^2_{\mathbf{y}^f} + (n_m-1)s^2_{\mathbf{y}^m}}{n_f + n_m - 2}}
\]
Linear Regression : \[
\frac{\hat{\beta}_k - \beta_k}{\sqrt{\hat{\sigma}^2 [(\mathbf{X}^T\mathbf{X})^{-1}]_{kk}}}
\sim
\mathcal{T}_{n-p}
\quad
\text{with}
\quad
p = 2
~,~
n = n_f + n_m
~,~
\beta_2 = 0 \text{ under } \mathcal{H}_0
\]
i.e. \[
\frac{\hat{\beta}_2}{\sqrt{\hat{\sigma}^2 [(\mathbf{X}^T\mathbf{X})^{-1}]_{22}}}
\underset{\mathcal{H}_0}{\sim} \mathcal{T}_{n_f + n_m - 2}
\]
Test statistics
Let’s show that: \[
\frac{\hat{\beta}_2}{\sqrt{\hat{\sigma}^2 [(\mathbf{X}^T\mathbf{X})^{-1}]_{22}}}
\quad
\text{with}
\quad
\mathbf{X} =
\begin{pmatrix}
1 & 0 \\
\vdots & \vdots \\
1 & 0 \\
1 & 1 \\
\vdots & \vdots\\
1 & 1 \\
\end{pmatrix}
\begin{matrix}
\\
n_f\\
\\
\\
n_m\\
\\
\end{matrix}
\qquad
\text{and}
\qquad
\boldsymbol{\beta} =
\begin{pmatrix}
\mu_f \\
\mu_m - \mu_f \\
\end{pmatrix}
\]
is equal to \[
\frac{\bar{\mathbf{y}}^f - \bar{\mathbf{y}}^m}{s_p \sqrt{\frac{1}{n_f} + \frac{1}{n_m}}}
\quad
\text{with}
\quad
s_p = \sqrt{\frac{(n_f-1)s^2_{\mathbf{y}^f} + (n_m-1)s^2_{\mathbf{y}^m}}{n_f + n_m - 2}}
\]
Test statistics - Proof - 1/5
\[
\mathbf{X}^T\mathbf{X} =
\begin{pmatrix}
1 & \cdots & 1 & 1 & \cdots & 1 \\
0 & \cdots & 0 & 1 & \cdots & 1 \\
\end{pmatrix}
\begin{pmatrix}
1 & 0 \\
\vdots & \vdots \\
1 & 0 \\
1 & 1 \\
\vdots & \vdots\\
1 & 1 \\
\end{pmatrix}
=
\begin{pmatrix}
n & n_m \\
n_m & n_m
\end{pmatrix}
\]
Hence: \[
(\mathbf{X}^T\mathbf{X})^{-1}
=
\begin{pmatrix}
n & n_m \\
n_m & n_m
\end{pmatrix}^{-1}
=
\frac{1}{nn_m - n_m^2}
\begin{pmatrix}
n_m & -n_m \\
-n_m & n
\end{pmatrix}
\]
\[
[(\mathbf{X}^T\mathbf{X})^{-1}]_{22}
= \frac{n}{nn_m - n_m^2}
= \frac{n_f + n_m}{(n - n_m)n_m}
= \frac{n_f + n_m}{n_f n_m}
= \frac{1}{n_f} + \frac{1}{n_m}
\]
Test statistics - Proof - 2/5
\[
\hat{\boldsymbol{\beta}}
= (\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}\mathbf{y}
\]
But: \[
\mathbf{X}^{T}\mathbf{y}
=
\begin{pmatrix}
1 & \cdots & 1 & 1 & \cdots & 1 \\
0 & \cdots & 0 & 1 & \cdots & 1 \\
\end{pmatrix}
\mathbf{y}
=
\begin{pmatrix}
\sum_{i = 1}^n y_i\\
\sum_{j = 1}^{n_m} y^m_j
\end{pmatrix}
\]
Hence: \[
\begin{aligned}
\hat{\boldsymbol{\beta}}
& = (\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}\mathbf{y}
=
\frac{1}{n_fn_m}
\begin{pmatrix}
n_m & -n_m \\
-n_m & n
\end{pmatrix}
\begin{pmatrix}
\sum_{i = 1}^n y_i\\
\sum_{j = 1}^{n_m} y^m_j
\end{pmatrix} \\
&=
%\frac{1}{n_fn_m}
%\begin{pmatrix}
%n_m\sum_{i = 1}^n y_i - n_m\sum_{j = 1}^{n_m} y^m_j\\
%- n_m\sum_{i = 1}^{n} y_i + n\sum_{j = 1}^{n_m} y^m_j\\
%\end{pmatrix}
%=
\frac{1}{n_fn_m}
\begin{pmatrix}
n_m\sum_{i = 1}^n y_i - n_m\sum_{j = 1}^{n_m} y^m_j\\
- n_m\sum_{i = 1}^{n} y_i + n_m\sum_{j = 1}^{n_m} y^m_j + n_f\sum_{j = 1}^{n_m} y^m_j\\
\end{pmatrix} \\
&=
\frac{1}{n_fn_m}
\begin{pmatrix}
n_m \sum_{i = 1}^{n_f} y^f_i\\
- n_m\sum_{i = 1}^{n_f} y^f_i + n_f\sum_{j = 1}^{n_m} y^m_j\\
\end{pmatrix}
\end{aligned}
\]
Test statistics - Proof - 3/5
\[
\hat{\boldsymbol{\beta}}
= (\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}\mathbf{y}
\]
And: \[
\hat{\boldsymbol{\beta}}
=
\frac{1}{n_fn_m}
\begin{pmatrix}
n_m \sum_{i = 1}^{n_f} y^f_i\\
- n_m\sum_{i = 1}^{n_f} y^f_i + n_f\sum_{j = 1}^{n_m} y^m_j\\
\end{pmatrix}
=
\begin{pmatrix}
\bar{\mathbf{y}}^f\\
-\bar{\mathbf{y}}^f + \bar{\mathbf{y}}^m\\
\end{pmatrix}
\]
Note : \[
\text{since }
\quad
\boldsymbol{\beta} =
\begin{pmatrix}
\mu_f \\
\mu_m - \mu_f \\
\end{pmatrix}
\quad
\text{this result is coherent.}
\]
Note : \[
\text{As}
\quad
\mathbf{x}^{i} =
\begin{cases}
(1~0) & \text{ if $i$ female} \\
(1~1) & \text{ if $i$ male} \\
\end{cases}
\quad
\text{we get:}
\quad
\hat{y}_i = \mathbf{x}^{i}\hat{\boldsymbol{\beta}} =
\begin{cases}
\bar{\mathbf{y}}^f & \text{ if $i$ female} \\
\bar{\mathbf{y}}^m & \text{ if $i$ male} \\
\end{cases}
\]
Test statistics - Proof - 4/5
\[
\hat{\sigma}^2 = \frac{1}{n - 2}\sum_{i = 1}^n (y_i - \hat{y}_i)^2
\]
Hence: \[
\hat{\sigma}^2 = \frac{1}{n_f + n_m - 2}
\left[
\sum_{i = 1}^{n_f} (y^f_i - \bar{\mathbf{y}}^f)^2 +
\sum_{i = 1}^{n_m} (y^m_i - \bar{\mathbf{y}}^m)^2
\right]
\]
Recall: \[
s^2_p = \frac{(n_f-1)s^2_{\mathbf{y}^f} + (n_m-1)s^2_{\mathbf{y}^m}}{n_f + n_m - 2}
\]
hence: \[
\hat{\sigma}^2 = s^2_p.
\]
Test statistics - Proof - 5/5
\[
T = \frac{\hat{\beta}_2}{\sqrt{\hat{\sigma}^2 [(\mathbf{X}^T\mathbf{X})^{-1}]_{22}}}
\]
but: \[
\hat{\beta}_2 = \bar{\mathbf{y}}^m -\bar{\mathbf{y}}^f \\
\hat{\sigma}^2 = s^2_p \\
[(\mathbf{X}^T\mathbf{X})^{-1}]_{22} = \frac{1}{n_f} + \frac{1}{n_m}
\]
hence: \[
T = \frac{\bar{\mathbf{y}}^f - \bar{\mathbf{y}}^m}{s_p \sqrt{\frac{1}{n_f} + \frac{1}{n_m}}}
\]
And the simple \(t\) -test is the same as the \(t\) -test on the coefficient of the linear regression.
Summary
Group comparison :
\(n_f\) female adelie penguins, \(n_m\) male adelie penguins
\(y^f_{i}\) : mass (g) of a female adelie penguin, \(1 \leq i \leq n_f\) , \(y^f_{i} \sim \mathcal{N}(\mu_f, \sigma^2)\) iid
\(y^m_{j}\) : mass (g) of a male adelie penguin, \(1 \leq j \leq n_m\) , \(y^m_{j} \sim \mathcal{N}(\mu_m, \sigma^2)\) iid
\(\mathcal{H}_0\) : \(\mu_f = \mu_m\) vs \(\mathcal{H}_1\) : \(\mu_f \neq \mu_m\)
is the same as a Linear model with:
\(\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon}\) with \(\boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \sigma^2 \mathbf{I}_n)\)
\(\mathbf{X}_1 = \mathbb{1}_n\) and \(\mathbf{X}_2 = \mathbb{1}(male)\)
\(\beta_1 = \mu_f\) and \(\beta_2 = \mu_m - \mu_f\) .
\(\mathcal{H}_0: \beta_2 = 0\) vs \(\mathcal{H}_1: \beta_2 \neq 0\)
\(\to\) what happens when there are more than two groups ?
Species
We keep only the females penguins of any species, and remove missing data.
females <- subset (penguins, sex == "female" )
females <- na.omit (females)
kable (females[c (1 , 2 , 101 , 102 , 151 , 152 ), ])
Adelie
Torgersen
39.5
17.4
186
3800
female
2007
Adelie
Torgersen
40.3
18.0
195
3250
female
2007
Gentoo
Biscoe
43.8
13.9
208
4300
female
2008
Gentoo
Biscoe
43.2
14.5
208
4450
female
2008
Chinstrap
Dream
46.9
16.6
192
2700
female
2008
Chinstrap
Dream
46.2
17.5
187
3650
female
2008
Species
Question : are the females of each species of different body mass ?
par (mfrow = c (1 , 2 ), mar = c (5 , 4 , 2 , 2 ) + 0.1 )
plot (females$ body_mass_g, col = females$ species) ## scatter plot
boxplot (body_mass_g ~ species, data = females) ## box plot
\(t\) tests
Pairwise \(t\) tests:
pairwise.t.test (females$ body_mass_g, females$ species,
p.adjust.method = "none" )
Pairwise comparisons using t tests with pooled SD
data: females$body_mass_g and females$species
Adelie Chinstrap
Chinstrap 0.0066 -
Gentoo <2e-16 <2e-16
P value adjustment method: none
\(\to\) \({K \choose 2}\) tests : Multiple testing problems.
\(t\) tests - Model
Model :
\(n_s\) penguins, for species \(s \in S = \{adelie, chinstrap, gentoo\}\)
\(y^s_{i}\) : mass (g) of a female penguin of species \(s\) , \(1 \leq i \leq n_s\)
\(y^s_{i} \sim \mathcal{N}(\mu_s, \sigma^2)\) iid for any species \(s\) .
Pairwise Tests :
\[
\mathcal{H}^{s_1,s_2}_0:~ \mu_{s_1} = \mu_{s_2} \quad \text{vs} \quad \mathcal{H}^{s_1,s_2}_1:~\mu_{s_1} \neq \mu_{s_2}
\text{ for any pair } (s_1,s_2) \text{ of species}.
\]
Global Tests can we do ?
\[
\mathcal{H}_0:~\mu_a = \mu_c = \mu_g \quad \text{vs} \quad \mathcal{H}_1:~\exists s_1, s_2 \in S ~|~ \mu_{s_1} \neq \mu_{s_2} ?
\]
One-way ANOVA
\(S\) a set of \(|S| = K\) groups;
\(\mathbf{y}\) the vector of all data, length \(n = \sum_{k=1}^K n_k\) ;
\(\mathbf{X}\) the matrix of size \(n \times K\) , such that:
\(\mathbf{X}_1 = \mathbb{1}_{n}\) is the intercept (vector column)
\(\mathbf{X}_k = \mathbb{1}(S_k)\) the indicator of group \(k\) , \(2 \leq k \leq K\) .
\(x_{ik} = 1\) if individual \(i\) is in group \(k\) , and \(0\) otherwise.
\(\boldsymbol{\beta}\) the vector of mean differences, of length \(K\) :
\(\beta_1 = \mu_1\)
\(\beta_k = \mu_k - \mu_1\)
the first group is the reference
One-way ANOVA
\[
\begin{pmatrix}
y^1_1 \\
\vdots\\
y^1_{n_1} \\
y^2_{1} \\
\vdots\\
y^2_{n_2} \\
\vdots\\
y^K_{1} \\
\vdots\\
y^K_{n_K} \\
\end{pmatrix}
=
\begin{pmatrix}
1 & 0 & 0 & \cdots & 0\\
\vdots & \vdots & \vdots & \cdots & \vdots\\
1 & 0 & 0 & \cdots & 0\\
1 & 1 & 0 & \cdots & 0\\
\vdots & \vdots & \vdots & \cdots & \vdots\\
1 & 1 & 0 & \cdots & 0\\
\vdots & \vdots & \vdots & \cdots & \vdots\\
1 & 0 & 0 & \cdots & 1\\
\vdots & \vdots & \vdots & \cdots & \vdots\\
1 & 0 & 0 & \cdots & 1\\
\end{pmatrix}
\begin{pmatrix}
\mu_1 \\
\mu_2 - \mu_1 \\
\vdots\\
\mu_K - \mu_1 \\
\end{pmatrix}
+
\begin{pmatrix}
\epsilon^1_1 \\
\vdots\\
\epsilon^1_{n_1} \\
\epsilon^2_{1} \\
\vdots\\
\epsilon^2_{n_2} \\
\vdots\\
\epsilon^K_{1} \\
\vdots\\
\epsilon^K_{n_K} \\
\end{pmatrix}
\]
One-way ANOVA
Model :
\[
\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon}
\quad \text{with} \quad \boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \sigma^2 \mathbf{I}_n)
\]
Note : Assuming that the groups are disjoint, \(\text{rg}(X) = K\) .
This is equivalent to the following model: \[
\begin{aligned}
y_i &= \beta_1 + \epsilon_i = \mu_1 + \epsilon_i
& \text{ if } &i \text{ is in group } 1\\
y_i &= \beta_1 + \beta_k + \epsilon_i
= \mu_k + \epsilon_i
& \text{ if } &i \text{ is in group } k,\ k>1 \\
\end{aligned}
\]
i.e. each group has a mean \(\mu_k\) , and the errors are Gaussian iid.
Species
fit <- lm (body_mass_g ~ species, data = females)
model.matrix (fit)[c (1 , 2 , 101 , 102 , 151 , 152 ), ]
(Intercept) speciesChinstrap speciesGentoo
1 1 0 0
2 1 0 0
101 1 0 1
102 1 0 1
151 1 1 0
152 1 1 0
Species
Call:
lm(formula = body_mass_g ~ species, data = females)
Residuals:
Min 1Q Median 3Q Max
-827.21 -193.84 20.26 181.16 622.79
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3368.84 32.42 103.908 < 2e-16 ***
speciesChinstrap 158.37 57.52 2.754 0.00657 **
speciesGentoo 1310.91 48.72 26.904 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 277 on 162 degrees of freedom
Multiple R-squared: 0.8292, Adjusted R-squared: 0.8271
F-statistic: 393.2 on 2 and 162 DF, p-value: < 2.2e-16
One-way ANOVA - Coefficients
It is easy to show that:
\[
\hat{\beta}_1 = \bar{\mathbf{y}}^{s_1}\\
\hat{\beta}_k = \bar{\mathbf{y}}^{s_k} - \bar{\mathbf{y}}^{s_1} \text{ for } 2 \leq k \leq K\\
\] Hence:
\(\hat{\beta}_1\) is the estimator of the mean of the first group;
\(\hat{\beta}_k\) is the estimator of the difference between the mean of group \(k\) and the first group (\(k > 1\) ).
One-way ANOVA - Coefficients
We have: \[
\boldsymbol{\beta} =
\begin{pmatrix}
\mu_{s_1}\\
\mu_{s_2} - \mu_{s_1}\\
\vdots\\
\mu_{s_K} - \mu_{s_1}\\
\end{pmatrix}
\quad
\text{and}
\quad
\hat{\boldsymbol{\beta}} =
\begin{pmatrix}
\bar{\mathbf{y}}^{s_1}\\
\bar{\mathbf{y}}^{s_2} - \bar{\mathbf{y}}^{s_1}\\
\vdots\\
\bar{\mathbf{y}}^{s_K} - \bar{\mathbf{y}}^{s_1}\\
\end{pmatrix}
\]
The \(t\) test on the coefficient \(\beta_k\) :
\[
\mathcal{H}^{k}_0: \beta_{k} = 0 \quad{ vs } \quad \mathcal{H}^{k}_1: \beta_{k} \neq 0
\]
is the same as the test of the difference between groups \(1\) & \(k\) :
\[
\mathcal{H}^{1,k}_0: \mu_{1} = \mu_{k} \quad{ vs } \quad \mathcal{H}^{1,k}_1: \mu_{1} \neq \mu_{k}
\]
One-way ANOVA - Global Test
We have: \[
\boldsymbol{\beta} =
\begin{pmatrix}
\mu_{s_1}\\
\mu_{s_2} - \mu_{s_1}\\
\vdots\\
\mu_{s_K} - \mu_{s_1}\\
\end{pmatrix}
\quad
\text{and}
\quad
\hat{\boldsymbol{\beta}} =
\begin{pmatrix}
\bar{\mathbf{y}}^{s_1}\\
\bar{\mathbf{y}}^{s_2} - \bar{\mathbf{y}}^{s_1}\\
\vdots\\
\bar{\mathbf{y}}^{s_K} - \bar{\mathbf{y}}^{s_1}\\
\end{pmatrix}
\]
The global \(F\) test on the coefficients \(\beta_2, \dotsc, \beta_K\) :
\[
\mathcal{H}_0: \beta_{2} = \beta_3 = \cdots = \beta_{K} = 0 \quad{ vs } \quad \mathcal{H}_1: \exists k | \beta_{k} \neq 0
\]
is the same as the global test of mean equality of all groups :
\[
\mathcal{H}_0: \mu_{1} = \mu_2 = \cdots = \mu_{K} \quad{ vs } \quad \mathcal{H}_1: \exists k, k' | \mu_{k} \neq \mu_{k'}
\]
Species
fit <- lm (body_mass_g ~ species, data = females)
summary (fit)
Call:
lm(formula = body_mass_g ~ species, data = females)
Residuals:
Min 1Q Median 3Q Max
-827.21 -193.84 20.26 181.16 622.79
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3368.84 32.42 103.908 < 2e-16 ***
speciesChinstrap 158.37 57.52 2.754 0.00657 **
speciesGentoo 1310.91 48.72 26.904 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 277 on 162 degrees of freedom
Multiple R-squared: 0.8292, Adjusted R-squared: 0.8271
F-statistic: 393.2 on 2 and 162 DF, p-value: < 2.2e-16
One-way ANOVA - Parametrization
General Model
In general, we might write for observation \(i\) of group \(k\) : \[
y_{i,k} = \mu + \beta_k + \epsilon_i
\]
\(\mu\) would be the “grand mean” (intercept)
\(\beta_k\) the specific effect of group \(k\) (\(1 \leq k \leq K\) ).
General Model Over-Parametrization
\[
y_{i,k} = \mu + \beta_k + \epsilon_i
\]
The model has \(K+1\) parameters.
The associated regression matrix would be: \[
\mathbf{X} = (\mathbb{1}, \mathbb{1}(S_1), \dotsc, \mathbb{1}(S_K))
\]
\(\mathbf{X}\) is not of full rank : \(\mathbb{1} = \sum_{k=1}^K \mathbb{1}(S_k)\) .
Exemple with two groups
\[
\mathbf{X} =
\begin{pmatrix}
1 & 1 & 0 \\
\vdots & \vdots & \vdots \\
1 & 1 & 0 \\
1 & 0 & 1 \\
\vdots & \vdots & \vdots\\
1 & 0 & 1 \\
\end{pmatrix}
\]
We need some constraint on the parameters .
Constraint : Group 1 as reference
\[
y_{i,k} = \mu + \beta_k + \epsilon_i
\]
In the previous studies, we assumed \[\beta_1 = 0\]
group 1 has the “reference” mean \(\mu\)
group \(k>1\) has mean \(\mu + \beta_k\)
\(\beta_k = \mu_k - \mu_1\) is the difference between group 1 and \(k\) .
\(\mathbf{X} = (\mathbb{1}, \mathbb{1}(S_2), \dotsc, \mathbb{1}(S_K))\) is full rank.
\(\to\) This is the constraint used by default in R
.
Constraint : No intercept
\[
y_{i,k} = \mu + \beta_k + \epsilon_i
\]
We can assume \[\mu = 0\]
\(\beta_k = \mu_k\) models the mean of group \(k\) directly.
\(\mathbf{X} = (\mathbb{1}(S_1), \dotsc, \mathbb{1}(S_K))\) is full rank.
estimators of \(\beta_k\) are changed, but not the estimator of \(\sigma^2\) , nor the fitted values.
Constraint : Null Sum
\[
y_{i,k} = \mu + \beta_k + \epsilon_i
\]
We can assume \[
\sum_{k = 1}^K \beta_k= 0
\qquad \text{i.e.} \qquad
\beta_K = - \sum_{k = 1}^{K-1} \beta_k
\]
\(\mathbf{X} = (\mathbb{1}, \mathbb{1}(S_1) - \mathbb{1}(S_K), \dotsc, \mathbb{1}(S_{K-1}) - \mathbb{1}(S_K))\) is full rank.
estimators of \(\mu\) , \(\beta_k\) are changed, but not the estimator of \(\sigma^2\) , nor the fitted values.
Constraint : Weighted Null Sum
\[
y_{i,k} = \mu + \beta_k + \epsilon_i
\]
We can assume \[
\sum_{k = 1}^K n_k \beta_k= 0
\qquad \text{i.e.} \qquad
\beta_K = -\frac{1}{n_K} \sum_{k = 1}^{K-1} n_k \beta_k
\]
\(\mathbf{X} = (\mathbb{1}, \mathbb{1}(S_1) - \frac{n_1}{n_K}\mathbb{1}(S_K), \dotsc, \mathbb{1}(S_{K-1}) - \frac{n_{K-1}}{n_K}\mathbb{1}(S_K))\) is full rank.
estimators of \(\mu\) , \(\beta_k\) are changed, but not the estimator of \(\sigma^2\) , nor the fitted values.
One-way ANOVA - \(F\) test
\[
y_{i,k} = \mu + \beta_k + \epsilon_i
\]
ANOVA Test : Has the grouping any effect at all ? I.e. \[
\begin{aligned}
\mathcal{H}_0&: \beta_{1} = \cdots = \beta_K = 0 &\text{vs}\\
\mathcal{H}_1&: \exists~k\in \{1, \dotsc, K\} ~|~ \beta_k \neq 0
\end{aligned}
\]
No matter which set of constraint we choose, the \(F\) statistic associated with the one factor ANOVA is:
\[
F = \frac{
\|\hat{\mathbf{y}} - \bar{y}\mathbb{1}\|^2 / (K - 1)
}{
\|\mathbf{y} - \hat{\mathbf{y}}\|^2 / (n - K)
}
\underset{\mathcal{H}_0}{\sim}
\mathcal{F}^{K - 1}_{n - K}.
\]
\(\to\) Global \(F\) test on the coefficients of the linear regression.
ANOVA = “ANalyse Of VAriance”
One-way ANOVA - Global \(F\) Test
With R
parametrization: \[
\boldsymbol{\beta} =
\begin{pmatrix}
\mu_{s_1}\\
\mu_{s_2} - \mu_{s_1}\\
\vdots\\
\mu_{s_K} - \mu_{s_1}\\
\end{pmatrix}
\quad
\text{and}
\quad
\hat{\boldsymbol{\beta}} =
\begin{pmatrix}
\bar{\mathbf{y}}^{s_1}\\
\bar{\mathbf{y}}^{s_2} - \bar{\mathbf{y}}^{s_1}\\
\vdots\\
\bar{\mathbf{y}}^{s_K} - \bar{\mathbf{y}}^{s_1}\\
\end{pmatrix}
\]
The global \(F\) test on the coefficients \(\beta_2, \dotsc, \beta_K\) is:
\[
\mathcal{H}_0: \beta_{2} = \beta_3 = \cdots = \beta_{K} = 0 \quad{ vs } \quad \mathcal{H}_1: \exists k | \beta_{k} \neq 0
\]
and is the same as the global test of mean equality of all groups:
\[
\mathcal{H}_0: \mu_{1} = \mu_2 = \cdots = \mu_{K} \quad{ vs } \quad \mathcal{H}_1: \exists k, k' | \mu_{k} \neq \mu_{k'}
\]
Species
fit <- lm (body_mass_g ~ species, data = females)
summary (fit)
Call:
lm(formula = body_mass_g ~ species, data = females)
Residuals:
Min 1Q Median 3Q Max
-827.21 -193.84 20.26 181.16 622.79
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3368.84 32.42 103.908 < 2e-16 ***
speciesChinstrap 158.37 57.52 2.754 0.00657 **
speciesGentoo 1310.91 48.72 26.904 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 277 on 162 degrees of freedom
Multiple R-squared: 0.8292, Adjusted R-squared: 0.8271
F-statistic: 393.2 on 2 and 162 DF, p-value: < 2.2e-16
One-way ANOVA - Table
\[
F = \frac{
\|\hat{\mathbf{y}} - \bar{y}\mathbb{1}\|^2 / (K - 1)
}{
\|\mathbf{y} - \hat{\mathbf{y}}\|^2 / (n - K)
}
=
\frac{
ESS / (K - 1)
}{
RSS / (n - K)
}
\underset{\mathcal{H}_0}{\sim}
\mathcal{F}^{K - 1}_{n - K}.
\]
“Anova table” :
factor
\(K - 1\)
\(ESS\)
\(ESS / (K - 1)\)
\(\frac{ESS / (K - 1)}{RSS / (n - K)}\)
\(p\)
residuals
\(n - K\)
\(RSS\)
\(RSS / (n - K)\)
total
\(n - 1\)
\(TSS\)
Species
fit <- lm (body_mass_g ~ species, data = females)
anova (fit)
Analysis of Variance Table
Response: body_mass_g
Df Sum Sq Mean Sq F value Pr(>F)
species 2 60350016 30175008 393.25 < 2.2e-16 ***
Residuals 162 12430757 76733
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Species
We keep all the penguins of any species and any sex, and remove missing data.
penguins_nona <- na.omit (penguins)
kable (penguins_nona[c (1 , 2 , 151 , 152 , 271 , 272 ), ])
Adelie
Torgersen
39.1
18.7
181
3750
male
2007
Adelie
Torgersen
39.5
17.4
186
3800
female
2007
Gentoo
Biscoe
47.6
14.5
215
5400
male
2007
Gentoo
Biscoe
46.5
13.5
210
4550
female
2007
Chinstrap
Dream
45.2
17.8
198
3950
female
2007
Chinstrap
Dream
46.1
18.2
178
3250
female
2007
Species
Question : what is the effect of both species and sex on the body mass ?
par (mfrow = c (1 , 2 ), mar = c (5 , 4 , 2 , 2 ) + 0.1 )
plot (penguins_nona$ body_mass_g, ## scatter plot
pch = as.numeric (penguins_nona$ species), ## point type = species
col = penguins_nona$ sex) ## color = sex
boxplot (body_mass_g ~ sex + species, data = penguins_nona, ## box plot
col = c ("grey" , "red" ))
Model
Observation \(i\) of group \(k\) in factor 1 and group \(l\) in factor 2 is written as: \[
y_{i,k,l} = \mu + \alpha_k + \beta_l + \gamma_{kl} + \epsilon_i
\]
\(\mu\) the “grand mean” (intercept)
\(\alpha_k\) the specific effect of group \(k\) in factor 1 (\(1 \leq k \leq K\) ).
\(\beta_l\) the specific effect of group \(l\) in factor 2 (\(1 \leq l \leq L\) ).
\(\gamma_{kl}\) the interaction effect of groups \(k\) and \(l\) .
The full model has regression matrix: \[
{\small
\mathbf{X} =
(\mathbb{1},
\mathbb{1}_{S_1}, \dotsc, \mathbb{1}_{S_K},
\mathbb{1}_{T_1}, \dotsc, \mathbb{1}_{T_L},
\mathbb{1}_{S_1, T_1}, \dotsc, \mathbb{1}_{S_K, T_L}
)
}
\]
\(\to\) \(1 + K + L + KL\) predictors. \(\to\) not identifiable !
Constraint : Groups 1 as reference
\[
y_{i,k,l} = \mu + \alpha_k + \beta_l + \gamma_{kl} + \epsilon_i
\]
Assume, for all \(1\leq k \leq K\) and \(1\leq l \leq L\) : \[
\alpha_1 = 0,
\qquad \beta_1 = 0,
\qquad \gamma_{1,l} = 0,
\qquad \gamma_{k,1} = 0.
\]
\(y_{i,1,1}\) has mean \(\mu\) (reference)
\(y_{i,1,l}\) has mean \(\mu + \beta_l\)
\(y_{i,k,1}\) has mean \(\mu + \alpha_k\)
\(y_{i,k,l}\) has mean \(\mu + \alpha_k + \beta_l + \gamma_{k,l}\) (\(k,l>1\) )
\(1 + K + L\) constraints \(\to\) \(KL\) free parameters
\(\mathbf{X} = (\mathbb{1}, \mathbb{1}_{S_2}, \dotsc, \mathbb{1}_{S_K}, \mathbb{1}_{T_2}, \dotsc, \mathbb{1}_{T_L}, \mathbb{1}_{S_2, T_2}, \dotsc, \mathbb{1}_{S_K, T_L})\) full
\(\to\) This is the constraint used by default in R
.
Species and Sex
fit <- lm (body_mass_g ~ species * sex, data = penguins_nona)
summary (fit)
Call:
lm(formula = body_mass_g ~ species * sex, data = penguins_nona)
Residuals:
Min 1Q Median 3Q Max
-827.21 -213.97 11.03 206.51 861.03
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3368.84 36.21 93.030 < 2e-16 ***
speciesChinstrap 158.37 64.24 2.465 0.01420 *
speciesGentoo 1310.91 54.42 24.088 < 2e-16 ***
sexmale 674.66 51.21 13.174 < 2e-16 ***
speciesChinstrap:sexmale -262.89 90.85 -2.894 0.00406 **
speciesGentoo:sexmale 130.44 76.44 1.706 0.08886 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 309.4 on 327 degrees of freedom
Multiple R-squared: 0.8546, Adjusted R-squared: 0.8524
F-statistic: 384.3 on 5 and 327 DF, p-value: < 2.2e-16
Other constraint
\[
y_{i,k,l} = \mu + \alpha_k + \beta_l + \gamma_{kl} + \epsilon_i
\]
Interactions only : for all \(1\leq k \leq K\) and \(1\leq l \leq L\) : \[
\mu = 0,\qquad
\alpha_k = 0,
\qquad \beta_l = 0.
\]
Null sum : \[
\sum_{k=1}^K\alpha_k = 0,
\qquad \sum_{l=1}^L\beta_l = 0,
\qquad \sum_{k=1}^K\gamma_{kl} = 0,
\qquad \sum_{l=1}^L\gamma_{kl} = 0.
\]
\(\to\) changes the estimation of the coefficients, but not of the variance, nor the predicted values.
Nested F Test - Reminder
Recall that (from CM5), in general, to test: \[
\begin{aligned}
\mathcal{H}_0&: \beta_{p_0 + 1} = \cdots = \beta_p = 0 &\text{vs}\\
\mathcal{H}_1&: \exists~k\in \{p_0+1, \dotsc, p\} ~|~ \beta_k \neq 0
\end{aligned}
\]
We can use the statistic: \[
F = \frac{
\|\hat{\mathbf{y}} - \hat{\mathbf{y}}_0\|^2 / (p - p_0)
}{
\|\mathbf{y} - \hat{\mathbf{y}}\|^2 / (n - p)
}
= \frac{n - p}{p - p_0}\frac{RSS_0 - RSS}{RSS}
\underset{\mathcal{H}_0}{\sim}
\mathcal{F}^{p - p_0}_{n - p}.
\]
Writing \(\mathbf{X}_{p_0} = (\mathbf{X}_1, \dotsc, \mathbf{X}_{p_0})\) , this is the same as testing: \[
\begin{aligned}
\mathcal{H}_0&: \mathbf{P}^{\mathbf{X}_{p_0}}\mathbf{y} = \mathbf{P}^{\mathbf{X}}\mathbf{y} &\text{vs}\\
\mathcal{H}_1&: \mathbf{P}^{\mathbf{X}_{p_0}}\mathbf{y} \neq \mathbf{P}^{\mathbf{X}}\mathbf{y}
\end{aligned}
\]
Nested F Test - Reminder
Nested F Test - General
Take any two nested subsets of indexes \(\eta_1 \subset \eta_2\) , and write: \[
\begin{aligned}
\mathbf{X}_{\eta_k} &= (\mathbf{X}_i)_{i\in\eta_k}
& \mathbf{P}^{\mathbf{X}_{\eta_k}}\mathbf{y} &= \hat{\mathbf{y}}_{\eta_k}
& \mathbf{P}^{\mathbf{X}}\mathbf{y} &= \hat{\mathbf{y}}
\end{aligned}
\]
To test: \[
\begin{aligned}
\mathcal{H}_0&: \text{"Larger model 2 does not bring any information compared to model 1"}\\
\mathcal{H}_1&: \text{"Larger model 2 does bring some information compared to model 1"}
\end{aligned}
\]
i.e. \[
\begin{aligned}
\mathcal{H}_0&: \mathbf{P}^{\mathbf{X}_{\eta_1}}\mathbf{y} = \mathbf{P}^{\mathbf{X}_{\eta_2}}\mathbf{y} &\text{v.s.}&&
\mathcal{H}_1&: \mathbf{P}^{\mathbf{X}_{\eta_1}}\mathbf{y} \neq \mathbf{P}^{\mathbf{X}_{\eta_2}}\mathbf{y}
\end{aligned}
\]
We can use: \[
F
= \frac{
\|\hat{\mathbf{y}}_{\eta_2} - \hat{\mathbf{y}}_{\eta_1}\|^2 / (|\eta_2| - |\eta_1|)
}{
\|\mathbf{y} - \hat{\mathbf{y}}\|^2 / (n - p)
}
= \frac{
1
}{
\hat{\sigma}^2
}
(RSS_{\eta_1} - RSS_{\eta_2}) / (|\eta_2| - |\eta_1|)
\underset{\mathcal{H}_0}{\sim}
\mathcal{F}^{|\eta_2| - |\eta_1|}_{n - p}.
\]
Anova model
Model : \[
y_{i,k,l} = \mu + \alpha_k + \beta_l + \gamma_{kl} + \epsilon_i
\] with, for all \(1\leq k \leq K\) and \(1\leq l \leq L\) : \[
\alpha_1 = 0,
\qquad \beta_1 = 0,
\qquad \gamma_{1,l} = 0,
\qquad \gamma_{k,1} = 0.
\] (\(\to KL\) free parameters in total.)
Tests :
Is factor 1 useful ?
Is factor 2 useful ?
Is the interaction between factor 1 and 2 useful ?
Nested F Test - Anova - Factor 1
To test: \[
\begin{aligned}
\mathcal{H}_0&: \mathbf{P}^{\mathbf{X}_{\mu}}\mathbf{y} = \mathbf{P}^{\mathbf{X}_{\mu,\alpha}}\mathbf{y} &\text{v.s.}&&
\mathcal{H}_1&: \mathbf{P}^{\mathbf{X}_{\mu}}\mathbf{y} \neq \mathbf{P}^{\mathbf{X}_{\mu,\alpha}}\mathbf{y}
\end{aligned}
\] i.e. \(\mathcal{H}_0\) : “factor 1 does not bring any information compared to intercept alone”.
We can use: \[
\begin{aligned}
F
&= \frac{n - KL}{K - 1}\frac{RSS_{\mu} - RSS_{\mu, \alpha}}{RSS}\\
&= \frac{R(\alpha | \mu) / (K - 1)}{\hat{\sigma}^2}\\
&\underset{\mathcal{H}_0}{\sim}
\mathcal{F}^{K - 1}_{n - KL}.
\end{aligned}
\]
Nested F Test - Anova - Factor 2
To test: \[
\begin{aligned}
\mathcal{H}_0&: \mathbf{P}^{\mathbf{X}_{\mu,\alpha}}\mathbf{y} = \mathbf{P}^{\mathbf{X}_{\mu,\alpha, \beta}}\mathbf{y} &\text{v.s.}&&
\mathcal{H}_1&: \mathbf{P}^{\mathbf{X}_{\mu,\alpha}}\mathbf{y} \neq \mathbf{P}^{\mathbf{X}_{\mu,\alpha,\beta}}\mathbf{y}
\end{aligned}
\] i.e. \(\mathcal{H}_0\) : “factor 2 does not bring any information compared to factor 1”.
We can use: \[
\begin{aligned}
F
&= \frac{n - KL}{L - 1}\frac{RSS_{\mu, \alpha} - RSS_{\mu, \alpha, \beta}}{RSS}\\
&= \frac{R(\beta | \mu, \alpha) / (L - 1)}{\hat{\sigma}^2}\\
&\underset{\mathcal{H}_0}{\sim}
\mathcal{F}^{L - 1}_{n - KL}.
\end{aligned}
\]
Nested F Test - Anova - Interaction
To test: \[
\begin{aligned}
\mathcal{H}_0&: \mathbf{P}^{\mathbf{X}_{\mu,\alpha,\beta}}\mathbf{y} = \mathbf{P}^{\mathbf{X}_{\mu,\alpha, \beta,\gamma}}\mathbf{y} &\text{v.s.}&&
\mathcal{H}_1&: \mathbf{P}^{\mathbf{X}_{\mu,\alpha,\gamma}}\mathbf{y} \neq \mathbf{P}^{\mathbf{X}_{\mu,\alpha,\beta,\gamma}}\mathbf{y}
\end{aligned}
\] i.e. \(\mathcal{H}_0\) : “Interactions do not bring any information compared to factors 1 and 2”.
We can use: \[
\begin{aligned}
F
&= \frac{n - KL}{(L - 1)(K - 1)}\frac{RSS_{\mu, \alpha, \beta} - RSS_{\mu, \alpha, \beta, \gamma}}{RSS}\\
&= \frac{R(\gamma | \mu, \alpha, \beta) / ((K - 1)(L - 1))}{\hat{\sigma}^2}\\
&\underset{\mathcal{H}_0}{\sim}
\mathcal{F}^{(K - 1)(L - 1)}_{n - KL}.
\end{aligned}
\]
Two-way ANOVA - Type I Table
“Type I” Anova table:
factor 1
\(K - 1\)
\(R(\alpha|\mu)\)
\(R(\alpha|\mu) / df\)
\(\hat{\sigma}^{-2} R(\alpha|\mu) / df\)
\(p\)
factor 2
\(L - 1\)
\(R(\beta|\mu, \alpha)\)
\(R(\beta|\mu, \alpha) / df\)
\(\hat{\sigma}^{-2} R(\beta|\mu, \alpha) / df\)
\(p\)
interactions
\((K - 1)(L - 1)\)
\(R(\gamma|\mu, \alpha, \beta)\)
\(R(\gamma|\mu, \alpha, \beta) / df\)
\(\hat{\sigma}^{-2} R(\gamma|\mu, \alpha, \beta) / df\)
\(p\)
residuals
\(n - KL\)
\(RSS\)
\(\hat{\sigma}^2 = RSS / df\)
Species and Sex - Type I
fit <- lm (body_mass_g ~ species * sex, data = penguins_nona)
anova (fit)
Analysis of Variance Table
Response: body_mass_g
Df Sum Sq Mean Sq F value Pr(>F)
species 2 145190219 72595110 758.358 < 2.2e-16 ***
sex 1 37090262 37090262 387.460 < 2.2e-16 ***
species:sex 2 1676557 838278 8.757 0.0001973 ***
Residuals 327 31302628 95727
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Sex and Species - Type I
fit <- lm (body_mass_g ~ sex * species, data = penguins_nona)
anova (fit)
Analysis of Variance Table
Response: body_mass_g
Df Sum Sq Mean Sq F value Pr(>F)
sex 1 38878897 38878897 406.145 < 2.2e-16 ***
species 2 143401584 71700792 749.016 < 2.2e-16 ***
sex:species 2 1676557 838278 8.757 0.0001973 ***
Residuals 327 31302628 95727
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Two-way ANOVA - Type I Table
“Type I” Anova table:
factor 1
\(K - 1\)
\(R(\alpha|\mu)\)
\(R(\alpha|\mu) / df\)
\(\hat{\sigma}^{-2} R(\alpha|\mu) / df\)
\(p\)
factor 2
\(L - 1\)
\(R(\beta|\mu, \alpha)\)
\(R(\beta|\mu, \alpha) / df\)
\(\hat{\sigma}^{-2} R(\beta|\mu, \alpha) / df\)
\(p\)
interactions
\((K - 1)(L - 1)\)
\(R(\gamma|\mu, \alpha, \beta)\)
\(R(\gamma|\mu, \alpha, \beta) / df\)
\(\hat{\sigma}^{-2} R(\gamma|\mu, \alpha, \beta) / df\)
\(p\)
residuals
\(n - KL\)
\(RSS\)
\(\hat{\sigma}^2 = RSS / df\)
Attention : The order of factors matters !
Two-way ANOVA - Type II Table
“Type II” Anova table:
factor 1
\(K - 1\)
\(R(\alpha|\mu, \beta)\)
\(R(\alpha|\mu, \beta) / df\)
\(\hat{\sigma}^{-2} R(\alpha|\mu, \beta) / df\)
\(p\)
factor 2
\(L - 1\)
\(R(\beta|\mu, \alpha)\)
\(R(\beta|\mu, \alpha) / df\)
\(\hat{\sigma}^{-2} R(\beta|\mu, \alpha) / df\)
\(p\)
interactions
\((K - 1)(L - 1)\)
\(R(\gamma|\mu, \alpha, \beta)\)
\(R(\gamma|\mu, \alpha, \beta) / df\)
\(\hat{\sigma}^{-2} R(\gamma|\mu, \alpha, \beta) / df\)
\(p\)
residuals
\(n - KL\)
\(RSS\)
\(\hat{\sigma}^2 = RSS / df\)
Symmetric in factor 1 and 2.
Sex and Species - Type II
fit <- lm (body_mass_g ~ sex * species, data = penguins_nona)
car:: Anova (fit)
Anova Table (Type II tests)
Response: body_mass_g
Sum Sq Df F value Pr(>F)
sex 37090262 1 387.460 < 2.2e-16 ***
species 143401584 2 749.016 < 2.2e-16 ***
sex:species 1676557 2 8.757 0.0001973 ***
Residuals 31302628 327
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Bill Lenght - Type II
fit <- lm (bill_length_mm ~ sex * species, data = penguins_nona)
car:: Anova (fit)
Anova Table (Type II tests)
Response: bill_length_mm
Sum Sq Df F value Pr(>F)
sex 1135.7 1 211.8066 <2e-16 ***
species 6975.6 2 650.4786 <2e-16 ***
sex:species 24.5 2 2.2841 0.1035
Residuals 1753.3 327
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Bill Lenght - Type II - No interaction
fit <- lm (bill_length_mm ~ sex + species, data = penguins_nona)
car:: Anova (fit)
Anova Table (Type II tests)
Response: bill_length_mm
Sum Sq Df F value Pr(>F)
sex 1135.7 1 210.17 < 2.2e-16 ***
species 6975.6 2 645.44 < 2.2e-16 ***
Residuals 1777.8 329
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Penguins
Question : what is the relationship between bill length and depth, controlling for the species ?
One line fits all
fit_all <- lm (bill_length_mm ~ bill_depth_mm, data = females)
summary (fit_all)
Call:
lm(formula = bill_length_mm ~ bill_depth_mm, data = females)
Residuals:
Min 1Q Median 3Q Max
-11.0745 -3.0689 -0.7609 2.6776 17.5034
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 61.2214 3.1966 19.152 < 2e-16 ***
bill_depth_mm -1.1643 0.1935 -6.018 1.13e-08 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.449 on 163 degrees of freedom
Multiple R-squared: 0.1818, Adjusted R-squared: 0.1768
F-statistic: 36.22 on 1 and 163 DF, p-value: 1.128e-08
One line fits all
\[
y_{i} = \mu + \beta x_{i} + \epsilon_i
\]
Take species into account
fit_species <- lm (bill_length_mm ~ bill_depth_mm + species, data = females)
summary (fit_species)
Call:
lm(formula = bill_length_mm ~ bill_depth_mm + species, data = females)
Residuals:
Min 1Q Median 3Q Max
-4.9886 -1.4503 -0.0603 1.4476 11.2797
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 25.0447 3.9377 6.360 1.99e-09 ***
bill_depth_mm 0.6930 0.2230 3.108 0.00222 **
speciesChinstrap 9.3393 0.4648 20.092 < 2e-16 ***
speciesGentoo 10.6515 0.8510 12.516 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.238 on 161 degrees of freedom
Multiple R-squared: 0.7954, Adjusted R-squared: 0.7916
F-statistic: 208.7 on 3 and 161 DF, p-value: < 2.2e-16
Take species into account
\[
y_{i,k} = \mu + \alpha_k + \beta x_{i} + \epsilon_i
\]
Take species into account
fit_species <- lm (bill_length_mm ~ bill_depth_mm + species, data = females)
car:: Anova (fit_species)
Anova Table (Type II tests)
Response: bill_length_mm
Sum Sq Df F value Pr(>F)
bill_depth_mm 48.41 1 9.6624 0.002224 **
species 2419.64 2 241.4533 < 2.2e-16 ***
Residuals 806.70 161
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Interactions
fit_int <- lm (bill_length_mm ~ bill_depth_mm + species + bill_depth_mm: species,
data = females)
fit_int
Call:
lm(formula = bill_length_mm ~ bill_depth_mm + species + bill_depth_mm:species,
data = females)
Coefficients:
(Intercept) bill_depth_mm
31.1671 0.3456
speciesChinstrap speciesGentoo
-2.5348 -8.8729
bill_depth_mm:speciesChinstrap bill_depth_mm:speciesGentoo
0.6745 1.2887
Interactions
\[
y_{i,k} = \mu + \alpha_k + \beta_k x_{i} + \epsilon_i
\]
Interactions
fit_int <- lm (bill_length_mm ~ bill_depth_mm + species + bill_depth_mm: species,
data = females)
car:: Anova (fit_int)
Anova Table (Type II tests)
Response: bill_length_mm
Sum Sq Df F value Pr(>F)
bill_depth_mm 48.41 1 9.8428 0.002032 **
species 2419.64 2 245.9610 < 2.2e-16 ***
bill_depth_mm:species 24.62 2 2.5029 0.085070 .
Residuals 782.08 159
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Ancova - One Regressor - Model
For any observation \(1 \leq i \leq n\) in group \(1 \leq k \leq K\) , we write: \[
y_{i,k} = \mu + \alpha_k + (\beta + \gamma_k) \times x_{i} + \epsilon_i
\qquad
\epsilon_i\sim\mathcal{N}(0, \sigma^2)~iid
\]
Identifiability : \(\alpha_1 = \gamma_1 = 0\) .
\[
\text{With $K=2$:}\qquad
\mathbf{X} =
\begin{pmatrix}
1 & 0 & x_1 & 0\\
\vdots & \vdots & \vdots & \vdots\\
1 & 0 & x_{n_1} & 0 &\\
1 & 1 & x_{n_1 + 1} & x_{n_1 + 1} \\
\vdots & \vdots & \vdots & \vdots\\
1 & 1 & x_{n_1 + n_2} & x_{n_1 + n_2}\\
\end{pmatrix}
\]
Ancova
F tests of effects and interactions can be done as before
More than one predictor ? \(\to\) not a problem
More than one factor ? \(\to\) not a problem
Attention as model become more complex, they are more difficult to interpret and test.
Attention diagnostics on the residuals still needed.
Bill length and depth
fit_species <- lm (bill_length_mm ~ bill_depth_mm + species, data = females)
par (mfrow = c (2 ,2 ))
plot (fit_species)