Régression Multiple - Choix de Modèle

Paul Bastide

2025-03-05

What we know

Gaussian Model

Model:

\[ \mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon} \]

\(\mathbf{y}\) random vector of \(n\) responses
\(\mathbf{X}\) non random \(n\times p\) matrix of predictors
\(\boldsymbol{\epsilon}\) random vector of \(n\) errors
\(\boldsymbol{\beta}\) non random, unknown vector of \(p\) coefficients

Assumptions:

(H1) \(rg(\mathbf{X}) = p\)
(H2) \(\boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \sigma^2 \mathbf{I}_n)\)

Distribution of the estimators

Distribution of \(\hat{\boldsymbol{\beta}}\): \[ \hat{\boldsymbol{\beta}} \sim \mathcal{N}\left( \boldsymbol{\beta}; \sigma^2 (\mathbf{X}^T\mathbf{X})^{-1} \right) \quad \text{and} \quad \frac{\hat{\beta}_k - \beta_k}{\sqrt{\hat{\sigma}^2 [(\mathbf{X}^T\mathbf{X})^{-1}]_{kk}}} \sim \mathcal{T}_{n-p}. \]

Distribution of \(\hat{\sigma}^2\): \[ \frac{(n-p) \hat{\sigma}^2}{\sigma^2} \sim \chi^2_{n-p}. \]

Student t tests

\[ \text{Hypothesis:} \qquad\qquad \mathcal{H}_0: \beta_k = 0 \quad \text{vs} \quad \mathcal{H}_1: \beta_k \neq 0 \]

\[ \text{Test Statistic:} \qquad\qquad\qquad\qquad T_k = \frac{\hat{\beta}_k}{\sqrt{\hat{\sigma}^2_k}} \underset{\mathcal{H}_0}{\sim} \mathcal{T}_{n-p} \]

\[ \text{Reject Region:} \qquad\qquad\qquad\qquad\qquad\qquad\qquad\quad\\ \mathbb{P}[\text{Reject} | \mathcal{H}_0 \text{ true}] \leq \alpha \qquad\qquad\qquad\qquad\qquad\\ \qquad\qquad\qquad \iff \mathcal{R}_\alpha = \left\{ t \in \mathbb{R} ~|~ |t| \geq t_{n-p}(1 - \alpha/2) \right\} \]

\[ \text{p value:}\qquad\qquad\qquad\qquad\quad p = \mathbb{P}_{\mathcal{H}_0}[|T_k| > T_k^{obs}] \]

Problems with the \(t\)-test

Simulated Dataset

\[ y_i = -1 + 3 x_{i1} - x_{i2} + \epsilon_i \]

set.seed(1289)

## Predictors
n <- 500
x_1 <- runif(n, min = -2, max = 2)
x_2 <- runif(n, min = -2, max = 2)

## Noise
eps <- rnorm(n, mean = 0, sd = 2)

## Model sim
beta_0 <- -1; beta_1 <- 3; beta_2 <- -1
y_sim <- beta_0 + beta_1 * x_1 + beta_2 * x_2 + eps

## Useless predictor
x_junk_1 <- runif(n, min = -2, max = 2)

Simulated Dataset - Fit

\[ y_i = -1 + 3 x_{i1} - x_{i2} + \epsilon_i \]

fit <- lm(y_sim ~ x_1 + x_2 + x_junk_1)
summary(fit)


Call:
lm(formula = y_sim ~ x_1 + x_2 + x_junk_1)

Residuals:
   Min     1Q Median     3Q    Max 
-5.275 -1.453 -0.028  1.371  4.893 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.82789    0.08814  -9.393   <2e-16 ***
x_1          3.01467    0.07546  39.952   <2e-16 ***
x_2         -0.95917    0.07469 -12.842   <2e-16 ***
x_junk_1     0.01534    0.07791   0.197    0.844    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.957 on 496 degrees of freedom
Multiple R-squared:  0.7905,    Adjusted R-squared:  0.7893 
F-statistic:   624 on 3 and 496 DF,  p-value: < 2.2e-16

Simulated Dataset

\[ y_i = -1 + 3 x_{i1} - x_{i2} + \epsilon_i \]

## Other Useless Predictors
p_junk <- 100
x_junk <- matrix(runif(n * p_junk, min = -2, max = 2),
                 ncol = p_junk)

## Data frame
data_junk <- data.frame(y_sim = y_sim,
                        x_junk = x_junk)

Simulated Dataset - Fit - Junk

fit <- lm(y_sim ~ -1 + ., data = data_junk)
summary(fit)


Call:
lm(formula = y_sim ~ -1 + ., data = data_junk)

Residuals:
     Min       1Q   Median       3Q      Max 
-10.8683  -3.0605  -0.2674   2.3554  11.0821 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)  
x_junk.1    0.2129516  0.1828010   1.165   0.2447  
x_junk.2   -0.1075770  0.1919617  -0.560   0.5755  
x_junk.3    0.4322452  0.1945466   2.222   0.0269 *
x_junk.4    0.1471909  0.1886285   0.780   0.4357  
x_junk.5   -0.1110445  0.1848774  -0.601   0.5484  
x_junk.6    0.1276583  0.1829333   0.698   0.4857  
x_junk.7    0.0079412  0.1944593   0.041   0.9674  
x_junk.8   -0.1126224  0.1854571  -0.607   0.5440  
x_junk.9   -0.0560611  0.1918247  -0.292   0.7702  
x_junk.10  -0.3215171  0.1883253  -1.707   0.0886 .
x_junk.11  -0.0875118  0.1891359  -0.463   0.6438  
x_junk.12  -0.0822697  0.1893373  -0.435   0.6641  
x_junk.13   0.0862599  0.1874289   0.460   0.6456  
x_junk.14  -0.1322800  0.1895130  -0.698   0.4856  
x_junk.15   0.1748496  0.1751535   0.998   0.3188  
x_junk.16  -0.2152065  0.1795766  -1.198   0.2315  
x_junk.17   0.0987989  0.1868782   0.529   0.5973  
x_junk.18  -0.0589122  0.1964012  -0.300   0.7644  
x_junk.19   0.0407544  0.1896260   0.215   0.8299  
x_junk.20   0.2788872  0.1817698   1.534   0.1257  
x_junk.21   0.0780682  0.1901043   0.411   0.6815  
x_junk.22   0.0001428  0.1936429   0.001   0.9994  
x_junk.23  -0.2146891  0.1860330  -1.154   0.2492  
x_junk.24   0.1135748  0.1794805   0.633   0.5272  
x_junk.25   0.1797087  0.1849661   0.972   0.3318  
x_junk.26  -0.0095168  0.1885643  -0.050   0.9598  
x_junk.27  -0.0697160  0.1782409  -0.391   0.6959  
x_junk.28  -0.1294379  0.1832707  -0.706   0.4804  
x_junk.29  -0.0138295  0.1830424  -0.076   0.9398  
x_junk.30   0.2922539  0.1918362   1.523   0.1284  
x_junk.31   0.1576348  0.1867874   0.844   0.3992  
x_junk.32   0.0598461  0.1892184   0.316   0.7520  
x_junk.33   0.2018882  0.1812726   1.114   0.2661  
x_junk.34  -0.0116431  0.1840672  -0.063   0.9496  
x_junk.35   0.0713938  0.1833419   0.389   0.6972  
x_junk.36   0.0190741  0.1798446   0.106   0.9156  
x_junk.37  -0.0238114  0.1885668  -0.126   0.8996  
x_junk.38  -0.1867089  0.1838147  -1.016   0.3104  
x_junk.39  -0.0777722  0.1862288  -0.418   0.6765  
x_junk.40  -0.0434927  0.1859361  -0.234   0.8152  
x_junk.41   0.0814161  0.1904542   0.427   0.6693  
x_junk.42   0.1292109  0.1870557   0.691   0.4901  
x_junk.43  -0.3022911  0.1835507  -1.647   0.1004  
x_junk.44   0.0831695  0.1823006   0.456   0.6485  
x_junk.45   0.1332532  0.1869381   0.713   0.4764  
x_junk.46  -0.1348223  0.1838432  -0.733   0.4638  
x_junk.47  -0.0218524  0.1972836  -0.111   0.9119  
x_junk.48   0.2795708  0.1854204   1.508   0.1324  
x_junk.49  -0.1531037  0.1798952  -0.851   0.3952  
x_junk.50   0.0520440  0.1852843   0.281   0.7789  
x_junk.51   0.1682396  0.1835292   0.917   0.3599  
x_junk.52  -0.1190719  0.1853241  -0.643   0.5209  
x_junk.53  -0.0989078  0.1857934  -0.532   0.5948  
x_junk.54  -0.1325051  0.1840157  -0.720   0.4719  
x_junk.55   0.1006848  0.1940173   0.519   0.6041  
x_junk.56   0.2034148  0.1764952   1.153   0.2498  
x_junk.57   0.1153209  0.1766600   0.653   0.5143  
x_junk.58  -0.0463597  0.1772648  -0.262   0.7938  
x_junk.59   0.0240049  0.1889276   0.127   0.8990  
x_junk.60   0.3726537  0.1909670   1.951   0.0517 .
x_junk.61   0.0250128  0.1864171   0.134   0.8933  
x_junk.62  -0.2615601  0.1901999  -1.375   0.1698  
x_junk.63   0.0774898  0.1871022   0.414   0.6790  
x_junk.64   0.0545841  0.1966734   0.278   0.7815  
x_junk.65   0.0725361  0.1882312   0.385   0.7002  
x_junk.66  -0.1206593  0.1897174  -0.636   0.5251  
x_junk.67  -0.3336932  0.1818668  -1.835   0.0673 .
x_junk.68   0.0981128  0.1912111   0.513   0.6082  
x_junk.69  -0.0153031  0.1858223  -0.082   0.9344  
x_junk.70   0.1886025  0.1830855   1.030   0.3036  
x_junk.71  -0.0695791  0.1906992  -0.365   0.7154  
x_junk.72  -0.1030034  0.1840569  -0.560   0.5760  
x_junk.73   0.0084075  0.1823678   0.046   0.9633  
x_junk.74   0.2999893  0.1783620   1.682   0.0934 .
x_junk.75   0.3355257  0.1843357   1.820   0.0695 .
x_junk.76  -0.1894886  0.1956830  -0.968   0.3335  
x_junk.77  -0.1183169  0.1843228  -0.642   0.5213  
x_junk.78  -0.1651666  0.1832443  -0.901   0.3679  
x_junk.79   0.0947620  0.1867396   0.507   0.6121  
x_junk.80   0.2538003  0.1910030   1.329   0.1847  
x_junk.81   0.0711922  0.1917948   0.371   0.7107  
x_junk.82  -0.0098761  0.2003307  -0.049   0.9607  
x_junk.83   0.0102922  0.1798480   0.057   0.9544  
x_junk.84   0.2163918  0.1807194   1.197   0.2319  
x_junk.85  -0.2514098  0.1881750  -1.336   0.1823  
x_junk.86   0.0181126  0.1832811   0.099   0.9213  
x_junk.87  -0.0907617  0.1809516  -0.502   0.6162  
x_junk.88   0.0354845  0.1869592   0.190   0.8496  
x_junk.89  -0.3283331  0.1867203  -1.758   0.0794 .
x_junk.90  -0.2284999  0.1895045  -1.206   0.2286  
x_junk.91  -0.0575054  0.1800322  -0.319   0.7496  
x_junk.92   0.0612578  0.1799226   0.340   0.7337  
x_junk.93  -0.0192369  0.1866956  -0.103   0.9180  
x_junk.94  -0.0615478  0.1828148  -0.337   0.7365  
x_junk.95   0.3675294  0.1778881   2.066   0.0395 *
x_junk.96  -0.0015247  0.1899413  -0.008   0.9936  
x_junk.97  -0.1364003  0.1779889  -0.766   0.4439  
x_junk.98   0.4422968  0.1888431   2.342   0.0197 *
x_junk.99   0.2366755  0.1815971   1.303   0.1932  
x_junk.100  0.0010360  0.1905522   0.005   0.9957  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.311 on 400 degrees of freedom
Multiple R-squared:  0.1889,    Adjusted R-squared:  -0.01383 
F-statistic: 0.9318 on 100 and 400 DF,  p-value: 0.6596

Simulated Dataset - Fit - Junk

p_values <- summary(fit)$coefficients[, 4]
hist(p_values, col = "lightblue", breaks = 20)

Simulated Dataset - Fit - Junk

sum(p_values <= 0.05)

[1] 3

summary(fit)$coefficients[p_values <= 0.05, ]

           Estimate Std. Error  t value   Pr(>|t|)
x_junk.3  0.4322452  0.1945466 2.221807 0.02685525
x_junk.95 0.3675294  0.1778881 2.066071 0.03946512
x_junk.98 0.4422968  0.1888431 2.342138 0.01966334

\(~\)

Reject Region: \(\mathbb{P}[\text{Reject} | \mathcal{H}_0 \text{ true}] \leq \alpha\)

Just by chance, about 5% of false discoveries.

Joint Distribution of the Coefficients - \(\sigma^2\) unknown

Joint distribution of \(\hat{\boldsymbol{\beta}}\)

When \(\sigma^2\) is known:

\[ \hat{\boldsymbol{\beta}} \sim \mathcal{N}\left( \boldsymbol{\beta}; \sigma^2 (\mathbf{X}^T\mathbf{X})^{-1} \right). \]

When \(\sigma^2\) is unknown: \[ \frac{1}{p\hat{\sigma}^2}\left(\hat{\boldsymbol{\beta}} - \boldsymbol{\beta}\right)^T (\mathbf{X}^T\mathbf{X})\left(\hat{\boldsymbol{\beta}} - \boldsymbol{\beta}\right) \sim \mathcal{F}^p_{n-p} \]

with \(\mathcal{F}^p_{n-p}\) the Fisher distribution.

Reminder: Fisher Distribution

Let \(U_1 \sim \chi^2_{p_1}\) and \(U_2 \sim \chi^2_{p_2}\), \(U_1\) and \(U_2\) independent. Then \[ F = \frac{U_1/p_1}{U_2/p_2} \sim \mathcal{F}^{p_1}_{p_2} \]

Joint distribution of \(\hat{\boldsymbol{\beta}}\) - Proof - Hints

Let’s show: \[ \frac{1}{p\hat{\sigma}^2}\left(\hat{\boldsymbol{\beta}} - \boldsymbol{\beta}\right)^T (\mathbf{X}^T\mathbf{X})\left(\hat{\boldsymbol{\beta}} - \boldsymbol{\beta}\right) \sim \mathcal{F}^p_{n-p} \]

Hints:
\[ \hat{\boldsymbol{\beta}} \sim \mathcal{N}\left( \boldsymbol{\beta}; \sigma^2 (\mathbf{X}^T\mathbf{X})^{-1} \right) \quad \text{and} \quad \frac{(n-p) \hat{\sigma}^2}{\sigma^2} \sim \chi^2_{n-p} \]

\[ \text{If } \mathbf{X} \sim \mathcal{N}(\boldsymbol{\mu}, \mathbf{\Sigma}) \quad \text{then} \quad (\mathbf{X} - \boldsymbol{\mu})^T \mathbf{\Sigma}^{-1} (\mathbf{X} - \boldsymbol{\mu}) \sim \chi^2_p \]

\[ \text{If } U_1 \sim \chi^2_{p_1} \text{ and } U_2 \sim \chi^2_{p_2} \text{ independent, then } \frac{U_1/p_1}{U_2/p_2} \sim \mathcal{F}^{p_1}_{p_2} \]

Joint distribution of \(\hat{\boldsymbol{\beta}}\) - Proof

\[ \hat{\boldsymbol{\beta}} \sim \mathcal{N}\left( \boldsymbol{\beta}; \sigma^2 (\mathbf{X}^T\mathbf{X})^{-1} \right) \qquad\qquad \\ \qquad\qquad \implies \frac{1}{\sigma^2}\left(\hat{\boldsymbol{\beta}} - \boldsymbol{\beta}\right)^T (\mathbf{X}^T\mathbf{X})\left(\hat{\boldsymbol{\beta}} - \boldsymbol{\beta}\right) \sim \chi^2_{p} \]

\[ \frac{(n-p) \hat{\sigma}^2}{\sigma^2} \sim \chi^2_{n-p} \quad \text{ independent from } \hat{\boldsymbol{\beta}} \]

Hence: \[ \begin{aligned} \frac{ \frac{1}{\sigma^2}\left(\hat{\boldsymbol{\beta}} - \boldsymbol{\beta}\right)^T (\mathbf{X}^T\mathbf{X})\left(\hat{\boldsymbol{\beta}} - \boldsymbol{\beta}\right) / p }{ \frac{(n-p) \hat{\sigma}^2}{\sigma^2} / (n-p) } \qquad\qquad\qquad \\ \qquad\qquad = \frac{1}{p\hat{\sigma}^2}\left(\hat{\boldsymbol{\beta}} - \boldsymbol{\beta}\right)^T (\mathbf{X}^T\mathbf{X})\left(\hat{\boldsymbol{\beta}} - \boldsymbol{\beta}\right) &\sim \mathcal{F}^p_{n-p}. \end{aligned} \]

Global Fisher F-test

Hypothesis: \[ \mathcal{H}_0: \beta_1 = \cdots = \beta_p = 0 \quad \text{vs} \quad \mathcal{H}_1: \exists~k ~|~ \beta_k \neq 0 \]

Test Statistic: \[ F = \frac{1}{p\hat{\sigma}^2}\hat{\boldsymbol{\beta}}^T (\mathbf{X}^T\mathbf{X})\hat{\boldsymbol{\beta}} \underset{\mathcal{H}_0}{\sim} \mathcal{F}^p_{n-p} \]

Reject Region: \[ \mathbb{P}[\text{Reject} | \mathcal{H}_0 \text{ true}] \leq \alpha \qquad\qquad\qquad\qquad\qquad\\ \qquad\qquad\qquad \!\iff\! \mathcal{R}_\alpha = \left\{ f \in \mathbb{R}_+ ~|~ f \geq f^p_{n-p}(1 - \alpha) \right\} \]

p value:\(\qquad\qquad p = \mathbb{P}_{\mathcal{H}_0}[F > F^{obs}]\)

Simulated Dataset - Fit - Junk

fit <- lm(y_sim ~ -1 + ., data = data_junk)
f_obs <- summary(fit)$fstatistic
f_obs

     value      numdf      dendf 
  0.931791 100.000000 400.000000

p_value_f <- 1 - pf(f_obs[1], f_obs[2], f_obs[3])
p_value_f

    value 
0.6596268

We do not reject the null hypothesis that all the (junk) coefficients are zero (phew).

Fisher F-Test - Geometry

\[ \begin{aligned} F &= \frac{1}{p\hat{\sigma}^2}\hat{\boldsymbol{\beta}}^T (\mathbf{X}^T\mathbf{X})\hat{\boldsymbol{\beta}} = \frac{(\mathbf{X}\hat{\boldsymbol{\beta}})^T (\mathbf{X}\hat{\boldsymbol{\beta}}) / p}{\|\hat{\boldsymbol{\epsilon}}\|^2 / (n - p)} \\ &= \frac{\|\hat{\mathbf{y}}\|^2 / p}{\|\hat{\boldsymbol{\epsilon}}\|^2 / (n - p)} = \frac{\|\hat{\mathbf{y}}\|^2 / p}{\|\mathbf{y} - \hat{\mathbf{y}}\|^2 / (n - p)} \\ \end{aligned} \]

\(\mathcal{H}_0: \beta_1 = \cdots = \beta_p = 0\) i.e. the model is useless.
If \(\|\hat{\boldsymbol{\epsilon}}\|^2\) is small, then the error is small,
and \(F\) tends to be large, i.e. we tend to reject \(\mathcal{H}_0\):
the model is useful.
If \(\|\hat{\boldsymbol{\epsilon}}\|^2\) is large, then the error is large,
and \(F\) tends to be small, i.e. we tend to not reject \(\mathcal{H}_0\):
the model is rather useless.

Geometrical interpretation

Good model

\(\mathbf{y}\) not too far from \(\mathcal{M}(\mathbf{X})\).
\(\|\hat{\boldsymbol{\epsilon}}\|^2\) not too large compared to \(\|\hat{\mathbf{y}}\|^2\).
\(F\) tends to be large.

Bad model

\(\mathbf{y}\) almost orth. to \(\mathcal{M}(\mathbf{X})\).
\(\|\hat{\boldsymbol{\epsilon}}\|^2\) rather large compared to \(\|\hat{\mathbf{y}}\|^2\).
\(F\) tends to be small.

Simulated Dataset - True and Junk

## Data frame
data_all <- data.frame(y_sim = y_sim,
                       x_1 = x_1,
                       x_2 = x_2,
                       x_junk = x_junk)
## Fit
fit <- lm(y_sim ~ ., data = data_all)
## multiple t tests
p_values_t <- summary(fit)$coefficients[, 4]
names(p_values_t[p_values_t <= 0.05])

[1] "(Intercept)" "x_1"         "x_2"         "x_junk.27"   "x_junk.76"  
[6] "x_junk.80"

## f test
f_obs <- summary(fit)$fstatistic
p_value_f <- 1 - pf(f_obs[1], f_obs[2], f_obs[3])
p_value_f

value 
    0

Nested Models Fisher F-test

Nested Models

Can I test ? \[ \begin{aligned} \mathcal{H}_0&: \beta_{p_0 + 1} = \cdots = \beta_p = 0 &\text{vs}\\ \mathcal{H}_1&: \exists~k\in \{p_0+1, \dotsc, p\} ~|~ \beta_k \neq 0 \end{aligned} \]

i.e. decide between the full model: \[ \mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon} \]

and the nested model: \[ \mathbf{y} = \mathbf{X}_0\boldsymbol{\beta}_0 + \boldsymbol{\epsilon} \]

with \[ \begin{aligned} \mathbf{X} &= (\mathbf{X}_0 ~ \mathbf{X}_1) & rg(\mathbf{X}) &= p & rg(\mathbf{X}_0) &= p_0 \end{aligned} \]

Geometrical interpretation

Nested Models

Idea: Compare

\(\|\hat{\mathbf{y}} - \hat{\mathbf{y}}_0\|^2\) distance of \(\hat{\mathbf{y}}\) compared to \(\hat{\mathbf{y}}_0\) to
\(\|\mathbf{y} - \hat{\mathbf{y}}\|^2\) residual error.

Then:

If \(\|\hat{\mathbf{y}} - \hat{\mathbf{y}}_0\|^2\) small compared to \(\|\mathbf{y} - \hat{\mathbf{y}}\|^2\), then “\(\hat{\mathbf{y}} \approx \hat{\mathbf{y}}_0\)” and the null model is enough.
If \(\|\hat{\mathbf{y}} - \hat{\mathbf{y}}_0\|^2\) large compared to \(\|\mathbf{y} - \hat{\mathbf{y}}\|^2\), then the full model is useful.

Geometrical interpretation

Good model

\(\hat{\mathbf{y}}\) quite different from \(\hat{\mathbf{y}}_0\).
Full model does add information.

Bad model

\(\hat{\mathbf{y}}\) similar to \(\hat{\mathbf{y}}_0\).
Full model does not add much information.

Nested Models - F test

\[ F = \frac{ \|\hat{\mathbf{y}} - \hat{\mathbf{y}}_0\|^2 / (p - p_0) }{ \|\mathbf{y} - \hat{\mathbf{y}}\|^2 / (n - p) } \]

When \(F\) is large, full model is useful.
When \(F\) is small, null model is enough.
Distribution of \(F\) ?

Distribution of \(F\) - 1/2

\[ F = \frac{ \|\hat{\mathbf{y}} - \hat{\mathbf{y}}_0\|^2 / (p - p_0) }{ \|\mathbf{y} - \hat{\mathbf{y}}\|^2 / (n - p) } \]

Let \(\mathbf{P}\) projection on \(\mathcal{M}(\mathbf{X})\) and \(\mathbf{P}_0\) projection on \(\mathcal{M}(\mathbf{X}_0)\)

\[ \begin{aligned} \hat{\mathbf{y}} - \hat{\mathbf{y}}_0 &= \mathbf{P}\mathbf{y} - \mathbf{P}_0\mathbf{y} = \mathbf{P}\mathbf{y} - \mathbf{P}_0\mathbf{P}\mathbf{y}\\ &= (\mathbf{I}_n - \mathbf{P}_0) \mathbf{P}\mathbf{y} = \mathbf{P}_0^\bot \mathbf{P}\mathbf{y}\\ &\in \mathcal{M}(\mathbf{X}_0)^\bot \cap \mathcal{M}(\mathbf{X}) \end{aligned} \]

\[ \mathbf{y} - \hat{\mathbf{y}} = (\mathbf{I}_n - \mathbf{P}) \mathbf{y} = \mathbf{P}^\bot \mathbf{y} \in \mathcal{M}(\mathbf{X})^\bot \]

\(\hat{\mathbf{y}} - \hat{\mathbf{y}}_0\) and \(\mathbf{y} - \hat{\mathbf{y}}\) are orthogonal
i.e. their covariance is zero
i.e. (Gaussian assumption) they are independent.

Distribution of \(F\) - 2/2

From Cochran’s theorem: \[ \frac{1}{\sigma^2}\|\mathbf{y} - \hat{\mathbf{y}}\|^2 = \frac{1}{\sigma^2}\|\mathbf{P}^\bot\mathbf{y}\|^2 = \frac{1}{\sigma^2}\|\mathbf{P}^\bot\boldsymbol{\epsilon}\|^2 \sim \chi^2_{n-p} \]

\[ \frac{1}{\sigma^2}\|\hat{\mathbf{y}} - \hat{\mathbf{y}}_0 - \mathbf{P}_0^\bot \mathbf{P}\mathbf{X}\boldsymbol{\beta}\|^2 = \frac{1}{\sigma^2}\|\mathbf{P}_0^\bot \mathbf{P}(\mathbf{y} - \mathbf{X}\boldsymbol{\beta})\|^2 \sim \chi^2_{p-p_0} \] (\(\mathbf{P}_0^\bot \mathbf{P}\) is a projection on a space of dimension \(p - p_0\))

If \(\mathcal{H}_0\) is true, then \(\mathbf{P}_0^\bot \mathbf{P}\mathbf{X}\boldsymbol{\beta} = 0\), hence:

\[ F = \frac{ \|\hat{\mathbf{y}} - \hat{\mathbf{y}}_0\|^2 / (p - p_0) }{ \|\mathbf{y} - \hat{\mathbf{y}}\|^2 / (n - p) } \underset{\mathcal{H}_0}{\sim} \mathcal{F}^{p - p_0}_{n - p}. \]

Nested F test

From Phythagoras’s theorem: \[ \begin{aligned} \|\mathbf{y} - \hat{\mathbf{y}}_0\|^2 &= \|\mathbf{y} - \mathbf{P}\mathbf{y} + \mathbf{P}\mathbf{y} - \mathbf{P}_0\mathbf{y}\|^2 = \|\mathbf{P}^\bot\mathbf{y} + \mathbf{P}_0^\bot\mathbf{P}\mathbf{y}\|^2\\ &= \|\mathbf{P}^\bot\mathbf{y}\|^2 + \|\mathbf{P}_0^\bot\mathbf{P}\mathbf{y}\|^2 = \|\mathbf{y} - \hat{\mathbf{y}}\|^2 + \|\hat{\mathbf{y}} - \hat{\mathbf{y}}_0\|^2\\ \end{aligned} \]

i.e. \[ \|\hat{\mathbf{y}} - \hat{\mathbf{y}}_0\|^2 = \|\mathbf{y} - \hat{\mathbf{y}}_0\|^2 - \|\mathbf{y} - \hat{\mathbf{y}}\|^2 = RSS_0 - RSS \]

Hence: \[ F = \frac{ \|\hat{\mathbf{y}} - \hat{\mathbf{y}}_0\|^2 / (p - p_0) }{ \|\mathbf{y} - \hat{\mathbf{y}}\|^2 / (n - p) } = \frac{n - p}{p - p_0}\frac{RSS_0 - RSS}{RSS} \]

Nested F test

To test: \[ \begin{aligned} \mathcal{H}_0&: \beta_{p_0 + 1} = \cdots = \beta_p = 0 &\text{vs}\\ \mathcal{H}_1&: \exists~k\in \{p_0+1, \dotsc, p\} ~|~ \beta_k \neq 0 \end{aligned} \]

we use the \(F\) statistics: \[ F = \frac{ \|\hat{\mathbf{y}} - \hat{\mathbf{y}}_0\|^2 / (p - p_0) }{ \|\mathbf{y} - \hat{\mathbf{y}}\|^2 / (n - p) } = \frac{n - p}{p - p_0}\frac{RSS_0 - RSS}{RSS} \underset{\mathcal{H}_0}{\sim} \mathcal{F}^{p - p_0}_{n - p}. \]

Simulated Dataset - True vs Junk

\[ y_i = -1 + 3 x_{i1} - x_{i2} + \epsilon_i \]

## Data frame
data_all <- data.frame(y_sim = y_sim,
                       x_1 = x_1,
                       x_2 = x_2,
                       x_junk = x_junk)
## Fit true
fit_small <- lm(y_sim ~ x_1 + x_2, data = data_all)
## Fit true and junk
fit_all <- lm(y_sim ~ x_1 + x_2 + x_junk.1 + x_junk.2 + x_junk.3, data = data_all)
## Comparison
anova(fit_small, fit_all)

Analysis of Variance Table

Model 1: y_sim ~ x_1 + x_2
Model 2: y_sim ~ x_1 + x_2 + x_junk.1 + x_junk.2 + x_junk.3
  Res.Df    RSS Df Sum of Sq      F Pr(>F)
1    497 1900.5                           
2    494 1883.2  3    17.297 1.5124 0.2104

Simulated Dataset - True vs Junk

\[ y_i = -1 + 3 x_{i1} - x_{i2} + \epsilon_i \]

## Fit true
fit_small <- lm(y_sim ~ x_1 + x_2, data = data_all)
## Fit all
fit_all <- lm(y_sim ~ ., data = data_all)
## Comparison
anova(fit_small, fit_all)$`Pr(>F)`[2]

[1] 0.3548787

We do not reject the null hypothesis that the first two variables are enough to explain the data.

Full \(F\) test

Assuming that we have an intercept, we often test: \[ \begin{aligned} \mathcal{H}_0&: \beta_{2} = \cdots = \beta_p = 0 &\text{vs}\\ \mathcal{H}_1&: \exists~k\in \{2, \dotsc, p\} ~|~ \beta_k \neq 0 \end{aligned} \]

The associated \(F\) statistics is (\(p_0 = 1\)): \[ F = \frac{ \|\hat{\mathbf{y}} - \bar{y}\mathbb{1}\|^2 / (p - 1) }{ \|\mathbf{y} - \hat{\mathbf{y}}\|^2 / (n - p) } \underset{\mathcal{H}_0}{\sim} \mathcal{F}^{p - 1}_{n - p}. \]

That is the default test returned by R.

Full \(F\) test

\[ y_i = -1 + 3 x_{i1} - x_{i2} + \epsilon_i \]

fit <- lm(y_sim ~ x_1 + x_2)
summary(fit)


Call:
lm(formula = y_sim ~ x_1 + x_2)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.2523 -1.4517 -0.0302  1.3494  4.8711 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.82877    0.08794  -9.424   <2e-16 ***
x_1          3.01415    0.07534  40.008   <2e-16 ***
x_2         -0.95922    0.07462 -12.855   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.955 on 497 degrees of freedom
Multiple R-squared:  0.7905,    Adjusted R-squared:  0.7897 
F-statistic: 937.8 on 2 and 497 DF,  p-value: < 2.2e-16

One coefficient \(F\) test

If \(p_0 = p-1\), we test: \[ \begin{aligned} \mathcal{H}_0&: \beta_p = 0 &\text{vs} \qquad \mathcal{H}_1&: \beta_p \neq 0 \end{aligned} \]

The associated \(F\) statistics is (\(p_0 = p-1\)): \[ F = \frac{ \|\hat{\mathbf{y}} - \hat{\mathbf{y}}_{-p}\|^2 }{ \|\mathbf{y} - \hat{\mathbf{y}}\|^2 / (n - p) } \underset{\mathcal{H}_0}{\sim} \mathcal{F}^{1}_{n - p}. \]

It can be shown that: \[ F = T^2 \quad \text{with} \quad T = \frac{\hat{\beta}_p}{\hat{\sigma}_p} \] so that it is equivalent to the \(t\)-test.

Summary

Exact tests

t-test: is one individual coefficient \(k\) zero ? \[ \begin{aligned} \mathcal{H}_0&: \beta_k = 0 \\ \mathcal{H}_1&: \beta_k \neq 0 \end{aligned} \quad;\quad T_k = \frac{\hat{\beta}_k}{\sqrt{\hat{\sigma}^2 [(\mathbf{X}^T\mathbf{X})^{-1}]_{kk}}} \underset{\mathcal{H}_0}{\sim} \mathcal{T}_{n-p} \]

Exact tests

Global F-test: are all coefficients zero ?
- with intercept \[ \begin{aligned} \mathcal{H}_0&: \beta_{2} = \cdots = \beta_p = 0 \\ \mathcal{H}_1&: \exists~k\in \{2, \dotsc, p\} ~|~ \beta_k \neq 0 \end{aligned} \quad;\quad F = \frac{ \|\hat{\mathbf{y}} - \bar{y}\mathbb{1}\|^2 / (p - 1) }{ \|\mathbf{y} - \hat{\mathbf{y}}\|^2 / (n - p) } \underset{\mathcal{H}_0}{\sim} \mathcal{F}^{p - 1}_{n - p} \]
- no intercept \[ \begin{aligned} \mathcal{H}_0&: \beta_{1} = \cdots = \beta_p = 0 \\ \mathcal{H}_1&: \exists~k\in \{1, \dotsc, p\} ~|~ \beta_k \neq 0 \end{aligned} \quad;\quad F = \frac{ \|\hat{\mathbf{y}}\|^2 / p }{ \|\mathbf{y} - \hat{\mathbf{y}}\|^2 / (n - p) } \underset{\mathcal{H}_0}{\sim} \mathcal{F}^{p}_{n - p} \]
Nested F-test: are extra coefficients from the big model zero ? \[ {\small \begin{aligned} \mathcal{H}_0&: \beta_{p_0 + 1} = \cdots = \beta_p = 0\\ \mathcal{H}_1&: \exists~k\in \{p_0+1, \dotsc, p\} ~|~ \beta_k \neq 0 \end{aligned} ; F = \frac{ \|\hat{\mathbf{y}} - \hat{\mathbf{y}}_0\|^2 / (p - p_0) }{ \|\mathbf{y} - \hat{\mathbf{y}}\|^2 / (n - p) } = \frac{n - p}{p - p_0}\frac{RSS_0 - RSS}{RSS} \underset{\mathcal{H}_0}{\sim} \mathcal{F}^{p - p_0}_{n - p} } \]

Exact tests

Can only test nested models.
Multiple testing issue.
How to choose relevant predictors from a large pool of predictors ?
- Marketing
- Genetics
Variable Selection