Régression Multiple - Choix de Modèle

Paul Bastide

2025-11-18

What we know

Gaussian Model

Model:

\[ \mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon} \]

\(\mathbf{y}\) random vector of \(n\) responses
\(\mathbf{X}\) non random \(n\times p\) matrix of predictors
\(\boldsymbol{\epsilon}\) random vector of \(n\) errors
\(\boldsymbol{\beta}\) non random, unknown vector of \(p\) coefficients

Assumptions:

(H1) \(rg(\mathbf{X}) = p\)
(H2) \(\boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \sigma^2 \mathbf{I}_n)\)

Distribution of the estimators

Distribution of \(\hat{\boldsymbol{\beta}}\): \[ \hat{\boldsymbol{\beta}} \sim \mathcal{N}\left( \boldsymbol{\beta}; \sigma^2 (\mathbf{X}^T\mathbf{X})^{-1} \right) \quad \text{and} \quad \frac{\hat{\beta}_k - \beta_k}{\sqrt{\hat{\sigma}^2 [(\mathbf{X}^T\mathbf{X})^{-1}]_{kk}}} \sim \mathcal{T}_{n-p}. \]

Distribution of \(\hat{\sigma}^2\): \[ \frac{(n-p) \hat{\sigma}^2}{\sigma^2} \sim \chi^2_{n-p}. \]

Student t tests

\[ \text{Hypothesis:} \qquad\qquad \mathcal{H}_0: \beta_k = 0 \quad \text{vs} \quad \mathcal{H}_1: \beta_k \neq 0 \]

\[ \text{Test Statistic:} \qquad\qquad\qquad\qquad T_k = \frac{\hat{\beta}_k}{\sqrt{\hat{\sigma}^2_k}} \underset{\mathcal{H}_0}{\sim} \mathcal{T}_{n-p} \]

\[ \text{Reject Region:} \qquad\qquad\qquad\qquad\qquad\qquad\qquad\quad\\ \mathbb{P}[\text{Reject} | \mathcal{H}_0 \text{ true}] \leq \alpha \qquad\qquad\qquad\qquad\qquad\\ \qquad\qquad\qquad \iff \mathcal{R}_\alpha = \left\{ t \in \mathbb{R} ~|~ |t| \geq t_{n-p}(1 - \alpha/2) \right\} \]

\[ \text{p value:}\qquad\qquad\qquad\qquad\quad p = \mathbb{P}_{\mathcal{H}_0}[|T_k| > T_k^{obs}] \]

Problems with the \(t\)-test

Simulated Dataset

\[ y_i = -1 + 3 x_{i1} - x_{i2} + \epsilon_i \]

set.seed(1289)

## Predictors
n <- 500
x_1 <- runif(n, min = -2, max = 2)
x_2 <- runif(n, min = -2, max = 2)

## Noise
eps <- rnorm(n, mean = 0, sd = 2)

## Model sim
beta_0 <- -1; beta_1 <- 3; beta_2 <- -1
y_sim <- beta_0 + beta_1 * x_1 + beta_2 * x_2 + eps

## Useless predictor
x_junk_1 <- runif(n, min = -2, max = 2)

Simulated Dataset - Fit

\[ y_i = -1 + 3 x_{i1} - x_{i2} + \epsilon_i \]

fit <- lm(y_sim ~ x_1 + x_2 + x_junk_1)
summary(fit)


Call:
lm(formula = y_sim ~ x_1 + x_2 + x_junk_1)

Residuals:
   Min     1Q Median     3Q    Max 
-5.275 -1.453 -0.028  1.371  4.893 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.82789    0.08814  -9.393   <2e-16 ***
x_1          3.01467    0.07546  39.952   <2e-16 ***
x_2         -0.95917    0.07469 -12.842   <2e-16 ***
x_junk_1     0.01534    0.07791   0.197    0.844    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.957 on 496 degrees of freedom
Multiple R-squared:  0.7905,    Adjusted R-squared:  0.7893 
F-statistic:   624 on 3 and 496 DF,  p-value: < 2.2e-16

Simulated Dataset

\[ y_i = -1 + 3 x_{i1} - x_{i2} + \epsilon_i \]

## Other Useless Predictors
p_junk <- 100
x_junk <- matrix(runif(n * p_junk, min = -2, max = 2),
                 ncol = p_junk)

## Data frame
data_junk <- data.frame(y_sim = y_sim,
                        x_junk = x_junk)

Simulated Dataset - Fit - Junk

fit <- lm(y_sim ~ -1 + ., data = data_junk)
summary(fit)


Call:
lm(formula = y_sim ~ -1 + ., data = data_junk)

Residuals:
     Min       1Q   Median       3Q      Max 
-10.8683  -3.0605  -0.2674   2.3554  11.0821 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)  
x_junk.1    0.2129516  0.1828010   1.165   0.2447  
x_junk.2   -0.1075770  0.1919617  -0.560   0.5755  
x_junk.3    0.4322452  0.1945466   2.222   0.0269 *
x_junk.4    0.1471909  0.1886285   0.780   0.4357  
x_junk.5   -0.1110445  0.1848774  -0.601   0.5484  
x_junk.6    0.1276583  0.1829333   0.698   0.4857  
x_junk.7    0.0079412  0.1944593   0.041   0.9674  
x_junk.8   -0.1126224  0.1854571  -0.607   0.5440  
x_junk.9   -0.0560611  0.1918247  -0.292   0.7702  
x_junk.10  -0.3215171  0.1883253  -1.707   0.0886 .
x_junk.11  -0.0875118  0.1891359  -0.463   0.6438  
x_junk.12  -0.0822697  0.1893373  -0.435   0.6641  
x_junk.13   0.0862599  0.1874289   0.460   0.6456  
x_junk.14  -0.1322800  0.1895130  -0.698   0.4856  
x_junk.15   0.1748496  0.1751535   0.998   0.3188  
x_junk.16  -0.2152065  0.1795766  -1.198   0.2315  
x_junk.17   0.0987989  0.1868782   0.529   0.5973  
x_junk.18  -0.0589122  0.1964012  -0.300   0.7644  
x_junk.19   0.0407544  0.1896260   0.215   0.8299  
x_junk.20   0.2788872  0.1817698   1.534   0.1257  
x_junk.21   0.0780682  0.1901043   0.411   0.6815  
x_junk.22   0.0001428  0.1936429   0.001   0.9994  
x_junk.23  -0.2146891  0.1860330  -1.154   0.2492  
x_junk.24   0.1135748  0.1794805   0.633   0.5272  
x_junk.25   0.1797087  0.1849661   0.972   0.3318  
x_junk.26  -0.0095168  0.1885643  -0.050   0.9598  
x_junk.27  -0.0697160  0.1782409  -0.391   0.6959  
x_junk.28  -0.1294379  0.1832707  -0.706   0.4804  
x_junk.29  -0.0138295  0.1830424  -0.076   0.9398  
x_junk.30   0.2922539  0.1918362   1.523   0.1284  
x_junk.31   0.1576348  0.1867874   0.844   0.3992  
x_junk.32   0.0598461  0.1892184   0.316   0.7520  
x_junk.33   0.2018882  0.1812726   1.114   0.2661  
x_junk.34  -0.0116431  0.1840672  -0.063   0.9496  
x_junk.35   0.0713938  0.1833419   0.389   0.6972  
x_junk.36   0.0190741  0.1798446   0.106   0.9156  
x_junk.37  -0.0238114  0.1885668  -0.126   0.8996  
x_junk.38  -0.1867089  0.1838147  -1.016   0.3104  
x_junk.39  -0.0777722  0.1862288  -0.418   0.6765  
x_junk.40  -0.0434927  0.1859361  -0.234   0.8152  
x_junk.41   0.0814161  0.1904542   0.427   0.6693  
x_junk.42   0.1292109  0.1870557   0.691   0.4901  
x_junk.43  -0.3022911  0.1835507  -1.647   0.1004  
x_junk.44   0.0831695  0.1823006   0.456   0.6485  
x_junk.45   0.1332532  0.1869381   0.713   0.4764  
x_junk.46  -0.1348223  0.1838432  -0.733   0.4638  
x_junk.47  -0.0218524  0.1972836  -0.111   0.9119  
x_junk.48   0.2795708  0.1854204   1.508   0.1324  
x_junk.49  -0.1531037  0.1798952  -0.851   0.3952  
x_junk.50   0.0520440  0.1852843   0.281   0.7789  
x_junk.51   0.1682396  0.1835292   0.917   0.3599  
x_junk.52  -0.1190719  0.1853241  -0.643   0.5209  
x_junk.53  -0.0989078  0.1857934  -0.532   0.5948  
x_junk.54  -0.1325051  0.1840157  -0.720   0.4719  
x_junk.55   0.1006848  0.1940173   0.519   0.6041  
x_junk.56   0.2034148  0.1764952   1.153   0.2498  
x_junk.57   0.1153209  0.1766600   0.653   0.5143  
x_junk.58  -0.0463597  0.1772648  -0.262   0.7938  
x_junk.59   0.0240049  0.1889276   0.127   0.8990  
x_junk.60   0.3726537  0.1909670   1.951   0.0517 .
x_junk.61   0.0250128  0.1864171   0.134   0.8933  
x_junk.62  -0.2615601  0.1901999  -1.375   0.1698  
x_junk.63   0.0774898  0.1871022   0.414   0.6790  
x_junk.64   0.0545841  0.1966734   0.278   0.7815  
x_junk.65   0.0725361  0.1882312   0.385   0.7002  
x_junk.66  -0.1206593  0.1897174  -0.636   0.5251  
x_junk.67  -0.3336932  0.1818668  -1.835   0.0673 .
x_junk.68   0.0981128  0.1912111   0.513   0.6082  
x_junk.69  -0.0153031  0.1858223  -0.082   0.9344  
x_junk.70   0.1886025  0.1830855   1.030   0.3036  
x_junk.71  -0.0695791  0.1906992  -0.365   0.7154  
x_junk.72  -0.1030034  0.1840569  -0.560   0.5760  
x_junk.73   0.0084075  0.1823678   0.046   0.9633  
x_junk.74   0.2999893  0.1783620   1.682   0.0934 .
x_junk.75   0.3355257  0.1843357   1.820   0.0695 .
x_junk.76  -0.1894886  0.1956830  -0.968   0.3335  
x_junk.77  -0.1183169  0.1843228  -0.642   0.5213  
x_junk.78  -0.1651666  0.1832443  -0.901   0.3679  
x_junk.79   0.0947620  0.1867396   0.507   0.6121  
x_junk.80   0.2538003  0.1910030   1.329   0.1847  
x_junk.81   0.0711922  0.1917948   0.371   0.7107  
x_junk.82  -0.0098761  0.2003307  -0.049   0.9607  
x_junk.83   0.0102922  0.1798480   0.057   0.9544  
x_junk.84   0.2163918  0.1807194   1.197   0.2319  
x_junk.85  -0.2514098  0.1881750  -1.336   0.1823  
x_junk.86   0.0181126  0.1832811   0.099   0.9213  
x_junk.87  -0.0907617  0.1809516  -0.502   0.6162  
x_junk.88   0.0354845  0.1869592   0.190   0.8496  
x_junk.89  -0.3283331  0.1867203  -1.758   0.0794 .
x_junk.90  -0.2284999  0.1895045  -1.206   0.2286  
x_junk.91  -0.0575054  0.1800322  -0.319   0.7496  
x_junk.92   0.0612578  0.1799226   0.340   0.7337  
x_junk.93  -0.0192369  0.1866956  -0.103   0.9180  
x_junk.94  -0.0615478  0.1828148  -0.337   0.7365  
x_junk.95   0.3675294  0.1778881   2.066   0.0395 *
x_junk.96  -0.0015247  0.1899413  -0.008   0.9936  
x_junk.97  -0.1364003  0.1779889  -0.766   0.4439  
x_junk.98   0.4422968  0.1888431   2.342   0.0197 *
x_junk.99   0.2366755  0.1815971   1.303   0.1932  
x_junk.100  0.0010360  0.1905522   0.005   0.9957  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.311 on 400 degrees of freedom
Multiple R-squared:  0.1889,    Adjusted R-squared:  -0.01383 
F-statistic: 0.9318 on 100 and 400 DF,  p-value: 0.6596

Simulated Dataset - Fit - Junk

p_values <- summary(fit)$coefficients[, 4]
hist(p_values, col = "lightblue", breaks = 20)

Simulated Dataset - Fit - Junk

sum(p_values <= 0.05)

[1] 3

summary(fit)$coefficients[p_values <= 0.05, ]

           Estimate Std. Error  t value   Pr(>|t|)
x_junk.3  0.4322452  0.1945466 2.221807 0.02685525
x_junk.95 0.3675294  0.1778881 2.066071 0.03946512
x_junk.98 0.4422968  0.1888431 2.342138 0.01966334

\(~\)

Reject Region: \(\mathbb{P}[\text{Reject} | \mathcal{H}_0 \text{ true}] \leq \alpha\)

Just by chance, about 5% of false discoveries.

Joint Distribution of the Coefficients - \(\sigma^2\) unknown

Joint distribution of \(\hat{\boldsymbol{\beta}}\)

When \(\sigma^2\) is known:

\[ \hat{\boldsymbol{\beta}} \sim \mathcal{N}\left( \boldsymbol{\beta}; \sigma^2 (\mathbf{X}^T\mathbf{X})^{-1} \right). \]

When \(\sigma^2\) is unknown: \[ \frac{1}{p\hat{\sigma}^2}\left(\hat{\boldsymbol{\beta}} - \boldsymbol{\beta}\right)^T (\mathbf{X}^T\mathbf{X})\left(\hat{\boldsymbol{\beta}} - \boldsymbol{\beta}\right) \sim \mathcal{F}^p_{n-p} \]

with \(\mathcal{F}^p_{n-p}\) the Fisher distribution.

Reminder: Fisher Distribution

Let \(U_1 \sim \chi^2_{p_1}\) and \(U_2 \sim \chi^2_{p_2}\), \(U_1\) and \(U_2\) independent. Then \[ F = \frac{U_1/p_1}{U_2/p_2} \sim \mathcal{F}^{p_1}_{p_2} \]

Joint distribution of \(\hat{\boldsymbol{\beta}}\) - Proof - Hints

Let’s show: \[ \frac{1}{p\hat{\sigma}^2}\left(\hat{\boldsymbol{\beta}} - \boldsymbol{\beta}\right)^T (\mathbf{X}^T\mathbf{X})\left(\hat{\boldsymbol{\beta}} - \boldsymbol{\beta}\right) \sim \mathcal{F}^p_{n-p} \]

Hints:
\[ \hat{\boldsymbol{\beta}} \sim \mathcal{N}\left( \boldsymbol{\beta}; \sigma^2 (\mathbf{X}^T\mathbf{X})^{-1} \right) \quad \text{and} \quad \frac{(n-p) \hat{\sigma}^2}{\sigma^2} \sim \chi^2_{n-p} \]

\[ \text{If } \mathbf{X} \sim \mathcal{N}(\boldsymbol{\mu}, \mathbf{\Sigma}) \quad \text{then} \quad (\mathbf{X} - \boldsymbol{\mu})^T \mathbf{\Sigma}^{-1} (\mathbf{X} - \boldsymbol{\mu}) \sim \chi^2_p \]

\[ \text{If } U_1 \sim \chi^2_{p_1} \text{ and } U_2 \sim \chi^2_{p_2} \text{ independent, then } \frac{U_1/p_1}{U_2/p_2} \sim \mathcal{F}^{p_1}_{p_2} \]

Joint distribution of \(\hat{\boldsymbol{\beta}}\) - Proof

\[ \hat{\boldsymbol{\beta}} \sim \mathcal{N}\left( \boldsymbol{\beta}; \sigma^2 (\mathbf{X}^T\mathbf{X})^{-1} \right) \qquad\qquad \\ \qquad\qquad \implies \frac{1}{\sigma^2}\left(\hat{\boldsymbol{\beta}} - \boldsymbol{\beta}\right)^T (\mathbf{X}^T\mathbf{X})\left(\hat{\boldsymbol{\beta}} - \boldsymbol{\beta}\right) \sim \chi^2_{p} \]

\[ \frac{(n-p) \hat{\sigma}^2}{\sigma^2} \sim \chi^2_{n-p} \quad \text{ independent from } \hat{\boldsymbol{\beta}} \]

Hence: \[ \begin{aligned} \frac{ \frac{1}{\sigma^2}\left(\hat{\boldsymbol{\beta}} - \boldsymbol{\beta}\right)^T (\mathbf{X}^T\mathbf{X})\left(\hat{\boldsymbol{\beta}} - \boldsymbol{\beta}\right) / p }{ \frac{(n-p) \hat{\sigma}^2}{\sigma^2} / (n-p) } \qquad\qquad\qquad \\ \qquad\qquad = \frac{1}{p\hat{\sigma}^2}\left(\hat{\boldsymbol{\beta}} - \boldsymbol{\beta}\right)^T (\mathbf{X}^T\mathbf{X})\left(\hat{\boldsymbol{\beta}} - \boldsymbol{\beta}\right) &\sim \mathcal{F}^p_{n-p}. \end{aligned} \]

Global Fisher F-test

Hypothesis: \[ \mathcal{H}_0: \beta_1 = \cdots = \beta_p = 0 \quad \text{vs} \quad \mathcal{H}_1: \exists~k ~|~ \beta_k \neq 0 \]

Test Statistic: \[ F = \frac{1}{p\hat{\sigma}^2}\hat{\boldsymbol{\beta}}^T (\mathbf{X}^T\mathbf{X})\hat{\boldsymbol{\beta}} \underset{\mathcal{H}_0}{\sim} \mathcal{F}^p_{n-p} \]

Reject Region: \[ \mathbb{P}[\text{Reject} | \mathcal{H}_0 \text{ true}] \leq \alpha \qquad\qquad\qquad\qquad\qquad\\ \qquad\qquad\qquad \!\iff\! \mathcal{R}_\alpha = \left\{ f \in \mathbb{R}_+ ~|~ f \geq f^p_{n-p}(1 - \alpha) \right\} \]

p value:\(\qquad\qquad p = \mathbb{P}_{\mathcal{H}_0}[F > F^{obs}]\)

Simulated Dataset - Fit - Junk

fit <- lm(y_sim ~ -1 + ., data = data_junk)
f_obs <- summary(fit)$fstatistic
f_obs

     value      numdf      dendf 
  0.931791 100.000000 400.000000

p_value_f <- 1 - pf(f_obs[1], f_obs[2], f_obs[3])
p_value_f

    value 
0.6596268

We do not reject the null hypothesis that all the (junk) coefficients are zero (phew).

Fisher F-Test - Geometry

\[ \begin{aligned} F &= \frac{1}{p\hat{\sigma}^2}\hat{\boldsymbol{\beta}}^T (\mathbf{X}^T\mathbf{X})\hat{\boldsymbol{\beta}} = \frac{(\mathbf{X}\hat{\boldsymbol{\beta}})^T (\mathbf{X}\hat{\boldsymbol{\beta}}) / p}{\|\hat{\boldsymbol{\epsilon}}\|^2 / (n - p)} \\ &= \frac{\|\hat{\mathbf{y}}\|^2 / p}{\|\hat{\boldsymbol{\epsilon}}\|^2 / (n - p)} = \frac{\|\hat{\mathbf{y}}\|^2 / p}{\|\mathbf{y} - \hat{\mathbf{y}}\|^2 / (n - p)} \\ \end{aligned} \]

\(\mathcal{H}_0: \beta_1 = \cdots = \beta_p = 0\) i.e. the model is useless.
If \(\|\hat{\boldsymbol{\epsilon}}\|^2\) is small, then the error is small,
and \(F\) tends to be large, i.e. we tend to reject \(\mathcal{H}_0\):
the model is useful.
If \(\|\hat{\boldsymbol{\epsilon}}\|^2\) is large, then the error is large,
and \(F\) tends to be small, i.e. we tend to not reject \(\mathcal{H}_0\):
the model is rather useless.

Geometrical interpretation

Good model

\(\mathbf{y}\) not too far from \(\mathcal{M}(\mathbf{X})\).
\(\|\hat{\boldsymbol{\epsilon}}\|^2\) not too large compared to \(\|\hat{\mathbf{y}}\|^2\).
\(F\) tends to be large.

Bad model

\(\mathbf{y}\) almost orth. to \(\mathcal{M}(\mathbf{X})\).
\(\|\hat{\boldsymbol{\epsilon}}\|^2\) rather large compared to \(\|\hat{\mathbf{y}}\|^2\).
\(F\) tends to be small.

Simulated Dataset - True and Junk

## Data frame
data_all <- data.frame(y_sim = y_sim,
                       x_1 = x_1,
                       x_2 = x_2,
                       x_junk = x_junk)
## Fit
fit <- lm(y_sim ~ ., data = data_all)
## multiple t tests
p_values_t <- summary(fit)$coefficients[, 4]
names(p_values_t[p_values_t <= 0.05])

[1] "(Intercept)" "x_1"         "x_2"         "x_junk.27"   "x_junk.76"  
[6] "x_junk.80"

## f test
f_obs <- summary(fit)$fstatistic
p_value_f <- 1 - pf(f_obs[1], f_obs[2], f_obs[3])
p_value_f

value 
    0

Nested Models Fisher F-test

Nested Models

Can I test ? \[ \begin{aligned} \mathcal{H}_0&: \beta_{p_0 + 1} = \cdots = \beta_p = 0 &\text{vs}\\ \mathcal{H}_1&: \exists~k\in \{p_0+1, \dotsc, p\} ~|~ \beta_k \neq 0 \end{aligned} \]

i.e. decide between the full model: \[ \mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon} \]

and the nested model: \[ \mathbf{y} = \mathbf{X}_0\boldsymbol{\beta}_0 + \boldsymbol{\epsilon} \]

with \[ \begin{aligned} \mathbf{X} &= (\mathbf{X}_0 ~ \mathbf{X}_1) & rg(\mathbf{X}) &= p & rg(\mathbf{X}_0) &= p_0 \end{aligned} \]

Geometrical interpretation

Nested Models

Idea: Compare

\(\|\hat{\mathbf{y}} - \hat{\mathbf{y}}_0\|^2\) distance of \(\hat{\mathbf{y}}\) compared to \(\hat{\mathbf{y}}_0\) to
\(\|\mathbf{y} - \hat{\mathbf{y}}\|^2\) residual error.

Then:

If \(\|\hat{\mathbf{y}} - \hat{\mathbf{y}}_0\|^2\) small compared to \(\|\mathbf{y} - \hat{\mathbf{y}}\|^2\), then “\(\hat{\mathbf{y}} \approx \hat{\mathbf{y}}_0\)” and the null model is enough.
If \(\|\hat{\mathbf{y}} - \hat{\mathbf{y}}_0\|^2\) large compared to \(\|\mathbf{y} - \hat{\mathbf{y}}\|^2\), then the full model is useful.

Geometrical interpretation

Good model

\(\hat{\mathbf{y}}\) quite different from \(\hat{\mathbf{y}}_0\).
Full model does add information.

Bad model

\(\hat{\mathbf{y}}\) similar to \(\hat{\mathbf{y}}_0\).
Full model does not add much information.

Nested Models - F test

\[ F = \frac{ \|\hat{\mathbf{y}} - \hat{\mathbf{y}}_0\|^2 / (p - p_0) }{ \|\mathbf{y} - \hat{\mathbf{y}}\|^2 / (n - p) } \]

When \(F\) is large, full model is useful.
When \(F\) is small, null model is enough.
Distribution of \(F\) ?

Distribution of \(F\) - 1/2

\[ F = \frac{ \|\hat{\mathbf{y}} - \hat{\mathbf{y}}_0\|^2 / (p - p_0) }{ \|\mathbf{y} - \hat{\mathbf{y}}\|^2 / (n - p) } \]

Let \(\mathbf{P}\) projection on \(\mathcal{M}(\mathbf{X})\) and \(\mathbf{P}_0\) projection on \(\mathcal{M}(\mathbf{X}_0)\)

\[ \begin{aligned} \hat{\mathbf{y}} - \hat{\mathbf{y}}_0 &= \mathbf{P}\mathbf{y} - \mathbf{P}_0\mathbf{y} = \mathbf{P}\mathbf{y} - \mathbf{P}_0\mathbf{P}\mathbf{y}\\ &= (\mathbf{I}_n - \mathbf{P}_0) \mathbf{P}\mathbf{y} = \mathbf{P}_0^\bot \mathbf{P}\mathbf{y}\\ &\in \mathcal{M}(\mathbf{X}_0)^\bot \cap \mathcal{M}(\mathbf{X}) \end{aligned} \]

\[ \mathbf{y} - \hat{\mathbf{y}} = (\mathbf{I}_n - \mathbf{P}) \mathbf{y} = \mathbf{P}^\bot \mathbf{y} \in \mathcal{M}(\mathbf{X})^\bot \]

\(\hat{\mathbf{y}} - \hat{\mathbf{y}}_0\) and \(\mathbf{y} - \hat{\mathbf{y}}\) are orthogonal
i.e. their covariance is zero
i.e. (Gaussian assumption) they are independent.

Distribution of \(F\) - 2/2

From Cochran’s theorem: \[ \frac{1}{\sigma^2}\|\mathbf{y} - \hat{\mathbf{y}}\|^2 = \frac{1}{\sigma^2}\|\mathbf{P}^\bot\mathbf{y}\|^2 = \frac{1}{\sigma^2}\|\mathbf{P}^\bot\boldsymbol{\epsilon}\|^2 \sim \chi^2_{n-p} \]

\[ \frac{1}{\sigma^2}\|\hat{\mathbf{y}} - \hat{\mathbf{y}}_0 - \mathbf{P}_0^\bot \mathbf{P}\mathbf{X}\boldsymbol{\beta}\|^2 = \frac{1}{\sigma^2}\|\mathbf{P}_0^\bot \mathbf{P}(\mathbf{y} - \mathbf{X}\boldsymbol{\beta})\|^2 \sim \chi^2_{p-p_0} \] (\(\mathbf{P}_0^\bot \mathbf{P}\) is a projection on a space of dimension \(p - p_0\))

If \(\mathcal{H}_0\) is true, then \(\mathbf{P}_0^\bot \mathbf{P}\mathbf{X}\boldsymbol{\beta} = 0\), hence:

\[ F = \frac{ \|\hat{\mathbf{y}} - \hat{\mathbf{y}}_0\|^2 / (p - p_0) }{ \|\mathbf{y} - \hat{\mathbf{y}}\|^2 / (n - p) } \underset{\mathcal{H}_0}{\sim} \mathcal{F}^{p - p_0}_{n - p}. \]

Nested F test

From Pythagoras’s theorem: \[ \begin{aligned} \|\mathbf{y} - \hat{\mathbf{y}}_0\|^2 &= \|\mathbf{y} - \mathbf{P}\mathbf{y} + \mathbf{P}\mathbf{y} - \mathbf{P}_0\mathbf{y}\|^2 = \|\mathbf{P}^\bot\mathbf{y} + \mathbf{P}_0^\bot\mathbf{P}\mathbf{y}\|^2\\ &= \|\mathbf{P}^\bot\mathbf{y}\|^2 + \|\mathbf{P}_0^\bot\mathbf{P}\mathbf{y}\|^2 = \|\mathbf{y} - \hat{\mathbf{y}}\|^2 + \|\hat{\mathbf{y}} - \hat{\mathbf{y}}_0\|^2\\ \end{aligned} \]

i.e. \[ \|\hat{\mathbf{y}} - \hat{\mathbf{y}}_0\|^2 = \|\mathbf{y} - \hat{\mathbf{y}}_0\|^2 - \|\mathbf{y} - \hat{\mathbf{y}}\|^2 = RSS_0 - RSS \]

Hence: \[ F = \frac{ \|\hat{\mathbf{y}} - \hat{\mathbf{y}}_0\|^2 / (p - p_0) }{ \|\mathbf{y} - \hat{\mathbf{y}}\|^2 / (n - p) } = \frac{n - p}{p - p_0}\frac{RSS_0 - RSS}{RSS} \]

Nested F test

To test: \[ \begin{aligned} \mathcal{H}_0&: \beta_{p_0 + 1} = \cdots = \beta_p = 0 &\text{vs}\\ \mathcal{H}_1&: \exists~k\in \{p_0+1, \dotsc, p\} ~|~ \beta_k \neq 0 \end{aligned} \]

we use the \(F\) statistics: \[ F = \frac{ \|\hat{\mathbf{y}} - \hat{\mathbf{y}}_0\|^2 / (p - p_0) }{ \|\mathbf{y} - \hat{\mathbf{y}}\|^2 / (n - p) } = \frac{n - p}{p - p_0}\frac{RSS_0 - RSS}{RSS} \underset{\mathcal{H}_0}{\sim} \mathcal{F}^{p - p_0}_{n - p}. \]

Simulated Dataset - True vs Junk

\[ y_i = -1 + 3 x_{i1} - x_{i2} + \epsilon_i \]

## Data frame
data_all <- data.frame(y_sim = y_sim,
                       x_1 = x_1,
                       x_2 = x_2,
                       x_junk = x_junk)
## Fit true
fit_small <- lm(y_sim ~ x_1 + x_2, data = data_all)
## Fit true and junk
fit_all <- lm(y_sim ~ x_1 + x_2 + x_junk.1 + x_junk.2 + x_junk.3, data = data_all)
## Comparison
anova(fit_small, fit_all)

Analysis of Variance Table

Model 1: y_sim ~ x_1 + x_2
Model 2: y_sim ~ x_1 + x_2 + x_junk.1 + x_junk.2 + x_junk.3
  Res.Df    RSS Df Sum of Sq      F Pr(>F)
1    497 1900.5                           
2    494 1883.2  3    17.297 1.5124 0.2104

Simulated Dataset - True vs Junk

\[ y_i = -1 + 3 x_{i1} - x_{i2} + \epsilon_i \]

## Fit true
fit_small <- lm(y_sim ~ x_1 + x_2, data = data_all)
## Fit all
fit_all <- lm(y_sim ~ ., data = data_all)
## Comparison
anova(fit_small, fit_all)$`Pr(>F)`[2]

[1] 0.3548787

We do not reject the null hypothesis that the first two variables are enough to explain the data.

Full \(F\) test

Assuming that we have an intercept, we often test: \[ \begin{aligned} \mathcal{H}_0&: \beta_{2} = \cdots = \beta_p = 0 &\text{vs}\\ \mathcal{H}_1&: \exists~k\in \{2, \dotsc, p\} ~|~ \beta_k \neq 0 \end{aligned} \]

The associated \(F\) statistics is (\(p_0 = 1\)): \[ F = \frac{ \|\hat{\mathbf{y}} - \bar{y}\mathbb{1}\|^2 / (p - 1) }{ \|\mathbf{y} - \hat{\mathbf{y}}\|^2 / (n - p) } \underset{\mathcal{H}_0}{\sim} \mathcal{F}^{p - 1}_{n - p}. \]

That is the default test returned by R.

Full \(F\) test

\[ y_i = -1 + 3 x_{i1} - x_{i2} + \epsilon_i \]

fit <- lm(y_sim ~ x_1 + x_2)
summary(fit)


Call:
lm(formula = y_sim ~ x_1 + x_2)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.2523 -1.4517 -0.0302  1.3494  4.8711 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.82877    0.08794  -9.424   <2e-16 ***
x_1          3.01415    0.07534  40.008   <2e-16 ***
x_2         -0.95922    0.07462 -12.855   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.955 on 497 degrees of freedom
Multiple R-squared:  0.7905,    Adjusted R-squared:  0.7897 
F-statistic: 937.8 on 2 and 497 DF,  p-value: < 2.2e-16

One coefficient \(F\) test

If \(p_0 = p-1\), we test: \[ \begin{aligned} \mathcal{H}_0&: \beta_p = 0 &\text{vs} \qquad \mathcal{H}_1&: \beta_p \neq 0 \end{aligned} \]

The associated \(F\) statistics is (\(p_0 = p-1\)): \[ F = \frac{ \|\hat{\mathbf{y}} - \hat{\mathbf{y}}_{-p}\|^2 }{ \|\mathbf{y} - \hat{\mathbf{y}}\|^2 / (n - p) } \underset{\mathcal{H}_0}{\sim} \mathcal{F}^{1}_{n - p}. \]

It can be shown that: \[ F = T^2 \quad \text{with} \quad T = \frac{\hat{\beta}_p}{\hat{\sigma}_p} \] so that it is equivalent to the \(t\)-test.

Summary

Exact tests

t-test: is one individual coefficient \(k\) zero ? \[ \begin{aligned} \mathcal{H}_0&: \beta_k = 0 \\ \mathcal{H}_1&: \beta_k \neq 0 \end{aligned} \quad;\quad T_k = \frac{\hat{\beta}_k}{\sqrt{\hat{\sigma}^2 [(\mathbf{X}^T\mathbf{X})^{-1}]_{kk}}} \underset{\mathcal{H}_0}{\sim} \mathcal{T}_{n-p} \]

Exact tests

Global F-test: are all coefficients zero ?
- with intercept \[ \begin{aligned} \mathcal{H}_0&: \beta_{2} = \cdots = \beta_p = 0 \\ \mathcal{H}_1&: \exists~k\in \{2, \dotsc, p\} ~|~ \beta_k \neq 0 \end{aligned} \quad;\quad F = \frac{ \|\hat{\mathbf{y}} - \bar{y}\mathbb{1}\|^2 / (p - 1) }{ \|\mathbf{y} - \hat{\mathbf{y}}\|^2 / (n - p) } \underset{\mathcal{H}_0}{\sim} \mathcal{F}^{p - 1}_{n - p} \]
- no intercept \[ \begin{aligned} \mathcal{H}_0&: \beta_{1} = \cdots = \beta_p = 0 \\ \mathcal{H}_1&: \exists~k\in \{1, \dotsc, p\} ~|~ \beta_k \neq 0 \end{aligned} \quad;\quad F = \frac{ \|\hat{\mathbf{y}}\|^2 / p }{ \|\mathbf{y} - \hat{\mathbf{y}}\|^2 / (n - p) } \underset{\mathcal{H}_0}{\sim} \mathcal{F}^{p}_{n - p} \]
Nested F-test: are extra coefficients from the big model zero ? \[ {\small \begin{aligned} \mathcal{H}_0&: \beta_{p_0 + 1} = \cdots = \beta_p = 0\\ \mathcal{H}_1&: \exists~k\in \{p_0+1, \dotsc, p\} ~|~ \beta_k \neq 0 \end{aligned} ; F = \frac{ \|\hat{\mathbf{y}} - \hat{\mathbf{y}}_0\|^2 / (p - p_0) }{ \|\mathbf{y} - \hat{\mathbf{y}}\|^2 / (n - p) } = \frac{n - p}{p - p_0}\frac{RSS_0 - RSS}{RSS} \underset{\mathcal{H}_0}{\sim} \mathcal{F}^{p - p_0}_{n - p} } \]

Exact tests

Can only test nested models.
Multiple testing issue.
How to choose relevant predictors from a large pool of predictors ?
- Marketing
- Genetics
Variable Selection

Variable Selection

How to choose the “best” model ?

Model: \[ y_i = \beta_1 x_{i1} + \dotsb + \beta_p x_{ip} + \epsilon_i \]

Problem:
Can we choose the “best” subset of non-zero \(\beta_k\) ?

I.e. Can we find the predictors that really have an impact on \(\mathbf{y}\) ?

Idea:
Find a “score” to assess the quality of a given fit.

Variable selection - \(R^2\)

\[ R^2 = 1 - \frac{RSS}{TSS} \]

## Function to get statistics for one fit
get_stats_fit <- function(fit) {
  sumfit <- summary(fit)
  res <- data.frame(R2 = sumfit$r.squared)
  return(res)
}

\[ y_i = -1 + 3 x_{i1} - x_{i2} + \epsilon_i \]

Variable selection - \(R^2\)

True fit

get_stats_fit(lm(y_sim ~ x_1 + x_2, data = data_all))

         R2
1 0.7905255

Bigger fit

get_stats_fit(lm(y_sim ~ x_1 + x_2 + x_junk.1, data = data_all))

       R2
1 0.79133

Wrong fit

get_stats_fit(lm(y_sim ~ x_1 + x_junk.1, data = data_all))

         R2
1 0.7252314

Variable selection - \(R^2\)

## Only one "junk" predictor
data_sub <- data.frame(y_sim = y_sim,
                       x_1 = x_1,
                       x_2 = x_2,
                       x_junk = x_junk_1)
## Fit all possible models
get_all_stats(data_sub)

            R2          preds
1 0.7208723311            x_1
2 0.1158793338            x_2
3 0.0005972444         x_junk
4 0.7905254867        x_1+x_2
5 0.7208974123     x_1+x_junk
6 0.1164841596     x_2+x_junk
7 0.7905418666 x_1+x_2+x_junk

\(R^2\) is not enough.

\(R^2\) is always increasing

\[ R^2 = 1 - \frac{RSS}{TSS} = 1 - \frac{\|\hat{\epsilon}\|^2}{\|\mathbf{y} - \bar{y} \mathbb{1}\|^2} = 1 - \frac{\|\mathbf{y} - \mathbf{X}\hat{\boldsymbol{\beta}}\|^2}{\|\mathbf{y} - \bar{y} \mathbb{1}\|^2} \]

If \(\mathbf{X}' = (\mathbf{X} ~ \mathbf{x}_{p+1})\) has one more row, then: \[ \begin{aligned} \|\mathbf{y} - \mathbf{X}'\hat{\boldsymbol{\beta}}'\|^2 & = \underset{\boldsymbol{\beta}' \in \mathbb{R}^{p+1}}{\operatorname{min}} \|\mathbf{y} - \mathbf{X}'\boldsymbol{\beta}' \|^2\\ & = \underset{\boldsymbol{\beta} \in \mathbb{R}^p\\ \beta_{p+1} \in \mathbb{R}}{\operatorname{min}} \|\mathbf{y} - \mathbf{X}\boldsymbol{\beta} - \mathbf{x}_{p+1}\beta_{p+1} \|^2\\ & \leq \underset{\boldsymbol{\beta} \in \mathbb{R}^p}{\operatorname{min}} \|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}\|^2 \end{aligned} \]

\[ \text{Hence:}\quad \|\mathbf{y} - \mathbf{X}'\hat{\boldsymbol{\beta}}'\|^2 \leq \|\mathbf{y} - \mathbf{X}\hat{\boldsymbol{\beta}}\|^2 \]

Penalized Residual Sum of Squares

Variable selection - \(R_a^2\)

\[ R_a^2 = 1 - \frac{RSS / (n-p)}{TSS / (n-1)} \]

Penalize for the number of predictors \(p\).
Attention \(p\) includes the intercept (rank of matrix \(\mathbf{X}\)).

## Function to get statistics
get_stats_fit <- function(fit) {
  sumfit <- summary(fit)
  res <- data.frame(R2 = sumfit$r.squared,
                    R2a = sumfit$adj.r.squared)
  return(res)
}

Variable selection - \(R_a^2\)

get_all_stats(data_sub)

            R2          R2a          preds
1 0.7208723311  0.720311834            x_1
2 0.1158793338  0.114103991            x_2
3 0.0005972444 -0.001409588         x_junk
4 0.7905254867  0.789682531        x_1+x_2
5 0.7208974123  0.719774263     x_1+x_junk
6 0.1164841596  0.112928764     x_2+x_junk
7 0.7905418666  0.789274983 x_1+x_2+x_junk

\(R_a^2\): adjust for the number of predictors.

Adjusted \(R_a^2\)

\[ R_a^2 = 1 - \frac{RSS_p / (n-p)}{TSS / (n-1)} \]

\(RSS_p = \|\mathbf{y} - \mathbf{X}_p\hat{\beta}_p \|^2\) where \(rg(\mathbf{X}_p) = p\).
The larger the better.
When \(p\) is bigger, the fit must be really better for \(R_a^2\) to raise.
Intuitive, but not much theoretical justifications.

Mallow’s \(C_p\)

\[ C_p = \frac{1}{n} \left( RSS_p + 2 p \hat{\sigma}^2 \right) \]

Penalize for the number of parameters \(p\).
The smaller the better.
Unbiaised estimate of the theoretical risk.

Simulated Dataset

## Mallow's Cp
Cp <- function(fit) {
  n <- nobs(fit)
  p <- n - df.residual(fit)
  RSS <- deviance(fit)
  return( (RSS + p * RSS / (n - p)) / n)
}
## Function to get statistics
get_stats_fit <- function(fit) {
  sumfit <- summary(fit)
  res <- data.frame(Cp = Cp(fit))
  return(res)
}

Variable selection

## Apply stats to all possible combinations
all_stats <- get_all_stats(data_sub)
all_stats

         Cp          preds
1  5.085236            x_1
2 16.107190            x_2
3 18.207435         x_junk
4  3.823952        x_1+x_2
5  5.095010     x_1+x_junk
6 16.128558     x_2+x_junk
7  3.831361 x_1+x_2+x_junk

Mallow’s \(C_p\) is smallest for the true model.

Penalized Log Likelihood

Maximized Log Likelihood

Recall: \[ \begin{aligned} (\hat{\boldsymbol{\beta}}_p, \hat{\sigma}^2_{p,ML}) &= \underset{(\boldsymbol{\beta}, \sigma^2) \in \mathbb{R}^{p}\times\mathbb{R}_+^*}{\operatorname{argmax}} \log L(\boldsymbol{\beta}, \sigma^2 | \mathbf{y})\\ &= \underset{(\boldsymbol{\beta}, \sigma^2) \in \mathbb{R}^{p}\times\mathbb{R}_+^*}{\operatorname{argmax}} - \frac{n}{2} \log(2\pi\sigma^2) - \frac{1}{2\sigma^2}\|\mathbf{y} - \mathbf{X}_p\boldsymbol{\beta} \|^2 \end{aligned} \]

And: \[ \hat{\sigma}^2_{p,ML} = \frac{ \|\mathbf{y} - \hat{\mathbf{y}}_p \|^2} {n} \quad;\quad \hat{\mathbf{y}}_p = \mathbf{X}_p\hat{\boldsymbol{\beta}}_p \]

Thus: \[ L(\hat{\boldsymbol{\beta}}_p, \hat{\sigma}^2_{ML} | \mathbf{y}) = \cdots \]

Maximized Log Likelihood

The maximized likelihood is equal to: \[ L(\hat{\boldsymbol{\beta}}_p, \hat{\sigma}^2_{ML} | \mathbf{y}) = - \frac{n}{2} \log\left(2\pi\frac{ \|\mathbf{y} - \hat{\mathbf{y}}_p \|^2}{n}\right) - \frac{1}{2\frac{ \|\mathbf{y} - \hat{\mathbf{y}}_p \|^2}{n}}\|\mathbf{y} - \hat{\mathbf{y}}_p \|^2 \]

i.e. \[ \log \hat{L}_p = - \frac{n}{2} \log\left(\frac{ \|\mathbf{y} - \hat{\mathbf{y}}_p \|^2}{n}\right) - \frac{n}{2} \log\left(2\pi\right) - \frac{n}{2} \]

Maximized Log Likelihood is increasing

Remember, for any \(\eta \subset \eta'\): \[ \|\mathbf{y} - \hat{\mathbf{y}}_{p'}\|^2 \leq \|\mathbf{y} - \hat{\mathbf{y}}_p\|^2 \]

Hence \(R^2_p = 1 - \frac{\|\mathbf{y} - \hat{\mathbf{y}}_p\|^2}{\|\mathbf{y} - \bar{y} \mathbb{1}\|^2}\) always increases when we add a predictor.

Similarly: \[ \log \hat{L}_p = - \frac{n}{2} \log\left(\frac{ \|\mathbf{y} - \hat{\mathbf{y}}_p \|^2}{n}\right) - \frac{n}{2} \log\left(2\pi\right) - \frac{n}{2} \] always increases when we add some predictors.

Penalized Maximized Log Likelihood

Problem: \(\log \hat{L}_p\) always increases when we add some predictors.

\(\to\) it can not be used for model selection (same as \(R^2\)).

Idea: add a penalty term that depends on the size of the model

\(\to\) minimize: \[ - \log \hat{L}_p + f(p) \qquad \text{with f increasing} \]

Penalized Maximized Log Likelihood

Minimize: \[ - \log \hat{L}_p + f(p) \qquad \text{with f increasing} \]

For nested models:

\(- \log \hat{L}_p\) decreases
\(f(p)\) increases

Goal: find \(f\) that compensates for the “overfit”, i.e. the noisy and insignificant increase of the likelihood when we add some unrelated predictors.

Simulated Dataset

## Function to get statistics
get_stats_fit <- function(fit) {
  sumfit <- summary(fit)
  res <- data.frame(minusLogLik = -logLik(fit))
  return(res)
}

Variable selection

## Apply stats to all possible combinations
all_stats <- get_all_stats(data_sub)
all_stats

  minusLogLik          preds
1    1115.053            x_1
2    1403.284            x_2
3    1433.925         x_junk
4    1043.286        x_1+x_2
5    1115.030     x_1+x_junk
6    1403.113     x_2+x_junk
7    1043.266 x_1+x_2+x_junk

When we add “junk” variable, \(- \log \hat{L}_p\) decreases, but not a lot.

Questions: can we find a good penalty that compensates for this noisy increasing ?

AIC and BIC

Akaike Information Criterion - AIC

\[ AIC = - 2\log \hat{L}_p + 2 p \]

\(\hat{L}_p\) is the maximized likelihood of the model.
\(p\) is the dimension of the model.
The smaller the better.
Asymptotic theoretical guaranties.
Easy to use and widely spread

Bayesian Information Criterion - BIC

\[ BIC = - 2\log \hat{L}_p + p \log(n) \]

\(~\)

\(\hat{L}_p\) is the maximized likelihood of the model.
\(p\) is the dimension of the model.
The smaller the better.
Based on an approximation of the marginal likelihood.
Easy to use and widely spread

See: Lebarbier, Mary-Huard. Le critère BIC : fondements théoriques et interprétation. 2004. [inria-00070685]

AIC vs BIC

\[ AIC = - 2\log \hat{L}_p + 2 p \]

\[ BIC = - 2\log \hat{L}_p + p \log(n) \]

Both have asymptotic theoretical justifications.
BIC penalizes more than AIC \(\to\) chooses smaller models.
Both widely used.
There are some (better) non-asymptotic criteria, see
Giraud C. 2014. Introduction to high-dimensional statistics.

Simulated Dataset

## Function to get statistics
get_stats_fit <- function(fit) {
  sumfit <- summary(fit)
  res <- data.frame(minusLogLik = -logLik(fit),
                    AIC = AIC(fit),
                    BIC = BIC(fit))
  return(res)
}

Variable selection

## Apply stats to all possible combinations
all_stats <- get_all_stats(data_sub)
all_stats

  minusLogLik      AIC      BIC          preds
1    1115.053 2236.105 2248.749            x_1
2    1403.284 2812.567 2825.211            x_2
3    1433.925 2873.850 2886.493         x_junk
4    1043.286 2094.572 2111.430        x_1+x_2
5    1115.030 2238.060 2254.919     x_1+x_junk
6    1403.113 2814.225 2831.084     x_2+x_junk
7    1043.266 2096.533 2117.606 x_1+x_2+x_junk

## Select model
apply(all_stats[, -ncol(all_stats)], 2,
      function(x) all_stats[which.min(x), ncol(all_stats)])

     minusLogLik              AIC              BIC 
"x_1+x_2+x_junk"        "x_1+x_2"        "x_1+x_2"

Variable selection - Advertising data

## Advertising data
library(here)
ad <- read.csv(here("data", "Advertising.csv"), row.names = "X")

## Apply stats to all possible combinations
all_stats <- get_all_stats(ad[, c(4, 1:3)])
all_stats

  minusLogLik       AIC       BIC              preds
1    519.0457 1044.0913 1053.9863                 TV
2    573.3369 1152.6738 1162.5687              radio
3    608.3357 1222.6714 1232.5663          newspaper
4    386.1970  780.3941  793.5874           TV+radio
5    509.8891 1027.7782 1040.9714       TV+newspaper
6    573.2361 1154.4723 1167.6655    radio+newspaper
7    386.1811  782.3622  798.8538 TV+radio+newspaper

## Select model
apply(all_stats[, -ncol(all_stats)], 2,
      function(x) all_stats[which.min(x), ncol(all_stats)])

         minusLogLik                  AIC                  BIC 
"TV+radio+newspaper"           "TV+radio"           "TV+radio"

Forward and Backward Search

Combinatorial problem

\[ p \text{ predictors } \to 2^p \text{ possible models.} \]

Cannot test all possible models.
“Manual” solution: forward or backward searches.

Forward search

Start with the null model (no predictor)
Try to add one predictor (\(p\) models to fit)
Choose the best fitting among the \(p\) models.
Repeat until no more predictors to add.
Choose best fit with \(C_p\), AIC or BIC.

Forward search - Init

## Full formula
full_formula <- formula(y_sim ~ x_1 + x_2 
                        + x_junk.1 + x_junk.2 + x_junk.3 
                        + x_junk.4 + x_junk.5 + x_junk.6)

## Complete search
2^8

[1] 256

## No predictors
fit0 <- lm(y_sim ~ 1, data = data_all)

Forward search - Add term

## Add one
library(MASS)
addterm(fit0, full_formula)

Single term additions

Model:
y_sim ~ 1
         Df Sum of Sq    RSS     AIC
<none>                9072.7 1451.21
x_1       1    6540.3 2532.4  815.17
x_2       1    1051.3 8021.4 1391.63
x_junk.1  1      22.9 9049.8 1451.95
x_junk.2  1      33.0 9039.7 1451.39
x_junk.3  1      60.2 9012.5 1449.88
x_junk.4  1      26.9 9045.8 1451.72
x_junk.5  1      10.0 9062.8 1452.66
x_junk.6  1       8.7 9064.0 1452.73

Forward search - Add term

## Add x_1
fit1 <- lm(y_sim ~ 1 + x_1, data = data_all)
## Add another
addterm(fit1, full_formula)

Single term additions

Model:
y_sim ~ 1 + x_1
         Df Sum of Sq    RSS    AIC
<none>                2532.4 815.17
x_2       1    631.94 1900.5 673.63
x_junk.1  1     39.55 2492.9 809.30
x_junk.2  1      4.12 2528.3 816.35
x_junk.3  1     12.01 2520.4 814.79
x_junk.4  1      0.00 2532.4 817.17
x_junk.5  1      0.07 2532.4 817.15
x_junk.6  1      8.99 2523.5 815.39

Forward search - Add term

## Add x_1
fit2 <- lm(y_sim ~ 1 + x_1 + x_2, data = data_all)
## Add another
addterm(fit2, full_formula)

Single term additions

Model:
y_sim ~ 1 + x_1 + x_2
         Df Sum of Sq    RSS    AIC
<none>                1900.5 673.63
x_junk.1  1    7.2994 1893.2 673.71
x_junk.2  1    3.2311 1897.3 674.78
x_junk.3  1    7.4492 1893.0 673.67
x_junk.4  1    0.5146 1900.0 675.50
x_junk.5  1    3.6290 1896.9 674.68
x_junk.6  1    2.4825 1898.0 674.98

Backward search

Start with the full model (all predictors)
Try to remove one predictor (\(p\) models to fit)
Choose the best fitting among the \(p\) models.
Repeat until no more predictors to remove.
Choose best fit with \(C_p\), AIC or BIC.

Backward search - Init

## Full formula
fit0 <- lm(y_sim ~ x_1 + x_2 
                        + x_junk.1 + x_junk.2 + x_junk.3 
                        + x_junk.4 + x_junk.5 + x_junk.6,
           data = data_all)

## drop one
dropterm(fit0)

Single term deletions

Model:
y_sim ~ x_1 + x_2 + x_junk.1 + x_junk.2 + x_junk.3 + x_junk.4 + 
    x_junk.5 + x_junk.6
         Df Sum of Sq    RSS     AIC
<none>                1875.5  679.01
x_1       1    6007.5 7883.0 1394.93
x_2       1     592.2 2467.7  814.22
x_junk.1  1       6.7 1882.2  678.78
x_junk.2  1       3.7 1879.2  678.00
x_junk.3  1       6.9 1882.4  678.86
x_junk.4  1       0.8 1876.3  677.22
x_junk.5  1       3.6 1879.1  677.97
x_junk.6  1       3.7 1879.2  678.01

Backward search - Drop term

## drop x_junk.3
fit1 <- lm(y_sim ~ x_1 + x_2 
           + x_junk.1 + x_junk.2 +
           + x_junk.4 + x_junk.5 + x_junk.6, data = data_all)
## Drop another
dropterm(fit1)

Single term deletions

Model:
y_sim ~ x_1 + x_2 + x_junk.1 + x_junk.2 + +x_junk.4 + x_junk.5 + 
    x_junk.6
         Df Sum of Sq    RSS     AIC
<none>                1882.4  678.86
x_1       1    6045.2 7927.6 1395.75
x_2       1     595.1 2477.5  814.20
x_junk.1  1       7.9 1890.3  678.95
x_junk.2  1       3.5 1886.0  677.79
x_junk.4  1       0.6 1883.1  677.03
x_junk.5  1       3.7 1886.1  677.84
x_junk.6  1       3.2 1885.7  677.72

Backward search - Drop term

## drop x_junk.4
fit2 <- lm(y_sim ~ x_1 + x_2 
           + x_junk.1 + x_junk.2 +
           + x_junk.5 + x_junk.6, data = data_all)
## Drop another
dropterm(fit2)

Single term deletions

Model:
y_sim ~ x_1 + x_2 + x_junk.1 + x_junk.2 + +x_junk.5 + x_junk.6
         Df Sum of Sq    RSS     AIC
<none>                1883.1  677.03
x_1       1    6075.6 7958.7 1395.71
x_2       1     594.4 2477.5  812.20
x_junk.1  1       7.8 1890.9  677.11
x_junk.2  1       3.4 1886.5  675.93
x_junk.5  1       3.7 1886.8  676.02
x_junk.6  1       3.2 1886.3  675.88

Backward search - Drop term

## drop x_junk.1
fit3 <- lm(y_sim ~ x_1 + x_2 
           + x_junk.2 +
           + x_junk.5 + x_junk.6, data = data_all)
## Drop another
dropterm(fit3)

Single term deletions

Model:
y_sim ~ x_1 + x_2 + x_junk.2 + +x_junk.5 + x_junk.6
         Df Sum of Sq    RSS     AIC
<none>                1890.9  677.11
x_1       1    6067.8 7958.7 1393.71
x_2       1     628.4 2519.3  818.56
x_junk.2  1       3.1 1894.0  675.92
x_junk.5  1       3.9 1894.8  676.14
x_junk.6  1       2.8 1893.7  675.85

Backward search - Drop term

## drop x_junk.2
fit4 <- lm(y_sim ~ x_1 + x_2 
           + x_junk.5 + x_junk.6, data = data_all)
## Drop another
dropterm(fit4)

Single term deletions

Model:
y_sim ~ x_1 + x_2 + x_junk.5 + x_junk.6
         Df Sum of Sq    RSS     AIC
<none>                1894.0  675.92
x_1       1    6098.6 7992.6 1393.83
x_2       1     629.2 2523.3  817.35
x_junk.5  1       4.0 1898.0  674.98
x_junk.6  1       2.9 1896.9  674.68

Backward search - Drop term

## drop x_junk.6
fit5 <- lm(y_sim ~ x_1 + x_2 
           + x_junk.5, data = data_all)
## Drop another
dropterm(fit5)

Single term deletions

Model:
y_sim ~ x_1 + x_2 + x_junk.5
         Df Sum of Sq    RSS     AIC
<none>                1896.9  674.68
x_1       1    6097.8 7994.6 1391.96
x_2       1     635.5 2532.4  817.15
x_junk.5  1       3.6 1900.5  673.63

Backward search - Drop term

## drop x_junk.5
fit6 <- lm(y_sim ~ x_1 + x_2,
           data = data_all)
## Drop another
dropterm(fit6)

Single term deletions

Model:
y_sim ~ x_1 + x_2
       Df Sum of Sq    RSS     AIC
<none>              1900.5  673.63
x_1     1    6120.9 8021.4 1391.63
x_2     1     631.9 2532.4  815.17

Forward and backward searches

Both are heuristics.
There are no guaranties that both converge to the same solution.
No theoretical guaranties on the selected model.
Model selection is still an active area of research.
- See e.g: LASSO and other penalized criteria (M2)