2025-03-12
t-test: is one individual coefficient \(k\) zero ? \[ \begin{aligned} \mathcal{H}_0&: \beta_k = 0 \\ \mathcal{H}_1&: \beta_k \neq 0 \end{aligned} \quad;\quad T_k = \frac{\hat{\beta}_k}{\sqrt{\hat{\sigma}^2 [(\mathbf{X}^T\mathbf{X})^{-1}]_{kk}}} \underset{\mathcal{H}_0}{\sim} \mathcal{T}_{n-p} \]
Nested F-test: are extra coefficients from the big model zero ? \[ {\small \begin{aligned} \mathcal{H}_0&: \beta_{p_0 + 1} = \cdots = \beta_p = 0\\ \mathcal{H}_1&: \exists~k\in \{p_0+1, \dotsc, p\} ~|~ \beta_k \neq 0 \end{aligned} ; F = \frac{ \|\hat{\mathbf{y}} - \hat{\mathbf{y}}_0\|^2 / (p - p_0) }{ \|\mathbf{y} - \hat{\mathbf{y}}\|^2 / (n - p) } = \frac{n - p}{p - p_0}\frac{RSS_0 - RSS}{RSS} \underset{\mathcal{H}_0}{\sim} \mathcal{F}^{p - p_0}_{n - p} } \]
Can only test nested models.
Multiple testing issue.
How to choose relevant predictors from a large pool of predictors ?
Variable Selection
Model: \[ y_i = \beta_1 x_{i1} + \dotsb + \beta_p x_{ip} + \epsilon_i \]
Problem:
Can we choose the “best” subset of non-zero \(\beta_k\) ?
I.e. Can we find the predictors that really have an impact on \(\mathbf{y}\) ?
Idea:
Find a “score” to assess the quality of a given fit.
\[ R^2 = 1 - \frac{RSS}{TSS} \]
\[ y_i = -1 + 3 x_{i1} - x_{i2} + \epsilon_i \]
## Only one "junk" predictor
data_sub <- data.frame(y_sim = y_sim,
x_1 = x_1,
x_2 = x_2,
x_junk = x_junk_1)
## Fit all possible models
get_all_stats(data_sub)
R2 preds
1 0.7208723311 x_1
2 0.1158793338 x_2
3 0.0005972444 x_junk
4 0.7905254867 x_1+x_2
5 0.7208974123 x_1+x_junk
6 0.1164841596 x_2+x_junk
7 0.7905418666 x_1+x_2+x_junk
\(R^2\) is not enough.
\[ R^2 = 1 - \frac{RSS}{TSS} = 1 - \frac{\|\hat{\epsilon}\|^2}{\|\mathbf{y} - \bar{y} \mathbb{1}\|^2} = 1 - \frac{\|\mathbf{y} - \mathbf{X}\hat{\boldsymbol{\beta}}\|^2}{\|\mathbf{y} - \bar{y} \mathbb{1}\|^2} \]
If \(\mathbf{X}' = (\mathbf{X} ~ \mathbf{x}_{p+1})\) has one more row, then: \[ \begin{aligned} \|\mathbf{y} - \mathbf{X}'\hat{\boldsymbol{\beta}}'\|^2 & = \underset{\boldsymbol{\beta}' \in \mathbb{R}^{p+1}}{\operatorname{min}} \|\mathbf{y} - \mathbf{X}'\boldsymbol{\beta}' \|^2\\ & = \underset{\boldsymbol{\beta} \in \mathbb{R}^p\\ \beta_{p+1} \in \mathbb{R}}{\operatorname{min}} \|\mathbf{y} - \mathbf{X}\boldsymbol{\beta} - \mathbf{x}_{p+1}\beta_{p+1} \|^2\\ & \leq \underset{\boldsymbol{\beta} \in \mathbb{R}^p}{\operatorname{min}} \|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}\|^2 \end{aligned} \]
\[ \text{Hence:}\quad \|\mathbf{y} - \mathbf{X}'\hat{\boldsymbol{\beta}}'\|^2 \leq \|\mathbf{y} - \mathbf{X}\hat{\boldsymbol{\beta}}\|^2 \]
\[ R_a^2 = 1 - \frac{RSS / (n-p)}{TSS / (n-1)} \]
Penalize for the number of predictors \(p\).
Attention \(p\) includes the intercept (rank of matrix \(\mathbf{X}\)).
R2 R2a preds
1 0.7208723311 0.720311834 x_1
2 0.1158793338 0.114103991 x_2
3 0.0005972444 -0.001409588 x_junk
4 0.7905254867 0.789682531 x_1+x_2
5 0.7208974123 0.719774263 x_1+x_junk
6 0.1164841596 0.112928764 x_2+x_junk
7 0.7905418666 0.789274983 x_1+x_2+x_junk
\(R_a^2\): adjust for the number of predictors.
\[ R_a^2 = 1 - \frac{RSS / (n-p)}{TSS / (n-1)} \]
The larger the better.
When \(p\) is bigger, the fit must be really better for \(R_a^2\) to raise.
Intuitive, but not much theoretical justifications.
## Useless predictors
p_junk <- 48
x_junk <- matrix(runif(n * p_junk, min = -2, max = 2),
ncol = p_junk)
## Good and Bad dataset
data_all <- data.frame(y_sim = y_sim,
x_1 = x_1,
x_2 = x_2,
x_junk = x_junk)
## Bad overfitting fit
fit_bad <- lm(y_sim ~ ., data = data_all)
pred_bad <- predict(fit_bad)
## Good fit
fit_good <- lm(y_sim ~ x_1 + x_2, data = data_all)
pred_good <- predict(fit_good)
mu_y <- beta_0 + beta_1 * x_1 + beta_2 * x_2
plot(mu_y, y_sim, pch = 3,
ylab = "y", xlab = expression(beta[0] + beta[1]*x[1] + beta[2]*x[2]))
## Good prediction (small model)
points(mu_y, pred_good, col = "lightblue", pch = 4)
## Bad prediction (big model)
points(mu_y, pred_bad, col = "firebrick", pch = 4)
The bad (bigger) model is closer to the observed data.
[1] 6.07942e-27
[1] 206.9159
Overfitting: the RSS is smaller for the bad, bigger model.
Generate new data according to the same generative model:
Questions:
Are the predictions from the bad (bigger) model
better or worse
than the predictions from the good (smaller) model,
compared to the new data ?
plot(mu_y, y_sim, pch = 3,
ylab = "y", xlab = expression(beta[0] + beta[1]*x[1] + beta[2]*x[2]))
## Bad prediction (big model)
points(mu_y, pred_bad, col = "firebrick", pch = 4)
## Good prediction (small model)
points(mu_y, pred_good, col = "lightblue", pch = 4)
## New points
points(mu_y, y_new, col = "darkgreen", pch = 3)
Compute the risk between predictions and new data:
[1] 535.296
[1] 205.6555
The good (smaller) model behaves better on new data.
RSS: \(\|\mathbf{y} - \hat{\mathbf{y}}_p\|^2\) with \(\hat{\mathbf{y}}_p\) learnt on \(\mathbf{y}\)
\(\to\) gets smaller when model gets bigger (overfit).
Predictive Risk: \(\| \mathbf{y}' - \hat{\mathbf{y}}_p\|^2\) with \(\mathbf{y}'\) same distribution as \(\mathbf{y}\).
\(\to\) becomes larger if \(\hat{\mathbf{y}}_p\) is overfit.
Idea:
We want to select a model that has a good predictive risk.
i.e. that represents the underlying model best.
Problem:
In general, we don’t have means to generate \(\mathbf{y}'\).
\(\to\) We use an estimate of the (theoretical) predictive risk.
Ideal model: \[ \mathbf{y} = \boldsymbol{\mu} + \boldsymbol{\epsilon} \quad \boldsymbol{\mu} \in \mathbb{R}^n \quad \boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \sigma^2\mathbf{I}_n). \]
\(\boldsymbol{\mu}\) the “true” expectation of the observations \(\mathbf{y}\).
Full linear regression model: \[ \mathbf{y} = \mathbf{X} \boldsymbol{\beta} + \boldsymbol{\epsilon} \quad rg(\mathbf{X}) = p, \quad \boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \sigma^2\mathbf{I}_n). \]
Assume that \(\boldsymbol{\mu}\) can be approximated as \(\mathbf{X} \boldsymbol{\beta}\), i.e. as an element of \(\mathcal{M}(\mathbf{X})\).
Ideal model: \[ \mathbf{y} = \boldsymbol{\mu} + \boldsymbol{\epsilon} \quad \boldsymbol{\mu} \in \mathbb{R}^n \quad \boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \sigma^2\mathbf{I}_n). \]
Sub-models:
keep only predictors \(\eta \subset \{1, \dotsc, p\}\) \[
\mathbf{y} = \mathbf{X}_{\eta} \boldsymbol{\beta}_{\eta} + \boldsymbol{\epsilon}_{\eta}
\quad
rg(\mathbf{X}_{\eta}) = |\eta|,
\quad
\boldsymbol{\epsilon}_{\eta} \sim \mathcal{N}(\mathbf{0}, \sigma^2\mathbf{I}_n).
\]
with \(\mathbf{X}_{\eta}\) the sub-matrix of \(\mathbf{X}\): \[ \mathbf{X}_{\eta} = (\mathbf{x}_{\eta_1} ~ \mathbf{x}_{\eta_2} \cdots \mathbf{x}_{\eta_{|\eta|}}) \]
Assume that \(\boldsymbol{\mu}\) can be approximated as \(\mathbf{X}_{\eta} \boldsymbol{\beta}_{\eta}\), i.e. as an element of \(\mathcal{M}(\mathbf{X}_{\eta})\).
“Truth”: \(\mathbf{y} = \boldsymbol{\mu} + \boldsymbol{\epsilon}\), with \(\boldsymbol{\mu} \in \mathbb{R}^n\) and \(\boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \sigma^2\mathbf{I}_n)\).
Models: \[ \mathbf{y} = \mathbf{X}_{\eta} \boldsymbol{\beta}_{\eta} + \boldsymbol{\epsilon}_{\eta} \quad rg(\mathbf{X}_{\eta}) = |\eta|, \quad \boldsymbol{\epsilon}_{\eta} \sim \mathcal{N}(\mathbf{0}, \sigma^2\mathbf{I}_n). \]
Estimators: \[ \begin{aligned} \hat{\boldsymbol{\beta}}_\eta & = \underset{\boldsymbol{\beta} \in \mathbb{R}^{|\eta|}}{\operatorname{argmin}} \|\mathbf{y} - \mathbf{X}_\eta\boldsymbol{\beta}_\eta \|^2 = (\mathbf{X}_\eta^{T}\mathbf{X}_\eta)^{-1}\mathbf{X}_\eta^{T}\mathbf{y}\\ \hat{\mathbf{y}}_\eta & = \mathbf{X}_\eta\hat{\boldsymbol{\beta}}_\eta = \mathbf{P}^{\mathbf{X}_\eta}\mathbf{y} \end{aligned} \]
True expectation: \[ \begin{aligned} \mathbf{y}_\eta = \mathbf{P}^{\mathbf{X}_\eta}\boldsymbol{\mu} \end{aligned} \]
Risk of an estimator: \[ R(\hat{\mathbf{y}}_\eta) = \mathbb{E}\left[ \|\boldsymbol{\mu} - \hat{\mathbf{y}}_\eta \|^2 \right] \]
Notes:
\(\boldsymbol{\mu}\) is unknown \(\to\) cannot be computed in practice.
Expectation of the difference between the true mean and the estimated projection.
Not subject to overfitting.
\[ R(\hat{\mathbf{y}}_\eta) = \mathbb{E}\left[ \|\boldsymbol{\mu} - \hat{\mathbf{y}}_\eta \|^2 \right] = \|\boldsymbol{\mu} - \mathbf{y}_\eta \|^2 + |\eta| \sigma^2 \]
\[ R(\hat{\mathbf{y}}_\eta) = \mathbb{E}\left[ \|\boldsymbol{\mu} - \hat{\mathbf{y}}_\eta \|^2 \right] = \|\boldsymbol{\mu} - \mathbf{y}_\eta \|^2 + |\eta| \sigma^2 \]
No overfitting:
Assume:
\(\boldsymbol{\mu} \in \mathcal{M}(\mathbf{X}_{\eta})\) so that \(\mathbf{y}_{\eta} = \mathbf{P}^{\mathbf{X}_{\eta}}\boldsymbol{\mu} = \boldsymbol{\mu}\).
\(\eta \subset \eta'\) so that \(\mathbf{y}_{\eta'} = \mathbf{P}^{\mathbf{X}_{\eta'}}\boldsymbol{\mu} = \boldsymbol{\mu}\).
Then: \[ \begin{aligned} R(\hat{\mathbf{y}}_{\eta'}) & = \|\boldsymbol{\mu} - \mathbf{y}_{\eta'} \|^2 + |\eta'| \sigma^2 \\ & \geq \|\boldsymbol{\mu} - \mathbf{y}_{\eta'} \|^2 + |\eta| \sigma^2 = \|\boldsymbol{\mu} - \mathbf{y}_{\eta} \|^2 + |\eta| \sigma^2 \\ & \geq R(\hat{\mathbf{y}}_\eta) \end{aligned} \]
\[ R(\hat{\mathbf{y}}_\eta) = \mathbb{E}\left[ \|\boldsymbol{\mu} - \hat{\mathbf{y}}_\eta \|^2 \right] = \mathbb{E}\left[ \|\boldsymbol{\mu} - \mathbf{y}_\eta + \mathbf{y}_\eta - \hat{\mathbf{y}}_\eta \|^2 \right] \]
\[ \begin{aligned} R(\hat{\mathbf{y}}_\eta) &= \mathbb{E}\left[ \|\boldsymbol{\mu} - \mathbf{P}^{\mathbf{X}_{\eta}}\boldsymbol{\mu} + \mathbf{P}^{\mathbf{X}_{\eta}}\boldsymbol{\mu} - \mathbf{P}^{\mathbf{X}_{\eta}}\mathbf{y} \|^2 \right] \\ &= \mathbb{E}\left[ \|(\mathbf{I}_n - \mathbf{P}^{\mathbf{X}_{\eta}})\boldsymbol{\mu} + \mathbf{P}^{\mathbf{X}_{\eta}}(\boldsymbol{\mu} - \mathbf{y}) \|^2 \right]\\ &= \mathbb{E}\left[ \|\mathbf{P}^{\mathbf{X}_{\eta}^\bot}\boldsymbol{\mu} + \mathbf{P}^{\mathbf{X}_{\eta}}(\boldsymbol{\mu} - \mathbf{y}) \|^2 \right] \end{aligned} \]
\[ \begin{aligned} R(\hat{\mathbf{y}}_\eta) &= \mathbb{E}\left[ \|\mathbf{P}^{\mathbf{X}_{\eta}^\bot}\boldsymbol{\mu} \|^2 + \|\mathbf{P}^{\mathbf{X}_{\eta}}(\boldsymbol{\mu} - \mathbf{y}) \|^2 \right] \\ &= \|\mathbf{P}^{\mathbf{X}_{\eta}^\bot}\boldsymbol{\mu} \|^2 + \mathbb{E}\left[ \|\mathbf{P}^{\mathbf{X}_{\eta}}(\boldsymbol{\mu} - \mathbf{y}) \|^2 \right] \\ &= \|\boldsymbol{\mu} - \mathbf{y}_\eta \|^2 + \mathbb{E}\left[ \|\mathbf{P}^{\mathbf{X}_{\eta}}(\boldsymbol{\mu} - \mathbf{y}) \|^2 \right] \end{aligned} \]
\[ R(\hat{\mathbf{y}}_\eta) = \|\boldsymbol{\mu} - \mathbf{y}_\eta \|^2 + \mathbb{E}\left[ \|\mathbf{P}^{\mathbf{X}_{\eta}}(\boldsymbol{\mu} - \mathbf{y}) \|^2 \right] \]
From Cochran’s Theorem: \[ \frac{1}{\sigma^2}\|\mathbf{P}^{\mathbf{X}_{\eta}}(\boldsymbol{\mu} - \mathbf{y}) \|^2 \sim \chi^2_{|\eta|} \]
Hence: \[ \mathbb{E}\left[ \frac{1}{\sigma^2}\|\mathbf{P}^{\mathbf{X}_{\eta}}(\boldsymbol{\mu} - \mathbf{y}) \|^2 \right] = |\eta| \]
And: \[ R(\hat{\mathbf{y}}_\eta) = \|\boldsymbol{\mu} - \hat{\mathbf{y}}_\eta \|^2 + |\eta|\sigma^2. \]
We would like: \[ \eta_0 = \underset{\eta \subset \{1, \cdots, p\}}{\operatorname{argmin}} R(\hat{\mathbf{y}}_\eta) \]
Minimize the risk over all possible models.
Problem: \(R(\hat{\mathbf{y}}_\eta)\) is theoretical \(\to\) cannot be computed.
Idea: Minimize an estimator \(\hat{R}(\hat{\mathbf{y}}_\eta)\) of \(R(\hat{\mathbf{y}}_\eta)\): \[ \hat{\eta} = \underset{\eta \subset \{1, \cdots, p\}}{\operatorname{argmin}} \hat{R}(\hat{\mathbf{y}}_\eta). \]
For any model \(\eta\), the expectation of the RSS is: \[ \mathbb{E}[\|\mathbf{y} - \hat{\mathbf{y}}_\eta \|^2] = R(\hat{\mathbf{y}}_\eta) + n\sigma^2 - 2|\eta|\sigma^2 \]
\[ \begin{aligned} \mathbb{E}[\|\mathbf{y} - \hat{\mathbf{y}}_\eta \|^2] &= \mathbb{E}[\|\mathbf{y} - \boldsymbol{\mu} + \boldsymbol{\mu} - \hat{\mathbf{y}}_\eta \|^2] \\ &= \mathbb{E}[\|\mathbf{y} - \boldsymbol{\mu}\|^2 + \|\boldsymbol{\mu} - \hat{\mathbf{y}}_\eta \|^2 + 2 \langle \mathbf{y} - \boldsymbol{\mu}; \boldsymbol{\mu} - \hat{\mathbf{y}}_\eta \rangle] \end{aligned} \]
\[ \begin{aligned} \mathbb{E}[\|\mathbf{y} - \hat{\mathbf{y}}_\eta \|^2] &= \mathbb{E}[\|\boldsymbol{\epsilon}\|^2] + R(\hat{\mathbf{y}}_\eta) + 2 \mathbb{E}[\langle \boldsymbol{\epsilon}; \boldsymbol{\mu} - \mathbf{P}^{\mathbf{X}_{\eta}}\mathbf{y} \rangle] \qquad \end{aligned} \]
\[ \begin{aligned} \mathbb{E}[\langle \boldsymbol{\epsilon}; \boldsymbol{\mu} - \mathbf{P}^{\mathbf{X}_{\eta}}\mathbf{y} \rangle] &= \mathbb{E}[\langle \boldsymbol{\epsilon}; \boldsymbol{\mu} - \mathbf{P}^{\mathbf{X}_{\eta}}(\boldsymbol{\mu} + \boldsymbol{\epsilon}) \rangle] \\ &= \mathbb{E}[\langle \boldsymbol{\epsilon}; \boldsymbol{\mu} - \mathbf{P}^{\mathbf{X}_{\eta}}\boldsymbol{\mu}) \rangle] - \mathbb{E}[\langle \boldsymbol{\epsilon}; \mathbf{P}^{\mathbf{X}_{\eta}}\boldsymbol{\epsilon} \rangle] \end{aligned} \]
\[ \begin{aligned} \mathbb{E}[\langle \boldsymbol{\epsilon}; \boldsymbol{\mu} - \mathbf{P}^{\mathbf{X}_{\eta}}\mathbf{y} \rangle] &= 0 - \mathbb{E}[\langle \mathbf{P}^{\mathbf{X}_{\eta}}\boldsymbol{\epsilon}; \mathbf{P}^{\mathbf{X}_{\eta}}\boldsymbol{\epsilon} \rangle] \qquad\qquad~~ \\ &= - \mathbb{E}[\| \mathbf{P}^{\mathbf{X}_{\eta}}\boldsymbol{\epsilon}\|^2] \\ &= - |\eta| \sigma^2 \end{aligned} \]
\[ \begin{aligned} \mathbb{E}[\|\mathbf{y} - \hat{\mathbf{y}}_\eta \|^2] &= n \sigma^2 + R(\hat{\mathbf{y}}_\eta) - 2 |\eta| \sigma^2 \qquad\qquad\qquad\qquad \end{aligned} \]
We have: \[ \mathbb{E}[\|\mathbf{y} - \hat{\mathbf{y}}_\eta \|^2] = R(\hat{\mathbf{y}}_\eta) + n\sigma^2 - 2|\eta|\sigma^2 \]
Mallow’s \(C_p\): \[ C_p(\hat{\mathbf{y}}_\eta) = \frac{1}{n} \left( \|\mathbf{y} - \hat{\mathbf{y}}_\eta \|^2 + 2 |\eta| \hat{\sigma}^2 \right) \]
We have: \[ C_p(\hat{\mathbf{y}}_\eta) = \frac{1}{n} \left( \|\mathbf{y} - \hat{\mathbf{y}}_\eta \|^2 + 2 |\eta| \hat{\sigma}^2 \right) \] And: \[ \mathbb{E}[C_p(\hat{\mathbf{y}}_\eta)] = \frac{1}{n}R(\hat{\mathbf{y}}_\eta) + \sigma^2 \]
Hence: \[ \hat{\eta} = \underset{\eta \subset \{1, \cdots, p\}}{\operatorname{argmin}} C_p(\hat{\mathbf{y}}_\eta) = \underset{\eta \subset \{1, \cdots, p\}}{\operatorname{argmin}} \hat{R}(\hat{\mathbf{y}}_\eta). \]
\(\to\) Minimizing Mallow’s \(C_p\) is the same as minimizing an unbiased estimate of the risk.
\[ C_p(\hat{\mathbf{y}}_\eta) = \frac{1}{n} \left( \|\mathbf{y} - \hat{\mathbf{y}}_\eta \|^2 + 2 |\eta| \hat{\sigma}^2 \right) \]
Penalize for the number of parameters \(|\eta|\).
The smaller the better.
Cp preds
1 5.085236 x_1
2 16.107190 x_2
3 18.207435 x_junk
4 3.823952 x_1+x_2
5 5.095010 x_1+x_junk
6 16.128558 x_2+x_junk
7 3.831361 x_1+x_2+x_junk
Mallow’s \(C_p\) is smallest for the true model.
Recall: \[ \begin{aligned} (\hat{\boldsymbol{\beta}}_\eta, \hat{\sigma}^2_{\eta,ML}) &= \underset{(\boldsymbol{\beta}, \sigma^2) \in \mathbb{R}^{|\eta|}\times\mathbb{R}_+^*}{\operatorname{argmax}} \log L(\boldsymbol{\beta}, \sigma^2 | \mathbf{y})\\ &= \underset{(\boldsymbol{\beta}, \sigma^2) \in \mathbb{R}^{|\eta|}\times\mathbb{R}_+^*}{\operatorname{argmax}} - \frac{n}{2} \log(2\pi\sigma^2) - \frac{1}{2\sigma^2}\|\mathbf{y} - \mathbf{X}_\eta\boldsymbol{\beta} \|^2 \end{aligned} \]
And: \[ \hat{\sigma}^2_{\eta,ML} = \frac{ \|\mathbf{y} - \hat{\mathbf{y}}_\eta \|^2} {n} \quad;\quad \hat{\mathbf{y}}_\eta = \mathbf{X}_\eta\hat{\boldsymbol{\beta}}_\eta \]
Thus: \[ L(\hat{\boldsymbol{\beta}}_\eta, \hat{\sigma}^2_{ML} | \mathbf{y}) = \cdots \]
The maximized likelihood is equal to: \[ L(\hat{\boldsymbol{\beta}}_\eta, \hat{\sigma}^2_{ML} | \mathbf{y}) = - \frac{n}{2} \log\left(2\pi\frac{ \|\mathbf{y} - \hat{\mathbf{y}}_\eta \|^2}{n}\right) - \frac{1}{2\frac{ \|\mathbf{y} - \hat{\mathbf{y}}_\eta \|^2}{n}}\|\mathbf{y} - \hat{\mathbf{y}}_\eta \|^2 \]
i.e. \[ \log \hat{L}_\eta = - \frac{n}{2} \log\left(\frac{ \|\mathbf{y} - \hat{\mathbf{y}}_\eta \|^2}{n}\right) - \frac{n}{2} \log\left(2\pi\right) - \frac{n}{2} \]
Remember, for any \(\eta \subset \eta'\): \[ \|\mathbf{y} - \hat{\mathbf{y}}_{\eta'}\|^2 \leq \|\mathbf{y} - \hat{\mathbf{y}}_\eta\|^2 \]
Hence \(R^2_\eta = 1 - \frac{\|\mathbf{y} - \hat{\mathbf{y}}_\eta\|^2}{\|\mathbf{y} - \bar{y} \mathbb{1}\|^2}\) always increases when we add a predictor.
Similarly: \[ \log \hat{L}_\eta = - \frac{n}{2} \log\left(\frac{ \|\mathbf{y} - \hat{\mathbf{y}}_\eta \|^2}{n}\right) - \frac{n}{2} \log\left(2\pi\right) - \frac{n}{2} \] always increases when we add some predictors.
Problem: \(\log \hat{L}_\eta\) always increases when we add some predictors.
\(\to\) it can not be used for model selection (same as \(R^2\)).
Idea: add a penalty term that depends on the size of the model
\(\to\) minimize: \[ - \log \hat{L}_\eta + f(|\eta|) \qquad \text{with f increasing} \]
Minimize: \[ - \log \hat{L}_\eta + f(|\eta|) \qquad \text{with f increasing} \]
For \(\eta \subset \eta'\):
\(- \log \hat{L}_\eta\) decreases
\(f(|\eta|)\) increases
Goal: find \(f\) that compensates for the “overfit”, i.e. the noisy and insignificant increase of the likelihood when we add some unrelated predictors.
minusLogLik preds
1 1115.053 x_1
2 1403.284 x_2
3 1433.925 x_junk
4 1043.286 x_1+x_2
5 1115.030 x_1+x_junk
6 1403.113 x_2+x_junk
7 1043.266 x_1+x_2+x_junk
When we add “junk” variable, \(- \log \hat{L}_\eta\) decreases, but not a lot.
Questions: can we find a good penalty that compensates for this noisy increasing ?
\[ AIC = - 2\log \hat{L}_\eta + 2 |\eta| \]
\(\hat{L}_\eta\) is the maximized likelihood of the model.
\(|\eta|\) is the dimension of the model.
The smaller the better.
Asymptotic theoretical guaranties.
Easy to use and widely spread
\[ BIC = - 2\log \hat{L}_\eta + |\eta| \log(n) \]
\(~\)
\(\hat{L}_\eta\) is the maximized likelihood of the model.
\(|\eta|\) is the dimension of the model.
The smaller the better.
Based on an approximation of the marginal likelihood.
Easy to use and widely spread
See: Lebarbier, Mary-Huard. Le critère BIC : fondements théoriques et interprétation. 2004. [inria-00070685]
\[ AIC = - 2\log \hat{L}_\eta + 2 |\eta| \]
\[ BIC = - 2\log \hat{L}_\eta + |\eta| \log(n) \]
Both have asymptotic theoretical justifications.
BIC penalizes more than AIC \(\to\) chooses smaller models.
Both widely used.
There are some (better) non-asymptotic criteria, see
Giraud C. 2014. Introduction to high-dimensional statistics.
minusLogLik AIC BIC preds
1 1115.053 2236.105 2248.749 x_1
2 1403.284 2812.567 2825.211 x_2
3 1433.925 2873.850 2886.493 x_junk
4 1043.286 2094.572 2111.430 x_1+x_2
5 1115.030 2238.060 2254.919 x_1+x_junk
6 1403.113 2814.225 2831.084 x_2+x_junk
7 1043.266 2096.533 2117.606 x_1+x_2+x_junk
## Select model
apply(all_stats[, -ncol(all_stats)], 2,
function(x) all_stats[which.min(x), ncol(all_stats)])
minusLogLik AIC BIC
"x_1+x_2+x_junk" "x_1+x_2" "x_1+x_2"
## Advertising data
library(here)
ad <- read.csv(here("data", "Advertising.csv"), row.names = "X")
## Apply stats to all possible combinations
all_stats <- get_all_stats(ad[, c(4, 1:3)])
all_stats
minusLogLik AIC BIC preds
1 519.0457 1044.0913 1053.9863 TV
2 573.3369 1152.6738 1162.5687 radio
3 608.3357 1222.6714 1232.5663 newspaper
4 386.1970 780.3941 793.5874 TV+radio
5 509.8891 1027.7782 1040.9714 TV+newspaper
6 573.2361 1154.4723 1167.6655 radio+newspaper
7 386.1811 782.3622 798.8538 TV+radio+newspaper
## Select model
apply(all_stats[, -ncol(all_stats)], 2,
function(x) all_stats[which.min(x), ncol(all_stats)])
minusLogLik AIC BIC
"TV+radio+newspaper" "TV+radio" "TV+radio"
\[ p \text{ predictors } \to 2^p \text{ possible models.} \]
Cannot test all possible models.
“Manual” solution: forward or backward searches.
Start with the null model (no predictor)
Try to add one predictor (\(p\) models to fit)
Choose the best fitting among the \(p\) models.
Repeat until no more predictors to add.
Choose best fit with \(C_p\), AIC or BIC.
Single term additions
Model:
y_sim ~ 1
Df Sum of Sq RSS AIC
<none> 731.66 136.165
x_1 1 466.43 265.23 87.429
x_2 1 56.39 675.27 134.154
x_junk.1 1 50.05 681.61 134.622
x_junk.2 1 32.10 699.56 135.921
x_junk.3 1 0.41 731.26 138.137
x_junk.4 1 2.49 729.18 137.995
x_junk.5 1 54.28 677.38 134.311
x_junk.6 1 11.59 720.08 137.367
Single term additions
Model:
y_sim ~ 1 + x_1
Df Sum of Sq RSS AIC
<none> 265.23 87.429
x_2 1 58.314 206.92 77.014
x_junk.1 1 6.038 259.19 88.277
x_junk.2 1 0.159 265.07 89.399
x_junk.3 1 0.376 264.85 89.358
x_junk.4 1 0.403 264.83 89.353
x_junk.5 1 2.587 262.64 88.939
x_junk.6 1 6.791 258.44 88.132
## Add x_1
fit2 <- lm(y_sim ~ 1 + x_1 + x_2, data = data_all)
## Add another
addterm(fit2, full_formula)
Single term additions
Model:
y_sim ~ 1 + x_1 + x_2
Df Sum of Sq RSS AIC
<none> 206.92 77.014
x_junk.1 1 3.0199 203.90 78.279
x_junk.2 1 2.1852 204.73 78.484
x_junk.3 1 0.8228 206.09 78.815
x_junk.4 1 1.6416 205.27 78.616
x_junk.5 1 4.9363 201.98 77.807
x_junk.6 1 3.7814 203.13 78.092
Start with the full model (all predictors)
Try to remove one predictor (\(p\) models to fit)
Choose the best fitting among the \(p\) models.
Repeat until no more predictors to remove.
Choose best fit with \(C_p\), AIC or BIC.
Single term deletions
Model:
y_sim ~ x_1 + x_2 + x_junk.1 + x_junk.2 + x_junk.3 + x_junk.4 +
x_junk.5 + x_junk.6
Df Sum of Sq RSS AIC
<none> 191.34 85.100
x_1 1 336.07 527.41 133.797
x_2 1 57.75 249.09 96.289
x_junk.1 1 2.92 194.25 83.857
x_junk.2 1 2.26 193.60 83.688
x_junk.3 1 1.46 192.80 83.480
x_junk.4 1 1.48 192.81 83.485
x_junk.5 1 2.08 193.42 83.641
x_junk.6 1 4.38 195.72 84.233
## drop x_junk.3
fit1 <- lm(y_sim ~ x_1 + x_2
+ x_junk.1 + x_junk.2 +
+ x_junk.4 + x_junk.5 + x_junk.6, data = data_all)
## Drop another
dropterm(fit1)
Single term deletions
Model:
y_sim ~ x_1 + x_2 + x_junk.1 + x_junk.2 + +x_junk.4 + x_junk.5 +
x_junk.6
Df Sum of Sq RSS AIC
<none> 192.80 83.480
x_1 1 335.92 528.71 131.921
x_2 1 57.16 249.96 94.463
x_junk.1 1 2.30 195.09 82.073
x_junk.2 1 2.18 194.98 82.043
x_junk.4 1 1.15 193.95 81.779
x_junk.5 1 2.27 195.06 82.065
x_junk.6 1 4.73 197.52 82.691
## drop x_junk.4
fit2 <- lm(y_sim ~ x_1 + x_2
+ x_junk.1 + x_junk.2 +
+ x_junk.5 + x_junk.6, data = data_all)
## Drop another
dropterm(fit2)
Single term deletions
Model:
y_sim ~ x_1 + x_2 + x_junk.1 + x_junk.2 + +x_junk.5 + x_junk.6
Df Sum of Sq RSS AIC
<none> 193.95 81.779
x_1 1 335.82 529.77 130.021
x_2 1 56.44 250.39 92.550
x_junk.1 1 1.95 195.90 80.278
x_junk.2 1 2.27 196.21 80.359
x_junk.5 1 2.96 196.91 80.537
x_junk.6 1 4.74 198.69 80.986
## drop x_junk.1
fit3 <- lm(y_sim ~ x_1 + x_2
+ x_junk.2 +
+ x_junk.5 + x_junk.6, data = data_all)
## Drop another
dropterm(fit3)
Single term deletions
Model:
y_sim ~ x_1 + x_2 + x_junk.2 + +x_junk.5 + x_junk.6
Df Sum of Sq RSS AIC
<none> 195.90 80.278
x_1 1 368.05 563.94 131.147
x_2 1 59.57 255.47 91.554
x_junk.2 1 2.46 198.36 78.903
x_junk.5 1 3.71 199.61 79.217
x_junk.6 1 4.71 200.60 79.466
## drop x_junk.2
fit4 <- lm(y_sim ~ x_1 + x_2
+ x_junk.5 + x_junk.6, data = data_all)
## Drop another
dropterm(fit4)
Single term deletions
Model:
y_sim ~ x_1 + x_2 + x_junk.5 + x_junk.6
Df Sum of Sq RSS AIC
<none> 198.36 78.903
x_1 1 422.20 620.56 133.930
x_2 1 57.59 255.95 89.648
x_junk.5 1 4.78 203.13 78.092
x_junk.6 1 3.62 201.98 77.807
Both are heuristics.
There are no guaranties that both converge to the same solution.
No theoretical guaranties on the selected model.
Model selection is still an active area of research.
Mallow’s \(C_p\): \[ C_p(\hat{\mathbf{y}}_\eta) = \frac{1}{n} \left( \|\mathbf{y} - \hat{\mathbf{y}}_\eta \|^2 + 2 |\eta| \hat{\sigma}^2_{\text{full}} \right) \]
FPE: \[ FPE(\hat{\mathbf{y}}_\eta) = \|\mathbf{y} - \hat{\mathbf{y}}_\eta \|^2 + 2 |\eta| \hat{\sigma}^2_\eta \quad \text{with} \quad \hat{\sigma}^2_\eta = \frac{ \|\mathbf{y} - \hat{\mathbf{y}}_\eta \|^2} {n - |\eta|} \]
Hence: \[ FPE(\hat{\mathbf{y}}_\eta) = \|\mathbf{y} - \hat{\mathbf{y}}_\eta \|^2 \left(1 + \frac{2 |\eta|} {n - |\eta|}\right) \]
\[ \begin{aligned} \log \frac1n FPE(\hat{\mathbf{y}}_\eta) &= \log \left(\frac1n \|\mathbf{y} - \hat{\mathbf{y}}_\eta \|^2 \left(1 + \frac{2 |\eta|} {n - |\eta|}\right) \right)\\ &= \log \left(\frac1n \|\mathbf{y} - \hat{\mathbf{y}}_\eta \|^2 \right) + \log \left(1 + \frac{2 |\eta|} {n - |\eta|}\right) \end{aligned} \] But: \[ n \log\left(\frac{ \|\mathbf{y} - \hat{\mathbf{y}}_\eta \|^2}{n}\right) = - 2\log \hat{L}_\eta - n \log\left(2\pi\right) - n \]
Hence: \[ n\log \frac1n FPE(\hat{\mathbf{y}}_\eta) + n \log\left(2\pi\right) + n \qquad \qquad \qquad \qquad\\ \qquad \qquad \qquad \qquad = - 2\log \hat{L}_\eta + n \log \left(1 + \frac{2 |\eta|} {n - |\eta|}\right)\\ \]
When \(n\) is large: \[ n \log \left(1 + \frac{2 |\eta|} {n - |\eta|}\right) \approx n \frac{2 |\eta|} {n - |\eta|} \approx 2 |\eta| \]
Hence: \[ \begin{aligned} n\log \frac1n FPE(\hat{\mathbf{y}}_\eta) + n \log\left(2\pi\right) + n &\approx - 2\log \hat{L}_\eta + 2 |\eta| \\ &\approx AIC(\hat{\mathbf{y}}_\eta) \end{aligned} \]
The log FPE and the AIC criteria select for the same models (asymptotically).
In a Bayesian setting,
we have priors on the models \(\eta\) and parameters \(\boldsymbol{\theta}_\eta = (\boldsymbol{\beta}_\eta, \sigma^2_\eta)\): \[
\left\{
\begin{aligned}
\eta &\sim \pi_{\eta} \\
\boldsymbol{\theta}_\eta & \sim p(\boldsymbol{\theta} | \eta)
\end{aligned}
\right.
\]
The likelihood given some parameters and a model is: \(p(\mathbf{y} | \boldsymbol{\theta}, \eta)\)
the marginal likelihood given a model is: \[ p(\mathbf{y} | \eta) = \int p(\mathbf{y} | \boldsymbol{\theta}, \eta) p(\boldsymbol{\theta} | \eta) d\boldsymbol{\theta} \]
and the posterior probability of the models are: \[ p(\eta | \mathbf{y}). \]
Goal: find the model with the highest posterior probability: \[ \hat{\eta} = \underset{\eta}{\operatorname{argmax}} p(\eta | \mathbf{y}) \]
Assume that all models have the same prior probability: \[ \pi_{\eta} \equiv \pi. \]
Then, from Bayes rule: \[ p(\eta | \mathbf{y}) = p(\mathbf{y} | \eta)\pi_{\eta} / p(\mathbf{y}) = p(\mathbf{y} | \eta) \times \pi / p(\mathbf{y}) \]
And: \[ \hat{\eta} = \underset{\eta}{\operatorname{argmax}} p(\eta | \mathbf{y}) = \underset{\eta}{\operatorname{argmax}} p(\mathbf{y} | \eta). \]
Goal: find the model with the highest posterior probability: \[ \begin{aligned} \hat{\eta} & = \underset{\eta}{\operatorname{argmax}} p(\eta | \mathbf{y}) = \underset{\eta}{\operatorname{argmax}} p(\mathbf{y} | \eta) \\ & = \underset{\eta}{\operatorname{argmax}} \int p(\mathbf{y} | \boldsymbol{\theta}, \eta) p(\boldsymbol{\theta} | \eta) d\boldsymbol{\theta} \end{aligned} \]
Idea:
Assume that \[p(\mathbf{y} | \boldsymbol{\theta}, \eta) p(\boldsymbol{\theta} | \eta)\] is concentrated around its maximum \[p(\mathbf{y} | \boldsymbol{\theta}_\eta^*, \eta) p(\boldsymbol{\theta}_\eta^* | \eta)\] and apply Laplace Approximation.
Let \(L: \mathbb{R}^q \to \mathbb{R}\) be a three times differentiable function, with a unique maximum \(\mathbf{u}^*\). Then: \[ \int_{\mathbb{R}^q} e^{n L(\mathbf{u})} d \mathbf{u} = \left(\frac{2\pi}{n}\right)^{q/2} |-L''(\mathbf{u}^*)|^{-1/2}e^{nL(\mathbf{u}^*)} + o(n^{-1}) \]
Applied to: \[ L(\boldsymbol{\theta}) = \frac1n\log p(\mathbf{y} | \boldsymbol{\theta}, \eta) + \frac1n\log p(\boldsymbol{\theta} | \eta) \] and assuming \(\log p(\mathbf{y} | \boldsymbol{\theta}^*, \eta) \approx \log p(\mathbf{y} | \hat{\boldsymbol{\theta}}, \eta)\), we get: \[ \log \int p(\mathbf{y} | \boldsymbol{\theta}, \eta) p(\boldsymbol{\theta} | \eta) d\boldsymbol{\theta} \approx \log p(\mathbf{y} | \hat{\boldsymbol{\theta}}, \eta) - \frac{|\eta|}{2} \log(n). \]
Hence: \[ \begin{aligned} \hat{\eta} & = \underset{\eta}{\operatorname{argmax}} p(\eta | \mathbf{y}) = \underset{\eta}{\operatorname{argmax}} p(\mathbf{y} | \eta) \\ & \approx \underset{\eta}{\operatorname{argmax}} \left\{\log p(\mathbf{y} | \hat{\boldsymbol{\theta}}, \eta) - \frac{|\eta|}{2} \log(n)\right\}\\ & \approx \underset{\eta}{\operatorname{argmin}} \left\{-2\log p(\mathbf{y} | \hat{\boldsymbol{\theta}}, \eta) + |\eta| \log(n) \right\}\\ &\approx \underset{\eta}{\operatorname{argmin}} \left\{- 2\log \hat{L}_\eta + |\eta| \log(n)\right\} = \underset{\eta}{\operatorname{argmin}} BIC(\hat{\mathbf{y}}_\eta) \end{aligned} \]
BIC is an approximation of the marginal likelihood.