Régression Multiple - Sélection de variables

Paul Bastide

2025-03-12

What we know

Exact tests

t-test: is one individual coefficient \(k\) zero ? \[ \begin{aligned} \mathcal{H}_0&: \beta_k = 0 \\ \mathcal{H}_1&: \beta_k \neq 0 \end{aligned} \quad;\quad T_k = \frac{\hat{\beta}_k}{\sqrt{\hat{\sigma}^2 [(\mathbf{X}^T\mathbf{X})^{-1}]_{kk}}} \underset{\mathcal{H}_0}{\sim} \mathcal{T}_{n-p} \]
Nested F-test: are extra coefficients from the big model zero ? \[ {\small \begin{aligned} \mathcal{H}_0&: \beta_{p_0 + 1} = \cdots = \beta_p = 0\\ \mathcal{H}_1&: \exists~k\in \{p_0+1, \dotsc, p\} ~|~ \beta_k \neq 0 \end{aligned} ; F = \frac{ \|\hat{\mathbf{y}} - \hat{\mathbf{y}}_0\|^2 / (p - p_0) }{ \|\mathbf{y} - \hat{\mathbf{y}}\|^2 / (n - p) } = \frac{n - p}{p - p_0}\frac{RSS_0 - RSS}{RSS} \underset{\mathcal{H}_0}{\sim} \mathcal{F}^{p - p_0}_{n - p} } \]

Exact tests

Can only test nested models.
Multiple testing issue.
How to choose relevant predictors from a large pool of predictors ?
- Marketing
- Genetics
Variable Selection

How to choose the “best” model ?

Model: \[ y_i = \beta_1 x_{i1} + \dotsb + \beta_p x_{ip} + \epsilon_i \]

Problem:
Can we choose the “best” subset of non-zero \(\beta_k\) ?

I.e. Can we find the predictors that really have an impact on \(\mathbf{y}\) ?

Idea:
Find a “score” to assess the quality of a given fit.

Variable selection - \(R^2\)

\[ R^2 = 1 - \frac{RSS}{TSS} \]

## Function to get statistics for one fit
get_stats_fit <- function(fit) {
  sumfit <- summary(fit)
  res <- data.frame(R2 = sumfit$r.squared)
  return(res)
}

\[ y_i = -1 + 3 x_{i1} - x_{i2} + \epsilon_i \]

Variable selection - \(R^2\)

True fit

get_stats_fit(lm(y_sim ~ x_1 + x_2, data = data_all))

         R2
1 0.7905255

Bigger fit

get_stats_fit(lm(y_sim ~ x_1 + x_2 + x_junk.1, data = data_all))

       R2
1 0.79133

Wrong fit

get_stats_fit(lm(y_sim ~ x_1 + x_junk.1, data = data_all))

         R2
1 0.7252314

Variable selection - \(R^2\)

## Only one "junk" predictor
data_sub <- data.frame(y_sim = y_sim,
                       x_1 = x_1,
                       x_2 = x_2,
                       x_junk = x_junk_1)
## Fit all possible models
get_all_stats(data_sub)

            R2          preds
1 0.7208723311            x_1
2 0.1158793338            x_2
3 0.0005972444         x_junk
4 0.7905254867        x_1+x_2
5 0.7208974123     x_1+x_junk
6 0.1164841596     x_2+x_junk
7 0.7905418666 x_1+x_2+x_junk

\(R^2\) is not enough.

\(R^2\) is always increasing

\[ R^2 = 1 - \frac{RSS}{TSS} = 1 - \frac{\|\hat{\epsilon}\|^2}{\|\mathbf{y} - \bar{y} \mathbb{1}\|^2} = 1 - \frac{\|\mathbf{y} - \mathbf{X}\hat{\boldsymbol{\beta}}\|^2}{\|\mathbf{y} - \bar{y} \mathbb{1}\|^2} \]

If \(\mathbf{X}' = (\mathbf{X} ~ \mathbf{x}_{p+1})\) has one more row, then: \[ \begin{aligned} \|\mathbf{y} - \mathbf{X}'\hat{\boldsymbol{\beta}}'\|^2 & = \underset{\boldsymbol{\beta}' \in \mathbb{R}^{p+1}}{\operatorname{min}} \|\mathbf{y} - \mathbf{X}'\boldsymbol{\beta}' \|^2\\ & = \underset{\boldsymbol{\beta} \in \mathbb{R}^p\\ \beta_{p+1} \in \mathbb{R}}{\operatorname{min}} \|\mathbf{y} - \mathbf{X}\boldsymbol{\beta} - \mathbf{x}_{p+1}\beta_{p+1} \|^2\\ & \leq \underset{\boldsymbol{\beta} \in \mathbb{R}^p}{\operatorname{min}} \|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}\|^2 \end{aligned} \]

\[ \text{Hence:}\quad \|\mathbf{y} - \mathbf{X}'\hat{\boldsymbol{\beta}}'\|^2 \leq \|\mathbf{y} - \mathbf{X}\hat{\boldsymbol{\beta}}\|^2 \]

Variable selection - \(R_a^2\)

\[ R_a^2 = 1 - \frac{RSS / (n-p)}{TSS / (n-1)} \]

Penalize for the number of predictors \(p\).
Attention \(p\) includes the intercept (rank of matrix \(\mathbf{X}\)).

## Function to get statistics
get_stats_fit <- function(fit) {
  sumfit <- summary(fit)
  res <- data.frame(R2 = sumfit$r.squared,
                    R2a = sumfit$adj.r.squared)
  return(res)
}

Variable selection - \(R_a^2\)

get_all_stats(data_sub)

            R2          R2a          preds
1 0.7208723311  0.720311834            x_1
2 0.1158793338  0.114103991            x_2
3 0.0005972444 -0.001409588         x_junk
4 0.7905254867  0.789682531        x_1+x_2
5 0.7208974123  0.719774263     x_1+x_junk
6 0.1164841596  0.112928764     x_2+x_junk
7 0.7905418666  0.789274983 x_1+x_2+x_junk

\(R_a^2\): adjust for the number of predictors.

Adjusted \(R_a^2\)

\[ R_a^2 = 1 - \frac{RSS / (n-p)}{TSS / (n-1)} \]

The larger the better.
When \(p\) is bigger, the fit must be really better for \(R_a^2\) to raise.
Intuitive, but not much theoretical justifications.

Predictive Risk

Simulation

set.seed(1289)

## Predictors
n <- 50
x_1 <- runif(n, min = -2, max = 2)
x_2 <- runif(n, min = -2, max = 2)

## Model
beta_0 <- -1; beta_1 <- 3; beta_2 <- -1

## sim
eps <- rnorm(n, mean = 0, sd = 2)
y_sim <- beta_0 + beta_1 * x_1 + beta_2 * x_2 + eps

Simulation - Fit

## Useless predictors
p_junk <- 48
x_junk <- matrix(runif(n * p_junk, min = -2, max = 2),
                 ncol = p_junk)

## Good and Bad dataset
data_all <- data.frame(y_sim = y_sim,
                       x_1 = x_1,
                       x_2 = x_2,
                       x_junk = x_junk)

## Bad overfitting fit
fit_bad <- lm(y_sim ~ ., data = data_all)
pred_bad <- predict(fit_bad)

## Good fit
fit_good <- lm(y_sim ~ x_1 + x_2, data = data_all)
pred_good <- predict(fit_good)

Simulation - Fit

mu_y <- beta_0 + beta_1 * x_1 + beta_2 * x_2
plot(mu_y, y_sim, pch = 3,
     ylab = "y", xlab = expression(beta[0] + beta[1]*x[1] + beta[2]*x[2]))

Simulation - Fit

mu_y <- beta_0 + beta_1 * x_1 + beta_2 * x_2
plot(mu_y, y_sim, pch = 3,
     ylab = "y", xlab = expression(beta[0] + beta[1]*x[1] + beta[2]*x[2]))
## Good prediction (small model)
points(mu_y, pred_good, col = "lightblue", pch = 4)

Simulation - Fit

mu_y <- beta_0 + beta_1 * x_1 + beta_2 * x_2
plot(mu_y, y_sim, pch = 3,
     ylab = "y", xlab = expression(beta[0] + beta[1]*x[1] + beta[2]*x[2]))
## Good prediction (small model)
points(mu_y, pred_good, col = "lightblue", pch = 4)
## Bad prediction (big model)
points(mu_y, pred_bad, col = "firebrick", pch = 4)

RSS: “fitted risk”

The bad (bigger) model is closer to the observed data.

## Distance between bad prediction and data (RSS)
sum((y_sim - pred_bad)^2)

[1] 6.07942e-27

## Distance between good prediction and data (RSS)
sum((y_sim - pred_good)^2)

[1] 206.9159

Overfitting: the RSS is smaller for the bad, bigger model.

Predictive Risk

Generate new data according to the same generative model:

## New dataset from the same model
eps_new <- rnorm(n, mean = 0, sd = 2)
y_new <- beta_0 + beta_1 * x_1 + beta_2 * x_2 + eps_new

Questions:
Are the predictions from the bad (bigger) model
better or worse
than the predictions from the good (smaller) model,
compared to the new data ?

Simulation - Fit

plot(mu_y, y_sim, pch = 3,
     ylab = "y", xlab = expression(beta[0] + beta[1]*x[1] + beta[2]*x[2]))
## Bad prediction (big model)
points(mu_y, pred_bad, col = "firebrick", pch = 4)
## Good prediction (small model)
points(mu_y, pred_good, col = "lightblue", pch = 4)
## New points
points(mu_y, y_new, col = "darkgreen", pch = 3)

Predictive Risk

Compute the risk between predictions and new data:

## Distance between bad prediction and true
sum((y_new - pred_bad)^2)

[1] 535.296

## Distance between good prediction and true
sum((y_new - pred_good)^2)

[1] 205.6555

The good (smaller) model behaves better on new data.

Predictive Risk

RSS: \(\|\mathbf{y} - \hat{\mathbf{y}}_p\|^2\) with \(\hat{\mathbf{y}}_p\) learnt on \(\mathbf{y}\)
\(\to\) gets smaller when model gets bigger (overfit).

Predictive Risk: \(\| \mathbf{y}' - \hat{\mathbf{y}}_p\|^2\) with \(\mathbf{y}'\) same distribution as \(\mathbf{y}\).
\(\to\) becomes larger if \(\hat{\mathbf{y}}_p\) is overfit.

Idea:
We want to select a model that has a good predictive risk.
i.e. that represents the underlying model best.

Problem:
In general, we don’t have means to generate \(\mathbf{y}'\).
\(\to\) We use an estimate of the (theoretical) predictive risk.

Theoretical Risk

Models

Ideal model: \[ \mathbf{y} = \boldsymbol{\mu} + \boldsymbol{\epsilon} \quad \boldsymbol{\mu} \in \mathbb{R}^n \quad \boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \sigma^2\mathbf{I}_n). \]

\(\boldsymbol{\mu}\) the “true” expectation of the observations \(\mathbf{y}\).

Full linear regression model: \[ \mathbf{y} = \mathbf{X} \boldsymbol{\beta} + \boldsymbol{\epsilon} \quad rg(\mathbf{X}) = p, \quad \boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \sigma^2\mathbf{I}_n). \]

Assume that \(\boldsymbol{\mu}\) can be approximated as \(\mathbf{X} \boldsymbol{\beta}\), i.e. as an element of \(\mathcal{M}(\mathbf{X})\).

Models

Ideal model: \[ \mathbf{y} = \boldsymbol{\mu} + \boldsymbol{\epsilon} \quad \boldsymbol{\mu} \in \mathbb{R}^n \quad \boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \sigma^2\mathbf{I}_n). \]

Sub-models:
keep only predictors \(\eta \subset \{1, \dotsc, p\}\) \[ \mathbf{y} = \mathbf{X}_{\eta} \boldsymbol{\beta}_{\eta} + \boldsymbol{\epsilon}_{\eta} \quad rg(\mathbf{X}_{\eta}) = |\eta|, \quad \boldsymbol{\epsilon}_{\eta} \sim \mathcal{N}(\mathbf{0}, \sigma^2\mathbf{I}_n). \]

with \(\mathbf{X}_{\eta}\) the sub-matrix of \(\mathbf{X}\): \[ \mathbf{X}_{\eta} = (\mathbf{x}_{\eta_1} ~ \mathbf{x}_{\eta_2} \cdots \mathbf{x}_{\eta_{|\eta|}}) \]

Assume that \(\boldsymbol{\mu}\) can be approximated as \(\mathbf{X}_{\eta} \boldsymbol{\beta}_{\eta}\), i.e. as an element of \(\mathcal{M}(\mathbf{X}_{\eta})\).

Models

“Truth”: \(\mathbf{y} = \boldsymbol{\mu} + \boldsymbol{\epsilon}\), with \(\boldsymbol{\mu} \in \mathbb{R}^n\) and \(\boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \sigma^2\mathbf{I}_n)\).

Models: \[ \mathbf{y} = \mathbf{X}_{\eta} \boldsymbol{\beta}_{\eta} + \boldsymbol{\epsilon}_{\eta} \quad rg(\mathbf{X}_{\eta}) = |\eta|, \quad \boldsymbol{\epsilon}_{\eta} \sim \mathcal{N}(\mathbf{0}, \sigma^2\mathbf{I}_n). \]

Estimators: \[ \begin{aligned} \hat{\boldsymbol{\beta}}_\eta & = \underset{\boldsymbol{\beta} \in \mathbb{R}^{|\eta|}}{\operatorname{argmin}} \|\mathbf{y} - \mathbf{X}_\eta\boldsymbol{\beta}_\eta \|^2 = (\mathbf{X}_\eta^{T}\mathbf{X}_\eta)^{-1}\mathbf{X}_\eta^{T}\mathbf{y}\\ \hat{\mathbf{y}}_\eta & = \mathbf{X}_\eta\hat{\boldsymbol{\beta}}_\eta = \mathbf{P}^{\mathbf{X}_\eta}\mathbf{y} \end{aligned} \]

True expectation: \[ \begin{aligned} \mathbf{y}_\eta = \mathbf{P}^{\mathbf{X}_\eta}\boldsymbol{\mu} \end{aligned} \]

Models

Risk

Risk of an estimator (theoretical)

Risk of an estimator: \[ R(\hat{\mathbf{y}}_\eta) = \mathbb{E}\left[ \|\boldsymbol{\mu} - \hat{\mathbf{y}}_\eta \|^2 \right] \]

Notes:

\(\boldsymbol{\mu}\) is unknown \(\to\) cannot be computed in practice.
Expectation of the difference between the true mean and the estimated projection.
Not subject to overfitting.

Bias-Variance Decomposition

\[ R(\hat{\mathbf{y}}_\eta) = \mathbb{E}\left[ \|\boldsymbol{\mu} - \hat{\mathbf{y}}_\eta \|^2 \right] = \|\boldsymbol{\mu} - \mathbf{y}_\eta \|^2 + |\eta| \sigma^2 \]

Bias: \(\|\boldsymbol{\mu} - \mathbf{y}_\eta \|^2\)
- Difference between true and projection on the model
- Best that we could do if we knew the truth
- Decreases when model is more complex, i.e. \(\eta\) gets larger.
Variance: \(|\eta| \sigma^2\)
- Variance of the estimation
- Increases when model is more complex, i.e. \(\eta\) gets larger.
Risk is a compromize between bias and variance.

Bias-Variance - No overfitting

\[ R(\hat{\mathbf{y}}_\eta) = \mathbb{E}\left[ \|\boldsymbol{\mu} - \hat{\mathbf{y}}_\eta \|^2 \right] = \|\boldsymbol{\mu} - \mathbf{y}_\eta \|^2 + |\eta| \sigma^2 \]

No overfitting:

Assume:

\(\boldsymbol{\mu} \in \mathcal{M}(\mathbf{X}_{\eta})\) so that \(\mathbf{y}_{\eta} = \mathbf{P}^{\mathbf{X}_{\eta}}\boldsymbol{\mu} = \boldsymbol{\mu}\).
\(\eta \subset \eta'\) so that \(\mathbf{y}_{\eta'} = \mathbf{P}^{\mathbf{X}_{\eta'}}\boldsymbol{\mu} = \boldsymbol{\mu}\).

Then: \[ \begin{aligned} R(\hat{\mathbf{y}}_{\eta'}) & = \|\boldsymbol{\mu} - \mathbf{y}_{\eta'} \|^2 + |\eta'| \sigma^2 \\ & \geq \|\boldsymbol{\mu} - \mathbf{y}_{\eta'} \|^2 + |\eta| \sigma^2 = \|\boldsymbol{\mu} - \mathbf{y}_{\eta} \|^2 + |\eta| \sigma^2 \\ & \geq R(\hat{\mathbf{y}}_\eta) \end{aligned} \]

Bias-Variance - Proof - 1/2

\[ R(\hat{\mathbf{y}}_\eta) = \mathbb{E}\left[ \|\boldsymbol{\mu} - \hat{\mathbf{y}}_\eta \|^2 \right] = \mathbb{E}\left[ \|\boldsymbol{\mu} - \mathbf{y}_\eta + \mathbf{y}_\eta - \hat{\mathbf{y}}_\eta \|^2 \right] \]

\[ \begin{aligned} R(\hat{\mathbf{y}}_\eta) &= \mathbb{E}\left[ \|\boldsymbol{\mu} - \mathbf{P}^{\mathbf{X}_{\eta}}\boldsymbol{\mu} + \mathbf{P}^{\mathbf{X}_{\eta}}\boldsymbol{\mu} - \mathbf{P}^{\mathbf{X}_{\eta}}\mathbf{y} \|^2 \right] \\ &= \mathbb{E}\left[ \|(\mathbf{I}_n - \mathbf{P}^{\mathbf{X}_{\eta}})\boldsymbol{\mu} + \mathbf{P}^{\mathbf{X}_{\eta}}(\boldsymbol{\mu} - \mathbf{y}) \|^2 \right]\\ &= \mathbb{E}\left[ \|\mathbf{P}^{\mathbf{X}_{\eta}^\bot}\boldsymbol{\mu} + \mathbf{P}^{\mathbf{X}_{\eta}}(\boldsymbol{\mu} - \mathbf{y}) \|^2 \right] \end{aligned} \]

\[ \begin{aligned} R(\hat{\mathbf{y}}_\eta) &= \mathbb{E}\left[ \|\mathbf{P}^{\mathbf{X}_{\eta}^\bot}\boldsymbol{\mu} \|^2 + \|\mathbf{P}^{\mathbf{X}_{\eta}}(\boldsymbol{\mu} - \mathbf{y}) \|^2 \right] \\ &= \|\mathbf{P}^{\mathbf{X}_{\eta}^\bot}\boldsymbol{\mu} \|^2 + \mathbb{E}\left[ \|\mathbf{P}^{\mathbf{X}_{\eta}}(\boldsymbol{\mu} - \mathbf{y}) \|^2 \right] \\ &= \|\boldsymbol{\mu} - \mathbf{y}_\eta \|^2 + \mathbb{E}\left[ \|\mathbf{P}^{\mathbf{X}_{\eta}}(\boldsymbol{\mu} - \mathbf{y}) \|^2 \right] \end{aligned} \]

Bias-Variance - Proof - 2/2

\[ R(\hat{\mathbf{y}}_\eta) = \|\boldsymbol{\mu} - \mathbf{y}_\eta \|^2 + \mathbb{E}\left[ \|\mathbf{P}^{\mathbf{X}_{\eta}}(\boldsymbol{\mu} - \mathbf{y}) \|^2 \right] \]

From Cochran’s Theorem: \[ \frac{1}{\sigma^2}\|\mathbf{P}^{\mathbf{X}_{\eta}}(\boldsymbol{\mu} - \mathbf{y}) \|^2 \sim \chi^2_{|\eta|} \]

Hence: \[ \mathbb{E}\left[ \frac{1}{\sigma^2}\|\mathbf{P}^{\mathbf{X}_{\eta}}(\boldsymbol{\mu} - \mathbf{y}) \|^2 \right] = |\eta| \]

And: \[ R(\hat{\mathbf{y}}_\eta) = \|\boldsymbol{\mu} - \hat{\mathbf{y}}_\eta \|^2 + |\eta|\sigma^2. \]

Penalized Least Squares

Oracle Estimator

We would like: \[ \eta_0 = \underset{\eta \subset \{1, \cdots, p\}}{\operatorname{argmin}} R(\hat{\mathbf{y}}_\eta) \]

Minimize the risk over all possible models.
Problem: \(R(\hat{\mathbf{y}}_\eta)\) is theoretical \(\to\) cannot be computed.
Idea: Minimize an estimator \(\hat{R}(\hat{\mathbf{y}}_\eta)\) of \(R(\hat{\mathbf{y}}_\eta)\): \[ \hat{\eta} = \underset{\eta \subset \{1, \cdots, p\}}{\operatorname{argmin}} \hat{R}(\hat{\mathbf{y}}_\eta). \]

Expectation of the RSS

For any model \(\eta\), the expectation of the RSS is: \[ \mathbb{E}[\|\mathbf{y} - \hat{\mathbf{y}}_\eta \|^2] = R(\hat{\mathbf{y}}_\eta) + n\sigma^2 - 2|\eta|\sigma^2 \]

Expectation of the RSS - Proof

\[ \begin{aligned} \mathbb{E}[\|\mathbf{y} - \hat{\mathbf{y}}_\eta \|^2] &= \mathbb{E}[\|\mathbf{y} - \boldsymbol{\mu} + \boldsymbol{\mu} - \hat{\mathbf{y}}_\eta \|^2] \\ &= \mathbb{E}[\|\mathbf{y} - \boldsymbol{\mu}\|^2 + \|\boldsymbol{\mu} - \hat{\mathbf{y}}_\eta \|^2 + 2 \langle \mathbf{y} - \boldsymbol{\mu}; \boldsymbol{\mu} - \hat{\mathbf{y}}_\eta \rangle] \end{aligned} \]

\[ \begin{aligned} \mathbb{E}[\|\mathbf{y} - \hat{\mathbf{y}}_\eta \|^2] &= \mathbb{E}[\|\boldsymbol{\epsilon}\|^2] + R(\hat{\mathbf{y}}_\eta) + 2 \mathbb{E}[\langle \boldsymbol{\epsilon}; \boldsymbol{\mu} - \mathbf{P}^{\mathbf{X}_{\eta}}\mathbf{y} \rangle] \qquad \end{aligned} \]

\[ \begin{aligned} \mathbb{E}[\langle \boldsymbol{\epsilon}; \boldsymbol{\mu} - \mathbf{P}^{\mathbf{X}_{\eta}}\mathbf{y} \rangle] &= \mathbb{E}[\langle \boldsymbol{\epsilon}; \boldsymbol{\mu} - \mathbf{P}^{\mathbf{X}_{\eta}}(\boldsymbol{\mu} + \boldsymbol{\epsilon}) \rangle] \\ &= \mathbb{E}[\langle \boldsymbol{\epsilon}; \boldsymbol{\mu} - \mathbf{P}^{\mathbf{X}_{\eta}}\boldsymbol{\mu}) \rangle] - \mathbb{E}[\langle \boldsymbol{\epsilon}; \mathbf{P}^{\mathbf{X}_{\eta}}\boldsymbol{\epsilon} \rangle] \end{aligned} \]

\[ \begin{aligned} \mathbb{E}[\langle \boldsymbol{\epsilon}; \boldsymbol{\mu} - \mathbf{P}^{\mathbf{X}_{\eta}}\mathbf{y} \rangle] &= 0 - \mathbb{E}[\langle \mathbf{P}^{\mathbf{X}_{\eta}}\boldsymbol{\epsilon}; \mathbf{P}^{\mathbf{X}_{\eta}}\boldsymbol{\epsilon} \rangle] \qquad\qquad~~ \\ &= - \mathbb{E}[\| \mathbf{P}^{\mathbf{X}_{\eta}}\boldsymbol{\epsilon}\|^2] \\ &= - |\eta| \sigma^2 \end{aligned} \]

\[ \begin{aligned} \mathbb{E}[\|\mathbf{y} - \hat{\mathbf{y}}_\eta \|^2] &= n \sigma^2 + R(\hat{\mathbf{y}}_\eta) - 2 |\eta| \sigma^2 \qquad\qquad\qquad\qquad \end{aligned} \]

Mallow’s \(C_p\)

We have: \[ \mathbb{E}[\|\mathbf{y} - \hat{\mathbf{y}}_\eta \|^2] = R(\hat{\mathbf{y}}_\eta) + n\sigma^2 - 2|\eta|\sigma^2 \]

Mallow’s \(C_p\): \[ C_p(\hat{\mathbf{y}}_\eta) = \frac{1}{n} \left( \|\mathbf{y} - \hat{\mathbf{y}}_\eta \|^2 + 2 |\eta| \hat{\sigma}^2 \right) \]

If \(\hat{\sigma}^2\) is unbiased, then: \[ \mathbb{E}[C_p(\hat{\mathbf{y}}_\eta)] = \frac{1}{n}R(\hat{\mathbf{y}}_\eta) + \sigma^2 \]

Mallow’s \(C_p\)

We have: \[ C_p(\hat{\mathbf{y}}_\eta) = \frac{1}{n} \left( \|\mathbf{y} - \hat{\mathbf{y}}_\eta \|^2 + 2 |\eta| \hat{\sigma}^2 \right) \] And: \[ \mathbb{E}[C_p(\hat{\mathbf{y}}_\eta)] = \frac{1}{n}R(\hat{\mathbf{y}}_\eta) + \sigma^2 \]

Hence: \[ \hat{\eta} = \underset{\eta \subset \{1, \cdots, p\}}{\operatorname{argmin}} C_p(\hat{\mathbf{y}}_\eta) = \underset{\eta \subset \{1, \cdots, p\}}{\operatorname{argmin}} \hat{R}(\hat{\mathbf{y}}_\eta). \]

\(\to\) Minimizing Mallow’s \(C_p\) is the same as minimizing an unbiased estimate of the risk.

Mallow’s \(C_p\)

\[ C_p(\hat{\mathbf{y}}_\eta) = \frac{1}{n} \left( \|\mathbf{y} - \hat{\mathbf{y}}_\eta \|^2 + 2 |\eta| \hat{\sigma}^2 \right) \]

Penalize for the number of parameters \(|\eta|\).
The smaller the better.

Simulated Dataset

## Mallow's Cp
Cp <- function(fit) {
  n <- nobs(fit)
  p <- n - df.residual(fit)
  RSS <- deviance(fit)
  return( (RSS + p * RSS / (n - p)) / n)
}
## Function to get statistics
get_stats_fit <- function(fit) {
  sumfit <- summary(fit)
  res <- data.frame(Cp = Cp(fit))
  return(res)
}

Variable selection

## Apply stats to all possible combinations
all_stats <- get_all_stats(data_sub)
all_stats

         Cp          preds
1  5.085236            x_1
2 16.107190            x_2
3 18.207435         x_junk
4  3.823952        x_1+x_2
5  5.095010     x_1+x_junk
6 16.128558     x_2+x_junk
7  3.831361 x_1+x_2+x_junk

Mallow’s \(C_p\) is smallest for the true model.

Penalized Log Likelihood

Maximized Log Likelihood

Recall: \[ \begin{aligned} (\hat{\boldsymbol{\beta}}_\eta, \hat{\sigma}^2_{\eta,ML}) &= \underset{(\boldsymbol{\beta}, \sigma^2) \in \mathbb{R}^{|\eta|}\times\mathbb{R}_+^*}{\operatorname{argmax}} \log L(\boldsymbol{\beta}, \sigma^2 | \mathbf{y})\\ &= \underset{(\boldsymbol{\beta}, \sigma^2) \in \mathbb{R}^{|\eta|}\times\mathbb{R}_+^*}{\operatorname{argmax}} - \frac{n}{2} \log(2\pi\sigma^2) - \frac{1}{2\sigma^2}\|\mathbf{y} - \mathbf{X}_\eta\boldsymbol{\beta} \|^2 \end{aligned} \]

And: \[ \hat{\sigma}^2_{\eta,ML} = \frac{ \|\mathbf{y} - \hat{\mathbf{y}}_\eta \|^2} {n} \quad;\quad \hat{\mathbf{y}}_\eta = \mathbf{X}_\eta\hat{\boldsymbol{\beta}}_\eta \]

Thus: \[ L(\hat{\boldsymbol{\beta}}_\eta, \hat{\sigma}^2_{ML} | \mathbf{y}) = \cdots \]

Maximized Log Likelihood

The maximized likelihood is equal to: \[ L(\hat{\boldsymbol{\beta}}_\eta, \hat{\sigma}^2_{ML} | \mathbf{y}) = - \frac{n}{2} \log\left(2\pi\frac{ \|\mathbf{y} - \hat{\mathbf{y}}_\eta \|^2}{n}\right) - \frac{1}{2\frac{ \|\mathbf{y} - \hat{\mathbf{y}}_\eta \|^2}{n}}\|\mathbf{y} - \hat{\mathbf{y}}_\eta \|^2 \]

i.e. \[ \log \hat{L}_\eta = - \frac{n}{2} \log\left(\frac{ \|\mathbf{y} - \hat{\mathbf{y}}_\eta \|^2}{n}\right) - \frac{n}{2} \log\left(2\pi\right) - \frac{n}{2} \]

Maximized Log Likelihood is increasing

Remember, for any \(\eta \subset \eta'\): \[ \|\mathbf{y} - \hat{\mathbf{y}}_{\eta'}\|^2 \leq \|\mathbf{y} - \hat{\mathbf{y}}_\eta\|^2 \]

Hence \(R^2_\eta = 1 - \frac{\|\mathbf{y} - \hat{\mathbf{y}}_\eta\|^2}{\|\mathbf{y} - \bar{y} \mathbb{1}\|^2}\) always increases when we add a predictor.

Similarly: \[ \log \hat{L}_\eta = - \frac{n}{2} \log\left(\frac{ \|\mathbf{y} - \hat{\mathbf{y}}_\eta \|^2}{n}\right) - \frac{n}{2} \log\left(2\pi\right) - \frac{n}{2} \] always increases when we add some predictors.

Penalized Maximized Log Likelihood

Problem: \(\log \hat{L}_\eta\) always increases when we add some predictors.

\(\to\) it can not be used for model selection (same as \(R^2\)).

Idea: add a penalty term that depends on the size of the model

\(\to\) minimize: \[ - \log \hat{L}_\eta + f(|\eta|) \qquad \text{with f increasing} \]

Penalized Maximized Log Likelihood

Minimize: \[ - \log \hat{L}_\eta + f(|\eta|) \qquad \text{with f increasing} \]

For \(\eta \subset \eta'\):

\(- \log \hat{L}_\eta\) decreases
\(f(|\eta|)\) increases

Goal: find \(f\) that compensates for the “overfit”, i.e. the noisy and insignificant increase of the likelihood when we add some unrelated predictors.

Simulated Dataset

## Function to get statistics
get_stats_fit <- function(fit) {
  sumfit <- summary(fit)
  res <- data.frame(minusLogLik = -logLik(fit))
  return(res)
}

Variable selection

## Apply stats to all possible combinations
all_stats <- get_all_stats(data_sub)
all_stats

  minusLogLik          preds
1    1115.053            x_1
2    1403.284            x_2
3    1433.925         x_junk
4    1043.286        x_1+x_2
5    1115.030     x_1+x_junk
6    1403.113     x_2+x_junk
7    1043.266 x_1+x_2+x_junk

When we add “junk” variable, \(- \log \hat{L}_\eta\) decreases, but not a lot.

Questions: can we find a good penalty that compensates for this noisy increasing ?

AIC and BIC

Akaike Information Criterion - AIC

\[ AIC = - 2\log \hat{L}_\eta + 2 |\eta| \]

\(\hat{L}_\eta\) is the maximized likelihood of the model.
\(|\eta|\) is the dimension of the model.
The smaller the better.
Asymptotic theoretical guaranties.
Easy to use and widely spread

Bayesian Information Criterion - BIC

\[ BIC = - 2\log \hat{L}_\eta + |\eta| \log(n) \]

\(~\)

\(\hat{L}_\eta\) is the maximized likelihood of the model.
\(|\eta|\) is the dimension of the model.
The smaller the better.
Based on an approximation of the marginal likelihood.
Easy to use and widely spread

See: Lebarbier, Mary-Huard. Le critère BIC : fondements théoriques et interprétation. 2004. [inria-00070685]

AIC vs BIC

\[ AIC = - 2\log \hat{L}_\eta + 2 |\eta| \]

\[ BIC = - 2\log \hat{L}_\eta + |\eta| \log(n) \]

Both have asymptotic theoretical justifications.
BIC penalizes more than AIC \(\to\) chooses smaller models.
Both widely used.
There are some (better) non-asymptotic criteria, see
Giraud C. 2014. Introduction to high-dimensional statistics.

Simulated Dataset

## Function to get statistics
get_stats_fit <- function(fit) {
  sumfit <- summary(fit)
  res <- data.frame(minusLogLik = -logLik(fit),
                    AIC = AIC(fit),
                    BIC = BIC(fit))
  return(res)
}

Variable selection

## Apply stats to all possible combinations
all_stats <- get_all_stats(data_sub)
all_stats

  minusLogLik      AIC      BIC          preds
1    1115.053 2236.105 2248.749            x_1
2    1403.284 2812.567 2825.211            x_2
3    1433.925 2873.850 2886.493         x_junk
4    1043.286 2094.572 2111.430        x_1+x_2
5    1115.030 2238.060 2254.919     x_1+x_junk
6    1403.113 2814.225 2831.084     x_2+x_junk
7    1043.266 2096.533 2117.606 x_1+x_2+x_junk

## Select model
apply(all_stats[, -ncol(all_stats)], 2,
      function(x) all_stats[which.min(x), ncol(all_stats)])

     minusLogLik              AIC              BIC 
"x_1+x_2+x_junk"        "x_1+x_2"        "x_1+x_2"

Variable selection - Advertising data

## Advertising data
library(here)
ad <- read.csv(here("data", "Advertising.csv"), row.names = "X")

## Apply stats to all possible combinations
all_stats <- get_all_stats(ad[, c(4, 1:3)])
all_stats

  minusLogLik       AIC       BIC              preds
1    519.0457 1044.0913 1053.9863                 TV
2    573.3369 1152.6738 1162.5687              radio
3    608.3357 1222.6714 1232.5663          newspaper
4    386.1970  780.3941  793.5874           TV+radio
5    509.8891 1027.7782 1040.9714       TV+newspaper
6    573.2361 1154.4723 1167.6655    radio+newspaper
7    386.1811  782.3622  798.8538 TV+radio+newspaper

## Select model
apply(all_stats[, -ncol(all_stats)], 2,
      function(x) all_stats[which.min(x), ncol(all_stats)])

         minusLogLik                  AIC                  BIC 
"TV+radio+newspaper"           "TV+radio"           "TV+radio"

Forward and Backward Search

Combinatorial problem

\[ p \text{ predictors } \to 2^p \text{ possible models.} \]

Cannot test all possible models.
“Manual” solution: forward or backward searches.

Forward search

Start with the null model (no predictor)
Try to add one predictor (\(p\) models to fit)
Choose the best fitting among the \(p\) models.
Repeat until no more predictors to add.
Choose best fit with \(C_p\), AIC or BIC.

Forward search - Init

## Full formula
full_formula <- formula(y_sim ~ x_1 + x_2 
                        + x_junk.1 + x_junk.2 + x_junk.3 
                        + x_junk.4 + x_junk.5 + x_junk.6)

## Complete search
2^8

[1] 256

## No predictors
fit0 <- lm(y_sim ~ 1, data = data_all)

Forward search - Add term

## Add one
library(MASS)
addterm(fit0, full_formula)

Single term additions

Model:
y_sim ~ 1
         Df Sum of Sq    RSS     AIC
<none>                731.66 136.165
x_1       1    466.43 265.23  87.429
x_2       1     56.39 675.27 134.154
x_junk.1  1     50.05 681.61 134.622
x_junk.2  1     32.10 699.56 135.921
x_junk.3  1      0.41 731.26 138.137
x_junk.4  1      2.49 729.18 137.995
x_junk.5  1     54.28 677.38 134.311
x_junk.6  1     11.59 720.08 137.367

Forward search - Add term

## Add x_1
fit1 <- lm(y_sim ~ 1 + x_1, data = data_all)
## Add another
addterm(fit1, full_formula)

Single term additions

Model:
y_sim ~ 1 + x_1
         Df Sum of Sq    RSS    AIC
<none>                265.23 87.429
x_2       1    58.314 206.92 77.014
x_junk.1  1     6.038 259.19 88.277
x_junk.2  1     0.159 265.07 89.399
x_junk.3  1     0.376 264.85 89.358
x_junk.4  1     0.403 264.83 89.353
x_junk.5  1     2.587 262.64 88.939
x_junk.6  1     6.791 258.44 88.132

Forward search - Add term

## Add x_1
fit2 <- lm(y_sim ~ 1 + x_1 + x_2, data = data_all)
## Add another
addterm(fit2, full_formula)

Single term additions

Model:
y_sim ~ 1 + x_1 + x_2
         Df Sum of Sq    RSS    AIC
<none>                206.92 77.014
x_junk.1  1    3.0199 203.90 78.279
x_junk.2  1    2.1852 204.73 78.484
x_junk.3  1    0.8228 206.09 78.815
x_junk.4  1    1.6416 205.27 78.616
x_junk.5  1    4.9363 201.98 77.807
x_junk.6  1    3.7814 203.13 78.092

Backward search

Start with the full model (all predictors)
Try to remove one predictor (\(p\) models to fit)
Choose the best fitting among the \(p\) models.
Repeat until no more predictors to remove.
Choose best fit with \(C_p\), AIC or BIC.

Backward search - Init

## Full formula
fit0 <- lm(y_sim ~ x_1 + x_2 
                        + x_junk.1 + x_junk.2 + x_junk.3 
                        + x_junk.4 + x_junk.5 + x_junk.6,
           data = data_all)

## drop one
dropterm(fit0)

Single term deletions

Model:
y_sim ~ x_1 + x_2 + x_junk.1 + x_junk.2 + x_junk.3 + x_junk.4 + 
    x_junk.5 + x_junk.6
         Df Sum of Sq    RSS     AIC
<none>                191.34  85.100
x_1       1    336.07 527.41 133.797
x_2       1     57.75 249.09  96.289
x_junk.1  1      2.92 194.25  83.857
x_junk.2  1      2.26 193.60  83.688
x_junk.3  1      1.46 192.80  83.480
x_junk.4  1      1.48 192.81  83.485
x_junk.5  1      2.08 193.42  83.641
x_junk.6  1      4.38 195.72  84.233

Backward search - Drop term

## drop x_junk.3
fit1 <- lm(y_sim ~ x_1 + x_2 
           + x_junk.1 + x_junk.2 +
           + x_junk.4 + x_junk.5 + x_junk.6, data = data_all)
## Drop another
dropterm(fit1)

Single term deletions

Model:
y_sim ~ x_1 + x_2 + x_junk.1 + x_junk.2 + +x_junk.4 + x_junk.5 + 
    x_junk.6
         Df Sum of Sq    RSS     AIC
<none>                192.80  83.480
x_1       1    335.92 528.71 131.921
x_2       1     57.16 249.96  94.463
x_junk.1  1      2.30 195.09  82.073
x_junk.2  1      2.18 194.98  82.043
x_junk.4  1      1.15 193.95  81.779
x_junk.5  1      2.27 195.06  82.065
x_junk.6  1      4.73 197.52  82.691

Backward search - Drop term

## drop x_junk.4
fit2 <- lm(y_sim ~ x_1 + x_2 
           + x_junk.1 + x_junk.2 +
           + x_junk.5 + x_junk.6, data = data_all)
## Drop another
dropterm(fit2)

Single term deletions

Model:
y_sim ~ x_1 + x_2 + x_junk.1 + x_junk.2 + +x_junk.5 + x_junk.6
         Df Sum of Sq    RSS     AIC
<none>                193.95  81.779
x_1       1    335.82 529.77 130.021
x_2       1     56.44 250.39  92.550
x_junk.1  1      1.95 195.90  80.278
x_junk.2  1      2.27 196.21  80.359
x_junk.5  1      2.96 196.91  80.537
x_junk.6  1      4.74 198.69  80.986

Backward search - Drop term

## drop x_junk.1
fit3 <- lm(y_sim ~ x_1 + x_2 
           + x_junk.2 +
           + x_junk.5 + x_junk.6, data = data_all)
## Drop another
dropterm(fit3)

Single term deletions

Model:
y_sim ~ x_1 + x_2 + x_junk.2 + +x_junk.5 + x_junk.6
         Df Sum of Sq    RSS     AIC
<none>                195.90  80.278
x_1       1    368.05 563.94 131.147
x_2       1     59.57 255.47  91.554
x_junk.2  1      2.46 198.36  78.903
x_junk.5  1      3.71 199.61  79.217
x_junk.6  1      4.71 200.60  79.466

Backward search - Drop term

## drop x_junk.2
fit4 <- lm(y_sim ~ x_1 + x_2 
           + x_junk.5 + x_junk.6, data = data_all)
## Drop another
dropterm(fit4)

Single term deletions

Model:
y_sim ~ x_1 + x_2 + x_junk.5 + x_junk.6
         Df Sum of Sq    RSS     AIC
<none>                198.36  78.903
x_1       1    422.20 620.56 133.930
x_2       1     57.59 255.95  89.648
x_junk.5  1      4.78 203.13  78.092
x_junk.6  1      3.62 201.98  77.807

Backward search - Drop term

## drop x_junk.6
fit5 <- lm(y_sim ~ x_1 + x_2 
           + x_junk.5, data = data_all)
## Drop another
dropterm(fit5)

Single term deletions

Model:
y_sim ~ x_1 + x_2 + x_junk.5
         Df Sum of Sq    RSS     AIC
<none>                201.98  77.807
x_1       1    426.25 628.23 132.544
x_2       1     60.66 262.64  88.939
x_junk.5  1      4.94 206.92  77.014

Backward search - Drop term

## drop x_junk.5
fit6 <- lm(y_sim ~ x_1 + x_2,
           data = data_all)
## Drop another
dropterm(fit6)

Single term deletions

Model:
y_sim ~ x_1 + x_2
       Df Sum of Sq    RSS     AIC
<none>              206.92  77.014
x_1     1    468.35 675.27 134.154
x_2     1     58.31 265.23  87.429

Forward and backward searches

Both are heuristics.
There are no guaranties that both converge to the same solution.
No theoretical guaranties on the selected model.
Model selection is still an active area of research.
- See e.g: LASSO and other penalized criteria (M2)

AIC vs log FPE

Final Prediction Error - FPE

Mallow’s \(C_p\): \[ C_p(\hat{\mathbf{y}}_\eta) = \frac{1}{n} \left( \|\mathbf{y} - \hat{\mathbf{y}}_\eta \|^2 + 2 |\eta| \hat{\sigma}^2_{\text{full}} \right) \]

FPE: \[ FPE(\hat{\mathbf{y}}_\eta) = \|\mathbf{y} - \hat{\mathbf{y}}_\eta \|^2 + 2 |\eta| \hat{\sigma}^2_\eta \quad \text{with} \quad \hat{\sigma}^2_\eta = \frac{ \|\mathbf{y} - \hat{\mathbf{y}}_\eta \|^2} {n - |\eta|} \]

Hence: \[ FPE(\hat{\mathbf{y}}_\eta) = \|\mathbf{y} - \hat{\mathbf{y}}_\eta \|^2 \left(1 + \frac{2 |\eta|} {n - |\eta|}\right) \]

AIC vs Log FPE

\[ \begin{aligned} \log \frac1n FPE(\hat{\mathbf{y}}_\eta) &= \log \left(\frac1n \|\mathbf{y} - \hat{\mathbf{y}}_\eta \|^2 \left(1 + \frac{2 |\eta|} {n - |\eta|}\right) \right)\\ &= \log \left(\frac1n \|\mathbf{y} - \hat{\mathbf{y}}_\eta \|^2 \right) + \log \left(1 + \frac{2 |\eta|} {n - |\eta|}\right) \end{aligned} \] But: \[ n \log\left(\frac{ \|\mathbf{y} - \hat{\mathbf{y}}_\eta \|^2}{n}\right) = - 2\log \hat{L}_\eta - n \log\left(2\pi\right) - n \]

AIC vs Log FPE

Hence: \[ n\log \frac1n FPE(\hat{\mathbf{y}}_\eta) + n \log\left(2\pi\right) + n \qquad \qquad \qquad \qquad\\ \qquad \qquad \qquad \qquad = - 2\log \hat{L}_\eta + n \log \left(1 + \frac{2 |\eta|} {n - |\eta|}\right)\\ \]

When \(n\) is large: \[ n \log \left(1 + \frac{2 |\eta|} {n - |\eta|}\right) \approx n \frac{2 |\eta|} {n - |\eta|} \approx 2 |\eta| \]

AIC vs Log FPE

Hence: \[ \begin{aligned} n\log \frac1n FPE(\hat{\mathbf{y}}_\eta) + n \log\left(2\pi\right) + n &\approx - 2\log \hat{L}_\eta + 2 |\eta| \\ &\approx AIC(\hat{\mathbf{y}}_\eta) \end{aligned} \]

The log FPE and the AIC criteria select for the same models (asymptotically).

BIC and Laplace Approximation

Bayesian Model Selection

In a Bayesian setting,
we have priors on the models \(\eta\) and parameters \(\boldsymbol{\theta}_\eta = (\boldsymbol{\beta}_\eta, \sigma^2_\eta)\): \[ \left\{ \begin{aligned} \eta &\sim \pi_{\eta} \\ \boldsymbol{\theta}_\eta & \sim p(\boldsymbol{\theta} | \eta) \end{aligned} \right. \]

The likelihood given some parameters and a model is: \(p(\mathbf{y} | \boldsymbol{\theta}, \eta)\)

the marginal likelihood given a model is: \[ p(\mathbf{y} | \eta) = \int p(\mathbf{y} | \boldsymbol{\theta}, \eta) p(\boldsymbol{\theta} | \eta) d\boldsymbol{\theta} \]

and the posterior probability of the models are: \[ p(\eta | \mathbf{y}). \]

Bayesian Model Selection

Goal: find the model with the highest posterior probability: \[ \hat{\eta} = \underset{\eta}{\operatorname{argmax}} p(\eta | \mathbf{y}) \]

Assume that all models have the same prior probability: \[ \pi_{\eta} \equiv \pi. \]

Then, from Bayes rule: \[ p(\eta | \mathbf{y}) = p(\mathbf{y} | \eta)\pi_{\eta} / p(\mathbf{y}) = p(\mathbf{y} | \eta) \times \pi / p(\mathbf{y}) \]

And: \[ \hat{\eta} = \underset{\eta}{\operatorname{argmax}} p(\eta | \mathbf{y}) = \underset{\eta}{\operatorname{argmax}} p(\mathbf{y} | \eta). \]

Bayesian Model Selection

Goal: find the model with the highest posterior probability: \[ \begin{aligned} \hat{\eta} & = \underset{\eta}{\operatorname{argmax}} p(\eta | \mathbf{y}) = \underset{\eta}{\operatorname{argmax}} p(\mathbf{y} | \eta) \\ & = \underset{\eta}{\operatorname{argmax}} \int p(\mathbf{y} | \boldsymbol{\theta}, \eta) p(\boldsymbol{\theta} | \eta) d\boldsymbol{\theta} \end{aligned} \]

Idea:
Assume that \[p(\mathbf{y} | \boldsymbol{\theta}, \eta) p(\boldsymbol{\theta} | \eta)\] is concentrated around its maximum \[p(\mathbf{y} | \boldsymbol{\theta}_\eta^*, \eta) p(\boldsymbol{\theta}_\eta^* | \eta)\] and apply Laplace Approximation.

Laplace Approximation

Let \(L: \mathbb{R}^q \to \mathbb{R}\) be a three times differentiable function, with a unique maximum \(\mathbf{u}^*\). Then: \[ \int_{\mathbb{R}^q} e^{n L(\mathbf{u})} d \mathbf{u} = \left(\frac{2\pi}{n}\right)^{q/2} |-L''(\mathbf{u}^*)|^{-1/2}e^{nL(\mathbf{u}^*)} + o(n^{-1}) \]

Applied to: \[ L(\boldsymbol{\theta}) = \frac1n\log p(\mathbf{y} | \boldsymbol{\theta}, \eta) + \frac1n\log p(\boldsymbol{\theta} | \eta) \] and assuming \(\log p(\mathbf{y} | \boldsymbol{\theta}^*, \eta) \approx \log p(\mathbf{y} | \hat{\boldsymbol{\theta}}, \eta)\), we get: \[ \log \int p(\mathbf{y} | \boldsymbol{\theta}, \eta) p(\boldsymbol{\theta} | \eta) d\boldsymbol{\theta} \approx \log p(\mathbf{y} | \hat{\boldsymbol{\theta}}, \eta) - \frac{|\eta|}{2} \log(n). \]

BIC

Hence: \[ \begin{aligned} \hat{\eta} & = \underset{\eta}{\operatorname{argmax}} p(\eta | \mathbf{y}) = \underset{\eta}{\operatorname{argmax}} p(\mathbf{y} | \eta) \\ & \approx \underset{\eta}{\operatorname{argmax}} \left\{\log p(\mathbf{y} | \hat{\boldsymbol{\theta}}, \eta) - \frac{|\eta|}{2} \log(n)\right\}\\ & \approx \underset{\eta}{\operatorname{argmin}} \left\{-2\log p(\mathbf{y} | \hat{\boldsymbol{\theta}}, \eta) + |\eta| \log(n) \right\}\\ &\approx \underset{\eta}{\operatorname{argmin}} \left\{- 2\log \hat{L}_\eta + |\eta| \log(n)\right\} = \underset{\eta}{\operatorname{argmin}} BIC(\hat{\mathbf{y}}_\eta) \end{aligned} \]

BIC is an approximation of the marginal likelihood.