Régression Multiple - Sélection de variables

Paul Bastide

2025-03-12

What we know

Exact tests

  • t-test: is one individual coefficient \(k\) zero ? \[ \begin{aligned} \mathcal{H}_0&: \beta_k = 0 \\ \mathcal{H}_1&: \beta_k \neq 0 \end{aligned} \quad;\quad T_k = \frac{\hat{\beta}_k}{\sqrt{\hat{\sigma}^2 [(\mathbf{X}^T\mathbf{X})^{-1}]_{kk}}} \underset{\mathcal{H}_0}{\sim} \mathcal{T}_{n-p} \]

  • Nested F-test: are extra coefficients from the big model zero ? \[ {\small \begin{aligned} \mathcal{H}_0&: \beta_{p_0 + 1} = \cdots = \beta_p = 0\\ \mathcal{H}_1&: \exists~k\in \{p_0+1, \dotsc, p\} ~|~ \beta_k \neq 0 \end{aligned} ; F = \frac{ \|\hat{\mathbf{y}} - \hat{\mathbf{y}}_0\|^2 / (p - p_0) }{ \|\mathbf{y} - \hat{\mathbf{y}}\|^2 / (n - p) } = \frac{n - p}{p - p_0}\frac{RSS_0 - RSS}{RSS} \underset{\mathcal{H}_0}{\sim} \mathcal{F}^{p - p_0}_{n - p} } \]

Exact tests

  • Can only test nested models.

  • Multiple testing issue.

  • How to choose relevant predictors from a large pool of predictors ?

    • Marketing
    • Genetics
  • Variable Selection

How to choose the “best” model ?

Model: \[ y_i = \beta_1 x_{i1} + \dotsb + \beta_p x_{ip} + \epsilon_i \]

Problem:
Can we choose the “best” subset of non-zero \(\beta_k\) ?

I.e. Can we find the predictors that really have an impact on \(\mathbf{y}\) ?

Idea:
Find a “score” to assess the quality of a given fit.

Variable selection - \(R^2\)

\[ R^2 = 1 - \frac{RSS}{TSS} \]

## Function to get statistics for one fit
get_stats_fit <- function(fit) {
  sumfit <- summary(fit)
  res <- data.frame(R2 = sumfit$r.squared)
  return(res)
}

\[ y_i = -1 + 3 x_{i1} - x_{i2} + \epsilon_i \]

Variable selection - \(R^2\)

  • True fit
get_stats_fit(lm(y_sim ~ x_1 + x_2, data = data_all))
         R2
1 0.7905255
  • Bigger fit
get_stats_fit(lm(y_sim ~ x_1 + x_2 + x_junk.1, data = data_all))
       R2
1 0.79133
  • Wrong fit
get_stats_fit(lm(y_sim ~ x_1 + x_junk.1, data = data_all))
         R2
1 0.7252314

Variable selection - \(R^2\)

## Only one "junk" predictor
data_sub <- data.frame(y_sim = y_sim,
                       x_1 = x_1,
                       x_2 = x_2,
                       x_junk = x_junk_1)
## Fit all possible models
get_all_stats(data_sub)
            R2          preds
1 0.7208723311            x_1
2 0.1158793338            x_2
3 0.0005972444         x_junk
4 0.7905254867        x_1+x_2
5 0.7208974123     x_1+x_junk
6 0.1164841596     x_2+x_junk
7 0.7905418666 x_1+x_2+x_junk

\(R^2\) is not enough.

\(R^2\) is always increasing

\[ R^2 = 1 - \frac{RSS}{TSS} = 1 - \frac{\|\hat{\epsilon}\|^2}{\|\mathbf{y} - \bar{y} \mathbb{1}\|^2} = 1 - \frac{\|\mathbf{y} - \mathbf{X}\hat{\boldsymbol{\beta}}\|^2}{\|\mathbf{y} - \bar{y} \mathbb{1}\|^2} \]

If \(\mathbf{X}' = (\mathbf{X} ~ \mathbf{x}_{p+1})\) has one more row, then: \[ \begin{aligned} \|\mathbf{y} - \mathbf{X}'\hat{\boldsymbol{\beta}}'\|^2 & = \underset{\boldsymbol{\beta}' \in \mathbb{R}^{p+1}}{\operatorname{min}} \|\mathbf{y} - \mathbf{X}'\boldsymbol{\beta}' \|^2\\ & = \underset{\boldsymbol{\beta} \in \mathbb{R}^p\\ \beta_{p+1} \in \mathbb{R}}{\operatorname{min}} \|\mathbf{y} - \mathbf{X}\boldsymbol{\beta} - \mathbf{x}_{p+1}\beta_{p+1} \|^2\\ & \leq \underset{\boldsymbol{\beta} \in \mathbb{R}^p}{\operatorname{min}} \|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}\|^2 \end{aligned} \]

\[ \text{Hence:}\quad \|\mathbf{y} - \mathbf{X}'\hat{\boldsymbol{\beta}}'\|^2 \leq \|\mathbf{y} - \mathbf{X}\hat{\boldsymbol{\beta}}\|^2 \]

Variable selection - \(R_a^2\)

\[ R_a^2 = 1 - \frac{RSS / (n-p)}{TSS / (n-1)} \]

  • Penalize for the number of predictors \(p\).

  • Attention \(p\) includes the intercept (rank of matrix \(\mathbf{X}\)).

## Function to get statistics
get_stats_fit <- function(fit) {
  sumfit <- summary(fit)
  res <- data.frame(R2 = sumfit$r.squared,
                    R2a = sumfit$adj.r.squared)
  return(res)
}

Variable selection - \(R_a^2\)

get_all_stats(data_sub)
            R2          R2a          preds
1 0.7208723311  0.720311834            x_1
2 0.1158793338  0.114103991            x_2
3 0.0005972444 -0.001409588         x_junk
4 0.7905254867  0.789682531        x_1+x_2
5 0.7208974123  0.719774263     x_1+x_junk
6 0.1164841596  0.112928764     x_2+x_junk
7 0.7905418666  0.789274983 x_1+x_2+x_junk

\(R_a^2\): adjust for the number of predictors.

Adjusted \(R_a^2\)

\[ R_a^2 = 1 - \frac{RSS / (n-p)}{TSS / (n-1)} \]

  • The larger the better.

  • When \(p\) is bigger, the fit must be really better for \(R_a^2\) to raise.

  • Intuitive, but not much theoretical justifications.

Predictive Risk

Simulation

set.seed(1289)

## Predictors
n <- 50
x_1 <- runif(n, min = -2, max = 2)
x_2 <- runif(n, min = -2, max = 2)

## Model
beta_0 <- -1; beta_1 <- 3; beta_2 <- -1

## sim
eps <- rnorm(n, mean = 0, sd = 2)
y_sim <- beta_0 + beta_1 * x_1 + beta_2 * x_2 + eps

Simulation - Fit

## Useless predictors
p_junk <- 48
x_junk <- matrix(runif(n * p_junk, min = -2, max = 2),
                 ncol = p_junk)

## Good and Bad dataset
data_all <- data.frame(y_sim = y_sim,
                       x_1 = x_1,
                       x_2 = x_2,
                       x_junk = x_junk)

## Bad overfitting fit
fit_bad <- lm(y_sim ~ ., data = data_all)
pred_bad <- predict(fit_bad)

## Good fit
fit_good <- lm(y_sim ~ x_1 + x_2, data = data_all)
pred_good <- predict(fit_good)

Simulation - Fit

mu_y <- beta_0 + beta_1 * x_1 + beta_2 * x_2
plot(mu_y, y_sim, pch = 3,
     ylab = "y", xlab = expression(beta[0] + beta[1]*x[1] + beta[2]*x[2]))

Simulation - Fit

mu_y <- beta_0 + beta_1 * x_1 + beta_2 * x_2
plot(mu_y, y_sim, pch = 3,
     ylab = "y", xlab = expression(beta[0] + beta[1]*x[1] + beta[2]*x[2]))
## Good prediction (small model)
points(mu_y, pred_good, col = "lightblue", pch = 4)

Simulation - Fit

mu_y <- beta_0 + beta_1 * x_1 + beta_2 * x_2
plot(mu_y, y_sim, pch = 3,
     ylab = "y", xlab = expression(beta[0] + beta[1]*x[1] + beta[2]*x[2]))
## Good prediction (small model)
points(mu_y, pred_good, col = "lightblue", pch = 4)
## Bad prediction (big model)
points(mu_y, pred_bad, col = "firebrick", pch = 4)

RSS: “fitted risk”

The bad (bigger) model is closer to the observed data.

## Distance between bad prediction and data (RSS)
sum((y_sim - pred_bad)^2)
[1] 6.07942e-27
## Distance between good prediction and data (RSS)
sum((y_sim - pred_good)^2)
[1] 206.9159

Overfitting: the RSS is smaller for the bad, bigger model.

Predictive Risk

Generate new data according to the same generative model:

## New dataset from the same model
eps_new <- rnorm(n, mean = 0, sd = 2)
y_new <- beta_0 + beta_1 * x_1 + beta_2 * x_2 + eps_new

Questions:
Are the predictions from the bad (bigger) model
better or worse
than the predictions from the good (smaller) model,
compared to the new data ?

Simulation - Fit

plot(mu_y, y_sim, pch = 3,
     ylab = "y", xlab = expression(beta[0] + beta[1]*x[1] + beta[2]*x[2]))
## Bad prediction (big model)
points(mu_y, pred_bad, col = "firebrick", pch = 4)
## Good prediction (small model)
points(mu_y, pred_good, col = "lightblue", pch = 4)
## New points
points(mu_y, y_new, col = "darkgreen", pch = 3)

Predictive Risk

Compute the risk between predictions and new data:

## Distance between bad prediction and true
sum((y_new - pred_bad)^2)
[1] 535.296
## Distance between good prediction and true
sum((y_new - pred_good)^2)
[1] 205.6555

The good (smaller) model behaves better on new data.

Predictive Risk

RSS: \(\|\mathbf{y} - \hat{\mathbf{y}}_p\|^2\) with \(\hat{\mathbf{y}}_p\) learnt on \(\mathbf{y}\)
\(\to\) gets smaller when model gets bigger (overfit).

Predictive Risk: \(\| \mathbf{y}' - \hat{\mathbf{y}}_p\|^2\) with \(\mathbf{y}'\) same distribution as \(\mathbf{y}\).
\(\to\) becomes larger if \(\hat{\mathbf{y}}_p\) is overfit.

Idea:
We want to select a model that has a good predictive risk.
i.e. that represents the underlying model best.

Problem:
In general, we don’t have means to generate \(\mathbf{y}'\).
\(\to\) We use an estimate of the (theoretical) predictive risk.

Theoretical Risk

Models

Ideal model: \[ \mathbf{y} = \boldsymbol{\mu} + \boldsymbol{\epsilon} \quad \boldsymbol{\mu} \in \mathbb{R}^n \quad \boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \sigma^2\mathbf{I}_n). \]

\(\boldsymbol{\mu}\) the “true” expectation of the observations \(\mathbf{y}\).

Full linear regression model: \[ \mathbf{y} = \mathbf{X} \boldsymbol{\beta} + \boldsymbol{\epsilon} \quad rg(\mathbf{X}) = p, \quad \boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \sigma^2\mathbf{I}_n). \]

Assume that \(\boldsymbol{\mu}\) can be approximated as \(\mathbf{X} \boldsymbol{\beta}\), i.e. as an element of \(\mathcal{M}(\mathbf{X})\).

Models

Ideal model: \[ \mathbf{y} = \boldsymbol{\mu} + \boldsymbol{\epsilon} \quad \boldsymbol{\mu} \in \mathbb{R}^n \quad \boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \sigma^2\mathbf{I}_n). \]

Sub-models:
keep only predictors \(\eta \subset \{1, \dotsc, p\}\) \[ \mathbf{y} = \mathbf{X}_{\eta} \boldsymbol{\beta}_{\eta} + \boldsymbol{\epsilon}_{\eta} \quad rg(\mathbf{X}_{\eta}) = |\eta|, \quad \boldsymbol{\epsilon}_{\eta} \sim \mathcal{N}(\mathbf{0}, \sigma^2\mathbf{I}_n). \]

with \(\mathbf{X}_{\eta}\) the sub-matrix of \(\mathbf{X}\): \[ \mathbf{X}_{\eta} = (\mathbf{x}_{\eta_1} ~ \mathbf{x}_{\eta_2} \cdots \mathbf{x}_{\eta_{|\eta|}}) \]

Assume that \(\boldsymbol{\mu}\) can be approximated as \(\mathbf{X}_{\eta} \boldsymbol{\beta}_{\eta}\), i.e. as an element of \(\mathcal{M}(\mathbf{X}_{\eta})\).

Models

“Truth”: \(\mathbf{y} = \boldsymbol{\mu} + \boldsymbol{\epsilon}\), with \(\boldsymbol{\mu} \in \mathbb{R}^n\) and \(\boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \sigma^2\mathbf{I}_n)\).

Models: \[ \mathbf{y} = \mathbf{X}_{\eta} \boldsymbol{\beta}_{\eta} + \boldsymbol{\epsilon}_{\eta} \quad rg(\mathbf{X}_{\eta}) = |\eta|, \quad \boldsymbol{\epsilon}_{\eta} \sim \mathcal{N}(\mathbf{0}, \sigma^2\mathbf{I}_n). \]

Estimators: \[ \begin{aligned} \hat{\boldsymbol{\beta}}_\eta & = \underset{\boldsymbol{\beta} \in \mathbb{R}^{|\eta|}}{\operatorname{argmin}} \|\mathbf{y} - \mathbf{X}_\eta\boldsymbol{\beta}_\eta \|^2 = (\mathbf{X}_\eta^{T}\mathbf{X}_\eta)^{-1}\mathbf{X}_\eta^{T}\mathbf{y}\\ \hat{\mathbf{y}}_\eta & = \mathbf{X}_\eta\hat{\boldsymbol{\beta}}_\eta = \mathbf{P}^{\mathbf{X}_\eta}\mathbf{y} \end{aligned} \]

True expectation: \[ \begin{aligned} \mathbf{y}_\eta = \mathbf{P}^{\mathbf{X}_\eta}\boldsymbol{\mu} \end{aligned} \]

Models

Risk

Risk of an estimator (theoretical)

Risk of an estimator: \[ R(\hat{\mathbf{y}}_\eta) = \mathbb{E}\left[ \|\boldsymbol{\mu} - \hat{\mathbf{y}}_\eta \|^2 \right] \]

Notes:

  • \(\boldsymbol{\mu}\) is unknown \(\to\) cannot be computed in practice.

  • Expectation of the difference between the true mean and the estimated projection.

  • Not subject to overfitting.

Bias-Variance Decomposition

Bias-Variance Decomposition

\[ R(\hat{\mathbf{y}}_\eta) = \mathbb{E}\left[ \|\boldsymbol{\mu} - \hat{\mathbf{y}}_\eta \|^2 \right] = \|\boldsymbol{\mu} - \mathbf{y}_\eta \|^2 + |\eta| \sigma^2 \]

  • Bias: \(\|\boldsymbol{\mu} - \mathbf{y}_\eta \|^2\)
    • Difference between true and projection on the model
    • Best that we could do if we knew the truth
    • Decreases when model is more complex, i.e. \(\eta\) gets larger.
  • Variance: \(|\eta| \sigma^2\)
    • Variance of the estimation
    • Increases when model is more complex, i.e. \(\eta\) gets larger.
  • Risk is a compromize between bias and variance.

Bias-Variance - No overfitting

\[ R(\hat{\mathbf{y}}_\eta) = \mathbb{E}\left[ \|\boldsymbol{\mu} - \hat{\mathbf{y}}_\eta \|^2 \right] = \|\boldsymbol{\mu} - \mathbf{y}_\eta \|^2 + |\eta| \sigma^2 \]

No overfitting:

Assume:

  • \(\boldsymbol{\mu} \in \mathcal{M}(\mathbf{X}_{\eta})\) so that \(\mathbf{y}_{\eta} = \mathbf{P}^{\mathbf{X}_{\eta}}\boldsymbol{\mu} = \boldsymbol{\mu}\).

  • \(\eta \subset \eta'\) so that \(\mathbf{y}_{\eta'} = \mathbf{P}^{\mathbf{X}_{\eta'}}\boldsymbol{\mu} = \boldsymbol{\mu}\).

Then: \[ \begin{aligned} R(\hat{\mathbf{y}}_{\eta'}) & = \|\boldsymbol{\mu} - \mathbf{y}_{\eta'} \|^2 + |\eta'| \sigma^2 \\ & \geq \|\boldsymbol{\mu} - \mathbf{y}_{\eta'} \|^2 + |\eta| \sigma^2 = \|\boldsymbol{\mu} - \mathbf{y}_{\eta} \|^2 + |\eta| \sigma^2 \\ & \geq R(\hat{\mathbf{y}}_\eta) \end{aligned} \]

Bias-Variance - Proof - 1/2

\[ R(\hat{\mathbf{y}}_\eta) = \mathbb{E}\left[ \|\boldsymbol{\mu} - \hat{\mathbf{y}}_\eta \|^2 \right] = \mathbb{E}\left[ \|\boldsymbol{\mu} - \mathbf{y}_\eta + \mathbf{y}_\eta - \hat{\mathbf{y}}_\eta \|^2 \right] \]

\[ \begin{aligned} R(\hat{\mathbf{y}}_\eta) &= \mathbb{E}\left[ \|\boldsymbol{\mu} - \mathbf{P}^{\mathbf{X}_{\eta}}\boldsymbol{\mu} + \mathbf{P}^{\mathbf{X}_{\eta}}\boldsymbol{\mu} - \mathbf{P}^{\mathbf{X}_{\eta}}\mathbf{y} \|^2 \right] \\ &= \mathbb{E}\left[ \|(\mathbf{I}_n - \mathbf{P}^{\mathbf{X}_{\eta}})\boldsymbol{\mu} + \mathbf{P}^{\mathbf{X}_{\eta}}(\boldsymbol{\mu} - \mathbf{y}) \|^2 \right]\\ &= \mathbb{E}\left[ \|\mathbf{P}^{\mathbf{X}_{\eta}^\bot}\boldsymbol{\mu} + \mathbf{P}^{\mathbf{X}_{\eta}}(\boldsymbol{\mu} - \mathbf{y}) \|^2 \right] \end{aligned} \]

\[ \begin{aligned} R(\hat{\mathbf{y}}_\eta) &= \mathbb{E}\left[ \|\mathbf{P}^{\mathbf{X}_{\eta}^\bot}\boldsymbol{\mu} \|^2 + \|\mathbf{P}^{\mathbf{X}_{\eta}}(\boldsymbol{\mu} - \mathbf{y}) \|^2 \right] \\ &= \|\mathbf{P}^{\mathbf{X}_{\eta}^\bot}\boldsymbol{\mu} \|^2 + \mathbb{E}\left[ \|\mathbf{P}^{\mathbf{X}_{\eta}}(\boldsymbol{\mu} - \mathbf{y}) \|^2 \right] \\ &= \|\boldsymbol{\mu} - \mathbf{y}_\eta \|^2 + \mathbb{E}\left[ \|\mathbf{P}^{\mathbf{X}_{\eta}}(\boldsymbol{\mu} - \mathbf{y}) \|^2 \right] \end{aligned} \]

Bias-Variance - Proof - 2/2

\[ R(\hat{\mathbf{y}}_\eta) = \|\boldsymbol{\mu} - \mathbf{y}_\eta \|^2 + \mathbb{E}\left[ \|\mathbf{P}^{\mathbf{X}_{\eta}}(\boldsymbol{\mu} - \mathbf{y}) \|^2 \right] \]

From Cochran’s Theorem: \[ \frac{1}{\sigma^2}\|\mathbf{P}^{\mathbf{X}_{\eta}}(\boldsymbol{\mu} - \mathbf{y}) \|^2 \sim \chi^2_{|\eta|} \]

Hence: \[ \mathbb{E}\left[ \frac{1}{\sigma^2}\|\mathbf{P}^{\mathbf{X}_{\eta}}(\boldsymbol{\mu} - \mathbf{y}) \|^2 \right] = |\eta| \]

And: \[ R(\hat{\mathbf{y}}_\eta) = \|\boldsymbol{\mu} - \hat{\mathbf{y}}_\eta \|^2 + |\eta|\sigma^2. \]

Penalized Least Squares

Oracle Estimator

We would like: \[ \eta_0 = \underset{\eta \subset \{1, \cdots, p\}}{\operatorname{argmin}} R(\hat{\mathbf{y}}_\eta) \]

  • Minimize the risk over all possible models.

  • Problem: \(R(\hat{\mathbf{y}}_\eta)\) is theoretical \(\to\) cannot be computed.

  • Idea: Minimize an estimator \(\hat{R}(\hat{\mathbf{y}}_\eta)\) of \(R(\hat{\mathbf{y}}_\eta)\): \[ \hat{\eta} = \underset{\eta \subset \{1, \cdots, p\}}{\operatorname{argmin}} \hat{R}(\hat{\mathbf{y}}_\eta). \]

Expectation of the RSS

For any model \(\eta\), the expectation of the RSS is: \[ \mathbb{E}[\|\mathbf{y} - \hat{\mathbf{y}}_\eta \|^2] = R(\hat{\mathbf{y}}_\eta) + n\sigma^2 - 2|\eta|\sigma^2 \]

Expectation of the RSS - Proof

\[ \begin{aligned} \mathbb{E}[\|\mathbf{y} - \hat{\mathbf{y}}_\eta \|^2] &= \mathbb{E}[\|\mathbf{y} - \boldsymbol{\mu} + \boldsymbol{\mu} - \hat{\mathbf{y}}_\eta \|^2] \\ &= \mathbb{E}[\|\mathbf{y} - \boldsymbol{\mu}\|^2 + \|\boldsymbol{\mu} - \hat{\mathbf{y}}_\eta \|^2 + 2 \langle \mathbf{y} - \boldsymbol{\mu}; \boldsymbol{\mu} - \hat{\mathbf{y}}_\eta \rangle] \end{aligned} \]

\[ \begin{aligned} \mathbb{E}[\|\mathbf{y} - \hat{\mathbf{y}}_\eta \|^2] &= \mathbb{E}[\|\boldsymbol{\epsilon}\|^2] + R(\hat{\mathbf{y}}_\eta) + 2 \mathbb{E}[\langle \boldsymbol{\epsilon}; \boldsymbol{\mu} - \mathbf{P}^{\mathbf{X}_{\eta}}\mathbf{y} \rangle] \qquad \end{aligned} \]

\[ \begin{aligned} \mathbb{E}[\langle \boldsymbol{\epsilon}; \boldsymbol{\mu} - \mathbf{P}^{\mathbf{X}_{\eta}}\mathbf{y} \rangle] &= \mathbb{E}[\langle \boldsymbol{\epsilon}; \boldsymbol{\mu} - \mathbf{P}^{\mathbf{X}_{\eta}}(\boldsymbol{\mu} + \boldsymbol{\epsilon}) \rangle] \\ &= \mathbb{E}[\langle \boldsymbol{\epsilon}; \boldsymbol{\mu} - \mathbf{P}^{\mathbf{X}_{\eta}}\boldsymbol{\mu}) \rangle] - \mathbb{E}[\langle \boldsymbol{\epsilon}; \mathbf{P}^{\mathbf{X}_{\eta}}\boldsymbol{\epsilon} \rangle] \end{aligned} \]

\[ \begin{aligned} \mathbb{E}[\langle \boldsymbol{\epsilon}; \boldsymbol{\mu} - \mathbf{P}^{\mathbf{X}_{\eta}}\mathbf{y} \rangle] &= 0 - \mathbb{E}[\langle \mathbf{P}^{\mathbf{X}_{\eta}}\boldsymbol{\epsilon}; \mathbf{P}^{\mathbf{X}_{\eta}}\boldsymbol{\epsilon} \rangle] \qquad\qquad~~ \\ &= - \mathbb{E}[\| \mathbf{P}^{\mathbf{X}_{\eta}}\boldsymbol{\epsilon}\|^2] \\ &= - |\eta| \sigma^2 \end{aligned} \]

\[ \begin{aligned} \mathbb{E}[\|\mathbf{y} - \hat{\mathbf{y}}_\eta \|^2] &= n \sigma^2 + R(\hat{\mathbf{y}}_\eta) - 2 |\eta| \sigma^2 \qquad\qquad\qquad\qquad \end{aligned} \]

Mallow’s \(C_p\)

We have: \[ \mathbb{E}[\|\mathbf{y} - \hat{\mathbf{y}}_\eta \|^2] = R(\hat{\mathbf{y}}_\eta) + n\sigma^2 - 2|\eta|\sigma^2 \]

Mallow’s \(C_p\): \[ C_p(\hat{\mathbf{y}}_\eta) = \frac{1}{n} \left( \|\mathbf{y} - \hat{\mathbf{y}}_\eta \|^2 + 2 |\eta| \hat{\sigma}^2 \right) \]

  • If \(\hat{\sigma}^2\) is unbiased, then: \[ \mathbb{E}[C_p(\hat{\mathbf{y}}_\eta)] = \frac{1}{n}R(\hat{\mathbf{y}}_\eta) + \sigma^2 \]

Mallow’s \(C_p\)

We have: \[ C_p(\hat{\mathbf{y}}_\eta) = \frac{1}{n} \left( \|\mathbf{y} - \hat{\mathbf{y}}_\eta \|^2 + 2 |\eta| \hat{\sigma}^2 \right) \] And: \[ \mathbb{E}[C_p(\hat{\mathbf{y}}_\eta)] = \frac{1}{n}R(\hat{\mathbf{y}}_\eta) + \sigma^2 \]

Hence: \[ \hat{\eta} = \underset{\eta \subset \{1, \cdots, p\}}{\operatorname{argmin}} C_p(\hat{\mathbf{y}}_\eta) = \underset{\eta \subset \{1, \cdots, p\}}{\operatorname{argmin}} \hat{R}(\hat{\mathbf{y}}_\eta). \]

\(\to\) Minimizing Mallow’s \(C_p\) is the same as minimizing an unbiased estimate of the risk.

Mallow’s \(C_p\)

\[ C_p(\hat{\mathbf{y}}_\eta) = \frac{1}{n} \left( \|\mathbf{y} - \hat{\mathbf{y}}_\eta \|^2 + 2 |\eta| \hat{\sigma}^2 \right) \]

  • Penalize for the number of parameters \(|\eta|\).

  • The smaller the better.

Simulated Dataset

## Mallow's Cp
Cp <- function(fit) {
  n <- nobs(fit)
  p <- n - df.residual(fit)
  RSS <- deviance(fit)
  return( (RSS + p * RSS / (n - p)) / n)
}
## Function to get statistics
get_stats_fit <- function(fit) {
  sumfit <- summary(fit)
  res <- data.frame(Cp = Cp(fit))
  return(res)
}

Variable selection

## Apply stats to all possible combinations
all_stats <- get_all_stats(data_sub)
all_stats
         Cp          preds
1  5.085236            x_1
2 16.107190            x_2
3 18.207435         x_junk
4  3.823952        x_1+x_2
5  5.095010     x_1+x_junk
6 16.128558     x_2+x_junk
7  3.831361 x_1+x_2+x_junk

Mallow’s \(C_p\) is smallest for the true model.

Penalized Log Likelihood

Maximized Log Likelihood

Recall: \[ \begin{aligned} (\hat{\boldsymbol{\beta}}_\eta, \hat{\sigma}^2_{\eta,ML}) &= \underset{(\boldsymbol{\beta}, \sigma^2) \in \mathbb{R}^{|\eta|}\times\mathbb{R}_+^*}{\operatorname{argmax}} \log L(\boldsymbol{\beta}, \sigma^2 | \mathbf{y})\\ &= \underset{(\boldsymbol{\beta}, \sigma^2) \in \mathbb{R}^{|\eta|}\times\mathbb{R}_+^*}{\operatorname{argmax}} - \frac{n}{2} \log(2\pi\sigma^2) - \frac{1}{2\sigma^2}\|\mathbf{y} - \mathbf{X}_\eta\boldsymbol{\beta} \|^2 \end{aligned} \]

And: \[ \hat{\sigma}^2_{\eta,ML} = \frac{ \|\mathbf{y} - \hat{\mathbf{y}}_\eta \|^2} {n} \quad;\quad \hat{\mathbf{y}}_\eta = \mathbf{X}_\eta\hat{\boldsymbol{\beta}}_\eta \]

Thus: \[ L(\hat{\boldsymbol{\beta}}_\eta, \hat{\sigma}^2_{ML} | \mathbf{y}) = \cdots \]

Maximized Log Likelihood

The maximized likelihood is equal to: \[ L(\hat{\boldsymbol{\beta}}_\eta, \hat{\sigma}^2_{ML} | \mathbf{y}) = - \frac{n}{2} \log\left(2\pi\frac{ \|\mathbf{y} - \hat{\mathbf{y}}_\eta \|^2}{n}\right) - \frac{1}{2\frac{ \|\mathbf{y} - \hat{\mathbf{y}}_\eta \|^2}{n}}\|\mathbf{y} - \hat{\mathbf{y}}_\eta \|^2 \]

i.e. \[ \log \hat{L}_\eta = - \frac{n}{2} \log\left(\frac{ \|\mathbf{y} - \hat{\mathbf{y}}_\eta \|^2}{n}\right) - \frac{n}{2} \log\left(2\pi\right) - \frac{n}{2} \]

Maximized Log Likelihood is increasing

Remember, for any \(\eta \subset \eta'\): \[ \|\mathbf{y} - \hat{\mathbf{y}}_{\eta'}\|^2 \leq \|\mathbf{y} - \hat{\mathbf{y}}_\eta\|^2 \]

Hence \(R^2_\eta = 1 - \frac{\|\mathbf{y} - \hat{\mathbf{y}}_\eta\|^2}{\|\mathbf{y} - \bar{y} \mathbb{1}\|^2}\) always increases when we add a predictor.

Similarly: \[ \log \hat{L}_\eta = - \frac{n}{2} \log\left(\frac{ \|\mathbf{y} - \hat{\mathbf{y}}_\eta \|^2}{n}\right) - \frac{n}{2} \log\left(2\pi\right) - \frac{n}{2} \] always increases when we add some predictors.

Penalized Maximized Log Likelihood

Problem: \(\log \hat{L}_\eta\) always increases when we add some predictors.

\(\to\) it can not be used for model selection (same as \(R^2\)).

Idea: add a penalty term that depends on the size of the model

\(\to\) minimize: \[ - \log \hat{L}_\eta + f(|\eta|) \qquad \text{with f increasing} \]

Penalized Maximized Log Likelihood

Minimize: \[ - \log \hat{L}_\eta + f(|\eta|) \qquad \text{with f increasing} \]

For \(\eta \subset \eta'\):

  • \(- \log \hat{L}_\eta\) decreases

  • \(f(|\eta|)\) increases

Goal: find \(f\) that compensates for the “overfit”, i.e. the noisy and insignificant increase of the likelihood when we add some unrelated predictors.

Simulated Dataset

## Function to get statistics
get_stats_fit <- function(fit) {
  sumfit <- summary(fit)
  res <- data.frame(minusLogLik = -logLik(fit))
  return(res)
}

Variable selection

## Apply stats to all possible combinations
all_stats <- get_all_stats(data_sub)
all_stats
  minusLogLik          preds
1    1115.053            x_1
2    1403.284            x_2
3    1433.925         x_junk
4    1043.286        x_1+x_2
5    1115.030     x_1+x_junk
6    1403.113     x_2+x_junk
7    1043.266 x_1+x_2+x_junk

When we add “junk” variable, \(- \log \hat{L}_\eta\) decreases, but not a lot.

Questions: can we find a good penalty that compensates for this noisy increasing ?

AIC and BIC

Akaike Information Criterion - AIC

\[ AIC = - 2\log \hat{L}_\eta + 2 |\eta| \]

  • \(\hat{L}_\eta\) is the maximized likelihood of the model.

  • \(|\eta|\) is the dimension of the model.

  • The smaller the better.

  • Asymptotic theoretical guaranties.

  • Easy to use and widely spread

Bayesian Information Criterion - BIC

\[ BIC = - 2\log \hat{L}_\eta + |\eta| \log(n) \]

\(~\)

  • \(\hat{L}_\eta\) is the maximized likelihood of the model.

  • \(|\eta|\) is the dimension of the model.

  • The smaller the better.

  • Based on an approximation of the marginal likelihood.

  • Easy to use and widely spread

See: Lebarbier, Mary-Huard. Le critère BIC : fondements théoriques et interprétation. 2004. [inria-00070685]

AIC vs BIC

\[ AIC = - 2\log \hat{L}_\eta + 2 |\eta| \]

\[ BIC = - 2\log \hat{L}_\eta + |\eta| \log(n) \]

  • Both have asymptotic theoretical justifications.

  • BIC penalizes more than AIC \(\to\) chooses smaller models.

  • Both widely used.

  • There are some (better) non-asymptotic criteria, see
    Giraud C. 2014. Introduction to high-dimensional statistics.

Simulated Dataset

## Function to get statistics
get_stats_fit <- function(fit) {
  sumfit <- summary(fit)
  res <- data.frame(minusLogLik = -logLik(fit),
                    AIC = AIC(fit),
                    BIC = BIC(fit))
  return(res)
}

Variable selection

## Apply stats to all possible combinations
all_stats <- get_all_stats(data_sub)
all_stats
  minusLogLik      AIC      BIC          preds
1    1115.053 2236.105 2248.749            x_1
2    1403.284 2812.567 2825.211            x_2
3    1433.925 2873.850 2886.493         x_junk
4    1043.286 2094.572 2111.430        x_1+x_2
5    1115.030 2238.060 2254.919     x_1+x_junk
6    1403.113 2814.225 2831.084     x_2+x_junk
7    1043.266 2096.533 2117.606 x_1+x_2+x_junk
## Select model
apply(all_stats[, -ncol(all_stats)], 2,
      function(x) all_stats[which.min(x), ncol(all_stats)])
     minusLogLik              AIC              BIC 
"x_1+x_2+x_junk"        "x_1+x_2"        "x_1+x_2" 

Variable selection - Advertising data

## Advertising data
library(here)
ad <- read.csv(here("data", "Advertising.csv"), row.names = "X")

## Apply stats to all possible combinations
all_stats <- get_all_stats(ad[, c(4, 1:3)])
all_stats
  minusLogLik       AIC       BIC              preds
1    519.0457 1044.0913 1053.9863                 TV
2    573.3369 1152.6738 1162.5687              radio
3    608.3357 1222.6714 1232.5663          newspaper
4    386.1970  780.3941  793.5874           TV+radio
5    509.8891 1027.7782 1040.9714       TV+newspaper
6    573.2361 1154.4723 1167.6655    radio+newspaper
7    386.1811  782.3622  798.8538 TV+radio+newspaper
## Select model
apply(all_stats[, -ncol(all_stats)], 2,
      function(x) all_stats[which.min(x), ncol(all_stats)])
         minusLogLik                  AIC                  BIC 
"TV+radio+newspaper"           "TV+radio"           "TV+radio" 

Combinatorial problem

\[ p \text{ predictors } \to 2^p \text{ possible models.} \]

  • Cannot test all possible models.

  • “Manual” solution: forward or backward searches.

Forward search - Init

## Full formula
full_formula <- formula(y_sim ~ x_1 + x_2 
                        + x_junk.1 + x_junk.2 + x_junk.3 
                        + x_junk.4 + x_junk.5 + x_junk.6)

## Complete search
2^8
[1] 256
## No predictors
fit0 <- lm(y_sim ~ 1, data = data_all)

Forward search - Add term

## Add one
library(MASS)
addterm(fit0, full_formula)
Single term additions

Model:
y_sim ~ 1
         Df Sum of Sq    RSS     AIC
<none>                731.66 136.165
x_1       1    466.43 265.23  87.429
x_2       1     56.39 675.27 134.154
x_junk.1  1     50.05 681.61 134.622
x_junk.2  1     32.10 699.56 135.921
x_junk.3  1      0.41 731.26 138.137
x_junk.4  1      2.49 729.18 137.995
x_junk.5  1     54.28 677.38 134.311
x_junk.6  1     11.59 720.08 137.367

Forward search - Add term

## Add x_1
fit1 <- lm(y_sim ~ 1 + x_1, data = data_all)
## Add another
addterm(fit1, full_formula)
Single term additions

Model:
y_sim ~ 1 + x_1
         Df Sum of Sq    RSS    AIC
<none>                265.23 87.429
x_2       1    58.314 206.92 77.014
x_junk.1  1     6.038 259.19 88.277
x_junk.2  1     0.159 265.07 89.399
x_junk.3  1     0.376 264.85 89.358
x_junk.4  1     0.403 264.83 89.353
x_junk.5  1     2.587 262.64 88.939
x_junk.6  1     6.791 258.44 88.132

Forward search - Add term

## Add x_1
fit2 <- lm(y_sim ~ 1 + x_1 + x_2, data = data_all)
## Add another
addterm(fit2, full_formula)
Single term additions

Model:
y_sim ~ 1 + x_1 + x_2
         Df Sum of Sq    RSS    AIC
<none>                206.92 77.014
x_junk.1  1    3.0199 203.90 78.279
x_junk.2  1    2.1852 204.73 78.484
x_junk.3  1    0.8228 206.09 78.815
x_junk.4  1    1.6416 205.27 78.616
x_junk.5  1    4.9363 201.98 77.807
x_junk.6  1    3.7814 203.13 78.092

Backward search - Init

## Full formula
fit0 <- lm(y_sim ~ x_1 + x_2 
                        + x_junk.1 + x_junk.2 + x_junk.3 
                        + x_junk.4 + x_junk.5 + x_junk.6,
           data = data_all)
## drop one
dropterm(fit0)
Single term deletions

Model:
y_sim ~ x_1 + x_2 + x_junk.1 + x_junk.2 + x_junk.3 + x_junk.4 + 
    x_junk.5 + x_junk.6
         Df Sum of Sq    RSS     AIC
<none>                191.34  85.100
x_1       1    336.07 527.41 133.797
x_2       1     57.75 249.09  96.289
x_junk.1  1      2.92 194.25  83.857
x_junk.2  1      2.26 193.60  83.688
x_junk.3  1      1.46 192.80  83.480
x_junk.4  1      1.48 192.81  83.485
x_junk.5  1      2.08 193.42  83.641
x_junk.6  1      4.38 195.72  84.233

Backward search - Drop term

## drop x_junk.3
fit1 <- lm(y_sim ~ x_1 + x_2 
           + x_junk.1 + x_junk.2 +
           + x_junk.4 + x_junk.5 + x_junk.6, data = data_all)
## Drop another
dropterm(fit1)
Single term deletions

Model:
y_sim ~ x_1 + x_2 + x_junk.1 + x_junk.2 + +x_junk.4 + x_junk.5 + 
    x_junk.6
         Df Sum of Sq    RSS     AIC
<none>                192.80  83.480
x_1       1    335.92 528.71 131.921
x_2       1     57.16 249.96  94.463
x_junk.1  1      2.30 195.09  82.073
x_junk.2  1      2.18 194.98  82.043
x_junk.4  1      1.15 193.95  81.779
x_junk.5  1      2.27 195.06  82.065
x_junk.6  1      4.73 197.52  82.691

Backward search - Drop term

## drop x_junk.4
fit2 <- lm(y_sim ~ x_1 + x_2 
           + x_junk.1 + x_junk.2 +
           + x_junk.5 + x_junk.6, data = data_all)
## Drop another
dropterm(fit2)
Single term deletions

Model:
y_sim ~ x_1 + x_2 + x_junk.1 + x_junk.2 + +x_junk.5 + x_junk.6
         Df Sum of Sq    RSS     AIC
<none>                193.95  81.779
x_1       1    335.82 529.77 130.021
x_2       1     56.44 250.39  92.550
x_junk.1  1      1.95 195.90  80.278
x_junk.2  1      2.27 196.21  80.359
x_junk.5  1      2.96 196.91  80.537
x_junk.6  1      4.74 198.69  80.986

Backward search - Drop term

## drop x_junk.1
fit3 <- lm(y_sim ~ x_1 + x_2 
           + x_junk.2 +
           + x_junk.5 + x_junk.6, data = data_all)
## Drop another
dropterm(fit3)
Single term deletions

Model:
y_sim ~ x_1 + x_2 + x_junk.2 + +x_junk.5 + x_junk.6
         Df Sum of Sq    RSS     AIC
<none>                195.90  80.278
x_1       1    368.05 563.94 131.147
x_2       1     59.57 255.47  91.554
x_junk.2  1      2.46 198.36  78.903
x_junk.5  1      3.71 199.61  79.217
x_junk.6  1      4.71 200.60  79.466

Backward search - Drop term

## drop x_junk.2
fit4 <- lm(y_sim ~ x_1 + x_2 
           + x_junk.5 + x_junk.6, data = data_all)
## Drop another
dropterm(fit4)
Single term deletions

Model:
y_sim ~ x_1 + x_2 + x_junk.5 + x_junk.6
         Df Sum of Sq    RSS     AIC
<none>                198.36  78.903
x_1       1    422.20 620.56 133.930
x_2       1     57.59 255.95  89.648
x_junk.5  1      4.78 203.13  78.092
x_junk.6  1      3.62 201.98  77.807

Backward search - Drop term

## drop x_junk.6
fit5 <- lm(y_sim ~ x_1 + x_2 
           + x_junk.5, data = data_all)
## Drop another
dropterm(fit5)
Single term deletions

Model:
y_sim ~ x_1 + x_2 + x_junk.5
         Df Sum of Sq    RSS     AIC
<none>                201.98  77.807
x_1       1    426.25 628.23 132.544
x_2       1     60.66 262.64  88.939
x_junk.5  1      4.94 206.92  77.014

Backward search - Drop term

## drop x_junk.5
fit6 <- lm(y_sim ~ x_1 + x_2,
           data = data_all)
## Drop another
dropterm(fit6)
Single term deletions

Model:
y_sim ~ x_1 + x_2
       Df Sum of Sq    RSS     AIC
<none>              206.92  77.014
x_1     1    468.35 675.27 134.154
x_2     1     58.31 265.23  87.429

Forward and backward searches

  • Both are heuristics.

  • There are no guaranties that both converge to the same solution.

  • No theoretical guaranties on the selected model.

  • Model selection is still an active area of research.

    • See e.g: LASSO and other penalized criteria (M2)

AIC vs log FPE

Final Prediction Error - FPE

Mallow’s \(C_p\): \[ C_p(\hat{\mathbf{y}}_\eta) = \frac{1}{n} \left( \|\mathbf{y} - \hat{\mathbf{y}}_\eta \|^2 + 2 |\eta| \hat{\sigma}^2_{\text{full}} \right) \]

FPE: \[ FPE(\hat{\mathbf{y}}_\eta) = \|\mathbf{y} - \hat{\mathbf{y}}_\eta \|^2 + 2 |\eta| \hat{\sigma}^2_\eta \quad \text{with} \quad \hat{\sigma}^2_\eta = \frac{ \|\mathbf{y} - \hat{\mathbf{y}}_\eta \|^2} {n - |\eta|} \]

Hence: \[ FPE(\hat{\mathbf{y}}_\eta) = \|\mathbf{y} - \hat{\mathbf{y}}_\eta \|^2 \left(1 + \frac{2 |\eta|} {n - |\eta|}\right) \]

AIC vs Log FPE

\[ \begin{aligned} \log \frac1n FPE(\hat{\mathbf{y}}_\eta) &= \log \left(\frac1n \|\mathbf{y} - \hat{\mathbf{y}}_\eta \|^2 \left(1 + \frac{2 |\eta|} {n - |\eta|}\right) \right)\\ &= \log \left(\frac1n \|\mathbf{y} - \hat{\mathbf{y}}_\eta \|^2 \right) + \log \left(1 + \frac{2 |\eta|} {n - |\eta|}\right) \end{aligned} \] But: \[ n \log\left(\frac{ \|\mathbf{y} - \hat{\mathbf{y}}_\eta \|^2}{n}\right) = - 2\log \hat{L}_\eta - n \log\left(2\pi\right) - n \]

AIC vs Log FPE

Hence: \[ n\log \frac1n FPE(\hat{\mathbf{y}}_\eta) + n \log\left(2\pi\right) + n \qquad \qquad \qquad \qquad\\ \qquad \qquad \qquad \qquad = - 2\log \hat{L}_\eta + n \log \left(1 + \frac{2 |\eta|} {n - |\eta|}\right)\\ \]

When \(n\) is large: \[ n \log \left(1 + \frac{2 |\eta|} {n - |\eta|}\right) \approx n \frac{2 |\eta|} {n - |\eta|} \approx 2 |\eta| \]

AIC vs Log FPE

Hence: \[ \begin{aligned} n\log \frac1n FPE(\hat{\mathbf{y}}_\eta) + n \log\left(2\pi\right) + n &\approx - 2\log \hat{L}_\eta + 2 |\eta| \\ &\approx AIC(\hat{\mathbf{y}}_\eta) \end{aligned} \]

The log FPE and the AIC criteria select for the same models (asymptotically).

BIC and Laplace Approximation

Bayesian Model Selection

In a Bayesian setting,
we have priors on the models \(\eta\) and parameters \(\boldsymbol{\theta}_\eta = (\boldsymbol{\beta}_\eta, \sigma^2_\eta)\): \[ \left\{ \begin{aligned} \eta &\sim \pi_{\eta} \\ \boldsymbol{\theta}_\eta & \sim p(\boldsymbol{\theta} | \eta) \end{aligned} \right. \]

The likelihood given some parameters and a model is: \(p(\mathbf{y} | \boldsymbol{\theta}, \eta)\)

the marginal likelihood given a model is: \[ p(\mathbf{y} | \eta) = \int p(\mathbf{y} | \boldsymbol{\theta}, \eta) p(\boldsymbol{\theta} | \eta) d\boldsymbol{\theta} \]

and the posterior probability of the models are: \[ p(\eta | \mathbf{y}). \]

Bayesian Model Selection

Goal: find the model with the highest posterior probability: \[ \hat{\eta} = \underset{\eta}{\operatorname{argmax}} p(\eta | \mathbf{y}) \]

Assume that all models have the same prior probability: \[ \pi_{\eta} \equiv \pi. \]

Then, from Bayes rule: \[ p(\eta | \mathbf{y}) = p(\mathbf{y} | \eta)\pi_{\eta} / p(\mathbf{y}) = p(\mathbf{y} | \eta) \times \pi / p(\mathbf{y}) \]

And: \[ \hat{\eta} = \underset{\eta}{\operatorname{argmax}} p(\eta | \mathbf{y}) = \underset{\eta}{\operatorname{argmax}} p(\mathbf{y} | \eta). \]

Bayesian Model Selection

Goal: find the model with the highest posterior probability: \[ \begin{aligned} \hat{\eta} & = \underset{\eta}{\operatorname{argmax}} p(\eta | \mathbf{y}) = \underset{\eta}{\operatorname{argmax}} p(\mathbf{y} | \eta) \\ & = \underset{\eta}{\operatorname{argmax}} \int p(\mathbf{y} | \boldsymbol{\theta}, \eta) p(\boldsymbol{\theta} | \eta) d\boldsymbol{\theta} \end{aligned} \]

Idea:
Assume that \[p(\mathbf{y} | \boldsymbol{\theta}, \eta) p(\boldsymbol{\theta} | \eta)\] is concentrated around its maximum \[p(\mathbf{y} | \boldsymbol{\theta}_\eta^*, \eta) p(\boldsymbol{\theta}_\eta^* | \eta)\] and apply Laplace Approximation.

Laplace Approximation

Let \(L: \mathbb{R}^q \to \mathbb{R}\) be a three times differentiable function, with a unique maximum \(\mathbf{u}^*\). Then: \[ \int_{\mathbb{R}^q} e^{n L(\mathbf{u})} d \mathbf{u} = \left(\frac{2\pi}{n}\right)^{q/2} |-L''(\mathbf{u}^*)|^{-1/2}e^{nL(\mathbf{u}^*)} + o(n^{-1}) \]

Applied to: \[ L(\boldsymbol{\theta}) = \frac1n\log p(\mathbf{y} | \boldsymbol{\theta}, \eta) + \frac1n\log p(\boldsymbol{\theta} | \eta) \] and assuming \(\log p(\mathbf{y} | \boldsymbol{\theta}^*, \eta) \approx \log p(\mathbf{y} | \hat{\boldsymbol{\theta}}, \eta)\), we get: \[ \log \int p(\mathbf{y} | \boldsymbol{\theta}, \eta) p(\boldsymbol{\theta} | \eta) d\boldsymbol{\theta} \approx \log p(\mathbf{y} | \hat{\boldsymbol{\theta}}, \eta) - \frac{|\eta|}{2} \log(n). \]

BIC

Hence: \[ \begin{aligned} \hat{\eta} & = \underset{\eta}{\operatorname{argmax}} p(\eta | \mathbf{y}) = \underset{\eta}{\operatorname{argmax}} p(\mathbf{y} | \eta) \\ & \approx \underset{\eta}{\operatorname{argmax}} \left\{\log p(\mathbf{y} | \hat{\boldsymbol{\theta}}, \eta) - \frac{|\eta|}{2} \log(n)\right\}\\ & \approx \underset{\eta}{\operatorname{argmin}} \left\{-2\log p(\mathbf{y} | \hat{\boldsymbol{\theta}}, \eta) + |\eta| \log(n) \right\}\\ &\approx \underset{\eta}{\operatorname{argmin}} \left\{- 2\log \hat{L}_\eta + |\eta| \log(n)\right\} = \underset{\eta}{\operatorname{argmin}} BIC(\hat{\mathbf{y}}_\eta) \end{aligned} \]

BIC is an approximation of the marginal likelihood.