When \(\sigma^2\) is unknown: \[
\frac{1}{p\hat{\sigma}^2}\left(\hat{\boldsymbol{\beta}} - \boldsymbol{\beta}\right)^T (\mathbf{X}^T\mathbf{X})\left(\hat{\boldsymbol{\beta}} - \boldsymbol{\beta}\right)
\sim
\mathcal{F}^p_{n-p}
\]
with \(\mathcal{F}^p_{n-p}\) the Fisher distribution.
Reminder: Fisher Distribution
Let \(U_1 \sim \chi^2_{p_1}\) and \(U_2 \sim \chi^2_{p_2}\), \(U_1\) and \(U_2\) independent. Then \[
F = \frac{U_1/p_1}{U_2/p_2} \sim \mathcal{F}^{p_1}_{p_2}
\]
Joint distribution of \(\hat{\boldsymbol{\beta}}\) - Proof - Hints
Test Statistic: \[
F
= \frac{1}{p\hat{\sigma}^2}\hat{\boldsymbol{\beta}}^T (\mathbf{X}^T\mathbf{X})\hat{\boldsymbol{\beta}}
\underset{\mathcal{H}_0}{\sim}
\mathcal{F}^p_{n-p}
\]
\(\mathcal{H}_0: \beta_1 = \cdots = \beta_p = 0\) i.e. the model is useless.
If \(\|\hat{\boldsymbol{\epsilon}}\|^2\) is small, then the error is small,
and \(F\) tends to be large, i.e. we tend to reject\(\mathcal{H}_0\):
the model is useful.
If \(\|\hat{\boldsymbol{\epsilon}}\|^2\) is large, then the error is large,
and \(F\) tends to be small, i.e. we tend to not reject\(\mathcal{H}_0\):
the model is rather useless.
Geometrical interpretation
Good model
\(\mathbf{y}\)not too far from \(\mathcal{M}(\mathbf{X})\).
\(\|\hat{\boldsymbol{\epsilon}}\|^2\)not too large compared to \(\|\hat{\mathbf{y}}\|^2\).
\(F\) tends to be large.
Bad model
\(\mathbf{y}\)almost orth. to \(\mathcal{M}(\mathbf{X})\).
\(\|\hat{\boldsymbol{\epsilon}}\|^2\)rather large compared to \(\|\hat{\mathbf{y}}\|^2\).
\(F\) tends to be small.
Simulated Dataset - True and Junk
## Data framedata_all <-data.frame(y_sim = y_sim,x_1 = x_1,x_2 = x_2,x_junk = x_junk)## Fitfit <-lm(y_sim ~ ., data = data_all)## multiple t testsp_values_t <-summary(fit)$coefficients[, 4]names(p_values_t[p_values_t <=0.05])
If \(\|\hat{\mathbf{y}} - \hat{\mathbf{y}}_0\|^2\)small compared to \(\|\mathbf{y} - \hat{\mathbf{y}}\|^2\), then “\(\hat{\mathbf{y}} \approx \hat{\mathbf{y}}_0\)” and the null model is enough.
If \(\|\hat{\mathbf{y}} - \hat{\mathbf{y}}_0\|^2\)large compared to \(\|\mathbf{y} - \hat{\mathbf{y}}\|^2\), then the full model is useful.
Geometrical interpretation
Good model
\(\hat{\mathbf{y}}\)quite different from \(\hat{\mathbf{y}}_0\).
Full model does add information.
Bad model
\(\hat{\mathbf{y}}\)similar to \(\hat{\mathbf{y}}_0\).
\(\hat{\mathbf{y}} - \hat{\mathbf{y}}_0\) and \(\mathbf{y} - \hat{\mathbf{y}}\) are orthogonal
i.e. their covariance is zero
i.e. (Gaussian assumption) they are independent.
\[
\frac{1}{\sigma^2}\|\hat{\mathbf{y}} - \hat{\mathbf{y}}_0 - \mathbf{P}_0^\bot \mathbf{P}\mathbf{X}\boldsymbol{\beta}\|^2
= \frac{1}{\sigma^2}\|\mathbf{P}_0^\bot \mathbf{P}(\mathbf{y} - \mathbf{X}\boldsymbol{\beta})\|^2
\sim \chi^2_{p-p_0}
\] (\(\mathbf{P}_0^\bot \mathbf{P}\) is a projection on a space of dimension \(p - p_0\))
If \(\mathcal{H}_0\) is true, then \(\mathbf{P}_0^\bot \mathbf{P}\mathbf{X}\boldsymbol{\beta} = 0\), hence:
Problem:
Can we choose the “best” subset of non-zero \(\beta_k\) ?
I.e. Can we find the predictors that really have an impact on \(\mathbf{y}\) ?
Idea:
Find a “score” to assess the quality of a given fit.
Variable selection - \(R^2\)
\[
R^2 = 1 - \frac{RSS}{TSS}
\]
## Function to get statistics for one fitget_stats_fit <-function(fit) { sumfit <-summary(fit) res <-data.frame(R2 = sumfit$r.squared)return(res)}
\[
y_i = -1 + 3 x_{i1} - x_{i2} + \epsilon_i
\]
Variable selection - \(R^2\)
True fit
get_stats_fit(lm(y_sim ~ x_1 + x_2, data = data_all))
R2
1 0.7905255
Bigger fit
get_stats_fit(lm(y_sim ~ x_1 + x_2 + x_junk.1, data = data_all))
R2
1 0.79133
Wrong fit
get_stats_fit(lm(y_sim ~ x_1 + x_junk.1, data = data_all))
R2
1 0.7252314
Variable selection - \(R^2\)
## Only one "junk" predictordata_sub <-data.frame(y_sim = y_sim,x_1 = x_1,x_2 = x_2,x_junk = x_junk_1)## Fit all possible modelsget_all_stats(data_sub)
Attention\(p\) includes the intercept (rank of matrix \(\mathbf{X}\)).
## Function to get statisticsget_stats_fit <-function(fit) { sumfit <-summary(fit) res <-data.frame(R2 = sumfit$r.squared,R2a = sumfit$adj.r.squared)return(res)}
## Mallow's CpCp <-function(fit) { n <-nobs(fit) p <- n -df.residual(fit) RSS <-deviance(fit)return( (RSS + p * RSS / (n - p)) / n)}## Function to get statisticsget_stats_fit <-function(fit) { sumfit <-summary(fit) res <-data.frame(Cp =Cp(fit))return(res)}
Variable selection
## Apply stats to all possible combinationsall_stats <-get_all_stats(data_sub)all_stats
Goal: find \(f\) that compensates for the “overfit”, i.e. the noisy and insignificant increase of the likelihood when we add some unrelated predictors.
Simulated Dataset
## Function to get statisticsget_stats_fit <-function(fit) { sumfit <-summary(fit) res <-data.frame(minusLogLik =-logLik(fit))return(res)}
Variable selection
## Apply stats to all possible combinationsall_stats <-get_all_stats(data_sub)all_stats
When we add “junk” variable, \(- \log \hat{L}_p\) decreases, but not a lot.
Questions: can we find a good penalty that compensates for this noisy increasing ?
AIC and BIC
Akaike Information Criterion - AIC
\[
AIC = - 2\log \hat{L}_p + 2 p
\]
\(\hat{L}_p\) is the maximized likelihood of the model.
\(p\) is the dimension of the model.
The smaller the better.
Asymptotic theoretical guaranties.
Easy to use and widely spread
Bayesian Information Criterion - BIC
\[
BIC = - 2\log \hat{L}_p + p \log(n)
\]
\(~\)
\(\hat{L}_p\) is the maximized likelihood of the model.
\(p\) is the dimension of the model.
The smaller the better.
Based on an approximation of the marginal likelihood.
Easy to use and widely spread
See: Lebarbier, Mary-Huard. Le critère BIC : fondements théoriques et interprétation. 2004. [inria-00070685]
AIC vs BIC
\[
AIC = - 2\log \hat{L}_p + 2 p
\]
\[
BIC = - 2\log \hat{L}_p + p \log(n)
\]
Both have asymptotic theoretical justifications.
BIC penalizes more than AIC \(\to\) chooses smaller models.
Both widely used.
There are some (better) non-asymptotic criteria, see
Giraud C. 2014. Introduction to high-dimensional statistics.
Simulated Dataset
## Function to get statisticsget_stats_fit <-function(fit) { sumfit <-summary(fit) res <-data.frame(minusLogLik =-logLik(fit),AIC =AIC(fit),BIC =BIC(fit))return(res)}
Variable selection
## Apply stats to all possible combinationsall_stats <-get_all_stats(data_sub)all_stats