Topics in Uncertainty Quantification for Model Selection

Nicholas Kissel

Department of Statistics & Data Science
Carnegie Mellon University

October 28, 2024

Motivation

Some diseases have genetic basis.

Figuring out which genetic variants are most responsible for a disease is difficult.

Why?

  • There are many genetic variants to study.
  • There is extremely high correlation among genetic variants.
  • In practice, there are relatively few samples.

What does DNA look like and what are genetic variants?

Here is a snippet of DNA that we may look at.

Since nucleotides always come in the same pairs, we only need to look at one side.

Individuals may have different entries at particular locations.
Genetic fine-mapping: we want to find which of these substitutions are responsible for a particular outcome.

Genetic fine-mapping is a variable selection task.

Suppose there is outcome \(Y_i \in \{0, 1 \}\) with relationship \[ Y_i = \text{Bern}(p_i) \\ \log\left(\frac{p_i}{1 - p_i}\right) = X_i \beta \] where \(X_i \in \mathbb \{0,1,2\}^d\) for \(i \in \{1, \dots, n \}\) and \(\beta\) is sparse.


Goal: We want to find the non-zero entries of \(\beta\).

Suppose there is outcome \(Y_i \in \mathbb R\) with relationship \[ \mathbb E\left(Y_i|X_i\right) = X_i \beta \] where \(X_i \in \mathbb \{0,1,2\}^d\) for \(i \in \{1, \dots, n \}\) and \(\beta\) is sparse.


Goal: We want to find the non-zero entries of \(\beta\).

y SNP1 SNP2 SNP3 SNP4 SNP5 SNP6 SNP7 SNP8
1 0 0 1 1 0 1 0 1 1
2 0 1 2 2 1 0 1 0 2
3 0 0 1 1 0 1 0 2 2
4 0 1 2 2 0 1 1 1 2
5 0 0 2 2 1 1 0 1 2
6 0 0 1 1 0 1 0 1 1
7 1 0 2 2 0 1 0 1 2
8 0 1 2 2 0 1 1 1 2
9 0 1 2 2 0 1 1 1 2

However, there exists extremely high correlation among the features.

With these structures consistent model selection impossible.

Consider a much simpler dataset.

Suppose we have data with 4 features in a 2-block design
and the following relationship

\[ \mathbb E(Y_i|X_i) = X_{i,1} + X_{i,3} \] but \(\textrm{Cor} \left( X_{i,1}, X_{i,2} \right) = \textrm{Cor} \left( X_{i,3}, X_{i,4} \right) = \rho\) is very high.

We would like to make statements like…

choose an element from each of \(\{X_{\bullet,1}, X_{\bullet,2} \}\) and \(\{X_{\bullet,3}, X_{\bullet,4} \}\).1

Consider a slightly more complicated dataset.

Suppose we have data with 9 features in a 3-block design and
the following relationship

\[ \mathbb E(Y_i|X_i) = 3 X_{i,1} + 2 X_{i,4} + X_{i,7} \] but again there is high within-block correlation.

We want to uncover ordered sets of variables.

  1. First, select any variable from \(\{X_{\bullet, 1}, X_{\bullet, 2}, X_{\bullet, 3} \}\).
  2. Then, select any variable from \(\{X_{\bullet, 4}, X_{\bullet, 5}, X_{\bullet, 6} \}\).
  3. Then, select any variable from \(\{X_{\bullet, 7}, X_{\bullet, 8}, X_{\bullet, 9} \}\).

We therefore devised a method that captures all of these 3-variable models and finds the variables in the correct order.

Let each index correspond to the columns of \(\mathbf X\) where \[ \mathbf X = \begin{pmatrix} X_{1,1} & X_{1,2} & \dots & X_{1, d} \\ X_{2,1} & \ddots & & \vdots \\ \vdots & & \ddots & X_{n-1, d}\\ X_{n, 1} & \dots & X_{n, d-1} & X_{n,d} \end{pmatrix} \]

Forward stability and
model path selection

Let’s consider generating the first variable set
(or univariate model set).

  • We may have access to a single dataset \(\mathcal D = \left( X_i, Y_i \right)_{i=1}^n\).
  • We have a list of \(p\) competing univariate models that comprise our candidate model set \(\mathcal{M} = \{1, \dots, p\}\).
  • For all \(m \in \mathcal{M}\) we fit regressors \(\hat f_m(\mathcal{D})\).
  • We evaluate these different regressors with a loss function \(\ell\).

Idea

In the 3-block design, as \(\rho \rightarrow 1\) and \(n \rightarrow \infty\), the probability that each of the first 3 features minimize the error is \(\frac{1}{3}\).

Alternative idea

Suppose that the first 3 features have equal signal strength. We expect that as \(n \rightarrow \infty\), the probability each of them minimizes the in-sample error tends toward \(\frac{1}{3}\).

If we knew the these probabilities, we could just select all the univariate models with similarly high values.

To construct model sets, we first acknowledge that each model has a probability of minimizing the in-sample error.

  • Consider dataset \(\mathcal D = \left( X_i, Y_i \right)_{i=1}^n\).

  • Suppose we have \(p\) models that comprise the model set \(\mathcal{M} = \{1, \dots, p\}\).

  • For each \(m \in \mathcal{M}\), \(\hat f_m(\mathcal{D})\) is a model fitted using dataset \(\mathcal{D}\).

  • Let \(\ell\) be a loss function.

For each \(r \in \mathcal{M}\), there exists some probability \(\gamma_r\) that \(r\) minimizes the loss. \[ \def\argmin{\mathop{\mathrm{arg\,min}}} \def\argmax{\mathop{\mathrm{arg\,max}}} \gamma_r = \mathbb P \left( r = \argmin_{ m\leq p} \sum_{i=1}^{n} \ell \left( X_i; \hat f_m(\mathcal D) \right) \right) \] If multiple models have the same high probability of selection, we should select all of them.

Let’s approximate the model selection probability with resampling.

  • Consider dataset \(\mathcal D = \left( X_i, Y_i \right)_{i=1}^n\).

  • Suppose we have \(p\) models that comprise the model set \(\mathcal{M} = \{1, \dots, p\}\).

  • \(\hat f_m(\mathcal{D})\) is a model fitted using dataset \(\mathcal{D}\).

  • Let \(\ell\) be a loss function.

For each \(r \in \mathcal{M}\), there exists some probability \(\gamma_r\) that \(r\) minimizes the loss.

To estimate, draw \(B\) subsamples \(\mathcal D^{1*}, \dots, \mathcal D^{B*}\), each of size \(\tilde n\)\(\frac{\tilde{n}}{n} \rightarrow 0\). Then, calculate for all \(r \in \mathcal M\)

\[ \hat \gamma_{r} = \frac{1}{B} \sum_{b=1}^B \mathbb{I} \left( r = \argmin_{ m \leq p} \sum_{i=1}^{\tilde n} \ell \left( X_i^{b*}; \hat f_m \left( \mathcal{D}^{b*} \right) \right) \right) \]

This produces a set of frequencies, but we need a method capable
of selecting a set of them that’s likely to contain the true best(s).
This is where the ranking and selection literature comes into play.

Using methods from the multinomial ranking and selection literature, we can find a threshold \(\theta\) for which \(\mathbb P( \argmax_q \gamma_q \in \{r : \hat \gamma_r \geq \max_s \hat\gamma_s - \theta \} ) \geq 1 - \alpha\) (Gupta and Nagel 1967; Panchapakesan 1971).

We can now select univariate model sets.

Recall the previous example \[ \mathbb E(Y_i|X_i) = 3 X_{i,1} + 2 X_{i,4} + X_{i,7} \] where each of the above features has 2 highly correlated noise features.

We can apply the resampling procedure to all univariate models, which may return counts like \[ \{\color{#E69F00}{160, 178, 176}, \color{#56B4E9}{104, 92, 119}, \color{#009E73}{50, 61, 60} \}. \]

Then the multinomial procedure will return a set of indices that comprise a set likely to contain the true most probable selection(s) like

\[ \{1, 2, 3\}. \]

How can we construct model sets for \(k\)-variable models?

Using these tools, we can construct a single a set of univariate models. In order to obtain larger models, we need only embed this within forward selection.

  • For each selected univariate model, we reapply this procedure to all 2-variable models that nest it.
  • Repeat this on all 3-variable models that nest selected 2-variable models. \[ \vdots \]
  • Repeat this on all \(k\)-variable models that nest selected \((k-1)\)-variable models.

We call this method model path selection (Chapter 2).

d <- nstep
current_models <- '1'
candidate_models <- paste0(current_models, ' + X', 1:p)
for(i in 1:d) {
  nselect <- length(current_models)
  new_models <- c()
  for(j in 1:nselect) {
    variable_set <- find_variable_set(candidate_models[j]) # returns vector of variable names
    new_models <- c(new_models, paste0(candidate_models[j], ' + ', variable_set))
  }
  current_models <- new_models
}

Example graphical output

🚨There is no stopping rule!🚨

Cross-validation

We can generate a stopping rule with cross-validation.

Theorems 1 and 2 in Chapter 3

For cross-validation risk estimate \(\hat R_{\textrm{cv}, r}\) Let \(v_i\) be the fold containing data \(X_i\), then \[\frac{1}{n}\sum_{i\leq n} \ell(X_i; \hat f_m(\mathcal{D_{-v_i}}))\] and center \(\mu_r\) \[\frac{1}{V} \sum_{v \leq V} \mathbb{E} \left[ \ell(X_0; \hat f_m(\mathcal{D}_{-v})) | \mathcal{D}_{-v}\right]\] \[\text{or}\] \[\frac{1}{V} \sum_{v \leq V} \mathbb{E} \left[ \ell(X_0; \hat f_m(\mathcal{D}_{-v})) \right]\], \[ \sup_{z\in\mathbb R} \left|\mathbb P\left(\max_{1\le r\le p}\sqrt{n}(\hat R_{{\rm cv},r}-\mu_r)\le z\right)-\mathbb P\left(\max_{1\le r\le p} Y_r \le z\right)\right|\rightarrow 0 \] where \(\mathbf Y = (Y_1, \dots, Y_p)\) is a centered gaussian vector with matching covariance.

Let \(\mathcal D^i\) equal \(\mathcal D\) but with element \(i\) replaced with an iid copy of \((X_i, Y_i)\). For all \(\hat f\), the assumptions require that the first and second order stability of the loss follows

\[\begin{align*} \nabla_i \ell & := \ell(X_0; \hat f (\mathcal D)) - \ell(X_0; \hat f (\mathcal D^i)) & << n^{-1/2} \\ \nabla_j \nabla_i \ell & & << n^{-3/2}. \end{align*}\]

We then can generate model confidence sets.

Consider the ordered variable set example with a low signal-to-noise ratio.

We apply this theory to risk differences for better power.

Select \(r\) if you fail to reject \[ H_{0, r} : \mu_{r} - \mu_{ s} \leq 0 \] for all \(s\neq r\) (Lei 2020).

\[ \, \]

Note

Both of these methods are actual asymptotically valid confidence sets.

Using this theory, we can generate model confidence sets to compare models of different complexity.

Suppose we have data of the form \[ \mathbb{E}(Y_i|X_i) = X_{i,1} + X_{i,2} \] and \(\mathcal{M} = \{ f_1, f_{1,2} \}\) are linear models that use the first feature and both features, respectively.

  • If the model confidence set \(\mathcal M^*\) only contains \(f_{1,2}\), it is indeed better than \(f_1\).
  • If the model confidence set \(\mathcal M^*\) both models, \(f_{1,2}\) is not meaningfully better than \(f_{1}\).

Therefore, we will use this in MPS to check whether subsequently added features create significantly better models.

There are 2 ways to apply the cross-validation model confidence set
to MPS.

Node-wise

\(\mathcal{M}^{\textrm{node}}_1 = \{ \{1\}, \{1, 3\}, \{1, 4\} \}\), \(\mathcal{M}^{\textrm{node}}_2 = \{ \{5\}, \{5, 1\} \}\)

Level-wise

\(\mathcal{M}^{\textrm{level}}_1 = \{ \{1\}, \{5\}, \{1, 3\}, \{1, 4\}, \{5, 1 \} \}\)

Node-wise

Suppose the CV MCS selects each of \(\mathcal{M}^{\textrm{node}*}_1 = \{\{1, 3\}, \{1, 4\} \}\)
\(\mathcal{M}^{\textrm{node}*}_2 = \{ \{5\}, \{ 5, 1\} \}\)

Level-wise

Suppose the CV MCS selects \(\mathcal{M}^{\textrm{level}*}_1 = \{\{1, 3\}, \{1, 4\}, \{5\}, \{5, 1\} \}\)

References

Gupta, S. S., and Nagel, K. (1967), On selection and ranking procedures and order statistics from the multinomial distribution,” Sankhyā: The Indian Journal of Statistics, Series B (1960-2002), Springer, 29, 1–34.
Lei, J. (2020), “Cross-validation with confidence,” Journal of the American Statistical Association, Taylor & Francis, 115, 1978–1997.
Panchapakesan, S. (1971), “On a subset selection procedure for the most probable event in a multinomial distribution,” in Statistical decision theory and related topics, eds. S. S. Gupta and J. Yackel, New York: Academic Press, pp. 275–298.
Wang, G., Sarkar, A., Carbonetto, P., and Stephens, M. (2020), A Simple New Approach to Variable Selection in Regression, with Application to Genetic Fine Mapping,” Journal of the Royal Statistical Society Series B: Statistical Methodology, 82, 1273–1300. https://doi.org/10.1111/rssb.12388.