Types of VIMs

Brian D. Williamson

2021-01-08

library("vimp")
library("SuperLearner")

Introduction

In the main vignette, I discussed variable importance defined using R-squared. I also mentioned that all of the analyses were carried out using a condititonal variable importance measure. In this document, I will discuss all three types of variable importance that may be computed using vimp.

In general, I define variable importance as a function of the true population distribution (denoted by \(P_0\)) and a predictiveness measure \(V\) – large values of \(V\) are assumed to be better. Currently, the measures \(V\) implemented in vimp are \(R^2\), classification accuracy, area under the receiver operating characteristic curve (AUC), and deviance. For a fixed function \(f\), the predictiveness is given by \(V(f, P)\), where large values imply that \(f\) is a good predictor of the outcome. The best possible prediction function, \(f_0\), is the oracle model – i.e., the prediction function that I would use if I had access to the distribution \(P_0\). Often, \(f_0\) is the true conditional mean (e.g., for \(R^2\)). Then the total oracle predictiveness can be defined as \(V(f_0, P_0)\). This is the best possible value of predictiveness.

I define variable importance measures (VIMs) as contrasts in oracle predictivness. The oracle models that I plug in determine what type of variable importance is being considered, as I outline below. For the remainder of this document, suppose that I have \(p\) variables, and an index set \(s\) of interest (containing some subset of the \(p\) variables). Throughout this document, I will use the South African heart disease study data (Hastie, Tibshirani, and Friedman 2009) to illustrate how each VIM may be estimated (freely available from the Elements of Statistical Learning website; more information about these data is available here). Throughout, I will also use a simple library of learners for the Super Learner (this is for illustration only; in practice, I suggest using a large library of learners, as outlined in the main vignette).

## read in the data from the Elements website
library("RCurl")
heart_data <- read.csv(text = getURL("http://web.stanford.edu/~hastie/ElemStatLearn/datasets/SAheart.data"), header = TRUE, stringsAsFactors = FALSE)
## minor data cleaning
heart <- heart_data[, 2:dim(heart_data)[2]]
heart$famhist <- ifelse(heart$famhist == "Present", 1, 0)
x <- heart[, -ncol(heart)]
learners.2 <- c("SL.ranger")
set.seed(12345)

Conditional VIMs

The reduced oracle predictiveness is defined as \(V(f_{0,-s}, P_0)\), where \(f_{0,-s}\) is the best possible prediction function that does not use the covariates with index in \(s\). Then the conditional VIM is defined as \[V(f_0, P_0) - V(f_{0,-s}, P_0).\] This is the measure of importance that I estimated in the main vignette. To estimate the conditional VIM for family history of heart disease, I can use the following code:

# note the use of a small V and a small number of SL folds, for illustration only
V <- 2
sl_cvcontrol <- list(V = 2)
fam_vim_cond <- vimp_rsquared(Y = heart$chd, X = x, indx = 5, SL.library = learners.2, na.rm = TRUE, V = V, cvControl = sl_cvcontrol)
#> Warning in cv_vim(Y = Y, X = X, f1 = f1, f2 = f2, indx = indx, V = V, type =
#> "r_squared", : Original estimate < 0; returning zero.

Marginal VIMs

The marginal oracle predictiveness is defined as \(V(f_{0,s}, P_0)\), where \(f_{0,s}\) is the best possible prediction function that only uses the covariates with index in \(s\). The null oracle predictiveness is defined as \(V(f_{0, \emptyset}, P_0)\), where \(f_{0,\emptyset}\) is the best possible prediction function that uses no covariates (i.e., is fitting the mean). Then the marginal VIM is defined as \[V(f_{0,s}, P_0) - V(f_{0,\emptyset}, P_0).\] To estimate the marginal VIM for family history of heart disease, I can use the following code:

# note the use of a small V and a small number of SL folds, for illustration only
fam_vim_marg <- vimp_rsquared(Y = heart$chd, X = x[, 5, drop = FALSE], indx = 1, SL.library = learners.2, na.rm = TRUE, V = V, cvControl = sl_cvcontrol)

Shapley VIMs

The Shapley population VIM (SPVIM) generalizes the marginal and conditional VIMs by averaging over all possible subsets. More specifically, the SPVIM for feature \(j\) is given by \[\sum_{s \subseteq \{1,\ldots,p\} \setminus \{j\}} \binom{p-1}{\lvert s \rvert}^{-1}\{V(f_{0, s \cup \{j\}}, P_0)) - V(f_{0,s}, P_0)\};\] this is the average gain in predictiveness from adding feature \(j\) to each possible grouping of the other features. To estimate the SPVIM for family history of heart disease, I can use the following code (note that sp_vim returns VIM estimates for all features):

all_vim_spvim <- sp_vim(Y = heart$chd, X = x, type = "r_squared", SL.library = learners.2, na.rm = TRUE, V = V, cvControl = sl_cvcontrol, env = environment())
#> Warning in sp_vim(Y = heart$chd, X = x, type = "r_squared", SL.library =
#> learners.2, : One or more original estimates < 0; returning zero for these
#> indices.

Adjusting for confounders

In some cases, there may be confounding factors that you want to adjust for in all cases. For example, in HIV vaccine studies, we often adjust for baseline demographic variables, including age and behavioral factors. If this is the case, then the null predictiveness above can be modified to be \(V(f_{0,c}, P_0)\), where \(c\) is the index set of all confounders.

Conclusion

The three VIMs defined here may be different for a given feature of interest. Indeed, we can see this for family history of heart disease in the South African heart disease study data:

fam_vim_cond
#> Variable importance estimates:
#>       Estimate SE         95% CI         VIMP > 0 p-value  
#> s = 5 0        0.06716226 [0, 0.1316356] FALSE    0.8431478
fam_vim_marg
#> Variable importance estimates:
#>       Estimate   SE         95% CI         VIMP > 0 p-value   
#> s = 1 0.08166248 0.04701503 [0, 0.1738103] TRUE     0.04208914
# note: need to look at row for s = 5
all_vim_spvim
#> Variable importance estimates:
#>       Estimate   SE         95% CI                   VIMP > 0 p-value     
#> s = 1 0.00000000 0.03979799 [0.00000000, 0.07800263] FALSE    3.725468e-01
#> s = 2 0.04045153 0.03747541 [0.00000000, 0.11390199] FALSE    7.852825e-02
#> s = 3 0.03793171 0.04292007 [0.00000000, 0.12205350] FALSE    1.190519e-01
#> s = 4 0.01213748 0.03699908 [0.00000000, 0.08465435] FALSE    2.500157e-01
#> s = 5 0.12856253 0.02591122 [0.07777747, 0.17934760]  TRUE    3.987260e-08
#> s = 6 0.00870792 0.02847462 [0.00000000, 0.06451715] FALSE    2.258975e-01
#> s = 7 0.00000000 0.04400392 [0.00000000, 0.08624610] FALSE    3.842072e-01
#> s = 8 0.00000000 0.03148677 [0.00000000, 0.06171293] FALSE    3.412430e-01
#> s = 9 0.12350424 0.03513627 [0.05463841, 0.19237006]  TRUE    5.952758e-05

This is simply a function of the fact that the VIMs are different population parameters. All three likely provide useful information in practice:

To choose a VIM, identify which of these three (there may be more than one) that best addresses your scientific question.

References

Hastie, T, R Tibshirani, and J Friedman. 2009. The Elements of Statistical Learning.