absorber package

Mary E. Savino

Introduction

The package \textsf{absorber} provides a tool to select variables in a nonlinear multivariate model. More precisely, it consists in providing a variable selection tool from nn observations satisfying the following nonparametric regression model: Yi=f(xi)+εi,xi=(xi(1),,xi(p)),1in,\begin{equation} \label{eq:model} Y_i = f(x_i) + \varepsilon_i, \quad x_i = \left(x_i^{(1)}, \ldots, x_i^{(p)}\right), \quad 1\leq i \leq n, \end{equation} where ff is an unknown real-valued function and where the εi\varepsilon_i’s are i.i.d centered random variables of variance σ2\sigma^2. The xix_i’s are observation points which belong to a compact set SS of Rp\mathbb{R}^p. We will also assume that ff actually depends on only dd variables instead of pp, with d<pd<p, which means that there exists a real-valued function f~\widetilde{f} such that f(x)=f~(x~)f(x)=\widetilde{f}(\widetilde{x}), where xRpx\in\mathbb{R}^p and x~Rd\widetilde{x}\in\mathbb{R}^d. Variable selection consists in identifying the components of x~\widetilde{x}. This variable selection approach is described in [1]. We refer the reader to this paper for further details and references.

Installing

You can install the released version of \textsf{absorber} from CRAN with:

install.packages("absorber")

Variable selection

We first propose to apply our method to n=700n=700 observations satisfying Model \eqref{eq:model} with f=f1f=f_1 where p=5p=5, defined in [1]. These observations are obtained with a Gaussian noise of σ=0.25\sigma = 0.25. In the following, the d=2d=2 relevant variables to select are {3,5}\{3,5\} and the irrelevant ones to discard are {1,2,4}\{1,2,4\}:

true.dimensions = c(3,5) ; false.dimensions = c(1,2,4)

Description of the dataset

The observation set is loaded from files which are provided within the package, as follows:

# --- Loading the values of the observation sets --- ##
data('x_obs') ;
head(x_obs)
##           [,1]       [,2]      [,3]      [,4]      [,5]
## [1,] 0.3687684 0.16895845 0.7114856 0.1493075 0.2300115
## [2,] 0.7162858 0.47407370 0.2271114 0.8187909 0.3845692
## [3,] 0.5543277 0.63473174 0.9341467 0.4209710 0.1551578
## [4,] 0.2551628 0.55242762 0.8940447 0.8587429 0.6602330
## [5,] 0.1468073 0.21261063 0.8249912 0.7159358 0.6177809
## [6,] 0.3917696 0.01350068 0.6862343 0.8377919 0.6143807
## --- Loading the values of  corresponding noisy values of the response variable --- ##
data('y_obs') ;
head(y_obs)
## [1] -0.09049367 -1.56817050  0.02365417  0.32580069  1.07158399  1.21354888

Application of absorber\texttt{absorber} to select the relevant variables

The absorber\texttt{absorber} function of the absorber\texttt{absorber} package is applied by using the following arguments:

res = absorber(x = x_obs, y = y_obs, M = 3)

Additional arguments can also be provided in this function:

The resulting outputs are the following:

First, we can print the sequence of penalization parameters λ\lambda used in our method:

head(res$lambdas)
## [1] 0.01563831 0.01492752 0.01424904 0.01360140 0.01298320 0.01239309

We can then print the corresponding sequences of selected variables for each penalization parameter:

head(res$selec.var)
## [[1]]
## NULL
## 
## [[2]]
## [1] 3
## 
## [[3]]
## [1] 3
## 
## [[4]]
## [1] 3
## 
## [[5]]
## [1] 3
## 
## [[6]]
## [1] 3

and finally the variables selected with AIC:

res$aic.var
## [1] 3 5

Visualization of the percentage of selection for each variable with plot_selection\texttt{plot\_selection}

The plot_selection\texttt{plot\_selection} function of the absorber\texttt{absorber} package produces a histogram of the variable selection percentage for each variable on which ff depends. It also displays in red the results obtained with the AIC.

plot_selection(res)
plot of chunk plotAbsorber

plot of chunk plotAbsorber

We can compare this visualization to the one indicating the relevant and the irrelevant variables in red and green, respectively, as in Figure 6 of [1]. To do so, we gather the results into a data.frame as follows:
nlam = length(res$lambdas)
occurrence = data.frame(table(unlist(res$selec.var))) ; 
colnames(occurrence) = c("Covariable", "Percentage") ;
occurrence$Percentage =occurrence$Percentage*100/nlam ;
occurrence = occurrence[order(-occurrence$Percentage),,drop=FALSE] ;
occurrence$Covariable = factor(occurrence$Covariable,
                                       levels = unique(occurrence$Covariable)) ;

occurrence$Category = as.factor(ifelse(occurrence$Covariable %in% true.dimensions, 
                                   'real features', 'fake features')) ;
str(occurrence) ;
## 'data.frame':	5 obs. of  3 variables:
##  $ Covariable: Factor w/ 5 levels "3","5","4","2",..: 1 2 3 4 5
##  $ Percentage: num  99 65 45 37 36
##  $ Category  : Factor w/ 2 levels "fake features",..: 2 2 1 1 1

We can then plot the results as a histogram of variable selection percentage:

color.order = c('firebrick', 'forestgreen')[which( c('fake features', 'real features') 
                                                   %in% levels(occurrence$Category))]

plt_occ = ggplot(data = occurrence, aes(x = Covariable, y = Percentage, fill = Category)) +
  geom_bar(stat = 'identity') +
  scale_fill_manual(values = color.order) +
  ylab('Percentage of selection') +
  theme_bw() +
  theme(legend.title = element_blank(),
        axis.text.x = element_text(size = 16, face = 'bold'),
        axis.text.y = element_text(size = 14),
        axis.title.x = element_blank(),
        axis.title.y = element_text(size = 15),
        legend.text =  element_text(size = 14),
        legend.position = 'bottom',
        legend.key.size = unit(1, "cm"), 
        panel.grid.major = element_line(size = 0.6, linetype = 'solid',
                                           colour = "darkgrey"), 
           panel.grid.minor = element_line(size = 0.2, linetype = 'solid',
                                           colour = "darkgrey"))

print(plt_occ)
plot of chunk unnamed-chunk-8

plot of chunk unnamed-chunk-8

The results obtained with the AIC allows us to retrieve the correct relevant variables since it selects $\{3,5\}$ while discarding the irrelevant ones.

References

[1] Savino, M. E. and Lévy-Leduc, C. (2024) A novel variable selection method in nonlinear multivariate models using B-splines with an application to geoscience. ⟨hal-04434820⟩.