Imputation Method based on xgboost

This vignette showcases the function xgboostImpute(), which can be used to impute missing values based on a random forest model using `[xgboost::xgboost()].

Data

The following example demonstrates the functionality of xgboostImpute() using a subset of sleep. The columns have been selected deliberately to include some interactions between the missing values

library(VIM)
dataset <- sleep[, c("Dream", "NonD", "BodyWgt", "Span")] # dataset with missings
dataset$BodyWgt <- log(dataset$BodyWgt)
dataset$Span <- log(dataset$Span)
aggr(dataset)

str(dataset)
#> 'data.frame':    62 obs. of  4 variables:
#>  $ Dream  : num  NA 2 NA NA 1.8 0.7 3.9 1 3.6 1.4 ...
#>  $ NonD   : num  NA 6.3 NA NA 2.1 9.1 15.8 5.2 10.9 8.3 ...
#>  $ BodyWgt: num  8.803 0 1.2194 -0.0834 7.8427 ...
#>  $ Span   : num  3.65 1.5 2.64 NA 4.23 ...

Imputation

In order to invoke the imputation methods, a formula is used to specify which variables are to be estimated and which variables should be used as regressors.First Dream will be imputed based on BodyWgt.

imp_xgboost <- xgboostImpute(formula=Dream~BodyWgt,data = dataset)
aggr(imp_xgboost, delimiter = "_imp")

The plot shows that all missing values of the variable Dream were imputed by the xgboostImpute() function.

Diagnosing the result

As we can see in the next plot, the correlation structure of Dream and BodyWgt is preserved by the imputation method.

imp_xgboost[, c("Dream", "BodyWgt", "Dream_imp")] |> 
  marginplot(delimiter = "_imp")

Imputing multiple variables

To impute several variables at once, the formula can be specified with more than one column name on the left hand side.

imp_xgboost <- xgboostImpute(Dream+NonD+Span~BodyWgt,data=dataset)
aggr(imp_xgboost, delimiter = "_imp")

Performance of method

In order to validate the performance of xgboostImpute() the iris dataset is used. Firstly, some values are randomly set to NA.

library(reactable)

data(iris)
df <- iris
colnames(df) <- c("S.Length","S.Width","P.Length","P.Width","Species")
# randomly produce some missing values in the data
set.seed(1)
nbr_missing <- 50
y <- data.frame(row=sample(nrow(iris),size = nbr_missing,replace = T),
                col=sample(ncol(iris)-1,size = nbr_missing,replace = T))
y<-y[!duplicated(y),]
df[as.matrix(y)]<-NA

aggr(df)

sapply(df, function(x)sum(is.na(x)))
#> S.Length  S.Width P.Length  P.Width  Species 
#>       12       10       13       12        0

We can see that there are missings in all variables and some observations reveal missing values on several points. In the next step we perform a multiple variable imputation and Species serves as a regressor.

imp_xgboost <- xgboostImpute(S.Length + S.Width + P.Length + P.Width ~ Species, df)
aggr(imp_xgboost, delimiter = "imp")

The plot indicates that all missing values have been imputed by the xgboostImpute() algorithm. The following table displays the rounded first five results of the imputation for all variables.