This vignette showcases the
function xgboostImpute()
, which can be used to impute
missing values based on a random forest model using
`[xgboost::xgboost()].
The following example demonstrates the functionality of
xgboostImpute()
using a subset of sleep
. The
columns have been selected deliberately to include some interactions
between the missing values
In order to invoke the imputation methods, a formula is used to
specify which variables are to be estimated and which variables should
be used as regressors.First Dream
will be imputed based on
BodyWgt
.
imp_xgboost <- xgboostImpute(formula=Dream~BodyWgt,data = dataset)
aggr(imp_xgboost, delimiter = "_imp")
The plot shows that all missing values of the variable
Dream
were imputed by the xgboostImpute()
function.
As we can see in the next plot, the correlation structure of
Dream
and BodyWgt
is preserved by the
imputation method.
To impute several variables at once, the formula can be specified with more than one column name on the left hand side.
In order to validate the performance of xgboostImpute()
the iris
dataset is used. Firstly, some values are randomly
set to NA
.
library(reactable)
data(iris)
df <- iris
colnames(df) <- c("S.Length","S.Width","P.Length","P.Width","Species")
# randomly produce some missing values in the data
set.seed(1)
nbr_missing <- 50
y <- data.frame(row=sample(nrow(iris),size = nbr_missing,replace = T),
col=sample(ncol(iris)-1,size = nbr_missing,replace = T))
y<-y[!duplicated(y),]
df[as.matrix(y)]<-NA
aggr(df)
We can see that there are missings in all variables and some
observations reveal missing values on several points. In the next step
we perform a multiple variable imputation and Species
serves as a regressor.
imp_xgboost <- xgboostImpute(S.Length + S.Width + P.Length + P.Width ~ Species, df)
aggr(imp_xgboost, delimiter = "imp")
The plot indicates that all missing values have been imputed by the
xgboostImpute()
algorithm. The following table displays the
rounded first five results of the imputation for all variables.