--- title: "Model based Imputation Methods" author: Gregor de Cillia output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Model based Imputation Methods} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 6, fig.align = "center" ) ``` This vignette showcases the functions `regressionImp()` and `rangerImpute()`, which can both be used to generate imputations for several variables in a dataset using a formula interface. ## Data For data, a subset of `sleep` is used. The columns have been selected deliberately to include some interactions between the missing values. ```{r setup, message = FALSE} library(VIM) dataset <- sleep[, c("Dream", "NonD", "BodyWgt", "Span")] dataset$BodyWgt <- log(dataset$BodyWgt) dataset$Span <- log(dataset$Span) aggr(dataset) str(dataset) ``` ## Imputation In order to invoke the imputation methods, a formula is used to specify which variables are to be estimated and which variables should be used as regressors. We will start by imputing `NonD` based in `BodyWgt` and `Span`. ```{r} imp_regression <- regressionImp(NonD ~ BodyWgt + Span, dataset) imp_ranger <- rangerImpute(NonD ~ BodyWgt + Span, dataset) aggr(imp_regression, delimiter = "_imp") ``` We can see that for `regrssionImp()` there are still missings in `NonD` for all observations where `Span` is unobserved. This is because the regression model could not be applied to those observations. The same is true for the values imputed via `rangerImpute()`. ## Diagnosing the results As we can see in the next two plots, the correlation structure of `NonD` and `BodyWgt` is preserved by both imputation methods. In the case of `regressionImp()` all imputed values almost follow a straight line. This suggests that the variable `Span` had little to no effect on the model. ```{r, fig.height=5} imp_regression[, c("NonD", "BodyWgt", "NonD_imp")] |> marginplot(delimiter = "_imp") ``` For `rangerImpute()` on the other hand, `Span` played an important role in the generation of the imputed values. ```{r, fig.height=5} imp_ranger[, c("NonD", "BodyWgt", "NonD_imp")] |> marginplot(delimiter = "_imp") imp_ranger[, c("NonD", "Span", "NonD_imp")] |> marginplot(delimiter = "_imp") ``` ## Imputing multiple variables To impute several variables at once, the formula in `rangerImpute()` and `regressionImp()` can be specified with more than one column name in the left hand side. ```{r} imp_regression <- regressionImp(Dream + NonD ~ BodyWgt + Span, dataset) imp_ranger <- rangerImpute(Dream + NonD ~ BodyWgt + Span, dataset) aggr(imp_regression, delimiter = "_imp") ``` Again, there are missings left for both `Dream` and `NonD`. ## Performance of method In order to validate the performance of `regressionImp()` the `iris` dataset is used. Firstly, some values are randomly set to `NA`. ```{r} library(reactable) data(iris) df <- iris colnames(df) <- c("S.Length","S.Width","P.Length","P.Width","Species") # randomly produce some missing values in the data set.seed(1) nbr_missing <- 50 y <- data.frame(row=sample(nrow(iris),size = nbr_missing,replace = T), col=sample(ncol(iris)-1,size = nbr_missing,replace = T)) y<-y[!duplicated(y),] df[as.matrix(y)]<-NA aggr(df) sapply(df, function(x)sum(is.na(x))) ``` We can see that there are missings in all variables and some observations reveal missing values on several points. In the next step we perform a multiple variable imputation and `Species` serves as a regressor. ```{r} imp_regression <- regressionImp(S.Length + S.Width + P.Length + P.Width ~ Species, df) aggr(imp_regression, delimiter = "imp") ``` The plot indicates that all missing values have been imputed by the `regressionImp()` algorithm. The following table displays the rounded first five results of the imputation for all variables. ```{r echo=F,warning=F} results <- cbind("TRUE1" = as.numeric(iris[as.matrix(y[which(y$col==1),])]), "IMPUTED1" = round(as.numeric(imp_regression[as.matrix(y[which(y$col==1),])]),2), "TRUE2" = as.numeric(iris[as.matrix(y[which(y$col==2),])]), "IMPUTED2" = round(as.numeric(imp_regression[as.matrix(y[which(y$col==2),])]),2), "TRUE3" = as.numeric(iris[as.matrix(y[which(y$col==3),])]), "IMPUTED3" = round(as.numeric(imp_regression[as.matrix(y[which(y$col==3),])]),2), "TRUE4" = as.numeric(iris[as.matrix(y[which(y$col==4),])]), "IMPUTED4" = round(as.numeric(imp_regression[as.matrix(y[which(y$col==4),])]),2))[1:5,] reactable(results, columns = list( TRUE1 = colDef(name = "True"), IMPUTED1 = colDef(name = "Imputed"), TRUE2 = colDef(name = "True"), IMPUTED2 = colDef(name = "Imputed"), TRUE3 = colDef(name = "True"), IMPUTED3 = colDef(name = "Imputed"), TRUE4 = colDef(name = "True"), IMPUTED4 = colDef(name = "Imputed") ), columnGroups = list( colGroup(name = "S.Length", columns = c("TRUE1", "IMPUTED1")), colGroup(name = "S.Width", columns = c("TRUE2", "IMPUTED2")), colGroup(name = "P.Length", columns = c("TRUE3", "IMPUTED3")), colGroup(name = "P.Width", columns = c("TRUE4", "IMPUTED4")) ), striped = TRUE, highlight = TRUE, bordered = TRUE ) ```