Title: | Visualization and Imputation of Missing Values |
---|---|
Description: | New tools for the visualization of missing and/or imputed values are introduced, which can be used for exploring the data and the structure of the missing and/or imputed values. Depending on this structure of the missing values, the corresponding methods may help to identify the mechanism generating the missing values and allows to explore the data including missing values. In addition, the quality of imputation can be visually explored using various univariate, bivariate, multiple and multivariate plot methods. A graphical user interface available in the separate package VIMGUI allows an easy handling of the implemented plot methods. |
Authors: | Matthias Templ [aut, cre], Alexander Kowarik [aut] , Andreas Alfons [aut], Gregor de Cillia [aut], Bernd Prantner [ctb], Wolfgang Rannetbauer [aut] |
Maintainer: | Matthias Templ <[email protected]> |
License: | GPL (>= 2) |
Version: | 6.2.4 |
Built: | 2025-01-04 04:57:16 UTC |
Source: | https://github.com/statistikat/vim |
This package introduces new tools for the visualization of missing or imputed values in , which can be used for exploring the data and the structure of the missing or imputed values. Depending on this structure, they may help to identify the mechanism generating the missing values or errors, which may have happened in the imputation process. This knowledge is necessary for selecting an appropriate imputation method in order to reliably estimate the missing values. Thus the visualization tools should be applied before imputation and the diagnostic tools afterwards.
Detecting missing values mechanisms is usually done by statistical tests or models. Visualization of missing and imputed values can support the test decision, but also reveals more details about the data structure. Most notably, statistical requirements for a test can be checked graphically, and problems like outliers or skewed data distributions can be discovered. Furthermore, the included plot methods may also be able to detect missing values mechanisms in the first place.
A graphical user interface available in the package VIMGUI allows an easy
handling of the plot methods. In addition, VIM
can be used for data
from essentially any field.
Matthias Templ, Andreas Alfons, Alexander Kowarik, Bernd Prantner
Maintainer: Matthias Templ [email protected]
M. Templ, A. Alfons, P. Filzmoser (2012) Exploring incomplete data using visualization tools. Journal of Advances in Data Analysis and Classification, Online first. DOI: 10.1007/s11634-011-0102-y.
M. Templ, A. Kowarik, P. Filzmoser (2011) Iterative stepwise regression imputation using standard and robust methods. Journal of Computational Statistics and Data Analysis, Vol. 55, pp. 2793-2806.
Calculate or plot the amount of missing/imputed values in each variable and the amount of missing/imputed values in certain combinations of variables.
Print method for objects of class "aggr"
.
Summary method for objects of class "aggr"
.
Print method for objects of class "summary.aggr"
.
aggr(x, delimiter = NULL, plot = TRUE, ...) ## S3 method for class 'aggr' plot( x, col = c("skyblue", "red", "orange"), bars = TRUE, numbers = FALSE, prop = TRUE, combined = FALSE, varheight = FALSE, only.miss = FALSE, border = par("fg"), sortVars = FALSE, sortCombs = TRUE, ylabs = NULL, axes = TRUE, labels = axes, cex.lab = 1.2, cex.axis = par("cex"), cex.numbers = par("cex"), gap = 4, ... ) ## S3 method for class 'aggr' print(x, ..., digits = NULL) ## S3 method for class 'aggr' summary(object, ...) ## S3 method for class 'summary.aggr' print(x, ...)
aggr(x, delimiter = NULL, plot = TRUE, ...) ## S3 method for class 'aggr' plot( x, col = c("skyblue", "red", "orange"), bars = TRUE, numbers = FALSE, prop = TRUE, combined = FALSE, varheight = FALSE, only.miss = FALSE, border = par("fg"), sortVars = FALSE, sortCombs = TRUE, ylabs = NULL, axes = TRUE, labels = axes, cex.lab = 1.2, cex.axis = par("cex"), cex.numbers = par("cex"), gap = 4, ... ) ## S3 method for class 'aggr' print(x, ..., digits = NULL) ## S3 method for class 'aggr' summary(object, ...) ## S3 method for class 'summary.aggr' print(x, ...)
x |
an object of class |
delimiter |
a character-vector to distinguish between variables and
imputation-indices for imputed variables (therefore, |
plot |
a logical indicating whether the results should be plotted (the
default is |
... |
Further arguments, currently ignored. |
col |
a vector of length three giving the colors to be used for observed, missing and imputed data. If only one color is supplied, it is used for missing and imputed data and observed data is transparent. If only two colors are supplied, the first one is used for observed data and the second color is used for missing and imputed data. |
bars |
a logical indicating whether a small barplot for the frequencies of the different combinations should be drawn. |
numbers |
a logical indicating whether the proportion or frequencies of the different combinations should be represented by numbers. |
prop |
a logical indicating whether the proportion of missing/imputed values and combinations should be used rather than the total amount. |
combined |
a logical indicating whether the two plots should be
combined. If |
varheight |
a logical indicating whether the cell heights are given by the frequencies of occurrence of the corresponding combinations. |
only.miss |
a logical indicating whether the small barplot for the
frequencies of the combinations should only be drawn for combinations
including missing/imputed values (if |
border |
the color to be used for the border of the bars and
rectangles. Use |
sortVars |
a logical indicating whether the variables should be sorted by the number of missing/imputed values. |
sortCombs |
a logical indicating whether the combinations should be sorted by the frequency of occurrence. |
ylabs |
if |
axes |
a logical indicating whether axes should be drawn. |
labels |
either a logical indicating whether labels should be plotted on the x-axis, or a character vector giving the labels. |
cex.lab |
the character expansion factor to be used for the axis labels. |
cex.axis |
the character expansion factor to be used for the axis annotation. |
cex.numbers |
the character expansion factor to be used for the proportion or frequencies of the different combinations |
gap |
if |
digits |
the minimum number of significant digits to be used (see
|
object |
an object of class |
Often it is of interest how many missing/imputed values are contained in each variable. Even more interesting, there may be certain combinations of variables with a high number of missing/imputed values.
If combined
is FALSE
, two separate plots are drawn for the
missing/imputed values in each variable and the combinations of
missing/imputed and non-missing values. The barplot on the left hand side
shows the amount of missing/imputed values in each variable. In the
aggregation plot on the right hand side, all existing combinations of
missing/imputed and non-missing values in the observations are visualized.
Available, missing and imputed data are color coded as given by col
.
Additionally, there are two possibilities to represent the frequencies of
occurrence of the different combinations. The first option is to visualize
the proportions or frequencies by a small bar plot and/or numbers. The
second option is to let the cell heights be given by the frequencies of the
corresponding combinations. Furthermore, variables may be sorted by the
number of missing/imputed values and combinations by the frequency of
occurrence to give more power to finding the structure of missing/imputed
values.
If combined
is TRUE
, a small version of the barplot showing
the amount of missing/imputed values in each variable is drawn on top of the
aggregation plot.
The graphical parameter oma
will be set unless supplied as an
argument.
for aggr
, a list of class "aggr"
containing the
following components:
x the data used.
combinations a character vector representing the combinations of variables.
count the frequencies of these combinations.
percent the percentage of these combinations.
missings a data.frame
containing the amount of
missing/imputed values in each variable.
tabcomb the indicator matrix for the combinations of variables.
a list of class "summary.aggr"
containing the following
components:
missings a data.frame
containing the amount of missing or
imputed values in each variable.
combinations a data.frame
containing a character vector
representing the combinations of variables along with their frequencies and
percentages.
Some of the argument names and positions have changed with version 1.3
due to extended functionality and for more consistency with other plot
functions in VIM
. For back compatibility, the arguments labs
and names.arg
can still be supplied to ...{}
and are handled
correctly. Nevertheless, they are deprecated and no longer documented. Use
ylabs
and labels
instead.
Andreas Alfons, Matthias Templ, modifications for displaying imputed values by Bernd Prantner
Matthias Templ, modifications by Andreas Alfons and Bernd Prantner
Matthias Templ, modifications by Andreas Alfons
Andreas Alfons, modifications by Bernd Prantner
M. Templ, A. Alfons, P. Filzmoser (2012) Exploring incomplete data using visualization tools. Journal of Advances in Data Analysis and Classification, Online first. DOI: 10.1007/s11634-011-0102-y.
Other plotting functions:
barMiss()
,
histMiss()
,
marginmatrix()
,
marginplot()
,
matrixplot()
,
mosaicMiss()
,
pairsVIM()
,
parcoordMiss()
,
pbox()
,
scattJitt()
,
scattMiss()
,
scattmatrixMiss()
,
spineMiss()
data(sleep, package="VIM") ## for missing values a <- aggr(sleep) a summary(a) ## for imputed values sleep_IMPUTED <- kNN(sleep) a <- aggr(sleep_IMPUTED, delimiter="_imp") a summary(a) data(sleep, package = "VIM") a <- aggr(sleep, plot=FALSE) a data(sleep, package = "VIM") summary(aggr(sleep, plot=FALSE)) data(sleep, package = "VIM") s <- summary(aggr(sleep, plot=FALSE)) s
data(sleep, package="VIM") ## for missing values a <- aggr(sleep) a summary(a) ## for imputed values sleep_IMPUTED <- kNN(sleep) a <- aggr(sleep_IMPUTED, delimiter="_imp") a summary(a) data(sleep, package = "VIM") a <- aggr(sleep, plot=FALSE) a data(sleep, package = "VIM") summary(aggr(sleep, plot=FALSE)) data(sleep, package = "VIM") s <- summary(aggr(sleep, plot=FALSE)) s
Convert colors to semitransparent colors.
alphablend(col, alpha = NULL, bg = NULL)
alphablend(col, alpha = NULL, bg = NULL)
col |
a vector specifying colors. |
alpha |
a numeric vector containing the alpha values (between 0 and 1). |
bg |
the background color to be used for alphablending. This can be used as a workaround for graphics devices that do not support semitransparent colors. |
a vector containing the semitransparent colors.
Andreas Alfons
alphablend("red", 0.6)
alphablend("red", 0.6)
Average log brain and log body weights for 28 Species
A data frame with 28 observations on the following 2 variables.
log body weight
log brain weight
The original data can be found in package MASS. 10 values on brain weight are set to be missing.
P. J. Rousseeuw and A. M. Leroy (1987) Robust Regression and Outlier Detection. Wiley, p. 57.
Venables, W. N. and Ripley, B. D. (1999) Modern Applied Statistics with S-PLUS. Third Edition. Springer.
Templ, M. (2022) Visualization and Imputation of Missing Values. Springer Publishing. Upcoming book.
data(Animals_na) aggr(Animals_na)
data(Animals_na) aggr(Animals_na)
Barplot with highlighting of missing/imputed values in other variables by splitting each bar into two parts. Additionally, information about missing/imputed values in the variable of interest is shown on the right hand side.
barMiss( x, delimiter = NULL, pos = 1, selection = c("any", "all"), col = c("skyblue", "red", "skyblue4", "red4", "orange", "orange4"), border = NULL, main = NULL, sub = NULL, xlab = NULL, ylab = NULL, axes = TRUE, labels = axes, only.miss = TRUE, miss.labels = axes, interactive = TRUE, ... )
barMiss( x, delimiter = NULL, pos = 1, selection = c("any", "all"), col = c("skyblue", "red", "skyblue4", "red4", "orange", "orange4"), border = NULL, main = NULL, sub = NULL, xlab = NULL, ylab = NULL, axes = TRUE, labels = axes, only.miss = TRUE, miss.labels = axes, interactive = TRUE, ... )
x |
a vector, matrix or |
delimiter |
a character-vector to distinguish between variables and
imputation-indices for imputed variables (therefore, |
pos |
a numeric value giving the index of the variable of interest.
Additional variables in |
selection |
the selection method for highlighting missing/imputed
values in multiple additional variables. Possible values are |
col |
a vector of length six giving the colors to be used. If only one color is supplied, the bars are transparent and the supplied color is used for highlighting missing/imputed values. Else if two colors are supplied, they are recycled. |
border |
the color to be used for the border of the bars. Use
|
main , sub
|
main and sub title. |
xlab , ylab
|
axis labels. |
axes |
a logical indicating whether axes should be drawn on the plot. |
labels |
either a logical indicating whether labels should be plotted below each bar, or a character vector giving the labels. |
only.miss |
logical; if |
miss.labels |
either a logical indicating whether label(s) should be plotted below the bar(s) on the right hand side, or a character string or vector giving the label(s) (see ‘Details’). |
interactive |
a logical indicating whether variables can be switched interactively (see ‘Details’). |
... |
further graphical parameters to be passed to
|
If more than one variable is supplied, the bars for the variable of interest are split according to missingness/number of imputed missings in the additional variables.
If only.miss=TRUE
, the missing/imputed values in the variable of
interest are visualized by one bar on the right hand side. If additional
variables are supplied, this bar is again split into two parts according to
missingness/number of imputed missings in the additional variables.
Otherwise, a small barplot consisting of two bars is drawn on the right hand
side. The first bar corresponds to observed values in the variable of
interest and the second bar to missing/imputed values. Since these two bars
are not on the same scale as the main barplot, a second y-axis is plotted on
the right (if axes=TRUE
). Each of the two bars are again split into
two parts according to missingness/number of imputed missings in the
additional variables. Note that this display does not make sense if only
one variable is supplied, therefore only.miss
is ignored in that
case.
If interactive=TRUE
, clicking in the left margin of the plot results
in switching to the previous variable and clicking in the right margin
results in switching to the next variable. Clicking anywhere else on the
graphics device quits the interactive session. When switching to a
continuous variable, a histogram is plotted rather than a barplot.
a numeric vector giving the coordinates of the midpoints of the bars.
Some of the argument names and positions have changed with version 1.3
due to extended functionality and for more consistency with other plot
functions in VIM
. For back compatibility, the arguments
axisnames
, names.arg
and names.miss
can still be
supplied to ...{}
and are handled correctly. Nevertheless, they
are deprecated and no longer documented. Use labels
and
miss.labels
instead.
Andreas Alfons, modifications to show imputed values by Bernd Prantner
M. Templ, A. Alfons, P. Filzmoser (2012) Exploring incomplete data using visualization tools. Journal of Advances in Data Analysis and Classification, Online first. DOI: 10.1007/s11634-011-0102-y.
Other plotting functions:
aggr()
,
histMiss()
,
marginmatrix()
,
marginplot()
,
matrixplot()
,
mosaicMiss()
,
pairsVIM()
,
parcoordMiss()
,
pbox()
,
scattJitt()
,
scattMiss()
,
scattmatrixMiss()
,
spineMiss()
data(sleep, package = "VIM") ## for missing values x <- sleep[, c("Exp", "Sleep")] barMiss(x) barMiss(x, only.miss = FALSE) ## for imputed values x_IMPUTED <- kNN(sleep[, c("Exp", "Sleep")]) barMiss(x_IMPUTED, delimiter = "_imp") barMiss(x_IMPUTED, delimiter = "_imp", only.miss = FALSE)
data(sleep, package = "VIM") ## for missing values x <- sleep[, c("Exp", "Sleep")] barMiss(x) barMiss(x, only.miss = FALSE) ## for imputed values x_IMPUTED <- kNN(sleep[, c("Exp", "Sleep")]) barMiss(x_IMPUTED, delimiter = "_imp") barMiss(x_IMPUTED, delimiter = "_imp", only.miss = FALSE)
Dataset containing the original Wisconsin breast cancer data.
A data frame with 699 observations on the following 11 variables.
Sample ID
as integer from 1 - 10
as integer from 1 - 10
as integer from 1 - 10
as integer from 1 - 10
as integer from 1 - 10
as integer from 1 - 10, includes 16 missings
as integer from 1 - 10
as integer from 1 - 10
as integer from 1 - 10
benign or malignant
The data downloaded and conditioned for R from the UCI machine learning repository, see https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Original) This breast cancer databases was obtained from the University of Wisconsin Hospitals, Madison from Dr. William H. Wolberg. If you publish results when using this database, then please include this information in your acknowledgements. Also, please cite one or more of: O. L. Mangasarian and W. H. Wolberg: "Cancer diagnosis via linear programming", SIAM News, Volume 23, Number 5, September 1990, pp 1 & 18. William H. Wolberg and O.L. Mangasarian: "Multisurface method of pattern separation for medical diagnosis applied to breast cytology", Proceedings of the National Academy of Sciences, U.S.A., Volume 87, December 1990, pp 9193-9196. O. L. Mangasarian, R. Setiono, and W.H. Wolberg: "Pattern recognition via linear programming: Theory and application to medical diagnosis", in: "Large-scale numerical optimization", Thomas F. Coleman and Yuying Li, editors, SIAM Publications, Philadelphia 1990, pp 22-30. K. P. Bennett & O. L. Mangasarian: "Robust linear programming discrimination of two linearly inseparable sets", Optimization Methods and Software 1, 1992, 23-34 (Gordon & Breach Science Publishers).
data(bcancer) aggr(bcancer)
data(bcancer) aggr(bcancer)
Plot a background map.
bgmap(map, add = FALSE, ...)
bgmap(map, add = FALSE, ...)
map |
either a matrix or |
add |
a logical indicating whether |
... |
further arguments and graphical parameters to be passed to
|
Andreas Alfons
M. Templ, A. Alfons, P. Filzmoser (2012) Exploring incomplete data using visualization tools. Journal of Advances in Data Analysis and Classification, Online first. DOI: 10.1007/s11634-011-0102-y.
data(kola.background, package = "VIM") bgmap(kola.background)
data(kola.background, package = "VIM") bgmap(kola.background)
A plastic product is produced in three parallel reactors (TK104, TK105, or TK107). For each row in the dataset, we have the same batch of raw material that was split, and fed to the 3 reactors. These values are the brittleness index for the product produced in the reactor. A simulated data set.
A data frame with 23 observations on the following 3 variables.
Brittleness for batches of raw material in reactor 104
Brittleness for batches of raw material in reactor 105
Brittleness for batches of raw material in reactor 107
https://openmv.net/info/brittleness-index
data(brittleness) aggr(brittleness)
data(brittleness) aggr(brittleness)
This data set is the same as
in package mvoutlier
, except that values below the detection limit
are coded as NA
.
A data frame with 606 observations on the following 110 variables.
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
For a more detailed description of this data set, see the help file
chorizon
in package mvoutlier
.
Kola Project (1993-1998)
Reimann, C., Filzmoser, P., Garrett, R.G. and Dutter, R. (2008) Statistical Data Analysis Explained: Applied Environmental Statistics with R. Wiley.
data(chorizonDL, package = "VIM") summary(chorizonDL)
data(chorizonDL, package = "VIM") summary(chorizonDL)
This is a modified version of the original training data set taken from the UCI repository, see reference. The modifications are only related to having appropriate levels for factor variables. This data set is about horse diseases where the task is to determine, if the lesion of the horse was surgical or not.
A training data frame with 300 observations on the following 31 variables.
yes or no
1 equals an adult horse, 2 is a horse younger than 6 months
ID
rectal temperature
heart rate in beats per minute
a normal rate is between 8 and 10
temperature of extremities
factor with four categories
a clinical judgement. The longer the refill, the poorer the circulation. Possible values are 1 = < 3 seconds and 2 = >= 3 seconds
a subjective judgement of the horse's pain level
an indication of the activity in the horse's gut. As the gut becomes more distended or the horse becomes more toxic, the activity decreases
An animal with abdominal distension is likely to be painful and have reduced gut motility. A horse with severe abdominal distension is likely to require surgery just tio relieve the pressure
This refers to any gas coming out of the tube. A large gas cap in the stomach is likely to give the horse discomfort
posible values are 1 = none, 2 = > 1 liter, 3 = < 1 liter. The greater amount of reflux, the more likelihood that there is some serious obstruction to the fluid passage from the rest of the intestine
scale is from 0 to 14 with 7 being neutral. Normal values are in the 3 to 4 range
Rectal examination. Absent feces probably indicates an obstruction
abdomen. possible values 1 = normal, 2 = other, 3 = firm feces in the large intestine, 4 = distended small intestine, 5 = distended large intestine
packed cell volume. normal range is 30 to 50. The level rises as the circulation becomes compromised or as the animal becomes dehydrated.
total protein. Normal values lie in the 6-7.5 (gms/dL) range. The higher the value the greater the dehydration
Abdominocentesis appearance. A needle is put in the horse's abdomen and fluid is obtained from the abdominal cavity
abdomcentesis total protein. The higher the level of protein the more likely it is to have a compromised gut. Values are in gms/dL
What eventually happened to the horse?
retrospectively, was the problem (lesion) surgical?
type of lesion
type of lesion
type of lesion
temperature of extremities (ordered)
mucous membranes. A subjective measurement of colour
different recodings of mucous membrances
https://archive.ics.uci.edu/ml/datasets/Horse+Colic Creators: Mary McLeish & Matt Cecile, Department of Computer Science, University of Guelph, Guelph, Ontario, Canada N1G 2W1 Donor: Will Taylor
data(colic) aggr(colic)
data(colic) aggr(colic)
Subset of the collision data from December 20. to December 31. 2018 from NYCD.
Each record represents a collision in NYC by city, borough, precinct and cross street.
https://data.cityofnewyork.us/Public-Safety/NYPD-Motor-Vehicle-Collisions/h9gi-nx95
data(collisions) aggr(collisions)
data(collisions) aggr(collisions)
Colored map in which the proportion or amount of missing/imputed values in each region is coded according to a continuous or discrete color scheme. The sequential color palette may thereby be computed in the HCL or the RGB color space.
colormapMiss( x, region, map, imp_index = NULL, prop = TRUE, polysRegion = 1:length(x), range = NULL, n = NULL, col = c("red", "orange"), gamma = 2.2, fixup = TRUE, coords = NULL, numbers = TRUE, digits = 2, cex.numbers = 0.8, col.numbers = par("fg"), legend = TRUE, interactive = TRUE, ... ) colormapMissLegend( xleft, ybottom, xright, ytop, cmap, n = 1000, horizontal = TRUE, digits = 2, cex.numbers = 0.8, col.numbers = par("fg"), ... )
colormapMiss( x, region, map, imp_index = NULL, prop = TRUE, polysRegion = 1:length(x), range = NULL, n = NULL, col = c("red", "orange"), gamma = 2.2, fixup = TRUE, coords = NULL, numbers = TRUE, digits = 2, cex.numbers = 0.8, col.numbers = par("fg"), legend = TRUE, interactive = TRUE, ... ) colormapMissLegend( xleft, ybottom, xright, ytop, cmap, n = 1000, horizontal = TRUE, digits = 2, cex.numbers = 0.8, col.numbers = par("fg"), ... )
x |
a numeric vector. |
region |
a vector or factor of the same length as |
map |
an object of any class that contains polygons and provides its
own plot method (e.g., |
imp_index |
a logical-vector indicating which values of ‘x’ have
been imputed. If given, it is used for highlighting and the colors are
adjusted according to the given colors for imputed variables (see
|
prop |
a logical indicating whether the proportion of missing/imputed values should be used rather than the total amount. |
polysRegion |
a numeric vector specifying the region that each polygon belongs to. |
range |
a numeric vector of length two specifying the range (minimum and maximum) of the proportion or amount of missing/imputed values to be used for the color scheme. |
n |
for |
col |
the color range (start end end) to be used. RGB colors may be
specified as character strings or as objects of class
" |
gamma |
numeric; the display gamma value (see
|
fixup |
a logical indicating whether the colors should be corrected to
valid RGB values (see |
coords |
a matrix or |
numbers |
a logical indicating whether the corresponding proportions or numbers of missing/imputed values should be used as labels for the regions. |
digits |
the number of digits to be used in the labels (in case of proportions). |
cex.numbers |
the character expansion factor to be used for the labels. |
col.numbers |
the color to be used for the labels. |
legend |
a logical indicating whether a legend should be plotted. |
interactive |
a logical indicating whether more detailed information about missing/imputed values should be displayed interactively (see ‘Details’). |
... |
further arguments to be passed to |
xleft |
left x position of the legend. |
ybottom |
bottom y position of the legend. |
xright |
right x position of the legend. |
ytop |
top y position of the legend. |
cmap |
a list as returned by |
horizontal |
a logical indicating whether the legend should be drawn horizontally or vertically. |
The proportion or amount of missing/imputed values in x
of each
region is coded according to a continuous or discrete color scheme in the
color range defined by col
. In addition, the proportions or numbers
can be shown as labels in the regions.
If interactive
is TRUE
, clicking in a region displays more
detailed information about missing/imputed values on the console. Clicking
outside the borders quits the interactive session.
colormapMiss
returns a list with the following components:
nmiss a numeric vector containing the number of missing/imputed values in each region.
nobs a numeric vector containing the number of observations in each region.
pmiss a numeric vector containing the proportion of missing values in each region.
prop a logical indicating whether the proportion of missing/imputed values have been used rather than the total amount.
range the range of the proportion or amount of missing/imputed values corresponding to the color range.
n either a positive integer giving the number of equally spaced
cut-off points for a discretized color scheme, or NULL
for a
continuous color scheme.
start the start color of the color scheme.
end the end color of the color scheme.
space a character string giving the color space (either
"rgb"
for RGB colors or "hcl"
for HCL colors).
gamma numeric; the display gamma value (see
colorspace::hex()
).
fixup a logical indicating whether the colors have been
corrected to valid RGB values (see colorspace::hex()
).
Some of the argument names and positions have changed with versions
1.3 and 1.4 due to extended functionality and for more consistency with
other plot functions in VIM
. For back compatibility, the arguments
cex.text
and col.text
can still be supplied to ...{}
and are handled correctly. Nevertheless, they are deprecated and no longer
documented. Use cex.numbers
and col.numbers
instead.
Andreas Alfons, modifications to show imputed values by Bernd Prantner
M. Templ, A. Alfons, P. Filzmoser (2012) Exploring incomplete data using visualization tools. Journal of Advances in Data Analysis and Classification, Online first. DOI: 10.1007/s11634-011-0102-y.
colSequence()
, growdotMiss()
,
mapMiss()
Compute color sequences by linear interpolation based on a continuous color scheme between certain start and end colors. Color sequences may thereby be computed in the HCL or RGB color space.
colSequence(p, start, end, space = c("hcl", "rgb"), ...) colSequenceRGB(p, start, end, fixup = TRUE, ...) colSequenceHCL(p, start, end, fixup = TRUE, ...)
colSequence(p, start, end, space = c("hcl", "rgb"), ...) colSequenceRGB(p, start, end, fixup = TRUE, ...) colSequenceHCL(p, start, end, fixup = TRUE, ...)
p |
a numeric vector with values between 0 and 1 giving values to be used for interpolation between the start and end color (0 corresponds to the start color, 1 to the end color). |
start , end
|
the start and end color, respectively. For HCL colors,
each can be supplied as a vector of length three (hue, chroma, luminance) or
an object of class " |
space |
character string; if |
... |
for |
fixup |
a logical indicating whether the colors should be corrected to
valid RGB values (see |
A character vector containing hexadecimal strings of the form
"#RRGGBB"
.
Andreas Alfons
Zeileis, A., Hornik, K., Murrell, P. (2009) Escaping RGBland: Selecting colors for statistical graphics. Computational Statistics & Data Analysis, 53 (9), 1259–1270.
colorspace::hex()
,
colorspace::sequential_hcl()
p <- c(0, 0.3, 0.55, 0.8, 1) ## HCL colors colSequence(p, c(0, 0, 100), c(0, 100, 50)) colSequence(p, polarLUV(L=90, C=30, H=90), c(0, 100, 50)) ## RGB colors colSequence(p, c(1, 1, 1), c(1, 0, 0), space="rgb") colSequence(p, RGB(1, 1, 0), "red")
p <- c(0, 0.3, 0.55, 0.8, 1) ## HCL colors colSequence(p, c(0, 0, 100), c(0, 100, 50)) colSequence(p, polarLUV(L=90, C=30, H=90), c(0, 100, 50)) ## RGB colors colSequence(p, c(1, 1, 1), c(1, 0, 0), space="rgb") colSequence(p, RGB(1, 1, 0), "red")
Count the number of infinite or missing values in a vector.
countInf(x)
countInf(x)
x |
a vector. |
countInf
returns the number of infinite values in x
.
countNA
returns the number of missing values in x
.
Andreas Alfons
data(sleep, package="VIM") countInf(log(sleep$Dream)) countNA(sleep$Dream)
data(sleep, package="VIM") countInf(log(sleep$Dream)) countNA(sleep$Dream)
The datasets consists of several medical predictor variables and one target variable, Outcome. Predictor variables includes the number of pregnancies the patient has had, their BMI, insulin level, age, and so on.
A data frame with 768 observations on the following 9 variables.
Number of times pregnant
Plasma glucose concentration a 2 hours in an oral glucose tolerance test
Diastolic blood pressure (mm Hg)
Triceps skin fold thickness (mm)
2-Hour serum insulin (mu U/ml)
Body mass index (weight in kg/(height in m)^2)
Diabetes pedigree function
Age in years
Diabetes (yes or no)
This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage.
https://www.kaggle.com/uciml/pima-indians-diabetes-database/data
Smith, J.W., Everhart, J.E., Dickson, W.C., Knowler, W.C., & Johannes, R.S. (1988). Using the ADAP learning algorithm to forecast the onset of diabetes mellitus. In Proceedings of the Symposium on Computer Applications and Medical Care (pp. 261–265). IEEE Computer Society Press.
data(diabetes) aggr(diabetes)
data(diabetes) aggr(diabetes)
Various error measures evaluating the quality of imputations
evaluation(x, y, m, vartypes = "guess") nrmse(x, y, m) pfc(x, y, m) msecov(x, y) msecor(x, y)
evaluation(x, y, m, vartypes = "guess") nrmse(x, y, m) pfc(x, y, m) msecov(x, y) msecor(x, y)
x |
matrix or data frame |
y |
matrix or data frame of the same size as x |
m |
the indicator matrix for missing cells |
vartypes |
a vector of length ncol(x) specifying the variables types, like factor or numeric |
This function has been mainly written for procudures that evaluate imputation or replacement of rounded zeros. The ni parameter can thus, e.g. be used for expressing the number of rounded zeros.
the error measures value
Matthias Templ
M. Templ, A. Kowarik, P. Filzmoser (2011) Iterative stepwise regression imputation using standard and robust methods. Journal of Computational Statistics and Data Analysis, Vol. 55, pp. 2793-2806.
data(iris) iris_orig <- iris_imp <- iris iris_imp$Sepal.Length[sample(1:nrow(iris), 10)] <- NA iris_imp$Sepal.Width[sample(1:nrow(iris), 10)] <- NA iris_imp$Species[sample(1:nrow(iris), 10)] <- NA m <- is.na(iris_imp) iris_imp <- kNN(iris_imp, imp_var = FALSE) evaluation(iris_orig, iris_imp, m = m, vartypes = c(rep("numeric", 4), "factor")) msecov(iris_orig[, 1:4], iris_imp[, 1:4])
data(iris) iris_orig <- iris_imp <- iris iris_imp$Sepal.Length[sample(1:nrow(iris), 10)] <- NA iris_imp$Sepal.Width[sample(1:nrow(iris), 10)] <- NA iris_imp$Species[sample(1:nrow(iris), 10)] <- NA m <- is.na(iris_imp) iris_imp <- kNN(iris_imp, imp_var = FALSE) evaluation(iris_orig, iris_imp, m = m, vartypes = c(rep("numeric", 4), "factor")) msecov(iris_orig[, 1:4], iris_imp[, 1:4])
The relative consumption of certain food items in European and Scandinavian countries.
A data frame with 16 observations on the following 21 variables.
The numbers represent the percentage of the population consuming that food type.
https://openmv.net/info/food-consumption
data(food) str(food) aggr(food)
data(food) str(food) aggr(food)
Computes the average missing value gap of a vector.
gapMiss(x, what = mean)
gapMiss(x, what = mean)
x |
a numeric vector |
what |
default is the arithmetic mean. One can include an own function that returns a vector of lenght 1 (e.g. median) |
The length of each sequence of missing values (gap) in a vector is calculated and the mean gap is reported
The gap statistics
Matthias Templ based on a suggestion and draft from Huang Tian Yuan.
v <- rnorm(20) v[3] <- NA v[6:9] <- NA v[13:17] <- NA v gapMiss(v) gapMiss(v, what = median) gapMiss(v, what = function(x) mean(x, trim = 0.1)) gapMiss(v, what = var)
v <- rnorm(20) v[3] <- NA v[6:9] <- NA v[13:17] <- NA v gapMiss(v) gapMiss(v, what = median) gapMiss(v, what = function(x) mean(x, trim = 0.1)) gapMiss(v, what = var)
The function gowerD is used by kNN to compute the distances for numerical, factor ordered and semi-continous variables.
gowerD( data.x, data.y = data.x, weights = rep(1, ncol(data.x)), numerical = colnames(data.x), factors = vector(), orders = vector(), mixed = vector(), levOrders = vector(), mixed.constant = rep(0, length(mixed)), returnIndex = FALSE, nMin = 1L, returnMin = FALSE, methodStand = "range" )
gowerD( data.x, data.y = data.x, weights = rep(1, ncol(data.x)), numerical = colnames(data.x), factors = vector(), orders = vector(), mixed = vector(), levOrders = vector(), mixed.constant = rep(0, length(mixed)), returnIndex = FALSE, nMin = 1L, returnMin = FALSE, methodStand = "range" )
data.x |
data frame |
data.y |
data frame |
weights |
numeric vector providing weights for the observations in x |
numerical |
names of numerical variables |
factors |
names of factor variables |
orders |
names of ordered variables |
mixed |
names of mixed variables |
levOrders |
vector with number of levels for each orders variable |
mixed.constant |
vector with length equal to the number of semi-continuous variables specifying the point of the semi-continuous distribution with non-zero probability |
returnIndex |
logical if TRUE return the index of the minimum distance |
nMin |
integer number of values with smallest distance to be returned |
returnMin |
logical if the computed distances for the indices should be returned |
methodStand |
character either "range" or "iqr", iqr is more robust for outliers |
returnIndex=FALSE: a numerical matrix n x m with the computed distances returnIndex=TRUE: a named list with "ind" containing the requested indices and "mins" the computed distances
data(sleep) # all variables used as numerical gowerD(sleep) # split in numerical an gowerD(sleep, numerical = c("BodyWgt", "BrainWgt", "NonD", "Dream", "Sleep", "Span", "Gest"), orders = c("Pred","Exp","Danger"), levOrders = c(5,5,5)) # as before but only returning the index of the closest observation gowerD(sleep, numerical = c("BodyWgt", "BrainWgt", "NonD", "Dream", "Sleep", "Span", "Gest"), orders = c("Pred","Exp","Danger"), levOrders = c(5,5,5), returnIndex = TRUE)
data(sleep) # all variables used as numerical gowerD(sleep) # split in numerical an gowerD(sleep, numerical = c("BodyWgt", "BrainWgt", "NonD", "Dream", "Sleep", "Span", "Gest"), orders = c("Pred","Exp","Danger"), levOrders = c(5,5,5)) # as before but only returning the index of the closest observation gowerD(sleep, numerical = c("BodyWgt", "BrainWgt", "NonD", "Dream", "Sleep", "Span", "Gest"), orders = c("Pred","Exp","Danger"), levOrders = c(5,5,5), returnIndex = TRUE)
Map with dots whose sizes correspond to the values in a certain variable. Observations with missing/imputed values in additional variables are highlighted.
growdotMiss( x, coords, map, pos = 1, delimiter = NULL, selection = c("any", "all"), log = FALSE, col = c("skyblue", "red", "skyblue4", "red4", "orange", "orange4"), border = par("bg"), alpha = NULL, scale = NULL, size = NULL, exp = c(0, 0.95, 0.05), col.map = grey(0.5), legend = TRUE, legtitle = "Legend", cex.legtitle = par("cex"), cex.legtext = par("cex"), ncircles = 6, ndigits = 1, interactive = TRUE, ... )
growdotMiss( x, coords, map, pos = 1, delimiter = NULL, selection = c("any", "all"), log = FALSE, col = c("skyblue", "red", "skyblue4", "red4", "orange", "orange4"), border = par("bg"), alpha = NULL, scale = NULL, size = NULL, exp = c(0, 0.95, 0.05), col.map = grey(0.5), legend = TRUE, legtitle = "Legend", cex.legtitle = par("cex"), cex.legtext = par("cex"), ncircles = 6, ndigits = 1, interactive = TRUE, ... )
x |
a vector, matrix or |
coords |
a matrix or |
map |
a background map to be passed to |
pos |
a numeric value giving the index of the variable determining the dot sizes. |
delimiter |
a character-vector to distinguish between variables and
imputation-indices for imputed variables (therefore, |
selection |
the selection method for highlighting missing/imputed
values in multiple additional variables. Possible values are |
log |
a logical indicating whether the variable given by |
col |
a vector of length six giving the colors to be used in the plot. If only one color is supplied, it is used for the borders of non-highlighted dots and the surface area of highlighted dots. Else if two colors are supplied, they are recycled. |
border |
a vector of length four giving the colors to be used for the
borders of the growing dots. Use |
alpha |
a numeric value between 0 and 1 giving the level of
transparency of the colors, or |
scale |
scaling factor of the map. |
size |
a vector of length two giving the sizes for the smallest and largest dots. |
exp |
a vector of length three giving the factors that define the shape of the exponential function (see ‘Details’). |
col.map |
the color to be used for the background map. |
legend |
a logical indicating whether a legend should be plotted. |
legtitle |
the title for the legend. |
cex.legtitle |
the character expansion factor to be used for the title of the legend. |
cex.legtext |
the character expansion factor to be used in the legend. |
ncircles |
the number of circles displayed in the legend. |
ndigits |
the number of digits displayed in the legend. Note that \
this is just a suggestion (see |
interactive |
a logical indicating whether information about certain observations can be displayed interactively (see ‘Details’). |
... |
for |
The smallest dots correspond to the 10\ the 99\ defining the shape of the exponential function. Missings/imputed missings in the variable of interest will be drawn as rectangles.
If interactive=TRUE
, detailed information for an observation can be
printed on the console by clicking on the corresponding point. Clicking in
a region that does not contain any points quits the interactive session.
The function was renamed to growdotMiss
in version 1.3.
bubbleMiss
is a (deprecated) wrapper for growdotMiss
for back
compatibility with older versions. However, due to extended functionality,
some of the argument positions have changed.
The code is based on (removed from CRAN) bubbleFIN from package StatDA.
Andreas Alfons, Matthias Templ, Peter Filzmoser, Bernd Prantner
M. Templ, A. Alfons, P. Filzmoser (2012) Exploring incomplete data using visualization tools. Journal of Advances in Data Analysis and Classification, Online first. DOI: 10.1007/s11634-011-0102-y.
bgmap()
, mapMiss()
,
colormapMiss()
data(chorizonDL, package = "VIM") data(kola.background, package = "VIM") coo <- chorizonDL[, c("XCOO", "YCOO")] ## for missing values x <- chorizonDL[, c("Ca","As", "Bi")] growdotMiss(x, coo, kola.background, border = "white") ## for imputed values x_imp <- kNN(chorizonDL[,c("Ca","As","Bi" )]) growdotMiss(x_imp, coo, kola.background, delimiter = "_imp", border = "white")
data(chorizonDL, package = "VIM") data(kola.background, package = "VIM") coo <- chorizonDL[, c("XCOO", "YCOO")] ## for missing values x <- chorizonDL[, c("Ca","As", "Bi")] growdotMiss(x, coo, kola.background, border = "white") ## for imputed values x_imp <- kNN(chorizonDL[,c("Ca","As","Bi" )]) growdotMiss(x_imp, coo, kola.background, delimiter = "_imp", border = "white")
Histogram with highlighting of missing/imputed values in other variables by splitting each bin into two parts. Additionally, information about missing/imputed values in the variable of interest is shown on the right hand side.
histMiss( x, delimiter = NULL, pos = 1, selection = c("any", "all"), breaks = "Sturges", right = TRUE, col = c("skyblue", "red", "skyblue4", "red4", "orange", "orange4"), border = NULL, main = NULL, sub = NULL, xlab = NULL, ylab = NULL, axes = TRUE, only.miss = TRUE, miss.labels = axes, interactive = TRUE, ... )
histMiss( x, delimiter = NULL, pos = 1, selection = c("any", "all"), breaks = "Sturges", right = TRUE, col = c("skyblue", "red", "skyblue4", "red4", "orange", "orange4"), border = NULL, main = NULL, sub = NULL, xlab = NULL, ylab = NULL, axes = TRUE, only.miss = TRUE, miss.labels = axes, interactive = TRUE, ... )
x |
a vector, matrix or |
delimiter |
a character-vector to distinguish between variables and
imputation-indices for imputed variables (therefore, |
pos |
a numeric value giving the index of the variable of interest.
Additional variables in |
selection |
the selection method for highlighting missing/imputed
values in multiple additional variables. Possible values are |
breaks |
either a character string naming an algorithm to compute the
breakpoints (see |
right |
logical; if |
col |
a vector of length six giving the colors to be used. If only one color is supplied, the bars are transparent and the supplied color is used for highlighting missing/imputed values. Else if two colors are supplied, they are recycled. |
border |
the color to be used for the border of the cells. Use
|
main , sub
|
main and sub title. |
xlab , ylab
|
axis labels. |
axes |
a logical indicating whether axes should be drawn on the plot. |
only.miss |
logical; if |
miss.labels |
either a logical indicating whether label(s) should be plotted below the bar(s) on the right hand side, or a character string or vector giving the label(s) (see ‘Details’). |
interactive |
a logical indicating whether the variables can be switched interactively (see ‘Details’). |
... |
further graphical parameters to be passed to
|
If more than one variable is supplied, the bins for the variable of interest will be split according to missingness/number of imputed missings in the additional variables.
If only.miss=TRUE
, the missing/imputed values in the variable of
interest are visualized by one bar on the right hand side. If additional
variables are supplied, this bar is again split into two parts according to
missingness/number of imputed missings in the additional variables.
Otherwise, a small barplot consisting of two bars is drawn on the right hand
side. The first bar corresponds to observed values in the variable of
interest and the second bar to missing/imputed values. Since these two bars
are not on the same scale as the main barplot, a second y-axis is plotted on
the right (if axes=TRUE
). Each of the two bars are again split into
two parts according to missingness/number of imputed missings in the
additional variables. Note that this display does not make sense if only
one variable is supplied, therefore only.miss
is ignored in that
case.
If interactive=TRUE
, clicking in the left margin of the plot results
in switching to the previous variable and clicking in the right margin
results in switching to the next variable. Clicking anywhere else on the
graphics device quits the interactive session. When switching to a
categorical variable, a barplot is produced rather than a histogram.
a list with the following components:
breaks the breakpoints.
counts the number of observations in each cell.
missings the number of highlighted observations in each cell.
mids the cell midpoints.
Some of the argument names and positions have changed with version 1.3
due to extended functionality and for more consistency with other plot
functions in VIM
. For back compatibility, the arguments
axisnames
and names.miss
can still be supplied to
...{}
and are handled correctly. Nevertheless, they are deprecated
and no longer documented. Use miss.labels
instead.
Andreas Alfons, Bernd Prantner
M. Templ, A. Alfons, P. Filzmoser (2012) Exploring incomplete data using visualization tools. Journal of Advances in Data Analysis and Classification, Online first. DOI: 10.1007/s11634-011-0102-y.
Other plotting functions:
aggr()
,
barMiss()
,
marginmatrix()
,
marginplot()
,
matrixplot()
,
mosaicMiss()
,
pairsVIM()
,
parcoordMiss()
,
pbox()
,
scattJitt()
,
scattMiss()
,
scattmatrixMiss()
,
spineMiss()
data(tao, package = "VIM") ## for missing values x <- tao[, c("Air.Temp", "Humidity")] histMiss(x) histMiss(x, only.miss = FALSE) ## for imputed values x_IMPUTED <- kNN(tao[, c("Air.Temp", "Humidity")]) histMiss(x_IMPUTED, delimiter = "_imp") histMiss(x_IMPUTED, delimiter = "_imp", only.miss = FALSE)
data(tao, package = "VIM") ## for missing values x <- tao[, c("Air.Temp", "Humidity")] histMiss(x) histMiss(x, only.miss = FALSE) ## for imputed values x_IMPUTED <- kNN(tao[, c("Air.Temp", "Humidity")]) histMiss(x_IMPUTED, delimiter = "_imp") histMiss(x_IMPUTED, delimiter = "_imp", only.miss = FALSE)
Implementation of the popular Sequential, Random (within a domain) hot-deck algorithm for imputation.
hotdeck( data, variable = NULL, ord_var = NULL, domain_var = NULL, makeNA = NULL, NAcond = NULL, impNA = TRUE, donorcond = NULL, imp_var = TRUE, imp_suffix = "imp" )
hotdeck( data, variable = NULL, ord_var = NULL, domain_var = NULL, makeNA = NULL, NAcond = NULL, impNA = TRUE, donorcond = NULL, imp_var = TRUE, imp_suffix = "imp" )
data |
data.frame or matrix |
variable |
variables where missing values should be imputed (not overlapping with ord_var) |
ord_var |
variables for sorting the data set before imputation (not overlapping with variable) |
domain_var |
variables for building domains and impute within these domains |
makeNA |
list of length equal to the number of variables, with values, that should be converted to NA for each variable |
NAcond |
list of length equal to the number of variables, with a condition for imputing a NA |
impNA |
TRUE/FALSE whether NA should be imputed |
donorcond |
list of length equal to the number of variables, with a donorcond condition as character string. e.g. ">5" or c(">5","<10). If the list element for a variable is NULL no condition will be applied for this variable. |
imp_var |
TRUE/FALSE if a TRUE/FALSE variables for each imputed variable should be created show the imputation status |
imp_suffix |
suffix for the TRUE/FALSE variables showing the imputation status |
the imputed data set.
If the sequential hotdeck does not lead to a suitable, a random donor in the group will be used.
Alexander Kowarik
A. Kowarik, M. Templ (2016) Imputation with R package VIM. Journal of Statistical Software, 74(7), 1-16.
Other imputation methods:
impPCA()
,
irmi()
,
kNN()
,
matchImpute()
,
medianSamp()
,
rangerImpute()
,
regressionImp()
,
sampleCat()
,
xgboostImpute()
data(sleep) sleepI <- hotdeck(sleep) sleepI2 <- hotdeck(sleep,ord_var="BodyWgt",domain_var="Pred") # Usage of donorcond in a simple example sleepI3 <- hotdeck( sleep, variable = c("NonD", "Dream", "Sleep", "Span", "Gest"), ord_var = "BodyWgt", domain_var = "Pred", donorcond = list(">4", "<17", ">1.5", "%between%c(8,13)", ">5") ) set.seed(132) nRows <- 1e3 # Generate a data set with nRows rows and several variables x <- data.frame( x = rnorm(nRows), y = rnorm(nRows), z = sample(LETTERS, nRows, replace = TRUE), d1 = sample(LETTERS[1:3], nRows, replace = TRUE), d2 = sample(LETTERS[1:2], nRows, replace = TRUE), o1 = rnorm(nRows), o2 = rnorm(nRows), o3 = rnorm(100) ) origX <- x x[sample(1:nRows,nRows/10), 1] <- NA x[sample(1:nRows,nRows/10), 2] <- NA x[sample(1:nRows,nRows/10), 3] <- NA x[sample(1:nRows,nRows/10), 4] <- NA xImp <- hotdeck(x,ord_var = c("o1", "o2", "o3"), domain_var = "d2")
data(sleep) sleepI <- hotdeck(sleep) sleepI2 <- hotdeck(sleep,ord_var="BodyWgt",domain_var="Pred") # Usage of donorcond in a simple example sleepI3 <- hotdeck( sleep, variable = c("NonD", "Dream", "Sleep", "Span", "Gest"), ord_var = "BodyWgt", domain_var = "Pred", donorcond = list(">4", "<17", ">1.5", "%between%c(8,13)", ">5") ) set.seed(132) nRows <- 1e3 # Generate a data set with nRows rows and several variables x <- data.frame( x = rnorm(nRows), y = rnorm(nRows), z = sample(LETTERS, nRows, replace = TRUE), d1 = sample(LETTERS[1:3], nRows, replace = TRUE), d2 = sample(LETTERS[1:2], nRows, replace = TRUE), o1 = rnorm(nRows), o2 = rnorm(nRows), o3 = rnorm(100) ) origX <- x x[sample(1:nRows,nRows/10), 1] <- NA x[sample(1:nRows,nRows/10), 2] <- NA x[sample(1:nRows,nRows/10), 3] <- NA x[sample(1:nRows,nRows/10), 4] <- NA xImp <- hotdeck(x,ord_var = c("o1", "o2", "o3"), domain_var = "d2")
Greedy algorithm for EM-PCA including robust methods
impPCA( x, method = "classical", m = 1, eps = 0.5, k = ncol(x) - 1, maxit = 100, boot = FALSE, verbose = TRUE )
impPCA( x, method = "classical", m = 1, eps = 0.5, k = ncol(x) - 1, maxit = 100, boot = FALSE, verbose = TRUE )
x |
data.frame or matrix |
method |
|
m |
number of multiple imputations (only if parameter |
eps |
threshold for convergence |
k |
number of principal components for reconstruction of |
maxit |
maximum number of iterations |
boot |
residual bootstrap (if |
verbose |
TRUE/FALSE if additional information about the imputation process should be printed |
the imputed data set. If boot = FALSE
this is a data.frame.
If boot = TRUE
this is a list where each list element contains a data.frame.
Matthias Templ
Serneels, Sven and Verdonck, Tim (2008). Principal component analysis for data containing outliers and missing elements. Computational Statistics and Data Analysis, Elsevier, vol. 52(3), pages 1712-1727
Other imputation methods:
hotdeck()
,
irmi()
,
kNN()
,
matchImpute()
,
medianSamp()
,
rangerImpute()
,
regressionImp()
,
sampleCat()
,
xgboostImpute()
data(Animals, package = "MASS") Animals$brain[19] <- Animals$brain[19] + 0.01 Animals <- log(Animals) colnames(Animals) <- c("log(body)", "log(brain)") Animals_na <- Animals probs <- abs(Animals$`log(body)`^2) probs <- rep(0.5, nrow(Animals)) probs[c(6,16,26)] <- 0 set.seed(1234) Animals_na[sample(1:nrow(Animals), 10, prob = probs), "log(brain)"] <- NA w <- is.na(Animals_na$`log(brain)`) impPCA(Animals_na) impPCA(Animals_na, method = "mcd") impPCA(Animals_na, boot = TRUE, m = 10) impPCA(Animals_na, method = "mcd", boot = TRUE)[[1]] plot(`log(brain)` ~ `log(body)`, data = Animals, type = "n", ylab = "", xlab="") mtext(text = "impPCA robust", side = 3) points(Animals$`log(body)`[!w], Animals$`log(brain)`[!w]) points(Animals$`log(body)`[w], Animals$`log(brain)`[w], col = "grey", pch = 17) imputed <- impPCA(Animals_na, method = "mcd", boot = TRUE)[[1]] colnames(imputed) <- c("log(body)", "log(brain)") points(imputed$`log(body)`[w], imputed$`log(brain)`[w], col = "red", pch = 20, cex = 1.4) segments(x0 = Animals$`log(body)`[w], x1 = imputed$`log(body)`[w], y0 = Animals$`log(brain)`[w], y1 = imputed$`log(brain)`[w], lty = 2, col = "grey") legend("topleft", legend = c("non-missings", "set to missing", "imputed values"), pch = c(1,17,20), col = c("black","grey","red"), cex = 0.7) mape <- round(100* 1/sum(is.na(Animals_na$`log(brain)`)) * sum(abs((Animals$`log(brain)` - imputed$`log(brain)`) / Animals$`log(brain)`)), 2) s2 <- var(Animals$`log(brain)`) nrmse <- round(sqrt(1/sum(is.na(Animals_na$`log(brain)`)) * sum(abs((Animals$`log(brain)` - imputed$`log(brain)`) / s2))), 2) text(x = 8, y = 1.5, labels = paste("MAPE =", mape)) text(x = 8, y = 0.5, labels = paste("NRMSE =", nrmse))
data(Animals, package = "MASS") Animals$brain[19] <- Animals$brain[19] + 0.01 Animals <- log(Animals) colnames(Animals) <- c("log(body)", "log(brain)") Animals_na <- Animals probs <- abs(Animals$`log(body)`^2) probs <- rep(0.5, nrow(Animals)) probs[c(6,16,26)] <- 0 set.seed(1234) Animals_na[sample(1:nrow(Animals), 10, prob = probs), "log(brain)"] <- NA w <- is.na(Animals_na$`log(brain)`) impPCA(Animals_na) impPCA(Animals_na, method = "mcd") impPCA(Animals_na, boot = TRUE, m = 10) impPCA(Animals_na, method = "mcd", boot = TRUE)[[1]] plot(`log(brain)` ~ `log(body)`, data = Animals, type = "n", ylab = "", xlab="") mtext(text = "impPCA robust", side = 3) points(Animals$`log(body)`[!w], Animals$`log(brain)`[!w]) points(Animals$`log(body)`[w], Animals$`log(brain)`[w], col = "grey", pch = 17) imputed <- impPCA(Animals_na, method = "mcd", boot = TRUE)[[1]] colnames(imputed) <- c("log(body)", "log(brain)") points(imputed$`log(body)`[w], imputed$`log(brain)`[w], col = "red", pch = 20, cex = 1.4) segments(x0 = Animals$`log(body)`[w], x1 = imputed$`log(body)`[w], y0 = Animals$`log(brain)`[w], y1 = imputed$`log(brain)`[w], lty = 2, col = "grey") legend("topleft", legend = c("non-missings", "set to missing", "imputed values"), pch = c(1,17,20), col = c("black","grey","red"), cex = 0.7) mape <- round(100* 1/sum(is.na(Animals_na$`log(brain)`)) * sum(abs((Animals$`log(brain)` - imputed$`log(brain)`) / Animals$`log(brain)`)), 2) s2 <- var(Animals$`log(brain)`) nrmse <- round(sqrt(1/sum(is.na(Animals_na$`log(brain)`)) * sum(abs((Animals$`log(brain)` - imputed$`log(brain)`) / s2))), 2) text(x = 8, y = 1.5, labels = paste("MAPE =", mape)) text(x = 8, y = 0.5, labels = paste("NRMSE =", nrmse))
Rough estimation of missing values in a vector according to its type.
initialise(x, mixed, method = "kNN", mixed.constant = NULL)
initialise(x, mixed, method = "kNN", mixed.constant = NULL)
x |
a vector. |
mixed |
a character vector containing the names of variables of type mixed (semi-continous). |
method |
Method used for Initialization (median or kNN) |
mixed.constant |
vector with length equal to the number of semi-continuous variables specifying the point of the semi-continuous distribution with non-zero probability |
Missing values are imputed with the mean for vectors of class
"numeric"
, with the median for vectors of class "integer"
, and
with the mode for vectors of class "factor"
. Hence, x
should
be prepared in the following way: assign class "numeric"
to numeric
vectors, assign class "integer"
to ordinal vectors, and assign class
"factor"
to nominal or binary vectors.
the initialized vector.
The function is used internally by some imputation algorithms.
Matthias Templ, modifications by Andreas Alfons
In each step of the iteration, one variable is used as a response variable and the remaining variables serve as the regressors.
irmi( x, eps = 5, maxit = 100, mixed = NULL, mixed.constant = NULL, count = NULL, step = FALSE, robust = FALSE, takeAll = TRUE, noise = TRUE, noise.factor = 1, force = FALSE, robMethod = "lmrob", force.mixed = TRUE, mi = 1, addMixedFactors = FALSE, trace = FALSE, init.method = "kNN", modelFormulas = NULL, multinom.method = "multinom", imp_var = TRUE, imp_suffix = "imp" )
irmi( x, eps = 5, maxit = 100, mixed = NULL, mixed.constant = NULL, count = NULL, step = FALSE, robust = FALSE, takeAll = TRUE, noise = TRUE, noise.factor = 1, force = FALSE, robMethod = "lmrob", force.mixed = TRUE, mi = 1, addMixedFactors = FALSE, trace = FALSE, init.method = "kNN", modelFormulas = NULL, multinom.method = "multinom", imp_var = TRUE, imp_suffix = "imp" )
x |
data.frame or matrix |
eps |
threshold for convergency |
maxit |
maximum number of iterations |
mixed |
column index of the semi-continuous variables |
mixed.constant |
vector with length equal to the number of semi-continuous variables specifying the point of the semi-continuous distribution with non-zero probability |
count |
column index of count variables |
step |
a stepwise model selection is applied when the parameter is set to TRUE |
robust |
if TRUE, robust regression methods will be applied |
takeAll |
takes information of (initialised) missings in the response as well for regression imputation. |
noise |
irmi has the option to add a random error term to the imputed values, this creates the possibility for multiple imputation. The error term has mean 0 and variance corresponding to the variance of the regression residuals. |
noise.factor |
amount of noise. |
force |
if TRUE, the algorithm tries to find a solution in any case, possible by using different robust methods automatically. |
robMethod |
regression method when the response is continuous. Default is
MM-regression with |
force.mixed |
if TRUE, the algorithm tries to find a solution in any case, possible by using different robust methods automatically. |
mi |
number of multiple imputations. |
addMixedFactors |
if TRUE add additional factor variable for each mixed variable as X variable in the regression |
trace |
Additional information about the iterations when trace equals TRUE. |
init.method |
Method for initialization of missing values (kNN or median) |
modelFormulas |
a named list with the name of variables for the rhs of the formulas, which must contain a rhs formula for each variable with missing values, it should look like 'list(y1=c("x1","x2"),y2=c("x1","x3"))“ if factor variables for the mixed variables should be created for the regression models |
multinom.method |
Method for estimating the multinomial models (current default and only available method is multinom) |
imp_var |
TRUE/FALSE if a TRUE/FALSE variables for each imputed variable should be created show the imputation status |
imp_suffix |
suffix for the TRUE/FALSE variables showing the imputation status |
The method works sequentially and iterative. The method can deal with a mixture of continuous, semi-continuous, ordinal and nominal variables including outliers.
A full description of the method can be found in the mentioned reference.
the imputed data set.
Matthias Templ, Alexander Kowarik
M. Templ, A. Kowarik, P. Filzmoser (2011) Iterative stepwise regression imputation using standard and robust methods. Journal of Computational Statistics and Data Analysis, Vol. 55, pp. 2793-2806.
A. Kowarik, M. Templ (2016) Imputation with R package VIM. Journal of Statistical Software, 74(7), 1-16.
Other imputation methods:
hotdeck()
,
impPCA()
,
kNN()
,
matchImpute()
,
medianSamp()
,
rangerImpute()
,
regressionImp()
,
sampleCat()
,
xgboostImpute()
data(sleep) irmi(sleep) data(testdata) imp_testdata1 <- irmi(testdata$wna, mixed = testdata$mixed) # mixed.constant != 0 (-10) testdata$wna$m1[testdata$wna$m1 == 0] <- -10 testdata$wna$m2 <- log(testdata$wna$m2 + 0.001) imp_testdata2 <- irmi( testdata$wna, mixed = testdata$mixed, mixed.constant = c(-10,log(0.001)) ) imp_testdata2$m2 <- exp(imp_testdata2$m2) - 0.001 #example with fixed formulas for the variables with missing form = list( NonD = c("BodyWgt", "BrainWgt"), Dream = c("BodyWgt", "BrainWgt"), Sleep = c("BrainWgt" ), Span = c("BodyWgt" ), Gest = c("BodyWgt", "BrainWgt") ) irmi(sleep, modelFormulas = form, trace = TRUE) # Example with ordered variable td <- testdata$wna td$c1 <- as.ordered(td$c1) irmi(td)
data(sleep) irmi(sleep) data(testdata) imp_testdata1 <- irmi(testdata$wna, mixed = testdata$mixed) # mixed.constant != 0 (-10) testdata$wna$m1[testdata$wna$m1 == 0] <- -10 testdata$wna$m2 <- log(testdata$wna$m2 + 0.001) imp_testdata2 <- irmi( testdata$wna, mixed = testdata$mixed, mixed.constant = c(-10,log(0.001)) ) imp_testdata2$m2 <- exp(imp_testdata2$m2) - 0.001 #example with fixed formulas for the variables with missing form = list( NonD = c("BodyWgt", "BrainWgt"), Dream = c("BodyWgt", "BrainWgt"), Sleep = c("BrainWgt" ), Span = c("BodyWgt" ), Gest = c("BodyWgt", "BrainWgt") ) irmi(sleep, modelFormulas = form, trace = TRUE) # Example with ordered variable td <- testdata$wna td$c1 <- as.ordered(td$c1) irmi(td)
k-Nearest Neighbour Imputation based on a variation of the Gower Distance for numerical, categorical, ordered and semi-continous variables.
kNN( data, variable = colnames(data), k = 5, dist_var = colnames(data), weights = NULL, numFun = median, catFun = maxCat, makeNA = NULL, NAcond = NULL, impNA = TRUE, donorcond = NULL, mixed = vector(), mixed.constant = NULL, trace = FALSE, imp_var = TRUE, imp_suffix = "imp", addRF = FALSE, onlyRF = FALSE, addRandom = FALSE, useImputedDist = TRUE, weightDist = FALSE, methodStand = "range", ordFun = medianSamp )
kNN( data, variable = colnames(data), k = 5, dist_var = colnames(data), weights = NULL, numFun = median, catFun = maxCat, makeNA = NULL, NAcond = NULL, impNA = TRUE, donorcond = NULL, mixed = vector(), mixed.constant = NULL, trace = FALSE, imp_var = TRUE, imp_suffix = "imp", addRF = FALSE, onlyRF = FALSE, addRandom = FALSE, useImputedDist = TRUE, weightDist = FALSE, methodStand = "range", ordFun = medianSamp )
data |
data.frame or matrix |
variable |
variables where missing values should be imputed |
k |
number of Nearest Neighbours used |
dist_var |
names or variables to be used for distance calculation |
weights |
weights for the variables for distance calculation.
If |
numFun |
function for aggregating the k Nearest Neighbours in the case of a numerical variable |
catFun |
function for aggregating the k Nearest Neighbours in the case of a categorical variable |
makeNA |
list of length equal to the number of variables, with values, that should be converted to NA for each variable |
NAcond |
list of length equal to the number of variables, with a condition for imputing a NA |
impNA |
TRUE/FALSE whether NA should be imputed |
donorcond |
list of length equal to the number of variables, with a donorcond condition as character string. e.g. a list element can be ">5" or c(">5","<10). If the list element for a variable is NULL no condition will be applied for this variable. |
mixed |
names of mixed variables |
mixed.constant |
vector with length equal to the number of semi-continuous variables specifying the point of the semi-continuous distribution with non-zero probability |
trace |
TRUE/FALSE if additional information about the imputation process should be printed |
imp_var |
TRUE/FALSE if a TRUE/FALSE variables for each imputed variable should be created show the imputation status |
imp_suffix |
suffix for the TRUE/FALSE variables showing the imputation status |
addRF |
TRUE/FALSE each variable will be modelled using random forest regression ( |
onlyRF |
TRUE/FALSE if TRUE only additional distance variables created from random forest regression will be used as distance variables. |
addRandom |
TRUE/FALSE if an additional random variable should be added for distance calculation |
useImputedDist |
TRUE/FALSE if an imputed value should be used for distance calculation for imputing another variable. Be aware that this results in a dependency on the ordering of the variables. |
weightDist |
TRUE/FALSE if the distances of the k nearest neighbours should be used as weights in the aggregation step |
methodStand |
either "range" or "iqr" to be used in the standardization of numeric vaiables in the gower distance |
ordFun |
function for aggregating the k Nearest Neighbours in the case of a ordered factor variable |
the imputed data set.
Alexander Kowarik, Statistik Austria
A. Kowarik, M. Templ (2016) Imputation with R package VIM. Journal of Statistical Software, 74(7), 1-16.
Other imputation methods:
hotdeck()
,
impPCA()
,
irmi()
,
matchImpute()
,
medianSamp()
,
rangerImpute()
,
regressionImp()
,
sampleCat()
,
xgboostImpute()
data(sleep) kNN(sleep) library(laeken) kNN(sleep, numFun = weightedMean, weightDist=TRUE)
data(sleep) kNN(sleep) library(laeken) kNN(sleep, numFun = weightedMean, weightDist=TRUE)
Coordinates of the Kola background map.
Kola Project (1993-1998)
Reimann, C., Filzmoser, P., Garrett, R.G. and Dutter, R. (2008) Statistical Data Analysis Explained: Applied Environmental Statistics with R. Wiley, 2008.
data(kola.background, package = "VIM") bgmap(kola.background)
data(kola.background, package = "VIM") bgmap(kola.background)
Map of observed and missing/imputed values.
mapMiss( x, coords, map, delimiter = NULL, selection = c("any", "all"), col = c("skyblue", "red", "orange"), alpha = NULL, pch = c(19, 15), col.map = grey(0.5), legend = TRUE, interactive = TRUE, ... )
mapMiss( x, coords, map, delimiter = NULL, selection = c("any", "all"), col = c("skyblue", "red", "orange"), alpha = NULL, pch = c(19, 15), col.map = grey(0.5), legend = TRUE, interactive = TRUE, ... )
x |
a vector, matrix or |
coords |
a |
map |
a background map to be passed to |
delimiter |
a character-vector to distinguish between variables and
imputation-indices for imputed variables (therefore, |
selection |
the selection method for displaying missing/imputed values
in the map. Possible values are |
col |
a vector of length three giving the colors to be used for observed, missing and imputed values. If a single color is supplied, it is used for all values. |
alpha |
a numeric value between 0 and 1 giving the level of
transparency of the colors, or |
pch |
a vector of length two giving the plot characters to be used for observed and missing/imputed values. If a single plot character is supplied, it will be used for both. |
col.map |
the color to be used for the background map. |
legend |
a logical indicating whether a legend should be plotted. |
interactive |
a logical indicating whether information about selected observations can be displayed interactively (see ‘Details’). |
... |
further graphical parameters to be passed to
|
If interactive=TRUE
, detailed information for an observation can be
printed on the console by clicking on the corresponding point. Clicking in
a region that does not contain any points quits the interactive session.
Matthias Templ, Andreas Alfons, modifications by Bernd Prantner
M. Templ, A. Alfons, P. Filzmoser (2012) Exploring incomplete data using visualization tools. Journal of Advances in Data Analysis and Classification, Online first. DOI: 10.1007/s11634-011-0102-y.
bgmap()
, bubbleMiss()
,
colormapMiss()
data(chorizonDL, package = "VIM") data(kola.background, package = "VIM") coo <- chorizonDL[, c("XCOO", "YCOO")] ## for missing values x <- chorizonDL[, c("As", "Bi")] mapMiss(x, coo, kola.background) ## for imputed values x_imp <- kNN(chorizonDL[, c("As", "Bi")]) mapMiss(x_imp, coo, kola.background, delimiter = "_imp")
data(chorizonDL, package = "VIM") data(kola.background, package = "VIM") coo <- chorizonDL[, c("XCOO", "YCOO")] ## for missing values x <- chorizonDL[, c("As", "Bi")] mapMiss(x, coo, kola.background) ## for imputed values x_imp <- kNN(chorizonDL[, c("As", "Bi")]) mapMiss(x_imp, coo, kola.background, delimiter = "_imp")
Create a scatterplot matrix with information about missing/imputed values in the plot margins of each panel.
marginmatrix( x, delimiter = NULL, col = c("skyblue", "red", "red4", "orange", "orange4"), alpha = NULL, ... )
marginmatrix( x, delimiter = NULL, col = c("skyblue", "red", "red4", "orange", "orange4"), alpha = NULL, ... )
x |
a matrix or |
delimiter |
a character-vector to distinguish between variables and
imputation-indices for imputed variables (therefore, |
col |
a vector of length five giving the colors to be used in the marginplots in the off-diagonal panels. The first color is used for the scatterplot and the boxplots for the available data, the second/fourth color for the univariate scatterplots and boxplots for the missing/imputed values in one variable, and the third/fifth color for the frequency of missing/imputed values in both variables (see ‘Details’). If only one color is supplied, it is used for the bivariate and univariate scatterplots and the boxplots for missing/imputed values in one variable, whereas the boxplots for the available data are transparent. Else if two colors are supplied, the second one is recycled. |
alpha |
a numeric value between 0 and 1 giving the level of
transparency of the colors, or |
... |
further arguments and graphical parameters to be passed to
|
marginmatrix
uses pairsVIM()
with a panel function based
on marginplot()
.
The graphical parameter oma
will be set unless supplied as an
argument.
Andreas Alfons, modifications by Bernd Prantner
M. Templ, A. Alfons, P. Filzmoser (2012) Exploring incomplete data using visualization tools. Journal of Advances in Data Analysis and Classification, Online first. DOI: 10.1007/s11634-011-0102-y.
marginplot()
, pairsVIM()
,
scattmatrixMiss()
Other plotting functions:
aggr()
,
barMiss()
,
histMiss()
,
marginplot()
,
matrixplot()
,
mosaicMiss()
,
pairsVIM()
,
parcoordMiss()
,
pbox()
,
scattJitt()
,
scattMiss()
,
scattmatrixMiss()
,
spineMiss()
data(sleep, package = "VIM") ## for missing values x <- sleep[, 1:5] x[,c(1,2,4)] <- log10(x[,c(1,2,4)]) marginmatrix(x) ## for imputed values x_imp <- kNN(sleep[, 1:5]) x_imp[,c(1,2,4)] <- log10(x_imp[,c(1,2,4)]) marginmatrix(x_imp, delimiter = "_imp")
data(sleep, package = "VIM") ## for missing values x <- sleep[, 1:5] x[,c(1,2,4)] <- log10(x[,c(1,2,4)]) marginmatrix(x) ## for imputed values x_imp <- kNN(sleep[, 1:5]) x_imp[,c(1,2,4)] <- log10(x_imp[,c(1,2,4)]) marginmatrix(x_imp, delimiter = "_imp")
In addition to a standard scatterplot, information about missing/imputed values is shown in the plot margins. Furthermore, imputed values are highlighted in the scatterplot.
marginplot( x, delimiter = NULL, col = c("skyblue", "red", "red4", "orange", "orange4"), alpha = NULL, pch = c(1, 16), cex = par("cex"), numbers = TRUE, cex.numbers = par("cex"), zeros = FALSE, xlim = NULL, ylim = NULL, main = NULL, sub = NULL, xlab = NULL, ylab = NULL, ann = par("ann"), axes = TRUE, frame.plot = axes, ... )
marginplot( x, delimiter = NULL, col = c("skyblue", "red", "red4", "orange", "orange4"), alpha = NULL, pch = c(1, 16), cex = par("cex"), numbers = TRUE, cex.numbers = par("cex"), zeros = FALSE, xlim = NULL, ylim = NULL, main = NULL, sub = NULL, xlab = NULL, ylab = NULL, ann = par("ann"), axes = TRUE, frame.plot = axes, ... )
x |
a |
delimiter |
a character-vector to distinguish between variables and
imputation-indices for imputed variables (therefore, |
col |
a vector of length five giving the colors to be used in the plot. The first color is used for the scatterplot and the boxplots for the available data. In case of missing values, the second color is taken for the univariate scatterplots and boxplots for missing values in one variable and the third for the frequency of missing/imputed values in both variables (see ‘Details’). Otherwise, in case of imputed values, the fourth color is used for the highlighting, the frequency, the univariate scatterplot and the boxplots of mputed values in the first variable and the fifth color for the same applied to the second variable. A black color is used for the highlighting and the frequency of imputed values in both variables instead. If only one color is supplied, it is used for the bivariate and univariate scatterplots and the boxplots for missing/imputed values in one variable, whereas the boxplots for the available data are transparent. Else if two colors are supplied, the second one is recycled. |
alpha |
a numeric value between 0 and 1 giving the level of
transparency of the colors, or |
pch |
a vector of length two giving the plot symbols to be used for the scatterplot and the univariate scatterplots. If a single plot character is supplied, it is used for the scatterplot and the default value will be used for the univariate scatterplots (see ‘Details’). |
cex |
the character expansion factor to be used for the bivariate and univariate scatterplots. |
numbers |
a logical indicating whether the frequencies of missing/imputed values should be displayed in the lower left of the plot (see ‘Details’). |
cex.numbers |
the character expansion factor to be used for the frequencies of the missing/imputed values. |
zeros |
a logical vector of length two indicating whether the variables
are semi-continuous, i.e., contain a considerable amount of zeros. If
|
xlim , ylim
|
axis limits. |
main , sub
|
main and sub title. |
xlab , ylab
|
axis labels. |
ann |
a logical indicating whether plot annotation ( |
axes |
a logical indicating whether both axes should be drawn on the
plot. Use graphical parameter |
frame.plot |
a logical indicating whether a box should be drawn around the plot. |
... |
further graphical parameters to be passed down (see
|
Boxplots for available and missing/imputed data, as well as univariate scatterplots for missing/imputed values in one variable are shown in the plot margins.
Imputed values in either of the variables are highlighted in the scatterplot.
Furthermore, the frequencies of the missing/imputed values can be displayed by a number (lower left of the plot). The number in the lower left corner is the number of observations that are missing/imputed in both variables.
Some of the argument names and positions have changed with versions
1.3 and 1.4 due to extended functionality and for more consistency with
other plot functions in VIM
. For back compatibility, the argument
cex.text
can still be supplied to ...{}
and is handled
correctly. Nevertheless, it is deprecated and no longer documented. Use
cex.numbers
instead.
Andreas Alfons, Matthias Templ, modifications by Bernd Prantner
M. Templ, A. Alfons, P. Filzmoser (2012) Exploring incomplete data using visualization tools. Journal of Advances in Data Analysis and Classification, Online first. DOI: 10.1007/s11634-011-0102-y.
Other plotting functions:
aggr()
,
barMiss()
,
histMiss()
,
marginmatrix()
,
matrixplot()
,
mosaicMiss()
,
pairsVIM()
,
parcoordMiss()
,
pbox()
,
scattJitt()
,
scattMiss()
,
scattmatrixMiss()
,
spineMiss()
data(tao, package = "VIM") data(chorizonDL, package = "VIM") ## for missing values marginplot(tao[,c("Air.Temp", "Humidity")]) marginplot(log10(chorizonDL[,c("CaO", "Bi")])) ## for imputed values marginplot(kNN(tao[,c("Air.Temp", "Humidity")]), delimiter = "_imp") marginplot(kNN(log10(chorizonDL[,c("CaO", "Bi")])), delimiter = "_imp")
data(tao, package = "VIM") data(chorizonDL, package = "VIM") ## for missing values marginplot(tao[,c("Air.Temp", "Humidity")]) marginplot(log10(chorizonDL[,c("CaO", "Bi")])) ## for imputed values marginplot(kNN(tao[,c("Air.Temp", "Humidity")]), delimiter = "_imp") marginplot(kNN(log10(chorizonDL[,c("CaO", "Bi")])), delimiter = "_imp")
Suitable donors are searched based on matching of the categorical variables. The variables are dropped in reversed order, so that the last element of 'match_var' is dropped first and the first element of the vector is dropped last.
matchImpute( data, variable = colnames(data)[!colnames(data) %in% match_var], match_var, imp_var = TRUE, imp_suffix = "imp" )
matchImpute( data, variable = colnames(data)[!colnames(data) %in% match_var], match_var, imp_var = TRUE, imp_suffix = "imp" )
data |
data.frame, data.table or matrix |
variable |
variables to be imputed |
match_var |
variables used for matching |
imp_var |
TRUE/FALSE if a TRUE/FALSE variables for each imputed variable should be created show the imputation status |
imp_suffix |
suffix for the TRUE/FALSE variables showing the imputation status |
The method works by sampling values from the suitable donors.
the imputed data set.
Johannes Gussenbauer, Alexander Kowarik
Other imputation methods:
hotdeck()
,
impPCA()
,
irmi()
,
kNN()
,
medianSamp()
,
rangerImpute()
,
regressionImp()
,
sampleCat()
,
xgboostImpute()
data(sleep,package="VIM") imp_data <- matchImpute(sleep,variable=c("NonD","Dream","Sleep","Span","Gest"), match_var=c("Exp","Danger")) data(testdata,package="VIM") imp_testdata1 <- matchImpute(testdata$wna,match_var=c("c1","c2","b1","b2")) dt <- data.table::data.table(testdata$wna) imp_testdata2 <- matchImpute(dt,match_var=c("c1","c2","b1","b2"))
data(sleep,package="VIM") imp_data <- matchImpute(sleep,variable=c("NonD","Dream","Sleep","Span","Gest"), match_var=c("Exp","Danger")) data(testdata,package="VIM") imp_testdata1 <- matchImpute(testdata$wna,match_var=c("c1","c2","b1","b2")) dt <- data.table::data.table(testdata$wna) imp_testdata2 <- matchImpute(dt,match_var=c("c1","c2","b1","b2"))
Create a matrix plot, in which all cells of a data matrix are visualized by rectangles. Available data is coded according to a continuous color scheme, while missing/imputed data is visualized by a clearly distinguishable color.
matrixplot( x, delimiter = NULL, sortby = NULL, col = c("red", "orange"), fixup = TRUE, xlim = NULL, ylim = NULL, main = NULL, sub = NULL, xlab = NULL, ylab = NULL, axes = TRUE, labels = axes, xpd = NULL, interactive = TRUE, ... )
matrixplot( x, delimiter = NULL, sortby = NULL, col = c("red", "orange"), fixup = TRUE, xlim = NULL, ylim = NULL, main = NULL, sub = NULL, xlab = NULL, ylab = NULL, axes = TRUE, labels = axes, xpd = NULL, interactive = TRUE, ... )
x |
a matrix or |
delimiter |
a character-vector to distinguish between variables and
imputation-indices for imputed variables (therefore, |
sortby |
a numeric or character value specifying the variable to sort
the data matrix by, or |
col |
the colors to be used in the plot. RGB colors may be specified
as character strings or as objects of class " |
fixup |
a logical indicating whether the colors should be corrected to
valid RGB values (see |
xlim , ylim
|
axis limits. |
main , sub
|
main and sub title. |
xlab , ylab
|
axis labels. |
axes |
a logical indicating whether axes should be drawn on the plot. |
labels |
either a logical indicating whether labels should be plotted below each column, or a character vector giving the labels. |
xpd |
a logical indicating whether the rectangles should be allowed to
go outside the plot region. If |
interactive |
a logical indicating whether a variable to be used for sorting can be selected interactively (see ‘Details’). |
... |
for |
In a matrix plot, all cells of a data matrix are visualized by
rectangles. Available data is coded according to a continuous color scheme.
To compute the colors via interpolation, the variables are first scaled to
the interval between 0 and 1. Missing/imputed values can then be
visualized by a clearly distinguishable color. It is thereby possible to use
colors in the HCL or RGB color space. A simple way of
visualizing the magnitude of the available data is to apply a greyscale,
which has the advantage that missing/imputed values can easily be
distinguished by using a color such as red/orange. Note that -Inf
and Inf
are always assigned the begin and end color, respectively, of
the continuous color scheme.
Additionally, the observations can be sorted by the magnitude of a selected
variable. If interactive
is TRUE
, clicking in a column
redraws the plot with observations sorted by the corresponding variable.
Clicking anywhere outside the plot region quits the interactive session.
This is a much more powerful extension to the function imagmiss
in the former CRAN package dprep
.
iimagMiss
is deprecated and may be omitted in future versions of
VIM
. Use matrixplot
instead.
Andreas Alfons, Matthias Templ, modifications by Bernd Prantner
M. Templ, A. Alfons, P. Filzmoser (2012) Exploring incomplete data using visualization tools. Journal of Advances in Data Analysis and Classification, Online first. DOI: 10.1007/s11634-011-0102-y.
Other plotting functions:
aggr()
,
barMiss()
,
histMiss()
,
marginmatrix()
,
marginplot()
,
mosaicMiss()
,
pairsVIM()
,
parcoordMiss()
,
pbox()
,
scattJitt()
,
scattMiss()
,
scattmatrixMiss()
,
spineMiss()
data(sleep, package = "VIM") ## for missing values x <- sleep[, -(8:10)] x[,c(1,2,4,6,7)] <- log10(x[,c(1,2,4,6,7)]) matrixplot(x, sortby = "BrainWgt") ## for imputed values x_imp <- kNN(sleep[, -(8:10)]) x_imp[,c(1,2,4,6,7)] <- log10(x_imp[,c(1,2,4,6,7)]) matrixplot(x_imp, delimiter = "_imp", sortby = "BrainWgt")
data(sleep, package = "VIM") ## for missing values x <- sleep[, -(8:10)] x[,c(1,2,4,6,7)] <- log10(x[,c(1,2,4,6,7)]) matrixplot(x, sortby = "BrainWgt") ## for imputed values x_imp <- kNN(sleep[, -(8:10)]) x_imp[,c(1,2,4,6,7)] <- log10(x_imp[,c(1,2,4,6,7)]) matrixplot(x_imp, delimiter = "_imp", sortby = "BrainWgt")
The function maxCat chooses the level with the most occurrences and random if the maximum is not unique.
maxCat(x, weights = NULL)
maxCat(x, weights = NULL)
x |
factor vector |
weights |
numeric vector providing weights for the observations in x |
The function medianSamp chooses the level as the median or randomly between two levels.
medianSamp(x, weights = NULL)
medianSamp(x, weights = NULL)
x |
ordered factor vector |
weights |
numeric vector providing weights for the observations in x |
Other imputation methods:
hotdeck()
,
impPCA()
,
irmi()
,
kNN()
,
matchImpute()
,
rangerImpute()
,
regressionImp()
,
sampleCat()
,
xgboostImpute()
Create a mosaic plot with information about missing/imputed values.
mosaicMiss( x, delimiter = NULL, highlight = NULL, selection = c("any", "all"), plotvars = NULL, col = c("skyblue", "red", "orange"), labels = NULL, miss.labels = TRUE, ... )
mosaicMiss( x, delimiter = NULL, highlight = NULL, selection = c("any", "all"), plotvars = NULL, col = c("skyblue", "red", "orange"), labels = NULL, miss.labels = TRUE, ... )
x |
a matrix or |
delimiter |
a character-vector to distinguish between variables and
imputation-indices for imputed variables (therefore, |
highlight |
a vector giving the variables to be used for highlighting.
If |
selection |
the selection method for highlighting missing/imputed
values in multiple highlight variables. Possible values are |
plotvars |
a vector giving the categorical variables to be plotted. If
|
col |
a vector of length three giving the colors to be used for observed, missing and imputed data. If only one color is supplied, the tiles corresponding to observed data are transparent and the supplied color is used for highlighting. |
labels |
a list of arguments for the labeling function
|
miss.labels |
either a logical indicating whether labels should be plotted for observed and missing/imputed (highlighted) data, or a character vector giving the labels. |
... |
additional arguments to be passed to |
Mosaic plots are graphical representations of multi-way contingency tables. The frequencies of the different cells are visualized by area-proportional rectangles (tiles). Additional tiles are be used to display the frequencies of missing/imputed values. Furthermore, missing/imputed values in a certain variable or combination of variables can be highlighted in order to explore their structure.
An object of class "structable"
is returned invisibly.
This function uses the highly flexible strucplot
framework of
package vcd
.
Andreas Alfons, modifications by Bernd Prantner
Meyer, D., Zeileis, A. and Hornik, K. (2006) The
strucplot
framework: Visualizing multi-way contingency tables with
vcd. Journal of Statistical Software, 17 (3), 1–48.
M. Templ, A. Alfons, P. Filzmoser (2012) Exploring incomplete data using visualization tools. Journal of Advances in Data Analysis and Classification, Online first. DOI: 10.1007/s11634-011-0102-y.
Other plotting functions:
aggr()
,
barMiss()
,
histMiss()
,
marginmatrix()
,
marginplot()
,
matrixplot()
,
pairsVIM()
,
parcoordMiss()
,
pbox()
,
scattJitt()
,
scattMiss()
,
scattmatrixMiss()
,
spineMiss()
data(sleep, package = "VIM") ## for missing values mosaicMiss(sleep, highlight = 4, plotvars = 8:10, miss.labels = FALSE) ## for imputed values mosaicMiss(kNN(sleep), highlight = 4, plotvars = 8:10, delimiter = "_imp", miss.labels = FALSE)
data(sleep, package = "VIM") ## for missing values mosaicMiss(sleep, highlight = 4, plotvars = 8:10, miss.labels = FALSE) ## for imputed values mosaicMiss(kNN(sleep), highlight = 4, plotvars = 8:10, delimiter = "_imp", miss.labels = FALSE)
Create a scatterplot matrix.
pairsVIM( x, ..., delimiter = NULL, main = NULL, sub = NULL, panel = points, lower = panel, upper = panel, diagonal = NULL, labels = TRUE, pos.labels = NULL, cex.labels = NULL, font.labels = par("font"), layout = c("matrix", "graph"), gap = 1 )
pairsVIM( x, ..., delimiter = NULL, main = NULL, sub = NULL, panel = points, lower = panel, upper = panel, diagonal = NULL, labels = TRUE, pos.labels = NULL, cex.labels = NULL, font.labels = par("font"), layout = c("matrix", "graph"), gap = 1 )
x |
a matrix or |
... |
further arguments and graphical parameters to be passed down.
|
delimiter |
a character-vector to distinguish between variables and
imputation-indices for imputed variables (therefore, |
main , sub
|
main and sub title. |
panel |
a |
lower , upper
|
separate panel functions to be used below and above the diagonal, respectively. |
diagonal |
optional |
labels |
either a logical indicating whether labels should be plotted in the diagonal panels, or a character vector giving the labels. |
pos.labels |
the vertical position of the labels in the diagonal panels. |
cex.labels |
the character expansion factor to be used for the labels. |
font.labels |
the font to be used for the labels. |
layout |
a character string giving the layout of the scatterplot
matrix. Possible values are |
gap |
a numeric value giving the distance between the panels in margin lines. |
This function is the workhorse for marginmatrix()
and
scattmatrixMiss()
.
The graphical parameter oma
will be set unless supplied as an
argument.
A panel function should not attempt to start a new plot, since the
coordinate system for each panel is set up by pairsVIM
.
The code is based on graphics::pairs()
. Starting with
version 1.4, infinite values are no longer removed before passing the
x
and y
vectors to the panel functions.
Andreas Alfons, modifications by Bernd Prantner
M. Templ, A. Alfons, P. Filzmoser (2012) Exploring incomplete data using visualization tools. Journal of Advances in Data Analysis and Classification, Online first. DOI: 10.1007/s11634-011-0102-y.
marginmatrix()
, scattmatrixMiss()
Other plotting functions:
aggr()
,
barMiss()
,
histMiss()
,
marginmatrix()
,
marginplot()
,
matrixplot()
,
mosaicMiss()
,
parcoordMiss()
,
pbox()
,
scattJitt()
,
scattMiss()
,
scattmatrixMiss()
,
spineMiss()
data(sleep, package = "VIM") x <- sleep[, -(8:10)] x[,c(1,2,4,6,7)] <- log10(x[,c(1,2,4,6,7)]) pairsVIM(x)
data(sleep, package = "VIM") x <- sleep[, -(8:10)] x[,c(1,2,4,6,7)] <- log10(x[,c(1,2,4,6,7)]) pairsVIM(x)
Parallel coordinate plot with adjustments for missing/imputed values. Missing values in the plotted variables may be represented by a point above the corresponding coordinate axis to prevent disconnected lines. In addition, observations with missing/imputed values in selected variables may be highlighted.
parcoordMiss( x, delimiter = NULL, highlight = NULL, selection = c("any", "all"), plotvars = NULL, plotNA = TRUE, col = c("skyblue", "red", "skyblue4", "red4", "orange", "orange4"), alpha = NULL, lty = par("lty"), xlim = NULL, ylim = NULL, main = NULL, sub = NULL, xlab = NULL, ylab = NULL, labels = TRUE, xpd = NULL, interactive = TRUE, ... )
parcoordMiss( x, delimiter = NULL, highlight = NULL, selection = c("any", "all"), plotvars = NULL, plotNA = TRUE, col = c("skyblue", "red", "skyblue4", "red4", "orange", "orange4"), alpha = NULL, lty = par("lty"), xlim = NULL, ylim = NULL, main = NULL, sub = NULL, xlab = NULL, ylab = NULL, labels = TRUE, xpd = NULL, interactive = TRUE, ... )
x |
a matrix or |
delimiter |
a character-vector to distinguish between variables and
imputation-indices for imputed variables (therefore, |
highlight |
a vector giving the variables to be used for highlighting.
If |
selection |
the selection method for highlighting missing/imputed
values in multiple highlight variables. Possible values are |
plotvars |
a vector giving the variables to be plotted. If |
plotNA |
a logical indicating whether missing values in the plot variables should be represented by a point above the corresponding coordinate axis to prevent disconnected lines. |
col |
if |
alpha |
a numeric value between 0 and 1 giving the level of
transparency of the colors, or |
lty |
if |
xlim , ylim
|
axis limits. |
main , sub
|
main and sub title. |
xlab , ylab
|
axis labels. |
labels |
either a logical indicating whether labels should be plotted below each coordinate axis, or a character vector giving the labels. |
xpd |
a logical indicating whether the lines should be allowed to go
outside the plot region. If |
interactive |
a logical indicating whether interactive features should be enabled (see ‘Details’). |
... |
for |
In parallel coordinate plots, the variables are represented by parallel
axes. Each observation of the scaled data is shown as a line. Observations
with missing/imputed values in selected variables may thereby be
highlighted. However, plotting variables with missing values results in
disconnected lines, making it impossible to trace the respective
observations across the graph. As a remedy, missing values may be
represented by a point above the corresponding coordinate axis, which is
separated from the main plot by a small gap and a horizontal line, as
determined by plotNA
. Connected lines can then be drawn for all
observations. Nevertheless, a caveat of this display is that it may draw
attention away from the main relationships between the variables.
If interactive
is TRUE
, it is possible switch between this
display and the standard display without the separate level for missing
values by clicking in the top margin of the plot. In addition, the variables
to be used for highlighting can be selected interactively. Observations
with missing/imputed values in any or in all of the selected variables are
highlighted (as determined by selection
). A variable can be added to
the selection by clicking on a coordinate axis. If a variable is already
selected, clicking on its coordinate axis removes it from the selection.
Clicking anywhere outside the plot region (except the top margin, if
missing/imputed values exist) quits the interactive session.
Some of the argument names and positions have changed with versions
1.3 and 1.4 due to extended functionality and for more consistency with
other plot functions in VIM
. For back compatibility, the arguments
colcomb
and xaxlabels
can still be supplied to ...{}
and are handled correctly. Nevertheless, they are deprecated and no longer
documented. Use highlight
and labels
instead.
Andreas Alfons, Matthias Templ, modifications by Bernd Prantner
Wegman, E. J. (1990) Hyperdimensional data analysis using parallel coordinates. Journal of the American Statistical Association 85 (411), 664–675.
M. Templ, A. Alfons, P. Filzmoser (2012) Exploring incomplete data using visualization tools. Journal of Advances in Data Analysis and Classification, Online first. DOI: 10.1007/s11634-011-0102-y.
Other plotting functions:
aggr()
,
barMiss()
,
histMiss()
,
marginmatrix()
,
marginplot()
,
matrixplot()
,
mosaicMiss()
,
pairsVIM()
,
pbox()
,
scattJitt()
,
scattMiss()
,
scattmatrixMiss()
,
spineMiss()
data(chorizonDL, package = "VIM") ## for missing values parcoordMiss(chorizonDL[,c(15,101:110)], plotvars=2:11, interactive = FALSE) legend("top", col = c("skyblue", "red"), lwd = c(1,1), legend = c("observed in Bi", "missing in Bi")) ## for imputed values parcoordMiss(kNN(chorizonDL[,c(15,101:110)]), delimiter = "_imp" , plotvars=2:11, interactive = FALSE) legend("top", col = c("skyblue", "orange"), lwd = c(1,1), legend = c("observed in Bi", "imputed in Bi"))
data(chorizonDL, package = "VIM") ## for missing values parcoordMiss(chorizonDL[,c(15,101:110)], plotvars=2:11, interactive = FALSE) legend("top", col = c("skyblue", "red"), lwd = c(1,1), legend = c("observed in Bi", "missing in Bi")) ## for imputed values parcoordMiss(kNN(chorizonDL[,c(15,101:110)]), delimiter = "_imp" , plotvars=2:11, interactive = FALSE) legend("top", col = c("skyblue", "orange"), lwd = c(1,1), legend = c("observed in Bi", "imputed in Bi"))
Boxplot of one variable of interest plus information about missing/imputed values in other variables.
pbox( x, delimiter = NULL, pos = 1, selection = c("none", "any", "all"), col = c("skyblue", "red", "red4", "orange", "orange4"), numbers = TRUE, cex.numbers = par("cex"), xlim = NULL, ylim = NULL, main = NULL, sub = NULL, xlab = NULL, ylab = NULL, axes = TRUE, frame.plot = axes, labels = axes, interactive = TRUE, ... )
pbox( x, delimiter = NULL, pos = 1, selection = c("none", "any", "all"), col = c("skyblue", "red", "red4", "orange", "orange4"), numbers = TRUE, cex.numbers = par("cex"), xlim = NULL, ylim = NULL, main = NULL, sub = NULL, xlab = NULL, ylab = NULL, axes = TRUE, frame.plot = axes, labels = axes, interactive = TRUE, ... )
x |
a vector, matrix or |
delimiter |
a character-vector to distinguish between variables and
imputation-indices for imputed variables (therefore, |
pos |
a numeric value giving the index of the variable of interest.
Additional variables in |
selection |
the selection method for grouping according to
missingness/number of imputed missings in multiple additional variables.
Possible values are |
col |
a vector of length five giving the colors to be used in the plot. The first color is used for the boxplots of the available data, the second/fourth are used for missing/imputed data, respectively, and the third/fifth color for the frequencies of missing/imputed values in both variables (see ‘Details’). If only one color is supplied, it is used for the boxplots for missing/imputed data, whereas the boxplots for the available data are transparent. Else if two colors are supplied, the second one is recycled. |
numbers |
a logical indicating whether the frequencies of missing/imputed values should be displayed (see ‘Details’). |
cex.numbers |
the character expansion factor to be used for the frequencies of the missing/imputed values. |
xlim , ylim
|
axis limits. |
main , sub
|
main and sub title. |
xlab , ylab
|
axis labels. |
axes |
a logical indicating whether axes should be drawn on the plot. |
frame.plot |
a logical indicating whether a box should be drawn around the plot. |
labels |
either a logical indicating whether labels should be plotted below each box, or a character vector giving the labels. |
interactive |
a logical indicating whether variables can be switched interactively (see ‘Details’). |
... |
for |
This plot consists of several boxplots. First, a standard boxplot of the
variable of interest is produced. Second, boxplots grouped by observed and
missing/imputed values according to selection
are produced for the
variable of interest.
Additionally, the frequencies of the missing/imputed values can be represented by numbers. If so, the first line corresponds to the observed values of the variable of interest and their distribution in the different groups, the second line to the missing/imputed values.
If interactive=TRUE
, clicking in the left margin of the plot results
in switching to the previous variable and clicking in the right margin
results in switching to the next variable. Clicking anywhere else on the
graphics device quits the interactive session.
a list as returned by graphics::boxplot()
.
Some of the argument names and positions have changed with version 1.3
due to extended functionality and for more consistency with other plot
functions in VIM
. For back compatibility, the arguments names
and cex.text
can still be supplied to ...{}
and are handled
correctly. Nevertheless, they are deprecated and no longer documented. Use
labels
and cex.numbers
instead.
Andreas Alfons, Matthias Templ, modifications by Bernd Prantner
M. Templ, A. Alfons, P. Filzmoser (2012) Exploring incomplete data using visualization tools. Journal of Advances in Data Analysis and Classification, Online first. DOI: 10.1007/s11634-011-0102-y.
Other plotting functions:
aggr()
,
barMiss()
,
histMiss()
,
marginmatrix()
,
marginplot()
,
matrixplot()
,
mosaicMiss()
,
pairsVIM()
,
parcoordMiss()
,
scattJitt()
,
scattMiss()
,
scattmatrixMiss()
,
spineMiss()
data(chorizonDL, package = "VIM") ## for missing values pbox(log(chorizonDL[, c(4,5,8,10,11,16:17,19,25,29,37,38,40)])) ## for imputed values pbox(kNN(log(chorizonDL[, c(4,8,10,11,17,19,25,29,37,38,40)])), delimiter = "_imp")
data(chorizonDL, package = "VIM") ## for missing values pbox(log(chorizonDL[, c(4,5,8,10,11,16:17,19,25,29,37,38,40)])) ## for imputed values pbox(kNN(log(chorizonDL[, c(4,8,10,11,17,19,25,29,37,38,40)])), delimiter = "_imp")
This function is used by the VIM
GUI for transformation and
standardization of the data.
prepare( x, scaling = c("none", "classical", "MCD", "robust", "onestep"), transformation = c("none", "minus", "reciprocal", "logarithm", "exponential", "boxcox", "clr", "ilr", "alr"), alpha = NULL, powers = NULL, start = 0, alrVar )
prepare( x, scaling = c("none", "classical", "MCD", "robust", "onestep"), transformation = c("none", "minus", "reciprocal", "logarithm", "exponential", "boxcox", "clr", "ilr", "alr"), alpha = NULL, powers = NULL, start = 0, alrVar )
x |
a vector, matrix or |
scaling |
the scaling to be applied to the data. Possible values are
|
transformation |
the transformation of the data. Possible values are
|
alpha |
a numeric parameter controlling the size of the subset for the
MCD (if |
powers |
a numeric vector giving the powers to be used in the Box-Cox
transformation (if |
start |
a constant to be added prior to Box-Cox transformation (if
|
alrVar |
variable to be used as denominator in the additive logratio
transformation (if |
Transformation:
"none"
: no transformation is used.
"logarithm"
: compute the the logarithm (to the base 10).
"boxcox"
: apply a Box-Cox transformation. Powers may be specified or
calculated with the function car::powerTransform()
.
Standardization:
"none"
: no standardization is used.
"classical"
: apply a z-Transformation on each variable by
using function scale()
.
"robust"
: apply a robustified z-Transformation by using median
and MAD.
Transformed and standardized data.
Matthias Templ, modifications by Andreas Alfons
scale()
, car::powerTransform()
data(sleep, package = "VIM") x <- sleep[, c("BodyWgt", "BrainWgt")] prepare(x, scaling = "robust", transformation = "logarithm")
data(sleep, package = "VIM") x <- sleep[, c("BodyWgt", "BrainWgt")] prepare(x, scaling = "robust", transformation = "logarithm")
Pulp quality by lignin content remaining
A data frame with 301 observations on the following 23 variables.
Pulp quality is measured by the lignin content remaining in the pulp: the Kappa number. This data set is used to understand which variables in the process influence the Kappa number, and if it can be predicted accurately enough for an inferential sensor application. Variables with a number at the end have been lagged by that number of hours to line up the data.
https://openmv.net/info/kamyr-digester
K. Walkush and R.R. Gustafson. Application of feedforward neural networks and partial least squares regression for modelling Kappa number in a continuous Kamyr digester", Pulp and Paper Canada, 95, 1994, p T7-T13.
data(pulplignin) str(pulplignin) aggr(pulplignin)
data(pulplignin) str(pulplignin) aggr(pulplignin)
Impute missing values based on a random forest model using ranger::ranger()
rangerImpute( formula, data, imp_var = TRUE, imp_suffix = "imp", ..., verbose = FALSE, median = FALSE )
rangerImpute( formula, data, imp_var = TRUE, imp_suffix = "imp", ..., verbose = FALSE, median = FALSE )
formula |
model formula for the imputation |
data |
A |
imp_var |
|
imp_suffix |
suffix used for TF imputation variables |
... |
Arguments passed to |
verbose |
Show the number of observations used for training
and evaluating the RF-Model. This parameter is also passed down to
|
median |
Use the median (rather than the arithmetic mean) to average the values of individual trees for a more robust estimate. |
the imputed data set.
Other imputation methods:
hotdeck()
,
impPCA()
,
irmi()
,
kNN()
,
matchImpute()
,
medianSamp()
,
regressionImp()
,
sampleCat()
,
xgboostImpute()
data(sleep) rangerImpute(Dream+NonD~BodyWgt+BrainWgt,data=sleep)
data(sleep) rangerImpute(Dream+NonD~BodyWgt+BrainWgt,data=sleep)
Impute missing values based on a regression model.
regressionImp( formula, data, family = "AUTO", robust = FALSE, imp_var = TRUE, imp_suffix = "imp", mod_cat = FALSE )
regressionImp( formula, data, family = "AUTO", robust = FALSE, imp_var = TRUE, imp_suffix = "imp", mod_cat = FALSE )
formula |
model formula to impute one variable |
data |
A data.frame containing the data |
family |
family argument for |
robust |
|
imp_var |
|
imp_suffix |
suffix used for TF imputation variables |
mod_cat |
|
lm()
is used for family "normal" and glm()
for all other families.
(robust=TRUE: lmrob()
, glmrob()
)
the imputed data set.
Alexander Kowarik
A. Kowarik, M. Templ (2016) Imputation with R package VIM. Journal of Statistical Software, 74(7), 1-16.
Other imputation methods:
hotdeck()
,
impPCA()
,
irmi()
,
kNN()
,
matchImpute()
,
medianSamp()
,
rangerImpute()
,
sampleCat()
,
xgboostImpute()
data(sleep) sleepImp1 <- regressionImp(Dream+NonD~BodyWgt+BrainWgt,data=sleep) sleepImp2 <- regressionImp(Sleep+Gest+Span+Dream+NonD~BodyWgt+BrainWgt,data=sleep) data(testdata) imp_testdata1 <- regressionImp(b1+b2~x1+x2,data=testdata$wna) imp_testdata3 <- regressionImp(x1~x2,data=testdata$wna,robust=TRUE)
data(sleep) sleepImp1 <- regressionImp(Dream+NonD~BodyWgt+BrainWgt,data=sleep) sleepImp2 <- regressionImp(Sleep+Gest+Span+Dream+NonD~BodyWgt+BrainWgt,data=sleep) data(testdata) imp_testdata1 <- regressionImp(b1+b2~x1+x2,data=testdata$wna) imp_testdata3 <- regressionImp(x1~x2,data=testdata$wna,robust=TRUE)
Add a rug representation of missing/imputed values in only one of the variables to scatterplots.
rugNA( x, y, ticksize = NULL, side = 1, col = "red", alpha = NULL, miss = NULL, lwd = 0.5, ... )
rugNA( x, y, ticksize = NULL, side = 1, col = "red", alpha = NULL, miss = NULL, lwd = 0.5, ... )
x , y
|
numeric vectors. |
ticksize |
the length of the ticks. Positive lengths give inward ticks. |
side |
an integer giving the side of the plot to draw the rug representation. |
col |
the color to be used for the ticks. |
alpha |
the alpha value (between 0 and 1). |
miss |
a |
lwd |
the line width to be used for the ticks. |
... |
further arguments to be passed to |
If side
is 1 or 3, the rug representation consists of values
available in x
but missing/imputed in y
. Else if side
is 2 or 4, it consists of values available in y
but missing/imputed
in x
.
Andreas Alfons, modifications by Bernd Prantner
data(tao, package = "VIM") ## for missing values x <- tao[, "Air.Temp"] y <- tao[, "Humidity"] plot(x, y) rugNA(x, y, side = 1) rugNA(x, y, side = 2) ## for imputed values x_imp <- kNN(tao[, c("Air.Temp","Humidity")]) x <- x_imp[, "Air.Temp"] y <- x_imp[, "Humidity"] miss <- x_imp[, c("Air.Temp_imp","Humidity_imp")] plot(x, y) rugNA(x, y, side = 1, col = "orange", miss = miss) rugNA(x, y, side = 2, col = "orange", miss = miss)
data(tao, package = "VIM") ## for missing values x <- tao[, "Air.Temp"] y <- tao[, "Humidity"] plot(x, y) rugNA(x, y, side = 1) rugNA(x, y, side = 2) ## for imputed values x_imp <- kNN(tao[, c("Air.Temp","Humidity")]) x <- x_imp[, "Air.Temp"] y <- x_imp[, "Humidity"] miss <- x_imp[, c("Air.Temp_imp","Humidity_imp")] plot(x, y) rugNA(x, y, side = 1, col = "orange", miss = miss) rugNA(x, y, side = 2, col = "orange", miss = miss)
The function sampleCat samples with probabilites corresponding to the occurrence of the level in the NNs.
sampleCat(x, weights = NULL)
sampleCat(x, weights = NULL)
x |
factor vector |
weights |
numeric vector providing weights for the observations in x |
Other imputation methods:
hotdeck()
,
impPCA()
,
irmi()
,
kNN()
,
matchImpute()
,
medianSamp()
,
rangerImpute()
,
regressionImp()
,
xgboostImpute()
Synthetic subset of the Austrian structural business statistics (SBS) data, namely NACE code 52.42 (retail sale of clothing).
The Austrian SBS data set consists of more than 320.000 enterprises. Available raw (unedited) data set: 21669 observations in 90 variables, structured according NACE revision 1.1 with 3891 missing values.
We investigate 9 variables of NACE 52.42 (retail sale of clothing).
From these confidential raw data set a non-confidential, close-to-reality, synthetic data set was generated.
data(SBS5242) aggr(SBS5242)
data(SBS5242) aggr(SBS5242)
Create a bivariate jitter plot.
scattJitt( x, delimiter = NULL, col = c("skyblue", "red", "red4", "orange", "orange4"), alpha = NULL, cex = par("cex"), col.line = "lightgrey", lty = "dashed", lwd = par("lwd"), numbers = TRUE, cex.numbers = par("cex"), main = NULL, sub = NULL, xlab = NULL, ylab = NULL, axes = TRUE, frame.plot = axes, labels = c("observed", "missing", "imputed"), ... )
scattJitt( x, delimiter = NULL, col = c("skyblue", "red", "red4", "orange", "orange4"), alpha = NULL, cex = par("cex"), col.line = "lightgrey", lty = "dashed", lwd = par("lwd"), numbers = TRUE, cex.numbers = par("cex"), main = NULL, sub = NULL, xlab = NULL, ylab = NULL, axes = TRUE, frame.plot = axes, labels = c("observed", "missing", "imputed"), ... )
x |
a |
delimiter |
a character-vector to distinguish between variables and
imputation-indices for imputed variables (therefore, |
col |
a vector of length five giving the colors to be used in the plot. The first color will be used for complete observations, the second/fourth color for missing/imputed values in only one variable, and the third/fifth color for missing/imputed values in both variables. If only one color is supplied, it is used for all. Else if two colors are supplied, the second one is recycled. |
alpha |
a numeric value between 0 and 1 giving the level of
transparency of the colors, or |
cex |
the character expansion factor for the plot characters. |
col.line |
the color for the lines dividing the plot region. |
lty |
the line type for the lines dividing the plot region (see
|
lwd |
the line width for the lines dividing the plot region. |
numbers |
a logical indicating whether the frequencies of observed and missing/imputed values should be displayed (see ‘Details’). |
cex.numbers |
the character expansion factor to be used for the frequencies of the observed and missing/imputed values. |
main , sub
|
main and sub title. |
xlab , ylab
|
axis labels. |
axes |
a logical indicating whether both axes should be drawn on the
plot. Use graphical parameter |
frame.plot |
a logical indicating whether a box should be drawn around the plot. |
labels |
a vector of length three giving the axis labels for the regions for observed, missing and imputed values (see ‘Details’). |
... |
further graphical parameters to be passed down (see
|
The amount of observed and missing/imputed values is visualized by jittered points. Thereby the plot region is divided into up to four regions according to the existence of missing/imputed values in one or both variables. In addition, the amount of observed and missing/imputed values can be represented by a number.
Some of the argument names and positions have changed with version 1.3
due to extended functionality and for more consistency with other plot
functions in VIM
. For back compatibility, the argument
cex.text
can still be supplied to ...{}
and is handled
correctly. Nevertheless, it is deprecated and no longer documented. Use
cex.numbers
instead.
Matthias Templ, modifications by Andreas Alfons and Bernd Prantner
M. Templ, A. Alfons, P. Filzmoser (2012) Exploring incomplete data using visualization tools. Journal of Advances in Data Analysis and Classification, Online first. DOI: 10.1007/s11634-011-0102-y.
Other plotting functions:
aggr()
,
barMiss()
,
histMiss()
,
marginmatrix()
,
marginplot()
,
matrixplot()
,
mosaicMiss()
,
pairsVIM()
,
parcoordMiss()
,
pbox()
,
scattMiss()
,
scattmatrixMiss()
,
spineMiss()
data(tao, package = "VIM") ## for missing values scattJitt(tao[, c("Air.Temp", "Humidity")]) ## for imputed values scattJitt(kNN(tao[, c("Air.Temp", "Humidity")]), delimiter = "_imp")
data(tao, package = "VIM") ## for missing values scattJitt(tao[, c("Air.Temp", "Humidity")]) ## for imputed values scattJitt(kNN(tao[, c("Air.Temp", "Humidity")]), delimiter = "_imp")
Scatterplot matrix in which observations with missing/imputed values in certain variables are highlighted.
scattmatrixMiss( x, delimiter = NULL, highlight = NULL, selection = c("any", "all"), plotvars = NULL, col = c("skyblue", "red", "orange"), alpha = NULL, pch = c(1, 3), lty = par("lty"), diagonal = c("density", "none"), interactive = TRUE, ... )
scattmatrixMiss( x, delimiter = NULL, highlight = NULL, selection = c("any", "all"), plotvars = NULL, col = c("skyblue", "red", "orange"), alpha = NULL, pch = c(1, 3), lty = par("lty"), diagonal = c("density", "none"), interactive = TRUE, ... )
x |
a matrix or |
delimiter |
a character-vector to distinguish between variables and
imputation-indices for imputed variables (therefore, |
highlight |
a vector giving the variables to be used for highlighting.
If |
selection |
the selection method for highlighting missing/imputed
values in multiple highlight variables. Possible values are |
plotvars |
a vector giving the variables to be plotted. If |
col |
a vector of length three giving the colors to be used in the plot. The second/third color will be used for highlighting missing/imputed values. |
alpha |
a numeric value between 0 and 1 giving the level of
transparency of the colors, or |
pch |
a vector of length two giving the plot characters. The second plot character will be used for the highlighted observations. |
lty |
a vector of length two giving the line types for the density
plots in the diagonal panels (if |
diagonal |
a character string specifying the plot to be drawn in the
diagonal panels. Possible values are |
interactive |
a logical indicating whether the variables to be used for highlighting can be selected interactively (see ‘Details’). |
... |
for |
scattmatrixMiss
uses pairsVIM()
with a panel function
that allows highlighting of missing/imputed values.
If interactive=TRUE
, the variables to be used for highlighting can be
selected interactively. Observations with missing/imputed values in any or
in all of the selected variables are highlighted (as determined by
selection
). A variable can be added to the selection by clicking in
a diagonal panel. If a variable is already selected, clicking on the
corresponding diagonal panel removes it from the selection. Clicking
anywhere else quits the interactive session.
The graphical parameter oma
will be set unless supplied as an
argument.
TKRscattmatrixMiss
behaves like scattmatrixMiss
, but uses
tkrplot to embed the plot in a Tcl/Tk window.
This is useful if the number of variables is large, because scrollbars allow
to move from one part of the plot to another.
Some of the argument names and positions have changed with version 1.3
due to a re-implementation and for more consistency with other plot
functions in VIM
. For back compatibility, the argument
colcomb
can still be supplied to ...{}
and is handled
correctly. Nevertheless, it is deprecated and no longer documented. Use
highlight
instead. The arguments smooth
, reg.line
and
legend.plot
are no longer used and ignored if supplied.
Andreas Alfons, Matthias Templ, modifications by Bernd Prantner
M. Templ, A. Alfons, P. Filzmoser (2012) Exploring incomplete data using visualization tools. Journal of Advances in Data Analysis and Classification, Online first. DOI: 10.1007/s11634-011-0102-y.
Other plotting functions:
aggr()
,
barMiss()
,
histMiss()
,
marginmatrix()
,
marginplot()
,
matrixplot()
,
mosaicMiss()
,
pairsVIM()
,
parcoordMiss()
,
pbox()
,
scattJitt()
,
scattMiss()
,
spineMiss()
data(sleep, package = "VIM") ## for missing values x <- sleep[, 1:5] x[,c(1,2,4)] <- log10(x[,c(1,2,4)]) scattmatrixMiss(x, highlight = "Dream") ## for imputed values x_imp <- kNN(sleep[, 1:5]) x_imp[,c(1,2,4)] <- log10(x_imp[,c(1,2,4)]) scattmatrixMiss(x_imp, delimiter = "_imp", highlight = "Dream")
data(sleep, package = "VIM") ## for missing values x <- sleep[, 1:5] x[,c(1,2,4)] <- log10(x[,c(1,2,4)]) scattmatrixMiss(x, highlight = "Dream") ## for imputed values x_imp <- kNN(sleep[, 1:5]) x_imp[,c(1,2,4)] <- log10(x_imp[,c(1,2,4)]) scattmatrixMiss(x_imp, delimiter = "_imp", highlight = "Dream")
In addition to a standard scatterplot, lines are plotted for the missing values in one variable. If there are imputed values, they will be highlighted.
scattMiss( x, delimiter = NULL, side = 1, col = c("skyblue", "red", "orange", "lightgrey"), alpha = NULL, lty = c("dashed", "dotted"), lwd = par("lwd"), quantiles = c(0.5, 0.975), inEllipse = FALSE, zeros = FALSE, xlim = NULL, ylim = NULL, main = NULL, sub = NULL, xlab = NULL, ylab = NULL, interactive = TRUE, ... )
scattMiss( x, delimiter = NULL, side = 1, col = c("skyblue", "red", "orange", "lightgrey"), alpha = NULL, lty = c("dashed", "dotted"), lwd = par("lwd"), quantiles = c(0.5, 0.975), inEllipse = FALSE, zeros = FALSE, xlim = NULL, ylim = NULL, main = NULL, sub = NULL, xlab = NULL, ylab = NULL, interactive = TRUE, ... )
x |
a |
delimiter |
a character-vector to distinguish between variables and
imputation-indices for imputed variables (therefore, |
side |
if |
col |
a vector of length four giving the colors to be used in the plot. The first color is used for the scatterplot, the second/third color for the rug representation for missing/imputed values. The second color is also used for the lines for missing values. Imputed values will be highlighted with the third color, and the fourth color is used for the ellipses (see ‘Details’). If only one color is supplied, it is used for the scatterplot, the rug representation and the lines, whereas the default color is used for the ellipses. Else if a vector of length two is supplied, the default color is used for the ellipses as well. |
alpha |
a numeric value between 0 and 1 giving the level of
transparency of the colors, or |
lty |
a vector of length two giving the line types for the lines and ellipses. If a single value is supplied, it will be used for both. |
lwd |
a vector of length two giving the line widths for the lines and ellipses. If a single value is supplied, it will be used for both. |
quantiles |
a vector giving the quantiles of the chi-square
distribution to be used for the tolerance ellipses, or |
inEllipse |
plot lines only inside the largest ellipse. Ignored if
|
zeros |
a logical vector of length two indicating whether the variables
are semi-continuous, i.e., contain a considerable amount of zeros. If
|
xlim , ylim
|
axis limits. |
main , sub
|
main and sub title. |
xlab , ylab
|
axis labels. |
interactive |
a logical indicating whether the |
... |
further graphical parameters to be passed down (see
|
Information about missing values in one variable is included as vertical or
horizontal lines, as determined by the side
argument. The lines are
thereby drawn at the observed x- or y-value. In case of imputed values, they
will additionally be highlighted in the scatterplot. Supplementary,
percentage coverage ellipses can be drawn to give a clue about the shape of
the bivariate data distribution.
If interactive
is TRUE
, clicking in the bottom margin redraws
the plot with information about missing/imputed values in the first variable
and clicking in the left margin redraws the plot with information about
missing/imputed values in the second variable. Clicking anywhere else in
the plot quits the interactive session.
The argument zeros
has been introduced in version 1.4. As a
result, some of the argument positions have changed.
Andreas Alfons, modifications by Bernd Prantner
M. Templ, A. Alfons, P. Filzmoser (2012) Exploring incomplete data using visualization tools. Journal of Advances in Data Analysis and Classification, Online first. DOI: 10.1007/s11634-011-0102-y.
Other plotting functions:
aggr()
,
barMiss()
,
histMiss()
,
marginmatrix()
,
marginplot()
,
matrixplot()
,
mosaicMiss()
,
pairsVIM()
,
parcoordMiss()
,
pbox()
,
scattJitt()
,
scattmatrixMiss()
,
spineMiss()
data(tao, package = "VIM") ## for missing values scattMiss(tao[,c("Air.Temp", "Humidity")]) ## for imputed values scattMiss(kNN(tao[,c("Air.Temp", "Humidity")]), delimiter = "_imp")
data(tao, package = "VIM") ## for missing values scattMiss(tao[,c("Air.Temp", "Humidity")]) ## for imputed values scattMiss(kNN(tao[,c("Air.Temp", "Humidity")]), delimiter = "_imp")
Sleep data with missing values.
A data frame with 62 observations on the following 10 variables.
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
Allison, T. and Chichetti, D. (1976) Sleep in mammals: ecological and constitutional correlates. Science 194 (4266), 732–734.
The data set was imported from GGobi
.
data(sleep, package = "VIM") summary(sleep) aggr(sleep)
data(sleep, package = "VIM") summary(sleep) aggr(sleep)
Spineplot or spinogram with highlighting of missing/imputed values in other variables by splitting each cell into two parts. Additionally, information about missing/imputed values in the variable of interest is shown on the right hand side.
spineMiss( x, delimiter = NULL, pos = 1, selection = c("any", "all"), breaks = "Sturges", right = TRUE, col = c("skyblue", "red", "skyblue4", "red4", "orange", "orange4"), border = NULL, main = NULL, sub = NULL, xlab = NULL, ylab = NULL, axes = TRUE, labels = axes, only.miss = TRUE, miss.labels = axes, interactive = TRUE, ... )
spineMiss( x, delimiter = NULL, pos = 1, selection = c("any", "all"), breaks = "Sturges", right = TRUE, col = c("skyblue", "red", "skyblue4", "red4", "orange", "orange4"), border = NULL, main = NULL, sub = NULL, xlab = NULL, ylab = NULL, axes = TRUE, labels = axes, only.miss = TRUE, miss.labels = axes, interactive = TRUE, ... )
x |
a vector, matrix or |
delimiter |
a character-vector to distinguish between variables and
imputation-indices for imputed variables (therefore, |
pos |
a numeric value giving the index of the variable of interest.
Additional variables in |
selection |
the selection method for highlighting missing/imputed
values in multiple additional variables. Possible values are |
breaks |
if the variable of interest is numeric, |
right |
logical; if |
col |
a vector of length six giving the colors to be used. If only one color is supplied, the bars are transparent and the supplied color is used for highlighting missing/imputed values. Else if two colors are supplied, they are recycled. |
border |
the color to be used for the border of the cells. Use
|
main , sub
|
main and sub title. |
xlab , ylab
|
axis labels. |
axes |
a logical indicating whether axes should be drawn on the plot. |
labels |
if the variable of interest is categorical, either a logical indicating whether labels should be plotted below each cell, or a character vector giving the labels. This is ignored if the variable of interest is numeric. |
only.miss |
logical; if |
miss.labels |
either a logical indicating whether label(s) should be plotted below the cell(s) on the right hand side, or a character string or vector giving the label(s) (see ‘Details’). |
interactive |
a logical indicating whether the variables can be switched interactively (see ‘Details’). |
... |
further graphical parameters to be passed to
|
A spineplot is created if the variable of interest is categorial and a spinogram if it is numerical. The horizontal axis is scaled according to relative frequencies of the categories/classes. If more than one variable is supplied, the cells are split according to missingness/number of imputed values in the additional variables. Thus the proportion of highlighted observations in each category/class is displayed on the vertical axis. Since the height of each cell corresponds to the proportion of highlighted observations, it is now possible to compare the proportions of missing/imputed values among the different categories/classes.
If only.miss=TRUE
, the missing/imputed values in the variable of
interest are also visualized by a cell in the spine plot or spinogram. If
additional variables are supplied, this cell is again split into two parts
according to missingness/number if imputed values in the additional
variables.
Otherwise, a small spineplot that visualizes missing/imputed values in the
variable of interest is drawn on the right hand side. The first cell
corresponds to observed values and the second cell to missing/imputed
values. Each of the two cells is again split into two parts according to
missingness/number of imputed values in the additional variables. Note that
this display does not make sense if only one variable is supplied, therefore
only.miss
is ignored in that case.
If interactive=TRUE
, clicking in the left margin of the plot results
in switching to the previous variable and clicking in the right margin
results in switching to the next variable. Clicking anywhere else on the
graphics device quits the interactive session.
a table containing the frequencies corresponding to the cells.
Some of the argument names and positions have changed with version 1.3
due to extended functionality and for more consistency with other plot
functions in VIM
. For back compatibility, the arguments
xaxlabels
and missaxlabels
can still be supplied to
...{}
and are handled correctly. Nevertheless, they are deprecated
and no longer documented. Use labels
and miss.labels
instead.
The code is based on the function graphics::spineplot()
by Achim
Zeileis.
Andreas Alfons, Matthias Templ, modifications by Bernd Prantner
M. Templ, A. Alfons, P. Filzmoser (2012) Exploring incomplete data using visualization tools. Journal of Advances in Data Analysis and Classification, Online first. DOI: 10.1007/s11634-011-0102-y.
histMiss()
, barMiss()
,
mosaicMiss()
Other plotting functions:
aggr()
,
barMiss()
,
histMiss()
,
marginmatrix()
,
marginplot()
,
matrixplot()
,
mosaicMiss()
,
pairsVIM()
,
parcoordMiss()
,
pbox()
,
scattJitt()
,
scattMiss()
,
scattmatrixMiss()
data(tao, package = "VIM") data(sleep, package = "VIM") ## for missing values spineMiss(tao[, c("Air.Temp", "Humidity")]) spineMiss(sleep[, c("Exp", "Sleep")]) ## for imputed values spineMiss(kNN(tao[, c("Air.Temp", "Humidity")]), delimiter = "_imp") spineMiss(kNN(sleep[, c("Exp", "Sleep")]), delimiter = "_imp")
data(tao, package = "VIM") data(sleep, package = "VIM") ## for missing values spineMiss(tao[, c("Air.Temp", "Humidity")]) spineMiss(sleep[, c("Exp", "Sleep")]) ## for imputed values spineMiss(kNN(tao[, c("Air.Temp", "Humidity")]), delimiter = "_imp") spineMiss(kNN(sleep[, c("Exp", "Sleep")]), delimiter = "_imp")
Create a reactable
table that highlights missing values and imputed values
with the same colors as histMiss()
tableMiss(x, delimiter = "_imp")
tableMiss(x, delimiter = "_imp")
x |
a vector, matrix or |
delimiter |
a character-vector to distinguish between variables and
imputation-indices for imputed variables (therefore, |
data(tao) x_IMPUTED <- kNN(tao[, c("Air.Temp", "Humidity")]) tableMiss(x_IMPUTED[105:114, ]) x_IMPUTED[106, 2] <- NA x_IMPUTED[105, 1] <- NA x_IMPUTED[107, "Humidity_imp"] <- TRUE tableMiss(x_IMPUTED[105:114, ])
data(tao) x_IMPUTED <- kNN(tao[, c("Air.Temp", "Humidity")]) tableMiss(x_IMPUTED[105:114, ]) x_IMPUTED[106, 2] <- NA x_IMPUTED[105, 1] <- NA x_IMPUTED[107, "Humidity_imp"] <- TRUE tableMiss(x_IMPUTED[105:114, ])
A small subsample of the Tropical Atmosphere Ocean (TAO) project data,
derived from the GGOBI
project.
A data frame with 736 observations on the following 8 variables.
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
zonal wind, i.e. latitude-parallel wind
meridional wind, i.e. longitude-parallel wind
All cases recorded for five locations and two time periods.
data(tao, package = "VIM") summary(tao) aggr(tao)
data(tao, package = "VIM") summary(tao) aggr(tao)
2 numeric, 2 binary, 2 nominal and 2 mixed (semi-continous) variables
The format is: List of 4
$wna
: a data.frame
with 500 obs. of 8 variables:
x1
: numeric 10.87 9.53 7.83 8.53 8.67 ...
x2
: numeric 10.9 9.32 7.68 8.2 8.41 ... ..
c1
: Factor w/ 4 levels "a","b","c","d": 3 2 2 1 2 2 1 3 3 2 ...
c2
: Factor w/ 4 levels "a","b","c","d": 2 3 2 2 2 2 2 4 2 2 ...
b1
: Factor w/ 2 levels "0","1": 2 2 1 2 1 2 1 2 1 1 ...
b2
: Factor w/ 2 levels "0","1": 2 2 1 1 1 1 1 2 2 2 ...
m1
: numeric 0 8.29 9.08 0 0 ...
m2
: numeric 10.66 9.39 7.8 8.11 7.33 ...
$wona
: a 'data.frame“ with 500 obs. of 8 variables:
x1
: numeric 10.87 9.53 7.83 8.53 8.67 ...
x2
: numeric 10.9 9.32 7.68 8.2 8.41 ...
c1
: Factor w/ 4 levels "a","b","c","d": 3 2 2 1 2 2 1 3 3 2 ...
c2
: Factor w/ 4 levels "a","b","c","d": 2 3 2 2 2 2 2 4 2 2 ...
b1
: Factor w/ 2 levels "0","1": 2 2 1 2 1 2 1 2 1 1 ...
b2
: Factor w/ 2 levels "0","1": 2 2 1 1 1 1 1 2 2 2 ...
m1
: numeric 0 8.29 9.08 0 0 ...
m2
: numeric 10.66 9.39 7.8 8.11 7.33 ...
$mixed
: c("m1", "m2")
$outlierInd
: 'NULL“
data(testdata)
data(testdata)
A 2-dimensional data set with additional information.
data frame with 100 observations and 12 variables. The first two variables represent the fully observed data.
data(toydataMiss)
data(toydataMiss)
Wine reviews from France, Switzerland, Austria and Germany.
A data frame with 9627 observations on the following 9 variables.
country of origin
the number of points WineEnthusiast rated the wine on a scale of 1-100 (though they say they only post reviews for wines that score >=80)
the cost for a bottle of the wine
the province or state that the wine is from
name of the person who tasted and reviewed the wine
Twitter handle for the person who tasted ane reviewed the wine
the type of grapes used to make the wine (ie pinot noir)
the winery that made the wine
broader category as variety
The data was scraped from WineEnthusiast during the week of Nov 22th, 2017. The code for the scraper can be found at https://github.com/zackthoutt/wine-deep-learning This data set is slightly modified, i.e. only four countries are selected and broader categories on the variety have been added.
https://www.kaggle.com/zynicide/wine-reviews
data(wine) str(wine) aggr(wine)
data(wine) str(wine) aggr(wine)
Impute missing values based on a random forest model using xgboost::xgboost()
xgboostImpute( formula, data, imp_var = TRUE, imp_suffix = "imp", verbose = FALSE, nrounds = 100, objective = NULL, ... )
xgboostImpute( formula, data, imp_var = TRUE, imp_suffix = "imp", verbose = FALSE, nrounds = 100, objective = NULL, ... )
formula |
model formula for the imputation |
data |
A |
imp_var |
|
imp_suffix |
suffix used for TF imputation variables |
verbose |
Show the number of observations used for training
and evaluating the RF-Model. This parameter is also passed down to
|
nrounds |
max number of boosting iterations,
argument passed to |
objective |
objective for xgboost,
argument passed to |
... |
Arguments passed to |
the imputed data set.
Other imputation methods:
hotdeck()
,
impPCA()
,
irmi()
,
kNN()
,
matchImpute()
,
medianSamp()
,
rangerImpute()
,
regressionImp()
,
sampleCat()
data(sleep) xgboostImpute(Dream~BodyWgt+BrainWgt,data=sleep) xgboostImpute(Dream+NonD~BodyWgt+BrainWgt,data=sleep) xgboostImpute(Dream+NonD+Gest~BodyWgt+BrainWgt,data=sleep) sleepx <- sleep sleepx$Pred <- as.factor(LETTERS[sleepx$Pred]) sleepx$Pred[1] <- NA xgboostImpute(Pred~BodyWgt+BrainWgt,data=sleepx)
data(sleep) xgboostImpute(Dream~BodyWgt+BrainWgt,data=sleep) xgboostImpute(Dream+NonD~BodyWgt+BrainWgt,data=sleep) xgboostImpute(Dream+NonD+Gest~BodyWgt+BrainWgt,data=sleep) sleepx <- sleep sleepx$Pred <- as.factor(LETTERS[sleepx$Pred]) sleepx$Pred[1] <- NA xgboostImpute(Pred~BodyWgt+BrainWgt,data=sleepx)