Title: | Simulation of Complex Synthetic Data Information |
---|---|
Description: | Tools and methods to simulate populations for surveys based on auxiliary data. The tools include model-based methods, calibration and combinatorial optimization algorithms, see Templ, Kowarik and Meindl (2017) <doi:10.18637/jss.v079.i10>) and Templ (2017) <doi:10.1007/978-3-319-50272-4>. The package was developed with support of the International Household Survey Network, DFID Trust Fund TF011722 and funds from the World bank. |
Authors: | Matthias Templ [aut, cre], Alexander Kowarik [aut] , Bernhard Meindl [aut], Andreas Alfons [aut], Mathieu Ribatet [ctb], Johannes Gussenbauer [ctb], Siro Fritzmann [ctb] |
Maintainer: | Matthias Templ <[email protected]> |
License: | GPL (>= 2) |
Version: | 2.1.3 |
Built: | 2024-11-08 13:30:38 UTC |
Source: | https://github.com/statistikat/simpop |
The production of synthetic datasets has been proposed as a statistical disclosure control solution to generate public use files out of protected data, and as a tool to create “augmented datasets” to serve as input for micro-simulation models. Synthetic data have become an important instrument for ex-ante assessments of policies' impact. The performance and acceptability of such a tool relies heavily on the quality of the synthetic populations, i.e., on the statistical similarity between the synthetic and the true population of interest.
Multiple approaches and tools have been developed to generate synthetic data. These approaches can be categorized into three main groups: synthetic reconstruction, combinatorial optimization, and model-based generation.
The package: simPop is a user-friendly R-package based on a modular object-oriented concept. It provides a highly optimized S4 class implementation of various methods, including calibration by iterative proportional fitting and simulated annealing, and modeling or data fusion by logistic regression.
The following applications further shows the methods and package:
We firstly demonstrated the use of simPop by creating
a synthetic population of Austria based on the
European Statistics of Income and Living Conditions (Alfons et al., 2011)
including the evaluation of the quality of the generated population.
In this contribution, the mathematical details of functions simStructure
, simCategorical
,
simContinuous
and simComponents
are given in detail.
The disclosure risk of this synthetic population has been evaluated in (Templ and Alfons, 2012) using large-scale simulation studies.
Employer-employee data were created in Templ and Filzmoser (2014) whereby the structure of companies and employees are considered.
Finally, the R package simPop is presented in full detail in Templ et al. (2017). In this paper - the main reference to this work - all functions and the S4 class structure of the package are described in detail. For beginners, this paper might be the starting point to learn about the methods and package.
Package: | simPop |
Type: | Package |
Version: | 1.0.0 |
Date: | 20017-08-07 |
License: | GPL (>= 2) |
Bernhard Meindl, Matthias Templ, Andreas Alfons, Alexander Kowarik,
Maintainer: Matthias Templ <[email protected]>
M. Templ, B. Meindl, A. Kowarik, A. Alfons, O. Dupriez (2017) Simulation of Synthetic Populations for Survey Data Considering Auxiliary Information. Journal of Statistical Survey, 79 (10), 1–38. doi:10.18637/jss.v079.i10
A. Alfons, M. Templ (2011) Simulation of close-to-reality population data for household surveys with application to EU-SILC. Statistical Methods & Applications, 20 (3), 383–407. doi: 10.1007/s10260-011-0163-2
M. Templ, P. Filzmoser (2014) Simulation and quality of a synthetic close-to-reality employer-employee population. Journal of Applied Statistics, 41 (5), 1053–1072. doi:10.1080/02664763.2013.859237
M. Templ, A. Alfons (2012) Disclosure Risk of Synthetic Population Data with Application in the Case of EU-SILC. In J Domingo-Ferrer, E Magkos (eds.), Privacy in Statistical Databases, 6344 of Lecture Notes in Computer Science, 174–186. Springer Verlag, Heidelberg. doi:10.1007/978-3-642-15838-4_16
Useful links:
## we use synthetic eusilcS survey sample data ## included in the package to simulate a population ## create the structure data(eusilcS) ## approx. 20 seconds computation time inp <- specifyInput(data=eusilcS, hhid="db030", hhsize="hsize", strata="db040", weight="db090") ## in the following, nr_cpus are selected automatically simPop <- simStructure(data=inp, method="direct", basicHHvars=c("age", "rb090")) simPop <- simCategorical(simPop, additional=c("pl030", "pb220a"), method="multinom", nr_cpus=1) simPop class(simPop) regModel = ~rb090+hsize+pl030+pb220a ## multinomial model with random draws eusilcM <- simContinuous(simPop, additional="netIncome", regModel = regModel, upper=200000, equidist=FALSE, nr_cpus=1) class(eusilcM) ## this is already a basic synthetic population, but ## many other functions in the package might now ## be used for fine-tuning, adding further variables, ## evaluating the quality, adding finer geographical details, ## using different methods, calibrating surveys or populations, etc. ## -- see Templ et al. (2017) for more details.
## we use synthetic eusilcS survey sample data ## included in the package to simulate a population ## create the structure data(eusilcS) ## approx. 20 seconds computation time inp <- specifyInput(data=eusilcS, hhid="db030", hhsize="hsize", strata="db040", weight="db090") ## in the following, nr_cpus are selected automatically simPop <- simStructure(data=inp, method="direct", basicHHvars=c("age", "rb090")) simPop <- simCategorical(simPop, additional=c("pl030", "pb220a"), method="multinom", nr_cpus=1) simPop class(simPop) regModel = ~rb090+hsize+pl030+pb220a ## multinomial model with random draws eusilcM <- simContinuous(simPop, additional="netIncome", regModel = regModel, upper=200000, equidist=FALSE, nr_cpus=1) class(eusilcM) ## this is already a basic synthetic population, but ## many other functions in the package might now ## be used for fine-tuning, adding further variables, ## evaluating the quality, adding finer geographical details, ## using different methods, calibrating surveys or populations, etc. ## -- see Templ et al. (2017) for more details.
add known margins/totals for a combination of variables for the population
to an object of class simPopObj
.
addKnownMargins(inp, margins)
addKnownMargins(inp, margins)
inp |
a |
margins |
a |
The function takes a data.frame containing known marginals/totals for a some variables that must exist in the population (stored in slot 'pop' of input object 'inp') and updates slot 'table' of the input object. This slot finally contains the known totals.
households are drawn from the data and new ID's are generated for the new households.
an object of class simPopObj
with updated slot
'table'.
Bernhard Meindl
M. Templ, B. Meindl, A. Kowarik, A. Alfons, O. Dupriez (2017) Simulation of Synthetic Populations for Survey Data Considering Auxiliary Information. Journal of Statistical Survey, 79 (10), 1–38. doi:10.18637/jss.v079.i10
data(eusilcS) data(eusilcP) ## Not run: ## approx. 20 seconds computation time inp <- specifyInput(data=eusilcS, hhid="db030", hhsize="hsize", strata="db040", weight="db090") inp <- simStructure(data=inp, method="direct", basicHHvars=c("age", "rb090")) inp <- simCategorical(inp, additional=c("pl030", "pb220a"), method="multinom",nr_cpus=1) margins <- as.data.frame( xtabs(rep(1, nrow(eusilcP)) ~ eusilcP$region + eusilcP$gender + eusilcP$citizenship)) colnames(margins) <- c("db040", "rb090", "pb220a", "freq") inp <- addKnownMargins(inp, margins) str(inp) ## End(Not run)
data(eusilcS) data(eusilcP) ## Not run: ## approx. 20 seconds computation time inp <- specifyInput(data=eusilcS, hhid="db030", hhsize="hsize", strata="db040", weight="db090") inp <- simStructure(data=inp, method="direct", basicHHvars=c("age", "rb090")) inp <- simCategorical(inp, additional=c("pl030", "pb220a"), method="multinom",nr_cpus=1) margins <- as.data.frame( xtabs(rep(1, nrow(eusilcP)) ~ eusilcP$region + eusilcP$gender + eusilcP$citizenship)) colnames(margins) <- c("db040", "rb090", "pb220a", "freq") inp <- addKnownMargins(inp, margins) str(inp) ## End(Not run)
addWeights
allows to modify sampling weights of an dataObj
or
simPopObj
-object. As input the output of
calibSample
must be used.
addWeights(object) <- value ## S4 replacement method for signature 'dataObj' addWeights(object) <- value ## S4 replacement method for signature 'simPopObj' addWeights(object) <- value
addWeights(object) <- value ## S4 replacement method for signature 'dataObj' addWeights(object) <- value ## S4 replacement method for signature 'simPopObj' addWeights(object) <- value
object |
|
value |
a numeric vector of suitable length |
data(eusilcS) data(totalsRG) ## Not run: inp <- specifyInput(data=eusilcS, hhid="db030", hhsize="hsize", strata="db040", weight="db090") ## approx. 20 seconds ... addWeights(inp) <- calibSample(inp, totalsRG) ## End(Not run)
data(eusilcS) data(totalsRG) ## Not run: inp <- specifyInput(data=eusilcS, hhid="db030", hhsize="hsize", strata="db040", weight="db090") ## approx. 20 seconds ... addWeights(inp) <- calibSample(inp, totalsRG) ## End(Not run)
A Simulated Annealing Algorithm for calibration of synthetic population data
available in a simPopObj
-object. The aims is to find,
given a population, a combination of different households which optimally
satisfy, in the sense of an acceptable error, a given table of specific
known marginals. The known marginals are also already available in slot
'table' of the input object 'inp'.
calibPop( inp, split = NULL, splitUpper = NULL, temp = 1, epsP.factor = 0.05, epsH.factor = 0.05, epsMinN = 0, maxiter = 200, temp.cooldown = 0.9, factor.cooldown = 0.85, min.temp = 10^-3, nr_cpus = NULL, sizefactor = 2, choose.temp = TRUE, choose.temp.factor = 0.2, scale.redraw = 0.5, observe.times = 50, observe.break = 0.05, n.forceCooldown = 100, verbose = FALSE, hhTables = NULL, persTables = NULL, redist.var = NULL, redist.var.factor = 1 )
calibPop( inp, split = NULL, splitUpper = NULL, temp = 1, epsP.factor = 0.05, epsH.factor = 0.05, epsMinN = 0, maxiter = 200, temp.cooldown = 0.9, factor.cooldown = 0.85, min.temp = 10^-3, nr_cpus = NULL, sizefactor = 2, choose.temp = TRUE, choose.temp.factor = 0.2, scale.redraw = 0.5, observe.times = 50, observe.break = 0.05, n.forceCooldown = 100, verbose = FALSE, hhTables = NULL, persTables = NULL, redist.var = NULL, redist.var.factor = 1 )
inp |
an object of class |
split |
given strata in which the problem will be split. Has to
correspond to a column population data (slot 'pop' of input argument 'inp')
. For example |
splitUpper |
optional column in the population for which decides the part
of the population from which to sample for each entry in |
temp |
starting temperatur for simulated annealing algorithm |
epsP.factor |
a factor (between 0 and 1) specifying the acceptance
error for contingency table on individual level. For example epsP.factor = 0.05 results in an acceptance error for the
objective function of |
epsH.factor |
a factor (between 0 and 1) specifying the acceptance
error for contingency table on household level. For example epsH.factor = 0.05 results in an acceptance error for the
objective function of |
epsMinN |
integer specifying the minimum number of units from which the synthetic populatin can deviate from cells in contingency tables.
This overwrites |
maxiter |
maximum iterations during a temperature step. |
temp.cooldown |
a factor (between 0 and 1) specifying the rate at which temperature will be reduced in each step. |
factor.cooldown |
a factor (between 0 and 1) specifying the rate at which the number of permutations of housholds, in each iteration, will be reduced in each step. |
min.temp |
minimal temperature at which the algorithm will stop. |
nr_cpus |
if specified, an integer number defining the number of cpus that should be used for parallel processing. |
sizefactor |
the factor for inflating the population before applying 0/1 weights |
choose.temp |
if TRUE |
choose.temp.factor |
number between (0,1) for rescaling |
scale.redraw |
Number between (0,1) scaling the number of households that need to be drawn and discarded in each iteration step.
The number of individuals currently selected through simulated annealing is substracted from the sum over the target population margins added to |
observe.times |
Number of times the new value of the objective function is saved. If |
observe.break |
When objective value has been saved |
n.forceCooldown |
integer, if the solution does not move for |
verbose |
boolean variable; if TRUE some additional verbose output is
provided, however only if |
hhTables |
Information on population margins for households. Can bei either a single |
persTables |
Information on population margins for persons. Can bei either a single |
redist.var |
single column in the population which can be redistributed in each 'split'. Still experimental! |
redist.var.factor |
numeric in the interval (0,1]. Used in combinationo with 'redist.var', still experimental! |
Calibrates data using simulated annealing. The algorithm searches for a (near) optimal combination of different households, by swaping housholds at random in each iteration of each temperature level. During the algorithm as well as for the output the optimal (or so far best) combination will be indicated by a logical vector containg only 0s (not inculded) and 1s (included in optimal selection). The objective function for simulated annealing is defined by the sum of absolute differences between target marginals and synthetic marginals (=marginals of synthetic dataset). The sum of target marginals can at most be as large as the sum of target marginals. For every factor-level in “split”, data must at least contain as many entries of this kind as target marginals.
Possible donors are automatically generated within the procedure.
The number of cpus are selected automatically in the following manner. The number of cpus is equal the number of strata. However, if the number of cpus is less than the number of strata, the number of cpus - 1 is used by default. This should be the best strategy, but the user can also overwrite this decision.
Returns an object of class simPopObj
with an
updated population listed in slot 'pop'.
Bernhard Meindl, Johannes Gussenbauer and Matthias Templ
M. Templ, B. Meindl, A. Kowarik, A. Alfons, O. Dupriez (2017) Simulation of Synthetic Populations for Survey Data Considering Auxiliary Information. Journal of Statistical Survey, 79 (10), 1–38. doi:10.18637/jss.v079.i10
data(eusilcS) # load sample data data(eusilcP) # population data ## Not run: inp <- specifyInput(data=eusilcS, hhid="db030", hhsize="hsize", strata="db040", weight="db090") simPop <- simStructure(data=inp, method="direct", basicHHvars=c("age", "rb090")) simPop <- simCategorical(simPop, additional=c("pl030", "pb220a"), method="multinom", nr_cpus=1) # add margins margins <- as.data.frame( xtabs(rep(1, nrow(eusilcP)) ~ eusilcP$region + eusilcP$gender + eusilcP$citizenship)) colnames(margins) <- c("db040", "rb090", "pb220a", "freq") simPop <- addKnownMargins(simPop, margins) simPop_adj2 <- calibPop(simPop, split="db040", temp=1, epsP.factor=0.1, epsMinN=10, nr_cpus = 1) ## End(Not run) # apply simulated annealing ## Not run: simPop_adj <- calibPop(simPop, split="db040", temp=1, epsP.factor=0.1,nr_cpus = 1) ## End(Not run) ## Not run: ### use multiple different margins # person margins persTables <- as.data.frame( xtabs(rep(1, nrow(eusilcP)) ~ eusilcP$region + eusilcP$gender + eusilcP$citizenship)) colnames(persTables) <- c("db040", "rb090", "pb220a", "Freq") # household margins filter_hid <- !duplicated(eusilcP$hid) eusilcP$hsize4 <- pmin(4,as.numeric(eusilcP$hsize)) hhTables <- as.data.frame( xtabs(rep(1, sum(filter_hid)) ~ eusilcP[filter_hid,]$region+eusilcP[filter_hid,]$hsize4)) colnames(hhTables) <- c("db040", "hsize4", "Freq") simPop@pop@data$hsize4 <- pmin(4,as.numeric(simPop@pop@data$hsize)) simPop_adj_2 <- calibPop(simPop, split="db040", temp=1, epsP.factor=0.1, epsH.factor = 0.1, persTables = persTables, hhTables = hhTables, nr_cpus = 1) ## End(Not run)
data(eusilcS) # load sample data data(eusilcP) # population data ## Not run: inp <- specifyInput(data=eusilcS, hhid="db030", hhsize="hsize", strata="db040", weight="db090") simPop <- simStructure(data=inp, method="direct", basicHHvars=c("age", "rb090")) simPop <- simCategorical(simPop, additional=c("pl030", "pb220a"), method="multinom", nr_cpus=1) # add margins margins <- as.data.frame( xtabs(rep(1, nrow(eusilcP)) ~ eusilcP$region + eusilcP$gender + eusilcP$citizenship)) colnames(margins) <- c("db040", "rb090", "pb220a", "freq") simPop <- addKnownMargins(simPop, margins) simPop_adj2 <- calibPop(simPop, split="db040", temp=1, epsP.factor=0.1, epsMinN=10, nr_cpus = 1) ## End(Not run) # apply simulated annealing ## Not run: simPop_adj <- calibPop(simPop, split="db040", temp=1, epsP.factor=0.1,nr_cpus = 1) ## End(Not run) ## Not run: ### use multiple different margins # person margins persTables <- as.data.frame( xtabs(rep(1, nrow(eusilcP)) ~ eusilcP$region + eusilcP$gender + eusilcP$citizenship)) colnames(persTables) <- c("db040", "rb090", "pb220a", "Freq") # household margins filter_hid <- !duplicated(eusilcP$hid) eusilcP$hsize4 <- pmin(4,as.numeric(eusilcP$hsize)) hhTables <- as.data.frame( xtabs(rep(1, sum(filter_hid)) ~ eusilcP[filter_hid,]$region+eusilcP[filter_hid,]$hsize4)) colnames(hhTables) <- c("db040", "hsize4", "Freq") simPop@pop@data$hsize4 <- pmin(4,as.numeric(simPop@pop@data$hsize)) simPop_adj_2 <- calibPop(simPop, split="db040", temp=1, epsP.factor=0.1, epsH.factor = 0.1, persTables = persTables, hhTables = hhTables, nr_cpus = 1) ## End(Not run)
Calibrate sample weights according to known marginal population totals. Based on initial sample weights, the so-called g-weights are computed by generalized raking procedures.
The methods return a list containing both the g-weights (slot
g_weights
) as well as the final weights (slot final_weights
)
(initial sampling weights adjusted by the g-weights.
The function provides methods with the following signatures.
Argument 'inp' must be an object of
class data.frame
, dataObj
or
simPopObj
and the totals must be specified in either
objects of class table
or data.frame
. If argument 'totals' is
a data.frame it must be provided in a way that in the first columns
n-columns the combinations of variables are listed. In the last column, the
frequency counts must be specified. Furthermore, variable names of all but
the last column must be available also from the sample data specified in
argument 'inp'. If argument 'total' is a table (e.g. created with function
tableWt
, it must be made sure that the dimnames match the
variable names (and levels) of the specified input data set.
This is a faster implementation of parts of
calib
from package sampling
. Note that the
default calibration method is raking and that the truncated linear method is
not yet implemented.
Andreas Alfons and Bernhard Meindl
Deville, J.-C. and Saerndal, C.-E. (1992) Calibration estimators in survey sampling. Journal of the American Statistical Association, 87(418), 376–382. Deville, J.-C., Saerndal, C.-E. and Sautory, O. (1993) Generalized raking procedures in survey sampling. Journal of the American Statistical Association, 88(423), 1013–1020.
data(eusilcS) eusilcS$agecut <- cut(eusilcS$age, 7) ## Not run: inp <- specifyInput(data=eusilcS, hhid="db030", hhsize="hsize", strata="db040", weight="db090") ## for simplicity, we are using population data directly from the sample, but you get the idea totals1 <- tableWt(eusilcS[, c("agecut","rb090")], weights=eusilcS$rb050) totals2 <- tableWt(eusilcS[, c("rb090","agecut")], weights=eusilcS$rb050) totals3 <- tableWt(eusilcS[, c("rb090","agecut","db040")], weights=eusilcS$rb050) totals4 <- tableWt(eusilcS[, c("agecut","db040","rb090")], weights=eusilcS$rb050) weights1 <- calibSample(inp, totals1) totals1.df <- as.data.frame(totals1) weights1.df <- calibSample(inp, totals1.df) identical(weights1, weights1.df) # we can also use a data.frame and an optional weight vector as input df <- as.data.frame(inp@data) w <- inp@data[[inp@weight]] weights1.x <- calibSample(df, totals1.df, w=inp@data[[inp@weight]]) identical(weights1, weights1.x) weights2 <- calibSample(inp, totals2) totals2.df <- as.data.frame(totals2) weights2.df <- calibSample(inp, totals2.df) identical(weights2, weights2.df) ## End(Not run) ## Not run: ## approx 10 seconds computation time ... weights3 <- calibSample(inp, totals3) totals3.df <- as.data.frame(totals3) weights3.df <- calibSample(inp, totals3.df) identical(weights3, weights3.df) ## approx 10 seconds computation time ... weights4 <- calibSample(inp, totals4) totals4.df <- as.data.frame(totals4) weights4.df <- calibSample(inp, totals4.df) identical(weights4, weights4.df) ## End(Not run)
data(eusilcS) eusilcS$agecut <- cut(eusilcS$age, 7) ## Not run: inp <- specifyInput(data=eusilcS, hhid="db030", hhsize="hsize", strata="db040", weight="db090") ## for simplicity, we are using population data directly from the sample, but you get the idea totals1 <- tableWt(eusilcS[, c("agecut","rb090")], weights=eusilcS$rb050) totals2 <- tableWt(eusilcS[, c("rb090","agecut")], weights=eusilcS$rb050) totals3 <- tableWt(eusilcS[, c("rb090","agecut","db040")], weights=eusilcS$rb050) totals4 <- tableWt(eusilcS[, c("agecut","db040","rb090")], weights=eusilcS$rb050) weights1 <- calibSample(inp, totals1) totals1.df <- as.data.frame(totals1) weights1.df <- calibSample(inp, totals1.df) identical(weights1, weights1.df) # we can also use a data.frame and an optional weight vector as input df <- as.data.frame(inp@data) w <- inp@data[[inp@weight]] weights1.x <- calibSample(df, totals1.df, w=inp@data[[inp@weight]]) identical(weights1, weights1.x) weights2 <- calibSample(inp, totals2) totals2.df <- as.data.frame(totals2) weights2.df <- calibSample(inp, totals2.df) identical(weights2, weights2.df) ## End(Not run) ## Not run: ## approx 10 seconds computation time ... weights3 <- calibSample(inp, totals3) totals3.df <- as.data.frame(totals3) weights3.df <- calibSample(inp, totals3.df) identical(weights3, weights3.df) ## approx 10 seconds computation time ... weights4 <- calibSample(inp, totals4) totals4.df <- as.data.frame(totals4) weights4.df <- calibSample(inp, totals4.df) identical(weights4, weights4.df) ## End(Not run)
Construct a matrix of binary variables for calibration of sample weights according to known marginal population totals. The following methods are implemented:
calibVars.default(x)
calibVars.matrix(x)
calibVars.matrix(x)
calibVars.data.frame(x)
calibVars(x)
calibVars(x)
x |
a vector that can be interpreted as factor, or a matrix or
|
A matrix of binary variables that indicate membership to the corresponding factor levels.
Bernhard Meindl and Andreas Alfons
M. Templ, B. Meindl, A. Kowarik, A. Alfons, O. Dupriez (2017) Simulation of Synthetic Populations for Survey Data Considering Auxiliary Information. Journal of Statistical Survey, 79 (10), 1–38. doi: 10.18637/jss.v079.i10
data(eusilcS) # default method ## Not run: aux <- calibVars(eusilcS$rb090) head(aux) # data.frame method aux <- calibVars(eusilcS[, c("db040", "rb090")]) head(aux) ## End(Not run)
data(eusilcS) # default method ## Not run: aux <- calibVars(eusilcS$rb090) head(aux) # data.frame method aux <- calibVars(eusilcS[, c("db040", "rb090")]) head(aux) ## End(Not run)
Compute (weighted) pairwise contingency coefficients.
contingencyWt(x, ...)
contingencyWt(x, ...)
x |
for the default method, a vector that can be interpreted as factor.
For the matrix and |
... |
for the generic function, arguments to be passed down to the methods, otherwise ignored. |
The function tableWt
is used for the computation of the
corresponding pairwise contingency tables. The following methods are implemented:
contingencyWt.default(x, y, weights = NULL, ...)
contingencyWt.matrix(x, weights = NULL, ...)
contingencyWt.data.frame(x, weights = NULL, ...)
Additional parameters are:
y: a vector that can be interpreted as factor (for the default method)
weights: an optional numeric vector containing sample weights
For the default method, the (weighted) contingency coefficient of
x
and y
.
For the matrix and data.frame
method, a matrix of (weighted) pairwise
contingency coefficients for all combinations of columns. Elements below
the diagonal are NA
.
Andreas Alfons and Stefan Kraft
Kendall, M.G. and Stuart, A. (1967) The Advanced Theory of Statistics, Volume 2: Inference and Relationship. Charles Griffin & Co Ltd, London, 2nd edition.
data(eusilcS) ## default method contingencyWt(eusilcS$pl030, eusilcS$pb220a, weights = eusilcS$rb050) ## data.frame method basic <- c("age", "rb090", "hsize", "pl030", "pb220a") contingencyWt(eusilcS[, basic], weights = eusilcS$rb050)
data(eusilcS) ## default method contingencyWt(eusilcS$pl030, eusilcS$pb220a, weights = eusilcS$rb050) ## data.frame method basic <- c("age", "rb090", "hsize", "pl030", "pb220a") contingencyWt(eusilcS[, basic], weights = eusilcS$rb050)
Correct for age heaping using truncated (log-)normal distributions
correctHeaps(x, heaps = "10year", method = "lnorm", start = 0, fixed = NULL)
correctHeaps(x, heaps = "10year", method = "lnorm", start = 0, fixed = NULL)
x |
numeric vector |
heaps |
|
method |
a character specifying the algorithm used to correct the age heaps. Allowed values are
|
start |
a numeric value for the starting of the 5 or 10 year sequences (e.g. 0, 5 or 10) |
fixed |
numeric index vector with observation that should not be changed |
Age heaping can cause substantial bias in important measures and thus age heaping should be corrected.
For method “lnorm”, a truncated log-normal is fit to the whole age distribution. Then for each age heap (at 0, 5, 10, 15, ...) random numbers of a truncated log-normal (with lower and upper bound) is drawn in the interval +- 2 around the heap (rounding of degree 2) using the inverse transformation method. A ratio of randomly chosen observations on an age heap are replaced by these random draws. For the ratio the age distribution is chosen, whereas on an age heap (e.g. 5) the arithmetic means of the two neighboring ages are calculated (average counts on age 4 and age 6 for age heap equals 5, for example). The ratio on, e.g. age equals 5 is then given by the count on age 5 divided by this mean This is done for any age heap at (0, 5, 10, 15, ...).
Method “norm” replace the draws from truncated log-normals to draws from truncated normals. It depends on the age distrubution (if right-skewed or not) if method “lnorm” or “norm” should be used. Many distributions with heaping problems are right-skewed.
Method “unif” draws the mentioned ratio of observations on truncated uniform distributions around the age heaps.
Repeated calls of this function mimics multiple imputation, i.e. repeating this procedure m times provides m imputed datasets that properly reflect the uncertainty from imputation.
a numeric vector without age heaps
Matthias Templ, Bernhard Meindl, Alexander Kowarik
M. Templ, B. Meindl, A. Kowarik, A. Alfons, O. Dupriez (2017) Simulation of Synthetic Populations for Survey Data Considering Auxiliary Information. Journal of Statistical Survey, 79 (10), 1–38. doi: 10.18637/jss.v079.i10
## create some artificial data age <- rlnorm(10000, meanlog=2.466869, sdlog=1.652772) age <- round(age[age < 93]) barplot(table(age)) ## artificially introduce age heaping and correct it: # heaps every 5 years year5 <- seq(0, max(age), 5) age5 <- sample(c(age, age[age %in% year5])) cc5 <- rep("darkgrey", length(unique(age))) cc5[year5+1] <- "yellow" barplot(table(age5), col=cc5) barplot(table(correctHeaps(age5, heaps="5year", method="lnorm")), col=cc5) # heaps every 10 years year10 <- seq(0, max(age), 10) age10 <- sample(c(age, age[age %in% year10])) cc10 <- rep("darkgrey", length(unique(age))) cc10[year10+1] <- "yellow" barplot(table(age10), col=cc10) barplot(table(correctHeaps(age10, heaps="10year", method="lnorm")), col=cc10) # the first 5 observations should be unchanged barplot(table(correctHeaps(age10, heaps="10year", method="lnorm", fixed=1:5)), col=cc10)
## create some artificial data age <- rlnorm(10000, meanlog=2.466869, sdlog=1.652772) age <- round(age[age < 93]) barplot(table(age)) ## artificially introduce age heaping and correct it: # heaps every 5 years year5 <- seq(0, max(age), 5) age5 <- sample(c(age, age[age %in% year5])) cc5 <- rep("darkgrey", length(unique(age))) cc5[year5+1] <- "yellow" barplot(table(age5), col=cc5) barplot(table(correctHeaps(age5, heaps="5year", method="lnorm")), col=cc5) # heaps every 10 years year10 <- seq(0, max(age), 10) age10 <- sample(c(age, age[age %in% year10])) cc10 <- rep("darkgrey", length(unique(age))) cc10[year10+1] <- "yellow" barplot(table(age10), col=cc10) barplot(table(correctHeaps(age10, heaps="10year", method="lnorm")), col=cc10) # the first 5 observations should be unchanged barplot(table(correctHeaps(age10, heaps="10year", method="lnorm", fixed=1:5)), col=cc10)
Correct a specific age heap in a vector containing age in years
correctSingleHeap( x, heap, before = 2, after = 2, method = "lnorm", fixed = NULL )
correctSingleHeap( x, heap, before = 2, after = 2, method = "lnorm", fixed = NULL )
x |
numeric vector representing age in years (integers) |
heap |
numeric or integer vector of length 1 specifying the year for which a heap should be corrected |
before |
numeric or integer vector of length 1 specifying the number of years before the heap that may be used to correct the heap. This input will be rounded! |
after |
numeric or integer vector of length 1 specifying the number of years after the heap that may be used to correct the heap. This input will be rounded!
|
method |
a character specifying the algorithm used to correct the age heaps. Allowed values are
|
fixed |
numeric index vector with observation that should not be changed |
a numeric vector without age heaps
Matthias Templ, Bernhard Meindl, Alexander Kowarik
## create some artificial data age <- rlnorm(10000, meanlog=2.466869, sdlog=1.652772) age <- round(age[age < 93]) barplot(table(age)) ## artificially introduce an age heap for a specific year ## and correct it age23 <- c(age, rep(23, length=sum(age==23))) cc23 <- rep("darkgrey", length(unique(age))) cc23[24] <- "yellow" barplot(table(age23), col=cc23) barplot(table(correctSingleHeap(age23, heap=23, before=2, after=3, method="lnorm")), col=cc23) barplot(table(correctSingleHeap(age23, heap=23, before=5, after=5, method="lnorm")), col=cc23) # the first 5 observations should be unchanged barplot(table(correctSingleHeap(age23, heap=23, before=5, after=5, method="lnorm", fixed=1:5)), col=cc23)
## create some artificial data age <- rlnorm(10000, meanlog=2.466869, sdlog=1.652772) age <- round(age[age < 93]) barplot(table(age)) ## artificially introduce an age heap for a specific year ## and correct it age23 <- c(age, rep(23, length=sum(age==23))) cc23 <- rep("darkgrey", length(unique(age))) cc23[24] <- "yellow" barplot(table(age23), col=cc23) barplot(table(correctSingleHeap(age23, heap=23, before=2, after=3, method="lnorm")), col=cc23) barplot(table(correctSingleHeap(age23, heap=23, before=5, after=5, method="lnorm")), col=cc23) # the first 5 observations should be unchanged barplot(table(correctSingleHeap(age23, heap=23, before=5, after=5, method="lnorm", fixed=1:5)), col=cc23)
Simulate variables of population data. The household structure of the population data needs to be simulated beforehand.
crossValidation( simPopObj, additionals, hyper_param_grid, fold = 3, method = c("xgboost"), type = c("categorical"), by = "strata", regModel = "available", nr_cpus = 1, verbose = FALSE )
crossValidation( simPopObj, additionals, hyper_param_grid, fold = 3, method = c("xgboost"), type = c("categorical"), by = "strata", regModel = "available", nr_cpus = 1, verbose = FALSE )
simPopObj |
a |
additionals |
a character vector specifying additional categorical
variables available in the sample object of |
hyper_param_grid |
a grid which can contain model specific parameters which will be passed onto the function call for the respective model. |
fold |
the number of k in k-fold crossvalidation |
method |
a character string specifying the method to be used for
simulating the additional categorical variables. Accepted value at the moment
only
|
type |
currently only "categorical" is implemented |
by |
defining which variable to use as split up variable of the estimation. Defaults to the strata variable. |
regModel |
allows to specify the variables or model that is used when simulating additional categorical variables. The following choices are available if different from NULL.
If method 'distribution' is used, it is only possible to specify a vector of length one containing one of the choices described above. If parameter 'regModel' is NULL, only basic household variables are used in any case. |
nr_cpus |
if specified, an integer number defining the number of cpus that should be used for parallel processing. |
verbose |
set to TRUE if additional print output should be shown. |
The number of cpus are selected automatically in the following manner. The number of cpus is equal the number of strata. However, if the number of cpus is less than the number of strata, the number of cpus - 1 is used by default. This should be the best strategy, but the user can also overwrite this decision.
An object of class simPopObj
containing survey
data as well as the simulated population data including the categorical
variables specified by argument additional
.
The basic household structure needs to be simulated beforehand with
the function simStructure
.
Bernhard Meindl, Andreas Alfons, Stefan Kraft, Alexander Kowarik, Matthias Templ, Siro Fritzmann
simStructure
, simRelation
,
simContinuous
, simComponents
, simCategorical
data(eusilcS) # load sample data ## Not run: ## approx. 20 seconds computation time inp <- specifyInput(data=eusilcS, hhid="db030", hhsize="hsize", strata="db040", weight="db090") ## in the following, nr_cpus are selected automatically simPop <- simStructure(data=inp, method="direct", basicHHvars=c("age", "rb090")) grid <- expand.grid(nrounds = c(5, 10), max_depth = 10, eta = c(0.2, 0.3, 0.5), eval_metric = "mlogloss", stringsAsFactors = FALSE) simPop <- crossValidation(simPop, additionals=c("pl030", "pb220a"), nr_cpus=1, hyper_param_grid = grid) simPop ## End(Not run)
data(eusilcS) # load sample data ## Not run: ## approx. 20 seconds computation time inp <- specifyInput(data=eusilcS, hhid="db030", hhsize="hsize", strata="db040", weight="db090") ## in the following, nr_cpus are selected automatically simPop <- simStructure(data=inp, method="direct", basicHHvars=c("age", "rb090")) grid <- expand.grid(nrounds = c(5, 10), max_depth = 10, eta = c(0.2, 0.3, 0.5), eval_metric = "mlogloss", stringsAsFactors = FALSE) simPop <- crossValidation(simPop, additionals=c("pl030", "pb220a"), nr_cpus=1, hyper_param_grid = grid) simPop ## End(Not run)
"dataObj"
Objects of this class contain information on a population or survey.
Objects can be created by calls of the form
new("dataObj", ...)
but are usually automatically created when using
simStructure
.
Bernhard Meindl and Matthias Templ
showClass("dataObj") ## show method, generate an object of class dataObj first data(eusilcS) inp <- specifyInput(data=eusilcS, hhid="db030", weight="rb050", strata="db040") ## shows some basic information: inp
showClass("dataObj") ## show method, generate an object of class dataObj first data(eusilcS) inp <- specifyInput(data=eusilcS, hhid="db030", weight="rb050", strata="db040") ## shows some basic information: inp
This data set is synthetically generated from real Austrian EU-SILC (European Union Statistics on Income and Living Conditions) data 2013.
A data frame with 13513 observations on the following 62 variables.
integer; the household ID.
integer; the number of persons in the household.
factor; the federal state in which the household is
located (levels Burgenland
, Carinthia
, Lower Austria
,
Salzburg
, Styria
, Tyrol
, Upper Austria
,
Vienna
and Vorarlberg
).
integer; the person's age.
factor; the person's gender (levels
male
and female
).
personal ID
sampling weights
factor; the person's
economic status (levels 1
= working full time, 2
= working
part time, 3
= unemployed, 4
= pupil, student, further
training or unpaid work experience or in compulsory military or community
service, 5
= in retirement or early retirement or has given up
business, 6
= permanently disabled or/and unfit to work or other
inactive person, 7
= fulfilling domestic tasks and care
responsibilities).
factor; the person's citizenship
(levels AT
, EU
and Other
).
for details, see Eurostat's code book
for details, see Eurostat's code book
for details, see Eurostat's code book
for details, see Eurostat's code book
for details, see Eurostat's code book
for details, see Eurostat's code book
for details, see Eurostat's code book
for details, see Eurostat's code book
for details, see Eurostat's code book
for details, see Eurostat's code book
for details, see Eurostat's code book
for details, see Eurostat's code book
for details, see Eurostat's code book
for details, see Eurostat's code book
for details, see Eurostat's code book
for details, see Eurostat's code book
for details, see Eurostat's code book
for details, see Eurostat's code book
for details, see Eurostat's code book
for details, see Eurostat's code book
for details, see Eurostat's code book
for details, see Eurostat's code book
for details, see Eurostat's code book
for details, see Eurostat's code book
for details, see Eurostat's code book
for details, see Eurostat's code book
for details, see Eurostat's code book
for details, see Eurostat's code book
for details, see Eurostat's code book
for details, see Eurostat's code book
for details, see Eurostat's code book
for details, see Eurostat's code book
for details, see Eurostat's code book
for details, see Eurostat's code book
for details, see Eurostat's code book
for details, see Eurostat's code book
for details, see Eurostat's code book
for details, see Eurostat's code book
for details, see Eurostat's code book
for details, see Eurostat's code book
for details, see Eurostat's code book
for details, see Eurostat's code book
for details, see Eurostat's code book
for details, see Eurostat's code book
for details, see Eurostat's code book
for details, see Eurostat's code book
for details, see Eurostat's code book
for details, see Eurostat's code book
for details, see Eurostat's code book
for details, see Eurostat's code book
for details, see Eurostat's code book
for details, see Eurostat's code book
for details, see Eurostat's code book
The data set consists of 5977 households and is used as sample data in some
of the examples in package simPop
. Note that it is included
for illustrative purposes only. The sample weights do not reflect the true
population sizes of Austria and its regions.
62 variables of the original survey are
simulated for this example data set. The variable names are rather cryptic
codes, but these are the standardized names used by the statistical
agencies. Furthermore, the variables hsize
, age
and
netIncome
are not included in the standardized format of EU-SILC
data, but have been derived from other variables for convenience.
Matthias Templ
This is a synthetic data set based on Austrian EU-SILC data from 2013. The original sample was provided by Statistics Austria.
Eurostat (2013) Description of target variables: Cross-sectional and longitudinal.
data(eusilc13puf) str(eusilc13puf)
data(eusilc13puf) str(eusilc13puf)
This data set is synthetically generated from real Austrian EU-SILC (European Union Statistics on Income and Living Conditions) data.
A data.frame
with 58 654 observations on the following 28
variables:
integer; the household ID.
factor; the federal state in which the household is
located (levels Burgenland
, Carinthia
, Lower Austria
,
Salzburg
, Styria
, Tyrol
, Upper Austria
,
Vienna
and Vorarlberg
).
integer; the number of persons in the household.
numeric; the equivalized household size according to the modified OECD scale.
numeric; a simplified version of the equivalized household income.
integer; the personal ID.
the household ID combined with the personal ID. The first five digits represent the household ID, the last two digits the personal ID (both with leading zeros).
integer; the person's age.
factor; the person's gender (levels male
and
female
).
factor; the person's economic status
(levels 1
= working full time, 2
= working part time, 3
= unemployed, 4
= pupil, student, further training or unpaid work
experience or in compulsory military or community service, 5
= in
retirement or early retirement or has given up business, 6
=
permanently disabled or/and unfit to work or other inactive person, 7
= fulfilling domestic tasks and care responsibilities).
factor; the person's citizenship (levels
AT
, EU
and Other
).
numeric; employee cash or near cash income (net).
numeric; cash benefits or losses from self-employment (net).
numeric; unemployment benefits (net).
numeric; old-age benefits (net).
numeric; survivor's benefits (net).
numeric; sickness benefits (net).
numeric; disability benefits (net).
numeric; education-related allowances (net).
numeric; income from rental of a property or land (net).
numeric; family/children related allowances (net).
numeric; housing allowances (net).
numeric; regular inter-household cash transfer received (net).
numeric; interest, dividends, profit from capital investments in unincorporated business (net).
numeric; income received by people aged under 16 (net).
numeric; regular inter-household cash transfer paid (net).
numeric; repayments/receipts for tax adjustment (net).
logical; indicates the main income holder (i.e., the person with the highest income) of each household.
The data set is used as population data in some of the examples in package
simFrame
. Note that it is included for illustrative purposes only.
It consists of 25 000 households, hence it does not represent the true
population sizes of Austria and its regions.
Only a few of the large number of variables in the original survey are
included in this example data set. Some variable names are different from
the standardized names used by the statistical agencies, as the latter are
rather cryptic codes. Furthermore, the variables hsize
,
eqsize
, eqIncome
and age
are not included in the
standardized format of EU-SILC data, but have been derived from other
variables for convenience. Moreover, some very sparse income components
were not included in the the generation of this synthetic data set. Thus the
equivalized household income is computed from the available income
components.
This is a synthetic data set based on Austrian EU-SILC data from 2006. The original sample was provided by Statistics Austria.
Eurostat (2004) Description of target variables: Cross-sectional and longitudinal. EU-SILC 065/04, Eurostat.
data(eusilcP) summary(eusilcP)
data(eusilcP) summary(eusilcP)
This data set is synthetically generated from real Austrian EU-SILC (European Union Statistics on Income and Living Conditions) data.
A data frame with 11725 observations on the following 18 variables.
integer; the household ID.
integer; the number of persons in the household.
factor; the federal state in which the household is
located (levels Burgenland
, Carinthia
, Lower Austria
,
Salzburg
, Styria
, Tyrol
, Upper Austria
,
Vienna
and Vorarlberg
).
integer; the person's age.
factor; the person's gender (levels
male
and female
).
factor; the person's
economic status (levels 1
= working full time, 2
= working
part time, 3
= unemployed, 4
= pupil, student, further
training or unpaid work experience or in compulsory military or community
service, 5
= in retirement or early retirement or has given up
business, 6
= permanently disabled or/and unfit to work or other
inactive person, 7
= fulfilling domestic tasks and care
responsibilities).
factor; the person's citizenship
(levels AT
, EU
and Other
).
numeric; the personal net income.
numeric; employee cash or near cash income (net).
numeric; cash benefits or losses from self-employment (net).
numeric; unemployment benefits (net).
numeric; old-age benefits (net).
numeric; survivor's benefits (net).
numeric; sickness benefits (net).
numeric; disability benefits (net).
numeric; education-related allowances (net).
numeric; the household sample weights.
numeric; the personal sample weights.
The data set consists of 4641 households and is used as sample data in some
of the examples in package simPopulation
. Note that it is included
for illustrative purposes only. The sample weights do not reflect the true
population sizes of Austria and its regions. The resulting population data
is about 100 times smaller than the real population size to save computation
time.
Only a few of the large number of variables in the original survey are
included in this example data set. The variable names are rather cryptic
codes, but these are the standardized names used by the statistical
agencies. Furthermore, the variables hsize
, age
and
netIncome
are not included in the standardized format of EU-SILC
data, but have been derived from other variables for convenience.
This is a synthetic data set based on Austrian EU-SILC data from 2006. The original sample was provided by Statistics Austria.
Eurostat (2004) Description of target variables: Cross-sectional and longitudinal. EU-SILC 065/04, Eurostat.
data(eusilcS) summary(eusilcS)
data(eusilcS) summary(eusilcS)
simPopObj-class
.Using samp
samp<-
it is possible to extract or
rather modify variables of the sample data within slot data
in slot
sample
of the simPopObj-class
-object. Using
pop
pop<-
it is possible to extract or rather
modify variables of the synthetic population within in slot data
in
slot sample
of the simPopObj-class
-object.
obj |
An object of class |
var |
variable name or index for the variable in slot 'samp' of object
with the slot name to be accessed. If |
value |
Content replacing whatever the variable in slot |
Returns an object of class simPopObj-class
with the
appropriate replacement.
Bernhard Meindl
simPopObj-class
,pop
,
pop<-
, samp<-
, manageSimPopObj
data(eusilcS) inp <- specifyInput(data=eusilcS, hhid="db030", hhsize="hsize", strata="db040", weight="db090") simPopObj <- simStructure(data=inp, method="direct", basicHHvars=c("age", "rb090")) ## get/set variables in sample-object of simPopObj head(samp(simPopObj, var="age")) samp(simPopObj, var="newVar") <- 1 head(samp(simPopObj, var="newVar")) ## deleting is also possible samp(simPopObj, var="newvar") <- NULL head(samp(simPopObj, var="newvar")) ## extract multiple variables head(samp(simPopObj, var=c("db030","db040"))) ## get/set variables in pop-object of simPopObj head(pop(simPopObj, var="age")) pop(simPopObj, var="newVar") <- 1 head(pop(simPopObj, var="newVar")) ## deleting is also possible pop(simPopObj, var="newvar") <- NULL head(pop(simPopObj, var="newvar")) ## extract multiple variables head(pop(simPopObj, var=c("db030","db040")))
data(eusilcS) inp <- specifyInput(data=eusilcS, hhid="db030", hhsize="hsize", strata="db040", weight="db090") simPopObj <- simStructure(data=inp, method="direct", basicHHvars=c("age", "rb090")) ## get/set variables in sample-object of simPopObj head(samp(simPopObj, var="age")) samp(simPopObj, var="newVar") <- 1 head(samp(simPopObj, var="newVar")) ## deleting is also possible samp(simPopObj, var="newvar") <- NULL head(samp(simPopObj, var="newvar")) ## extract multiple variables head(samp(simPopObj, var=c("db030","db040"))) ## get/set variables in pop-object of simPopObj head(pop(simPopObj, var="age")) pop(simPopObj, var="newVar") <- 1 head(pop(simPopObj, var="newVar")) ## deleting is also possible pop(simPopObj, var="newvar") <- NULL head(pop(simPopObj, var="newvar")) ## extract multiple variables head(pop(simPopObj, var=c("db030","db040")))
Compute break points for categorizing continuous or semi-continuous
variables using (weighted) quantiles. This is a utility function that is
useful for writing custom wrapper functions such as simEUSILC
.
getBreaks( x, weights = NULL, zeros = TRUE, lower = NULL, upper = NULL, equidist = TRUE, probs = NULL, strata = NULL )
getBreaks( x, weights = NULL, zeros = TRUE, lower = NULL, upper = NULL, equidist = TRUE, probs = NULL, strata = NULL )
x |
a numeric vector to be categorized. |
weights |
an optional numeric vector containing sample weights. |
zeros |
a logical indicating whether |
lower , upper
|
optional numeric values specifying lower and upper bounds
other than minimum and maximum of |
equidist |
a logical indicating whether the (positive) break points should be equidistant or whether there should be refinements in the lower and upper tail (see “Details”). |
probs |
a numeric vector of probabilities with values in |
strata |
an optional vector specifying a strata variable (e.g household ids).
if specified, the mean of |
If equidist
is TRUE
, the behavior is as follows. If
zeros
is TRUE
as well, the 0%, 10%, ..., 90% quantiles
of the negative values and the 10%, 20%, ..., 100% of the positive
values are computed. These quantiles are then used as break points together
with 0. If zeros
is not TRUE
, on the other hand, the 0%,
10%, ..., 100% quantiles of all values are used.
If equidist
is not TRUE
, the behavior is as follows. If
zeros
is not TRUE
, the 1%, 5%, 10%, 20%, 40%, 60%, 80%,
90%, 95% and 99% quantiles of all values are used for the inner part of
the data (instead of the equidistant 10%, ..., 90% quantiles). If
zeros
is TRUE
, these quantiles are only used for the positive
values while the quantiles of the negative values remain equidistant.
Note that duplicated values among the quantiles are discarded and that the
minimum and maximum are replaced with lower
and upper
,
respectively, if these are specified.
The (weighted) quantiles are computed with the function
quantileWt
.
A numeric vector of break points.
Andreas Alfons and Bernhard Meindl
data(eusilcS) # semi-continuous variable, positive break points equidistant getBreaks(eusilcS$netIncome, weights=eusilcS$rb050) # semi-continuous variable, positive break points not equidistant getBreaks(eusilcS$netIncome, weights=eusilcS$rb050, equidist = FALSE)
data(eusilcS) # semi-continuous variable, positive break points equidistant getBreaks(eusilcS$netIncome, weights=eusilcS$rb050) # semi-continuous variable, positive break points not equidistant getBreaks(eusilcS$netIncome, weights=eusilcS$rb050, equidist = FALSE)
Categorize continuous or semi-continuous variables. This is a utility
function that is useful for writing custom wrapper functions such as
simEUSILC
.
getCat(x, breaks, zeros = TRUE, right = FALSE)
getCat(x, breaks, zeros = TRUE, right = FALSE)
x |
a numeric vector to be categorized. |
breaks |
a numeric vector of two or more break points. |
zeros |
a logical indicating whether |
right |
logical; if |
If zeros
is TRUE
, 0 is added to the break points and treated
as its own factor level. Consequently, intervals for negative values are
left-closed and right-open, whereas intervals for positive values are
left-open and right-closed.
A factor
containing the categories.
Andreas Alfons
data(eusilcS) ## semi-continuous variable breaks <- getBreaks(eusilcS$netIncome, weights=eusilcS$rb050, equidist = FALSE) netIncomeCat <- getCat(eusilcS$netIncome, breaks) summary(netIncomeCat)
data(eusilcS) ## semi-continuous variable breaks <- getBreaks(eusilcS$netIncome, weights=eusilcS$rb050, equidist = FALSE) netIncomeCat <- getCat(eusilcS$netIncome, breaks) summary(netIncomeCat)
Get the positions of missing values in the data. This function is used internally by other other functions of this package
getExclude(x, ...)
getExclude(x, ...)
x |
a vector, matrix, data.frame or data.table |
... |
other arguments, not currently used |
Interger vector with positions indicating missing values
This data set is synthetically generated from real GLSS (Ghana Living Standards Survey) data.
A data frame with 36970 observations on the following 14 variables.
integer; the household ID.
integer; the number of persons in the household.
factor; the region in which the household is located
(levels western
, central
, greater accra
, volta
,
eastern
, ashanti
, brong ahafo
, northern
,
upper east
and upper west
).
factor; the enumeration area.
integer; the person's age.
factor; the person's sex (levels male
and
female
).
factor; the relationship with the
household head (levels head
, spouse
, child
,
grandchild
, parent/parentlaw
, son/daughterlaw
,
other relative
, adopted child
, househelp
and
non_relative
).
factor; the person's
nationality (levels ghanaian birth
, ghanaian naturalise
,
burkinabe
, malian
, nigerian
, ivorian
,
togolese
, liberian
, other ecowas
, other africa
and other
).
factor; the person's ethnicity
(levels akan
, all other tribes
, ewe
, ga-dangbe
,
grusi
, guan
, gurma
, mande
and
mole-dagbani
).
factor; the person's religion
(levels catholic
, anglican
, presbyterian
,
methodist
, pentecostal
, spiritualist
, other
christian
, moslem
, traditional
, no religion
and
other
).
factor; the person's highest
degree of education (levels none
, mlsc
, bece
,
voc/comm
, teacher trng a
, teacher trng b
, gce 'o'
level
, ssce
, gce 'a' level
, tech/prof cert
,
tech/prof dip
, hnd
, bachelor
, masters
,
doctorate
and other
).
factor; the
person's occupation (levels armed forces and other security
personnel
, clerks
, craft and related trades workers
,
elementary occupations
, legislators, senior officials and
managers
, none
, plant and machine operators and assemblers
,
professionals
, service workers and shop and market sales
workers
, skilled agricultural and fishery workers
, and
technicians and associate professionals
).
numeric; the person's annual income.
numeric; the sample weights.
The data set consists of 8700 households and is used as sample data in some
of the examples in package simPopulation
. Note that it is included
for illustrative purposes only. The sample weights do not reflect the true
population sizes of Ghana and its regions. The resulting population data is
about 100 times smaller than the real population size to save computation
time.
Only some of the variables in the original survey are included in this example data set. Furthermore, categories are aggregated for certain variables due to the large number of possible outcomes in the original survey data.
This is a synthetic data set based on GLSS data from 2006. The original sample was provided by Ghana Statistical Service.
Ghana Statistical Service (2008) Ghana Living Standards Survey: Report of the fifth round.
data(ghanaS) summary(ghanaS)
data(ghanaS) summary(ghanaS)
adjust sampling weights to given totals based on household-level and/or individual level constraints
ipu(inp, con, hid = NULL, eps = 1e-07, verbose = FALSE)
ipu(inp, con, hid = NULL, eps = 1e-07, verbose = FALSE)
inp |
a |
con |
named list with each list element holding a constraint total with
list-names relating to column-names in |
hid |
character vector specifying the variable containing household-ids
within |
eps |
number specifiying convergence limit |
verbose |
if TRUE, ipu will print some progress information. |
Bernhard Meindl
library(data.table) # basic example inp <- as.data.frame(matrix(0, nrow=8, ncol=6)) colnames(inp) <- c("hhid","hh1","hh2","p1","p2","p3") inp$hhid <- 1:8 inp$hh1[1:3] <- 1 inp$hh2[4:8] <- 1 inp$p1 <- c(1,1,2,1,0,1,2,1) inp$p2 <- c(1,0,1,0,2,1,1,1) inp$p3 <- c(1,1,0,2,1,0,2,0) con <- list(hh1=35, hh2=65, p1=91, p2=65, p3=104) res <- ipu(inp=inp, hid="hhid", con=con, verbose=FALSE) # more sophisticated # load sample and population data data(eusilcS) data(eusilcP) # variable generation and preparation eusilcS$hsize <- factor(eusilcS$hsize) # make sure, factor levels in sample and population match eusilcP$region <- factor(eusilcP$region, levels = levels(eusilcS$db040)) eusilcP$gender <- factor(eusilcP$gender, levels = levels(eusilcS$rb090)) eusilcP$hsize <- factor(eusilcP$hsize , levels = levels(eusilcS$hsize)) # generate input matrix # we want to adjust to variable "db040" (region) as household variables and # variable "rb090" (gender) as individual information library(data.table) samp <- data.table(eusilcS) pop <- data.table(eusilcP) setkeyv(samp, "db030") hh <- samp[!duplicated(samp$db030),] hhpop <- pop[!duplicated(pop$hid),] # reg contains for each region the number of households reg <- data.table(model.matrix(~db040 +0, data=hh)) # hsize contains for each household size the number of households hsize <- data.table(model.matrix(~factor(hsize) +0, data=hh)) # aggregate persons-level characteristics per household # gender contains for each household the number of males and females gender <- data.table(model.matrix(~db030+rb090 +0, data=samp)) setkeyv(gender, "db030") gender <- gender[, lapply(.SD, sum), by = key(gender)] # bind together and use it as input inp <- cbind(reg, hsize, gender) # the totals we want to calibrate to con <- c( as.list(xtabs(rep(1, nrow(hhpop)) ~ hhpop$region)), as.list(xtabs(rep(1, nrow(hhpop)) ~ hhpop$hsize)), as.list(xtabs(rep(1, nrow(eusilcP)) ~ eusilcP$gender)) ) # we need to have the same names as in 'inp' names(con) <- setdiff(names(inp), "db030") # run ipu und check results res <- ipu(inp=inp, hid="db030", con=con, verbose=TRUE) is <- sapply(2:(ncol(res)-1), function(x) { sum(res[,x]*res$weights) }) data.frame(required=unlist(con), is=is)
library(data.table) # basic example inp <- as.data.frame(matrix(0, nrow=8, ncol=6)) colnames(inp) <- c("hhid","hh1","hh2","p1","p2","p3") inp$hhid <- 1:8 inp$hh1[1:3] <- 1 inp$hh2[4:8] <- 1 inp$p1 <- c(1,1,2,1,0,1,2,1) inp$p2 <- c(1,0,1,0,2,1,1,1) inp$p3 <- c(1,1,0,2,1,0,2,0) con <- list(hh1=35, hh2=65, p1=91, p2=65, p3=104) res <- ipu(inp=inp, hid="hhid", con=con, verbose=FALSE) # more sophisticated # load sample and population data data(eusilcS) data(eusilcP) # variable generation and preparation eusilcS$hsize <- factor(eusilcS$hsize) # make sure, factor levels in sample and population match eusilcP$region <- factor(eusilcP$region, levels = levels(eusilcS$db040)) eusilcP$gender <- factor(eusilcP$gender, levels = levels(eusilcS$rb090)) eusilcP$hsize <- factor(eusilcP$hsize , levels = levels(eusilcS$hsize)) # generate input matrix # we want to adjust to variable "db040" (region) as household variables and # variable "rb090" (gender) as individual information library(data.table) samp <- data.table(eusilcS) pop <- data.table(eusilcP) setkeyv(samp, "db030") hh <- samp[!duplicated(samp$db030),] hhpop <- pop[!duplicated(pop$hid),] # reg contains for each region the number of households reg <- data.table(model.matrix(~db040 +0, data=hh)) # hsize contains for each household size the number of households hsize <- data.table(model.matrix(~factor(hsize) +0, data=hh)) # aggregate persons-level characteristics per household # gender contains for each household the number of males and females gender <- data.table(model.matrix(~db030+rb090 +0, data=samp)) setkeyv(gender, "db030") gender <- gender[, lapply(.SD, sum), by = key(gender)] # bind together and use it as input inp <- cbind(reg, hsize, gender) # the totals we want to calibrate to con <- c( as.list(xtabs(rep(1, nrow(hhpop)) ~ hhpop$region)), as.list(xtabs(rep(1, nrow(hhpop)) ~ hhpop$hsize)), as.list(xtabs(rep(1, nrow(eusilcP)) ~ eusilcP$gender)) ) # we need to have the same names as in 'inp' names(con) <- setdiff(names(inp), "db030") # run ipu und check results res <- ipu(inp=inp, hid="db030", con=con, verbose=TRUE) is <- sapply(2:(ncol(res)-1), function(x) { sum(res[,x]*res$weights) }) data.frame(required=unlist(con), is=is)
simPopObj
.This functions allows to get or set variables in slots pop
and
sample
of simPopObj
-objects. This is a utility
function that is useful for writing custom wrapper functions.
manageSimPopObj(x, var, sample = FALSE, set = FALSE, values = NULL)
manageSimPopObj(x, var, sample = FALSE, set = FALSE, values = NULL)
x |
an object of class |
var |
character vector of length 1; variable name that should be set or extracted. |
sample |
a logical indicating whether |
set |
logical; if TRUE, argument 'values' is set to either the sample or population data stored in 'x', depending on argument 'sample'. If FALSE, the desired variable given by 'var' is returned from either the sample or the pop slot of 'x'. |
values |
vector; if 'set' is TRUE, then this vector is used to update the variable of sample or population data depending of choice of argument 'sample'. |
An object of class simPopObj
(if 'set' is TRUE)
or a vector (if 'set' is FALSE).
Bernhard Meindl and Matthias Templ
data(eusilcS) inp <- specifyInput(data=eusilcS, hhid="db030", hhsize="hsize", strata="db040", weight="db090") simPopObj <- simStructure(data=inp, method="direct", basicHHvars=c("age", "rb090")) (manageSimPopObj(simPopObj, var="age", sample=FALSE, set=FALSE)) (manageSimPopObj(simPopObj, var="age", sample=TRUE, set=FALSE))
data(eusilcS) inp <- specifyInput(data=eusilcS, hhid="db030", hhsize="hsize", strata="db040", weight="db090") simPopObj <- simStructure(data=inp, method="direct", basicHHvars=c("age", "rb090")) (manageSimPopObj(simPopObj, var="age", sample=FALSE, set=FALSE)) (manageSimPopObj(simPopObj, var="age", sample=TRUE, set=FALSE))
Compute quantiles taking into account sample weights. The following methods are implemented:
quantileWt.default(x, weights=NULL, probs=seq(0, 1, 0.25), na.rm=TRUE, ...)
quantileWt.dataObj(x, vars, probs=seq(0, 1, 0.25), na.rm=TRUE, ...)
Additional parameters are:
weights an optional numeric vector containing sample weights.
vars a character vector of length 1 specifying a variable name that
is available in the data-slot of x
and which is used for the
calculation.
probs a numeric vector of probabilities with values in .
na.rm a logical indicating whether any NA
or NaN
values
should be removed from x
before the quantiles are computed. Note
that the default is TRUE
, contrary to the function
quantile
.
quantileWt(x, ...)
quantileWt(x, ...)
x |
a numeric vector. |
... |
for the generic function |
If weights are not specified then quantile(x, probs, na.rm=na.rm,
names=FALSE, type=1)
is used for the computation.
Note probabilities outside cause an error.
A vector of the (weighted) sample quantiles.
Stefan Kraft and Bernhard Meindl
A basic version of this function was provided by Cedric Beguin and Beat Hulliger.
data(eusilcS) (quantileWt(eusilcS$netIncome, weights=eusilcS$rb050)) # dataObj-method inp <- specifyInput(data=eusilcS, hhid="db030", hhsize="hsize", strata="db040", weight="db090") (quantileWt(inp, vars="netIncome"))
data(eusilcS) (quantileWt(eusilcS$netIncome, weights=eusilcS$rb050)) # dataObj-method inp <- specifyInput(data=eusilcS, hhid="db030", hhsize="hsize", strata="db040", weight="db090") (quantileWt(inp, vars="netIncome"))
The function samples households from microdata containing personal and household information.
sampHH(pop, sizefactor = 1, hid = "hid", strata = "region", hsize = NULL)
sampHH(pop, sizefactor = 1, hid = "hid", strata = "region", hsize = NULL)
pop |
data frame containing households and persons |
sizefactor |
factor of how many times the initial population should be resampled |
hid |
string specifying the name of the household-id variable in the data. |
strata |
can be used to sample within strata. |
hsize |
string specifying the name of the household size variable in the data. |
households are drawn from the data and new ID's are generated for the new households.
the data frame of new households.
Bernhard Meindl, Matthias Templ and Johannes Gussenbauer
M. Templ, B. Meindl, A. Kowarik, A. Alfons, O. Dupriez (2017) Simulation of Synthetic Populations for Survey Data Considering Auxiliary Information. Journal of Statistical Survey, 79 (10), 1–38. doi: 10.18637/jss.v079.i10
data(eusilcP) pop <- eusilcP colnames(pop)[3] <- "hhsize" system.time(x1 <- sampHH(pop, strata="region", hsize="hhsize")) dim(x1) ## Not run: ## approx. 10 second computation time ... system.time(x1 <- sampHH(pop, sizefactor=4, strata="region", hsize="hhsize")) dim(x1) system.time(x2 <- sampHH(pop, strata=NULL, hsize="hhsize")) pop <- pop[,-which(colnames(pop)=="hhsize")] system.time(y1 <- sampHH(pop, strata="region", hsize=NULL)) system.time(y2 <- sampHH(pop, strata=NULL, hsize=NULL)) ## End(Not run)
data(eusilcP) pop <- eusilcP colnames(pop)[3] <- "hhsize" system.time(x1 <- sampHH(pop, strata="region", hsize="hhsize")) dim(x1) ## Not run: ## approx. 10 second computation time ... system.time(x1 <- sampHH(pop, sizefactor=4, strata="region", hsize="hhsize")) dim(x1) system.time(x2 <- sampHH(pop, strata=NULL, hsize="hhsize")) pop <- pop[,-which(colnames(pop)=="hhsize")] system.time(y1 <- sampHH(pop, strata="region", hsize=NULL)) system.time(y2 <- sampHH(pop, strata=NULL, hsize=NULL)) ## End(Not run)
Various utility functions mainly used for simulating EU-SILC data
loadSILC( file = NULL, filed = NULL, filer = NULL, filep = NULL, fileh = NULL, year = 2013, country = "Austria" ) mergeSILC(filed, filer, fileh, filep) checkCol(x, y) chooseSILCvars( x, vars = c("db030", "db040", "rb030", "rb080", "rb090", "pl031", "pb220a", "py010g", "py021g", "py050g", "py080g", "py090g", "py100g", "py110g", "py120g", "py130g", "py140g", "hy040g", "hy050g", "hy060g", "hy070g", "hy080g", "hy090g", "hy100g", "hy110g", "hy120g", "hy130g", "hy140g", "db090", "rb050", "pb190", "pe040", "pl051", "pl111", "rb010"), country = NULL ) modifySILC(x, country = "Austria")
loadSILC( file = NULL, filed = NULL, filer = NULL, filep = NULL, fileh = NULL, year = 2013, country = "Austria" ) mergeSILC(filed, filer, fileh, filep) checkCol(x, y) chooseSILCvars( x, vars = c("db030", "db040", "rb030", "rb080", "rb090", "pl031", "pb220a", "py010g", "py021g", "py050g", "py080g", "py090g", "py100g", "py110g", "py120g", "py130g", "py140g", "hy040g", "hy050g", "hy060g", "hy070g", "hy080g", "hy090g", "hy100g", "hy110g", "hy120g", "hy130g", "hy140g", "db090", "rb050", "pb190", "pe040", "pl051", "pl111", "rb010"), country = NULL ) modifySILC(x, country = "Austria")
file |
data set in R binary format, csv or sav (SPSS) of merged EU-SILC data. |
filed |
data set including the household register information |
filer |
data set including the personal register information |
filep |
data set including the personal information |
fileh |
data set including the household information |
year |
year of origin |
country |
country |
x |
public-use file (for checkCol function) or orginal data |
y |
scientific-use file (for checkCol function) |
vars |
variables to be selected for function chooseSILCvars |
Collection of functions to import, select and modify data EU-SILC data. Either file (merged data) or single files have to be provided for loadSILC().
Matthias Templ
## Not run: x <- loadSILC("new_workfile.RData") filed <- "zielvar_d_eurostat2013.sav" filer <- "zielvar_r_eurostat2013.sav" filep <- "zielvar_p_eurostat2013.sav" fileh <- "zielvar_h_eurostat2013.sav" suf4 <- loadSILC(filed = filed, filer = filer, filep = filep, fileh = fileh) ## End(Not run) ## Not run: filed <- "zielvar_d_eurostat2013.sav" filer <- "zielvar_r_eurostat2013.sav" filep <- "zielvar_p_eurostat2013.sav" fileh <- "zielvar_h_eurostat2013.sav" suf4 <- loadSILC(filed = filed, filer = filer, filep = filep, fileh = fileh) suf <- mergeSILC(d = suf4[["d"]], r = suf4[["r"]], h = suf4[["h"]], p = suf4[["p"]]) ## End(Not run) data(eusilc13puf) ## instead of scientific-use file or ## original data we took the 2006 synthetic data data(eusilcS) ## check which columns of y are in x checkCol(eusilc13puf, eusilcS) ## Not run: ## on original silc data to extract needed variables for SGA project on SILC x <- loadSILC("new_workfile.RData") chooseSILCvars(x) ## End(Not run) ## Not run: ## wrapper to prepare SILC data ## on original silc data x <- loadSILC("new_workfile.RData") x <- chooseSILCvars(x) modifySILC(x) ## End(Not run)
## Not run: x <- loadSILC("new_workfile.RData") filed <- "zielvar_d_eurostat2013.sav" filer <- "zielvar_r_eurostat2013.sav" filep <- "zielvar_p_eurostat2013.sav" fileh <- "zielvar_h_eurostat2013.sav" suf4 <- loadSILC(filed = filed, filer = filer, filep = filep, fileh = fileh) ## End(Not run) ## Not run: filed <- "zielvar_d_eurostat2013.sav" filer <- "zielvar_r_eurostat2013.sav" filep <- "zielvar_p_eurostat2013.sav" fileh <- "zielvar_h_eurostat2013.sav" suf4 <- loadSILC(filed = filed, filer = filer, filep = filep, fileh = fileh) suf <- mergeSILC(d = suf4[["d"]], r = suf4[["r"]], h = suf4[["h"]], p = suf4[["p"]]) ## End(Not run) data(eusilc13puf) ## instead of scientific-use file or ## original data we took the 2006 synthetic data data(eusilcS) ## check which columns of y are in x checkCol(eusilc13puf, eusilcS) ## Not run: ## on original silc data to extract needed variables for SGA project on SILC x <- loadSILC("new_workfile.RData") chooseSILCvars(x) ## End(Not run) ## Not run: ## wrapper to prepare SILC data ## on original silc data x <- loadSILC("new_workfile.RData") x <- chooseSILCvars(x) modifySILC(x) ## End(Not run)
Simulate categorical variables of population data. The household structure of the population data needs to be simulated beforehand.
simCategorical( simPopObj, additional, method = c("multinom", "distribution", "ctree", "cforest", "ranger", "xgboost"), limit = NULL, censor = NULL, maxit = 500, MaxNWts = 1500, eps = NULL, nr_cpus = NULL, regModel = NULL, seed = 1, verbose = FALSE, by = "strata", model_params = NULL )
simCategorical( simPopObj, additional, method = c("multinom", "distribution", "ctree", "cforest", "ranger", "xgboost"), limit = NULL, censor = NULL, maxit = 500, MaxNWts = 1500, eps = NULL, nr_cpus = NULL, regModel = NULL, seed = 1, verbose = FALSE, by = "strata", model_params = NULL )
simPopObj |
a |
additional |
a character vector specifying additional categorical
variables available in the sample object of |
method |
a character string specifying the method to be used for
simulating the additional categorical variables. Accepted values are
|
limit |
if |
censor |
if |
maxit , MaxNWts
|
control parameters to be passed to
|
eps |
a small positive numeric value, or |
nr_cpus |
if specified, an integer number defining the number of cpus that should be used for parallel processing. |
regModel |
allows to specify the variables or model that is used when simulating additional categorical variables. The following choices are available if different from NULL.
If method 'distribution' is used, it is only possible to specify a vector of length one containing one of the choices described above. If parameter 'regModel' is NULL, only basic household variables are used in any case. |
seed |
optional; an integer value to be used as the seed of the random number generator, or an integer vector containing the state of the random number generator to be restored. |
verbose |
set to TRUE if additional print output should be shown. |
by |
defining which variable to use as split up variable of the estimation. Defaults to the strata variable. |
model_params |
NULL or a named list which can contain model specific parameters which will be passed onto the function call for the respective model. |
The number of cpus are selected automatically in the following manner. The number of cpus is equal the number of strata. However, if the number of cpus is less than the number of strata, the number of cpus - 1 is used by default. This should be the best strategy, but the user can also overwrite this decision.
An object of class simPopObj
containing survey
data as well as the simulated population data including the categorical
variables specified by argument additional
.
The basic household structure needs to be simulated beforehand with
the function simStructure
.
Bernhard Meindl, Andreas Alfons, Stefan Kraft, Alexander Kowarik, Matthias Templ, Siro Fritzmann
B. Meindl, M. Templ, A. Kowarik, O. Dupriez (2017) Simulation of Synthetic Populations for Survey Data Considering Auxiliary Information. Journal of Statistical Survey, 79 (10), 1–38. doi:10.18637/jss.v079.i10
A. Alfons, M. Templ (2011) Simulation of close-to-reality population data for household surveys with application to EU-SILC. Statistical Methods & Applications, 20 (3), 383–407. doi:10.1080/02664763.2013.859237
simStructure
, simRelation
,
simContinuous
, simComponents
data(eusilcS) # load sample data ## Not run: ## approx. 20 seconds computation time inp <- specifyInput(data=eusilcS, hhid="db030", hhsize="hsize", strata="db040", weight="db090") ## in the following, nr_cpus are selected automatically simPop <- simStructure(data=inp, method="direct", basicHHvars=c("age", "rb090")) simPop <- simCategorical(simPop, additional=c("pl030", "pb220a"), method="multinom", nr_cpus=1) simPop ## End(Not run)
data(eusilcS) # load sample data ## Not run: ## approx. 20 seconds computation time inp <- specifyInput(data=eusilcS, hhid="db030", hhsize="hsize", strata="db040", weight="db090") ## in the following, nr_cpus are selected automatically simPop <- simStructure(data=inp, method="direct", basicHHvars=c("age", "rb090")) simPop <- simCategorical(simPop, additional=c("pl030", "pb220a"), method="multinom", nr_cpus=1) simPop ## End(Not run)
Simulate components of continuous variables of population data by resampling fractions from survey data. The continuous variable to be split and any categorical conditioning variables need to be simulated beforehand.
simComponents( simPopObj, total = "netIncome", components = c("py010n", "py050n", "py090n", "py100n", "py110n", "py120n", "py130n", "py140n"), conditional = c(getCatName(total), "pl030"), replaceEmpty = c("sequential", "min"), seed )
simComponents( simPopObj, total = "netIncome", components = c("py010n", "py050n", "py090n", "py100n", "py110n", "py120n", "py130n", "py140n"), conditional = c(getCatName(total), "pl030"), replaceEmpty = c("sequential", "min"), seed )
simPopObj |
a |
total |
a character string specifying the continuous variable of dataP that should be split into components. Currently, only one variable can be split at a time. |
components |
a character vector specifying the components in
|
conditional |
an optional character vector specifying categorical
conditioning variables for resampling. The fractions occurring in
|
replaceEmpty |
a character string; if |
seed |
optional; an integer value to be used as the seed of the random number generator, or an integer vector containing the state of the random number generator to be restored. |
An object of class simPopObj
containing survey
data as well as the simulated population data including the components of
the continuous variable specified by total
and components
.
The basic household structure, any categorical conditioning variables
and the continuous variable to be split need to be simulated beforehand with
the functions simStructure
, simCategorical
and
simContinuous
.
Stefan Kraft and Andreas Alfons and Bernhard Meindl
B. Meindl, M. Templ, A. Kowarik, O. Dupriez (2017) Simulation of Synthetic Populations for Survey Data Considering Auxiliary Information. Journal of Statistical Survey, 79 (10), 1–38. doi:10.18637/jss.v079.i10
A. Alfons, M. Templ (2011) Simulation of close-to-reality population data for household surveys with application to EU-SILC. Statistical Methods & Applications, 20 (3), 383–407. doi:10.1080/02664763.2013.859237
simStructure
, simCategorical
,
simContinuous
, simEUSILC
data(eusilcS) ## Not run: ## approx. 20 seconds computation time inp <- specifyInput(data=eusilcS, hhid="db030", hhsize="hsize", strata="db040", weight="db090") simPopObj <- simStructure(data=inp, method="direct", basicHHvars=c("age", "rb090", "hsize", "pl030", "pb220a")) simPopObj <- simContinuous(simPopObj, additional = "netIncome", regModel = ~rb090+hsize+pl030+pb220a+hsize, method="multinom", upper=200000, equidist=FALSE, nr_cpus=1) # categorize net income for use as conditioning variable sIncome <- manageSimPopObj(simPopObj, var="netIncome", sample=TRUE, set=FALSE) sWeight <- manageSimPopObj(simPopObj, var="rb050", sample=TRUE, set=FALSE) pIncome <- manageSimPopObj(simPopObj, var="netIncome", sample=FALSE, set=FALSE) breaks <- getBreaks(x=unlist(sIncome), w=unlist(sWeight), upper=Inf, equidist=FALSE) simPopObj <- manageSimPopObj(simPopObj, var="netIncomeCat", sample=TRUE, set=TRUE, values=getCat(x=unlist(sIncome), breaks)) simPopObj <- manageSimPopObj(simPopObj, var="netIncomeCat", sample=FALSE, set=TRUE, values=getCat(x=unlist(pIncome), breaks)) # simulate net income components simPopObj <- simComponents(simPopObj=simPopObj, total="netIncome", components=c("py010n","py050n","py090n","py100n","py110n","py120n","py130n","py140n"), conditional = c("netIncomeCat", "pl030"), replaceEmpty = "sequential", seed=1 ) class(simPopObj) ## End(Not run)
data(eusilcS) ## Not run: ## approx. 20 seconds computation time inp <- specifyInput(data=eusilcS, hhid="db030", hhsize="hsize", strata="db040", weight="db090") simPopObj <- simStructure(data=inp, method="direct", basicHHvars=c("age", "rb090", "hsize", "pl030", "pb220a")) simPopObj <- simContinuous(simPopObj, additional = "netIncome", regModel = ~rb090+hsize+pl030+pb220a+hsize, method="multinom", upper=200000, equidist=FALSE, nr_cpus=1) # categorize net income for use as conditioning variable sIncome <- manageSimPopObj(simPopObj, var="netIncome", sample=TRUE, set=FALSE) sWeight <- manageSimPopObj(simPopObj, var="rb050", sample=TRUE, set=FALSE) pIncome <- manageSimPopObj(simPopObj, var="netIncome", sample=FALSE, set=FALSE) breaks <- getBreaks(x=unlist(sIncome), w=unlist(sWeight), upper=Inf, equidist=FALSE) simPopObj <- manageSimPopObj(simPopObj, var="netIncomeCat", sample=TRUE, set=TRUE, values=getCat(x=unlist(sIncome), breaks)) simPopObj <- manageSimPopObj(simPopObj, var="netIncomeCat", sample=FALSE, set=TRUE, values=getCat(x=unlist(pIncome), breaks)) # simulate net income components simPopObj <- simComponents(simPopObj=simPopObj, total="netIncome", components=c("py010n","py050n","py090n","py100n","py110n","py120n","py130n","py140n"), conditional = c("netIncomeCat", "pl030"), replaceEmpty = "sequential", seed=1 ) class(simPopObj) ## End(Not run)
Simulate continuous variables of population data using multinomial log-linear models combined with random draws from the resulting categories or (two-step) regression models combined with random error terms. The household structure of the population data and any other categorical predictors need to be simulated beforehand.
simContinuous( simPopObj, additional = "netIncome", method = c("multinom", "lm", "poisson", "xgboost"), zeros = TRUE, breaks = NULL, lower = NULL, upper = NULL, equidist = TRUE, probs = NULL, gpd = TRUE, threshold = NULL, est = "moments", limit = NULL, censor = NULL, log = TRUE, const = NULL, alpha = 0.01, residuals = TRUE, keep = TRUE, maxit = 500, MaxNWts = 1500, tol = .Machine$double.eps^0.5, nr_cpus = NULL, eps = NULL, regModel = "basic", byHousehold = NULL, imputeMissings = FALSE, seed, verbose = FALSE, by = "strata", model_params = NULL )
simContinuous( simPopObj, additional = "netIncome", method = c("multinom", "lm", "poisson", "xgboost"), zeros = TRUE, breaks = NULL, lower = NULL, upper = NULL, equidist = TRUE, probs = NULL, gpd = TRUE, threshold = NULL, est = "moments", limit = NULL, censor = NULL, log = TRUE, const = NULL, alpha = 0.01, residuals = TRUE, keep = TRUE, maxit = 500, MaxNWts = 1500, tol = .Machine$double.eps^0.5, nr_cpus = NULL, eps = NULL, regModel = "basic", byHousehold = NULL, imputeMissings = FALSE, seed, verbose = FALSE, by = "strata", model_params = NULL )
simPopObj |
a |
additional |
a character string specifying the additional continuous
variable of |
method |
a character string specifying the method to be used for
simulating the continuous variable. Accepted values are |
zeros |
a logical indicating whether the variable specified by
|
breaks |
an optional numeric vector; if multinomial models are
computed, this can be used to supply two or more break points for
categorizing the variable specified by |
lower , upper
|
optional numeric values; if multinomial models are
computed and |
equidist |
logical; if |
probs |
numeric vector with values in |
gpd |
logical; if |
threshold |
a numeric value; if |
est |
a character string; if |
limit |
an optional named list of lists; if multinomial models are computed, this can be used to account for structural zeros. The names of the list components specify the predictor variables for which to limit the possible outcomes of the response. For each predictor, a list containing the possible outcomes of the response for each category of the predictor can be supplied. The probabilities of other outcomes conditional on combinations that contain the specified categories of the supplied predictors are set to 0. Currently, this is only implemented for more than two categories in the response. |
censor |
an optional named list of lists or |
log |
logical; if |
const |
numeric; if |
alpha |
numeric; if |
residuals |
logical; if |
keep |
logical; if multinomial models are computed, this indicates
whether the simulated categories should be stored as a variable in the
resulting population data. If |
maxit , MaxNWts
|
control parameters to be passed to
|
tol |
if |
nr_cpus |
if specified, an integer number defining the number of cpus that should be used for parallel processing. |
eps |
a small positive numeric value, or |
regModel |
allows to specify the model that should be for the simulation of the additional continuous variable. The following choices are possible:
|
byHousehold |
if NULL, simulated values are used as is. If either |
imputeMissings |
if TRUE, missing values in variables that are used for the underlying model are imputed using hock-deck. |
seed |
optional; an integer value to be used as the seed of the random number generator, or an integer vector containing the state of the random number generator to be restored. |
verbose |
(logical) if |
by |
defining which variable to use as split up variable of the estimation. Defaults to the strata variable. |
model_params |
adding optional parameter to the model, at the moment only implemented for xgboost hyperparameters |
If method
is "lm"
, the behavior for two-step models is
described in the following.
If zeros
is TRUE
and log
is not TRUE
or the
variable specified by additional
does not contain negative values, a
log-linear model is used to predict whether an observation is zero or not.
Then a linear model is used to predict the non-zero values.
If zeros
is TRUE
, log
is TRUE
and const
is specified, again a log-linear model is used to predict whether an
observation is zero or not. In the linear model to predict the non-zero
values, const
is added to the variable specified by additional
before the logarithms are taken.
If zeros
is TRUE
, log
is TRUE
, const
is
NULL
and there are negative values, a multinomial log-linear model is
used to predict negative, zero and positive observations. Categories for the
negative values are thereby defined by breaks
. In the second step, a
linear model is used to predict the positive values and negative values are
drawn from uniform distributions in the respective classes.
If zeros
is FALSE
, log
is TRUE
and const
is NULL
, a two-step model is used if there are non-positive values in
the variable specified by additional
. Whether a log-linear or a
multinomial log-linear model is used depends on the number of categories to
be used for the non-positive values, as defined by breaks
. Again,
positive values are then predicted with a linear model and non-positive
values are drawn from uniform distributions.
The number of cpus are selected automatically in the following manner. The number of cpus is equal the number of strata. However, if the number of cpus is less than the number of strata, the number of cpus - 1 is used by default. This should be the best strategy, but the user can also overwrite this decision.
An object of class simPopObj
containing survey
data as well as the simulated population data including the continuous
variable specified by additional
and possibly simulated categories
for the desired continous variable.
The basic household structure and any other categorical predictors
need to be simulated beforehand with the functions
simStructure
and simCategorical
, respectively.
Bernhard Meindl, Andreas Alfons, Alexander Kowarik (based on code by Stefan Kraft), Siro Fritzmann
B. Meindl, M. Templ, A. Kowarik, O. Dupriez (2017) Simulation of Synthetic Populations for Survey Data Considering Auxiliary Information. Journal of Statistical Survey, 79 (10), 1–38. doi:10.18637/jss.v079.i10
A. Alfons, M. Templ (2011) Simulation of close-to-reality population data for household surveys with application to EU-SILC. Statistical Methods & Applications, 20 (3), 383–407. doi:10.1080/02664763.2013.859237
simStructure
, simCategorical
,
simComponents
, simEUSILC
data(eusilcS) ## Not run: ## approx. 20 seconds computation time inp <- specifyInput(data=eusilcS, hhid="db030", hhsize="hsize", strata="db040", weight="db090") simPop <- simStructure(data=inp, method="direct", basicHHvars=c("age", "rb090", "hsize", "pl030", "pb220a")) regModel = ~rb090+hsize+pl030+pb220a # multinomial model with random draws eusilcM <- simContinuous(simPop, additional="netIncome", regModel = regModel, upper=200000, equidist=FALSE, nr_cpus=1) class(eusilcM) # two-step regression eusilcT <- simContinuous(simPop, additional="netIncome", regModel = "basic", method = "lm", nr_cpus=1) class(eusilcT) ## End(Not run)
data(eusilcS) ## Not run: ## approx. 20 seconds computation time inp <- specifyInput(data=eusilcS, hhid="db030", hhsize="hsize", strata="db040", weight="db090") simPop <- simStructure(data=inp, method="direct", basicHHvars=c("age", "rb090", "hsize", "pl030", "pb220a")) regModel = ~rb090+hsize+pl030+pb220a # multinomial model with random draws eusilcM <- simContinuous(simPop, additional="netIncome", regModel = regModel, upper=200000, equidist=FALSE, nr_cpus=1) class(eusilcM) # two-step regression eusilcT <- simContinuous(simPop, additional="netIncome", regModel = "basic", method = "lm", nr_cpus=1) class(eusilcT) ## End(Not run)
Simulate population data for the European Statistics on Income and Living Conditions (EU-SILC).
simEUSILC( dataS, hid = "db030", wh = "db090", wp = "rb050", hsize = NULL, strata = "db040", pid = NULL, age = "age", gender = "rb090", categorizeAge = TRUE, breaksAge = NULL, categorical = c("pl030", "pb220a"), income = "netIncome", method = c("multinom", "twostep"), breaks = NULL, lower = NULL, upper = NULL, equidist = TRUE, probs = NULL, gpd = TRUE, threshold = NULL, est = "moments", const = NULL, alpha = 0.01, residuals = TRUE, components = c("py010n", "py050n", "py090n", "py100n", "py110n", "py120n", "py130n", "py140n"), conditional = c(getCatName(income), "pl030"), keep = TRUE, maxit = 500, MaxNWts = 1500, tol = .Machine$double.eps^0.5, nr_cpus = NULL, seed )
simEUSILC( dataS, hid = "db030", wh = "db090", wp = "rb050", hsize = NULL, strata = "db040", pid = NULL, age = "age", gender = "rb090", categorizeAge = TRUE, breaksAge = NULL, categorical = c("pl030", "pb220a"), income = "netIncome", method = c("multinom", "twostep"), breaks = NULL, lower = NULL, upper = NULL, equidist = TRUE, probs = NULL, gpd = TRUE, threshold = NULL, est = "moments", const = NULL, alpha = 0.01, residuals = TRUE, components = c("py010n", "py050n", "py090n", "py100n", "py110n", "py120n", "py130n", "py140n"), conditional = c(getCatName(income), "pl030"), keep = TRUE, maxit = 500, MaxNWts = 1500, tol = .Machine$double.eps^0.5, nr_cpus = NULL, seed )
dataS |
a |
hid |
a character string specifying the column of |
wh |
a character string specifying the column of |
wp |
a character string specifying the column of |
hsize |
an optional character string specifying a column of
|
strata |
a character string specifying the column of |
pid |
an optional character string specifying a column of |
age |
a character string specifying the column of |
gender |
a character string specifying the column of |
categorizeAge |
a logical indicating whether age categories should be used for simulating additional categorical and continuous variables to decrease computation time. |
breaksAge |
numeric; if |
categorical |
a character vector specifying additional categorical
variables of |
income |
a character string specifying the variable of |
method |
a character string specifying the method to be used for
simulating personal income. Accepted values are |
breaks |
if |
lower , upper
|
numeric values; if |
equidist |
logical; if |
probs |
numeric vector with values in |
gpd |
logical; if |
threshold |
a numeric value; if |
est |
a character string; if |
const |
numeric; if |
alpha |
numeric; if |
residuals |
logical; if |
components |
a character vector specifying the income components in
|
conditional |
an optional character vector specifying categorical
contitioning variables for resampling of the income components. The
fractions occurring in |
keep |
a logical indicating whether variables computed internally in the procedure (such as the original IDs of the corresponding households in the underlying sample, age categories or income categories) should be stored in the resulting population data. |
maxit , MaxNWts
|
control parameters to be passed to
|
tol |
if |
nr_cpus |
if specified, an integer number defining the number of cpus that should be used for parallel processing. |
seed |
optional; an integer value to be used as the seed of the random number generator, or an integer vector containing the state of the random number generator to be restored. |
An object of class simPopObj
containing the
simulated EU-SILC population data as well as the underlying sample.
This is a wrapper calling simStructure
,
simCategorical
, simContinuous
and
simComponents
.
Andreas Alfons and Stefan Kraft and Bernhard Meindl
simStructure
, simCategorical
,
simContinuous
, simComponents
data(eusilcS) # load sample data ## Not run: ## long computation time # multinomial model with random draws eusilcM <- simEUSILC(eusilcS, upper = 200000, equidist = FALSE , nr_cpus = 1) summary(eusilcM) # two-step regression eusilcT <- simEUSILC(eusilcS, method = "twostep", nr_cpus = 1) summary(eusilcT) ## End(Not run)
data(eusilcS) # load sample data ## Not run: ## long computation time # multinomial model with random draws eusilcM <- simEUSILC(eusilcS, upper = 200000, equidist = FALSE , nr_cpus = 1) summary(eusilcM) # two-step regression eusilcT <- simEUSILC(eusilcS, method = "twostep", nr_cpus = 1) summary(eusilcT) ## End(Not run)
This function allows to manipulate an object of class
simPopObj
in a way that a new variable containing
smaller regions within an already existing broader region is generated. The
distribution of the smaller region within the broader region is respected.
simInitSpatial( simPopObj, additional, region, tspatialP = NULL, tspatialHH = NULL, eps = 0.05, maxIter = 100, nr_cpus = NULL, seed = 1, verbose = FALSE )
simInitSpatial( simPopObj, additional, region, tspatialP = NULL, tspatialHH = NULL, eps = 0.05, maxIter = 100, nr_cpus = NULL, seed = 1, verbose = FALSE )
simPopObj |
an object of class |
additional |
a character vector of length one holding the variable name
of the variable containing smaller geographical units. This variable name
must be available as a column in input argument |
region |
a character vector of length one holding the variable name of
the broader region. This variable must be available in the input
|
tspatialP |
a data.frame (or data.table) containing three columns. The broader region
(with the variable name being the same as in input |
tspatialHH |
a data.frame (or data.table) containing three columns. The broader region
(with the variable name being the same as in input |
eps |
relative deviation of person counts if person and household counts are provided |
maxIter |
maximum number of iteration for adjustment if person and household counts are provided |
nr_cpus |
if specified, an integer number defining the number of cpus that should be used for parallel processing. |
seed |
optional; an integer value to be used as the seed of the random number generator, or an integer vector containing the state of the random number generator to be restored. |
verbose |
TRUE/FALSE if some information should be shown during the process |
The distributional information must be contained in an input table that holds combinations of characteristics of the broader region and the smaller regions as well as population counts (which may be available from a census).
An object of class simPopObj
with an additional
variable in the synthetic population slot.
Bernhard Meindl and Alexander Kowarik
M. Templ, B. Meindl, A. Kowarik, A. Alfons, O. Dupriez (2017) Simulation of Synthetic Populations for Survey Data Considering Auxiliary Information. Journal of Statistical Survey, 79 (10), 1–38. doi:10.18637/jss.v079.i10
library(data.table) data(eusilcS) data(eusilcP) library(data.table) # no districts are available in the population, so we have to generate those # we randomly assign districts within "region" in the eusilc population data # each hh has the same district simulate_districts <- function(inp) { hhid <- "hid" region <- "region" a <- inp[!duplicated(inp[,hhid]),c(hhid, region)] spl <- split(a, a[,region]) regions <- unique(inp[,region]) tmpres <- lapply(1:length(spl), function(x) { codes <- paste(x, 1:sample(3:9,1), sep="") spl[[x]]$district <- sample(codes, nrow(spl[[x]]), replace=TRUE) spl[[x]] }) tmpres <- do.call("rbind", tmpres) tmpres <- tmpres[,-c(2)] out <- merge(inp, tmpres, by.x=c(hhid), by.y=hhid, all.x=TRUE) invisible(out) } eusilcP <- data.table(simulate_districts(eusilcP)) # we generate the input table using the broad region (variable 'region') # and the districts, we have generated before. #Generate table with household counts by district tabHH <- eusilcP[!duplicated(hid),.(Freq=.N),by=.(db040=region,district)] setkey(tabHH,db040,district) #Generate table with person counts by district tabP <- eusilcP[,.(Freq=.N),by=.(db040=region,district)] setkey(tabP,db040,district) # we generate a synthetic population setnames(eusilcP,"region","db040") setnames(eusilcP,"hid","db030") inp <- specifyInput(data=eusilcP, hhid="db030", hhsize="hsize", strata="db040",population=TRUE) ## Not run: # use only HH counts simPopObj <- simStructure(data=inp, method="direct", basicHHvars=c("age", "gender")) simPopObj1 <- simInitSpatial(simPopObj, additional="district", region="db040", tspatialHH=tabHH, tspatialP=NULL, nr_cpus=1) # use only P counts simPopObj <- simStructure(data=inp, method="direct", basicHHvars=c("age", "gender")) simPopObj2 <- simInitSpatial(simPopObj, additional="district", region="db040", tspatialHH=NULL, tspatialP=tabP, nr_cpus = 1) # use P and HH counts simPopObj <- simStructure(data=inp, method="direct", basicHHvars=c("age", "gender")) simPopObj3 <- simInitSpatial(simPopObj, additional="district", region="db040", tspatialHH=tabHH, tspatialP=tabP, nr_cpus = 1) ## End(Not run)
library(data.table) data(eusilcS) data(eusilcP) library(data.table) # no districts are available in the population, so we have to generate those # we randomly assign districts within "region" in the eusilc population data # each hh has the same district simulate_districts <- function(inp) { hhid <- "hid" region <- "region" a <- inp[!duplicated(inp[,hhid]),c(hhid, region)] spl <- split(a, a[,region]) regions <- unique(inp[,region]) tmpres <- lapply(1:length(spl), function(x) { codes <- paste(x, 1:sample(3:9,1), sep="") spl[[x]]$district <- sample(codes, nrow(spl[[x]]), replace=TRUE) spl[[x]] }) tmpres <- do.call("rbind", tmpres) tmpres <- tmpres[,-c(2)] out <- merge(inp, tmpres, by.x=c(hhid), by.y=hhid, all.x=TRUE) invisible(out) } eusilcP <- data.table(simulate_districts(eusilcP)) # we generate the input table using the broad region (variable 'region') # and the districts, we have generated before. #Generate table with household counts by district tabHH <- eusilcP[!duplicated(hid),.(Freq=.N),by=.(db040=region,district)] setkey(tabHH,db040,district) #Generate table with person counts by district tabP <- eusilcP[,.(Freq=.N),by=.(db040=region,district)] setkey(tabP,db040,district) # we generate a synthetic population setnames(eusilcP,"region","db040") setnames(eusilcP,"hid","db030") inp <- specifyInput(data=eusilcP, hhid="db030", hhsize="hsize", strata="db040",population=TRUE) ## Not run: # use only HH counts simPopObj <- simStructure(data=inp, method="direct", basicHHvars=c("age", "gender")) simPopObj1 <- simInitSpatial(simPopObj, additional="district", region="db040", tspatialHH=tabHH, tspatialP=NULL, nr_cpus=1) # use only P counts simPopObj <- simStructure(data=inp, method="direct", basicHHvars=c("age", "gender")) simPopObj2 <- simInitSpatial(simPopObj, additional="district", region="db040", tspatialHH=NULL, tspatialP=tabP, nr_cpus = 1) # use P and HH counts simPopObj <- simStructure(data=inp, method="direct", basicHHvars=c("age", "gender")) simPopObj3 <- simInitSpatial(simPopObj, additional="district", region="db040", tspatialHH=tabHH, tspatialP=tabP, nr_cpus = 1) ## End(Not run)
Fast simulation of new variables based on univariate distributions
univariate.dis(puf, data, additional, weights, value = "data", fNA = NA) conditional.dis( puf, data, additional, conditional, weights, value = "data", fNA = NA )
univariate.dis(puf, data, additional, weights, value = "data", fNA = NA) conditional.dis( puf, data, additional, conditional, weights, value = "data", fNA = NA )
puf |
data for which one additional column specified by function argument ‘additional’ is simulated |
data |
donor data |
additional |
name of variable to be simulated |
weights |
sampling weights from data |
value |
if “data” then the puf including the additional variable is returned, otherwise only the simulated vector. |
fNA |
only used with missing values if another code as NA should be used |
conditional |
conditioning variable |
Function uni.distribution: random draws from the weighted univariate distribution of the original data
Function conditional.dis: random draws from the weighted conditional distribution (conditioned on a factor variable)
This are simple functions to produce structural variables, variables that should have the same categories as given ones. For more advanced methods see simCategorical()
Lydia Spies, Matthias Templ
## we don't have original data, so let's use eusilc data(eusilc13puf) data(eusilcS) v1 <- univariate.dis(eusilcS, eusilc13puf, additional = "db040", weights = "rb050", value = "vector") table(v1) table(eusilc13puf$db040) ## we don't have original data, so let's use eusilc ##data(eusilc13puf) ##data(eusilcS) ##v1 <- conditional.dis(eusilcS, eusilc13puf, additional = "pb190", ## conditional = "db040", weights = "rb050") ##table(v1) / sum(table(v1)) ##table(eusilc13puf$pb190) / sum(table(eusilc13puf$pb190))
## we don't have original data, so let's use eusilc data(eusilc13puf) data(eusilcS) v1 <- univariate.dis(eusilcS, eusilc13puf, additional = "db040", weights = "rb050", value = "vector") table(v1) table(eusilc13puf$db040) ## we don't have original data, so let's use eusilc ##data(eusilc13puf) ##data(eusilcS) ##v1 <- conditional.dis(eusilcS, eusilc13puf, additional = "pb190", ## conditional = "db040", weights = "rb050") ##table(v1) / sum(table(v1)) ##table(eusilc13puf$pb190) / sum(table(eusilc13puf$pb190))
"simPopObj"
An object that is used throughout the package containing information on the
sample (in slot sample
), the population (slot pop
) and
optionally some margins in form of a table (slot table
).
Objects are automatically created in
function simStructure
.
Bernhard Meindl and Matthias Templ
showClass("simPopObj") ## show method: generate an object of class simPop first data(eusilcS) inp <- specifyInput(data=eusilcS, hhid="db030", hhsize="hsize", strata="db040", weight="db090") eusilcP <- simStructure(data=inp, method="direct", basicHHvars=c("age", "rb090")) class(eusilcP) ## shows some basic information: eusilcP
showClass("simPopObj") ## show method: generate an object of class simPop first data(eusilcS) inp <- specifyInput(data=eusilcS, hhid="db030", hhsize="hsize", strata="db040", weight="db090") eusilcP <- simStructure(data=inp, method="direct", basicHHvars=c("age", "rb090")) class(eusilcP) ## shows some basic information: eusilcP
Simulate categorical variables of population data taking relationships
between household members into account. The household structure of the
population data needs to be simulated beforehand using
simStructure()
.
simRelation( simPopObj, relation = "relate", head = "head", direct = NULL, additional, limit = NULL, censor = NULL, maxit = 500, MaxNWts = 2000, eps = NULL, nr_cpus = NULL, seed = 1, regModel = NULL, verbose = FALSE, method = c("multinom", "ctree", "cforest", "ranger"), by = "strata" )
simRelation( simPopObj, relation = "relate", head = "head", direct = NULL, additional, limit = NULL, censor = NULL, maxit = 500, MaxNWts = 2000, eps = NULL, nr_cpus = NULL, seed = 1, regModel = NULL, verbose = FALSE, method = c("multinom", "ctree", "cforest", "ranger"), by = "strata" )
simPopObj |
a |
relation |
a character string specifying the columns of |
head |
a character string specifying the category of the variable given
by |
direct |
a character string specifying categories of the variable given
by |
additional |
a character vector specifying additional categorical
variables of |
limit |
this can be used to account for structural zeros. If only one additional variable is requested, a named list of lists should be supplied. The names of the list components specify the predictor variables for which to limit the possible outcomes of the response. For each predictor, a list containing the possible outcomes of the response for each category of the predictor can be supplied. The probabilities of other outcomes conditional on combinations that contain the specified categories of the supplied predictors are set to 0. If more than one additional variable is requested, such a list of lists can be supplied for each variable as a component of yet another list, with the component names specifying the respective variables. |
censor |
this can be used to account for structural zeros. If only one
additional variable is requested, a named list of lists or
|
maxit , MaxNWts
|
control parameters to be passed to
|
eps |
a small positive numeric value, or |
nr_cpus |
if specified, an integer number defining the number of cpus that should be used for parallel processing. |
seed |
optional; an integer value to be used as the seed of the random number generator, or an integer vector containing the state of the random number generator to be restored. |
regModel |
allows to specify the variables or model that is used when
simulating additional categorical variables. The following choices are
available if different from
|
verbose |
set to |
method |
a character string specifying the method to be used for simulating the additional categorical variables. Accepted values are
|
by |
defining which variable to use as split up variable of the estimation. Defaults to the strata variable. |
The values of a new variable are simulated in three steps, where the second
step is optional. First, the values of the household heads are simulated
with multinomial log-linear models. Second, individuals directly related to
the corresponding household head (as specified by the argument
direct
) inherit the value of the latter. Third, the values of the
remaining individuals are simulated with multinomial log-linear models in
which the value of the respective household head is used as an additional
predictor.
The number of cpus are selected automatically in the following manner. The number of cpus is equal the number of strata. However, if the number of cpus is less than the number of strata, the number of cpus - 1 is used by default. This should be the best strategy, but the user can also overwrite this decision.
An object of class simPopObj containing survey
data as well as the simulated population data including the categorical
variables specified by additional
.
The basic household structure needs to be simulated beforehand with
the function simStructure()
.
Andreas Alfons and Bernhard Meindl
simStructure()
, simCategorical()
,
simContinuous()
, simComponents()
data(ghanaS) # load sample data samp <- specifyInput( data = ghanaS, hhid = "hhid", strata = "region", weight = "weight" ) ghanaP <- simStructure( data = samp, method = "direct", basicHHvars = c("age", "sex", "relate") ) class(ghanaP) ## Not run: ## long computation time ... ghanaP <- simRelation( simPopObj = ghanaP, relation = "relate", head = "head", additional = c("nation", "ethnic", "religion"), nr_cpus = 1 ) str(ghanaP) ## End(Not run)
data(ghanaS) # load sample data samp <- specifyInput( data = ghanaS, hhid = "hhid", strata = "region", weight = "weight" ) ghanaP <- simStructure( data = samp, method = "direct", basicHHvars = c("age", "sex", "relate") ) class(ghanaP) ## Not run: ## long computation time ... ghanaP <- simRelation( simPopObj = ghanaP, relation = "relate", head = "head", additional = c("nation", "ethnic", "religion"), nr_cpus = 1 ) str(ghanaP) ## End(Not run)
Simulate basic categorical variables that define the household structure (typically variables such as household ID, age and gender) of population data by resampling from survey data.
simStructure( dataS, method = c("direct", "multinom", "distribution"), basicHHvars, seed = 1, MaxNWts = 1e+07 )
simStructure( dataS, method = c("direct", "multinom", "distribution"), basicHHvars, seed = 1, MaxNWts = 1e+07 )
dataS |
an object of class |
method |
a character string specifying the method to be used for
simulating the household sizes. Accepted values are |
basicHHvars |
a character vector specifying important variables for the
household structure that need to be available in |
seed |
optional; an integer value to be used as the seed of the random number generator, or an integer vector containing the state of the random number generator to be restored. |
MaxNWts |
optional; an integer value for the multinom method for controlling the maximum number of weights. |
An object of class simPopObj
containing the simulated
population household structure as well as the underlying sample that was
provided as input.
The function sample
is used, which gives results
incompatible with those from < 2.2.0 and produces a warning the first time
this happens in a session.
Bernhard Meindl and Andreas Alfons
M. Templ, B. Meindl, A. Kowarik, A. Alfons, O. Dupriez (2017) Simulation of Synthetic Populations for Survey Data Considering Auxiliary Information. Journal of Statistical Survey, 79 (10), 1–38. doi:10.18637/jss.v079.i10
simCategorical
, simContinuous
,
simComponents
, simEUSILC
data(eusilcS) ## Not run: inp <- specifyInput(data=eusilcS, hhid="db030", hhsize="hsize", strata="db040", weight="db090") eusilcP <- simStructure(data=inp, method="direct", basicHHvars=c("age", "rb090")) class(eusilcP) eusilcP ## End(Not run)
data(eusilcS) ## Not run: inp <- specifyInput(data=eusilcS, hhid="db030", hhsize="hsize", strata="db040", weight="db090") eusilcP <- simStructure(data=inp, method="direct", basicHHvars=c("age", "rb090")) class(eusilcP) eusilcP ## End(Not run)
Compute the statistics necessary for producing box-and-whisker plots of continuous or semi-continuous variables, taking into account sample weights.
spBwplotStats(x, weights = NULL, coef = 1.5, zeros = TRUE, do.out = TRUE)
spBwplotStats(x, weights = NULL, coef = 1.5, zeros = TRUE, do.out = TRUE)
x |
a numeric vector. |
weights |
an optional numeric vector containing sample weights. |
coef |
a numeric value that determines the extension of the whiskers. |
zeros |
a logical indicating whether the variable specified by
|
do.out |
a logical indicating whether data points that lie beyond the extremes of the whiskers should be returned. |
The function quantileWt
is used for the computation of
(weighted) quantiles. The median is computed together with the first and
the third quartile, which form the box. If range
is positive, the
whiskers extend to the most extreme data points that have a distance to the
box of no more than coef
times the interquartile range. For
coef = 0
, the whiskers mark the minimum and the maximum of the
sample, whereas a negative value causes an error.
A list of class "spBwplotStats"
with the following
components:
stats |
A vector of length 5 containing the (weighted) statistics for the construction of a box plot. |
n |
if |
nzero |
if |
out |
if |
Stefan Kraft and Andreas Alfons
spBwplot
, for producing (weighted) box plots of
continuous or semi-continuous variables.
quantileWt
for the computation of (weighted) sample quantiles.
boxplot.stats
for the unweighted statistics for box
plots (not considering semi-continuous variables).
data(eusilcS) ## semi-continuous variable spBwplotStats(eusilcS$netIncome, weights=eusilcS$rb050, do.out = FALSE)
data(eusilcS) ## semi-continuous variable spBwplotStats(eusilcS$netIncome, weights=eusilcS$rb050, do.out = FALSE)
Compute a (weighted empirical) cumulative distribution function for survey or population data. For survey data, sample weights are taken into account.
spCdf(x, weights = NULL, approx = FALSE, n = 10000)
spCdf(x, weights = NULL, approx = FALSE, n = 10000)
x |
a numeric vector. |
weights |
an optional numeric vector containing sample weights. |
approx |
a logical indicating whether an approximation of the cumulative distribution function should be computed. |
n |
a single integer value; if |
Sample weights are taken into account by adjusting the step height. To be
precise, the weighted step height for an observation is defined as its
weight divided by the sum of all weights
If requested, the approximation is performed using the function
approx
.
A list of class "spCdf"
with the following components:
x |
a numeric vector containing the |
y |
a
numeric vector containing the |
approx |
a logical indicating whether the coordinates represent an approximation. |
Andreas Alfons and Stefan Kraft
A. Alfons, M. Templ (2011) Simulation of close-to-reality population data for household surveys with application to EU-SILC. Statistical Methods & Applications, 20 (3), 383–407. doi:10.1007/s10260-011-0163-2
data(eusilcS) cdfS <- spCdf(eusilcS$netIncome, weights = eusilcS$rb050) plot(cdfS, type="s")
data(eusilcS) cdfS <- spCdf(eusilcS$netIncome, weights = eusilcS$rb050) plot(cdfS, type="s")
create an standardized input object of class 'dataObj' containing
information on weights, household ids, household sizes, person ids and
optionally strata. Outputs of this function are typically used in
simStructure
.
specifyInput( data, hhid = NULL, hhsize = NULL, pid = NULL, weight = NULL, strata = NULL, population = FALSE )
specifyInput( data, hhid = NULL, hhsize = NULL, pid = NULL, weight = NULL, strata = NULL, population = FALSE )
data |
a |
hhid |
character vector of length 1 specifying variable containing
household ids within slot |
hhsize |
character vector of length 1 specifying variable containing
household sizes within slot |
pid |
character vector of length 1 specifying variable containing
person ids within slot |
weight |
character vector of length 1 specifying variable holding
sampling weights within slot |
strata |
character vector of length 1 specifing variable name within
slot |
population |
TRUE/FALSE vector of length 1 specifing if the data object is a sample or a population NULL if such variable does not exist. |
Bernhard Meindl
M. Templ, B. Meindl, A. Kowarik, A. Alfons, O. Dupriez (2017) Simulation of Synthetic Populations for Survey Data Considering Auxiliary Information. Journal of Statistical Survey, 79 (10), 1–38. doi:10.18637/jss.v079.i10
data(eusilcS) inp <- specifyInput(data=eusilcS, hhid="db030", weight="rb050", strata="db040") class(inp) inp
data(eusilcS) inp <- specifyInput(data=eusilcS, hhid="db030", weight="rb050", strata="db040") class(inp) inp
Create mosaic plots of expected (i.e., estimated) and realized (i.e., simulated) population sizes.
spMosaic(x, method = c("split", "color"), ...)
spMosaic(x, method = c("split", "color"), ...)
x |
An object of class |
method |
A character string specifying the plot method. Possible values
are |
... |
if |
If method
is "split"
, the two tables of expected and realized
population sizes are combined into a single table, with an additional
conditioning variable indicating expected and realized values. A conditional
plot of this table is then produced using cotabplot
.
Andreas Alfons and Bernhard Meindl
M. Templ, B. Meindl, A. Kowarik, A. Alfons, O. Dupriez (2017) Simulation of Synthetic Populations for Survey Data Considering Auxiliary Information. Journal of Statistical Survey, 79 (10), 1–38. doi:10.18637/jss.v079.i10
A. Alfons, M. Templ (2011) Simulation of close-to-reality population data for household surveys with application to EU-SILC. Statistical Methods & Applications, 20 (3), 383–407. doi:10.1080/02664763.2013.859237
set.seed(1234) # for reproducibility ## Not run: data(eusilcS) # load sample data samp <- specifyInput(data=eusilcS, hhid="db030", hhsize="hsize", strata="db040", weight="db090") eusilcP <- simStructure(data=samp, method="direct", basicHHvars=c("age","rb090")) abb <- c("B","LA","Vi","C","St","UA","Sa","T","Vo") tab <- spTable(eusilcP, select=c("rb090", "db040", "hsize")) # expected and realized population sizes spMosaic(tab, method = "split", labeling=labeling_border(abbreviate=c(db040=TRUE))) # realized population sizes colored according to relative # differences with expected population sizes spMosaic(tab, method = "color", labeling=labeling_border(abbreviate=c(db040=TRUE))) ## End(Not run)
set.seed(1234) # for reproducibility ## Not run: data(eusilcS) # load sample data samp <- specifyInput(data=eusilcS, hhid="db030", hhsize="hsize", strata="db040", weight="db090") eusilcP <- simStructure(data=samp, method="direct", basicHHvars=c("age","rb090")) abb <- c("B","LA","Vi","C","St","UA","Sa","T","Vo") tab <- spTable(eusilcP, select=c("rb090", "db040", "hsize")) # expected and realized population sizes spMosaic(tab, method = "split", labeling=labeling_border(abbreviate=c(db040=TRUE))) # realized population sizes colored according to relative # differences with expected population sizes spMosaic(tab, method = "color", labeling=labeling_border(abbreviate=c(db040=TRUE))) ## End(Not run)
Using the Sprague multipliers, the age counts are estimated for each year having 5-years interval data as input.
sprague(x)
sprague(x)
x |
numeric vector of age counts in five-year intervals |
The input is population counts of age classes 0-4, 5-9, 10-14, ... , 77-74, 75-79, 80+.
Population counts for age 0, 1, 2, 3, 4, ..., 78, 79, 80+.
Matthias Templ
G. Calot and J.-P. Sardon. Methodology for the calculation of Eurostat's demographic indicators. Detailed report by the European Demographic Observatory
## example from the world bank x <- data.frame(age=as.factor(c( "0-4", "5-9","10-14","15-19", "20-24", "25-29","30-34","35-39","40-44","45-49", "50-54","55-59","60-64","65-69","77-74","75-79","80+" )), pop=c(1971990, 2095820,2157190, 2094110,2116580, 2003840, 1785690, 1502990, 1214170, 796934, 627551, 530305, 488014, 364498, 259029,158047, 125941) ) s <- sprague(x[,2]) s all.equal(sum(s), sum(x[,2]))
## example from the world bank x <- data.frame(age=as.factor(c( "0-4", "5-9","10-14","15-19", "20-24", "25-29","30-34","35-39","40-44","45-49", "50-54","55-59","60-64","65-69","77-74","75-79","80+" )), pop=c(1971990, 2095820,2157190, 2094110,2116580, 2003840, 1785690, 1502990, 1214170, 796934, 627551, 530305, 488014, 364498, 259029,158047, 125941) ) s <- sprague(x[,2]) s all.equal(sum(s), sum(x[,2]))
Compute contingency tables of expected (i.e., estimated) and realized (i.e., simulated) population sizes. The expected values are obtained with the Horvitz-Thompson estimator.
spTable(inp, select)
spTable(inp, select)
inp |
an object of class |
select |
character; vector defining the columns in slots 'pop' and 'sample' of argument 'input' that should be used for tabulation. |
The contingency tables are computed with tableWt
.
A list of class "spTable"
with the following components:
expected |
the contingency table estimated from the survey data. |
realized |
the contingency table computed from the simulated population data. |
Sampling weights are automatically used from the input object 'inp'!
Andreas Alfons and Bernhard Meindl
set.seed(1234) # for reproducibility data(eusilcS) # load sample data ## Not run: samp <- specifyInput(data=eusilcS, hhid="db030", hhsize="hsize", strata="db040", weight="db090") eusilcP <- simStructure(data=samp, method="direct", basicHHvars=c("age", "rb090")) res <- spTable(eusilcP, select = c("age", "rb090")) class(res) res ## End(Not run)
set.seed(1234) # for reproducibility data(eusilcS) # load sample data ## Not run: samp <- specifyInput(data=eusilcS, hhid="db030", hhsize="hsize", strata="db040", weight="db090") eusilcP <- simStructure(data=samp, method="direct", basicHHvars=c("age", "rb090")) res <- spTable(eusilcP, select = c("age", "rb090")) class(res) res ## End(Not run)
Compute contingency tables taking into account sample weights.
tableWt(x, weights = NULL, useNA = c("no", "ifany", "always"))
tableWt(x, weights = NULL, useNA = c("no", "ifany", "always"))
x |
a vector that can be interpreted as a factor, or a matrix or
|
weights |
an optional numeric vector containing sample weights. |
useNA |
a logical indicating whether to include extra |
For each combination of the variables in x
, the weighted number of
occurence is computed as the sum of the corresponding sample weights. If
weights are not specified, the function table
is applied.
The (weighted) contingency table as an object of class table
,
an array of integer values.
Andreas Alfons and Stefan Kraft
data(eusilcS) tableWt(eusilcS[, c("hsize", "db040")], weights = eusilcS$rb050) tableWt(eusilcS[, c("rb090", "pb220a")], weights = eusilcS$rb050, useNA = "ifany")
data(eusilcS) tableWt(eusilcS[, c("hsize", "db040")], weights = eusilcS$rb050) tableWt(eusilcS[, c("rb090", "pb220a")], weights = eusilcS$rb050, useNA = "ifany")
Population characteristics Region times Gender from Austria.
Using samp
samp<-
it is possible to extract or
rather modify variables of the sample data within slot data
in slot
sample
of the simPopObj-class
-object. Using
pop
pop<-
it is possible to extract or rather
modify variables of the synthetic population within in slot data
in
slot sample
of the simPopObj-class
-object.
totalsRG: A data frame with 18 observations on the following 3 variables.
gender; a factor with levels
female
male
region; a factor with levels
Burgenland
Carinthia
Lower Austria,
Salzburg
Styria
Tyrol
Upper Austria
Vienna
Vorarlberg
totals; a numeric vector
totalsRGtab: a two-dimensional table holding the same information
totalsRG: A data frame with 18 observations on the following 3 variables.
gender; a factor with levels
female
male
region; a factor with levels
Burgenland
Carinthia
Lower Austria,
Salzburg
Styria
Tyrol
Upper Austria
Vienna
Vorarlberg
totals; a numeric vector
totalsRGtab: a two-dimensional table holding the same information
Population totals Region times Gender for Austria 2006
Population characteristics Region times Gender from Austria.
StatCube - statistical data base, http://www.statistik.at
StatCube - statistical data base, http://www.statistik.at/
data(totalsRG) totalsRG data(totalsRGtab) totalsRGtab data(totalsRG) totalsRG data(totalsRGtab) totalsRGtab
data(totalsRG) totalsRG data(totalsRGtab) totalsRGtab data(totalsRG) totalsRG data(totalsRGtab) totalsRGtab
Various utility measues that basically compares two data sets
utility( x, y, type = c("all", "compareColumns", "compareRows", "compareRowsHH", "compareNA"), hhid = NULL ) utilityModal(x, y, varx, vary = NULL) utilityIndicator(x, y)
utility( x, y, type = c("all", "compareColumns", "compareRows", "compareRowsHH", "compareNA"), hhid = NULL ) utilityModal(x, y, varx, vary = NULL) utilityIndicator(x, y)
x |
a data.frame, typically the original data set. For |
y |
a data.frame, typically the corresponding synthetic data set. For |
type |
which measure
|
hhid |
index or name of variable containing the houshold ID |
varx |
name or index of a variable in data.frame x |
vary |
NULL or name or index of a variable in data.frame y corresponding to variable varx in data.frame x. If NULL, the names of the selected variable should be the same in both x and y. |
the measure(s) of interest
utility()
: comparisons of two data sets
utilityModal()
: comparison of number of categories
utilityIndicator()
: difference between two values
Matthias Templ, Maxime Bergeaut
data(eusilcS) data(eusilcP) ## for fast caluclations, took a subsample eusilcP <- eusilcP[1:15000, ] utility(eusilcS, eusilcP) data(eusilcS) data(eusilcP) utilityModal(eusilcS, eusilcP, "age") utilityModal(eusilcS, eusilcP, "pl030", "ecoStat") data(eusilcS) data(eusilcP) m1 <- meanWt(eusilcS$age, eusilcS$rb050) m2 <- mean(eusilcP$age) utilityIndicator(m1, m2)
data(eusilcS) data(eusilcP) ## for fast caluclations, took a subsample eusilcP <- eusilcP[1:15000, ] utility(eusilcS, eusilcP) data(eusilcS) data(eusilcP) utilityModal(eusilcS, eusilcP, "age") utilityModal(eusilcS, eusilcP, "pl030", "ecoStat") data(eusilcS) data(eusilcP) m1 <- meanWt(eusilcS$age, eusilcS$rb050) m2 <- mean(eusilcP$age) utilityIndicator(m1, m2)
Compute mean, variance, covariance matrix and correlation matrix, taking into account sample weights.
meanWt
: a simple wrapper that calls mean(x, na.rm=na.rm)
if
weights
is missing and weighted.mean(x, w=weights,
na.rm=na.rm)
otherwise. Implemented methods for this generic are:
meanWt.default(x, weights, na.rm=TRUE, ...)
meanWt.dataObj(x, vars, na.rm=TRUE, ...)
varWt
: calls var(x, na.rm=na.rm)
if weights
is missing.
Implemented methods for this generic are:
varWt.default(x, weights, na.rm=TRUE, ...)
varWt.dataObj(x, vars, na.rm=TRUE, ...)
covWt
and covWt
: always remove missing values pairwise and call
cov
and cor
, respectively, if weights
is missing.
Implemented methods for these generics are:
covWt.default(x, y, weights, ...)
covWt.matrix(x, weights, ...)
covWt.data.frame(x, weights, ...)
covWt.dataObj(x, vars, ...)
corWt.default(x, y, weights, ...)
corWt.matrix(x, weights, ...)
corWt.data.frame(x, weights, ...)
corWt.dataObj(x, vars, ...)
The additional parameters are now described:
y: a numeric vector. If missing, this defaults to x
.
vars: a character vector of variable names that should be used for the calculation.
na.rm: a logical indicating whether any NA
or NaN
values
should be removed from x
before computation. Note that the default
is TRUE
.
weights: an optional numeric vector containing sample weights.
meanWt(x, ...) varWt(x, ...) covWt(x, ...) corWt(x, ...)
meanWt(x, ...) varWt(x, ...) covWt(x, ...) corWt(x, ...)
x |
for |
... |
for the generic functions |
For meanWt
, the (weighted) mean.
For varWt
, the (weighted) variance.
For covWt
, the (weighted) covariance matrix or, for the default
method, the (weighted) covariance.
For corWt
, the (weighted) correlation matrix or, for the default
method, the (weighted) correlation coefficient.
meanWt
, varWt
, covWt
and corWt
all make use of
slot weights
of the input object if the dataObj
-method is
used.
Stefan Kraft and Andreas Alfons
mean
, weighted.mean
,
var
, cov
,
cor
data(eusilcS) meanWt(eusilcS$netIncome, weights=eusilcS$rb050) sqrt(varWt(eusilcS$netIncome, weights=eusilcS$rb050)) # dataObj-methods inp <- specifyInput(data=eusilcS, hhid="db030", hhsize="hsize", strata="db040", weight="db090") meanWt(inp, vars="netIncome") sqrt(varWt(inp, vars="netIncome")) corWt(inp, vars=c("age", "netIncome")) covWt(inp, vars=c("age", "netIncome"))
data(eusilcS) meanWt(eusilcS$netIncome, weights=eusilcS$rb050) sqrt(varWt(eusilcS$netIncome, weights=eusilcS$rb050)) # dataObj-methods inp <- specifyInput(data=eusilcS, hhid="db030", hhsize="hsize", strata="db040", weight="db090") meanWt(inp, vars="netIncome") sqrt(varWt(inp, vars="netIncome")) corWt(inp, vars=c("age", "netIncome")) covWt(inp, vars=c("age", "netIncome"))
The function calculates the original and modified Whipple index to evaluate age heaping.
whipple(x, method = "standard", weight = NULL)
whipple(x, method = "standard", weight = NULL)
x |
numeric vector holding the age of persons |
method |
“standard” or “modified” Whipple index. |
weight |
numeric vector holding the weights of each person |
The original Whipple's index is obtained by summing the number of persons in the age range between 23 and 62, and calculating the ratio of reported ages ending in 0 or 5 to one-fifth of the total sample. A linear decrease in the number of persons of each age within the age range is assumed. Therefore, low ages (0-22 years) and high ages (63 years and above) are excluded from analysis since this assumption is not plausible.
When the digits 0 and 5 are not reported in the data, the original Whipple index varies between 0 and 100, 100 if no preference for 0 or 5 is within the data. When only the digits 0 and 5 are reported in the data it reaches a to a maximum of 500.
For the modified Whipple index, age heaping is calculated for all ten digits (0-9). For each digit, the degree of preference or avoidance can be determined for certain ranges of ages, and the modified Whipple index then is given by the absolute sum of these (indices - 1). The index is scaled between 0 and 1, therefore it is 1 if all age values end with the same digit and 0 it is distributed perfectly equally.
The original or modified Whipple index.
Matthias Templ, Alexander Kowarik
Henry S. Shryock and Jacob S. Siegel, Methods and Materials of Demography (New York: Academic Press, 1976)
#Equally distributed age <- sample(1:100, 5000, replace=TRUE) whipple(age) whipple(age,method="modified") # Only 5 and 10 age5 <- sample(seq(0,100,by=5), 5000, replace=TRUE) whipple(age5) whipple(age5,method="modified") #Only 10 age10 <- sample(seq(0,100,by=10), 5000, replace=TRUE) whipple(age10) whipple(age10,method="modified")
#Equally distributed age <- sample(1:100, 5000, replace=TRUE) whipple(age) whipple(age,method="modified") # Only 5 and 10 age5 <- sample(seq(0,100,by=5), 5000, replace=TRUE) whipple(age5) whipple(age5,method="modified") #Only 10 age10 <- sample(seq(0,100,by=10), 5000, replace=TRUE) whipple(age10) whipple(age10,method="modified")