For the most part, this document
will present the functionalities of the function
surveysd::calc.stError()
which generates point estimates
and standard errors for user-supplied estimation functions.
In order to use a dataset with calc.stError()
, several
weight columns have to be present. Each weight column corresponds to a
bootstrap sample. In the following examples, we will use the data from
demo.eusilc()
and attach the bootstrap weights using
draw.bootstrap()
and recalib()
. Please refer
to the documentation of those functions for more detail.
## Warning: S3 method 'update.bootstrap' was declared in NAMESPACE but not found
set.seed(1234)
eusilc <- demo.eusilc(prettyNames = TRUE)
dat_boot <- draw.bootstrap(eusilc, REP = 10, hid = "hid", weights = "pWeight",
strata = "region", period = "year")
dat_boot_calib <- recalib(dat_boot, conP.var = "gender", conH.var = "region",
epsP = 1e-2, epsH = 2.5e-2, verbose = FALSE)
dat_boot_calib[, onePerson := nrow(.SD) == 1, by = .(year, hid)]
## print part of the dataset
dat_boot_calib[1:5, .(year, povertyRisk, eqIncome, onePerson, pWeight, w1, w2, w3, w4, w5)]
## year povertyRisk eqIncome onePerson pWeight w1 w2
## <num> <lgcl> <num> <lgcl> <num> <num> <num>
## 1: 2010 FALSE 16090.69 FALSE 504.5696 0.4499442 1008.0015
## 2: 2010 FALSE 16090.69 FALSE 504.5696 0.4499442 1008.0015
## 3: 2010 FALSE 16090.69 FALSE 504.5696 0.4499442 1008.0015
## 4: 2010 FALSE 27076.24 FALSE 493.3824 988.3080420 987.7743
## 5: 2010 FALSE 27076.24 FALSE 493.3824 988.3080420 987.7743
## w3 w4 w5
## <num> <num> <num>
## 1: 0.4474463 0.4486785 1018.4027
## 2: 0.4474463 0.4486785 1018.4027
## 3: 0.4474463 0.4486785 1018.4027
## 4: 984.3519175 0.4387304 996.8986
## 5: 984.3519175 0.4387304 996.8986
The parameters fun
and var
in
calc.stError()
define the estimator to be used in the error
analysis. There are two built-in estimator functions
weightedSum()
and weightedRatio()
which can be
used as follows.
povertyRate <- calc.stError(dat_boot_calib, var = "povertyRisk", fun = weightedRatio)
totalIncome <- calc.stError(dat_boot_calib, var = "eqIncome", fun = weightedSum)
Those functions calculate the ratio of persons at risk of poverty (in percent) and the total income. By default, the results are calculated separately for each reference period.
## Key: <year, n, N>
## year n N val_povertyRisk stE_povertyRisk
## <num> <int> <num> <num> <num>
## 1: 2010 14827 8182222 14.44422 0.5800070
## 2: 2011 14827 8182222 14.77393 0.6253285
## 3: 2012 14827 8182222 15.04515 0.6296422
## 4: 2013 14827 8182222 14.89013 0.5181864
## 5: 2014 14827 8182222 15.14556 0.5223758
## 6: 2015 14827 8182222 15.53640 0.4974983
## 7: 2016 14827 8182222 15.08315 0.3365409
## 8: 2017 14827 8182222 15.42019 0.5162417
## Key: <year, n, N>
## year n N val_eqIncome stE_eqIncome
## <num> <int> <num> <num> <num>
## 1: 2010 14827 8182222 162750998071 956103829
## 2: 2011 14827 8182222 161926931417 1285601987
## 3: 2012 14827 8182222 162576509628 1393971723
## 4: 2013 14827 8182222 163199507862 1144199513
## 5: 2014 14827 8182222 163986275009 1286636906
## 6: 2015 14827 8182222 163416275447 1141205133
## 7: 2016 14827 8182222 162706205137 1749220106
## 8: 2017 14827 8182222 164314959107 1696132770
Columns that use the val_
prefix denote the point
estimate belonging to the “main weight” of the dataset, which is
pWeight
in case of the dataset used here.
Columns with the stE_
prefix denote standard errors
calculated with bootstrap replicates. The replicates result in using
w1
, w2
, …, w10
instead of
pWeight
when applying the estimator.
n
denotes the number of observations for the year and
N
denotes the total weight of those persons.
In order to define a custom estimator function to be used in
fun
, the function needs to have at least two arguments like
the example below.
## define custom estimator
myWeightedSum <- function(x, w) {
sum(x*w)
}
## check if results are equal to the one using `surveysd::weightedSum()`
totalIncome2 <- calc.stError(dat_boot_calib, var = "eqIncome", fun = myWeightedSum)
all.equal(totalIncome$Estimates, totalIncome2$Estimates)
## [1] TRUE
The parameters x
and w
can be assumed to be
vectors with equal length with w
being numeric weight
vector and x
being the column defined in the
var
argument. It will be called once for each period (in
this case year
) and for each weight column (in this case
pWeight
, w1
, w2
, …,
w10
).
Custom estimators using additional parameters can also be supplied
and parameter add.arg
can be used to set the additional
arguments for the custom estimator.
## use add.arg-argument
fun <- function(x, w, b) {
sum(x*w*b)
}
add.arg = list(b="onePerson")
err.est <- calc.stError(dat_boot_calib, var = "povertyRisk", fun = fun,
period.mean = 0, add.arg=add.arg)
err.est$Estimates
## Key: <year, n, N>
## year n N val_povertyRisk stE_povertyRisk
## <num> <int> <num> <num> <num>
## 1: 2010 14827 8182222 273683.9 12509.646
## 2: 2011 14827 8182222 261883.6 11082.030
## 3: 2012 14827 8182222 243083.9 14383.509
## 4: 2013 14827 8182222 238004.4 14458.301
## 5: 2014 14827 8182222 218572.1 12941.891
## 6: 2015 14827 8182222 219984.1 12589.386
## 7: 2016 14827 8182222 201753.9 10780.028
## 8: 2017 14827 8182222 196881.2 6829.766
# compare with direct computation
compare.value <- dat_boot_calib[,fun(povertyRisk,pWeight,b=onePerson),
by=c("year")]
all((compare.value$V1-err.est$Estimates$val_povertyRisk)==0)
## [1] TRUE
The above chunk computes the weighted poverty ratio for single person households.
In our example the variable povertyRisk
is a boolean and
is TRUE
if the income is less than 60% of the weighted
median income. Thus it directly depends on the original weight vector
pWeight
. To further reduce the estimated error one should
calculate for each bootstrap replicate weight w the weighted median income medIncomew
and then define povertyRiskw
as
$$ povertyRisk_w = \cases{1 \quad\text{if Income}<0.6\cdot medIncome_{w}\\ 0 \quad\text{else}} $$
The estimator can then be applied to the new variable povertyRiskw. This can be realized using a custom estimator function.
# custom estimator to first derive poverty threshold
# and then estimate a weighted ratio
povmd <- function(x, w) {
md <- laeken::weightedMedian(x, w)*0.6
pmd60 <- x < md
# weighted ratio is directly estimated inside the function
return(sum(w[pmd60])/sum(w)*100)
}
err.est <- calc.stError(
dat_boot_calib, var = "povertyRisk", fun = weightedRatio,
fun.adjust.var = povmd, adjust.var = "eqIncome")
err.est$Estimates
## Key: <year, n, N>
## year n N val_povertyRisk stE_povertyRisk
## <num> <int> <num> <num> <num>
## 1: 2010 14827 8182222 14.44422 0
## 2: 2011 14827 8182222 14.77393 0
## 3: 2012 14827 8182222 15.04515 0
## 4: 2013 14827 8182222 14.89013 0
## 5: 2014 14827 8182222 15.14556 0
## 6: 2015 14827 8182222 15.53640 0
## 7: 2016 14827 8182222 15.08315 0
## 8: 2017 14827 8182222 15.42019 0
The approach shown above is only valid if not grouping variables are
supplied (parameter group
). If grouping variables are
supplied one should use parameters fun.adjust.var
and
adjust.var
such that the povertyRiskw
is first calculated for each period
and then used for each
grouping in group
.
# using fun.adjust.var and adjust.var to estimate povmd60 indicator
# for each period and bootstrap weight before applying the weightedRatio
povmd2 <- function(x, w) {
md <- laeken::weightedMedian(x, w)*0.6
pmd60 <- x < md
return(as.integer(pmd60))
}
# set adjust.var="eqIncome" so the income vector is used to estimate
# the povmd60 indicator for each bootstrap weight
# and the resulting indicators are passed to function weightedRatio
group <- "gender"
err.est <- calc.stError(
dat_boot_calib, var = "povertyRisk", fun = weightedRatio, group = "gender",
fun.adjust.var = povmd2, adjust.var = "eqIncome")
err.est$Estimates
## Key: <year, n, N, gender>
## year n N gender val_povertyRisk stE_povertyRisk
## <num> <int> <num> <fctr> <num> <num>
## 1: 2010 7267 3979572 male 12.02660 0.6259582
## 2: 2010 7560 4202650 female 16.73351 0.7178332
## 3: 2010 14827 8182222 <NA> 14.44422 0.6302645
## 4: 2011 7267 3979572 male 12.81921 0.6207981
## 5: 2011 7560 4202650 female 16.62488 0.6324005
## 6: 2011 14827 8182222 <NA> 14.77393 0.5711159
## 7: 2012 7267 3979572 male 13.76065 0.4831183
## 8: 2012 7560 4202650 female 16.26147 0.7925504
## 9: 2012 14827 8182222 <NA> 15.04515 0.6215972
## 10: 2013 7267 3979572 male 13.88962 0.4733111
## 11: 2013 7560 4202650 female 15.83754 0.6729889
## 12: 2013 14827 8182222 <NA> 14.89013 0.5292087
## 13: 2014 7267 3979572 male 14.50351 0.5861250
## 14: 2014 7560 4202650 female 15.75353 0.6763902
## 15: 2014 14827 8182222 <NA> 15.14556 0.5854949
## 16: 2015 7267 3979572 male 15.12289 0.3999441
## 17: 2015 7560 4202650 female 15.92796 0.6201872
## 18: 2015 14827 8182222 <NA> 15.53640 0.4940599
## 19: 2016 7267 3979572 male 14.57968 0.2846372
## 20: 2016 7560 4202650 female 15.55989 0.4763188
## 21: 2016 14827 8182222 <NA> 15.08315 0.3459556
## 22: 2017 7267 3979572 male 14.94816 0.3756216
## 23: 2017 7560 4202650 female 15.86717 0.3769701
## 24: 2017 14827 8182222 <NA> 15.42019 0.3414891
## year n N gender val_povertyRisk stE_povertyRisk
In case an estimator should be applied to several columns of the
dataset, var
can be set to a vector containing all
necessary columns.
multipleRates <- calc.stError(dat_boot_calib, var = c("povertyRisk", "onePerson"), fun = weightedRatio)
multipleRates$Estimates
## Key: <year, n, N>
## year n N val_povertyRisk stE_povertyRisk val_onePerson
## <num> <int> <num> <num> <num> <num>
## 1: 2010 14827 8182222 14.44422 0.5800070 14.85737
## 2: 2011 14827 8182222 14.77393 0.6253285 14.85737
## 3: 2012 14827 8182222 15.04515 0.6296422 14.85737
## 4: 2013 14827 8182222 14.89013 0.5181864 14.85737
## 5: 2014 14827 8182222 15.14556 0.5223758 14.85737
## 6: 2015 14827 8182222 15.53640 0.4974983 14.85737
## 7: 2016 14827 8182222 15.08315 0.3365409 14.85737
## 8: 2017 14827 8182222 15.42019 0.5162417 14.85737
## stE_onePerson
## <num>
## 1: 0.2590413
## 2: 0.1893102
## 3: 0.3440835
## 4: 0.3695592
## 5: 0.2810260
## 6: 0.3758405
## 7: 0.3970361
## 8: 0.3501762
Here we see the relative number of persons at risk of poverty and the relative number of one-person households.
The groups
argument can be used to calculate estimators
for different subsets of the data. This argument can take the grouping
variable as a string that refers to a column name (usually a factor) in
dat
. If set, all estimators are not only split by the
reference period but also by the grouping variable. For simplicity, only
one reference period of the above data is used.
dat2 <- subset(dat_boot_calib, year == 2010)
for (att in c("period", "weights", "b.rep"))
attr(dat2, att) <- attr(dat_boot_calib, att)
To calculate the ratio of persons at risk of poverty for each federal
state of Austria, group = "region"
can be used.
povertyRates <- calc.stError(dat2, var = "povertyRisk", fun = weightedRatio, group = "region")
povertyRates$Estimates
## Key: <year, n, N, region>
## year n N region val_povertyRisk stE_povertyRisk
## <num> <int> <num> <fctr> <num> <num>
## 1: 2010 549 260564 Burgenland 19.53984 1.8075049
## 2: 2010 733 377355 Vorarlberg 16.53731 3.2902750
## 3: 2010 924 535451 Salzburg 13.78734 2.2504749
## 4: 2010 1078 563648 Carinthia 13.08627 1.7521514
## 5: 2010 1317 701899 Tyrol 15.30819 1.3893509
## 6: 2010 2295 1167045 Styria 14.37464 1.3507201
## 7: 2010 2322 1598931 Vienna 17.23468 1.0401741
## 8: 2010 2804 1555709 Lower Austria 13.84362 1.7019761
## 9: 2010 2805 1421620 Upper Austria 10.88977 0.8680154
## 10: 2010 14827 8182222 <NA> 14.44422 0.5800070
The last row with region = NA
denotes the aggregate over
all regions. Note that the columns N
and n
now
show the weighted and unweighted number of persons in each region.
In case more than one grouping variable is used, there are several
options of calling calc.stError()
depending on whether
combinations of grouping levels should be regarded or not. We will
consider the variables gender
and region
as
our grouping variables and show three options on how
calc.stError()
can be called.
Calculate the point estimate and standard error for each region and each gender. The number of rows in the output is therefore
nperiods ⋅ (nregions + ngenders + 1) = 1 ⋅ (9 + 2 + 1) = 12.
The last row is again the estimate for the whole period.
povertyRates <- calc.stError(dat2, var = "povertyRisk", fun = weightedRatio,
group = c("gender", "region"))
povertyRates$Estimates
## Key: <year, n, N, gender, region>
## year n N gender region val_povertyRisk stE_povertyRisk
## <num> <int> <num> <fctr> <fctr> <num> <num>
## 1: 2010 549 260564 <NA> Burgenland 19.53984 1.8075049
## 2: 2010 733 377355 <NA> Vorarlberg 16.53731 3.2902750
## 3: 2010 924 535451 <NA> Salzburg 13.78734 2.2504749
## 4: 2010 1078 563648 <NA> Carinthia 13.08627 1.7521514
## 5: 2010 1317 701899 <NA> Tyrol 15.30819 1.3893509
## 6: 2010 2295 1167045 <NA> Styria 14.37464 1.3507201
## 7: 2010 2322 1598931 <NA> Vienna 17.23468 1.0401741
## 8: 2010 2804 1555709 <NA> Lower Austria 13.84362 1.7019761
## 9: 2010 2805 1421620 <NA> Upper Austria 10.88977 0.8680154
## 10: 2010 7267 3979572 male <NA> 12.02660 0.6303201
## 11: 2010 7560 4202650 female <NA> 16.73351 0.6255656
## 12: 2010 14827 8182222 <NA> <NA> 14.44422 0.5800070
state
and
gender
Split the data by all combinations of the two grouping variables. This will result in a larger output-table of the size
nperiods ⋅ (nregions ⋅ ngenders + 1) = 1 ⋅ (9 ⋅ 2 + 1) = 19.
povertyRates <- calc.stError(dat2, var = "povertyRisk", fun = weightedRatio,
group = list(c("gender", "region")))
povertyRates$Estimates
## Key: <year, n, N, gender, region>
## year n N gender region val_povertyRisk stE_povertyRisk
## <num> <int> <num> <fctr> <fctr> <num> <num>
## 1: 2010 261 122741.8 male Burgenland 17.414524 2.2329915
## 2: 2010 288 137822.2 female Burgenland 21.432598 2.0940707
## 3: 2010 359 182732.9 male Vorarlberg 12.973259 3.1096981
## 4: 2010 374 194622.1 female Vorarlberg 19.883637 3.7576316
## 5: 2010 440 253143.7 male Salzburg 9.156964 1.9004827
## 6: 2010 484 282307.3 female Salzburg 17.939382 2.5268499
## 7: 2010 517 268581.4 male Carinthia 10.552149 2.0606134
## 8: 2010 561 295066.6 female Carinthia 15.392924 1.9858506
## 9: 2010 650 339566.5 male Tyrol 12.857542 1.0359935
## 10: 2010 667 362332.5 female Tyrol 17.604861 2.2100445
## 11: 2010 1128 571011.7 male Styria 11.671247 1.5114035
## 12: 2010 1132 774405.4 male Vienna 15.590616 1.3343058
## 13: 2010 1167 596033.3 female Styria 16.964539 1.3779181
## 14: 2010 1190 824525.6 female Vienna 18.778813 0.9289638
## 15: 2010 1363 684272.5 male Upper Austria 9.074690 1.1705213
## 16: 2010 1387 772593.2 female Lower Austria 16.372949 1.8354623
## 17: 2010 1417 783115.8 male Lower Austria 11.348283 1.6427264
## 18: 2010 1442 737347.5 female Upper Austria 12.574206 0.7665952
## 19: 2010 14827 8182222.0 <NA> <NA> 14.444218 0.5800070
In this case, the estimates and standard errors are calculated for
The number of rows in the output is therefore
nperiods ⋅ (nregions ⋅ ngenders + nregions + ngenders + 1) = 1 ⋅ (9 ⋅ 2 + 9 + 2 + 1) = 30.
povertyRates <- calc.stError(dat2, var = "povertyRisk", fun = weightedRatio,
group = list("gender", "region", c("gender", "region")))
povertyRates$Estimates
## Key: <year, n, N, gender, region>
## year n N gender region val_povertyRisk stE_povertyRisk
## <num> <int> <num> <fctr> <fctr> <num> <num>
## 1: 2010 261 122741.8 male Burgenland 17.414524 2.2329915
## 2: 2010 288 137822.2 female Burgenland 21.432598 2.0940707
## 3: 2010 359 182732.9 male Vorarlberg 12.973259 3.1096981
## 4: 2010 374 194622.1 female Vorarlberg 19.883637 3.7576316
## 5: 2010 440 253143.7 male Salzburg 9.156964 1.9004827
## 6: 2010 484 282307.3 female Salzburg 17.939382 2.5268499
## 7: 2010 517 268581.4 male Carinthia 10.552149 2.0606134
## 8: 2010 549 260564.0 <NA> Burgenland 19.539837 1.8075049
## 9: 2010 561 295066.6 female Carinthia 15.392924 1.9858506
## 10: 2010 650 339566.5 male Tyrol 12.857542 1.0359935
## 11: 2010 667 362332.5 female Tyrol 17.604861 2.2100445
## 12: 2010 733 377355.0 <NA> Vorarlberg 16.537310 3.2902750
## 13: 2010 924 535451.0 <NA> Salzburg 13.787343 2.2504749
## 14: 2010 1078 563648.0 <NA> Carinthia 13.086268 1.7521514
## 15: 2010 1128 571011.7 male Styria 11.671247 1.5114035
## 16: 2010 1132 774405.4 male Vienna 15.590616 1.3343058
## 17: 2010 1167 596033.3 female Styria 16.964539 1.3779181
## 18: 2010 1190 824525.6 female Vienna 18.778813 0.9289638
## 19: 2010 1317 701899.0 <NA> Tyrol 15.308190 1.3893509
## 20: 2010 1363 684272.5 male Upper Austria 9.074690 1.1705213
## 21: 2010 1387 772593.2 female Lower Austria 16.372949 1.8354623
## 22: 2010 1417 783115.8 male Lower Austria 11.348283 1.6427264
## 23: 2010 1442 737347.5 female Upper Austria 12.574206 0.7665952
## 24: 2010 2295 1167045.0 <NA> Styria 14.374637 1.3507201
## 25: 2010 2322 1598931.0 <NA> Vienna 17.234683 1.0401741
## 26: 2010 2804 1555709.0 <NA> Lower Austria 13.843623 1.7019761
## 27: 2010 2805 1421620.0 <NA> Upper Austria 10.889773 0.8680154
## 28: 2010 7267 3979571.7 male <NA> 12.026600 0.6303201
## 29: 2010 7560 4202650.3 female <NA> 16.733508 0.6255656
## 30: 2010 14827 8182222.0 <NA> <NA> 14.444218 0.5800070
## year n N gender region val_povertyRisk stE_povertyRisk