Reconciling R and Stata Approaches
Getting predictions in R is and always has been pretty easy for the vast majority of packages providing modeling functions, as they also provide a predict method for the model objects. For those in the Stata world, they typically use margins
for this, but when they come to R, there is no obvious option for how to go about it in the same way1. Likewise, some in the R world catch a whiff of Stata’s margins and would want something similar, but may not be sure where to turn.
A little digging will reveal there are several packages that will provide the same sort of thing. In addition, there are numerous resources for both R and Stata for getting marginal results (i.e. predictions). However, here we note the issues that arise when models include an offset. Offsets are commonly used to model rates when the target variable is a count, but are used in other contexts as well. The Stata documentation for the margins command offers no specific details of how the offset/exposure is treated, and some R packages appear not to know what to do with it, or offer few options to deal with it. So even when the models are identical, marginal estimates might be different in R and Stata. Here we’ll try to sort some of this out.
We will use the Insurance data from the MASS package which most with R will have access to. From the helpfile:
The data given in data frame Insurance consist of the numbers of policyholders of an insurance company who were exposed to risk, and the numbers of car insurance claims made by those policyholders.
We do a bit of minor processing, and I save the data as a Stata file in case anyone wants to play with it in that realm.
library(tidyverse)
set.seed(123)
insurance = MASS::Insurance %>%
rename_all(tolower) %>%
mutate(
# create standard rather than ordered factors for typical output
age = factor(age, ordered = FALSE),
group = factor(group, ordered = FALSE),
# create a numeric age covariate for later
age_num = case_when(
age == '<25' ~ sample(18:25, n(), replace = T),
age == '25-29' ~ sample(25:29, n(), replace = T),
age == '30-35' ~ sample(30:35, n(), replace = T),
age == '>35' ~ sample(36:75, n(), replace = T),
),
# for stata consistency
ln_holders = log(holders)
)
haven::write_dta(insurance, 'data/insurance.dta')
Let’s take a quick peek to get our bearings.
$`Numeric Variables`
Variable N Mean SD Min Q1 Median Q3 Max Missing
1 holders 64 364.98 622.77 3.0 46.75 136.00 327.50 3582.00 0
2 claims 64 49.23 71.16 0.0 9.50 22.00 55.50 400.00 0
3 age_num 64 35.23 15.83 18.0 25.00 29.50 35.50 75.00 0
4 ln_holders 64 4.90 1.48 1.1 3.84 4.91 5.79 8.18 0
$`Categorical Variables`
# A tibble: 12 x 4
Variable Group Frequency `%`
<chr> <chr> <int> <dbl>
1 district 1 16 25
2 district 2 16 25
3 district 3 16 25
4 district 4 16 25
5 group <1l 16 25
6 group >2l 16 25
7 group 1-1.5l 16 25
8 group 1.5-2l 16 25
9 age <25 16 25
10 age >35 16 25
11 age 25-29 16 25
12 age 30-35 16 25
Starting out, we run a model in as simple a form as possible. I use just a standard negative binomial with a single covariate age
, so we can clearly see how the ouptut is being produced. Note that age
has four categories as seen above.
nb_glm_offset = MASS::glm.nb(claims ~ age + offset(ln_holders), data = insurance)
summary(nb_glm_offset)
Call:
MASS::glm.nb(formula = claims ~ age + offset(ln_holders), data = insurance,
init.theta = 28.40119393, link = log)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.29753 -0.79833 -0.04613 0.82465 3.00825
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.59233 0.09071 -17.554 < 2e-16 ***
age25-29 -0.12697 0.11743 -1.081 0.2796
age30-35 -0.25340 0.11558 -2.193 0.0283 *
age>35 -0.41940 0.10583 -3.963 7.4e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for Negative Binomial(28.4012) family taken to be 1)
Null deviance: 86.761 on 63 degrees of freedom
Residual deviance: 67.602 on 60 degrees of freedom
AIC: 444.38
Number of Fisher Scoring iterations: 1
Theta: 28.40
Std. Err.: 9.80
2 x log-likelihood: -434.385
Now we run it with Stata. We get the same result, so this means we can’t get different predictions if we do the same thing in both R or Stata2.
nbreg claims i.age, offset(ln_holders) nolog
Negative binomial regression Number of obs = 64
LR chi2(3) = 15.73
Dispersion = mean Prob > chi2 = 0.0013
Log likelihood = -217.1925 Pseudo R2 = 0.0349
------------------------------------------------------------------------------
claims | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
age |
25-29 | -.1269666 .1182871 -1.07 0.283 -.3588049 .1048718
30-35 | -.2533991 .116535 -2.17 0.030 -.4818034 -.0249948
>35 | -.4193953 .1068907 -3.92 0.000 -.6288973 -.2098933
|
_cons | -1.592325 .0913742 -17.43 0.000 -1.771415 -1.413235
ln_holders | 1 (offset)
-------------+----------------------------------------------------------------
/lnalpha | -3.346431 .3544212 -4.041084 -2.651778
-------------+----------------------------------------------------------------
alpha | .0352098 .0124791 .0175784 .0705257
------------------------------------------------------------------------------
LR test of alpha=0: chibar2(01) = 38.33 Prob >= chibar2 = 0.000
First let’s use emmeans, a very popular package for getting estimated marginal means, to get the predicted counts for each age group.
library(emmeans)
emmeans(
nb_glm_offset,
~ age,
type = "response",
offset = mean(insurance$ln_holders) # default
)
age | response | SE | df | asymp.LCL | asymp.UCL |
---|---|---|---|---|---|
<25 | 27.437 | 2.489 | Inf | 22.968 | 32.776 |
25-29 | 24.166 | 1.802 | Inf | 20.880 | 27.969 |
30-35 | 21.295 | 1.525 | Inf | 18.507 | 24.505 |
>35 | 18.038 | 0.983 | Inf | 16.211 | 20.072 |
How is this result obtained? It is just the prediction at each value of the covariate, with the offset held at its mean. We can duplicate this result by using the predict method and specifying a data frame with the values of interest.
nd = data.frame(
age = levels(insurance$age),
ln_holders = mean(insurance$ln_holders)
)
predict(
nb_glm_offset,
newdata = nd,
type = 'response'
)
prediction |
---|
27.437 |
24.166 |
21.295 |
18.038 |
As an explicit comparison, the intercept represents group ‘<25’, and if we exponentiate and add the mean offset we get:
\(exp(Intercept + \overline{ln\_holders}) = e^{-1.59+4.90} = \qquad\) 27.437
Now let’s look at Stata. First we want just the basic margins output.
margins age
Adjusted predictions Number of obs = 64
Model VCE : OIM
Expression : Predicted number of events, predict()
------------------------------------------------------------------------------
| Delta-method
| Margin Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
age |
<25 | 74.25681 6.785153 10.94 0.000 60.95815 87.55546
25-29 | 65.40266 4.936493 13.25 0.000 55.72731 75.07801
30-35 | 57.63502 4.195113 13.74 0.000 49.41275 65.85729
>35 | 48.81971 2.747836 17.77 0.000 43.43405 54.20537
------------------------------------------------------------------------------
These values, while consistent in pattern, are much different than the emmeans output, so what is going on?
In this model, we only have the age
covariate and the offset, so there really isn’t much to focus on besides the latter. To replicate the Stata output in R, we will use all values of the offset for every level of age
, and subsequently get an average prediction for each age group. First, we create a data frame for prediction using expand.grid, get the predictions for all those values, then get mean prediction per group.
predictions =
expand.grid(
age = levels(insurance$age),
ln_holders = insurance$ln_holders
) %>%
mutate(prediction = predict(nb_glm_offset, newdata = ., type = 'response')) %>%
group_by(age) %>%
summarise(avg_prediction = mean(prediction))
age | avg_prediction |
---|---|
<25 | 74.257 |
25-29 | 65.403 |
30-35 | 57.635 |
>35 | 48.820 |
The emmeans doesn’t appear to allow one to provide all values of the offset, as adding additional values just applies them to each group and then recycles. In this case, it would just use the first four values of ln_holders
for each age group respectively, which is not what we want.
emmeans(
nb_glm_offset,
~ age,
type = "response",
offset = insurance$ln_holders
)
age response SE df asymp.LCL asymp.UCL
<25 40.1 3.64 Inf 33.6 47.9
25-29 47.3 3.53 Inf 40.9 54.8
30-35 38.8 2.78 Inf 33.8 44.7
>35 224.7 12.25 Inf 201.9 250.0
Confidence level used: 0.95
Intervals are back-transformed from the log scale
insurance$ln_holders[1:4]
[1] 5.283204 5.575949 5.505332 7.426549
If we add the offset to the spec
argument, it still just fixes it at the mean (and I tried variations on the spec). So at least using the standard approaches with this model does not appear to give you the same thing as Stata.
emmeans(
nb_glm_offset,
~ age + offset(ln_holders),
type = "response"
)
age ln_holders response SE df asymp.LCL asymp.UCL
<25 4.9 27.4 2.489 Inf 23.0 32.8
25-29 4.9 24.2 1.802 Inf 20.9 28.0
30-35 4.9 21.3 1.525 Inf 18.5 24.5
>35 4.9 18.0 0.983 Inf 16.2 20.1
Confidence level used: 0.95
Intervals are back-transformed from the log scale
Unfortunately Stata has the opposite issue. Trying to set the offset to the mean results in an error, and using atmeans
doesn’t change the previous result.
margins age, atmeans
margins age, at(ln_holders = 10.27)
Adjusted predictions Number of obs = 64
Model VCE : OIM
Expression : Predicted number of events, predict()
at : 1.age = .25 (mean)
------------------------------------------------------------------------------
| Delta-method
| Margin Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
age |
<25 | 74.25681 6.785153 10.94 0.000 60.95815 87.55546
25-29 | 65.40266 4.936493 13.25 0.000 55.72731 75.07801
30-35 | 57.63502 4.195113 13.74 0.000 49.41275 65.85729
>35 | 48.81971 2.747836 17.77 0.000 43.43405 54.20537
------------------------------------------------------------------------------
variable ln_holders not found in list of covariates
r(322);
end of do-file
r(322);
The margins package explicitly attempts to duplicate Stata’s margins command, but here we can see it has an issue with the offset.
library(margins)
margins(
nb_glm_offset,
variables = 'age',
type = "response" # default
)
Error in offset(ln_holders): could not find function "offset"
The offset function is part of the stats package of the base R installation, so I tried rerunning the model using stats::offset
, but this makes the offset just like any other covariate, i.e. it did not have a fixed coefficient of 1. Changing the model to a standard glm
class with poisson and moving the offset to the offset
argument did work, and produces the results for the differences in predictions for each group from the reference group (dydx
in Stata), but we’ll visit this type of result later3. However, the offset
argument is not available to glm.nb, so we’re stuck for now.
pois_glm_offset = glm(
claims ~ age,
data = insurance,
family = 'poisson',
offset = ln_holders
)
margins(
pois_glm_offset,
variables = 'age',
type = "response" # default
)
age25-29 age30-35 age>35
-10.32 -18.46 -28.79
expand.grid(
age = levels(insurance$age),
ln_holders = insurance$ln_holders
) %>%
mutate(prediction = predict(pois_glm_offset, newdata = ., type = 'response')) %>%
group_by(age) %>%
summarise(avg_prediction = mean(prediction)) %>%
mutate(diff = avg_prediction - avg_prediction[1])
# A tibble: 4 x 3
age avg_prediction diff
<fct> <dbl> <dbl>
1 <25 73.4 0
2 25-29 63.1 -10.3
3 30-35 55.0 -18.5
4 >35 44.7 -28.8
In Stata, with categorical values we can also use the over
approach. What do we get in this case?
margins, over(age)
Predictive margins Number of obs = 64
Model VCE : OIM
Expression : Predicted number of events, predict()
over : age
------------------------------------------------------------------------------
| Delta-method
| Margin Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
age |
<25 | 14.47052 1.322232 10.94 0.000 11.879 17.06205
25-29 | 26.16218 1.974682 13.25 0.000 22.29188 30.03249
30-35 | 29.67738 2.160145 13.74 0.000 25.44358 33.91119
>35 | 141.0984 7.941774 17.77 0.000 125.5328 156.6639
------------------------------------------------------------------------------
These are very different from our previous results for Stata, so what’s happening here?
This can be duplicated with the predict function as follows. While similar to the previous approach, here only the observed values of the offset for each group are used. We then make predictions for all values of the data and average them by group.
predictions_over = insurance %>%
group_by(age) %>%
group_modify(
~ data.frame(
prediction = mean(
predict(nb_glm_offset, newdata = ., type = 'response')
)
),
keep = TRUE
)
age | prediction |
---|---|
<25 | 14.471 |
25-29 | 26.162 |
30-35 | 29.677 |
>35 | 141.098 |
The pattern is actually in the opposite direction, which is unexpected, but probably just reflects the fact that we just don’t have much data. However, it’s good to note that these respective approaches would not necessarily tell you the same thing.
I currently don’t know of an equivalence for emmeans in this offset case, and initial searches didn’t turn up much, though it is hard to distinguish specific ‘average predictions’ from many other similar scenarios. I attempted the following, which keeps the values of ln_holders
, but it only keeps unique ones, and it’s not reproducing what I would expect, although it implies that it is averaging over the offset values.
rg = ref_grid(nb_glm_offset, cov.keep = c('ln_holders'))
# type is ignored for some reason, so we exponentiate after
em_over = emmeans(rg, ~age, type = 'response')
data.frame(em_over) %>%
mutate(count = exp(emmean))
age emmean SE df asymp.LCL asymp.UCL count
1 <25 2.055962 0.09071069 Inf 1.878173 2.233752 7.814354
2 25-29 2.835734 0.07457136 Inf 2.689577 2.981892 17.042911
3 30-35 3.010866 0.07161826 Inf 2.870497 3.151236 20.304984
4 >35 4.574641 0.05450461 Inf 4.467813 4.681468 96.993166
The over approach for the margins package is not explicitly supported. The package author states:
At present, margins() does not implement the over option. The reason for this is also simple: R already makes data subsetting operations quite simple using simple
[
extraction. If, for example, one wanted to calculate marginal effects on subsets of a data frame, those subsets can be passed directly to margins() via the data argument (as in a call to prediction()).
It would look something like the following, but we still have the offset problem for this negative binomial class, so I don’t show a result.
insurance %>%
group_by(age) %>%
group_map(~margins(nb_glm_offset, .), keep = TRUE)
Sometimes people want differences as you move from one level (e.g. the reference level) to the next for some covariate, the ‘average marginal effect’. In Stata this is obtained with the dydx
option.
margins, dydx(age)
Conditional marginal effects Number of obs = 64
Model VCE : OIM
Expression : Predicted number of events, predict()
dy/dx w.r.t. : 2.age 3.age 4.age
------------------------------------------------------------------------------
| Delta-method
| dy/dx Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
age |
25-29 | -8.854149 8.375143 -1.06 0.290 -25.26913 7.56083
30-35 | -16.62179 7.959339 -2.09 0.037 -32.2218 -1.021769
>35 | -25.4371 7.297716 -3.49 0.000 -39.74036 -11.13383
------------------------------------------------------------------------------
Note: dy/dx for factor levels is the discrete change from the base level.
In R, we can get this from our initial predictions that used all offset values by just taking the differences in the predicted values.
predictions %>%
mutate(dydx = avg_prediction - avg_prediction[1])
age | avg_prediction | dydx |
---|---|---|
<25 | 74.257 | 0.000 |
25-29 | 65.403 | -8.854 |
30-35 | 57.635 | -16.622 |
>35 | 48.820 | -25.437 |
Now we’ll consider a continuous covariate. Here we’ll again just focus on a simple example where we rerun the model, but with age as numeric rather than binned4. For comparison we’ll set the numeric age values at roughly the midpoint of the binned categories. We can do this using the at
option.
margins, at(age_num = (21, 27, 32, 50))
Negative binomial regression Number of obs = 64
LR chi2(1) = 12.58
Dispersion = mean Prob > chi2 = 0.0004
Log likelihood = -218.76542 Pseudo R2 = 0.0280
------------------------------------------------------------------------------
claims | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
age_num | -.0079258 .0020657 -3.84 0.000 -.0119746 -.0038771
_cons | -1.521278 .0919801 -16.54 0.000 -1.701556 -1.341001
ln_holders | 1 (offset)
-------------+----------------------------------------------------------------
/lnalpha | -3.191873 .3194707 -3.818025 -2.565722
-------------+----------------------------------------------------------------
alpha | .0410948 .0131286 .0219712 .0768636
------------------------------------------------------------------------------
LR test of alpha=0: chibar2(01) = 69.85 Prob >= chibar2 = 0.000
Adjusted predictions Number of obs = 64
Model VCE : OIM
Expression : Predicted number of events, predict()
1._at : age_num = 21
2._at : age_num = 27
3._at : age_num = 32
4._at : age_num = 50
------------------------------------------------------------------------------
| Delta-method
| Margin Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
_at |
1 | 67.50043 3.7343 18.08 0.000 60.18133 74.81952
2 | 64.36558 3.025254 21.28 0.000 58.43619 70.29497
3 | 61.86471 2.5763 24.01 0.000 56.81526 66.91417
4 | 53.63947 2.270244 23.63 0.000 49.18988 58.08907
------------------------------------------------------------------------------
Again, we can duplicate this with the basic predict function. We just predict at that value of the covariate for all values of the offset, and get the average prediction as we did before.
nb_glm_offset_cont = MASS::glm.nb(claims ~ age_num + offset(ln_holders), data = insurance)
predictions = expand.grid(
age_num = c(21, 27, 32, 50),
ln_holders = insurance$ln_holders
) %>%
mutate(pred =
predict(
nb_glm_offset_cont,
newdata = .,
type = 'response'
)
) %>%
group_by(age_num) %>%
summarise(average_prediction = mean(pred))
age_num | average_prediction |
---|---|
21 | 67.500 |
27 | 64.366 |
32 | 61.865 |
50 | 53.639 |
We can also get the dydx
for the continuous covariate, which is the derivative of the target with respect to the covariate. In linear models, this is just the regression coefficient, but here we have to do things a little differently.
margins, dydx(age_num)
Average marginal effects Number of obs = 64
Model VCE : OIM
Expression : Predicted number of events, predict()
dy/dx w.r.t. : age_num
------------------------------------------------------------------------------
| Delta-method
| dy/dx Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
age_num | -.4255242 .1050737 -4.05 0.000 -.6314649 -.2195836
------------------------------------------------------------------------------
As noted for the categorical case, this value is the average marginal effect. As the Stata reference describes:
It is not necessarily true that dydx() = 0.5 means that “y increases by 0.5 if x increases by 1”. It is true that “y increases with x at a rate such that, if the rate were constant, y would increase by 0.5 if x increased by 1”
This qualified interpretation may not be of much value in contexts where the rate is not constant, but we can still see what Stata is doing.
For dydx
, when it comes to continuous covariates, there isn’t an obvious change in the covariate to use (i.e. the dx
) to evaluate at each point, as is the case with categorical variables, which can use a reference group. So what we do is use a small arbitrary difference (\(\epsilon\)) for the covariate at its observed values, get the predictions for the values above and below the observed value, and then average those differences in predicted values. For comparison to Stata, I set \(\epsilon\) to the value used by the margins
command. Note that we are only using the observed values for the offset.
h = function(x, epsilon = 1e-5) max(abs(x), 1) * sqrt(epsilon)
age_dx_plus = insurance %>%
select(age_num, ln_holders) %>%
mutate(age_num = age_num + h(age_num))
age_dx_minus = insurance %>%
select(age_num, ln_holders) %>%
mutate(age_num = age_num - h(age_num))
predictions_dydx =
tibble(
dy =
predict(nb_glm_offset_cont, age_dx_plus, type = 'response') -
predict(nb_glm_offset_cont, age_dx_minus, type = 'response'),
dx = age_dx_plus$age_num - age_dx_minus$age_num,
dydx = dy/dx
)
# summarise(predictions_dydx, ame = mean(dydx))
ame |
---|
-0.42552 |
So we just get predictions for a small difference in age for each value of age, and average that difference in predictions.
The emmeans package is primarily geared toward factor variables, but does have support for numeric variables interacting with factors. However, this isn’t what we’re really looking for here.
We can however use the margins package for this, and it provides the same result as before. For whatever reason, it doesn’t have an issue with the offset if we use the lower level dydx function.
dydx(
insurance,
nb_glm_offset_cont,
variable = 'age_num',
eps = 1e-5
) %>%
summarise(ame = mean(dydx_age_num))
ame |
---|
-0.42552 |
For more on the dydx
case for continuous variables in general, see the resources.
Obviously models will have more than one covariate, and in the particular case that was brought to my attention, there were also random effects. I may explore more in the future, but the general result should hold in those circumstances. As a quick example5, we can get the same age results for both, by getting the age group predictions with all values of the dataset (not just the offset).
nb_glm_offset_full = MASS::glm.nb(
claims ~ age + group + district + offset(ln_holders),
data = insurance
)
summary(nb_glm_offset_full)
Call:
MASS::glm.nb(formula = claims ~ age + group + district + offset(ln_holders),
data = insurance, init.theta = 449932.3841, link = log)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.46551 -0.50795 -0.03196 0.55554 1.94022
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.82174 0.07679 -23.723 < 2e-16 ***
age25-29 -0.19101 0.08286 -2.305 0.021155 *
age30-35 -0.34495 0.08138 -4.239 2.25e-05 ***
age>35 -0.53667 0.06996 -7.671 1.70e-14 ***
group1-1.5l 0.16133 0.05054 3.192 0.001412 **
group1.5-2l 0.39281 0.05500 7.142 9.23e-13 ***
group>2l 0.56341 0.07232 7.791 6.67e-15 ***
district2 0.02587 0.04302 0.601 0.547637
district3 0.03853 0.05052 0.763 0.445686
district4 0.23421 0.06168 3.797 0.000146 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for Negative Binomial(449932.4) family taken to be 1)
Null deviance: 236.212 on 63 degrees of freedom
Residual deviance: 51.416 on 54 degrees of freedom
AIC: 390.74
Number of Fisher Scoring iterations: 1
Theta: 449932
Std. Err.: 4185430
Warning while fitting theta: iteration limit reached
2 x log-likelihood: -368.745
nbreg claims i.age i.group i.district, offset(ln_holders) nolog
margins age
Negative binomial regression Number of obs = 64
LR chi2(9) = 81.37
Dispersion = mean Prob > chi2 = 0.0000
Log likelihood = -184.37077 Pseudo R2 = 0.1808
------------------------------------------------------------------------------
claims | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
age |
25-29 | -.1910093 .0828564 -2.31 0.021 -.3534049 -.0286137
30-35 | -.3449496 .0813741 -4.24 0.000 -.5044399 -.1854592
>35 | -.5366602 .0699556 -7.67 0.000 -.6737706 -.3995497
|
group |
1-1.5l | .1613446 .0505323 3.19 0.001 .0623031 .260386
>2l | .5634109 .0723153 7.79 0.000 .4216755 .7051462
|
district |
2 | .0258648 .0430156 0.60 0.548 -.0584443 .1101739
3 | .0385174 .0505114 0.76 0.446 -.0604832 .137518
4 | .2341969 .0616732 3.80 0.000 .1133197 .3550742
|
_cons | -1.821741 .0767876 -23.72 0.000 -1.972242 -1.67124
ln_holders | 1 (offset)
-------------+----------------------------------------------------------------
/lnalpha | -19.19245 546.8434 -1090.986 1052.601
-------------+----------------------------------------------------------------
alpha | 4.62e-09 2.53e-06 0 .
------------------------------------------------------------------------------
LR test of alpha=0: chibar2(01) = 7.1e-06 Prob >= chibar2 = 0.499
Predictive margins Number of obs = 64
Model VCE : OIM
Expression : Predicted number of events, predict()
------------------------------------------------------------------------------
| Delta-method
| Margin Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
age |
<25 | 76.39696 5.069345 15.07 0.000 66.46123 86.33269
25-29 | 63.11343 3.143376 20.08 0.000 56.95253 69.27433
30-35 | 54.10861 2.544524 21.26 0.000 49.12143 59.09579
>35 | 44.66913 .9835578 45.42 0.000 42.7414 46.59687
------------------------------------------------------------------------------
To do this with predict, we make predictions for all observed values as if they were at each level of age. Then we average them for each age group, just like we did before.
predictions_full_model =
map_df(1:4, function(i) mutate(insurance, age = levels(age)[i])) %>%
mutate(
age = factor(age, levels(insurance$age)), # convert back to factor
prediction = predict(nb_glm_offset_full, newdata = ., type = 'response')
) %>%
group_by(age) %>%
summarise(avg_prediction = mean(prediction))
age | avg_prediction |
---|---|
<25 | 76.397 |
25-29 | 63.113 |
30-35 | 54.109 |
>35 | 44.669 |
To summarize R’s capabilities with Stata-like margins with models using an offset, we have a few options we can note. First, we can get the values using the predict method. Then there are the packages to help with getting these types of predictions. margins explicitly attempts to replicate Stata-like margins for standard and some more complex models, but there doesn’t appear to be documentation on how the offset is dealt with by default. Furthermore, care must be taken if it isn’t an explicitly supported model. As we have also seen, emmeans provides many predictions of the sort discussed here, supports many more models, produces results in a nice format, and has plotting capabilities. However, it’s mostly suited toward factor variables.
Beyond those, ggeffects uses predict and emmeans under the hood, so offers a nice way to do the same sorts of things, but with a more viable plot as a result. Other packages and functions are available for specific settings. For example, conditional_effects in the brms package provides predictions and visualization for the bayesian setting.
Hopefully this will clarify the discrepancies between R and Stata with models using an offset. Honestly, I pretty much always use the predict function with my specified data values because I know what it’s doing and I can understand the results without hesitation regardless of model or package used. Furthermore, if one knows their data at all, it should be possible to specify covariate values that are meaningful pretty easily. On the other hand, getting predictions at averages can cause conceptual issues with categorical variables in many settings, and getting average effects often also can be hard to interpret (e.g. nonlinear relationships).
One thing you don’t get with some of the averaged predictions using the predict function are interval estimates, but this could be obtained via bootstrapping. Otherwise, most predict methods provide the standard error for a prediction with an additional argument (e.g. se.fit = TRUE
), so if you getting predictions at key values of the variables it is trivial to get the interval estimate. In general, most R packages are just using predict under the hood, so being familiar with it will likely get you what you need on its own.
Marginal Effects- Rich Williams notes- 1
Marginal Effects- Rich Williams notes- 2
Marginal Effects- Rich Williams notes- 3
Marginal Effects Stata Article by Rich Williams
Josh Errickson’s comparisons of Stata, emmeans, and ggeffects
UCLA IDRE FAQ (Margins command section)
Stata FAQ (based on old mfx command)
Just for giggles, I did an average marginal effect for a GAM, though I find little utility in it for the relationship shown. Confirmed via gratia and margins.
library(mgcv)
set.seed(2)
dat <- gamSim(1, n = 400, dist = "normal", scale = 2)
Gu & Wahba 4 term additive model
b <- gam(y ~ s(x2), data = dat)
visibly::plot_gam(b)
# set change step
h = 1e-5
b_dx_plus = dat %>%
select(x2) %>%
mutate(x2 = x2 + h)
b_dx_minus = dat %>%
select(x2) %>%
mutate(x2 = x2 - h)
predictions_dydx =
tibble(
x2 = dat$x2,
dy =
predict(b, b_dx_plus, type = 'response') -
predict(b, b_dx_minus, type = 'response'),
dx = b_dx_plus$x2 - b_dx_minus$x2,
dydx = dy/dx
)
gratia_result = gratia::derivatives(b, newdata = dat, eps = h)
margins_result = margins::dydx(dat, b, variable = 'x2', eps = h^2) # note that margins uses the h function specified previously
all.equal(as.numeric(predictions_dydx$dydx), gratia_result$derivative)
[1] "Mean relative difference: 5.596277e-05"
all.equal(as.numeric(predictions_dydx$dydx), as.numeric(margins_result$dydx_x2))
[1] TRUE
summarise(
predictions_dydx,
ame = mean(dydx),
ame_gratia = mean(gratia_result$derivative),
ame_margins = mean(margins_result$dydx_x2)
)
# A tibble: 1 x 3
ame ame_gratia ame_margins
<dbl> <dbl> <dbl>
1 1.76 1.76 1.76
Not the least of which is that most outside of econometrics don’t call predictions margins, since these days we aren’t adding results to the margin of a hand-written table.↩︎
For those in the R world, the i.age
tells Stata to treat the age factor as, well, a factor. Stata’s alpha
is 1/theta
from R’s output.↩︎
The margins package does do predictions rather than the marginal effects, but it, like others, is just a wrapper for the predict method, and doesn’t appear to average them, so I don’t demonstrate that.↩︎
There is rarely a justifiable reason to discretize age as near as I can tell, and doing so inevitably results in less satisfying and less reliable conclusions.↩︎
There is a weird print issue where the Stata output isn’t showing the coefficient for one of the levels of group
, but the model is correct and was verified directly using Stata.↩︎
If you see mistakes or want to suggest changes, please create an issue on the source repository.
Text and figures are licensed under Creative Commons Attribution CC BY-SA 4.0. Source code is available at https://github.com//m-clark/m-clark.github.io, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Clark (2020, June 16). Michael Clark: Predictions with an offset. Retrieved from https://m-clark.github.io/posts/2020-06-15-predict-with-offset/
BibTeX citation
@misc{clark2020predictions, author = {Clark, Michael}, title = {Michael Clark: Predictions with an offset}, url = {https://m-clark.github.io/posts/2020-06-15-predict-with-offset/}, year = {2020} }