Bayesian IRT

The following shows some code demonstration for one through four parameter IRT models, though will only extensively explore the first two. You can learn more about IRT models in general in my structural equation modeling document.

One Parameter IRT

Data Setup

This data set has the responses of 316 participants on 24 items of a questionnaire on verbal aggression. Other covariates are also provided. For simplicity I will focus on the four ‘DoShout’ items.

library(tidyverse)

data("VerbAgg", package = "lme4")

glimpse(VerbAgg)
Rows: 7,584
Columns: 9
$ Anger  <int> 20, 11, 17, 21, 17, 21, 39, 21, 24, 16, 15, 18, 36, 22, 16, 18, 23, 16, 21, 25, 22, 15, 26, 13, 33, 17, 17, 22, 21, 17, 19, 18, 33, 19, 25, 17, 12, 14, 25, 22, 20, 25, 12, 16, 23, 19, 22, 15, 25, 35, 24…
$ Gender <fct> M, M, F, F, F, F, F, F, F, F, F, F, M, M, F, F, F, F, F, F, F, F, F, F, F, F, F, F, M, F, F, M, M, F, F, F, F, F, M, M, F, F, F, F, F, F, F, F, F, F, F, F, F, F, F, F, F, F, M, F, F, F, F, F, F, F, F, F…
$ item   <fct> S1WantCurse, S1WantCurse, S1WantCurse, S1WantCurse, S1WantCurse, S1WantCurse, S1WantCurse, S1WantCurse, S1WantCurse, S1WantCurse, S1WantCurse, S1WantCurse, S1WantCurse, S1WantCurse, S1WantCurse, S1WantC…
$ resp   <ord> no, no, perhaps, perhaps, perhaps, yes, yes, no, no, yes, perhaps, yes, yes, yes, perhaps, perhaps, perhaps, perhaps, no, perhaps, perhaps, yes, yes, perhaps, no, no, yes, yes, perhaps, yes, yes, perhap…
$ id     <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53,…
$ btype  <fct> curse, curse, curse, curse, curse, curse, curse, curse, curse, curse, curse, curse, curse, curse, curse, curse, curse, curse, curse, curse, curse, curse, curse, curse, curse, curse, curse, curse, curse,…
$ situ   <fct> other, other, other, other, other, other, other, other, other, other, other, other, other, other, other, other, other, other, other, other, other, other, other, other, other, other, other, other, other,…
$ mode   <fct> want, want, want, want, want, want, want, want, want, want, want, want, want, want, want, want, want, want, want, want, want, want, want, want, want, want, want, want, want, want, want, want, want, want…
$ r2     <fct> N, N, Y, Y, Y, Y, Y, N, N, Y, Y, Y, Y, Y, Y, Y, Y, Y, N, Y, Y, Y, Y, Y, N, N, Y, Y, Y, Y, Y, Y, N, Y, Y, Y, N, N, N, N, Y, N, Y, N, N, Y, Y, N, N, Y, N, N, Y, Y, N, Y, Y, Y, N, N, Y, Y, Y, Y, Y, N, Y, Y…
verbagg_items = VerbAgg %>% 
  filter(btype == 'shout', situ == 'self') %>% 
  select(id, item, r2)

head(verbagg_items)
  id        item r2
1  1 S3WantShout  Y
2  2 S3WantShout  N
3  3 S3WantShout  N
4  4 S3WantShout  N
5  5 S3WantShout  N
6  6 S3WantShout  Y
verbagg_items_wide = verbagg_items %>% 
  pivot_wider(id_cols = id, names_from = item, names_prefix = 'item_', values_from = r2)

head(verbagg_items_wide)
# A tibble: 6 x 5
  id    item_S3WantShout item_S4WantShout item_S3DoShout item_S4DoShout
  <fct> <fct>            <fct>            <fct>          <fct>         
1 1     Y                N                N              Y             
2 2     N                N                N              N             
3 3     N                N                N              N             
4 4     N                N                N              N             
5 5     N                N                N              N             
6 6     Y                N                N              N             

While we often think of the data in ‘wide form’, with one row per person and multiple columns respective to each item, and the subsequent Stan code will use that, it is generally both tidier and more straightforward for modeling with the long format, where one can use standard mixed model approaches. r2 is the target variable of interest in that case.

In the long format, the model for a single person is as follows, where \(Z\) is the latent person (\(p\))score, and \(i\) is the \(i^{th}\) item.

\[\textrm{logit}(\pi) = \textrm{disc} (Z_p - \beta_i)\]

Another formulation is the following, and corresponds to what brms will use.

\[\textrm{logit}(\pi) = \beta_i + \textrm{disc}\cdot Z_p\]

Model Code

data {
  int N;                               // Number of people
  int J;                               // Number of items
  int Y[N,J];                          // Binary Target
}

transformed data{
  
}

parameters {
  vector[J] difficulty;                // Item difficulty
  real<lower = 0> discrim;             // Item discrimination (constant)
  vector[N] Z;                         // Latent person ability
}

model {
  matrix[N, J] lmat;
  
  // priors
  Z ~ normal(0, 1);
  
  discrim    ~ student_t(3, 0, 5);
  difficulty ~ student_t(3, 0, 5);

  for (j in 1:J){
    lmat[,j] = discrim * (Z - difficulty[j]);
  }
  
  // likelihood
  for (j in 1:J)  Y[,j] ~ bernoulli_logit(lmat[,j]);
  
}

Estimation

First we create a Stan-friendly data list and then estimate the model. The following assumes a character string or file (bayes_irt1_model) of the previous model code.

verbagg_items_wide_mat = apply(
  as.matrix(verbagg_items_wide[, -1]) == 'Y',
  2,
  as.integer
)

stan_data = 
  list(
    N = nrow(verbagg_items_wide_mat),
    J = ncol(verbagg_items_wide_mat),
    Y = verbagg_items_wide_mat
  )

library(rstan)

fit_1pm = sampling(
  bayes_irt1_model,
  data = stan_data,
  thin = 4
)

Comparison

Now we compare to brms. I use the author’s article as a guide for this model, and note again that it is following the second parameterization depicted above.

library(brms)

# half normal for variance parameter, full for coefficients
prior_1pm <-
  prior("normal(0, 3)", class = "sd", group = "id") +
  prior("normal(0, 3)", class = "b")

brms_1pm = brm(
  r2 ~ 0 + item + (1 | id),
  data   = verbagg_items,
  family = bernoulli,
  prior  = prior_1pm,
  thin   = 4,
  cores  = 4
)

If you want to compare to standard IRT in either parameterization, you can use the ltm package.

library(ltm)

irt_rasch_par1 = rasch(verbagg_items_wide_mat, IRT.param = FALSE)
irt_rasch_par2 = rasch(verbagg_items_wide_mat, IRT.param = TRUE)
print(
  fit_1pm,
  digits = 3,
  par    = c('discrim', 'difficulty'),
  probs  = c(.025, .5, 0.975)
)
Inference for Stan model: 0715d0cee215483765f01043abb55ea9.
4 chains, each with iter=2000; warmup=1000; thin=4; 
post-warmup draws per chain=250, total post-warmup draws=1000.

               mean se_mean    sd  2.5%   50% 97.5% n_eff  Rhat
discrim       1.859   0.007 0.197 1.502 1.852 2.250   812 1.001
difficulty[1] 0.968   0.004 0.121 0.750 0.965 1.215   913 1.001
difficulty[2] 0.665   0.004 0.106 0.460 0.663 0.874   788 1.000
difficulty[3] 1.854   0.006 0.176 1.534 1.846 2.237   844 1.005
difficulty[4] 1.255   0.005 0.137 1.003 1.250 1.524   804 0.998

Samples were drawn using NUTS(diag_e) at Wed Nov 25 17:39:05 2020.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).
summary(brms_1pm)
 Family: bernoulli 
  Links: mu = logit 
Formula: r2 ~ 0 + item + (1 | id) 
   Data: verbagg_items (Number of observations: 1264) 
Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 4;
         total post-warmup samples = 1000

Group-Level Effects: 
~id (Number of levels: 316) 
              Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept)     1.90      0.21     1.53     2.32 1.00      832      890

Population-Level Effects: 
                Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
itemS3WantShout    -1.79      0.22    -2.25    -1.37 1.00      947      883
itemS4WantShout    -1.23      0.21    -1.66    -0.83 1.01      860      809
itemS3DoShout      -3.43      0.31    -4.08    -2.86 1.00      907      914
itemS4DoShout      -2.35      0.25    -2.86    -1.90 1.00      812      852

Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
brms_diff = fixef(brms_1pm)[,'Estimate']
brms_discrim = VarCorr(brms_1pm)$id$sd[1]
fit_params = summary(fit_1pm, digits = 3, par = c('discrim', 'difficulty'))$summary[,'mean']

After extracting, we can show either parameterization for either model. For example, brms item difficulties = our model -discrim*difficulties.

# A tibble: 5 x 5
  parma         model  brms model_par2 brms_par1
  <chr>         <dbl> <dbl>      <dbl>     <dbl>
1 discrim       1.86   1.90       1.86     1.90 
2 difficulty[1] 0.968 -1.79      -1.80     0.942
3 difficulty[2] 0.665 -1.23      -1.24     0.647
4 difficulty[3] 1.85  -3.43      -3.45     1.81 
5 difficulty[4] 1.26  -2.35      -2.33     1.23 

Two Parameter IRT

Now we can try a two parameter model. Data setup is the same as before.

Model Code

data {
  int N;
  int J;
  int Y[N, J];
}

parameters {
  vector[J] difficulty;
  vector<lower = 0>[J] discrim;    // Now per-item discrimination
  vector[N] Z;
}

model {
  matrix[N, J] lmat;

  // priors
  Z ~ normal(0, 1);
  
  discrim    ~ student_t(3, 0, 5);
  difficulty ~ student_t(3, 0, 5);
  
  for (j in 1:J){
    lmat[,j] = discrim[j] * (Z - difficulty[j]);
  }

  // likelihood
  for (j in 1:J)  Y[,j] ~ bernoulli_logit(lmat[,j]);

}

Estimation

First, our custom Stan model. The following assumes a character string or file (bayes_irt2_model) of the previous model code.

library(rstan)

fit_2pm = sampling(
  bayes_irt2_model,
  data = stan_data,
  thin = 4,
  iter = 4000,
  warmup = 3000,
  cores = 4,
  control = list(adapt_delta = .99)
)

Comparison

Now we compare to brms. I use the author’s article as a guide for this model, and note that it is following the second parameterization. Took a little over 30 seconds on my machine, though of course you may experience differently.

library(brms)

# half normal for variance parameter, full for coefficients
prior_2pm <-
  prior("normal(0, 5)", class = "b",                  nlpar = "Z") +
  prior("normal(0, 5)", class = "b",                  nlpar = "logdiscr") +
  prior("constant(1)",  class = "sd", group = "id",   nlpar = "Z") +
  prior("normal(0, 3)", class = "sd", group = "item", nlpar = "Z") +
  prior("normal(0, 3)", class = "sd", group = "item", nlpar = "logdiscr")

formula_2pm = bf(
  r2 ~ exp(logdiscr) * Z,
  Z  ~ 1 + (1 |i| item)  + (1 | id),
  logdiscr ~ 1 + (1 |i| item),
  nl = TRUE
)

brms_2pm = brm(
  formula_2pm,
  data    = verbagg_items,
  family  = bernoulli,
  prior   = prior_2pm,
  thin    = 4,
  iter    = 4000,
  warmup  = 3000,
  cores   = 4,
  control = list(adapt_delta = .99, max_treedepth = 15)
)
print(
  fit_2pm,
  digits = 3,
  par    = c('discrim', 'difficulty'),
  probs  = c(.025, .5, 0.975)
)
Inference for Stan model: 503e7ed2dc96cbd6b316fe17abc2a86f.
4 chains, each with iter=4000; warmup=3000; thin=4; 
post-warmup draws per chain=250, total post-warmup draws=1000.

               mean se_mean    sd  2.5%   50%  97.5% n_eff  Rhat
discrim[1]    1.033   0.008 0.249 0.599 1.003  1.567   891 1.003
discrim[2]    3.026   0.098 1.630 1.603 2.632  7.436   279 1.000
discrim[3]    1.462   0.014 0.389 0.803 1.425  2.321   784 0.998
discrim[4]    4.461   0.158 3.688 1.882 3.609 12.287   544 1.002
difficulty[1] 1.420   0.011 0.314 0.946 1.376  2.207   866 0.998
difficulty[2] 0.595   0.004 0.106 0.401 0.589  0.815   738 1.000
difficulty[3] 2.196   0.017 0.464 1.595 2.115  3.144   703 0.998
difficulty[4] 1.034   0.006 0.129 0.820 1.021  1.311   513 1.003

Samples were drawn using NUTS(diag_e) at Wed Nov 25 17:40:25 2020.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).
summary(brms_2pm)
 Family: bernoulli 
  Links: mu = logit 
Formula: r2 ~ exp(logdiscr) * Z 
         Z ~ 1 + (1 | i | item) + (1 | id)
         logdiscr ~ 1 + (1 | i | item)
   Data: verbagg_items (Number of observations: 1264) 
Samples: 4 chains, each with iter = 4000; warmup = 3000; thin = 4;
         total post-warmup samples = 1000

Group-Level Effects: 
~id (Number of levels: 316) 
                Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Z_Intercept)     1.00      0.00     1.00     1.00 1.00     1000     1000

~item (Number of levels: 4) 
                                    Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Z_Intercept)                         1.09      0.73     0.33     3.07 1.00      706      816
sd(logdiscr_Intercept)                  0.94      0.79     0.05     2.87 1.00      515      648
cor(Z_Intercept,logdiscr_Intercept)     0.18      0.50    -0.87     0.91 1.00      781      758

Population-Level Effects: 
                   Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Z_Intercept           -1.22      0.67    -2.49     0.11 1.00      733      710
logdiscr_Intercept     0.66      0.56    -0.36     2.04 1.00      896      654

Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
brms_diff    = coef(brms_2pm)$item[,,'Z_Intercept'][,'Estimate']
brms_discrim = exp(coef(brms_2pm)$item[,,'logdiscr_Intercept'][,'Estimate'])

fit_diff    = summary(fit_2pm, digits = 3, par = 'difficulty')$summary[,'mean']
fit_discrim = summary(fit_2pm, digits = 3, par = 'discrim')$summary[,'mean']
# A tibble: 8 x 3
  parma         model   brms
  <chr>         <dbl>  <dbl>
1 discrim[1]    1.03   1.15 
2 discrim[2]    3.03   2.39 
3 discrim[3]    1.46   1.62 
4 discrim[4]    4.46   3.24 
5 difficulty[1] 1.42  -1.30 
6 difficulty[2] 0.595 -0.622
7 difficulty[3] 2.20  -1.98 
8 difficulty[4] 1.03  -1.09 

Here is the non-Bayesian demo if interested.

library(ltm)

irt_2pm_par1 = ltm(verbagg_items_wide_mat ~ z1, IRT.param = FALSE)
irt_2pm_par2 = ltm(verbagg_items_wide_mat ~ z1, IRT.param = TRUE)

coef(irt_2pm_par1)
coef(irt_2pm_par2)

Three Parameter IRT

For the three parameter model I only show the Stan code. This model adds a per-item guessing parameter, which serves as a lower bound, to the two parameter model.

data {
  int N;
  int J;
  int Y[N,J];
}

parameters {
  vector[J] difficulty;
  vector<lower = 0>[J] discrim;
  vector<lower = 0, upper = .25>[J] guess;
  vector[N] Z;
}

model {
  matrix[N, J] pmat;

  // priors
  Z ~ normal(0, 1);
  
  discrim    ~ student_t(3, 0, 5);
  difficulty ~ student_t(3, 0, 5);
  
  guess ~ beta(1, 19);

  for (j in 1:J){
    pmat[,j] = guess[j] + (1 - guess[j]) * inv_logit(discrim[j] * (Z - difficulty[j]));
  }

  // likelihood
  for (j in 1:J)  Y[,j] ~ bernoulli(pmat[,j]);
  
}

Four Parameter IRT

For the four parameter model I only show the Stan code. This model adds a per-item ceiling parameter, which serves as an upper bound, to the three parameter model.

data {
  int N;
  int J;
  int Y[N,J];
}

parameters {
  vector[J] difficulty;
  vector<lower = 0>[J] discrim;
  vector<lower = 0, upper = .25>[J] guess;
  vector<lower = .95, upper = 1>[J] ceiling;
  vector[N] Z;
}

model {
  matrix[N, J] pmat;

  // priors
  Z ~ normal(0, 1);
  
  discrim    ~ student_t(3, 0, 5);
  difficulty ~ student_t(3, 0, 5);
  
  guess   ~ beta(1, 19);
  ceiling ~ beta(49, 1);


  for (j in 1:J){
    pmat[,j] = guess[j] + (ceiling[j] - guess[j]) * inv_logit(discrim[j] * 
    (Z - difficulty[j]));
  }


  // likelihood
  for (j in 1:J)  Y[,j] ~ bernoulli(pmat[,j]);

}