Comparison between pystan and numpyro in Bayesian item response theory: evaluation of agreement of estimated latent parameters and sampling performance

Purpose The purpose of this study is to compare two libraries dedicated to the Markov chain Monte Carlo method: pystan and numpyro. In the comparison, we mainly focused on the agreement of estimated latent parameters and the performance of sampling using the Markov chain Monte Carlo method in Bayesian item response theory (IRT). Materials and methods Bayesian 1PL-IRT and 2PL-IRT were implemented with pystan and numpyro. Then, the Bayesian 1PL-IRT and 2PL-IRT were applied to two types of medical data obtained from a published article. The same prior distributions of latent parameters were used in both pystan and numpyro. Estimation results of latent parameters of 1PL-IRT and 2PL-IRT were compared between pystan and numpyro. Additionally, the computational cost of the Markov chain Monte Carlo method was compared between the two libraries. To evaluate the computational cost of IRT models, simulation data were generated from the medical data and numpyro. Results For all the combinations of IRT types (1PL-IRT or 2PL-IRT) and medical data types, the mean and standard deviation of the estimated latent parameters were in good agreement between pystan and numpyro. In most cases, the sampling time using the Markov chain Monte Carlo method was shorter in numpyro than that in pystan. When the large-sized simulation data were used, numpyro with a graphics processing unit was useful for reducing the sampling time. Conclusion Numpyro and pystan were useful for applying the Bayesian 1PL-IRT and 2PL-IRT. Our results show that the two libraries yielded similar estimation result and that regarding to sampling time, the fastest libraries differed based on the dataset size.


INTRODUCTION
Item response theory (IRT) is a statistical framework used for analyzing test results and evaluating test items and test takers quantitatively.While IRT is commonly used in educational and psychological research (Embretson & Reise, 2000;Hays, Morales & Reise, 2000;Cappelleri, Jason Lundy & Hays, 2014), there are several applications of IRT to medical research.For example, Choi et al. (2010) used IRT for constructing the computer adaptive testing system of short-form patient-reported outcome measures with the data from the Patient-Reported Outcomes Measurement Information System project.Gershon et al. (2012) used IRT to build the quality of life item banks for adults with neurological disorders.The most notable example of computer adaptive testing system using IRT is the National Institutes of Health Patient-Reported Outcomes Measurement Information System (PROMIS) (https://www.healthmeasures.net/explore-measurement-systems/promis).PROMIS is an NIH-funded initiative to develop and validate patient reported outcomes for clinical research and practice.
Generally, IRT is applied to the results of binary responses to the test items (e.g., correct and incorrect answers).In medical diagnosis, the results of various diagnostic procedures are frequently defined as binary responses.Therefore, it is possible to apply IRT to the data of medical diagnosis.To apply IRT to the data of medical diagnosis, the following correspondence is assumed: (i) the patient as the test item, (ii) the doctor as the test taker, and (iii) the results of the binary responses obtained through medical diagnosis as test results.For example, Nishio et al. (2020) used IRT for analyzing the results of medical diagnoses by radiologists.
The Bayesian IRT can be implemented using probabilistic programming languages or dedicated libraries (e.g., JAGS, Stan, pystan, and numpyro) (Python Software Foundation, 2023;Depaoli, Clifton & Cobb, 2016;Carpenter et al., 2017;Phan, Pradhan & Jankowiak, 2019).For example, previous studies used Stan for the implementation of the Bayesian IRT, graded response model, and nominal response model (Luo & Jiao, 2017;Nishio et al., 2020;Nishio et al., 2022).The recent advances in hardware and software make it possible to use the Bayesian IRT efficiently.However, there is no study comparing the efficiency of the Bayesian IRT from the viewpoint of computational cost.
The purpose of the current study was to compare the results of the Bayesian IRT implemented with two dedicated libraries (pystan and numpyro).In the current study, the Bayesian 1PL-IRT and 2PL-IRT implemented with pystan and numpyro were applied to the two types of medical data obtained from Nishio et al. (2020).Our main contributions in this article are as follows; (i) the two IRT models could be fitted with pystan and numpyro, (ii) the two libraries yielded similar estimation results for the combinations of the two IRT models and the two types of medical data, and (iii) depending on the dataset size, we evaluated which package had better Markov chain Monte Carlo sampling performance.For reproducibility, our implementation of the Bayesian IRT in pystan and numpyro used in the current study is disclosed as open source through GitHub (https://github.com/jurader/irt_pystan_numpyro).Selections of this article were previously published as a preprint (Nishio et al., 2023).

MATERIALS & METHODS
Because this study used the medical data obtained from Nishio et al. (2020), institutional review board approval or informed consent of patients was not necessary.

Medical data
The two types of medical data (BONE and BRAIN data) were obtained from Nishio et al. (2020).Table 1 shows the characteristics of the two types of medical data.The BONE data include binary responses from 60 patients (test items) and seven radiologists (test takers), and the BRAIN data include those from 42 patients and 14 radiologists.The total numbers of the binary responses were 420 and 588 in the BONE and BRAIN data, respectively.While the data from two modalities (computed tomography and temporal subtraction) were used in Nishio et al. (2020), those from one modality (computed tomography) were used in this study.As a result, the total numbers of the binary responses were half in this study, compared with Nishio et al. (2020).

1PL-IRT
IRT is a statistical model for analyzing the results of binary responses.While there are several types of IRT models (Gelman et al., 2013), 1PL-IRT and 2PL-IRT were used.In the current study, latent parameters of IRT are estimated based on the results of medical diagnoses by test takers.In 1PL-IRT, one latent parameter (β i ) is used to represent the difficulty of test item i, and another latent parameter (θ j ) is used to represent the ability of test taker j.The following equations represent 1PL-IRT. Pr Here, • Pr r ij = 1 represents the probability that the response of test taker j to test item i is correct, • β i is the difficulty parameter of test item i, • θ j is the ability parameter of test taker j.

2PL-IRT
In 2PL-IRT, two latent parameters (α i and β i ) are used to represent test item i.The following equations represent 2PL-IRT. Pr Here, • α i and β i are the discrimination and difficulty parameters of test item i.

Experiments for agreement of latent parameters
The Bayesian 1PL-IRT and 2PL-IRT, implemented with pystan and numpyro, were applied to the BONE and BRAIN data.For 1PL-IRT, the following prior distributions were used: where N represents a normal distribution in which the first and second arguments are the average and variance of normal distribution, respectively.For 2PL-IRT, the following prior distribution was used in addition to those of 1PL-IRT: • log (α i ) ∼ N (0.5,1).The same prior distributions of the latent parameters were used in both pystan and numpyro.
The following parameters were used for sampling using Markov chain Monte Carlo method in both pytan and numpyro: number of chains (num_chains) = 6, number of samples per one chain (num_samples) = 8,000, number of samples for warmup (num_warmup) = 2,000.For numpyro GPU version, chain_method = 'parallel' was used.After the sampling, the posterior distributions of the latent parameters were obtained for the Bayesian 1PL-IRT and 2PL-IRT.The posterior distributions of the latent parameters were then compared between pystan and numpyro.

Experiments for computational cost
To evaluate the computation cost of 1PL-IRT and 2PL-IRT implemented with pystan and numpyro, simulation data were generated from the medical data (BONE and BRAIN data) and numpyro.The computer simulation was performed in the following steps: (i) estimating the posterior distributions of the latent parameters of 1PL-IRT and 2PL-IRT for the two types of medical data, and (ii) generating binary responses from the IRT equations and the estimated posterior distributions for the two types of medical data.The total number of binary responses in the simulation data were as follows: 420,840,2,100,4,200,8,400,21,000,42,000,84,000,210,000,and 420,000 for the BONE data;588,1,176,2,940,5,880,11,760,29,400,58,800,117,600,294,000,and 588,000 for the BRAIN data.This means that size of the simulation data ranged from the original size to 1,000 times.To evaluate the computational cost in pystan and numpyro, the sampling time using Markov chain Monte Carlo method was measured.In numpyro, both CPU version and GPU version were used for the sampling.The following parameters were used for the sampling in both pytan and numpyro: number of chains (num_chains) = 2, number of samples per one chain (num_samples) = 3,000, number of samples for warmup (num_warmup) = 500.For numpyro GPU version, chain_method = 'parallel' was used.Due to the limitation of Google Colaboratory, it was not possible to evaluate the sampling time in several large-sized simulation data.Therefore, in addition to Google Colaboratory, we performed the same experiment on a local workstation with Intel Core i9-9820X CPU and Nvidia Quadro RTX 8000.

Results for agreement of latent parameters
In the current study, we focused on the ability parameters of test takers, and the estimation results of test items were omitted.Tables 2-5 present the estimation results of the ability parameters of test takers.In addition, Figs. 1 and 2 show representative scatter plots of the estimation results between pytan and numpyro, which are obtained from values of Tables 2  and 5, respectively.Tables 2-5 show the mean, standard deviation, and credible interval (highest density interval) as the estimation results of the ability parameters of test takers.Based on Tables 2-5 and Figs. 1 and 2, we found that there was good agreement between pystan and numpyro for 1PL-IRT and 2PL-IRT of the BONE and BRAIN data.
From Tables 2-5, Lin's concordance correlation coefficients (CCC) (Lin, 1989) of the estimated mean of the ability parameters were calculated between (a) pystan v.s.numpyro CPU version, (b) pystan v.s.numpyro GPU version, and (c) numpyro CPU version v.s.numpyro GPU version.The results of CCC values are summarized in Table 6.The following criteria were used to evaluate CCC (Nishio et al., 2016;Kojita et al., 2021); low CCC values (<0.900) were considered to represent poor agreement, whereas higher CCC values represented moderate (0.900-0.950), substantial (0.951-0.990), and almost perfect agreement (>0.990).As shown in Table 6, the CCC values of the estimated mean of the ability parameters indicate almost perfect agreement for 1PL-IRT and 2PL-IRT of the BONE and BRAIN data.
In addition, when the number of samples were fewer (number of samples per one chain was less than 8,000), the agreement of the ability parameters was investigated.The results are shown in Table 7.When original-size simulation data were used, the sampling time was shorter in pystan than numpyro CPU version.However, in the simulation data of the BONE and BRAIN data except for the original size, the sampling time was shorter in numpyro CPU version than pystan.Moreover, when the large-sized simulation data (total number of binary responses > 30,000-50,000) were used, the sampling time was shorter in numpyro GPU version than numpyro CPU version.In addition to the experiments using Google Colaboratory, the results of computational cost on the local workstation are shown in Figs.7-10.The same trend was observed when using Google Colaboratory and the local workstation.

DISCUSSION
The current study aimed to compare the estimation results of the ability parameter of test takers using two different libraries (pystan and numpyro) for two different types of IRT models and medical data (BONE and BRAIN data).The study found that there was good agreement between pystan and numpyro for all the combinations of the IRT models and medical data; there was almost perfect agreement between pystan and numpyro in the CCC values of the estimated mean of ability parameters.The current study also compared the sampling time for the simulation data of the BONE and BRAIN data.We found that while the sampling time was shorter in pystan than numpyro CPU version for the original-size data, it was shorter in numpyro CPU version than pystan for the simulation data except  for the original size.For the large-sized simulation data, the sampling time was shorter in numpyro GPU version than numpyro CPU version.Our results show that there was almost perfect agreement in the ability parameters of the Bayesian IRT between pystan and numpyro.This suggests that researchers can choose either library for implementing the Bayesian IRT.While numpyro requires only Python, both Python and Stan (two different programming languages) are necessary for pystan.Many practitioners and researchers may find numpyro to be simple and straightforward.
Although we used the simulation data, our results of sampling time show that the fastest libraries differed based on the total number of binary responses.Specifically, pystan was the fastest for the original-size simulation data, while numpyro CPU version was the fastest for the small-sized and medium-sized data.For the large-sized simulation data, the sampling time was shorter in numpyro GPU version than numpyro CPU version.This implies that      practitioners and researchers should select either pystan or numpyro based on the data size.Figure 11 shows our recommendation for selecting pystan and numpyro.Tables 3 and 4 show that the total number of latent parameters may affect the usefulness of GPU for reducing the sampling time.In complex models, such as 2PL-IRT used in this study, numpyro GPU version may tend to be faster than numpyro CPU version.The effect of GPU on the sampling time should be evaluated in future studies.
This study had several limitations.First, we evaluated only Bayesian 1PL-IRT and 2PL-IRT.Further studies are needed to investigate the effectiveness of pystan and numpyro  in other types of Bayesian models.Second, although we evaluated the sampling time, we used the simulation data instead of real-world data.Future studies should use real-world data to evaluate the sampling time.Third, we mainly used Google Colaboratory.Although Google Colaboratory has several merits (e.g., ease of use and availability), our experiments were performed using limited types of hardware.Fourth, because the data and purpose of our study are different from those of Nishio et al. (2020), it is impossible to compare our results with those from that article.

Figures 3 -
Figures 3-6 show the sampling time for the simulation data of the BONE and BRAIN data.When original-size simulation data were used, the sampling time was shorter in pystan than numpyro CPU version.However, in the simulation data of the BONE and BRAIN data except for the original size, the sampling time was shorter in numpyro CPU version than pystan.Moreover, when the large-sized simulation data (total number of binary responses > 30,000-50,000) were used, the sampling time was shorter in numpyro GPU version than numpyro CPU version.In addition to the experiments using Google Colaboratory, the

Figure 1
Figure 1 Representative scatter plots of estimated ability parameters of 1PL-IRT between numpyro and pystan for BONE data.(A) Plot for mean of estimated ability parameters, (B) plot for SD of estimated ability parameters.Note: (A) and (B) are obtained from values from Table 2. Full-size DOI: 10.7717/peerjcs.1620/fig-1

Figure 2
Figure 2 Representative scatter plots of estimated ability parameters of 2PL-IRT between numpyro and pystan for BRAIN data.(A) Plot for mean of estimated ability parameters, (B) plot for SD of estimated ability parameters.Note: (A) and (B) are obtained from values from Table 5. Full-size DOI: 10.7717/peerjcs.1620/fig-2 i) Except for the number of samples per one chain (num_samples), the parameters for sampling using Markov chain Monte Carlo method are the same as those of the main experiment.(ii) In the main experiment, number of samples per one chain is 8000.(iii) Estimated mean of ability parameter was used in calculating CCC.(iv) Rhat values are not always less than 1.10 in this experiment.Abbreviation: Lin's concordance correlation coefficients, CCC.

Table 2 Estimation results of ability parameters of test takers in 1PL-IRT for BONE data.
a Note: theta[i] ofTable means θ i of the equation of 1PL-IRT.

Table 3 Estimation results of ability parameters of test takers in 2PL-IRT for BONE data.
Note: theta[i] ofTable means θ i of the equation of 2PL-IRT.

Table 4
Note: theta[i] ofTable means θ i of the equation of 1PL-IRT.

Table 5
Notes.Note: theta[i] ofTable means θ i of the equation of 2PL-IRT.