autoRasch: An R Package to Do Semi-Automated Rasch Analysis

The R package autoRasch has been developed to perform a Rasch analysis in a (semi-)automated way. The automated part of the analysis is achieved by optimizing the so-called in-plus-out-of-questionnaire log-likelihood (IPOQ-LL) or IPOQ-LL-DIF when differential item functioning (DIF) is included. These criteria measure the quality of fit on a pre-collected survey, depending on which items are included in the final instrument. To compute these criteria, autoRasch fits the generalized partial credit model (GPCM) or the generalized partial credit model with differential item functioning (GPCM-DIF) using penalized joint maximum likelihood estimation (PJMLE). The package further allows the user to reevaluate the output of the automated method and use it as a basis for performing a manual Rasch analysis and provides standard statistics of Rasch analyses (e.g., outfit, infit, person separation reliability, and residual correlation) to support the model reevaluation.


Introduction
A popular way to validate a questionnaire and to measure one's abilities given subjects' responses is the Rasch model (Rasch 1966) and its family. With the help of various kinds of software (e.g., RUMM, Winstep) and packages (e.g., eRm, ltm) listed in https: //www.rasch.org/software.htm, Rasch analysis is typically done in a manual iterative manner. Using the result of standard statistics (e.g., outfit, infit, local dependency, reliability, unidimensionality), misfit items are removed and the remaining items are then reevaluated until a psychometrically optimal set of items is obtained. Nevertheless, which statistics to begin with, which items need to be removed first, and when to stop are decisions to be made by experts. Consequently, this procedure is, undoubtedly, a laborious task that somewhat relies on expertise and human judgment. Moreover, different experts often favor non-identical but equally befitting items.
This article introduces the autoRasch package as a new tool for performing Rasch analyses in a (semi-)automatic manner. This package implements novel criteria that have been shown to naturally incorporate the standard properties of Rasch analyses. These criteria are called the in-plus-out-of-questionnaire log likelihood with differential item functioning (IPOQ-LL-DIF) for DIF incorporated cases and in-plus-out-of-questionnaire log-likelihood (IPOQ-LL) for non-DIF cases (Wijayanto, Bucur, Mul, Groot, van Engelen, and Heskes 2022;Wijayanto, Mul, Groot, van Engelen, and Heskes 2021). Both criteria are developed based on the Rasch model by considering the discrimination and differential functioning over items, which will be discussed in Section 2. Section 2 also explains the techniques for estimating the models and how they differ from what some other packages offer. The formulation of the criteria is then discussed in Section 3 together with the implemented optimization procedure.
As mentioned, autoRasch aims to automate the iterative process of a Rasch analysis as much as possible. In spite of that, as in any statistical analysis, the final results have to be treated and analyzed carefully to judge whether they satisfy the modeler's desiderata. Therefore, we have also implemented standard statistics of Rasch analyses to do pre and post-analyses of this automatic procedure's output. These features are discussed in Section 4. Also, in Section 4, the corresponding implementation in R (R Core Team 2021) is described using a real-world dataset. Finally, Section 5 summarizes our work and discusses some aspects related to the criteria and to the autoRasch package.

The Models
For the semi-automated procedure, we have implemented four models that generalize the Rasch model, i.e., the partial credit model (PCM) (Masters 1982), the generalized partial credit model (GPCM) (Muraki 1992), the partial credit model with differential item functioning (PCM-DIF) (Wijayanto et al. 2022), and the generalized partial credit model with differential item functioning (GPCM-DIF) (Wijayanto et al. 2022).

Model Expression
We first describe the most general form of the model, the GPCM-DIF. Having modeled both the discrimination and the differential functioning of items, GPCM-DIF contains the most parameters. The responses are represented as a data matrix X whose rows represent subjects and whose columns represent items. The GPCM-DIF models the general form of responses which are recorded into two or more ordered categories, x ∈ {0, 1, . . . , m i } where m i + 1 is the number of categories. According to the GPCM-DIF, the probability of subject n giving response x to item i reads for x > 0, and (2) Here, θ n is the ability parameter, β ij represents the difficulty (threshold) parameter of category j on item i, α i is the item discrimination parameter, and δ if is the DIF parameter that models the difference in the difficulty of item i for subjects from different groups. The binary matrix κ nf maps subject n to group f ∈ {1, 2, . . . , m f }, where m f denotes the number of potential DIF-inducing covariates: κ nf = 1 if respondent n is a member of group f and κ nf = 0 otherwise.
The GPCM-DIF is equivalent to the GPCM (Muraki 1992) if all δ if are zero, corresponding to the case without DIF. In case all α i are zero, we have the partial credit model with DIF (PCM-DIF) (Wijayanto et al. 2022). When both δ if and α i are all zero, the GPCM-DIF simplifies to the partial credit model (PCM) (Masters 1982). For binary responses, the PCM and GPCM are equivalent to the Rasch model and to the two probability logistic (2-PL) model, respectively (Rasch 1966;Lord and Novick 1968).

Penalized Joint Maximum Likelihood Estimation
Although we share the same model with the GPCMlasso package (Schauberger and Mair 2019), our autoRasch package requires a different estimation method: instead of marginal maximum likelihood estimation (MMLE) as in (Schauberger and Mair 2019), we use penalized joint maximum likelihood estimation (PJMLE) by adding penalty terms to the log likelihood.
Given the observed responses x ni , we define the log likelihood for a particular set of items S ⊆ {1, . . . , P } as with P (X = X ni |θ, β, α, δ) as in Equations (1) and (2).
Standard JMLE is known for having difficulties in estimating extreme scores (Linacre 2004). These extreme scores tend to result in very large estimated values (either positive or negative). Therefore, we propose to use Tikhonov (L 2 ) regularization in order to bound the estimated value of θ n . Furthermore, to avoid obtaining negative discrimination parameters, we use the parametrization α i = exp γ i . Additionally, to drive the GPCM-DIF toward the standard Rasch model, we add Tikhonov regularization to the natural logarithm of α i to drive it to one and an L 1 penalty to drive δ i to zero.
Adding these regularizations to (3) we obtain In the absence of the DIF effect, the log likelihood in (4) simplifies to

Optimization Method
To optimize the log likelihood in (5), we make use of quasi-Newton methods implemented in optim(). However, to obtain a stable solution for (4), we resort to a two-level coordinate descent method, described in Algorithm 1. The outer loop coordinate descent aims to maximize the log likelihood and to estimate the GPCM parameters (θ, β, and α), while fixing the δ values. The inner loop coordinate descent treats the δ parameters as separate coordinates and estimates them separately, while fixing the values of the GPCM parameters.

The Criteria
In (Wijayanto et al. 2021) and (Wijayanto et al. 2022), we introduced novel criteria to evaluate the quality of an instrument from a given original survey. In summary, when performing a standard Rasch analysis for a given original survey S, items that do not satisfy certain criteria are moved to the excluded set (S out ). Based on the remaining items, gathered in the included set (S in ), we then construct the most representative scale for the subjects. In our package, we do the same when estimating the GPCM-DIF parameters by optimizing (4) with S = S in : We refer to the log likelihood of these fitted parameters on the included itemset as the inquestionnaire log likelihood with DIF: By construction, the excluded items in S out no longer have an effect on the scale. Nevertheless, assuming they have not been considered in the initial survey without a reason, we still would like the estimated scale to also have a decent fit to these excluded items. To estimate the quality of this fit, we now clamp θ Sout toθ S in and estimate the other parameters of the excluded items by optimizing the log-likelihood We refer to the resulting as the out-of-questionnaire log likelihood with DIF. Taking the total of both log likelihoods leads us to one of our criteria: In absence of differential item functioning, we set δ = 0 and only estimate θ, β, and γ. This gives the simpler criterion IPOQ-LL (Wijayanto et al. 2021). Since both criteria are meant to score the quality of an itemset for constructing a scale based on the given responses, the higher these scores the better.

Searching for the Optimal Itemset
For any choice of included itemset S in and corresponding excluded itemset S out , we can compute the IPOQ-LL or IPOQ-LL-DIF. To search for an optimal itemset, autoRasch makes use of stepwise selection. Stepwise selection is a mix between backward elimination (eliminating variables one-by-one) and forward selection (adding variables one-by-one). The step-by-step procedure of the stepwise selection is presented in Algorithm 2. Stepwise selection starts with backward elimination on a full itemset to eliminate the worst item. OneStepBackwardElimination(S in ) in line 5 considers all possible itemsets with one item fewer and returns the highest in-plus-out-of-questionnaire log likelihood as well as the itemset that leads to this maximum. In lines 8 and 11, OneStepForwardSelection(S in ) does the same, but only considers itemsets that can be constructed by adding one additional item to S in . In practice, stepwise selection provides a good trade-off between very efficient greedy methods like backward elimination and forward selection and much more computationally demanding procedures such as best subset selection, which considers all possible combinations of items.

Implementation in autoRasch
The autoRasch package fits the models presented in Section 2.1 using penalized joint maximum likelihood estimation (PJMLE) as formulated in Section 2.2. We have implemented four fitting functions (i.e., gpcm_dif(), gpcm(), pcm_dif(), and pcm()) to estimate the parameters of the GPCM-DIF, GPCM, PCM-DIF, and PCM, respectively. The Rasch model, a special case of the PCM, can be fit using pcm() and the 2-PL model, a special case of the GPCM, can be fit using gpcm(). Each of these fitting functions returns an object of a class Algorithm 2 Pseudocode for stepwise selection (Wijayanto et al. 2021). end while 15: end procedure named after the function (e.g., pcm() returns a pcm object, pcm_dif() returns a pcm_dif object, etc.), for which the methods print() and summary() are available.
The semi-automated procedure searches automatically for the maximum score of either the IPOQ-LL or IPOQ-LL-DIF (see Section 3.1). These criteria can be computed using compute_score() for a single itemset and compute_scores() for multiple itemsets at once in parallel. The maximum of the criteria can be obtained using stepwise_search(), which implements Algorithm 2. The result of the search procedure can be plotted using plot_search().
Having obtained an itemset with the highest score, one could always reevaluate its performance using standard statistics of Rasch analyses or even perform an additional manual Rasch analysis starting from this itemset. The autoRasch package also implements some standard statistics (e.g., item goodness-of-fit, correlation of the residual, and separation reliability). The unweighted mean square (OutfitMnSq), weighted mean square (InfitMnSq), standardized unweighted mean square (OutfitZSTD), and standardized weighted mean square (InfitZSTD) are provided to measure how well an item fits to a model Masters 1982, 1990). These statistics can be computed using fitStats(). As for the local dependency, we can compute the correlation of the residual using residCor(). To measure the reliability of the itemset, one can use checkRel(), which implements the person/item separation reliability statistics (Wright and Masters 1982). Additionally, the package is also equipped with graphical tools, e.g., person item map (plot_PImap()) and item characteristic curves (plot_ICC()).

Application in the Interdisciplinary Education Perception Scale
In this section we will demonstrate how to perform a semi-automated Rasch analysis using autoRasch. We consider a real-world dataset from the interdisciplinary education perception scale (IEPS) (Vaughan 2019). The IEPS dataset consists of responses from 319 subjects to 18 items and four potential DIF-inducing covariates that contain subjects' information (i.e., university, year level, age group, and gender). For this demo, we only used two covariates: gender and university. After rescoring some items to eliminate disordered thresholds, the data is then analyzed using semi-automated Rasch analysis as described in (Vaughan 2019 First, the autoRasch package needs to be loaded with library(). The responses from n subjects to i items of the original survey are stored as a matrix or data.frame. To preview the dataset, head() can be used. Here columns 1 and 2 consist of respondents' background information and the responses are in columns 3 to 20. We then store the responses to iepsWorkData. Passing in these responses, we use stepwise_search() to find an optimal itemset in a stepwise manner based on the chosen criterion; criterion = "ipoqlldif" to take DIF effects into account and criterion = "ipoqll" otherwise. The ipoqlldif requires groups_map parameters, an n × m f binary matrix that maps n subjects into m f focal groups based on the subjects' background information for estimating the DIF effect. To accommodate this mapping process, createGroup() can be used to group the subjects automatically. Additionally, to speed up the search, stepwise_search() allows for parallelization by setting the number of cores through cores.
R> plot_search(search_result, type = "l", ylim = c(-5600,-5100), + cex.lab = 1.3, cex = 1.3, cex.axis = 1.1) Additionally, utilizing plot() we can track the highest IPOQ-LL-DIF score for every |S in |. In Figure 1, starting from the full set, items are removed sequentially and the scores first increase and then decrease with a peak at |S in | = 14. The IPOQ-LL-DIF and IPOQ-LL criteria score how well the responses fit the model, where the highest value is to be preferred. In this case, IPOQ-LL-DIF scores from |S in | = 16 to |S in | = 8 are very similar, suggesting that a shorter instrument can also be taken under consideration. Having obtained an itemset using the automated method, one could always reevaluate its performance using the statistics of standard Rasch analyses. If needed, a Rasch analysis can be initiated manually from the obtained itemset. To compute the parameter estimates of a partial credit model, one may use pcm(), which implements penalized joint maximum likelihood estimation as described in Section 2.2. In the presence of DIF, one should use pcm_dif() to incorporate the effect of DIF into the estimation procedure.
R> pcmResult <-pcm_dif(iepsWorkData [,c(1,2,4,5,7,9,10,12,13,14,15,16,17,18) Figure 2: The person-item map of 14 items that results in the highest IPOQ-LL-DIF. The _Name appended to a variable indicates the name of the covariate that induced DIF in a particular item. For example, V9 maps the difficulty of item V9 for the reference group whereas V9_sex maps the difficulty for the focal group when gender is the DIF-inducing covariate.

R> plot_ICC(pcmResult, itemno = 13)
The fitting function for the PCM-DIF model, pcm_dif(), estimates all parameters including delta, which models the difference in difficulty between groups for each item. The pcm_dif(), however, can only estimate abilities of subjects with background information and omit the subjects otherwise. Having these estimates, the estimated parameters can be plotted as a person-item map using plot_PImap(). On the person-item map, item 17 is highlighted in red, which indicates a disordered threshold. If needed, the item characteristic curve (ICC) of item 17 can be plotted using plot_ICC() as in Figure 3. The items' goodness-of-fit can be examined using the standard statistics of Rasch analyses, namely unweighted mean square (OutfitMnSq), weighted mean square (InfitMnSq), standardized unweighted mean square (OutfitZSTD), and standardized unweighted mean square (InfitZSTD) Masters 1982, 1990). These statistics are implemented in fitStats() for objects of either class pcm or pcm_dif.  Figure 4: Estimated alpha (α) values against Outfit mean square (OutfitMnSq) which is plotted using plot_fitStats().
The items' goodness-of-fit statistics can be summarized using itemfit(). Additionally, fitStats() also returns the corresponding discrimination parameters (alpha) that are obtained after fitting the data to either the GPCM or GPCM-DIF for objects of class pcm or pcm_dif, respectively. Normally, these five values are well correlated and any deviation implies the need to check the values, e.g., a high OutfitMnSq value because of surprising responses from a single subject (see Wijayanto et al. (2021)). The correlation can be visualized using plot_fitStats() as shown in Figure 4.
The local item dependency is another issue that needs to be checked since it has been shown to have a strong relation with unidimensionality. To check the local dependency status of an itemset, autoRasch has been equipped with corResid() so as to compute the correlation of the residual for evaluating the local dependency status of an itemset.

R> pcmLD <-corResid(pcmFit) R> summary(pcmLD)
Correlation of the standardized residual: There are 2 pair(s) of variable(s) with correlation >= 0.3 1. V9 and V4 2. V7 and V5 Another potential concern is the reliability of the estimates in the Rasch analysis. The typical statistic used is the separation reliability which is accessed via checkRel(). This function returns both the index and reliability of the separation of both person and item.

Summary and discussion
In this article, we have presented the autoRasch package, a tool for performing Rasch analyses in a semi-automated way. This approach has been shown to provide comparable, statistically indistinguishable results compared to standard Rasch analyses (see Wijayanto et al. (2021) and Wijayanto et al. (2022)).
Two criteria have been developed to be used in different circumstances. The IPOQ-LL-DIF is used to include the effect of the differential functioning in items. The IPOQ-LL disregards this DIF effect, making its computation faster. To select the items that make up the final instrument, we chose to implement a stepwise-search optimization method. Other optimization methods could be implemented in the future.
The autoRasch package can be used to automatically arrive at an instrument, without manual interventions. As argued before, we do recommend to reevaluate the quality of the final set based on one's expertise. Alternatively, autoRasch can be used to quickly arrive at an initial set of items that are used as a starting point for a standard Rasch analysis. Accompanied by one's expertise and guided by some typical statistics in Rasch analyses, this autoRasch package has the potential to find high-quality items with a lower cost in effort and time. autoRasch's functions are designed to be as user-friendly as possible without the need for complicated settings.