R-package LNIRT for joint modeling of response accuracy and times

In computer-based testing it has become standard to collect response accuracy (RA) and response times (RTs) for each test item. IRT models are used to measure a latent variable (e.g., ability, intelligence) using the RA observations. The information in the RTs can help to improve routine operations in (educational) testing, and provide information about speed of working. In modern applications, the joint models are needed to integrate RT information in a test analysis. The R-package LNIRT supports fitting joint models through a user-friendly setup which only requires specifying RA, RT data, and the total number of Gibbs sampling iterations. More detailed specifications of the analysis are optional. The main results can be reported through the summary functions, but output can also be analysed with Markov chain Monte Carlo (MCMC) output tools (i.e., coda, mcmcse). The main functionality of the LNIRT package is illustrated with two real data applications.


INTRODUCTION
In computer (adaptive) testing next to response accuracy (RA) response times (RTs) can be automatically recorded. The information from RTs can be used to improve the assessment of the test-takers (e.g., latent ability estimation), to improve the design of a test (e.g., adaptive item selection), and to learn more about a test-taker's response process (e.g., test cheating, item exposure). The traditional item response theory (IRT) models to measure ability and item characteristics (e.g., one-and two-parameter IRT models) offer no options to include this important information from RTs. These IRT models were constructed for paper-and-pencil tests and assume the RA data (response-accuracy observations to test items) represents the complete-data information. Therefore, to include the RT data (response-time observations to test items) in the psychometric-measurement analysis the traditional IRT models need to be modified.
IRT models have been adjusted to include RT information through a new parameter (e.g., a time parameter) (see, e.g., Roskam, 1997;Thissen, 1983;Verhelst, Verstraalen & Jansen, 1997). In another approach, the RT data is modeled separately using an RT-specific measurement model (Maris, 1993, Scheiblechner, 1979. Both approaches are not popular, mainly because the analysis of RT and RA data requires a joint modeling approach, which LNIRT package offers several important contributions for jointly analysing item patterns of RA and RTs: 1. Statistical computations are done through a powerful Gibbs sampling method that allows for fast MCMC convergence and efficient MCMC sampling. 2. Extensive residual analysis tools are available for the evaluation of item-and person fit, and outlier detection (Fox & Marianti, 2017).
3. Different parameterizations of the joint model can be applied. For instance, the lognormal RT model can be fitted with the parameterization of van der Linden (2007) or of Klein Entink, Fox & van der Linden (2009).
4. Explanatory variables can be included at the item and person level, and missing item response data can be treated as missing at random (MAR) or missing by design (i.e., incomplete test design) (van der Linden & Fox, 2016).
5. It offers a generalized measurement model for RTs in which working speed can be modeled through a latent growth component to allow for differential working speed across items (Fox & Marianti, 2016).
The remainder of this article discusses the main contributions of the package. Two detailed data examples are presented which show how the package can be used. In Section 2 the joint model is described, which includes the different measurement model parameterizations, the inclusion of explanatory variables at the item and person level, and (hyper) prior distributions. In "Model Fit", statistical tools are discussed to evaluate the fit of the model. This includes person-fit and item-fit tools to flag extreme persons and items under the model, respectively. "Applications" describes an illustration of the package by analysing the Credential data set collected from 1,636 candidates who took the licensure exam (Cizek & Wollack, 2016). Furthermore, the joint model with variable working speed is discussed using the Amsterdam chess data ( Van der Maas & Wagenmakers, 2005). Then, in "Summary and Discussion", the conclusions are given.

THE JOINT MODEL
The LNIRT (joint) model is represented by the description of the measurement models for speed and accuracy at level 1. At a second (hierarchical) level, the distribution is described for the latent variables speed and accuracy, which operate as level-1 parameters to describe the RA and RTs. Furthermore, the distribution for the item parameters, which also operate as level-1 parameters, is also described at level 2. The joint model for RA and RTs was introduced by van der Linden (2007) and generalized by among others van der Linden & Fox (2016) and Fox & Marianti (2016). The current description of the joint model follows the presentation given by Fox, Klein Entink & van der Linden (2007), who described the (Bayesian) joint model with multivariate normal priors for the person and item parameters.

RA model
The matrix of RA-observations Y contains the responses to k ¼ 1; . . . ; K items of the i ¼ 1; . . . ; N test takers. Each RA is described by item characteristics and a test-taker's accuracy. In the two-parameter IRT model, the probability of a correct response is defined given the accuracy of a test taker i, h i , and item parameters (see, e.g., Lord & Novick, 1968). Accordingly, the probability of a correct response to item k is defined as: (1) where a k and b k are generally known as the discrimination parameter and difficulty parameter of item k, respectively, and È denotes the normal cumulative distribution function. To define the item difficulties on the same scale as the ability scale, additional brackets need to be placed in the mean component. Then, the probability of a correct response to item k is given by, where h i andb k are defined on the same scale. In LNIRT, both parameterizations are implemented. The item difficulty parameters in Eqs. (1) and (2) are not directly comparable, and are defined on different scales. For the three-parameter model, a guessing parameter c k is introduced, representing the probability of guessing item k correctly, and this leads to the following measurement model for the success probability: (3)

RT model
The matrix of RT-observations RT contains the log-RTs of the N test takers to the K items. When assuming a constant working speed, each test taker works with a constant speed represented by f i . The time needed to complete an item also depends on item characteristic parameters. They are denoted as f k and k k , and can be seen as a time-discrimination and time-intensity parameter, respectively. The log-RTs, RT ik , are assumed to be normally distributed given the speed and item parameters, The time-intensity parameter k k represents the average time needed to complete the item (on a logarithmic scale). The speed parameter, f i , represents the working speed of test taker i, and the time-discrimination parameter, f k , the item-specific effect of working speed on the RT. When increasing the time-intensity, k k , the average time needed to complete the item will increase. When increasing speed f i the average time needed of person i to complete the item will decrease. In the same way as for the IRT model in Eq. (2), the time intensities can be defined on the same scale as the speed parameter. By including extra brackets in the mean term the time-discrimination parameter operates on the termk k À f i . Fox, Klein Entink & van der Linden (2007) and Klein Entink, Fox & van der Linden (2009) introduced this timediscrimination parameter to improve the modelling of the association between speed and RTs. This item-specific discrimination effect of speed on the RT represents the itemspecific change in RT, when increasing or decreasing speed. For a constant timediscrimination across items, a change in speed leads to item-invariant changes in RTs. For a non-constant item-specific time-discrimination, the change in speed leads to itemspecific changes in RTs. The latter is more natural, since the effect of a change in speed is likely to depend on an item-specific processes instead of a test-specific process. Fox & Marianti (2016) showed that the time-discrimination parameters in Eq. (4) contributes to the modeling of the covariance structure of the RTs. Furthermore, the itemspecific measurement error variance in Eq. (4) represents unexplained variance in RTs due to stochastic behavior of a test taker. When test takers operate with different speed values, take small pauses during the test, or change their time management, the RTs might show more systematic variation than explained by the structural mean term. The item-specific error component might accommodate for these differences and avoids bias in the parameter estimates.
The time-discrimination parameter defined in LNIRT differs from the one defined by van der Linden (2007). In his approach, the item-specific measurement-error precision, defined as the reciprocal of the standard deviation of the measurement error, is referred to as the time discrimination parameter. This parameter operates on the squared difference between observed RT and speed (and time intensity), instead of operating directly on the difference between speed and time intensity. Thus, the parameter does not represent the average change in RT due to a change in speed, but represents an effect of a change in the unexplained heterogeneity between RTs and speed. This parameterization also has the consequence that a measurement-error variance is not included in the RT model.
It is assumed that a test taker operates with a constant speed throughout the test conditioning on the time intensities. Each test taker's speed level does not change whatever the test conditions. In Section 3.4, the differential working-speed model is discussed to model changes in working speed, for example, due to fatigue or the adoption of a new strategy during the test, which is also implemented in the R-package LNIRT.
Level 2 and 3: structural person and item parameter models Persons A bivariate normal (population) distribution is defined for the ability and speed parameter, The covariance between the person parameters is represented by q. The level-2 model for speed and ability can be considered to represent a population distribution for the test takers. The test takers are defined to be exchangeable, and the distribution represents the prior for the person parameters. Without an identification restriction(s) on the variance parameters, the hyperprior for the covariance matrix S P is the inverse-Wishart distribution with degrees of freedom m P and scale parameter V P .

Items
A multivariate normal distribution is specified for the item parameters, Parameters a k and f k are restricted to be positive. The covariance matrix for the item parameters allows for correlation between item parameters. The hyperprior for ðl I ; S I Þ is a normal-inverse-Wishart distribution: where m I and V I are the degrees of freedom and scale matrix of the inverse Wishart distribution, l 0 is the prior mean and j the number of prior measurements, which is given a default value of one to specify a vaguely informative prior. A Beta prior is specified for the guessing parameter, c k , where the default hyper-parameter values are 20 and 80. This leads to a prior proportion of guessing of 1/5 with a standard deviation of 0.04.

Explanatory variables
The multivariate models for persons and item parameters can be extended to include explanatory variables. Let X h denote the predictors for the ability parameter and X f for the speed parameter. The mean component for the person parameters can be expressed as For the mean component of the difficulty and time-intensity parameters a similar extension is defined, Explanatory information can be included to explain differences between persons and item characteristics. Noninformative normal priors are defined for the regression parameters with a mean of zero and a large variance. In the LNIRT package, the option to include predictors for discrimination and time discrimination are not implemented, since the variance in parameter values is (very) small. For categorical predictor variables, dummy coding is required.

Parameter estimation and model identification
A summarized description of the estimation method and model identification rules is presented. A more detailed description of the parameter estimation can be found in Fox, Klein Entink & van der Linden (2007), Klein Entink, Fox &van der Linden (2009) andFox (2010). IRT models are usually identified by fixing the mean and variance of the latent variable to zero and one, respectively. Typically, this can be done directly by restricting the prior mean l h and variance r 2 h , or by putting restrictions on the item parameters. The joint model can be identified in the same way. However, for the identification rules in the package LNIRT restricting the variance of a person parameter has been avoided. When restricting the variance of a (random) person parameter, the covariance matrix in Eq. (6) is also restricted, and the inverse-Wishart distribution does not apply to a restricted covariance matrix. For this scenario, Klein Entink, Fox & van der Linden (2009) redefined the prior for the person parameters, where q becomes a regression parameter in the regression of working speed on ability with the working speed variance included in the error variance. However, this identification procedure would also complicate other modeling features (e.g., model-fit tools, variable working speed).
The LNIRT package provides two options to identify the model. For both options, the variance of the latent scales are identified by restricting the product of discriminations and of the time discriminations to one, Q k a k ¼ 1 and Q k f k ¼ 1, respectively. For option one (referred to as (R-code) ident=1), the mean of the scales are identified by restricting the sum of the difficulty and of the time-intensity parameters to zero, P k b k ¼ 0 and P k k k ¼ 0, respectively. For option two (referred to as (R-code) ident=2), the mean of the scales are identified by fixing the mean of the ability parameter to zero, l h ¼ 0, and of the speed parameter to zero, l f ¼ 0.
Model parameters are estimated through Gibbs sampling from their joint posterior distribution. The procedure involves the division of all unknown parameters into blocks, with iterative sampling of the conditional posterior distributions of the parameters in each block given the preceding draws for the parameters in all other blocks (Fox, 2010). In various simulation studies and real data analysis, the Gibbs samplers in the LNIRT package showed good convergence properties and generally returned efficient MCMC samples (e.g., Fox & Marianti, 2016, 2017Klein Entink, Fox & van der Linden, 2009).

Logistic vs probit model item parameter estimates
The LNIRT package uses the normal-ogive IRT model (Probit model) to model the probability of a correct/positive response. The item parameter estimates under the normal ogive (Probit) model can be transformed to a Logistic scale and vice versa. Therefore, the logistic scale factor is used (1.7) to transform parameters on a Logistic scale to those on a Probit scale. Let l L , r L and l P , r P denote the mean and variance of the latent scale under the Logistic and Probit model, respectively. Then, the item parameters a k and b k are invariant under both scales, when applying the logistic scale factor Assume that data were generated under the Logistic IRT model, and item parameter estimates were obtained under the Probit model. The Probit model estimates can be transformed to the Logistic model estimates; and, it follows that a k r P 1:7 ¼ a k r L : In the same way, the difficulty parameters under the Probit model can be transformed. When assuming that l P and l L are zero, the transformation iŝ

Person fit
The general idea is (1) to define a person-fit statistic for an RA and RT pattern, (2) to define classification variables, as a function of the statistics, to represent a non-aberrant and an aberrant state, and (3) to compute the posterior probability of the aberrant state. A personfit test for RA and RT patterns is discussed. For each test, a dichotomous classification variable is introduced, which states whether a pattern is considered extreme. Then, the person-fit test for the joint model is constructed from the dichotomous classification variables for RA and RT patterns.
Person-fit statistic for RA pattern Drasgow, Levine & Williams (1985) proposed a standardized version of Levine-Rubin statistic, which is shown to have statistical power to detect aberrant response patterns (Karabatsos, 2003). The log-likelihood is used to evaluate the fit of an RA pattern. Following a two-parameter IRT model, the person-fit statistic, denoted as l 0 , is given by, The person-fit statistic can be standardized and this standardized version, denoted as l y s , is approximately standard normally distributed. Under the 3PL model, a dichotomous classification variable S ik is introduced that classifies a correct response to be either a correct random guess with probability c i ðS ik ¼ 0Þ or a correct response according to the two-parameter IRT model ðS ik ¼ 1Þ. Then, l y s is defined conditionally on S ik ¼ 1 to evaluate the extremeness of non-guessed responses in the RA pattern. The guessed responses are ignored in the evaluation of the extremeness of an RA pattern.
To quantify the extremeness of an RA pattern, the posterior probability is computed that the person-fit statistic is greater than a threshold C, which can be defined according to its standard normal distribution. Note that the logarithm-of-response-probabilities are negative, thus increasing values of the negative log-likelihood correspond to misfit. The statistic is integrated over the prior distributions of all parameters and can therefore be interpreted as a prior predictive test. The (marginal) posterior probability of the statistic being greater than the threshold C is expressed as; where ' denotes the normal density function. This approach corresponds to the prior predictive approach to hypothesis testing advocated by Box (1980), since the posterior probability of an extreme RA pattern is computed by integrating over the prior distributions of the model parameters.
By introducing a classification variable, the posterior probability of an aberrant RA pattern for a given significance level can be computed. Let F y i denote a random variable that takes the value one when an observed pattern y i is marked as extreme and zero otherwise, The posterior probability of an extreme pattern (F y i equals one) is computed by integrating over the model parameters. Then, for a fixed critical level C, the posterior probability represents how likely it is that the RA pattern is extreme under the model.

Person-fit statistic for RT pattern
Analogously, the log-likelihood of the RT pattern of a test taker i is used to define a personfit statistic to quantify the extremeness of the pattern. The person-fit statistic, based on the log-likelihood of an RT pattern, is represented by The sum of standardized errors is an increasing function of the negative log-likelihood, and an unusually large person-fit value corresponds to a misfit. The l t i f i ; k; f; r 2 ; rt i ð Þ given the model parameters is chi-squared distributed with K degrees of freedom. For a threshold C, representing the boundary of a critical region, the posterior probability of an extreme RT pattern is expressed as A classification variable can be defined to quantify the posterior probability of an extreme RT pattern given a threshold value C. Let F t i denote the random variable which equals one when the RT pattern is flagged as extreme and zero otherwise; The posterior probability of an extreme RT pattern is computed for each pattern with MCMC.

Person-fit statistic for RA and RT pattern
An observed RT pattern rt i is reported as extreme when the F t i equals one with a least 0.95 posterior probability. To identify a joint pattern of RA and RT to be extreme, another classification variable is defined. Let F t;y i equal one, when both F y i and F t i are equal to one, and equal zero otherwise. This joint classification variable represents the situation that both patterns of a test taker are extreme or not. The classification variable for the joint pattern is defined as The posterior probabilities of the classification variables in Eqs. (13), (16), and (17) are Bayesian significance probabilities. They represent the posterior probability of an extreme person-fit statistic given the data. They are computed using MCMC taking into account dependencies in the joint model parameters.

Item fit
Without reproducing the equations for item-fit statistics, the person-fit tests for RA and RT patterns can be modified to examine the fit of RA and RT item patterns. Therefore, the log-likelihood of RA and of RTs of an item is considered to define an item-fit statistic for RA and RTs, respectively. The estimated posterior probability of each item-fit statistic represents the probability that the statistic is extreme under the model, which means that the pattern of responses to the item is extreme.

Residual analysis
Bayesian residual computation has been considered by Albert & Chib (1993), Johnson & Albert (2006), and Fox (2010, Chapter 5) to evaluate the fit of an IRT model. This approach is extended to the joint model, and latent residuals e ik are defined, which represent the difference between a latent continuous RA and the mean component. The latent continuous RA is defined through a data augmentation method to facilitate a Gibbs sampling algorithm (Fox, 2010, Chapter 3 and 4). Conditional expectation of a latent residual is derived by integrating out the augmented latent variable to obtain a Rao-Blackwell estimator for the latent residuals. The estimated latent residuals are used to quantify the total percentage of outliers per item and per test taker.
For the RA, the conditional expected latent residual is given by where ' and È are the normal density function and the cumulative distribution function, respectively. Subsequently, the posterior probability that a latent residual is greater than a threshold C is given by The residuals for the RTs, referred to as 2 ik , can be estimated directly as the difference between the RT ik and the mean, 2 ik ¼ ðrt ik À ðk k À f k f i ÞÞ. The extremeness of an RT residual is expressed as the posterior probability that the residual is greater than a threshold C. This posterior probability can be expressed as

Distribution RT residuals
The distribution of the RT residuals is evaluated using the Kolmogorov-Smirnov (KS) test.
The empirical distribution of the residuals is compared to the assumed normal distribution. The KS test is a goodness-of-fit test and the posterior probability is computed that RT residuals of an item are non-normally distributed. The empirical distribution is given by The implemented KS test represents the difference between the cumulative empirical distribution and the normal cumulative distribution function: The distribution of D N is the Kolmogorov distribution, which is used to compute the probability that D N is greater than a threshold C: The marginal posterior probability is computed using MCMC, and the estimated significance probability represents the probability that the RT residuals of item k are nonnormally distributed.

Differential Working Speed
The generalization of the joint model to allow differential working speed has been introduced by Fox & Marianti (2016). A summarized version of their procedure is given, which shows how the joint model can be modified to include differential working speed.
Differential working speed is defined as a change in working speed of a test taker during the test. This relaxes a basic assumption of the joint model to work with a constant speed throughout the test. The change in working speed is modeled using a latent growth model. A random intercept, a linear trend component, and a quadratic time component are considered to model the speed trajectory of each test taker. The random intercept, f i0 , represents the initial value of working speed. The linear trend component, f i1 , is used to model a linear change in speed. Test takers can start slowly (fast) to increase (decrease) their speed later on. The quadratic time component, f i2 , is used to decelerate or accelerate the linear trend. For instance, a positive linear trend in working speed can be decelerated by a negative quadratic component.
The log-normal differential speed model with a random trend and quadratic time variable is represented by where the time variable X ik represents the order in which the test items are solved. The beginning of the test is represented by the time-point X i1 ¼ 0 for each test taker i. This does not mean that each test taker should start the test at the same time. It is simply a reference point for the speed process of each test taker. The time scale is defined on an equidistant scale to reduce the MCMC computations. Let X ðiÞ ¼ X ði1Þ ; X ði2Þ ; . . . ; X ðiKÞ represent the order in which items are solved by test taker i. Then, a convenient time scale is defined by X ik ¼ ðX ðikÞ À 1Þ=K-the times are defined on a scale from 0 to 1, with 1, the upper bound, representing an infinite number of items. The time scale addresses the order in which the items are made and assumes an equidistant property of the speed measurements.
The average of the time intensities defines the average time to complete the test, since the random effects have a population mean of zero. The random intercept represents the speed measured at the first item. The population-average speed trajectory is constant, and shows no change in speed. Test takers can work faster (slower) than this populationaverage level, which corresponds to a positive (negative) initial speed level. A negative (positive) growth rate shows a decrease (increase) in speed, which can be decelerated (accelerated) by the quadratic time component. The random component variances represent the variance in growth parameters of the working speed trajectories. The covariance between random speed effects is modeled through the time-discrimination parameters. When the time-discrimination parameters are all fixed to one, the full covariance matrix of the random speed components is freely estimated.

Ability and differential working speed
A multivariate normal distribution is assumed for the random component ability and speed to allow for relationships between ability and the different speed components. This multivariate model for the random person parameters, ðh i ; f i Þ ¼ ðh i ; f 0i ; f 1i ; f 2i Þ, is given by The relationship between ability and the speed components is defined by the covariance components S h;f . The growth components defining the speed trajectory influence ability. When test takers do not vary speed, ability is only influenced by the random intercept speed. When test takers vary their speed, the trend and quadratic time component will also influence ability, which is a specific feature of this generalization of the constant speed model.
Whether a change in speed improves the accuracy of the responses depends on the application. The differential speed joint model can be used to examine different test strategies across test takers. For instance, it is possible to estimate the speed trajectories of test takers with different levels of ability. The speed trajectories of high-ability students can differ from the low-ability students. The speed trajectories of test takers may also differ across tests. The model can be used to explore effects of time limits on test takers' changes in working speed. Benefits of exploring heterogenous speed trajectories in relation to ability will depend on the application.

SOFTWARE
The main function of the package LNIRT is the LNIRT function to fit the joint model for RA and RTs. It checks the input, arranges the input for the MCMC algorithm, and constructs the data output. An object of class LNIRT is generated, and a summary function can be used to get a summarized view of the estimation results. In the most simple case, the user passes the data to the LNIRT function and uses the package's summary function to get an overview of the results. The complete functionality of the package can be accessed by making further input specifications.

Input
The following arguments are mandatory for the LNIRT function: RT: A matrix RT containing the log-normally transformed response times in wide format, (N) persons in rows and (K) items in columns. Y: A matrix Y containing the (binary) RA data in wide format, (N) persons in rows and (K) items in columns. The binary RA data is coded as zero (incorrect) and one (correct) (missing values as NA). data (optional if RT and Y are given): A list or a simLNIRT object (output object of the function simLNIRT) containing the RT and RA matrices and optionally predictors for the item and person parameters. If a simLNIRT object is provided, in the summary the simulated item and time parameters are shown alongside of the estimates. If the required variables cannot be found in the list, or if no data object is given, then the variables are taken from the environment from which LNIRT is called. XG: The integer number of MCMC iterations (the default is 1,000), this includes the burn-in period.
The remaining arguments are optional for the LNIRT function: burnin: The percentage of the total number of MCMC iterations (XG) which will serve as the burn-in period of the chains. The default is 10%. ident: Identification rule, (ident=1) restrict sum of difficulties and sum of time intensities to zero and (ident=2) restrict mean person parameters to zero. The default is (ident=2). The product of (time) discriminations is restricted to one. guess: A logical variable with the default FALSE, where TRUE (FALSE) represents (not) a guessing parameter in the IRT model. par1: A logical variable with the default FALSE, where TRUE represents the bracket notation for the mean term of the IRT component as in Eq. (2), and FALSE the nonbracket notation as in Eq. (1). In general, the MCMC performance is better for the nonbracket parameterization. XGresid: the number of MCMC iterations (default is 1,000) to be done before starting the residual computation. residual: A logical variable with the default FALSE. When TRUE, a complete residual analysis is done together with the estimation of the joint model parameters, as discussed in "Model Fit". The residual computations are started after XGresid MCMC draws. Therefore, XG should be greater than XGresid, and a sufficient number of MCMC iterations should be made after XGresid MCMC iterations to obtain accurate residual estimates. Preferably, at least 5,000 MCMC iterations are made, when doing a residual analysis. td: A logical variable with the default TRUE. When TRUE, the time-discrimination parameter is estimated. When FALSE, the time discrimination is restricted to one. WL: A logical variable with the default FALSE. When TRUE, the time-discrimination parameter represents the inverse of the measurement error variance parameter according to the parameterization of van der Linden (2007). alpha: An optional vector of length K of pre-defined item discrimination parameters.
beta: An optional vector of length K of pre-defined item difficulty parameters.
phi: An optional vector of length K of pre-defined time-discrimination parameters.
lambda: An optional vector of length K of pre-defined time-intensity parameters.
XPA: An optional matrix of predictor variables for the ability parameters, where the columns represent the predictor variables. Categorical predictor variables need to be dummy coded. XPT: An optional matrix of predictor variables for the speed parameters, where the columns represent the predictor variables. Categorical predictor variables need to be dummy coded. XIA: An optional matrix of predictor variables for the item difficulty parameters, where the columns represent the predictor variables. Categorical predictor variables need to be dummy coded. XIT: An optional matrix of predictor variables for the time-intensity parameters, where the columns represent the predictor variables. Categorical predictor variables need to be dummy coded. MBDY: An optional missing-by-design indicator matrix-of the same size as Y-for missing values (coded NA) in the Y matrix due to the test design (0 is missing by design, 1 is not missing by design). Multiple imputations are simulated for missing values (missing at random) in the Y matrix which are not assigned in MBDY as missing-bydesign. MBDT: An optional missing-by-design indicator matrix-of the same size as RT-for missing values (coded NA) in the RT matrix due to the test design (0 is missing by design, 1 is not missing by design). Multiple imputations are simulated for missing values (missing at random) in the RT matrix which are not assigned in MBDT as missing-by-design.

Output
The LNIRT function creates an object of class LNIRT, which stores the MCMC output, posterior draws and posterior mean estimates. The output of the function LNIRT is described in an itemized way, without including a residual computation. data: If available a data object from the function simLNIRT. -Item.Discrimination: Samples of discrimination parameters (XG by K).
-Item.Guessing: Samples of guessing parameters (XG by K). -Sigma2: Sampled values of measurement error variance parameters (XG by K).
-Var.Person.Ability: Sampled variance ability parameters. When the residual computation is included (residual=TRUE), then the LNIRT object includes the following additional output variables: EAPCP1: Posterior probability that the RT pattern is flagged as aberrant according to Eq. (16), using the posterior probability that the person-fit statistic l t is extreme as defined in Eq. (15) with a significance level of 0.05. EAPCP2: Posterior probability that the RA pattern is flagged as aberrant according to Eq. (13), using the posterior probability that the person-fit statistic, l y s , is significant (with a significance level of 0.05). EAPCP3: Posterior probability that both patterns (RA and RT) are flagged as aberrant according to Eq. (17). EAPKS: Posterior probability of an extreme KS-test result according to Eq. (22), which indicates that the RT residuals are not normally distributed. EAPKSA: A significance probability of an extreme KS-test that the latent residuals of RA items are not normally distributed. This significance test has no power. EAPresid: Posterior probability of an extreme (standardized) residual (RT data), which is greater than plus or minus two in absolute value. EAPresidA: Posterior probability of an extreme (standardized) latent residual (RA data), which is greater than plus or minus two in absolute value. IFl: The (negative) log-likelihood statistic for item fit under the IRT model (high values represent misfit). IFlp: The posterior significance probability of an extreme item fit under the IRT model. lZI: Item-fit statistic representing the posterior probability that the squared sum of item residuals are extreme under the model. EAPl0: The log-likelihood contribution of each RA observation, where a low value represents a misfit. PFl: The standardized (negative) log-likelihood contribution of each RA pattern, where a high value represents a misfit. PFlp: Posterior (significance) probability of observing a more extreme person-fit statistic for the RA pattern than the observed one. lZPA: Posterior significance probability of an extreme person-fit test for RA pattern based on latent residuals. This significance test has no power (under construction). lZPT: The (unstandardized) estimated person-fit statistic according to Eq. (14). lZP: Posterior (significance) probability of observing a more extreme person-fit statistic for the RT pattern than the observed one, according to Eq. (15).

MCMC output analysis
The posterior draws in the output of LNIRT are obtained through a Gibbs sampler. MCMC convergence tests are necessary to make sure that the MCMC chains converged before making statistical inferences. The coda package (Plummer et al., 2006) can be used to do an MCMC convergence analysis. The LNIRT function returns a single MCMC chain for each model parameter, which can be directly translated to an MCMC object with the function as.mcmc. The single-chain convergence diagnostics, for instance Geweke, and Heidelberg-Welch, can be used to examine if the chains did not converge. The multiplechain convergence diagnostics, for instance Gelman and Rubin's convergence statistic, requires multiple calls to LNIRT to create multiple MCMC chains with different starting values-LNIRT uses random starting values when parameter values are not pre-specified.
With the summary function of the package LNIRT, posterior means and standard deviations are computed and reported. Although the LNIRT Gibbs samplers produce efficient MCMC samples with low autocorrelation, it is possible that for a specific dataset some of the chains have a medium to high autocorrelation. Then, the reported standard deviation estimates may underestimate the standard deviation, since autocorrelation in the chains are ignored. The coda and the mcmcse package can be used to compute the autocorrelation and the Monte Carlo standard error. Note that the autocorrelation has no effect on the posterior mean estimate. The effective sample size-the sample size of independently distributed values with the same variance as the autocorrelated MCMC sample-can also be computed to examine if the run was long enough to make accurate and reliable inferences. A reasonable rule of thumb is to have an effective sample size of 400. Then, the Monte Carlo standard error is less than 5% of the overall uncertainty of the posterior mean.
The default burn-in period is 10% of the total number of MCMC iterations. This burnin period is used in the computation of the posterior estimates, which are also reported with the summary function. Extensive simulation studies showed that the burn-in period is usually below 100 MCMC iterations, but this could be higher for a specific data set. Furthermore, the MCMC properties are better for the RA model in Eq. (1) than for the RA model with the additional brackets given in Eq. (2).

APPLICATIONS The credential (Form 1) data
The credential data set of Cizek & Wollack (2016) is analyzed to illustrate the functionality of the package LNIRT. The credential data set concerns 1,636 test takers who applied for licensure. Form 1 of the test was administered, which consisted of 170 licensure exam items, and 30 pretest items. A total of 10 background variables of the test takers was stored -this includes the country where the candidate received his/her educational training, the state in which the test taker applied for licensure, and the center where the candidate took the exam. The RA and RT data were also stored. The collected data followed from a year of testing using a computer-based program that tests continuously.
Each candidate completed one of the three pretest forms, each consisting of 10 items. In Table 1, the test design is given, which shows the incomplete test design of in total 200 items for three groups.
The RA and RT data were extracted from the data to be used in the LNIRT() function. The RA data consisted of 1,636 test takers (in rows) and 170 exam items (in columns).
R> RT<-as.matrix(data[c(which(colnames(data)=="idur. In the first analysis, the (default) joint model was fitted to the data to explore the item parameter estimates and the item and person covariance matrix. The model was identified by restricting the population means of ability and speed to zero and by restricting the product of time discriminations and discriminations to one. A total of 5,000 MCMC iterations were computed, and the burn-in period was 500 (i.e., 10% of the total number of MCMC iterations).

MCMC convergence
Multiple MCMC chains were checked for convergence and run length by computing the Geweke and Heidelberger statistics, the effective sample sizes, and the MCMC standard errors. The convergence statistics did not show any problems of non-convergence for the examined chains. The lowest effective sample size was 427 for the discrimination parameter of item 155 (MCMC standard error of 0.007), and the highest 4,000 for the speed parameter of test taker 1 (MCMC standard error less than 0.000). The naive posterior standard deviation-when ignoring autocorrelation in the chain-was often just smaller than the times-series standard error estimate using the coda package. It was concluded that the burn-in period and the total length of the chains were sufficiently long to compute accurate parameter estimates.

Output analysis
The general summary function of LNIRT shows the item number (which corresponds to the column number of Y and of RT), and the posterior mean (labeled EAP) and standard deviation (labeled SD) estimates of the item and of the prior parameters of the items and persons. In Fig. 1, the estimation results for the first five item parameters are given. In Fig. 2, the item and person prior parameter estimates are given. The mean of the items, l I , is labeled in the output as mu_a (mean discrimination), mu_b (mean difficulty), mu_phi (mean time discrimination) and mu_lam (mean time intensity). The posterior mean estimate of the covariance matrix of the items, S I , is given under the label Sigma_I. The estimated mean item discrimination is around 1.19 with a variance of around 0.32. Around 10 items have an estimated item discrimination below 0.40. The item difficulty estimates ranges from −1.84 to.75, with a mean of −0.70 and a variance of 0.27. The estimated average time-discrimination and time-intensity is around 1.03 and 3.96, with a variance of 0.05 and 0.11, respectively. The time discrimination ranged from 0.47 to 1.42, Figure 1 Screenshot of the LNIRT summary output: item parameter estimates of the first five items.
Full-size  DOI: 10.7717/peerj-cs.1232/ fig-1 and it appears that the items show better discriminating performance in speed than in ability. However, the estimated working-speed variance is very small and around 0.03. The variance in RTs is mostly explained by the time-intensity parameters and hardly by differences in working speed across test takers. Thus, the time discriminations are somewhat higher than the item discriminations, but the time discriminations have a minor contribution in explaining variance in RTs. The covariance matrix of the item parameters shows that the less difficult items are more discriminating (correlation of −0.43), and the less time-intensive items are also more time-discriminating (correlation of −0.40). Differences in ability and speed are better measured with less difficult and less time-intensive items. The more difficult items are also more time-intensive (correlation of 0.46). The time-discriminating items are positively correlated with the item discriminations (correlation of 0.49).
The covariance matrix of the person parameters (ability and speed), S P , is given under the label Sigma_P. The variance in ability and speed across test takers are both small. There is a positive correlation between ability and speed of around 0.40, which states that more able test takers also work faster than the less able ones.

Explanatory variables
Dummy coded variables were created for the pretest groups, where effect coding is used such that the general intercept can be restricted to zero to identify the joint model. Differences in ability and speed were examined across pretest groups (stored in objects XA and XT for ability and speed), and the test-taker's total test time was used to explain the correlation between ability and speed (stored in object XA for ability). The observed total test times contained information on top of the speed values, since the latter were shrunk R> summary(out1) The effect of the total test time on ability was significant and around −0.07. Those who finished earlier scored on average higher than those who finished the test later. When accounting for the total test-time differences in measuring ability, the correlation between ability and speed was around 0.1. So, the total test-time explained around 75% of the correlation. There were no significant speed differences between the pretest groups, and the variance of speed was also small and around 0.03. There were estimated differences in ability between the pretest groups: group G 1 scored on average −0.13 lower, group G 2 0.07 higher, and group G 3 0.05 higher than the general average, but they were not significant.

Planned missing by design
In the planned missing data design, simultaneous parameter estimation for all examinees is possible using (R-code) LNIRT() function. It requires defining an indicator matrix for the planned missing data. In the matrix MBDM, a zero is a designed missing and a one a designed observation. The planned missing data matrix can be defined separately for the RA and RT data, and arguments MBDY and MBDT represent the design matrix for the RA and RT data, respectively. For instance, it is possible to include RA data in the analysis for which RT data was only partly collected. When including the pretest items in the measurement of ability and speed, the test design contains planned missing data. Then, the indicator matrix MBDM for the planned missing data is defined and included in the call to LNIRT: R> In the call to LNIRT, the arguments RTt and Yt contain the RT and RA data for all 200 items. Pre-specified item discriminations, alpha1, are used, since the model with free item discrimination parameters would not fit. The output is not discussed for reasons of brevity.

Model fit
The joint model-fit tools are illustrated by re-running the analysis with explanatory variables for the 170 items and activating the residual computation (R-code residual=TRUE).
R> out1 <-LNIRT(RT=RT,Y=Y,XG=5000,XPA=XA,XPT=XT, residual=TRUE) R> summary(out1) The summary report contains an overview of the extreme residuals under the header Residual Analysis. A total of 0.07% of RTs residuals were considered extreme with at least 95% posterior probability (Eq. (20)). This concerns RTs that were small and around 2-6 s or much higher than the item-average RTs. For the RA residuals, around 0.02% was estimated to be extreme with 95% posterior probability (Eq. (19)). This concerns incorrect responses from test takers with an above-average ability. For 62 items (36.5%) the assumption of log-normally distributed residuals was violated for which a significant probability of the KS test was computed (Eq. (22)). The variance in working speed and time intensities are small, and the estimated residual variance is around 0.26. Therefore, RT outliers more easily affect the fit of the log-normal distribution. The item-fit statistics (RA and RT data) did not identify a significant misfit of an item. Despite the outliers, loglikelihoods of item patterns were not significant under the joint model.
The EAPCP1 is reported and around 18.34%, representing the percentage of RT patterns which are considered extreme with 95% posterior probability. The percentage of significant extreme RT patterns is around 19.5%, when using a significance level of 0.05. The reported EAPCP2 is around 1.59% and the significant RA patterns is around 1.65%, which shows that there are only a few RA patterns identified as extreme. Finally, around 0.31% of the joint patterns (RA and RT) are extreme (object EAPCP3).
The heterogeneity in person-fit statistics for RA an RT patterns is examined with a linear regression of the statistics on the number of test attempts, country, and pretest group. The country variable was (dummy) recoded (US (Cgroup1=1),Philippines (Cgroup2=1),India (Cgroup3=1),Others (intercept)). The pretest groups were already represented by two dummy coded variables.
R> summary(lm(out1$PFl~as.factor(XFT$Attempt)+(XFT $Cgroup)+(XFT$Pgroup))) Coefficients: Estimate i statistic, the critical value is 201.4 when the significance level is 0.05 (l t i is chisquare distributed with 170 degrees of freedom). The intercept corresponds to test takers from pretest group 1 from citizens outside the US, India, and the Philippines, who took the test for the first time. For test takers with more attempts the person-fit statistic is on average higher and closer to the critical value. Those test takers are more likely to give a very fast or much slower response. Test takers from the US have on average much lower person-fit scores, and test takers from India have a much higher person-fit score.

The amsterdam chess data
The Amsterdam Chess data was collected by Van der Maas & Wagenmakers (2005) and is used to demonstrate a differential working speed analysis with the package LNIRT. The data study of Fox & Marianti (2016) is followed who also used this data set to illustrate discrimination item 1) to 9,000 (time intensity item 1). The naive posterior standard deviation were often close to the time-series standard error. The 10,000 MCMC iterations were sufficiently long and the chains showed convergence after 1,000 iterations. The summary function of LNIRTQ provides a similar output as the LNIRT function. The posterior mean estimates and (naive) standard deviations of the item parameters are given, in addition to the covariance matrix of the item and person parameters. For reasons of brevity, the correlation matrix of the items and the covariance matrix of the persons are discussed.
The correlation matrix of the items is again given in the order of a, b, f, and k. The correlation between discrimination and difficulty is around 0.37, and between time discrimination and intensity around −0.53. The difficult items tend to discriminate better in ability than the less difficult items. The high time-intensive items do not discriminate well between speed levels. RTs were hardly affected by increasing working speed for the high time-intensive items. For low time-intensive items this effect on the RTs is much higher. There is a strong correlation between item difficulty and time intensity of 0.79, which states that the difficult items also took more time to be completed.
---Covariance matrix Items ---Item Matrix 0.000 0.000 0.063 The covariance matrix of the random person parameters are given under the label Theta (ability), Intercept (speed intercept), Slope1 (speed trend), and Slope2 (speed quadratic component). The variance in working speed intercepts (i.e., the variance in starting speed of persons) is small (0.06). The speed trajectories of the persons show a variance of 0.11 in trends. The variance in deceleration/acceleration in working speed is around 0.06. The positive covariance between ability and the speed intercept shows that high-ability persons are more likely to start with a higher speed. The negative covariance between ability and the speed trend shows that the speed trajectories of high-ability persons has a negative trend (decrease in speed), in comparison to low-ability persons. The covariance between ability and the quadratic speed component is around −0.014, which means that highability persons are more likely to show an acceleration in the negative trend in speed.
For the records with missing values, the imputed data leads to population-average estimates of the random person effects. The ability estimate is around the population average of −0.162, and the speed components are around zero. For instance, for the record of 147 the estimated random person effects are (ability, intercept, trend, quadratic): R> outchess$Mtheta[147,] [1] -0.228 -0.018 -0.004 -0.004

SUMMARY AND DISCUSSION
Computer-based testing has made it possible to collect more information from respondents to improve our understanding and interpretation of test performances and of the behaviors of respondents. RT, the amount of time a test taker spends on answering an item, has shown to be a very useful source of information. RTs have been used to measure working speed, item time-intensity, or the relationship between speed and accuracy (i.e., speed-accuracy trade-off). The modeling of RTs and its integration in the measurement of ability has led to the development of joint models for RA and RTs. However, there is a lack of software tools that supports a joint model data analysis, which includes recent joint modeling extensions. The R-package LNIRT has been developed with the purpose to provide MCMC estimation methods for (hierarchical) joint models for RA and RT data, while including also new developments in this area. The IRT-based measurement models for RA and RT data have been implemented under different parameterizations to make the program suitable for different users. Imputation methods have been integrated to deal with missing data, while the program can also deal with planned missing by design data. Furthermore, Bayesian significance tests are included to evaluate person and item fit for RT and RA patterns. The developed person-fit statistics can be used to identify aberrant test takers with respect to their RT pattern, RA pattern or both patterns. Explanatory variables can be used to explain differences between persons and items.
The LNIRT package has been designed to estimate the parameters of a joint model with multiple link functions (linear, probit) and with non-exponential family distributions. The cross-classified nature of the (random) effects (item and person parameters) further complicates the use of standard (multilevel) software for parameter estimation. Furthermore, Bayesian simulation methods are preferred to handle the required highdimensional numeric integration for parameter estimation. However, black-box MCMC methods (e.g., JAGS) (1) can be very slow for medium to large data sets and for highdimensional models (2) cannot integrate identification restrictions in the simulation procedure-for instance re-scaling latent variables to an identified scale in each MCMC iteration-(3) necessary identifying restrictions on parameters can lead to complex priors for the model parameters, and (4) the computation of the model-fit tools is complex, since the tools need to be integrated in the model description. The MCMC sampler in the LNIRT package has been designed to collect posterior samples with low autocorrelation and a high effective sample size. This leads to a faster and more efficient algorithm than a black-box MCMC method, which is not designed to optimize the information content of the posterior samples.