Real-Time Bayesian Parameter Estimation for Item Response Models

. Bayesian item response models have been used in modeling educational testing and Internet ratings data. Typically, the statistical analysis is carried out using Markov Chain Monte Carlo methods. However, these may not be compu-tationally feasible when real-time data continuously arrive and online parameter estimation is needed. We develop an eﬃcient algorithm based on a deterministic moment-matching method to adjust the parameters in real-time. The proposed online algorithm works well for two real datasets, achieving good accuracy but with considerably less computational time.


Introduction
Markov Chain Monte Carlo (MCMC) methods have revolutionized statistical computing in the past two decades. The developments in MCMC algorithms and software have enabled the use of Bayesian inference for item response theory (IRT) models in psychological and educational studies. For example, Johnson and Albert (1999) described Gibbs sampling methods for Normal Ogive (or Probit) IRT models; Patz and Junker (1999) developed a Metropolis-Hastings sampling method and illustrated it using the two-parameter logistic (2PL) IRT model; Fox and Glas (2001) proposed Gibbs sampling for multilevel IRT models; Ho and Quinn (2008a) fit a Bayesian ordinal item response theory (IRT) model via MCMC techniques for Internet ratings data, where the data are typically ordinal measurements on the quality of all kinds of items such as movies, consumer products, and so on; and Martin and Quinn (2002) and Wang et al. (2013) presented the MCMC strategy for dynamic item response models. For a brief survey of developments in item response modeling from a Bayesian perspective, see Albert (2015).
An MCMC method is a sampling, or nondeterministic, approach for implementing Bayesian inference. It is simple and often works well, even for complex models, but requires considerable computation. Take the ordinal IRT model in Ho and Quinn (2008a) for illustration: this model has advantages over traditional average ratings by accounting for the rater bias and item quality, but fitting the model by MCMC methods may not be feasible for large-scale real-time data. Indeed, for such a scenario, Ho and Quinn (2008a, Section 5) commented on the possible use of efficient deterministic approximation algorithms. The present paper aims to fill this gap by proposing an online deterministic algorithm for Bayesian parameter adjustment of the ordinal IRT models.
Online methods arise in several areas including statistics, machine learning, and control literature; see, for instance, Robbins and Monro (1951), Rosenblatt (1958), and Maybeck (1982). As opposed to an offline algorithm which processes the entire data at once, an online algorithm learns about one piece of data at a time and discards it after learning. On the one hand, since online algorithms do not see the entire data in advance, they often perform worse than their offline counterparts. On the other hand, online methods require less memory and have advantages when dealing with very large real-time data. In recent years, there has been a growing interest in online algorithms due to the emergence of large-scale Internet data. We refer to Saad (1998), Shalev-Shwartz (2011), and the references therein, for an overview of online methods.
Deterministic methods are useful alternatives to MCMC ones for Bayesian inference. Deterministic approaches such as variational Bayes (Bishop, 2008), expectation propagation (Minka, 2001), and so on, are popular machine learning methodologies. They have been widely used by computer scientists to solve large data problems. Recently, the statistical community has devoted more attention to these methods; see, for example, Rue et al. (2009), Faes et al. (2011, Hall et al. (2011), among others. Variational Bayes methods are a family of techniques that try to give a lower bound for the marginal likelihood and provide an analytical approximation to the posterior distribution of the unobserved variables. Expectation propagation is an extension to assumed-density filtering (Maybeck (1982), Boyen and Koller (1998); also called "moment matching") and a general deterministic method that approximately minimizes the Kullback-Leibler divergence between the exact distribution p and its approximation q: With the constraint that q is in the Gaussian family, the solution follows from moment matching, or expectation constraints: It can be viewed as a variation of the extended Kalman filter, a popular sequential method for inference in dynamic systems that approximately updates the first two moments of the state distribution. Herbrich et al. (2007) employed the expectation propagation technique to develop TrueSkill TM , the online ranking system for Xbox Live.
Recently, Weng and Lin (2011) proposed a new deterministic moment-matching approach based on a version of Stein's identity and exact calculation of certain integrals. When applying it for online ranking of players, the accuracy is comparable to TrueSkill TM , but the running time as well as the code are much shorter. This approach has been used in other applications. For example, Wistuba et al. (2012) modified it for move prediction in Computer Go and obtained promising experimental results; Chen et al. (2013) applied it to obtain an efficient online Bayesian ranking scheme in a crowdsourced setting. All these are within the models of ranked data.
The present paper extends the moment matching in Weng and Lin (2011) to IRT models. We begin by observing that, similar to the models for ranked data, IRT models often model the outcomes by normal or logistic distributions, though these can be more complicated. Next, we obtain online algorithms to adjust the parameters in certain ordinal IRT models, where the parameters may be allowed to be time-varying. Then we demonstrate the effectiveness of the proposed algorithm through two real datasets. Our experiments show that the proposed real-time estimation method works well for 100,000 ratings collected over time.
The organization of the paper is as follows. In Section 2, we introduce the motivating examples, review IRT models, and comment on parameter estimation. In Section 3, we describe the approximation method. Section 4 develops online algorithms. Section 5 presents experiments on simulated and real datasets. Section 6 concludes.
2 Online product ratings and IRT models

Motivating examples
Online consumer product ratings are growing rapidly and they play an increasingly important role in consumers' purchase decisions. For example, Chevalier and Mayzlin (2006) found a positive relationship between consumer book ratings and book sales; Liu (2006) showed that both positive and negative reviews of a movie increase its box office revenue. Here we discuss the statistical analysis of online ratings. Two online ratings datasets are considered: the ratings of news outlets from Mondo Times (http://www.mondotimes.com/) used in Ho and Quinn (2008a), and the ratings of movies from GroupLens Research Project at the University of Minnesota (http:// movielens.umn.edu).
Mondo Times is an online company that disseminates information from over 33,000 media outlets, including newspapers, television stations, and so on, in 213 countries. The registered members can submit five-point ratings of the content quality of news outlets from 1=awful, 2=poor, 3=average, 4=very good, to 5=great. The dataset presented in Ho and Quinn (2008a) consists of 4,511 ratings on 1,515 products (news outlets) from 946 raters, available from their Ratings package (available at http://cran.r-project. org/). The average number of ratings for a product is 3.0(= 4, 511/1, 515) and the average number rated by a rater is 4.8(= 4, 511/946).
The movie ratings data were collected through the MovieLens web site during the seven-month period from September 19th, 1997, to April 22nd, 1998. This dataset consists of 100,000 movie ratings from 943 users on 1,682 movies. The ratings are also on a scale of 1 to 5. The average number of ratings for a product is 59.5(= 100, 000/1, 682), the average number rated by a rater is 106.0(= 100, 000/943), and each user has rated at least 20 movies.
These online product ratings data are characterized by having numerous products and raters, and each rater only rates a small proportion of the products. For the Mondo Times data, the missing rate is 1−4, 511/(946×1, 515) ≈ 0.997; for the MovieLens data, it is 1−100, 000/(943×1, 682) ≈ 0.937. So, the rater-product matrix of ratings is sparse. The ratings data are typically displayed by a number of stars that represent the mean rating of a product. Ho and Quinn (2008a) pointed out some potential problems with such displays. First, it does not take into account raters' rating behavior; for instance, some may be more inclined to use high or low ratings, while some may use all the rating categories. Weighting all raters equally may bias the results. Second, with just a number of stars for the mean rating, it lacks a measure of statistical uncertainty. To address these problems, Ho and Quinn (2008a) proposed some graphical displays based on an ordinal IRT model, which can incorporate statistical uncertainty in the ratings and adjust for rater-specific factors. In the next subsection, we review some IRT models that are relevant to the present paper.

The IRT models
IRT models have been widely used in modeling dichotomously and polytomously scored data from educational tests; see van der Linden and Hambleton (2013) and De Ayala (2013).
Example 1: Basic IRT. The basic one-parameter IRT model is for analyzing dichotomously scored test data. It is based on the idea that the probability of a correct response to a test item is a function of the examinee's ability and the item's difficulty. The model has the form where the item response variable Y ij = 0 or 1, corresponding to whether the response to the jth item taken by the ith person is correct or not, θ i represents the ability parameter of the ith person, β j represents the difficulty parameter of the jth test item, and the item response function F (·) is a cumulative distribution function (c.d.f.) from a continuous distribution. Model (1) becomes the Rasch model (Rasch, 1961) when F (·) is the standard logistic c.d.f., and the Normal Ogive (or Probit) model when F (·) is the standard normal c.d.f.
Example 2: Ordinal IRT. Samejima (1969) introduced the graded response model to analyze ordered polytomous data. Let Y ij denote the score of the ith person on item j. The model specifies the probability of the ith person responding in category c or higher on item j as where θ i is the proficiency of the ith person, β j is the discrimination parameter for item j, δ j,c is the item response parameter for item j and category c, c = {0, 1, . . . , C j }, and F (·) is the c.d.f. of a logistic or a normal distribution. Muraki (1990) proposed a modified graded response model suitable for Likert-type data in which the categories are {1, 2, . . . , C} for all items. This model resolved the item response parameter δ j,c into the item location parameter α j and the category threshold parameter Ho and Quinn (2008a) proposed the following ordinal IRT model to fit online product ratings data. To begin, let Y ij ∈ {1, 2, . . . , C} denote the rating of product i by rater j. They assume that the observed Y ij is determined by an unobserved variable Y * ij : where the γ c are cutpoints, In (4), α j captures the center location of rater j's rating, β j is assumed positive (to identify the sign of θ i ) and it measures how well rater j discriminates between low and high quality, and θ i represents the latent quality of product i. By (3), the probability of observing Y ij in category c or higher is Note that ij in (4) follows either the normal or the logistic distribution. We observe that model (5) is a variation of the graded response model and that it closely resembles (2), but they differ in the parameterization of the item location and category threshold parameters.
Example 3: Dynamic IRT. Many authors have proposed the dynamic IRT model to capture the changes in a person's ability over time; for example, Embretson (1991), Martin and Quinn (2002), Wang et al. (2013), among others. The simplest case has the form: Observation equation: where θ i,t is the current ability of the ith person, θ i,t−1 is the ability at the previous time point, w i,t represents the random change in ability as in all linear dynamic models, and Y i,j,t = 1 if the ith person answered correctly on day t for test item j.
For fitting a modified graded response model to product ratings data, the temporal changes in θ i represent the evolution of product perception and popularity. Moreover, it is sensible that the raters' inclinations (α j , β j ) may change gradually. For example, see Koren et al. (2009) for modeling temporal effects for both products and raters in the context of movie ratings. We shall consider models with time-varying coefficients in Section 4.2.

Parameter estimation
There are maximum likelihood procedures for item parameters in IRT models; for example, joint maximum likelihood, conditional maximum likelihood, and marginal maximum likelihood. For a detailed account of these methods, see Bock and Aitkin (1981), Molenaar (1995), and the references therein. Albert (1992) and Kim (2001) found that the IRT parameter estimates using the maximum likelihood methods and the Gibbs sampling method with non-informative priors are similar.
Although the maximum likelihood methods for IRT models could give good parameter estimates in many cases, to the best of our knowledge, these estimation procedures under large sparse data matrices have not been fully considered in the literature. In fact, there could be some potential problems. First, the maximum likelihood estimate may not be unique. For the Mondo Times data, many raters give very few ratings and many items received very few ratings. Specifically, among the 946 raters, 410 rate just one item, and 138 rate exactly two items; among 1,515 items, 777 received just one rating, and 326 received just 2 ratings. The histograms for "number of items rated by a rater" and "number of ratings an item receives" are given in Figure 1. However, for the ordinal IRT model (4) in Section 2.2, each rater is associated with an intercept parameter and a slope parameter; from (4), it is not difficult to see that these two parameters can not be uniquely determined if a rater only rates one item. Second, obtaining the standard error estimates for the model parameters involves solving the inverse of a large Hessian matrix of the log-likelihood. For the Mondo Times data, the matrix is about 3, 500 × 3, 500. Even if the inverse exists, the computation may be burdensome, especially in an online setting. Ho and Quinn (2008a) proposed some graphical displays based on an ordinal IRT model, which can incorporate statistical uncertainty in the ratings and adjust for rater-specific factors. Their MCMC approach fits well for offline data; however, it may not be viable for large-scale real-time data.

The approximation method
In what follows, let φ and Φ denote the density and distribution function of a standard normal variable, and let φ(x|μ, σ) denote the density of the normal distribution with mean μ and standard deviation σ.

Moment equations
The proposed method is based on the moment equations (9) and (10) obtained from a version of Stein's identity. Stein's lemma (Stein, 1981) concerns the expectation of a normally distributed random variable. The lemma is famous and of interest primarily because of its applications to the James-Stein estimator (James and Stein, 1961) and to empirical Bayes methods.
In the context of deriving sequential confidence levels, Woodroofe (1989) studied integrable expansions for posterior distributions and developed a variant of Stein's identity. It concerns the expectation with respect to a distribution of the form The degree of smoothness of f governs the order of the expansion. A motivation for considering such a distribution comes from large-sample theory: many suitably normalized quantities are asymptotically normally distributed and may be written in this form; therefore, by studying expectations with respect to it, one can possibly refine the normal approximation. Weng and Lin (2011) referred to this identity as Woodroofe-Stein's identity to distinguish it from Stein's lemma, and used the identity to obtain a Bayesian approximate moment-matching method.
As the identity involves complex notation, here we describe only some results necessary for the proposed online method. For a detailed account of the identity, we refer readers to Woodroofe and Coad (1997, Proposition 2) and Weng and Lin (2011, Corollary 2); see also Weng (2010Weng ( , 2015 for some extensions. Let ψ * = (ψ * 1 , . . . , ψ * p ) be a vector of parameters whose posterior density takes the form where the dependence on the data is suppressed in the notation and the normalizing constant. Further suppose that f is a twice continuously differentiable function. An application of Woodroofe-Stein's identity gives the following: provided the expectations on both sides of (9) and (10) exist, where δ iq = 1 if i = q and 0 otherwise, and [·] iq indicates the (i, q) entry of a matrix.
From (9) and (10), the mean and variance of ψ * i are On the right-hand side of (9)-(12), we note that the terms inside the brackets do not involve the normalizing constant C, which is often intractable.

A Bayesian moment-matching scheme
In the context of inferring individual skills from group comparisons, Weng and Lin (2011) proposed a two-step method where the first step approximates team skills and the second step relates the individual skill update to the team skill update. Both steps make use of the moment equations (9) and (10). Since the present paper shall exploit and extend the first step, below we briefly review this step. To begin, consider the paired comparison case where the strength parameter of player i is denoted as ψ i . Assume that ψ i follows N (μ i , σ 2 i ), as in online rating systems such as Glicko (Glickman, 1999) and TrueSkill (Herbrich et al., 2007). Next, upon observing the game outcome D (i beats j), update the skill as N (μ new where the X i are unobserved actual performance. When X i − X j follows the normal distribution, (13) corresponds to the Thurstone-Mosteller model (Thurstone, 1927), whereas, when X i − X j follows the logistic distribution, (13) becomes the Bradley-Terry model.

Now assume that the actual performance
. So, η i represents the uncertainty in the actual performance, while σ i represents the prior Then the posterior density of ψ given the game outcome D (i beats j) is whereη = (η 2 i + η 2 j ) 1/2 . By (11), (14) and the chain rule, we have Weng and Lin (2011, Section 3.2) proceeded to show that one may evaluate the expectation on the right-hand side of (16) at ψ * = 0 and then correct the approximation by a scaling factor; that is, by letting it is shown that where c is a scaling factor for which The posterior variance can be expressed similarly: first, (12), (14) and the chain rule give then, letting h(μ, σ, cη)).
The idea of using a scaling factor to improve an estimate is not new; see, for instance, Spiegelhalter and Lauritzen (1990, Section 4.3) and MacKay (1992, Section 2.1) in the context of logistic regression.
One may incorporate the time-varying strengths into the approximation technique described above. The very idea of modeling time-varying strengths for players of games has appeared in Glickman (1999), Glickman and Stern (1998), Dangauthier et al. (2008), and the references therein. Specifically, assume that individual i's strength has been updated from outcomes at time t − 1 (measurement update), and the strength is distributed as ψ i,t−1 ∼ N (μ i , σ 2 i ); and assume that the ability follows a Gaussian drift ψ i,t = ψ i,t−1 + w i,t , where w i,t ∼ N (0, ν 2 ) for some ν > 0. Then integrating the distribution of ψ i,t |ψ i,t−1 with respect to the prior distribution of ψ i,t−1 gives ψ i,t ∼ N (μ i , σ 2 i + ν 2 ) (time update), which is considered as the prior information for the test taken at time t.

Online inference of IRT models for Likert-type data
For the basic one-parameter IRT models (1), if the item response function F is taken as the standard logistic or normal c.d.f., then it is not difficult to derive real-time parameter adjustment by modifying Algorithms 1 and 3 in Weng and Lin (2011).

Sequential update
This subsection presents the sequential update rule when f in (24) is from (5) and comments on how similar procedures can be applied when f is as (2).
For model (5), the posterior distribution of (α * j , β * j , θ * i ) given y ij = c is (24) with The above line resembles f in (15), yetη in (15) is 1 here. The proposed estimates for the posterior means and variances arẽ where The scalar ν plays the same role as cη in (19), and is derived by approximate calculation of a triple integral. The detailed derivations of the above equations are relegated to Appendix A of the Supplementary Appendix (Weng and Coad, 2016). Since the variance approximations in (26)-(28) may be negative, in the algorithm below we set a small positive lower bound κ in (38)-(40) to avoid a negative variance.
The proposed algorithm is described in Algorithm 1. Clearly, from (37), if a rater r is not discriminating (so μ βj ≈ 0), then his/her rating has little effect on μ θi ; similarly, from (36), if μ θi ≈ 0, then a rating on it has little effect on μ βj .
3. Update parameters as follows: The parameter update can be derived similarly. To see how, first note that an observed Y ij from model (2) can be determined by an unobserved variable −1 , d c ]. Therefore, by approximating the distribution of Y † ij /β j , we can estimate the threshold parameters (d 1 , . . . , d C−1 ) in the same manner as for γ. The sequential update for the posterior means and variances of α j , β j , and θ i can be derived analogously. We omit the details, but remark that the property that rating on an item with μ θi ≈ 0 has little effect on μ βj does not hold for model (2). The reason is that β j in (2) interacts with not only θ i , but also α j and d c−1 .

Time-varying parameters
At the end of Section 2.2, we reviewed the dynamic IRT model. In particular, we commented that both the product perception (θ) and the raters' inclinations (α, β) may evolve over time in the product ratings scenario. For the ordinal IRT models (2) and (5), one may simply extend (6) to all the variables α, β, and θ: where w αj ,t , w βj ,t , and w θi,t are assumed to be normally distributed with unknown variances σ 2 . The unknown σ 2 can be estimated through maximizing the marginal likelihood (also called "evidence") over a series of N observations. This approach has been proposed by Jazwinski (1969) in the context of the Kalman filter and extended Kalman filter, and by de Freitas et al. (2000) in the context of neural networks, among others.
For the stationary ordinal IRT model (5), the marginal likelihood of an observation y ij is where the approximation follows from the derivation in Appendix A of the Supplementary Material (Weng and Coad, 2016), and ν 2 = 1 + σ 2 αj + σ 2 βj μ 2 θi + σ 2 θi μ 2 βj . For the dynamic case, the marginal likelihood is the same except that the prior variances for α j , β j , and θ i are nowσ 2 αj = σ 2 αj + σ 2 ,σ 2 βj = σ 2 βj + σ 2 , andσ 2 θi = σ 2 θi + σ 2 , respectively; and hence The unknown state noise parameter σ 2 can either be chosen by users or be estimated by maximizing the sum of the log-marginal over N ratings: Once σ 2 is determined, the online parameter update is the same as Algorithm 1 except that ν (t) in (32) is now replaced byν in (45). The marginal likelihood for model (2) and the update of its time-varying parameters can be done similarly; we omit the details.

Experiments
We conduct experiments to assess the performance of Algorithm 1 on simulated data and two Internet ratings datasets.

Numerical issues
There are some numerical issues in the implementation. Note that the functions Ω and Δ in (30) play important roles in the algorithm. If both the numerator and the denominator of some term on the right-hand side of (30) approach zero, it may cause computational difficulties. To have a closer look at these functions, we let and rewrite Δ(x, a) as Δ(x, a) = Λ(x, a) + (Ω(x, a)) 2 .   Therefore, in our implementation, we employ the symmetric properties and give a bound on Ω(x, a) when |x| is too large to avoid numerical difficulties.

Simulation
We assume that each α j follows N (1, 1), each θ i follows N (0, 1), and each β j follows  N (1, 20), with all parameters mutually independent. This corresponds to setting the initial parameter values in Algorithm 1 as μ θi = 1 for i = 1, . . . , I and j = 1, . . . , J. Note that these priors are exactly the same as in Ho and Quinn (2008a), except that their prior for β j is N (−5, 20) truncated to be positive, while our β j prior has a positive mean but without truncation. Our reason for not restricting β j is given below. The purpose of constraining β j ∈ R + for all r is to identify the sign of θ i ; however, this may result in misleadingly assigning β j to be near zero for some raters. In fact, it is appropriate not to restrict the parameter β j because the sign identifiability problem induced can be resolved by constraining a particular θ i to be positive; for example, see Ho and Quinn (2008b).
For the γ i values, recall that γ 0 = −∞ and γ 5 = ∞ in (3). Ho and Quinn (2008a) proposed to constrain γ 1 = 0 and set improper uniform priors for the unknown cutpoints γ i , i = 2, 3, 4. We propose to use relation (3) and set them by the following steps. First, calculate the observed proportions #{y ij = c}/N for c = 1, . . . , 5, where N is a prespecified number of ratings. Next, find the z-scores corresponding to these proportions. Finally, convert the z-scores to the y * scale approximately. Now we specifically describe how we set the cutpoints for the Mondo data; the treatment for other data is similar. To begin, we randomly selected 1,000 ratings and found the relative frequencies for c = 1, . . . , 5 to be about 0. 22, 0.13, 0.19, 0.20, 0.26. The z-values corresponding to these areas are z = (−0.7, −0.3, 0.1, 0.6). Note that β j θ i follows the normal product distribution, which is symmetric about zero; see Glen et al. (2004) and the references therein for a detailed description of this distribution. Therefore, roughly speaking, the unconditional distribution of Y * is approximately N (1, 23), where the variance is derived from the independence assumptions among parameters. Then, converting the z-values to the y * scale gives (γ 1 , γ 2 , γ 3 , γ 4 ) as −2.36, −0.44, 1.48, 3.88).
These cutpoints will be used in Section 5.3.
For evaluating accuracy, we first compute the approximations (26) based on a single observation y ij . Given the parameter values as the initial settings, Table 1 presents our approximations and the estimates using numerical integration. Note that the initial μ αj and σ αj are both 1. In general, our approximations made smaller adjustments (so the estimates are closer to the initial values 1). Clearly, when the observed rating is higher, the posterior mean of α j tends to be larger; and the variance is bigger when the rating is high or low. Overall, the proposed approximation is quite accurate.
Next, we evaluate the performance of our algorithm on simulated data generated from the ordinal IRT model specified in (3) and (4). We take I = 50, J = 200, and set  (26)   the cutpoints to be (γ 1 , γ 2 , γ 3 , γ 4 ) = (−6, −2, 2, 6). Then we follow the prior distribution in this section to generate α j , β j and θ i , to set initial parameter values for our algorithm, and to estimate the cutpoints by the empirical frequencies of the initial 500 ratings. Here we restrict θ 1 to be positive so as to identify the signs of the model parameters. A total of 10,000 ratings are generated. We conduct online estimation on 20 random permutations of the dataset. Table 2 reports the mean squared error (MSE) between the estimate θ * and the true θ for each of the 20 random permutations; that is, MSE= i (θ * i − θ i ) 2 /I. In general, our approximations are satisfactory.

Mondo Times data
The Mondo Times data are taken from Ho and Quinn (2008a), where they removed raters who rate fewer than five products and removed products that are only rated by these raters. They end up with 3,249 ratings on 1,344 products from 232 raters. Hereafter, we shall call these the "sub-Mondo" data. Since there is no information about the timestamp of the ratings, we apply our algorithm to two randomly shuffled ratings.
Next, we compare the starplots by the offline MCMC method and the proposed online method for the sub-Mondo data, both with fixed γ, as in (47). As the package Ratings is designed for random γ, we have modified Ratings (available upon request) in order to obtain MCMC estimates with fixed γ. The results are in the left and center plots of Figure 4. We observe that, compared with the starplot by MCMC with random γ (Figure 3), the results of MCMC with fixed γ are much closer to our online method. The right plot in Figure 4 is by our method for the whole-Mondo data. Overall, the three starplots are in good agreement.
We also compare the estimates of θ. Since the MCMC samples reveal that the posterior densities of μ θi for the twelve news outlets are roughly bell-shaped (not shown here), Table 3 reports only the posterior means and standard deviations. In Table 3, the column MCMC1 is by package Ratings with default initial γ = (γ 1 , . . . , γ 4 ) = (0,1,3,5), where γ 1 = 0 is fixed, while γ 2 , γ 3 , γ 4 are to be estimated; MCMC2 is obtained from the modified package in which the γ values are fixed, as in (47). The two online columns  (47)  are results from two random orderings of the data, using the same fixed γ. The two MCMC columns reveal that the posterior means with random γ are somewhat different from those with fixed γ, and that the standard deviations are generally smaller if γ is fixed. Moreover, for each news outlet, the 95% posterior intervals from the four columns mostly overlap, except for the Great Falls Tribune with MCMC2. Overall, our online method performs pretty well.

MovieLens data
The MovieLens data contain the timestamp (in unix seconds since 1/1/1970 UTC) for each rating. So, we sort the ratings according to the timestamp, and study both the stationary and the dynamic model (Section 4.2). We set initial values σ  (43), we set the state noise parameter σ 2 = 0.002, obtained by maximizing the approximate marginal likelihood (44) over 5,000 ratings. The top 10 movies are presented in Figure 5. The left panel is for the stationary case, using Algorithm 1; and the right panel is for the time-varying case, which is the same as Algorithm 1 except that ν (t) in (32) is now replaced byν in (45).

Remarks
The positive lower bound κ in (38)-(40) is set to be 0.0001 to avoid a negative variance. Our experiments show that the results for the two real datasets are not sensitive to the κ value.
For the computational time, to produce a starplot in Figure 3 for the sub-Mondo data, it takes about five minutes using the Ratings package, but just three seconds using the online algorithm written in pure R code. We also generate a subset of the MovieLens data, with about 68,000 ratings, for which the online algorithm takes about 20 seconds, while the MCMC method via Ratings takes about 30 minutes. The offline MCMC method may take more time if real-time parameter adjustment is required. The computations for the online method here are available at http://www3.nccu.edu.tw/~chweng/ BaIRT.zip

Concluding remarks
Online methods are necessary when large data continue to arrive and real-time parameter adjustment is needed. We have shown how to develop Bayesian online parameter estimation for IRT models with Likert-type data. The experiments on two real datasets are satisfactory.
We have also compared our real-time estimation method with the offline MCMC methods via the package Ratings and its modification for fixed γ. Though sacrificing some accuracy, in general, our proposed method achieves a good performance, but with considerably less computational time. Thus, for situations where faster approximate methods are desirable, our proposed method can be a useful alternative to offline methods. That said, we have to point out some limitations of our method. First, with only mean and variance updates, our method can not provide estimates of quantities of interest, which are easily obtainable by MCMC methods. Moreover, our use of fixed cutpoints γ based on initial observed proportions #{y ij = c}/N (see Section 5.2) may be inappropriate. We may alleviate this weakness by updating the observed proportions sequentially, and then adjust the γ values either sequentially or after collecting a certain number of observations. The adaptive γ can better reflect the evolution of the ratings distribution. In an unreported experiment, we adjusted γ whenever collecting 500 ratings for the sub-Mondo data (Section 5.3). The results are close to the unadjusted case, possibly because the ratings are randomly permuted; furthermore, the computation time does not increase much because the observed ratings frequencies can be easily accumulated.
Some questions deserve further study. First, some theoretical analysis on the discrepancies between the offline method and the proposed online one would be worthwhile. This can be challenging because the rater-product matrix of ratings is very sparse and the number of parameters would increase with the data. Second, it is desirable to extend the proposed method to other IRT models, such as the Rating Scale model (Andrich, 1978) and the Partial Credit model (Masters, 1982). These models are direct extensions of the one-parameter IRT model in Example 1, and sufficient statistics exist for the model parameters. One may also consider a bias term for the product in model (5). In principle, similar derivations could be obtained straightforwardly.