Efficient moment calculations for variance components in large unbalanced crossed random effects models

Large crossed data sets, described by generalized linear mixed models, have become increasingly common and provide challenges for statistical analysis. At very large sizes it becomes desirable to have the computational costs of estimation, inference and prediction (both space and time) grow at most linearly with sample size. Both traditional maximum likelihood estimation and numerous Markov chain Monte Carlo Bayesian algorithms take superlinear time in order to obtain good parameter estimates. We propose moment based algorithms that, with at most linear cost, estimate variance components, measure the uncertainties of those estimates, and generate shrinkage based predictions for missing observations. When run on simulated normally distributed data, our algorithm performs competitively with maximum likelihood methods.


Introduction
Modern electronic activity generates enormous data sets with an unbalanced crossed random effects structure. The factors are customer IDs, URLs, product IDs, cookies, IP addresses, news stories, tweets, and query strings, among others. These variables could be treated as fixed effects, plain categorical variables that just happen to have a large number of levels. But in many cases, the specific category levels are evanescent. Customers turn over at some rate, cookies get deleted at an even faster rate, products or news stories grow in popularity but then fade. In such cases it is more realistic to treat such variables as random effects. We want our inferences to apply to the population from which the future and observed levels of those variables are sampled. Furthermore, for realism we should treat data in the same level of a factor as correlated.
The statistically efficient way to treat data sets with crossed random effects is through generalized linear mixed models (GLMMs), maximizing the likelihood with respect to both the parameters and the random effects. However, the cost of these computations is dominated by a Cholesky decomposition that takes time cubic in the number of distinct levels and space quadratic in that number; see Bates (2014) or Raudenbush (1993). Such costs are infeasible for big data.
It has been suggested to us that stochastic gradient descent (SGD) could provide an alternative way to maximize the likelihood. However, SGD approaches have only been developed for data that can be split into independent subsets, which is not possible for data sets with crossed random effects.
With GLMMs infeasible, it is natural to consider the Gibbs sampler and other Markov Chain Monte Carlo (MCMC) methods. But, as shown in Section 2, those methods in the crossed random effects context has computational cost that is superlinear in the sample size. This is very different from the great success that MCMC has on hierarchical models for data with a nested structure. See for instance Gelman et al. (2012), Snijders (2014) and Yu and Meng (2011).
With both likelihood and Bayesian methods running into difficulties, we turn to the method of moments. It seems ironic to use a 19th century method in this era of increased computer power. But data growth has been outpacing processing power for single-threaded computation, so it is appropriate to revisit methods from an earlier time when the data was large compared to the available computing power. A compelling advantage of the method of moments is that it is easily parallelizable. It also makes very weak assumptions, has no tuning parameters, and does not require cumbersome diagnostics.
We are motivated by generalized linear mixed models with linear predictors but we focus the present paper on a very special case. We consider a setting with identity link, just two factors that are both random, and intercept only regression. In this paper, we assume that the data follows the model Model 1. Two-factor crossed random effects: In the available data we only see N of the Y ij , where 1 N < ∞, in R distinct rows (i's) and C distinct columns (j's). We assume that observations are missing completely at random. See Section 7.1 for comments on informative missingness. Note that we do not make any distributional assumptions. We choose this model because it is the simplest case that exhibits the intrinsic difficulty of the large unbalanced crossed random effects setting, even though it may not describe real-world data well. Our goal is not to resolve the issue of analyzing massive crossed data sets via GLMMs in one go. Instead, we consider a simple GLMM for crossed data and study parameter estimation in that model, which is still a challenging problem.
Let θ = (σ 2 A , σ 2 B , σ 2 E ) T be the vector of variance components. Our first task is to get an unbiased estimateθ of θ at computational cost O(N ) and using additional storage that is O(R + C), which is often sublinear in N .
Our second and more challenging task is to find the variance ofθ, Var(θ | θ, κ). This variance depends on both θ and the vector of kurtoses of the random effects κ = (κ A , κ B , κ E ) T . We develop formulas V (θ, κ) approximating Var(θ | θ, κ) that can be computed in O(N ) time and O(R + C) storage, given values for θ and κ. After developing an estimateκ that can be computed in O(N ) time and O(R + C) space, we let Var(θ) = V (θ,κ) be our plug-in estimate of the variance ofθ.
Notice that in order to achieve the complexity bounds, we choose to over-estimate Var(θ). Specifically, we require the functions V to satisfy diag(V (θ, κ)) diag(Var(θ | θ, κ)). There is a trade-off in selecting V though; the less conservative it is, the more time needed to compute it.
For large data sets we might suppose that Var(θ) is necessarily very small and getting exact values is not important. While this may be true, it is wise to check. The effective sample size (as defined in Lavrakas (2008)) in model (1) might be as small as R or C if the row or column effects dominate. Moreover, if the sampling frequencies of rows or columns are very unequal, then the effective sample size can be much smaller than R or C. For example, the Netflix data set (Bennett and Lanning, 2007) has N . = 10 8 . But there are only about 18,000 movies and so for statistics dominated by the movie effect the effective sample size might be closer to 18,000. That the movies do not appear equally often would further reduce the effective sample size. Indeed, Owen (2007) shows that for some linear statistics the variance could be as much as 50,000 times larger than a formula based on IID sampling would yield. That factor is perhaps extreme but it would translate a nominal sample size of 10 8 into an effective sample size closer to 2,000.
An outline of this paper is as follows. Section 2 describes the difficulties with Gibbs sampling and other MCMC algorithms for crossed random effects, as suggested by theoretical results and shown through simulations. Section 3 introduces further notation and assumptions. Section 4 presents our linear-cost algorithm to estimate θ and conservatively approximate the variance of that estimate. Section 5 studies how knowledge of σ 2 A , σ 2 B , and σ 2 E can be used to construct shrinkage predictions of unknown Y ij . Section 6 illustrates the methods in Section 4 on both simulated Gaussian data and real world data. Section 7 concludes the paper and discusses informative missingness. The appendix, Section 8, has a proof of convergence rates for MCMC methods and tables of their simulation results. A supplement, Sections 9-19, develops the variance formulas for our moment estimates and provides proofs of our theorems about prediction. We conclude this section with a few more pointers to the literature.
Our procedure to find variance component estimates are similar to those of Henderson (1953) as described in Searle et al. (2009, Chapter 5). Some differences are that we use Ustatistics, and that we find variance component estimates and variances of those estimates in time and space O(N ). For one of Henderson's algorithms, even the point estimates require superlinear computation in inverting R × R or C × C matrices. Moreover, the majority of Searle et al. (2009) considers Gaussian data which makes the kurtoses zero. Gaussian variables are not a reasonable assumption in our target applications and so we develop kurtosis estimates.
For crossed random effects models with missing data Clayton and Rasbash (1999) propose an alternating imputation-posterior (AIP) algorithm, which they show has good performance on fairly large data sets. It may be termed a 'pseudo-MCMC' method since it alternates between sampling the missing data from its distribution given the parameter estimates and sampling the parameters from a distribution centered on the maximum likelihood estimates. Because of this last step, we do not consider AIP to be scalable to Internet size problems.
In our model (1), for simplicity the variance components are homoscedastic. Alternatively, we could allow them to be heteroscedastic; see Owen (2007) or Owen and Eckles (2012), who study bootstrap variance estimates for means and smooth functions of means. The latter paper also considers a more complex model in the sense that there are more than two factors as well as interactions among factors.

MCMC for large crossed data
In this section we consider some common MCMC methods to estimate the parameters σ 2 A , σ 2 B , and σ 2 E of model (1). For this section only, we assume that a i , b j and e ij are normally distributed.
Balanced data is a fully sampled R×C matrix with Y ij for rows i = 1, . . . , R and columns j = 1, . . . , C. We present some analyses for the balanced case with interspersed remarks on how the general unbalanced case behaves. The balanced case allows sharp formulas that we find useful and that case is the one we simulate. In particular, we can obtain convergence rates for some MCMC algorithms.
To estimate σ 2 A , σ 2 B , and σ 2 E we sample from the posterior distribution given the data: π = p(µ, a, b, σ 2 A , σ 2 B , σ 2 E | Y ) where a is the vector of a i and b is the vector of b j . Let T , for t 1 denote the resulting chain. While MCMC is effective for hierarchical random effects models, it scales badly for crossed random effects models as we see here. In limits where R, C → ∞, the dimension of our chain S (t) approaches infinity. Convergence rates of many MCMC methods slow down as the dimension of the chain increases, making them ineffective for high dimensional parameter spaces. The MCMC methods we consider go over the entire data set at each iteration. There are alternative samplers that save computation time by only looking at subsets of data at each iteration. However, so far those approaches are developed for IID data and not the crossed random effects setting.
Definition 2.1. Let θ (t) , for integer t 0 be a Markov chain with stationary distribution h. Its convergence rate is the minimum number ρ such that holds for all measurable functions f such that E h (f (θ) 2 ) < ∞ and all r > ρ.
Theorem 2.1. Let ρ be the convergence rate of X (t) to φ, as in Definition 2.1. Then, Proof. See Section 8.1.
We see that ρ → 1 as R, C → ∞, outside of trivial cases with σ 2 A or σ 2 B equal to zero. If R and C grow proportionately then ρ = 1 − α/ √ N + O(1/N ) for some α > 0. We can therefore expect the Gibbs sampler to require at least some constant multiple of √ N iterations to approximate the target distribution sufficiently. When the data are not perfectly balanced numerical computation of ρ shows that Gibbs still mixes increasingly slowly as N → ∞. But in that case, the sampler requires O(N ) computation per iteration. In sum, Gibbs takes O(N 3/2 ) work to sample from φ, which is not scalable.
Because sampling from φ can be viewed as a subproblem of sampling from π, we believe that the Gibbs sampler that draws from π, which also requires O(N ) time per iteration, will exhibit the same slow convergence and hence require superlinear computation time.

Other MCMC algorithms
The Gibbs sampler is widely used for problems like this, where the full conditional distributions are tractable. But there are other MCMC algorithms that one could use. Here we consider random walk Metropolis (RWM), Langevin diffusion, and Metropolis adjusted Langevin (MALA). They also have difficulties scaling to large data sets.
At iteration t + 1 of RWM, a Gaussian random walk proposal S (t+1) ∼ N (S (t) , σ 2 I) for σ 2 > 0 is made and the step is taken with the Metropolis-Hastings acceptance probability. If the target distribution is a product distribution of dimension d, the chainS (t) ≡ S (dt) (i.e. the chain formed by every dth state of the chain S (t) ) converges to a diffusion whose solution is the target distribution. We may interpret this as a convergence time for the algorithm that grows as O(d) (Roberts and Rosenthal, 2001).
For our problem, evaluating the acceptance probability requires time at least O(N ), so the overall algorithm then takes O(N (R + C)) time. This is at best O(N 3/2 ), as we found for Gibbs sampling, and could be worse for sparse data where N RC. Our target distribution is not of product form, and we have no reason to expect that RWM mixes orders of magnitude faster here than for a distribution of product form. Indeed, it seems more likely that mixing would be faster for product distributions than for distributions with more complicated dependence patterns such as ours.
At iteration t + 1, Langevin diffusion steps S (t+1) ∼ N (S (t) + (h/2)∇ log π(S (t) ), hI) for h > 0. As h → 0, the stationary distribution for this process converges to π, as shown for general target distributions in (Liu, 2004). Because h = 0 in practice, the Langevin algorithm is biased. To correct this, the MALA algorithm uses the Metropolis-Hastings algorithm with the Langevin proposal S (t+1) . When the target distribution is a product distribution of dimension d, the chainS (t) ≡ S (d 1/3 t) converges to a diffusion with solution π; the convergence time grows as O(d 1/3 ) (Roberts and Rosenthal, 2001). With similar reasoning as for RWM, the computation time is O(N (R+C) 1/3 ), which is at best O(N 1+1/6 ).

Simulation results
We carried out simulations of the four algorithms described above, as well as five others: the block Gibbs sampler ('Block'), the reparameterized Gibbs sampler ('Reparam.'), the  Table 1: Summary of simulation results for cases with R = C = 1000. The first row gives CPU time in seconds. The next four rows give median estimates of the 4 parameters. The next four rows give the number of lags required to get an autocorrelation below 0.5.
independence sampler ('Indp.'), RWM with subsampling ('RWM Sub.'), and the pCN algorithm of Hairer et al. (2014). Descriptions of these five algorithms are given below with discussions of their simulation results. Every algorithm was implemented in MATLAB and run on a cluster using 4GB memory. For each algorithm and a range of values of R and C, we generated balanced data from model (1) with µ = 1, σ 2 A = 2, σ 2 B = 0.5, and σ 2 E = 1. We ran 20,000 iterations of the algorithm, retaining the last 10,000 for analysis. We record the CPU time required, the median values of µ, σ 2 A , σ 2 B , and σ 2 E , and the number of lags needed for their sample auto-correlation functions (ACF) to go below 0.5.
The entire process is repeated in 10 independent runs. Table 1 presents median values of the recorded statistics over the 10 runs for the case R = C = 1000. Tables 2 through 6 of the appendix collect corresponding results at a range of (R, C) sizes.
Block Gibbs, which updates a and b together to try to improve mixing, has computation time superlinear in the number of observations. Also to improve mixing, reparameterized Gibbs scales the random effects to have equal variance. This gives an algorithm equivalent to the conditional augmentation of Van Dyk and Meng (2001). For all three Gibbs-type algorithms, the parameter estimates are good but µ mixes slower as R and C increase, while the variance components do not exhibit this behavior.
The computation times of Langevin diffusion ('Lang.') and MALA are approximately linear in the number of observations. However, σ 2 E tends to explode for large data sets in Langevin diffusion, while the chain does not mix well in MALA.
The independent sampler is a Metropolis-Hastings algorithm where the proposal distribution is fixed. We propose µ ∼ N (1, 1), a = N (0, I R ), b = N (0, I C ), and σ 2 A , σ 2 B , σ 2 E ∼ InvGamma(1, 1). The computation time grows linearly with the data size. The parameters do not mix well, and their estimates are not good. It is possible that better results would be obtained from a different proposal distribution, but it is not clear how best to choose one in practice.
RWM and RWM with subsampling, the latter of which updates a subset of parameters at each iteration, both have computation time linear in the number of observations. Neither algorithm mixed well, and for RWM σ 2 E tended to go to zero in large data sets. The pCN algorithm is Metropolis-Hastings where the proposals are Gaussian random walk steps shrunken towards zero: S (t+1) ∼ N ( √ 1 − σ 2 S (t) , σ 2 I), for σ 2 ≤ 1. Hairer et al. (2014) show that under certain conditions on the target distribution, the convergence rate of this algorithm does not slow with the dimension of the distribution. We include it here, even though our π does not satisfy those conditions. The computation time grows linearly with the data size. However, the estimates for µ and σ 2 E are not good, and those for σ 2 E even get worse as the data size increases. None of the parameters seem to mix well. In summary, for large data sets each algorithm mixes increasingly slowly or returns flawed estimates of µ and the variance components. We have also simulated some unbalanced data sets and slow mixing is once again the norm, with worse performance as R and C grow.

Further notation and assumptions
In this section, we go over pertinent notation and assumptions about the pattern of observations. Our data are realizations from model (1).
We refer to the first index of Y ij as the 'row' and the second as the 'column'. We use integers i, i , r, r to index rows and j, j , s, s for columns. The actual indices may be URLs, customer IDs, or query strings and are not necessarily the integers we use here.
The variable Z ij takes the value 1 if Y ij is observed and 0 otherwise. We assume that there can be at most one observation in position (i, j).
The sample size is N = ij Z ij < ∞. The number of observations in row i is N i• = j Z ij and the number in column j is N •j = i Z ij . The number of distinct rows is R = i 1 N i• >0 and there are C = j 1 N •j >0 distinct columns. In the following, all of our sums over rows are only over rows i with N i• > 0, and similarly for sums over columns. We state this because there are a small number of expressions where omitting rows without data changes their values. This convention corresponds to what happens when one makes a pass through the whole data set.
Let Z be the matrix containing Z ij . Of interest are (ZZ T ) ii = j Z ij Z i j , the number of columns for which we have data in both rows i and i , and (Z T Z) jj . Note that (ZZ T ) ii N i• and furthermore Two other useful idioms are T i• is the total number of observations in all of the columns j that are represented in row i. Our notation allows for an arbitrary pattern of observations. Some special cases are as follows. A balanced crossed design can be described via Z ij = 1 i R 1 j C . If max i N i• = 1 but max j N •j > 1 then the data have a nested structure with rows nested in columns. If max i N i• = max j N •j = 1, then the observed Y ij are IID. Some patterns are difficult to handle. For example, if all the observations are in the same row or column, some of the variance components are not identifiable. We are motivated by problems that are not such worst cases.
The quantities measure the extent to which a single row or column dominates the data set. We expect that these are both small and in limiting arguments, where N → ∞, we may assume that It is also often reasonable to suppose that max i T i• /N and max j T •j /N are both small. In many data sets, the average row and column sizes are large, but much smaller than N . One way to measure the average row size is N/R. Another way to measure it is to randomly choose an observation and inspect its row size, obtaining an expected value of (1/N ) i N 2 i• . Similar formulas hold for the average column size. Therefore, we assume that as N → ∞ Notice that 1 and so the second part of (6) merely follows from (3) and (4). While the average row count may be large, many of the rows corresponding to newly seen entities can have N i• = 1. In our analysis, it is not necessary to assume that all of the rows and columns contain at least some minimum number of observations. Thus, we avoid losing information by the practice of iteratively removing all rows and columns with few observations.
As a demonstration of the validity of our assumptions, the Netflix data has N = 100,480,507 ratings on R = 17,770 movies by C = 480,189 customers. Therefore R/N . = 0.00018 and C/N . = 0.0047. It is sparse with N/(RC) . = 0.012. It is not dominated by a single row or column because R . = 0.0023 and R = 0.00018 even though one customer has rated an astonishing 17,653 movies. Similarly = 6.43 × 10 −6 so that the average row or column has size 1 and N . There are various possible data storage models. We consider the log-file model with a collection of (i, j, Y ij ) triples, which for the purposes of this paper we assume are stored at the same location. A pass over the data proceeds via an iteration over all (i, j, Y ij ) triples in the data set. Such a pass may generate intermediate values that we assume can be retained for further computations.

Moment estimates of variance components
Here we develop a method of moments estimateθ for θ = (σ 2 A , σ 2 B , σ 2 E ) T that requires one pass over the data. We also find an expression for Var(θ | θ, κ) and describe how to obtain an approximation of it after a second pass over the data.
Naturally, we would also want to estimate µ, and there are a number of ways to do so. The simplest is to letμ =Ȳ •• , the sample mean. From Owen and Eckles (2012), The upper bound in (8) is tight for balanced data, but otherwise it can be very conservative. We anticipate that 1 R , C 1/N holds for our motivating applications as it did in the examples of Owen and Eckles (2012). The properties of this estimator has been well-studied in the literature, so in this paper we focus on estimating the variance components.

U -statistics for variance components
We use U -statistics in our method of moments estimators. The usual unbiased sample variance estimate can be formulated as a U -statistic, which is more convenient to analyze. We use the following U-statistics: To understand U a note that for each row i, the quantities Y ij − µ − a i are IID with variance σ 2 B + σ 2 E . Thus, U a is a weighted sum of within-row unbiased estimates of σ 2 B + σ 2 E . The explanation for U b is similar, while U e is a proportional to the sample variance estimate of all N observations. Lemma 4.1. Let Y ij follow the two-factor crossed random effects model (1) with the observation pattern Z ij as described in Section 3. Then the U -statistics defined at (9) satisfy Proof. See Section 10.1 of the supplement.
To obtain unbiased estimatesσ 2 A ,σ 2 B , andσ 2 E given values of the U -statistics, we solve the 3 × 3 system of equations For our method to return unique and meaningful estimates, the determinant of M must be nonzero. This is true when no row or column has more than half of the data, and at least one row and at least one column has more than one observation. To compute the U -statistics, notice that In one pass over the data and time O(N ), we compute N i• , Y i• , and S i• for all R observed levels of i using the incremental algorithm described in the next paragraph. We can also compute N , R and C in such a pass if they are not known beforehand. Chan et al. (1983) show how to compute both Y i• = N i•Ȳi• and S i• in a numerically stable one pass algorithm. At the initial appearance of an observation in row i, with corresponding column j = j(1), set N i• = 1, Y i• = Y ij and S i• = 0. After that, at the kth appearance of an observation in row i, with corresponding column j(k), Chan et al. (1983) give a detailed analysis of roundoff error for update (11) as well as generalizations that update higher moments from groups of data values. In that same pass over the data, U e and the analogous quantities needed to compute U b (S •j ,Ȳ •j , N •j ) are also computed using the incremental algorithm. Finally, in additional time O(R + C), we calculate i S i• , j S •j , i N 2 i• , and j N 2 •j . Now, we have U a , U b , U e , and all the entries of M .
Given U a , U b , U e , and M we can calculateσ 2 A ,σ 2 B , andσ 2 E in constant time. Therefore, finding our method of moments estimators takes O(N ) time overall.

Variances of the estimators
In this section we present how to estimate the covariance matrix ofθ = (σ 2 A ,σ 2 B ,σ 2 E ) T .

True variance ofθ
This section discusses the finite sample covariance matrix ofθ. Theorem 4.1 below gives the exact variances and covariances of our U -statistics.
Theorem 4.1. Let Y ij follow the random effects model (1) with the observation pattern Z ij as described in Section 3. Then the U -statistics defined at (9) have variances and and Var(U e ) equals Their covariances are Proof. Equation (12) is proved in Section 11.2 of the supplement and then equation (13) follows by exchanging indices. Equation (14) is proved in Section 11.7 of the supplement. Equation (15) is proved in Section 12 of the supplement. Equation (16) is proved in Section 13 of the supplement and then equation (17) follows by exchanging indices. Now we consider Var(θ). From (10) We show in Section 4.2.2 that while Var(U e ) and the covariances of the U -statistics may be exactly computed in time O(N ), Var(U a ) and Var(U b ) cannot. Therefore, we approximate Var(U a ) and Var(U b ) such that when we apply formula (18) we get conservative estimates of Var(σ 2 A ), Var(σ 2 B ), and Var(σ 2 E ) (the values of primary interest). For intuition on what sort of approximation is needed, we give a linear expansion of Var(θ) in terms of the variances and covariances of the U -statistics. Letting = max( R , C , R/N, C/N ) we have that as → 0 and so

It follows thatσ
Disregarding the O( ) terms, In light of equation (20), to find computationally attractive but conservative approximations of Var(θ) in finite samples, we use over-estimates of Var(U a ) and Var(U b ). We discuss how to do so in Section 4.2.2.
In practice, when obtaining Var(θ), unless we are in the asymptotic situation described in Section 4.2.3, we plug inσ 2 A ,σ 2 B ,σ 2 E , and estimates of the kurtoses into the covariance matrix of the U -statistics where Var(U a ) and Var(U b ) have been replaced by their over-estimates. Then we apply equation (18). We discuss estimating the kurtoses in Section 4.2.4.

Computable approximations of Var(U )
First, we show how to obtain over-estimates of Var(U a ) in time O(N ); the case of Var(U b ) is similar. In addition to N − R, Var(U a ) contains the following quantities The third and fourth quantities above can be computed in O(R) work after the first pass over the data. The first quantity is a sum over i and r, and cannot be simplified any further. Computing it takes more than O(N ) work. Since its coefficient σ 4 B (κ B + 2) is nonnegative, we must use an upper bound to obtain an over-estimate of Var(U a ). We have the bound which can be computed in O(N ) work in a second pass over the data. Other weaker bounds may be obtained without the second pass. An example is For the same reason the second quantity cannot be computed in time O(N ) and we upper bound it via (ZZ T ) ir N r• , getting which can be computed in O(N ) work on a second pass. All but one expression in Var(U e ) (see (14)) can be computed in O(R + C) time after the first pass over the data. The one expression is The second term in (21) requires a second pass over the data in time O(N ), because it is the sum over i and j of a polynomial of Z ij , N i• , and N •j . The quantity in (21) alternatively can be expressed as which shows that it is a kind of unnormalized test for row versus column independence in the observation process. Equation (22) is numerically more stable than (21) but requires O(RC) computation which is ordinarily too expensive. With the same reasoning as for the second term of (21), we see that Cov(U a , U b ) can be computed in a second pass over the data in time O(N ). This reasoning also shows that we can compute nearly every term in Cov(U a , U e ) in a second pass over the data; the exception is We compute T i• for each i in a second pass over the data. But, we must use additional time O(R) to get (23). Nevertheless, the total computation time is still O(N ). Symmetrically Cov(U b , U e ) can be computed in time O(N ) as well.

Asymptotic approximation of Var(θ)
Under asymptotic conditions, we may obtain simple, analytic approximate expressions for the covariance matrix of our method of moments estimators.
Theorem 4.2. As described in Section 3, suppose that hold for the same small δ > 0 and that hold. Then Finallyσ 2 A ,σ 2 B andσ 2 E are asymptotically uncorrelated as δ → 0 with Proof. See Section 15 of the supplement.
We think that the typical N •j is large, A similar argument applies for N i• . Thus, the additional bounds in (24) seem very reasonable. However, it is possible that the pairs where Z ij = 1 with large N i• may have small N •j and vice versa. Dyer and Owen (2011) report such a headto-tail affinity in several data sets but it would have to be quite extreme for (24) to require a large δ.
The variance ofσ 2 E is the same variance we would have gotten had σ 2 A = σ 2 B = 0 held. Similar remarks apply forσ 2 A andσ 2 B .

Estimating kurtoses
Under a Gaussian assumption, κ A = κ B = κ E = 0. If however the data have heavier tails than this, a Gaussian assumption will lead to underestimates of Var(θ). Therefore, we will estimate the kurtoses by U -statistics.
The fourth moment U -statistics we use are Theorem 4.3. Let Y ij follow the random effects model (1) with the observation pattern Z ij as described in Section 3. Then the statistics defined at (25) have means Proof. See Section 16 of the supplement.
Using Theorem 4.3, we compute estimatesμ A,4 ,μ B,4 , andμ E,4 , by solving the 3 × 3 system of equations where M is the same matrix that we used for the U -statistics in equation (10), with We compute the statistics (25) via beyond those used to computeθ. These can be computed in a second pass over the data afterȲ i• ,Ȳ •j andȲ •• have been computed in the first pass. They can also be computed in the first pass using update formulas analogous to the second moment formulas (11). Such formulas are given by Pébay (2008), citing an unpublished paper by Terriberry.
Because the kurtosis estimates are used in formulas for Var(θ) and those formulas already require a second pass over the data, it is more convenient to compute the sample fourth moments via (28) in a second pass. By a similar argument as in Section 4.1, obtainingκ A , κ B , andκ E has space complexity O(R + C) and time complexity O(N ), and is therefore scalable.

Algorithm summary
For clarity of exposition, here we gather all of the steps in our algorithm to estimate σ 2 A , σ 2 B , and σ 2 E and the variances of those estimators. An outline is shown in Figure 1. We assume that all of the computations below can be done with large enough variable storage that overflow does not occur. This may require an extended precision representation beyond 64 bit floating point, such as that in the python package mpmath (Johansson, 2010).
The first task is to computeθ. In a first pass over the data compute counts N , R, C, row values N i• ,Ȳ i• , S i• for all unique rows i in the data set, and column values N •j ,Ȳ •j , S •j for all unique columns j in the data set as well asȲ •• and S •• . Incremental updates are used as described in (11).
Then compute . The second task is to compute approximately the variance ofθ. A second pass over the data computes the centered fourth moments in (28). Then one calculates the fourth order U -statistics of equation (27), solves (26) for the centered fourth moments, and converts them to kurtosis estimates, all in time O(R + C). In the second pass over the data, also compute as well as T i• and T •j of equation (2) for all i and j in the data. Now we may verify whether the limiting approximations in Theorem 4.2 hold. Specifically, compute Otherwise, then more work must be done in the second pass. Some of these next computations require even more bits per variable than are needed to avoid overflow, because they involve subtraction in a way that will lose precision.
In this case, estimate the variances of the U -statistics. To estimate the variances of U a and U b , we apply the upper bounds discussed in Section 4.2.2 to (12) and (13) and plug in σ 2 A ,σ 2 B ,σ 2 E ,κ A ,κ B , andκ E , calculating using time and space O(R + C) and To estimate Var(U e ) and the covariances of the U -statistics, we again plug in the variance component and kurtosis estimates into Theorem 4.1 without approximation. We get Var(U e ) from (14), using ZN 1,1 from the second pass over the data. We get Cov(U a , U e ) from (16) using ZN −1,1 , ZN −1,2 and T i• , and Cov(U b , U e ) from (17) using ZN 1,−1 , ZN 2,−1 and T •j . We get Cov(U a , U b ) from (15) using ZN −1,−1 . It can be easily verified that these calculations also take time and space O(R + C).
The final plug-in estimator of variance is where M is the matrix in (10). Aggregating the computation times and counting the number of intermediate values we must calculate, we see that our algorithm takes time O(N ) and space O(R + C).

Predictions
Here we consider an application of variance component estimation to the prediction of a missing observation Y ij at given values of i and j in model (1). An equivalent problem is predicting the expected value at those levels of the factors, µ

Best linear predictor
A gold standard is the best linear predictor (BLP), (Searle et al., 2009, Chapter 7.3), which minimizes the MSE over the class of all predictors of the formŶ ij (λ) = rs λ rs Z rs Y rs , where λ is the vector of all λ rs . In this section, we characterize the weights λ * rs of the BLP. We begin with the MSE Lemma 5.1. The MSEs for the linear predictor rs λ rs Z rs Y rs are Proof. See Section 17.1 of the supplement.
The weights λ * rs of the BLP must satisfy the stationarity condition ∂L(λ * rs )/∂λ = 0. As shown in Section 17.2 of the supplement, when Z rs = 0, the condition holds no matter the value of λ * rs . When Z rs = 1, the condition becomes We can compute λ * rs by solving an N × N system of equations but that ordinarily costs O(N 3 ) time. Shortcuts are possible if there is a special pattern in the Z ij , such as balanced data, but we don't know of any faster way to solve (33) for general Z. Therefore, we consider a smaller class of linear predictors called shrinkage predictors.

Shrinkage predictors
It is reasonable to suppose that the most important observations for predicting Y ij are those in its row and column. Therefore we consider predicting Y ij through a linear combination of the overall average, the average in row i, and the average in column j. We use estimators of the formŶ where λ = λ 0 λ a λ b T . Then tλ 0 , λ a , and λ b are chosen to minimize L(λ). By writing (34) in terms of row and column totals we avoid complicated treatments for the situation where row or column means are unavailable because N i• = 0 or N •j = 0 (or both). As an Lemma 5.2. The MSEs for the linear predictor (34) are Proof. See Section 17.3 of the supplement.
is a symmetric matrix with upper triangular elements Proof. See Section 17.4 of the supplement.
Given estimates of µ and θ we can plug them in to get estimates of the optimal λ for prediction at (i, j). Assuming that the algorithm to computeθ and its variance has been executed, all of c and most of H can be computed using quantities found in the first pass over the data. All of the quantities (2) are available after a second pass.
Therefore, since solving Hλ * = c takes time O(1), λ * for predicting a given Y ij can be found in time O(N ). If we wanted to find λ * for k different sets of i and j, the computation cost is O(N + k); we simply would have to store k different H's and c's.
Predicting a missing Y ij using Theorem 5.1 is simple. Next we look at some special cases to understand how it performs.

Special case: Y ij in new row and new column
In this case, N rj = N is = 0 for any r, s, and N i• = N •j = 0. The only nonzero entry of H is The predictionŶ ij is then a shrinkage In practice we would plug in estimates of µ and the variance components. As we would expect, this estimate is very close toȲ •• for large N , whenμ = 0 and the limits (6) hold.
In that case, the corresponding MSE is L . = σ 2 A + σ 2 B + σ 2 E , which can be verified to be approximately the same as the MSE of the BLP.

Special case: Y ij in new row but old column
Suppose that Z is = 0 for any s but ∃r where Z rj = 1 , so N i• = 0 and N •j > 0. We would expect most of the weight to be onȲ •j , the average in the column containing Y ij . This is indeed the case if T •j is not large compared to N , that is, if the rows that are co-observed with column j do not comprise a large fraction of the data.
Let c k denote the kth entry of c and H k be the entry of H in row k and column . In this case, c 2 is zero as is the second row and second column of H. Therefore, without loss of generality we can take λ * a = 0 andλ * = λ * 0 λ * b T can be computed by solving the system The following theorem describes the relative size of λ * 0 and λ * b in the big data limit.
Proof. See Section 17.5 of the Supplement.
Note that λ * 0 is the coefficient of a sum of N observations, while λ * b is the coefficient of a sum of N •j observations. Therefore, to more equitably compare the importances of the overall average and the column average for predicting Y ij , we consider the ratio We may interpret this as the column j average being some multiple of N •j times as important as the overall average. This makes sense because the more data we have in column j, the better estimate we would be able to get of µ + b j ; the overall average only tells us about µ. Also, note that the larger σ 2 E is relative to σ 2 B , the more weight we put on the overall average; we do not trust using only the column average. Special case: large N i• and large N •j Next we show that if both row i and column j have a very large number of observations, and the observation matrix Z is not too extreme, thenŶ ij is approximatelyȲ i• +Ȳ •j −Ȳ •• as we might expect. As a result, the customized weights in Theorem 5.1 are most useful for cases where one or both of N i• and N •j are not very large.
Proof. See Section 18 in the supplement.
6 Experimental Results

Simulations
First, we compare the performance of our method of moments algorithm ('MoM'), described in Section 4.3, to the commonly used R package for mixed models, lme4. lme4 computes the maximum likelihood estimates of the parameters under an assumption of normality. For our algorithm, we consider a range of data sizes, with R = C ranging from 10 to 500. At each fixed value of R = C, for 100 iterations, we generate data according to model (1) with normally distributed random effects and σ 2 A = 2, σ 2 B = 0.5, and σ 2 E = 1. Exactly 25 percent of the cells were randomly chosen to be observed. We measure the CPU time needed to obtain the variance component estimatesσ 2 A ,σ 2 B , andσ 2 E (labeled short) and the CPU time need to obtain the variance component estimates as well as upper bounds on the variances of those estimates (labeled long). In addition, we measure the mean squared errors of the variance component estimates. At the end, those five measurements were averaged over the 100 iterations.
With regard to lme4, our simulation steps are nearly the same, with the following differences. Due to the slowness of lme4, we only consider data sizes with R = C up to 300. In addition, because lme4 finds the maximum likelihood variance component estimates, the variances of those estimates were computed asymptotically using the inverse expected Fisher information matrix. The simulation results are shown in Figure 2.
Note that lme4 always takes more time than our algorithm. From Figure 2a, we see that our method of moments algorithm takes time at most linear in the data size to compute both the variance component estimates and upper bounds on the variances of those estimates. For lme4 the computation time is clearly superlinear in the data size, for data sets large enough that the startup cost of the package is no longer dominant. Figure 2: Simulation results: log-log plots of the five recorded measurements against R * C, which is proportional to the number of observations. The slope of a fitted line through the scatterplot describes the effect of the x-axis quantity on the y-axis quantity; a slope of 1 indicates a linear relationship, greater than 1 a superlinear relationship, and less than 1 a sublinear relationship.

The MSEs ofσ 2
A for our algorithm and lme4 are comparable. Moreover, both decrease at most linearly with the data size. The same is true for the MSEs ofσ 2 B . However, the MSE ofσ 2 E in lme4 is noticeably smaller than that of our algorithm; this appears to be the price we pay for the decreased computation time. In both cases, though, the MSE ofσ 2 E decreases approximately linearly with the data size.

Real World Data
We illustrate our algorithm, coded in Python, on three real world data sets that are too large for lme4 to handle in a timely manner.
The first, from Yahoo!-Webscope (2015a), contains a random sample of ratings of movies by users, which are grades from A+ to F converted into a numeric scale. There are 211, 231 ratings by 7, 642 users on 11, 916 movies, filtered with the condition that each user rates at least ten movies. Only 0.23 percent of the user-movie matrix is observed.
The estimated variances of the user random effect, the movie random effect, and the error are 2.57, 2.86, and 7.68. The estimated kurtoses are −2, −2, and 6.56. Estimated upper bounds on the variances of the estimated variance components are 0.0030, 0.0018, and 0.0060.
The second data set, also from Yahoo!-Webscope (2015b), contains ratings of 1000 songs by 15400 users, on a scale of 1 to 5. The first group of 10000 users were randomly selected on the condition that they had rated at least 10 of the 1000 songs. The rest of the users were randomly selected from responders on a survey that asked them to rate a random subset of 10 of the 1000 songs. The songs were selected to have at least 500 ratings. Here, about 2 percent of the user-song pairs were observed.
The estimated variances of the user random effect, the song random effect, and the error are 0.97, 0.24, and 1.30. The estimated kurtoses are −2, −2, and 3.31. Estimated upper bounds on the variances of the estimated variance components are 4.5 × 10 −5 , 10 −5 , and 5.8 × 10 −5 . For determining the rating, the user effect is dominant over the song effect.
The third data set from Last.fm (2015) contains the numbers of times artists' songs are played by about 360, 000 users. Only the counts for the top k (for some k) artists for each user is recorded. The users are randomly selected. This data set is extremely sparse; only about 0.03 percent of user-artist pairs are observed.
The estimated variances of the user random effect, the artist random effect, and the error are 1.65, 0.22, and 0.27. The estimated kurtoses are 0.019, −2, and 23.14. Estimated upper bounds on the variances of the estimated variance components are 1.68 × 10 −5 , 4.06 × 10 −7 , and 1.37 × 10 −6 . The biggest source of variation in the number of plays is the user, not the artist. The kurtosis of the row effect is nearly zero, indicating possible normality.
In all three data sets at least one of the estimated kurtoses was −2, which would be unexpected if the model is correctly specified. However, if model (1) does not fit the data well, such behavior may occur. For example, the expected rating of a movie may not be additively decomposable into a movie effect, a user effect, and an error.

Conclusion
When traditional maximum likelihood or MCMC methods are used, with both theory and simulations, we have found that fitting large two-factor crossed unbalanced random effects models has costs that are superlinear in the number of data points, N . With the method of moments it is possible to get, in linear time, parameter estimates and somewhat conservative estimates of their variance. The space requirements are proportional to the number of distinct levels of the factors entities; this will often be sublinear in N . We also developed shrinkage predictors of missing data that utilize our method of moments estimates.
Through simulations on normally distributed data, we show that our method of moments estimates are competitive with maximum likelihood estimates. We trade off a small increase in the MSE of one variance component for a dramatic decrease in computation time as N gets large.
As stated in the introduction, the crossed random effects model we consider here is the simplest one for which we felt that there was no useful prior solution. We expect that richer models, which are the basis of our future work, will provide better fits to real world data.
In some cases we may be expecting a repeat observation in the ij-cell and then it may be possible to get a better estimate of µ + a i + b j than Y ij is. Section 19 of the supplement considers this problem.

Informative Missingness
We have assumed throughout that the missingness pattern in Z ij is not informative. But in many applications the observed values are likely to differ in some way from the missing values. For instance, in movie ratings data people may be more likely to watch and rate movies they believe they will like, and so missing values could be lower on average than observed ones. In general, the observed ratings may have both high and low values oversampled relative to middling values.
From observed values alone we cannot tell how different the missing values would be. To do so requires making untestable assumptions about the missingness mechanism. Even in cases where followup sampling can be made, e.g., giving some users incentives to make additional ratings, there will still be difficulties such as users refusing to make those ratings, or if forced, making inaccurate ratings. Methods to adjust for missingness have to be designed on a case by case basis, using whatever additional data and assumptions can be brought to bear. The uncertainties of the estimates from such methods can be quantified using, with further development, the techniques of this paper. experiments, and Lester Mackey and Norm Matloff for some helpful discussions. Pébay, P. (2008). Formulas for robust, one-pass parallel computation of covariances and arbitrary-order statistical moments. Technical Report SAND2008-6212, Sandia National Laboratories. In the balanced case we may assume that i ∈ {1, 2, . . . , R} and j ∈ {1, 2, . . . , C}. The posterior distribution of the parameters is given by

Raudenbush
Then, φ is given by Therefore, the posterior distribution of a and b is a joint normal with precision matrix From Theorem 1 of Roberts and Sahu (1997), for the Gibbs sampler described in Section 2.1, we have the following result. Let A = I − diag(Q −1 11 , Q −1 22 )Q, where Q 11 denotes the upper left block of Q and Q 22 denotes the lower right block. Let L be the block lower triangular part of A, and U = A − L. Then, the convergence rate ρ is given by the spectral radius of the matrix B = (I − L) −1 U . Now, we compute ρ. First Clearly, B has rank one. Then, its spectral radius must be equal to its nonzero eigenvalue, which is also the trace of B. Hence,

Partially observed random effects model
The random effects model is ∼ F e independent of each other. These random variables have mean 0, variances σ 2 A , σ 2 B , σ 2 E and kurtoses κ A , κ B , κ E , respectively. We will not need their skewnesses.
We use letters i, i , r, r to index rows. Letters j, j , s, s are used for columns. In internet applications, the actual indices may be people rating items, items being rated, cookies, URLs, IP addresses, query strings, image identifiers and so on. We simplify the index set to N for notational convenience. One feature of these variables is that we fully expect future data to bring hitherto unseen levels. That is why a countable index set is appropriate.
We will want to estimate σ 2 A , σ 2 B , σ 2 E and get a formula for the variance of those estimates. Many, perhaps most, of the Y ij values are missing. Here we assume that the missingness is not informative. For further discussion see Section 7.1 of the main document.
The variable Z ij ∈ {0, 1} takes the value 1 if Y ij is available and 0 otherwise. The total sample size is N = ij Z ij . We assume that 1 N < ∞. We also need N i• = j Z ij and N •j = i Z ij . The number of unique observed rows and columns are, respectively, In the sum above, only finitely many summands are nonzero. When we sum over i, i , r, r , the sum is over the set {i | N i• > 0}. Similarly sums over column indices j, j , s, s are over the set {j | N •j > 0}. These ranges are what one would naturally get in a pass over data logs showing all records.
We frequently need the number of columns jointly observed in two rows such as i and i . This is j Z ij Z i j = (ZZ T ) ii . Similarly, columns j and j are jointly observed in i Z ij Z ij = (Z T Z) jj rows. The matrix Z encodes several different measurement regimes as special cases. These include crossed designs, nested designs and IID sampling, as follows. A crossed design with an R × C matrix of completely observed data can be represented via Z ij = 1 1 i R 1 1 j C .
If max i N i• = 1 and max j N •j > 1 then the data have a nested structure, with N •j distinct rows in column j and (Z T Z) jj = 0 for j = j . Similarly max j N •j = 1 with max i N i• > 1 yields columns nested in rows. If max i N i• = max j N •j = 1 then we have N IID observations. We note some identities: We need some notation for equality among index sets. The notation 1 ij=rs means 1 i=r 1 j=s . It is different from 1 {i,j}={r,s} which we also use. Additionally, 1 ij =rs means 1 − 1 ij=rs .

Weighted U statistics
We will work with weighted U-statistics for weights u i , v j and w ij chosen below. We can write U a = i u i N i• (N i• − 1)s 2 i• where s 2 i• is an unbiased estimate of σ 2 B + σ 2 E from within any row i with N i• 2. Under our model the values in row i are IID with mean µ + a i and variance σ 2 B + σ 2 E , and so is the kurtosis of Y ij for the given i and any j. Thus Inverse variance weighting then suggests that we weight s 2 i• proportionally to a value between N i• and N i• − 1. Weighting proportional to N i• − 1 has the advantage of zeroing out rows with N i• = 1. This consideration motivates us to take u i = 1/N i• , and similarly v j = 1/N •j .
If U e is dominated by contributions from e ij then the observations enter symmetrically and there is no reason to not take w ij = 1. Even if the e ij do not dominate, the statistic U e compares more data pairs than the others. It is unlikely to be the information limiting statistic. So w ij = 1 is a reasonable default.
If the data are IID then only U e above is nonzero. This is appropriate as only the sum σ 2 A + σ 2 B + σ 2 E can be identified in that case. For data that are nested but not IID, only two of the U-statistics above are nonzero and in that case only one of σ 2 A and σ 2 B can be identified separately from σ 2 E . The U-statistics we use are then Because we only sum over i with N i• > 0 and j with N •j > 0, our sums never include 0/0.

Expected U -statistics
Here we find the expected values for our three U -statistics.

The matrix in
Our moment based estimates are They are only well defined when M is nonsingular. The determinant of M is The first factor is positive so long as max i N i• > 1, and the second factor requires max j N •j > 1. We already knew that we needed these conditions in order to have all three U-statistics depend on the Y ij . It is still of interest to know when the third factor is positive. It is sufficient that no row or column has over half of the data.

The variance
From equation (107) where M is given at (106). So we need the variances and covariances of the three U statistics.
To find variances, we will work out E(U 2 ) for our U -statistics. Those involve × (a r − a r ) 2 + (b s − b s ) 2 + (e rs − e r s ) 2 + 2(a r − a r )(b s − b s ) + 2(a r − a r )(e rs − e r s ) + 2(b s − b s )(e rs − e r s ) .
This expression involves 8 indices and it has 36 terms. Some of those terms simplify due to independence and some vanish due to zero means. To shorten some expressions we use with mnemonics bilinear, diagonal and quartic. There are similarly defined terms for component B. For the error term we have B E,iji j ,rsr s ≡ E((e ij − e i j )(e rs − e r s )) D E,ij,i j ≡ E((e ij − e i j ) 2 ), and, Q E,iji j ,rsr s ≡ E((e ij − e i j ) 2 (e rs − e r s ) 2 ).

The generic contribution
The other 24 terms are zero.

Variance parts
Here we collect expressions for the quantities appearing in the generic term of our squared U -statistics.
Lemma 11.1. In the random effects model (100), Proof. The first one follows by expanding and using E(a i a r ) = σ 2 A 1 i=r , et cetera. The other two use the same argument.
Lemma 11.2. In the random effects model (100), Proof. Take i = r and i = r in Lemma 11.1.

Variance of U a
We will work out E(U 2 a ) and then subtract E(U a ) 2 . First we write For E(U 2 a ) we use the special case i = i and r = r of (108),  after eliminating terms that are always 0. We handle these five sums in the next subsubsections.

U
The expression ir (ZZ T ) ir simplifies to j N 2 •j , changing it from a 'row quantity' to a 'column quantity'. But the other parts of this expression are equivalent to sums of terms such as N −1 i• Z ij N •j making the column version less convenient to work with. Term 1.3 is the same as term 1.2 by symmetry of indices.
Term 1.4 is σ 4 B times ijj rss Summing terms 1.1 through 1.4 yields by the same process that evaluated term 1.1. Term 2.2 is σ 4 E (κ E + 2)/4 times ijj rss The last expression resembles the diagonal part of term 1.2. Term 2.3 is the same is the same as term 2.2. Term 2.4 is σ 4 E times ijj rss This is the same sum as the coefficient in term 1.4 has except that it has the additional constraint i = r. Imposing i = r on that quantity yields Term 2 is thus 11.2.3 U 2 a terms 3 and 4 These terms are equal by symmetry. We evaluate term 3.
by the same steps. Therefore term 3 of E(U 2 a ) equals σ 2 B σ 2 E (N − R) 2 and the sum of terms 3 and 4 is 2σ 2 11.2.4 U 2 a term 5 The term equals ijj rss Term 5 is then which we call terms 5.1, 5.2, 5.3 and 5.4. Next we find the coefficients of σ 2 B σ 2 E in these four terms.
For term 5.1, we get ijj ss For term 5.2, we get − ijj ss as well. Terms 5.3 and 5.4 are also N i• − 1, by the steps used for terms 5.2 and 5.1 respectively. As a result term 5 equals 4σ 2 B σ 2 E (N − R).

Combination
Combining the results of the previous sections, we have

Checks
We can check some special cases of this formula.

Rows nested in columns
If for instance rows are nested within columns, then N = R, and all N i• = N r• = 1 and in this case U a = 0. The above formula gives Var(U a ) = 0 for this case.

Columns nested in rows
If columns are nested in rows, then (ZZ T ) ir = 1 i=r N i• and equation (109) yields When columns are nested in rows, then U a = i (N i• − 1)s 2 i• and because the rows are then independent, U a has variance The kurtosis of b j + e ij is Therefore for columns nested in rows matching the expression (110) that comes from equation (109) for Var(U a ).
11.4.3 σ 2 B = 0 If σ 2 B = 0 then Var(U a ) should be the same as it is for columns nested in rows. In this case equation (109) reduces to If instead we first take the columns nested in rows special case from equation (110) and then substitute σ 2 B = 0, we get the same expression.
11.4.4 σ 2 E = 0 and κ B = −2 In this special case we take σ 2 E = 0 and take b j ∼ U (±1). Then σ 2 B = 1 and κ B = −2. Then In this case To get the variance of U a we need E(b j b j b s b s ) = 1 j=j 1 s=s + 1 j=s 1 j =s + 1 j=s 1 j =s − 2 × 1 j=s 1 j =s 1 s=s .

Now
In this case we get matching the result from (109).

Crossed design
In a crossed design N i• = C for all i and (ZZ T ) ir = C for all i and r. Here the variance is

Variance of U b
This case is exactly symmetric to the one above with Var(U a ) given by (109). Therefore 11.6 Variance of U e As before, we find E(U 2 e ) and then subtract E(U e ) 2 . Now ii jj rr ss From (108), .
We handle the twelve sums in the next subsections.
11.6.1 U 2 e Term 1 As before, we split term 1 into four parts.

4
ii jj rr ss Summing terms 1.1 to 1.4 gives 11.6.2 U 2 e Term 2 We can use the symmetry of the roles of A and B and their indices. Therefore, term 2 is equal to 11.6.3 U 2 e Term 3 As before, we split term 3 into four parts.

4
ii jj rr ss Z ij Z i j Z rs Z r s Q E,iji j ,rsr s = 1 4 ii jj rr ss Term 3.2 is σ 4 E (κ E + 2)/4 times ii jj rr ss By symmetry of indices, term 3.3 is the same as term 3.2. For term 3.4, we have σ 4 E times ii jj rr ss Summing terms 3.1 to 3.4, we get 11.6.4 U 2 e Term 4 1 4 ii jj rr ss The first factor is ii jj By the same argument, the second factor is rr ss and so term 4 is 11.6.5 U 2 e Term 5 1 4 ii jj rr ss rr ss Z rs Z r s D E,rs,r s .
The first factor is computed in the previous section. The second factor is rr ss Thus, term 5 is 11.6.6 U 2 e Term 6 By symmetry of indices, this is the same as Term 4: 11.6.7 U 2 e Term 7 This is like term 5 with factors A and B interchanged. Thus, term 7 is equal to 11.6.8 U 2 e Term 8 By symmetry of indices, this is the same as term 5: 11.6.9 U 2 e Term 9 By symmetry of indices, this is the same as term 7: 11.6.10 U 2 e Term 10 ii jj rr ss ii jj rr ss Z ij Z i j Z rs Z r s (1 i=r 1 j=s − 1 i=r 1 j=s − 1 i=r 1 j =s + 1 i=r 1 j =s − 1 i=r 1 j=s + 1 i=r 1 j=s + 1 i=r 1 j =s − 1 i=r 1 j =s − 1 i =r 1 j=s + 1 i =r 1 j=s + 1 i =r 1 j =s − 1 i =r 1 j =s 11.6.11 U 2 e Term 11 ii jj rr ss ii jj rr ss Z ij Z i j Z rs Z r s 1 ij=rs − 1 i=r 1 ij=r s − 1 i=r 1 i j =rs + 1 i=r 1 i j =r s − 1 i=r 1 ij=rs + 1 ij=r s + 1 i=r 1 i j =rs − 1 i=r 1 i j =r s − 1 i =r 1 ij=rs + 1 i =r 1 ij=r s + 1 i j =rs − 1 i =r 1 i j =r s 11.6.12 U 2 e Term 12 We can use the symmetry with term 11, interchanging rows columns. Thus, term 12 is

Combination
Summing up the results of the previous twelve sections, we have Then, we have Next, we simplify the form of this expression. The coefficient of (κ and similarly for that of (κ B + 2)σ 4 B . The coefficient of ( Applying these simplifications The coefficient of σ 2 A σ 2 B is a measure of how close to a regular R × C grid the data are.

IID sampling
If max i N i• = max j N •j = 1, then the observations are IID with variance σ 2 Y = σ 2 A +σ 2 B +σ 2 E and kurtosis Now U e is N (N − 1) times the sample standard deviation of all N observations. Thus In this case, the formula gives If we set all positive N i• = 1 and all positive N •j = 1 then i N 2 i• = N because there are now R = N rows in the data. Similarly N 3 i• and N 4 i• sum to N and these powers of N •j also sum to N . Next ij Z ij N i• N •j = ij Z ij = N . The most subtle of these sums is ij N 2 i• N 2 •j = ij 1 = N 2 because the indices run over all i with N i• > 0 and all j with N •j > 0.
Making these substitutions we get which matches equation (114). Equation (113) gives

IID sampling again
If max i N i• = 1 and σ 2 B = 0, then once again the observations are IID and The formula gives matching (115).

Covariance of U a and U b
We use the formula Cov(U a , U b ) = E(U a U b ) − E(U a )E(U b ), so we just need to compute E(U a U b ). Using our preferred normalization, Then, .
We consider each term separately.

U
For 1.2, we have σ 4 E (κ E + 2)/4 times ijj rr s since the last indicator implies r = i and r = i but the second one is 1 r =r . Summing up, term 1 is equal to

Combination
Adding up the four terms, we have Notice that Cov(U a , U b ) = 0 when σ 2 E = 0. This can be verified by noting that when σ 2 E = 0 then U a is a function only of a i while U b is a function only of b j . Therefore U a and U b are independent when σ 2 E = 0.
13 Covariance of U a and U e We use the formula Cov(U a , U e ) = E(U a U e ) − E(U a )E(U e ), so we just need to compute E(U a U e ). First, Then, .
We consider each term separately.
13.1 U a U e Term 1 1 4 ijj rr ss (1 j∈{s,s } + 1 j ∈{s,s } ) Term 1.2 is equal to σ 4 B (κ B + 2)/4 times ijj rr ss Summing the four terms, we find that term 1 is equal to 13.2 U a U e Term 2 1 4 ijj rr ss For 2.2, we get σ 4 E (κ E + 2)/4 times ijj rr ss Adding up the four terms, we find that term 2 equals 13.3 U a U e Term 3 1 4 ijj rr ss 13.4 U a U e Term 4 1 4 ijj rr ss 13.5 U a U e Term 5 1 4 ijj rr ss using the result for term 3.

Combination
We add up the seven terms, replacing some N r• and N •s expressions by equivalents using N i• and N •j , getting which contains terms equalling several of those in E(U a U e ) above. Subtracting those term from E(U a U e ) yields Cov(U a , U e ) = 2σ 4 14 Covariance of U b and U e By interchanging the roles of the rows and columns in Cov(U a , U e ), we find that

Asymptotic approximation: proof of Theorem 4.2
We suppose that the following inequalities all hold for the same small > 0. The first six inequalities are assumed in the theorem statement. The last two follow from the first two. We also assume that Note that we can bound σ 2 A σ 2 B , σ 2 A σ 2 E , and σ 2 A σ 2 B away from 0 and ∞ uniformly with those other quantities after replacing m by min(m, m 2 ) and m by max(m, m 2 ).
We also suppose that The bounds in (116) seem reasonable but it appears that they cannot be derived from the first eight bounds above. We begin with the coefficient of σ 4 B (κ B + 2) in Var(U a ) from equation (12). It is The third, fourth and fifth terms in Var(U a ) are all O( ). The second term contains so it is of smaller order than the lead term, as well as As a result Turning to the covariances Next Cov(U a , U e ) contains the term σ 4 , and similarly The last variance is Next we verify that these variance estimates are asymptotically uncorrelated. Ignoring the 1 + O( ) factors we have

Estimating Kurtoses
To estimate the kurtoses κ A , κ B and κ E in the above variance expressions, it suffices to estimate fourth central moments such as µ A,4 = σ 4 A (κ A + 3) and similarly defined µ B,4 and µ E,4 . Givenσ 2 A ,σ 2 B , andσ 2 E , we can do this via GMM. Consider the following estimating equations and their expectations, Using previous results, By symmetry, These expectations are all linear in the fourth moments. Therefore, given estimates of σ 2 A , σ 2 B , and σ 2 E , we can solve another three-by-three system of equations to get estimates of the fourth moments.
Letting M be the matrix in equation (106) we find that For plug-in method of moment estimators we replace expected W -statistics by their sample quantities, replace the variance components by their estimates and solve the matrix equation gettingμ A,4 et cetera. Thenκ A =μ A,4 /σ 4 A − 3 and so on.

Best linear predictor
Here we predict consider linear predicton of Y ij . We begin with predictions of the form Y ij =Ŷ ij (λ) = rs λ rs Z rs Y rs . Then we consider predictions of a reduced form that consider only the totals in row i, in row j and in the whole data set.
17.1 Proof of Lemma 5.1 Now suppose that we consider the loss L = E(((µ + a i + b j ) −Ŷ ij ) 2 ). To do so we replace Var(Ŷ ij ) and Cov(Y ij ,Ŷ ij ) above by Var(a i + b j ) and Cov(a i + b j ,Ŷ ij ) respectively, yielding

Stationary conditions
The partial derivative of L with respect to λ r s is A rss Z rs Z rs (λ rs 1 rs=r s + λ rs 1 rs =r s ) After taking account of the indicator functions we get We can replace Z is 1 i=r by 1 i=r because of the leading factor Z r s . This and a corresponding change to the coefficient of The simplified expression no longer requires the double primes and so we find that the partial derivative of L with respect to λ rs is

Proof of Lemma 5.2
Here we considerŶ The mean squared error is L = E((Y ij −Ŷ ij ) 2 ). Expanding it we get We set about finding the other terms. First The remaining terms use somewhat longer arguments.
Combining these pieces we find that . Then we must replace Var(Ŷ ij ) by Var(µ + a i + b j ) = σ 2 A + σ 2 B and remove the σ 2 E Z ij terms from the covariances with Y ij . The result is

Proof of Theorem 5.1
From the result of Lemma 5.2, we see that L is quadratic in λ. Since L is bounded below by 0, it follows that L attains its minimum on R 3 , which would be any solution of the stationarity condition ∇ λ L = 0. We find the components of this gradient.
1 2 We write this as Using T i• ≡ s Z is N •s and T •j ≡ r Z rj N r• some of these simplify:

Proof of Theorem 5.2
To begin with, we note that N •j = r Z rj r N r• Z rj N . We write Then (1 + O( )).
Next detH = H 11 H 33 − H 2 13 = µ 2 N 2 + σ 2 As a result the prediction for a new row in a large column is essentially that column average plus O(1/N •j ) times the global average. 17.6 Special case N i• = 0 and N •j = 1 Now suppose that we have no data in the target row and exactly one older observation in the target column. Let i be the single row with Z i j = 1. There are enough large rows and columns that the usual conditions N i N 2 i• N 2 hold but there are also some lightly observed rows and columns. Theñ The determinant is The numerator for λ * 0 is and so Similarly, the numerator for λ * b is − H 13 (N µ 2 + σ 2 B ) + H 11 (µ 2 + σ 2 B ) ≈ −(µ 2 N )N µ 2 + µ 2 N 2 (µ 2 + σ 2 B ) = µ 2 σ 2 B N 2 and so In this case, the prediction for Y ij is 18 Asymptotic weights: proof of Theorem 5.3 Here we have The first five follow easily from 1 < 1/ N i• , N •j N . The last four follow from the others. For instance r N 2 r• r N r• ( N ) = N 2 , and r N r• Z rj r Z rj ( N ) = N •j N . We also have 0 < m µ 2 , σ 2 A , σ 2 B , σ 2 E M < ∞. Then and using symbolic computation (via Wolfram|Alpha, September 6, 2015) The determinant of H −1 is (σ 2 A σ 2 B µ 2 N 2 i• N 2 •j N 2 ) −1 (1+O( )), so we need N i• 1 and N •j 1 to make matrix inversion a continuous operation. Similarly Thus ignoring the O( ) terms The end result −1/N is of the same order of magnitude as the original terms. Therefore λ * 0 = (−1/N )(1 + O( )). Similarly and both of these approximations involve multiplication by 1 + O( ). In this limit then which make intuitive sense as (μ +â i ) + (μ +b j ) −μ.

Smoothing predictors
In some cases we may want a better estimate of E(Y ij ) than Y ij itself is. Such a predictor could take the form It puts either extra or reduced weight on Y ij itself, depending on the sign of λ ab . This predictor is only useful when Z ij = 1, so it does not apply in the new row or new column cases either. It is only nontrivial when our goal is to estimate µ + a i + b j , not Y ij itself. So we only consider L = E((Ŷ ij − µ − a i − b j ) 2 ) here.
Lemma 19.1. The MSE for the linear predictor (117) is Proof. This problem only arises when Z ij = 1, which we assume for the rest of this section. Then Gathering up the coefficient of µ 2 we get Half of the derivative of this squared error with respect to λ ab is We see that given the other λ choices, this derivative is decreasing at 0 (hence we favor positive self-weight) if Furthermore, the optimal self-weight, given the other λ's is The point of this predictor is that we might expect another observation to be made later in row i and column j. Then estimating µ + a i + b j is a better way to predict than repeating the earlier Y ij . To use Lemma 19.1 after a second pass, one can compute L as the given quadratic function in the four variables λ 0 , λ a , λ b and λ ab . The minimizer of that quadratic gives weights to apply in prediction. When σ 2 E is very small then Y ij is already close to µ + a i + b j and placing special weight on Y ij will be advantageous.