Quasi-Monte Carlo Methods for Binary Event Models with Complex Family Data

Abstract The generalized linear mixed model for binary outcomes with the probit link function is used in many fields but has a computationally challenging likelihood when there are many random effects. We extend a previously used importance sampler, making it much faster in the context of estimating heritability and related effects from family data by adding a gradient and a Hessian approximation and making a faster implementation. Additionally, a graph-based method is suggested to simplify the likelihood when there are thousands of individuals in each family. Simulation studies show that the resulting method is orders of magnitude faster, has a negligible efficiency loss, and confidence intervals with nominal coverage. We also analyze data from a large study of obsessive-compulsive disorder based on Swedish multi-generational data. In this analysis, the proposed method yielded similar results to a previous analysis, but was much faster. Supplementary materials for this article are available online.


Introduction
Generalized linear mixed models (GLMMs) are commonly used for the analysis of grouped data either to control for, or estimate, group specific effects.This article focuses on GLMMs for binary outcomes with the probit link function which are commonly referred to as mixed probit models or liability-threshold models.These models have been widely used in many disciplines, such as econometrics (Wilhelm and de Matos 2013), quantitative genetics (de Villemereuil et al. 2016), animal breeding (Thompson 1990), psychiatry (Lichtenstein et al. 2009;Sandin et al. 2017;Bai et al. 2019;Mahjani et al. 2020), and family-based medicine studies (Pawitan et al. 2004;Brown and Prescott 2014).However, estimation of GLMMs with many random effects per cluster is computationally challenging as this requires approximations of high-dimensional integrals.We particularly focus on estimation of heritability and related effects where the number of random effects in each cluster is equal to the cluster size.This is problematic for example, with multi-generational data from national registers where clusters can have thousands of observations.
Fast and deterministic procedures have shown to be imprecise for Bayesian inference for the models we consider (Holand et al. 2013).Alternatively, Markov chain Monte Carlo (MCMC) is attractive because of its asymptotic properties and because it can handle very large clusters.However, a common problem with MCMC is slow convergence or, in the worst case, lack of practical convergence.
The likelihood of the GLMMs does not have a closed form expression in frequentist inference.Deterministic and fast approximation exists (Breslow and Clayton 1993;Pinheiro and Bates 1995) but these can yield biased results particularly with low event rates or high variability (Rodríguez and Goldman 2001;Pinheiro and Chao 2006;Joe 2008).Adaptive Gauss-Hermite quadrature is effective with few random effects per cluster but scales very poorly in the number of random effects per cluster (Pinheiro and Bates 1995).
The general expression of the likelihood of the GLMMs is in the form of Gaussian weighted integrals, although for the particular class of models used in this article, the likelihood can also be written as cumulative distribution functions (CDFs) of multivariate normal distributions (Christoffersen et al. 2021).Pawitan et al. (2004) make use of this alternative form of the likelihood using the importance sampling method proposed by Genz (1992) and Genz and Bretz (2002), which is based on randomized quasi-Monte Carlo (RQMC), to approximate the likelihood of each cluster.This method has been shown to provide a very efficient means of approximating the likelihood even in higher dimensions.However, a gradient approximation has not previously been used; previous implementations have used either derivative-free optimization methods or approximate derivatives from finite difference, resulting in slow estimation.This has limited how many random effects per cluster researchers have used.For example, previous applications have restricted analysis to smaller cluster sizes by dividing clusters and duplicating observations (Pawitan et al. 2004;Lichtenstein et al. 2009;Bai et al. 2019).This results in a pseudo-likelihood and makes it possible to estimate the model with previous integral approximations, but at the price of efficiency and invalid inference, unless corrections are made (Varin et al. 2011).
We develop a computationally efficient method to approximate the likelihood of GLMMs with binary outcomes and the probit link function, for large datasets with many random effects per cluster.We extend the importance sampler developed by Genz (1992) and Genz and Bretz (2002) to approximate the gradient of the log-likelihood which can be used by gradient based optimization methods along with an extension to provide a Hessian approximation for fast inference.A highly efficient C++ implementation of our method is provided, which supports parallel processing.In addition, we propose an approach based on graph partitioning to split the very large clusters with thousands of random effects, to simplify the likelihood without losing much information in the context of random effect structures that are defined for heritability and related effects.Our efficient approximations and the graph partition method allows us to do frequentist inference orders of magnitude faster than with a similar Bayesian model using Gibbs sampling.
The remainder of the article is organized as follows.Section 2 introduces the class of models for family-based studies that we use in this article.Section 3 describes the importance sampler we use to approximate the likelihood, the gradient of the log-likelihood and the Hessian of the log-likelihood.We also describe our graph-based method to simplify the likelihood.Section 4 presents the results of the simulation studies, where we compare our method with two MCMC implementations.A study of possible efficiency loss, bias, and coverage of likelihood ratio-based and Wald-based confidence intervals when using our graph partitioning method and the pseudo-likelihood is provided.Section 5 presents our analysis of data from 897,210 individuals collected from the Swedish national patient register and the Swedish Multi-Generation Registry (SMGR) (Ekbom 2010).We estimate the heritability of obsessive-compulsive disorder (OCD) and compare the computation time with a Gibbs sampler.Sections 6 provides the discussion and conclusion.The implementation of our methods is available as an R package called pedmod at https://github.com/boennecd/pedmod.

Models
Statistical inference methodologies have been used to understand the heredity and risk factors of phenotypes (an individual's observable trait).The fundamental step in understanding the inheritance of a phenotype is to partition the variability of that phenotype in a population into genetic and environmental components.For binary outcomes, the most flexible method is based on GLMMs, which can use data from families of varying sizes and structures (Tenesa and Haley 2013).GLMMs partition the variation of a phenotype into additive genetic effects (narrow-sense heritability) and environmental effects using random effects.Common examples of environmental effects are shared childhood environment effects and maternal effects.Maternal effects are the influences on the offspring phenotype that result from maternal phenotypes.Maternal effects can be partitioned into genetic maternal effects and environmental maternal effects.
In this section, we show how to use GLMMs to partition the variability of a phenotype.Assume that we want to estimate K different random effects (e.g., direct additive genetics or maternal effects) and adjust for a number of fixed effects.
Let m be the number of independent families, with each family i ∈ {1, . . ., m} consisting of n i members, on which we observe outcomes.Each outcome is a binary trait Y ij ∈ {0, 1} for the jth observation in the ith family with corresponding known fixed effect covariates X ij .We model the Y ij 's using a GLMM of the form Y ij = 1 {β X ij + ij >0} where 1 {•} is one if the condition in the subscript is true and zero otherwise, 0 l is a l dimensional vector of zeros, I l is the l dimensional identity matrix, N (v) (θ, ) is the v dimensional multivariate normal distribution with mean θ and covariance matrix , β is a vector of unknown fixed effect coefficients, and σ 2 = (σ 2 1 , . . ., σ 2 K ) are unknown scale parameters.C ik is the known relationship matrix for family i and effect k. Another way to write the model is , where Bin(m, p) is the binomial distribution with m trails and probability p, and (x) is the standard normal distribution's CDF evaluated at x with a corresponding density function denoted by φ(x).
In this form, it may be easier to recognize the model as a GLMM.
After estimating the parameters, we can calculate the proportion of phenotypic variation explained by each variance component after controlling for the fixed effects.That is fixing the residual variance to 1 as in Equation (1) (i.e., σ 2 0 = 1) and assuming that all relationship matrices have ones in the diagonal.An alternative and equivalent model can be obtained by replacing the covariance matrix in Equation (1) with σ0 σ 2 k = 1.The corresponding fixed effect coefficients are βj = β j / 1 + k=1 σ 2 k and it follows that σ 2 k = h k .This parameterization may be preferable for reporting the results, as the scale parameters are equal to the proportion of variance attributable to each effect and the fixed effect coefficients are easier to compare in magnitude with models without random effects.
A primary concern with estimating parameters for these GLMMs is that the log marginal likelihood is intractable.The log marginal likelihood term of the ith family is given by where the inequality is elementwise, X i = (X ij , . . ., X ij ) , and φ (v) (x; θ , ) is the density of N (v) (θ , ) evaluated at x.In Section 3, we propose to use RQMC to approximate the log marginal likelihood in Equation (3) along with the gradient with respect to β and σ 2 .We focus on modeling a single binary phenotype; however, as we discuss in Section 6, our approach can be easily extended for multiple phenotypes, and survival outcomes.

Methods
In this section we describe how we fit GLMMs for complex family data by extending and improving the method by Genz and Bretz (2002), making it faster and adding a gradient approximation and Hessian approximation (Section 3.1).In Section 3.2, we introduce a novel graph-based method to simplify the likelihood.

Estimations of GLMMs using Importance Sampling
The primary issue with evaluating the log marginal likelihood in Equation ( 3) is to approximate where μ = −X i β and Here, we provide details of the RQMC method we use to approximate this quantity, along with the gradient of the log-likelihood.We use importance sampling to approximate L i for a general n i .In particular, we follow the approach by Genz (1992), Genz and Bretz (2002), and the Geweke, Hajivassiliou, and Keane (GHK) simulator used by Hajivassiliou et al. (1996), and use a series of conditional truncated normal distributions as the importance distribution.Though, special and fast algorithms are available for n i = 2, 3 (Genz and Bretz 2009), our algorithm is for the general case.
First, we rewrite is the Cholesky decomposition of .Next, let h(•) be the importance distribution's density function and let (5) where x l:g = (x l , x l+1 , . . ., x g ) is the lth to the gth element of a vector x.Then It is computationally cheap to recursively sample u j ∈ (â ij (u 1:(j−1) ), bij (u 1:(j−1) )) conditionally from h(u), and it is cheap to evaluate n i j=1 w j (u).Thus, each integrand evaluation in the importance sampling method is computationally cheap.
The variance of the estimator can be reduced by using a heuristic variable reordering (Genz and Bretz 2009).Our implementation is based on the Fortran code from Genz et al. (2020) which uses such a heuristic variable reordering (see supplementary material S1 for further details on variable reordering).The original implementation also uses RQMC based on lattice rules.This yields an O s −1 (log s) k convergence rate for some k ≤ n i where s is the number of samples compared with O s −1/2 for a Monte Carlo (MC) method (Caflisch 1998).We have added support for scrambled Sobol sequences (Joe and Kuo 2003) using the scrambling method used by Owen (1998).Though, the latter RQMC method had a higher variance at a fixed computation time in preliminary experiments compared with the lattice rules.Therefore, we use the lattice rules.
We have rewritten the code from Genz et al. (2020) in C++ .Our implementation supports computation in parallel by using OpenMP with Boost's random header through the BH package (Eddelbuettel et al. 2020).This is a major advantage given the availability of multicore processors and that the method is compute-bound even for moderate to high n i .A substantial amount of the CPU time is spent on evaluating and −1 even for moderately large dimensions (say n i = 50).For this reason, we allow the user to replace the 16 digit precision method for −1 from Wichura (1988) used by the Fortran implementation through R (R Core Team 2020) with the seven digit precision method.We have also implemented monotone cubic Hermite interpolation based on the Fritsch-Carlson method for .This yields an error no greater than approximately ±10 −7 .Using both of these alternatives seems to give a substantial reduction in the computation time while not affecting the estimates.
Importantly, we have extended the algorithm by Genz et al. (2020) to also provide an approximation of the gradient with respect to β and σ 2 .The model can be estimated using a gradient based nonlinear optimizer given a log-likelihood and gradient approximation.We use the BFGS implementation in the stat package in R fixing the seed used in both approximations.Following Hajivassiliou et al. (1996) (see supplementary material S2 for an explicit link to our work), we find that the gradient is where ∇ L i (μ, ) = (∂L(μ, )/∂ψ lk ) l,k=1,...,n i and we define ∇ μ L i (μ, ) similarly.Thus, we can use an approximation of Equations ( 7) and ( 8) since the partial derivative of the loglikelihood is ∂ log L i /∂x = ∂L i /∂x L i where we already have an approximation of the denominator.Applying similar steps used to get to Equation ( 6) to the gradient terms in Equations ( 7) and ( 8) and an application of the chain rule shows that where vec(•) is the vectorization operator that transforms a matrix into a column vector by stacking its columns.We use the variable reordering and the RQMC method used by Genz and Bretz (2002) for both L i and the derivatives.This reordering seems to be beneficial also for the derivatives although the reordering is to heuristically reduce the variance of the integrand in Equation ( 6).The approximation is unbiased for the likelihood when a finite number of samples is used but it is biased for the log-likelihood and the gradient (Hajivassiliou et al. 1996).However, the bias can be expected to be negligible-we motivate this in supplementary material S3, where we show that the bias of the log-likelihood is approximately proportional to the variance of the estimator.Moreover, we show that the bias of the gradient of the log-likelihood from the division of the inverse of the likelihood is approximately less than the variance of the estimate of the inverse of likelihood.Both cases yield a negligible bias if the variances of the respective quantities are small.Further details of the implementation are given in Appendix A in the supplementary material including formulas for the importance sampler for the Hessian approximation which are omitted here due to space limitations.

Recursive Partitioning for Handling Large Families
In the absence of genotypic data, estimating genetic contributions to phenotypes/diseases requires large datasets with detailed knowledge of family relationships.Such datasets are good examples of complex structured data.Here we analyze data from a sample of 897,210 individuals born in Sweden between 1982 and 1990, for which information on diagnosis of OCD has previously been obtained from the Swedish National Patient Register (Mahjani et al. 2020).Information on the family relationships for a maximum of three (prior) generations has also previously been obtained for these individuals from the SMGR, which holds information on family relatedness for more than nine million individuals.In the dataset we analyze, most individuals belong to marginally independent families that are quite small, but there are also a few very large families, with the largest cluster containing 167,279 individuals, which makes it computationally challenging to fit the GLMM.Importance sampling is often limited in terms of scaling to very high dimensional integrals as the variance of the estimate often depends on the dimension of the integral.This is true for the importance sampler we use and thus we cannot work with a 167,279 dimension integral (Botev 2017).Consequently, we have to make some clusters smaller in order to apply importance sampling.Previously, authors have tended to use a pseudolikelihood, possibly without addressing that standard asymptotics are no longer applicable (Varin et al. 2011).The result is a loss of efficiency and invalid confidence intervals and pvalues unless adjustments are made.However, the approach simplifies the likelihood substantially and has been necessary in previously used estimation methods.Examples are Pawitan et al. ( 2004), Lichtenstein et al. (2009), and Bai et al. (2019), in which datasets are similar to the three generation dataset that we use.Typically, clusters are defined in terms of grandparents where children and grandchildren belong to the same cluster.This leads to duplication of outcomes as grandchildren will have grandparents on both their mother's and father's side, as illustrated in supplementary material of Bai et al. (2019). 1n this article, we instead suggest to simplify the pedigree without duplicating observations.The individuals in the larger families are usually connected by long parent-child paths in the three generation data.This is illustrated with a family with 48 members in Figure 1.This suggests that it may be possible to limit the family sizes in the data by removing a small number of parent-child relationships and thereby splitting the pedigree.
To formalize this, we formulate a family as an undirected graph G = (V, E) with the vertices V corresponding to individuals and edges E corresponding to parent-child relations.Partitioning V into two disjoint sets V 1 and V 2 , that are connected, implies that, if we ignore the edges between V 1 and V 2 , we only have to do The above suggests the following procedure to simplify the clusters.Let w be a vertex weight function and h be an edge weight function.We set w to be one if we observe an outcome and 10 −5 otherwise, and h to be 10 if the child has an observed outcome and one otherwise.We will discuss other strategies for selecting h later in Section 6.We then find a partition with connected sets V 1 , . . ., V k for some k ≥ 1 such that min e∈{d∈E: d crosses two sets} h(e) where N is a user specified maximum dimension of the integrals (i.e., the maximum number of individuals with observed outcomes in a cluster).Thus, the objective is to find the minimum weighted set of parent-child relationships to remove, such that we end up with some k ≥ 1 families within which individuals' weights sum to at most N.This amounts to that no families have more than N observed outcomes with our vertex weight function.The above problem is similar to the p-way partitioning problem from graph theory (Buluç et al. 2016), except that what matters in our application is the connected sets, and not possibly unconnected sets.The p-way partitioning is a NP-complete problem but many heuristics have been suggested (Buluç et al. 2016).
We have implemented a recursive partitioning method similar to the method suggested by Simon and Teng (1997).This requires a graph partitioning method for which we have implemented a greedy algorithm similar to that suggested by Kernighan and Lin (1970) and Fiduccia and Mattheyses (1982).However, we found quite large improvements by starting the greedy partitioning method from an approximation of the maximally balanced connected partition.For the latter, we implemented the algorithm suggested by Chlebíková (1996) using the algorithm suggested by Hopcroft and Tarjan (1973) to find the biconnected components.More details are provided in supplementary material S4.It is likely that one of the recent multilevel methods (Buluç et al. 2016) may perform better if adapted to our problem.However, our implementation seems to work well on the OCD data as we show in Section 5. We have focused on simplifying the computational problem when the relationships matrices can be defined in terms of parent-child relationships.The motivation is that many other effects have dependencies which are also defined in terms of parent-child relationships (e.g., maternal effects, paternal effects, and shared childhood environment).

Simulations
We conducted multiple simulation studies to compare the computation time, coverage of Wald-based confidence intervals and also examined any potential biases with our RQMC method and MCMC methods.We made comparisons with MCMC methods since these methods can handle large families.The two MCMC implementations we compare with are MCMCglmm (Hadfield 2010) and a Gibbs sampler implemented in thrgibbs1f90b (as a part of the family of programs BLUPF90 Misztal et al. 2018).In another set of simulation studies, we investigated potential biases introduced by using our graph-based method and the pseudo-likelihood method for simplifying large clusters, described in Section 3.2.
For the first set of simulation studies, we sampled 250 families of size n i = 10.All families had the same pedigree which is shown in Figure 2. We used the following model to generate the outcomes  In the first study, we simulated data using only an additive genetic effect with σ2 Add = 3, corresponding to a heritability of h Add = 0.75.In the second study, we simulated data using an additive genetic effect and a childhood environment effect with (σ 2 Add , σ 2 Env ) = (2, 1).The marginal probability of observing an event was 0.208 in both simulation studies.Four threads were used with our RQMC method using a fast heuristic to get the starting values, followed by optimizing the log marginal likelihood using 5000 to 25,000 samples per term. 2 25,000 samples were used for the Hessian approximations.For the first study (additive effects only) 1000 datasets were analyzed with our RQMC method, but only the first ten first samples were analyzed using the MCMC approximations due to their long computation times.We used 1,000,000 samples with MCMCglmm with 100,000 used as burn-in.We used a multivariate normal distribution prior for the fixed effect coefficients, β, with 100I 3 as the covariance matrix and a F-distribution with 1 and 10 degrees of freedom as the prior for σ 2 Add .The latter yielded a somewhat flat prior on the heritability (Villemereuil et al. 2013) and solved some of the mixing problems.The average effective sample size of σ 2 Add was 335.6 with a standard deviation of 283.6.An improper prior was used with BLUPF90, with 200,000 samples with 10,000 used as burn-in.The average effective sample size for σ 2 Add was 661.2 with a standard deviation of 48.6.MCMCglmm and RQMC implementations were run on a Laptop with an Intel ® Core™ i7-8750H CPU and the code was compiled with GCC 10.1.The BLUPF90 program was run on a server with an Intel ® Xeon ® Processor E5-2630 v3.The results are shown in Table 1. 3 There is no or only minor evidence of bias in the simulation studies with our RQMC method.Similar conclusions apply to the MCMC methods based on the first ten datasets.Importantly, our RQMC method is much faster than the MCMC implementations.This makes it possible to compute for example, profile likelihood-based confidence intervals as we do later.For the second study with additive genetic and childhood environment effects, there was similarly no, or only minor evidence, of biases.The coverage of the Wald-based confidence intervals were in both cases close to the nominal level.The time taken to compute the Hessian is included in the estimation time.We also estimated the models for the first 100 datasets without using the gradient.On average it took 3.71 and 4.13 times longer to fit the models without and with the environment effect, respectively.
Next, we studied potential biases introduced by simplifying pedigrees using the approach we suggest in Section 3.2.It is necessary to simplify the likelihood with families with more than a few hundred observations when using the RQMC method.For each dataset and family, we simulated a random pedigree for a three generation dataset similar to the observational data we use later in Section 5. Details of the method used to simulate the pedigrees are provided in supplementary material S5.We retained only the outcomes of the last generation, to mimic the observational data.The families we simulate each have about 100 observed outcomes.We then used the graph-based method of Section 3.2 to split the families such that the integrals had a maximum dimension of 33.The pseudo-likelihood method mentioned in Section 3.2 was also used.We used 100,000 sam-ples per log-likelihood term because the dimensions of the integrals were larger than before.250,000 samples were used for the Hessian approximations.One RQMC sequence was used for each term as the number of samples was fixed and the error decreases at a faster rate in the length of each RQMC sequence than in the number of sequences.For this simulation we used a maternal effect, as in Mahjani et al. (2020), instead of the childhood environment effect which we used earlier.We set (σ 2 Add , σ 2 Mat ) = (2, 1) and β 0 = −0.5 to get a prevalence closer to 50%.There were 25 families in each dataset, and 100 datasets were sampled.
Bias estimates, coverage of Wald-based confidence intervals and average computation times are shown in Table 2.There was no evidence of any bias when our RQMC method was applied to the full datasets, the reduced datasets, or the dataset from the pseudo-likelihood.The estimation times with the reduced datasets were shorter than for the full datasets, as expected.It was also smaller compared to the pseudo-likelihood.An explanation is that the algorithm behaves like O (n i ) when n i is not too large and the number of samples is fixed, as we mention in Appendix A in the supplementary material.It is though perceivable that we could use fewer samples with the pseudo-likelihood to get the same precision of the likelihood estimate as with the two other likelihoods.The standard errors show that neither the reduced dataset nor the pseudo-likelihood have a large efficiency loss.However, the Wald-based confidence intervals for the pseudo-likelihood-based method do not have the right coverage and are anti-conservative, as expected.90% profile likelihood-based confidence interval of h Add were also constructed.The coverage was 89% with our suggested graph-based method and 84% with the pseudo-likelihood-based method consistent with the Wald-based confidence intervals. 4It is possible to correct the latter but this is not straight forward (Varin et al. 2011).A note on the construction of the confidence intervals is provided in supplementary material S6.The first row shows the bias estimates on the full data and the "reduced" row shows the bias estimates after using the method to reduce the family sizes, that is described in Section 3.2.All bias estimates are scaled by 100.The "time" column shows the average computation times.The coverage rows show the coverage of a 90% Wald-based confidence intervals.

Analysis of Data from Swedish National Registries for Estimating the Heritability of OCD
In this section, we analyze a dataset using a GLMM to estimate the direct additive genetic effect for the risk of OCD.We chose a sample of 897,210 individuals as described in Section 3.2.Family relationships were constructed using information from three generations.A total of 6888 individuals (in the last generation, that is, born between 1982 and 1990) had an OCD diagnosis.This dataset is similar to the one used by Mahjani et al. (2020), but not identical, since the two datasets were extracted from the registry at different times and maintained by different institutes.
We applied the recursive partitioning method described in Section 3.2.In the largest family, which had 167,279 outcomes, we removed 3076 parent-child relationships, by requiring that we perform no more than N = 200 dimensional integration.Note that with this many outcomes and a requirement of dimension 200 , the smallest possible number of removed relationships would be 167279/200 − 1 ≈ 836, in the (unrealistic) case where the graph is for example, a chain.In total, we removed 3131 parent-child relations from the 3,944,737 relations in the entire dataset.This yielded a reduction of 1.32 % in the unique cousin pairs and 0.038 % in the unique sibling pairs with observed outcomes.These are relevant metrics for the amount of removed information as the individuals with observed outcomes are either siblings, half-siblings, or cousins.Less information was removed in this data than from the second simulation study in Section 4. In that (simulation) study, 11.7% of the unique cousin pairs were removed and 0.0159% of the unique sibling pairs were removed in the reduced datasets, on average.We also computed the Euclidean norm of the difference in the reduced and full relationship matrix divided by the Euclidean norm of the full relationship matrix.We ignored the diagonal entries which are always identical.This ratio was 0.0440 in the applied data and the average ratio was 0.1521 in the simulation study.We therefore expected the impact of reducing the OCD data to be minimal on the analysis.
In the analysis of the data in Mahjani et al. (2020), fitted models included fixed effects for sex and age of the mother (AOM).AOM was included as a binary covariate (≤ 34, > 34).We used the same covariates, which allowed us to reduce the number of log marginal likelihood terms from a total of 237,332 to 37,865 (unique) terms-that is, some of the terms were identical in terms of the relationship matrix, covariates, and outcomes, and these could be weighted according to their frequency.This approach, of using weights, is only possible for discrete covariates and has been used by Pawitan et al. (2004)  to speed up computation.However, in our study this approach was not so useful because of the large cluster sizes, as the larger clusters take substantially more time to approximate and because most of the large families are unique.We used 100,000 samples per log marginal likelihood term.One RQMC sequence was used for each log marginal likelihood term, as in the second simulation study in Section 4. We estimated the model on a server with two Intel® Xeon® processor E5-2660 v2 using 20 logical cores for everything but the Hessian where we used 10 logical cores as the current implementation did not run faster with more threads on the CPU on the server.It took 2 hr and 52 min to fit the model and an additional 3 hr and 18 min was spent on approximating the Hessian with 250,000 samples per log marginal likelihood term.In contrast, it takes about a month to generate a chain of length 100,000 using the BLUPF90 software.The estimates obtained using our approach are shown in Table 3.They are similar to the posterior means obtained by Mahjani et al. (2020) using BLUPF90.The Bayesian analysis based on improper priors, of Mahjani et al. (2020) obtained posterior means of log 1.27 = 0.239 for β sex , log 1.17 = 0.157 for β AOM , and 0.482 for h Add .All values are consistent with our result.We also computed a 95% profile likelihood-based confidence interval for the h Add .The interval was (0.387, 0.545) which is very close to the Wald-based confidence interval of (0.394, 0.543).However, the interval was wider than the credibility interval of (0.431, 0.526) reported by Mahjani et al. (2020).We are unable to explain this discrepancy but suspect that it is related to us having fewer cases (Mahjani et al. 2020, have 7184).It took 7 hr and 49 min to compute the confidence interval.

Discussion
In this article we have explored approaches for fitting generalized linear mixed models with binary outcomes, using the probit link function, which is used in a number of fields.We have in particular focused on analyzing complex family data and have presented an analysis of data from a large study of OCD in Swedish families.The suggested graph-based partitioning method has allowed us to use almost the full information from the pedigrees despite that some clusters have thousands of members.Using an extension of the importance sampler from Genz and Bretz (2002) to compute gradient approximations of the log-likelihood, fast estimation was possible.We used simulation studies to show that the graph-based partitioning approach to simplifying the likelihood resulted in little, if any, loss of efficiency and in nominal coverage of likelihood ratio-based confidence intervals.The importance sampler was also extended to provide a Hessian approximation which can be used for fast frequentist inference.The simulation studies equally showed that the Wald-based confidence intervals had close to nominal coverage.
The method suggested by Genz (1992) and Genz and Bretz (2002) has been shown to be a very competitive means of approximating the likelihood in Equation (4).However, Botev (2017) has recently extended their work using exponential tilting of the importance distribution in Equation (5).Botev (2017) shows that his minimax tilting method may yield an improvement in general to the RQMC method, particularly for certain rare events.
We have used one particular graph-based method to simplify the pedigree structures.Other procedures have of course been suggested to simplify pedigrees, but these have primarily focused on much fewer individuals who are related through much more detailed pedigrees, and these procedures have been tailored for specific contexts (Liu et al. 2008;Kirichenko et al. 2009).Falchi et al. (2004) propose a method which, like ours, is graphbased.They however write the graph by including edges between individuals with nonzero kinship coefficients and look for cliques.The method is not so useful in our application as we only have small cliques.Modifications of our graph partitioning method approach are also possible.For example, the edge weight function, h, in Section 3.2 could be set to the sum of kinship coefficients for the observed outcomes that will be set to zero if the parent-child relation is removed.This would be computationally easy to update in our three generation data, as relatively few edge weights need to be updated after removal (or addition) of a parent-child relationship.Our graph-based method may also be used to improve inference for Bayesian methods which factorizes the covariance matrix.It will be faster to work with the many smaller covariance matrices rather than the full covariance matrix despite using that the latter is sparse.
In one of the simulation studies in Section 4 we used shared childhood environment effects.In general, for observational data analysis, strong assumptions have to be made to specify the correlation structure of the environmental effects for each unique pedigree.An important issue is what assumptions to make for example, between shared environment of children and their parents and between cousins.Unfortunately, the assumptions may effect the heritability estimate and whether one can distinguish between heritability and environment effects in a given study design.
We have focused on univariate outcomes.The approach described in this article can however easily be extended to handle multiple traits, that is, for studying genetic and environmental associations between traits.Importantly, our method and software can handle this extension as outlined in supplementary material S7.
Finally, we note also that it is easy to extend our approach to handle time-to-event data with censored outcomes.Christoffersen et al. (2021) have previously shown how the approximation method suggested in this article can be used for a class of mixed generalized survival models (GSMs) (Liu et al. 2017) and related mixed transformation models (Hothorn et al. 2018).An advantage of these models is that the proportion of variance attributable to each effect can be quantified on an estimated scale in a similar manner to the model presented for binary traits.This is unlike other mixed survival models (Scurrah et al. 2000).Further details are given in supplementary material S8.

Figure 1 .
Figure 1.Example of a family with 48 members and its graph representation.The left plot shows the pedigree of the family.Circles are females and squares are males.Dashed lines point to the same individual: the individual is shown with the siblings in one end and with the partner in the other end.The right plot shows the family illustrated as a graph with edges between parents and children as in Section 3.2.
5), and β = (−3, 1, 2) .C Add is the relationship matrix for the additive genetic effect, and C Env is the relationship matrix for the childhood environment effect.Both matrices are shown in supplementary material S5.

Figure 2 .
Figure 2. Pedigree used for all families in the simulation studies.Circles indicate females and squares indicate males.

Table 1 .
Bias estimates, coverage estimates and average computation times plus and minus one standard error for the estimates in the simulation studies.Both the posterior means and modes are shown for MCMCglmm.The bias estimates are on the scale where the total variance is constrained to be one and all values are multiplied by 100.The coverage rows show the coverage of 95% Wald-based confidence intervals.The last column shows the average computation times in seconds.

Table 2 .
Bias estimates and average computation times plus and minus one standard error for the estimates for the simulation study with an additive genetic and maternal effect.

Table 3 .
Estimation of the direct additive genetic effect for OCD using our proposed method.