Kinship Solutions for Partially Observed Multiphenotype Data

Current work for multivariate analysis of phenotypes in genome-wide association studies often requires that genetic similarity matrices be inverted or decomposed. This can be a computational bottleneck when many phenotypes are presented, each with a different missingness pattern. A usual method in this case is to perform decompositions on subsets of the kinship matrix for each phenotype, with each subset corresponding to the set of observed samples for that phenotype. We provide a new method for decomposing these kinship matrices that can reduce the computational complexity by an order of magnitude by propagating low-rank modifications along a tree spanning the phenotypes. We demonstrate that our method provides speed improvements of around 40% under reasonable conditions.


INTRODUCTION
U nderstanding the etiology and biological pathways involved in health and disease requires a multivariate analysis of phenotypic data in genome-wide association studies (GWAS). This sort of analysis is becoming more common due to increased compute power (Cox et al., 2018). Only recently have such analyses become tractable through improvements to the efficiency of linear mixed models and related models (Lippert et al., 2011;Listgarten et al., 2012Listgarten et al., , 2013Dahl et al., 2016). Further multivariate analysis is required for the study of complex diseases such as cancer (Knox, 2010) and groups of complex phenotypes such as brain imaging phenotypes (Elliott et al., 2018), and the analysis of large consortia involving multimodal data such as the U.K. Biobank (Bycroft et al., 2018).
The use of multiphenotype (or multivariate) data often requires that a scaled version of the kinship matrix (or genetic similarity matrix, or genetic relationship matrix: Patterson et al., 2006; variate normal distribution, defined on the phenotype vector in a GWAS (Dutilleul, 1999). And so, solutions to K -1 2 y, in which y is a phenotype matrix (with a row for each sample and a column for each phenotype), must be computed to obtain maximum likelihood estimates for use in expectation maximization, variational Bayes or data whitening, or to obtain Markov chain Monte Carlo updates for Bayesian posterior simulation. Here K is an n · n positive-definite genetic similarity matrix, and y is an n · d partially observed phenotype matrix such that y ij is phenotype for sample i and phenotype j and n is the number of samples in the study and d is the number of phenotypes in the study. The computation of K -1 2 y is also of general interest to other machine learning fields such as kernel methods (Gretton, 2003).
Phenotype matrices may be partially observed (e.g., outliers are removed leading to column-specific missingness, or experimental paradigm varies among subjects leading to blockwise missingness in phenotype categories). To compute K -1 2 y for missingness in y, imputation may be performed (Dahl et al., 2016), but imputation can be slow and it adds additional model-specific biases into the analysis. A standard and classical method for dealing with missingness in y is the construction of (K obs(j)‚ obs(j) ) -1 2 y obs(j)‚ j for each j. This is equivalent to marginalizing out the missing samples under the assumption of Gaussianity (i.e., it does not add additional assumptions beyond those already assumed by mixed models). Here obs(j) is the set of indices of samples observed for phenotype j and A B‚ C for a matrix A means that if B or C is a subset of N , the submatrix of A should be formed by selecting rows with indices in B or columns with indices in C. For ease of subscript usage, we adopt the notation K j = K obs(j)‚ obs(j) and y j = y obs(j)‚ j . Computing K -1 2 j y j for each phenotype j is costly and precludes scaling of multivariate GWAS analysis.
In this article, we present an efficient algorithm for computing the Cholesky decomposition (Benoıt, 1924) of K j , for use in calculation of K -1 2 j y j . The algorithm works by computing the full O(n 3 ) Cholesky decomposition L j 0 for K j 0 for a fixed j 0 , and then performing rank-1 modifications (updates and ''downdates'';Benoıt, 1924) to the Cholesky decompositions and also performing other O(n 2 ) operations to propagate the Cholesky decomposition of K j 0 to that of K j 8j : 1 j d‚ j 6 ¼ j 0 . The decomposition is propagated along a minimum spanning tree of a complete graph with one vertex per phenotype, and edge weights given by the number of rank-1 Cholesky modifications required for propagation of the decomposition along that edge. In Figure 1, a comparison is provided for the runtime and asymptotics of the full Cholesky decomposition against the Cholesky modifications for a range of sample sizes, indicating the efficiency of these modifications. In this figure, genetic similarity matrices are simulated by drawing from a Wishart distribution with means given by the identity matrix and with 10,000 degrees of freedom (simulating a study typed at 10,000 markers). We refer to our algorithm as kgen (for kinship generation) and we implement our algorithm in an open source software package called the kgen software. A manual for this software is provided in Appendix C of the Supplementary Material.
The asymptotic complexity of performing Cholesky decompositions in a naive way for all K j is O(dn 3 ), (assuming the missingness patterns of each phenotype are not identical). In contrast, the worst-case asymptotic complexity of the kgen algorithm is as follows: Here r is defined as max 1j 1 <j 2 d f#(obs(j 1 )nobs(j 2 )) + #(obs(j 2 )nobs(j 1 ))g (this is the maximum number of rank-1 Cholesky modifications required to propagate a Cholesky decomposition between two phenotypes) and #A denotes the size of a set A and BnC for sets B and C denotes the set difference between B and C.

Related work
To reduce the computational complexity of genetic similarity matrix operations, several research programs have been conducted to store and manipulate sparse representations of the genetic similarity matrix (Shor et al., 2019). In these representations, researchers set a threshold and then zero out elements of the genetic similarity matrix with absolute value less than this threshold. The computational gains of such an approach may be large, but in theory, such an approach could lead to loss of power. Furthermore, this approach would be less suitable to data originating from pedigrees or small isolated populations. To our knowledge, our work is the first to leverage Cholesky low-rank modifications for improving efficiency of genetic similarity matrix-based inference.

METHODS
In this section, we describe the genetic similarity matrices for which the kgen algorithm is appropriate and then provide the details of the kgen algorithm.

Genetic similarity matrices
The definition of the kinship matrix we use is that of a genetic similarity matrix centered at populationlevel minor allele frequencies. This definition is based on Patterson et al. (2006) but note that it involves population-level normalization instead of sample-level normalization. Let G i' be the genotype of the i-th subject at the '-th marker (suppose that there are n markers in total), and let q ' be the minor allele frequency of the '-th marker with respect to the population from which the samples are drawn (we assume q ' > 0). Then, the genetic similarity matrix is an n · n positive semidefinite matrix such that: The Cholesky decomposition of an n · n positive definite matrix K is the unique upper triangular matrix L such that LL T = K and so L -1 y is a solution to K -1 2 y. Efficient algorithms exist for computing L (Anderson et al., 1999), and although the computation of L has asymptotic complexity O(n 3 ), it is often much faster than a matrix inversion performed on a matrix of the same size. Due to the upper triangular nature of L, L -1 and L -1 y may be computed with asymptotic complexity O(n 2 ). Given a Cholesky decomposition L of K, for any vector v of length n, the Cholesky decomposition L may be updated to form the Cholesky decomposition of K + vv T (i.e., the sum of K and the rank-1 vector vv T ) or downdated to form the Cholesky decomposition of Kvv T . These update and downdate operations have asymptotic complexity O(n 2 ). For more detail on Cholesky decompositions and their modifications, we refer to Seeger (2008), Osborne et al. (2010), and Benoıt (1924).

The kgen algorithm
A meditation on these asymptotes suggests the kgen algorithm. The Cholesky modifications can be designed to add and remove rows and columns of K (Osborne et al., 2010), and so, the computation of K -1 2 1 y 1 ‚ . . . ‚ K -1 2 d y d can be done by performing only one O(n 3 ) full Cholesky decomposition (instead of d such operations) and then propagating it to the rest of the decompositions, provided that the number of rows and columns to be added and removed to propagate the Cholesky decompositions is small with respect to n yielding Eq. (1).
Procedures to arrange Cholesky modifications in a way that adds and removes rows and columns of K are provided in Osborne et al. (2010). We refer to these operations as delete and insert. These operations are described formally as follows. Let v be an n · 1 vector and let L be the Cholesky decomposition of the n · n positive definite matrix Then, L -= delete(K + ‚ i) is the Cholesky decomposition of Conversely, let L be the Cholesky decomposition of the matrix Kgiven in Equation (4), then is the Cholesky decomposition of the matrix K + given in Equation (3). These insert and delete operations can be performed in O(n 2 ) time. Descriptions of these operations are provided in Algorithms S1 and S2 of the Supplementary Material. For our delete operation, we use the procedure from Osborne et al. (2010). For our insert operation, we use a procedure slightly different from Osborne et al. (2010) and provide a proof of our procedure in Appendix A of the Supplementary Material.
We now describe the kgen algorithm in detail. The kgen algorithm is listed in Algorithm 1 in this article. This algorithm assumes that an n · n positive definite genetic similarity matrix K is provided as in Eq. (2), and that an n · d phenotype matrix Y is provided, with missing entries indicated. The phenotype matrix is used to find the sets of missing entries obs(j) : 1 j d, and the particular values of the phenotype matrix are not used. Instead, Cholesky decompositions L j of K j = K obs(j)‚ obs(j) are returned, providing fast access to L -1 j y j . The kgen algorithm works by first finding a phenotype j 0 such that the number of missing entries for the j 0 -th column of Y is less than or equal to the number of missing phenotypes in any other column. And then, we find a tree spanning all phenotypes. The tree is chosen such that the sum of the number of samples that must be added or removed for each edge of the tree (the sum of the weights of the edges) is minimized. For an edge from phenotypes j 1 to j 2 , a sample must be added if it is in obs(j 2 ) but not in obs(j 1 ), and a sample must be removed if it is in obs(j 1 ) but not in obs(j 2 ). Note that this relation about the number of samples to be added or removed is reflexive and so the weighted complete graph with vertices given by the columns of the phenotype matrix is an undirected graph. The vertex j 0 is identified as the root of this minimum spanning tree. In our implementation of this algorithm, we use Kruskal's algorithm to find the minimum spanning tree (Kruskal, 1956).
After this minimum spanning tree is created, a breadth-first enumeration of the edges of this tree is constructed, such that the first edge includes the root of the tree. This enumeration must be breadth-first, because propagation of Cholesky decompositions along an edge may involve hard-drive reads and writes. The finished decompositions may have to be written and read to disk, as RAM (random access memory) provisions on supercomputers often cannot store the Cholesky decompositions of >1,000 phenotypes with 20,000 samples. So, software implementing the kgen algorithm will read the decomposition from the ''source'' vertex unless that decomposition has been recently read. Ensuring that the decomposition for the ''source'' vertex has most often been recently read is equivalent to providing the path through the spanning tree in a breadth-first way.

ELLIOTT
Algorithm 1 The kgen algorithm 1: Inputs: a) An n · n positive definite matrix K; b) An n · d phenotype matrix Y. 2: Outputs: A list of Cholesky decompositions L j for 1 £ j £ d wherein L j is the #obs(j) · #obs(j) Cholesky decomposition of the positive definite matrix K j . 3: Let G be the weighted undirected complete graph on d vertices such that the weight of the edge between vertices j 1 and j 2 is #(obs(j 1 ) yobs(j 2 )) + #(obs(j 2 ) yobs(j 1 )). 4: Let T be a minimum spanning tree of G. 5: Let j 0 be a vertex such that #obs(j 0 ) £ #obs(j) c 1 £ j £ d. 6: Let E 1 ,., E d-1 be a breadth-first enumeration of all of the edges of T along with an ordering of the vertices of each edge (so the two vertices defining the edge E i are given in order by E i1 and E i2 ), such that the vertex E i1 always appears among the set {j 0 , E 12 ,., E i-1,2 } and such that E 11 = j 0 . 7: L j0 ) chol(K j0 ) 8: for j = 1 . d -1 do 9: e 1 ) L Ej1 10: e 2 ) L Ej2 11: L 0 ) L e1 12: S ) obs(e 1 ) 13: k ) 1 14: for i = 1 . n do 15: if i˛obs(e 1 ) and i˛obs(e 2 ) then 16: k ) k + 1 17: else if i˛obs(e 1 ) and i ; obs(e 2 ) then 18: S ) S y{i} 19: else if i ; obs(e 1 ) and i˛obs(e 2 ) then 21: S ) S W{i} 22: L 0 ) insert(L 0 , K S,k , k) 23: k ) k + 1 24: assert S = obs(e 2 ) 25: L e2 ) L 0 26 return L 1,., L d After the enumeration is created, the kgen algorithm computes the Cholesky decomposition of K j 0 and then for each edge (e 1 ‚ e 2 ) in the enumeration, it modifies the Cholesky decomposition of K e 1 by inserting samples that are in obs(e 2 )nobs(e 1 ) and deleting samples that are in obs(e 1 )nobs(e 2 ). And then it proceeds to the next edge in the path, and repeats this procedure until the d -1 edges in the minimum spanning tree on the phenotypes are exhausted.

A worked example of the kgen algorithm
In Figure 2, we provide a worked example of the kgen algorithm involving 10 samples and 8 phenotypes. In this example, the root of the minimum spanning tree is phenotype two and a breadth-first enumeration of the edges of the minimum spanning tree (such that the root is given by the first edge) is (2‚ 1)‚ (1‚ 4)‚ (4‚ 3)‚ (4‚ 7) . . .. The Cholesky decomposition in Fig. 2e right can be formed by performing rank-1 modifications after computing the Cholesky decomposition in Fig. 2d.

EXPERIMENTS
We now consider two experiments on simulated data and compare the speed and accuracy of the kgen algorithm to that of a naive algorithm in which the full Cholesky decomposition is computed for each phenotype. In the first experiment, we vary the number of samples and the missingness rate of the phenotype measurements, and assume that the data are missing-at-random, and assume a fixed number of 100 phenotypes per condition. In the second experiment, we examine a blockwise missingness pattern. Missing data are denoted by the string NA. Middle: A complete undirected weighted graph on the 8 phenotypes. Weights are omitted, but can be inferred from Y: (e.g., the weight between phenotypes one and two is two, as two samples have to be inserted to move from phenotype one to phenotype two). Right: The minimum spanning tree over the phenotypes. (b) The simulated genotypes (x-scale indicates marker index and y-scale indicates sample index, and colors are white for homozygous minor, gray for heterozygous, and black for homozygous major). (c) The kinship matrix implied by the genotypes (to two decimals of precision). (d, e) The kinship matrix restricted to the observed subjects for phenotype two or one (resp.) and the corresponding Cholesky decompositions.
These experiments (and the results are displayed in Fig. 1) were conducted on Intel Xeon E5-2683 CPUs, and the numerical matrix operations were performed using Intel's MKL (math kernel library) implementation of the Netlib library (Anderson et al., 1999). The machine epsilon on this CPU was 2:2e -308.

Experiment 1
We consider missingness rates of 0.01%, 0.05%, 0.1%, 0.15%, and 0.2%, and a missing-at-random missingness pattern over 100 phenotypes. We consider n = 10,000, 15,000, 20,000, 25,000, and 30,000 samples. For each condition, we consider five independent replicates, and we sample the kinship matrix from the same Wishart distribution that was used in Figure 1 (i.e., with a mean given by the identity matrix and with 10,000 degrees of freedom). The difference in the runtime between kgen and the naive method (in which a full Cholesky decomposition is done for each phenotype), averaged over the five independent replicates for each condition, is displayed in Figure 3. The maximum entrywise absolute difference between the two methods over all phenotypes and conditions was 1.1768e -14 (in the units of the Cholesky decomposition space), indicating close alignment and low numerical imprecision.
A histogram of all of the elementwise absolute differences is displayed in Supplementary Figure S1 in the Supplementary Material. The runtime of each replicate (for both the kgen algorithm and the naive method) and the maximum entrywise absolute difference between the Cholesky decompositions of each method are shown in Supplementary Table S1 of Appendix B in the Supplementary Material.

Experiment 2
In our second experiment, we consider a blockwise missingness pattern and vary the number of phenotypes. We fix the number of samples at n = 15‚ 000 and we consider a missing-at-random rate of 0.1% and also a blockwise missingness pattern in which each block of 50 consecutive phenotypes all have 10% of the samples masked (the same 10% of samples are masked for all 50 phenotypes in the block). The genetic similarity matrix is taken to be the same Wishart distribution that was used in Experiment 1. This situation is similar to a massively multiphenotyped version of the Wellcome Trust Case/Control Consortium (2007). The number of phenotypes is varied in the set f100‚ 200‚ 300‚ 400‚ 500g. The runtime of kgen and the naive method for Experiment 2 are shown in Figure 4. Improvements shown in Figure 3 are only generally realized for < 0:1% missingness. And so for Experiment 2 (and our released kgen software), we institute a new rule in which if >100 samples must be added or removed to propagate along an edge between one phenotype and another in the minimum spanning tree, instead a full Cholesky decomposition is performed on the other phenotype. This ''short circuits'' the kgen algorithm and allows superb performance even in cases for which the overhead of the low-rank modifications in the kgen algorithm could swamp the gains (the choice of 100 depends on the overhead and could be reduced in the future, to respect new and more powerful hardware).

RESULTS
In Figure 3, we see a between 1-and 6-hour improvement for sample sizes greater than 15,000 and missingness rates <0.1%, and for 100 phenotypes. In theory, and in Figure 4, we see an indication that these improvements are linear in the number of phenotypes. And so a study with 10,000 phenotypes under the right conditions may benefit from a 25-day reduction in compute on our hardware through the kgen algorithm. The ''n'' column indicates number of samples, ''rate'' column indicates the missingness rate (in basis points), ''mean (kgen)'' and ''std (kgen),'' ''mean (naive)'' and ''std (naive)'' column indicates means and standard deviations over five independent replicates for kgen and the naive method (in seconds) resp.

FIG. 4.
Runtimes for Experiment 2 versus number of phenotypes, displaying some linearity in d (varied in the set f100‚ 200‚ 300‚ 400‚ 500g). Lines show means over five independent replicates. Shaded region indicates standard deviations. The kgen algorithm improves runtimes by *40% in all conditions.

ELLIOTT
Since Figure 3 indicates the difference in runtimes between kgen and the naive method, we also provide means and standard deviations for the runtime for both methods in Table 1. This table indicates that the kgen algorithm provides between 0% and 40% improvement in runtime for missingness at random rates of less than 0.1%. Figure 4 indicates around a 40% improvement for situations similar to The Wellcome Trust Case/Control Consortium (2007).

CONCLUSION
Multivariate GWAS are limited by computational resources. We have provided a new method to create and manipulate the genetic similarity matrices required for linear mixed models for multivariate GWAS. On our hardware, our method provides an improvement of around 40% under reasonable simulation settings. Software implementing our methods are released under an open source license.