Inferring trait-specific similarity among individuals from molecular markers and phenotypes with Bayesian regression

Modeling covariance structure based on genetic similarity between pairs of relatives plays an important role in evolutionary, quantitative and statistical genetics. Historically, genetic similarity between individuals has been quantified from pedigrees via the probability that randomly chosen homologous alleles between individuals are identical by descent (IBD). At present, however, many genetic analyses rely on molecular markers, with realized measures of genomic similarity replacing IBD-based expected similarities. Animal and plant breeders, for example, now employ marker-based genomic relationship matrices between individuals in prediction models and in estimation of genome-based heritability coefficients. Phenotypes convey information about genetic similarity as well. For instance, if phenotypic values are at least partially the result of the action of quantitative trait loci, one would expect the former to inform about the latter, as in genome-wide association studies. Statistically, a non-trivial conditional distribution of unknown genetic similarities, given phenotypes, is to be expected. A Bayesian formalism is presented here that applies to whole-genome regression methods where some genetic similarity matrix, e.g., a genomic relationship matrix, can be defined. Our Bayesian approach, based on phenotypes and markers, converts prior (markers only) expected similarity into trait-specific posterior similarity. A simulation illustrates situations under which effective Bayesian learning from phenotypes occurs. Pinus and wheat data sets were used to demonstrate applicability of the concept in practice. The methodology applies to a wide class of Bayesian linear regression models, it extends to the multiple-trait domain, and can also be used to develop phenotype-guided similarity kernels in prediction problems. © 2019TheAuthor(s).PublishedbyElsevierInc.ThisisanopenaccessarticleundertheCCBYlicense (http://creativecommons.org/licenses/by/4.0/).


Introduction
Assessing genetic or genomic similarity among species or individuals is a central topic in evolutionary and quantitative genetics (Lynch and Walsh, 1998;Walsh and Lynch, 2018). Sethuraman (2018) reviewed areas where estimation of genetic relatedness is important, including paternity and maternity assignments, forensic, association and linkage studies, and inference and prediction in quantitative genetics.
Until recently, many quantitative-trait analyses such as estimation of genetic variances and covariances and prediction of unobservable genotypic values (e.g., breeding values in animal and plant breeding), have relied on modeling covariances based on pedigree-based genetic relatedness between relatives; such covariances enter into the dispersion structure of mixed effects and Bayesian models. Historically, agricultural breeders have quantified genetic similarity by the probability that randomly chosen homologous alleles are identical by descent (IBD) in a pair of individuals. These probabilities are used to form Sewall Wright's numerator relationship matrix (A), which is proportional to the covariance matrix between additive genetic values of individuals. Eventually, elements of A enter into quantitative genetic models via the notion of covariances between relatives (Kempthorne, 1954). The calculations of IBD probabilities are deterministic and rely on the notion of a conceptual ancestral population from which descendants evolve in the absence of selection and mutation, following some equilibrium laws.
The most widely used G (Van Raden, 2008; G VR hereinafter) is built from genotype codes (e.g., 0, 1, 2 for bb, Bb and BB individuals, respectively, with B being the reference allele) deviated from their means and observed at p single nucleotide polymorphism (SNP) loci in each of n individuals. Typically, the genotype scoring encodes additive genetic effects. With X denoting an observed matrix of mean-deviated marker scores, then the n×n matrix G VR is proportional to XX ′ . Under Hardy-Weinberg equilibrium and in the absence of selection or mutation, the expected value of G VR (after some normalization) is A, the pedigree-based similarity matrix (Habier et al., 2007;Gianola et al., 2009). G VR weights all SNPs alike and does not exploit linkage disequilibrium (LD) information beyond what is conveyed by colinearity among columns of X. To see this point, note that the i, jth element of G VR is proportional to ∑ p k=1 x ik x jk , where x ik and x jk are the scores at locus k for individuals i and j. An alternative similarity matrix could take the more general form XWX ′ for some p × p weight matrix W constructed such that LD or effect-size based differential weights enter into G; in G VR , W is actually an n × n identity matrix. Speed et al. (2012) proposed a genomic relationship matrix that exploits LD information .
Importantly, G VR also ignores information that phenotypes may convey about genetic similarities. A central dogma of quantitative genetics is that genetic similarity generates covariance among relatives, so phenotypic similarities (after accounting for environmental sources of variation) are expected to reflect similarity due to allele sharing at quantitative trait loci (QTL) affecting a trait in question. Conversely, phenotypic similarity should be expected to inform about genetic (genomic) similarity. Further, if distinct QTL affect different traits, similarities on genotypic values for feed or nutrient consumption, say, should differ from genotypic similarities due to resistance to disease. Use of trait-specific similarity may be useful for enhancing prediction of complex traits in animal and plant breeding and in personalized medicine (de los Campos et al., 2011). Hence, G VR is not necessarily the best prescription for all traits. Thompson (1975) is representative of approaches that employ likelihood-based inference of kinship relationships from genetic data. Bravington et al. (2016) and Wang et al. (2017) pointed out that existing measures of kinship did not exploit information such as order along a chromosome (linkage or LD), and addressed estimation of genetic similarities using formal statistical procedures such as the Day-Williams method (Day-Williams et al., 2011). Making use of estimation theory is clearly a step forward towards ascertaining molecular kinship, as properties of estimators can be examined in a well-defined theoretical framework. However, neither Bravington et al. (2016) nor Wang et al. (2017) incorporated phenotypic information into their models, perhaps because in many areas (e.g., evolutionary biology) the relevant phenotypes are not easily available. On the other hand, animal breeders have attempted to incorporate phenotypic values into calculations of trait-specific genomic similarity. Zhang et al. (2010) used a twostage approach to obtain a trait-specific genomic relationship matrix. In the first stage, Bayesian multiple-regression methods produce estimates of substitution effects at the SNP loci. In the second stage, weights computed from the effect size estimates are used to form a genomic relationship matrix that is a weighted average of identity-by-state matrices computed for each SNP locus. The actual weight assigned to a given marker was the ''estimated genetic variance at that locus'', obtained using estimates of allelic frequency and of substitution effects. The expressions used to represent ''estimated genetic variance'' cannot be always justified as formal metrics for such parameter. In particular, one representation assumed Hardy-Weinberg and linkage equilibrium among markers; the latter assumption is manifestly violated in plant and animal breeding data.
In another approach (Wang et al., 2012), the weights of Zhang et al. (2010) were applied iteratively. In each iteration, GBLUP was used to estimate genomic breeding values, with estimates of SNP effects obtained indirectly as in Strandén and Garrick (2009). In a subsequent iteration, an idea suggested by Van Raden et al. (2009) was used to form a weighted G matrix where the contribution from each locus was weighted by the ''estimated variance at that locus''. Sun et al. (2012) described a method that is identical to that of Wang et al. (2012) except in the weights used for computing G. Sun et al. (2012) derived weights from an EM algorithm (Dempster et al., 1977) that presumably converged to produce the joint posterior mode of SNP effects under the Bayes A model of Meuwissen et al. (2001). In the context of genome-wide association studies, Liu et al. (2016) described an iteration between fixed and random effects models which would lead to some phenotype-informed similarity matrix. Karaman et al. (2018) proposed an ad-hoc multiple-trait method that used a Bayes A-type (Bayes AS) procedure hybridized with a GBLUP approach. All these approaches, although intuitively appealing, are heuristic so it is difficult to characterize their properties from a formal statistical perspective.
We present a Bayesian single-stage method for inferring genomic similarities among individuals that uses both marker and phenotypic information. Since the Bayesian model can be solved by sampling from the joint posterior distribution of a similarity matrix, uncertainty can be characterized fully and precisely in the framework of a well-established theory. Our method is adapted to several different Bayesian regression models and is illustrated employing simulation and with analyses of Pinus and wheat data sets. It is also shown how the concept can be adapted to prediction in a training-testing setting, and multiple-trait extensions are suggested.

Linear regression connecting markers to phenotypes
be an n×p (n = number of individuals; p = number of markers) observed matrix of centered molecular scores used as covariates in the linear regression model where y (phenotypes) and e ∼ N ( σ 2 e is a variance parameter) are n-dimensional vectors, and β is a p × 1 vector of regression coefficients; typically n < p. Assume that nuisance location effects have been eliminated. One way of dealing with the lack of likelihood identification caused by n < p is to adopt a Bayesian position and assign some prior distribution to the vector of regression coefficients (Meuwissen et al., 2001). For instance, Bayes A uses a t−distribution as prior, the Bayesian Lasso (BL) adopts a conditional double exponential prior, and Bayes R assigns a mixture of four normal distributions with known variance. These methods were reviewed and discussed by, e.g., de los Campos et al. (2013) and Gianola (2009Gianola ( , 2013.

General considerations
The notion of a genetic similarity matrix is motivated here. In a pedigree-based model, if g is a vector of additive effects of n individuals, their covariance matrix (Henderson, 1976) is where A, defined earlier, is a similarity matrix satisfying and σ 2 a is the additive genetic variance. The most widely used form of genome-based similarity matrix (n × n) is denoted generically as G ∝ XX ′ . Such G appears naturally if regression coefficients in (1) are treated as random variables. Assume β ∼ F (0, Bσ 2 β ), where F is some prior distribution with covariance matrix Bσ 2 β and σ 2 β is a variance parameter.
From (1), with g = Xβ and B = I it follows that so consistently with (3) an a priori genomic similarity matrix can be defined as where σ 2 g is the genomic variance encoded by the model. The distribution of g depends on F .
As an example, suppose the method of inference is best linear unbiased prediction (BLUP). Here, σ 2 β is a known variance parameter representing the variability of marker effects over conceptual repeated sampling. The estimates of β obtained with BLUP are numerically identical (at the same level of regularization) to those from ridge-regression so the procedure is often called ''ridge-regression BLUP'' (RRBLUP). The two needed variance components in BLUP, σ 2 β and σ 2 e , can be inferred using Bayesian or likelihood-based methods, and then kept fixed in the BLUP computations as if they were true values. Van Raden (2008) defined as ''genomic relationship'' matrix where q j is the frequency of a reference allele at marker locus j and H is average heterozygosity, taken over markers. Observe that only X informs about similarity in G VR . The connection between σ 2 β and σ 2 g in BLUP can be deduced readily. The regression model in scalar form is g i = ∑ p j=1 x ij β j . If it is assumed that all x ij (marker genotypes) and β j (regression coefficients) are independently distributed random variables with null means, unconditionally with respect to the x ′ ij s and β ′ j s, the genomic variance in GBLUP can be defined as assuming Hardy-Weinberg equilibrium holds at each locus; independence of x ij from any x ij ′ implies linkage equilibrium. This representation of genomic variance holds for any regression model, but the specific form of σ 2 β depends on the prior adopted for marker effects. Here, It is instructive to contrast (7) with the classical formula for additive genetic variance (V A ) in a model with a finite (K , say) number of QTL under linkage equilibrium. Here where q k is allelic frequency at QTL k and α 2 k is a fixed substitution effect, for k = 1, 2, . . . , K (Falconer and Mackay, 1996;Lynch and Walsh, 1998). There are important conceptual differences between σ 2 g and V A (e.g., de los Campos et al., 2015). In particular, additive genetic variance stems from randomness of genotypes and fixed effects at QTL, whereas σ 2 β (variance of marker effects distribution, with perhaps even none of the markers being a QTL) and σ 2 g (genomic variance) stem from a model where marker substitution effects are random in frequentist or Bayesian senses.
In the latter case, σ 2 β is a metric for prior Bayesian uncertainty (Gianola et al., 2009;Gianola, 2013) that appears as a hyperparameter in a GBLUP model. Note that G VR derives from a single realization of X, an observable matrix. On the other hand, in classical quantitative genetics X is viewed as varying at random according to a distribution reflecting equilibrium or disequilibrium laws, typically the latter in artificial breeding. Hence, G VR would be a random matrix as well; in the absence of selection and mutation and under Hardy-Weinberg equilibrium, it can be shown (e.g., Habier et al., 2007) that E (G VR ) = A.

Trait-specific (effects based) similarity matrices
Write model (1) as g = ∑ p j=1 X j β j where X j is the jth column of X. The notion of a trait-specific similarity matrix is introduced by defining where D (β) is an unknown p × p diagonal matrix with typical element β 2 j ∑ p j=1 β 2 j ; j = 1, 2, . . . , p, whose values range between 0 and 1. If β j = 0, locus j does not contribute to trait-specific similarity. Further, loci with stronger (positive or negative) effects contribute more towards similarity than loci with weak effects. Note that, if marked effects are independent and identically distributed, for any prior distribution As it will be seen throughout the paper, this definition produces an expected prior similarity that is proportional to common, marker derived, measures of similarity. Subsequently, Bayesian learning incorporates phenotypic information, producing measures of posterior similarity.
Once β is assigned a prior distribution F , a prior distribution is automatically induced for G (β). If F is such that marker effects are independent and identically distributed with variance σ 2 β , the where H is average heterozygosity, as defined earlier. Hence, prior similarity is proportional to the similarity conveyed by Van Raden's genomic relationship matrix, i.e., a priori all markers contribute to expected similarity.
If Bayesian learning for β takes place, it must also occur for G (β), producing a posterior distribution of similarities [G (β) |X, y]. Using definition (9), the posterior expectation of the similarity matrix is Approximately, Provided the model holds and assuming likelihood-identified parameters, standard asymptotic Bayesian theory (Sorensen and Gianola, 2002) indicates that as n → ∞, then , which is the ''true'' contribution of locus j towards similarity. In finite samples, if two markers are inferred at the same level of precision, the one with stronger effects will be assigned more weight. The extent of Bayesian learning stemming from markers and phenotypes, relative to the information conveyed from markers only, can be evaluated by computing Frobenius distances (dist) between random draws from the prior and posterior distributions of G (β) or D (β). For instance, the Frobenius distance between pairs of samples of G (β) is The S samples can be used to estimate the distribution of dist.
The effects-based notion of similarity proposed holds for any member of the Bayesian alphabet (Gianola et al., 2009;Gianola, 2013), i.e., irrespective of the prior adopted. Further, the use of MCMC samples permits a full characterization of the posterior distribution of dist since any of its features can be estimated directly from the draws.

Implementation-specific similarity matrices
In contrast to the generic effect-size matrix defined in (9), measures of similarity can be constructed based on specific features of members of the Bayesian alphabet referred to previously. Here, we present the main ideas focusing on Bayes A (Meuwissen et al., 2001); specific details pertaining to some other members of the Bayesian alphabet (Gianola et al., 2009) are given in the Appendix

Bayes A
Bayes A is a linear regression model with a two-stage hierarchical prior assigned to marker effects. The model assumes: (1) for each j = 1, 2, . . . , p; IID means ''independent and identically distributed''. The first stage poses a marker-specific variance σ 2 β j and the second stage assumes that all such variances follow the same scale inverted chi-square distribution on ν β degrees of freedom and with scale parameter S 2 β ; both ν β and S 2 β are hyper-parameters. Gianola et al. (2009) pointed out that the unconditional prior distribution assigned to each of the marker effects is actually that is, all markers are assumed, a priori, to follow the same t−distribution with variance Meuwissen et al. (2001) interpreted the meaning of σ 2 β j incorrectly and attempted to connect such parameter to some region-specific genetic variance. In fact, markers are homoscedastic, as the same t-distribution is assigned throughout; the incorrect interpretation reappears in the literature (e.g., Zhang et al., 2010).
In Bayes A the covariance matrix of marker effects is Hence, the vector of genomic breeding values g = Xβ has as mean vector and covariance matrix of the prior distribution .
The unknown prior distribution of each element of g is that of a linear combination of independent t-random variables (Fisher, 1935;Sukhatme, 1938;Walker and Saw, 1978). It follows that the ''prior genomic relationship matrix'' in Bayes A has the same form as (6), but the genomic variance here is Interpretation of σ 2 g in Bayes A cannot be disassociated from the hyper-parameters intervening in the distribution of marker effects; given σ 2 g , an increase in ν β must be compensated by a where . An unobserved similarity matrix in Bayes A, can be defined following (4) and (6) as where i.e., the prior expectation is the same as that of the effect-size based similarity given in (11). A posterior distribution can be defined for G BA as well. The Bayes A method allows for some (but limited) Bayesian learning about the σ 2 β j parameters and, therefore, about D BA and G BA . Bayes A is implemented with MCMC using a Gibbs sampler (e.g., Meuwissen et al., 2001;Pérez and de los Campos, 2014). In the sampler, draws are made from the conditional poste- where ELSE denotes all other parameters, hyper-parameters and data. For Bayes A, σ 2 Note that only 1 degree of freedom is gained over the prior, thus posing an upper limit to the learning that can be attained for each σ 2 β j , with the same holding for D BA (Gianola et al., 2009). If S draws are available from the posterior distributions of each of the σ 2 β j parameters, the posterior distribution of G BA can be estimated from the MCMC samples The posterior expectation of the distribution is estimated as BA /S, and the posterior variance of the similarity between individuals i and j can be assessed as /S. Frobenius distances between draws from the prior and posterior distributions of D BA can be used to quantify the extent of Bayesian learning about similarity stemming from use of phenotypes over and above what is learned from markers only.

Learning similarity from multiple-trait models
Similarity matrices can also be learned from multiple-trait analyses. For example, consider a two-trait Bayes A regression, e.g., a joint analysis of stature (trait 1) and body weight (trait 2) in a group of subjects, and assume that both traits are measured in each of n individuals. A bivariate Bayes A model poses the hierarchy and ( Above, B j is a 2 × 2 marker-specific covariance matrix and IW denotes an inverted Wishart distribution with known Ω = { Ω ij } (scale matrix) and ν (degrees of freedom) parameters. All B j matrices are independent and identically distributed, a priori, so the bivariate Bayes A model assigns the same IW distribution to each of the p matrices B j . The marginal prior distribution of the vector of marker effects is bivariate t, with null mean vector and covariance matrix Ω ν ν − 2 . Sorting individuals within traits, let g 1 = Xβ 1 and g 2 = Xβ 2 where β 1 is the p × 1 vector of marker effects on trait 1, and so on. The model is where I is an n × n identity matrix, ⊗ is the Kronecker product and the e ′ s are model residuals. The genetic variance-covariance matrix of genomic values is therefore Hence, trait-specific genomic variances in the bivariate Bayes A model are and the genomic covariance is As in (18), an unknown (normalized) similarity matrix can be defined for trait t as In a Gibbs sampling context, it can be shown that B j |ELSE is an inverse Wishart distribution. The sampler provides draws from the posterior distribution of B j for each marker, thus leading to draws from the posterior distributions of D BAt and of G BAt . The two n × n blocks, G BA 1 and G BA 2 represent the similarity matrices for each of the two traits in the bivariate analysis. Given S samples from the posterior distributions of G BA 1 and G BA 2 the posterior distribution of the Frobenius distance between the two similarity matrices can be estimated as If the distribution of the Frobenius distance between G BA 1 and G BA 2 is away from 0, such finding would support the view that trait-specific similarity matrices are required.
Bayes B and Bayes C π (see Appendix for a description of the basics of these methods) can be adapted to the multiple trait case following Cheng et al. (2018). Likewise, a multiple-trait Bayesian Lasso (MBL) that assumes a conditional multivariate Laplace distribution for marker effects can be considered as well.
As in Gianola and Fernando (2018), ordering the p × 1 vectors of marker effects (β i ) within trait, the conditional prior for bivariate marker effects is where , a priori, and Σ is the scale matrix of a bivariate Laplace distribution. Hence, As shown in Gianola and Fernando (2018) the diagonal elements of D v have unrecognizable conditional posterior distributions and a Metropolis-Hastings algorithm was suggested by these authors for drawing samples. Gianola and Fernando (2018) give ) as prior distribution for the weights; T = number of traits. The similarity matrix in MBL can be defined as where . Note that there is a single similarity matrix for the MBL model, as v 2 j takes the same value across traits. A priori, for T = 2 As before, the extent of Bayesian learning can be evaluated by comparing the prior and posterior distribution of G BL in (33).

Setting
A genome consisting of 10 chromosomes, each of length one Morgan and containing 2000 SNP loci per chromosome, was simulated using the XSim package developed in the Julia programming language environment (Cheng et al., 2015). Random mating was practiced in a population of size n = 100 for onehundred generations to generate linkage disequilibrium between loci. Thereafter, the population was expanded to n = 500, 2000 or 4000 to produce increasingly large training samples and to evaluate the extent of Bayesian learning as a function of size of the sample. One hundred loci were randomly drawn from across the genome and designated as QTL; substitution effects for the QTL were simulated from a standard normal distribution. Phenotypic values were generated as where Q is the matrix of the QTL genotypes and α is a fixed vector of ''true'' substitution effects. The residuals in vector e were sampled independently from a normal distribution with null mean and variance σ 2 e chosen to generate a trait with a heritability of 0.5.
The 20,000 simulated SNP genotypes (including the 100 designated as QTL) and the phenotypes from the training samples were used in Bayes C π analyses (Habier et al., 2011), with the proportion of loci with null substitution effects π treated as unknown and assigned a uniform prior. For the Bayes C π model we constructed an effect-size based similarity matrix as defined in (9). The Julia package JWAS (Cheng et al., 2016) was used to draw 60,000 Gibbs samples of model unknowns for each BayesCπ analysis, and every 20th posterior sample of the vector of effects, β, was saved for inference; hence, 3000 samples were employed for inference on genomic similarity.
Since genotypes and their effects are known in the simulation, true genomic similarity at the QTL level could be computed, and this was done for illustrative purposes for the first 100 individuals from each of the three population sizes. Using (9) the ''true'' genomic similarity matrix employed for these individuals was where Q c is the matrix of centered QTL genotype scores for the 100 individuals at the 100 QTL and α were the true (simulated) QTL effect sizes. For each of the 3000 MCMC posterior samples, a similarity matrix was computed from marker effects and marker genotypes of the 100 individuals as Similarly, 3000 samples from the prior distribution of β were used to produce draws from the prior distribution of G (β) The Frobenius distance between G Q and each of the G [ β (s) ] 3000 matrices drawn from either the prior or posterior distributions of G was calculated as Bayesian learning on similarity was evaluated by comparing the prior and posterior distributions of the Frobenius distances away from G Q . An overlap of distributions would indicate that phenotypes do not inform about similarity between individuals beyond what is conveyed by marker data only. Conversely, when phenotypes inform about similarity beyond markers, distances between G Q and posterior similarities should be shorter than those between G Q and prior similarities. . 1 shows overlap between posterior and prior distributions of Frobenius distances when training data set size was 500, but draws from the posterior distribution were closer to G Q than prior draws. As the size of the training data set increased to 2000 (Fig. 2) and 4000 (Fig. 3) the overlap between posterior and prior distributions disappeared. Frobenius distances based on posterior  draws were closer to similarity at the QTL level than those based on samples from the prior; further, the posterior distribution of distances was sharper than the prior distribution. Note, however, that even though the QTL were in the marker panel, learning was still imperfect, as 0 was not assigned appreciable probability. In the limit, as n tends to infinity, the posterior distribution of G (β) is expected to converge to G Q , the genetic similarity at the QTL level, since QTL genotypes are included in the marker panel. The results corroborate, at least for Bayes Cπ , that when a training data set has sufficient size, the phenotypic data provide information beyond markers on trait-specific genetic similarities between individuals.

Pinus taeda data
The Loblolly (Pinus taeda) data described in Cheng et al. (2018) was employed to test if Bayesian learning could be attained in a real data set. After edits, there were n = 807 individuals with p = 4828 SNP markers with phenotypic measurements on Rust bin scores (presence or absence) and Rust gall volume, two disease traits; see Cheng et al. (2018) for additional information on these two disease traits.
A bivariate Bayes Cπ method was used as in Cheng et al. (2018); these authors assumed bivariate normality for the sampling model. The MCMC scheme had a burn-in period of 10,000  iterations, with an additional 100,000 samples drawn from the posterior distribution, and sub-sampled every 50th round such that sample size for inference was 2000. Further, 2000 samples were drawn from the prior distribution of the similarity matrices. Frobenius distances between G VR and prior or posterior draws of the similarity matrices were calculated, leading to the distributions shown in Figs. 4 and 5, for Rust gall volume and Rust bin, respectively. The distributions had practically no overlap, indicating that use of phenotypic information did modify knowledge about similarity from what was conveyed by markers only. Note that the posterior distributions of the Frobenius distance away from G VR were similar between traits. As shown in Fig. 6, Frobenius distances varied concomitantly for the two traits. Fig. 7 displays the distribution of the Frobenius distance between the similarity matrices for the two traits. Since there was no appreciable probability mass near zero, the analysis supports the view that trait-specific similarities differed. Hence, G VR is not necessarily a suitable prescription for all traits.     marker locus in a given line. Grain yields in environments 1 and 2 were employed to compare outcomes between analyses based on bivariate GBLUP and the bivariate Bayesian Lasso (Gianola and Fernando, 2018). In the bivariate model, yields in the two environments are treated as distinct traits, conceptually. This type of data structure arises in dairy cattle-breeding (milk production of daughters of bulls in different countries treated as different traits) and in multi-environment situations in plant breeding.

Wheat data
Using the wheat data, we constructed similarity matrices derived from bivariate GBLUP and bivariate Lasso analyses. For GBLUP, the estimates of variance and covariance components were obtained via a maximum likelihood procedure; the bivariate Lasso model was implemented with a Gibbs/Metropolis-Hastings procedure based on six parallel chains leading to 12,000 samples post-convergence to the joint posterior distribution (Gianola and Fernando, 2018). The picture was similar to that obtained with the Pinus data set. Fig. 8 displays a scatter plot of 5000 randomly chosen posterior samples of Frobenius distances away from G VR for wheat yield 1 and wheat yield 2. Clearly, the posterior samples of distances for the two traits were very weakly correlated; the green straight line depicts the fitted values of a linear regression of distances for trait 2 on trait 1 with a slope of 0.13 and R 2 = 0.01. The horizontal and vertical lines give the Frobenius distances between G (β) and G VR , whereβ is the BLUP of marker effects for either trait 1 or trait 2. The location of the bivariate distribution of Lasso samples for the two traits was far from the centroid defined by the intersection of the distances between G (β 1 ) and G VR , and G (β 2 ) and G VR , where the subscript denotes the trait. Hence, Fig. 8 corroborates that phenotypes inform about trait-specific similarity.

FBLUP
All members of the Bayesian alphabet are self-contained from a predictive point of view. For instance, if regression coefficients β in the standard Bayesian linear model g = X Trn β are unknown and inferred from a training set, then in a testing set with known genotypes X Tst the mean of the predictive distribution is X Tst β. Here, β is the posterior expectation in the training set. On the other hand, our procedures allow to learn similarity matrices in testing sets with genotypes but without phenotypes. Such trait-specific similarity could be exploited further in a GBLUP context while producing prediction machines that are different from GBLUP. To illustrate the basic concept, let X = [ Above, D is p × p and learned from training data only and the training process assigns differential weights to SNPs; let D be the posterior expectation of D obtained from the training process. A pseudo-BLUP (FBLUP, F stands for ''fake'') procedure could be used to produce predictions of genetic values in the testing set aŝ where g Trn is the posterior expectation obtained from the Bayesian analysis of the training set. Note that g Trn and D derive from training set data only, e.g., from fitting Bayes Cπ , R or any other variable selection method, whileĝ F ,Tst is based on a BLUP logic. FBLUP has not been evaluated empirically yet, but the idea is intriguing, as it represents a ''hybrid'' between some Bayesian approach and BLUP with, perhaps, predictive synergy. FBLUP is not a linear procedure since neither D nor g Trn are linear functions of the phenotypes. The focus here is introducing the methodology, as opposed to conducting an exhaustive evaluation of its behavior as a way of generating novel prediction machines. Nevertheless, we compared the predictive ability of univariate GBLUP with that of univariate FBLUP using the data for wheat yield 1 and a trainingtesting set layout reconstructed at random 1000 times. In each training-testing instance, the training and testing sets had sizes n Trn = 450 and n Tst = 149, respectively. In FBLUP, G VR was replaced by G (β 1 ) whereβ 1 represents the univariate BLUP estimates of marker effects on wheat yield 1 obtained from the appropriate training set, then used to form predictions evaluated against the appropriate testing sets. Fig. 9 displays, training (for GBLUP) and testing (GBLUP and FBLUP) correlations and mean-squared errors. As usual, training set correlations (between fitted values and targets) were larger than the corresponding predictive correlations; likewise, training mean-squared errors were smaller than the mean squared errors of prediction. The commonly negative relationship between fitting and predictive abilities was observed as well: the better the model fitted to the training data, the worse its predictions were. No clear differences in favor of FBLUP over GBLUP were noted. For instance, the median predictive correlations for GBLUP were 0.523 and 0.491 for traits 1 and 2, respectively; for FBLUP the corresponding median correlations were 0.518 and 0.486, respectively. Differences in predictive mean square errors were small as well. Since FBLUP can be used in conjunction with any method in which markers are learned, the term actually defines a new family of prediction machines.

Discussion
Genomic relationship matrices (e.g., Van Raden, 2008) are a central part of GBLUP, arguably the prediction method most widely used in animal and plant improvement programs. In GBLUP, some G such as G VR replaces the pedigree-based kinship matrix A used in classical BLUP. Van Raden's G VR is based entirely on markers that are assigned the same weight, disregarding the possibility that some genomic region may be more important than others in trait determination. Also, linkage disequilibrium relationships among markers (beyond what is encoded by the columns of the genotype matrix X) are not exploited in G VR . Bravington et al. (2016) assumed linkage equilibrium in kinship estimation in the context of mark-recapture studies. On the other hand, LD information was incorporated into a marker-based similarity suggested by Speed et al. (2012).
The construction of trait-specific G ′ s has been considered in previous research although not formally addressed as an estimation problem. Several studies, e.g., Zhang et al. (2010) and Wang et al. (2012), used information on phenotypes in the construction of G to produce trait-specific similarity (kernel) matrices. Fan et al. (2016) suggested an algorithm that iterates between fixed effects and random effects models that can be used to arrive at trait-informed relationship matrices. Although these methods have intuitive appeal, a cohesive theoretical foundation lacked. Here, we propose a formal Bayesian approach and implement it in an MCMC framework. The point of departure is to regard a similarity matrix G as an unknown parameter and define it as G = XDX ′ where D is a diagonal matrix involving unknown quantities. For example, we let G (β) be a function of unknown effect sizes, define a prior density p (G (β) |X, Hyp) and arrive at a posterior density p (G (β) |X, y,Hyp). The unknown similarity matrix can be inferred from MCMC draws from the posterior. The extent of Bayesian learning is assessed from the samples of Frobenius distances between matrices drawn from the prior and posterior distributions of G (β) . It is important to note that the proposed method does not use information additional to what is conveyed by the regression model. Rather, it creates a connection between phenotypes and similarity which is ignored when the latter is inferred from markers only. Specifically, G (β) is treated as an unknown (non-linear) function of β, so given a posterior distribution of the vector of regression coefficients, a posterior distribution of the similarity matrix is induced automatically. The basic idea applies to all members of the Bayesian alphabet, such as, e.g., Bayes A, B, C, Cπ , R and L, since obtaining realizations of β is common to all such methods.
Using simulation, it was found that Bayesian learning on similarity does takes place and that the prior and posterior distributions of G (β) become increasingly distinct as sample size grows. A Pinus data set employed together with Bayes C π provided proof-of-concept. The distribution of Frobenius distances away from G VR of prior and posterior samples suggested that phenotypes conveyed information on similarities beyond what is provided from markers, these holding for the two traits evaluated. Furthermore, there was evidence that the marker-phenotype informed similarity matrices differed between the two disease traits. A similar picture emerged from the analyses carried out with the wheat data set.
The Bayesian approach provides information about the uncertainty of similarities between any pair of individuals as it is easy to estimate the posterior distribution of any element of G (β).
Such variability is expressed conditionally on X and y and it measures statistical uncertainty on similarity, but it is not comparable to the metrics presented by Hill and Weir (2011). These authors assumed conceptual repeated sampling of genotypes from a population. The distinction with the Bayesian interpretation is crucial, as some of the genome-enabled prediction literature is unclear with respect to what is fixed and what is random in markerbased models. Genetic variability and covariability is defined in quantitative genetics using the notion that allelic substitution effects are fixed and genotypes are random (Gianola et al., 2009;de los Campos et al., 2015;Gianola et al., 2015), as in Fisher's (2018) model. On the other hand, the Bayesian approach regards substitution effects as random and marker genotypes as fixed. In our paper, we treat X as a fixed matrix and the randomness in G = XD (β) X ′ derives from Bayesian uncertainty on D (β) which, in turn, depends on the uncertainty on model parameters.
Bayes C π was used for the Pinus data and a different method would have surely produced a different posterior difference of Frobenius distances away from G VR , perhaps overlapping from the one produces by Bayes C π , perhaps not. As noted earlier, similarities based on the Bayesian Lasso applied to wheat grain yield differed from those based on a genomic similarity matrix derived from BLUP point estimates of marker effects. Bayesian learning about G is entirely dependent on how much is learned about β, which is a function of the trait-environment-method combination examined. For methods 1 and 2, say, and with an orthonormal regression model (X ′ X = I) the squared Frobenius distance between the corresponding similarity matrices is and similarly for method 2. Clearly, differences in similarity matrices depend on how marker effects are inferred by various methods, a question that is entirely context dependent. If a formal model comparison favors method 1, say, one could state that G 1 provides a better inference of G than G 2 . However, if G 1 is used in a prediction machine that performs better than one based on G 2 , this finding does necessarily imply that the inference of similarity based on 2 is better than the one based on 1. The constellation of complex traits (and the number of members of the Bayesian alphabet and of prediction methods) is enormous and it does not seem sensible to offer general prescriptions, especially from a prediction perspective. Successful application of the concepts developed in this paper depends on the availability of a reliable MCMC implementation. For suitable inference, evidence must indicate that there has not been failure in reaching the target distribution and that sampling has been intensive enough, such that Monte Carlo error does not swamp statistical signals. If the prior distribution is a mixture, for example, so is the posterior (Albert, 2009), but detecting and attaining convergence (or lack thereof) with high-dimensional models is a challenge. With p markers and a four-component mixture, as in Bayes R, the joint posterior has 3p parameters so at least 3 p states must be visited, as there are three ''free'' indicator variables per marker. Rajaratnam and Sparks (2015) studied ''convergence complexity'' in n ≪ p situations and derived formulae that can be used to calculate the minimum amount of sampling required before reaching the equilibrium distribution. Calculations using such formulae (not shown here) suggest that convergence may be computationally difficult to attain when all polymorphic nucleotides available in a DNA sequence are used as covariates in a model, so much more sampling may be required than what is often done in practice. Conceivably, burn-in periods of at least half a million iterations may be needed. Our calculations support observations made by Celeux et al. (2000) on difficulties encountered in convergence of Bayesian mixture (variable selection) models. Inference must be done with caution when high-dimensional data are encountered, especially if sampling is not intensive enough. The approaches presented here do not escape from such pitfalls.

Acknowledgments
DG and CCS acknowledge financial support from the Deutsche Forschungsgemeinschaft (DFG; grant no. SCHO 690/4-1). DG was also partially supported by the Jay L. Lush Endowment at Iowa State University. RLF was supported by US Department of Agriculture, Agriculture and Food Research Initiative National Institute of Food and Agriculture Competitive Grant No. 2018-67015-27957. The Technical University of Munich (TUM) School of Life Sciences, the TUM Institute for Advanced Study, the University of Wisconsin-Madison, Iowa State University and the Institut Pasteur de Montevideo are thanked for providing office space, library and computing resources and logistic support.

A.0.1. Bayes L
The Bayesian Lasso (Bayes L) assigns the same conditional double exponential distribution, with parameter λ, as prior to each of the regression coefficients; marker effects are assumed (given λ) independently distributed a priori. Bayes L can be parameterized and implemented in various manners (Park and Casella, 2008;Kärkkäinen and Sillänpää, 2012). Here, we adopt the scalemixture of normals developed by Gómez-Sánchez-Manzano et al. (2007) and used by Gianola and Fernando (2018) in a multipletrait generalization of Bayes L. The multi-trait Bayes L implementation in Gianola and Fernando (2018) reduces to a univariate Bayes L model when the number of traits is 1.
The hierarchy is: for all j = 1, 2, . . . , p. Then . The structure of the problem is as in Bayes A. The variance of a marker effect (the prior expectation is zero) is distribution. Hence, the genomic variance in the Bayesian Lasso (given λ) is Given some estimate of σ 2 g and of allelic frequencies at marker loci, λ can be assessed using (43).
A similarity matrix can be defined from the hierarchical rep- The prior expectation of G BL is approximately Sampling from the prior is straightforward since σ 2 .

A.0.2. Bayes B and Bayes Cπ
Bayes B (Meuwissen et al., 2001) and Bayes C and C π (Habier et al., 2011) pose a two-component mixture model as prior distribution. All marker effects follow the same prior distribution and are regarded as independently distributed, a priori. In Bayes B, the first component consists of a null effect with known prior probability π. The second component assigns a t−distribution to each marker effect, with parameters as in Bayes A and with probability 1 − π. Bayes C also assumes known π but differs from Bayes B in that, for the second component, a multivariate t−distribution with null mean is assigned as joint prior to the non-null effects; the scale matrix in Bayes C is IS 2 β and the degrees of freedom are as in Bayes B. A subtle but important difference, is that in Bayes B effects are independent a priori, but this is not so for Bayes C. In the latter, marker effects are uncorrelated but not independent, a priori. Finally, Bayes C π has an additional layer in the hierarchy, with a prior distribution assigned to π . In what follows, we take Bayes B as prototype, as the treatment in Bayes C is similar.
Using properties of mixtures (e.g., Gianola et al., 2006), the marginal prior distribution of a marker effect in Bayes B has mean zero and variance σ 2 β = (1 − π) S 2 β ν β ν β − 2 . If hyper-parameter values were set as in Bayes A, the prior variance would be smaller in Bayes B because of the strong assumption that a fraction π of the markers has a null prior variance. The vector of genomic values in Bayes B has mean and covariance matrix the distribution of g is unknown and cannot be written in closed form. The Bayes B genomic variance is structured as The original formulation of Bayes B (Meuwissen et al., 2001) uses an implementation based on variance parameters. Given σ 2 β 1 , σ 2 β 2 , . . . , σ 2 βp (see hierarchical model for Bayes A) a marker effect has variance σ 2 β j with probability 1 − π, or variance 0 with probability π . Put where I j takes the value σ 2 β j ∑ p j=1 σ 2 β j (j = 1, 2, . . . , p) with prior probability 1 − π or 0 with probability π. The Bayes B MCMC sampler produces an indicator of whether 0 or the realized (normalized) value of σ 2 β j is placed on position j along the diagonal of D BB . In Bayes C π , the mixing proportion varies at random because π is unknown. A priori, in Bayes B Given S samples from each of the prior and posterior distributions of D BB , Frobenius distances can be compared as discussed previously. The posterior distribution of G BB can be estimated from posterior samples Bayes B can also be implemented as in a standard twocomponents mixture model where assignment to one of the two components of the distribution pertains to an effect and not to a variance. Use is made of data augmentation by including binary (0, 1) indicator variables δ = ( δ 1 , δ 2 , . . . , δ p ) ′ denoting the component of the mixture responsible for the effect of a marker covariate on phenotypes. The scheme produces metrics for variable selection based on the posterior distribution of the δ ′ s. Given δ, the linear regression model (assuming no intercept) for datum i is or, in matrix notation where ∆ = diag(δ j ) and X * = X∆. Conditionally on ∆, where σ 2 β = S 2 β ν β ν β − 2 (1 − π); note that ∆ 2 = ∆. Using (48) an unknown similarity matrix is Since ∆ is unknown, G BB (∆) is a random matrix. If δ j ∼ Bernoulli (π ), j = 1, 2, . . . , p its prior expectation is Now, ∆ in (53) and in (55) which shows clearly how variable selection affects similarity; G (s) BB,j is the contribution to similarity made by locus j in sample s.
The posterior expectation of G BB,j (∆) is estimated aŝ where δ j is the estimated posterior expectation of δ j , also an estimate of the probability of ''inclusion'' in the model. Since the prior probability of inclusion is 0 < π < 1, the averaging process results in that all markers contribute to similarity, but to different degrees, aŝ The difference between similarity matrices drawn from their prior and posterior distributions is driven by the difference between the prior and posterior distributions of ∆. In Bayes B, π is taken as a known hyper-parameter, so draws from the prior distribution of ∆ can be obtained by drawing each of its elements from δ j ∼ Bernoulli (π). In Bayes C π the draws from the prior distribution can be obtained from composition sampling: (1) draw π from its prior distribution (e.g., a uniform prior) and (2) conditionally on the realization π * , draw δ j ∼ Bernoulli (π * ), repeating the two steps for each marker. The draws from the posterior distribution of ∆ follow directly from the MCMC algorithm; prior and posterior distributions can be compared by examining the distribution of Frobenius distances.
It is important to emphasize that (51) and (55) define two distinct similarity matrices. The prior and posterior distributions of (51) depend on the σ 2 β j parameters, on which limited Bayesian learning takes place (Gianola et al., 2009). The second matrix involves the δ j indicators, which enable learning about π in Bayes Cπ .
A.0.3. Bayes R Erbe et al. (2012) proposed a four-component mixture distribution as prior for each of the markers entering into the linear regression model and the procedure was termed Bayes R. The prior for the effect of marker j is β j |π 1 , π 2 , π 3 , π 4 , Hyp ∼ π 1 N 1 + π 2 N 2 + π 3 N 3 + π 4 N 4 ; j = 1, 2, . . . , p, where N i (i = 1, 2, 3, 4) denotes a normal distribution with null mean and variance σ 2 i , π = {π i } is the vector of probabilities of membership. In Bayes R the hyper-parameters in Hyp include the variances of the four normal distributions, which are set arbitrarily. Actually, component 1 has null mean and variance in Bayes R so it represents a point-mass with probability π 1 , producing a ''spike and slab'' model. In turn, the probabilities of membership π are assigned a four-dimensional Dirichlet prior distribution with hyper-parameters α = (1, 1, 1, 1) ′ , resulting in a uniform prior. Hence, recalling that σ 2 1 = 0. Because of the nullity of the means of the component distributions, averaging over the Dirichlet distribution yields so the variance of the marginal prior distribution of a marker effect is an unweighted average of the variances of the mixture components. For Bayes R, and the genomic variance takes the form σ 2 g = p ∑ j=1 2q j (1 − q j )σ 2 β .
As in any mixture model, the joint posterior can be augmented with indicator variables. In Bayes R, a 4 × 1 vector ϕ j = (ϕ 1 , ϕ 2 , ϕ 3 , ϕ 4 ) ′ , points to the distribution ''responsible'' for marker effect j; e.g., if marker j has a null effect, then ϕ j = (1, 0, 0, 0) ′ . From the point of view of learning a similarity matrix informed by phenotypes, what matters is whether a given marker has a null effect or not. Hence, one can define a binary indicator taking the value 1 if the marker has an effect (ϕ j1 = 0) and 0 otherwise, i.e., if either ϕ j2 , ϕ j3 or ϕ j4 take the value 1; let the binary indicator be γ j (j = 1, 2, . . . , p). A similarity matrix for Bayes R would be where π 1 is the prior probability of a marker having a null effect on the trait. The MCMC sampler produces draws from the posterior distribution of each ϕ j (and therefore, of each γ j ). The mean of the posterior distribution of G BR (Γ ) is estimated aŝ X ′ Γ (s) X; s = 1, 2, . . . , S.
At any round of the MCMC γ (s) j = 0 if ϕ (s) j1 = 1, or γ (s) j = 1 otherwise; j = 1, 2, . . . , p. The difference between the prior and posterior distributions of Γ informs how much phenotypes affect similarity between individuals over and above markers. Equivalently, one can examine the distribution of Frobenius distances between pairs of matrices drawn from the prior and posterior distributions of G BR (Γ ).