Elsevier

Livestock Science

Volume 177, July 2015, Pages 1-7
Livestock Science

A multi-compartment model for genomic selection in multi-breed populations

https://doi.org/10.1016/j.livsci.2015.03.027Get rights and content

Highlights

  • Enhancing the estimation of SNP effects when pooled data is justified.

  • Implementing a hierarchical Bayesian model where the effect of an SNP marker could be different across breeds or lines.

  • Increasing genomic prediction accuracies between 17 to 47%.

  • Model simple and easy to implement.

Abstract

Genome wide evaluation methods are often conducted using purebred populations. Estimation and often validation are carried out using primarily select elite animals. This process is successful when estimated SNP effects are used to predict genomic breeding values of animals of similar breed. This approach fails when SNP estimates in one breed are used for genomic prediction in other breeds. In this study, we proposed a multi-compartment model where the effect of an SNP marker could differ between breeds. Two simulation scenarios were carried out using an admixed population of two divergent lines (A and B), first using a low density panel (300 SNPs) and second using a high density panel (60 k SNPs). Divergence between the two lines was artificially created by multiplying marker effects in one line by a variable α which was sampled from different uniform or normal distributions. The proposed method was compared to the pooled data approach based on the accuracy of predicting the true breeding values. In the first simulation scenario, the prediction accuracy using the pooled data approach for line A, was 0.40, 0.39 and 0.38 when α was generated from a uniform distribution between [−2, 2], [−4, 4] and [−8, 8] respectively. Using our proposed method, the corresponding accuracies were 0.47, 0.46 and 0.46, respectively. A similar trend was observed for line B with a clear superiority of the multi-compartment model over the pooled data approach with an increase ranging from 17 to 47% and increases as the divergence between lines increases. In the second scenario, when α was sampled from a uniform [−2.2], accuracy for line A (B) was 0.32 (0.30) using pooled data model, and 0.33 (0.32) using the multi-compartment model. Although smaller than in first simulation scenario, the proposed method still has a superiority of 3 to 7%. Similar performance was observed when α was sampled from uniform [−4,4].

Introduction

One key benefit of genomic selection is a more accurate selection of animals that inherited genes or chromosome segments of superior merit (Meuwissen et al., 2001). In Dairy cattle, for example, accuracy of the genomically estimated breeding values (GEBVs) are 30 to 70% higher than their counterparts obtained used the classical BLUP approach (VanRaden et al., 2009, Harris and Johnson, 2010, Su et al., 2012). Additionally, genomic selection allows for a significant reduction of generation interval as young animals could be genomically evaluated at birth or even before; thus reducing or even eliminating the need to wait for several years (depending of the specie and the trait) until enough phenotypic information is collected and a reliable genetic evaluation is conducted. Thus, it is not surprising that genomic selection is quickly becoming the method of choice for genetic evaluation, encouraged by the continuous decrease in genotyping costs despite the substantial increase in the density of commercial single nucleotide polymorphism (SNP) marker panels. Currently, genome wide association studies and genomic selection are often conducted using purebred populations. Estimation and often validation of SNP are carried out using a select elite set of pure bred animals (i.e. proven sires). This process was successful when estimated SNP effects were used to predict genomic breeding values on animals of the same breed. However, when these SNP estimates are used for genomic prediction in other breeds or crossbred animals, it fails at different degrees depending on the genetic similarity between breeds in the mixture (Pryce et al., 2011). Unfortunately, this situation is not rare in several segments of livestock industry (beef cattle, swine or poultry) where the traits of interest are measured in crossbred or mixed populations with uncertain breed composition (Toosi et al., 2009). The main reasons that genomic selection is not as successful when predicting genetic merit in admixed or crossbred populations are the change in linkage disequilibrium (LD) between markers and QTLs, inconsistency of linkage phase across subpopulations, and variation in allele frequencies between breeds (De Roos et al., 2008, Kizilkaya et al., 2010, Thomasen et al., 2012).

Accuracy of genome wide evaluation methods crucially depends on the extent of LD between markers and QTLs as well as the size of the reference population (Daetwyler et al. (2008); De Roos et al., 2009, Goddard, 2009, Lund et al., 2011, Brøndum et al., 2011). Availability of large enough reference population is not always possible especially for breeds with limited number of genotyped and phenotyped animals (VanRaden et al., 2009, Hayes et al., 2009, Thomasen et al., 2014). Thomasen et al. (2014) showed the negative impact of small size reference data sets on the reliability of genomic predictions. To deal with these limitations, one plausible solution is to pool data from different breeds; thus creating a large enough reference population. Although this approach will resolve or at least alleviate the lack of power due to limited size of the training population, it intrinsically assumes that the SNP effects (indirectly QTL effects) are constant across all breeds in the admixed population. Several simulation and real data studies have been conducted to evaluate the adequacy of different pooling strategies for the training and validation sets (Toosi et al., 2009, Daetwyler et al., 2012, Olson et al., 2012, Hoze et al., 2014). Their results were mixed and even contradictory. In general, prediction accuracy increased when subpopulations are genetically close and decreases as the genetic distance between components of the admixed population increased (Daetwyler et al., 2012, Wientjes et al., 2013, De Los Campos et al., 2013). More recently, (Kachman et al., 2013) showed that using a multi-breed training population did not increase prediction accuracies compared to single breed analysis when reasonable number of animals are available in each breed. However, prediction accuracy increased for breeds with small number of genotyped animals.

Given the limitations of the pooled data approach, several other methods have been proposed. These methods can be clustered into two broad groups based on their mode of accommodating differences between breeds; either through SNP effects or the genomic relationship matrix. Ibanez-Escriche et al. (2009) proposed a method where marker effects were estimated based on their population of origin. Unfortunately, this method was not successful for high density SNP panels, in the sense that accounting for breed specific marker effects did not improve prediction accuracy. Karoui et al. (2012) proposed using a multi-breed training population that accounts for the difference in genetic correlations between breeds. Their results showed little to no increase in accuracy compared to the classical data pooling approach. Through modifications to the genomic relationship matrix, Harris and Johnson (2010) proposed a generalization of the regression technique used to derive the relationship matrix and Makgahlela et al. (2012) adopted a random regression type approach that accounts for breed proportions in the population which performed slightly better than models ignoring breed-specific effects. In plant breeding, Schulz-Streeck et al. (2012) proposed a model that combines marker main effects that are consistent across sub-populations and population-specific marker effects. Although in general their results showed a slight increase in accuracy using population specific marker model compared to main marker effects model, however there are some cases in which population specific model performed worse than main marker effects model.

It is clear that current approaches for dealing with admixed and crossbred populations in genomic selection are far from providing a global answer to this relevant issue. Their results are data dependent and could lead to reduction in accuracies for animals in the pure breed populations. The objective of this study is to develop a model where the effect of an SNP (indirectly QTL) could be different between breeds or lines and parameterized as a function of its effect on one of the breeds in pooled population through a one to one mapping function.

Section snippets

Material and methods

SNP effects often change between breeds or crossbred groups due to several factors including, change of minor allele frequency, strength of LD between markers-QTLs, and linkage phase between marker and QTL alleles. From hereafter we will refer to breeds or crossbred groups (F1, F2, etc.) in an admixed population as lines. Our idea is based on the possibility of inferring change in SNP effects between lines as a function of their genetic similarity. Our hypothesis is that the genetic similarity

Results and discussion

To establish a base for comparison, the two lines were analyzed separately. When training and validation were conducted within line, accuracy was 0.47 and 0.29 when heritability was equal to 0.3 and 0.67 and 0.40 when heritability was equal to 0.50, for lines A and B, respectively. When validation was conducted in the line that was not used in the training, the accuracy was −0.01 for line B and from −0.16 to −0.06 for line A (Table 1) when α was sampled from uniform [−2,2] and similar results

Conclusions

Pooling data from lines or breeds in the training set when conducting genome wide evaluation studies seems an attractive approach since it benefits from the increase in power. Its performance is variable and depends largely on the genetic similarity between the sub-populations in the mixture. When the sub-populations are very close genetically, the pooled data approach even in its basic form will result in an increase of accuracies, especially for the lines with limited recording. As the

Conflict of interest

We wish to confirm that there are no known conflicts of interest associated with this publication and there has been no significant financial support for this work that could have influenced its outcome. We confirm that the manuscript has been read and approved by all named authors and that there are no other persons who satisfied the criteria for authorship but are not listed. We further confirm that the order of authors listed in the manuscript has been approved by all of us.

We confirm that

References (27)

  • A.P.W. De Roos et al.

    Reliability of genomic predictions across multiple populations

    Genetics

    (2009)
  • A.P.W. De Roos et al.

    Linkage disequilibrium and persistence of phase in Holstein-Friesian, Jersey and Angus cattle

    Genetics

    (2008)
  • M.E. Goddard

    Genomic selection: prediction of accuracy and maximisation of long term response

    Genetica

    (2009)
  • Cited by (4)

    • A structural model for genetic similarity in genomic selection of admixed populations

      2015, Livestock Science
      Citation Excerpt :

      To deal with these issues, we hypothesize that the changes in SNP effects between the components of an admixed population could be inferred based on criteria already available in the observed SNP marker genotypes. In order to evaluate the adequacy of this new re-parameterization, three models were implemented and compared using simulated data: 1) classical pooled data (M1) using a BayesA type of model (Meuwissen et al., 2001), multi-compartment model as presented by Hay and Rekaya (2015) where alpha was directly modeled (M2); and 3) the new model as presented in Eq. (5) where alpha is indirectly modeled (M3). For all scenarios, the data was divided into training (2/3) and validation (1/3) datasets.

    View full text