Using Machine Learning To Improve the Accuracy of Genomic Prediction on Reproduction Traits in Pigs

Background: Recently, machine learning (ML) is becoming attractive in genomic 23 prediction, while its superiority in genomic prediction and the choosing of optimal ML 24 methods are needed investigation. 25 Results: In this study, 2566 Chinese Yorkshire pigs with reproduction traits records 26 were used, they were genotyped with GenoBaits Porcine SNP 50K and PorcineSNP50 27 panel. Four ML methods, including support vector regression (SVR), kernel ridge 28 regression (KRR), random forest (RF) and Adaboost.R2 were implemented. Through 29 20 replicates of five-fold cross-validation, the genomic prediction abilities of ML 30 methods were explored. Compared with genomic BLUP(GBLUP), single-step GBLUP 31 (ssGBLUP) and Bayesian method BayesHE, our results indicated that ML methods 32 significantly outperformed. The prediction accuracy of ML methods was improved by 33 19.3%, 15.0% and 20.8% on average over GBLUP, ssGBLUP and BayesHE, ranging 34 from 8.9% to 24.0%, 7.6% to 17.5% and 11.1% to 24.6%, respectively. In addition, ML 35 methods yielded smaller mean squared error (MSE) and mean absolute error (MAE) in 36 all scenarios. ssGBLUP yielded improvement of 3.7% on average compared to GBLUP, 37 and the performance of BayesHE was close to GBLUP. Among four ML methods, SVR 38 and KRR had the most robust prediction abilities, which yielded higher accuracies, 39 lower bias, lower MSE and MAE, and comparable computing efficiency as GBLUP. 40 RF demonstrated the lowest prediction ability and computational efficiency among ML 41 methods. 42 Conclusion: Our findings demonstrated that ML methods are more efficient than traditional genomic selection methods, and it could be new options for genomic prediction.

more accurately estimate the genomic breeding values. 66 Driven by applications in intelligent robots, self-driving cars, automatic translation, 67 face recognition, artificial intelligence games and medical services, machine learning 68 (ML) has gained considerable attention in the past decade. Some characteristics of the 69 ML methods make it potentially attractive to deal with high-order non-linear 70 relationships in high-dimensional genomic data, e.g. allowing the number of variables 71 larger than the sample size [13] , capable of capturing the hidden relationship between 72 genotype and phenotype in an adaptive manner, and imposing little or no specific 73 distribution assumptions about the predictor variables as GBLUP and Bayesian 74 methods [14,15] . 75 Studies have shown that random forest (RF), support vector regression (SVR), kernel 76 ridge regression (KRR) and other machine learning methods gained advantage over 77 GBLUP and Bayes B, etc. [16][17][18] . Ornella  with other ML methods and linear models in the genomic prediction of the rust 83 resistance of wheat [18] . Additionally, ML algorithms have also been widely used in the 84 fields of gene screening, genotype imputation, and protein structure and function 85 prediction, etc. [20][21][22][23] , demonstrating its superiority as well. However, one challenge for 86 NBA were both 0.12. 110 Derivation of corrected phenotypes 111 In order to avoid double counting of parental information, the corrected phenotypes (yc) 112 derived from the estimated breeding values (EBV) were used as response variable in 113 genomic prediction. The pedigree-based BLUP and single-trait repeatability model 114 were performed to estimate the breeding values for each trait separately. herd-year-season; represent additive genetic effects, following a norm distribution 118 N(0, Aσ 2 a), where A is the pedigree-based relationship matrix, σ 2 a is the additive 119 genetic variance. is permanent environment effects with norm distribution N(0, 120 Iσ 2 pe ), where σ 2 pe is permanent environment variance. e is the vector of random error, 121 following a norm N(0, Iσ 2 e ), where σ 2 e represents residual variance. , , and 122 are incidence matrices linked , and to . A total of 3893 individuals were 123 traced to construct A matrix. Their EBVs were calculated using the DMUAI procedure 124 of the DMU software [24] . The yc were calculated as EBV plus the average estimated 125 residuals for multiple parties of a sow following Guo et al. [25] . to deal with regression problems. The model formulation of SVR can be expressed as: 168 in which ℎ( ) is the kernel function, is the vector of weights, and 0 is the bias.

169
Generally, the formalized SVR is given by minimizing the following restricted loss (7) 194 is the identity matrix; ′ = ( , ) with = 1,2,3,…,n, n is the number of 195 training samples, and is the test sample. In the expanded form, The grid search was used to find the most suitable kernel function and  in this study, 198 and an internal 5-fold CV strategy was used for tuning the hyper-parameters.  Adaboost.R2

216
Adaboost.R2 [35] is an ad hoc modification of Adaboost.R and is an extension of Adaboost.M2 to deal with regression problems, which repeatedly uses a regression tree 218 as a weak learner followed by increasing the weights of incorrectly predicted samples 219 and decreasing the weights of correctly predicted samples. It builds a "committee" by 220 integrating multiple weak learners, making its prediction effect better than those of 221 weak learners [36] . Adaboost.R2 regression model can be written as: of ( ). After ( ) is trained, the weight distribution ( ) will become +1 ( ),

229
in which is a normalization factor chosen such that +1 ( ) will be a distribution.

230
In current study, SVR and KRR were respectively used as weak learners of 231 Adaboost.R2.

232
For these four ML methods, the vectors of genotypes (coded as 0, 1, 2) were the input 233 independent variables and yc were used as response variables, and Sklearn package for 234 Python (V0.22) was used for genomic prediction.

235
Meanwhile, the optimal hyper-parameters for SVR, KRR, RF and Adaboost.R2 236 according to the grid search were shown in Table S1.  On the other hand, mean squared error (MSE) and mean absolute error (MAE) were 304 also used to assess the performance of different methods. As shown in     This also confirmed by other studies, that a certain improvement can be achieved by 375 adding a smaller reference population for traits with medium or high heritability [2,44] . GBLUP except for RF in TNB, with an improvement from 0.5% to 8.2% (Table S2).

386
(II) ML methods could handle the number of parameters larger than the sample size, it 387 is very efficient in the case with high-density genetic markers for GS [45] . (III) ML 388 methods do not make distribution assumptions about the genetic determinism 389 underlying the trait, enabling to capture the possible non-linear relationships between 390 genotype and phenotype in a flexible way [45] , and it is different from GBLUP and 391 Bayesian methods, which assumes that all marker effects follow the same normal 392 distribution, or have different classes of shrinkage for different SNP effects. In addition, ML methods can take the correlation and interaction of markers into account as well, 394 while linear models based on pedigree and genomic relationships may not provide a 395 sufficient approximation of the genetic signals generated by complex genetic systems 396 [14] . Consequently, when traits are affected by non-additive effects, especially epistasis,

397
ML methods can achieve more accurate predictions [23] .These make ML methods gain 398 large advantage over GBLUP and BayesHE even they only use genotyped animals. Genomic prediction for Nordic Red Cattle using one-step and selection index blending.    Adaboost.R2_(SVR) 1h 35min 13s 1h 15min 28s Adaboost.R2_(KRR) 5min 03s 5min 16s   Figure 1 Imputation accuracy Imputation accuracy of GenoBaits Porcine SNP 50K to PorcineSNP50 BeadChip at different minor allele frequency (MAF) intervals (a) and chromosomes (b). DR2, the estimated squared correlation between the estimated allele dose and the true allele dose; Genotype concordance rate (CR), the ratio of the correctly imputed genotypes; Genotype correlation (COR), the correlation coe cient between the imputed variants and the true variants.

Supplementary Files
This is a list of supplementary les associated with this preprint. Click to download. Supplymentary.docx