Skip to main content

Advertisement

Log in

Comparison of three statistical approaches for feature selection for fine-scale genetic population assignment in four pig breeds

  • Regular Articles
  • Published:
Tropical Animal Health and Production Aims and scope Submit manuscript

Abstract

Background

Assigning animals to their corresponding breeds through breed informative single-nucleotide polymorphisms (SNPs) is required in many fields. For instance, it is used in the traceability and the authentication of meat and other livestock products. SNPs’ information for several pork breeds are now accessible thanks to the availability of dense SNP chips. These SNP chips cover a large number of molecular markers distributed across the entire genome. To identify the pork breed from a sample of industrial meat, one must analyze a large panel of genetic markers depending on the SNP chip used. The analysis of such large datasets requires intensive work. This leads to the idea of creating less dense chips of breed informative markers based on a reduced number of SNPs. Therefore, the analysis of the data emanating from the genotyping of these reduced chips will require less time and effort.

Aim

The objective of this study is to find the most informative SNPs for the discrimination between four pig breeds, namely Duroc, Landrace, Large White, and Pietrain.

Method

The Illumina Porcine 60 k SNP chip was used to genotype SNPs distributed all over the individuals’ genomes. Firstly, we used three different statistical approaches for feature selection: (i) principal component analysis (PCA), (ii) least absolute shrinkage and selection operator (LASSO), and (iii) random forest (RF). These three approaches identified three sets of SNPs; each set corresponds to one approach. Then, we combined the results of the three methods by setting up a final panel containing the SNPs which appear on the three sets altogether.

Results

Separately, each method resulted in a panel with the corresponding most discriminating SNPs. The PCA, the LASSO, and the random forest with Boruta algorithm highlighted 28,816, 50, and 286 SNPs, respectively. The number of SNPs selected by PCA is high compared to Boruta and LASSO because PCA chooses the variables while preserving as much information about the data as possible. The only downside of LASSO regression is that among a group of correlated variables, LASSO tends to select only one variable and ignore the others regardless of their importance. Contrarily to LASSO, the Boruta algorithm considers the interdependence between SNPs and selects informative variables even if they are correlated and have the same effect. The three panels shared 23 SNPs; the distribution of the individuals according to these SNPs showed a grouping of individuals of each breed in well-defined clusters without any overlapping.

Conclusions

The biological pathways represented by 23 breed informative SNPs resulted by the combination of PCA, LASSO, and Boruta should be explored in further analysis. The results provided by our study are promising for further applications of this method in other livestock animals.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

Data availability

Data will be made available from the corresponding author on reasonable request.

Code availability

The manuscript does not contain software application or custom code.

References

  • Aulchenko, Y.S., Ripke, S., Isaacs, A. and van Duijn, C.M., 2007. GenABEL: an R library for genome-wide association analysis Bioinformatics (Oxford, England), 23, 1294–1296

  • Bertolini, F., Galimberti, G., Calò, D.G., Schiavo, G., Matassino, D. and Fontanesi, L., 2015. Combined use of principal component analysis and random forests identify population-informative single nucleotide polymorphisms: application in cattle breeds Journal of Animal Breeding and Genetics, 132, 346–356

    Article  CAS  Google Scholar 

  • Bertolini, F., Galimberti, G., Schiavo, G., Mastrangelo, S., Gerlando, R.D., Strillacci, M.G., Bagnato, A., Portolano, B. and Fontanesi, L., 2018. Preselection statistics and Random Forest classification identify population informative single nucleotide polymorphisms in cosmopolitan and autochthonous cattle breeds animal, 12, 12–19 (Cambridge University Press)

  • Botti, S., Caprera, A., Gaita, L., Mondin, P., Ossani, N., Palermo, S., Luini, M., Vezzoli, F., Cordioli, P., Nigrelli, D., Fallacara, C., Barbieri, I., Pacciarini, M., Bandi, C., Stella, A. and Giuffra, E., 2006. The misagen project: towards the genetic improvement of disease resistance of pig commercial populations. Proceedings of the 8th World Congress on Genetics Applied to Livestock Production, Belo Horizonte, Minas Gerais, Brazil, 13–18 August, 2006, 15–24 (Instituto Prociência)

  • Breiman, L., 2001. Random Forests Machine Learning, 45, 5–32

    Article  Google Scholar 

  • Chen, H. and Boutros, P.C., 2011. VennDiagram: a package for the generation of highly-customizable Venn and Euler diagrams in R BMC Bioinformatics, 12, 35

  • Ciampolini, R., Cetica, V., Ciani, E., Mazzanti, E., Fosella, X., Marroni, F., Biagetti, M., Sebastiani, C., Papa, P., Filippini, G., Cianci, D. and Presciuttini, S., 2006. Statistical analysis of individual assignment tests among four cattle breeds using fifteen STR loci Journal of Animal Science, 84, 11–19

    Article  CAS  Google Scholar 

  • FAO’s Animal Production and Health Division: Meat & Meat Products n.d.

  • Fontanesi, L., Scotti, E., Gallo, M., Nanni Costa, L. and Dall’Olio, S., 2016. Authentication of “mono-breed” pork products: Identification of a coat colour gene marker in Cinta Senese pigs useful to this purpose Livestock Science, 184, 71–77

  • Friedman, J., Hastie, T. and Tibshirani, R., 2010. Regularization Paths for Generalized Linear Models via Coordinate Descent Journal of Statistical Software, 33, 1–22

    Article  Google Scholar 

  • Groeneveld, L.F., Lenstra, J.A., Eding, H., Toro, M.A., Scherf, B., Pilling, D., Negrini, R., Finlay, E.K., Jianlin, H., Groeneveld, E. and Weigend, S., 2010. Genetic diversity in farm animals – a review Animal Genetics, 41, 6–31

  • Guàrdia, M., Quintanilla, R., Manunza, A., Mercadé, A., Amills, M., Pena, R. and Hernández-Sánchez, J., 2012. GWAS of low heritable traits: the case of sensory attributes of dry-cured hams

  • Jolliffe, I.T. and Cadima, J., 2016. Principal component analysis: a review and recent developments Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 374, 20150202 (Royal Society)

  • Kassambara, A. and Mundt, F., 2020. factoextra: Extract and Visualize the Results of Multivariate Data Analyses,

  • Kohannim, O., Hibar, D.P., Stein, J.L., Jahanshad, N., Hua, X., Rajagopalan, P., Toga, A.W., Jack, C.R., Weiner, M.W., de Zubicaray, G.I., McMahon, K.L., Hansell, N.K., Martin, N.G., Wright, M.J. and Thompson, P.M., 2012. Discovery and Replication of Gene Influences on Brain Structure Using LASSO Regression Frontiers in Neuroscience, 6

  • Kursa, M.B., Jankowski, A. and Rudnicki, W.R., 2010. Boruta – A System for Feature Selection Fundamenta Informaticae, 101, 271–285 (IOS Press)

  • Kursa, M.B., 2014. Robustness of Random Forest-based gene selection methods BMC bioinformatics, 15, 8

    PubMed  Google Scholar 

  • Kwon, T., Yoon, J., Heo, J., Lee, W. and Kim, H., 2017. Tracing the breeding farm of domesticated pig using feature selection (Sus scrofa) Asian-Australasian Journal of Animal Sciences, 30, 1540–1549

  • Lee, J., Lee, S., Park, J.-E., Moon, S.-H., Choi, S.-W., Go, G.-W., Lim, D. and Kim, J.-M., 2019. Genome-wide association study and genomic predictions for exterior traits in Yorkshire pigs Journal of Animal Science, 97, 2793–2802 (Oxford Academic)

  • Liaw, A. and Wiener, M., 2002. Classification and Regression by randomForest R News, 2, 18–22

    Google Scholar 

  • Meng, Y.A., Yu, Y., Cupples, L.A., Farrer, L.A. and Lunetta, K.L., 2009. Performance of random forest when SNPs are in linkage disequilibrium BMC Bioinformatics, 10, 78

    Article  Google Scholar 

  • Niu, P., Kim, S.-W., Choi, B.-H., Kim, T.-H., Kim, J.-J. and Kim, K.-S., 2013. Porcine insulin-like growth factor 1 (IGF1) gene polymorphisms are associated with body size variation Genes & Genomics, 35, 523–528

  • Paschou, P., Ziv, E., Burchard, E.G., Choudhry, S., Rodriguez-Cintron, W., Mahoney, M.W. and Drineas, P., 2007. PCA-Correlated SNPs for Structure Identification in Worldwide Human Populations PLOS Genetics, 3, e160 (Public Library of Science)

  • Ramos, A.M., Crooijmans, R.P.M.A., Affara, N.A., Amaral, A.J., Archibald, A.L., Beever, J.E., Bendixen, C., Churcher, C., Clark, R., Dehais, P., Hansen, M.S., Hedegaard, J., Hu, Z.-L., Kerstens, H.H., Law, A.S., Megens, H.-J., Milan, D., Nonneman, D.J., Rohrer, G.A., Rothschild, M.F., Smith, T.P.L., Schnabel, R.D., Tassell, C.P.V., Taylor, J.F., Wiedmann, R.T., Schook, L.B. and Groenen, M.A.M., 2009. Design of a High Density SNP Genotyping Assay in the Pig Using SNPs Identified and Characterized by Next Generation Sequencing Technology PLOS ONE, 4, e6524 (Public Library of Science)

  • Rashidi, H., 2016. Breeding against infectious diseases in animals (Wageningen University: Wageningen, NL)

    Google Scholar 

  • Rosenvold, K. and Andersen, H.J., 2003. Factors of significance for pork quality—a review Meat Science, 64, 219–237

  • Schiavo, G., Bertolini, F., Galimberti, G., Bovo, S., Dall’Olio, S., Costa, L.N., Gallo, M. and Fontanesi, L., 2020. A machine learning approach for the identification of population-informative markers from high-throughput genotyping data: application to several pig breeds animal, 14, 223–232 (Cambridge University Press)

  • Tang, J., Zhang, Z., Yang, B., Guo, Y., Ai, H., Long, Y., Su, Y., Cui, L., Zhou, L., Wang, X., Zhang, H., Wang, C., Ren, J., Huang, L. and Ding, N., 2017. Identification of loci affecting teat number by genome-wide association studies on three pig populations Asian-Australasian Journal of Animal Sciences, 30, 1–7

    Article  Google Scholar 

  • Tibshirani, R., 1996. Regression Shrinkage and Selection via the Lasso Journal of the Royal Statistical Society. Series B (Methodological), 58, 267–288 ([Royal Statistical Society, Wiley])

  • Wilkinson, S., Wiener, P., Archibald, A.L., Law, A., Schnabel, R.D., McKay, S.D., Taylor, J.F. and Ogden, R., 2011. Evaluation of approaches for identifying population informative markers from high density SNP Chips BMC Genetics, 12, 45

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Contributions

BB and SB conceived and designed the research. IH analyzed the data. MA, BB, SB and IH interpreted the results. IH and BB wrote the manuscript. All authors read and approved the manuscript.

Corresponding author

Correspondence to Bouabid Badaoui.

Ethics declarations

Ethics approval

The manuscript does not contain clinical studies or patient data.

Consent to participate

All the authors approved the final manuscript.

Consent for publication

All the authors consented the final manuscript.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Hayah, I., Ababou, M., Botti, S. et al. Comparison of three statistical approaches for feature selection for fine-scale genetic population assignment in four pig breeds. Trop Anim Health Prod 53, 395 (2021). https://doi.org/10.1007/s11250-021-02824-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11250-021-02824-x

Keywords

Navigation