Differentiation of the domestic pig and wild boar using genotyping-by-sequencing

,

For many reasons, pigs are one of the most important livestock species for humans. First, pork is a major source of protein in global animal production. In addition, the pig is an excellent animal model to study the molecular components and processes of some human diseases including cancer (Groenen et al. 2012).
Food adulteration practices commonly include the substitution of wild boar meat products with widely accessible and cheaper, but poorer quality, pig meat (Koutsogiannouli et al. 2010). Therefore, the differentiation of both the pig (Sus scrofa f. domestica) and the wild boar (Sus scrofa scrofa) is of huge importance for the identification and differentiate between a species and subspecies (Mary et al. 2022). Rubin et al. (2012) used a whole-genome resequencing approach. They identified 33 selective sweeps localised on 18 loci covering the NR6A1, PLAG1, LCORL, and OSTN genes. All these genes have a known effect on the phenotype and are potential marker candidates. Another technology, based on a Porcine 60K SNP chip, was used to study the diversity of European pig breeds and to differentiate between domestic pigs and wild boars (Wilkinson et al. 2013). The genomes mostly differed in the regions covering the genes for bone formation (on SSC1), growth (on SSC7) and fat deposition (on SSC7 and SSCX). However, the ability to genetically distinguish between phylogenetically close taxa like pigs and wild boar with a proven interspecies gene flow needs further elucidation.
Although microarrays are the most commonly used technology in genomic studies for small projects, their use is financially unjustified. A microarray study also requires details on the SNP type and the surrounding DNA sequence. For studying a poorly annotated species, the discovery of genome-wide SNPs and population-wide genotyping, genotypingby-sequencing (GBS) could be an alternative to microarrays (Gurgul et al. 2019).
Genotyping-by-sequencing (GBS), which was originally developed for use in plants (crops), is a relatively simple approach that is applied in population genomic studies. GBS is both versatile and can be used for any species. It is based on the high-throughput, next generation sequencing (NGS) of genomic loci that are targeted by restriction enzymes (Elshire et al. 2011). The technique involves a digestion of the genomic DNA with one of the available frequently-cutting endonucleases and, following the adapters ligation, the sequencing of the obtained short DNA fragments using NGS. This approach reduces the target genome regions for NGS, thus allowing for an analysis of multiple samples (multiplexing abilities) in a single sequencing run. The GBS technique can be used to identify many markers with a relatively low error rate, and the detected polymorphisms are distributed throughout the whole genome. Another advantage of the technique is that it allows for the simultaneous discovery and genotyping of SNPs within a large number of samples. These features facilitate different types of genome-wide analyses such as relatedness studies, genome-wide association studies and genomic selection (Kim et al. 2016), as well as population studies, including among other prevention of food fraud cases. A reliable method for species identification is needed by consumers, food and animal feed producers, as well as by the prosecuting authorities (Lorenzini et al. 2020). Molecular tests are commercially available that identify animal components in human food, as well as in animal feed. Based on PCR techniques (e.g. RFLP, Real-Time) it is possible to identify the DNA of cattle, horses and poultry in food and in feed for animals (Natonek-Wiśniewska et al. 2013;Safdar et al. 2014). Nonetheless, although a porcine component can be identified in a simple manner, the differentiation between a wild boar and a domestic pig in meat products remains challenging (Koutsogiannouli et al. 2010;Lorenzini et al. 2020;Kaltebrunner et. al. 2019Kaltebrunner et. al. , 2020. Fajardo et al. 2008 conducted studies for the first time that were aimed at the differentiation of wild boars and commercial pig breeds based on combining both nucleus (melanocortin 1 receptor, MC1R) and mitochondrial DNA (D-loop). The analysis of the MC1R gene proved to be more effective than the most common method used for species identification based on the polymorphism of mitochondrial DNA (Fajardo et al. 2008). The NR6A1 gene has also been studied, which is a strong candidate gene for a vertebrae QTL. A proline to leucine substitution at NR6A1 codon 192 (p.Pro192Leu) resulted in an increased number of vertebrae in commercial pig breeds (Mikawa et al. 2007). A method of differentiating pigs and wild boars was also tested using short tandem repeats (STRs), a marker type commonly used in parentage verification and population studies (Radko et al. 2021). The STRs alone (Rębala et al. 2016), combined with MC1R polymorphism (Nikolov et al. 2017) and MC1R, NR6A1 or mitochondrial D-loop polymorphism, revealed the hybridisation that exists between pigs and wild boars (Lorenzini et al. 2020;Scandura et al. 2008). Szemethy et al. 2021 tested another approach based on identifying the insertions or deletions (InDels) specific to wild boars. In their case, the bioinformatics pipeline indicated three InDels to marker development during the performed routine genetic testing.
Introducing single nucleotide polymorphism (SNP) based molecular techniques to (sub-)species discrimination has many benefits. The most important of them is the significant number of markers (up to tens of thousands) that can be analysed for one individual. SNP-based molecular techniques have been implemented to seek the SNP markers which, under certain conditions (sufficient genotyping density and relevant reference populations), can use of GBS Barcode Generator software. Before the library preparation process, the ligation-ready double-stranded adapters were prepared, i.e. the oligonucleotides (forward and reverse complements) were synthesised with unique barcodes (4-9 nt long) and annealed with a common adapter. The laboratory workflow covered the following steps: overnight incubation of the genomic DNA (200 ng in total) with a tenfold PstI-HF enzyme excess (100,000 U/ml, New England Biolabs, USA) at 37°C; digestion and ligation of the products at 22°C for 60 min with 4.8 ng of each adapter, using the T4 DNA ligase (400 U/µl; New England Biolabs, USA); cleanup step; amplification of the libraries by PCR in 18 cycles, using universal primers (12.5 pmol/µl each) and 2x Taq Master Mix (New England Biolabs, USA); purification of the amplified libraries; quality assessment using Agilent's TapeStaion2200 system (Agilent Technologies, USA) and quantification using a DNA binding dye (Qubit DS DNA assay; Thermo Fisher Scientific, USA). The oligonucleotide sequences used here were the same as the ones used in our previous study (Gurgul et al. 2019).
The library pools were ultimately sequenced in a single-end 50-bp run on the HiScanSQ system (Illumina, USA) using the TruSeq v3 HiSeq flowcell and v3 sequencing reagents (Illumina, USA). The raw reads were deposited in the NCBI Sequence Read Archive (SRA) database under the BioProject Number ID PRJNA658252.

SNPs calling and filtering
The quality of the generated sequencing reads was assessed with the use of FastQC software. The subsequent analysis of the reads was done using the TASSEL 5 GBS v2 Pipeline (Glaubitz et al. 2014) with the default parameters, except maximum kmer Length that was set to 40 and the minimum count of reads for a tag to be output that was set to 5. The pig Sscrofa11.1 genome assembly was used as a reference for the TASSEL pipeline including the read tags mapping with a Bowtie2 aligner (Langmead & Salzberg 2012). Polymorphisms resulting from the default TASSEL GBS v2 production pipeline were initially filtered as following: indels and multiallelic markers were removed; individual genotypes were removed if the obtained coverage was low (< 5 reads); and finally, SNPs located on the X and Y chromosomes were removed due to a lack of knowledge of the wild boars' sexes.
We also removed from the analysis markers with a minor allele frequency (MAF) lower than 0.01, missing genotypes in more than 40% of the things the identification of the selection signatures (Baazaoui et al. 2020;Wang et al. 2018). However, the analysis of GBS-derived SNP datasets is complicated by frequent cases of missing genotypes. This is especially relevant in highly heterozygous organisms, due to a dependence on the read depth to correctly identify both alleles (Dodds et al. 2015).
For the first time, we used a GBS approach to study the genetic differentiation of wild boars and domestic pigs in Poland. The aim of the research is to identify the DNA regions that underwent strong selection during the domestication of the pig and to give an insight into the genetic diversity of the Polish wild boar and domestic pigs. The study also focuses on the identification of admixed individuals in both subspecies.

Material
The material for the study was genomic DNA extracted with the use of the Sherlock AX Kit for DNA purification (A&A Biotechnology, Poland). The DNA was extracted from the ear tissues of 20 wild boars (WB) from the South region of Poland, as well as from the blood samples of unrelated and randomly selected pigs of the following breeds: Polish Landrace (PBZ, n=2), Large White Polish (WBP, n=4), Duroc (DUR, n=1), Puławska (PUL, n=3) and Pietrain (n=2). The wild boars were legally hunted in accordance with the national regulations during the 2009-2011 hunting seasons. The sample collection was performed by licensed hunters as a part of routine wildlife management measures. Blood samples from the pigs were taken by the representatives of the Polish Pig Breeders Association, "POLSUS", and were collected in accordance with the routine parentage controls conducted at the National Research Institute of Animal Production in Balice, Poland; therefore, no permission from the local ethics commission was needed.

Library preparation and sequencing
We prepared the GBS libraries according to the previously described protocol (Elshire et al. 2011) with minor modifications. In short, the PstI endonuclease was used for the DNA digestion and ligation of 48 indexed adapters designed with the which allowed us to account for a random locus-bylocus variation in the alleles frequency. Subsequently, the FST-windows falling into the top 1% of the highest FST observations were analysed in detail for their gene content, in order to identify the genes that were strongly affected by the wild boar domestication and the artificial selection of pigs. The genes within regions with the highest FST were identified using the UCSC Genome Browser (Haeussler et al. 2019) and were analysed in terms of their functions using Kobas3.0 (Xie et al. 2011). The overrepresentation tests in the separate GO, KEEG and Reactome pathways categories were conducted with respect to all known pig genes using Fisher's exact test and the FDR (Benjamini & Hochberg 1995) correction for multiple testing. Some annotation analyses were completed using the PANTHER Classification System (Mi et al. 2013).

Sequencing read statistics
The sequencing of the libraries with the HiScanSQ System generated a total of 84.3 million (M) raw reads for all the animals. Of the reads, 36,285,567 (1.3 M on an average, per sample) had a proper barcode and PstI cut site. These reads were collapsed into 463,756 different sequence tags matching the reference genome. Finally, an average of about 1.1 M reads were mapped to the identified sequence tags and were used for SNP calling in the separate samples (Table 1).

SNPs panel characteristics
The initial TASSEL GBS pipeline allowed for the identification of 29,647 variants common for the domestic pigs and wild boars. Filtering of the markers for the average read depth and variants character (keeping only the biallelic SNPs) resulted in 7,785 SNPs, which were additionally evaluated individuals and those deviating from the Hardy-Weinberg equilibrium with a Chi-Square test p-value<0.000001. Moreover, the individuals with a more than 40% missing genotypes (three) were also removed. All these filtering steps were done using PLINK (Purcell et al. 2007) or TASSEL (Bradbury et al. 2007) software.

Phylogenetic and admixture analysis
The analysis of the genetic distance between the individuals was done in a similar way as in our previous work (Gurgul et al. 2019). MEGAX software (Kumar et al. 2018) was used to construct the phylogenetic tree constructed using the maximum likelihood (ML) method, along with subsequent bootstrapping. The additional visualisation of the population differentiation was done by a multidimensional scaling (MDS) applied to the IBS (identity by state) distance matrix calculated using TASSEL software. To further analyse the population structure and the patterns of admixture among the pig and wild boar populations, a Structure software (Falush et al. 2003) was used and was run 10 times per each K, initially from K1 up to K6. An admixture model was used, where the parameters were as follows: 100,000 iterations and a 100,000 burn-in period. The best K for this analyse was inferred using Structure Harvester (Earl & Vonholdt 2012) software based on the Delta K parameter. Finally, CLUMPAK software was used to visualise the results (Kopelman et al. 2015).

Genetic differentiation between the populations
Apart from the global genetic differentiation, we also attempted to identify genome regions showing the largest genetic differences between the analysed domestic pigs and wild boars. To this end, based on the filtered SNPs, we calculated the SNP-by-SNP pairwise FST generic distances (Weir & Cockerham 1984). Next, a 5-SNP sliding window was applied to obtain the moving average for the pairwise FST, between the studied populations. The MDS allowed for the identification of clearly separated populations of domestic pigs and wild boars, with a visibly lower genetic variation present within the wild boar cluster. Among the domestic pigs, a visible subclustering was observed that largely corresponded to the included breeds (Figure 1). Similar observations were made when the maximum-likelihood method was applied to create a phylogenetic tree. This analysis clearly separated the two analysed populations (Figure 2). To further validate this result, the Structure software was used to fully reveal the pattern of individual admixtures. For both the groups studied, we verified our hypothesis that more than two parental populations existed (at least one for the domestic representatives and a putative two parental lines, i.e. Asian and European, for the wild boar group). Therefore, we tested different K values ranging from one to six. However, the most appropriate K=2 was established using the Delta K parameter (Suppl. Mat. 1 - Fig. 1). This analysis showed a clear subdivision of the analysed animals into two subpopulations corresponding to the domestic pig and the wild boar populations (Figure 3).
for their quality and variation parameters (MAF, percentage of genotypes called and Hardy-Weinberg equilibrium (HWE)). The final SNP panel included 7,298 markers (ranging from 20 on SSC18 to 587 on SSC6) showing satisfying polymorphism parameters and distributed across the Sscrofa11.1 genome assembly with an average inter-marker distance of 327kb (ranging from 205 to 484 kb for separate chromosomes) ( Table 2). The mean MAF was 0.266 for all the markers and for both populations. A total of 5243 SNPs were polymorphic in both populations, and 2055 SNPs were monomorphic in at least one of them. The average observed heterozygosity of the SNPs was high and reached 0.346, on average. The polymorphism parameters were similar in both the analysed populations ( Table 2). The average FIS for all individuals was 0.037, with a slightly higher value observed in the wild boars (0.044) than in the pigs (0.026).

Genetic differentiation of the pig and the wild boar captured using GBS
The final filtered panel of SNPs was used to analyse and visualise the genetic differentiation    (Jukes & Cantor 1969). The tree with the highest log likelihood (-65198,14) is shown. The percentage of trees in which the associated taxa are clustered together is shown next to the branches. Initial tree(s) for the heuristic search were obtained automatically by applying the Neighbor-Join and BioNJ algorithms to a matrix of pairwise distances estimated using the Maximum Composite Likelihood (MCL) approach, and then selecting the topology with the superior log likelihood value. The tree is drawn to scale, with the branch lengths measured according to the number of substitutions per site. B -The bootstrap consensus tree inferred from 1000 replicates is shown to represent the evolutionary history of the taxa that were analysed (Felsenstein 1985). The branches corresponding to partitions reproduced in less than 50% of the bootstrap replicates are collapsed. The percentage of replicate trees in which the associated taxa were clustered together in the bootstrap test (1000 replicates) are shown next to the branches (Felsenstein 1985). Initial tree(s) for the heuristic search were obtained automatically by applying the Neighbor-Join and BioNJ algorithms to a matrix of pairwise distances estimated using the Maximum Composite Likelihood (MCL) approach, and then selecting the topology with the superior log likelihood value. PBZ -Polish Landrace; WBP -Large White Polish; DUR -Duroc; PUL -Puławska; PIETR -Pietrain.
The overall global genetic differentiation of the analysed population captured by the GBS SNPs was relatively low, with mean and weighted FST values of 0.099 and 0.118, respectively, meaning that only about 10% of the global genetic variation can be attributed to the differentiation between the populations.
The overlapping top 5-SNP windows were merged and the covered genome regions were analysed in detail. This approach allowed for the detection of the genome regions with fixed genetic differences between the pig and wild boar (Suppl. Mat. 1 -Fig. 2). The regions were distributed on 17 different autosomes, with the highest number located on SSC1, SSC6, SSC9 and SSC17 (3 regions each). The size of the regions ranged from 210 kb to 4.6 Mb, with a mean of 1.7 Mb. In the overlapped regions with 50 different RefSeq genes, 48 genes were annotated (Suppl. Mat. 2). Together, the genes enriched GO terms (FDR<0.05) associated inter alia with DNA-binding transcription activator activity, RNA polymerase II-specific and the regulation of oogenesis; KEEG ps was associated with the Wnt signalling pathway, arginine and proline metabolism and pyrimidine metabolism; and the Reactome pathways connected with, e.g., interconversion of nucleotide di-and triphosphates, metabolism of nucleotides and amino acids, VEGFR2 mediated cell proliferation and many other pathways (Table 3; Suppl. Mat. 2). In addition, the gene functions were supported by an extensive literature search.  and one Landrace) had a negligible proportion of the wild boar genome. This observation is also clearly demonstrated in the phylogenetic tree. Our studies did not coincide with that research on the European population that reported the frequent hybridisation of these two subspecies (Fontanesi et al. 2014;Iacolina et al. 2018;Scandura et al. 2008). There are farms (e.g. in Bulgaria and Sardinia) where pigs are kept in semi-wild conditions, and it is highly likely that in these cases the domestic pigs sporadically crossbred with wild boars (Scandura et al. 2008). The frequent hybridisation of pigs and wild boar was found in studies on the European population (Scandura et al. 2008;Fontanesi et al. 2014;Iacolina et al. 2018), while studies of the MC1R gene in the Polish population of wild boars also proved the existence of admixed genotypes (Dzialuk et al. 2018;Babicz et al. 2013). However, these results are probably due to the limited number of samples representing wild boars, which were restricted to one region of Poland. The breeding of Polish pigs also prevents crossing between domestic pigs and wild boars.

Genetic differentiation of the pig and the wild boar
In the present study, we identified genome regions that were differentially selected between the pig and the wild boar. Within these regions, 48 different genes were identified. Three databases -the KEGG pathway, Reactome and GO terms -were further used to assign a functional significance to the identified genes determined to have the greatest differences between the two groups. A detailed analysis of the enriched GO terms allowed for the identification of genes that are associated with the axonal transport of mitochondria (NEFL, UCHL1), DNA-binding transcription activator activity, RNA polymerase II-specific (WT1, PAX6, MYOG, NOBOX, MEOX2) and the positive regulation of transcription by RNA polymerase II (WT1, NOBOX, NME2, AHR, PAX6, MAVS, MYOG). The KEEG pathway and Reactome grouped the selected genes into biological processes: interconversion of nucleotide di-and triphosphates, myogenesis, RIG-I-like receptor signalling pathway, extra-nuclear oestrogen signalling and muscle contraction. These are processes that are presumably under selection during domestication and animal breeding. Furthermore, we found several genes that were potentially under selection in early domestication, as well as in the later artificial selection of pigs. These are associated with inter alia muscle development (MYOG, MEOX2), pre-weaning mortality stress (MYO7A), immune response (CSTF3, APC) and sensory perception (TAS1R3).

Discussion
We used the GBS approach for the first time, with an aim to differentiate between closely related subspecies. GBS is a technique based on the random digestion of genomic DNA with one frequentlycutting restriction enzyme (REs) and ligating the cut site-specific adapters, including unique indexes following a common primer binding site. The selection of appropriate REs is a pivotal step in the GBS lab protocol, but it also generates the highest financial cost. The PstI enzyme was selected based on our previous experiences of farm animals as producing sufficient fragments for a genome-wide analysis (Gurgul et al. 2019). GBS takes advantage of a simplified hands-on lab protocol, which was achieved mostly by reducing the sequencing library preparation steps and simplifying the procedure for the generation of restriction fragments with barcoded adapters (Elshire et al. 2011). Finally, the time and cost were diminished by reducing the DNA purification steps and omitting DNA size selection (Gurgul et al. 2019).
While considering the GBS data analysis, the identification of the polymorphism and sample genotyping does not require a reference genome. This gives the GBS technique a broader purpose than more commonly used microarrays.

Genetic diversity and identification of hybrids and interspecies admixtures
The European and Asian wild boar populations diverged around 1 million years ago (mya), resulting in very different allele frequencies at millions of genomic loci and over a million loci representing alternative alleles (Groenen et al. 2012). While wild boars are the ancestors of modern domestic pigs, it is believed that the domestication process occurred separately for the species in Asia and Europe. Studies of the genetic diversity of wild boars and pigs can give insights into the phylogenetic background of the subspecies.
In our study, we found a low level of genetic diversity in the Polish wild boar population compared to the domestic pig population. Domestic pigs appear more divergent, as they originated from a highly diverse range of wild ancestors and were bred for several traits resulting in a large number of phenotypically diverse breeds (Yang et al. 2017). All the methods (MDS, Structure) we implemented to study the genetic structure of both subspecies proved that the analysed wild boars are not admixed with the gene pool of the domestic pigs. In the domestic pig group, two samples (one from the Duroc breed in pigs (mostly related to salty, umami and sweet tastes). Among the genes coding taste receptors, we found the one related to the sweet taste (TAS1R3) in the divergent gene regions.
Despite a limited sample number, our results proved that the GBS as a whole-genome screening technique is a useful and valuable tool for population genetic studies and for the differentiation of two closelyrelated species and their possible hybrids. A limited number of SNPs (5K SNPs) was sufficient to grasp the genetic distance between pigs and wild boars. However, the number of SNPs might be improved in further experiments regarding this subject. We revealed novel genes with different patterns of variation in domestic pigs and wild boars. The genes differentiating both subspecies were related to both breeding selection (fertility and meat efficiency) and to biological processes associated with adaptation to the altered environment (host defence and sensory perception.). The polymorphism of these genes has the potential to differentiate wild boars from domestic pigs, but further studies with more samples representing both subspecies and their hybrids are needed.

Author Contributions
Research concept and design: