Assessing breed integrity of Göttingen Minipigs

Göttingen Minipigs (GMP) is the smallest commercially available minipig breed under a controlled breeding scheme and is globally bred in five isolated colonies. The genetic isolation harbors the risk of stratification which might compromise the identity of the breed and its usability as an animal model for biomedical and human disease. We conducted whole genome re-sequencing of two DNA-pools per colony to assess genomic differentiation within and between colonies. We added publicly available samples from 13 various pig breeds and discovered overall about 32 M loci, ~ 16 M. thereof variable in GMPs. Individual samples were virtually pooled breed-wise. FST between virtual and DNA pools, a phylogenetic tree, principal component analysis (PCA) and evaluation of functional SNP classes were conducted. An F-test was performed to reveal significantly differentiated allele frequencies between colonies. Variation within a colony was quantified as expected heterozygosity. Phylogeny and PCA showed that the GMP is easily discriminable from all other breads, but that there is also differentiation between the GMP colonies. Dependent on the contrast between GMP colonies, 4 to 8% of all loci had significantly different allele frequencies. Functional annotation revealed that functionally non-neutral loci are less prone to differentiation. Annotation of highly differentiated loci revealed a couple of deleterious mutations in genes with putative effects in the GMPs . Differentiation and annotation results suggest that the underlying mechanisms are rather drift events than directed selection and limited to neutral genome regions. Animal exchange seems not yet necessary. The Relliehausen colony appears to be the genetically most unique GMP sub-population and could be a valuable resource if animal exchange is required to maintain uniformity of the GMP.


Background
The Göttingen Minipig (GMP) is an animal model with growing importance [1]. Created in the 1960's by crossing Minnesota Minipigs, Vietnamese Potbellied Pigs and the German Landrace, the breed has been under a fully documented closed breeding scheme ever since. The first colony was founded at the research farm of the University of Göttingen in Friedland, Germany, and later resettled to the Relliehausen research farm. Due to the growing customer interest of using GMPs, this facility could not satisfy the demand anymore and therefor collaboration with Ellegaard Göttingen Minipigs A/S in Dalmose, Denmark, including a larger colony, was established in 1992. In 2003, animals from this colony were brought to Marshall BioResources, North Rose, New York, USA as the basis of a North American GMP breeding colony. In 2009, a second barrier colony was established in Dalmose, based on breeders from the first barrier colony to increase the production. Finally, animals from Dalmose were brought to Oriental Yeast Co., Ltd. in Nisshin, Japan, in 2013 to establish a barrier colony in Japan. After the initial animal transfer, all colonies remained under closed breeding without any genetic exchange, albeit being under a common controlled breeding scheme, coordinated by the animal breeding and genetics group at the University of Göttingen, Germany. Today all GMPs under that management are bred for two main traits, number of piglets born alive and an index comprising body weight measures at different ages. It is inherent in breeding practice that there is also a non-documented co-selection on tame behavior and against appearing malformations such as unwanted pigmentation [2]. Different intensities of conservational breeding techniques, such as adjustment of selective pressures and balancing selective pressures between the sexes, is used to account for the different herd sizes, eventually results in all colonies having comparable effective population sizes.
Managing the GMP in independent colonies in closed barriers is beneficial from a safety point of view. Additionally, a production unit close to the main sales market minimizes negative effects on animal welfare through long transports and prevents import complications and unnecessary bureaucracy. On the other hand, splitting a population reduces the effective population size of each sub-population, which increases the risk of genetic drift or the manifestation of recessive disorders [3]. Two concepts to counter these risks are purging of deleterious variants [4] or maintenance of genetic diversity [5]. Lacy [6] argues that drift is the most important factor in loss of genetic variance when effective population sizes are low, as in the case of the GMP [7], and the only effective measure to mitigate adverse effects would be animal exchange.
In this study we try to assess whether the genetic management was able to maintain the uniformity of the GMP breed, or if the isolated production units are already genetically diversified such that an exchange of breeders is inevitable. This was done by re-sequencing two representative DNA pools from each unit: candidates were sampled for low average relationship within a pool, but elevated relationship towards the remaining colony, allowing an assessment of the diversity within and between units.

Results
Sampling of the optimally representative candidates for pooling based on relationship measures resulted in candidate sets which exhibited lower inner-set mean relationship coefficients (a) compared to the mean relationship of the candidate set with the remaining colony mates (b). Both, the absolute level of relationship and the difference between a and b were lowest for Relliehausen (RE) and highest for North Rose (NR), while Dalmose barrier 2 (DA2) and 3 and Nisshin (NI) were at the same level and exhibited similar difference between a and b (Table 1).

Principal component analysis
Principal component analysis (PCA) on the reference allele frequencies shows clear separation of the GMPs, Asian and European breeds, with the Mini-LEWE clustering close to the Asian breeds, by the first component ( Fig. 1). The second component separates the Asian breeds from GMPs and Europeans. In the context of multiple breeds, no structure can be observed within the GMPs. Separate analysis shows three sub-groups within the GMPs consisting of the RE colony, NR colony and a composite group of DA2, DA3 and NI.

Differentiation and distance measures
As measures of differentiation and genetic distance between the different breeds, F ST and the Reynolds genetic distance (D R ) were estimated. When applied to the variable set of large breeds and minipig pools, both measures provided a similar picture of three strongly differentiated groups (Table 2). In principle, these three groups were the minipigs, the European breeds and the Asian breeds, respectively, with the exception, that the Mini-LEWE pool clustered with the Asian group. Comparing F ST against D R , the latter showed generally higher estimates, relatively inflated at moderate levels of differentiation/ distance (Fig. 2), but provided in general a very similar picture. Therefore, only F ST was used for later purposes, such as the functional annotation. Focusing on the differentiation within the three groups, the GMP exhibited the lowest average differentiation (F ST : 0.07; D R : 0.11; see also Additional File 1: Supplementary  Table 2). The average differentiation to other groups was higher than the differentiation within the groups, clearly so for minipigs and for European breeds, but not that clearly for the Asian breeds. The latter exposed an even lower average D R (0.36 vs 0.37) and F ST (0.26 vs. 0.27) to the GMP than within the group of Asian breeds.

Phylogeny
The UPGMA tree (Fig. 3   resampling probability between 18 and 87%, with the exception of the Japanese wild boar, that behaves like an outgroup sample. Even though, the European and the GMP clusters are distinct, the order within the clusters is variable. The node support within the European cluster spans from 56 to 72%, and between 18 to 86% in the GMP cluster. The most stable structure with 86% contains the RE pools and the least stable (18%) contains the DA and NI herds.

Stratification within the GMP
The genetic differences within the GMP were determined by comparing pools in terms of allele frequency differences, such as oppositely fixed alleles, extreme F ST values between colonies, differences in the average expected heterozygosity within pools by a variation based approach employing an F-test statistic. Resulting loci detected by the aforementioned statistics were functionally annotated and imbalances between the various classes were checked for potential biases towards differentiated loci.

Significance test of pool allele frequencies between and within colonies
The F-test compared the variation between the two pools with the variation between one of the pools against one foreign pool and could, in contrast to F ST, add probabilistic evidence on differentiation between pools. On average, the NI colony had the lowest proportion of significantly (p = 0.05, Bonferroni corrected) differentiated loci, overall 4.7%, followed by DA3 and 2 with 5.3 and 5.4%, respectively, and RE with 5.7%. With 6.9%, NR had the highest proportion of significantly differentiated loci (Table 3; Supplementary Table 2). Focusing on the colonies separately (Fig. 4), only RE had comparable amounts of differentiated loci with all others. From the perspective of DA, NI, and NR, the level of differentiation to RE was clearly highest throughout all comparisons, while the number of evaluated loci was relatively even throughout the comparisons. This means, e.g. RE pools could be significantly different from NR at a locus, since Variation within NR is low, and the average allele frequency in RE is far while using the more variable RE as a basis, the NR frequency is ranging within the variation of RE. Mostly both tested pools of a colony showed a similar amount of differentiation with the exception of NR versus the two NI pools. The highest proportion of differentiated loci was found, when NR was tested against the two RE pools.

Expected heterozygosity
Expected heterozygosity, as measure of variation within a pool, revealed that all pools exhibit similar levels of expected heterozygosity (Table 4) with RE and NI being highest and NR being lowest. When estimated for single pools, expected heterozygosity was between 0.271 and 0.283 for DA2 and DA3 and NR and between 0.280 and 0.285 in NI and RE. Estimated from the virtual union of both pools per colony, the values were about 0.01 higher, following the same trends (Table 5).
Fixed alleles and private polymorphisms Table 6 depicts the correlation of allele frequencies of loci that had complete recordings and where each colony was fixed for either the reference or the alternative allele (Supplementary Table 3). Only 506 loci fulfilled this criterion. The correlations between the colonies based on these loci ranged between − 0.10 and + 0.14, with the highest being between RE and NI.
On the other hand, RE held by far the largest number of still variable loci while the other pools were fixed at one allele. Out of the 1′203'000 loci fulfilling the criterion of being variable in one colony while all others were fixed, 555′591 belonged to RE (Table 7). NR (192′896) and NI (156′502) with about 80′000 loci carried more than the DA units (163′853 and 134′158).

Annotation
Functional annotation of loci significant in F-test, showing oppositely fixed alleles and exhibiting extreme F-test values revealed that most loci were in intergenic or intronic regions (compare Table 8, i.e. F-test: 65% intron and 19% intergenic) followed by~13% upstream and downstream variants. Exonic variants were present to an extent of less than 1%. Potential protein changing variants like start or stop codons were barely present at a 5% significance level in the F-test and nearly absent among loci with oppositely fixed alleles. Compared to the unselected background, intergenic, intron, up-and downstream variants were slightly higher represented in both, the 5% F-test level and for the oppositely fixed loci, while exonic variants were in majority less frequent, especially for stop codons and in oppositely fixed loci only five such stop codons were present. Annotating SNPs in different levels of F ST supported these findings. Start and stop codon changes could only be found at lower F ST levels, while synonymous and missense mutations showed a decline in frequency towards high F ST values, while up-and downstream, intron and intergenic variants were unaffected or increased in frequency (Fig. 5).
Annotation of loci with variability in only one colony, while they were fixed in all other colonies, resembled the fractions of functional classes already known from the Ftest and F ST annotations (Table 10), but due to the higher number of private polymorphic loci in RE, the absolute numbers of loci annotated to potentially protein changing classes, such as missense mutations, was therefore higher in RE (2′380) than in all other colonies. DA3 colony carried the lowest number of missense mutations (538). Still, every colony carried at least one stop codon gain or loss, RE even 15.

Discussion
The aim of our study was to determine whether the integrity of the breed Göttingen Minipigs was compromised by the current production and the genetic management system that relies on genetic isolation of production units. First, the classification of the GMP samples within the context of various pig breeds representing worldwide porcine genetic variation was evaluated with phylogenetic and population genetic methods. Secondly, genetic identity of the breed was assessed by multiple approaches describing variability within and differentiation between the separated barrier colonies.

Discriminability of Göttingen Minipigs from other pig breeds
Our PCA, genetic distance and phylogenetic results show clearly distinct groups of European pigs, Asian pigs and Göttingen Minipigs. The distance between the   European and Asian breeds reflects the current scientific consensus that domestication happened independently in Europe and Asia about 9000 years ago [8]. The European breeds appear generally closer to each other, which might be explained through the different domestication processes in both centers: while the European breeds emerged more or less directly from relatively uniform wild boar strains [9], the Asian domestication history is characterized by complex human driven dispersal of domesticated pigs in the South East Asian archipelagos, sometimes interrupted by feral states, before pigs eventually reached the Asian mainland [10]. This might explain why the European group clusters closely together in the UPGMA tree with higher resampling support than the Asian group. The tree, based on genome wide SNPs, clusters together, on one hand Xiang, Meishan and the South Chinese wild boars and on the other hand Jiangquhai and the North Chinese wild boars, interrupted by the Mini-LEWE. This is in contradiction to Ai et al. [11] who found that Meishan clustered together with the North Chinese wild boars, and could also support the low resampling probabilities found among the Asian breeds. The Mini-LEWE, a composite miniature breed, developed by crossing Vietnamese Potbellied Pigs, Saddlebacks and German Landrace, is in our study represented by a DNA pool of 10 females and a virtual pool made up from two sequenced individuals. Although it appears that individual sequences are not fully comparable to pools, since mixing of individual sequences and pool sequences leads to clustering of the respective sample types (results not shown). Still, the virtual and the DNA Mini-LEWE pools are clearly identified as one breed. Therefore, the virtual pooling seems to be a suitable measure to make different types of data comparable. In the case of the GMP, both types were mixed and both analyses, PCA and the UPGMA tree show that it is easily discriminable from all other breeds. The phylogenetic tree supports a GMP clade with 100% resampling support, which is located among the Asian pig breeds. This can be explained by the cross-breeding history in which Vietnamese Potbellied Pigs, Minnesota Minipigs and German Landrace were involved [12]. An earlier study [7] estimated that about 70% of the GMP genome are of Asian origin. In the PCA, the first component explains the difference between the three breed groups as the main source of variation, accounting for 19% of the genetic variability, while the second component discriminates the GMP and European breeds from the Asian breeds. Following the interpretation of Kim et al. [13], the average F ST between the three groups, ranging between 0.26 and 0.33, suggests that still a major part of the total variability can be assigned to differences among individuals. Anyway, albeit using

Variation and differentiation within and between the GMP pools
While it seems particularly easy to distinguish the GMP from other pig breeds, it is more difficult, but relevant from a breeders' point of view, to determine, if genetic  isolation of the five breeding colonies has led to differentiated subpopulations. Applying PCA on the 10 GMP pools, we were able to see a trend to three subgroups consisting of NR, RE, and a cluster comprising NI and DA, respectively. The presence of a certain level of stratification is expected and has been observed beforehand, i.e. in studies in dogs [14] or sheep [15].  [16,17]. F ST of~0.1 was found between relatively similar breeds, for example Large White and Landrace, while values higher 0.3 indicated major differentiation, such as between Nellore and Holstein cattle or Asian and European pig breeds. These values matched our findings between the European, Asian and GMP groups. Within the GMP, even two randomly composed pools from the same unit had a minimum differentiation of about 0.05. Between the aforementioned clusters F ST was about 0.06 to 0.8 and therefore between the differentiation observed between the sheep breeds from separate origins and clearly distinct breeds. We explain this by genetic drift and slight differences in the actual breeding management, since the three clusters are confounded with the three partners in GMP breeding, even though all follow the same general breeding goal. We also assume that ascertainment bias affecting the variant discovery procedure might have an elevating effect on differentiation since incorporation of various pig breeds in the discovery sample might allow for calling more variants and especially more heterogeneous variants than calling in a GMP only sample [18]. Comparison of our results with the F ST levels found in the aforementioned studies implies that our colonies are at the edge of splitting into sub-populations, and we did not expect all genomic regions between all pairwise combinations of the five units being similarly differentiated, when focusing on individual loci. The F-test (Table 3) identified about 4 to 8% of the genome to be objected by differentiation which is similar to the range found in a comparable study by Amaral et al. [19]. We hypothesize that genetic differentiation should be attributed to drift rather than to selection, if it affects neutral loci relatively more than loci with putative harmful consequences on protein translation, such as stop codon gains or deleterious missense mutations. This was supported by an underrepresentation of detrimental variation among highly differentiated loci. The 10 loci representing deleterious missense mutations with maximum F ST were partly located in genes with known function. Although the RVIS values from the genetic intolerance analysis suggest that the respective genes are relatively tolerant to changes in functional variation, there should be further research on the real functionality, since some genes may have influence on important features of GMPs. While high differentiation in one candidate, CAMK2B, seems even vital to maintain cellular functions [20,21], for example, ARRB1 has found to be involved in feed conversion in pigs [22], AIRE is an important pathway gene of auto-immune response [23], and CHST12 is differentially expressed in Oocytes after nuclear transfer in mice [24]. Nine of these SNPs are identified when the DA2 and DA3  subpopulation are involved in a pairwise comparison, three of them are between DA2 and 3 and five are between a DA unit and NR, indicating that there is detectable differentiation between the DA units and NR. When focusing on the subset of loci where all units are fixed for either allele, it is interesting, that the correlations between all units at such loci range around zero, indicating no clear pattern of shared fixed loci among colonies.
When we looked at expected heterozygosity as a measure of variability, RE and NI exhibit just slightly higher values than the other colonies, but it is notable, that RE holds about as many private polymorphisms as all other units together, making it an indispensable resource of genetic variability. We explain this with the consequent implementation of the mating scheme based on the optimum genetic contribution concept [25] in RE.
Not only is the preservation of a common genetic identity for all colonies of the GMP important [26], but also the risk of inbreeding depression and loss of variation due to drift is increased in artificially reduced subpopulations [6]. To counter this in future, two strategies appear feasible: First, the exchange of genetic material, e.g. via artificial insemination, and second, selection of a most diverse set of breeders as basis for future breeding. The first strategy, in which semen from RE, the major reservoir of remaining variability, would be used to inseminate breeders in the other units, as it is commonly used by dog breeders, would harbor various risks of spreading diseases and disorders between units. The second option of selecting a most diverse set of breeders from the respective unit also has the potential to increase heterozygosity, as can be observed in NI whose founding population was established in that very way. It can be taken as an example of the Bulmer effect [27] that genetic variation in the relatively large colony of DA3 wasn't lost while selection and assortative mating was conducted, and could be largely recovered when the NI founders were chosen for maximum diversity.

Conclusions
Our study based on assessment of differentiation found that the GMP as a breed is easily discriminable from other pig breeds. The finding of genetic distances and differentiation between the isolated breeding colonies of the GMP being minor in comparison to such between distinct breeds is taken as evidence of a successful conservation breeding program, even though some indications of stratification were detected. Functional annotation revealed that loci with functional impact are less differentiated than neutral loci, which implies that the force differentiating the colonies appears to be rather random drift than selective pressure. Selection pressures, through the current breeding activities, might on the other hand have ensured similarity of all colonies in regions which are expected to underlie traits important for the use of the GMPs as animal models. The detection of putatively deleterious highly differentiated variations in candidate genes with functions important for GMPs suggest that research needs to be undertaken to confirm their functionality in our animals.
Albeit animal exchange seems not yet necessary, the RE subpopulation harbors the highest amount of genetic variation while not being especially differentiated from all other colonies.

Samples
A joint pedigree was created from the pedigrees of all colonies in the five separated barrier facilities (Research Farm Relliehausen: Relliehausen (RE); Ellegaard Göttingen Minipigs A/S: Dalmose barrier 2 and 3 (DA2, DA3); Marshall BioResources: North Rose (NR); Oriental Yeast Co., Ltd.: Nisshin (NI)). Numerator relationship matrices were constructed with Wrights coefficient of relationship [28] for each colony and all animals alive within a colony in November 2015. A set of 30 individuals was selected for blood sampling with the following procedure in each colony, respectively: all candidates available for blood sampling consisting of only non-pregnant, healthy sows without genetic disorders were identified. A subset of 30 animals was randomly sampled from this list and the relationship within the set (a) and between the animals in the set and all remaining animals in the colony (b) were calculated. Both values were combined in an index I = 0.8*a − 0.2*b, to minimize relationship within the samples while maximizing relationship with the sample and the remaining colony. This sampling was repeated up to 25′000 times and restarted every time a new index value went below the previously recorded one. The procedure was stopped after 25′000 rounds without improvement.
DNA of two times ten animals per colony, randomly chosen from the available samples from the previously selected 30 candidates, was pooled using equimolar amounts of the individual DNA. 150 bp paired-end sequencing was done on an Illumina HiSeq 4000 with an aim coverage of 30X and an insertsize of about 420 bp. Publicly available data from 13 various pig breeds [29][30][31] as used in a previous study [32] were incorporated (see also Table 11). Raw data was aligned to the reference genome Sscrofa11.1 [33] with BWA 0.7.12 [34], sorting, merging of different libraries and marking duplicates were done with Picard tools 2.10.5 [35], base qualities were recalibrated with GATKs BQSR [36,37] using the available SNPs from dbSNP as validation [38]. Biallelic SNPs were called with the Haplotype Caller from GATK 4.0.8.1 in gVCF-mode. SNPs were filtered with the VQSR tool of GATK that uses machine learning to assess the validity of a SNP. SNPs contained in the Affymetrix Axiom_PigHD_v1 array were used to train the model incorporating the variant attributes Qualityby-Depth (QD), MappingQuality (MQ), MQRankSumTest, ReadPositionRankSumTest, FisherStrand (FS), Stran-dOddsRatio (SOR) and depth (DP). A truth sensitivity filter level of 99.0 was applied. Monomorphic loci, Xchromosomal loci and loci without records were removed.

Principal component analysis (PCA)
PCA was constructed from matrix X, which contained samples in rows and genetic loci in columns. Elements of X were the respective reference allele frequencies. Every locus was centered and scaled beforehand. Due to the nature of data from short-read sequencing, matrix X contained numerous missing values. To account for these, missing information was recoded in matrix N with the same dimensions as X, in which records and missing positions were recoded as 1 or 0, respectively. The adjusted covariance structure was subsequently modeled as E ¼ XX 0 NN 0 . Eigenvectors V and eigenvalues λ were achieved bei eigenvalue decomposition of E. Loadings L where the corrected eigenvalues where negative values were set to zero. Principal components were calculated by multiplying the eigenvectors with a diagonal-matrix containing square roots of L.

Fixation index and Reynolds distance
Fixation index (F ST ) and Reynolds distance (D R ) were estimated between breed pools. Therefore read information of individuals was virtually pooled by breed-wise summation of reads supporting the reference and the alternative allele, respectively.
Reference allele frequency in each breed k per locus was estimated as p k ¼ , where i reflects the i th allele at a biallelic locus, namely the reference allele or the alternative allele, respectively [39]. Both measures were averaged over all pairwise complete loci to gain genome-wide values

Phylogeny
A phylogenetic tree was constructed from genome-wide F ST values from all autosomal loci, using the clustering algorithm UPGMA as implemented in the package "phangorn" [40]. The resulting tree reliability was determined by comparison to 100 trees constructed from 100 randomly sampled loci each.

Test of allele frequency differences between pools
We employed an F-test based statistic to determine statistically significant variation patterns between pools for every locus (eq. 1).
where V I is the pooled variance within a unit, e.g. RE1 and RE2, estimated as V I ¼ p RE1 þp RE2 2 Ãð1− p RE1 þp RE2 2 ÞÃ 2 10 , and where V O represents the variance between the aforementioned unit and a remote pool, e.g. NI1 estimated as The degrees of freedom where assumed to be nine, since every pool was made up from ten animals.

Heterozygosity, fixed alleles and private polymorphisms
Expected heterozygosity at locus i was estimated from original pools and the virtual pool for each colony as H exp i ¼ 2Ãp i Ãð1−p i Þ, where p i is the reference allele frequency. It was further assessed whether a single colony was fixed for one allele, while the others were fixed for the other allele. To assess variability remaining in only one colony, loci where all colonies apart from one were fixed were identified. This was done both for the subset of loci without missing information and for loci where single colonies had missing information.

Annotation
Loci identified in the aforementioned tests were functionally annotated with the Ensemble Genes database (version 98; Sscrofa11.1 [41]). Genetic intolerance of predicted deleterious missense mutations at highly differentiated loci was assessed by calculation of Residual Variation Intolerance Scores (RVIS [42]).