Assessing the power of principal components and wright’s fixation index analyzes applied to reveal the genome-wide genetic differences between herds of Holstein cows

Due to the advent of SNP array technology, a genome-wide analysis of genetic differences between populations and breeds has become possible at a previously unattainable level. The Wright’s fixation index (Fst) and the principal component analysis (PCA) are widely used methods in animal genetics studies. In paper we compared the power of these methods, their complementing each other and which of them is the most powerful. Comparative analysis of the power Principal Components Analysis (PCA) and Fst were carried out to reveal genetic differences between herds of Holsteinized cows. Totally, 803 BovineSNP50 genotypes of cows from 13 herds were used in current study. Obtained Fst values were in the range of 0.002–0.012 (mean 0.0049) while for rare SNPs with MAF 0.0001–0.005 they were even smaller in the range of 0.001–0.01 (mean 0.0027). Genetic relatedness of the cows in the herds was the cause of such small Fst values. The contribution of rare alleles with MAF 0.0001–0.01 to the Fst values was much less than common alleles and this effect depends on linkage disequilibrium (LD). Despite of substantial change in the MAF spectrum and the number of SNPs we observed small effect size of LD - based pruning on Fst data. PCA analysis confirmed the mutual admixture and small genetic difference between herds. Moreover, PCA analysis of the herds based on the visualization the results of a single eigenvector cannot be used to significantly differentiate herds. Only summed eigenvectors should be used to realize full power of PCA to differentiate small between herds genetic difference. Finally, we presented evidences that the significance of Fst data far exceeds the significance of PCA data when these methods are used to reveal genetic differences between herds. LD - based pruning had a small effect on findings of Fst and PCA analyzes. Therefore, for weakly structured populations the LD - based pruning is not effective. In addition, our results show that the significance of genetic differences between herds obtained by Fst analysis exceeds the values of PCA. Proposed, to differentiate herds or low structured populations we recommend primarily using the Fst approach and only then PCA.


Background
Farmed animals should have large genetic variation in exterior, production and fitness traits. Genetic variation is the basis for survival and maintaining of cattle populations. On genome level variation appears as considerable allelic diversity and heterozygosity. Genomic data help to track herds' genetic divergence at molecular level. Knowledge of genetic diversity is also important for small breed conservation and crossbreeding strategies [1,2]. Contemporary technologies allow to use massive SNPs data for these goals.
Several tools could be used to estimate genetic diversity in populations. The most effective and commonly used are principal component (PCA) and Wright's fixation index (F st ) analysis. Both methods are widely used to estimate genetic difference between populations. Genomic PCA finds the eigenvectors of the covariance matrix derived from the genotypes of animals. These eigenvectors provide the efficient linear combination of marker data which the most effectively differentiate of various samples, without requiring apriori sample classification information. The resolution of highly structured populations with PCA depends on non-random patterns of genetic variation. To reduce the impact of this factor, one should filter the data by removing a marker from every pair of markers which are in tight LD [3][4][5], or implement a shrinkage PCA [6], and apply iterative pruning PCA [7,8].
More than 70 years have passed since S. Wright introduced a fixation index to measure genetic difference between populations [9]. His approach proved to be very fruitful for the further development of population genetics. Over these years many F st statistics have been proposed. Among them the most commonly used estimators are those presented by Weir & Cockerham [10] and Nei [11]. But, the first one is sensitive to sample size and the second one consistently overestimates F st [12]. Another approach used in the Hudson's estimator [13]. It is not sensitive to sample size ratio, not systematically overestimate F st , and it is accurate and stable under various ascertainment schemes [12].
Generally accepted that the rare alleles play an important role in evolution. Analysis performed by Gorlov et al. [14] suggests that including rare SNPs in genotyping platforms will advance identification of causal SNPs in casecontrol association studies, particularly as sample sizes increase. This effect is confirmed by the genomic breeding value evaluation of dairy cattle [15] and the effect of rare alleles on estimated genomic relationships from whole genome sequence data [16]. For animals the frequency of SNPs alleles in the range from 0 to 0.5 obtained with arrays is nearly uniform while for sequence data this distribution is substantially biased to rare alleles [5]. Obtained phenomenon can lead to ascertainment bias in the evaluation between populations difference with SNPs arrays. Removing low-frequency and rare SNPs alleles (MAF < 0.02) can significantly distort results of PCA analysis [17]. Human population studies have shown inflation of ascertainment schemes on F st values calculation [18][19][20]. In other studies were observed upward bias in F st values [5,21]. Clark et al. [22] on human populations demonstrated that data sets based on different ascertainment schemes gave different patterns of F st values. Moreover, the raw array data and those with polymorphic SNPs in the wild chicken samples underestimated pairwise F st values between breeds which had low F st (< 0.15) in the whole genome resequencing (WGS) data, and overestimated this parameter for high WGS F st (> 0.15) [5]. It should be borne in mind that F st value can depend heavily on the level of variation present in a sample and the frequency of the most frequent allele [23]. Indeed, Jost [24] argued that F st may be so affected by genetic diversity that it should not be used as a measure of population differentiation, gene flow or relatedness.
The Leningrad region is the highest average milk yield producing region in Russia

Evaluation the correspondence of MAF and linkage disequilibrium
The effect of LD -based pruning on the number of SNPs was large (see Additional file 1: Figure S1). To estimate impact of LD -based pruning on MAF of SNPs we calculated the distribution of MAF in eight bins (Fig. 1). The proportion of SNPs regarding the MAF bins in the complete and the pruned data was noticeably different. LDbased pruning completely removed monomorphic SNPs, disproportionally removed SNPs with MAF 0.2-0.4 while proportion of rare and common SNPs with MAF 0.0001-0.1 and 0.5 increased (Fig. 1). It can be suggested that in average SNPs with MAF 0.1-0.4 distributed in genome closer to each other than remaining SNPs leading to the largest LD between them.

Assessing the impact of outliers removal on F st data
On the first step we evaluated the impact of the outliers on F st values. We calculated F st values for Pairwise set of complete data both with correction and without correction on outliers (see Additional file 1: Table S1). After outliers correction in EIGENSOFT 799 cows and 46,626 SNPs were remained.
Absence of the outliers correction leads to a bias of F st values but only for 6 from 78 pairs of the herds. In all cases the difference between F st values was ±0.001 with exception of 4 and 13 pair of the herds having 0.002 difference. Nearly the same F st values was also stored for Pairwise set where was excluded SNPs with MAF < 0.01. Among these F st values only six pairs of the herds differed by 0.001 from F st values for complete data in Table  S2 (see Additional file 1) and three of them were the same as in result of outliers correction.

Evaluation LD based pruning methodology on F st values
Linkage disequilibrium pruning (LD < 0.1) had the same effect size on F st values ±0.001 as the outliers had but affected more pairs of the herds 22 vs. 6 for outliers effect (Table 1 and see Additional file 1:  Table 1). In point of fact the effect size on F st was not large despite of considerable decrease in the total number of SNPs (5827 vs. 48,108) and their proportion in the SNPs bins (Fig. 1). Thus, LD -based pruning had a moderate effect on F st values but it affected more pairs of herds.

Evaluation impact of SNPs MAF on F st values
To evaluate impact of SNPs MAF on F st values, we divided the entire MAF interval 0.0001-0.5 into 6 bins and calculated for each of them the mean F st value across Pairwise sets formed from complete and pruned data (Fig. 2). The rare SNPs alleles with MAF 0.0001-0.005 had the smallest mean F st value (0.0027) across all herds than those for remaining SNPs (see Additional file 1: Table S3). It can be concluded that in average between herds differences calculated for rare alleles were less of those for common alleles. For MAF in the range of 0.1-0.5 the difference between the mean F st values across beans for two Data sets was not significant. As a result of mutual compensation of the mean F st values in complete and pruned data in whole MAF range, the total summed value of F st value between them was insignificant (see Additional file 1: Table S3). Thus, these results again confirm a small effect size of LDbased pruning on F st values only for rare SNPs not common SNPs alleles.

Evaluation significance of F st values
To assess significance of F st values in the Table 1 we carried out the pairwise herds permutations of the cows treating them as H null-distribution. The results of these F st values calculations are listed in Table S4 (see Additional file 1). Then, we calculate P-values for each pair of the herds in Pairwise set using Student's t-test ( Table 2). All of them were with Pvalues in the range from 1.0e-06 to 3.6e-60 with mean 6.5e-18 and median 3.6e-40, thereby it is higly skewed distribution. To calculate F st for H null-distribution we carried out only 5 permutations for each of 78 pairs of the herds as it was time consuming process and result of Pvalues estimates would be only slightly underestimated. In the Table 1 the minimum F st values were 0.002-0.003. The pairs of the herds corresponding to these values are the candidates for genetically most similar herds. However, when comparing these herds in Table 1 the errors were not taken into account. The probabilities of making a type 1 error for all 78 herd combinations are given in Table 2. To evaluate the genetic differences between the herds we have chosen cut off P ≤ 1.0e-30 (P ≤ 1.28e-32 taking into account the Bonferroni correction) in which, as a rule, the differences between the herds at F st values 0.002-0.003 should be insignificant. The results are shown in the Table 3. Insignificant pairs of herds were 2 and 8, 9, 11, 12 (4 pairs); 3 and 5, 8, 9, 10 (4 pairs); 8 and 2, 3, 9, 11 (4 pairs); 9 and 2, 3, 11, 12 (4 pairs). The pairs of herds with 2, 3, 8 and 9 herds had 4-6 F st values 0.002-0.003 (Table 1). Therefore, the results of identifying insignificant pairs of herds (Table 3) correspond to the minimal F st data in the Table 1. In the Table 3 most significant pairs of herds at this cut off were the herd 4 (10 pars), 7 (12 pairs), 13 (12 pairs) or a more stringent level of significance at cut off P ≤ 1.28e-39 the herds 4, 7 and 13 had 10, 8 and 11 significant pairs of the herds (Table 3).
It was necessary to determine the most significant pairs of herds. The most significant at cut off P ≥ 1.28e-50 pairs of the herds were 2 and 5, 6; 4 and 2, 3, 5, 12; 5 and 11; 7 and 1, 2, 9; 13 and 5, 9, 12 ( Table 2). These pairs of the herds correspond to the most genetically different pairs of the herds, while F st data errors were taken into account as well. Summarizing the results of Pvalues calculating we can assert about a high level of significance the F st analysis.     (1) and (2) can be calculated the variances to be used after summing ten eigenvectors. It were 8.17% for complete data and 6.47% for pruned data. Whence, the more eigenvectors will be summed, the more value of variance will be used.
Having the small F st values and gradual decrease of the eigenvalues we calculated the mean for every herd in the PC scales to statistical description between herds genetic differences revealed by PCA. The plot of the means for all herds along PC 1 and PC 3 is shown on Fig. 3 and along PC 1 and PC 4 is shown on Fig. 4. To assess significance of genetic difference between 13 herds based on PC 1 we listed (+) (denoting between herds significance) in Table 4 obtained from Pvalues in Table 2 where cut off of significance was taken at P ≤ 0.05 but given the Bonferroni correction we get P ≤ 6.4e-4. Further, for brevity, we write P ≤ 0.05 instead P ≤ 6.4e-4. For PC 1 among 78 pairs of the herds there were 14 significant pairs of the herds. Most often significant data were observed for herds 4 and 13. Some significant results obtained with F st statistic also confirmed with PCA for eigenvectors 1. For example, the greatest pairwise F st values for herd 4 were confirmed by noticeably higher   level of significance revealed by PCA (Table 2). Furthermore, insignificant pairs of the herds 1 and 4, 4 and 6, 4 and 13 for PC 1 correspond to smallest F st values for pairs of the herds formed with the herd 4 (Table 1). It should be noted a negligible effect size of LD based pruning on between herds' significance for eigenvector 1 ( Table 2). The same procedure was carried out for PC 3 (Table 4). Among Pairwise set there were 16 significant pairs of herds. The most often significant data were obtained as well for herd 4 not herd 13. Out of fourteen significant pair of the herds revealed PC 1 only 9 coincide with sixteen significant pair of the herds revealed PC 3. Thus, PC 3 score is different from the PC 1 one. Obviously, it would be incorrect to make a conclusion about between herds significant differences if we used data for a separate eigenvector ( Table 2).
Comparing the visible pattern of location the mean values of the herds along PC 1-3 and PC 1-4 we can draw some general conclusions (Figs. 3 and 4). The trajectory connecting herds 4-7-6-13-1 preserved on both figures. Other herds visually shifted relative to each other although not all of those displacements were significant at P < 0.05 as was shown along eigenvectors 1 and 3. However, the difference between these pairs of the herds was highly significant when we measured them with F st statistics (Table 2). Thus, visual differences of the herds positions on Figs. 3 and 4 might be incorrect if we used only visual information along separate eigenvectors.
The lack of overall significance (P < 0.20) of second eigenvector for complete data and insignificance of the most pairs of the herds in Pairwise set indicates that there are not between herds genetic difference for this axis. Therefore, these data were excluded from consideration.
Furthermore, based on complete data, we tested the level of PCA data significance if Pvalues for Pairwise set were calculated from summed ten PC. Appropriate P values are given in the Table 2 and significant pairs of the herds which were denoted as (+) are listed in the Table 5 at cutoff P ≤ 0.05. Among them there were 47 pairwise significant combinations of the herds for summed PC 1-10. The most significant result was obtained for herd 4 and 7 while insignificant results for herd 8, 9 and 10. Thus, giving summed genetic variance from 10 eigenvectors lead to noticeably increase the level of significance and change conclusions about data significance as was shown for PC 1 and PC 3.
To verify the level of significance further we calculated Pvalues for Pairwise set of the herds from complete data across summed 20 eigenvectors (Table 2). It turned out that for cutoff at P ≤ 0.05, 61 from 78 pairs of the herds were significant ( Table 5). The most significant pairs of herds were 1, 4, 7, 8, 11 and 12 and the most insignificant pair of the herds was formed with the herd 3. Considering the data for summed ten and twenty eigenvectors, it is important to note that significant pairs of the herds varied greatly with an increase in the number of summed eigenvectors. Thus, increasing the number of summed eigenvectors leads to overall increase of significance level.  To include complete variance available from PCA analysis we calculated Pvalues for 100 summed eigenvectors (Table 2). For complete data P-values distribution had mean 2.2e-07 and median 2.2e-15, thereby the distribution is highly skewed. The herd 3 had minimum Pvalues with other herds ( Table 2) therefore based on these values we selected significant pairs of the herds at cutoff P ≤ 1.0e-10 and given the Bonferroni correction P ≤ 1.28e-12. The results are shown in Table 6. The herd 3 formed 6 insignificant pairs of herds 3 and 6, 8, 9, 10, 11, 12 and herd 8 formed 9 insignificant pairs of the herds 8 and 1, 2, 3, 5, 6, 9, 10, 11, 12. Thus, the herd 8 and 3 was the most genetically related with other herds and this result do not contradict F st values 0.002 and 0.003 prevailing in pairwise set for these herds (Table 1). It was necessary to determine the most significant pairs of the herds. The most significant pairs of the herds at cutoff P ≤ 1.28e-20 were 4 and 2, 3, 7, 9, 12, 13 (6 pairs); 7 and 1, 2, 3, 4, 5, 6, 9, 10, 11, 12, 13 (11pairs); 13 and 2, 3, 4, 7, 9, 10, 12 (7 pairs). This result for herds 7 and 13 is due to smaller standard errors for these herds than for herd 4 (see Additional file 1: Table 5).

Assessing the power of F st and PCA analyzes
In the Table 2 listed Pvalues for Pairwise set of the herds calculated with PCA and F st analyzes. According to these data for summed 100 eigenvectors, Pvalues were the smallest of those for any other eigenvector or summed 10 and 20 eigenvectors. This result was due to use the complete variance from initial data. Further, comparing Pvalues of PCA and F st analyzes draw a conclusion that F st Pvalues were many orders of magnitude less those of summed 100 eigenvectors. Across Pairwise set the PCA calculated power was within the range of 0.8-1.0, while for F st it was within the range of 0.9-1 that is the probabilities of a type II error are similar. In total, considering by several orders of magnitude smaller Pvalues for F st , we can conclude that probability type I error for the F st analysis was far less the PCA one. Therefore, it should be accepted that the data from the F st analysis are more reliable.

Discussion
Verification of the genetic diversity across herds is useful in a variety of biological context, especially in breeding, selection and conservation of breeds as well as crossbreeding strategies. In fact, the maintenance of genetic variation in the breeds is extremely important. The problem is to know whether such differentiation reflects meaningful differences. Genomewide data allow to carry out the population analysis at unprecedented earlier level. We can expect herein to be able to resolve some diversity across herds' genetic data. To solve this problem we apply two tools. First one was Wright's F st statistics [9]. Second one was recently proposed PCA tool as an alternative approach to determine within and between populations diversity [3].

Comparing the power of F st and PCA analyzes
Natural models of population structure predict that the most of the eigenvalues of theoretical covariance will be «small», nearly equal, and arise from sampling noise, while just a few eigenvalues will be «large», reflecting past demographic events [3]. It is not relevant for commercial dairy herds. Monotonic decrease of the eigenvalues was observed. This indicates a relatively homogeneous genetic structure of the herds due to a big enough gene flow between herds as result of artificial selection. For example the proportions of the cows born from the same bulls in 78 pair of the herds was up to 32% [25]. In fundamental analysis of genetic diversity with PCA, Patterson et al. [3] discovered a threshold (as measured, for example, by F st ) below which the population structure was essentially undetectable. They proposed that for two equal size subpopulations, there is a threshold value of F st calculated as 1/ √nm (where n is the number of animals and m is the number of SNPs), below which there will be essentially no evidence of the populations structure. In our study this threshold is within 0.0005-0.0006 for complete data which is considerably lower obtained minimal F st value 0.002 and threshold 0.0015-0.0018 for pruned data which is comparable with minimal F st value 0.002 (Table 1). Findings show that proximity to the threshold of PCA analysis for pruned data did not affect samples testing (Table 1).
Therefore, we predict that the power of our PCA analysis would be sufficient to reveal detectable across herds genetic differences. But, PCA calculates correctly if the markers are independent (between them have not sizable LD) [3,17]. Several approaches have been proposed to achieve this goal, namely, shrinkage PCA [6], iterative pruning PCA [7,8] and LD -based pruning [3]. We used LD -based pruning. For our local Holsteinized herds LD -based pruning has no affect on F st and PCA data analysis. The same result for PCA was obtained when studying the genetic diversity of Spanish beef cattle at much greater F st values (0.026-0.068) [4]. However, for human populations LD -based pruning has a sizable effect (e. g [3,17].). Perhaps the PCA would be sensitive to LD -based pruning when populations or herds have pronounced genetic structure. It should be noted that after LD -based pruning the second eigenvector overall P -value became significant 6.4E-06 (see (2)). Consequently, insignificance of the second eigenvector may be possible result of LD between SNPs. Thus, the effect of LD on estimates between herds' differences is moderate but not as great as for human population [3]. We propose this is a consequence of genetic relatedness of the cows in the herds. Really, the cows from 13 herds had considerable genetic relatedness owing to use high proportion up to 32% the same sires in the herds [25].
To clarify effect of LD on between herds differences, consider the results of F st analysis. The pattern of MAF before and after LD -based pruning changes considerably (Fig. 1). Despite of this effect on MAF as pairwise F st values (Table 1) and mean F st values in MAF bins (see Additional file 1: Table S3) have not changed considarably. Meanwhile, the rare alleles have the smaller mean F st values than common alleles, particularly for MAF 0.0001-0.005 (see Additional file 1: Table S3) and F st values gradually increase up to MAF 0.1. Thus, the rare SNPs alleles are less differentiated between the herds than common alleles but they did not have substantial effect on F st values. The less differentiation rare alleles may be suggested as a by-effect of artificial selection on the highest breeding values.
It was proposed if population has gone through bottleneck then F st values could be greater for rare alleles as compared with common alleles [3]. We have observed the opposite effect, a decrease of F st values for rare alleles. This means the bottleneck event in the breeding history of the herds if it had occurred it might have revealed by PCA.
The PCA plots of the herds means along first, third and fourth PC for the cows from complete data are shown on Figs. 3 and 4. The position of forth herd is outstanding on these images. The mean pairwise F st value with other 12 herds for herd 4 is also the greatest 0.0087 compared with those of other 12 herds (0.0038-0.0063) [25]. We assume that this result is caused by the heavy use of bulls from the Netherlands between 2000 and 2007 years in herd 4, while bulls imported mainly from the USA and Canada were more recently used in the other 12 herds. Therefore, all pairwise combinations of the forth herd with other twelve herds are highly significant for first and third (except of pair the herds 1 and 4, 6 and 4), summed PC 1-10 and PC 1-20, summed 10-100 eigenvectors (except 4 and 6 pair of the herds) ( Table 2).
Unlike the position of the herd 4, the position of third and eighth herds was nearly in the middle of the cluster which located herds 3, 8 10, 11, 12 on Fig. 3. These herds have minimal mean pairwise F stvalues with other 12 herds 0.0038 and 0.0039 [25]. The herd 3 forms many insignificant at P ≤ 0.05 pairwise herds combinations with other herds revealed by PCA 1-20 (Table 5) and at P ≤ 1.26e-12 by PC 1-100 (Table 6). Such properties of third herd are the result of genetic relatedness of the cows from these herd with the cows of other 12 herds mostly due to a large percentage of cows (up to 32%) born from the same bulls used in other 12 herds [25]. The same is true for herd 8 revealed by PC 1-100 ( Table 6). The herd 13 is in the middle of both images on Figs. 3 and 4. Therefore, the herd 13 forms many highly significant pairs of herds revealed by PC 1-100 and F st (Tables 2,  6 and 7).
Thus, mutual position of the herds and their pairwise significance depends on the eigenvectors since they are orthogonal and in each of them used only a part of genetic variance. It cannot be used the certain eigenvector to evaluate genetic differences for low genetically different pairs of the herds. Only summed eigenvectors are able to accurately assess these differences not contradictory to F st approach. This conclusion is fully confirmed by the results obtained from summed 100 eigenvectors (Table 2).

Examples between cattle breeds genetic differences
For the largescale SNP data the PCA and F st are widely used to summarize the structure of genetic variation in the populations. Consider some findings available from publications studying a moderate between populations difference. Analysis of Russian cattle breeds demonstrate a very low differentiation of Black and White breed from Holstein -Friesian breed (Fst = 0.02) [26]. The authors did not use PCA. In another research F st value for Black and White and Holstein breeds was 0.035 and Black and White breed formed a cluster with the breeds from Northern Europe on multi dimentional scaling (MDS) images [27]. PCA analysis applied to a distance matrix based on identity by state (IBS) showed a grouping of Spanish beef cattle breeds [4]. The degree of genetic differentiation was small to moderate as the pairwise fixation index of genetic differentiation among breeds estimates ranged from 0.026 to 0.068. Obtained results indicate large within-breed diversity and a low degree of divergence among the autochthonous Spanish beef cattle breeds studied. Among 47 worldwide breeds the USA and French Holstein have F st value 0.004 and they are indistinguishable across PC 1 -PC 2 [28]. Authors concluded that PCA may fail to detect spatial structuring if this is not associated with the most pronounced genetic differentiation. Some degree of differentiation was shown with PCA between the USA and New Zealand Jersey bulls and cows [29]. The mean (max) F st across the genome for AU versus US cows was 0.008 (0.12) and the average (max) for US versus AU, US versus NZ, and AU versus NZ was 0.006 (0.08), 0.029 (0.21) and 0.009 (0.07), respectively. Authors suggest that although some differentiation based on F st was seen, especially for US versus NZ cows, the other populations appear to be similar. Noteworthy, differentiation between Australian and the USA Jersey cow populations was marginal in comparison with populations of the bulls. On PC 1 -Pc 2 image it was impossible to differentiate them geographically. Taken together the PCA and F st results show that two artificially unselected breeds were not well differentiated and still cover a considerable part of the original genetic diversity [30]. On the contrary, artificially selected breeds show significantly highest differentiation. The highest overlap of genetic variation was found between Anatolian Black and Illyrian Mountain Buša (F st = 0.037). This breeds were very close to each other in the PC 1 -PC 2 and PC 1 -PC 3 images and statistical prove on genetic differences between them are not given. Most of the remaining breeds also had their smallest F st value (F st = 0.037-0.096) when compared to Illyrian Mountain Buša. In indigenous six cattle populations of Ethiopia and Korea, PCA evidently distinguishes Ethiopian cattle populations from Hanwoo breed [31]. The most similar populations are Ambo -Arsi, Horro wih F st 0.002 (P < 0.01) and they are very close to each other on PC 1 -PC 2 image but statistical data are not shown. Ancestry analysis of the new world cattle evidences that the first axis of PC was associated with the indicinetaurine split and the second PC axis was associated with the divergence between European and African taurine cattle [32]. The authors calculated the overall P-values based on TracyWidom test and shown that 154 axes in the 50 k dataset were statistically significant. In another research PCA and F st showed minimal structure within the Guernsey breed, with no complete segregation of animals reflecting geographic origin (Fst = 0.006-0.011) and PCA with no distinct visual animal separation [33].
It is important to note that as a rule in above mention publications the eigenvalues (variance) decrease faster than in our study. Apparently this is the result a considerably low between herds genetic difference comparing to the breeds. We should keep in mind that visual evaluations of the genetic distance between herds on PCA images may be incorrect without of statistical prove. As a rule, any statistical treatment of PCA between populations' results was not given and such images are only illustrative.
What conclusions from obtained results should be done regarding the power of F st and PCA analyzes? Wright's F st is based on maximization of allele frequencies differences between populations through variance component. Used by us Hudson's estimator provides the genetic difference between populations compared to the genetic difference within populations through variance component as well. Applied in our research PCA relies on covariance matrix of SNPs alleles among animals [3] and it able to find between herds genetic variation. That is F st and PCA based on similar mathematical approaches and additional simulation analysis is needed to determine why F st analysis gives more significant data. Summarizing the results of F st and PCA analyzes should be noted that the power of both analyzes similar but probability of making a type I error is much less for F st approach. It can be concluded at that point the F st analysis is superior to PCA.

Conclusions
Firstly, despite of genetic relatedness of the cows in the herds, F st and PCA analyzes are able to differentiate between herds genetic differences. But, PCA applied to the herds might only be efficient when summed results of several eigenvectors will be used. Secondly, despite of considerable change in the number of SNPs and their MAF spectrum due to LD -based pruning, it has a small effect on the results of F st and PCA analyzes. We suggest that this is a consequence of homogeneous genetic structure of the herds. Our findings show that F st method give the more significant data than PCA but PCA approach might be useful due to visualization of some genetics features of the herds.

Animal resources and SNP genotyping
Data and genotypes were obtained from Committee on Agro-Industrial Complex of Leningrad region. Cows genotypes were available from 13 breeding herds locating in Leningrad region (Russia) born in range from 2010 to 2013. Animals for genotyping were selected by farmers with respect to pedigree structure of the herd. Number of animals genotyped depends on number of sires used in herd at least one daughter was presented by sire. In case of multiple daughters were presented per single sire, sire of dams were different. Sampled animals are presented 8-15% of total number of milking cows (see Additional file 1:

Principal components analysis
For calculation of PCA the EIGENSOFT 6.0.1 software was used [3]. The outliers removal was carried out during each iteration. We used 6 standard deviations which an animal must exceed along one of the top principal components in order to be removed as an outlier. ANOVA P-values between each pair of the herds along principal components were calculated. For each pair of the herds, the above mentioned ANOVA statistics are summed across 10, 20 up to 100 eigenvectors. The distributions were approximately chisq with d.f. equal to the number of eigenvectors minus one. Likewise Pvalues were calculated for each 78 pairs of herds including summed PC 1-10, PC 1-20 and PC 1-100. For each of leading component PC 1, PC 3 and PC 4 the mean herd value was calculated. Then, they were plotted with R software environment [36]. Power of PCA analysis for summed PC 1-100 eigenvectors was calculated with «powerAnalysis» software in R software environment [36]. When comparative evaluating P -values in Table 2 Bonferroni corrections by formula: P = α/m was used, where α is the desired overall alpha level and m is the number of hypotheses.
Additional file 1:  . Table S6. Description of the herds and number of the genotyped cows. * -Country of origin of the sires of the genotyped cows, NLthe Netherlands. Figure S1. Effect of LD -based pruning on the number of SNP in the complete data.