Ancestry Prediction Comparisons of Different AISNPs for Five Continental Populations and Population Structure Dissection of the Xinjiang Hui Group via a Self-Developed Panel

Ancestry informative markers are genetic markers that show distinct genetic divergences among different populations. These markers can be utilized to discern population substructures and estimate the ancestral origins of unknown individuals. Previously, we developed a multiplex system of 30 ancestry informative single nucleotide polymorphism (AISNP) loci to facilitate ancestral inferences in different continental populations. In the current study, we first compared the ancestry resolutions of the 30 AISNPs and the other previously reported AISNP panels for African, European, East Asian, South Asian and American populations. Next, the genetic components of the Xinjiang Hui group were further explored in comparison to these continental populations based on the 30 AISNPs. Genetic divergence analyses of the 30 AISNPs in these five continental populations revealed that most of the AISNPs showed high genetic differentiations between these populations. Ancestry analysis comparisons of the 30 AISNPs and other published AISNPs revealed that these 30 AISNPs had comparable efficiency to other AISNP panels. Genetic relationship analyses among the studied Hui group and other continental populations demonstrated that the Hui group had close genetic affinities with East Asian populations and might share the genetic ancestries with East Asian populations. Overall, the 30 AISNPs can be used to predict the bio-geographical origins of different continental populations. Moreover, the obtained genetic data of 30 AISNPs in the Hui group can further enrich the extant reference data, which can be used as reference data for ancestry analyses of the Hui group.


Introduction
A bio-geographical origin analysis can determine the population substructures in a genome-wide association study [1]. This type of analysis also has wide applications in forensic research. For example, ancestral inferences of unknown individuals may provide valuable information that can assist forensic investigations by narrowing the detection scope; it can also help corroborate eyewitness accounts [2]. In human genome diversity research, forensic geneticists have selected and reported some AISNP

Sample Information
The bloodstain samples of 98 Hui individuals in China were collected with their written informed consent. Twenty-six populations from five continents were used as reference populations; the genetic data for 30 AISNPs in these populations were obtained from the 1000 Genomes Project [20]. This research was carried out in accordance with the Declaration of Helsinki. Moreover, the study protocol was agreed upon by the ethics committee of Xi'an Jiaotong University Health Science Center (2019-1039).

DNA Extraction and Primer Design
DNA samples were isolated from bloodstain cards using a Magbead Blood Spots DNA Kit (CWBio, Beijing, China) according to the manufacturer's description. The DNA concentration of each sample was determined using a NanoDrop 2000 instrument (Thermo Fisher Scientific, Waltham, MA, USA). The primer designs for the 30 AISNP loci were conducted using the Primer Premier v6.23 software (Premier Biosoft International, Palo Alto, Santa Clara, CA, USA), and then these primers were mixed into the Primer Mix. Primer information of 30 AISNPs was presented in Supplementary Table S1.

Library Preparation, NGS and Data Analysis
The DNA library was prepared by two-rounds PCR. Detailed information of two-rounds PCR was shown in Supplementary Figure S1. Barcode sequences used in this study were given in Supplementary  Table S2. And then we determined the library quantification using a Qubit dsDNA HS Assay Kit (Thermo Fisher Scientific, Waltham, MA, USA).
Paired-end sequencing with a read length of 150 bp was conducted on the Illumina NextSeq 500 platform that could produce 100-200 G data in a run. A total of 150 cycles were used to conduct sequencing; other parameters were set according to the manufacturer's recommendation. Raw data for each individual were saved in the FASTQ format after sequencing. By filtering the sequences with low-quality reads (<80 bp) and sequencing adapters, clean data were obtained from the raw data using the Cutadapt software [21]. After quality control, these data were mapped to the UCSC hg19 human reference genome by the mean algorithm of the BWA software [22] based on the default parameters. Duplicate reads were removed using Picard tools (http://broadinstitute.github.io/picard/), and mapping reads were used to detect variations. Genetic variations of the 30 AISNP loci were detected by the GATK [23] and the mpileup2cns algorithm (min coverage > 30, min var freq > 0.005, p value > 0.05, output vcf = 1, min reads > 2) in the VarScan software [24].

Statistical Analysis
The allele coverage ratio (ACR) of each AISNP locus was calculated by the following formula: minor allele coverage/major allele coverage. The Hardy-Weinberg equilibrium (HWE) tests of 30 AISNPs in the Xinjiang Hui group were performed using the Genepop software v4.0 [25] with the probability test described by Guo et al. [26]. The allelic frequency differences (δ), fixation index (Fst) and the informativeness-for-assignment (In) values of the 30 AISNPs between one continent and the other four continents were calculated by the AncestrySNPminer online tool (https://research.cchmc. org/mershalab/AncestrySNPminer). Then, the Fst and In values of the 30 AISNPs in all continental populations were calculated according to the previous report [2]. By selecting "Hardy-Weinberg principle applies to your marker set", verbose cross-validation analyses of the five continental populations were conducted through the "Thorough analysis of population data" option in the Snipper v2.5 (http://mathgene.usc.es/snipper/). The correlation coefficient r 2 values of pairwise AISNP loci in the Hui group were estimated by the Haploview software v4.2 [27]. The heterozygosity values and minor allelic frequencies (MAF) of 30 AISNPs in the Hui group were also calculated by the Haploview software v4.2. The forensic parameters of 30 AISNPs in the Hui group were estimated by the PowerStats program (Promega, Madison, WI, USA).
The population genetic relationships between the Hui group and the 26 reference populations from five continents were determined based on the 30 AISNPs. Principal component analyses (PCAs) of the Hui group and different continental populations were performed using the XLSTAT program (https://www.xlstat.com). Nei's D A distances among the Hui group and 26 reference populations were estimated by the DISPAN software [28]. A neighbor-joining tree of the Hui group and reference populations was reconstructed by the MEGA software v6.0 [29], based on their Nei's D A values. The pairwise Fst values of the Hui group and other reference populations were calculated using the Genepop software v4.0, and the heatmap of pairwise Fst values was built with the R software v3.3 [30]. Based on the ADMIXTURE software v1.3 [31], the genetic structure analyses among the Hui group and other reference populations were performed for each of the K values 2-6; the detailed parameters used in the ADMIXTURE software were as follows: we used the block relaxation algorithm (optimization method), and the log-likelihood increased by less than 10 −4 (termination criteria). Then, the estimated ancestral components of these populations were displayed as a bar plot by the CLUMPAK online tool [32].

Depth of Coverage and the ACR of the 30 AISNPs in the Hui Group
Depth of coverage (DOC) that is the number of sequencing target regions is usually used as the metric to evaluate the data generated from massively parallel sequencing. For the ForenSeq TM DNA Signature Prep Kit, more than 20 reads were considered as the detection threshold for the sequencing data of 230 genetic markers in previous research [33]. Guo et al. additionally proposed that more than 30 reads could be used as the interpretation threshold to aid in the analyses of loci with heterozygote alleles [34]. In this study, we detected the genetic profiles of 30 AISNPs in the Hui group using the NGS. The average DOC values for the 30 AISNPs in the Hui group ranged from 821 (rs2075509) to 60,855 (rs1205357), as shown in Figure 1. The lowest DOC value was observed at the rs723220 locus with a DOC value of 30 (data not shown), which was more than the detection threshold and equal to the interpretation threshold mentioned above.
ACR can evaluate the heterozygosity balance or intralocus balance: the locus that has a higher ACR is more beneficial for mixture analysis [35]. Since the rs3176921 locus showed homozygous alleles in all Hui individuals, the ACR of this locus was not estimated in the following study. The average ACR values of the remaining 29 AISNPs were also shown in Figure 1. The results revealed a range from 0.7635 to 0.9773, indicating that these 29 AISNPs had a good intralocus balance. Therefore, most of the 30 AISNPs may be useful to disentangle the mixtures based on the obtained ACR values, which remained to be further validated in future research. Genes 2020, 11, x FOR PEER REVIEW 5 of 13

Genetic Divergences of the 30 AISNPs among the Five Continental Populations
Based on the genetic data of the five continental populations in the 1000 Genomes Project, we evaluated the genetic divergences of these AISNPs among these continental populations. δ values are the allelic frequency differences of the genetic markers in different populations, which can measure the genetic divergences of the markers [2]. Generally speaking, a locus with a high δ value is suitable as the AIM for ancestry analyses. First, the δ values of 30 AISNPs in one continent vs. the other four continents were estimated, as shown in Figure

Genetic Divergences of the 30 AISNPs among the Five Continental Populations
Based on the genetic data of the five continental populations in the 1000 Genomes Project, we evaluated the genetic divergences of these AISNPs among these continental populations. δ values are the allelic frequency differences of the genetic markers in different populations, which can measure the genetic divergences of the markers [2]. Generally speaking, a locus with a high δ value is suitable as the AIM for ancestry analyses. First, the δ values of 30 AISNPs in one continent vs. the other four continents were estimated, as shown in Figure

Ancestry Resolution Comparisons of Different AISNPs among Five Continental Populations
The cross-validation analysis in the Snipper software can re-estimate the allelic frequencies of genetic markers in training populations after randomly removing one sample successively and then can infer the ancestral origin of the removed individual based on the remaining dataset. This analysis can evaluate the values of a set of novel AISNPs to infer ancestry [36]. Cross-validation analyses of five continental populations were conducted based on the 30 studied AISNPs and previously published AISNPs [3,5,7], as presented in Table 1. For the 30 studied AISNPs, most individuals from the five continents could be assigned to their corresponding continental origins. However, we found that some individuals were classified into other continental populations, particularly for American populations. Among these four AISNP panels (Table 1), the 55 AISNPs selected by Kidd et al. provided the best ancestry resolution performance, even for admixed American populations, followed by the 33 AISNPs. Nonetheless, we found that the 30 studied AISNPs displayed slightly higher accuracy for the ancestry analyses of African and East Asian populations than the 33 AISNPs. Given the results in Table 1, the 30 studied AISNPs were able to achieve ancestry analyses of the four continental populations (African, East Asian, European and South Asian), which further corroborated the previous findings [19]. We also noted that these 30 AISNPs performed less efficiently in differentiating American populations from other continental populations. The American populations collected by the 1000 Genomes Project possessed different proportions of ancestries from European, African and indigenous American populations, which might create challenges for the ancestry analyses of American populations. Therefore, the power of the 30 studied AISNPs to differentiate American populations from other continental populations should be further evaluated with indigenous American individuals. Moreover, those highly informative SNPs, which are useful for the ancestral analyses of American populations, should be incorporated into the developed AISNP panel in the future.

Ancestry Resolution Comparisons of Different AISNPs among Five Continental Populations
The cross-validation analysis in the Snipper software can re-estimate the allelic frequencies of genetic markers in training populations after randomly removing one sample successively and then can infer the ancestral origin of the removed individual based on the remaining dataset. This analysis can evaluate the values of a set of novel AISNPs to infer ancestry [36]. Cross-validation analyses of five continental populations were conducted based on the 30 studied AISNPs and previously published AISNPs [3,5,7], as presented in Table 1. For the 30 studied AISNPs, most individuals from the five continents could be assigned to their corresponding continental origins. However, we found that some individuals were classified into other continental populations, particularly for American populations. Among these four AISNP panels (Table 1), the 55 AISNPs selected by Kidd et al. provided the best ancestry resolution performance, even for admixed American populations, followed by the 33 AISNPs. Nonetheless, we found that the 30 studied AISNPs displayed slightly higher accuracy for the ancestry analyses of African and East Asian populations than the 33 AISNPs. Given the results in Table 1, the 30 studied AISNPs were able to achieve ancestry analyses of the four continental populations (African, East Asian, European and South Asian), which further corroborated the previous findings [19]. We also noted that these 30 AISNPs performed less efficiently in differentiating American populations from other continental populations. The American populations collected by the 1000 Genomes Project possessed different proportions of ancestries from European, African and indigenous American populations, which might create challenges for the ancestry analyses of American populations. Therefore, the power of the 30 studied AISNPs to differentiate American populations from other continental populations should be further evaluated with indigenous American individuals. Moreover, those highly informative SNPs, which are useful for the ancestral analyses of American populations, should be incorporated into the developed AISNP panel in the future.

Genetic Distributions and Forensic Parameters of 30 AISNPs in the Hui Group
The p-values of HWE tests for 30 AISNPs in the Hui group were presented in Supplementary  Table S3. Since the rs3176921 locus showed homozygous alleles in all studied Hui individuals, the HWE test for the locus was not conducted. For the other 29 AISNP loci, we found that the p-values of these AISNPs were larger than 0.05, except for the rs590086 locus. Nonetheless, these 29 AISNP loci conformed to HWE in the Hui group after applying Bonferroni correction (p = 0.05/29 = 0.0017). We also described the results of linkage disequilibrium analyses of 30 AISNPs in the studied Hui group, revealing that the pairwise r 2 values of these AISNPs were less than 0.1 (Supplementary Figure S3). These relatively low r 2 values indicated that these AISNPs had weak correlations and could be viewed as independent loci from each other in the Hui group.
The SNP locus can be viewed as a valuable marker for forensic individual identification once its MAF is greater than 0.2 [37]. As shown in Figure 3, the findings for the MAF of the 30 AISNP loci revealed that there were 14 AISNP loci with MAF greater than 0.2. Furthermore, we also evaluated the heterozygosity of 30 AISNPs in the Hui group (Figure 3 and Supplementary Table S3). The results showed that observed heterozygosity (Ho) and expected heterozygosity (He) of the 30 AISNPs in the Hui group ranged from 0.0000 (rs3176921) to 0.5310 (rs723220), and from 0.0000 (rs3176921) to 0.4950 (rs748144), respectively. Nine out of 30 AISNPs showed relatively high He values (>0.4), suggesting that these loci could be used as individual identification SNPs for forensic applications in the Hui group. The forensically relevant parameters of the 30 AISNPs in the studied group were given in Supplementary Table S3. The average matching probability, power of discrimination (PD), polymorphism information content and power of exclusion (PE) values of the 30 AISNPs in the studied Hui group were 0.5653, 0.4347, 0.2441 and 0.0782, respectively. The cumulative power of discrimination (CPD) and power of exclusion (CPE) values in the studied Hui group were 0.999 999 987 and 0.9183, respectively. Compared to the results of the 30 InDels [14] and STRs [10,15] in the Hui group, these 30 AISNPs were of less value for individual identification and paternity testing. Nevertheless, the CPD value (0.999 999 987) of the 30 AISNPs in the Hui group demonstrated that these AISNPs could also be used as a supplementary tool for forensic individual identification. polymorphism information content and power of exclusion (PE) values of the 30 AISNPs in the studied Hui group were 0.5653, 0.4347, 0.2441 and 0.0782, respectively. The cumulative power of discrimination (CPD) and power of exclusion (CPE) values in the studied Hui group were 0.999 999 987 and 0.9183, respectively. Compared to the results of the 30 InDels [14] and STRs [10,15] in the Hui group, these 30 AISNPs were of less value for individual identification and paternity testing. Nevertheless, the CPD value (0.999 999 987) of the 30 AISNPs in the Hui group demonstrated that these AISNPs could also be used as a supplementary tool for forensic individual identification.

Phylogenetic Relationships and Population Structure Analyses of the Hui Group and Other Continental Populations
Based on the 30 selected AISNPs, we explored the population genetic relationships among the studied Hui group and other continental populations using multiple methods. PCA is one of the multivariate analysis methods; it can extract the most important variables that account for most information in the raw dataset. By reducing the dimensionality of the dataset, the studied subjects can be graphically represented in a two-dimensional plane, which can visually recognize the relationships between these subjects [36]. Thus, we first conducted a PCA of the studied Hui group and other continental populations at PC1 and PC2, as shown in Figure 4A. We found that the populations from the four continents (African, European, East Asian and South Asian) formed four clusters, and the populations in the same continent clustered together; the American populations distributed among European, East Asian and South Asian populations. The PCA for these populations was also conducted at an individual level (Supplementary Figure S4), which showed similar distribution patterns. For the studied Hui group, most individuals were overlapped on the East Asian individual cluster. Nei's DA distance refers to the genetic distances (related to mutation and genetic drift) between pairwise populations. The DA distance can give a reliable population phylogenetic tree [38]. Next, we constructed a phylogenetic tree among the Hui group and other populations based on Nei's DA distances, as shown in Figure 4B. Three apparent branches could be seen from the phylogenetic tree: five East Asian populations, the studied Hui group, and two American populations (Peruvian in Lima and Mexican Ancestry in Los Angeles) were located in the same branch. The four European populations and the other two American populations (Colombian in Medellin and Puerto Rican in Puerto Rico) were located in one branch, while African and South Asian populations were positioned in another branch. Moreover, our findings for Nei's DA distances demonstrated that the Hui group featured minor genetic differences from East Asian populations. We also assessed the pairwise Fst values of Hui group and other continental populations, as shown in Supplementary Figure S5. We found that the Hui group had relatively small Fst values compared to East Asian populations. The ADMIXTURE software can discern a population's substructure, estimate ancestry components, and study the admixtures between populations [31]. To further dissect

Phylogenetic Relationships and Population Structure Analyses of the Hui Group and Other Continental Populations
Based on the 30 selected AISNPs, we explored the population genetic relationships among the studied Hui group and other continental populations using multiple methods. PCA is one of the multivariate analysis methods; it can extract the most important variables that account for most information in the raw dataset. By reducing the dimensionality of the dataset, the studied subjects can be graphically represented in a two-dimensional plane, which can visually recognize the relationships between these subjects [36]. Thus, we first conducted a PCA of the studied Hui group and other continental populations at PC1 and PC2, as shown in Figure 4A. We found that the populations from the four continents (African, European, East Asian and South Asian) formed four clusters, and the populations in the same continent clustered together; the American populations distributed among European, East Asian and South Asian populations. The PCA for these populations was also conducted at an individual level (Supplementary Figure S4), which showed similar distribution patterns. For the studied Hui group, most individuals were overlapped on the East Asian individual cluster. Nei's D A distance refers to the genetic distances (related to mutation and genetic drift) between pairwise populations. The D A distance can give a reliable population phylogenetic tree [38]. Next, we constructed a phylogenetic tree among the Hui group and other populations based on Nei's D A distances, as shown in Figure 4B. Three apparent branches could be seen from the phylogenetic tree: five East Asian populations, the studied Hui group, and two American populations (Peruvian in Lima and Mexican Ancestry in Los Angeles) were located in the same branch. The four European populations and the other two American populations (Colombian in Medellin and Puerto Rican in Puerto Rico) were located in one branch, while African and South Asian populations were positioned in another branch. Moreover, our findings for Nei's D A distances demonstrated that the Hui group featured minor genetic differences from East Asian populations. We also assessed the pairwise Fst values of Hui group and other continental populations, as shown in Supplementary Figure S5. We found that the Hui group had relatively small Fst values compared to East Asian populations. The ADMIXTURE software can discern a population's substructure, estimate ancestry components, and study the admixtures between populations [31]. To further dissect the population structure of the studied Hui group, a genetic structural analysis of the Hui group was conducted in comparison to the other continental populations using the ADMIXTURE software ( Figure 4C). Firstly, with an increase in the K values, populations from the same continent showed similar genetic component distributions. However, American populations showed admixed ancestral proportions of European and East Asian populations. Secondly, no further distinctions could be made between the Hui group and East Asian populations, revealing close genetic affinities between Hui group and these East Asian populations.
Genes 2020, 11, x FOR PEER REVIEW 9 of 13 the population structure of the studied Hui group, a genetic structural analysis of the Hui group was conducted in comparison to the other continental populations using the ADMIXTURE software ( Figure 4C). Firstly, with an increase in the K values, populations from the same continent showed similar genetic component distributions. However, American populations showed admixed ancestral proportions of European and East Asian populations. Secondly, no further distinctions could be made between the Hui group and East Asian populations, revealing close genetic affinities between Hui group and these East Asian populations.  Yao et al. investigated the genetic structure of the Hui group residing in Gansu province via autosomal STR loci and found that the Hui group might have common genetic ancestry with East Asian populations [39]. He et al. comprehensively explored the genetic background and ancestry components of the Hui group from the Ningxia region and found that the Hui group had close genetic relationships with Chinese Han populations that showed prominent East Asian ancestry components [18]. Zhou et al. explored the admixture signals of the Ningxia Hui group based on a set of InDels and found that the East Asian populations provided greater genetic contributions to the Ningxia Hui group than western Eurasian populations [40]. For the Xinjiang Hui group, similar conclusions were made in our previous research [10,14,15]. Here, we further dissected the genetic components of the Xinjiang Hui group based on a set of AISNP loci. The obtained results provided evidence for the East Asian origin of the studied Hui group, which might be related to the greater gene flow between the Hui group and East Asian populations. However, genetic distribution analyses of the genetic markers on the Y chromosomes in the Hui groups from different regions revealed their genetic substructures [13]. Therefore, further analyses of the genetic components in the Hui groups from other regions should be conducted based on the developed AISNP panel.

Conclusions
In this study, we compared the ancestry resolutions of previously selected 30 AISNPs and other published AISNP panels and found that studied 30 AISNPs could be used for ancestry analyses of African, East Asian, European and South Asian populations. The obtained population data of 30 AISNPs in the Xinjiang Hui group can be employed as reference data for ancestry origin analysis of the Hui group. Furthermore, population genetic analysis between the studied Hui group and other continental populations based on 30 AISNPs revealed that the Hui group might have similar ancestry origins with East Asian populations.

Conflicts of Interest:
The authors declare no conflict of interest.
Data Availability Statement: Genetic data of 30 AISNPs in Xinjiang Hui group were only used as scientific research. Data in this study are available from the corresponding author. Personal information of participants will not be shared with any individuals or organizations.