Introduction

The State of Kuwait is located on the Arabian Gulf in the northwest of the Asian continent and in the heart of the Middle East. Kuwait is bordered by the Kingdom of Saudi Arabia in the south, the Republic of Iraq in the north and west, and Iran in the east, across the Persian Gulf Sea. Kuwait’s population is about 4.8 million, which includes 1.4 million Kuwaiti nationals and 3.4 million foreign nationals, according to the 2019 census (https://www.paci.gov.kw/stat/Default.aspx). Currently, forensic DNA analysis in Kuwait is carried out by the Kuwaiti Identification DNA Laboratory (KIDL) using only short tandem repeat (STR) markers, including autosomal STRs, Y-chromosome STRs (Y-STRs) and X-chromosome STRs (X-STRs). Autosomal STRs are routinely used both for identification of individuals and paternity testing, whereas Y-STRs and X-STRs are used less frequently, and only for specific scenarios.

To date, few papers have been published investigating the forensic utility and genetic diversity of autosomal STR markers in the Kuwaiti population. In 2008, Alenizi and colleagues reported the allele frequencies of 15 STR loci included in the AmpFℓSTR Identifiler kit (Thermo Fisher Scientific, MA, USA)1. Based on these 15 STRs, the FST distances between Kuwaiti nationals and foreign nationals from seven other populations residing in Kuwait were found to be consistent with their geographical distances2. Another recent study investigated the forensic utility of 25 autosomal STRs included in two separate kits: the PowerPlex CS7 system and the PowerPlex 21 system (Promega Corporation, WI, USA)3. Although these existing STRs are efficient for analysing cases of simple relationships, more STRs are increasingly required, particularly for complex paternity cases or to increase the discrimination power in cases of partial DNA profiles and DNA mixtures.

Recently, Promega launched the PowerPlex Fusion 6C kit, a six-dye kit that can amplify 27 loci, including the 20 autosomal loci in the expanded CODIS set (CSF1PO, FGA, TH01, TPOX, vWA, D1S1656, D2S1338, D2S441, D3S1358, D5S818, D7S820, D8S1179, D10S1248, D12S391, D13S317, D16S539, D18S51, D19S433, D21S11, and D22S1045)4, three additional autosomal STRs (PentaE, PentaD, and SE33) to increase the power of discrimination, two sex chromosome markers (Amelogenin and DYS391), and two rapidly mutating Y-STRs (DYS570 and DYS576)5. The PowerPlex Fusion 6C kit was validated by multi-laboratory evaluation following SWGDAM guidelines5.

Before utilising this kit for criminal and relationship cases in Kuwait, population and forensic statistical data for the loci in the kit must be evaluated. In this study, we aim to increase the amount of genetic data available for the Kuwaiti population, using the 23 autosomal STRs in the PowerPlex Fusion 6C kit, of which four loci (D10S1248, D22S1045, D2S441 and SE33) have not been reported before for Kuwait. In addition, we aim to evaluate the forensic utility of these autosomal STRs in this underrepresented region, and to investigate the utility of these markers in population genetic differentiation by examining the genetic distance between the Kuwaiti population and other global populations for which data are available.

Materials and methods

Samples and genotyping

Blood samples were collected on Whatman FTA cards (GE Healthcare Life Sciences, IL, USA) from 400 unrelated Kuwaiti (253 males and 147 females). DNA was amplified directly, without quantification, from a 1.2 mm FTA card punch, according to the directions in the PowerPlex Fusion 6C manual, using a SureCycler 8800 thermal cycler (Agilent Technologies, CA, USA). Detection and separation of the DNA fragments were carried out using an Applied Biosystems 3500 Genetic Analyzer (Thermo Fisher Scientific) with the internal lane standard WEN ILS 500 and allelic ladder provided with the PowerPlex Fusion 6C kit. Genotype determination and allele calling for only the 23 autosomal loci were carried out using GeneMapper ID-X software version 1.4 (Thermo Fisher Scientific).

Statistical analysis

Data analysis was carried out for the 23 autosomal loci only (the sex chromosomes are not included in this paper). Arlequin statistical software version 3.5 was used to calculate allele frequencies, to test for linkage disequilibrium, and to test for deviation from the Hardy–Weinberg Equilibrium6. Forensic parameters, including the random match probability (RMP), discrimination power (DP), power of exclusion (PE), typical paternity index (TPI) and polymorphic information content (PIC), were calculated using STRAF (http://cmpg.unibe.ch/shiny/STRAF/), an online tool for STR data analysis7.

Intra-population genetic structure among Kuwaitis

Countries in the Arabian Peninsula, including Kuwait, have a high rate of consanguineous marriage, which causes differential distribution of alleles among families and tribes, resulting in population genetic stratification8, 9. Newly presented markers therefore must be assessed for the presence of any population structure, to avoid calculation of forensic parameters using inaccurate allele frequencies taken from the total population, rather than the relevant subpopulation. Stratification also negatively impacts discrimination power, because the chance of random individuals possessing similar genotypes is higher within a subpopulation, than within the total population10. Two methods were therefore used to detect genetic structure in the population, principal component analysis (PCA), and a Bayesian-based method implemented in STRUCTURE version 2.3.411,12,13.

In order to demonstrate whether these two methods were able to cluster the samples into their real subpopulations, each sample was categorised into one of three ancestral subgroups (K = 3) based on the donor’s surname. It has previously been found that the Kuwaiti population is mainly composed of settlers coming from three different regions: the Arabian Peninsula (from Saudi Arabia), the desert (representing nomadic tribes), and Persian countries (mainly from Iran)9, 14,15,16,17. On this basis, the samples were categorised into three groups: KW-1 (n = 162) representing individuals originating from the Arabian Peninsula, KW-2 (n = 163), which consists of those coming from Persian countries and Iraq (north), and KW-3 (n = 75) composed of Bedouin individuals coming from nomadic tribes. PCA was carried out on allele frequencies at the 23 autosomal STR loci for the different population groups KW-1–KW-3 using R software33 and visualised using the factoextra package34.

In contrast to PCA, which is an unsupervised clustering algorithm, STRUCTURE (a Bayesian-based approach) takes a range of numbers of populations (K) in order to calculate the proportion of the genome of each individual in the sample originating from each inferred population11. STRUCTURE software calculates the likelihood of the data (X) for range of K values, and the true number of K is determined by the maximal value of Ln P(X|K). However, it was found by Evanno et al.18 that the maximal value does not always provide the correct number of K in the data. Instead, the maximal value of the rate of change (Delta K) in the Ln P(X|K) between successive K values accurately infers the true number of genetic clusters in the data18. As such, both Ln P(X|K) and Delta K at each K were calculated and reported. STRUCTURE was run without population information, as recommended in the STRUCTURE documentation, in order to check whether the results approximately agreed with the separation of samples into their subgroups. Thus, the predefined groups (KW-1 to KW-3) were only included as a population label rather than as prior information for the analysis. The parameters for the analysis were set as follows: ‘admixture’ and ‘correlated allele frequencies’ models using 100,000 Markov Chain Monte Carlo (MCMC) steps for each run, with the first 100,000 discarded as a burn-in, and the inferred number of K was set from 1 to 10. At each K, the analysis was repeated five times in order to test the results for consistency. The results were visualised using CLUMPAK (Clustering Markov Packager Across K, available at http://clumpak.tau.ac.il/index.html)19, and the best K was calculated using STRUCTURE HARVESTER (available at http://taylor0.biology.ucla.edu/structureHarvester/)20.

Inter-population genetic structure and population relationships

To assess the genetic distance between the Kuwaiti sample and other global populations, PCA was conducted based on allele frequencies of the 23 autosomal loci for 57 global populations grouped into seven continental regions: Africa (AFR), America (AMR), Central and South Asia (C_S_ASIA), the Middle East (ME), Europe (EUR) and East Asia (E_ASIA). Data for the global populations were obtained from the HGDP-CEPH Human Genome Diversity Panel (HGDP-CEPH) using the online forensic STR frequency browser, popSTR (http://spsmart.cesga.es/popstr.php)21,22,23. Data from Lebanon (LEB) and an Indian (IND) population from Madhya Pradesh typed for the 23 autosomal loci24 were also included in the analysis. Genetic distance was also assessed at a regional level using allele frequencies for the 13 of the 23 autosomal loci (CSF1PO, D13S317, D16S539, D18S51, D21S11, D3S1358, D5S818, D7S820, D8S1179, FGA, TH01, TPOX, vWA) that are shared between the data reported in this study and other studies of Kuwait and neighbouring counties: Kuwait (KW13 and KW22), Iran (IRN25 and IRN12), Saudi Arabia (SA26 and SA127), Qatar (QAT28), Oman (OMN29), Yemen (YEM29), United Arab Emirates (UAE30), Bahrain (BAH31), and Iraq (IRQ32 and IRQ12). PCA analysis was conducted using R software33 and visualised using the factoextra package34.

In addition to the PCA, we studied the genetic relationship between the Kuwaiti samples and the other populations at both the continental and regional levels, using phylogenetic trees. These trees were constructed using pairwise genetic distances (DA) based on Nei et al.35, which were calculated from the allele frequencies of the populations using POPTREE2 software36. The type of phylogenetic trees used were Neighbour-joining (NJ) trees, constructed using Mega X software version 10.0.537.

Ethics statement

The study was performed in accordance with the University of Strathclyde code of practice on investigations involving human beings, and ethical approval (reference number DEC18/PAC06) was granted by the Department of Pure and Applied Chemistry Ethics Committee. Written, informed consent was obtained from all participants prior to sampling.

Results and discussion

Allele frequencies and forensic performance

Full PowerPlex Fusion 6C STR profiles were recovered from blood samples taken from 400 Kuwaiti individuals. Table 1 shows the allele frequencies and forensic parameters calculated for these samples. Similar to studies of other global populations38, 39, SE33 was the most discriminative locus in the Kuwaiti population, having 45 different alleles (PIC = 0.945). In contrast, TPOX was the least discriminative locus, with only eight different alleles (PIC = 0.616). The calculated combined match probability (CMP) was 7.37 × 10–30, meaning that the probability of observing two identical profiles for the 23 autosomal loci in the Kuwaiti population was 1 in 1.36 × 1029 The TPI ranged between 1.439 (TPOX) and 8.333 (SE33), and the combined PE was > 99.9999%. These high values indicate the usefulness of the PowerPlex Fusion 6C kit for both human identification and paternity testing in the Kuwaiti population.

Table 1 Allele frequencies among 400 Kuwaiti individuals typed at 23 autosomal STR loci in the PowerPlex Fusion 6C kit.

Statistical analysis of populations

No significant deviation from the expectations of the Hardy–Weinberg Equilibrium was detected at any locus in the Kuwaiti genotypic data, therefore, the PowerPlex Fusion 6C autosomal STR alleles are independent and can be used to estimate allele frequencies from their genotype frequencies. Association between alleles at all possible pairwise combinations of loci was evaluated using the linkage disequilibrium test. Significant linkage disequilibrium was detected between 22 (of a total of 253) pairs of loci (p < 0.05). However, after Bonferroni correction of the significance level using the number of tests (0.05/253 = 0.000198), none of the pairs of loci showed significant linkage disequilibrium, indicating that all loci are statistically independent. Therefore, their allele frequencies can be multiplied together to estimate match probabilities in the Kuwaiti population.

Off-ladder and novel alleles

Alleles that could not be identified using the GeneMapper allelic ladder for the PowerPlex Fusion 6C kit were assigned as off-ladder (OL) alleles, and were observed in 13 samples. These samples were re-amplified for confirmation and all OL alleles were confirmed. OL alleles were observed at the PentaE (5 alleles), PentaD (1 allele), D22S1045 (1 allele), SE33 (5 alleles), and D18S51 (1 allele) loci. The samples were previously sequenced using the Verogen ForenSeq DNA Signature Prep kit (manuscript in preparation), and these data were examined to determine whether the undesignated alleles at the PentaE, PentaD, D22S1045 and SE33 loci could be identified; the repeat structure sequences from this dataset are shown in Table 2 and permitted all alleles to be identified. The D18S51 locus is not included in the ForenSeq kit therefore, its OL allele was identified using the allelic ladder bins created in GeneMapper software.

Table 2 Off-ladder alleles in the Kuwaiti population identified using sequencing data.

All of the identified alleles have been reported previously in the STRBase database (an online STR database created by the United States National Institute of Standards and Technology (NIST)40), except for the PentaD 11.2 allele, which is a novel allele not reported before in the literature.

Intra-population genetic structure

Markers that are used for human identification may have weaker discrimination power in populations with genetic structure than in unstructured populations, due to the impact that the presence of subpopulation groups has on the random match probability. This is due to the fact that individuals coming from the same subpopulation groups tend to possess similar alleles, which means the likelihood of seeing random individuals possessing similar genotypes would increase in the presence of genetic structure10. Despite the fact that, in this study, no significant deviation from the expectations of the Hardy–Weinberg Equilibrium was detected between the markers, indicating that there is no genetic stratification, it is useful to assess the markers to see if they reveal any genetic clusters within the data. To achieve this, PCA was carried out on the DNA profiles obtained from the Kuwaiti samples for the 23 autosomal PowerPlex Fusion 6C markers. PCA is an unsupervised clustering method that does not require any prior information about the ancestral origin of the samples. Simply, it clusters the samples based on their similarities to each other, forming homogenous clusters of individuals that can be seen on a PCA plot. As expected, the PCA plot (Fig. 1), did not show any pattern of segregation that could be related to the ancestral population of origin of the individuals in the data, indicating that there is no genetic structure within the sample.

Figure 1
figure 1

Principal component analysis (PCA) plot showing samples from 400 individuals from the Kuwaiti population, typed for the 23 autosomal STR markers in the PowerPlex Fusion 6C kit. The samples were colour-coded based on their subpopulation of origin KUW-1 to KUW-3. The size of the point represents the number of samples (generated using factoextra package34 in R software33).

Another widely used method to infer population structure in genetic data is the Bayesian-based model implemented in the STRUCTURE software, which calculates how likely each individual in the data is to belong to each of a number of K (predetermined by the user) populations, and then uses this information to assign individuals into population subgroups18. The analysis was run without population information, and the mean log likelihood across five repeated runs of the analysis for each value of K (from 2 to 10) was estimated. The results showed inconsistency in estimating the log likelihood at K = 5 and over, which is indicated by the high standard deviation (SD), as presented in Supplementary Figure S1A. Based on the method described in Evanno et al.18, the most likely inferred value of K was 7, as this is the number of populations at which the highest Delta K value was recorded (Supplementary Figure S1B).

However, whilst the results indicated that the data is most probable at K = 7, there was no clear genetic differentiation between individuals in the sample. This can be seen in Supplementary Figure S2, which shows no clear signal of structuring between the three subpopulation groups, in terms of the proportion of each individual’s genetic ancestry assigned to each population, regardless of the number of populations assumed. This is further supported by the relatively small increases in mean log likelihood and Delta K values from K = 2 to K = 3, suggesting that there is limited evidence for any genetic structuring within the Kuwaiti population sample, in agreement with the PCA analysis above.

Genetic distance

To investigate the genetic distance between the Kuwaiti population and other global populations, allele frequencies for the 23 autosomal STRs in the PowerPlex Fusion 6C kit were pooled from the HGDP-CEPH global panel, which contains 57 populations grouped into seven global regions, and consists of eight African (N = 507), six American (N = 551), nine Central and South Asian (N = 202), four Middle Eastern (N = 160), 11 European (N = 2135), and 17 East Asian (N = 227) populations. An Indian population (N = 374) and a Lebanese population (N = 505) were also typed for the 23 loci, thus were added to the analysis. Both PCA and phylogenetic analyses were carried out, and the resulting plots characterise the genetic differentiation between populations. The distribution of the populations on the PCA plot (Fig. 2), and the genetic distances between them on the NJ tree (Fig. 4A) show that the Kuwaiti population is genetically closest to the Lebanese and Middle Eastern groups, which includes Mozabite, Druze, Palestinian and Bedouin populations. This is explained by the gene flow between these geographically close locations, which consequently leads to more similar allele frequency distributions among them.

Figure 2
figure 2

Principal component analysis of genetic distance based on allele frequencies of 23 autosomal PowerPlex Fusion 6C loci shared between the population from Kuwait (KUW; this study) and those from Africa (AFR), America (AMR), Central and South Asia (C_S_ASIA), the Middle East (ME), Europe (EUR), East Asia (E_ASIA), Lebanon (LEB) and India (IND; Madhya Pradesh) (generated using factoextra package34 in R software33).

At the regional level, genetic distance was assessed based on the 13 loci shared between this study and studies examining other populations from the Arabian Peninsula. The resulting PCA plot (Fig. 3), and NJ tree (Fig. 4B) show that our Kuwaiti dataset broadly clustered with the previously published Kuwaiti data, and was genetically closer to the countries in the north of the region such as Iraq and Iran. In contrast, there was a higher level of genetic differentiation between Kuwait and Saudi Arabia, Yemen and Qatar, which were clustered in the upper-right part of the PCA plot, and Bahrain, Oman and UAE, which were clustered in the lower part of the plot.

Figure 3
figure 3

Principal component analysis of genetic distance based on allele frequencies of 13 autosomal PowerPlex Fusion 6C loci (CSF1PO, D13S317, D16S539, D18S51, D21S11, D3S1358, D5S818, D7S820, D8S1179, FGA, TH01, TPOX, vWA) shared between the dataset reported here (KUW), previously published Kuwaiti studies (KUW1 and KUW2), and studies of the populations of neighbouring countries Iraq (IRQ and IRQ1), Iran (IRN and IRN1), Saudi Arabia (SA and SA1), United Arab Emirates (UAE), Oman (OMN), Yemen (YEM), and Bahrain (BAH) (generated using factoextra package34 in R software33).

Figure 4
figure 4

Neighbour-joining tree based on pairwise DA values from allele frequencies of (A) the 23 autosomal PowerPlex Fusion 6C loci in Kuwait and populations from Africa (AFR), America (AMR), Central and South Asia (C_S_ASIA), the Middle East (ME), Europe (EUR), East Asia (E_ASIA), Lebanon (LEB) and India (IND; Madhya Pradesh), and (B) the 13 loci shared between the dataset reported here (KUW), previously published Kuwaiti studies (KUW1 and KUW2), and studies of the populations of neighbouring countries Iraq (IRQ and IRQ1), Iran (IRN and IRN1), Saudi Arabia (SA and SA1), United Arab Emirates (UAE), Oman (OMN), Yemen (YEM), and Bahrain (BAH). The nodes represent a common ancestor and the branching tips are descendants of that common ancestry (generated using Mega X software version 10.0.537).

In this study, 30% of individuals declared their origins as being from the north (Iraq and Iran), 39% from the south region (Saudi Arabia, Bedouin and Bahrain), and 24% had parents of different origin (admixed). Therefore, the closer genetic relationship of our samples to the northern region might be due to the presence of these individuals. There is no information available about the population of origin for the samples collected in the two previous Kuwaiti studies (KW13 and KW22). It is therefore not possible to determine whether sampling from different sub-populations could explain why, in contrast to our sample, these two Kuwaiti samples cluster more closely with the Saudi Arabian sample than the samples from Iran and Iraq. Overall, it can be seen that the allele frequencies of the 23 autosomal markers in the PowerPlex Fusion 6C kit can be successfully used to separate both geographically distant global populations and closely related populations on the basis of their genetic distance, making them a good choice for detecting genetic differentiation between populations.

Conclusion

This study evaluated the forensic utility of the 23 autosomal STR loci included in the Promega PowerPlex Fusion 6C kit for the Kuwaiti population. Among these loci, D10S1248, D22S1045, D2S441 and SE33 are reported for the first time for Kuwait. The genetic data indicate that these 23 autosomal STRs are highly polymorphic in the Kuwaiti population and are of high value for human identification and paternity testing. STRUCTURE and PCA analysis show no signature of genetic structuring of the Kuwaiti population into subpopulations. Comparison of the Kuwaiti population to other global populations indicates that Kuwait clusters with other Middle Eastern populations, and shows a close relationship with Iran and Iraq, suggesting that they may share common ancestry.