A study of genomic diversity in populations of Maharashtra, India, inferred from 20 autosomal STR markers

This study was planned to evaluate the genetic diversity in the admixed and Teli (a Hindu caste) populations of Maharashtra, India using 20 autosomal Short Tandem Repeat (STR) genetic markers. We further investigated the genetic relatedness of the studied populations with other Indian populations. The studied populations showed a wide range of observed heterozygosity viz. 0.690 to 0.918 for the admixed population and 0.696 to 0.942 for the Teli population. This might be due to the multi-directional gene flow. The admixed and Teli populations also showed a high degree polymorphism which ranged from 0.652 to 0.903 and 0.644 to 0.902, respectively. Their combined value of matching probability for all the studied loci was 4.29 × 10–25 and 5.01 × 10–24, respectively. The results of Neighbor-Joining tree and Principal Component Analysis showed that the studied populations clustered with the general populations of Jharkhand, UttarPradesh, Rajasthan and Central Indian States, as well as with the specific populations of Maharashtra (Konkanastha Brahmins) and Tamil Nadu (Kurmans). Overall, the obtained data showed a high degree of forensic efficacy and would be useful for forensic applications as well as genealogical studies.


Introduction
The state of Maharashtra is located in the western peninsular region of India. It is the third-largest state by area and the second-most populous state in the country. It shares its geographical boundaries with the states of Karnataka and Goa in the South, Telangana in the South-east, Chhattisgarh in the East, Gujarat and Madhya Pradesh in the North, Dadra-Nagar Haveli in the North-west, and [3,[5][6][7]. Not only people from varied regions, but people of different castes also reside in Maharashtra(Hindu hierarchical groups) like Teli caste. The 'Teli' community derives its name from Sanskrit word 'talika' or 'taila' which means oil and it indicates towards the traditional occupation of the Teli community which was to extract oil from sesame and mustard seeds. One of the Hindu mythological references of the Teli caste indicates that the first Teli individual was created by 'Lord Shiva' to rub him with oil [8].
The overall social, cultural, and lingual diversity of the state of Maharashtra led us to evaluate the genomic diversity of the admixed and Teli populations of this state. The genomic data of the selected populations was evaluated using in-silico or computational techniques through various population data software and servers such as GeneMapper ™ ID-X, Arlequin v3.5, POPTREE2, PAST 3.02a, etc. The in-silico techniques have served as an efficient approach for the evaluation of very large genomic data sets such as STRs, SNPs, large sequence and NGS data [9][10][11][12] because they could quickly analyze large data sets with high-throughput and accuracy.

Main text
To investigate the genetic diversity of the admixed and Teli population of Maharashtra, we randomly selected 158 and 69 unrelated healthy adults, respectively.The subjects in the admixed group belonged to almost all the Fig. 1 a Geographical locations of the studied and compared populations (map is not under copyright; map was created with mapchart.net). b Phylogenetic distance between studied admixed population of Maharashtra and compared populations. c Phylogenetic distance between studied Teli population of Maharashtra and compared populations population groups residing in the state of Maharashtra and hence represented the diverse population of Maharashtra. On the contrary, the subjects in the Teli group were recruited only from theTeli community. An online randomization tool-the randomizer (www.rando m.org) was used to randomly allocate subjects to each group, prior to the sample collection.
First, an interview was conducted to confirm that each participant's ancestors have been residing within the geographical boundaries of Maharashtra for more than three generations. Next, blood samples were collected from each participant following the ethical guidelines and the declaration of Helsinki [13]. The collected blood samples were subjected to the Phenol-Chloroform Isoamyl Alcohol (PCIA) organic extraction method for DNA extraction [14]. The extracted DNA was quantified using the PowerQuant ® DNAQuantification kit (Promega, Madisson, USA-Promega) in a Real-Time Polymerase Chain Reaction machine (RT-PCR-7500) (Thermo Fisher Scientific, CA, USA) as recommended by the manufacturer(except for the half-reaction volume). A 500 pg DNA template was used to amplify 21 autosomal STR loci using PowerPlex ® 21 System (Promega) on Veriti ™ 96-Well Fast Thermal Cycler (Ther-moFisher Scientific, CA, USA) as per manufacturer's recommendations(except for the half-reaction volume). The amplified DNA fragments were separated by capillary electrophoresis using POP ™ -4, 36 cm capillary array and Genetic Analyzer 3500XL (Thermo Fisher Scientific, CA, USA) as recommended by the manufacturer. The allelic ladder provided with the kit was used for the allocation of the allele number at the particular loci. The DNA profile was evaluated using the GeneMapper TM ID-X v1.5 software (Thermo Fisher Scientific, CA, USA). Positive and negative controls were used in the experiment to assure the quality control. Additionally, the authors conducting this study have passed the proficiency test conducted by GITAD, Spain (http://gitad .ugr.es/princ ipal.htm).
The obtained genetic data was analyzed using statistical software.The GenAlex 6.5 software [15] was used to calculate the allele frequencies and the PowerStats v1.2 spreadsheet program [16] was used to calculate various forensic parameters namely polymorphic information content (PIC), power of discrimination (PD), power of exclusion (PE), matching probability (PM) and paternity index (PI). The observed heterozygosity (Hobs), expected heterozygosity (Hexp) and Hardy-Weinberg equilibrium (HWE) were calculated using the Arlequin v3.5 software [17]. POPTREE2 program [18] was used to draw neighborjoining (NJ) tree and Nei's genetic distances [19] among the compared populations. The PAST 3.02a software [20] was used for the graphical representation of genetic distances among the compared populations, based on the Principal component analysis (PCA). Maximum likelihood (ML) phylogenetic tree was reconstructed as described earlier [21].
A total of 228 alleles, with an average of 11.4 alleles per locus were observed for the admixed population group, while a total of 194 alleles, with an average of 9.7 alleles per locus were observed for the Teli population group. The locus D3S1358 showed minimum allele number of 5, and loci Penta E and D21S11 showed maximum allele number of 19 in the admixed population group. On the other hand, in the Teli population group, the loci D3S1358 and TPOX showed minimum allele number of 5, and locus Penta E showed maximum allele number of 17. The range of allele frequencies for the admixed and Teli population group were 0.003 to 0.427 and 0.007 to 0.435, respectively. Allele 11 of locus TPOX was observed to be the most frequent allele in both admixed (Table 1) and Teli (Additional file 1: Table S1) population groups. All the studied loci for both population groups followed the Hardy-Weinberg equilibrium after applying Bonferroni correction (P = 0.05/20, at a 95% significance level).
The obtained forensic efficacy parameters for the admixed and Teli populations of Maharashtra are shown in Table 1 and Additional file 1: Table S1, respectively. The locus Penta E was the most polymorphic loci in both the population groups, with a value of 0.903 in the admixed and 0.902 in the Teli population group. In contrast, locus TPOX was the least polymorphic among all the studied loci, with a value of 0.652 in the admixed and 0.644 in the Teli population group. A high range of observed heterozygosity (Hobs) value in the admixed (0.690 to 0.918) group as well as the Teli (0.696 to 0.942) group might have resulted from the inflow of genes in the studied populations from various directions.
The power of discrimination (PD) for the admixed population group ranged from 0.857 (TPOX) to 0.980 (Penta E) and the PD for the Teli population group ranged from 0.849 (TPOX) to 0.974 (Penta E), with the combined value for all the studied loci as 1, for both the groups. In the admixed group, the power of exclusion (PE) ranged from 0.413 (CSF1PO) to 0.832 (D1S1656) with the combined value for all the studied loci as 0.999999998666, whereas in the Teli group,the PE range was 0.422 (D5S818) to 0.882 (D1S1656) with the combined value for all the studied loci as 0.999999999652. The combined value of matching probability for all the studied 20 autosomal STR loci was found to be 4.29 × 10 -25 (Table 1) for the admixed group and 5.01 × 10 -24 (Additional file 1: Table S1) for the Teli group.
A neighbour joining (NJ) tree (Fig. 2a) based on the Nei's genetic distance, constructed using POPTREE-2 software, was used to investigate the genetic affinity between the studied (admixed andTeli) populations and Table 1  the reported Indian populations namely, the Konkanastha Brahmins (Maharashtra) [22]; the Mahadev Kolis (Maharashtra) [22]; the Iyengars (Tamil Nadu) [22]; the Kurumans (Tamil Nadu) [22]; the Yerukulas (Andhra Pradesh) [23]; the Koras (West Bengal) [24]; the Baniyas (Punjab) [25]; the population of Jharkhand [26]; the population of Uttar Pradesh [27]; the population of Rajasthan [28]; the populations of Central India [29]; and the pooled populations belonging to the geographical boundaries of India [30]. The NJ tree revealed that the studied admixed and Teli populations of Maharashtra pooled into one cluster with the Konkanastha Brahmins of Maharashtra and the Kurumans of Tamil Nadu. The populations of Rajasthan, Uttar Pradesh, Madhya Pradesh and Jharkhand also pooled with the studied populations, which might be the result of ancestral relatedness [3,4]. The Koras of West Bengal, the Baniyas of Punjab and the pooled populations of Indian geographical region were observed to be the outliers in the NJ tree, which could be attributed to the isolation on the account of distance [31]. The Mahadev Koli population of Maharashtra, despite being geographically close to the studied populations, showed genetic distinction, which might be the result of small effective sample size, the founder effect and drift [32]. The maximum likelihood (ML) phylogram (Fig. 2b) showed consistency with the NJ tree with respect to the scattering pattern of the studied and compared populations. In the NJ tree, three nodes out of eleven had the bootstrap values above 50 percent and the three nodes had the bootstrap values of more than 25 percent. In the case of the ML phylogram, out of the eleven nodes, two had the bootstrap values of over 50 percent and four of the nodes had the bootstrap values higher than 25 percent. Similar patterns in the bootstrap values were observed in the NJ tree and the ML phylogram, suggesting a low level of confidence. In order to validate the genetic relatedness observed in the NJ tree and the ML phylogram with the low bootstrap values, principal component analysis (PCA) and locus-wise Fst distance calculation between the studied and compared populations were undertaken. In the PCA plot (Fig. 2c), both the studied populations clustered and made patterns similar to those observed in the NJ tree and the ML phylogram. In the case of pair-wise Fst distance, out of 15 loci, the admixed population of Maharashtra showed significant variations at ten loci with the Yerukulas (Andhra Pradesh), at nine loci with the Koras (Bengal), at seven loci with the Mahadev Kolis (Maharashtra), at four loci with the Konkanastha Brahmins (Maharashtra) and the Baniyas (Punjab), at three loci with the pooled Indian populations and the population of Rajasthan, at two loci with the Kurmans (Tamil Nadu), the Central Indian population and the population of Jharkhand, and at one locus with the Iyengars (Tamil Nadu) and the population of Uttar Pradesh. Similarly, the results of pairwise Fst distance analyses in the Teli population group also showed significant variations at the ten loci with the Koras (Bengal), at six loci with the Mahadev Kolis (Maharashtra), at four loci with the Konkanastha Brahmins (Maharashtra) and the Yerukulas (Andhra Pradesh), at two loci with the Baniyas (Punjab), the population of Jharkhand, the population of Uttar Pradesh and the pooled Indian populations, and at one locus with the Iyengars (Tamil Nadu), the Kurmans (Tamil Nadu), the population of Rajasthan and the Central Indian population. No significant variations were observed in the studied populations among all the compared 15 loci (Additional file 2: Table S2). On the contrary, the Teli population group showed significant similarities at all compared 15 loci, with the admixed population group (Additional file 3: Table S3). Interestingly, both the studied populations showed a similar pattern of Fst distances with the compared populations. The mean Fst value of the studied and compared populations, irrespective of their geographical locations have been shown in Fig. 1a-c. Overall, the results of the Principal Component Analysis and the Fst distance study were found to be consistent with each other, and support the genetic relatedness observed in the neighbourjoining tree and the maximum likelihood phylogram.
Since the obtained genetic data showed a high degree of polymorphism and forensic efficacy,it might be useful for forensic DNA application, genetic and genealogical studies, and may enrich the national autosomal STR database.

Limitations
The small sample size was the main limitation of this study. However, the analyzed samples well explain the polymorphic nature of the studied genetic markers and the genetic affinity of the studied population with the previously reported populations. We further propose the use of larger sample size and Next Generation Sequencing (NGS) studies.