Analysis of Population Structure and Genetic Diversity in Rice Germplasm Using SSR Markers: An Initiative Towards Association Mapping of Agronomic Traits in Oryza Sativa

Genetic diversity is the main source of variability in any crop improvement program. It serves as a reservoir for identifying superior alleles controlling key agronomic and quality traits through allele mining/association mapping. Association mapping based on LD (Linkage dis-equilibrium), non-random associations between causative loci and phenotype in natural population is highly useful in dissecting out genetic basis of complex traits. For any successful association mapping program, understanding the population structure and assessing the kinship relatedness is essential before making correlation between superior alleles and traits. The present study was aimed at evaluating the genetic variation and population structure in a collection of 192 rice germplasm lines including local landraces, improved varieties and exotic lines from diverse origin. A set of 192 diverse rice germplasm lines were genotyped using 61 genome wide SSR markers to assess the molecular genetic diversity and genetic relatedness. Genotyping of 192 rice lines using 61 SSRs produced a total of 205 alleles with the PIC value of 0.756. Population structure analysis using model based and distance based approaches revealed that the germplasm lines were grouped into two distinct subgroups. AMOVA analysis has explained that 14 % of variation was due to difference between with the remaining 86 % variation may be attributed by difference within groups. Based on these above analysis viz., population structure and genetic relatedness, a core collection of 150 rice germplasm lines were assembled as an association mapping panel for establishing marker trait associations.


Background
Rice, being the staple food crop for more than 50 % of the world population is cultivated in 163 million hectares with the production of 491 million tonnes. About 90 % of the world's rice is produced in Asia and India contributes 20 % of the world's production. This record level production and productivity is due to the availability and exploitation of rich genetic diversity existing in rice germplasm of India. For precise genetic manipulation of complex quantitative traits like, yield, tolerance against biotic/abiotic stresses, quality etc., understanding the genetic/molecular basis of target traits needs to be investigated thoroughly.
The genetic basis of important agronomic traits has been unraveled through Quantitative Trait Loci (QTL) mapping either through linkage mapping (bi-parental mapping populations) or through LD mapping (natural populations). Although traditional linkage based QTL-mapping has become an important tool in gene tagging of crops, it has few limitations viz., 1) classical linkage mapping involves very high cost; 2) it has low resolution as it can resolve only a few alleles and 3) it has limitations towards fine mapping of QTLs as it needs BC-NILs. These limitations can be overcome by the LD based approach of "Association Mapping" using the natural populations. Association mapping serves as a tool to mine the elite genes by structuring the natural variation present in a germplasm. It was successfully exploited in various crops such as rice, maize, barley, durum wheat, spring wheat, sorghum, sugarcane, sugarbeet, soybean, grape, forest tree species and forage grasses (Abdurakhmonov and Abdukarimov 2008).
Before performing an association analysis in a population, it is essential to determine the population structure which can reduce type I and II errors in association mapping due to unequal allele frequency distribution between subgroups that causes spurious association between molecular markers and trait of interest (Pritchard et al. 2000). Similar attempts were recently undertaken to define population structure in rice using different germplasm lines and by developing core collection from national collections and international collections (Ebana et al. 2008;Jin et al. 2010;Zhang et al. 2011;Agrama et al. 2010 andLiakat Ali et al. 2011). Simple Sequence repeat (SSR) markers have been commonly used in genetic diversity studies in rice because of high level of polymorphism which helps to establish the relationship among the individuals even with less number of markers (McCouch et al. 1997). For similar studies, SSR markers were used alone by Jin et al. (2010); Hesham et al. (2008); Sow et al. (2014); Das et al. (2013) and Choudhury et al. (2013) or along with SNP markers by Courtois et al. (2012) and . The objectives of this present study were to evaluate the genetic variation and to examine the population structure of 192 rice germplasm accessions that comprises of local landraces, improved varieties and exotic lines from diverse origin.

Genetic Diversity
All the 192 rice germplasm lines were genotyped using 61 SSR (microsatellite) markers which produced a total of 205 alleles (Additional file 1: Figure S1). Among these 205 alleles, 5 % were considered as rare (showed an allele frequency of < 5 %). The number of alleles per loci varied from 2 to 7 with an average of 3 alleles per locus. The highest number of alleles were detected for the loci RM316 (7) and the lowest was detected for a group of markers viz., RM171, RM284, RM455, RM514, RM277, RM 5795, HvSSR0247, RM 559, RM416 and RM1227. PIC value represents the relative informativeness of each marker and in the present study, the average PIC value was found to be 0.468. The highest genetic diversity is explained by the landraces included in this study with the mean PIC value of 0.416. PIC values ranged between 0.146 for RM17616 to 0.756 for RM316. Heterozygosity was found to be very low which may be due to autogamous nature of rice. Expected heterozygosity or Gene diversity (H e ) computed according to Nei (1973) varied from 0.16 (RM17616) to 0.75 (RM287) with the average of 0.52 (Table 1).

STRUCTURE Analysis
Population structure of the 192 germplasm lines was analysed by Bayesian based approach. The estimated membership fractions of 192 accessions for different values of k ranged between 2 and 5 ( Fig. 1). The log likelihood revealed by structure showed the optimum value as 2 (K = 2). Similarly the maximum of adhoc measure ΔK was found to be K = 2 (Fig. 2), which indicated that the entire population can be grouped into two subgroups (SG1 and SG2). Based on the membership fractions, the accessions with the probability of ≥ 80 % were assigned to corresponding subgroups with others categorized as admixture (Fig. 3).
SG1 consisted of 134 accessions with most of the landraces and varieties of Indian origin and SG2 consisted of 38 accessions which composed of non Indian accessions. Twenty accessions were retained to be admixture. The subgroup SG1 was dominated by indica subtype whereas the subgroup SG2 consisted mostly of japonica group. When the number of subgroups increased from two to five, the accessions in both the subgroups were classified into sub-sub groups ( Table 2). As SG1 consisted of 134 accessions mostly of Indian origin, an independent STRUCTURE analysis was performed for this subgroup. ΔK showed its maximum value for K =3 which indicated that SG1 could be further classified into three sub-sub groups (Fig. 4). The differentiation in origin and seasonal differentiation of rice varieties contributed for this clustering.
Clustering analysis based on Unweighted Pair Group Method with Arithmetic Mean (UPGMA) method using DARwin separated the accessions into two main groups which showed similar results as STRUCTURE analysis. The group I in UPGMA tree consists of both indigenous and agronomically improved varieties whereas the other group consists of exotic accessions. In UPGMA tree, the accessions within group 1 and 2 clustered into smaller sub groups based on their origin and types. Most of the landraces and varieties have been clustered in upper branches of the tree whereas the exotic accessions have been clustered in lower branches of the tree (Fig 5). Hence the clustering analysis by two classification methods revealed high level of similarity in clustering the genotypes. PCoA was used to characterize the subgroups of the germplasm set. A two-dimensional scatter plot involving all 192 accessions has shown that the first two PCA axes accounted for 12.6 and 4.9 % of the genetic variation among populations (Fig 6).

Genetic Variance Analysis
The hierarchial distribution of molecular variance by AMOVA and pair-wise analysis revealed highly significant genetic differentiation among the groups. It revealed that 14 % of the total variation was between the groups, while 86 % was among individuals within groups (Tables 3  and 4). Calculation of Wright's F statistic at all SSR loci revealed that F IS was 0.50 and F IT was 0.56. Determination of F ST for the polymorphic loci across all accessions has shown F ST as 0.14 which implies high genetic variation ( Table 4). The pairwise F ST estimate among sub-groups has indicated that the two groups are significantly different from each other (Table 3).

Discussion
Genetic diversity is the key determinant of germplasm utilization in crop improvement. Population with high level of genetic variation is the valuable resource for broadening the genetic base in any breeding program.  (Nachimuthu et al. 2014). This panel has its importance because of its major component as traditional landraces with valuable agronomic traits that are cultivated in the small pockets of Tamil Nadu, India. Molecular markers help us to understand the level of genetic diversity that exists among traditional races, varieties and exotic accessions which can be exploited in rice breeding programs. The genetic architecture of diverse germplasm lines can be precisely estimated by assessing the STRUCTURE of the population using molecular markers viz., SSRs or SNPs etc., (Horst and Wenzel 2007;Powell et al. 1996;Varshney et al. 2007). In this study, the genetic diversity among the accessions was evaluated by model based clustering and distance based clustering approach using the SSR genotypic data.
Regarding genetic divergence of the population consisting of local landraces, exotic cultivars and breeding lines, 61 polymorphic markers have detected a total of 205  (Garris et al. 2005;Ram et al. 2007). In the current study, the average number of alleles (3 alleles/locus) is slightly lesser than the average number of alleles (3.88 alleles/ locus) reported by Zhang et al. (2011) in rice core collection with 150 rice varieties from south Asia and Brazil and Jin et al. (2010) who has reported the average alleles per locus as 3.9 in 416 rice accessions collected from China. Using three sets of germplasm lines (Thai (47), IRRI germplasm (53) amd other Oryza species (5)), Chakhonkaen et al. (2012) has reported 127 alleles for all loci, with a mean of 6.68 alleles per locus, and a mean Polymorphic Information Content (PIC) of 0.440 by screening with 19 InDel markers. Chen et al. (2011) has reported the average gene diversity of 0.358 and polymorphic information content of 0.285 from 300 rice accessions from different rice growing areas of the world with 372 SNP markers. The gene diversity detected in this study (0.52) is comparable to overall gene diversity of rice core collection (0.544) from China, North Korea, Japan, Philippines, Brazil, Celebes, Java, Oceanina and Vietnam (Zhang et al. 2011) and it is higher than US accession panel with average gene diversity of 0.43 (Agrama and Eizenga 2008) and Chinese rice accession panel by Jin et al. (2010) with the average gene diversity of 0.47. The gene diversity reported in our study is lesser than gene diversity (0.68) reported by (Liakat Ali et al. 2011). Most of the diversity panel with global accessions has the gene diversity of 0.5 to 0.7 (Garris et al. 2005;Liakat Ali et al. 2011;Ni et al. 2002). These results on global accessions help to infer that this diversity panel of 192 germplasm lines represents a large proportion of the genetic diversity that exists in major rice growing Asian continent.
The PIC value was 0.468 which varied from 0.146 for RM17616 with only 2 two alleles to 0.756 for RM316  and Graph of estimated membership fraction for K = 2. The maximum of adhoc measure ΔK determined by structure harvester was found to be K = 2, which indicated that the entire population can be grouped into two subgroups (SG1 and SG2)      Zhang et al. (2011). In this study, significant amount of rare alleles was identified which indicates that these rare alleles contribute well to the overall genetic diversity of the population. Model based approach by STRUCTURE is implemented frequently for studying population structure by various researchers (Agrama et al. 2007, Agrama andEizenga 2008;Garris et al. 2005;Zhang et al. 2007Zhang et al. , 2011Jin et al. 2010;Liakat Ali et al. 2011, Chakhonkaen et al. 2012Courtois et al. 2012, Das et al. 2013. Courtois et al. (2012) has successfully detected two subgroups in their study population and assigned rice varieties into two groups with few admixture lines. Jin et al. (2010) has identified seven sub populations among 416 rice accessions from China. Das et al. (2013) has grouped a collection of 91 accessions of rice landraces from eastern and north eastern India into four groups.
Assigning of genotypes to the subgroups based on ancestry threshold vary between different research groups.  Population structure analysis in different rice diversity panel has indicated the existence of two to eight sub population in rice (Zhang et al. 2007, Zhang et al. 2011, Garris et al. 2005, Agrama et al. 2007, Liakat Ali et al. 2011, Chakhonkaen et al. 2012and Das et al. 2013). In the current rice diversity panel of 192 accessions based on the criterion of maximum membership probabilities, 134 accessions were assigned to SG1 which is dominated by indica subtype with most of the landraces and varieties of Indian origin and SG2 consisted of 38 accessions which composed mostly of japonica accessions of exotic origin. Similar population structure of two subgroups was observed in previous research by Zhang et al. (2009) in a collection of 3024 rice landraces in China. Zhang et al. (2011) has reported two distinct subgroups in a rice core collection. Courtois et al. (2012) has successfully classified two subgroups as japonica and non japonica accessions in European core collection of rice. The results indicated that two subgroups are due to the different adaptation behavior of accessions to different ecological environment as indica and japonica accessions has independent evolution frame and the origin of Indian rice accessions from indica cultivars. Hence the major criterion for population structure in this panel is indicajaponica subtype. This study includes large number of traditional landraces and varieties from Indian Subcontinent and few exotic accessions randomly selected from IRRI worldwide collection. It clarifies the relationship between Indian germplasm and exotic accessions which indicates that germplasm lines varies based on its ecology and also shows higher level of genetic diversity exists within this population.
Further structure analysis of SG1 that consisted of 134 lines indicated that it can be further subdivided in to three sub sub-groups. The three sub sub-groups classification has the factor of ecosystem and seasonal variation as the major factors for population structure. This results is in accordance with the inference that indica group has higher genetic diversity than japonica accessions which was given by various researchers (Gao et al. 2005;Lu et al. 2005;Lapitan et al. 2007;Caicedo et al. 2007;Liakat Ali et al. 2011;Garris et al. 2005;Qi et al. 2006;Qi et al. 2009); as this subgroup has indica accessions. Liakat  has substantiated this statement with the reason of the indica subpopulation occupying the largest rice growing region which has a varied environments, ecological conditions and soil type. The result of model based analysis is in accordance with the clustering pattern of Neighbour joining tree and Principal Coordinate Analysis. The first two principal coordinates explained 12.6 and 4.8 % of the molecular variance. Similar pattern of molecular variance explanation was observed by Zhang et al. (2011) for two population subgroups.
Calculation of Wright's F Statistic at all loci revealed the deviation from Hardy-Weinberg law for molecular variation within the population. The result of F st indicates higher divergence existing between subgroups of the population. Higher F IT , which is measured at subgroup level in whole population, has indicated lack of equilibrium across the groups and lack of heterozygosity most likely due to the inbreeding nature of rice.
The present study revealed that several unexploited landraces of Tamil Nadu, India which is widely cultivated           by the farmers in different parts of the state. Ecological and evolutionary history contributes for the genetic diversity maintained in a population. The varieties with diverse ecosystems and wide eco-geographical conditions contribute for the genetic diversity among rice varieties in this population.
For establishing a core collection for association studies, two step approach followed by Breseghello and Sorrells (2006) and Courtois et al. (2012) was used. This approach involves the determination of population structure and then sampling can be done based on the relatedness of the accessions in the population. Those accessions that show high magnitude of genetic relatedness can be eliminated to develop core collection with diverse representatives. Based on this idea, out of 192 accessions, 150 (Table 5) were selected to form association mapping panel which can be utilized either by genome wide or candidate gene specific association mapping for linking the genotypic and phenotypic variation.

Conclusion
This study analyze the pattern of divergence exists in a population of 192 rice accessions that constitute our rice diversity panel for association mapping. Based on various statistical methods, we identified two sub groups within 192 rice accessions selected for establishing association mapping panel. The average number of alleles per locus and gene diversity has indicated the existence of broad genetic base in this collection. The result of structure analysis is in accordance with clustering method of neighbor joining tree and principal coordinate analysis. Thus, the results of this study which indicates the genetic diversity of the accessions can be utilized to predict approaches such as association analysis, classical mapping population development; parental line selection in breeding programs and hybrid development for exploiting the natural genetic variation exists in this population.

Plant Material
A collection consisting of 192 rice accessions was used in this study, which consist of land races and varieties collected from nine different states of India as well as from Argentina, Bangladesh, Brazil, Bulgaria, China, Colombia, Indonesia, Philippines, Taiwan, Uruguay, Venezuela and United States (Table 6).

Microsatellite Genotyping DNA Isolation and PCR Amplification
DNA was extracted from leaf tissue by grinding with liquid nitrogen using CTAB method (Saghai-Maroof et al. 1984.). It was diluted to a final concentration of 30 ng μl −1 for enabling polymerase chain reactions. DNA amplification parameters such as specificity, efficiency and fidelity are strongly influenced by the components of the PCR reaction and by thermal cycling conditions (Caetano-Anolles and Brant 1991). Therefore, the careful optimization of reaction components and conditions will ultimately result in more reproducible and efficient amplification. The concentrations of primers, template DNA, Master Mix, and annealing temperature was optimized on eight diverse accessions for 156 SSR markers distributed on the 12 chromosomes by modified Taguchi method (Cobb and CIarkson 1994). Microsatellite primer sequences, annealing temperature and chromosomal locations are obtained from GRAMENE database (http:// archive.gramene.org/markers/microsat/). Sixty one SSR primer pairs which produce polymorphic allele amplification were chosen to genotype the entire set of germplasm collection.
The volume of the PCR reaction system was 10 μl. The PCR reaction mixture of 10 μl had 0.4 mM dNTPs, 4 mM of MgCl 2 , 150 mM of Tris-HCl, 10 pmoles of forward and reverse primer and 0.05 U Taq polymerase with 30 ng of DNA. Polymerase chain reaction was performed in BIORAD THERMAL CYCLER using the following program: 94°C for 2 min, 35 cycles of 94°C for 45 sec, 50-60°C for 1 min, 72°C for 2 min with a final extension of 72°C for ten min.

Polyacrylamide Gel Electrophoresis
Amplified products were size separated in native polyacrylamide gel electrophoresis using 6 % (w/v) polyacrylamide gel according to Sambrook et al. (2001) in vertical electrophoresis tank with 1X TBE at 150 V. The gel size was determined using standard molecular weight size markers after the bands were detected by silver staining.

Allele Scoring
The bands were visualized in a cluster of two to six in the stained gels for most of the markers. Based on the expected product size given in the GRAMENE website (Additional file 2: Table S1), the size of the most intensely amplified bands around the expected product size for each microsatellite marker was identified using standard molecular weight size markers (20 bp DNA ladder, GeNeI Company). Then the stained gel was dried and documented using light box. Allele score was given based on the presence of a particular size allele in each of the germplasm. The presence was denoted as 1 and absence of an allele as 0 and it was rechecked manually (Additional file 3: Table S2).

Data Analysis
A 1/0 matrix was constructed based on the presence and absence of alleles for the set of 61 markers. This SSR genotype data was analyzed for genetic diversity and population structure.