Genome‐wide association and polygenic risk score estimation of type 2 diabetes mellitus in Kinh Vietnamese—A pilot study

Abstract A genome‐wide association study (GWAS) is a powerful tool in investigating genetic contribution, which is a crucial factor in the development of complex multifactorial diseases, such as type 2 diabetes mellitus. Type 2 diabetes mellitus is a major healthcare burden in the Western Pacific region; however, there is limited availability of genetic‐associated data for type 2 diabetes in Southeast Asia, especially among the Kinh Vietnamese population. This lack of information exacerbates global healthcare disparities. In this study, 997 Kinh Vietnamese individuals (503 with type 2 diabetes and 494 controls) were prospectively recruited and their clinical and paraclinical information was recorded. DNA samples were collected and whole genome genotyping was performed. Standard quality control and genetic imputation using the 1000 Genomes database were executed. A polygenic risk score for type 2 diabetes was generated in different models using East Asian, European, and mix ancestry GWAS summary statistics as training datasets. After quality control and genetic imputation, 107 polymorphisms reached suggestive statistical significance for GWAS (≤5 × 10−6) and rs11079784 was one of the potential markers strongly associated with type 2 diabetes in the studied population. The best polygenic risk score model predicting type 2 diabetes mellitus had AUC = 0.70 (95% confidence interval = 0.62–0.77) based on a mix of ancestral GWAS summary statistics. These data show promising results for genetic association with a polygenic risk score estimation in the Kinh Vietnamese population; the results also highlight the essential role of population diversity in a GWAS of type 2 diabetes mellitus.


| INTRODUC TI ON
Type 2 diabetes mellitus (T2DM) is a significant healthcare burden worldwide. 14][5] It has been shown that family history significantly increases the risk of developing T2DM, suggesting heritability of the disease. 6,7derstanding the genetic contribution to T2DM is essential, as it provides useful information for T2DM prediction and prevention.A powerful tool that can be used to obtain genetic insights into complex diseases such as T2DM is a genome-wide association study (GWAS).To date, most GWAS results for T2DM relate to Caucasian and East Asian populations, although the large populations in Southeast Asia also have significant levels of T2DM and have been found to be genetically susceptible to T2DM. 8,9The lack of T2DMrelated GWAS data not only hinders the application of these results in Asia but also exacerbates healthcare disparities.Therefore, the need for GWAS data on T2DM in Southeast Asian populations is strongly required.
Located in Southeast Asia, Vietnam is one of the most populated countries in the Western Pacific region, with more than 100 million inhabitants.Kinh Vietnamese account for more than 80% of the total population.Before 2022, only a small, single GWAS study had been performed on Kinh Vietnamese to investigate the genetic basis for schizophrenia; there is no available GWAS data for T2DM or cardiovascular diseases. 10GWAS results are much more reliable with larger sample sizes, and the first GWAS conducted on a Vietnamese population used more than 300 subjects. 10Our aim was to perform the first GWAS on T2DM in Kinh Vietnamese.Briefly, we recruited 503 T2DM cases and 494 controls, the participants' DNA samples were used to perform genome-wide genotyping.The output data underwent quality control, genotype imputation, and genetic association analysis.Polygenic risk score (PRS) for T2DM was also calculated.These early findings should provide further information on potential novel genetic associations and PRS estimations for T2DM, but the wider aim is to provide genetic data on Kinh Vietnamese that can be used in future GWAS meta-analyses.

| Subject recruitment
The study protocol was approved by the Ethical Committee of the University of Medicine and Pharmacy in Ho Chi Minh City (Approval number: 350/HDDD-DHYD).Recruitment took place from May 2020 to September 2021 in the University Medical Center of the University of Medicine and Pharmacy.The participants all self-identified as Kinh Vietnamese and signed the informed written consent.Two sample sets were collected for this study: a T2DM set and a control set.Subjects with T2DM were recruited if they had either a history of T2DM or were newly diagnosed with T2DM, based on the American Diabetes Association 2020 criteria. 11T2DM cases were excluded from the study if they had type 1 diabetes or elevated plasma glucose due to endocrine diseases or drug usage.The controls were recruited randomly among visitors attending for a health check-up who had no history of diabetes and who had normal plasma glucose and HbA1c levels.
The exclusion criteria for the controls were a history of diabetes, pregnancy, or cancer.
All participants provided basic clinical information and underwent a detailed physical examination.Demographic information (age, sex), duration of T2DM, and anthropometric parameters (weight, height, waist circumference, hip circumference, and systolic and diastolic blood pressure) were documented.Laboratory data were obtained from fasting (minimum 8 h) blood samples for all participants using a Beckman Coulter AU2700 Chemistry Analyzer (Beckman Coulter, California, USA).Fasting plasma glucose, HbA1c, total serum cholesterol, high-density lipoprotein (HDL) cholesterol, low-density lipoprotein (LDL) cholesterol, triglycerides, and creatinine levels were recorded.Each participant also provided 2 mL of peripheral venous blood for genetic analysis.

| Genome-wide genotyping
Genomic DNA was extracted from 997 peripheral whole blood samples using a QIAGEN Blood Mini Kit (QIAGEN, Hilden, Germany).DNA samples were quantified using a Qubit™ 3 Fluorometer (Thermo Fisher Scientific, Waltham, MA, USA) and Qubit™ dsDNA Quantification Assay Kits (Thermo Fisher Scientific).Genome-wide genotyping of the DNA samples was performed using an Infinium Global Screening Array-24 v2.0 BeadChip (Illumina, California, USA) following the manufacturer's protocol.The intensity files (.idat), which are the raw data files from the BeadChip, were read by GenomeStudio (v2.0) (Illumina, San Diego, CA).The primary filter criteria were set up with Call Rate ≥0.98, GenTrain Score >0.7, Cluster Sep Score >0.3, and Call Freq ≥0.95.Qualified samples were exported in PLINK format as .pedand .mapfiles.

| Quality control
Quality control on the samples was carried out using PLINK1.9 with default parameters.The criteria for single nucleotide polymorphisms

| Genotype imputation
Before conducting the imputation, the dataset was pre-phased using SHAPEIT2 (University of Oxford). 12Genotype imputation was performed using the imputation stepwise approach implemented in IMPUTE2 (with default parameters), 13 specific parameters included buffer size 250 kb, imputing into an East Asian dataset, and allele frequency >0.01 to reduce the cost of computing.The imputation reference data were 1000 Genomes haplotype phase 3 with 77,818,332 SNPs.

| Genetic association analysis
Analysis of single SNP associations was carried out using SNPTEST v.2.5.6 (University of Oxford). 14After imputation, SNPs with imputation quality (INFO) <0.8 were excluded prior to genetic analyses.The genome-wide association was illustrated by Manhattan and QQ plots.For association analysis, p-value ≤5 × 10 −8 was considered statistical significance, while p-value ≤5 × 10 −6 was considered suggestive statistical significance.Besides Bonferroni correction, Benjamini-Hocberg procedure was also applied to adjust p-value for multiple comparisons with a false discovery rate of 20%.

| Polygenic risk scoring
PRS analysis was performed using the PRSice-2 package. 15Briefly, the approach constructs a polygenic score by summing all traitassociated alleles in a target sample, weighted by the effect size of each allele in a base GWAS.We obtained summary statistics (base GWAS) with effect sizes of associated SNPs from the GWAS Catalogue of the European Bioinformatics Institute to calculate PRS for individuals in our target dataset. 16Genome-wide summary statistics were 'clumped' by various thresholds of p-value significance (5 × 10 −2 ; 5 × 10 −4 ; 5 × 10 −6 ; 5 × 10 −8 ) within 250 kb and -clumpr 2 = 0.2.Different ethnic datasets (base) summary statistics for the T2DM trait in the GWAS Catalogue were used: (i) East-Asian (EAS) specific meta-analysis (77,418 cases and 356,122 controls), 17 (ii) European (EUR) specific meta-analysis (26,676 cases and 132,532 controls), 18 and (iii) mix ancestral (MIX) meta-analysis (84,244 cases and 583,280 controls). 19We built models based on logistic regression to make predictions.In simulation studies, logistic regression was performed using PRS as an independent variable.Our samples were split into training and testing datasets, and used three ratios, 80:20, 50:50, and 20:80.The performance of the prediction models was assessed by the area under the curve (AUC) and confidence intervals of AUC using the pROC package within R.

| Characteristics of the recruited samples
A total of 997 qualified samples were used, with 503 cases of T2DM and 494 controls.The demographic characteristics of all the participants are described in Table 1.Notably, the age of T2DM onset in the T2DM group was significantly lower than the age at recruitment of the control group.The T2DM group also had significantly greater BMI and waist-hip ratio compared to the control group.

| Quality control
After inputting raw data for the 997 samples to GenomeStudio, six samples with Call Rate <0.98 were excluded; 991 samples with 665,539 SNPs each were exported in PLINK format containing .ped and .mapfiles.Twenty-four samples which did not meet quality control (QC) criteria were further excluded from the analysis.Finally, 957 samples with 379,428 SNPs each passed the QC, and these results were used for genotype imputation.After imputation, 957 samples with 7,587,388 SNPs each were used for genetic association analysis.

| Genetic association analysis
Before imputation, SNP tests showed that rs11079784 was the strongest SNP associated with T2DM (p-value = 8.8 × 10 −7 ) with OR = 1.58.After imputation, the results of the SNP tests showed that 107 SNPs reached the suggestive threshold of statistical significance for GWAS (p-value ≤5 × 10 −6 ).These SNPs are mainly located in chromosomes 2, 6, 10, 11, and 17.By using Benjamini-Hochberg correction, there were 37 SNPs statistically associated with T2DM and these SNPs are mainly located in chromosome 17.Post-QC Manhattan and QQ plots did not show any unexpected genomewide significant associations (Figure 1).A comprehensive list of these associated SNPs is in Table S1.

| PRS calculation
The final PRS calculation consisted of 957 individuals (489 T2DM cases and 468 controls), with 7,587,388 SNPs each.The generated PRS was significantly associated with T2DM; however, the predictive capabilities of PRS in the studied population varied with the differences in training datasets.The strongest effect sizes were observed when PRS was calculated from EAS summary statistics, which can explain 8.4% of the variance in T2DM liability (pvalue = 5 × 10 −14 ).On the other hand, PRS calculated from MIX and EU GWAS summary statistics only accounted for 7.3% and 5.7% of the variability in T2DM risk with p-value = 2.1 × 10 −12 and 5 × 10 −10 , respectively (Figure 2).The best model for T2DM prediction derived from MIX summary statistics showed AUC = 0.70 (95% confident interval = 0.62-0.77)achieved with a 'clump' p-value = 0.05 and 80:20 Training: Testing ratio (

| DISCUSS ION
Genetics play an essential role in the pathophysiology of T2DM; however, defining a genetic model for T2DM is not always feasible due to the heterogeneous nature of the disease. 20,21Several attempts have been made to identify the genetic associations between certain loci and T2DM in Kinh Vietnamese, 22,23 but more information is required for a comprehensive risk prediction model for T2DM.
In this study, the T2DM age at onset was 46, which is relatively young, and this reflected two issues.First, the genetic contribution to T2DM may be greater in this population.Second, the early onset age will create a significant healthcare burden for T2DM in the future. 24,25BMI, waist-hip ratio and systolic blood pressure were also significantly greater in the T2DM group compared to the controls.
These parameters could be applied to the logistic model to predict T2DM; however, it would be best if the parameters could be obtained before treatment, as several management paradigms of T2DM could affect patient BMI, waist-hip ratio, and systolic blood pressure.There was no BMI criterion used for recruitment in the control and T2DM group.Overweight/obesity is a distinct phenotype and a strong risk factor for T2DM development that can be explained by both genetic and environmental components.Higher BMI observed in the T2DM group, therefore, may result from different genetic characteristics.
With the sample size in this study, we could not detect any 'truly' statistically significant SNPs associated with the T2DM phenotype (p-value ≤5 × 10 −8 ).It is reasonable to use this cut-off to avoid a coincident association based on Bonferroni correction, as millions of SNP tests were performed. 26However, several authors have suggested that these criteria are too stringent and should be adjusted

Frequencies
and percentages were used to express qualitative variables.Quantitative variables were tested for normal distribution using the Kolmogorov-Smirnov test, and are presented as mean with standard deviation (SD) if normally distributed or median with interquartile range [Q1-Q3] if not normally distributed.Differences in characteristics were compared using chi-squared for qualitative variables and Student's t-test or the Mann-Whitney U-test for quantitative variables, based on the data distribution.SPSS Statistics for Windows version 20.0 (IBM Corp., Armonk, NY, USA) was used for statistical analysis.A p-value of <0.05 was considered to be statistically significant.

F I G U R E 1 F I G U R E 2
Manhattan plot after quality control.Chromosome position in ascending order is indicated on the x axis, the −log10 (p-values) of included autosomal SNPs derived from the association analysis are plotted on the y axis.The variance explained by polygenic risk score at different p-value thresholds.(A) East Asian, (B) European, (C) mix ancestral population.

Table 2 )
. The distribution of PRS among T2DM cases and controls with p-value = 5 × 10 −2 is illustrated