Use of a gene score of multiple low-modest effect size variants can predict the risk of obesity better than the individual SNPs

Obesity is a complex disorder, the development of which is modulated by a multitude of environmental, behavioral and genetic factors. The common forms of obesity are polygenic in nature which means that many variants in the same or different genes act synergistically and affect the body weight quantitatively. The aim of the current study was to use information from many common variants previously identified to affect body weight to construct a gene score and observe whether it improves the associations observed. The SNPs selected were G2548A in leptin (LEP) gene, Gln223Arg in leptin receptor (LEPR) gene, Ala54Thr in fatty acid binding protein 2 (FABP2) gene, rs1121980 in fat mass and obesity associated (FTO) gene, rs3923113 in Growth Factor Receptor Bound Protein 14 (GRB14), rs16861329 in Beta-galactoside alpha-2,6-sialyltransferase 1 (ST6GAL1), rs1802295 in Vacuolar protein sorting-associated protein 26A (VPS26A), rs7178572 in high mobility group 20A (HMG20A), rs2028299 in adaptor-related protein complex 3 (AP3S2), and rs4812829 in Hepatocyte Nuclear Factor 4 Alpha (HNF4A). A total of 475 subjects were genotyped for the selected SNPs in different genes using different genotyping techniques. The study subjects’ age, weight, height, BMI, waist and hip circumference, serum total cholesterol, triglycerides, LDL and HDL were measured. A summation term, genetic risk score (GRS), was calculated using SPSS. The results showed a significantly higher mean gene score in obese cases than in non-obese controls (9.1 ± 2.26 vs 8.35 ± 2.07, p = 2 × 10− 4). Among the traits tested for association, gene score appeared to significantly affect BMI, waist circumference, and all lipid traits. In conclusion, the use of gene score is a better way to calculate the overall genetic risk from common variants rather than individual risk variants.


Background
Obesity is a defined as the excess of body fat. It is a complex disorder, influenced by a number of genetic and environmental factors. There has been a dramatic increase in the number of overweight and obese individuals, both children and adults, globally [1]. Pakistan with a total population of 184.35 million in 2012-13 is the 6th most populous country of the world [2]. According to the Global Burden of Disease Study, Pakistan ranked 9th out of 188 countries in terms of obesity [3].
Traditionally, before the advent of high throughput genotyping methodologies, the contribution of genes to the risk of development of disease was recognized through the increased risk of disease in the proband's relatives. The genetic component was then expressed as heritability estimates or variance components. However, rapid developments of high-throughput genetic technologies have led to the genome-wide association studies (GWAS) [4]. The GWASs analyze common variations by genotyping of a large number of SNPs (~0.5-1 million) in a case control study design. The results are then used to determine which of these SNPs reach genome wide significance level with the outcome (mostly the disease) [5]. One problem with common variants is their small effect sizes (the contribution of a SNP to the genetic variance of a trait) accounting for a small fraction of variance in the disease risk. Familial clustering of complex diseases suggests that the heritable risk factors are of large effect sizes therefore a GWAS is unable to detect such variants because of a very low frequency. The situation is further complicatede due to epistatic effects resulting from the interaction of variants in different genes. The epistatic effects thus confound the search for new loci because their probability is the product of probabilities of low frequency individual variants [6].
The genetics of complex disease is inherently based on statistical methods because the phenomenon (e.g., obesity) being a complex disorder is itself probabilistic by definition. In order to interpret meaningful results from a dataset, various statistical methods are needed. The commonly used statistical procedures include use of risk prediction algorithms (relative risk, odds ratio), family analyses (liability threshold models) and regression methods (linear/logistic regression) [7]. These methods are based on assumptions which can be very different and even incompatible. In GWASs, the inclusion of a large number of SNPs leads to more accurate gene identification in theory because it is based on the frequency of individual risk alleles. However, this theoretical advantage is reduced either by multiple testing correction (due to the inclusion of many SNPs), or by the increased degrees of freedom. The use of a weighted score test (WST) or gene score with only one degree of freedom has been suggested to handle the above mentioned limitation [8].
Gene score is defined as the sum of all the risk alleles of the selected variants present in each study participant. However, this approachfaces a problem when some of the SNPs are positively associated with the outcome of interest (i.e., increase the risk of disease), while some are protective (i.e., decrease the risk of disease). In order to overcome this limitation, SNP coding is adjusted before a gene score test in such a way that all the alleles are positively correlated with the outcome [9]. Another problem encountered in gene scoring is the use of information from all of the SNPs, although some SNPs have low and others have high effect sizes, resulting in the reduction in the study power. The current ways to deal with these issues include modified forward multiple regression (MFMR, has higher power to detect weak genetic effects and has limited number of false positives), Bonferroni correction (used to counteract the problem ofmultiple comparisons, particularly when many SNPs are included simultaneously, the p-value used for statistical significance cutoff is 0.05/the number of the SNPs), false discovery rate (FDR, a method of conceptualizing the rate of type I errors in null hypothesis testing when conducting multiple comparisons and randomization tests (significance test that will have a false rejection rate always equal to the significance level of the test) [10].
There has been scarce research on the obesitygenetics in Pakistan and most of it focussed monogenic forms. We have chosen those SNPs which are either candidate or GWAS hits for involvement in the energy regulation pathway. For the current investigation, ten SNPs were chosen because it is a pilot study and we were limited by the resources. It was taken care that the SNPs chosen had intermediate MAFs and they have previously been shown to predispose to obesity in other ethnicities. A gene score approach has not been tried for these SNPs in the Pakistani subjects. We are the first to use this approach to our ethnic group. We therefore aimed to look for any difference which use of a genetic risk score can make in comparison to the individual risk variants.

Study subjects
The study was a case control observational type and included 475 subjects (250 cases and 225 controls). Study subject recruitment was done from various cities of Punjab, Pakistan. The study subjects' recruitment details, inclusion and exclusion criteria have been described elsewhere in detail [11]. The inclusion criteria were BMI and WHR cut offs defined for Asian populations previously (for obese cases: BMI > 23Kg/m 2 as overweight and > 26Kg/m 2 , for controls: BMI < 23Kg/m 2 ). Exclusion criteria for both cases and controls included pregnancy, presence of malignancies and recent infections. The study was approved by the institutional ethics committee (Ethical Committee, School of Biological Sciences, University of the Punjab, Pakistan), subjects gave a written informed consent and all procedures were carried out in compliance with the Helsinki declaration.

Anthropometric measurements, blood sampling and biochemical analyses
The measurement of body weight (Kg), height (m), waist and hip circumference (cm) was according to standard procedures as described previously [12]. BMI (body mass index, Kg/m 2 ) and WHR (waist to hip ratio) were calculated for each study subject. Blood samples were taken after 8-12 h fasting, half sample was used for DNA isolation while the rest half was used to obtain serum. Serum was separated by centrifuging gel vacutainers at 14,000 rpm for 10 min, collected in sterilized eppendorf and screened for any infectious agents (HBV, HCV, HIV). Any positive samples were discarded and safe samples were used for the lipid profile determination. Serum total cholesterol (TC), triglycerides (TG), high density lipoprotein cholesterol (HDL-C), and low density lipoprotein cholesterol (LDL-C) were measured using commercially available kits (Spectrum Diagnostics, Egypt). Epoch, Biotek microplate reader (Biotek instruments, Highland Park) was used for all optical density measurements.

Genotyping
Genomic DNA was isolated from blood leukocytes using Wizard® Genomic DNA purification kit (Promega, USA). DNA was quantified using nanodrop (ND-8000, USA), and made to a 5 ng/μl concentration. The variants included the common SNPs in the genes involved in either the energy regulation (candidate) genes or GWAS implicated (non-candidate) genes (Additional file 1: Table S1). The genotyping methodologies for these SNPs were based on PCR-RFLP, tetra-ARMS or TaqMan methods (leptin (LEP) gene SNP G2548A, leptin receptor (LEPR) SNP Gln223Arg, and fatty acid binding protein 2 (FABP2) SNP Ala54Thr, were genotyped by PCR-RFLP method, the FTO gene SNP by tetra-ARMS PCR and rs3923113 near growth factor receptor bound protein (GRB14), rs16861329 in sialyltransferase 6 galactosidase 1 protein (ST6GAL1), rs1802295 in vacuolar protein sorting associated protein (VPS26A), rs7178572 in high mobility group protein 20 A (HMG20A), rs 2,028,299 in adaptor related protein complex (AP3S2) and rs4812829 in hepatocyte nuclear factor (HNF4A) by TaqMan allelic discrimination assay). The reaction mixture composition and PCR conditions have been described previously [11][12][13][14].

Gene score (GS) calculation & statistical analysis
For the GRB14 and ST6GAL1 SNPs, the major alleles while for the rest the minor alleles were risk alleles. SPSS was used to construct the gene score of the included variants. The SNPs were coded as 0, 1, and 2 for presence of no, one and two risk alleles i.e., homozygous protective, heterozygous and homozygous risk genotype, respectively. A new variable named 'Gene Score' was computed in the SPSS by adding up the number of the risk alleles for all the SNPs in each subject (e.g., if a subject has the allele profile for all variants as 0, 0, 1, 2, 0, 1, 1, 2, 0, 1, 1, 2, and 0, the gene score would be 0 + 0 + 1 + 2 + 0 + 1 + 1 + 2 + 0 + 1 + 1 + 2 + 0 = 11). The trend of gene score in cases and controls was analyzed by a normal distribution curve and the effect on anthropometric and biochemical traits was checked using linear regression taking obesity or lipid traits as dependent and gene score as the independent variable. The analyses were adjusted for confounders including age, gender, socioeconomic status (SES), hypertensive, diabetic, CVD status, etc. The difference between mean gene score in cases and controls was checked by the independent sample t-test. Due to the inclusion of multiple SNPs, a corrected p-value (0.05/10 = 0.005) was used as a significance cutoff.

Results
The study subject characteristics have been published previously (Table 1) [11]. The reference SNPs' information including name, respective gene and the minor allele frequency in the cases and the controls is given in Additional file 1: Table 1. Table 1 showed that all the parameters except height differed significantly between the cases and the controls as tested by the independent sample t-test. The lipid profile parameters deviated from normal ranges (with TC, TG and LDL-c significantly increased and HDL-c significantly decreased) in the cases as compared to the controls. The genotyping call rates for all the SNPs were~98%.

Gene score distribution in the cases and the controls
The comparison of the gene score between cases and controls is given in the Fig. 1. It shows that the curve is shifted towards right in the cases indicating that a greater number of individuals possessed a higher gene score as compared to the controls whereby the majority of the individuals had a lower number of risk alleles. The mean gene score of the participants is given in Table 1 and showed that in controls (8.35 ± 2.07) and cases (9.1 ± 2.26) was significantly different (p = 2 × 10 − 4 ). As ten SNPs were included in the analysis, the maximum number of risk alleles an individual could possess is twenty. The descriptives of the gene score are summarized in Table 1.

Comparison of the effect of gene score and individual variants on obesity
In order to check whether the use of the gene score approach improves the association of the genetic component to obesity as compared to the individual variants, we performed a linear regression analysis. The association of the individual variants and the gene score with the obesity showed that the individual variants had either marginal or no significant association with obesity, but the gene score was highly significantly associated with the obesity. The p-values in the table indicate the strength of association of the single SNPs and the gene score ( Table 2).

The effect of the gene score on anthropometric parameters
The effect of the gene score on anthropometric parameters is presented in Tables 3 and 4 , Table 3 summarizing the increase in the mean values of a parameter with increasing gene score and Table 4 showing the quantitative increase per increase of one risk allele. It is clear from Table 3 that the increase in the number of risk alleles (i.e. the gene score) increased the weight, BMI, WC, HC and WHR. This is further clarified in the Fig. 2 showing  Fig. 1 Histograms of Gene score in cases and controls. The histograms show the normal distribution of the risk allele count of the study participants, the top half for the controls and the bottom half for the cases. On the x-axis, gene score is plotted and on y-axis, frequency is mentioned. The bars show the respective frequency of study subjects in each group on that particular gene score a graphical plot of the relationship of the gene score and the anthropometric traits. The effect of the gene score on the selected anthropometric traits appeared to be quantitative and is shown in Table 4. The beta effect means the per risk allele increase in a parameter and the p-value shows whether this increase is significant or not.
It is clear that the per allele increase in weight, height, HC and WHR is insignificant while highly significant increase is observed for BMI and WC.
The effect of the gene score on biochemical parameters The change in biochemical traits with increasing gene score is given in Table 5 and per risk allele increase in the mean values are shown in Table 6. The presence of an increasing number of risk alleles makes the lipid traits more dyslipidemic as indicated by the increase in the values of TC, TG and LDL and decrease in the concentration of HDLC with an increasing gene score in the Table 5. The quantitative effect of the gene score on the selected biochemical traits in Table 6 indicated a strongly significant increase in all the lipid parameters' levels with the presence of each risk allele. A Bonferroni's correction was made for analyses and a corrected p-value (0.005) was used for testing the significance of the association of the gene score due to inclusion of ten SNPs. Gene score appeared to be significantly associated with only BMI, WC and all lipid traits.

Discussion
The use of a genetic risk score is not a completely new idea, it is being used in the risk scoring of heart diseases  in addition to the conventional risk factors in many developed countries to decide about the appropriate therapeutic options [15] and a recent study in Pakistan also proposed the use of this approach [16]. It has been used for risk scoring of many polygenic disorders, however, its use is somewhat new for obesity. A recent study found significant association of genetic risk score for 32 loci with obesity in obese subjects with major depressive disorder [17], a 32-locus genetic risk score was also found to be statistically significant predictor of body mass index and obesity in White subjects from Atherosclerosis Risk in Communities (ARIC) cohort [18], whereas another study reported the association of gene score with serum triglyceride levels in morbidly obese Mexican subjects [19]. We used the gene score to study the combined effect of risk alleles on obesity. This is a robust approach, particularly when sample size is small, as it gives information regarding the additive effects of multiple variants in different genes in the same individual. The effect size assigned to each variant is independent of the effect estimated from the current small study such that the power problem is somewhat overcome. We selected ten variants from different genes that means an individual could have a maximum of twenty risk alleles. By using a gene score approach, we found that there is a significant difference in the mean gene scores between cases and controls which indicates the role of this index in the development of disease.
Among anthropometric traits, gene score appeared to be significantly associated with BMI and WC only. These are indices of central and abdominal obesity and association of the gene score with these parameters shows that the effect of each risk may be small on its own, but when combined can affect the overall fat distribution and disturb the fat metabolism resulting in an increase in BMI and WC. It is important to note, however, that although we could detect association with these two indices only, the trends of HC and WHR are also in the same direction. The lack of a statistically significant association may either be due to small sample size or the possibility that other variants which have not been included in the study may have influenced the effect of the SNPs.
It has been observed that many lipid/lipoprotein abnormalities are prevalent in obesity, such abnormalities are collectively termed as dyslipidemia, however, these dyslipidemias are often hyperlipidemia wherein majority lipids are shifted towards the upper limits of range or higher than the range. Obesity associated dyslipidemia is characterized by an increase in total cholesterol (TC), triglycerides (TG), low density lipoproteins (LDL-c), and decrease in high density lipoproteins (HDL-c), with TG and HDL being the most consistent and pronounced. One study considered fat distribution as an important factor for determining the differential distribution of TG, HDL and lipoproteins in both sexes and indicated lipid profile in obese persons as an important factor for progression to cardiovascular diseases [20][21][22][23]. We  observed that the associations with the lipid traits became less significant when adjusted for BMI, while the association with TG was no longer significant. This showed that the associations were mediated somewhat by BMI. It is thus unclear whether lipid or obesity is causal for the others or the genes have pleiotropic effects on both traits. The genetic contribution to obesity is well known in the present era and many remarkable achievements have been made in elucidating the role of genetics in the development of obesity. There is a long list of the candidate and non candidate genes known to be associated with obesity and we have comprehensively reviewed it previously [24]. We selected only a few from this list and from other sources retrieved from various search engines because of the resources available that supported the current analysis only. Because of this consideration, we tried to select a representative set of variants, from a number of genes so that this pilot study can provide us information about the significance to study role of these variants in context with obesity in the Pakistani subjects. The study has the limitation of relatively small sample size, inability to include more SNPs into analyses and different genotyping strategies for different SNPs. For the first two limitations, future studies should be planned to identify a panel of common variants associated with obesity in the Pakistani population. The limitation of genotyping techniques was relatively overcome by adopting a stringent control over genotyping call rate, the genotyping was repeated wherever a discrepancy was The beta effect means the rise or fall in the selected trait per each risk allele increase. For example, for TC, a beta effect of 0.168 means that with each risk allele increase, the serum total cholesterol increases by 0.168 mmol/L.