Integrative 1H-NMR-based Metabolomic Profiling to Identify Type-2 Diabetes Biomarkers: An Application to a Population of Qatar

Ullah E1, Shahzad M3, Rawi R1, Dehbi M2, Suhre K4, Selim M5 and Bensmail H 1* 1Computational Sciences and Engineering, Qatar Computing Research Institute, Education City, Qatar Foundation, Doha, Qatar 2Qatar Biomedical Research Institute, Education City, Qatar Foundation, Doha, Qatar 3Information Systems Department, University of Carnegie Mellon, US 4Department of Physiology and Biophysics, Weill Cornell Medical College in Qatar, Education City, Qatar Foundation, Doha, Qatar 5Dermatology Department, Hamad Medical Corporation, Doha, Qatar


Introduction
Many chronic diseases like Type II Diabetes (T2D) and its complications may be prevenTable by avoiding factors that trigger the disease process. Accurate prediction and identification using biomarkers will be useful for disease prevention and initiation of proactive therapies to those individuals who are most likely to develop the disease. Recent techno-logical advances in proton 1 H Nuclear Magnetic Resonance (NMR) spectroscopy techniques for metabolomics profiling offer great opportunity for biomarker discovery [1][2][3][4][5][6][7][8][9][10][11][12][13][14][15][16][17][18]. Because of experimental issues in the technical equipment, the levels of some metabolites cannot be universally determined. As the number of measured metabolites often exceeds the number of samples, dimensionality reduction methods are required.
In this study, we present a possible analysis workflow for mining 1 H-NMR spectrum for a sample of subjects with T2D and controls (see Methods) using robust statistical approaches such as regularized principal component and regularized cluster analysis methods as an integrative approach to discover new metabolites and possibly discover new biomarkers (Figure 1).
QMDiab is a 2012 study from the Dermatology Department of Hamad Medical Corporation in Doha, Qatar. The incentive was the high prevalence of T2D mellitus in Qatar, where the country ranked #21 worldwide in 2013 (International Diabetes Federation, 2014). Metabolite analysis was per-formed on human blood and urine biofluid of 348 subjects with T2D and controls (here we use urine biofluid only) where at least 100 patients were Qatari (173 males and 175 females). The subject characteristics are shown in Table 1.
In the first round of analysis the complete spectra were binned into different bin sizes and normalized using the total peak area normalization method [8]. The qualitative analysis of the major variances in the spectra was performed directly by using a newly developed flexible and robust PCA (we named fPCA) which preprocess noisy and correlated 1H-NMR data ( Figure 2). The fPCA was able to cluster all the samples without diabetes but the samples with diabetes had a wide spread. Compounds that are identified by NMR spectrum Diabetes is usually a lifelong chronic disease characterized by an above-average concentration of sugar in the blood and urine. It is characterized as a disease of affluence and hence affects a considerable portion of the population of the developed world. Diabetes is caused by a reduction in the insulin production by the pancreas or a decreased response of body cells to insulin. The prevalence of diabetes in Qatar is higher in females than in males [1]. Risk factors also increase with age, obesity, hypertension, heart diseases and smoking habits [1]. Family history also effects a person's predisposition to diabetes, which shows that there is a significant genetic component. In Qatar, there are a lot of marriages between close cousins and this is a cause for concern. Qatar Diabetes Association (QDA), which was set up by Qatar Foundation (www.qf.org.qa), is leading the fight against diabetes by educating the general population about the risk factors.

Approach
Technologies to measure high-throughput biomedical data in proteomics, chemometrics, and genomics have led to a proliferation of high-dimensional data that pose many statistical challenges. As metabolites, are biologically interconnected, the variables, in these data sets are not only far larger than the sample size but are often highly correlated and noisy. More generally, methods such as PLS, PCA and SPCA can be used as dimension reduction techniques that finds projections of the data that maximize the covariance between the data and the response [15]. During the last decade, several work have been proposed to encourage sparsity in these projections, or loadings vectors, to select relevant features in high-dimensional data [13,14]. There are several motivations for regularizing the PCA loadings vectors. Several authors have shown that the PCA projection vectors are asymptotically inconsistent in high-dimensional settings and encouraging sparsity in the loadings has been shown to yield consistent projections [11][12][13][14][15]. However, the computational cost is expensive when requiring a large number of loading so it is desirable to find an approach, which regularize loading scores, reduce features and boost the computation of PCA. The PCA loading vectors can be used as a data compression technique when making future predictions; sparsity further compresses the data. As many variables in high-dimensional data are noisy and irrelevant, sparsity presents a method for automatic feature selection. This leads to results that are easier to interpret and visualize. While sparsity in PCA is important for high-dimensional data, there is also a need for more general and flexible regularized methods. Consider our NMR spectroscopy as a motivating example. This high-throughput data measures the spectrum of chemical resonances of all the latent metabolites, or small molecules, present in a biological sample. Typical experimental data consists of discretized, functional, and nonnegative spectra with variables measuring in the thousands for only a small number of samples. Additionally, variables in the spectra have complex dependencies arising from correlation at adjacent chemical shifts, metabolites resonating at more than one chemical shift, and overlapping resonances of latent metabolites. Because of these complex dependencies, there is a long history of using PCA to reduce the NMR spectrum for supervised data [16]. Classical PCA or Sparse PCA, however, are not optimal for this type of data as they do not account for the non-negativity or functional nature of the spectra and do not encourage sparsity or group sparsity.
In this paper, we seek a more flexible framework for regularizing the PCA loadings that are computationally efficient and fast for analyzing high-dimensional 1 H NMR data that encourage sparsity, group sparsity, or smoothness, and also leads to a more computationally efficient and fast numerical algorithm.  require an accurate and fast analysis of spectra from hundreds and even thousands of biological samples. Statistical inference is the only approach that can yield objective and reproducible conclusions from such data. At present the statistical tools available for this task are of limited performance.

QMDiab Study
This study was embedded in the Qatar Metabolomics Study on Diabetes (QMDiab), a cross-sectional case-control study with 348 subjects (Tables 1-3). The work was a joint collaboration between Hamad Medical Corporation and Weill Cornell Medical College Qatar. Patients were asked to enroll between February and June 2012. The study has been approved by the Institutional Review Board (IRB) of Hamad Medical Corporation and Weill Cornell Medical College Qatarand is accordance with the Helsinki Declaration of 1975. Written informed consent was obtained from all participants. The study measured metabolites in 348 individuals within the age of 17 to 81. The metabolites were measured in the three body fluids non-fasting blood plasma, urine, and saliva. In the time from February to June 2012, 1107 samples were taken from the participants, comprising 1563 metabolites including amino acids, peptides, carbohydrates and lipids, as well as age, gender, ethnicity, weight, height, Body Mass Index (BMI) and personal history of T2D [17].
The samples were analyzed by the three companies Metabolon Inc., Chenomx Inc., and Biocrates Life Sciences AG. The respective companies utilized liquid/gas chromatography with mass spectrometry injections, targeted profiling using NMR, and Multiple Reaction Monitoring (MRM). The study found that all variables of ethnicity, gender and smoking had a strong effect on a diabetes risk factor, advanced glycation end products. Women, Arabs, Filipinos, and smokers were more strongly affected than men, south Asians, and non or irregular smokers [17].

Statistical Analysis
NMR binned data: When dealing with high resolution NMR spectra it is in general impracticable to work with the entire data points of the spectra which are usually in the order of 32Kb and bigger. The most common strategy used to reduce the number of variables consists in dividing each spectrum in a defined number of regions, the so called bins. Several binning strategies are available today, from regular binning, where bins have fixed width, to more sophisticated strategies such as gaussian or dynamic adaptive binning [8]. Here we used regular Smoking (%) 8 (9.4%) 10 (8.7%) 6 (9.2%) 2 (5.9%) 1 (7.7%) 2 (9.1%)   binning to preprocess the high resolution data ∼ 65536 data points in a single spectrum and remove any anomalies. This was motivated by the fact that when dealing with an array of NMR spectra, whilst regular binning of a number of bins over stacked spectra containing spectra will generate a matrix, it is not possible to generate a similar matrix using directly deconvolved peaks (peak list) since the number and position of peaks varies from spectrum to spectrum. In our case, we have used a binning approach which automates the binning of assembled NMR spectrum using imposed alignment of each spectra. In fact, we had 354 files that contained NMR coordinates. Each file had approximately 65,000 data points. This means that our algorithm had to iterate through 374 x 65,000 = 22,620,000 (22 and 1/2 million) data points. Each bin gives rise to a new value which is representative for the bin. We used a bin interval of 0.007 ppm. Using JAVA, we iterated through all x values in this interval and calculated the mean and standard deviation. After this we considered values inside m ± 3σ and calculated their mean. The obtained matrix after processing the data was of size 348 x 2960.
Sparse PCA with elastic net: Consider the linear regression model with n observations and p predictors.
) T be the response vector and X = [X 1 , ..., X p ], j = 1, ..., p the predictors, where Xj = (x 1j , ...,x n j ) T . After a location transformation we can assume all the Xj and Y are centered. The lasso is a penalized least squares method, imposing a constraint on the ℓ 1 norm of the regression coefficients. Thus, the lasso estimates β lasso are obtained by minimizing the lasso criterion where λ is non-negative. The lasso continuously shrinks the coefficients toward zero, and achieves its prediction accuracy via the bias variance trade-off. Due to the nature of the 1  penalty, some coefficients will be shrunk to exact zero if λ is large enough. The elastic net [12] generalizes the lasso to overcome these drawbacks, while enjoying its other favorable properties. For any non-negative λ 1 and λ 2 , the elastic net estimates β are given as follows: The connection between robust regression method and PCA have been discussed by [11] and the problem becomes equivalent to the following optimization problem to note that (i) for p  n the augmented data set has p + n observations and p variables, which can slow the computation considerably; (ii) if the original design matrix is normalized, there is no guarantees the augmented design matrix will behave similarly, which can cause a loss of a part of the interpretation of the data; and (iii) the coordinate-descent algorithm proceeds by one at a time philosophy, e.g. it minimizes the loss function of β j while maintaining components β k, k ≠ j fixed at their actual values, in this case we cannot develop Gauss-Seidel for a grouped variable selection problem. To overcome these limitations, we derive a unified alternating direction method of multipliers based algorithm to handle sparse principal component selection which aims at selecting important components and penalizing the others through β [20]. We propose a doubly regularized model with a general penalty term of the form where λ , µ ≥ 0 are two tuning parameters, ω =( 2 ω , ..., 2 ω p ) t and Q = (q i j ) 1=i, j=p are weights associated with the 1  and 2  norms respectively, which are fixed in advance.
The advantages of our algorithm are: (1) Provide a general frame to deal with the limitations of unweighed versions of lassotype estimates. A weighted version possesses the oracle properties of selecting the subset of interesting variables with a proper choice of the weights and increasing the number of hits and decreasing the number of false positives. (2) Combine the strengths of Lasso and a quadratic penalty designed to capture additional structure on the features in high dimensional setting which is frequent in high-throughput generated from 1 H-NMR spectroscopy. (3) Develop an easy and fast algorithm using the Alternating Direction Method of Multipliers (ADMM) approach to find optimal estimator without augmenting or normalizing data (see next section).

Alternating Direction Method of Multipliers (ADMM):
Recently, the alternating direction method of multipliers has been revisited and successfully applied to solving large scale problems arising from different applications. In this section we give an overview of ADMM. Consider the following optimization problem: where f and g are two convex functions and β, ξ ∈ R P. In this optimization problem, we have two sets of variables, with separable objective. The augmented Lagrangian for this problem is: where δ is the dual variable for the constraint β-ξ=0 and τ>0 is a penalty parameter. The augmented Lagrangian methods were developed in part to bring robustness to the dual ascent method, and in particular, to yield convergence without strong assumptions like strict convexity or finiteness of f and g.
At iteration k, the ADMM algorithm consists of the three steps: Stopping criteria: The primal and dual residuals at iteration k have the forms: The ADMM algorithm terminates when the primal and dual residuals satisfy stopping criterion. A typical stopping criterion is given in [5] where the authors propose to terminate when k where λ ,µ are two non negative tuning parameters, Q is a positive semi-definite matrix, y * * =Σ 1/2 α j , X * * =Σ 1/2 and Σ is the covariance matrix of X.
Equation (13) combines the strengths of regularized techniques of type Lasso and a quadratic penalty designed to capture additional structure on the features. When ˆj ω = 1, it is straightforward to show that all type of lasso models (Lasso, Enet, Slasso, L1Cp and Wfusion) are particular case of (13) using an augmented data reparameterization of the form , Therefore any efficient algorithm developed to find the whole solution path of the Lasso like least angle regression or coordinate descent algorithm can be applied. Unfortunately, the good properties of the two optimization techniques are overshadowed by the difficulties (i), (ii) and (iii). To deal with those problems, we propose to solve (13) using the ADMM algorithm. The idea is simple and straightforward. First, we propose to re-write (13) on the following ADMM form: If we write  (13) becomes (5). In this case f and g are two convex functions. Apply-ing the ADMM algorithm to (14), we have to perform the following two steps at each iteration: The β-minimization step.
This step updates β k by: { } − κ > κ   ≤ κ   + κ < − κ  is the soft thresholding function introduced and analyzed by [6]. The dual-update step is straightforward and consists of updating η k by η k+1 := η k +β k+1 -ξ k+1 . It is worth to notice that since τ > 0, µ ≥ 0, X t X and Q are positive semi-definite matrices, (X t X+µQ+ τI p ) is always invertible. If p > n, let M = µQ + τI p , to alleviate the cost of calculations, we can exploit the Woodbury formula for (X t X + M) −1 . Algorithm 2 shows the complete details of the flexible elastic-net with ADMM and and Algorithm 3 summarizes the flexible PCA.

Tuning parameters selection:
In practice, it is important to select appropriate tuning parameters in order to obtain a good prediction precision and to control the amount of sparsity in the model. Choosing the tuning parameters can be done via minimizing an estimate of the out-of-sample prediction error. If a validation set is available, this can be estimated directly. Lacking a validation set one can use ten-fold cross validation. In our experiments l takes 100 logarithmically equally spaced values, µ ϵ {0,0.1,1,10,100} and γ ϵ {0.5,1,2.5,5,25}.

Results
fPCA, was applied to examine similarities and/or differences in the H-NMR spectra. A flexible principal component is a weighted linear combination of each of the original NMR variables so that the original data matrix is compressed into a smaller number of variables; the NMR data may be compressed into three to four fPCs in cases where the changes between groups or due to specific treatments are quite large. Figure 3 shows projection of processed urine samples with uniform 0.007 ppm bin widths on the first and second fPC axes and Figure 4 summarizes the loading scores. From this projection, fPCA analysis shows two perpendicular clustered groups with an overlap in diabetic and non-diabetic samples.
Interestingly, after further analysis, we identified that the overlap summarizes patients with controlled diabetes. Any supervised learning algorithm may lead to invalid results due to huge overlap of diabetic and non-diabetic samples and provide a one cluster summary. Here, flexible principal component 1 (fPC1) provide the maximum variance across diabetic and non-diabetic samples while principal component 2 (fPC2) summarizes maximum variance across samples within diabetic or non-diabetic samples. Sixty metabolites were identified by 1 H-NMR spectrum analysis of original data and loading spectrum of fPC1 and fPC2. Twenty four metabolites from major energy sources such as carbohydrates, lipids, and proteins, are identified by NMR spectrum analysis of original data and loading spectrum of fPC1 can be used as potential biomarkers in human lipids for detection of diabetes. In total, 24 metabolites were detected at statistically different concentrations (Table 4).
Many compounds detected at higher levels for T2D were the end products of gluconeogenesis, including glucose and its polymer. Glycine-betaine (betaine) and glutamate, three of the major osmoprotectants used by S. Typhimurium, were found at higher concentrations. Other compounds more abundant  in gestational diabetes mellitus were 3-hydroxyisovalerate and 2-hydroxyisobutyrate, probably due to altered biotin status and amino acid and/or gut metabolisms (the latter possibly related to higher BMI values). The major compounds detected at higher levels were the upper TCA cycle intermediates succinate, and the transhydrogenase Lactate/malate, which has dual metabolic functions, named: Delta(1)piperideine-2-carboxylate/Delta(1)-pyrroline-2-carboxylate reductase, the first member of a novel subclass in a large family of NAD(P)dependent oxidoreductases [7]. The compounds identified by analysis of loading spectrum of fPC2 are the compounds that indicate most variations in the normal and diabetic blood samples ( Table 6). The fPCA was able to cluster all the samples without diabetes. The samples with diabetes had a wide spread. An important observation in this case is the overlap of diabetic and non-diabetic samples. The overlap may be due a result of controlled diabetes of diabetic samples, which resulted into normal concentrations of metabolites compared to the metabolite concentrations for non-diabetic samples. The distribution of samples in fPC1, fPC2 space show that diabetic and non-diabetic samples are distributed along fPC1 and variation among the samples within the diabetic and non-diabetic groups is along fPC2. The spectra for fPC1 and fPC2 are shown in Figure 4. The spectra were processed to compute the concentration of metabolites.
Since the reference compound was present in all the metabolites, it was not included in fPC1 and fPC2. Therefore, the concentration  of compounds in fPC1 and fPC2 could not be computed. But, the maximum concentration of a compound was calculated and is shown in Table 5. Moreover, 9 people out of 178 had potential Paraquat poisoning based on their abnormal concentration on citrate, glutinane and alanine and 24 out of 178 had Salicylate (Aspirin) detected in their blood and urine. For Asprin abnormality, we conclude that people with diabetes are more encouraged to take Aspirin as it may reduce risk of heart attack due to coronary obstruction, which is a risk many diabetics concurrunce ( Figure 5).

Conclusion
In this study, we presented an integrative analysis that revealed metabolites correlated with diabetes for a subset of Qatari population and we, furthermore, identified specific metabolites affected by medication, which constraints differentiation between diabetic and control patients. Despite significant advances, no single profiling method currently allows simultaneous analysis of all of the metabolites in the metabolome. Ultimate achievement of our study is to present an integrative statistical method for mining raw 1 H NMR data. Challenges appear in handling big data where number of peaks is larger than the number of samples which limited the use of traditional statistical methods. Our next work is the continuation of the development of computational methods for the analysis of complex 1 H NMR datasets and their integration with equally complex genomic, transcriptomic, and proteomic profiles as well as metabolome integrated network analysis.