Automated analysis for multiplet identification from ultra-high resolution 2D- 1H, 13C-HSQC NMR spectra

Background: Metabolism is essential for cell survival and proliferation. A deep understanding of the metabolic network and its regulatory processes is often vital to understand and overcome disease. Stable isotope tracing of metabolism using nuclear magnetic resonance (NMR) and mass spectrometry (MS) is a powerful tool to derive mechanistic information of metabolic network activity. However, to retrieve meaningful information, automated tools are urgently needed to analyse these complex spectra and eliminate the bias introduced by manual analysis. Here, we present a data-driven algorithm to automatically annotate and analyse NMR signal multiplets in 2D- 1H, 13C-HSQC NMR spectra arising from 13C - 13C scalar couplings. The algorithm minimises the need for user input to guide the analysis of 2D- 1H, 13C-HSQC NMR spectra by performing automated peak picking and multiplet analysis. This enables non-NMR specialists to use this technology. The algorithm has been integrated into the existing MetaboLab software package. Methods: To evaluate the algorithm performance two criteria are tested: is the peak correctly annotated and secondly how confident is the algorithm with its analysis. For the latter a coefficient of determination is introduced. Three datasets were used for testing. The first was to test reproducibility with three biological replicates, the second tested the robustness of the algorithm for different amounts of scaling of the apparent J-coupling constants and the third focused on different sampling amounts. Results: The algorithm annotated overall >90% of NMR signals correctly with average coefficient of determination ρ of 94.06 ± 5.08%, 95.47 ± 7.20% and 80.47 ± 20.98% respectively. Conclusions: Our results indicate that the proposed algorithm accurately identifies and analyses NMR signal multiplets in ultra-high resolution 2D- 1H, 13C-HSQC NMR spectra. It is robust to signal splitting enhancement and up to 25% of non-uniform sampling.


Amendments from Version 1
We thank the reviewers for their comments and have amended the manuscript accordingly. This included suggestions for the optimal spectral resolution required for the presented algorithm to be able to accurately determine multiplet components and how this changes with increasing amounts of enhancement of multiplet splittings due to J-coupling. We also added an analysis of a fully sampled HSQC spectrum from the qtmge dataset to the extended data repository, tables S2 and S5.
We clarified that there are three HSQC spectra overlaid in Figure 1. We also commented on reduction of data acquisition times through use of splitting enhancement due to J-coupling. We added a mathematical expression to clarify our vector normalisation. We also added a comment on how to minimise pH variations between different samples during sample preparation.
Finally, we corrected several typing errors.

Introduction
An efficient metabolism is the basis for the survival of any living organism 1,2 . Inside each cell, the network of metabolic reactions is organised into discrete but intrinsically linked metabolic pathways. Inside the cell, these pathways then function within different cellular compartments. Just as healthy metabolism requires the efficient functioning of the metabolic network, perturbations within one compartment can result in disease by spreading through and altering the entire metabolic network 3,4 . This is a direct consequence of inherent redundancy of the metabolic network, which means that many metabolites can be anabolised or catabolised by several different metabolic pathways.
Metabolomics studies allow the measurement of the concentrations of metabolites present at a given time but provide no direct information as to the metabolic source of these metabolites. Feeding metabolic precursors enriched with low-abundance stable isotopes such as 13 C, for example [1,[2][3][4][5][6][7][8][9][10][11][12][13] C] glucose or [U- 13 C] glutamine, enables the tracing of specific metabolic pathways usage. The contribution of the different metabolic pathways leading to the production of this metabolite can be untangled through the analysis of the 13 C distribution within the metabolite [5][6][7] .
While technologies such as gas chromatography-mass spectrometry (GC-MS) are highly sensitive but low-resolution, nuclear magnetic resonance (NMR) spectroscopy is a high-resolution technique, albeit with relatively low sensitivity. In our previous work, we developed and published a method for the combined analysis of NMR spectroscopic and MS data that harnesses the strengths of both technologies to produce highly-resolved metabolism information in the form of metabolite isotopomers 8 . The 2D-1 H, 13 C-HSQC NMR spectrum contains information about the relative incorporation of 13 C labelled nuclei into metabolites. This information is found in the multiplets of each resonance, with the pattern of the multiplet being due to the presence or absence of 13 C nuclei in positions adjacent to the detected nuclei. The extent of peak splitting due to isotope incorporation (known as J-coupling) is dependent on the nuclei involved and the local chemical structure. Each multiplet is therefore virtually unique for a specific resonance of each metabolite. As many different combinations of labelling patterns may be present in each sample, the overall peaks pattern can be complex but can always be derived as a linear superposition of the individual multiplet components.
Signal annotation and multiplet analysis are the biggest challenges in the analysis of such 2D-1 H, 13 C-HSQC NMR spectra, as precise resonance positions of metabolite resonances can change as a result of minor pH or sample composition changes. This can provide significant challenges in crowded areas of the NMR spectrum, especially for multiplet signals with similar, albeit distinct 13 C/ 13 C-Jconstants. In addition, the overall shape of a specific multiplet differs quite drastically depending on the choice of tracer (e.g. [U-13 C] vs [1,2-13 C] glucose), as well as the specific activity of metabolic pathways. An example of these challenges is depicted in Figure 1, where the signal for HC(2) of lactate is shown for three different samples of different tracers. In addition, the chemical shifts of the three lactate signals are different for the three samples because of differences in pH. It becomes quite clear why the analysis of such 2D-1 H, 13 C-HSQC NMR spectra requires an in-depth knowledge and expertise in 2D-1 H, 13 C-HSQC NMR spectroscopy of complex mixtures in order to perform an efficient and an unbiased analysis of the data. Automated data analysis of 2D-1 H, 13 C-HSQC NMR tracing data provides an avenue free of operator bias and allows researchers with limited NMR and data analysis expertise to use this powerful analytical technique with ease.
The necessity for automated algorithms has been recognised in other NMR spectroscopy areas such as automatic assignment and data analysis for metabolomics applications 9,10 and machine learning applications for HSQC data analysis for drug discovery 11,12 . However, no such approaches have been implemented so far for the automated analysis of 13 C -multiplets present in 2D-1 H, 13 C-HSQC NMR spectra used for metabolic tracing NMR experiments.
Here, we present a machine-learning based algorithm, which automatically annotates and analyses multiplets in 2D-1 H, 13 C-HSQC NMR. After signal annotation and multiplet decomposition, the algorithm assesses quality of data analysis, which will inform the user about how much trust they can put into the result. The software developed here is fully integrated in the MetaboLab software package 13 and can be run as a fully automated procedure to be used by researchers with biological expertise without the necessity of expert NMR or data analysis knowledge.

Algorithm development Outline
Mathematically, the identification of a metabolite signal presents itself as a signal unmixing problem. The aim is the decomposition of the NMR spectrum into statistically independent components by using a data-driven approach and without prior knowledge on how the multiplet components mix in the 2D spectrum. A 2D-1 H, 13 C-HSQC NMR spectrum is a linear superposition of all resonances from each 13 CH pair in each metabolite within the sample. The NMR multiplets of each metabolite are independent of the resonances of all other metabolites. Thus, it is theoretically possible to disentangle major spectrum components in a data-driven manner. The problem can be formulated as an inverse modelling problem, where multivariate observations (e.g. the 2D-1 H, 13 C-HSQC NMR spectrum) relate to lower-dimensional vectors of statistically independent variables (e.g. the contribution of each metabolite to the 2D-1 H, 13 C-HSQC NMR spectrum), using a linear model. Once a metabolite multiplet has been identified and localised within the spectrum, the contribution of each multiplet component to the entire signal must be quantified. The algorithm presented here uses linear least square regression (LS) with parameters constrained to non-negative values only 14 to estimate the relative contribution of each multiplet component to the experimentally found resonance.
As a result, in order to implement an algorithm capable of automated signal annotation and multiplet decomposition, the computational problem for each resonance of each metabolite was subdivided into several steps ( Figure 2 In the following sections, the pipeline of the algorithm developed in this work is presented, each section explains a principal element of the algorithm (Figure 2). To provide an overview, a pseudo-code representation of the implemented algorithm can be found in the extended data repository (githubRef ,  Table S1 and Algorithm 1).

Metabolite localisation in the HSQC NMR spectra
The assignment of NMR resonances to specific metabolite nuclei is based on a peak list obtained from a spectral library included in the MetaboLab software which was derived from publicly available databases [15][16][17][18] .
For a given metabolite, the algorithm loops through each resonance and derives expected 1 H and 13 C chemical shifts from the built-in database. The next logical step would be to identify the precise location of the spectral resonance. However, because this algorithm works without any a priori knowledge about metabolic pathways or which metabolic tracer was used, we do not have any knowledge about neither the exact chemical shift values of the resonance nor the multiplet composition. In order to identify the correct multiplet in the 2D-1 H, 13 C-HSQC NMR spectrum, we can use information about 13 C/ 13 C-Jcouplings, stored in the spectral database included with the software. While differences in pH and sample matrix composition may lead to a change in the precise resonance location in the 2D-1 H, 13 C-HSQC NMR spectrum, the 1 H and 13 C chemical shift values will still be similar to the values given by the database. To improve the software's computational efficiency, the search space (i.e. the portion of 2D-1 H, 13 C-HSQC NMR spectrum to be examined) for each multiplet signal is restricted from the entire spectrum to an area of ±maxWidth1H ppm (±0.15 ppm) and ±maxWidth13C ppm (±3.0 ppm) around the 1 H and 13 C library shift of the metabolite respectively. The maxWidth parameters refer to user changeable parameters with the standard values indicated if the user does not provide these values. This range allows the algorithm to accommodate changes in chemical shift due to the aforementioned factors.
Multiplet identification: Independent Component Analysis (ICA) As mentioned earlier, the computational problem to identify a particular metabolite multiplet in the restricted spectral area can be understood to be a signal unmixing problem. One particularly efficient way to solve this problem is Independent Component Analysis (ICA). Mathematically, we perceive the NMR spectrum columns (i.e. 13 C 1D sub-spectra) of the chosen spectral area to be N observations (y 1 (ω), y 2 (ω),..,y i (ω),..,y N (ω)), where each observation y i (ω) ∈ ℝ T

Figure 2. Schematic representation of the main steps of the algorithm for automated 2D-1 H, 13 C-HSQC NMR spectrum analysis.
The computational routine uses a list of metabolite names as input and outputs a metabolite composition as well as the software reliability for each resonance analysis. Optionally, the user can choose to produce a graphical report of the analysis in form of a pdf file.
contains contributions of a smaller number of K NMR signals and noise. The K number of signals are also known as latent variables (s 1 (ω),s 2 (ω),..,s i (ω),..,sK (ω)). While the noise of each spectral column is assumed to be independent, we assume each observation y i (ω) to be a mixture of K latent variables weighted by unknown coefficients. We can then define a mathematical model as follows, using a vector-matrix notation and dropping the frequency dependence ω without loss of generality: where Y ∈ ℝ N ×T is the observation matrix, W ∈ ℝ N ×K is the unknown linear mapping matrix from the latent space S ∈ ℝ K ×T to the observation space.
In the case of a 2D-1 H, 13 C-HSQC NMR spectrum of a metabolite mixture, all the multiplets of the metabolites within the restricted spectrum area are latent variables (s 1 , ..., s T ) and the restricted area of the 2D-1 H, 13 C-HSQC NMR spectrum is the observation. While a 1D-ICA generally requires multiple 1-dimensional observations, here one 2-dimensional observation is available (i.e. the search area of the 2D-1 H, 13 C-HSQC NMR spectrum).
The shortcoming of having only a single 2D observation is addressed by utilising the proton dimension as the observation dimension. This means that all the 13 C NMR spectra at different proton chemical shifts of the 2D-spectrum are considered as observations, therefore the proton dimension is the observation dimension, and the problem can be solved using 1D-ICA. The K latent factors are the underlying spectral components (e.g. the mutiplets of the metabolites). The K components are assumed to be statistically independent as required for the ICA algorithm 19 . This assumption is also physically/chemically plausible, as there is no interaction between metabolites at different 1 H-shifts 20 . Moreover, the linearity of the model can be justified, as the metabolites mix in an additive way in the spectrum 20 . These assumptions justify the choice of using ICA among other unmixing algorithms such as non-negative matrix decomposition (NMF), since ICA finds a decomposition of the observed data to retrieve latent components which are as independent as possible. For the practical implementation of this algorithm, we chose to use an implementation of the FastICA 21 algorithm. Details on the working principle of 1D-ICA and its adaptation to solve the proposed problem are described in the Supporting Information (Section: Independent Component Analysis for mutiplets identification). Figure 3 provides an example of some of the ICA components (independent components are shown in green and experimental 13 C NMR sub-spectra are plotted in blue) provided by the ICA algorithm during the analysis for HC(2) of Lactate. The 1 H chemical shift range for the shown independent components was 3.95 to 4.2 ppm.
Localisation along the proton dimension by crosscorrelation While ICA is very successful in identifying statistically independent resonances in the 13 C dimension, information about the 1 H localisation is lost. In order to assign a specific 1 H chemical shift to each latent component a cross-correlation between each component and all columns from the search area of the experimental NMR spectrum is required. Because each of the NMR columns is associated with a specific 1 H ppm value, the column which receives the largest cross-correlation value determines the assigned 1 H chemical shift of that component.
To avoid selection of latent components representing spectral noise only, the algorithm uses a minimum threshold for the correlation value (minCorr) below which the latent component will be discarded from further evaluation. The standard value of this minimum correlation is 0.8, which can be adjusted by the user if needed.

Latent component identification
The latent components localised within the spectrum might have captured multiplet patterns of metabolites within the searching area, which will contain the resonance of the metabolites of interest if this actually exists in the current spectrum. However, these components will also contain resonances belonging to CH pairs of other metabolites. To determine if the resonance of interest exists in one of the estimated latent components, the following steps are necessary: 1. The multiplet of interest is simulated using chemical shift values and 13 C/ 13 C-J-coupling constants stored in MetaboLab's database are used assuming equal intensities for each of the multiplet components as shown in Figure 4 panel A. The pygamma NMR simulation library (version 4.3.4) is used for multiplet simulations 22 .
2. This multiplet is then aligned with each latent component through cross-correlation (of the simulated multiplet with the latent components) and the 13   Each of these combinations is aligned with the latent component and a regression analysis is performed to obtain the coefficient of determination ρ scores, which quantifies the goodness of the fitting. The combination of multiplet components which gives the highest ρ value is chosen and the ρ score saved. Theoretically, this should result in the highest ρ score for the correct multiplet simulation at the right 13 C chemical shift.
-However, the ρ value is sensitive to underlying noise and NMR resonances close to the resonance of interest. Therefore, in each latent components only NMR signals with maxima greater than 50% of the global maximum within the search area are retained and noise related issues can be avoided. In addition, the score ρ is weighted by the squared deviation ∆ppm of the determined 13 where γ H and γ C are the 1 H and 13 C gyromagnetic ratios.
Multiplet analysis: contribution of multiplet components to overall signal As the last step, the relative intensity of each multiplet component is then determined by a constrained linear least square regression. Prior to this step we optimize the alignment along the proton dimension using the hill climbing heuristic optimization algorithm 24 to ensure that the 1 H chemical shift obtained in the previous localisation step corresponds to the maximum of the experimental NMR resonance in the 1 H dimension. The regression is defined as follows 5: where F is the simulated multiplet with adjusted intensities of all multiplet components, X is a matrix containing the different multiplet sub-spectra as input and B ≥ 0 is a vector containing the relative contribution values, which is the vector of unknowns that needs to be determined. After the constrained linear regression, vector B is normalised, so that all relative contributions add up to be 1 (B = B/∑ i b(i), where b(i) are the vector components).

Sample preparation
For the test NMR samples, cells were cultured in the presence of isotopically labelled tracer, chosen dependent on the pathway of interest. For instance, cells cultured in the presence of [1,2-13 C 2]-D-glucose can provide information regarding flux through pyruvate dehydrogenase (PDH) and into the tricarboxylic acid (TCA) cycle via the isotomomer composition of C4 of glutamate. Metabolites were extracted using a biphasic system to remove proteins and lipophilic molecules whilst retaining the polar molecules 25 .
NMR methodology 2D-1 H, 13 C-HSQC NMR spectra were collected using the standard procedures described previously 8,26,27 . Spectra were acquired on either a Bruker NEO 800-MHz or a Bruker Avance 600-MHz spectrometer. Spectrometers were equipped with 1.7mm z-PFG TCI CryoProbes. All 2D-HSQC NMR spectra were acquired using echo/anti-echo gradient selection with a presaturation pulse during the 1.5s interscan delay to suppress the water resonance. Typically, spectra were acquired using 2 scans, 512 complex data points and a spectral width of 15.6ppm in the 1 H dimension. The 13 C dimension was acquired using 25% of 8,192 complex data points and a spectral width of 189.8 ppm resulting in an experiment time of approximately 4 hours. To evaluate the efficacy of the algorithm with regards to the scaling of the splitting due to J-couplings experiments using scaling of 0, 2, 4 or 8 were collected. The non-uniformly collected spectra were reconstructed using the IRLS algorithm using 20 iterations with MDDNMR (version 2.5) 28,29 and then processed using NMRPipe (Version 9.2) 30 . A polynomial baseline correction was applied following manual phase correction.

Implementation
While the biggest part of the algorithm is implemented in MATLAB (source code is available here: https://zenodo.org/ record/7120367#.YzR6FC0w30o), the multiplet simulation is performed within python (version 3.9) using the pygamma NMR simulation library (version 4.3.4, https://pypi.org/project/ pygamma/) 22  The datasets used as discussed in the Underlying data section 32,33 .

Operation
The algorithm is tested on a laptop with an Intel i7-6700HQ 2.60GHz CPU and a GTX 960M GPU. The analysis of a single spectrum is performed in less than 5 minutes.

Results and discussion
Program output While all resulting information is stored inside the MetaboLab data structure, the user has the option to create a clear text report in the form of a pdf file (report: on). This report (see Supporting Information) contains sample information originally stored inside TopSpin's title file, a list of the metabolites analysed and the user defined variables for the multiplet assignment and analysis (maxWidth1H, maxWidth13C, minCorr, maxRange, nReps, R2, R2 (scaled), dataSets, experiments). The following information is output for each observable carbon of each selected metabolite: coefficient of determination ρ for each multiplet, the percentage contribution for each component of the multiplet, the distance of the assigned multiplet from the library shift, a zoomed region of the 2D-1 H, 13 C-HSQC NMR spectrum showing the multiplet of interest and the 1 H shift of the independent component, and an overlay of the experimental 1D-13 C column and simulated data. The coefficient of determination ρ and distances from the library chemical shifts are colour coded. The colours indicate whether the software considers the assignment and multiplet estimation to be trustworthy (coefficient of determination ≥ 80%), borderline (70% > coefficient of determination < 80%) or not trustworthy (coefficient of determination ≤ 70%).

Evaluation of algorithm performance
To evaluate the algorithm performance, the following metric is used. The coefficient of determination ρ quantifies the percentage of observed variance that can be explained by the linear regression model. Considering that the model structure is supported by the theoretical assumption that each observed multiplet is a linear combination of single multiplet components, ρ is indicative of the goodness of the fit. In this work, ρ is used to indicate the analysis reliability. If the experimental 1D spectrum is given as t(M × 1) and the estimated spectrum as t (M × 1), the coefficient of determination ρ can be calculated as follows: ∑ t t is the unexplained sum of the squared distance between the observations and predictions. The coefficient of determination ρ indicates not only the quality of multiplet estimation but incorporate also information on the correctness of the multiplet assignment. For this reason, ρ evaluates the overall performance of the algorithm: it indicates if the multiplet has been correctly localised and consequently the linear regression method is able to closely fit the experimental data. If a signal has a poor signal-to-noise ratio or cannot be found at all, the corresponding ρ value will be set to 0. Sample reports are provided in the supporting information, including examples of metabolites with very low concentrations where the algorithm fails to assign and analyse the multiplet correctly.

Validation of the model and analysis of the results
Overall algorithm evaluation. The performance of the algorithm developed in this work is evaluated by reporting the results of multiple testing scenarios using different datasets. We manually inspected the datasets to evaluate the software output and in particular to confirm the following outcomes: a) the multiplet is not correctly assigned and the software warns the user about the unreliability of the output (i.e. low coefficient of determination, the distance from the library shifts exceeds the set threshold) b) the multiplet is not correctly identified and the software fails in flagging the issue to the user, giving as output a high coefficient of determination. The coefficient of determination ρ is generally affected by the noise and peaks surrounding the multiplet being analysed, as demonstrated in Figure 5 and Supporting Information Figures S1-S5. Therefore, ρ is thought to provide not only a measure of the goodness of the fit but also information about the signal-to-noise ratio (SNR). In heavily congested areas or for signals with poor SNR the computed value of ρ is likely to be low. In this case, the user is also warned and the software decision making may be checked manually if desired.
Analysis was performed on four metabolites (alanine, aspartate, glutamate and lactate) on all 13 test 2D-1 H, 13 C-HSQC NMR spectra. In each spectrum we expect 11 multiplets consisting of two for lactate, two for alanine, four for glutamate and three for aspartate. In total three different and previously published datasets were used. The three datasets contain a total of 13 2D-1 H, 13 C-HSQC NMR spectra. All NMR data is available via public data repositories. The first dataset contains 2D-1 H, 13 C-HSQC NMR spectra with [1,2-13 C] glucose as metabolic tracer (MTBLS241 8 ). This dataset contains spectra from three biological replicates. The second dataset contains 4 different 2D-1 H, 13 C-HSQC NMR spectra from one sample, but with different amounts of apparent signal-splitting due to J-coupling (eqhn3 27 ). The third dataset contains 6 2D-1 H, 13 C-HSQC NMR spectra from one sample using different amounts of sampling in a non-uniform sampling scheme (qtmge 26 ). The characteristics of each dataset are summarised in Supporting Information Table S2 The proton and carbon chemical shifts ρ values and percentage contribution of each multiplet component for these are given in Supporting Information Tables S3-S5. A quantitative analysis of the results is provided below for each dataset.

Dataset 1 -CANMS (MTBLS241)
Manual inspection of the automated analysis reveals that all the 33 metabolite multiplets (for lactate, alanine, glutamate and aspartate) are correctly identified and localised. The average coefficient of determination ρ is 94.06 ± 5.08%.
The automatic peak-picking is compared against the manual procedure of dataset MTBLS241 to assess the quality of the automated algorithm and compute the accuracy (see figure S6).

Dataset 2 -Scaling of apparent scalar couplings (qtmge)
Manual inspection of the automated analysis reveals that all the 44 metabolite multiplets are correctly identified and localised. The average coefficient of determination ρ is 95.47 ± 7.20%. Compared to the previous dataset, the slightly higher standard deviation is due to alanine (C2) (ρ = 56.7%). The algorithm is compatible with the technique of enhancement of splitting due to J-coupling 27 meaning that the automated assignment can be used in conjunction with rapidly collected data on high field spectrometers. The degree of splitting enhancement is detected by the software and algorithm parameters automatically altered. The average coefficients of determination for the different values of enhancement of splitting due to J-coupling (1,2,4,8) are ρ = 96.14 ± 3.96%, ρ = 97.92 ± 1.67%, ρ = 96.54 ± 3.65%, ρ = 91.30 ± 12.78% respectively. The results show that a splitting enhancement of 4 is dealt with effectively by the algorithm but an enhancement of 8 is not recommended as it can lead to poorer estimations on less intense multiplets. With an enhancement of splitting due to J-scaling of 1 the number of complex data points collected in the indirect dimension should be 8192. With increasing enhancement, the number of data points acquired can be reduced proportionately. The enhancement of splitting due to J-coupling can also be utilised to reduce the experiment time from 4 hours to 1 hour 26 .

Dataset 3 -NUS-sampling (eqhn3)
The algorithm gives good results with data collected with reduced sampling down to 25%. At sampling of below 25% the signal to noise is decreased such that only intense peaks can be analysed with confidence and the resulting coefficient of determination of weaker peaks is low. We therefore do not recommend collected data with sampling of less than 25%. 50 out of 66 metabolite multiplets have been picked correctly by the software. The average coefficient of determination (80.47 ± 20.98%) is significantly lower compared to the previous results. This is due to aspartate, whose co-efficient of determination is low in all the replicates due to poor signal-to-noise ratio because of very low aspartate concentrations. Considering the 16 multiplets that have not been correctly picked, 12 out of 16 have corresponding low coefficient (red flag). In this case the user is warned about the software not being able to provide a reliable analysis and we therefore consider this as a positive outcome for the software (true negative). Instead, 4 multiplets have wrongly been picked by the software and the corresponding coefficients of determination are > 85% (false positive). This last case arises for glutamate. The method described here localises a metabolite along the proton dimension by relying on the cross-correlation metric, as explained in the Algorithm development section. With well defined and resolved peaks the peak is easily identified by the algorithm as shown by the large area with a high coefficient of determination for instance with Lactate C(3) (Supporting Information Figure S2). However, if two or more metabolites have multiplets with almost identical patterns and carbon shift, as seen with C(2) (Supporting Information Figure S3) and C(3) (Supporting Information Figure S4) of glutamine and glutamate, the algorithm occasionally fails to correctly pick the metabolite as each metabolite assignment is carried out without knowledge of other metabolites' assignments. This can almost always be identified by the larger than expected deviation of the chemical shifts from the library values. In one of our test cases the downfield 1 H of glutamate C(3) is misassigned to glutamine C(3) while the upfield 1 H is correctly assigned. The misassigned resonance can again be identified by the larger deviation from the library shift and thus analysis can be limited to the correctly assigned 1 H resonance. The misassignment is due to the glutamine being unlabelled and thus the ease of the algorithm fitting a singlet in close proximity to the library shifts. Care must be taken in the analysis of spectra with significant signals from glutamate and glutamine. Where a carbon has two attached protons, as with aspartate C(3) and glutamate C(3), the algorithm may select the same multiplet independent of the actual 1 H chemical shift. As the 13 C/ 13 C-J coupling constants and the 13 C chemical shift and the multiplet composition are all identical for the two protons, this can be ignored. If the sample preparation results in the collected data differing significantly from the library shifts the algorithm can adapt in most cases except where the coupling constants of a nearby resonance match that of the actual metabolite. This can be minimised by correctly the pH of the polar phase (methanol:water) prior to drying the sample and resuspension in NMR buffer. This is shown in Supporting Information Figure S3 where if the sample conditions results in a 0.1 ppm 1 H shift then a glutamate (C2) can be misassigned as creatine with a high coefficient of determination. However, the rest of the dataset would then show very poor algorithm results allowing the user to realise that there may be an issue with the sample. If proper sample preparation is performed, then such problems should not be an issue.

Conclusion
In this article, we have described a data-driven algorithm which functions within the existing MetaboLab software. The algorithm automatically assigns, annotates and analyses multiplets arising from 13 C -13 C scalar couplings present in 2D-1 H, 13 C-HSQC spectra collected to study metabolism using isotopic tracers. The method is robust and operates with a high degree of accuracy. We have tested the algorithm on three different datasets and show that the algorithm can analyse data collected with a range of tracers with different labelling, at different frequencies, with multiple biological repeats and with different sampling amounts. All spectra were successfully analysed and the algorithm is also able to correctly identify when enhancement of splitting due to J-coupling has been used and to automatically compensate for this effect prior to analysis. The coefficient of determination of the algorithm in the analysis is displayed conveniently in a colour coded form to aid the user in determining the reliability of the result. The algorithm functions successfully in all but the most challenging of situations allowing extensive analysis to be performed even by those with limited knowledge of the field. differing significantly from the library shifts the algorithm can adapt in most cases except where the coupling constants of a nearby resonance match that of the actual metabolite''. So, what proper sample preparation should be suggested before collecting data? "...poor signal-to-noise ratio because of very load aspartate concentrations". Here, please correct "load" to "low".

6.
Is the work clearly and accurately presented and does it cite the current literature? Yes

If applicable, is the statistical analysis and its interpretation appropriate? Yes
Are all the source data underlying the results available to ensure full reproducibility? Partly The use of labelled tracers allows metabolic pathway information to be determined, however, identification of these metabolic products can be challenging. HSQC spectra provide information on CH pairs within a molecule. By incorporating 13C-labelled tracers, additional splittings are observed along the F1 axis arising from 13C-13C couplings in the metabolites. In this sense these experiments contain substantially larger amounts of information that just HSQC spectra of the metabolites.
The authors present a framework built upon the Metabolab platform for the identification and quantification of the metabolites. The paper clearly presents the logic of the algorithm developed and provides a good overview of its function and the quality of the results obtained.
There is clearly a requirement for high digital resolution in the 13C dimension for this approach to be successful, and the data presented has this. It would be useful for the authors to comment on what the limits of this digital resolution are. What is the minimum digital resolution required in order for the their algorithm to be effective?
Additionally they comment on the use of non-uniform sampling schemes. Typically sparse sampling schemes work best for sparse datasets, e.g. better performance with lower sampling schemes for HSQC than NOESY. It would be useful to give a comparison between NUS and fully sampled data on the same sample.
There are a couple of minor typos on page 7, column 2, second item, and page 10, column 2, 2nd paragraph Is the work clearly and accurately presented and does it cite the current literature? Yes

Is the study design appropriate and is the work technically sound? Yes
Are sufficient details of methods and analysis provided to allow replication by others?