Predicting glycan structure from tandem mass spectrometry via deep learning

Glycans constitute the most complicated post-translational modification, modulating protein activity in health and disease. However, structural annotation from tandem mass spectrometry (MS/MS) data is a bottleneck in glycomics, preventing high-throughput endeavors and relegating glycomics to a few experts. Trained on a newly curated set of 500,000 annotated MS/MS spectra, here we present CandyCrunch, a dilated residual neural network predicting glycan structure from raw liquid chromatography–MS/MS data in seconds (top-1 accuracy: 90.3%). We developed an open-access Python-based workflow of raw data conversion and prediction, followed by automated curation and fragment annotation, with predictions recapitulating and extending expert annotation. We demonstrate that this can be used for de novo annotation, diagnostic fragment identification and high-throughput glycomics. For maximum impact, this entire pipeline is tightly interlaced with our glycowork platform and can be easily tested at https://colab.research.google.com/github/BojarLab/CandyCrunch/blob/main/CandyCrunch.ipynb. We envision CandyCrunch to democratize structural glycomics and the elucidation of biological roles of glycans.


low scoring glycans.
2) To enable this manual evaluation a couple of things are essential in the output of the tool, including the compositional annotation of the glycan often used in MS data, indicating the number of hexoses, hexnacs, fucoses etc in the glycan (eg.H4N5F1 or Hex4HexNAc5Fuc1), the theoretical mass of the annotated glycan and the ppm error between the data and the structure found.Importantly, it should be easy to look at the fragmentation spectrum used, including the annotated peaks for the glycan fragments (see comment 3).
3) The current solution to evaluate the annotated fragmentation spectrum (CandyCrumbs) is not sufficient to allow an easy look at the data.Mainly because: a.Multiple peaks are annotated with the same fragments.b.You only see the fragments when you mouse-over a peak and often the content of the mouse-over is not readable because of some screen dimension issues.c.There is no indication of the ppm error between the theoretical fragment and the observed peak, making it very difficult to evaluate the trueness of an annotation.It would be good if the mass accuracy of the used method could be specified upfront, as in that way only peaks within this range can be annotated.4) Next to negative mode PGC data, also positive mode C18-orbitrap data with an 2-AB label was used to test the tool by the reviewer.Unfortunately, this did not result in any hits, while the manuscript indicated inclusion of this type of data in the tool training.It would be very valuable if other data types could also be analyzed with the current tool, but as this seems less functional at the moment, it should be accurately described in the manuscript for what data types the tool is working and for what data types not (yet).Also the title of the manuscript can be adjusted to be more specific, accordingly.5) Minor issues encountered: a.For the plotting of the annotated spectra in step 4, the code had to be changed: plot_width/height into width/height, to prevent an error of not recognizing "plot_width" and "plot_height" b.The dimensions of the output on the webpage are not optimal for easy viewing, the alternative glycan structures appeared out of screen without a scrolling possibility.6) Next to knowing what MS signals could be assigned to what glycan structures, other valuable information would be what MS/MS spectra were not assigned to a glycan and -more importantlywhich of these do contain diagnostic glycan fragments?The latter would indicate unexpected glycoforms/monosaccharides/modifications.To not overlook these unexpected glycoforms it is essential that these are pointed out so they could be targeted for manual evaluation.It is unclear to what extent the tool provides this information.7) The authors state "we set out to augment our pipeline to allow for, limited, zero-shot prediction outside our 3,508 defined glycans".It is unclear how automated and integrated this already is in the current tool.8) The manuscript highlights the importance of active maintenance and updates for these type of tools and the fact that this is not done for other tools in the field.It is indeed a large frustration of researchers in the field that tools are often not completely functional or long-term supported.The long term value of the current tool will only be high when actively maintained and regularly updated with new data.How will the authors ensure this in a dynamic and demanding academic environment?9) Especially for low-resolution iontrap data, m/z calibration can be particularly bad.Furthermore, PGC systems show often shifts in retention time.How does this affect the performance of the tool, is precalibration/RT alignment necessary?And would it improve annotations if accuracy could be identified upfront?10) Currently, only options for trap data are provided in the user interface, what about data acquired on e.g.qTOF systems?Can these be processed, or will this be an option in the future?11) It is already pointed out that "we do not currently support every type of glycan modification within CandyCrunch and CandyCrumbs".Still, this would be highly valuable with newly emerging modifications and labels used in (functional) glycan analytics, as well as for the detection of uncommon monosaccharides not covered in the training data.How do the authors envision to enable these type of analysis in the future, possibly even without the training data available to incorporate these diverse modifications into the training models?12) Finally, the visual appearance of especially Figure 1  Reporting summary: https://www.nature.com/documents/nr-reporting-summary.zipEditorial policy checklist: https://www.nature.com/documents/nr-editorial-policy-checklist.zipIf your paper includes custom software, we also ask you to complete a supplemental reporting summary.
Software supplement: https://www.nature.com/documents/nr-software-policy.pdfPlease submit these with your revised manuscript.They will be available to reviewers to aid in their evaluation if the paper is re-reviewed.If you have any questions about the checklist, please see http://www.nature.com/authors/policies/availability.html or contact me.
Please note that these forms are dynamic 'smart pdfs' and must therefore be downloaded and completed in Adobe Reader.We will then flatten them for ease of use by the reviewers.If you would like to reference the guidance text as you complete the template, please access these flattened versions at http://www.nature.com/authors/policies/availability.html.

IMAGE INTEGRITY
When submitting the revised version of your manuscript, please pay close attention to our Digital Image Integrity Guidelines and to the following points below: --that unprocessed scans are clearly labelled and match the gels and western blots presented in figures.
--that control panels for gels and western blots are appropriately described as loading on sample processing controls --all images in the paper are checked for duplication of panels and for splicing of gel lanes.
Finally, please ensure that you retain unprocessed data and metadata files after publication, ideally archiving data in perpetuity, as these may be requested during the peer review and production process or after publication if any issues arise.

DATA AVAILABILITY
Please include a "Data availability" subsection in the Online Methods.This section should inform readers about the availability of the data used to support the conclusions of your study, including accession codes to public repositories, references to source data that may be published alongside the paper, unique identifiers such as URLs to data repository entries, or data set DOIs, and any other statement about data availability.At a minimum, you should include the following statement: "The data that support the findings of this study are available from the corresponding author upon request", describing which data is available upon request and mentioning any restrictions on availability.If DOIs are provided, please include these in the Reference list (authors, title, publisher (repository name), identifier, year).For more guidance on how to write this section please see: http://www.nature.com/authors/policies/data/data-availability-statements-data-citations.pdfCODE AVAILABILITY Please include a "Code Availability" subsection in the Online Methods which details how your custom code is made available.Only in rare cases (where code is not central to the main conclusions of the paper) is the statement "available upon request" allowed (and reasons should be specified).
We request that you deposit code in a DOI-minting repository such as Zenodo, Gigantum or Code Ocean and cite the DOI in the Reference list.We also request that you use code versioning and provide a license.
For more information on our code sharing policy and requirements, please see: https://www.nature.com/nature-research/editorial-policies/reporting-standards#availability-ofcomputer-codeMATERIALS AVAILABILITY As a condition of publication in Nature Methods, authors are required to make unique materials promptly available to others without undue qualifications.
Authors reporting new chemical compounds must provide chemical structure, synthesis and characterization details.Authors reporting mutant strains and cell lines are strongly encouraged to use established public repositories.
More details about our materials availability policy can be found at https://www.nature.com/natureportfolio/editorial-policies/reporting-standards#availability-of-materialsSUPPLEMENTARY PROTOCOL To help facilitate reproducibility and uptake of your method, we ask you to prepare a step-by-step Supplementary Protocol for the method described in this paper.We encourage authors to share their step-by-step experimental protocols on a protocol sharing platform of their choice and report the protocol DOI in the reference list.Nature Portfolio 's Protocol Exchange is a free-to-use and open resource for protocols; protocols deposited in Protocol Exchange are citable and can be linked from the published article.More details can found at www.nature.com/protocolexchange/about.ORCID Nature Methods is committed to improving transparency in authorship.As part of our efforts in this direction, we are now requesting that all authors identified as 'corresponding author' on published papers create and link their Open Researcher and Contributor Identifier (ORCID) with their account on the Manuscript Tracking System (MTS), prior to acceptance.This applies to primary research papers only.ORCID helps the scientific community achieve unambiguous attribution of all scholarly contributions.You can create and link your ORCID from the home page of the MTS by clicking on 'Modify my Springer Nature account'.For more information please visit please visit www.springernature.com/orcid.Please do not hesitate to contact me if you have any questions or would like to discuss these revisions further.We look forward to seeing the revised manuscript and thank you for the opportunity to consider your work.

Reviewers' Comments:
Reviewer #1: Remarks to the Author: At its core, CandyCrunch is still a spectral matching method.To evaluate the correctness of the predicted structure, CandyCrunch compares the experimentally observed fragmentation pattern with that predicted by AI from many tandem mass spectra of the given structure in the training dataset.Effectively, it utilizes a "consensus" spectrum rather than individual reference spectrum, leading to a (presumably) improved accuracy of top-ranked predictions (around 90%).The experimental parameters (ionization mode, LC types, glycan modifications, MS instrument types, glycan classes, etc.) are embedded in the CandyCrunch model architecture, allowing it to (somewhat) overcome significant variations in experimental conditions among data sets.The (normalized) retention time information and precursor m/z were also utilized by CandyCrunch to aid prediction.CandyCrunch further utilizes domain knowledge and biosynthetic knowledge to filter and refine predictions.Although it is conceptually infeasible to predict structures absent from the training set, the authors demonstrated limited subset zero-shot predictions utilizing the biosynthetic network.As the authors pointed out in their rebuttal letter, the manuscript also contains several independent advances.For example, the curated set of glycomics MS2 data would be a valuable resource; CandyCrumbs is a useful tool for spectral annotation; and MD simulations that elucidates fragmentation mechanisms.Although these are valuable, they, on their own, do not warrant publication on Nature Methods.The critique below will focus on the core advance of this study, namely the development of an AI prediction model, but would also comment on the values and flaws of other aspects of the manuscript when applicable. 1.A true zero-shot prediction should be made with synthetic glycan structures that are not present in the training dataset.However, this is not likely to succeed with the present approach that relies on the identification of existing structures in the sample set and biosynthetic network, especially for: (a) synthetic glycan structures that do not conform to known biosynthetic knowledge; and/or (b) identification of individual glycan structure that may or may not be biologically relevant, from an isolated tandem MS spectrum.An alternative approach to demonstrate the limited feasibility of zeroshot prediction is to purposely remove all tandem mass spectra of selected structures from the training set, that is, perform the training/test split at the structure level rather than the sample level, and see if CandyCrunch can recover these structures using the present approach.This would constitute a better testing strategy on the performance of zero-shot prediction, as the structures removed are validated structures rather than tenuous assignment (that are likely made on low-quality spectra or low-abundance species).The authors should be able to perform this evaluation with the available data, and if successful, could significantly strengthen the manuscript.2. CandyCrunch utilizes domain and biosynthetic knowledge, as well as retention time information to (presumably) improve the prediction accuracy, which underlined the limitations of utilizing only fragmentation pattern for structure prediction.This is not surprising because: (1) biosynthetic knowledge would reduce the number of possible structures for a given mass by many orders of magnitude; and (2) many isomers have similar fragmentation patterns, especially when acquired by CID in ion trap instruments.The authors should provide a quantitative evaluation of how the application of each filter/refining process improves the prediction accuracy (e.g.prediction accuracy with only the MS2 spectra; with the addition of retention time; with the addition of the biosynthetic knowledge; etc.) 3.According to the authors, "even erroneous predictions are structurally close to the correct solution".However, "closeness" is often insufficient to answer many biological questions.In fact, even a slight change in one structural variable can have considerable impact on the biological function of a glycan.A better indication of the evaluation accuracy would be the confidence score.It is advisable for the authors to develop a confidence score to indicate the likelihood of correct assignment.Once this is established, the authors can present a statistical distribution of the confidence score for correct assignments, and for incorrect assignments.This is important to inform the analyst how much they should trust the result.4. The authors stated CandyCrunch is suitable for analysis of glycans with common modifications, such as permethylation.It would be useful for the authors to provide a statistical evaluation of the prediction accuracy utilizing different experimental approaches (beyond the statement that negativemode CID of reduced glycans tends to produce the best results).5. How would CandyCrunch perform on tandem mass spectra of mixture?Co-elution is common for isomeric glycans.Would it be able to identify the presence of multiple structures in a single chimeric spectrum?6. Retention time can vary significantly from one data set to another, especially for PGC-LC analysis.The authors perform retention time normalization by dividing the absolute retention time by the respective maximal retention time.I do not think this is the correct way to normalize retention time.The retention time should be normalized using the glucose unit (dextran ladder) or at the worst-case scenario, against known structures present in the sample.7. The authors also performed grouping of averaged spectra in chunks of 0.5-minute retention time window, but many isomers would co-elute in that time frame.8. Biosynthetic network analysis would only be feasible if the entire glycome is analyzed.What would be the strategy for analysis of a single isolated structure?9.The authors stated that they also checked missed Neu5Gc-substituted Neu5Ac-glycans and vice versa with a mass shift of 16 Da and corresponding diagnostic ions.Please comment on if this check extends beyond single substitution.10.The authors noted that "analyzing the data at higher resolution does not give rise to higher accuracy".The simulation of higher resolution data was performed by increasing the number of bins (Supporting Table 3), leading to an increased "effective resolution per bin".They further stated that the fact that they "used m/z remainder approach overrides binning resolution and allows us to recover exact masses as long as only one fragment is contained in a bin (which Supplementary Fig. 1 supports)".This is inaccurate.Without a sufficiently high resolution, it would be impossible to differentiate isobaric fragments (e.g.those differ in composition of CH4 vs. O): no binning can improve the resolution of the original data (at best, it can preserve the resolution), and no m/z remainder approach can improve the mass accuracy of the original data.The true test should be done by utilizing the subset of real high-resolution data obtained on TOF or Orbitrap instruments, and compare the performance.Additionally, Supplementary Figure 1 is suspect, as it only considers glycosidic bond cleavages.Cross-ring cleavages and neutral losses are common in glycan tandem mass spectra.Figure S1 does not reflect the true extent of potential occurrence of multiple fragments in a single bin.
11.The authors noted that "a model only trained on O-glycans performed worse for predicting Oglycans (Supplementary Table 4; topology: 84% accuracy, structure: 79% accuracy) than the model trained on all classes", and hypothesized that "this was due to the structure-based loss function we used for training, as well as shared information between spectra of different classes, stemming from shared glycan motifs across classes".This is an interesting point.How is the motif information incorporated into the training algorithm?Would CandyCrunch be able to predict structures that are built from existing motif but not in the training set?Again, this could be tested by removing certain structures entirely from the training set.12. CandyCrumbs is a useful tool, but the way it prioritizes assignments when multiple assignments are possible is suspect (Figure S10).There are often cases where a neutral loss or double cleavages are more abundant than a single-cleavage fragment.Is it possible to also learn the assignment from the annotated training set?Overall, I think the manuscript is worth consideration for publication on Nature Methods, but there are serious flaws that need to be addressed and it needs further and more robust evaluation.
Reviewer #3: Remarks to the Author: The rebuttal addresses all my comments well and the authors updated and improved the manuscript.The described tool has a lot of options to tune it for a specific experiment and train it with new data, and it will likely be very valuable in LC-MS based glycomics experiments in the future.Although the authors provided an easy accessible Google Drive interface to make the tool accessible for non-python-specialists, this is not an ideal interface to bring the tool to its full potential (as indicated by the authors themselves as well).I understand that this is not the main purpose of the manuscript, but I would like to encourage the developers to equip the Python tool with a user friendly interface to make the tool accessible for all glyco LC-MS specialist, independent of their programming skills.As a start, a clearer "read me" can be provided on how to get started with the tool using the local installation.I'm sure there will be enough "dummy testers" in the field happy to help validating this.
Reviewer #4: Remarks to the Author: This manuscript presents a highly novel tool for glycomics data annotation which is made available via an ad hoc Python notebook.The authors have addressed both the novelties of the software as well as the drawbacks or points for users to be aware of when using the software.In particular, it is very important to make potential users aware of what NOT to expect.The data and methodology is rigorously explained, and the manuscript is written clearly overall.

Suggestions for improvement:
1.A table or at least a description of the various options that are available could be summarized.While the author responded to one of the other reviewer's comments and added a statement that the given option (zero-shot) is available, amongst the comments to the reviewers there seemed to be many such options that are available, but the user would not be aware of them.So it would help the manuscript greatly to indicate that all of the various options that often need to be considered when annotation glycomics spectra are addressed, and the list would indicate how to set those options.
2. While users could use the outputed annotations as figures to include in their own publications, how do the authors consider feeding back the annotations to databases such as in the GlySpace Alliance?In particular, are GlyTouCan IDs assignable to the annotations?
Reviewer #5: Remarks to the Author: The manuscript by Urban et al. describes a method for identification of glycan structures from glycomics mass spectrometry data using deep learning.The method clearly provides a dramatic improvement in capability vs existing software methods, both in performance and generalizability to a variety of experimental and instrumental setups.This is both a very challenging and very necessary endeavor, making the manuscript likely of great interest to a broad glycoscience community.The manuscript is well written and, particularly following the revisions from earlier reviews, clearly describes the method, results, and current limitations.I have some specific comments on the method and results below, but most are very minor issues.
The only major issue I see is the requirement that the software be accessed programmatically in Python, with no provision of any user interface or command line accessibility beyond the CoLab notebook demo.While this is not an issue for experienced programmers, to truly "democratize structural glycomics" beyond a few specialists, this represents a major roadblock for glycobiologists and other glycoscientists who may not have the programming skills (or time) to write their own code to use the method.The authors refer to the CoLab demo as not a main part of the manuscript in responding to the initial review comments about its limited functionality, but these comments are a clear indication of the need and desire for a functional interface to fully realize the potential of the method.This dovetails with concerns about the method receiving long term support and engagement -if the barrier to use is too high and few people adopt the method, there is less feedback and external motivation to keep maintaining and improving it.I would urge the authors to provide at least a minimal command line execution option (e.g., pass in a config file that has the parameters and data paths) to run the analysis and generate at least a basic output table or report.Further development (e.g., of a graphical interface) could come later or as warranted from community feedback, but I think there needs to be some provision to run the software without having to write Python code for it to be considered a community resource as opposed to a specialist tool for bioinformaticians.That said, the CoLab demo and source code clearly show a high level of programming expertise, so I have confidence in the authors' capability to do this.

Minor Comments:
• "Applied to fully unseen datasets, CandyCrunch routinely achieved high performance (Supplementary Table 4; topology: 92% accuracy, structure: 84%)".These numbers appear to be Top5 structure accuracies from Supp.Table 4, whereas the previous data in Figure 1 was presented as Top1 accuracy.This needs to be noted, and ideally, the range of Top1 accuracies should be provided as well or instead.
• With regard to Supp.Table 3, the authors state that higher m/z resolution appeared not to necessarily translate to higher prediction performance.The resolutions tested (0.7 to 1.4 Da per bin) are 2 orders of magnitude lower than high resolution mass spectrometry data, so it would be worth noting the difference between "higher" resolution than the model default (i.e., 2x higher) and "high" resolution in a mass spectrometry context (i.e., 100x higher).

Author Rebuttal, first revision:
We thank all reviewers for their insightful comments and suggestions for improvement.We have fully addressed these comments in our substantially revised manuscript by engaging in extensive new analyses, leading to new supplemental figures and tables, as well as numerous text modifications and additions, and improvements to our Python package and beyond.In summary, we have (i) demonstrated zero-shot prediction capabilities by purposefully removing structures from the training data (new Fig. S8), (ii) added the option of using CandyCrunch via a command line interface (new CandyCrunch v0.3), (iii) performed ablation experiments to illustrate how much of our performance comes from different sources of information (MS2 spectrum, retention time, etc.) (new Table S3), (iv) demonstrated that correctly predicted structures, on average, exhibit higher confidence scores (new Table S7), (v) added subgroup analyses describing the performance of CandyCrunch on various glycan modifications (e.g., permethylation) (new Table S5) as well as high-resolution data (e.g., orbitrap) (new Table S6), (vi) added a multi-sample workflow that aligns retention times across samples from the same experiment and accommodates shifts in retention time (new Fig. S9), (vii) provided more options to customize retention time precision when using CandyCrunch (e.g., minimum/maximum cut-offs, time window resolution) (new CandyCrunch v0.3 and new Table S12), (viii) extended the documentation of CandyCrunch, with particular emphasis on available options and how to fill them, and a guide on how to work with the software when installed locally (new Fig. S2, new Table S12, new CandyCrunch v0.3), (ix) expanded our strategy of searching for more extensively Neu5Gc-/Neu5Ac-substituted glycans, as well as GlcNAc-/GlcNAc6S-substituted O-glycans, to increase zero-shot capabilities (new CandyCrunch v0.3), (x) documented the inherent cross-training with our approach by showing that glycans with shared motifs need fewer training spectra to be correctly captured by the model (new Fig. S6), (xi) added GlyTouCan IDs as a new column in our output table, whenever available (new CandyCrunch v0.3), and (xii) demonstrated that our approach is robust to retention time overlaps (new Fig. S3).Changes in the manuscript and point-by-point responses here are colored in blue.We believe that these changes have substantially improved our manuscript, contextualized our findings, and will allow readers to better evaluate our analyses and findings.

Reviewer #1:
Remarks to the Author: At its core, CandyCrunch is still a spectral matching method.To evaluate the correctness of the predicted structure, CandyCrunch compares the experimentally observed fragmentation pattern with that predicted by AI from many tandem mass spectra of the given structure in the training dataset.Effectively, it utilizes a "consensus" spectrum rather than individual reference spectrum, leading to a (presumably) improved accuracy of top-ranked predictions (around 90%).The experimental parameters (ionization mode, LC types, glycan modifications, MS instrument types, glycan classes, etc.) are embedded in the CandyCrunch model architecture, allowing it to (somewhat) overcome significant variations in experimental conditions among data sets.The (normalized) retention time information and precursor m/z were also utilized by CandyCrunch to aid prediction.CandyCrunch further utilizes domain knowledge and biosynthetic knowledge to filter and refine predictions.Although it is conceptually infeasible to predict structures absent from the training set, the authors demonstrated limited subset zero-shot predictions utilizing the biosynthetic network.As the authors pointed out in their rebuttal letter, the manuscript also contains several independent advances.For example, the curated set of glycomics MS2 data would be a valuable resource; CandyCrumbs is a useful tool for spectral annotation; and MD simulations that elucidates fragmentation mechanisms.Although these are valuable, they, on their own, do not warrant publication on Nature Methods.The critique below will focus on the core advance of this study, namely the development of an AI prediction model, but would also comment on the values and flaws of other aspects of the manuscript when applicable.
We thank the reviewer for engaging with our work and providing us with the opportunity to improve it.Below, we have addressed each of the individual points and describe how this has improved our manuscript.
1.A true zero-shot prediction should be made with synthetic glycan structures that are not present in the training dataset.However, this is not likely to succeed with the present approach that relies on the identification of existing structures in the sample set and biosynthetic network, especially for: (a) synthetic glycan structures that do not conform to known biosynthetic knowledge; and/or (b) identification of individual glycan structure that may or may not be biologically relevant, from an isolated tandem MS spectrum.An alternative approach to demonstrate the limited feasibility of zero-shot prediction is to purposely remove all tandem mass spectra of selected structures from the training set, that is, perform the training/test split at the structure level rather than the sample level, and see if CandyCrunch can recover these structures using the present approach.This would constitute a better testing strategy on the performance of zero-shot prediction, as the structures removed are validated structures rather than tenuous assignment (that are likely made on low-quality spectra or low-abundance species).The authors should be able to perform this evaluation with the available data, and if successful, could significantly strengthen the manuscript.
We first would like to respectfully challenge this with our view that zero-shot prediction is a niche case in the current practice of glycomics, especially given our extensive prediction repertoire of ~3,400 unique structures.The most common use case, especially in the realm of highthroughput studies, where CandyCrunch would reap the greatest benefits, are samples which do not exhibit structures that would lie outside our ~3,400 structures (e.g., serum glycans etc.).Very few people work on the characterization of highly novel structures found in exotic species and this cannot be viewed as a common use case under any definition.Further, we would make the conjecture that few, if any, humans would have experience in annotating ~3,400 structures, such that many zero-shot annotations for any given human would not be thus for our model.However, we acknowledge the point that we should objectively demonstrate the capability of our additional workflows to engage in proof-of-concept zero-shot predictions, to substantiate our claim.For this purpose, we have added new analyses, in which -as suggested by the reviewer -we have removed all spectra of individual structures from the training set and measured whether these structures are captured via zero-shot predictions.This indeed worked robustly for the structures we analyzed, both for biosynthetic modeling as well as database inference approaches, the two main zero-shot approaches we employ in this work.The results from this analysis can be found in the new Supplementary Fig. 8.
2. CandyCrunch utilizes domain and biosynthetic knowledge, as well as retention time information to (presumably) improve the prediction accuracy, which underlined the limitations of utilizing only fragmentation pattern for structure prediction.This is not surprising because: (1) biosynthetic knowledge would reduce the number of possible structures for a given mass by many orders of magnitude; and (2) many isomers have similar fragmentation patterns, especially when acquired by CID in ion trap instruments.The authors should provide a quantitative evaluation of how the application of each filter/refining process improves the prediction accuracy (e.g.prediction accuracy with only the MS2 spectra; with the addition of retention time; with the addition of the biosynthetic knowledge; etc.)This is a very important observation and we agree with the reviewer.We have now added the new Supplementary Table 3 to show the results of the suggested ablation experiment on model performance.While we do see decreases in performance when removing certain modes of information, we note that (i) binned intensities alone are responsible for a large fraction of the performance and (ii) there is redundancy between information channels (e.g., one could imagine that, for an AI model, the information contained in precursor mass could be mostly reconstructable from MS2 information).Since neither computational resources, inference time, nor the acquisition of the required information are current bottlenecks in any sense, we opt to retain everything that gives us a boost in performance, no matter how minuscule.Importantly, the model using all types of provided information yields the highest prediction performance.
3. According to the authors, "even erroneous predictions are structurally close to the correct solution".However, "closeness" is often insufficient to answer many biological questions.In fact, even a slight change in one structural variable can have considerable impact on the biological function of a glycan.A better indication of the evaluation accuracy would be the confidence score.It is advisable for the authors to develop a confidence score to indicate the likelihood of correct assignment.Once this is established, the authors can present a statistical distribution of the confidence score for correct assignments, and for incorrect assignments.This is important to inform the analyst how much they should trust the result.
We agree that precision is important when evaluating biological functions.We still maintain, however, that it constitutes an unequivocal improvement to have errors, if they do occur, be closer to the truth.This is not something to be taken for granted, especially in machine learning, in which errors often strongly deviate from human intuition, and is one of the many aspects that distinguishes our work from those of others.Further, the preponderance of structural ambiguities (e.g., "?" or "HexNAc") present in nearly all published glycomics datasets showcases that biological insights can be gleaned even in the absence of structural perfection.
We also note that the expectation of a reliable confidence score outstrips the current expectations the field has for human analysts, for whom no measures of confidence need to be specified.Further, CandyCrunch already produces a confidence score-the predicted probability of the final structure-and this is present in the user output and used within the workflow, for instance to filter out predictions with too low confidence values.What we cannot provide-and what would be an unrealistic expectation, also with respect to other machine learning applications-is a single, universal cut-off value, above which a user can trust the prediction.The main reason for this lies in combinatorics and probability density functions: larger glycans, even if correctly predicted, will always have smaller probability scores, on average.That is because the probability density is stretched across all possible isomers, dampening the apex of this function (which should correspond to the correct prediction).
What we can do, however, to provide some guidance, is to compare the confidence scores of the same structure for correct and incorrect top1 predictions.While this will not result in a universal cut-off, it at least shows that, on average and controlling for the combinatorial potential of a specific glycan composition, correct predictions showcase statistically significant higher predicted probabilities than incorrect predictions.In fact, ~94% of structures exhibit a higher confidence score if they are indeed the correct prediction (~78% even reaching statistical significance after stringent multiple testing correction).This is, in part, ensured by the prediction confidence calibration that we perform via Platt scaling.We demonstrate this in the new Supplementary Table 7.
4. The authors stated CandyCrunch is suitable for analysis of glycans with common modifications, such as permethylation.It would be useful for the authors to provide a statistical evaluation of the prediction accuracy utilizing different experimental approaches (beyond the statement that negative-mode CID of reduced glycans tends to produce the best results).
We agree with the reviewer and have now added a new supplemental table detailing model performance for modifications for which we have sufficient amounts of data (new Supplementary Table 5).We note that, as for any machine learning effort, our model will always perform best, on average, with the settings that contributed the most data.We added appropriate cautionary notes to that effect into the revised manuscript. 5. How would CandyCrunch perform on tandem mass spectra of mixture?Co-elution is common for isomeric glycans.Would it be able to identify the presence of multiple structures in a single chimeric spectrum?
In principle, this can be detected by CandyCrunch, though we caution that this would also be substantially harder for human analysts, compared to non-chimeric spectra.Since the model ranks structures by probability, the situation can arise that multiple structures all have reasonably high predicted probabilities.We also always output the top ranked structures in our standard prediction output, so this information is inherently available to users.We note though, that in the column denoting the "final" prediction we, by default, always have a single structure.In the mentioned case, this would then result in the more dominant structure being chosen for this column (we assume here that an exact 50/50 mixture is a rare exception).As further discussed below, as long as there is no perfect overlap in co-elution (i.e., the peaks do not exactly and completely overlap), we still routinely detect both isomers in the output table, with the minor caveat that we might have some inaccuracies in the relative quantification of the respective isomers, due to overlap.We raise these important points in the revised manuscript.
We also tested this empirically by searching for isomers eluting with a low mean retention time difference.We first would like to note that we had major difficulties of finding cases like this in our dataset of ~500,000 MS2 spectra, either indicating that this does not seem prevalent or that human experts do not commonly annotate co-eluting structures as well.One of the few examples we identified that fulfilled at least some criteria (while not a strict isomer, the m/z difference was below 0.5 and thus below our default cut-off of 0.5, to establish precursor ion identity, making the structures isobars for our model) can be found in the new Supplementary Fig. 3. Herein, we show that, despite overlaps in retention time, CandyCrunch is able to confidently assign both structures, including in the overlapping range.
6. Retention time can vary significantly from one data set to another, especially for PGC-LC analysis.The authors perform retention time normalization by dividing the absolute retention time by the respective maximal retention time.I do not think this is the correct way to normalize retention time.The retention time should be normalized using the glucose unit (dextran ladder) or at the worst-case scenario, against known structures present in the sample.
We concur that retention time variation is an issue and concede that our approach to normalize retention time is inferior to absolute normalization methods, such as via glucose units.However, this approach is currently infeasible because we simply do not have this information for our training data.It would also not work to only use glucose unit-normalized retention times at the time of prediction, if the model has not been trained on similar data.Similarly, normalization against known structures would lead to varying retention time associations across samples, contingent on whether a certain structure is present or not.We also would like to note that, while such approaches might further improve model performance, our currently reported model performance, achieved with our current retention time approach, is substantially above alternative methods.
However, since we acknowledge the importance of retention time variation, we have developed a new workflow for using CandyCrunch with a batch of samples.In this workflow, next to our already described advances, we align retention times across samples, if possible and suitable, to build a prediction library and ensure that shifts in retention time between samples are accommodated.In the new Supplementary Fig. 9, we demonstrate that this is a useful addition to further improve and harmonize predictions.Of course, we do acknowledge that this does not entirely cover the concern of the reviewer, as it would, for now, only apply to multiple samples of the same experiment, yet we still view it as a meaningful advance toward this purpose and are confident that further developments will make this even more potent.This analysis, or rather the evaluation of it, also highlighted the occasional presence of single outlier spectra (e.g., m/z 384 close to 40 minutes in a 45-minute LC run, yet still containing glycan fragments such as m/z 204) that were discarded by the original analyst and thus present a large retention time difference in our evaluation.In case users wish to avoid the occurrence of these spectra in the output, we have used this opportunity to showcase three solutions: (i) the abovementioned multi-file workflow, which remedies these outliers, (ii) setting the new "rt_max" keyword argument to avoid investigating the end of the LC run, or (iii) using "experimental = True", as we have now added a new preliminary step of removing retention time outliers to this part of the workflow (new v0.3 of CandyCrunch).We expect that the last option, after more careful and comprehensive testing, will eventually make its way into being always applied within wrap_inference.
7. The authors also performed grouping of averaged spectra in chunks of 0.5-minute retention time window, but many isomers would co-elute in that time frame.
We acknowledge the concern of the reviewer and have now added the option in the CandyCrunch software for users to more precisely specify the treatment of retention time (new v0.3 of CandyCrunch).This includes the settings of minimum and maximum cut-offs, avoiding potential false-positives of predicting glycans at very high retention times, as well as the mentioned value for retention time windows.
We still would like to clarify that this is only a concrete problem for CandyCrunch if isomers coelute in a perfect manner.Partial co-elution is not an inherent problem.As long as there is a 0.5 minute window that predominantly contains one isomer (even if the spectra would be technically chimeric and not from the central part of the peak), predictions would reflect both isomers, which would survive our deduplication efforts, and be contained within our annotation output table.As mentioned above, at worst this will moderately affect the estimated abundance of each isomer.We here would again like to point to our new Supplementary Fig. 3, to substantiate these arguments.
The empirical support of our dataset of 500,000 MS2 spectra, indicating a paucity of isomers with mean retention time differences below 0.5 minutes, supports our view that co-elution is predominantly partial and thus not a fundamental problem for our approach.
8. Biosynthetic network analysis would only be feasible if the entire glycome is analyzed.What would be the strategy for analysis of a single isolated structure?
It is correct that biosynthetic network analysis would be inapplicable to single isolated structures.In that case, this additional analysis would just not contribute anything to the annotation, though the rest of our workflow would still normally apply.We have added this consideration to the revised manuscript.We note that biosynthetic network analysis in its current form is a supplemental analysis and also optional within the workflow.9.The authors stated that they also checked missed Neu5Gc-substituted Neu5Ac-glycans and vice versa with a mass shift of 16 Da and corresponding diagnostic ions.Please comment on if this check extends beyond single substitution.
Previously, this substitution related to replacing a single Neu5Ac with Neu5Gc (or vice versa).In the revised manuscript and CandyCrunch package (new v0.3 onward), we now replace up to all of Neu5Ac with Neu5Gc (or vice versa; i.e., all possible options), to have a broader coverage.This was the only check of that sort at the time.One could envision to extend this to other similar scenarios (such as checking for otherwise identical O-glycans with/without a GlcNAc6S modification) and we have now made the first step toward this by adding a step to check/impute for sequences with GlcNAc/GlcNAc6S connected to the reducing end GalNAc (new v0.3 of CandyCrunch).We will explore this approach further in the future and thank the reviewer for encouraging this improvement.
10.The authors noted that "analyzing the data at higher resolution does not give rise to higher accuracy".The simulation of higher resolution data was performed by increasing the number of bins (Supporting Table 3), leading to an increased "effective resolution per bin".They further stated that the fact that they "used m/z remainder approach overrides binning resolution and allows us to recover exact masses as long as only one fragment is contained in a bin (which Supplementary Fig. 1 supports)".This is inaccurate.Without a sufficiently high resolution, it would be impossible to differentiate isobaric fragments (e.g.those differ in composition of CH4 vs. O): no binning can improve the resolution of the original data (at best, it can preserve the resolution), and no m/z remainder approach can improve the mass accuracy of the original data.The true test should be done by utilizing the subset of real high-resolution data obtained on TOF or Orbitrap instruments, and compare the performance.Additionally, Supplementary Figure 1 is suspect, as it only considers glycosidic bond cleavages.Cross-ring cleavages and neutral losses are common in glycan tandem mass spectra.Figure S1 does not reflect the true extent of potential occurrence of multiple fragments in a single bin.This is an important point and we rephrased the statement in our revised manuscript to reflect that the m/z remainder approach can, at best, only recover the resolution of the mass spectrometer.It still stands, however, that we currently do not see any improvements in prediction when we increase the effective resolution of the model (not the mass spectrometer) by binning more finely, as that should also make the m/z remainder recovery more robust.If the hypothesis were true that higher-resolution data, on average, lead to more correct annotations in theory, then this effect should already become apparent here, which we do not find in the old Supplementary Table 3 (now Supplementary Table 4).While there could be a theoretical threshold effect, in that performance only increases when reaching a certain value of resolution and not before, this seems like a highly elaborate scenario to the authors.
In the revised manuscript, we have taken up the suggestion of the reviewer and have added a comparison to the Orbitrap data (we caution that we have insufficient data from TOF set-ups to draw representative conclusions yet), to further test the importance of resolution, found in the new Supplementary Table 6.We further caution, however, that comparisons will always be flawed to some extent, as, in our experience, annotations from the ion trap data subset are, on average, much better, because they usually are refined with additional exoglycosidase digestions etc. (seen for instance in the best-in-class performance on set-ups using the amaZon ion trap, which all derive from researchers at the Leiden University Medical Center).As we tend to see less of that in the TOF/Orbitrap data, resulting in more structures that are either only partially defined or lack important reporting details including retention time, this constitutes an inextricable confounder.We also would like to take this opportunity to argue that the, even higher, performance on amaZon data (>95% accuracy in the new Supplementary Table 6), despite highly diverse and complex O-glycan samples, already indicates that, with higher-quality data, even higher prediction performance can be reached with CandyCrunch than we currently report across all types of data, making us very optimistic and excited about the future potential of this technology.
11.The authors noted that "a model only trained on O-glycans performed worse for predicting O-glycans (Supplementary Table 4; topology: 84% accuracy, structure: 79% accuracy) than the model trained on all classes", and hypothesized that "this was due to the structure-based loss function we used for training, as well as shared information between spectra of different classes, stemming from shared glycan motifs across classes".This is an interesting point.How is the motif information incorporated into the training algorithm?Would CandyCrunch be able to predict structures that are built from existing motif but not in the training set?Again, this could be tested by removing certain structures entirely from the training set.
We agree with the reviewer and think this is very promising.Our structure-based loss function works such that two glycans receive a higher value (a greater distance) if they share fewer substructures.As the model tries to minimize the loss, it is guided toward predicted structures that have similar substructures or motifs.Our software library glycowork has potent capabilities to recognize and count motifs in glycan sequences and we use this functionality to construct this initial structure-based distance matrix for the glycans in our training data.This matrix is then used as a kind of lookup table during training.As discussed in more depth above, CandyCrunch (ignoring biosynthetic modeling etc.) is inherently incapable of predicting unseen glycans, no matter how similar to seen glycans.However, this shared information that is being considered during training allows us to perform better on glycans for which we have few training spectra, if that glycan has many shared motifs with other glycans, for which we do have more training spectra.We tested this cross-training hypothesis by analyzing the dependence of glycan performance on number of training spectra, in the context of whether its motifs are found in other glycans in the data, shown in the new Supplementary Fig. 6.This analysis indeed confirms that, with statistical significance, a higher support (i.e., a greater number of spectra from structurally related glycans in the training set) lowers the probability of low-accuracy predictions for glycans with few training spectra on their own.12. CandyCrumbs is a useful tool, but the way it prioritizes assignments when multiple assignments are possible is suspect (Figure S10).There are often cases where a neutral loss or double cleavages are more abundant than a single-cleavage fragment.Is it possible to also learn the assignment from the annotated training set?
We agree with the reviewer that, at best, our fragment prioritization is a heuristic.First, we would like to note that fragment prioritization is only an option within CandyCrumbs, controlled by a keyword argument, and our method can readily output all possible fragment assignments at a given m/z value.The trade-offs here are that the agnostic output of all possible fragments might be most useful for expert users but not for intermediate users, who might be, for instance, overwhelmed by the inclusion of, theoretically possible, triple-cleavage fragments.This is why we support both modes of analysis.We note that, while there are certainly cases where our heuristic breaks down, the main aim of a heuristic is to be more right than wrong in most of the cases.It also needs to be generalizable.We thus maintain that fragment prioritization, if chosen to be activated, adds value.Yet we, of course, are optimistic that this will be improved in the future.
It is theoretically absolutely possible to learn assignments and we would love nothing more.The main issue here is that there is not enough available data for this, as only a minuscule fraction of deposited glycomics data is accompanied with annotated fragments (and then usually in the form of pictures in supplemental figures).Further, if not accompanied by MS3 data, assignments in the case of multiple options are often not based on rigorous criteria and vary extensively between analysts.We are thus hopeful that, once sufficient data of that type become available, CandyCrumbs can be revisited as a machine learning problem.
Overall, I think the manuscript is worth consideration for publication on Nature Methods, but there are serious flaws that need to be addressed and it needs further and more robust evaluation.
We thank the reviewer for the constructive feedback and we are convinced that our point-bypoint responses, as well as the associated changes we made to the revised manuscript, have improved our work substantially.

Reviewer #3:
Remarks to the Author: The rebuttal addresses all my comments well and the authors updated and improved the manuscript.The described tool has a lot of options to tune it for a specific experiment and train it with new data, and it will likely be very valuable in LC-MS based glycomics experiments in the future.Although the authors provided an easy accessible Google Drive interface to make the tool accessible for non-python-specialists, this is not an ideal interface to bring the tool to its full potential (as indicated by the authors themselves as well).I understand that this is not the main purpose of the manuscript, but I would like to encourage the developers to equip the Python tool with a user friendly interface to make the tool accessible for all glyco LC-MS specialist, independent of their programming skills.As a start, a clearer "read me" can be provided on how to get started with the tool using the local installation.I'm sure there will be enough "dummy testers" in the field happy to help validating this.
We thank the reviewer for engaging with our work and helping us improve it.As further detailed in the response to Reviewer #5, we have decided to at least also implement a command line interface for using CandyCrunch via simple command line commands (new v0.3 of CandyCrunch).While we realize that this may still exceed the capabilities or interest of some potential users, we are confident that this still constitutes an improvement over the current state and hope to construct a true graphical interface for this in the future as well.We have also expanded our "Read me" document to be more specific about the usage of CandyCrunch in case of local installation.We are also hopeful that the additions to the revised manuscript (new Supplementary Fig. 2, new Supplementary Table 12), detailing usage and options of CandyCrunch, will aid in making this method more accessible.

Reviewer #4:
Remarks to the Author: This manuscript presents a highly novel tool for glycomics data annotation which is made available via an ad hoc Python notebook.The authors have addressed both the novelties of the software as well as the drawbacks or points for users to be aware of when using the software.In particular, it is very important to make potential users aware of what NOT to expect.The data and methodology is rigorously explained, and the manuscript is written clearly overall.
We are grateful for the evaluation and feedback on our herein described method and address the suggestions for improvement below.

Suggestions for improvement:
1.A table or at least a description of the various options that are available could be summarized.While the author responded to one of the other reviewer's comments and added a statement that the given option (zero-shot) is available, amongst the comments to the reviewers there seemed to be many such options that are available, but the user would not be aware of them.So it would help the manuscript greatly to indicate that all of the various options that often need to be considered when annotation glycomics spectra are addressed, and the list would indicate how to set those options.
We thank the reviewer for this suggestion and have added a schematic description of what needs to be set in using CandyCrunch, and which considerations go into setting parameters (new Supplementary Fig. 2), as well as a new supplemental table that describes the options for each of these parameters (new Supplementary Table 12).This is also accompanied by a now more extensive "Read me" section on the GitHub page of our CandyCrunch package (new v0.3 of CandyCrunch).
2. While users could use the outputed annotations as figures to include in their own publications, how do the authors consider feeding back the annotations to databases such as in the GlySpace Alliance?In particular, are GlyTouCan IDs assignable to the annotations?This is a very valuable suggestion and we have now linked the glycans in our predictions to GlyTouCan IDs, whenever available.Thus, from CandyCrunch v0.3 onward, GlyTouCan IDs will also be present in the annotation tables of the output of this method and can be used to interface findings with the resources of the GlySpace Alliance.

Reviewer #5:
• "Applied to fully unseen datasets, CandyCrunch routinely achieved high performance (Supplementary Table 4; topology: 92% accuracy, structure: 84%)".These numbers appear to be Top5 structure accuracies from Supp.Table 4, whereas the previous data in Figure 1 was presented as Top1 accuracy.This needs to be noted, and ideally, the range of Top1 accuracies should be provided as well or instead.This is correct and we have added this information to the revised manuscript.We now report the corresponding values for Top1 predictions, updated for the re-trained model from the previous round of revisions, as well as improvements in the post-prediction workflow (now from the updated Supplementary Table 8).
• With regard to Supp.Table 3, the authors state that higher m/z resolution appeared not to necessarily translate to higher prediction performance.The resolutions tested (0.7 to 1.4 Da per bin) are 2 orders of magnitude lower than high resolution mass spectrometry data, so it would be worth noting the difference between "higher" resolution than the model default (i.e., 2x higher) and "high" resolution in a mass spectrometry context (i.e., 100x higher).
We thank the reviewer for bringing up this important distinction and have added this information to the revised manuscript.We are also enthusiastic about revisiting this avenue of research once sufficient amounts of high-resolution mass spectrometry data become publicly available.

Decision Letter, second revision:
Dear Daniel, Thank you for submitting your revised manuscript "Predicting glycan structure from tandem mass spectrometry via deep learning" (NMETH-A52883C).It has now been seen by the original referees and their comments are below.The reviewers find that the paper has improved in revision, and therefore we'll be happy in principle to publish it in Nature Methods, pending minor revisions to satisfy the referees' final requests and to comply with our editorial and formatting guidelines.
We are now performing detailed checks on your paper and will send you a checklist detailing our editorial and formatting requirements within two weeks or so.Please do not upload the final materials and make any revisions until you receive this additional information from us.
TRANSPARENT PEER REVIEW Nature Methods offers a transparent peer review option for new original research manuscripts submitted from 17th February 2021.We encourage increased transparency in peer review by publishing the reviewer comments, author rebuttal letters and editorial decision letters if the authors agree.Such peer review material is made available as a supplementary peer review file.Please state in the cover letter 'I wish to participate in transparent peer review' if you want to opt in, or 'I do not wish to participate in transparent peer review' if you don't.Failure to state your preference will result in delays in accepting your manuscript for publication.other structures are biologically relevant.
(2) The authors' statement that it is rare to find cases where isomers co-elute within a half-minute retention time window may seem unexpected.For instance, a recent study investigating highmannose and paucimannose N-glycans (J.Proteome Res. 2024, 23, 939-955) demonstrates significant overlap of isomers, with many falling within a 0.5-minute retention time window (Table 1, noting that 1 dextran index corresponds to approximately a 2-3-minute gap in retention time).However, it should be noted that the JPR study did not involve reduction, resulting in more complex chromatograms due to anomerism.The authors' statement that the examples shown in Figure S3 are isobars and not strictly isomers is incorrect.Gal?1-?GalNAcα1-3(Neu5Acα2-6)GalNAc and Fucα1-?GalNAcα1-3(Neu5Gcα2-6)GalNAc are indeed isomers, as both should have an m/z value of 878.326 in their reduced and deprotonated form, or 878.325 in their underivatized and protonated form.The difference in their m/z values in the datasets are likely due to insufficient mass measurement accuracies.
(3) The authors stated that they currently do not see any improvements in prediction when they increase the effective resolution of the model by binning more finely, but acknowledged that the performance may only increase when reaching a certain value of resolution and not before.They also stated that utilization of the higher resolution Orbitrap data did not lead to performance enhancement.However, it's important to note that Table S4 shows that the highest effective resolution per bin tested is merely 0.72 Da, which is significantly greater than the difference between common isobars observed in glycan fragments (0.036 Da for the CH4 and O splitting).Therefore, it is not surprising that the benefits of higher mass resolution cannot be realized in the present setting.The authors also acknowledged in their rebuttal letter that direct comparison between performance on data acquired on different instrument platforms is confounded by the variance in other experimental variables, such as validation from exoglycosidase digestions, potential MS3 data (maybe?),annotation quality, and retention time information.These are all valid points that should be incorporated into the main text too inform the readers, rather than being solely addressed in the rebuttal letter.Mass calibration is another complicating factor that the authors should address in their manuscript.Currently, further reducing the bin size may not lead to the desired performance enhancements due to two main factors: (1) to effectively account for the CH4-O isobar, the bin size would need to be reduced by more than an order of magnitude, which may not be practical; and (2) the variations in mass accuracy of the training data across different datasets do not justify the effort of further reducing the bin size.Instead of conducting performance evaluation at further refined bin sizes, it may be sufficient for the authors to acknowledge these caveats in the revised manuscript.
Reviewer #3 (Remarks to the Author): The manuscript and tool are greatly improved and in my opinion ready for publication.I look forward to future developments and more user friendly interfaces.
Reviewer #3 (Remarks on code availability): I was able to run the code, a read me is present.
Reviewer #4 (Remarks to the Author): please let the production team know when you receive the proof of your article to ensure there is sufficient time to coordinate.Further information on our embargo policies can be found here: https://www.nature.com/authors/policies/embargo.htmlTo assist our authors in disseminating their research to the broader community, our SharedIt initiative provides you with a unique shareable link that will allow anyone (with or without a subscription) to read the published article.Recipients of the link with a subscription will also be able to download and print the PDF.
As soon as your article is published, you will receive an automated email with your shareable link.
If you are active on Twitter/X, please e-mail me your and your coauthors' handles so that we may tag you when the paper is published.
You can now use a single sign-on for all your accounts, view the status of all your manuscript submissions and reviews, access usage statistics for your published articles and download a record of your refereeing activity for the Nature journals.
Please note that you and any of your coauthors will be able to order reprints and single copies of the issue containing your article through Nature Portfolio's reprint website, which is located at http://www.nature.com/reprints/author-reprints.html.If there are any questions about reprints please send an email to author-reprints@nature.com and someone will assist you.
Please feel free to contact me if you have questions about any of these points.

Best regards, Arunima
Arunima Singh, Ph.D. Senior Editor Nature Methods should be improved: there is a large variation in font sizes used, panel C is hard to understand, panel D and F are very small.It is unclear what panel E is based on, can this be included in the figure?Panel H is very large and contains relatively little information (and what does the 9 mean?).Text in G is hardly readable (same for Fig 2A).