Deep scanning lysine metabolism in Escherichia coli

Abstract Our limited ability to predict genotype–phenotype relationships has called for strategies that allow testing of thousands of hypotheses in parallel. Deep scanning mutagenesis has been successfully implemented to map genotype–phenotype relationships at a single‐protein scale, allowing scientists to elucidate properties that are difficult to predict. However, most phenotypes are dictated by several proteins that are interconnected through complex and robust regulatory and metabolic networks. These sophisticated networks hinder our understanding of the phenotype of interest and limit our capabilities to rewire cellular functions. Here, we leveraged CRISPR‐EnAbled Trackable genome Engineering to attempt a parallel and high‐resolution interrogation of complex networks, deep scanning multiple proteins associated with lysine metabolism in Escherichia coli. We designed over 16,000 mutations to perturb this pathway and mapped their contribution toward resistance to an amino acid analog. By doing so, we identified different routes that can alter pathway function and flux, uncovering mechanisms that would be difficult to rationally design. This approach sets a framework for forward investigation of complex multigenic phenotypes.

Thank you again for submitting your work to Molecular Systems Biology. We have now heard back from the two referees who agreed to evaluate your study. As you will see below, the reviewers acknowledge that the study seems potentially interesting. They raise however a series of concerns, which we would ask you to address in a revision of the manuscript.
Reviewer #2 provides constructive suggestions on additional analyses that will enhance the impact of the study. I think that overall the recommendations of the reviewers are rather clear, but please let me know in case you would like to discuss further any of the reviewers' comments. In this work the authors present the application of deep scanning mutagenesis to multiple proteins involved in lysine metabolism with the goal of better understanding the multiple factors that lead to increased production. The work appears to be scientifically sound, and in fact, most of the findings are backed up by other historic studies that have looked at various aspects of this system. Overall, I found the work interesting and clearly presented, but throughout struggled to see the leap in understanding that the paper seemed to suggest in the introduction. I also felt that a more balanced presentation of the methodology in this setting was required to explain not only the potential of the approach, but also the constraints, limits and scalability (see comment 2 below). In light of this, I'd also recommend the authors think about whether the paper might work better if presented more like a method than an article. MSB do have a new category of paper for this.
Below are some major points for consideration: 1. The introduction nicely outlined the importance of considering multiple points in a pathway or complex regulatory network to assess the role of mutations on fitness. Based on this I was expecting to see combinations of mutations explored to look at synergetic or antagonistic effects of changes to multiple elements in the system. However, if I have interpreted the approach correctly, each mutation only affects one gene. While I agree that it's very useful to be able to use pooled sequencing, surely this approach is no different to producing 17 CREATE libraries one for each gene in isolation and then pooling to sequence everything together? Is this approach not significantly hampered by not considering mutations in multiple genes at once? (although this will cause further problems with scalability) 2. One of the challenges with combinatorial approaches is the large numbers of possibilities that arise as the system of interest grows even moderately. As the approach presented here is all about scaling beyond single proteins, how large a system could be investigated with current chip oligo synthesis technologies and sequencing methods? How well does is scale with the number of proteins in a pathway? I would expect to see some calculations and discussion on these properties and specifically the limits of what is currently possible. The paper currently lacks a balanced perspective on limits and capabilities of the approach.
3. When presenting the enrichment of synonymous mutations in LysP you mention some of the mechanisms that might play a role, but do not provide any information as to whether this is supported by, for example, folding energy of the modified transcript, whether the structure of the LysP protein is likely to require co-translational folding ( 4. There was a lack of data regarding the sequencing performed. For example, the number of reads achieved for each replicate, number of mapped reads, number of reads after filtering, etc. This information would be useful to include as a Supplementary File to assess this crucial part of the process.
I also had a few minor comments: Line 521: the "-16" should be superscript. I'd advise the authors also checking throughout to ensure that subscripts and superscripts are as they should be. Figure 1D: I didn't find the CREATE workflow particularly clear. It would help to include more details about the steps you take and describing these either visually in the figure or briefly in the caption?
All figures: I found much of the text in the figures barely readable, which makes them frustrating to understand. It would help if it was ensured all text was larger than 6 pt when composed into a figure.
Reviewer #2: Bassalo et al., manuscript named "Deep scanning lysine metabolism in Escherichia coli" is well written and provides new insights in lysine metabolism. Although, this study is a continuation of already published CREATE method, it provides broader insights in one of the applications of the method. Since, the method itself is published, more in depth analysis of proposed application needs to be considered to retain novelty. Please see below a few comments, which in reviewers opinion would strengthen the manuscript.
Major points: 1. Can authors consider combining uncovered mutants in a single strain? Would such a strain have improved production of lysine? 2. L120: To better characterize the method, can authors show if all the residues were targeted in chosen catalytic domains without selection pressure. In other words, did genome edits matched with the mutagenesis library? If not all the residues were targeted can authors determine the efficiency of mutagenesis. 3. Only a fraction of targeted genes were investigated closer. What was the reason for targeting other genes if they were not investigated after? 4. L290: The benchmarking of ALE with the CREATE method is irrelevant, as all the 19 genes/815 sites selected as CREATE targets are rationally inferred as related to lysine metabolism (L107). This is not the case for ALE, in which both neutral, but also beneficial variants based on seemingly unrelated pleiotropic effects on lysine metabolism may arise. When Figure 6D lists mutations based on 30 hrs this does not take into considerations the 100s of man-years spent investigating the lysine metabolism, which also founded the basis for selection of the 19 genes targeted by CREATE in the first place. The reviewer acknowledges the authors' interest to benchmark the two methods, but the benchmark really does not make much sense beyond serving to illustrate that CREATE allows scanning lysine metabolism "hotspots" in greater depth compared to ALE. Moreover, if including this data in the final manuscript, this reviewer also suggests the authors to put in the lysine quantifications for the 15 whole-genome sequenced strains from the ALE experiment. This indeed would be a relevant benchmark of the two methods. 5. L519: "The intracellular concentration was calculated using the total CFUs present in the pellet (estimated by plating at the harvested stage) and the estimated volume of a single E. coli cell (4.96×10-16 L) (Neidhardt & Curtiss, 1996)". Were there no changes in colony sizes observed for the lysine metabolism mutants? If so, the lysine quantifications should be performed relative to dry cell weight and not CFU. Please comment. acid changes across the genome is particularly valuable as this allows the investigation of the contribution of individual mutations to the phenotype of interest, which is often a starting point in more sophisticated combinatorial search strategies.
2. One of the challenges with combinatorial approaches is the large numbers of possibilities that arise as the system of interest grows even moderately. As the approach presented here is all about scaling beyond single proteins, how large a system could be investigated with current chip oligo synthesis technologies and sequencing methods? How well does is scale with the number of proteins in a pathway? I would expect to see some calculations and discussion on these properties and specifically the limits of what is currently possible. The paper currently lacks a balanced perspective on limits and capabilities of the approach.
As mentioned in comment #1, we included a table (Table 1) containing more detailed metrics on actual genomic edits that we observed in the current study. We also included extensive changes in the manuscript (L291-336; L391-423) with a more thorough discussion on the current limits and capabilities of this approach. In these sections, we discuss some of the concerns raised by the reviewer such as sequencing depth and scalability. Further, we suggest several technical and experimental changes that could alleviate some of the current limitations in future implementation of this technology.
3. When presenting the enrichment of synonymous mutations in LysP you mention some of the mechanisms that might play a role, but do not provide any information as to whether this is supported by, for example, folding energy of the modified transcript, whether the structure of the LysP protein is likely to require co-translational folding This is a great suggestion by the reviewer. An expansion on this topic was added to the manuscript (L225-230).
4. There was a lack of data regarding the sequencing performed. For example, the number of reads achieved for each replicate, number of mapped reads, number of reads after filtering, etc. This information would be useful to include as a Supplementary File to assess this crucial part of the process.
We included a new table (Dataset EV4) with descriptive information on our sequencing runs.
I also had a few minor comments: Line 521: the "-16" should be superscript. I'd advise the authors also checking throughout to ensure that subscripts and superscripts are as they should be.
We fixed these issues throughout the manuscript. Figure 1D: I didn't find the CREATE workflow particularly clear. It would help to include more details about the steps you take and describing these either visually in the figure or briefly in the caption?
A more detailed description on the actual steps was included in the figure caption.
All figures: I found much of the text in the figures barely readable, which makes them frustrating to understand. It would help if it was ensured all text was larger than 6 pt when composed into a figure.
Font size for all figures was increased.

Reviewer #2:
Bassalo et al., manuscript named "Deep scanning lysine metabolism in Escherichia coli" is well written and provides new insights in lysine metabolism. Although, this study is a continuation of already published CREATE method, it provides broader insights in one of the applications of the method. Since, the method itself is published, more in depth analysis of proposed application needs to be considered to retain novelty. Please see below a few comments, which in reviewers opinion would strengthen the manuscript.
Major points: 1. Can authors consider combining uncovered mutants in a single strain? Would such a strain have improved production of lysine?
As suggested, we performed new experiments to combine several of the mutants described in this manuscript into a single strain. We did not observe any further improvements in the combined strains. This is not surprising given the complex set of parameters that govern pathway flux (we observed similar neutral or even antagonistic effects in our previous studies along these lines, Sandoval et al., PNAS, 2012). We agree that a combinatorial engineering approach would be extremely valuable and is a promising future direction from technologies such as CREATE. Such an approach requires, however, that search strategies are employed that attempt to explicitly consider pleiotrophic effects rather than the simple combining of the best individual mutants (as we show here, and in more depth in our prior PNAS paper cite above).
2. L120: To better characterize the method, can authors show if all the residues were targeted in chosen catalytic domains without selection pressure. In other words, did genome edits matched with the mutagenesis library? If not all the residues were targeted can authors determine the efficiency of mutagenesis. This is a great suggestion from the reviewer, which we addressed with new experimentation. Although a full assessment of the genomic edits and library coverage would not be feasible on a genome-wide scale (hence the value of technologies such as CREATE), we selected a few regions from targeted genes to deep sequence and investigate actual genomic edits. The results from this analysis was included in Table 1 and as Figure EV2 in the manuscript.
3. Only a fraction of targeted genes were investigated closer. What was the reason for targeting other genes if they were not investigated after?
We wanted to target all genes related to lysine metabolism so that they could compete against each other under selection. That way, we could assess the relative contribution of each gene to the selective pressure. We could not investigate all targeted genes in more details for two main reasons. First, validation and mechanistic elucidation of target edits is the most time-consuming step, with the study of many of these genes in isolation being a manuscript itself. Therefore, we believe that mapping of multiple of these genes in parallel followed by validation of a few of them, picked to highlight different aspects of the technology, is in itself a significant contribution. Second, current limits on depth and noise prohibits detailed investigation of every single designed edit. We included an extensive discussion on this issue to clarify the current limitations and provide a more balanced manuscript. 4. L290: The benchmarking of ALE with the CREATE method is irrelevant, as all the 19 genes/815 sites selected as CREATE targets are rationally inferred as related to lysine metabolism (L107). This is not the case for ALE, in which both neutral, but also beneficial variants based on seemingly unrelated pleiotropic effects on lysine metabolism may arise. When Figure 6D lists mutations based on 30 hrs this does not take into considerations the 100s of man-years spent investigating the lysine metabolism, which also founded the basis for selection of the 19 genes targeted by CREATE in the first place. The reviewer acknowledges the authors' interest to benchmark the two methods, but the benchmark really does not make much sense beyond serving to illustrate that CREATE allows scanning lysine metabolism "hotspots" in greater depth compared to ALE. Moreover, if including this data in the final manuscript, this reviewer also suggests the authors to put in the lysine quantifications for the 15 whole-genome sequenced strains from the ALE experiment. This indeed would be a relevant benchmark of the two methods.
We agree with the reviewer that CREATE and ALE are different strategies, and our intentions by comparing them was not to suggest CREATE as a replacement (or better) approach than ALE. The value of ALE is beyond discussion, with decades of contribution to the field. As the reviewer mentioned, CREATE allowed us to scan with Thank you again for sending us your revised study. We have now heard back from reviewer #2 who was asked to evaluate your study. As you will see below, the reviewer thinks that all issues have been satisfactorily addressed and is supportive of publication.
Before we formally accept your manuscript for publication, we would ask you to address a couple of remaining editorial issues. In the pink boxes below, please ensure that the answers to the following questions are reported in the manuscript itself. Every question should be answered. If the question is not relevant to your research, please write NA (non applicable). We encourage you to include a specific subsection in the methods section for statistics, reagents, animal models and human subjects.
definitions of statistical methods and measures: a description of the sample collection allowing the reader to understand whether the samples represent technical or biological replicates (including how many animals, litters, cultures, etc.).
Please fill out these boxes ê (Do not worry if you cannot see all your text once you press return) a specification of the experimental system investigated (eg cell line, species name).

B--Statistics and general methods
the assay(s) and method(s) used to carry out the reported observations and measurements an explicit mention of the biological and chemical entity(ies) that are being measured. an explicit mention of the biological and chemical entity(ies) that are altered/varied/perturbed in a controlled manner.

Data
the data were obtained and processed according to the field's best practice and are presented to reflect the results of the experiments in an accurate and unbiased manner. figure panels include only data points, measurements or observations that can be compared to each other in a scientifically meaningful way. graphs include clearly labeled error bars for independent experiments and sample sizes. Unless justified, error bars should not be shown for technical replicates. if n< 5, the individual data points from each experiment should be plotted and any statistical test employed should be justified the exact sample size (n) for each experimental group/condition, given as a number, not a range; Each figure caption should contain the following information, for each panel where they are relevant:

Captions
The data shown in figures should satisfy the following conditions: Source Data should be included to report the data underlying graphs. Please follow the guidelines set out in the author ship guidelines on Data Presentation.

YOU MUST COMPLETE ALL CELLS WITH A PINK BACKGROUND ê
Sample size was chosen to allow statistical tests and comparison between samples. Generally, 3 or 5 samples were chosen. The initial edited library was done as two biological replicates.

NA
All samples were included in the analyses, with the proper filters described in the Methods section.

NA
For the analyzes of the library biological replicates, a weighted enrichment score was applied. While this allows for bias, reconstruction and validation are essential to make sure the phenotype observed is real, as discussed in the manuscript.