Proteogenomic Analysis Provides Novel Insight into Genome Annotation and Nitrogen Metabolism in Nostoc sp. PCC 7120

ABSTRACT Cyanobacteria, capable of oxygenic photosynthesis, play a vital role in nitrogen and carbon cycles. Nostoc sp. PCC 7120 (Nostoc 7120) is a model cyanobacterium commonly used to study cell differentiation and nitrogen metabolism. Although its genome was released in 2002, a high-quality genome annotation remains unavailable for this model cyanobacterium. Therefore, in this study, we performed an in-depth proteogenomic analysis based on high-resolution mass spectrometry (MS) data to refine the genome annotation of Nostoc 7120. We unambiguously identified 5,519 predicted protein-coding genes and revealed 26 novel genes, 75 revised genes, and 27 different kinds of posttranslational modifications in Nostoc 7120. A subset of these novel proteins were further validated at both the mRNA and peptide levels. Functional analysis suggested that many newly annotated proteins may participate in nitrogen or cadmium/mercury metabolism in Nostoc 7120. Moreover, we constructed an updated Nostoc 7120 database based on our proteogenomic results and presented examples of how the updated database could be used to improve the annotation of proteomic data. Our study provides the most comprehensive annotation of the Nostoc 7120 genome thus far and will serve as a valuable resource for the study of nitrogen metabolism in Nostoc 7120. IMPORTANCE Cyanobacteria are a large group of prokaryotes capable of oxygenic photosynthesis and play a vital role in nitrogen and carbon cycles on Earth. Nostoc 7120 is a commonly used model cyanobacterium for studying cell differentiation and nitrogen metabolism. In this study, we presented the first comprehensive draft map of the Nostoc 7120 proteome and a wide range of posttranslational modifications. In addition, we constructed an updated database of Nostoc 7120 based on our proteogenomic results and presented examples of how the updated database could be used for system-level studies of Nostoc 7120. Our study provides the most comprehensive annotation of Nostoc 7120 genome and a valuable resource for the study of nitrogen metabolism in this model cyanobacterium.

Reviewer #1 (Public repository details (Required)): There is a large dataset of Mass Spectrometry analysis of Nostoc proteins. The authors state that the data are deposited in iProX database with the identifier IPX0002995000. However, the data are not publicly available at present. They should be available upon acceptance Reviewer #1 (Comments for the Author): In this work the authors use MS technology to identify proteins in Nostoc in different conditions. The high depth proteomic analysis allows the refinement of the annotation, identify new ORFs and refine the N' terminal of many proteins. In addition, they identify PTMs in a systematic way resulting in the discovery of new PTMs not previously known in bacteria. Both the new annotation and the PTM data represent an invaluable repository of information for the scientific community. However, there are several major issues to be solved. 1-The authors used as reference the Cyanobase annotation. However, there is a newer annotation in NCBI (https://www.ncbi.nlm.nih.gov/nuccore/NC_003272.1). Some of the ORFs the authors claim to have "discovered" are annotated in NCBI, as detailed below. This does not represent a methodological problem for the analysis of the MS data because in addition to the Cyanobase they use the six-frame genome database. But they should screen their GSSP against NCBI. In fact, in lines 287-289 they mention that the "novel" refined model for alr5269 was already reported in NCBI. 2-The results of this work allows for an updated annotation of ORFs in Nostoc. This is one of the most useful contributions of this work. However, the updated annotation is not easily available from the data presented. The authors should present the annotation as a table similar to table S1E with two additional columns: the updated coordinates for each ORF, and the Refseq new name of each ORF, if available. 3-There is an overall confusion between newly identified proteins and confirmation of proteins already annotated. 4-The "new" genes in Fig 5B, C and updated 5' end in 5E are already annotated in RefSeq. 5-There is disagreement in Fig. 6 between transcriptomic and RT-PCR. An example out of several: RNAseq: NG-2 increases at 12 h; RT-PCR: NG-2 decrease at 12 h. 6-Furthermore, there is disagreement between RNA data in Fig. 6 and protein quantification data in Fig.7. For instance, NG-7 protein amount increase upon nitrogen deprivation, but both RNASeq and RT-PCR indicate that there is a reduction in the amount of mRNA. The authors should at least discuss these discrepancies.
Other points Line 41: photosynthesis Lines 171-173: When describing the shared proteins identified they should mention in the text the three heterocyst specific proteins with split genes that are in Table S2. Line 182-185: I find this speculation baseless. Lines 199-203. The statements in this paragraph are not relevant because the authors do not perform a functional enrichment analysis of the different PTMs. Table 4B reflects just the general distribution of proteins. Line 216-217. The PTM data are clearly very interesting, novel, and are one of the main strengths of the paper. From figure 4C it seems that at least propionyl and crotonyl modifications are altered in the -N sample for some proteins. This is a very interesting result that should be highlighted in the text. Line 267: Insert here reference 50 so that the reader knows what RNASeq data are used. Line 273. ¿Is "transcriptomic evidence" here the same than "RNA-Seq evidence" in 267?
In addition to the detailed comments above, another expert in the field raised concerns regarding the use of this strain, claiming it is a mutant and highly mutable. Please comment on the time this strain has been in culture since the initial genome sequence and consider this in your interpretation of differences between the initially deposited sequence annotation and the update you provide.

Preparing Revision Guidelines
To submit your modified manuscript, log onto the eJP submission site at https://spectrum.msubmit.net/cgi-bin/main.plex. Go to Author Tasks and click the appropriate manuscript title to begin the revision process. The information that you entered when you first submitted the paper will be displayed. Please update the information as necessary. Here are a few examples of required updates that authors must address: • Point-by-point responses to the issues raised by the reviewers in a file named "Response to Reviewers," NOT IN YOUR COVER LETTER. • Upload a compare copy of the manuscript (without figures) as a "Marked-Up Manuscript" file. • Each figure must be uploaded as a separate file, and any multipanel figures must be assembled into one file. Please return the manuscript within 60 days; if you cannot complete the modification within this time period, please contact me. If you do not wish to modify the manuscript and prefer to submit it to another journal, please notify me of your decision immediately so that the manuscript may be formally withdrawn from consideration by Microbiology Spectrum.
If you would like to submit an image for consideration as the Featured Image for an issue, please contact Spectrum staff.
If your manuscript is accepted for publication, you will be contacted separately about payment when the proofs are issued; please follow the instructions in that e-mail. Arrangements for payment must be made before your article is published. For a complete list of Publicat ion Fees, including supplemental material costs, please visit our website.
Corresponding authors may join or renew ASM membership to obtain discounts on publication fees. Need to upgrade your membership level? Please contact Customer Service at Service@asmusa.org.

Reviewer #1 (Public repository details (Required)):
There is a large dataset of Mass Spectrometry analysis of Nostoc proteins. The authors state that the data are deposited in iProX database with the identifier IPX0002995000. However, the data are not publicly available at present. They should be available upon acceptance.
Authors reply: We appreciate the valuable comments from the reviewer. We have submitted the data to iProX database with the identifier IPX0002995000 and they are publicly available now.

Reviewer #1 (Comments for the Author):
In this work the authors use MS technology to identify proteins in Nostoc in different conditions. The high depth proteomic analysis allows the refinement of the annotation, identify new ORFs and refine the N' terminal of many proteins. In addition, they identify PTMs in a systematic way resulting in the discovery of new PTMs not previously known in bacteria. Both the new annotation and the PTM data represent an invaluable repository of information for the scientific community.
However, there are several major issues to be solved.  it has 2 amino acids extension at the N-terminus. (Figure 5D). Authors reply: We thank the reviewer for this constructive comment. The updated coordinates for each ORF have been displayed in the column H-J of Table S6B and Table S6C. And the Refseq new names of each ORF have been updated in Table   S6D and S6E.

3-There is an overall confusion between newly identified proteins and confirmation of proteins already annotated.
Authors reply: We apologize for this lack of clarity. In this study, all raw MS data were analyzed against two databases: (I) Cyanobase database, (II) a six-frame translated genome database of Nostoc sp. PCC 7120. The identified peptides exclusively matching the six-frame genome database were designated as genome search-specific peptides (GSSPs). And the GSSPs were mapped to the unique genomic locus to identify open reading frames (ORFs) using BLAST. ORFs that were only mapped to the genome regions without overlap with any known gene models were designated as novel protein-coding genes, that is the newly identified proteins   Fig. 6 between transcriptomic and RT-PCR. An example out of several: RNAseq: NG-2 increases at 12 h; RT-PCR: NG-2 decrease at 12 h.

5-There is disagreement in
Authors reply: We appreciate the comment of the reviewer. We have also noticed the disagreement between transcriptomic and RT-PCR. We believe that the reasons for this divergence may be due to the different samples, different detection and analysis methods, shorter transcripts, and lower expression levels, etc (2, 3). Therefore, Figure 6 was mainly used to show these new genes can be detected at the mRNA level and differentially expressed under nitrogen deficiency. We have mentioned this disagreement in the revised manuscript in response to the comments of the reviewers. (Line 317-318) 6-Furthermore, there is disagreement between RNA data in Fig. 6 and protein quantification data in Fig.7. For instance, NG-7 protein amount increase upon nitrogen deprivation, but both RNASeq and RT-PCR indicate that there is a reduction in the amount of mRNA. The authors should at least discuss these discrepancies.
Authors reply: We thank the reviewer for this constructive suggestion. It is wildly accepted mRNA level usually has a poor correlation with the corresponding protein level (4,5). Perhaps substantial regulatory processes, including post-transcriptional, translational and degradation regulation, contribute at least as much as transcription itself in the determination of protein concentrations. We have discussed it as the suggestion of the reviewers in the revised manuscript. (Line 349-352) Other points: Line 41: photosynthesis Authors reply: We are sorry for the mistake. We have revised the spelling, and the whole manuscript has been carefully checked.
Lines 171-173: When describing the shared proteins identified they should mention in the text the three heterocyst specific proteins with split genes that are in Table S2.
Authors reply: We appreciate the reviewer's comment. We have added it in the revised manuscript (Line 175-177) as the reviewer's suggestion.
Authors reply: We apologize for this lack of clarity. We have deleted the controversial speculation in the revised manuscript.  Table 4B reflects just the general distribution of proteins. Authors reply: We appreciate the reviewer's comment. As mentioned above, there are discrepancies between different versions of genome annotation. We have showed the sequence alignment of novel genes in Table S6D. As showed in the table, only NG-7 is consistent with the genome annotation of 2019 and 2020 version. Although NG-1 also has homologous sequences, its N-terminus is 10 amino acids shorter than that in the NCBI RefSeq. While neither NG-2 nor NG-6 has been annotated in the NCBI RefSeq by referring to the BLAST analysis.
Lines 361-363, As indicated above in relation to lines 182-185, this speculation is baseless. I don't understand the meaning of "weaker protein coding potential".  Table S6 are already annotated in Refseq.
Authors reply: We appreciate the reviewer's comment. There are three different versions of genome annotations for Nostoc sp. PCC 7120 published in NCBI database as mentioned above. In this study, we select the Cyanobase annotation because it has been widely used as a reference database. As showed in Table S6D and Figure 5C, we also acknowledge that NG-20 is present in the latest genome annotation. While this protein is missing in the Cyanobase database and 2019 version of genome annotation. The sequence alignment analysis of other novelties with genes annotated in different versions of genome are also summarized in Table S6D and

S6E.
Paragraph starting at line 412. The extended discussion on proteins related to heavy metal metabolism (including fig. 8) is rather out of the main focus of the paper, could be deleted or strongly reduced.

Authors reply:
We thank the reviewer for this constructive suggestion. We have strongly reduced this discussion section in the revised manuscript.
Table S1E has a recurrent typo in column D: putative "proteion".
Authors reply: Thanks for pointing out these mistakes and we have corrected them in the revised manuscript.
References should be carefully reviewed. References 10 and 41 are duplicated.
References 66 and 101 are incomplete or wrongly formatted.
Authors reply: We are sorry for the mistakes. We have corrected them, and the whole revised manuscript has been carefully checked.

Reviewer #2:
In addition to the detailed comments above, another expert in the field raised concerns regarding the use of this strain, claiming it is a mutant and highly mutable.
Please comment on the time this strain has been in culture since the initial genome sequence and consider this in your interpretation of differences between the initially deposited sequence annotation and the update you provide.
Authors reply: We highly appreciate the comment raised by the reviewer. Nostoc sp.
PCC 7120 is commonly used as a model strain for studying cell differentiation and multicellular pattern. At present, the widely used substrains mainly from either University of Chicago, Michigan State University, or the PCC. And the three substrains were reported to have undergone microevolution by comparative genome analysis, including single nucleotide polymorphisms (SNPs), small insertion/deletions (indels; 1 to 3 bp), fragment deletions, and transpositions [4] (Table R1-2). The genomic locations of novelties identified in this study were listed in Table R3. By comparing the regions of the novelties with the sequence mutations reported by Xu et al, no overlap region was found. In addition, genome microevolution events have also been reported in other organisms, such as Escherichia coli, Bacillus subtilis, Synechocystis sp. PCC 6803, etc (6)(7)(8). The genome reannotations based on proteogenomics analysis, similar to this study, have also been performed in these organisms (9)(10)(11).
The substrain used in this study was obtained in 2017 from Jindong Zhao (Peking University), who had brought it from the University of Chicago. The Cyanobase database annotation also refers to the genome sequencing of the strain from University of Chicago. Therefore, although the strain has undergone microevolution during the cultivation, these novelties identified in this study should not be caused by these mutations in view of the comparing analysis above and the methodology employed in this study is also reasonable. Thank you for your detailed modifications to the manuscript. Please note that while you have provided accession numbers to data, you should provide the following in a "Data Availability" paragraph at the end of the Materials and Methods section of full-length articles (or at the end of the text in shorter article types): data description, name of the repository, and DOIs or accession numbers.
Your manuscript has been accepted, and I am forwarding it to the ASM Journals Department for publication. You will be notified when your proofs are ready to be viewed.
The ASM Journals program strives for constant improvement in our submission and publication process. Please tell us how we can improve your experience by taking this quick Author Survey.
As an open-access publication, Spectrum receives no financial support from paid subscriptions and depends on authors' prompt payment of publication fees as soon as their articles are accepted. You will be contacted separately about payment when the proofs are issued; please follow the instructions in that e-mail. Arrangements for payment must be made before your article is published. For a complete list of Publicat ion Fees, including supplemental material costs, please visit our website.
Corresponding authors may join or renew ASM membership to obtain discounts on publication fees. Need to upgrade your membership level? Please contact Customer Service at Service@asmusa.org.