Unraveling the hidden universe of small proteins in bacterial genomes

Abstract Identification of small open reading frames (smORFs) encoding small proteins (≤ 100 amino acids; SEPs) is a challenge in the fields of genome annotation and protein discovery. Here, by combining a novel bioinformatics tool (RanSEPs) with “‐omics” approaches, we were able to describe 109 bacterial small ORFomes. Predictions were first validated by performing an exhaustive search of SEPs present in Mycoplasma pneumoniae proteome via mass spectrometry, which illustrated the limitations of shotgun approaches. Then, RanSEPs predictions were validated and compared with other tools using proteomic datasets from different bacterial species and SEPs from the literature. We found that up to 16 ± 9% of proteins in an organism could be classified as SEPs. Integration of RanSEPs predictions with transcriptomics data showed that some annotated non‐coding RNAs could in fact encode for SEPs. A functional study of SEPs highlighted an enrichment in the membrane, translation, metabolism, and nucleotide‐binding categories. Additionally, 9.7% of the SEPs included a N‐terminus predicted signal peptide. We envision RanSEPs as a tool to unmask the hidden universe of small bacterial proteins.

Thank you again for submitting your work to Molecular Systems Biology. We have now heard back from the three referees who agreed to evaluate your manuscript. As you will see from the reports below, the referees find the topic of your study of potential interest. They raise, however, substantial concerns on your work, which should be convincingly addressed with further experimentation and analyses.
The major concerns raised by the reviewers refer to the need of more solid, orthogonal evidence for the existence of a "significant subset of the small proteins reported" and an assessment of false/true positive rates.
If you feel you can satisfactorily deal with these points and those listed by the referees, you may wish to submit a revised version of your manuscript. Please attach a covering letter giving details of the way in which you have handled each of the points raised by the referees. A revised manuscript will be once again subject to review and you probably understand that we can give you no guarantee at this stage that the eventual outcome will be favorable. This study represents a significant amount of work, and the identification of unannotated small open reading frame-encoded proteins is important. However, given the potential artifacts associated with the identification and characterization of small proteins due to their low information content, the authors need to be much more careful about their analyses, the presentation of the experiments and data, and their conclusions. A flood of insufficiently substantiated small protein data could do more harm than good.

1.
A key component of the authors' analysis is the supposed validation of proteins by mass spectrometric analysis. Given the difficulties in mass spectrometric analysis of small proteins in particular (proteins with two or more UTPs could also be artifacts), the authors need to provide independent evidence of small protein synthesis, such as immunoblot detection of tagged proteins expressed from the chromosome, for a significant subset of the proteins reported.
2. The authors are not sufficiently explicit about experimental details and results. As just two examples, the conditions and mutants for the 116 shotgun mass spectrometric experiments should be easy to decipher, and the data for all 109 bacterial smORFomes should be easily accessible (www.ranseps.crg.es was not available).
3. In the absence of any experimental tests, the authors need to be more critical of the output of their computational analysis of the small protein features and functions. A problem inherent in small proteins is the low information content which makes hydrophobicity and homology searches tenuous. How well do their analyses perform on known and random data sets? For example, I question whether the authors can really distinguish between signal peptide sequences and transmembrane regions to provide such specific numbers (9.7% and 15%)?
More minor comments: 4. Page 3, line 8: Since the small proteins being described appear to be encoded by specific genes rather than being cleaved from larger proteins, they should be referred to as proteins rather than peptides.
5. Page 6, lines 2-4 from bottom: These two sentences are contradictory "In fact we found 19 decoy SEPs with {greater than or equal to}1 UTP...All decoy proteins were not detected when considering {greater than or equal to}1 UTP. 6. Typographical error: "aand" Reviewer #2: The characterization of short open reading frames (smORFs) that code for small proteins has become a very interesting and important topic. Unfortunately, most gene assembly and annotation tools ignore this arena, and thus this important category of molecules is often unknown. To address this issue, this manuscript reports an approach to combine a novel bioinformatics tool (RanSEPs) with omics measurements to examine numerous bacterial smORFomes. They find that about 25% of proteins in a bacterium are SEPs, and assign functional categories to many of them.
In general, this paper is scientifically sound and presents interesting results. The layout and discussion are logical and systematic. Numerous comments are given below: 7. Page 7, bottom -this discussion of RanSEP is nice and somewhat offsets my comment #4 above, but there is still some confusion about how TPR and FPR are set and validated? 8. Page 9, bottom -the authors should be careful here in the header to state that "RanSEP is the only tool..." There are other available, and have been demonstrated on other systems, including plants. -New data for other bacterial species: we have extended the bacterial organisms studied by mass spectrometry (MS) from the original 6 Mycoplasmas species to an additional 6 species.
-We have looked for SEPs in total cell extracts of Escherichia coli, Pseudomonas aeruginosa and Staphylococcus aureus. Briefly, after running tricine SDS gels to recover SEPs, we cut a band corresponding to <10 kDa and extract the proteins for MS .
-Also, we have re-analyzed (with the same parameters as in the self-generated datasets) publicly available MS datasets generated to detect SEPs and reported in the literature for: Lactococcus lactis (PRD000266), Synechocystis sp. PCC6803 (PXD001246) and Helicobacter pylori (PXD000054).

Independent of MS:
-We have used ribosome profiling in E.coli and M. pneumoniae to support the claim that two or more UTPs unequivocally confirm the existence of a SEP, a claim that was questioned by one of the referees. -Also, datasets for M. pneumoniae have been generated in our group. We plan to publish these in an independent manuscript. However, these datasets have been used to provide additional support to the results observed in E. coli.
Computational approaches: -During the review process we have improved specific points pertaining to the functioning and availability of RanSEPs : -The first version required the genome sequence in fasta format and the CDS in a multi-fasta file. Now RanSEPs accepts genbanks as input file so that only one single file is required with the genome and coding sequences in addition to gene features information.
-Previously, RanSEPs was using a predefined database. The current version accepts user-defined databases. This change makes RanSEPs more flexible and users will have more control over how the predictions are computed.
-The first version required the database of putative ORFs to be generated and a conservation study to define the negative training set. This step was redundant when multiple predictions were required for the same bacterial species (e.g., testing different parameters). RanSEPs now accepts these required files as arguments, eliminating unnecessary computational steps and saving computational costs.
-RanSEPs predictions are now supported by conservation studies to detect: -Novel SEPs in an organism of interest with function/annotation in a different species.
-Pseudogenes and repeated sequences that would count as false positives.
-Re-design of www.ranseps.crg.es to include an 'about' section with the main description of the approach and the possibility to download the program directly (as opposed to downloading it from the GitHub repository).
- -To evaluate the rate of false positives we used the previous set of negative SEPs without random sampling (~15,000 entries).
-We re-calculated the prediction quality metrics for RanSEPs and the rest of the software tools using the previous sets as a reference.
-A computational time cost comparative study has been included in the article.
-CPC predictions used in the comparative analysis have been updated to include the scores provided by the new version CPC2. The first version of the program (desktop and web server) is currently out of service and we could not run it for the updated validation test.
Consequently, we decided to recompute the results to ensure reproducibility.
-An accident occurred with the ESPPredictor server that was used to predict responsive peptides and no functional desktop version was found (more information

A key component of the authors' analyses is the supposed validation of proteins by mass
spectrometric analysis. Given the difficulties in mass spectrometric analysis of small proteins in particular (proteins with two or more UTPs could also be artifacts), the authors need to provide independent evidence of small protein synthesis, such as immunoblot detection of tagged proteins expressed from the chromosome, for a significant subset of the proteins reported.
We understand the concern of the reviewer about mass spectrometry data. We would like to clarify that the positive set (n=135) included 97 experimentally validated SEPs collected from the literature, most of which were identified by targeted and integrative approaches, not only by label free proteomics. In addition, we showed that the threshold of two or more unique peptides (≥2 UTPs) is strict enough to avoid false positives (validated by targeted proteomics, first section of results, 3rd paragraph). We apologize that it was not properly described in the manuscript, and have now tried to explain it better in the current version. However, we agree with the point made by the reviewer: "more evidences of small protein synthesis were required and they should be independent of MS". To address this issue we carried out two different approaches, a MS-dependent and a MSindependent approach.
In the first approach, we aimed to validate SEPs present in the proteome of other bacteria species by sensitivity, specificity, AUC, and accuracy and supported the results with a ROC curve and Precision-Recall visualizations ( Figure 3C and Appendix S7, Datasets EV18-EV19).
In the second approach, we used MS-independent experimental evidence to validate the cutoff of 2 UTPs as a way to unequivocally decide that a SEP is real. After evaluating the suggestion made by the reviewer to perform immunoblot detection of tagged proteins expressed from the chromosome, we thought that this technique was not the most appropriate and instead decided to apply Ribosome profiling combined with ultra-sequencing (RiboSeq). The reasons against using immunoblot   3. Additionally, endogenous promoters could be transcriptionally regulated and the expression of SEPs could depend on specific conditions. For example, the alternative sigma factor MPN626 and its targets are not seen under normal growth conditions in M. pneumoniae.
For all these reasons, we decided to study the correlation between ribosome profiling and detection of proteins by MS. As we mentioned in the introduction of the manuscript, we are aware that RiboSeq is not the best technique to detect SEPs, since the binding of the ribosome does not indicate the frame in which the mRNA is translated. However, this technique has been shown to be reliable    Figure S5). We apologize for possible difficulties during the revision process due to a lack of or inaccessible information. We have improved the descriptions and information by:

The authors
-Adding a description of each mutant or condition for the different shotgun experiments in an additional column in the Supplementary file (Dataset EV2).
-Making accessible the data for all the 109 bacterial smORFomes in the webpage www.ranseps.crg.es (login requirement will be removed upon publication; username: reviewer and password: reviewerCRG123456). This was previously unavailable due to privacy requirements.
-Thoroughly reviewing the content in the Supplementary tables so that it is more selfexplanatory and does not require merging of information from different tables: -NCBI description added to 6 ORF databases when protein found annotated in Datasets EV1, EV6-EV16.
-New Dataset EV17 including all the SEPs detected in this work, including all the information required to validate the observation (UTPs detected, sequences, annotation, RanSEPs scores, etc.).

In the absence of any experimental tests, the authors need to be more critical of the output of their computational analysis of the small protein features and functions. A problem inherent in small
proteins is the low information content which makes hydrophobicity and homology searches tenuous. How well do their analyses perform on known and random data sets? For example, I question whether the authors can really distinguish between signal peptide sequences and transmembrane regions to provide such specific numbers (9.7% and 15%)?
We agree with the reviewer that we lacked proper validation for the results exposed in the functional characterization of predicted SEPs. In the reviewed version, we have improved the validation of the results obtained by the three external tools used: PeptideSieve, BlastP and Phobius.
-PeptideSieve was used to predict the number of responsive UTPs in order to estimate whether or not to expect the detection of a protein by MS (Mallick et al, 2007). This tool is independent of the annotation length as it in silico digests proteins and predictions are computed over the resulting UTPs if they are longer than 5 amino acids in length. Therefore, probabilities computed by PeptideSieve only depend amino acid composition of the UTP whilst the protein has no impact in the prediction.
-BlastP is used in several steps of our work: definition of negative sets for RanSEPs, prediction of pseudogenes and prediction of functions. We implemented multiple conditions for considering homology between SEPs in order to ensure that these predictions were not biased by the size of the annotation being explored (explained in Material and Methods, 'Conservation analyses' section). As an additional test to address the reviewer´s concerns, we repeated the prediction of functions with the 'decoy' dataset used in MS, including randomly generated sequences, and found that none of them passed the threshold used in the case of predicted SEPs. This has been included in the manuscript at the end of the 2nd paragraph in the last section of the Results "Functional assessment to novel SEPs": 'We repeated this search with the dataset of 'decoy' proteins used for MS as the target, and found that no sequence passed the thresholds required to be considered homologous. As such, we would not expect to have false positives by chance.' -Finally, we did not use RanSEPs to predict proteins with signal peptides or transmembrane proteins, but used Phobius (Käll et al, 2004). To test the efficiency of Phobius in predicting transmembrane segments and signal peptide presence, we performed the same test as in the previous paragraph and found that less than 1.2% of the sequences were predicted to have to these (sample size=20,100 SEPs). This shows that the percentage observed for predicted SEPs (9.7%) was higher than what we would expect by chance. In addition, we performed another test where we subsetted annotated proteins of M. pneumoniae with a size >200 aa and a signal peptide.
Then, we sequentially shortened their C'-terminus. This was done to check for a loss in the accuracy of predicting peptide/transmembrane segments due to protein size. This analysis showed that Phobius works as expected for SEPs and thus we are able to trust its predictions. This information is presented in the manuscript at the end of the last paragraph in the last section entitled "Functional assessment to novel SEPs" in Results: 'The percentage of SEPs with a signal peptide was higher than expected by chance when compared with the same 'decoy' set of SEPs used in MS (9.7% for predicted SEPs, 1.2% for 'decoy' SEPs, unpaired two-tailed t-test p-value=0.018). Moreover, to confirm that the results obtained with Phobius are meaningful with regards to SEPs, and that protein size did not bias the analysis, we ran a test over a set of annotated standard proteins in which we sequentially shortened their C-terminus.  Figure S11).'

This analysis has been extensively detailed in a last section of the Appendix Supplementary
Methods information file and Appendix Figure S14 has been added to represent the results.

Page 3, line 8:
Since the small proteins being described appear to be encoded by specific genes rather than being cleaved from larger proteins, they should be referred to as proteins rather than peptides.
We agree with the reviewer and we have changed the terminology accordingly. In the SEPs description we have now explained the term as small encoded proteins instead of peptides. The whole first section of results about key factors and determinants in the detection of SEPs has been edited to include a suggestion from the second referee and we thoroughly checked the numbers to avoid similar mistakes to the one pointed out by the referee.

Typographical error: "aand"
We apologize for the error that is corrected in the current version of the manuscript. This version has been externally reviewed by a native English scientific editor to improve the quality and clarity of the text.

Referee 2
The We have added the following sentence to the introduction to highlight the RanSEPs´ potential and specificities which make it better for SEPs prediction (last sentence 4th paragraph): "The better prediction accuracy of our method is due to the iterative randomization of the training set, a technique that enables the capturing of additional protein-related information during training. In addition, as the training set is biased to include more SEPs, it places a higher level of importance on the possible alternative features in the classification." The test performed consisted in running each tool with 8 bacterial genomes ranging in genome size from 0.5Mb to 9.1Mb. With this approach we showed that the running time of RanSEPs per prediction is comparable to the ones presented by glimmer and prodigal. However, time performance of our tool increases with the number of iterations selected by the user; this additional cost can be justified by the significant increase in SEPs prediction accuracy. In terms of CPU, the three tools presented a similar percentage of CPU use.
In the manuscript we exposed this analysis in the Appendix Supplementary Methods information, as well as, in the Appendix Figure S8.
In terms of accuracy, we have significantly improved the validation and method comparative section by increasing the number of entries in the validation set. This analysis replaces the previous validation section in the manuscript; we have added that together multiple assessment metrics for the 6 tools compared: sensitivity, specificity, AUC, and accuracy and supported the results with a ROC curve and Precision-Recall visualizations ( Figure 3C and Appendix S7, Datasets EV18-EV19).

Page 5, line 13 -the observation of a much larger proportion of SEPs in bacterial genomes
immediately raises the concern of "noise"  We have extensively modified the first section of the results (now entitled 'Key factors and criteria for the experimental identification of SEPs') to include observations obtained when adding the thresholds suggested (≥1 UTP and ≥1 NUTP). By targeted proteomics we concluded that the threshold suggested was not appropriate for defining a set of actual novel SEPs, as it presented false positive putative SEPs (signal by label free proteomics but not targeted MS). These observations are exposed in paragraph 2 of the first section of results. Figure 2 ( "Searches were performed using a mass accuracy enforcement of 7 ppm, which fits accordingly with the accuracy of the orbitrap mass analyzer, and a product ion tolerance of 0.5 Da.
Resulting data files were filtered for FDR < 1."

Page 7, bottom -this discussion of RanSEP is nice and somewhat offsets my comment #4 above, but there is still some confusion about how TPR and FPR are set and validated?
The validation section in results ('RanSEPs validation and method comparative') has been rewritten to include information about how we define the validation datasets (extended details in Material and Methods), the software comparative in terms of accuracy, TPR and FPR, and a last test defined to control and evaluate the impact of false positives on our predictions. Thank you again for submitting your work to Molecular Systems Biology. We have now heard back from the referees who accepted to evaluate the revised study. As you will see, referee #2 is now positive. Reviewer #1 is however less supportive.
Reviewer #1 was not convinced that the ribosome profiling data provide 'unequivocal' evidence and would have preferred to see confirmation by Western blotting. We agree that having such evidence would be much better and we strongly encourage you to provide these if available. On the other hand, given the potential difficulties mentioned with regard to conducting conclusive validation by Western blot, and to unblock the situation, we would ask that you tone down the strengh of the conclusions derived from the ribosomal profiling data and, indeed as suggested by reviewer #1, present RanSEP as a 'prediction tool' rather than an 'annotation tool'. Point #2 raised by reviewer #1 should also be explicitly addressed. With regard to point #3, we would also ask to reduce speculations as suggested by the reviewer.

REFEREE REPORTS
Reviewer #1: The identification of novel small open reading frames is essential for full understanding of the complex layers of regulation that occur within each cell, and computational approaches such as RanSEPs could have an integral role in the process of uncovering these genes. This version of the manuscript is improved, in many ways, over the previous version (the increased stringency for utilization of the mass spectrometry data removes false positives that were present before, and the ribosome profiling data serves as an independent source of data to support those hits) and represents a significant amount of work.
However, some key issues were insufficiently addressed, leaving this reviewer unconvinced of the efficacy of this method for the annotation of new SEPs. The core issue, as mentioned in the previous review, is that the mass annotation of genomes for SEPs without sufficient corroboration could do more harm than good. Thus, it was disappointing that the authors refused to experimentally verify predicted proteins by immunoblot detection of tagged proteins expressed from the chromosome, despite being able to choose from 12 different organisms in which to carry out the validation. Although RanSEPs identifies many known smORFs, the goal is to demonstrate that new genes with detectable products can be identified with this method. One reason given for not carrying out experimental validation (that detection of small proteins by immunoblot analysis has a low success rate) demonstrates a lack of understanding -both of the field and of the study that was cited. The point is that many are false predictions (coming from computational analysis, mass spectrometry and ribosome profiling). Contrary to the authors claim, ribosome profiling data does not "unequivocally confirm the existence of a SEP". While RanSEPs may be an incredibly powerful tool for predicting possible SEPs, it does not meet the burden of proof for small protein annotation. The manuscript would be more acceptable if the authors denoted RanSEPs a predictive tool rather than an annotation tool.
Additional comments: 1. The authors' response mentions a 77% success rate for the proteins that were identified in a recent study (Van Orsdel et al. 2018) as a means of providing credibility of RanSEPs over conventional tagging. What is the success rate for the candidates not detected by western analysis?
2. An obvious route of obtaining supporting data for the predictions was the examination of available ribosome profiling data for the predicted new SEPs of E. coli (which was used as validation of the mass spectrometry hits). Was this done or is there a reason this was not carried out?
3. The "functional analysis" of the SEPs carried out by the authors consists primarily of using BLAST and Phobius. While these are both useful tools, many proteins are misannotated or annotated solely by homology -both of these lead to the propagation of incorrectly annotated functions, especially for sequences that are short and/or are poorly conserved. Additionally, the emphasis that any predicted transmembrane or secreted SEP has a role in quorum sensing or as a toxin ignores a wide range of possibilities associated with poorly conserved small proteins. The presence of "a N-terminus predicted signal peptide" cannot be equated with a role in "quorum sensing and/or signaling". The authors may want to better familiarize themselves with the literature regarding what is known about small proteins with a transmembrane domain in both bacteria and eukaryotes (for example, AcrZ, myoregulin and sarcolipin). In general, the authors should limit their speculation about possible functions.
4. The text was significantly improved by the external review by a native English scientific editor, but grammatical errors persist. Reviewer #2: Authors have done a careful and thorough job of responding to the detailed reviewer comments. In particular, they added key new work, clarified poor quality text, expanded/clarified validation approaches, and better qualified the novelty and framework of their new tool.
While I have some remaining minor concerns about this overall approach, much of these reflect the general field rather than the authors' specific work. In general, I feel that they have now done a detailed and appropriate job in carefully and defensibly defining their approach. As such, I believe that this manuscript is now suitable for publication consideration. We have carefully ensured that RanSEPs is presented as a prediction tool rather than an annotation tool throughout the manuscript to avoid misleading conclusions regarding the usage of RanSEPs. Changes that reflect this are: -'Genome annotation' in the keywords section has been replaced by 'protein prediction'.
-Fourth line, paragraph 4 of the ´Introduction´ section: 'we developed RanSEPs, a random forest-based tool for the unbiased identification of SEPs in any bacterial genome' has been changed to 'we developed RanSEPs, a random forest-based tool for the unbiased prediction of SEPs in any bacterial genome´.
-Fourth line from the end of the last paragraph in the ´Introduction´ section: 'thereby suggesting that they could play important roles in quorum sensing or signalling' has been toned down and improved with additional references:

Reviewer #1:
The identification of novel small open reading frames is essential for full understanding of the complex layers of regulation that occur within each cell, and computational approaches such as

RanSEPs could have an integral role in the process of uncovering these genes. This version of the manuscript is improved, in many ways, over the previous version (the increased stringency for
utilization of the mass spectrometry data removes false positives that were present before, and the ribosome profiling data serves as an independent source of data to support those hits) and represents a significant amount of work.
We appreciate the reviewer for pointing out the strengths of our approach and we agree that the computational prediction of these proteins is an important element in the field of protein discovery. We have made a great effort to improve the quality of the manuscript and thus appreciate the reviewer for pointing this out. We understand the concern of the reviewer and would like take advantage of this revision to further clarify our approach and the assertions made throughout the study.
First of all, we agree with the reviewer's concern that 'mass annotation of genomes for SEPs without sufficient corroboration could do more harm than good'. In the first round of revision we understood that the reviewer's concern was related to using ≥ 2 UTPs as the benchmark for validating our predictions. That is why we supported such candidates by integrating ribosome profiling and demonstrating that SEPs detected with ≥ 2UTPs also present higher coverages by ribosome profiling (Appendix Figure S5). We understand and agree that we cannot consider RanSEPs as an annotation tool but rather as a prediction tool, and we apologize if it was not correctly expressed throughout the manuscript. Thus, we have now carefully ensured that RanSEPs is presented as a prediction tool and not as an annotation tool to avoid misleading conclusions regarding the usage of RanSEPs. Changes that reflect this are: -'Genome annotation' in the keywords section has been replaced by 'protein prediction'.
-Fourth line, paragraph 4 of the ´Introduction´ section: 'we developed RanSEPs, a random forest-based tool for the unbiased identification of SEPs in any bacterial genome' has been changed to 'we developed RanSEPs, a random forest-based tool for the prediction of SEPs in any bacterial genome´.
In any case, we believe that an important application of RanSEPs could be to prioritize the candidates which should be included in future mass spectrometry searches and to guide which candidates should be further explored by immunoblotting or other techniques. This idea has now been included in the discussion of the manuscript as: 'When no experimental information is available, RanSEPs can guide the selection of candidate SEPs for validation and further characterization.' Additional comments: As described in our previous comment, we used ribosome profiling data to further support using the ≥ 2UTPs criteria when selecting positive candidates (Appendix supplementary Figure S5).
We . We found that SEPs predicted as positive showed significantly higher RCV levels compared with candidates predicted as negatives (Mann-Whitney one-sided test p-value=1x10-7) and ncRNAs (Mann-Whitney one-sided test p-value=1x10-4, Figure 3C).
Additionally, while RanSEPs positive predictions presented RCV values closer to the scores of annotated proteins, although still significantly lower (Mann-Whitney two-sided test, p-value=1x10-10), negative predictions were more similar to annotated ncRNAs (no significant differences by Mann-Whitney two-sided test, p-value=0.13). ' Although we agree with the reviewer that ribosome profiling presents false positives, we believe that overall it still supports RanSEPs predictions. To provide a more cautious idea about the function estimation, we mentioned the concerns at the end of the second paragraph of the 'Functional assessment of novel SEPs' section in the results:

The "functional analysis" of the
'Although we have assigned functionality to most of the predicted SEPs in the 109 genomes, one needs to be cautious as sequence homology and functional annotation of small proteins is not always reliable.' We have also toned down the conclusions made throughout the discussion in order to provide a more cautious point of view regarding the functions and the relationship with signalling and quorum sensing. This can be seen in the previous to last paragraph in the discussion section: 'However, this analysis should be taken with caution as sequence homology and functional annotation of SEPs is challenging (VanOrsdel et al, 2018).' We agree with the reviewer about our conclusions regarding the quorum sensing and signalling for the predicted SEPs and we have modified the manuscript to mention only the enrichment in polypeptides with at least one membrane segment, supporting the idea with two additional references. This change has been applied to the last paragraph in 'Introduction' section: We apologize for the grammatical errors, they could have occured while incorporating the revisions during the previous round. We have now carefully reviewed the text to double check and correct any remaining grammatical errors. Furthermore, previous to this latest submission, the manuscript and all the additional content has been reviewed again by a native English scientific editor. --Expanded view data sets: The authors need to provide a key or table of contents for their Data sets.
We followed the indications in the 'Author Guidelines' section of the journal webpage by including a first sheet with the legend of the dataset for each expanded dataset. In order to make the exploration of these datasets easier, we have now included an 'Index of datasets' in the Supplementary information (see Appendix). This has been indicated in the 'Data and Software availability' section.

Reviewer #2:
For animal studies, include a statement about randomization even if no randomization was used.
4.a. Were any steps taken to minimize the effects of subjective bias during group allocation or/and when assessing results (e.g. blinding of the investigator)? If yes please describe. Do the data meet the assumptions of the tests (e.g., normal distribution)? Describe any methods used to assess it.
Is there an estimate of variation within each group of data?
Experimental studies implied "-omics" datasets were the whole space of annotation was considered thus we did not discriminate any sequence. Computational analyses during validation were performed with no experimental reference and compared later for assessing the results presented in the validation section of the article.
No animal studies required in this work.
Statistical tests were justified and described in this work.
We assessed the normality of the distributions tested by Shapiro-Wilk test. If normality is satisfied, we performed t-test to provide statistical evidences of our comparison. For the no normal distribution we checked that population size was >500 and performed a central limit theorem transformation to compare the two distributions related.
We have supported the observations with standard deviation associated and included in figures by error bars and shades.
Please fill out these boxes ê (Do not worry if you cannot see all your text once you press return) In our specific case we study small proteins in bacterial genomes.
No animal studies required in this work.
No samples were excluded in this study.
The method presented in this work has been validated at two different steps. First, we studied the proteome of M. pneumonia in detail and used the results to specify the parameters of our tool. This configuration was used to explore a set of 570 real small proteins. As control, we used a collection of ~15,000 small open reading frames with not evidence to be coding. Difference in test set sizes are expected to prioritize the True Positive Rate over the False Positive Rate.
No animal studies required in this work.
a statement of how many times the experiment shown was independently replicated in the laboratory. definitions of statistical methods and measures: Any descriptions too long for the figure legend should be included in the methods section and/or with the source data.
In the pink boxes below, please ensure that the answers to the following questions are reported in the manuscript itself. Every question should be answered. If the question is not relevant to your research, please write NA (non applicable). We encourage you to include a specific subsection in the methods section for statistics, reagents, animal models and human subjects.

B-Statistics and general methods
a specification of the experimental system investigated (eg cell line, species name). the assay(s) and method(s) used to carry out the reported observations and measurements an explicit mention of the biological and chemical entity(ies) that are being measured. an explicit mention of the biological and chemical entity(ies) that are altered/varied/perturbed in a controlled manner.
the exact sample size (n) for each experimental group/condition, given as a number, not a range; a description of the sample collection allowing the reader to understand whether the samples represent technical or biological replicates (including how many animals, litters, cultures, etc.).
figure panels include only data points, measurements or observations that can be compared to each other in a scientifically meaningful way. graphs include clearly labeled error bars for independent experiments and sample sizes. Unless justified, error bars should not be shown for technical replicates. if n< 5, the individual data points from each experiment should be plotted and any statistical test employed should be justified Source Data should be included to report the data underlying graphs. Please follow the guidelines set out in the author ship guidelines on Data Presentation.

Captions
Each figure caption should contain the following information, for each panel where they are relevant:

Reporting Checklist For Life Sciences Articles (Rev. June 2017)
This checklist is used to ensure good reporting standards and to improve the reproducibility of published results. These guidelines are consistent with the Principles and Guidelines for Reporting Preclinical Research issued by the NIH in 2014. Please follow the journal's authorship guidelines in preparing your manuscript.

A-Figures 1. Data
The data shown in figures should satisfy the following conditions: the data were obtained and processed according to the field's best practice and are presented to reflect the results of the experiments in an accurate and unbiased manner.

EMBO PRESS
YOU MUST COMPLETE ALL CELLS WITH A PINK BACKGROUND ê PLEASE NOTE THAT THIS CHECKLIST WILL BE PUBLISHED ALONGSIDE YOUR PAPER