Yersiniomics, a Multi-Omics Interactive Database for Yersinia Species

ABSTRACT The genus Yersinia includes a large variety of nonpathogenic and life-threatening pathogenic bacteria, which cause a broad spectrum of diseases in humans and animals, such as plague, enteritis, Far East scarlet-like fever (FESLF), and enteric redmouth disease. Like most clinically relevant microorganisms, Yersinia spp. are currently subjected to intense multi-omics investigations whose numbers have increased extensively in recent years, generating massive amounts of data useful for diagnostic and therapeutic developments. The lack of a simple and centralized way to exploit these data led us to design Yersiniomics, a web-based platform allowing straightforward analysis of Yersinia omics data. Yersiniomics contains a curated multi-omics database at its core, gathering 200 genomic, 317 transcriptomic, and 62 proteomic data sets for Yersinia species. It integrates genomic, transcriptomic, and proteomic browsers, a genome viewer, and a heatmap viewer to navigate within genomes and experimental conditions. For streamlined access to structural and functional properties, it directly links each gene to GenBank, the Kyoto Encyclopedia of Genes and Genomes (KEGG), UniProt, InterPro, IntAct, and the Search Tool for the Retrieval of Interacting Genes/Proteins (STRING) and each experiment to Gene Expression Omnibus (GEO), the European Nucleotide Archive (ENA), or the Proteomics Identifications Database (PRIDE). Yersiniomics provides a powerful tool for microbiologists to assist with investigations ranging from specific gene studies to systems biology studies. IMPORTANCE The expanding genus Yersinia is composed of multiple nonpathogenic species and a few pathogenic species, including the deadly etiologic agent of plague, Yersinia pestis. In 2 decades, the number of genomic, transcriptomic, and proteomic studies on Yersinia grew massively, delivering a wealth of data. We developed Yersiniomics, an interactive web-based platform, to centralize and analyze omics data sets on Yersinia species. The platform allows user-friendly navigation between genomic data, expression data, and experimental conditions. Yersiniomics will be a valuable tool to microbiologists.

The authors put together literature and their own data on multi-omics database for Yersinia species. I can only applaud this effort, as the transcriptomic data accumulated in this area for the past two decades is quite a mess. Nevertheless, the authors created a workable and useful interactive database where the genomic and phylogenetic data are integrated with transcriptomics and proteomics. It has the user friendly interface and relatively easy to follow. I am confident that this will be a good tool for the Yersinia community and others from different areas of bacterial pathogenesis. The planned inclusion in the database genomewide yeast two-hybrid interactome data, small RNAs, transcriptional start sites, riboswitches, as well as defining the mutants phenotype (p.15, lines 336-341) is an appropriate direction for further expansion.
Nevertheless, a critical feature is missing in this database. The omics data ideally should be integrated with the biochemical pathways, for example, KEGG Pathways database. Then, genes expressed at certain conditions could be directly linked to the pathways, that can make the dataset truly interactive. This will be particularly important for the proposed defining mutant phenotype. The lack of plans to expand the database to the biochemical pathways significantly reduced the reviewer's enthusiasm for the future of this project.
The minor comments are: 1. p. 6, lines 116-120. To address the phylogeny, the authors used their 500 genes-based cgMLST scheme. I suggest additionally for Y. pestis to designate for each strain in the database the phylogenetic branch based on SNP analysis, as this is widely used these days for the phylogenetic relatedness of this pathogen. This should appear on the phylogenetic tree as displayed on the left panel on Figure 3. In this manuscript, Le Bury et. al, have compiled most of the publicly available processed or unprocessed Genomics, Transcriptomics, and Proteomics data of Yersinia species in an interactive, user-friendly database called Yersiniomics. The database is constructed on BacNet which was previously used for Listeria. The authors have spent a tremendous amount of effort, also time I assume, to build the database. I believe Yersiniomics provides a great source and tool for many researchers in the field of infection biology, especially for those working with different species and even strains of Yersinia as a model organism to study bacterial infections. I appreciated the design of the database which allows cross-comparisons of species/strains and different -omics datasets. For example; I highly appreciated that the authors linked the homologues genes in different species/strains which then allows to trace the gene sequence, transcript level, protein level and differential expression in transcript and protein level in different strains and species. Moreover, the interactive feature if the database allows user defined setting for certain tools embedded to the database. Additionally, the addition of new locus tags and old locus tags for each gene helps users to combine the information from this database other types of database which uses either old or the new locus tags. I appreciate that the authors were aware of this confusion and recorded all this information in one place. Finally, the authors indicated the embedding novel datatypes such as yeast two-hybrid interactome data and small RNA data to Yersiniomics, which will give more depth to the database.
Even though I am very much impressed by the idea of constructing Yersiniomics and well-thought details in the design of it, I have some points. These concerns are about the content and analysis of the transcriptomics data and the usage of the database, which I listed below.

The content of the transcriptomics data
The authors claimed that they have used all Yersinia omics data published today. They retrieved transcriptomics data for 251 biological conditions, which 151 were originally generated with microarray and retrieved from GEO and 100 were originally generated with RNA-seq and retrieved from ENA. I wonder if the authors are aware of SRA in GEO which, today, contains 644 biosamples (biological conditions) associated to Yersinia and generated with RNAseq. Why did not author retrieve this data? They should include this data to Yersiniomics as well. If not, they should have strong evidence about why not doing so.
The analysis of transcriptomics data • The authors have used RPKM values instead of TPM values as normalized expression level. I would like to know why they preferred RPKM. This could be discussed in the discussion section.
• They have generated Co-expression network using RPKM values with Pearson correlation coefficient via the BacNet platform. Why did they prefer this method while there are well-established Co-expression network construction methods such as WGCNA and ICA? Did authors compare those methods?
The usage of the database In the Genomics browser, • The number of replicons is shown as number of chromosomes. It should either have separate columns for chromosome and plasmid or as 'Number of chromosomes/plasmid' which the numbers should be shown as for example; 1/3 (1 chromosome and 3 plasmids.
• For many strains the number of genes, proteins, name of the species and strain, and CladeID is missing. Why are they missing while number of CDSs, rRNA and tRNA consistently exist in all of them? • When browsing the genome of a particular species, 'Download gene selection as a table' generated and empty txt file named after 'Listeria Genomic Table' even though multiple genes were selected. This should be corrected. • I could run Synteny function only for once and for a Yersinia pestis strain. If possible, it should work for Y. pseudotuberculosis and Y. enterocolitica also. Does the webpage work equally fine in Windows and MacOS?
In the Transcriptomics browser • 'Strain array' column is used even for RNA-seq data. The rows with RNA-seq data should have empty cell for this column or indicate 'No applicable' • In the heatmap transcriptomics part and also at any place in the main text, it is not mentioned what statistical analyses was performed to show the significance of the differential expression. Did the authors employ a p-value or adjusted p-value cut-off? If yes, they should mention in the main text and if not, they should discuss why not.
• When visualizing transcriptomics datasets in Genome viewer and using AddTranscriptomics data, the webpages gives an error and does not allow addition.
• When visualizing transcriptomics datasets in Genome viewer, the webpage does not allow switch from Absolute expression to Relative expression data Yersiniomics wiki • Access Yersiniomics wiki directs users to Listeriomics. This should be corrected.
Line 82-86: This sentence should be re-written as it sounds that only Illumina produces short reads and only PacBio produces long reach. There are other technologies producing short and long reads. Line 355-356. Did the authors specifically downloaded only 'Illumina reads'? If not, they should use 'sequencing reads' instead. Line 353. 'formated' to formatted.

Preparing Revision Guidelines
To submit your modified manuscript, log onto the eJP submission site at https://spectrum.msubmit.net/cgi-bin/main.plex. Go to Author Tasks and click the appropriate manuscript title to begin the revision process. The information that you entered when you first submitted the paper will be displayed. Please update the information as necessary. Here are a few examples of required updates that authors must address: • Point-by-point responses to the issues raised by the reviewers in a file named "Response to Reviewers," NOT IN YOUR COVER LETTER. • Upload a compare copy of the manuscript (without figures) as a "Marked-Up Manuscript" file. • Each figure must be uploaded as a separate file, and any multipanel figures must be assembled into one file. For complete guidelines on revision requirements, please see the journal Submission and Review Process requirements at https://journals.asm.org/journal/Spectrum/submission-review-process. Submissions of a paper that does not conform to Microbiology Spectrum guidelines will delay acceptance of your manuscript. " Please return the manuscript within 60 days; if you cannot complete the modification within this time period, please contact me. If you do not wish to modify the manuscript and prefer to submit it to another journal, please notify me of your decision immediately so that the manuscript may be formally withdrawn from consideration by Microbiology Spectrum.
If your manuscript is accepted for publication, you will be contacted separately about payment when the proofs are issued; please follow the instructions in that e-mail. Arrangements for payment must be made before your article is published. For a complete list of Publication Fees, including supplemental material costs, please visit our website.
Corresponding authors may join or renew ASM membership to obtain discounts on publication fees. Need to upgrade your membership level? Please contact Customer Service at Service@asmusa.org.
Thank you for submitting your paper to Microbiology Spectrum.
In this manuscript, Le Bury et. al, have compiled most of the publicly available processed or unprocessed Genomics, Transcriptomics, and Proteomics data of Yersinia species in an interactive, user-friendly database called Yersiniomics. The database is constructed on BacNet which was previously used for Listeria. The authors have spent a tremendous amount of effort, also time I assume, to build the database. I believe Yersiniomics provides a great source and tool for many researchers in the field of infection biology, especially for those working with different species and even strains of Yersinia as a model organism to study bacterial infections. I appreciated the design of the database which allows crosscomparisons of species/strains and different -omics datasets. For example; I highly appreciated that the authors linked the homologues genes in different species/strains which then allows to trace the gene sequence, transcript level, protein level and differential expression in transcript and protein level in different strains and species. Moreover, the interactive feature if the database allows user defined setting for certain tools embedded to the database. Additionally, the addition of new locus tags and old locus tags for each gene helps users to combine the information from this database other types of database which uses either old or the new locus tags. I appreciate that the authors were aware of this confusion and recorded all this information in one place. Finally, the authors indicated the embedding novel datatypes such as yeast two-hybrid interactome data and small RNA data to Yersiniomics, which will give more depth to the database.
Even though I am very much impressed by the idea of constructing Yersiniomics and wellthought details in the design of it, I have some points. These concerns are about the content and analysis of the transcriptomics data and the usage of the database, which I listed below.

The content of the transcriptomics data
The authors claimed that they have used all Yersinia omics data published today. They retrieved transcriptomics data for 251 biological conditions, which 151 were originally generated with microarray and retrieved from GEO and 100 were originally generated with RNA-seq and retrieved from ENA. I wonder if the authors are aware of SRA in GEO which, today, contains 644 biosamples (biological conditions) associated to Yersinia and generated with RNAseq. Why did not author retrieve this data? They should include this data to Yersiniomics as well. If not, they should have strong evidence about why not doing so.

The analysis of transcriptomics data
• The authors have used RPKM values instead of TPM values as normalized expression level. I would like to know why they preferred RPKM. This could be discussed in the discussion section. • They have generated Co-expression network using RPKM values with Pearson correlation coefficient via the BacNet platform. Why did they prefer this method while there are well-established Co-expression network construction methods such as WGCNA and ICA? Did authors compare those methods?
The usage of the database In the Genomics browser, • The number of replicons is shown as number of chromosomes. It should either have separate columns for chromosome and plasmid or as 'Number of chromosomes/plasmid' which the numbers should be shown as for example; 1/3 (1 chromosome and 3 plasmids. • For many strains the number of genes, proteins, name of the species and strain, and CladeID is missing. Why are they missing while number of CDSs, rRNA and tRNA consistently exist in all of them? • When browsing the genome of a particular species, 'Download gene selection as a table' generated and empty txt file named after 'Listeria Genomic Table' even though multiple genes were selected. This should be corrected. • I could run Synteny function only for once and for a Yersinia pestis strain. If possible, it should work for Y. pseudotuberculosis and Y. enterocolitica also. Does the webpage work equally fine in Windows and MacOS?
In the Transcriptomics browser • 'Strain array' column is used even for RNA-seq data. The rows with RNA-seq data should have empty cell for this column or indicate 'No applicable' • In the heatmap transcriptomics part and also at any place in the main text, it is not mentioned what statistical analyses was performed to show the significance of the differential expression. Did the authors employ a p-value or adjusted p-value cut-off? If yes, they should mention in the main text and if not, they should discuss why not. • When visualizing transcriptomics datasets in Genome viewer and using AddTranscriptomics data, the webpages gives an error and does not allow addition. • When visualizing transcriptomics datasets in Genome viewer, the webpage does not allow switch from Absolute expression to Relative expression data Yersiniomics wiki • Access Yersiniomics wiki directs users to Listeriomics. This should be corrected.
Line 82-86: This sentence should be re-written as it sounds that only Illumina produces short reads and only PacBio produces long reach. There are other technologies producing short and long reads. Line 355-356. Did the authors specifically downloaded only 'Illumina reads'? If not, they should use 'sequencing reads' instead. Line 353. 'formated' to formatted.

Reviewer #1
The authors put together literature and their own data on multi-omics database for Yersinia species. I can only applaud this effort, as the transcriptomic data accumulated in this area for the past two decades is quite a mess. Nevertheless, the authors created a workable and useful interactive database where the genomic and phylogenetic data are integrated with transcriptomics and proteomics. It has the user friendly interface and relatively easy to follow. I am confident that this will be a good tool for the Yersinia community and others from different areas of bacterial pathogenesis. The planned inclusion in the database genome-wide yeast two-hybrid interactome data, small RNAs, transcriptional start sites, riboswitches, as well as defining the mutants phenotype (p.15, lines 336-341) is an appropriate direction for further expansion.
We thank the reviewer for highlighting the interest of our database.
Nevertheless, a critical feature is missing in this database. The omics data ideally should be integrated with the biochemical pathways, for example, KEGG Pathways database. Then, genes expressed at certain conditions could be directly linked to the pathways, that can make the dataset truly interactive. This will be particularly important for the proposed defining mutant phenotype. The lack of plans to expand the database to the biochemical pathways significantly reduced the reviewer's enthusiasm for the future of this project.
We thank the reviewer for this suggestion. For most reference strains, we have now added tabs in the gene viewer with dynamic links to KEGG, UniProt, InterPro, IntAct and STRING (p.13, lines 278-288 of the final manuscript in PDF format). These tabs allow to directly browse this database inside Yersiniomics and are automatically updated to the entry of the selected gene. These new functionalities will help decipher gene functionalities, crossing known structural or functional data and pathways to experimental results performed on Yersinia.
1. p. 6, lines 116-120. To address the phylogeny, the authors used their 500 genes-based cgMLST scheme. I suggest additionally for Y. pestis to designate for each strain in the database the phylogenetic branch based on SNP analysis, as this is widely used these days for the phylogenetic relatedness of this pathogen. This should appear on the phylogenetic tree as displayed on the left panel on Figure 3.
Following the reviewer's suggestion, we added "lineage" and "sublineage" columns in the genomics browser (p.7 line 137), in which lineages were determined by cgMLST and Yersinia pestis sublineages were determined by SNP analysis (p.7 lines 137-139).