Strain-Level Dynamics Reveal Regulatory Roles in Atopic Eczema by Gut Bacterial Phages

ABSTRACT The vast population of bacterial phages or viruses (virome) plays pivotal roles in the ecology of human microbial flora and health conditions. Obstacles, including poor viral sequence inference, strain-sensitive virus-host relationship, and the high diversity among individuals, hinder the in-depth understanding of the human virome. We conducted longitudinal studies of the virome based on constructing a high-quality personal reference metagenome (PRM). By applying long-read sequencing for representative samples, we could build a PRM of high continuity that allows accurate annotation and abundance estimation of viruses and bacterial species in all samples of the same individual by aligning short sequencing reads to the PRM. We applied this approach to a series of fecal samples collected for 6 months from a 2-year-old boy who had experienced a 2-month flare-up of atopic eczema (dermatitis) in this period. We identified 31 viral strains in the patient’s gut microbiota and deciphered their strain-level relationship to their bacterial hosts. Among them, a lytic crAssphage developed into a dozen substrains and coordinated downregulation in the catabolism of aromatic amino acids (AAAs) in their host bacteria which govern the production of immune-active AAA derivates. The metabolic alterations confirmed based on metabolomic assays cooccurred with symptom remission. Our PRM-based analysis provides an easy approach for deciphering the dynamics of the strain-level human gut virome in the context of entire microbiota. Close temporal correlations among virome alteration, microbial metabolism, and disease remission suggest a potential mechanism for how bacterial phages in microbiota are intimately related to human health. IMPORTANCE The vast populations of viruses or bacteriophages in human gut flora remain mysterious. However, poor annotation and abundance estimation remain obstacles to strain-level analysis and clarification of their roles in microbiome ecology and metabolism associated with human health and diseases. We demonstrate that a personal reference metagenome (PRM)-based approach provides strain-level resolution for analyzing the gut microbiota-associated virome. When applying such an approach to longitudinal samples collected from a 2-year-old boy who has experienced a 2-month flare-up of atopic eczema, we observed thriving substrains of a lytic crAssphage, showing temporal correlation with downregulated catabolism of aromatic amino acids, lower production of immune-active metabolites, and remission of the disease. The PRM-based approach is practical and powerful for strain-centric analysis of the human gut virome, and the underlying mechanism of how strain-level virome dynamics affect disease deserves further investigation.

The raw sequencing data should be published and is noted as having been submitted to the GSA database to be released on publication.
Reviewer #1 (Comments for the Author): The authors present a longitudinal case study of the fecal metagenome from a 2-year-old patient before, during, and after onset of atopic eczema. The authors assemble and annotate a personalized, de novo metagenome for the subject and use this personalized metagenome to assess longitudinal patterns of differential abundance of phages/prophages, their associate bacteria, and metabolites. Overall, the study presents data that support a hypothesis that a strain-specific interaction between phages and their host bacteria could be related to atopic eczema etiology and severity of symptoms. This hypothesis is then further evaluated using data from another, previously published longitudinal study. While the data from the case study is interesting and valuable, I list below several concerns with the manuscript as written.
Major comments: 1. The authors claim that the "Personal Reference Metagenomes (PRM)" methodology is a new approach. However, previously published, similar work has been done in the context of Metagenome-Assembled Genomes (MAGs) and MAG binning (Stewart et al 2018, "Assembly of 913 microbial genomes...", Nat Commun; Stewart et al. 2019, "MAGpy..." Bioinformatics). Given that the importance section in this manuscript focuses on the PRM approach, the existing area of bioinformatics methods should be addressed, and discussion should be included on how the PRM approach improves upon published methodology. 2. Additional discussion should be included on the limitations of results derived from a single subject in this work. While case studies from individual patients can be valuable, more discussion of the limitations of the findings is warranted than what is included currently in the manuscript.
Minor comments: 1. (Lines 118-125) "To construct PRM for individual..." and (Lines 137-139) "The final PRM..." -The authors claim that long read assemblies are satisfactory for accurate viral genome identification and taxonomic assignment but no numerical or summary data are provided to support the claim. 2. (Lines 156-159) "When mapping the short reads..." -The authors claim that an average short-read mapping rate indicates "high quality for our PRM assembly" that enables "exact estimation of the relative abundance of all viruses in our samples." It's not clear that a high mapping rate indicates high quality assemblies, particularly if the short read data were used in the construction of the assemblies; a different metric may be appropriate here. It's unclear what is meant by "exact estimation" when the data have been normalized by total sum scaling and/or  "Interestingly, such strain-level..." -The conclusion that the data display "a complex ecological relationship between viruses and their host" is overstated based on a sample size of one. 4. (Line 200, section title) "Upsurge of crAssphage substrains correlates..." -"Correlates" should be supported by a statistical analysis or measurement of correlation in the context of a longitudinal study; perhaps a different term would be appropriate here, since the authors analyzed differential abundance and not, for example, correlation with another measure like EASI, as was done for AAA metabolite concentration later in this section. 5. P-values, summary metrics, and statistical testing results listed in figures, tables, or figure/table captions (including supplementary material) should be reported in the main body text at the appropriate location if used to support statements made in the text. 6. Minor but noticeable grammatical and spelling errors occur throughout the work.
Reviewer #2 (Comments for the Author): 1. There are numerous grammatical and spelling errors that should be addressed before publication. 2. I find that the construction of the Personal Reference Metagenomes (PRM) could be a source of bias. The PRM were obtained using only two of the samples in the cohort and used to estimate genome abundance for every sample. Therefore, there could be viral and/or prokaryotic populations in the other samples that are not represented in the PRM, thus confounding the abundance results. An alternative way to do this, which would not require generating more sequence data, would be to assemble genomes from all of the samples to create an assembly for each sample. Then use the assembled contigs from each sample to produce the PRM in the same way that is described in the manuscript. I recognize that the lack of nanopore reads from the other samples in the cohort would make obtaining complete genomes from the other samples more difficult, but it may reveal populations that were missed using the current approach. 3. Performing metagenomic binning on the contigs obtained from each sample using a tool such as metabat2, concoct, or maxbin could yield more complete PRM. 4. Using a marker gene based taxonomic annotation tool, such as GTDBTK, could be a better option than using Kraken2, which assings taxonomic labels based on kmer similarities. 5. Including completion and contamination metrics using checkM and checkV on the PRM could provide higher confidence in the analysis. 6. Adding more specifics on how the abundance was estimated to the methods section would increase confidence in the results of the analysis. The figure axis are labelled as RPKM, but the methods section does not thoroughly explain how the abundance values were obtained. Additionally, explaining why RPKM was used could benefit the analysis. Generally, TPM is the favored normalization technique when comparing abundances between samples.

Preparing Revision Guidelines
To submit your modified manuscript, log onto the eJP submission site at https://spectrum.msubmit.net/cgi-bin/main.plex. Go to Author Tasks and click the appropriate manuscript title to begin the revision process. The information that you entered when you first submitted the paper will be displayed. Please update the information as necessary. Here are a few examples of required updates that authors must address: • Point-by-point responses to the issues raised by the reviewers in a file named "Response to Reviewers," NOT IN YOUR COVER LETTER. • Upload a compare copy of the manuscript (without figures) as a "Marked-Up Manuscript" file. • Each figure must be uploaded as a separate file, and any multipanel figures must be assembled into one file. For complete guidelines on revision requirements, please see the journal Submission and Review Process requirements at https://journals.asm.org/journal/Spectrum/submission-review-process. Submissions of a paper that does not conform to Microbiology Spectrum guidelines will delay acceptance of your manuscript. " Please return the manuscript within 60 days; if you cannot complete the modification within this time period, please contact me. If you do not wish to modify the manuscript and prefer to submit it to another journal, please notify me of your decision immediately so that the manuscript may be formally withdrawn from consideration by Microbiology Spectrum.
If your manuscript is accepted for publication, you will be contacted separately about payment when the proofs are issued; please follow the instructions in that e-mail. Arrangements for payment must be made before your article is published. For a complete list of Publication Fees, including supplemental material costs, please visit our website.
The authors present a longitudinal case study of the fecal metagenome from a 2-year-old patient before, during, and after onset of atopic eczema. The authors assemble and annotate a personalized, de novo metagenome for the subject and use this personalized metagenome to assess longitudinal patterns of differential abundance of phages/prophages, their associate bacteria, and metabolites. Overall, the study presents data that support a hypothesis that a strain-specific interaction between phages and their host bacteria could be related to atopic eczema etiology and severity of symptoms. This hypothesis is then further evaluated using data from another, previously published longitudinal study. While the data from the case study is interesting and valuable, I list below several concerns with the manuscript as written.
Major comments: 1. The authors claim that the "Personal Reference Metagenomes (PRM)" methodology is a new approach. However, previously published, similar work has been done in the context of Metagenome-Assembled Genomes (MAGs) and MAG binning (Stewart et al 2018, "Assembly of 913 microbial genomes…", Nat Commun; Stewart et al. 2019, "MAGpy…" Bioinformatics). Given that the importance section in this manuscript focuses on the PRM approach, the existing area of bioinformatics methods should be addressed, and discussion should be included on how the PRM approach improves upon published methodology. 2. Additional discussion should be included on the limitations of results derived from a single subject in this work. While case studies from individual patients can be valuable, more discussion of the limitations of the findings is warranted than what is included currently in the manuscript.
Minor comments: 1. (Lines 118-125) "To construct PRM for individual…" and (Lines 137-139) "The final PRM…" -The authors claim that long read assemblies are satisfactory for accurate viral genome identification and taxonomic assignment but no numerical or summary data are provided to support the claim. 2. (Lines 156-159) "When mapping the short reads…" -The authors claim that an average shortread mapping rate indicates "high quality for our PRM assembly" that enables "exact estimation of the relative abundance of all viruses in our samples." It's not clear that a high mapping rate indicates high quality assemblies, particularly if the short read data were used in the construction of the assemblies; a different metric may be appropriate here. It's unclear what is meant by "exact estimation" when the data have been normalized by total sum scaling and/or RPKM. 3. (Lines 171-196) "Interestingly, such strain-level…" -The conclusion that the data display "a complex ecological relationship between viruses and their host" is overstated based on a sample size of one. 4. (Line 200, section title) "Upsurge of crAssphage substrains correlates…" -"Correlates" should be supported by a statistical analysis or measurement of correlation in the context of a longitudinal study; perhaps a different term would be appropriate here, since the authors analyzed differential abundance and not, for example, correlation with another measure like EASI, as was done for AAA metabolite concentration later in this section. 5. P-values, summary metrics, and statistical testing results listed in figures, tables, or figure/table captions (including supplementary material) should be reported in the main body text at the appropriate location if used to support statements made in the text. 6. Minor but noticeable grammatical and spelling errors occur throughout the work.

Point-by-point response to the reviewers
Our response is in blue.

Reviewer comments:
Reviewer #1 (Public repository details (Required)): The raw sequencing data should be published and is noted as having been submitted to the GSA database to be released on publication.
Response: Thanks for the suggestion. The raw sequence data reported in our manuscript have been deposited in a publicly accessible repository, the Genome Sequence Archive in National Genomics Data Center, China National Center for Bioinformation https://ngdc.cncb.ac.cn/gsa (GSA: CRA003594). To facilitate the manuscript review, you can access the data from the temporary link: https://ngdc.cncb.ac.cn/gsa/s/Jd2K2S7i before Mar 17. We hope to keep the data private until acceptance. We have noted it in the "Data Availability" Section of the revised manuscript.

Reviewer #1 (Comments for the Author):
The authors present a longitudinal case study of the fecal metagenome from a 2- year-old patient before, during, and after onset of atopic eczema. The authors assemble and annotate a personalized, de novo metagenome for the subject and use this personalized metagenome to assess longitudinal patterns of differential abundance of phages/prophages, their associate bacteria, and metabolites. Overall, the study presents data that support a hypothesis that a strain-specific interaction between phages and their host bacteria could be related to atopic eczema etiology and severity of symptoms. This hypothesis is then further evaluated using data from another, previously published longitudinal study. While the data from the case study is interesting and valuable, I list below several concerns with the manuscript as written.
Major comments: Response: Thank you for your insightful comment. In this work, to disentangle the role of gut virome in eczema and its relationship to gut microbes, we have to calculate the relative abundance of each virus to the entire microbiome besides the virome, for which a non-redundant reference is required through read-mapping.
Actually, before the PRM based on the long-read sequencing approach, we tried the binning strategy to obtain assembled the reference genomes of both bacteria and viruses, i.e., Metagenome-Assembled Genomes (MAGs) reconstructions from shotgun metagenomic sequence data. However, the results are not satisfying for our particular sample set, i.e., dynamic samples from a single patient. The binning process lost many sequences even though we pooled all samples together,    Table S5). The final MAG-based PRM also lost many sequences even after pooling bins from all the samples due to the trimmed contigs in the binning process, which leads to a significantly lower mapping rate of 85.55±0.04%, indicating its poorer representativeness." 2. (Lines 156-159) "When mapping the short reads..." -The authors claim that an average short-read mapping rate indicates "high quality for our PRM assembly" that enables "exact estimation of the relative abundance of all viruses in our samples." It's not clear that a high mapping rate indicates high quality assemblies, particularly if the short read data were used in the construction of the assemblies; a different metric may be appropriate here. It's unclear what is meant by "exact estimation" when the data have been normalized by total sum scaling and/or RPKM.
Response: Thank you for the insightful comments. We agree that the wording in these sentences is not accurate. We mean that our PRM can represent the microbial DNA in the sample very well when the mapping rate reaches as high as 99%. As for the abundance, we want to highlight that the relative abundance of each virus in the whole microbial flora than in the virome can be more accurately evaluated by mapping the reads to the reference than marker-gene-based abundance evaluation, especially for the relative abundance between viruses and bacteria. We have rephrased these sentences in the revised manuscript to be more precise and accurate as follows. Response: Thank you for your valuable comment and suggestion. We complement the correlation analysis in the revised version. Table R2 and Figure R1 show the correlations between phages abundance and EASI score, evaluated with a Spearman's rank correlation test (|ρ|>0.5 and P <0.01). The lytic crAssphage_99kb and circular_phage_158kb show the strongest negative correlations to the EASI score, whereas a variety of prophages exhibit strong positive correlations. We have added this table and figure to the revised manuscript. Response: Thank you for the suggestion. We complemented them in revised manuscript as you suggested.
6. Minor but noticeable grammatical and spelling errors occur throughout the work.
Response: Thank you for the critique. We have combed through any grammar and spelling errors in the revised manuscript and corrected them. We are sorry for these mistakes.
Reviewer #2 (Comments for the Author): 1. There are numerous grammatical and spelling errors that should be addressed before publication.
Response: Thank you for the critique. We have combed through any grammar and spelling errors in the revised manuscript and corrected them. We are sorry for these mistakes.
2. I find that the construction of the Personal Reference Metagenomes (PRM) could be a source of bias. The PRM were obtained using only two of the samples in the cohort and used to estimate genome abundance for every sample. Therefore, there could be viral and/or prokaryotic populations in the other samples that are not represented in the PRM, thus confounding the abundance results. An alternative way to do this, which would not require generating more sequence data, would be to assemble genomes from all of the samples to create an assembly for each sample.
Then use the assembled contigs from each sample to produce the PRM in the same way that is described in the manuscript. I recognize that the lack of nanopore reads from the other samples in the cohort would make obtaining complete genomes from the other samples more difficult, but it may reveal populations that were missed using the current approach.
Response: Thank you for the insightful comment. In most metagenomic studies, which are cross-sectional, reference genomes assembled from one sample cannot represent that of others due to the great diversity between individuals. However, in our study, all the samples were from a single patient within a short period, and the Pooling all samples is a common approach in assembling metagenome but inappropriate in our case because pooling samples from the same subject can result in overwhelming redundancy, which is not easy to remove completely using the cdhit tool. The redundant sequences containing the same repeating segments might undervalue the abundance.
3. Performing metagenomic binning on the contigs obtained from each sample using a tool such as metabat2, concoct, or maxbin could yield more complete PRM.
Response: Thank you for the valuable suggestion. In this work, we focused on the relative abundance of each virus to the entire microbiome instead of that to the virome, for which a non-redundant reference is required through read-mapping.
Actually, before the PRM based on the long-read sequencing approach, we tried the binning strategy to obtain the assembled reference genomes of both bacteria and viruses, i.e., Metagenome-Assembled Genomes (MAGs) reconstructions from shotgun metagenomic sequence data. However, we found that the binning process not only lost many sequences, including the crAssphage, even though we pooled all samples together, but also led to considerable redundancy in the reference.
Although  We performed MAG binning and taxonomical classification on the assembly of 24 samples' shotgun sequencing data as follows: Filtered Illumina Hiseq reads of each sample were assembled using metaSPAdes. For each sample, contigs longer than 5k are clustered based on sequence using CD-HIT-EST (v4.8.1) with '-c 0.95'.
The comparison shows that the PRM annotated more viruses than the binned MAGs after pooling all 24 samples, especially for the complete and circular virus, which depends on the length of contigs. Furthermore, the crAssphage we focused on was missed from the MAG-based PRM, and the lower mapping rate to it indicated a considerable proportion of sample DNA material was not represented in reference.
Therefore, we think the PRM we proposed is more appropriate for dynamic samples from a single subject. And we added this comparison in the revised manuscript as follows.
Lines 168-178, "Metagenome-Assembled Genomes (MAGs) reconstructions from shotgun metagenomic sequence data is another approach to obtaining the PRM that needs no extra long-read sequencing (24-26). Here, we also tried this strategy by binning the short-read assembled contigs of each sample using MetaWRAP. Then we pooled all the bins from the 24 samples and removed redundancy using dRep to obtain the MAG-based PRM. Compared with our long-read-based PRM, the MAG-based PRM performs worse in virus identification, especially for the circular or high-quality genomes, due to the relatively short length of its contigs (Supplementary Table S5). The final MAG-based PRM also lost many sequences even after pooling bins from all the samples due to the trimmed contigs in the binning process, which leads to a significantly lower mapping rate of 85.55±0.04%, indicating its poorer representativeness." 4. Using a marker gene based taxonomic annotation tool, such as GTDBTK, could be a better option than using Kraken2, which assings taxonomic labels based on kmer similarities. Response: Thank you for the valuable suggestion. We assessed the completeness and contamination of virus contigs/genomes using checkV. The quality evaluation of most viruses mentioned in this study, as shown in Table R5, is consistent with that evaluated using VIBRANT, and these results were supplemented in the revised manuscript.
As the bacterial contigs are not binned in the PRM, we do not perform CheckM to assess their completeness and contamination. Cj, the number of reads mapping to gene/fragment j; Lj, the length of gene/fragment j; N, total contigs in PRM.
The "relative abundance" of each bacterial species and virus in the original manuscript is calculated as follows, and we add this section in Methods of the revised manuscript. Your manuscript has been accepted, and I am forwarding it to the ASM Journals Department for publication. However, there are some comments from the editor that you must address before final publication, please make the necessary adjustments.
(1) There are still some minor grammatical errors that should be addressed throughout.
For instance, Lines 290 -294. Here, the authors tried to validate their results by analyzing previously published metagenomic datasets. Therefore, the sentences should be written in past tense or part participle. line 293 -what does sub-optimally mean here? I don't think it is necessary. The sentence can begin with "Two of the studies ....." line 294 -collected not collect line 294 -only one "had" not has line 295 -the phrase "which metagenomic and metadata are fully released" is not clear. Are there some data that was "partially" released? I will re-write the sentence to say -" Two of the studies collected cross-sectional cohorts, and only one (45) had longitudinal metagenomic and metadata for gut samples from each subject.
You will be notified when your proofs are ready to be viewed.
The ASM Journals program strives for constant improvement in our submission and publication process. Please tell us how we can improve your experience by taking this quick Author Survey.