Correcting modification-mediated errors in nanopore sequencing by nucleotide demodification and reference-based correction

The accuracy of Oxford Nanopore Technology (ONT) sequencing has significantly improved thanks to new flowcells, sequencing kits, and basecalling algorithms. However, novel modification types untrained in the basecalling models can seriously reduce the quality. Here we reports a set of ONT-sequenced genomes with unexpected low quality due to novel modification types. Demodification by whole-genome amplification significantly improved the quality but lost the epigenome. We also developed a reference-based method, Modpolish, for correcting modification-mediated errors while retaining the epigenome when a sufficient number of closely-related genomes is publicly available (default: top 20 genomes with at least 95% identity). Modpolish not only significantly improved the quality of in-house sequenced genomes but also public datasets sequenced by R9.4 and R10.4 (simplex). Our results suggested that novel modifications are prone to ONT systematic errors. Nevertheless, these errors are correctable by nucleotide demodification or Modpolish without prior knowledge of modifications.

A) The title and abstract does not state that this is a reference-based method.If Modpolish were a denovo method, it would be of great use to the vast majority of ONT R9.4 users.However, it is not.Please add this directly in the title and abstract.This is a large difference to fx.Racon and Medaka which the authors sometimes compare to in their review response.
B) The abstract does not mention that the method is only useful in the cases where a high number of very closely related genomes exists in the database.Basically, limiting the scope of Modpolish to clinical samples of common pathogens.Please add the recommended number of genomes and the identity cutoffs directly in the abstract (20 genomes within 99% ANI and 90% ASI, as far as I understood from the methods).
C) The abstract mentions: "Modpolish not only significantly improved the quality of in-house sequenced genomes but also public datasets sequenced by R9.4/R10.4flowcells".However, from figure 4 it is only R9.4 where significant improvements are seen in Q-score.Results for R10.4 (which everyone hopefully is using now) show that there is no improvement or very limited.For example, estimated from figure 4, Bascillus subtilis improves from Q43 (medaka) to Q60 (modpolish), but for R10.4,medaka and Modpolish is both at Q42, and no improvement is seen for any of the genomes.As you show on the Zymo mock data, that the improvement from ONT R9.4 to R10.4, means that there is no advantage in Modpolish anymore I think it is essential to show if this is also the case for "novel" modifications on the Listeria monocytogenes genomes.Currently the only thing your data support is that Modpolish works for R9.4 -you have not shown it for R10.4.This needs to be very clear in the abstract.

Minor comments:
-The authors have made a sensitivity and specificity analysis based on my comments (Table S13-15).However, I likely did not make myself clear enough on what I thought was needed.I am concerned about the cases where few related organisms are present in the database and what that has of impact.Furthermore, what impact coverage has on the ability to correct and identify the errors?For table S13-15, I would like that for each species it is explicitly stated how many closely related genomes were used for correction and what the coverage was.The coverage is only broadly stated as 10x in Table S15 -which also seems strangely low.
-From the methods, it is a little unclear what the cutoff in genomes used for correction is.Methods state that the top-20 closest related genomes are taken.However, it seems like 20 was before ANI+ASI cutoffs?In my mind, the correct cutoff has to be defined pr.position as some positions might not be present in all closely related genomes?-Throughout the manuscript, please add which ONT version was used directly in the text.I.e.R9.4 or R10.4.
-From the methods and review response, it is mentioned that both HAC and SUP base-calling was used.I could not see from the figures if HAC or SUP was used.Please make sure that all your analysis only uses SUP.It adds confusion and makes no sense to compare Modpolish with HAC.
-I do not think the data supports your claim that modpolish significantly improves the accuracy over R10.4 (line 154-166) to me it looks the same at 50x?As sequencing cost is such a small part of the total price to sequence a genome I do not think that anyone would generally go for less than 100x anyway for isolate sequencing.We usually get 1000x+++ because we can not multiplex enough genomes.Please re-analyse the data and rephrase the section (also relates to Major Comment part C).
-Figure 4: It is very difficult to compare the R9.4 and R10.4 data.Please integrate the R9.4 and R10.4 into results into 1 figure (Q score and mismatches separately).Furthermore, if the Q-score barplot is chosen, please compare using 50x coverage for all datasets so the data can be compared (I could not see what coverage was used for the R9.4 data currently?).Optimally I would like to see the coverage profile as shown for the R10.4 data for both R9.4 and R10.4 combined.Finally, do not show the flye error rate as it is not essential for the comparison (also relates to Major Comment part C).
-Line 184-188: I do not see Modpolish currently being useful in a metagenome context, as the vast majority of genomes recovered from metagenomes will have no or very few closely related genomes in the database.This is by far the largest challenge.
-Line 220-228: Add the limitations directly to the text.What were the default settings you implemented?It is difficult to read this from the methods(number of genomes, ANI, ASI, % consensus).
Limitations to this review: Unfortunately, I did not have sufficient time to test the software itself but I hope other reviewers had, as it is an essential part of the review.Furthermore, I'm not an expert in restriction systems and can not evaluate the scientific findings in this regard.
Reviewer #2 (Remarks to the Author): I think the authors have adequately addressed most reviewers' comments and the manuscript have improved.But some minor points raised below would need to be addressed.
Line 201 and 204: I think it is more common to express errors caused by DNA amplification as "amplification errors", rather than "polymerization errors".Line 211-216: References supporting the descriptions should be cited.

Correcting Modification-Mediated Errors in Nanopore Sequencing by Nucleotide Demodification and Referense-Based Correction (Revision Report)
General Response: Dear editor and reviewers, We have revised the manuscript by following Reviewer #1's suggestions (i.e., emphasizing the usage scope of Modpolish to R9.4, update of all numbers to latest SUP model, ..etc).The remaining comments are centered on the explicitly stating the limitations of the program, which we agree.However, we hope the editor and reviewers could also see this work highlights the merits of nucleotide demodification by whole-genome amplification and the novel modification system in Listeria untrained in the Nanopore basecalling model.Below please find point-to-point responses to reviewer #1's comments.We hope the editor and reviewers could accept the mansucript for publication.

Reviewer #1 (Remarks to the Author): Major comments:
A) The title and abstract does not state that this is a reference-based method.If Modpolish were a de-novo method, it would be of great use to the vast majority of ONT R9.4 users.However, it is not.Please add this directly in the title and abstract.This is a large difference to fx.Racon and Medaka which the authors sometimes compare to in their review response.Ans: The title has been changed to "Correcting Modification-Mediated Errors in Nanopore Sequencing by Nucleotide Demodification and Referense-Based Correction" in the revised manuscript.We have also revised the abstract by adding reference-based correction in the corresponding sentences.

B) The abstract does not mention that the method is only useful in the cases where a high number of very closely related genomes exists in the database. Basically, limiting the scope of Modpolish to clinical samples of common pathogens. Please add the recommended number of genomes and the identity cutoffs directly in the abstract (20 genomes within 99% ANI and 90% ASI, as far as I understood from the methods).
Ans: We have revised the sentences in the abstract."We developed a reference-based method, Modpolish, for correcting modification-mediated errors while retaining the epigenome when a sufficient number of closely-related genomes is publicly available (default: 20 genomes with 95% ANI").The reviewer's original suggestion would be misleading as a hierachical similarity estimation was implemented, i.e., 95% ANI by Mash first, 99% ANI by FastANI, and finally 90% ASI.In practice, the majority of species can improve quality solely by related genomes with 95% ANI by Mash.The other two filters (i.e., 99% ANI by FastANI and 90% ASI) are only beneficial for some species/strains.The requirement of 99% ANI by FastANI and 90% ASI will be ignored when insufficient genomes are retained, and all genomes exceeding 95% ANI by Mash will be used.We found it's difficult to explain this hierachical filtration in the abstract, and hope the reviewer can accept this shorter version which stands for most cases.We have also revised the method for clarifying this hierarchical filtration (L355-360).
C) The abstract mentions: "Modpolish not only significantly improved the quality of in-house sequenced genomes but also public datasets sequenced by R9.4/R10.4flowcells".However, from figure 4 it is only R9.4 where significant improvements are seen in Q-score.Results for R10.4 (which everyone hopefully is using now) show that there is no improvement or very limited.For example, estimated from figure 4, Bascillus subtilis improves from Q43 (medaka) to Q60 (modpolish), but for R10.4,medaka and Modpolish is both at Q42, and no improvement is seen for any of the genomes.As you show on the Zymo mock data, that the improvement from ONT R9.4 to R10.4, means that there is no advantage in Modpolish anymore I think it is essential to show if this is also the case for "novel" modifications on the Listeria monocytogenes genomes.Currently the only thing your data support is that Modpolish works for R9.4 -you have not shown it for R10.4.This needs to be very clear in the abstract.Ans: We have followed the reviewer's suggestion by emphasizing the improvement on R9.4 only in the abstract.S15 -which also seems strangely low.Ans: We have disclosed the limitations of common species in the abstract of this revision."We developed a reference-based method, Modpolish, for correcting modification-mediated errors while retaining the epigenome when a sufficient number of closely-related genomes is publicly available."

Minor comments: -The authors have made a sensitivity and specificity analysis based on my comments (Table S13-15). However, I likely did not make myself clear enough on what I thought was needed. I am concerned about the cases where few related organisms are present in the database and what that has of impact. Furthermore, what impact coverage has on the ability to correct and identify the errors? For table S13-15, I would like that for each species it is explicitly stated how many closely related genomes were used for correction and what the coverage was. The coverage is only broadly stated as 10x in Table
Second, all the genomes tested in this manuscript (Tables S13-15) are common species and thus can retrieve the default 20 related genomes (>95 ANI by Mash).As we have explicitly sequenced and evaluated over these rare bacteria in the previous publication to justify this limitation (Huang et al., 2020), we hesitate to do this again as not much improvement will be expected (Please also refer to response #1 in the previous round of revision).We hope the reviewer could understand it's extremely difficult to culture some of the rare species (e.g., anaerobic microbes).
The coverage was added in Supplementary Table S13 according to the sequencing and assembly stats in Supplementary Tables S2 and S4.Supplementary Table S13 The two public datasets, Supplementary Tables 14-15, are metagenomic sequencing downloaded from the ZymoBIOMICS Microbial Community Standard, and thus difficult to assess the individual microbial coverage in the mixed community.We hope the reviewer can leave this part unspecified.
-From the methods, it is a little unclear what the cutoff in genomes used for correction is.Methods state that the top-20 closest related genomes are taken.However, it seems like 20 was before ANI+ASI cutoffs?In my mind, the correct cutoff has to be defined pr.position as some positions might not be present in all closely related genomes?Ans: Please refer to response to comment #1.A multi-level similarity estimation was implemented, i.e., 95% ANI by Mash first, 99% ANI by FastANI, and finally 90% ASI.The 20 genomes refer to the 95% ANI by Mash.
The number of genomes may vary at different loci due to structural variations (SVs) (e.g., transposon-like elements ISs).In practice, it's hard to optimize this parameter at per-base resolution as the distribution of SVs also varies in the related genomes.We have tested this parameter at the early development stage (from 10 to 30) at the whole-genome scale and found ~20 can roughly handle the common and rare species.In fact, this default number (20) is much higher than needed as only a few (true) closely-related strains are sufficient for corrections in practice.
-Throughout the manuscript, please add which ONT version was used directly in the text.I.e.R9.4 or R10.4.Ans: Done.
-From the methods and review response, it is mentioned that both HAC and SUP base-calling was used.I could not see from the figures if HAC or SUP was used.Please make sure that all your analysis only uses SUP.It adds confusion and makes no sense to compare Modpolish with HAC.Ans: We have updated all the numbers to the SUP model (sequenced by ourselves) in the revised manuscript.
-I do not think the data supports your claim that modpolish significantly improves the accuracy over R10. 4 (line 154-166) to me it looks the same at 50x?As sequencing cost is such a small part of the total price to sequence a genome I do not think that anyone would generally go for less than 100x anyway for isolate sequencing.We usually get 1000x+++ because we can not multiplex enough genomes.Please re-analyse the data and rephrase the section (also relates to Major Comment part C).Ans: We are sorry for not clarifying simple vs duplex modes.We have revised the descriptions of R10.4 (duplex mode) to no improvement in the revised manuscript (Figure 4(c)(d)).In the simplex R10.4 dataset, marginal improvement can still be seen (Supplementary Figure S11).
-Figure 4: It is very difficult to compare the R9.4 and R10.4 data.Please integrate the R9.4 and R10.4 into results into 1 figure (Q score and mismatches separately).Furthermore, if the Q-score bar-plot is chosen, please compare using 50x coverage for all datasets so the data can be compared (I could not see what coverage was used for the R9.4 data currently?).Optimally I would like to see the coverage profile as shown for the R10.4 data for both R9.4 and R10.4 combined.Finally, do not show the flye error rate as it is not essential for the comparison (also relates to Major Comment part C).Ans: We have updated Figure 4 as suggested.We have clarified the sentences to indicate no improvement on duplex and marginal improvement can be observed in the simplex mode."Therefore, the qualities of ONT R10.4 flowcells, in particular the duplex mode, is not only higher than those of R9.4 and require nearly no further correction by Modpolish.In the simplex mode, marginal improvement can be seen."-Line 184-188: I do not see Modpolish currently being useful in a metagenome context, as the vast majority of genomes recovered from metagenomes will have no or very few closely related genomes in the database.This is by far the largest challenge.
Ans: We have rephrased the sentence to convey the intended meaning."Hence, these untrained modification-mediated errors are better removed by WGA demodification or Modpolish (viable only when large contigs can be obtained)."We fully understood the limitations of Modpolish and would like to note that WGA demodification is also another option suggested in the manuscript.
-Line 220-228: Add the limitations directly to the text.What were the default settings you implemented?It is difficult to read this from the methods(number of genomes, ANI, ASI, % consensus).Ans: See resposne to comment #1.We have emphasized theses limitations in the abstract.
Response to Reviewer 2 I think the authors have adequately addressed most reviewers' comments and the manuscript have improved.But some minor points raised below would need to be addressed.Line 201 and 204: I think it is more common to express errors caused by DNA amplification as "amplification errors", rather than "polymerization errors".Ans: Done.
Line 211-216: References supporting the descriptions should be cited.Ans: Done.