Antibiotic resistance prediction for Mycobacterium tuberculosis from genome sequence data with Mykrobe

Two billion people are infected with Mycobacterium tuberculosis, leading to 10 million new cases of active tuberculosis and 1.5 million deaths annually. Universal access to drug susceptibility testing (DST) has become a World Health Organization priority. We previously developed a software tool, Mykrobe predictor, which provided offline species identification and drug resistance predictions for M. tuberculosis from whole genome sequencing (WGS) data. Performance was insufficient to support the use of WGS as an alternative to conventional phenotype-based DST, due to mutation catalogue limitations. Here we present a new tool, Mykrobe, which provides the same functionality based on a new software implementation. Improvements include i) an updated mutation catalogue giving greater sensitivity to detect pyrazinamide resistance, ii) support for user-defined resistance catalogues, iii) improved identification of non-tuberculous mycobacterial species, and iv) an updated statistical model for Oxford Nanopore Technologies sequencing data. Mykrobe is released under MIT license at https://github.com/mykrobe-tools/mykrobe. We incorporate mutation catalogues from the CRyPTIC consortium et al. (2018) and from Walker et al. (2015), and make improvements based on performance on an initial set of 3206 and an independent set of 5845 M. tuberculosis Illumina sequences. To give estimates of error rates, we use a prospectively collected dataset of 4362 M. tuberculosis isolates. Using culture based DST as the reference, we estimate Mykrobe to be 100%, 95%, 82%, 99% sensitive and 99%, 100%, 99%, 99% specific for rifampicin, isoniazid, pyrazinamide and ethambutol resistance prediction respectively. We benchmark against four other tools on 10207 (=5845+4362) samples, and also show that Mykrobe gives concordant results with nanopore data. We measure the ability of Mykrobe-based DST to guide personalized therapeutic regimen design in the context of complex drug susceptibility profiles, showing 94% concordance of implied regimen with that driven by phenotypic DST, higher than all other benchmarked tools.


Introduction
The software tool Mykrobe predictor 1 , released in 2015, identified isolates to the species level and predicted the drug susceptibility testing (DST) profile of Staphylococcus aureus and Mycobacterium tuberculosis directly from genomic sequencing data. Mykrobe predictor was developed to address four needs. First, it provided robust genotyping of single nucleotide polymorphisms (SNPs), insertions and deletions (indel) and gene alleles, independently of the location of mutations (where "location" is used both in the sense of coordinate, and in the sense of chromosomal versus plasmid). It also did this with error rates that did not depend on the phylogenetic position of the isolate of interest, a characteristic of all reference-genome methods 2 . We did this by building a de Bruijn "genome graph" representation of known variation in the species, incorporating both resistant and susceptible alleles, and also nearby variation that might affect k-mer matching (equivalent to neighbouring mutations affecting a PCR probe (see Figure 1 of 1)). Second, it identified target species, with prior knowledge of likely contaminants, and prevented misdiagnosis of resistance due to shared elements. Third, it predicted phenotype from genotype, including from low (within-isolate) frequency alleles. Fourth, it integrated these functionalities in a fast, lightweight, internet-free and user-friendly platform.
In that initial publication 1 , for M. tuberculosis, the drug resistance conferring mutation panel was based on first and second generation line probe assays (LPA) (HAIN Lifesciences, Nehren, Germany) resulting in low sensitivity but high specificity. We also showed that detecting minor populations of resistant alleles improved the ability to predict resistance for some second-line drugs (amikacin, capreomycin, ofloxacin), although the low number of samples with associated second-line phenotypic data left this analysis underpowered, and needing replication.
There were a number of improvements we wished to make over the 2015 Mykrobe predictor codebase. First, we wanted a cleaner codebase. Second we wanted users to be able to specify their own resistance catalogue. The latest Mykrobe predictor catalogue, updated since publication, was based on Walker et al. 3 . Walker et al. derived this catalogue by analysing the same 3500 samples which were used in the initial Mykrobe predictor performance analysis 1 -thus no independent test or validation dataset were available to properly estimate error rates. Finally, we wanted to further develop the support for nanopore data which had been improved by the use of a slightly different statistical model 4 , but only tested on N = 5 replicates of a single strain of M. bovis substrain BCG.
A number of other tools for detection of resistance-associated alleles have been released: ABRicate (Seemann), ARIBA 5 , ARGs-OAP 6 , ARG-ANNOT 7 , CASTB 8 , KvarQ 9 , MTBseq 10 , PhyResSe 11 , RAST 12 , ResFinder 13 , RGI 14 , SRST2 15 , SSTAR 16 , and TB-Profiler 17 . The majority look for gene presence, primarily in enterobacteriaceae, but some also look for SNP and indel mutations. In particular, SRST2 which maps reads to a panel of alleles (tested on S. aureus, Streptococcus pneumoniae, Salmonella enterica serovar Typhimurium, Shigella sonnei, Enterococcus faecium, Listeria monocytogenes and Klebsiella pneumoniae), and ARIBA which performs local assembly (tested on E. faecium, S. sonnei and Neisseria gonorrhoeae). In the specific case of M. tuberculosis, where there is very little recombination, resistance prediction is dominated by the ability to correctly genotype SNPs and indels. Various tools have been developed to achieve this. These include the web-tool PhyResSE 11 , KvarQ 9 which uses k-mers, and TB-Profiler 17 which uses mapping; earlier versions were benchmarked in 18. Another tool, MTBseq 10 is a general mapping-based WGS pipeline for M. tuberculosis which can in particular be used to predict resistance. Our expectation was that for M. tuberculosis, the resistance panel would be the key determinant of sensitivity and specificity, as was previously demonstrated for S. aureus 19 . We confirm this below.
In a recent study on resistance to first-line tuberculosis (TB) drugs (rifampicin, isoniazid, ethambutol and pyrazinamide) 20 the authors show, for the first time, that sequencing-based prediction of susceptibility is accurate enough for clinical use. They classify resistance-associated gene mutations as i) resistance-causing mutations, ii) definitely not resistance-causing mutations, or iii) unknown significance mutations. Their resistance mutation catalogue included variants from a systematic review from Miotto et al. 21 and some new pncA mutations from 22. It also characterised all frame-shifts in pncA and katG as causing resistance to pyrazinamide or isoniazid respectively. Using phenotyping as gold standard, they measured performance characteristics of WGS-based DST, including sensitivity and specificity, on the samples for which a prediction can be made with high confidence. They propose a workflow where a sample is sequenced, and if an "unknown significance" mutation is seen, then no resistance/susceptibility prediction is made, but the sample is instead referred for phenotyping. This provides a means whereby healthcare systems can leverage the advantages of WGS for rapid and comprehensive DST, be informed when a given sample is pushing against the limits of our knowledge, and revert to slower but well-understood and trusted phenotyping in order to provide clinicians with reliable results, and prospectively enrich available databases.
In this study we evaluate our new lightweight offline species-identification and resistance prediction tool, named simply Mykrobe 23 . The species-identification methodology is unchanged since 1 , but after publication Mykrobe predictor was heavily evaluated by Public Health England, and in order to pass their acceptance criteria, we improved our species-informative probes. We report the results of applying these prospectively at PHE below, but this study primarily concerns improvements in resistance prediction. We first combine the resistance catalogue from The CRyPTIC Consortium et al. 20 with the pre-existing second-line drug resistance catalogue from Walker et al. 3 . We treat this catalogue as our first candidate catalogue and improve it in an iterative process, evaluating it on successive datasets.
We benchmark Mykrobe against a range of other tools: ARIBA (using the same catalogue as Mykrobe, as a control), and also KvarQ, MTBseq, and TB-profiler using their inbuilt catalogues. In contrast to the approach introduced in 20, Mykrobe has no concept of "unknown significance", predicting susceptibility if there is no known resistance mutation. We quantify the trade-offs between these approaches.
The World Health Organisation (WHO) publishes guidelines for TB treatment including drug susceptible, single drug-resistant, multidrug-resistant (MDR-TB) and extensively drug-resistant TB (XDR-TB) 24-27 . These guidelines are regularly reviewed and updated to appropriately reflect, and adapt to, the evolution of M. tuberculosis resistance and the availability of novel TB drugs, whether new or re-purposed. Based on latest WHO recommendations, we simulate the ability of Mykrobe to guide personalized therapeutic regimen design and compare the Mykrobe-inferred regimen against that implied by phenotypic testing. We find that Mykrobe compares favourably with the gold-standard approach and is superior to the other tools we benchmark. We believe that this is a novel and important metric for the evaluation of an in silico DST tool.
One ongoing challenge for the development of in silico DST, is that only limited phenotype and sequence data are shared openly for reuse and validation. We address both of these problems in this paper. First, we make available all our underlying data both in public and in easily-reusable text format. Second, we chose to publish in this journal specifically because we can update the paper as the catalogue is modified and when competitor tools are updated, to give current results pointing to the latest underlying data.

Species identification
Mykrobe predictor was evaluated extensively by Public Health England (PHE) for species identification of Mycobacteria in 2016 and a need to improve on the identification of non-tuberculous Mycobacteria (NTMs) was identified. As a result, a new set of probes was built by augmenting the initial training set to 1018 samples, building a pooled de Bruijn graph, and searching for species informative contigs (method described in 1) to use as probes. These were evaluated in a PHE laboratory where all mycobacterial samples were prospectively sequenced over the course of one year, and the results published 28 : of 1902 samples for which Line Probe Assay (LPA) testing had identified a clinically important mycobacterium and whole genome sequencing data was also available, 1825 (96%) were correctly identified to the species level by Mykrobe (treating the LPA as gold standard). Mykrobe 23 incorrectly identified 33 isolates as a different species within the same complex (6 MTBC, 3 M. abscessus complex, 7 M. avium complex, and 17 M. fortuitum complex isolates). We have adopted the same probes in Mykrobe, and no further evaluation has been done in this study.
Drug resistance prediction Data. We re-use the sequencing and phenotype data from 1,3 and 20; the latter included only first-line phenotypes, so in this study we augment that data with second-line phenotype data. We also remove in advance those samples which were identified as likely sample-swaps in 20. We split the samples into three disjoint datasets: • Training set: N = 3206 isolates, from 1 • Prospective set: N = 4362 isolates from 20; specifically these were the only isolates that were sampled prospectively (from Italy, Germany, the Netherlands and the UK) to enable realistic error-estimates.
See Figure 1 and Source data sample_data.tsv 29 for a per-country breakdown of the datasets. Resistance prediction performance on Global dataset. We show in Figure 3 a comparison of three sets of predictions on the Global dataset: those from 20, those from Mykrobe using the panel ("Walker-2015") used by our previous software Mykrobe predictor, and those from Mykrobe using CP2 (complete results are in Extended data file accuracy_stats.tsv 33 ). The greatest difference in performance based on Walker-2015 and CP2 is in pyrazinamide, where sensitivity increases from 25% in Walker-2015 to 74% in CP2, at the cost of specificity falling from 98% to 94%. Numerous rare variants were the source of reduced pyrazinamide specificity in CP2, most of which also cause true-positives (Extended data file variant_counts.tsv 34 ). However, 21 of these variants resulted in 29 false-positives and no true-positives and were therefore removed from CP2 to make CP3. We did however notice that of these 21, 16 were in the resistance catalogue from 22, which provides direct experimental evidence for their impact, and so we may reinstate them in future given further data.
The difference between Walker-2015 (86% sensitivity, 89% specificity) and CP2 (84% sensitivity, 89% specificity) for ethambutol is explained by two variants in the embB gene. The first site, M306, is included in all panels. However, in Walker-2015 it is present as any non-synonymous change, whereas in NEJM-2018 (and hence CP1 and CP2) only the more specific M306I and M306V are included. The variant M306L (in all samples   found as ATG changed to CTG) causes 27 correct resistant calls, but also 14 false-positives. The second site is Q497K, which is in both Walker-2015 and CP1, causes 15 true-positive and 5 false-positive calls on the Global dataset, but was excluded from CP2 because its only contribution was 3 false-positives on the Training dataset.
In terms of bottom-line statistics on this, our largest dataset containing the widest range of resistance alleles, Very Major Error rate (VME, missed resistance) for the first-line drugs was 4.9%, 6.2%, 14.7%, 26.0% for rifampicin, isoniazid, ethambutol, pyrazinamide respectively, with corresponding negative predictive values of 95.7%, 93.6%, 93.8%, 90.1%. The Major Error rate (false resistance prediction) was: 2.0%, 2.0%, 11.0%, 6.9% respectively. These results can be seen in Table 1b. We discuss below how the approach taken in 20 improved on this VME despite using essentially the same catalogue.
Resistance prediction performance on Prospective dataset Different resistance mutations occur at varying frequencies across the world. Therefore, in order for an error-rate estimate to be generalisable into clinical practise, the underlying data needs to be sampled in a way that is representative of prevalence somewhere. We used the Prospective dataset to get realistic estimates of error rates in the (low burden) countries from which these data were sampled. We show in Figure 4 a comparison of the predictions for that dataset from 20, Mykrobe using the old Mykrobe predictor panel (Mykrobe-2015) 3 , and Mykrobe with CP3. Sensitivity, specificity, and error rates are shown in Table 1c. Very Major Error rate (missed resistance) for the first-line drugs was 0.0%, 5.2%, 1.4%, 18.4% for rifampicin, isoniazid, ethambutol, pyrazinamide respectively, with corresponding negative predictive values of 100.0%, 99.6%, 100.0%, 99.4%. Thus, in this prevalence setting, Mykrobe predicting susceptibility to the four first-line drugs would meet clinical requirements. The Major Error rate (false resistance prediction) was also low: 0.8%/0.5%/1.1%/0.6% respectively.
We noted that ciprofloxacin sensitivity was lower for CP3, compared to Walker-2015. This was caused by the (standard line probe assay) variant D94A in the gyrA gene having been removed from CP2 to make CP3, because it caused one false-positive and no true-positive calls on the Global data set. However, on the Prospective data set its inclusion causes four true-positives and no false-positives.  Table 2.
Performance evaluation on Oxford Nanopore Technologies data We evaluated Mykrobe performance on five samples where we had both Oxford Nanopore Technologies (ONT) and Illumina sequence data. The resistance calls made by Mykrobe on the Illumina and ONT data were in complete agreement, with the same variant calls made on each sample and therefore the same resistance profiles. Three of the samples were susceptible to all drugs (ERS3036287, ERS3036289, and ERS3036290), one sample (ERS3036288) was resistant to isoniazid, and the fifth sample (ERS3036286) was resistant to seven of the 11 drugs called by Mykrobe (Source data ont.tsv) 37 .

Minor alleles improve prediction
Phenotypic tests by the indirect proportion method are defined in such a way as to call an isolate resistant if more than some proportion P of bacilli are resistant 38 -typically P = 0.01. Whole genome sequencing to a depth of 30-50x after several weeks of culture is by comparison relatively insensitive to minor populations. If we exclude for a moment the issues of clonal interference between resistant alleles, and selection due to culture, any single minor resistant allele detectable by sequencing should easily be detectable by phenotyping. This is essentially the motivation for Mykrobe using minor alleles to predict resistance. To do this, instead of specifying a minimum frequency threshold to detect minor alleles, Mykrobe performs a likelihood comparison between two statistical models: first, coverage on minor allele due to sequencing error (at rate around 0.01) and second, coverage on minor allele due to a minor population at frequency 0.2. In practise, with coverage around 30-50x, this results in a minor population being detected when has frequency above around 8-10%.
Having made minor changes to the catalogue from CP1 to CP2, we measured performance on the Global dataset, to finalise catalog and parameters (described below) before evaluating on the (final) Prospective dataset. The one remaining parameter that could be changed was the frequency of minor allele in the second model above, set to 0.2. We measured the impact of resistance calls driven by minor alleles using CP2 on the Global set -results are shown in Table 3. These results are broadly in concordance with those reported in 1 on the Training set. For second line drugs, the use of minor variants increases sensitivity without impacting false positive rate. In fact for amikacin, kanamycin, ofloxacin and streptomycin there were no false positive calls at all, and the proportion of resistance which was explained by minor alleles varied from around 1%-10%. For rifampicin and isoniazid, although the net contribution was positive, the error rate was slightly higher (8.3% and 6.7% of minor-allele-driven resistance calls were false). For ethambutol and pyrazinamide, however, although there were 29 and 45 true resistant calls made due to minor alleles, the proportion of false positive minor-driven resistant calls was even higher: 38% and 15%, respectively. If we compare the error rates in minor-allele-driven resistance calls for first-line drugs (6.7%, 8.3%, 38.3, 15.0% for rifampicin, isoniazid, ethambutol, pyrazinamide, see Table 3) with the overall rates driven primarily by major-allele calls (2.0%, 2.0%, 11.0%, 6.9%, see Table 1b) we see the relative differences between drugs are preserved -indeed the minor-allele-driven rates are all 3-4x the overall rates. Thus, the catalogue drives some or much of the between-drug differences. There are several potential explanations for the higher error rates for minor-allele calls. First, mutations that cause a low-level increase in MIC (minimum inhibitory concentration) may not cause a resistant phenotype if in a minor population -this is particularly likely for ethambutol where most mutations only increase the MIC slightly. Second, for pyrazinamide the phenotypic test is defined differently to most other drugs -the critical proportion P at which resistance is detected is 0.1 rather than 0.01 as for other drugs 38 . In other words, the pyrazinamide phenotypic test is less sensitive to low frequency populations than those for other drugs. In contrast to the above, for second-line drugs the pattern is different, with lower error rates for minordriven calls. The number of samples with second-line phenotypes is too low to draw strong conclusions. We would like to see these results replicated with comprehensive phenotyping and deep sequencing.
In theory the Mykrobe statistical test described above could have been modified to use a second model with higher minor-allele frequency than the current value of 0.2 -essentially demanding a higher frequency to reduce false positives. Indeed, it might be possible to fit different thresholds per drug, improving results on our dataset. However, we had concerns about generalisability. A recent publication 39 showed how sequencing the same isolate with Illumina HiSeq, MiSeq and NextSeq could give different pictures of minor allele variation, including very different minor allele frequencies. We therefore elected to leave the model unchanged.

Evaluating policy of ignoring unknown mutations in resistance genes
A key component of the approach taken by the CRyPTIC consortium in 20 was to refuse to make predictions if an unknown mutation was detected in a resistance-associated gene, and divert such samples to phenotyping. By contrast, Mykrobe genotypes known SNPs and indels, but does not detect novel mutations, and makes predictions of phenotypes based on these data only. In particular, if no resistance mutation from its catalogue is detected, Mykrobe makes a Susceptible prediction. Our prior expectation was that this would come at the cost of a higher false susceptible rate for Mykrobe. This was indeed the case -see the red components of the bars on the right hand side of Figure 3. However, in addition, on both the Global and Prospective datasets, we find that Mykrobe calls more true positives and negatives (i.e. correct resistant and susceptible calls) -see the grey segments of bars on Figure 3 (in particular on the left) where the CRyPTIC approach avoids making a call. The results (see Table 4) are consistent across the four first-line drugs: on the Global set Mykrobe misses around 60-170 resistant samples that the CRyPTIC process theoretically detects by reverting to phenotyping (with the caveat that phenotypic assays are themselves not 100% reliable). Although this is accompanied by more rapid detection of 100-300 more susceptible samples (which CRyPTIC detects later, by phenotyping), this provides strong motivation to add the ability to detect novel variants to Mykrobe in future releases. We note that this functionality is also missing from the other tools we benchmarked against, and that all of them could add this. Finally, we recall again that the global set is a heterogeneous combination of datasets sampled in different ways. The prospective set is prospectively sampled, though from low-burden countries in Europe. On that dataset we find Benchmarking against other tools In order to separate the effect of mutation catalogue and genotyping method, we gave ARIBA the same catalogue (namely the final release panel described above) as Mykrobe (with small differences, see Methods for details). For the other tools, we simply used their default catalogues in their latest versions at the time of benchmarking. The sensitivity, specificity, and error rates for Mykrobe, ARIBA, KvarQ, MTBseq, and TB-Profiler when applied to the Prospective dataset, summarised across all drugs, is given in Table 5 (full data are in the Extended data file accuracy_stats.tsv 33 ).
On each of the three datasets, Mykrobe is the most sensitive, has the lowest very major error rate (false susceptible error rate), and the highest negative predictive value. Mykrobe has the best specificity, major error rate (false resistant rate) and positive predictive value on the Training dataset, and the best or second-best results for those statistics on the Global and Prospective datasets.
A breakdown of the results on the Prospective dataset, showing each drug separately, is given in Figure 5.
Mykrobe and ARIBA, as expected, are very similar except for the high false-positive rate for isoniazid for ARIBA, resulting in a PPV of 82% for ARIBA compared to 94% for Mykrobe. Manual inspection revealed that the extra incorrect calls were caused by assembly errors in the katG gene, resulting in false-positive calls because the gene appeared to be incomplete. The memory usage and run time averaged across all runs of each tool is given in Table 6, and in Extended data files we have: summary boxplots in run_time_boxplots.pdf 40

Impact of Mykrobe on initial choice of WHO-recommended regimen
The World Health Organisation (WHO) now recommends systematic access to drug susceptibility testing and specific TB treatment regimens based on individual drug resistance profiles. We therefore ask, if we were to base the choice of a patient's initial multi-drug therapy purely on the genotyping results from Mykrobe, how accurate would this choice of treatment regimen be? We encoded the most recently published WHO TB treatment recommendations (https://github.com/iqbal-lab-org/tb-amr-benchmarking/blob/master/ python/evalrescallers/who_treatment.py), and used this to infer both the recommended drug regimen implied by the known phenotype of each sample, and that implied by Mykrobe's resistance prediction (on the combined global and prospective datasets). The most recent WHO recommended regimens for MDR-TB involve a number of drugs for which phenotyping is rarely performed and for which the genetic basis of drug resistance is only partially understood (e.g. bedaquiline, linezolid). In these cases, drug administration and implementation of WHO recommendations in clinical contexts most often relies on the assumption of drug susceptibility. Given that, and the scarcity of available genotypic to phenotypic correlation data for those drugs, all isolates were treated as susceptible to bedaquiline and linezolid in this analysis. Also, given the absence of recent updated (2018) WHO recommendations for pre-XDR and XDR TB, this analysis model relies on previous recommendations 24, 25 which therefore limits the clinical significance of the analysis for the limited number of extensively resistant TB isolates. It also highlights the need for data, new guidance and subsequent update of the Mykrobe performance analysis for those most resistant isolates. Figure 6 presents comprehensive resistance profiles, associated WHO-recommended regimen and a comparison of phenotype-and Mykrobe-driven therapies for all isolates. We immediately observe that for 98.6% (= 4842/(4842 + 69)) of phenotypically pan-susceptible isolates, Mykrobe would correctly imply first-line treatment (regimen 1 in the figure). We excluded those pan-susceptible isolates from the figure for increased clarity. Full results are available in Extended data file who_regimen_counts.tsv 43 .

Key error modes for Regimens 2-9
The most frequent clinically significant error mode was wrongly recommending a pan-susceptible TB regimen for mono-isoniazid resistant isolates, which occurred in 53 (12%) of the 451 mono-resistant (regimen 2) isolates. We describe below how all other benchmarked tools shared this error mode. Figure 6 also illustrates other, rarer, cases of non-MDR-TB and non-XDR-TB resistant isolates (regimens 3 to 9) for which a disagreement between phenotype-driven and Mykrobe-driven regimens results from undetected resistance (lines moving upwards as we go from left to right) or false-resistant genotyping results (line moving downwards as we go from left to right). Of the 187 isolates assigned to regimens 3-9 using phenotype data, Mykrobe suggested a pan-susceptible regimen (regimen 1) in 45 cases, and a RR-, MDR-or XDR-TB regimens (10, 11 or 12, respectively) in 3, 16 or 7 cases, respectively.

RR-TB, MDR-TB and pre-XDR-TB regimens. A significant number of discrepancies between phenotypedriven and
Mykrobe-driven regimens revolve around regimen 10 (RR-TB), regimen 11 (the most recent WHOrecommended MDR-TB regimen), and regimen 12 (pre-XDR-TB) (see Figure 6 for regimen details). Regimens 10 and 11 only differ in that regimen 10 includes isoniazid, and is recommended if the isolate is known to be susceptible to that drug, or if the phenotype is unknown. Conversely, when comprehensive DST results are available, if they confirm a resistance profile including resistance to isoniazid, rifampicin a fluoroquinolone and/or an injectable aminoglycoside, a customized regimen including as many drugs to which the isolate is confirmed to be susceptible as possible is recommended. Given the limited number of XDR-TB isolates in our data, this analysis focused on quinolone resistant pre-XDR samples (regimen 12) for which we had sufficient data to recognise appropriate or erroneous treatment patterns. There were 2903 rifampicin resistant M. tuberculosis isolates for which first-line drug phenotyping justified initiation of RR-TB (N = 253), MDR-TB (N = 2619), or pre-XDR-TB (N = 31) treatment regimen. Mykrobe assigned 96 of the 253 phenotype-inferred RR-TB (regimen 10) isolates to that same regimen (8 of these had an unknown isoniazid phenotype). False-negative rifampicin calls by Mykrobe in 33 of the 253 (=13.0%) isolates resulted in a predicted pan-susceptible treatment (regimen 1). 54 of the 253 isolates were classified as MDR-TB (regimen 11) by Mykrobe, comprising 35 isolates with unknown isoniazid phenotype and 19 false-positive isoniazid calls. The remaining 70 isolates all had unknown phenotypes for isoniazid and moxifloxacin, whereas Mykrobe predicted resistance to these drugs and assigned them to XDR-TB regimen 12.
There were 2619 isolates classified as MDR-TB (regimen 11) from the phenotype data. The majority of isolates (1727/2619=65.9%) are appropriately assigned by Mykrobe to the MDR-TB regimen. A minority of isolates (103/2619=3.9%) are falsely identified as being less resistant and consequently directed (upwards in the figure) towards regimens 1-9. False-negative isoniazid calls resulted in 92 of the 2619  isolates (=3.5%) to be wrongly allocated to regimen 10 (RR-TB). The remaining 692 of the 2619 isolates (=26.4%) were identified as resistant to moxifloxacin, classifying them as regimen 12. Moxifloxacin phenotypic susceptibility testing results for 689 of these isolates was unknown, whereas Mykrobe determined them to be resistant and so directed the samples to a pre-XDR-TB regimen instead of an MDR-TB regimen. We can estimate what proportion of these 692 samples were correctly assigned to an XDR-TB regimen instead of to MDR-TB using the positive predictive value estimated from those samples with phenotypes (see Table 2). We estimate that for 644/692 (=93%) samples, the Mykrobe-driven regimen choice was accurate. We emphasise that this model relies on WHO recommendations which are expected to be updated in the specific case of pre-XDR-TB and XDR-TB.
Comparison with other tools. We find Mykrobe resulted in the correct regimen choice for 93.9% of samples using the latest panel, a 1.2% improvement over Mykrobe with the 2015 panel, and a 1% improvement over ARIBA using the same 2019 panel. We counted a tool as correct for RR-TB and MDR-TB isolates where phenotype data for isoniazid and/or moxifloxacin resistance was unknown, but the tool assigned MDR-TB or XDR-TB because of resistance calls for one or both of those drugs. All benchmarked tools performed reasonably similarly, with the worst performance by TB-profiler (89% correct). Per-tool accuracy of regimen choice is shown in Table 7, and Table 8, with the best-performing tool for each regimen highlighted. For 4 of the 11 regimens (numbers 1 (pan-susceptible), 2 (INH mono-resistant), 8 (pyrazinamide-mono resistant) and 10 (rifampicin-mono resistant)), Mykrobe achieves the highest success rate. In particular, the INH-mono-resistant case, highlighted above as a concern for Mykrobe, is also an issue for other tools.

Discussion
Although sequencing-based diagnostic information for M. tuberculosis has been available for some years 1,9,11,17,44 , the results have not been sufficiently good to justify replacing (or partially replacing) phenotyping. However, very recently the CRyPTIC consortium showed that for first-line drugs, a prediction of susceptibility was now   sufficiently accurate to support therapeutic management in routine clinical settings 20 . Indeed this has been implemented now in Public Health England, RIVM (Netherlands) and the Wadsworth Centre (New York). In this study we introduce our reimplemented resistance prediction tool, Mykrobe, with initial mutation catalogue taken from that study 20 for first-line drugs, and 3 for second-line drugs. We evaluate Mykrobe in a number of ways. We show that the new catalogue does indeed give improved results over that from 3 used in our old implementation Mykrobe predictor, in particular increasing power to detect pyrazinamide resistance from 25% to 74% (global set, containing 1014 resistant samples) and 72% to 82% (prospective set, containing 125 resistant samples). This is achieved by considering all frameshifts of length 1 or 2 in pncA as causing pyrazinamide resistance, and comes at the cost of a increased major error rate (rising from 2.2% on the global set for the previous catalogue based on 3 to 5.7% for current Mykrobe (although there was no increase in major error rate in the prospective set, changing from 0.7% to 0.5%)). Given the poor performance of pyrazinamide phenotypic testing, some proportion of the discrepancies are probably due to phenotypic error 45,46 .
A key component in the approach of 20 was to flag novel mutations in resistance-associated genes, and refuse to make a prediction in those cases. In a putative clinical setting this would delay getting (conservative, correct) results until phenotypic DST was completed. By contrast Mykrobe does not do this, and predicts susceptibility when there is no mutation from the catalogue. We are able to compare and contrast these approaches, as we are using the same data, and conclude that the novel-mutation-aware approach is preferable to that used by Mykrobe (and, as far as we can tell, all other benchmarked tools). Essentially, novel (off-catalogue) mutations are an effective flag for very major errors (missed resistance). Our feeling is that the cost (delays for susceptible samples containing novel mutations) is outweighed by the benefit (avoiding some false susceptible calls). Modifying Mykrobe to follow that policy is therefore something we would like to do in the future.
Of the other tools we tested, MTBSeq and TB-profiler would most easily be able to adopt this approach, as they both already do full SNP discovery for every sample.
In general Mykrobe outperforms all the other benchmarked tools, having the best sensitivity and best or second-best specificity on each of the three datasets. Mykrobe supports error-prone ONT reads, and we show that ONT and Illumina data produce identical results, although we would wish to have tested on more than five samples. Motivated by the fact that low frequency resistant populations will sweep to fixation under drug pressure, and by the results in the Mykrobe predictor paper 1 , Mykrobe predicts resistance if it detects known resistance alleles at low frequencies (in practise above 10%, for illumina data only). For isoniazid, rifampicin and the second line drugs, this results in detection of more resistant isolates, at limited cost in false positives (see Table 3). However for ethambutol and pyrazinamide the proportion of these extra calls that are false is unacceptably high (38% and 15%, respectively). Possible reasons for different performance for these drugs are given above; future versions of Mykrobe are likely to exclude minor alleles for these drugs.
We introduce the idea of measuring how sequencing-based diagnostic for TB drug resistance impact potential regimen choice, and include a powerful way to visualise it. The idea is explicitly not to suggest this could directly advise physicians, but instead to provide a different metric for clinically important errors. Our analysis has limitations. First, we have only partial phenotyping data for second line TB drugs, and in fact none for the new and repurposed TB drugs. Second, WHO guidelines are ever evolving and may differ from clinical expert opinions when it comes to the choice of personalized therapeutic regimens. Nevertheless, we can clearly see that the richer resistance profile from WGS data leads to more differentiated choices of regimen. We look forward to impending large and fully phenotyped datasets, for example from the CRyPTIC consortium, where tens of thousands of global samples will have quantitative phenotype data for 14 drugs. In addition, CRyPTIC is set up to deliver improved resistance catalogues, which we intend to incorporate into Mykrobe.
The pipelines for running the analyses in this paper are fully automated, all the way to generating figures, tables of results and Extended data, and we intend to update the paper as results improve (and as competitor tools are updated; a future update will benchmark the latest TB-profiler release, which appeared during the preparation of this paper). Our recommendations for in silico DST tools such as Mykrobe reflect our own learnings: use minor populations to improve predictions, flag novel mutations in resistance genes, and allow user-defined catalogues to facilitate comparisons and testing.

Conclusions
Antibiotic resistance prediction is of critical importance for the roll-out of sequencing-based diagnostics for TB. We have demonstrated here our new tool Mykrobe, supporting both nanopore and Illumina data. Mykrobe provides simple, automated and lightweight results which we evaluate thoroughly on over 10,000 isolates. We find that Mykrobe outperforms other tools, and (by implementing our catalogue in a second tool, ARIBA) confirm that the primary determinant of success is the resistance catalogue. We find that considering minor populations, and flagging novel (off catalogue) mutations in resistance genes improve results. Finally, we introduce a new metric to highlight errors which impact initial choice of therapeutic regimen, achieving an accuracy of 95%.
The field has further to go before we can say we are fully fit for purpose, especially for the second-line, novel and repurposed drugs. A genomic resistance predictor must by necessity be a living and growing project, and our intention is that this document should also be updated in parallel, providing evidence and also open underlying data. We will be updating it to include benchmarking of future panels of Mykrobe, and updated versions of other tools.

Implementation
Species identification. A hierarchy of sequence probes are used to identify species and "phylogroup" (in this case Non Tuberculous Mycobacteria (NTM)) or Mycobacterium tuberculosis Complex (MTBC)). This is done by building a de Bruijn graph of a training set of samples of known species, identifying which species each unitig (maximal non-branching path) is seen in, and then selecting probes (unitigs) which are highly differentiated between species, and between phylogroups. The method was previously described in Bradley et al. 1 .

Lineage identification with MTBC.
Mykrobe currently uses the lineage-informative SNPs from 47 to assign lineage within the MTBC.

New statistical model for genotyping.
In order to use a de Bruijn graph to genotype a catalogue of variants (SNPs and indels) we first convert the list of variant sites into a set of sequence 'probes' of length 2k − 1.
If the resistance mutations are defined in amino acid space, we first convert these into a set of possible codon changes in DNA space. In the simple case, we then represent these alleles as two short sequences (of length 2k − 1) -one "reference" or "susceptible" allele, and the other "alternate" or "resistant" allele. However, if there is another SNP or indel within k bases, any sample with this variant would have different susceptible or resistant allele k-mers. We therefore build equivalent susceptible or resistant alleles for each variant within k bases of our SNP of interest. These lead to at least two alleles for each SNP, and several more if there is background variation within k bases. To construct a catalogue of these background variants, we used all SNPs and indels called in 200 samples selected at random from the Training set.
From this set of probes, we construct our reference de Bruijn graph. Given an input FASTQ file, the de Bruijn graph is created, and intersected with this reference graph. For each probe allele we calculate the proportion of k-mers with non-zero k-mer coverage, and the median k-mer depth on each probe. For all mutations in the panel we iterate through all possible nucleotide changes that would generate the specified amino acid change (if appropriate), and find the resistant allele and susceptible allele with the highest coverage. We then compare three competing k-mer producing Poisson models: homozygous reference ('0/0'), heterozygous ('0/1') (frequency=50%), and homozygous alternate ('1/1') specified as follows 0/0: where KmerCount() is a function returning the number of k-mers observed from a given allele (i.e., the sum of the k-mer coverages along an allele), L is the number of k-mers in that allele, D is the genome-wide average depth of coverage, which corresponds to an expected k-mer coverage of D′ = D(R − k + 1)/R, D is the expected k-mer count along the allele (D = LD′ ), and ε is the expected error rate. The log-likelihood of each allele is summed, and the maximum likelihood model is chosen. The confidence is given by the difference between the log-likelihoods of the two most likely models.
The following filters are applied to the resulting genotype calls resulting in a NULL call for the variant: • LOW_GT_CONF: If the confidence of the call is below a certain value (default 150 for Illumina, dynamically assigned for nanopore) • LOW_PERCENT_COVERAGE: If proportion of k-mers on the allele with non-zero coverage is not 100% • LOW_TOTAL_DEPTH: If the total depth on a variant (reference + alternate) is less than min_proportion_ expected_depth (default 0.30) of expected depth.
Oxford Nanopore evaluation. For Oxford nanopore data, which has a much higher per-base error rate than Illumina, the LOW_GT_CONF filter default value is determined empirically from the input data, as follows. First, the SNPs/indels are genotyped as described above using a default error rate of 15%. Restricting to variants called as homozygous (and assuming these calls are correct), the rate of error k-mers appearing on the wrong allele is estimated by dividing the total k-mer depth on all the non-called alleles by the total k-mer depth seen across both alleles in all calls. Next, the coverages on 10,000 SNPs are simulated as follows: for each SNP sample the depth on the "true" allele from a Poisson distribution with mean equal to the mean depth, and the depth on the "false" allele by sampling from a binomial distribution, with the number of trials equal to the mean depth, and probability equal to the estimated incorrect k-mer error rate. The genotype confidence of each SNP is calculated, and together these 10,000 simulated SNPs give us a modelled genotype confidence distribution. We then choose the genotype confidence cutoff to be the value that would retain 90% of the simulated SNP calls (this is the default -the precise value can be changed by the user). Genotyping is then re-run with the estimated k-mer error rate and this GT_CONF threshold.

Operation
Mykrobe is available for Linux (command-line only), Mac (command-line or graphical user-interface (GUI) application) and Windows (GUI only). For Linux, Python 2 or 3 (minimum versions 2.7 or 3.4+) is required to install and run the software.

Command-line version.
Installation of the command line version is via cloning the git repository and running some install commands (documented here: https://github.com/Mykrobe-tools/mykrobe), or by using bioconda: conda install -c bioconda mykrobe Drug resistance predictions can be obtained from a sample using the single command of the form: mykrobe predict --format json --seq reads.fastq --output out.json sample_name species where "sample_name" can be replaced with the name of the sample being run (any text will do). The "species" parameter can be "tb" or "staph" to use the built-in panels, or "custom" for a usergenerated panel (an example is shown later). The output file "out.json" contains variant and drug resistance calls, and details of the run, such as the version number of Mykrobe and input parameters. It is in the standard JSON format, which can be easily parsed by standard libraries in most programming languages. Below is an example excerpt of the "susceptibility" section of the output, which contains resistance calls. In the above example, the sample has been called as susceptible to Isoniazid ("predict": "S") and resistant to rifampicin ("predict": "R") because it has the variant I491F in the rpoB gene. The string ATC761277TTC tells us that this is a change from ATC to TTC at position 761277 of the reference genome H37Rv3. The various tag meanings are given in Table 9.
Results can optionally be exported to a CSV file. mykrobe predict --format csv --seq reads.fastq --output out.csv sample_name species The above example would export as follows where the format of the called_by column is variant_name:reference_kmer_count:alt_kmer_ count:conf. Using Mykrobe with user-defined panels for M. tuberculosis. As described above, Mykrobe requires a set of sequence probes, which are generated from the list of variant sites. The panel of variant sites need to be in a tab-delimited file that has the gene name, variant, and whether the variant is a change in protein or DNA sequence. For example: There are three variants in this example. The first is a change from G to GC at position 12 in the DNA sequence of the pncA gene. The second variant is an amino acid change from D to G at position 94 in the embB gene. The final variant is eight nucleotides upstream of the fabG1 gene, where the T in the reference genome is changed to A, C, or G (i.e. X is a wildcard). Note that all positions are relative to the gene, not the complete genome sequence. The following command produces a FASTA file of probe sequences mykrobe variants make-probes reference.fasta -g reference.gb -t panel.tsv > probes.fasta where reference.fasta and reference.gb are FASTA and GenBank format files of the reference sequence and annotation respectively. Note that background variants can also be included in the probe set, which requires the variants in one or more VCF files, and MongoDB to be installed. The following BASH commands will add variants to a new database.
db=$PWD/mongo-db/ mkdir $db mongod --quiet --dbpath $db & sleep 10 for f in 'cat vcfs.txt' do mykrobe variants add --db_name mtb $f reference.fasta done In the above commands, the file vcfs.txt contains a list of the VCF file names (one file name per line), and the variants from each VCF file are added in turn to the database. Then the same command as above can be run to make the probe sequences, but with the extra option --db_name mtb.
In addition to the probes FASTA file, a JSON file that maps each variant to a list of drugs is required to run the "predict" subcommand of Mykrobe. For example, the JSON file corresponding to the above example three variants is as follows.  Figure 2). We started with the catalogue of variants from 20 for the first line drugs, which we call NEJM-2018. This was used together with the remaining drugs from the existing Mykrobe predictor release panel (Walker-2015), and the fluoroquinolones separated into ciprofloxacin, moxifloxacin and ofloxacin. This combined set of variant sites was called candidate panel 1 (CP1). Note that this includes all frameshifts of length 1 and 2 in pncA and katG, implying resistance to pyrazinamide and isoniazid respectively. Mykrobe was run on the Training dataset using CP1 and the results were compared against the known phenotypes for those samples. All variants that had positive predictive value less than 5% were removed from CP1, to make candidate panel 2 (CP2). Next, the process was repeated, removing variants from CP2 to make candidate panel 3 (CP3), based on the results of running Mykrobe with CP2 on the Global data set. All removed variants are provided in the Source data file removed_variants.tsv 32 . Finally, three removed variants were reinstated into the panel because they were found to have positive predictive value greater than 5% on later datasets (these were identified using the results from the Walker-2015 panel). The final panel is included as the default set of variants in release 0.7.0 of Mykrobe.
Comparison with other tools. Since our combined Training, Global, and prospective datasets consist of more than ten thousand samples, we only benchmarked Mykrobe against tools which could be run on the commandline (not web or user-interface applications). Specifically, we tested ARIBA (version 2.13.2), KvarQ (commit d693f561d205c9a3f9b9c705e2fefecdeb715cc8), MTBseq (version 1.0.3) and TB-Profiler (commit 327e431c3e 9de2897a885fabe6bfede1421b2470). In order to separate the impact of catalogue from the different genotyping processes applied by different methods, we modified ARIBA to use the same catalogue as Mykrobe, but for the other tools we used their in-built catalogues.
ARIBA was modified by adding an option to use the same variant panel as the final Mykrobe release panel, but with the following differences. First, since ARIBA performs local assembly, it was modified to call resistance to pyrazinamide or isoniazid if the pncA or katG genes respectively were not completely assembled without internal stop codons. This is different from the Mykrobe variant panel, which instead includes all nonsense mutations and indels of length 1 and 2 in those genes. Second, unlike Mykrobe, ARIBA was not set to report resistance-conferring synonymous changes in gene sequences. We note that ARIBA code version 2.13.2 was run for this study, but using the built-in panel from version 2.13.3, which corresponds to the final Mykrobe release panel. These versions only differ in that 2.13.2 has a variant panel corresponding to CP3.
Mykrobe and ARIBA explicitly have the fluoroquinolones ciprofloxacin, moxifloxacin, and ofloxacin in their variant catalogues. However, KvarQ, MTBseq and TB-Profiler simply call fluoroquinolone resistance without distinguishing between the three drugs. Therefore when reporting results, a fluoroquinolone call was taken to imply resistance to all three of ciprofloxacin, moxifloxacin, and ofloxacin.
The benchmarking pipeline was implemented using Nextflow 48  trigger the use of the dynamic genotype confidence cutoff, as described above.

WHO regimen analysis
Using previously published WHO guidelines for treatment of drug susceptible and drug resistant M. tuberculosis, we established a set of rules to associate any given drug resistance profile to a recommended drug regimen (Extended data file who_regimens.tsv 54 presents treatment rules) 24, 25 . Regimens including multiple new drugs (such as bedaquiline and delamanid, where the basis of drug resistance is poorly understood and phenotyping is rare), were included in this model, but isolates were assumed to be susceptible, since phenotype data for these drugs was unavailable, and is not assumed to be available for the WHO protocols. 26 .

1.
Reviewer Expertise: WGS of mycobacteria, phylogenetics, molecular epidemioogy, clinical bioinformatics I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard. 29 January 2020 Reviewer Report https://doi.org/10.21956/wellcomeopenres.17090.r37245 © 2020 MacCannell D. This is an open access peer review report distributed under the terms of the Creative Commons , which permits unrestricted use, distribution, and reproduction in any medium, provided the original Attribution License work is properly cited.

Duncan MacCannell
Office of Advanced Molecular Detection, National Center for Emerging and Zoonotic Infectious Diseases, Centers for Disease Control and Prevention, Atlanta, GA, USA Thank you for the opportunity to review this manuscript from Hunt and colleagues on the development and validation of genotypic antibiotic resistance prediction for Mycobacterium tuberculosis using the open source Mykrobe platform. Building upon the existing Mykrobe Predictor platform, Mykrobe includes a number of important improvements for mycobacteriological analysis, including significant refactoring of the codebase, reengineering and algorithmic improvements, an updated catalog of functional susceptibility data and corresponding genotypic predictors, expansion to include new sequencing platforms, such as the ONT MinION, and renewed emphasis of user-centric and user-configurable design.
Major comments: PAGE 5: The geographic breakdown of training, prospective and global sample subsets seems disproportionate. Presumably, this is due to the training and prospective datasets from specific sites having more reliable or complete data and functional susceptibility information? The multi-phased development and validation of the resistance prediction panel is otherwise well-described (text and Figure 2) and adequate, but the geographic sampling bias should at the very least be explained. PPV and NPV performance metrics were all impressive, with low overall error rate; critical discrepancies were adequately addressed or explained. Without delving into the primary data, the dataset tables on Page 10 using the final Mykrobe panel would seem to suggest significant underlying differences in PPV/NPV/VME/ME between datasets.
This could be an issue if the characteristics of the three test populations used to develop and refine the final Mykrobe panel are not relatively homogenous.
Minor comments: PAGE 9: Interestingly (Figure 4), ciprofloxacin resistance false positivity seems to have increased slightly with Mykrobe-CP3, without corresponding shifts in the prediction of other fluoroquinolones. This is noted on Page 9, which implicates D94A in gyrA, and discusses adjustments that were made to CP3 to accomodate. While I'm not by any means an expert in the specific mechanisms of fluoroquinolone resistance in TB, presumably Mfx and Ofx should also show some degree of https://doi.org/10.21956/wellcomeopenres.17090.r37246 © 2019 Colman R. This is an open access peer review report distributed under the terms of the Creative Commons , which permits unrestricted use, distribution, and reproduction in any medium, provided the original Attribution License work is properly cited.

Rebecca E. Colman
Foundation for Innovative New Diagnostics (FIND), Geneva, Switzerland University of California, San Diego, CA, USA Hunt and co-authors describe the improvements and testing of an antibiotic resistance prediction tool for using NGS data. The authors clearly describe the improvements on the Mycobacterium tuberculosis original software resulting in this new version, Improvements being: mutation Mykrobe predictor Mykrobe. catalog update, ability to use other mutation catalogs, improved identification of non-tuberculosis Mycobacterial species, and ability to use tool with Oxford Nanopore data. The software is freely available at mykrobe.com and has source code and documentation on Github. Additionally the data presented in the paper can be found on figshare. The development of the mutation panels and data interpretation is demonstrated on three datasets: training, prospective, and global set. The benchmarking against four other prediction tools is useful to show improvements and performance. Although only five Oxford Nanopore samples were used, they did show agreement in drug resistant variant calls. The case study of using predictions to guide personalized regimens is an extremely useful comparison and Mykrobe illustrates valuable clinical application of this type of tool.
The manuscript is scientifically sound without further changes. However, the following minor comments should be considered to increase the readability of the manuscript.
The paper is very lengthy and full of information, however it is hard to read and digest all the information contained within the manuscript. The abstract is concise and clear it would be beneficial for structure and format to be mirrored throughout the manuscript. For example, the introduction contains a lot of text dedicated to explaining the first version, and does not put the tool in context of Mykrobe predictor, tuberculosis until the fourth paragraph. Perhaps clearly outlining the need for this tool and then summarize how the original version has been improved will streamline the introduction and increase the impact of the work, as right now some of the benefits of the software are buried in a large amount of text.
Throughout the results section, there are portions that make more sense in the methods section, and then present on the results for only this study in the results section. For example, in the results species identification section, it is unclear which are the results for this manuscript versus previous findings. Additionally in the minor alleles prediction section, the rationale behind the work should be moved to the methods.
In the results section, the table 1 does not include sensitivity/specificity as stated in the body. The data is there to calculate these values but the specific sensitivity and specificity values are not included. Additionally, the sensitivity and specificity stated in the abstract are not found elsewhere in the paperthey can be calculated with data given, but would be beneficial to have in results. The methods section should be organized to mirror the order of the results section as this would enable readers to quickly reference methods when reading the results section. In the methods variant panel development section, it is unclear what data sets CP2 and CP3 are run on to generate the next panels. This is clear in the figure 2, but text was unclear.