Mycobacterium tuberculosis complex lineage 5 exhibits high levels of within-lineage genomic diversity and differing gene content compared to the type strain H37Rv

Pathogens of the Mycobacterium tuberculosis complex (MTBC) are considered to be monomorphic, with little gene content variation between strains. Nevertheless, several genotypic and phenotypic factors separate strains of the different MTBC lineages (L), especially L5 and L6 (traditionally termed Mycobacterium africanum) strains, from each other. However, this genome variability and gene content, especially of L5 strains, has not been fully explored and may be important for pathobiology and current approaches for genomic analysis of MTBC strains, including transmission studies. By comparing the genomes of 355 L5 clinical strains (including 3 complete genomes and 352 Illumina whole-genome sequenced isolates) to each other and to H37Rv, we identified multiple genes that were differentially present or absent between H37Rv and L5 strains. Additionally, considerable gene content variability was found across L5 strains, including a split in the L5.3 sub-lineage into L5.3.1 and L5.3.2. These gene content differences had a small knock-on effect on transmission cluster estimation, with clustering rates influenced by the selected reference genome, and with potential overestimation of recent transmission when using H37Rv as the reference genome. We conclude that full capture of the gene diversity, especially high-resolution outbreak analysis, requires a variation of the single H37Rv-centric reference genome mapping approach currently used in most whole-genome sequencing data analysis pipelines. Moreover, the high within-lineage gene content variability suggests that the pan-genome of M. tuberculosis is at least several kilobases larger than previously thought, implying that a concatenated or reference-free genome assembly (de novo) approach may be needed for particular questions.

Several phenotypic and genotypic features separate L5 and L6 strains from strains of the other human-adapted lineages. Some TB diagnostics have a lower performance for L5 and L6 strains, compared to strains of other MTBC lineages [25,26] and these lineages are less likely to grow in culture [27], with dysgonic appearance on solid medium [18,26,28,29], despite the inclusion of pyruvate-supplemented medium [21,27]. Mutations in genes essential for growth in culture were identified for L6 strains [30], yet for L5 strains the reasons for the difficulty in growth remain unclear. L6 strains and those of the closely related animal strains/lineages such as Mycobacterium bovis are reported to be less virulent in humans than those from other human-adapted MTBC lineages in populationbased studies [20,31,32] and in genome studies of mutations in virulence regulation genes/systems [33,34]. Infection with an L6 strain progresses slowly to TB disease, and is associated with impaired immunity in some settings but not all (e.g. HIV infection [20]). In contrast, although some authors have studied the genomics of L5 strains [35][36][37], protein secretion and in vivo immunogenicity [35], much remains to be learned on the genomics, virulence and disease progression of L5 strains.
Whole-genome sequencing (WGS) based on next-generation sequencing (NGS) of MTBC strains often involves data analysis by mapping of short sequence reads of the strains to a complete reference genome [38], usually H37Rv [39,40] or a reconstructed ancestor with the same gene content as H37Rv [22,41,42]. The resulting SNPs are then used for drug resistance determination, subtyping and transmission analyses [38]. However, since H37Rv is a L4 strain, it might not be representative for the genome of strains of other MTBC lineages. Thus, if H37Rv is used as basis for genome analysis, several genes or larger genomic regions may be missed, resulting in an underestimation of genome diversity among the strains of other lineages that may, however, provide additional information, e.g. for transmission analysis.
The members of the MTBC evolved from an environmental organism to an obligate pathogen through genome reduction and acquisition of new genes [43]. In addition, it is known that some differences in gene content exist between strains of different lineages [10,38,[44][45][46][47][48]. Furthermore, it was reported that genes had a higher genetic diversity among

Impact Statement
The Mycobacterium tuberculosis complex (MTBC) consists of nine human-associated lineages. Although many of these have been described for decades, little is known about the gene content variation both between and within strains of these lineages. This is most pronounced for strains of lineage 5 (L5), once part of the Mycobacterium africanum species. We compared the genomes of over 350 clinical L5 strains, the largest dataset gathered to date, to each other and the H37Rv reference strain to look for gene content variation and the potential impact this would have on clinical use of genome sequence data. We found that multiple genes are differentially present or absent between H37Rv and L5 strains, and that there is high within-L5 gene content variability, resulting in the split of the sub-lineage L5.3 into L5.3.1 and L5.3.2. We quantified the potential impact of this gene content difference on transmission clustering estimation. We found that the current H37Rv-centric approach widely used in MTBC epidemiology would overestimate the clustering rate of L5 strains since it misses single-nucleotide polymorphisms present in L5-only genes. Thus, for highresolution outbreak analysis, MTBC epidemiological studies may need to move away from the H37Rv-centric approach, especially when looking at transmission in countries where L5 is prominent.
L6 strains compared to L5 ones [36], but the difference in gene content between strains of these two lineages was not investigated, but had been alluded to in an earlier abstract publication [37]. Further, little is known about the gene differences between genomes of L4 and L5 strains and the potential limitations this may impose for in-depth analysis of genome studies of L5 strains (e.g. sub-lineage detection and transmission tracking). Similarly, little is known about within-lineage diversity in terms of gene content.
To address these questions, we assessed the gene content diversity of L5 strains in the context of WGS and reference selection. To this end, we analysed 358 genomes of L5 strains, including 3 complete genomes, and compared them to H37Rv and strains of closely related lineages [L6 and M. tuberculosis var. bovis (hereafter called M. bovis) [19]. Our main focus was to determine particularities in the genomes of L5 strains compared to the MTBC strain type (H37Rv, a L4) and strains of MTBC lineages phylogenetically closely related to L5 (L6 and M. bovis); deduce hypotheses for L5 phenotype/biology; and define the level of within-lineage gene content differences among strains of this lineage of the MTBC.

Whole genomic DNA extraction
Genomic DNA extraction was performed on growth from fresh Löwenstein-Jensen slants using the semi-automated Maxwell 16 Cell DNA purification kit in the Maxwell 16 machine, or from the late exponential phase of growth in 7H9 medium, using the CTAB method [6,52].

Assembly and annotation of the complete (PacBiosequenced) genomes
The Benin PacBio-sequenced genome was assembled using HGAP [53] and Quiver [53] and checked for sufficient quality and coverage (10 000 bp sliding window coverage was always above 65× and average coverage across entire genome was 107×). The other two PacBio-sequenced genomes were already assembled as described previously [1]. The genome sequences of all three complete (PacBio-sequenced) genomes were annotated using Prokka-1.12 [54] based on the reference genome H37Rv annotation.

Gene presence-absence analysis
Gene content differences were assessed with an all-vs-all blastn approach, using blast+ version 2.8.1 [55,56]. For a specific genome-genome comparison, the following procedure was used: the gene sequences of genome 1 (ffn file; coding regions only) were compared to those of genome 2 using an evalue cut-off of 1e −05 and a minimum similarity of 70 % to look for any homology for each gene. Those genes found in genome 1 and not in genome 2 were then compared to the complete (PacBio-sequenced) genome (fna file; coding and non-coding regions) of genome 2 to look for pseudogenes (herein qualified as 'suspected pseudogenes') using blastn with the same cut-offs. These suspected pseudogenes were then confirmed using tblastn of the genome 1 protein sequences (faa file) compared to the complete (PacBio-sequenced) genome (fna file) of genome 2. This procedure was used to compare the complete (PacBio-sequenced) genomes of all three L5 strains to each other as well as each to H37Rv. Those genes present or absent in H37Rv (or pseudogenes in either) were compared in a similar manner to the L6 and M. bovis reference strains to determine whether these genes/pseudogenes are L5-specific (i.e. present in L5, but not in H37Rv, L6 and M. bovis).
The ffn files of H37Rv, the complete (PacBio-sequenced) genomes of the three L5 strains and the complete genome of the L6 reference strain were also aligned using progres-siveMauve (v20150226) [57] to identify rearrangements and examine synteny.

L5 Illumina-sequenced (short reads) genomes
In total, whole genomes from 355 L5 strains from various countries sequenced on the Illumina platform were included in the study. These genomes were derived from a larger study on the genetic diversity of L5 and L6 [6]. After reducing isolate redundancy (i.e. 1 representative retained for those that were extremely closely related), genomes from 205 L5 strains formed a non-redundant dataset. These 205 strains (genomes) originated from West, South, East and Central Africa (Table S8), but primarily (n=155) from two regions within West Africa: the western part of West Africa (including The Gambia, Sierra Leone, Ivory Coast, Liberia, Guinea and Mali) and the eastern part of West Africa (including Ghana, Benin and Nigeria) [6]. In general, L5 and L6 are geographically restricted to West and Central Africa, making this genome selection from L5 strains representative of most of the geographical regions affected.

Mapping of L5 Illumina reads to H37Rv and L5 complete (PacBio-sequenced) genomes
Raw reads (fastq files) of the 205 non-redundant Illuminasequenced genomes were mapped respectively to H37Rv and each of the PacBio-sequenced genomes of the 3 L5 strains using the MTBseq pipeline [58]. The depth mapping coverage of the samples in the clinical Illumina-sequenced genomes against the reference genomes (percentage read mapping, unambiguous coverage mean) was compared between the PacBio-sequenced genomes of the three L5 strains and between the PacBio-sequenced genome of each L5 strain and H37Rv. Mapping statistics parameters such as percentage unambiguous total base, uncovered bases, SNPs, deletions, insertions, substitutions and percentage genes mapped to reference were also compared using the PacBio-sequenced genome of each L5 strain or H37Rv as a reference.
Checking unique versus missing genes in Illuminasequenced genomes of L5 strains and SNPs in those confirmed as L5-specific The genes found to be present or absent in L5 based on genome comparisons of the PacBio-sequenced genomes of the three L5 strains with H37Rv were checked for their expected presence or absence in the Illumina-sequenced genomes of L5 strains. Using the position tables produced by the MTBseq pipeline, a gene was considered absent if 95 % of its position in the genome had fewer than eight reads covering them. From these data, a gene presence/absence matrix was generated for Illumina-sequenced genomes of L5 strains mapped to H37Rv and each of the three complete (PacBio-sequenced) genomes of L5 strains. Genes found to be L5-specific (present in complete PacBio-and Illumina-sequenced genomes) were also checked for SNPs in these genes using each of the complete (PacBio-sequenced) genomes of the three L5 strains as a reference. The script used for undertaking this analysis can be found at https:// github. com/ conmeehan/ pathophy.

Calculating the effect of reference genome selection on pairwise SNP distances
Transmission analysis of MTBC strains often involves clustering strains together based on specific SNP cut-offs [38,59,60]. We assessed whether the selection of the reference genome (H37Rv, PcbL5Ben, PcbL5Gam or PcbL5Nig) changed the clustering rate of L5 strains. For this we used the non-redundant set of L5 genomes from a previous study on L5 diversity [6], which included the 205 detailed above and a further 145 strains that were closely related to at least 1 of these 205 strains. Additionally, the TB-Profiler online SRA search tool (https:// tbdr. lshtm. ac. uk/ sra) was used to identify a further 5 isolates that were not in this dataset, resulting in 355 strains being included in this clustering study.
SNP alignments of the Illumina-sequenced genomes were created by first mapping to each of the four reference genomes (H37Rv and three PcbL5). The Amend function of MTBseq does this automatically for H37Rv, including masking of repetitive regions, accounting for 10 % of the genome [40], and exclusion of columns with 95 % ambiguous calls. To undertake this for the reference genomes from the three L5 strains, repetitive regions were first determined from the annotations of the genes. All genes whose description contained one of the following words was excluded: integrase, PE family, PE-PGRS, phage, transposase. SNP alignments were then created using MTBseq as done for H37Rv. Pairwise distance matrices, one for each alignment based on each reference genome, were created using a Python script that can be found at https:// github. com/ conmeehan/ pathophy. Loose transmission clusters were created from these matrices at a cut-off of 1, 5 and 12 SNPs, as described previously [59]. Clustering rates and the presence of a strain in a transmission cluster was then compared across the four reference genome mapping approaches.

Determination of putative function of genes differentially present or absent in the complete (PacBio-sequenced) genomes of L5 strains
To find the (putative) function of genes present or absent only in L5 strains, and genes present in L5 strains but pseudogenes in H37Rv/L6/M. bovis or vice versa, the fasta sequences of the genes were searched against the NCBI's NR database using blastx2.8.1+ [61] as well as the Tuberculist database (http:// tuberculist. epfl. ch), Mycobrowser (https:// mycobrowser. epfl. ch/) [62] and the literature. Furthermore, the gene function group/class was found using the COG database [63] and Mycobrowser.

Determination of the sub-lineage of the L5 strains
The sub-lineage of the strains was determined by looking for sub-lineage-defining SNPs in the genomes of L5 strains as previously described [6,35]. The sub-lineage-defining SNPs were searched for in the SNP data generated using the MTBseq pipeline with H37Rv as the reference genome.
The size of the complete genomes of L5 strains was respectively 4 438 262 bp for PcbL5Ben, 4 424 447 for PcbL5Gam and 4 417 534 for PcbL5Nig (data in Figshare [2]). The size was 4 411 532 bp for H37Rv and 4 493 502 for the M. bovis reference genome (data from annotation with Prokka).
The visualization of the structure of the genomes showed that all three complete genomes of all L5 strains had a region that was absent in H37Rv. Furthermore, the PacBio-sequenced genome of the Nigerian L5 strain (PcbL5Nig) was missing an additional region (herein called L5Nig-Del) that was present in the genomes of the Benin and the Gambian L5 strains and also present in H37Rv.
The genome of the Benin strain contained a syntenic block of three genes (1148, 1104, 2840 bp) that were present in neither the genome of the Nigerian strain nor the one of the Gambian strain (Tables 1 and 2). These increased the pangenome of L5 by at least 5092 bp. The Benin and Gambian strains each contained 33 genes that were not present in the Nigerian strain (Table 2 and S1). Interestingly, 32 of these genes were also present in H37Rv (Table 2 and S1). These 32 genes were thus missing in the Nigerian L5 strain exclusively, meaning that its ancestor has genes with a subsequent loss in the Nigerian L5 strain. Those 32 genes -absent in the Nigerian L5 strain only -were sequentially contiguous and formed 3 blocks of 19, 11 and 2 genes (Table S1). We found that 30 of those genes, the blocks of 19 and 11, were separated by 1 gene and represented the L5Nig-Del region (mentioned in the paragraph above, Table 3). Additionally, 11 genes were shared by the 3 PacBio-sequenced genomes of L5 strains, which were absent in H37Rv (Table S2), while 9 genes present in H37Rv were not present in any of the 3 complete (PacBio-sequenced) genomes of L5 strains (Table  S3). Two (Rv2073c, Rv2074) of those nine genes were only present in H37Rv but absent in the complete genomes of L5, L6 and M. bovis strains. Note that for the gene presence/ absence analysis, the genes are often referred to by the Rv designation but are actually putative orthologues of those H37Rv genes.
Six of the genes shared by the three complete (PacBiosequenced) genomes of L5 strains were confirmed pseudogenes in H37Rv (Table S4). Three genes present in H37Rv were confirmed to be pseudogenes in the three complete (PacBio-sequenced) genomes of the L5 strains (Table S5).

Gene presence/absence and related SNPs in lineage-specific genes in a wider set of clinical strains
The mapping estimates of the Illumina-sequenced genomes are presented in Table 4. Three (ERR502505, ERR751302, ERR1215478) of the 205 Illumina-sequenced genomes were identified as mixed infection strains based on their MTBseq output, and thus excluded from the analysis. Mapping quality and coverage against a complete L5 reference was superior to the H37Rv reference approach (Table 3, Fig.  S1), as expected. Using the complete (PacBio-sequenced) genomes from Benin and The Gambia strains, yielded similar mapping estimates that were better than those of the genome of the Nigerian L5 strain, likely due to the large deletion in this genome. The genes specific to PcbL5Ben [3, region PcbL5Ben_1893 through PcbL5Ben_1895; none specific to PcbL5Gam (0), PcbL5Nig (0)] were each found in 77.7 % (157/202, Table 1) of the 202 Illumina-sequenced genomes of L5 strains.
Interestingly, 3 % of the Illumina-sequenced genomes of L5 strains (6/202) had similar patterns of large gene loss   The two blocks contain genes whose annotations include mutB, mazE4, mazF4 and others with various putative functions (Table 3, Table S6); none of them were essential genes (Mycobrowser).
Four of the 11 genes present in the 3 complete (PacBiosequenced) genomes of the L5 strains but absent in H37Rv were found in all 202 Illumina-sequenced genomes of the L5 strains (  (Tables 5 and S2). Importantly for phylogenetic purposes, SNPs were detected in all 4 genes in 1.5-3.5 % of the 202 Illumina-sequenced genomes of L5 strains (Table S2). Predicted functions for all L5-specific genes are listed in Table S2.
Six (Rv1977, Rv1979c, Rv1993c, Rv1995, Rv2073c, Rv2074) of the 9 genes that were present in H37Rv and absent in the 3 complete (PacBio-sequenced) genomes of L5 strains were absent in all 202 Illumina-sequenced genomes of L5 strains (Table 5 and S3). Rv1977 is a conserved hypothetical, probably a peptidase; Rv1979c is involved in 'cell wall and cell processes' , probably a permease; Rv1993c and Rv1995 are conserved hypotheticals; and Rv2073c and Rv2074 are involved in 'intermediary metabolism and respiration' , with Rv2073c probably a dehydrogenase and Rv2074 a pyridoxamine-5-phosphate oxidase (Mycobrowser, Table 5). Four of these genes (Rv1977, Rv1979c, Rv1993c, Rv1995) were absent (did not have orthologues) in L5 strains only (i.e. present in L6 and M. bovis reference genomes), while the other two (Rv2073c and Rv2074) were also absent in L6 and M. bovis ( Table 5). The three other genes absent from the complete (PacBio-sequenced) genomes were present in a minority of the L5 Illuminasequenced genomes (Table S3, Fig. S1). Predicted functions for all nine genes are listed in Table S3.

Sub-lineage of the strains (PacBio and Illuminasequenced genomes)
The determination of the strains sub-lineage revealed that the PcbL5Ben strain is an L5.3 strain that contains the 30 genes (2 blocks of 19 and 11 genes: Rv1493 through Rv1509 and Rv1511 through Rv1521), whereas the PcbL5Nig strain is also a L5.3 strain, but placed in a different clade (Fig. 1

Impact of the reference genome selection on genetic distances used for transmission clustering rates
Transmission analysis was undertaken on an expanded set of 352 L5 strains using each of the four reference genomes (H37Rv, PcbL5Ben, PcbL5Gam and PcbL5Nig) for the SNP mapping. The current gold standard is to use the H37Rv genome (NC_000962.3) for calling SNPs and then creating transmission clusters at a specific SNP cut-off differentiating two and more strains. For this dataset, using a 12 SNP cutoff, it was found that 40.6 % (n=144) of strains were within a transmission cluster with at least 1 other isolate (Table 6), for a total of 55 clusters. Using the PcbL5Ben genome instead of the H37Rv genome as the reference reduced the number of clusters to 54, resulting in a clustering rate of 39.7 % at a 12 SNP cut-off (Table 6). A similar reduction in transmission clustering was observed when using the PcbL5Nig genome as a reference but not when using the PcbL5Gam (Table 6). When a more conservative SNP cut-off was used, such as one SNP or five SNPs, the reduction in transmission clustering rates was more pronounced and was observed for all the PcbL5 genomes (Table 6). This demonstrates that even with this small dataset, transmission cluster estimations are affected by the selection of the reference genome, with potential overestimation of recent transmission when using H37Rv as the reference genome.

DISCUSSION
Comparison of complete (PacBio-sequenced) genomes and Illumina-sequenced genomes from clinical strains revealed gene content diversity both within L5 strains and between L5 strains and strains of other lineages. This diversity also had an impact on clinical epidemiology analysis of L5 strains, with   Several genes were found to be absent from the genome of either L5 strains only or also from strains of the closely related lineages L6 and M. bovis. Most of these genes are located within RDs previously described as being absent in one or all of these lineages.
RD9, consisting of Rv2073c and Rv2074 and partial Rv2072 and Rv2075c [13], has previously been reported to be absent in L5, L6 and M. bovis [10,12,15]. In this study, we confirmed that Rv2073c and Rv2074 are absent in strains of these lineages. Rv2072 and Rv2075c were present in their truncated form in the genomes of all L5 strains as previously described [13]. Rv2073c is an NAD(P)-dependent oxidoreductase (NCBI) that catalyzes a wide range of reactions and is involved in redox sensor mechanisms [64,65]. Rv2074 was previously thought to be a pyridoxine (vitamin B6) oxidase, but is now known to be a F420-dependent biliverdin reductase, a cofactor of vitamin B6 synthesis [66,67]. Vitamin B6 is essential for the survival and virulence of M. tuberculosis [68] and its cofactor (F420-dependent biliverdin reductase) is  implicated in immune-evasive mechanisms to allow bacterial persistence [66,67] (Table S3). Recently, it has been reported that the synthesis of F420 might be stimulated by phosphoenolpyruvate [69], which is the precursor to pyruvate, a supplement often added to culture medium to improve the in vitro growth of L5, L6 and M. bovis strains. Furthermore, there is a suggestion that F420 is needed for the activation of the antituberculosis drugs pretomanid and delamanid [69].
Another gene absent in L5 strains is Rv1977. This gene is one of the 14 genes contained in RD7 [13]), which is known to be lacking completely from strains of L6 and animal-adapted lineages such as M. bovis [12]. Similarly, Rv1979c, 1 of 10 genes that make up RD2, was absent from genomes of L5 strains; this RD is also absent from late-generation M. bovis BCG such as M. bovis BCG Pasteur strains [12]. Rv1978, another of the 10 genes in RD2, and located in the region Rv1977-Rv1978-Rv1979c was absent in the 3 complete (PacBio-sequenced) genomes of L5 strains, but present in 2 L5 strains Illumina-sequenced genomes [2/202, 1 %; further confirmed using blast search and manual check of the gene in the reads of the 2 L5 strains genomes (ERR439931 and ERR4192386)], making up the region containing Rv1977, Rv1978 and Rv1979c, a region absent in most of L5 strains (99 %, 200/202). Rv1979c has been associated with clofazimine and bedaquiline resistance [70,71], two of the drugs used for the treatment of rifampicin-and multidrug-resistant TB [72,73], but minimal inhibitory concentration testing of clofazimine in five L5 strains did not find that the deletion conferred resistance [74]. Further studies and protein function discovery is needed to investigate the consequences of the absence of those genes in genomes of L5 strains. Rv1978 is required for bacterial survival in macrophages [75].
Rv1993c and Rv1995 were absent in all L5 genomes. Rv1994c a gene outside of known RDs, located in the region Rv1993c-Rv1994-Rv1994 was absent in the three complete (PacBiosequenced) genomes of L5 strains, yet present in one L5 strain Illumina-sequenced genome [1/202, 0.5 %; further confirmed, using blast search and manual checking of Rv1994 in the reads of the concerned L5 strain genome (ERR439931)]. This resulted in the absence, in most of the L5 strains (99.5 %, 201/202) of the region containing Rv1993c, Rv1994c and Rv1995. Rv1994c is involved in the regulation and transport (efflux) of toxic metals, especially copper, which is toxic in excess and may hamper in vitro growth [76][77][78][79][80] (Table S3).
In combination, the absence of these genes in L5 strains suggests that they would be less likely to survive in macrophages (Rv1978), have reduced growth in vitro (Rv1994c, as previously found by [27]), and be less immune-evasive, less persistent and less virulent (Rv2074) than L4 strains (at least H37Rv), as previously suggested for L5 and L6 strains [5,32] In contrast to the absence of some genes in RD2 and RD7, or all genes in RD9, RD5 was found to be present in its entirety in nearly all sequenced genomes of L5 strains. Some authors reported that up to 45 % of L5 strains missed RD5, while others reported that RD5 is present in most L5 strains [35,81]. The RD5 region includes Rv2346 (truncated), Rv2347, Rv2348, Rv2349c (plcC), Rv2350c (plcB), Rv2351c (plcA) and a PPE gene [13]. Our findings showed that of the 202 Illumina-sequenced L5 strains, the majority (196, 97 %) had the RD5 region in its entirety, whereas 6 L5 strains (3 %) missed a part (one or 2 genes among the non-PPE genes) of the RD5 region. Likewise, in contrast to a previous report [10], our data did not confirm that all L5 strains lack RD711 (coordinates: 1 501 713-1 503 655 in H37Rv [82]) composed of Rv1333 (truncated), Rv1334, Rv1335 and Rv1336 (truncated). The lacking of RD711 was rather observed in all L5.1 strains in our dataset, which is in agreement with the report of Ates et al. [35] and also observed among 58.6 %(17/29) of L5 strains of unknown sub-lineages (Table S7).
We found that four genes, present in all of the L5 strains in our multi-country collection, were absent (no orthologues) in the H37Rv genome. Those genes represent 2382 bp, which is around 0.05 % of the total length (base pair) of a typical genome of an MTBC strain. Our findings are in line with those of other reports indicating that some genes present in genomes of clinical MTBC strains were absent (no orthologues) in the H37Rv genome [46,47]. Although these gene differences are small, they did affect the distance matrices used for transmission clustering analyses (Table 6). Overall, H37Rv-based mapping was found to place a slightly higher percentage of strains in transmission clusters than any L5-based mapping approach, especially at lower SNP cutoffs. This has implications for molecular epidemiology in West Africa, where most L5 strains are found. Additionally, if similar scenarios exist for other lineages, this may result in changes to all non-L4 transmission analyses.
Our findings support past recommendations to use additional reference genomes that are different from H37Rv [46,83,84]. Although another study [85] concluded that there is no need to use a lineage-specific reference genome, their observation was only based on the analysis of L4 clinical strains and focused on SNPs and short indels, not larger gene deletions or SNPs within these regions. In contrast, our findings indicate that mapping of NGS data from L5 strains to L4 reference will have an impact in terms of both reference genome coverage and coverage of lineage-specific genes.
While the use of a single L5 genome as reference would have many benefits over H37Rv for particular study questions, several gene content differences were still observed within the L5 lineage. One (PcbL5Ben) of the three genomes of L5 strains sequenced with PacBio had genes that were not shared with the two others that were all found in Illumina-sequenced genomes of L5 strains (thus excluding exogenous contamination) ( Table 1). The L5NigDel was also found in the genomes of six L5 strains from different countries. These six strains formed a monophyletic group of the (SNP-defined) L5.3 sublineage [6] (Fig. 1), suggesting that the loss of these 30 genes is a marker for strains of L5.3.2 sub-lineage, compared to the sub-lineage L5.3.1 strains that have these genes intact. Indeed, deletions, as described for MTBC lineages [10,45,48], may also be limited to sub-lineages.
The between-and within-lineage differences in both gene content and potential functionality indicate a need for more closed genomes of MTBC sub-lineages to be constructed. Due to the impact of reference choice on transmission studies, there is now a need for varying approaches to NGS data analysis to be considered. For instance, sub-lineage-specific reference genomes could improve resolution, although such ad hoc (e.g. outbreak-specific) reference genomes are only specific to that particular situation/outbreak/population/lineage and cannot be used in another context, making comparisons between lineages and settings difficult. Alternatively, a pangenome-based reference genome(s) capturing all the known diversity may be required instead. This can take two forms: a composite genome containing all the genes found in strains of all lineages and sub-lineages of the MTBC (i.e. both the core and accessory genome) [86], represented as a graph instead of a single sequence [87][88][89][90][91][92], or a selection of reference genomes, with mapping to all or a subset undertaken, as has been done with strains of Mycobacterium chimaera and other pathogens [93,94]. Other authors have also reported that, because of the genetic variability between strains, using a single strain genome as reference genome lacks accuracy [95][96][97]. This MTBC-wide pangenome approach has been suggested before, including for other organisms [38,47,[95][96][97][98][99][100]. However, such an approach also has its own drawbacks, including difficulty in mapping reads that bridge the boundaries between accessory genes and the rest of the genome, comparing strains of different lineages including phylogenetic analyses and retention of gene names and codes in clinical use, where H37Rv is deeply embedded [89]. The specific sequence of each orthologue gene would also need to be chosen for such a reference, with the inferred ancestral genome representative of MTBC lineages approach being the most likely method [22,42,101].
Another alternative approach is a de novo assembly referencefree approach [38,96,102,103], where strains are either assembled into contigs without the use of a reference genome or compared to each other without first calling SNPs. This would allow for clustering of samples regardless of lineage, e.g. for genotyping or transmission analyses using Mash (software for fast/meta -genome distance estimation technique) [104] but would require new cut-offs for defining clusters and many additional steps for further gene annotation and between-sample comparisons, making its clinical use potentially confusing.
A limitation of this study is that the complete (PacBio) and short-read (Illumina) genomes were derived from positive cultures, excluding possible minority L5 strain diversity, as L5 strains are overrepresented in negative cultures [27]. WGS applied directly to sputum is increasingly needed, especially for ancestral lineages (including L5 and L6), where negative culture or dysgonic isolates are more common and are a challenge for DNA extraction [26,27]. There is also a limitation regarding the comparison of gene presence/absence in strains of L5 versus L4 (including H37Rv), L6, M. bovis complete genomes; which is that our study only included a single complete genome of one strain of L4, L6 and M. bovis, respectively, while strains of these lineages may display similar variability to the intra-L5 variability we observed in our study. Of note, the genes that are present in L5 strains but not specific to L5 strains could be either active or inactive (mutated), requiring additional in vitro or in vivo validations to fully elucidate the metabolic profile of these strains.
In conclusion, the use of a (sub-)lineage reference genome can increase resolution for strain comparison in comparison to a H37Rv-based mapping approach for L5 genome analyses for epidemiology (transmission), phylogeny and sub-lineage determination. Still, the use of a (sub-)lineage reference genome may miss some within-lineage gene differences. For drug resistance detection in clinical L5 strains or strains of other lineages, H37Rv could still be used as a reference genome as resistance-related mutations are usually among the core genes (shared across all lineages). The high withinlineage gene content variability suggests that the pangenome of MTBC strains may be larger (at least by 5092 bp) than previously thought, implying that a reference-free genome assembly (de novo assembly) approach may be needed. in resources. S. N. was involved in conceptualization, reviewing/editing the manuscript and supervision. J. P., was involved in investigation, data curation and reviewing/editing the manuscript. S. H. was involved in conceptualization, investigation, data curation and reviewing/editing the manuscript. D. Y. was involved in conceptualization, resources and reviewing/editing the manuscript. S. G. was involved in conceptualization and reviewing/editing the manuscript. L. R. was involved in conceptualization, resources, reviewing/editing the manuscript and supervision. D. A. was involved in resources, reviewing/editing the manuscript and supervision. B. C. d. J. was involved in conceptualization, methodology, writing of the original draft, reviewing/editing the manuscript and supervision. C. J. M. was involved in conceptualization, methodology, formal analysis, investigation, writing of the original draft, reviewing/editing the manuscript, supervision and visualization.

Conflicts of interest
The authors declare that there are no conflicts of interest.

Ethical statement
The PacBio-sequenced L5 strain from Benin was isolated during the multicentric OFLOTUB study approved by the National Ethics Committee in Cotonou, Benin and each of the participating countries (Merle et al.