Transmission Dynamics of a Mycobacterium tuberculosis Complex Outbreak in an Indigenous Population in the Colombian Amazon Region

ABSTRACT Whole genome sequencing (WGS) has become the main tool for studying the transmission of Mycobacterium tuberculosis complex (MTBC) strains; however, the clonal expansion of one strain often limits its application in local MTBC outbreaks. The use of an alternative reference genome and the inclusion of repetitive regions in the analysis could potentially increase the resolution, but the added value has not yet been defined. Here, we leveraged short and long WGS read data of a previously reported MTBC outbreak in the Colombian Amazon Region to analyze possible transmission chains among 74 patients in the indigenous setting of Puerto Nariño (March to October 2016). In total, 90.5% (67/74) of the patients were infected with one distinct MTBC strain belonging to lineage 4.3.3. Employing a reference genome from an outbreak strain and highly confident single nucleotide polymorphisms (SNPs) in repetitive genomic regions, e.g., the proline-glutamic acid/proline-proline-glutamic-acid (PE/PPE) gene family, increased the phylogenetic resolution compared to a classical H37Rv reference mapping approach. Specifically, the number of differentiating SNPs increased from 890 to 1,094, which resulted in a more granular transmission network as judged by an increasing number of individual nodes in a maximum parsimony tree, i.e., 5 versus 9 nodes. We also found in 29.9% (20/67) of the outbreak isolates, heterogenous alleles at phylogenetically informative sites, suggesting that these patients are infected with more than one clone. In conclusion, customized SNP calling thresholds and employment of a local reference genome for a mapping approach can improve the phylogenetic resolution in highly clonal MTBC populations and help elucidate within-host MTBC diversity. IMPORTANCE The Colombian Amazon around Puerto Nariño has a high tuberculosis burden with a prevalence of 1,267/100,000 people in 2016. Recently, an outbreak of Mycobacterium tuberculosis complex (MTBC) bacteria among the indigenous populations was identified with classical MTBC genotyping methods. Here, we employed a whole-genome sequencing-based outbreak investigation in order to improve the phylogenetic resolution and gain new insights into the transmission dynamics in this remote Colombian Amazon Region. The inclusion of well-supported single nucleotide polymorphisms in repetitive regions and a de novo-assembled local reference genome provided a more granular picture of the circulating outbreak strain and revealed new transmission chains. Multiple patients from different settlements were possibly infected with at least two different clones in this high-incidence setting. Thus, our results have the potential to improve molecular surveillance studies in other high-burden settings, especially regions with few clonal multidrug-resistant (MDR) MTBC lineages/clades.

In recent years, TB incidence in the WHO/Pan American Health Organization (PAHO) region of the Americas has been on the rise (3), especially in the indigenous population of LATAM, with nine times more burden than in the general population, with 11,608 cases reported per 100,000 indigenous people in 2018, 5.4% of which were from Colombia, particularly in communities of the Amazon (2). Members of these communities are at high risk of developing the disease given the isolated location and health, sociological, and economic inequalities that have hindered timely access to medical services (8,9). In addition, the ongoing Coronavirus disease 2019 (COVID- 19) pandemic has potentially worsened their situation and is therefore threatening public health efforts to end TB by 2035 (2,3,10,11).
To advance faster in the control and elimination of the disease in these vulnerable populations, the World Health Organization (WHO) has endorsed whole-genome sequencing (WGS) as a tool for understanding local epidemics to enable better public health interventions (12,13). WGS analyses allow for the rapid analysis of the genomes of clinical MTBC strains, enabling genome comparisons with high resolution, e.g., for outbreak analysis, reliable strain genotyping, drug resistance prediction, and identification of transmission chains (14)(15)(16). While MTBC genomic surveillance is already implemented in developed countries (17,18), it is poorly integrated into the public health practices of LATAM due to high initial investments, equipment restrictions, and lack of expertise (14).
To better understand the spread of an MTBC outbreak that occurred in 2016 among 23 indigenous settlements of the Ticuna, Cocama, and Yagua ethnicities in the remote setting of Puerto Nariño, in the Colombian Amazon Region (19), we performed a WGS analysis of 74 MTBC isolates obtained from March to October 2016. This outbreak led to a reported TB prevalence of 1,267/100,000 inhabitants, which raised concerns among national and international public health authorities (20). Due to the low genetic diversity of the circulating strain, the outbreak could not be fully unraveled with the conventional molecular epidemiology methods available in the country (21). Aiming for an increased phylogenetic resolution and an improved understanding of the transmission dynamics in this indigenous and remote patient population, we investigated the outbreak isolates with a standard WGS approach and extended the analytical range to repetitive regions, low frequency variant detection, and mapping on a newly established "outbreak genome."

RESULTS
Demographic characteristics of the populations studied. The TB outbreak comprised 80 positive cases out of 6,310 individuals screened from March to October 2016 (19), from which only 74 MTBC isolates (one isolate per patient) were recovered, and 70 of the patients self-identified as indigenous. These patients were inhabiting 16 indigenous settlements on the shore of the Loretoyaco and Amazon rivers ( Fig. 1 and 2 Classical genotyping of the LAM-MTBC outbreak. Initially, we defined the genotypes of all 74 isolates with 24-loci mycobacterial interspersed repetitive-unit-variablenumber tandem-repeat (MIRU-VNTR) typing. The MTBC population structure was found to be highly homogeneous; 74/74 (100%) of the isolates belonged to the Euro American lineage, also known as lineage 4 (L4), 72/74 of which (97.2%) were Latin American Mediterranean (LAM), and 2/74 (2.7%) were identified as the Haarlem genotype. Of the LAM isolates, 31/72 (43.05%) and 41/72 (56.94%) were from indigenous settlements based on the Amazon and Loretoyaco Rivers, respectively (see Table S1 in the supplemental material). One large cluster comprised 49 patients with identical 24loci MIRU-VNTR patterns, including 16 children. A smaller cluster was observed with 11 patients and 2 alleles distances. Patients in both clusters originated from different settlements in the study region, and 8/74 (10.8%) isolates had mixed MIRU-VNTR alleles ( Fig. S1A and B).
Whole-genome sequencing analysis. WGS was successfully performed for all 74 MTBC isolates. Phylogenetic classification of the isolates based on specific single nucleotide polymorphism (SNP) signatures described previously by Coll (Table  S1). Following an H37Rv reference-mapping approach allowed us to build a maximum parsimony tree (MPT) based on the SNP distance matrix of 890 SNPs, which confirmed the clonal population structure of the outbreak. Cluster analysis revealed two main molecular transmission chains of LAM 4.3.3. strains named cluster I and cluster II (Fig. 3). Furthermore, two other clusters were noted, i.e., cluster III, comprising four LAM 4.3.3. strains separated by more than 200 SNPs from cluster I/II, and cluster IV, comprising two Haarlem strains (Fig. 3).
The largest cluster, cluster I, contained 42 identical (42/74, 56.8%) isolates and 7 isolates differing by a maximum of two SNPs (Fig. 3). Out of the 49 patients in this largest  Table S1). The second cluster, cluster II comprised 18 identical (18/ 74, 24.3%) isolates, with 11 SNPs separating both clusters. This suggests a relatively recent common ancestor; however, the clear separation by 11 SNPs indicates two independent short-term transmission chains in the study region.
Extended WGS analysis of the LAM-MTBC outbreak. In a first attempt to increase genotyping resolution, we explored whether the inclusion of SNPs in repetitive and resistance-associated genes increases the typing resolution for the outbreak. This approach improved the phylogenetic resolution and led to a subdivision of the largest cluster, cluster I (Fig. 4). Overall, the number of SNPs included in the concatenated SNP alignment rose from 890 to 1,062 SNPs, resulting in the increase in the number of nodes in the outbreak from 5 to 8 and 3 additional branches in the MPT, i.e., three new putative transmission links ( Table 2; Fig. 4). Cluster I was split into three new nodes: one isolate at a 2-SNP distance (node E), two isolates at a 4-SNP distance (node F), and 12 isolates (node D) at a 5-SNP distance toward the parental node A (Fig. 4). The 12 isolates within node G were characterized by the presence of 5 novel SNPs (Table 3): Rv1194c Trp244Arg, Rv3621c (PPE65) Gln253Arg, Rv3478 (PPE60), Rv3478 (PPE60) Ala326Ala, and Rv1945 Ala327Ala.
Next, we assembled the outbreak-specific genome from isolate COL-2 de novo and used it as a reference genome. The COL-2 genome comprised 4,402,676 nucleotides  (two contigs, 3,947,682 and 454,994 bp) and was slightly smaller than the H37Rv reference genome with 4,411,532 nucleotides ( Table 2). The gene content between COL-2 and H37Rv was highly similar; however, we observed 6 disrupted and 10 newly annotated coding sequences due to larger insertions and deletions ( Table 2 and Table S2). Employing the COL-2 genome as a reference for the mapping approach resulted in a further increase in the number of differentiating SNPs in the concatenated sequence from 1,062 to 1,094, which also resulted in an increase of outbreak nodes in the MPT from 8 to 9 ( Table 2). The overall population structure in the MPT remained consistent, and node F was further differentiated into F1 and F2, resulting in two new branches in the MPT (Fig. 5). Lastly, we investigated potential mixed infections in all outbreak isolates (n = 67) based on the allele frequencies of 16 SNPs, differentiating especially cluster I, node G (5 SNPs), and cluster II (11 SNPs) isolates (Fig. 6). Overall, 22/67 isolates were considered clonal infections; i.e., we did not find heterogeneous alleles in any of the 16 genome positions (Table S3). However, in 20/67 isolates, either all cluster II-specific SNPs or all cluster I (node G)-specific SNPs were identified at a lower frequency (,100%), suggesting a mixed infection with at least two clones of the outbreak. In the remaining 25/67 isolates, heterogeneous alleles were present in some but not all of the node-specific positions.

DISCUSSION
Here, we employed a newly assembled outbreak reference genome and alternative variant calling parameters to enhance the resolution for analysis of an MTBC outbreak compared to classical comparative MTBC genomic analysis. Our alternative high-resolution approach revealed additional putative transmission events, which can improve epidemiological investigations and interventions in MTBC outbreaks or in settings with a very clonal MTBC population structure. Such investigations are urgently needed since indigenous peoples in the Amazon are facing the highest TB burden in the region of the Americas and Colombia but the lack of access to these patients limits our understanding of MTBC population structure and transmission dynamics in these remote settings (2, 23-26). Including repetitive and resistance-associated regions of the reference genome allowed us to detect additional variants differentiating the outbreak and further suggested that multiple patients in this remote setting were likely infected with more than one clone. In this study, WGS analysis confirmed the clonal expansion of one lineage 4 MTBC strain that infected most of the patients in this remote Amazon setting in our short sampling time from March to October 2016. It has been suggested that L4 strains were introduced into the indigenous pre-Columbian societies, presumably by the European colonization (27). In particular, our finding that LAM 4.3.3 MTBC strains accounted for 97% of the isolates in our study is in line with previous studies reporting LAM MTBC strains as the most common sublineage in northern South America (Ecuador, Peru, Brazil, and Venezuela), followed by Haarlem, Orphan, and T (20).
Our SNP-based phylogenetic analysis revealed that the LAM 4.3.3 outbreak comprises two clusters defined by distinct mutations. Patients within these clusters originate from different settlements, and there was no apparent infection hot spot. However, we speculate that many infections may have occurred in the municipality of Puerto Nariño, which is a major hub in the region and is frequently visited by people from different settlements. In fact, 24 out of 42 cases of the main transmission chain (cluster I, node A) were from Puerto Nariño and nearby settlements such as San Francisco, Ticoya, Veinte de Julio, and Puerto Esperanza, which is the main gathering area (i.e., fishing, shopping, and schooling) and is located the Amazon River port that connects the indigenous settlements with the modern world.
The classical MTBC genomic analysis indeed did not provide a significantly higher resolution than the 24-loci MIRU-VNTR genotyping investigation in our study setting. Here, only an extended variant calling approach could further increase the resolution of this outbreak scenario and identify new yet unrecognized transmission chains. First, we included repetitive regions such as proline-glutamic acid/proline-proline-glutamicacid (PE/PPE) and resistance-associated genes in our analysis. Of note, PPE/PE genes account for nearly 10% of the genes of MTBC and might play a role in the virulence FIG 4 Genetic relationship of patient isolates based on an H37Rv reference mapping approach, including repetitive regions and drug resistance-associated genes. The maximum parsimony tree (MPT) was based on 1,062 single nucleotide polymorphisms (SNPs) of 74 Mycobacterium tuberculosis complex (MTBC) isolates recovered from 16 indigenous settlements. The size of each node is proportional to the number of isolates. The genetic distance is indicated on branches as the number of SNPs that differ between nodes. Samples are color-coded based on the patient's settlement and identified with an ID number. In the data sets, the sample ID number is preceded by the label COL, which refers to Colombia. Outbreak clusters are named I, II, III, and IV. Clusters comprising the clonal events are shown with dashed boxes, and nodes of cluster I are termed A, B, C, and D. New nodes derived from node A are termed E, F, and G. LAM, Latin American Mediterranean; MDR, multidrug-resistant; AR, Amazon River; LR, Loretoyaco River. mechanisms of host-pathogen interaction (28,29); however, their high guanine cytosine (GC) content and sequence repetitions challenge standard WGS pipelines (22,30,31). Nevertheless, recent work has shown that the inclusion of PPE/PE genes with strict variant calling thresholds does not negatively impact the analysis and can provide novel and reliable information on mutations (31,32).
Second, we assembled a new reference genome from an outbreak isolate of the largest node, node A, using long-read sequencing data. Molecular epidemiological approaches in recent years mostly have used H37Rv as a mapping reference. However, its use as a sole reference genome has been discussed, as this does not consider the genetic diversity across all 9 distinct human-adapted MTBC lineages and sublineages (33,34). Thus, it is intriguing to consider whether the use of local reference genomes enables an improved resolution power, especially those caused by lineages other than lineage 4. Along these lines, Lee et al. have suggested deep sequencing in combination with the assembly of an outbreak-specific reference genome based on long-read sequencing as an alternative approach to enhance resolution and identify superspreaders in highly clonal outbreaks (35). However, this method is limited by high costs. Here, both approaches, i.e., including  repetitive regions for variant calling and a new outbreak reference genome, led to an increased outbreak resolution and a number of differentiating SNPs. With this new resolution, we also found evidence that multiple patients are potentially infected with more than one outbreak strain. Others, however, have found considerable intrapatient heterogeneity, often related to one infecting clone (36). Previous work has documented that mixed infections might account for 30% of TB cases, for which the mixed infection rates, i.e., with distinct strains, are estimated to be from 19% in pulmonary samples to 51% in extrapulmonary and pulmonary combined samples (37)(38)(39)(40). However, in settings with a clonal population structure and with classical comparative MTBC genomic pipelines, the assessment of mixed infections is extremely difficult (41). This is also particularly true for regions with a high incidence of MDR-TB, which are often dominated by few major clades with drastically reduced genetic diversity (42,43). Here, it is crucial for molecular diagnostics to also identify minority MTBC FIG 5 Genetic relationships of patient isolates based on a mapping approach including repetitive regions, drug resistance-associated genes, and a newly assembled genome from the outbreak strain 2 as reference. The MPT is based on 1,094 SNPs of 74 Mycobacterium tuberculosis complex (MTBC) isolates from 16 indigenous settlements. The size of each node is proportional to the number of isolates. The genetic distance is indicated on branches as the number of SNPs that differ between nodes. Samples are color-coded based on the patient's settlement and identified with an ID number. In the data sets, the sample ID number is preceded by the label COL, which refers to Colombia. Outbreak clusters are named I, II, III, and IV. Clusters comprising the clonal events are shown with dashed boxes, and their nodes are termed A, B, C, and D. New nodes derived from node A are termed E, F, and G. The partition of node F into F1 and F2 is shown. LAM, Latin American Mediterranean; MDR, multidrug-resistant; AR, Amazon River; LR, Loretoyaco River. populations in one patient, which could have a huge impact on diagnosis, treatment decisions, and clinical outcomes.
Our study has limitations, such as the short study duration, rendering the comprehensive analysis of the TB epidemiology in the study region difficult. Future work should be based on longitudinal sampling enabling prospective molecular surveillance and allowing us to trace the evolution of drug resistance in locally dominating clades/ strains. Our analysis was focused on SNPs, and the inclusion of insertions and deletions could further improve the phylogenetic resolution.
Conclusions. In conclusion, we confirmed that the TB epidemiology in the indigenous setting of Puerto Nariño in the Southwest Colombian Amazon is mainly driven by the ongoing transmission of one outbreak strain. Almost one-third of the patients were possibly infected with at least two clones of the local outbreak strain, highlighting the need for interventions to break transmission chains and better TB control in the population. We also demonstrated that customized variant calling pipelines and a new local reference genome can moderately increase the resolution of MTBC outbreaks. The potential benefit of these customized reference-mapping approaches for molecular surveillance studies needs to be investigated in other outbreaks and may vary depending on the strain genetic background.

MATERIALS AND METHODS
Patient cohort. We recovered a total of 74 MTBC isolates from primary sputum samples on Löwenstein-Jensen medium (LJ) from March to October 2016. These samples were obtained from patients with pulmonary TB allocated to 16 indigenous settlements based on the shore of the Amazon (AR) and Loretoyaco Rivers (LR) in Puerto Nariño (19). The sociodemographic and clinical characteristics of the TB patients were recorded. These included Age, sex, health system affiliation, household conditions, history of TB, Bacillus Calmette-Guérin (BCG) vaccination, recreational drug usage, alcohol, smoking, self-reported human immunodeficiency virus (HIV) status, and other medical diagnoses.
Statistical analysis. We calculated descriptive statistics of the demographic data of the population studied, including proportions and median and interquartile ranges using the functions of base R and RStudio (43).
DNA extraction and classical genotyping. We extracted genomic DNA using the PureLink genomic DNA minikit (Thermo Fisher Scientific) according to the manufacturer's instructions. DNA samples were genotyped using 24-loci mycobacterial interspersed repetitive-unit-variable-number tandem-repeat (MIRU-VNTR) as follows. Multiplex PCR amplification was performed with a GenoScreen typing kit, and fragment separation was run on the ABI 3500XL automated sequencer as described previously (45). Fragment analysis was accomplished using GeneMapper software (PE Applied Biosystems). Genotypes were identified on the web-based databank MIRU-VNTRplus (https://www.miru-vntrplus.org/MIRU/index .faces). Phylogenetic analyses were performed with BioNumerics software version 7.6 (Applied Maths). MIRU clusters comprised nodes with more than one isolate with an identical genotype. A 24-loci MIRU-VNTR dendrogram was calculated using the categorical distance and the unweighted pair group method with an arithmetic mean algorithm (UPGMA).
Whole-genome sequencing and bioinformatic analysis. For WGS, DNA libraries were prepared based on the Baym protocol with the Nextera XT kits and sequenced with the NextSeq 500 sequencing platform from Illumina (151 bp, paired end) according to the manufacturer's instructions (46). We analyzed the sequencing data with the MTBseq pipeline, thus using a reference mapping approach (22,42,47) and inferring the phylogenetic lineage and resistance-associated variants (48). Briefly, FASTQ files were mapped to the reference genome Mycobacterium tuberculosis H37Rv (GenBank ID NC_000962.3) using the Burrows-Wheeler Aligner Alignment tool (49), and refined mapping was done with the Genome Analysis Toolkit software package (50). For inference of the phylogeny, a concatenated SNP alignment was built from SNP positions with the MTBseq default thresholds: 4 reads mapped in forward and reverse orientations, a minimum of 4 reads supporting the allele with a Phred score of not less than 20, and 75% allele frequency. We excluded SNP positions within repetitive regions and drug resistanceassociated genes and finally combined positions that matched these thresholds in .95% of the isolates. We submitted sequencing data to the European Nucleotide Archive (accession numbers are given in Table S1). In addition, we generated a concatenated SNP alignment with the thresholds mentioned above but also including repetitive regions (51).
Cluster analysis was performed using the MTBseq pipeline with a threshold of 5 SNPs (48,52).
A new reference genome of one outbreak strain named COL-2 was assembled using the Single Molecule Real Time (SMRT), long-read technology of the Sequel II system (Pacific Biosciences [PacBio]) to increase the resolution of the outbreak. We selected the local isolate COL-2 from the largest clonal cluster to serve as our reference. The DNA library was prepared with the SMRTbell Express Template prep kit version 2.0 with barcoded adapters from Integrated DNA Technologies (IDT) and sequenced on the Sequel II system. De novo genome assembly was performed using the PacBio SMRTlink software version 9.0 and its microbial assembly application, with the genome length set to 4.4 Mb and a seed coverage of 30. The assembly of 54,267 subreads (N 50 length, 6,696 bp; coverage, 73.5Â) resulted in two polished contiguous sequences (contigs) of 3,947,682 and 454,994 bp, respectively.
Maximum parsimony trees (MPT) were generated with BioNumerics either based on a distance matrix derived from the standard 24-loci MIRU-VNTR data or derived from the three reference-mapping approaches based on a concatenated SNP alignment with classical SNP calling thresholds: (i) the classical comparative MTBC genomic pipeline, (ii) including repetitive regions and drug resistance-associated regions, and (iii) reference mapping to a new outbreak genome and including repetitive regions. All approaches employed the default MTBseq conservative SNP-calling parameters described above. In the data sets, the sample ID number is preceded by the label COL-which refers to Colombia; however, only the ID number is depicted on the MPT.
We searched the binary alignment map (BAM) files of the outbreak isolates (n = 67) for variants with variable allele frequencies in 16 outbreak-differentiating SNP positions using binoSNP as well as in resistanceassociated regions (i.e., alternative alleles with P values of ,0.01 were considered) (53). The 16 phylogenetic informative positions were further validated with a visual inspection of the bam.gatk alignment files using the integrative genome viewer (IGV). The frequencies of these variants were represented for all outbreak isolates in a heat map using GraphPad Prism version 9 software.
To identify the corresponding positions of variants of interest in the PacBio assembly, we performed a genome alignment between H37Rv (lineage 4 [L4]) (GenBank ID NC_000962.3) (54-56) and PacBio COL-2 (L4, Latin American Mediterranean [LAM]) reference genomes using the Mauve plugin (57) with Geneious Prime software. Both genomes were annotated with Prokka with default parameters to identify new genes and putative loss/gain of functions due to larger insertions and deletions (58). Repetitive regions, mobile elements, hypothetical genes, and smaller insertions and deletions not affecting the coding sequence annotation are not listed in Table S2.
The geographical maps shown are an adaptation of Google and Colombian National Maps: Comunidades Indígenas de Puerto Nariño and Asociación de Indígenas, Tikuna, Cocama and Yagua (ATICOYA). Open data were provided by Gobernación del Amazonas at https://www.datos.gov.co.
Data availability. The Illumina short-read sequences of the 74 genomes of the outbreak and the PacBio long-read sequence corresponding to the local reference genome COL-2 are publicly accessible in the European Nucleotide Archive under the projects PRJEB57971 and PRJEB57950 (see Table S1).