Bayesian reconstruction of Mycobacterium tuberculosis transmission networks in a high incidence area over two decades in Malawi reveals associated risk factors and genomic variants

Understanding host and pathogen factors that influence tuberculosis (TB) transmission can inform strategies to eliminate the spread of Mycobacterium tuberculosis (Mtb). Determining transmission links between cases of TB is complicated by a long and variable latency period and undiagnosed cases, although methods are improving through the application of probabilistic modelling and whole-genome sequence analysis. Using a large dataset of 1857 whole-genome sequences and comprehensive metadata from Karonga District, Malawi, over 19 years, we reconstructed Mtb transmission networks using a two-step Bayesian approach that identified likely infector and recipient cases, whilst robustly allowing for incomplete case sampling. We investigated demographic and pathogen genomic variation associated with transmission and clustering in our networks. We found that whilst there was a significant decrease in the proportion of infectors over time, we found higher transmissibility and large transmission clusters for lineage 2 (Beijing) strains. By performing evolutionary convergence testing (phyC) and genome-wide association analysis (GWAS) on transmitting versus non-transmitting cases, we identified six loci, PPE54, accD2, PE_PGRS62, rplI, Rv3751 and Rv2077c, that were associated with transmission. This study provides a framework for reconstructing large-scale Mtb transmission networks. We have highlighted potential host and pathogen characteristics that were linked to increased transmission in a high-burden setting and identified genomic variants that, with validation, could inform further studies into transmissibility and TB eradication.

. Demographic characteristics associated with recent infection, defined as cases for whom the source of infection has been inferred within the last 5 years. Cases Table S3. Genes found to be associated with transmissibility in Nebenzahl- Guimaraes et al. 2017 in Mtb cases collected from the Netherlands. We identified 91 loci that had variation within the Karonga population in these genes, with 52 loci found in only one strain. SNPs and INDELs identified in these genes in more than one Karonga strain are shown, with variants found in both the Karonga population and the Netherlands population italicised. The number of transmitter and non-transmitter Karonga strains harbouring each variant is shown. The number of missing calls in the total strains are also shown. No significant association with transmissibility was found in these genes with the phyC or GWAS analyses. Δ = deletion, + = insertion.

SUPPLEMENTARY METHODS AND RESULTS
The R software, TransPhylo, was used to reconstruct transmission networks, allowing for within host evolution and incomplete sampling. This approach requires the user to define model priors for the underlying transmission reconstruction algorithm. To define appropriate prior values for each parameter we determined initial values based on a review of relevant literature and population level calculations of values from our data. We then conducted a sensitivity analysis, varying specified prior values, on two large transmission cluster of cases linked by up to 50 SNP differences, with the likelihood of transmission links and direction of transmission between cases analysed.
TransPhylo parameters Figure S2. An illustration of the stages of infection within the generation time and sampling time distributions.

Generation time distribution
This is the time between a host becoming infected and infecting another individual ( Figure S2), modelled as a gamma distribution. This can be highly variable in TB as an individual can develop active, infectious disease relatively quickly after infection in a few weeks or there can be a long period of latency of many years 1,2 , though the chances of progression are greatest in the first few years 3 . This rate can also be affected by HIV, which can increase the chances of progression to active disease 4 .
The generation time distribution will also include the time that a host may remain infectious and transmit to a recipient while in active treatment, which can be up to two months with effective treatment, though with decreasing infectivity 5,6 . Additionally, it will also include any time between diagnosis and the initiation of treatment. This can be variable depending on whether the patient returns to a healthcare centre for treatment, though the Karonga Prevention Study (KPS) provides support to reduce this risk by including measures such as home visits to patients to encourage them to come for treatment 7 .
To account for the variability in the generation time, the gamma distribution choice was modelled to include the chance for a reasonably quick progression to active disease but without penalising transmission trees that include potential long latency in the host. Therefore, prior parameter values chosen were characterised by a relatively rapid rise but with a long tail for latency. We chose three gamma distribution parameters to test in the sensitivity analysis ( Figure S3) and posterior distributions were analysed.

Sampling time distribution
This is the total time between a host becoming infected and the sample collection. This includes the time between becoming infected to developing symptoms, and the total delay time between a patient becoming symptomatic and receiving a TB diagnosis, including the patient delay time in seeking healthcare and the healthcare system delay in diagnosis ( Figure S2).
Systematic reviews into delay times in pulmonary TB diagnosis in low-income settings placed the median time from a patient becoming symptomatic to diagnosis to be around 67 days 8,9 ; the KPS provides relatively effective detection rates and management of TB 10 and so we would not expect median delay times to exceed this estimate.
In the Karonga Prevention Study (KPS), samples are taken by project staff at hospitals and health centres at the time of diagnosis after screening for symptoms 11 , and the first available culture positive sample for each case is used in our analysis. In this population, the sampling time will be similar to the generation time as it will include the all aspects of the generation time apart from the relatively short time remaining infectious after diagnosis (and sample collection) and any delay in the initiation of treatment. Again, the sampling time can be affected by HIV status as these patients may present at healthcare centres more frequently and thus diagnosis may be quicker.
Therefore, we set prior parameters of the sampling time distribution to allow for a large variability in the time to sampling after initial infection. We have used the same distribution as the generation time distribution gamma shape and scale parameter values subject to sensitivity analysis ( Figure S3).

Reproductive number -R
This is the population-level estimate of the expected number of secondary cases caused by each case. The smear-positive adult TB incidence in the Karonga region has dropped from 124/100,000 per year in the mid-90s to 87/100,000 per year in 2013 7 , so R will be < 1. Therefore, the initial value of R in the sensitivity analysis was set as 0.8 and updated through MCMC iterations.

Sampling density -π
Over the period of the study there were about 3305 cases of TB, including 3130 pulmonary and 2585 smear positive. After removing isolates that failed QC, have a high probability of mixed infection 12 , and that are multiple samples from the same episode of TB in a patient, there were 1857 cases in our dataset. A sampling density (π) of 1 would be equal to the complete sampling of TB cases. Our dataset contains WGS data for 56% of the reported cases of TB in the region (1857 of 3305). This value though does not include undiagnosed cases and cases of transmission from outside the study area. To account for this uncertainty, we have varied the sampling density in the sensitivity analysis with π = 0.4, 0.5 and 0.6, which correspond levels of ~30%, 11% and 0% undiagnosed cases in the population.

Within-host coalescent rate -Neg
If there is no within-host heterogeneity then the transmission tree will correspond directly to the phylogenetic tree, with all transmission events taking place at the same time as phylogenetic coalescent events (nodes of the tree) and all genetic difference between cases evolving at the time of transmission. Incorporating the within-host coalescent rate allows for coalescent events between two cases to be further back in time than the transmission date, i.e. the time to the most recent common ancestor between two cases may be prior to transmission and within a single host. Therefore, there is the possibility that the within-host lineage that transmitted to the secondary case may have been different to the one that is sampled.
This parameter is the rate at which within-host lineages will coalesce (the average time that all lineages within a host (based on the effective population size) can be traced back to their most recent common ancestor), measured in years. The underlying model in TransPhylo employs an extension of Kingsman's coalescent theory 13,14 . The parameter value used corresponds to the product of the effective population size within a host and the generation time. For example, if an organism has a within-host effective population size of 100 and a generation time of 1 day, Neg will equal 100*(1/365) = 0.274 years. In TransPhylo, this parameter is applied as a single value over the population assuming a constant coalescent rate across all samples as well as a bottleneck where only a single within-host lineage is transmitted.
The effective population size, the number of individual bacilli to account for the genetic variation within an organism, will be linked to the mutation rate 15 . The within-host mutation rate for the KPS population has been calculated previously through analysis of SNP differences in longitudinal samples 16 , resulting in a rate of 0.45 SNPs per genome per year. A within-host coalescent rate of 1.48 years has been calculated previously for a TB outbreak with a comparable mutation rate (0.48 SNPs per genome per year) 2 . This single value is unlikely to capture the full picture of within-host diversity in all samples and the parameter has been difficult to accurately estimate in simulations with the program 17 , with large variation in inferred values. To allow for transmission events before phylogenetic events to not be penalised in inferred transmission trees but owing to the difficulty in estimating this value a priori, an initial value of 1.48 was set and updated through MCMC iterations.

Probability thresholds and penalisation for extrapulmonary TB
The output of TransPhylo is a posterior sample set of transmission trees from which the probability of pairwise transmission events between patients can be calculated. Results using the TransPhylo algorithm on a simulated dataset of known transmission events shows that a probability threshold of > 0.5 is sufficient for a specificity rate of 99% and sensitivity rate of 72% 17 . Accordingly, we have chosen a probability threshold of 0.5.
Inferred transmission chains with the transmitter deemed to be a host with extrapulmonary TB were excluded; extrapulmonary TB does not transmit 18,19 so we specified that these cases can only be recipient cases in our analysis.

Sensitivity analysis sampling
The methods used to analyse the sequence data and perform the TransPhylo analysis are described in the methods section of the main paper. A sensitivity analysis was conducted on large transmission clusters, defined as cases linked by up to 50 SNPs, from lineage 2 (n =27) and lineage 3 (n = 37). The generation and sampling time distribution and the sampling density were varied to determine the effect of parameterisation on the inference of transmission links and direction of transmission ( Figure S4).

Generation and sampling time distribution
Sampling density 6   Table S4. The generation time and sampling time gamma distributions used in the sensitivity analysis, repeated for two large transmission clusters from lineages 2 and 3. Gamma distributions A, B and C are shown in Figure 2.

Sensitivity analysis results
TransPhylo runs with generation distributions B and C reached convergence over the 10 5 MCMC iterations in the lineage 2 cluster (Figure S4A), and all runs reached convergence in the lineage 3 cluster (Figure S4B). Considering the posterior generation time distributions for each run, which is the interval for all transmission events including sampled and inferred non-sampled cases with a probability over 0.5 (Figure S5), we found that generation time distributions B and C fit the data well. The runs using generation distribution A tended towards a shorter median generation time, which resulted in transmission trees with branches containing an unrealistically high number of inferred, non-sampled hosts with very short transmission times (Figure S7A and S7D). Changing the sampling density within each prior generation distribution did not affect the posterior generation time distribution significantly. The reproductive number was updated through the MCMC iterations and the posterior ranged between 1 and 2.5 (Figure S6), thus an initial prior of 1.75 was used for the full analysis and updated through MCMC iterations.  We found a general agreement in the results of the transmission links inferred by TransPhylo when varying the generation time between distributions B and C, with distribution A leading to varying results (Tables S5 and S6). The number of SNPs between sampled cases that were inferred to be directly linked was between 0 and 2 SNPs in both clusters, with a median of 0 SNP distance (except runs A-4 and A-6).
Inspecting the transmission trees produced by TransPhylo, we found that setting a sampling density of π = 0.4 led to branches where there were a large number of inferred, non-sampled cases in a short time (illustrated at terminal branches in Figure S7A and S7D). While the rates of undiagnosed cases vary across high-incidence countries, it has been reported to be as high as 30-50% [20][21][22] , which would better correspond to a sampling proportion ≤ 0.4. The justification for using a higher sampling density in this study are two-fold. Firstly, the KPS has been working in the Karonga region since 1986 to improve case findings and lower the proportion of undiagnosed TB cases, and this may result in a higher sampling density in our study population than has been reported in other high-incidence areas. Secondly, the starting trees for TransPhylo analysis are subtrees of the whole population that may have different characteristics. Robust surveillance and follow-up of confirmed TB cases with household contact tracing and screening can lead to a higher TB case detection 23 and this may be higher around clustered cases, increasing the within cluster sampling density as compared to the whole population.
Considering only generation distributions B and C and sampling densities 0.5 and 0.6, we found the transmission trees very similar (Figure S7B and S7C, and Figure S7E and S7F), and the characterisation of infector strains to be identical in all but one case (ERR181707 in lineage 3, Table S6). As it is unlikely that there are no undiagnosed cases within the transmission clusters (corresponding to a sampling density of ~ 0.6), we chose to set the final generation distribution for the full analysis of all clusters in the population as distribution B and a sampling density of 0.5, checking and validating transmission events using INDELs and epidemiologically-linked cases where possible.    ERR221542  1  1  1  1  1  1  1  1  1  9  ERR190401  0  0  0  0  0  0  0  0  0  0   ERR221545  0  0  0  0  0  0  0  0  0  0   ERR212126  1  1  1  1  1  1  1  1  1  9 ERR181916