Phylodynamic Analysis of Ebola Virus in the 2014 Sierra Leone Epidemic

Background: The Ebola virus (EBOV) epidemic in Western Africa is the largest in recorded history and control efforts have so far failed to stem the rapid growth in the number of infections. Mathematical models serve a key role in estimating epidemic growth rates and the reproduction number (R0) from surveillance data and, recently, molecular sequence data. Phylodynamic analysis of existing EBOV time­stamped sequence data may provide independent estimates of the unobserved number of infections, reveal recent epidemiological history, and provide insight into selective pressures acting upon viral genes. Methods: We fit a series mathematical models of infectious disease dynamics to phylogenies estimated from 78 whole EBOV genomes collected from distinct patients in May and June of 2014 in Sierra Leone, and perform evolutionary analysis on these genomes combined with closely related EBOV genomes from previous outbreaks. Two analyses are conducted with values of the latent period that have been used in recent modelling efforts. We also examined the EBOV sequences for evidence of possible episodic adaptive molecular evolution during the 2014 outbreak. Results: We find evidence for adaptive evolution affecting L and GP protein coding regions of the EBOV genome, which is unlikely to bias molecular clock and phylodynamic analyses. We estimate R0=2.40 (95% HPD:1.54­3.87 ) if the mean latent period is 5.3 days, and R0=3.81, (95% HPD:2.47­6.3) if the mean latent period is 12.7 days. The estimated coefficient of variation (CV) of the number of transmissions per infected host is very high, and a large proportion of infections yield no transmissions. Conclusions: Estimates of R0 are sensitive to the unknown latent infectious period which can not be reliably estimated from genetic data alone. EBOV phylogenies show significant evidence for superspreading and extreme variance in the number of transmissions per infected individual during the early epidemic Sierra latent period

AI036214). EMV was partly supported by the UK Medical Research Council (MRC) and the UK Department for International Development (DFID) under the MRC/DFID Concordat agreement. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
The 2014 Ebola virus in Western Africa is the largest Ebola epidemic in history and the number of infections continues to grow exponentially. The unprecedented rate of growth has negatively impacted the quality of epidemiological surveillance and has made it difficult to map and characterize the spread of the epidemic . As local health and surveillance systems are overwhelmed, an unknown proportion of cases are unreported, are not isolated, and do not receive adequate treatment. Predictive modelling and evaluation of intervention efforts is also hampered by the rapid rate of increase and imperfect case reporting.
In the absence of complete surveillance data and contact tracing, mathematical models have provided valuable insights into the rate of epidemic growth and the reproduction number ( ). The reproduction number is a useful parameter for characterizing the difficulty of eradication. Early analyses based on case reporting by the World Health Organization (WHO) indicated that differed substantially between countries . In some instances, estimates of based on different models are not in agreement, implying that they are sensitive to the assumptions of the mathematical framework used, and the exact data sets used for parameter estimation. Althaus estimated (2.412.67) for Sierra Leone based on WHO case reports through late August 2014, whereas Fisman et al. estimated for Sierra Leone using a similar data set. More recently, Towers et al. estimated (1.0,1.5) for Sierra Leone using a longer timeseries of case reports and a model with timedependent reproduction numbers. And, the WHO Ebola Response Team presents an estimate of (1.7892. 26) for Sierra Leone, which additionally makes use of new information about the incubation period and serial interval for the current epidemic. The analysis by Althaus modeled the natural history of infection by including a mean 5.3 day latent period before cases become infectious, whereas the analysis by Fisman et al. used a much longer latent period, but did not explicitly consider the lack of infectiousness during the latent period.
We conduct a phylodynamic analysis of 78 Ebola virus genetic sequences discussed in Gire et al. . These data provide an independent source of information about epidemic growth rates and and may corroborate previous estimates based on case reporting data. To examine the sensitivity of estimates to the unknown latent period, we repeat our analysis with two values (5.3 and 12.7 days) which have been estimated from previous Ebola outbreaks.
Recently, Stadler et al. conducted a similar phylodynamic analysis of the same data. We conclude our analysis with a discussion of the primary differences in the analytic approach and findings of these two studies.
An advantage of phylodynamic analysis is that estimates are robust to incomplete sampling of cases, and the proportion of cases which are unreported does not enter directly into our model . Sequence data may also be informative about epidemiological parameters where standard surveillance data are unhelpful. Previous phylodynamic analyses have shown how sequence data can be highly informative about who infected whom and risk factors for transmission . In addition to , we estimate parameters that describe heterogeneity in transmission rates between infected individuals. Fitting these models allows us to characterize superspreading as well as estimate the proportion of cases which do not yield secondary infections.
We also examine the virus genomes for evidence of natural selection, which can potentially bias phylodynamic analyses by violating assumptions of neutral evolution. Because the primary analysis of EBOV isolates found a large number of nonsynonymous mutations in whole length genomes, and because strong selective pressures can bias molecular clocks , and violate the assumptions of the standard coalescent process , we performed an exhaustive analysis of all genes in the EBOV genome for evidence of episodic diversifying natural selection using sensitive codonsubstitution evolutionary models. Data. We conduct a secondary analysis of EBOV phylogenies presented by Gire et al. Samples were collected for wholegenome deep sequencing from 78 patients between 25 May and 20 June in Sierra Leone. In situations where multiple samples were available for a single patient, only the first sample was used in the phylogenetic analysis. Dates of common ancestry for all pairs of samples were estimated with using Bayesian relaxed clock methods . Further details of the sequencing protocol and models used for phylogenetic analysis can be found in Gire et al. This procedure yields a sample from the posterior distribution of dated phylogenies, from which we sampled 1,000 trees to make computation tractable.
For molecular selection analyses, we augmented the 2014 outbreak sequences with 1737 (depending on the gene) additional isolates from previous EBOV outbreaks (19762007), which were also included in the original Bayesian relaxed clock analysis .
Models.The starting point for the analysis is the SEIR model, which has previously been applied to EBOV outbreaks and has recently been applied to the 2014 epidemic .
The parameter will be the transmission rate per infectious individual, will be the rate that infected progress from the latent period to the infectious period, and will be the rate that infectious cases are removed due to death and burial or by effective isolation and treatment. The ordinary differential equations for the deterministic SEIR model are: These equations describe the dynamics of the number exposed noninfectious individuals and the number of infectious individuals . We make the approximation that the majority of the population is in the susceptible category ( ) for this and subsequent models, such that an equation for the dynamics of is not needed. We will refer to this as the ODE SEIIR model.
A stochastic version of the SEIIR model was also fitted in order to account for any bias due to noisy dynamics during the early exponential growth phase of the epidemic. The equations for this model are given by the following stochastic differential equations: where are independent Wiener (standard Brownian motion) processes, and . This system accounts for noise in incidence and deaths, but does not account for noise in the composition of into the and categories because of the difficulty of fitting such a system to data. Note that the stochastic terms in the equations for and are the same and multiplied by and respectively. We will refer to this model as the SDE SEIIR model. The euler method with a time step of one day was used to simulate trajectories from the SDEs.
Statistical analysis. The epidemiological models were fitted to EBOV phylogenies using the rcolgem package in R , which computes the likelihood of epidemiological parameters given a phylogeny. When fitting the ODE and SDE SEIIR models, likelihoods were calculated for each phylogeny with the boundary condition that each sample is a superspreader with probability , which is estimated. Models were fitted using a Bayesian Markov chain within Metropolis (MCWM) algorithm , which integrates over the distribution of phylogenies previously estimated in Gire et al. . This algorithm was implemented by customizing the mcmc package in R. At each step of the MCWM algorithm, the likelihood of the set of trees given a solution of the epidemiological model is approximated by The genealogies used in the approximation are drawn uniformly at random from the distribution estimated by Gire et al. If fitting a stochastic model, a doublemarginalization is required over genealogies and simulations of the stochastic model. Such a Markov chain will sample the posterior distribution regardless of the choice of sample size , however the value used will influence the efficiency of the algorithm. We chose for ODE models and for SDE models to match the architecture of our high performance computing cluster.
In the SEIR model, two parameters were estimated: and the time when the epidemic was initiated in Sierra Leone. In the ODE SEIIR and SDE SEIIR model, the parameter was also estimated, which controls the proportion of cases in . A diffuse lognormal prior (mean 3.2, standard deviation 2.5) was used for , a uniform(0,1) prior was used for , and a normal prior for with mean April 23, and standard deviation of 6 days (based on the results by Althaus ).
We compared models using the approximate AICM method . AICM is a summary statistic for the goodness of model fit for Bayesian analyses, and is analogous to the Akaike information criterion (AIC ) used for model selection in the maximum likelihood framework. AICM was calculated using Tracer1.6 . At least two MCWM chains were sampled for each model and combined with 20% of samples removed for burnin and effective sample sizes were computed to confirm adequate sample size. tree into the foreground (the 2014 Western Africa EBOV clade) and background (all other branches) segments, and fitted two 3bin distributions jointly to all the branches in each partition. A likelihood ratio test of the unconstrained model versus the null model where all (i.e. negative selection or neutral evolution) was used to establish significance for evidence of diversifying positive selection affecting a proportion of sites along a proportion of 2014 EBOV lineages. All analyses have been implemented and run using HyPhy v2.12 . Estimates of are sensitive to the latent period which could not be estimated from the genetic data alone. Published estimates of the duration of the latent period based on earlier Ebola outbreaks are highly variable , but we present results based on two values that have been used in recent modelling studies of the current epidemic in Western Africa. If the latent period is a mean of 12.7 days, the fitted ODE SEIIR model provides an estimate of (95% HPD:2.476.31). http://currents.plos.org/outbreaks/article/phylodynamicanalysisofebolavirusinthe2014sierraleoneepidemic/ 14/21 Figure 4 shows the estimated cumulative number of infections through time as predicted by the ODE SEIIR model with days. These estimates, while imprecise, are consistent with reported cases by WHO which were not used for model fitting. The slightly larger number of cases predicted by the model may in part reflect underreporting of cases in the WHO data.
We find strong evidence for superspreading. The fitted SEIIR models which account for supserspreading have higher median posterior log likelihoods of 346.9 versus 380.1 for the SEIR model. The ODE SEIIR model is also superior to the SEIR model by the AICM criterion. For two distinct Markov chain samples, we find AICM=38.7 and 25.6 in favor of the ODE SEIIR model.
In all fits of the SEIIR model, the estimated proportion of cases in the high transmission rate category is less than 63% and posterior median estimates are approximately 10%. Figure 5 shows the EBOV phylogeny with maximum posterior probability. Branches are colored with the probability that the virus lineage inhabits a superspreading host. The superspreader lineage probabilities are based on the median posterior parameter estimates with the ODE SEIIR model. When a lineage occupies a superspreading host (shaded red), it is much more likely to undergo a coalescent event, that is, to have common ancestry with other sampled lineages. This process yields phylogenies with very imbalanced topologies. It also introduces correlation between the lengths of neighboring ancestral and daughter branches, as a lineage in a superspreading host is likely to undergo several coalescent events in short succession.  16 Selection analyses. Table S1 provides an overview of codonbased selection analyses of the seven EBOV protein coding regions. There is a notable variation in nucleotide level diversity across genes (total tree length), with the glycoprotein (GP) showing the highest diversity. About 0.5% of branchsite combinations in the long RNA polymerase (L) gene appear to be under strong diversifying selection ( ) in the 2014 clade, whereas the entirety of sequence evolution in the GP gene is comprised of nonsynonymous changes ( at 100% of branchsite combinations). The remaining genes do not contain significant positive selection signal.
When we asked which individual sites showed evidence of episodic diversification (using the MEME method with pvalue < 0.95), sites 388 and 389 in the heavily glycosylated mucinlike domain of GP, and sites 1396, 1492, 1722 in L were identified.
Phylodynamic analysis of EBOV sequences provides a new perspective on and epidemic growth rates that is independent of previous analyses based on WHO  While we find that it is not possible to estimate the latent period from genetic data alone, Stadler et al. have conducted a phylodynamic analysis of the same EBOV data and estimated a mean incubation period (assumed equivalent to the latent period) of 4.92 days (95% HPD 2.1123.20). Stadler et al.'s inference of incubation periods was made possible by using additional information, namely the times of genetic sequence sample collection, which were assumed to be collected at a constant percapita rate. By calibrating the exponential growth rates of the epidemic to match the rate of sample collection, other parameters are rendered identifiable. Incorporating a model of the sampling process into phylodynamic inference can greatly increase statistical power , however it can also bias estimates if the sampling process is misspecified. We do not find evidence that the sampling rate was constant as required by the analysis in Stadler at al. , but rather that it increased steadily over the sample collection period. By comparing the cumulative number of infections reported by WHO to the cumulative number of samples collected, we find that the sampling rate varied from 20% early in the epidemic to 70% near the end of the sample collection period. An alternative to using times of sample collection to calibrate growth rates would be to use the WHO case reports. Unfortunately, there is very little overlap between the timestamped EBOV phylogenies and WHO case reports because all samples were collected during the early portion of the epidemic.
In contrast to the analysis by Stadler et al. , we find statistically significant support for a model which features superpreading (heterogeneous transmission rates). These divergent findings may be due to differences in the population genetic models used (coalescent and birthdeathsampling). The discrepancy may also be due to a different parameterisation of superspreading. The analysis by Stadler et al. required two additional parameters to describe superspreading. We chose a model of superspreading which required only one additional parameter, thereby increasing discriminatory power at the expense of some realism. The quantitative estimates of the CV of the reproduction number may be biased upwards by unrealistic distributional assumptions in our model. It is unlikely that transmission heterogeneity is well described by a mixture of only two transmission rates.
It is possible to characterize superspreading patterns from virus phylogenies because the variance in transmissions per case alters the genetic relatedness of a random sample of EBOV sequences . The EBOV phylogenies are highly imbalanced, and neighbouring branch lengths are highly correlated. We hypothesize that these features are a consequence of high variance in transmission rates, and we have proposed an epidemiological population genetic model that reproduces these features. Our epidemiological model of superspreading lacks some realism, however our parameterisation of the transmission process allows us to easily estimate variance in transmission rates.
High variance in transmission rates may hamper contact tracing efforts, since a single missed contact may trigger a sizeable outbreak. Epidemics which feature a highly skewed distribution of transmissions per infected individual differ substantially from epidemics where the number of transmissions cluster around . In epidemics with many superspreading events, the probability of epidemic extinction is greater, and the probability that a single introduction into a susceptible population will trigger an epidemic is also lower. But, when outbreaks do occur, they are more explosive and contact tracing may be more difficult. Furthermore, intervention strategies that are targeted towards individuals with higher transmission risk are likely to be more effective in epidemics with superspreading events. We estimate that a small proportion of infected cases are responsible for a majority of transmissions and a large proportion of infections yield no transmissions.
Our molecular selection analyses suggest that episodic diversifying selection may be operating on L and GP genes. When analysing recent viral isolates, much of the selection signal could be driven by overall maladaptive substitutions along terminal branches due to intrahost evolution . When additional isolates become available, some of the techniques for filtering out such substitutions (e.g. analysing only internal branches ) may prove fruitful. The functional importance of sites subject to such forces remains to be elucidated.
Many factors contribute to the uncertainty of our findings, including uncertainty in the EBOV phylogenies, dates of common ancestry, and inherent noisiness of the epidemic process during the early period of exponential growth. Our estimates are based on a relatively small sample of EBOV sequences, and much greater precision could be achieved if a larger proportion of cases are sequenced over a longer period of time. Genetic sequence data are only available for the very early portion of the epidemic in Sierra Leone. Estimates may differ in other countries and settings, as well as through time as intervention efforts are scaled up and the population adapts to the growing epidemic. Phylodynamic methods are robust to variable and incomplete sampling of cases, so that virus sequences may be a useful supplement to epidemic surveillance if a growing proportion of cases are not reported to health systems. Selection analysis results on EBOV genes; N, number of sequence; L, number of codons; T, total tree length, expected substitutions/nucleotide site.
is the inferred distribution of the ratio of nonsynonymous to synonymous substitution rates under the modified branchsite REL model (see text) on the clade comprising 2014 outbreak isolates (weight of each category is indicated in the parentheses); for the two genes with significant pvalues, we also indicate approximate 95% confidence intervals (likelihood profile) for the ω categories with estimates over 1 in square brackets; is the corresponding distribution for the remainder of the tree (pre2014); is the pvalue for the likelihood ratio test for evidence of episodic diversifying selection anywhere in the 2014 clade.