Going back to the roots: Evaluating Bayesian phylogeographic models with discrete trait uncertainty

https://doi.org/10.1016/j.meegid.2020.104501Get rights and content
Under a Creative Commons license
open access

Highlights

  • In discrete virus phylogeography, the root (origin) of the outbreak is often studied.

  • We studied how phylogeographic models perform root state classification.

  • We simulated data sets and increased the number of sequences and discrete traits.

  • Phylogeographic models tend to perform best at intermediate sequence data set sizes.

  • The Kullback-Leibler divergence, both increases with discrete state space and data set sizes.

Abstract

Phylogeography is a popular way to analyze virus sequences annotated with discrete, epidemiologically-relevant, trait data. For applied public health surveillance, a key quantity of interest is often the state at the root of the inferred phylogeny. In epidemiological terms, this represents the geographic origin of the observed outbreak. Since determining the origin of an outbreak is often critical for public health intervention, it is prudent to understand how well phylogeographic models perform this root state classification task under various analytical scenarios. Specifically, we investigate how discrete state space and sequence data set influence the root state classification accuracy. We performed phylogeographic inference on several simulated DNA data sets while i) increasing the number of sequences and ii) increasing the total number of possible discrete trait values. We show that phylogeographic models tend to perform best at intermediate sequence data set sizes. Further, we demonstrate that a popular metric used for evaluation of phylogeographic models, the Kullback-Leibler (KL) divergence, both increases with discrete state space and data set sizes. Further, by modeling phylogeographic root state classification accuracy using logistic regression, we show that KL is not supported as a predictor of model accuracy, indicating its limited utility for assessing phylogeographic model performance on empirical data. These results suggest that relying solely on the KL metric may lead to artificially inflated support for models with finer discretization schemes and larger data set sizes. These results will be important for public health practitioners seeking to use phylogeographic models for applied infectious disease surveillance.

Keywords

Phylogenetics
Phylogeography
Bayesian statistics
Model evaluation

Cited by (0)