Global estimates of mammalian viral diversity accounting for host sharing

Present estimates suggest there are over 1 million virus species found in mammals alone, with about half a million posing a possible threat to human health. Although previous estimates assume linear scaling between host and virus diversity, we show that ecological network theory predicts a non-linear relationship, produced by patterns of host sharing among virus species. To account for host sharing, we fit a power law scaling relationship for host–virus species interaction networks. We estimate that there are about 40,000 virus species in mammals (including ~10,000 viruses with zoonotic potential), a reduction of two orders of magnitude from present projections of viral diversity. We expect that the increasing availability of host–virus association data will improve the precision of these estimates and their use in the sampling and surveillance of pathogens with pandemic potential. We suggest host sharing should be more widely included in macroecological approaches to estimating biodiversity. A re-analysis of virus diversity in mammals that now takes into account host sharing finds that previous global estimates have been overstated by two orders of magnitude.

M easuring global biodiversity is one of the longest-standing problems in ecology. Over several decades, methods have been proposed that range from back-of-the-envelope calculations to sophisticated mechanistic macroecological models. In most cases, these methods estimate diversity from the asymptote of sampling curves over time, effort or space 1 . Although new species are described every year, the diversity of several major groups of life, like vertebrates and vascular plants, is now well-established. On the other hand, invertebrates and microbes, which harbour most of the global diversity of life, continue to pose a challenge. By some calculations, most life on the planet is made up by groups like viruses, helminths and parasitoid wasps that have become hyperdiverse through coevolution 2,3 . However, chronic data deficiencies prevent the use of most macroecological methods to estimate the diversity of microbes and parasites and the number of many taxa is still growing exponentially [4][5][6][7] .
To work around these challenges, several methods have been developed that estimate the richness of 'affiliate' groups like parasites or mutualists on the basis of the richness of their hosts. The simplest way to estimate the diversity of affiliate species is to multiply host richness by an independent estimate of the mean per-host affiliate richness for a particular pair of host and affiliate groups. For example, a recent study suggested that if every arthropod species has at least one host-specific parasite, there could be at least 81.6 million species of nematode parasites of arthropods 2 . However, this approach deliberately excludes generalist parasites and can only be appropriately used to estimate the number of host-specific species 8 . A few other studies focused on parasite diversity have acknowledged the existence of host sharing, adapting the 'linear estimation' method by dividing estimates of macroparasite diversity by the average degree of host specificity 6,9 : = × A H Total affiliates Mean affiliates per host Mean hosts per affiliate Total hosts (1) Although this method seems intuitive, one recent study resampled host-helminth association networks to show that diversity actually scales non-linearly. The authors describe this pattern as a power law scaling of host and parasite richness (A ∝ H~0 .3−0.7 ). In most cases, using this method led to substantially reduced estimates of helminth diversity in vertebrates 10 .
This non-linear scaling between host and affiliate richness can be reproduced for several types of species interactions (Fig. 1). The reason for this non-linearity can be described intuitively: the 1,000th host sampled will on average share more parasites with the first 999 hosts than the tenth host will share with the first nine. In network terms, the total number of possible parasites each host can add is governed by the degree distribution of hosts, while the expected number of parasites is reduced on the basis of host sharing, that is the probability each potential new parasite shares a host with a parasite already sampled. This scaling pattern has only recently been proposed as a power law 10 ; the pattern may be subtle enough at smaller scales to not have been evident without the large network data that is increasingly available in community ecology 11,12 . In this study, we build on recent advances to estimate the global diversity of mammalian viruses, a problem with applied significance far beyond macroecology. The emergence of viral threats, such as Ebola and Zika, demands an improved understanding of the landscape of viral emergence and several ambitious projects are working to catalogue global viral diversity, with the ultimate aim of predicting outbreaks and preventing pandemics. Recent work has estimated a global richness of over 1.6 million virus species in mammals and birds, extrapolating total diversity from viral sampling data from the Indian flying fox (Pteropus giganteus) and the rhesus macaque (Macaca mulatta) 13 . A fundamental gap in projections is the exclusion of host-sharing patterns from these methods (see Supplementary Information).
Here, we show that non-linear scaling in bipartite networks implies viral diversity has been overestimated by approximately two orders of magnitude and discuss how a network perspective can help optimize viral sampling. In the absence of theoretical expectations, we introduce a new simulation method that performs iterative resampling, curve-fitting and extrapolation on bipartite networks and we apply it to the most detailed list of mammal-virus associations available 14 . With 511 viruses catalogued from 753 mammals Global estimates of mammalian viral diversity accounting for host sharing Colin J. Carlson *, Casey M. Zipfel , Romain Garnier and Shweta Bansal Present estimates suggest there are over 1 million virus species found in mammals alone, with about half a million posing a possible threat to human health. Although previous estimates assume linear scaling between host and virus diversity, we show that ecological network theory predicts a non-linear relationship, produced by patterns of host sharing among virus species. To account for host sharing, we fit a power law scaling relationship for host-virus species interaction networks. We estimate that there are about 40,000 virus species in mammals (including ~10,000 viruses with zoonotic potential), a reduction of two orders of magnitude from present projections of viral diversity. We expect that the increasing availability of host-virus association data will improve the precision of these estimates and their use in the sampling and surveillance of pathogens with pandemic potential. We suggest host sharing should be more widely included in macroecological approaches to estimating biodiversity.
(excluding humans), the network covers about 10% of mammal diversity. Our approach estimates viral diversity in two steps. First, we use a power law to extrapolate from sampled hosts to all hosts (100% host sampling but incomplete viral sampling). Second, we extrapolate to the unsampled portion of viral diversity, by using an estimate of sampling completeness derived from the bat and macaque datasets used in previous studies. We repeat this analysis separately for DNA and RNA virus networks on the basis of approximate zoonotic rates in both groups and ultimately estimate the number of zoonotic viruses in all mammals.

Results
To understand the general relationship between affiliate and host diversity, we resampled the association networks of plants and seed dispersers, plants and mycorrhizal fungi, plants and pollinators and mammals and nematodes. We show in Figs. 1 and 2 that each network produced a non-linear, concave sampling curve. To describe this pattern, we compared the fit of three simple functional forms (linear, power law and logarithmic) and compared model performance using Akaike information criterion (AIC). We reproduced the result of ref. 10 Tables 2, 3). Examining the residuals, we noted a non-random effect, where residuals were lower (the curves overpredicted) at the lowest and highest values ( Supplementary Fig. 4). One important consequence of this error structure was that power laws fit to a smaller portion of the network produced higher estimates (see Fig. 1e), with z values closer to 1. We took advantage of this property to create what we called 'upper bound' estimates, using 50% of the network to generate estimates that reflect an upper bound on the overall size of the network.
Given the pattern we observed in the residuals, we investigated whether more complex forms of power laws might better describe the data given this curvature, with six candidate models drawn from the species-area relationship literature (Supplementary Table 4). We found that these models improved predictions, with the quadratic power law most often ranking with the lowest AIC; but no one model consistently performed the best and more complex functional forms fit to these curves all produced substantially lower estimates of viral diversity (Supplementary Table 5, 6). In the absence of analytic expectations, we erred towards parsimony and limited our main results to classical power laws, noting that they were probably prone to overestimation, even when using 100% of the network.
To estimate mammalian viral diversity, we iteratively resample mammal-virus associations and use the same power law approach ( Fig. 2a-c). We estimate 1,434 viral species would be affiliated to 5,291 mammals (100% host sampling but incomplete viral sampling per host). Using the viral profiles of the bat and the macaque, we then estimated that our host-virus association dataset covered roughly 6% of viral diversity for sampled hosts. This coverage allows us to extrapolate our overall estimate of viral diversity to 23,419 virus species (Table 1). Iteratively fitting models to 50% of the network and projecting out for an upper confidence bound gives an estimate for all mammals of 1,860 virus species or an extrapolated total of 30,368 species ( Table 2). The same method estimates a total of 603 viruses (95% confidence interval: 590-617) for the total network compared to a true value of 511 species, highlighting the small overestimation.
Next, we address the issue of heterogeneity and structure in the association network. At the broadest level, RNA viruses generally have a higher host breadth (number of host species that can be infected) than DNA viruses, which should reduce their scaling exponent 14 ; we observed this reduction in the curves we generated (Fig. 2a-c). We thus evaluate the richness for RNA and DNA viruses separately before combining them. We estimated a total 26,315 DNA viruses and 14,573 RNA viruses or a total 40,878 viruses together (Table 1). Even using the 50% upper bound method, which produces a substantially increased estimate, we only calculated 55,784 possible total viruses, still less than 10% of previous estimates (Table 2). Furthermore, diversity is distributed non-homogeneously among different pairs of host and viral families, with some host groups forming disproportionate reservoirs of viruses with high host sharing 15 . Curves could be constructed for each of these sub-groups but would be sensitive to existing taxonomic biases and simply adding these estimates together would ignore the high degree of sharing among host groups (Fig. 2d). Finally, we estimate the total number of potentially zoonotic mammal viruses. Given that RNA viruses have an apparently higher rate of zoonotic infection, we use the separate RNA and DNA virus richness estimates to estimate the number of total potential zoonoses. Using the zoonotic rate in DNA and RNA viruses in the dataset (14.1% and 41.7% respectively), this would suggest a total of 3,710 zoonotic DNA viruses and 6,077 zoonotic RNA viruses-a total of 9,787 compared to the previous estimate of 493,856-689,285 13 (Table 1). Even using the 50% estimation method as an upper bound, we estimate a total of 12,941 zoonoses; although higher, this is still approximately 2-3% of previous estimates ( Table 2).

Discussion
Network science is useful for studying biotic interactions in modern ecology and offers powerful ways to understand data such as host-virus associations 16 . Here, we highlight how a simple scaling property of bipartite networks enables a method of estimating diversity for affiliate groups like parasites and pathogens. Using our computational approach, we found that global viral diversity in mammals has probably been overestimated by about two orders of magnitude due to the omission of host-sharing patterns. The possible power law we identified here has fundamental implications for biodiversity research but we have found no analytical solution to the underlying mathematical problem: under a certain set of expectations (for example, a power law or exponential degree distribution), what is the expected scaling of edges in subsamples drawn randomly from a bipartite network? Fitting a power law to these data appears to be adequate for our purposes but the possibility remains that, like the species-area relationship, the scaling of affiliate richness is scale-dependent and described by a more complex pattern 17,18 ; our evidence suggests this scale-dependence may exist and lead to overestimation, as the slope may collapse at broader scales (Fig. 1e, Supplementary Information). This is a promising topic for future research in mathematics and network science and a solution might have broader implications beyond the biological sciences. An analytical expectation for this scaling pattern will also improve the precision and confidence of future species richness estimates.
Our study suggests that there are about 40,000 viruses in mammals, of which about 10,000 have zoonotic potential. Whereas previous estimates assumed 289.5 unique virus species per host, our study suggests there are about five to ten times as many virus species as mammal species, with most viruses shared by a few hosts (mean = 4.79, median = 2). While our estimate corrects for undersampling of viruses per host, it does not account for the likelihood that host breadth is also being underestimated, which might further reduce richness estimates. Our broader finding that viral diversity has probably been overestimated is congruent with the limited literature on the subject. Parallel work focused on phage diversity has used rarefaction curves and the Pacific Ocean Virome metagenomic dataset to suggest that the size of the broader global virome (defined by genetic diversity rather than species counts, which are based in challenging species concepts) may have been similarly overestimated in the early 2000s (ref. 19 ).
Our results highlight the need for completeness not just in viral inventories but in host-virus association data. Even with the development of viral sequencing techniques allowing easier access to diversity estimates in (and potentially between) hosts 20 , the need for completeness makes the problem of cataloguing viral diversity exponentially more intensive. Targeting specific groups may make this problem more manageable: groups like bats, rodents and primates harbour disproportionate viral richness, even accounting for sampling bias due to their high zoonotic rate 16,[21][22][23] . Moreover, zoonotic viruses in these groups may account for much of viral sharing over broad phylogenetic distances 15,24 . Focusing on describing viral sharing in and among these groups might reduce the effort needed to approximate the overall level of host sharing in the network and the effort needed to update viral richness estimates.
On the other hand, it is difficult to assess how much the dominance of zoonoses in sharing networks is a feature, not an artefact, of current sampling schemes; separating the zoonotic and non-zoonotic viruses in our association data shows a tight coupling between sampling, sharing and existing priorities for zoonotic virus description (Fig. 2d). Even in well-sampled groups like bats, sampling priorities may poorly reflect underlying patterns of viral richness 25,26 ; for groups that are less common reservoirs of zoonoses, there is almost certainly a disproportionate level of undersampling in hostvirus associations and a disproportionately high observed zoonotic rate. The methods we use here can help standardize estimates of viral richness for sampling effort and, in conjunction with real-time data collection, dynamically target hotspots of undiscovered viral richness for sampling 27 . Advances in machine learning that predict possible host-virus links 28,29 may help further target sampling. Future evidence may change conventional knowledge about the structure of the mammalian viral sharing network and decouple the tight correlation between zoonotic sampling and the centrality of groups like bats and carnivores in it.
Mammal viruses are only a subset of the hyperdiverse affiliate taxa on earth and many groups remain unassessed using methods that account for host sharing. Bird viral diversity is a logical next target, as the existing estimate was calculated using the same estimates derived from one monkey and one bat species 13 . The viral diversity of all vertebrates is an important end goal, given recent work showing that RNA viruses are widely distributed across all five classes of vertebrates-even viral families, including the Filoviridae or Flaviviridae, that pose some of the greatest emerging threats to human health 30 . Although viruses like Wenzhou shark flavivirus or Wenling triplecross lizardfish picornavirus may never pose a threat to human health, they remain an important part of understanding, defining and measuring the global virome 30,31 .

Methods
In this study we estimate the global diversity of viruses in mammal hosts by re-analysing data that have been previously used to provide a linear estimate by Carroll et al. 13    mycorrhizal interaction networks, we used a dataset on fungal associations in 150 Japanese plant species/taxa (not all resolved to species level), including 8,080 total operational taxonomic units; we only used data on arbuscular mycorrhizae, for convenience 36 . Finally, for helminth-vertebrate interactions, we used the helminthR package to compile a global interaction web of nematode-mammal interactions, with 849 mammal species and 2,248 nematode species 37,38 .
To develop our estimates of mammalian viral diversity, we constructed a viral interaction network using the raw data made available by Olival et al. 14 .
Humans are disproportionately represented in this dataset, so much so that constructing resampled curves produces two distinct curves depending on whether Homo sapiens is included or not in a given subsample (Supplementary Fig. 1). Consequently, we removed humans from all our network analyses. The remaining network includes 511 viruses hosted by 753 mammal species. Several features in the database, such as host classification and virus classification, were used in subsequent analyses; for analyses involving zoonotic proportions, the non-stringent classifications of zoonotic risk were used. The proportion of viruses described was derived using the proportion of estimated viral diversity known from P. giganteus and M. mulatta viral metagenomics and by constructing a rarefaction curve over the number of animals sampled (as in ref. 13 ).

Bipartite richness estimators.
We developed a new R package, 'codependent' 39 , to streamline bipartite richness estimation. The method subsamples a network with H host species and A affiliate species and ∀i ∈ [1, H], subsamples i host species n times and counts the number of affiliate species â. (This assumes every host has at least one affiliate species and in some cases overestimates affiliate richness for this reason.) A power law function is then fit of the form a ∝ bi z using non-linear least-squares regression (nls), with initial parameters ̂=̂= . b z 1, 0 5. The copredict function in codependent returns the point estimates for curve parameters with a 95% confidence interval using the confint2 function in the nlstools package and then extrapolates the curve to the total number of host species (in this case, an estimate of 5,291 mammal species), including a 95% confidence interval.
For our viral richness estimates for mammals, we resampled a curve with every number of hosts (i) between 1 and H = 753, each n = 1,000 times and used the copredict function to project to 5,291 total mammal species. We repeated this separately for DNA and RNA viruses, which have different overall patterns of diversity and host specificity. We multiply these by the proportion reported as zoonotic in the Olival et al. dataset 14 to obtain total estimates of zoonotic viral richness. The true proportion of viruses with zoonotic potential may be higher, as many viruses have yet to emerge in human populations or it may be lower, as zoonotic viruses sampled from hyper-reservoirs make up a disproportionate share of known viral diversity. The total number of zoonotic viruses is still bounded in the 0% and 100% of total viral richness estimates, which are much smaller than previous estimates of zoonoses alone.
As a final method for bounding uncertainty, we use the codependent.ci function, which iterates the same rarefaction method on 50% of the network (half the total number of hosts) and projects it out to a given proportion of hosts. Fitting the curve on smaller portions of the network leads to z values closer to 1 and therefore the method overestimates (see Fig. 1); this makes this confidence bound method an absolute outer bound on plausible richness. For example, using the helminth network in Fig. 1, fitting a curve with n = 100 iterations each gives an estimate of 2,291 nematode species (95% confidence interval: 2,271-2,311) compared to a true richness of 2,248 species. We apply this methodology to the virus network with 200 iterations again and project over the total network (753 mammal species) and out to total mammal richness (5,291 species).
Correcting for sampling. To estimate how comprehensive the Olival et al. dataset 14 is, we compare the number of recorded viruses in those data versus the viral metagenomics dataset. For both the bat and macaque, we first count the number of virus species recorded in the Olival dataset 14 (host-virus associations). Next, we estimate the 'true richness' by adding the number of known virus species (from the metagenomic data) to the number of undescribed species estimated using the Chao-1 method (also included in the previous metagenomic estimates). The estimates of undescribed diversity come with their own lower and upper 95% confidence bounds, which we used to create upper and lower 95% bounds respectively on the proposed sampling rate. We average these two rates between the bats and macaques to estimate a sampling rate of 6.1%, with a 95% confidence interval of 3.4-7.4%. While these rates should be derived from a larger and representative sample of species, we note that the bat and macaque datasets are unusual in their completeness.
This estimate is the most tenuous in our analysis but uses much the same logic as the linear extrapolation used by Carroll et al. 13 , without making their assumption that every host-virus family association is equally possible. In reality, there are disproportionate associations due to a combination of 'forbidden links' (in the sense of ref. 40 ) and non-random coevolutionary diversification. It is probably also a liberal estimate of undersampling, given that bats (especially Pteropus, a major zoonotic reservoir) have a disproportionately high underlying viral richness 21 .
Using one sampling rate for all host-virus group pairs is a simplifying assumption and there are several interacting and difficult-to-quantify biases probably contained in this host-virus association dataset 41 . Ideally, at a minimum, we would be able to derive separate sampling rate estimates for DNA and RNA viruses. However, our ability to do so is increasingly limited by sample size: for example, no DNA viruses are recorded for Pteropus in the main dataset. If we used these methods, we could derive a DNA virus sampling rate of 25.5% (95% confidence interval: 18.1-26.0%) and an RNA virus sampling rate of 7.2% (95% confidence interval: 3.3-8.8%); both independent estimates reduce the total unsampled viral diversity. In Supplementary Table 7 we show how using these numbers would reduce overall estimates.

Network analyses.
To generate a unipartite network of host sharing by viruses, we analysed associations between viruses and their hosts. We classified hosts by their orders (separating out H. sapiens from primates) and represented these orders as the nodes in the network. Links between these nodes represent instances of shared viruses between host species belonging to different orders. We ignored viral sharing between host species in the same order (that is, self links were removed). Edges were weighted proportional to the Jaccard index 42 , which is defined by where A and B are the number of viruses in two orders, respectively, and C are the number shared between orders. This network was created separately for zoonotic and non-zoonotic viruses. There were 296 viruses with more than one non-human host recorded and 149 zoonotic viruses with more than one non-human host recorded. Also, there were 116 viruses with more than one order recorded and 86 zoonotic viruses with more than one order recorded. Networks were constructed and analysed using the networkx package in Python 43 and visualizations were constructed with Gephi 44 and PhyloPic (http://www.phylopic.org).
Reporting Summary. Further information on research design is available in the Nature Research Reporting Summary linked to this article.

Data availability
All data in this study are from previous studies and available online for researchers to reproduce our results. Original sources can be found as follows: global diversity of viruses in mammal hosts can be found in Carroll et al. 13 ; plant-pollinator interactions can be found in Robertson 32 and reproduced in Marlin et al. 33 ; myccorhizal networks are described in Toju et al. 36 ; and the host-helminth network can be obtained from the Natural History Museum London's Helminth Database, through the helminthR API (ref. 37 ). All data are also available on the Github repository for the project, at github.com/cjcarlson/brevity

Code availability
All code is available on a Github repository found at github.com/cjcarlson/brevity. The codependent R package 39 is available at github.com/cjcarlson/codependent