Predictors of human-infective RNA virus discovery in the United States, China, and Africa, an ecological study

Background: The variation in the pathogen type as well as the spatial heterogeneity of predictors make the generality of any associations with pathogen discovery debatable. Our previous work confirmed that the association of a group of predictors differed across different types of RNA viruses, yet there have been no previous comparisons of the specific predictors for RNA virus discovery in different regions. The aim of the current study was to close the gap by investigating whether predictors of discovery rates within three regions—the United States, China, and Africa—differ from one another and from those at the global level. Methods: Based on a comprehensive list of human-infective RNA viruses, we collated published data on first discovery of each species in each region. We used a Poisson boosted regression tree (BRT) model to examine the relationship between virus discovery and 33 predictors representing climate, socio-economics, land use, and biodiversity across each region separately. The discovery probability in three regions in 2010–2019 was mapped using the fitted models and historical predictors. Results: The numbers of human-infective virus species discovered in the United States, China, and Africa up to 2019 were 95, 80, and 107 respectively, with China lagging behind the other two regions. In each region, discoveries were clustered in hotspots. BRT modelling suggested that in all three regions RNA virus discovery was better predicted by land use and socio-economic variables than climatic variables and biodiversity, although the relative importance of these predictors varied by region. Map of virus discovery probability in 2010–2019 indicated several new hotspots outside historical high-risk areas. Most new virus species since 2010 in each region (6/6 in the United States, 19/19 in China, 12/19 in Africa) were discovered in high-risk areas as predicted by our model. Conclusions: The drivers of spatiotemporal variation in virus discovery rates vary in different regions of the world. Within regions virus discovery is driven mainly by land-use and socio-economic variables; climate and biodiversity variables are consistently less important predictors than at a global scale. Potential new discovery hotspots in 2010–2019 are identified. Results from the study could guide active surveillance for new human-infective viruses in local high-risk areas. Funding: FFZ is funded by the Darwin Trust of Edinburgh (https://darwintrust.bio.ed.ac.uk/). MEJW has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No. 874735 (VEO) (https://www.veo-europe.eu/).


Introduction
RNA viruses are the primary cause for emerging infectious diseases with epidemic potential, given that they have a high rate of evolution and high capacity to adapt to new hosts (Woolhouse et al., 2016). In recent decades, infectious diseases caused by severe acute respiratory syndrome coronavirus (SARS-CoV), Middle East respiratory syndrome coronavirus (MERS-CoV), Bundibugyo Ebola virus and SARS-CoV-2 present major threats to the health and welfare of humans (Albariño et al., 2013;Ksiazek et al., 2003;Mackay and Arden, 2015;World Health Organisation, 2020). Detection of formerly unknown human-infective RNA viruses in the earliest stage after the emergence are essential for controlling the infections they cause. Measures to implement early detection include not only advanced diagnostic techniques (Lipkin and Firth, 2013), but more importantly the idea where to look for them (so-called hotspots) (Morse, 2012).
Socio-economic, environmental, and ecological factors related to both virus natural history and research effort have been found to affect the discovery of emerging RNA viruses (Jones et al., 2008;Morse, 2012;Rosenberg, 2015;Zhang et al., 2020). However, these factors are highly spatially heterogeneous, making the generality of any associations with discovery debatable. For example, the United States, China, and Africa have experienced different rates of socio-economic, environmental, and ecological changes in the last one hundred years. The United States has always had better resources to discover new viruses. For example, the Rockefeller Foundation-a U.S. foundation-supported the discovery of 23 arboviruses in Latin America, Africa, and India in 1951-1969(Rosenberg et al., 2013. China has seen urban land coverage more than double and GDP per capita increase by seven times since the 1980s (Ritchie, 2018;Roser, 2013). Nine out of 223 human-infective RNA viruses have been originally discovered in China, and all were discovered after 1982 (Zhang et al., 2020). In contrast, effective surveillance is challenging in less developed regions such as large parts of Africa given resource constraints (Petti et al., 2006).
There have been no previous comparisons of the specific predictors for RNA virus discovery in different regions. In this study, we applied a similar methodology from our previous study of global patterns of discovery of human-infective RNA viruses (Zhang et al., 2020) to investigate whether predictors of discovery rates within three regions-the United States, China, and Africa-differ from one another and from those at the global level, using three new virus discovery data sets. We also mapped discovery probability in three regions in 2010-2019 using the fitted models and historical predictors. According to findings from our previous study (Zhang et al., 2020), the main predictors for virus discovery at the global scale were GDP-related. This suggests that the patterns of virus discovery we have identified may have been largely driven by research effort rather than the underlying biology. In this study, by focusing on more restricted and homogenous regions where the research effort is less variable, we expected to identify predictors more associated with virus biology.

Data sets of human-infective RNA viruses in three regions
We performed an ecological study, and the subject of interest is each human-infective RNA virus species. With reference to a full list of human-infective RNA virus species (Zhang et al., 2020), we geocoded the first report of each in humans in the United States, China, and Africa separately. The latest version as of 31 December 2019 included 223 species (Appendix 1-table 1), with Human torovirus abolished and a new species-Heartland banyangvirus-added by International Committee on Taxonomy of Viruses (ICTV) in 2018 (International Committee on Taxonomy of Viruses, 2018). Data used in this study were not subsets of our previous global analysis; information on discovery locations and discovery dates for each virus species was re-collated for each specific geographical region.
We followed the same search terms, databases searched, and inclusion or exclusion criteria as our global data set for data collection (Woolhouse and Brierley, 2018). In each region, we established whether or not each virus species has been discovered in humans according to peer-reviewed literature. Reference databases included PubMed, Web of Science, Google Scholar, and Scopus. Two Chinese databases [i.e. China National Knowledge Infrastructure (CNKI) and Wanfang Data] were also searched when collecting data for China. Reference lists of relevant studies and reviews were also checked manually to find potential earlier discovery papers. The following key words were used for the retrieval: virus full name or abbreviations or virus synonyms; and human* or person* or case* or patient* or worker* or infection* or disease* or outbreak* or epidemic*; and region name (Chin* or Taiwan or Hong Kong or Macau; United States or US or USA or America*; Africa* or all African country names). Virus synonyms and abbreviations include early names used in the discovery paper and all subtypes provided by the ICTV online report (International Committee on Taxonomy of Viruses, 2018 ). Evidence which met the following criteria from peer-reviewed literatures were included: (a) Diagnostic methods for RNA virus infection in humans were clearly described, through either viral isolation or serological methods; (b) Specific virus species name or subtypes falling under that species were clearly provided; (c) Both natural infection and iatrogenic or occupational infections were accepted. Evidence which met the following criteria were excluded: (a) Uncertain species due to cross-reactivity with related viruses; (b) Diagnostic methods for virus infection were not specified; (c) Description of clinical symptoms or pathogenicity were not considered as human infection of one certain virus species; (d) Report of '[virus name]-like' or 'potential [virus name] infections'; (e) Intentional infections including experimental inoculation or vitro infections; (f) Nonpeer-reviewed literature, including media reports, thesis, or unpublished data. Literature selection was performed by two individuals independently and discrepancies were resolved by discussion with a third individual.
We defined discovery location as where the initial human was exposed to/infected with the virus, as suggested in the first report of human infections from peer-reviewed literature. All locations were geolocated as precisely as possible using methods from our previous paper (Zhang et al., 2020). For each region, a polygon was created for those locations at administrative level 3 (county for the United States; city for China; for Africa, it varies between different countries) and above. Details of data types for virus discovery database in three regions was summarised in Appendix 1-table 2. Although the majority of discovery locations in the United States and Africa involved point data and in China the majority involved polygon data at province level, the average number of grid cells per virus in three regions were similar. A bootstrap resampling procedure was developed for polygon data covering more than one grid cell (details below). Discovery date of human infection was defined as the publication year in the scientific literature. temporal coverage were extrapolated back to 1901; both following methods from our previous paper (Zhang et al., 2020).

Boosted regression trees modelling
We used a Poisson boosted regression trees (BRT) model to examine the relationship between discovery of RNA virus and 33 predictors for each 1° resolution of grid cell across each region separately, following codes from our previous study (Zhang et al., 2020) and one previous paper (Allen et al., 2017). As a tree-based machine learning method, the BRT model can automatically capture complex relationships and interactions between variables, and also can well account for spatial autocorrelation within the data (Crase et al., 2012). We compared Moran's I values of the raw virus data and the model residuals to estimate the ability of the BRT model to account for spatial autocorrelation (Cliff and Ord, 1981). In order to minimise the effect of spatial uncertainty of virus discovery data, we performed 1000 times bootstrap resampling for those discovery locations reported as polygons. We assumed each grid cell in the polygon has the equal chance to be selected, and for each virus record we selected one grid cell randomly from the polygon for each subsample. A ratio of 1:2 for presence to absence constituted each subsample, that is, for each grid cell with virus discovery, two grid cells with no discovery were randomly selected from 'virus discovery free' areas at all time points within the region. Take the United States as an example, each subsample included 95 grid cells with virus discovery and 190 with no virus discovery. We then matched the virus data with all predictors by geographical coordinates and decade (using the nearest decade for time-varying predictors). We assumed that the virus count in any given grid cell in each decade followed a Poisson distribution, and we calculated the virus discovery count in each grid cell by decade as the response variable.We also performed further sensitivity analyses by (i) matching virus discovery data and time-varying covariate data by year and (ii) testing for lag effects by matching virus discovery at year t and predictors at t-1 to t-5 year (Appendix4).
All BRT models were fitted in R v. 3.6.3, using packages dismo and gbm. BRT models require the user to balance three parameters including tree complexity, learning rate, and bag fraction. Tree complexity reflects the order of interaction in a tree; learning rate shrinks the contribution of each tree to the growing model; bag fraction specifies the proportion of data drawn from the full training data at each step. We set these parameters as recommended from Elith et al., 2008, and make sure each resampling model contained at least 1000 trees. BRT models identified the final optimal number of trees in each model using a 10-fold cross validation stagewise function (Elith et al., 2008). The three parameter values of the optimal model as well as the mean optimal number of trees across 1000 replicate models for all three regions were summarised in Appendix 1-table 3.
By fitting 1000 replicate BRT models, the relative contribution plots and partial dependence plots with 95% quantiles were plotted. We defined variables with a relative contribution greater than the mean (3.03%) as influential predictors in all three regions (Shearer et al., 2018). The partial dependence plots depict the influence of each variable on the response while controlling for the average effects of all the other variables in the model. The map of virus discovery probability across each region in 2010-2019 was derived from the means of the predictions of 1000 replicate models, using values of the 33 predictors in 2015. In order to show discovery hotspots, we converted the prediction map of virus count to a map of probability.
Two statistics were calculated to evaluate the model's predictive performance: (a) the deviance of the bootstrap model (Elith et al., 2008), (b) intraclass correlation coefficient (ICC) calculated from 50 rounds of 10-fold cross-validation, by following methods from our previous paper (Zhang et al., 2020). For the 10-fold cross-validation, we selected 50 data sets randomly from the 1000 bootstrapped subsamples. We took the first data set and partitioned into 10 subsets. For each round of 10-fold cross-validation, the unique combinations of nine subsets constituted the training sets and were used to fit models, and the remaining one was used as a test set to evaluate the predictive performance of the model. We repeated the same process as above for the remaining 49 data sets. One intraclass correlation coefficient (ICC) was calculated from each round of validation and the median with 95% quantiles across all 50 rounds was calculated. The ICC varies between 0 and 1, with an ICC of less than 0.40 representing a poor model, 0.40-0.59 representing a fair model, 0.60-0.74 representing a good model, and 0.75-1 representing an excellent model (Cicchetti, 1994).
Exploratory subgroup analyses distinguishing viruses firstly discovered in regions and those that had been discovered elsewhere in the world were performed. We used the same BRT modelling approach as we described above, and relative contribution of each predictor was calculated for each subgroup. We were unable to perform subgroup analysis for China because only nine human-infective RNA viruses have been firstly discovered in it, and the BRT model cannot be fitted to a sample as small as 9.
R software, version 3.6.3 (R Foundation for Statistical Computing, Vienna, Austria) was used for all statistical analyses. All maps were visualised by using ArcGIS Desktop 10.5.1 (Environmental Systems Research Institute).

Results
The numbers of human-infective virus species discovered in the United States, China, and Africa up to October 2019 were 95, 80, and 107, respectively (Appendix 1-table 1). Most first discoveries have been in eastern United States (especially in areas around Maryland, Washington, D.C., and New York), eastern China (developed cities including Beijing, Hong Kong, Shanghai, and Guangzhou), and southern and central Africa (Pretoria and Johannesburg, South Africa; Borno State and Ibadan, Nigeria) ( Figure 1). A total of 60 virus species were previously reported in all three regions, and 27, 12, 37 species were only found in the United States, China, and Africa, respectively ( Figure 2 Figure 2). The 60 shared species were also disproportionally vector-borne [11.7% (7/60)] and strictly zoonotic [7% (4/60), Figure 2].
The discovery curves for the United States and Africa have seen a broadly similar pattern, with China lagging behind these two regions ( Figure 3). The median time lag between the original discovery year of each virus in the world and the discovery year of each virus in each region was 0 [interquartile range (IQR): 2.5], 12 (IQR: 29.5), and 2 (IQR: 10.5) years in the United States, China, and Africa, respectively (Appendix 3- figure 2). In China, the time lag was noticeably shorter for viruses discovered after 1975 [before 1975: a median lag of 30.5 (IQR: 30.5) years; after 1975: 2.5 (IQR: 7) years, p value of Wilcoxon rank sum test < 0.001].
In the United States, six variables including three predictors related to land use [urbanized land: relative contribution of 35.8%, urbanization of cropland (i.e. the percentage of land area change from cropland to urban land): 8.0%, growth of urbanized land: 4.1%], two socio-economic variables (GDP growth: 10.0%; GDP: 5.7%), and one climatic variable (diurnal temperature change: 4.9%) were identified as important predictors for discriminating between locations with and without virus discovery ( Figure 4A). The partial dependence plots shown in Appendix 3-figure 3 suggested non-linear relationships between the probability of virus discovery and most predictors. All important predictors presented a positive trend over narrow ranges at lower values.
In China, twelve variables including four socio-economic variables (GDP: 12.7%, university count: 7.5%, GDP growth: 4.6%, population growth: 4.4%), five predictors involving land use [pasture: 8.3%, urbanized land: 8.1%, vegetation: 5.8%, cropland: 5.3%, urbanization of secondary land (the percentage of land area change from secondary land to urban land; secondary land is natural vegetation that is recovering from previous human disturbance): 3.3%], and three climatic variables (maximum precipitation: 4.5%, precipitation change: 3.8%, diurnal temperature range: 3.3%) were identified as important predictors for discriminating between locations with and without virus discovery ( Figure 4B). GDP, urbanized land, university count, vegetation, GDP growth, maximum precipitation, population growth, and urbanization of secondary land presented a positive trend over narrow ranges at lower levels; pasture, cropland, precipitation change, and diurnal temperature range had non-monotonic/ negative impacts, with highest risks at lower values (Appendix 3- figure 4).
In Africa, ten variables including two socio-economic variables (GDP growth: 21.2%, GDP: 13.0%), seven predictors related to land use (urbanized land: 9.4%, growth of cropland area: 5.6%, urbanization of cropland: 5.5%, growth of urbanized land: 5.1%, urbanization of pasture: 3.8%, vegetation, 3.7%, cropland: 3.2%), and one biodiversity variable (mammal species richness: 3.1%) were identified as important predictors for discriminating between locations with and without virus discovery ( Figure 4C). All important predictors presented a positive trend over narrow ranges at lower positive values, except mammal species over a large range (Appendix 3-figure 5). Our BRT models reduced Moran's I value below 0.15 in all three regions (Appendix 3-figure 6), suggesting that BRT models with 33 predictors have adequately accounted for spatial autocorrelations in the raw virus data in all three regions. The model validation statistics for each region are shown in Appendix 1-table 4. Combining these measures, our BRT model predictions range from fair to good (Cicchetti, 1994). In our sensitivity analyses based on data matched by year (Appendix 3- figure 7) and 1-5 year lag (results of 1 year lag shown in Appendix 3-figure 8), though there were several changes of relative contribution, the top predictors were broadly consistent with our main model based on data matched by decade ( Figure 4).
In comparison with the whole world, human-infective RNA virus discovery was more associated with land use and socio-economic variables than climatic variables and biodiversity in all three regions ( Figure 5). The comparison of four groups of predictors between three regions showed that: the greatest contribution of climatic variables to the discovery of human-infective RNA viruses was in China; the greatest contribution of land use was in the United States; the greatest contribution of socio-economic variables and biodiversity was in Africa and least in the United States.
We mapped human-infective RNA virus discovery probability in 2010-2019 for the three regions, based on the fitted BRT models and values of all 33 predictors in 2015 (Appendix 3- figure 9 to Appendix 3- figure 11). Outside contemporary risk areas where human-infective RNA viruses were previously discovered in the United States ( Figure 1A), we predicted high probabilities of virus discovery across southern Michigan, central-Northern Carolina, central Oklahoma, southern Nevada, and north-eastern Utah ( Figure 6A). Outside contemporary risk areas where human-infective RNA viruses were previously discovered in China ( Figure 1B), we predicted high probabilities of virus discovery across other eastern China area as well as two western areas including south-central Shaanxi and north-eastern Sichuan ( Figure 6B). Outside contemporary risk areas where human-infective RNA viruses were previously discovered in Africa ( Figure 1C), we predicted high probabilities of virus discovery across northern Morocco, northern Algeria, northern Libya, south-eastern Sudan, central Ethiopia and western Democratic Republic of the Congo ( Figure 6C). Most new virus species since 2010 in each region (6/6 in the United States, 19/19 in China, 12/19 in Africa) were discovered in highrisk areas (85% percentiles of predicted probability across each region) as predicted by our model.  Based on our subgroup analysis distinguishing viruses firstly discovered in regions and those that had been discovered elsewhere in the world, discoveries of human-infective RNA viruses first discovered from either United States or Africa were better predicted by climatic and biodiversity variables, while discoveries of viruses that had been discovered from elsewhere in the world were better predicted by socio-economic variables (Appendix 3- figure 12).

Discussion
To our knowledge, this analysis represents the first investigation of human-infective RNA virus discovery in three large regions of the world which have experienced distinct socio-economic, ecological and environmental changes over the last 100 years. In total, 95 human-infective RNA virus species had been found in the United States; 80 in China; 107 in Africa. The discovery maps of human-infective RNA virus in the three regions indicated areas with historically high discovery counts: eastern and western United States, eastern China, and central and southern Africa. BRT modelling suggested that  the relative contribution of 33 predictors to human-infective RNA virus discovery varied across three regions, though climatic and biodiversity variables were consistently less important in all three regions than at a global scale. We mapped the probability of human-infective RNA virus discovery in 2010-2019 which would continue to be high in historical hotspots but, in addition, we identified several new hotspots in central-eastern and southwestern United States, eastern and western China, and northern Africa. These results offer a tool for public health practitioners and policymakers to better understand local patterns of virus discovery and to invest efficiently in surveillance systems at the local level.
In recent decades, factors that drive pathogen discovery have been comprehensively studied, e.g., (Morse, 2012). In general, evidence has come from three forms of analyses: analysis of single emergence event such as SARS, AIDS, and Ebola (Parrish et al., 2008), quantifying the spillover (or host switching/cross-host transmission) risk using traits of both hosts and viruses (Kreuder Johnson et al., 2015;Olival et al., 2017;Pulliam and Dushoff, 2009), and record of first emergence/discovery event in humans globally over time (Allen et al., 2017;Jones et al., 2008;Zhang et al., 2020). Of these, the latter form of analyses have linked the distribution of emerging infectious diseases across the globe to ecological, environmental, and socio-economic factors, predicted the high-risk areas for discovery of emerging zoonoses, and helped identify priority regions for investment in surveillance systems for new human viruses (Allen et al., 2017;Jones et al., 2008;Zhang et al., 2020). In addition to these analyses, our current regional analyses identified more precise hotspots for virus discovery in three large regions of the world. Because zoonotic viruses are responsible for most historical endemics and epidemic diseases, several projects such as the Global Virome project (GVP), the PREDICT project, and the Vietnam Initiative on Zoonotic Infections (VIZIONS) were launched to construct a comprehensive data set of unknown viruses with epidemic potential from specific animals likely to harbour highrisk viruses, humans having a high contacting rate with animals, and animal-human interfaces with high spill-over probability (Carroll et al., 2018;Morse, 2012;Rabaa, 2015). These hotspots analyses indicate priority regions for surveillance for new viruses for these projects.
In all three regions, GDP and/or GDP growth were identified as important predictors for virus discovery. This is consistent with our previous analysis that GDP and GDP growth play a major role in discovering viruses (Zhang et al., 2020). In general, sufficient economic, human and material resources, the availability of advanced infrastructure and technology, and greater research capabilities in the relative higher income areas enable the virus discovery (Rosenberg et al., 2013). That this effect applied both within one continent and within single countries such as the United States and China suggested that most virus discoveries were likely passive, that is, the viruses were detected when they arrived in a location with the resources to detect them. This is plausible because in all regions in our study, human-transmissible viruses accounted for the larger proportion, and our previous analysis suggested richer areas were more likely to first capture transmissible viruses (e.g. Influenza virus, Rhinovirus, Rabies lyssavirus, Measles morbillivirus, Mumps orthorubulavirus, Rubella virus, and Norwalk virus) capable of spreading to multiple areas (Zhang et al., 2020). Temporally, in China the rate of discovery increased after economic growth accelerated in the 1980s (Figure 3). We note in publications describing first virus discoveries that most historical virus discoveries in Africa received support from the United States and Europe, and this may explain why Africa saw an increased number of virus discoveries after 1950-30 years earlier than China (Figure 3). Notably, in contrast to Africa, university count was found to be associated with virus discovery in China, suggesting virus discovery likely being a significant area of research in Chinese universities. Our model also suggested the overall socio-economic factors contributed less in the United States than other two regions. The possible explanation is that the socio-economic level across the whole United States is relatively high and homogenous.
Predictors other than GDP and university count are likely to be linked to virus natural history. In all three regions, the area of urban land and further urbanization made great contribution to virus discovery. This reinforced previous studies that urbanization was linked to the detection of new human pathogens through the denser urban population, increased human-wildlife contact rate, spill-over of human infection from enzootic cycle, and the contamination of the urban environment with microbial agents (Hassell et al., 2017;Olival et al., 2017;Weaver, 2013). In the United States, land use contributed more to virus discovery than in other regions-urbanized land, urbanization of cropland, and growth of urbanized land alone had a relative contribution of 47.9%. It is possible that land use change in the US is driving both the emergence of novel viruses and their discovery, as has been suggested for Heartland virus (Mansfield et al., 2017;Savage et al., 2013) and several hantaviruses (Hassell et al., 2017).
Climate had less influence on human-infective RNA virus discovery in all three regions in comparison to other predictors, in contrast to virus discovery at a global scale (Zhang et al., 2020). The underlying reason may be that the proportion of vector-borne viruses-whose distribution and abundance is strongly associated with the impact of climate on vector populations (Li et al., 2014)-in all three regions (United States: 23.2%; China: 21.3%; Africa: 27.1%) were less than that in the world (41.7%) (Figure 3). Vector-borne viruses tend to have more restricted global ranges, so are less likely to appear in a study of any one region (Zhang et al., 2020).
In addition, a relatively smaller proportion of strictly zoonotic viruses in three regions (United States: 30.5%; China: 16.3%; Africa: 33.6%) than that in the world (58.7%) (Figure 2) made biodiversity contribute less to virus discovery in the three regions than in the world (Zhang et al., 2020). With exposure to a higher density of mammals played a slightly larger role in virus discovery in Africa than in China and the United States (Appendix 3- figure 9 to Appendix 3- figure 11).
Our discovery probability maps for 2010-2019 in three regions captured most historical hotspots, though several small new areas in central-eastern and southwestern United States, eastern and western China, as well as northern Africa would also make greater contribution to virus discovery ( Figure 6). Our model has a good predictive ability, given 84% (37/44) new virus species in 2010-2019 were discovered in high-risk areas we have defined-85% percentiles of discovery probability within each region. Further, 35% (13/37) of those viruses discovered in high-risk areas since 2010 were discovered at the potential new hotspots where there had not been any virus discoveries in the past.
Our subgroup analyses distinguishing viruses firstly discovered in regions and those that had been discovered elsewhere in the world suggested in both the United States and Africa, discoveries of viruses firstly discovered in regions were more likely to be associated with climatic and biodiversity variables while discoveries of viruses had been discovered elsewhere in the world were more likely to be associated with socio-economic variables. This is plausible, again because after a novel virus was discovered elsewhere in the world, it is usually areas with a higher socio-economic level that first capture the virus in the local region.
This study had limitations. First, one common problem for data collected from literature review is the time lag between virus discovery and publication, in which case the virus data are likely to be matched to covariates in later decades. Second, we acknowledge that it is possible we have not identified the earliest report for some well-known viruses such as yellow fever virus, measles virus, especially in the post-vaccination era. Third, we were unable to identify robust and comprehensive data for all three regions on virus discovery effort (e.g. government transparency, laboratory infrastructure and technology), although we interpret GDP and university count as being an indirect measure of resources available for this activity. Previous studies have tried to use the bibliographic data to correct for the discovery effort (; ). However, this strategy worked less well for our data as the frequency of published paper from virus-related scientific journals has only a weak link to publications on novel human-infective RNA virus (Appendix 3- figure 1).
The study adds to our previous study (Zhang et al., 2020) in several ways. First, we firstly construct data sets of human-infective RNA virus discovery reflecting the viral richness in three broad regions of the world. Second, we reduced the heterogeneity of the predictors by focusing on regions, including those predictors reflecting the research effort. Research effort is less variable within restricted regions and therefore has less effect on virus detection. This implies our predicted hotspots stand closer to the virus geographic distribution in nature. Third, the predicted hotspots derived from regional analysis have a higher precision than at a global scale, for example, specific areas in the United States and China were identified as hotspots from regional analysis, rather than the whole eastern area from the global analysis. This helps target areas for future surveillance.
In conclusion, a heterogeneous pattern of virus discovery-driver relationships was identified across three regions and the globe. Within regions virus discovery is driven more by land use and socioeconomic variables; climate and biodiversity variables are consistently less important predictors than at a global scale. We mapped with good accuracy that in 2010-2019 three regions where human-infective RNA viruses had previously been discovered would continue to be the discovery hotspots, but in addition, several new areas in each region would make great contribution to virus discovery. Results from the study could guide active surveillance for new human-infective viruses in high-risk areas. Yes Barry et al., 1995New Haven, Connecticut 41.31 --72.93 No No Cali mammarenavirus 1971Yes Buchmeier et al., 19741974  Yes Rott et al., 1985Philadelphia, Pennsylvania 39.95 --75.17 Yes Chen et al., 1999 Yes Bode et al., 1992Bode et al., 1992 Rural area of East    Cumulative relative contribution of predictors to human-infective RNA virus discovery by group in each model of subgroups. Subgroup 1 represents viruses firstly discovered from the region (United States or Africa); Subgroup 2 represents viruses firstly discovered elsewhere in the world. In the United States, virus count of Subgroup 1 and Subgroup 2 were 52 and 43, respectively. In Africa, virus count of Subgroup 1 and Subgroup 2 were 39 and 68, respectively. The relative contributions of all explanatory factors sum to 100% in each model, and each colour represents the cumulative relative contribution of all explanatory factors within each group.