Optimal capacity sharing for global genomic surveillance


 Recent technological advances and substantial cost reductions have made the genomic surveillance of pathogens during pandemics feasible. Our paper focuses on full genome sequencing as a tool that can serve two goals: the estimation of variant prevalences, and the identification of new variants. Assuming that capacity constraints limit the number of samples that can be sequenced, we solve for the optimal distribution of these capacities among countries. Our results show that if the principal goal of sequencing is prevalence estimation, then the optimal capacity distribution is less than proportional to the weights (e.g., sizes) of countries. If, however, the main aim of sequencing is the detection of new variants, capacities should be allocated to countries or regions that have the most infections. Applying our results to the sequencing of SARS-CoV-2 in 2021, we provide a comparison between the observed and a suggested optimal capacity distribution worldwide and in the EU. We believe that following such quantifiable guidance will increase the efficiency of genomic surveillance for pandemics.



Introduction
The SARS-CoV-2 pandemic has increased the scope and magnitude of genomic surveillance.By October 2022, more than 13.2 million genomic sequences of SARS-CoV-2 isolates have been shared through the Global Initiative on Sharing All Influenza Data genomic data repository (GISAID, 2022), originally established to track influenza variants.This international effort engendered by the pandemic allowed researchers to identify and characterize emerging mutations of the virus.However, the share of isolates sequenced presents large inequalities, especially between high/middle-income countries and developing nations (Mestanza et al., 2022;Chen et al., 2022;Crawford and Williams, 2021;Brito et al., 2021;Shey et al., 2020).
In this paper, we build a model that can guide global genomic surveillance strategy.Given the total available sequencing capacity, we derive how many samples should be sequenced in each country.To the best of our knowledge, the optimal distribution of sequencing capacity among countries has not yet been addressed by a formal model.We show how this optimal distribution depends on the prevalent number of infections, and the relative importance of policy goals.
Both health organizations (ECDC, 2021a,b,c,d;WHO, 2021a,b) and experts (Gardy et al., 2015;Gardy and Loman, 2018;Priesemann et al., 2021;Robishaw et al., 2021) have consistently urged countries to strengthen their efforts in genomic surveillance.The European Commission asks EU Member States to sequence at least 5%, and preferably 10% of all SARS-CoV-2 positive test results (EC, 2021).Similarly, in September 2021, the WHO asked African countries to attain a 5% sequencing rate (WHO, 2021c).However, the source of this 5% threshold is typically unspecified.Furthermore, based on simulations relying on Danish data, Vavrek et al. (2021) show that 5% sampling of all positive tests allows the detection of emerging strains when they have a prevalence of 0.1% to 1.0%.However, their model takes into account only variant detection as the goal of sequencing.It also ignores the international aspect of sequencing efforts, which is the main focus of our paper.
Other experts have called for increased international cooperation in the domain of genomic sequencing (Lancet, 2021;Crawford and Williams, 2021;Grubaugh et al., 2021).We aim to reinforce their arguments by building a model that quantifies the advantages gained therefrom, and by providing specific recommendations on how cooperation may maximize these benefits.Our model's contributions are twofold.First, we explicitly identify two goals of sequencing, and show that they lead to different optimal capacity allocations.Second, we give specific guidelines for optimal distribution of sequencing capacity, based on the weights assigned to various goals.We demonstrate that, contrary to existing recommendations, it is generically suboptimal to sequence the same share of isolates in every country.Our model is general in terms of the pathogen concerned, and we believe it can https://doi.org/10.1016/j.epidem.2023.100690Received 5 December 2022; Received in revised form 10 March 2023; Accepted 21 May 2023 provide guidance for genomic surveillance beyond SARS-CoV-2, for future pandemics.
Our paper proceeds as follows.In Section 2, we construct a model of sequencing capacity allocation step by step, by considering two main goals of sequencing first separately, and then jointly.In Section 3, we apply the model to derive optimal sequencing capacity distribution in a global context, and within the European Union.Section 4 concludes and reflects on possible extensions of our framework.

A model of sequencing capacity allocation
Variant sequencing is an essential tool for epidemiology for a number of reasons.Based on the literature, we classify these reasons in three classes.First, it provides information on current variant prevalence as a guide for control measures.For example, Brito et al. (2021), Crawford and Williams (2021) and Nadon et al. (2022) emphasize this goal.
Second, it allows the identification and characterization of new mutations as they emerge (see e.g.Burki, 2021;Duarte et al., 2021Duarte et al., , 2022;;Furuse, 2021;Grubaugh et al., 2021).The WHO uses information gained from sequencing to classify pathogen variants as Emerging Variants or Variants of Concern.Longitudinally, sequencing enables determining the mutation rates of various infectious agents.Sequencing may also be necessary to determine whether mutated pathogens have the ability to escape antibodies or vaccines.Combining sequencing data also enables the construction of phylogenetic trees.
Third, relatively rarely mentioned in the literature, it enables the analysis of transmission networks both between and within species (Quick et al., 2016).In this paper, we focus on the first two objectives, and ignore the third.
The only study we are aware of that takes account of these same two objectives (variant prevalence and variant detection) is Wohl et al. (2022).Their framework aims to calculate appropriate sample sizes for sequencing-based surveillance studies.It ignores, however, the aspect of international cooperation, which is the focus of this study.
We emphasize that our model aims to provide a short-term perspective for the analysis of optimal capacity allocation.This implies that sequencing capacities are fixed at a certain level.Further, our shortterm approach allows us to abstract away from the problem of the timeliness of variant identification/detection.It also means that we can ignore the complexities of modeling virus transmission dynamics.

Goal 1: Estimating variant prevalence
We begin our analysis by focusing on estimating variant prevalence.Assume each country has a capacity constraint of   for full genome sequencing that determines the maximum number of sequenced samples, for a total available capacity of  = ∑    .1For parsimony, we ignore any costs related to the transportation of the isolates between countries.Suppose the genome of   virus-positive samples are sequenced.Within these samples,   1 are found to be of variant 1, and   2 =   −  1 of variant 2. The best estimate regarding the prevalence of variant  is given by the sample mean,      .The actual prevalence for this variant in country  is denoted by    .The government of each country is interested in identifying the true variant ratio in the population so that they may adjust public health policy accordingly.For example, consider that one variant poses severe epidemiological risk, while the other does not.If the more risky variant is widespread, optimal response requires strong measures to limit social contacts, strict lock-downs, etc.In contrast, if the less risky variant dominates, the best public health policy may be relatively lenient, forgoing the costs of limiting economic and social activity.Deviation in either direction may be costly for society.This is the reason behind the government's objective of identifying the true variant ratio.
One-country model.For simplicity, we assume that the objective function of the government is to minimize the quadratic difference between the estimated and the actual variant ratios.We call this difference the 'mistake function'.Specifically, country 's decision problem is captured as:2 min The decision variable of the government is the number of samples sequenced   .We assume that sequencing within the capacity constraint is free.
Since [  1 ] =     1 , we get that the variance of   1 equals . Therefore, the objective function can be written in terms of  (  1 ) and   : As a first approximation of this function, assume that   1 is described by a binomial distribution with parameters   (number of trials) and   1 (probability of finding variant 1 in each trial).This assumes sampling with replacement, the outcome of each trial being independent.In this case,   1 has variance     1 (1 −   1 ), and we get: This mistake function is decreasing in   , and thus, more sequencing leads to more accurate estimates of the distribution of variants within the infected, and a public policy more adapted to the epidemiological situation.Another key property of the mistake function is that it is convex in the capacity   .In other words, the informational benefit of each additional sequenced sample is strictly decreasing.Convexity has two important consequences.First, as Fig. 7 in the Appendix shows, there is an optimal, finite number of samples to be sequenced in case sequencing is costly.Second, in the two-country model, convexity will be key to determining the optimal allocation of sequencing capacity between countries.In Appendix A, we show that these main results hold under the more realistic assumption of sampling without replacement as well.
Finally, as it is apparent from Fig. 1, the closer the distribution of prevalences is to fifty-fifty, the higher the mistake for any given amount sequenced.This result will carry over to the two-country model, and will be discussed in more detail below.
Two-country model.We analyze the optimal allocation of sequencing capacity between two countries ( and ) from the perspective of a social planner.While we show in Appendix B that the results of our model carry over to the case of three or more countries, for expositional simplicity, we here adopt the two-country perspective.
The social planner aims to allocate the total sequencing capacity of  in a way that minimizes the weighted sum of mistakes: The intuitions behind this formula and the presence of weights are as follows.We assume that during a pandemic, the public health measures taken depend on the relative prevalence of more and less risky variants.The impact of these measures, however, is greater for 'larger' countries, since more people are affected, and the economic impact is also more significant.Some natural and convenient choices for the weights   would be the population size of a country, or the size of its economy.Broadly, how the weights should be chosen is a problem for moral philosophy that also involves intricate empirical considerations,3 and is therefore beyond the scope of this paper.We believe that the population size of a country provides a good first approximation of appropriate weights to represent the preferences of a 'fair' social planner, and thus, in our empirical analysis in Section 3, we associate the weights   with countries' population sizes.
As before, we assume that the sequencing outcomes    follow a binomial distribution.The objective function of the social planner thus becomes: , subject to   +   ≤ .
Standard optimization leads to: and .
We examine the optimal solution by focusing on the ratio of optimal allocations     : We find that the optimal allocations are determined by two factors, namely, the relative weight of countries and the relative extremeness of prevalences. 4We disentangle these two effects by assuming, first, equal extremeness and second, equal weights.
For equal prevalence of variants in the two countries (  1 =   1 ), which implies equal extremeness, the optimal allocation of capacity simplifies to a square-root rule: , see Fig. 2.This has clear policy-relevant implications.Consider country weights to be chosen according to countries' population size.For example, the population of Spain is approximately four times larger than that of Portugal.Our model implies that in the optimal allocation of sequencing capacity, Spain should sequence only twice as many samples as Portugal.In general, for determining public policy based on variant prevalence, the optimal allocation of sequencing capacity between countries of uneven weights requires an allocation that is less than proportional to country weights.
For equal country weights (  =   ), the optimal allocation of sequencing is a function of variant prevalences' extremeness.As    (1 −    ) achieves its maximum at    = 0.5, countries with more evenly shared variants -i.e., where the prevalence of the two variants is closer to fifty-fifty -should receive a higher share of the total sequencing capacity.Fig. 3 illustrates the ratio of optimal allocations as a function of the variant prevalences in the two countries.
To consider the practical implications of our model, consider the following scenario.Initially, a certain variant is fully dominant in countries  and .A new, more virulent variant appears in country , starts spreading there first, and appears only later in .This means that initially, extremeness in ,   (1−  ) will be larger than in .Thus, initially, more sequencing should be done for isolates from country , where the variant first appeared.Some time after the new variant takes over in , reaching near-total prevalence (  ∼ 1), extremeness in  will drop below that in .From this point on, more sequencing capacity should be allocated for isolates from country , where there is still epidemiological competition between the new and the old variants.

Goal 2: Identifying a new variant
So far we have assumed that the governments engage in sequencing in order to estimate the share of different variants among the infected as precisely as possible.In this section, we focus on another objective: detecting emerging variants.Indeed, one of the stated aims of sequencing is to identify emerging variants and potentially label them as ''Variants of Interest (VOI)'', ''Variants of Concern (VOC)'' or ''Variants of High Consequence (VOHC)''. 5This classification requires full genome sequencing.
We assume that detecting a new variant early on, and classifying it correctly brings a fixed benefit  to society.Moreover, as such a discovery is shared almost instantly all over the world, its benefit is enjoyed by all countries. 6The benefit incorporates (in monetary terms) all the expected benefits of future research, as well as the ability of governments to adapt to the new epidemiological situation.Assume again that there are two countries  and .A new variant that can out-compete the existing ones may emerge in either country , or country .We regard the possibility of two such variants emerging simultaneously to be vanishingly small.Let  denote the probability that such a new (mutant) variant does not emerge over a unit period of time in either country.If the time period is short,  is very close to one.Assuming that mutations appear randomly, and that the characteristics of the infected population do not differ between the countries, the conditional probability that the variant emerges in country  is proportional to the number of infected in that country,   .The probabilities of a new mutant arising in country  and  are thus given by: Suppose that a new mutant indeed emerges in country .Let  denote the expected share of the new variant among all the infected after one unit of time.In other words,  is proportional to how fast the new variant spreads.Then, if   samples are sequenced, the probability that at least one of the mutant-containing samples is sequenced is 1 − (1 − )   .Since  is small, this probability can be approximated by   .Substituting in the objective function, we get: subject to   +   ≤ .
Using that the total capacity constraint will be binding in optimum, i.e.,   =  −   , the maximization problem can be simplified to: This leads to a bang-bang solution: if the only objective of sequencing is the detection of new variants, it is optimal to allocate all the sequencing capacity to the country with the larger number of infections.Formally, the optimal number of sequencing in country  satisfies: This result is in stark contrast with the recommendation derived from goal 1, i.e., when the objective is to estimate the prevalence of existing variants.Recall that under that objective, the optimal allocation of sequencing capacity is less than proportional to the relative size of countries.When the objective is to detect new variants, the optimal allocation of sequencing is more than proportional, in an extreme way: it is all-or-nothing.

Combined goals
Both choosing a policy that fits the epidemiological situation and detecting new variants are important when determining the allocation of sequencing capacity.In this subsection, we integrate these considerations within a unified, two-country framework.
With the notation of the preceding subsections, the decision problem becomes: subject to   +   ≤ .
For given parameter values, this problem is solvable with standard numerical methods.To get a qualitative sense of the effects of various principal parameters on the optimal capacity allocation, we adopt two simplifying assumptions.First, we use the share of infected to calculate the weights, in particular, we let   =     +  ,   =     +  . 7Second, we assume that the variant shares are equal across countries, i.e.,   1 =   1 =  1 .Let  = (1−)⋅  1 (1− 1 ) , representing the (relative) importance of identifying new variants.Indeed, the more likely mutations are, the faster they spread, and the greater the expected benefit associated with finding them, the larger  becomes; lower extremeness, on the other hand, implies a lower .Ultimately, the social planner estimates the value of  , and thus, its preferences have a direct influence on the optimal solution by way of , the importance of identifying new variants.
With these simplifications at hand, using   =  −   , the social planner's problem is equivalent to: Fig. 4A and B contrast the effect of parameter  on the optimal allocation when country  has twice or half as many infected as country , respectively.In Appendix D, we show that the optimal allocation to country  is increasing in  if and only if country  has more infected than .This makes intuitive sense, as parameter  captures the relative importance of finding a new variant.Higher values of  lead to a reallocation of the sequencing capacity to the country with more infected.With  = 0, new mutations are completely irrelevant, and we get back our model from Section 2.1, and the optimal allocation of capacity will follow our square-root rule.Conversely, with very high values of , we converge to the framework of Section 2.2, and the entire sequencing capacity will be allocated to the country counting more infected.Fig. 4C and D generalize these relationships to arbitrary   ∕  ratios.

Genomic surveillance in the case of SARS-CoV-2: Reality and opportunities
While our theoretical model is general, and its insights are applicable to any pathogen, in this section we adapt our results to the SARS-CoV-2 pandemic, based on data from 2021.8According to our datasets, there were approximately 204 million SARS-CoV-2 infections worldwide in 2021.GISAID reports that nearly 6.33 million sequences were submitted to its database, which means that ∼3.1% of all positive samples were sequenced, falling somewhat short of the 5% recommendation of health agencies (ECDC, 2021b,c,d;WHO, 2021a,b).However, there are large inequalities in sequencing efforts, especially between developed countries and the global south, see Fig. 5, Panel .Out of 183 countries in our dataset, 103 did not sequence even 1% of their sample pool, including, surprisingly, welloff countries such as Saudi Arabia, Taiwan or Cyprus.Only 27 countries managed to reach a sequencing rate of 5%.
In order to derive recommendations based on our model, we focus exclusively on Goal 1, i.e., identifying variant prevalence for public policy.This way, we avoid arbitrarily choosing the relative importance of the two goals.Further, we identify weights with the population size of each country.Fig. 5, Panel  shows the share of global sequencing capacity that should be dedicated to each country based on our model from Section 2.1.The contrast with the actual distribution is apparent.
We acknowledge that transportation costs, legal constraints, as well as other transaction costs may make the global cooperation required to reach the optimum difficult to achieve.Therefore, we next focus on capacity sharing within the European Union, where such hurdles should be easier to overcome.Fig. 6, Panel  shows the actual share

Table 1
Genomic sequencing within the European Union in 2021.Second column shows the number of sequences submitted to GISAID.Third column indicates the desired sequencing amounts when the same share of all positive samples are analyzed in each country (3.99%).Fourth column represents the desired sequencing amounts for estimating variant prevalence when country weights are determined by population, and the total capacity is equal to the sequencing capacity of 2021.Last row shows the expected value of the objective function (i.e., the mistake to be minimized) under the different sequencing scenarios.

Country
Actually sequenced by countries in the EU, while Panel  represents the optimal distribution, based on the same assumptions as for Fig. 5. Finally, Table 1 compares the actual amounts sequenced with the recommendations of our model, as well as the infection-proportional sharing recommendation of the European Commission.Indeed, the Commission recommends sequencing 5% of all positive samples in each country (ECDC, 2021b,c,d).As the European Union only sequenced 3.99% of positive isolates collectively, indicating a capacity constraint even at the EU level, we use the 3.99% level for the infection-proportional sharing rule.Three observations can be made based on Table 1.First, countries in the North and West of Europe over-perform, both compared to the 3.99% recommendation, and our proposed distribution; while countries in the South and East of the EU under-perform.There are some positive (Denmark) and negative (Hungary and Cyprus) outliers.Second, unsurprisingly, we find a positive correlation between this sequencing surplus or deficit and the logarithm of per capita GDP,  = 0.56.Third, both the infection-proportional and our proposed rule provide an order of magnitude of improvement value of the objective function over the current sequencing allocation.Moreover, our proposed allocation rule entails an improvement of more than 25% over the rule advocated by international health agencies.

Conclusion
Our paper provides a model of sequencing capacity sharing by specifying the two main goals of genomic surveillance: variant prevalence estimation and the identification of new pathogen variants.While Section 3 uses SARS-CoV-2 as a case study, our results are general, and do not depend on the type of the pathogen.Due to its novelty and until recently high cost, the principal uses of genomic surveillance were for influenza, Ebola, and SARS-CoV-2.Given the substantially increased probability of extreme epidemics due to environmental change (Marani et al., 2021), the relevance of finding optimal mechanisms for pathogen identification and control will ever increase.Indeed, there is a wide consensus regarding the importance of genomic surveillance for ending the health threat posed by SARS-CoV-2 (Lazarus et al., 2022).
An advantage of our model is that the optimal distribution of sequencing takes into account the relative importance of various public policy goals, which can be parametrized by the policy-maker.For example, in some contexts, only the identification of Variants of Concern (i.e., goal 2) may be policy-relevant.We can get policy recommendations for this scenario as a special case of our model.
One limitation of our model is that it assumes that the transportation of isolates between countries/sequencing centers is costless.When transportation is costly, the optimal distribution of total sequencing capacity will be closer to each country's individual capacity.This point holds not only for financial, but also temporal costs.The more important timeliness of detection is, the less international capacity sharing improves social outcomes.We hope that future work on the problem can address these issues more directly.
Another, related limitation of our framework is that we also ignore other transaction costs, such as legal and political constraints on the international transport of pathogen isolates.Anecdotal evidence suggests that these can create important barriers for international cooperation for genomic surveillance.However, if such barriers are present, the geographic domain of optimal capacity redistribution can be adjusted to the appropriate set within which these barriers are not present, or are manageable (e.g., EU, or NAFTA).Moreover, our model can also be adapted to solve capacity distribution within a country, e.g., considering the states of the U.S. or Germany, or the provinces of Canada or China.
We also abstract away from the complexity arising from each country pursuing its self-interest.Instead, our goal is to explore the theoretical maximum of gains on a collective level.A full game-theoretic analysis of these problems is beyond the scope of this paper.
Health experts have already highlighted the necessity of large-scale international cooperation in the efforts to track and control pandemics.Our work quantifies the gains that could be realized from such cooperation.We believe that instead of genomic autarky -i.e., each country focusing its sequencing efforts to infections within its borders -, sequencing capacity sharing can improve outcomes for all parties, especially in the short run.In the long run, countries should aim for building up their sequencing capacities.However, for countries with limited material and human resources, and especially those that do not currently engage in genomic sequencing, this may take a significant amount of time.
In our view, the identification of new pathogen variants, especially variants of concern, should be treated as a global public good.In other words, capacity sharing has positive externalities, and thus, genomic sequencing potentially benefits everyone in the world.Thus, countries that contribute more to global sequencing efforts should not be penalized for identifying new variants, such as has been the case for South Africa for identifying the first Omicron variant.Furthermore, countries with larger capacities should sequence isolates from their neighbors and regional partners.

CRediT authorship contribution statement
Zsombor Z. Méder: Conceived and designed the analysis, Collected the data, Performed the analysis, Wrote the paper.Robert Somogyi: Conceived and designed the analysis, Collected the data, Performed the analysis, Wrote the paper.

Declaration of competing interest
The authors declare no conflict of interest If costs are equal, i.e.,   =   , we get back the formula derived under capacity constraints in Section 2.1.If sequencing costs differ between countries, their impact is again less than proportional, and follows a square-root rule.
Appendix D. Relationship between the relative importance of the two policy goals, the number of infected, and the optimal allocation.
We show that the optimal allocation to country  is increasing in  if and only if country  has more infected than  when considering both goals in Section 2.3.Mathematically, we need to prove that in the optimal allocation denoted  * given by:  * = arg min First, we show that the objective function  (  ) is strictly convex.Indeed, straightforward calculations lead to: Thus the following first-order condition is sufficient for global optimality: Applying the implicit function theorem to the above equation: Using that  ′′ is strictly positive, we conclude that  *  > 0 ⟺     > 1. □

Fig. 1 .
Fig. 1.Mistake (i.e., expected difference between the estimated and actual variant ratios) for three different actual variant ratios at a sequencing capacity of 200 samples.More extreme variant prevalences lead to fewer mistakes.

Fig. 2 .
Fig. 2. Share of total available capacity (%) used for sequencing in country  (  ∕) in optimum as a function of country 's relative weight.Country  is assumed to have a larger weight.

Fig. 3 .
Fig. 3. Share of total available capacity (%) used for sequencing in country  (  ∕) in optimum as a function of the prevalence of variant 1 in countries  and .Countries with more extreme variant distributions require a lower share of the capacity.

Fig. 4 .
Fig. 4. Share of total available capacity (%) used for sequencing in country  in optimum when both estimating variant prevalence and identifying a new variant are important.Panels A and B: Share in optimum as a function of the relative importance of identifying a new variant when the number of infected in  are twice (A)/half (B) as many as in country .Panels C and D: Share in optimum as a function of the relative importance of identifying a new variant and relative number of infected.Country  has more (C)/less (D) infected than .

Fig. 5 .
Fig. 5. SARS-CoV-2 sequencing worldwide in 2021.Panel A: Actual sequencing as a share of global available capacity.Panel B: Optimal sequencing allocation for estimating variant prevalence as a share of global available capacity.Country weights are determined by population size.

Fig. 6 .
Fig. 6.SARS-CoV-2 sequencing in the European Union in 2021.Panel A: Actual sequencing as a share of available EU capacity.Panel B: Optimal sequencing allocation for estimating variant prevalence as a share of available EU capacity.Country weights are determined by population size.
sequenced Infection-prop.rule Square-root rule if and only if   >   .