Design and Analysis of Elimination Surveys for Neglected Tropical Diseases

Abstract As neglected tropical diseases approach elimination status, there is a need to develop efficient sampling strategies for confirmation (or not) that elimination criteria have been met. This is an inherently difficult task because the relative precision of a prevalence estimate deteriorates as prevalence decreases, and classic survey sampling strategies based on random sampling therefore require increasingly large sample sizes. More efficient strategies for survey design and analysis can be obtained by exploiting any spatial correlation in prevalence within a model-based geostatistics framework. This framework can be used for constructing predictive probability maps that can inform in-country decision makers of the likelihood that their elimination target has been met, and where to invest in additional sampling. We evaluated our methodology using a case study of lymphatic filariasis in Ghana, demonstrating that a geostatistical approach outperforms approaches currently used to determine an evaluation unit’s elimination status.


S(x) :
x ∈ A is a stationary Gaussian process with mean zero, variance σ 2 and correlation function ρ(u/ϕ), where u denotes distance. 2. Z(x) is uncorrelated Gaussian white noise with mean zero and variance ν 2 . 3. Conditional on S(·) and Z(·), the numbers, Y i , of positive test results are independent, binomially distributed with binomial denominators m i and binomial probabilities P (x i ) defined by the following equation In the above definition, the parameter ϕ determines the range of the spatial correlation in the log-odds of prevalence whilst the two variance components σ 2 and ν 2 determine its strength.
The conditional binomial distribution for the Y i follows from elementary probability theory.

Parameter estimation
For the geostatistical model defined above, we estimate the model parameters θ = (α, β, σ, ϕ, ν) by Monte Carlo maximum likelihood (Geyer and Thompson, 1992;Christensen, 2004). Maximum likelihood is a generally efficient method of estimation, whilst its Monte Carlo version extends its applicability to models whose likelihood functions are intractable. A general property of maximum likelihood estimators,θ, is that in large samples they are approximately unbiased and multivariate Normally distributed with a variance matrix, Σ, that can also be estimated as a by-product of the maximisation algorithm for findingθ. In the current context we apply the Monte Carlo maximum likelihood method to a transformed set of parameters to improve the multivariate Normal approximation (Diggle and Giorgi, 2019).
As with any numerical optimisation method, Monte Carlo maximum likelihood needs to be handled with care. In particular, it relies on the user providing reasonable starting values for the covariance parameter estimates σ 2 , ϕ and ν 2 . A useful tool for this is the variogram. The variogram, V (u), of a stationary spatial stochastic process is defined to be one half of the variance of the difference between values of the process at locations a distance u apart. For our model, the variogram of the log-odds of prevalence is The left-hand panel of Figure S1 gives a schematic picture of a generic variogram whilst the right-hand panel shows an estimated variogram for data on LF prevalence in Ghana, which we will analyse in our case-study. The shaded area on the estimated variogram covers the range of estimated variograms under repeated random permutations of the empirical logit-transformed prevalence values amongst the sampled locations. This represents the sampling distribution of the estimated variogram in the absence of spatial correlation, and confirms the presence of spatial correlation in the data with a range of very roughly 100km. We choose the shape of the correlation function and initial values for the model parameters by matching the generic form to the estimate; a "by eye" matching is usually good enough and is facilitated by the eyefit() function in the PrevMap package. The nugget is another name for the sampling variance of the response variable (in the current context, the empiricallogit-tranformed prevalence). The sill is another name for total variance of the response variable; the difference between the sill and the nugget is the variance of the spatially correlated conmponent of variance. The practical range is the distance u at which the spatial correlation function ρ(u) decays to 0.05. Right-hand panel: an estimated variogram calculated from the data on lymphatic filariasis prevalence in Ghana. The shaded area covers the range of estimated variograms under repeated random permutations of the empirical logit-transformed prevalence values among the sample locations.

Spatial prediction
For prevalence mapping, interest typically lies not in the parameters themselves but in what they tell us about the prevalence surface P = {P (x * j ) : j = 1, ..., N }. As noted above, the predictive distribution of P is its conditional distribution given the data. For plug-in prediction, we draw samples from this conditional distribution with the elements of θ held fixed at their Monte Carlo maximum likelihood estimates. This ignores parameter uncertainty. In many cases the effects of parameter uncertainty are negligible relative to the predictive uncertainty in P (x), because all of the data contribute information about θ whereas only data from locations relatively close to x contribute information about P (x). Nevertheless, to account for parameter uncertainty we can draw samples from the conditional distribution of P given θ, in which the value of θ for each sample is itself drawn from a multivariate Normal distribution with meanθ and variance matrixΣ.
The sampled prevalence surfaces, {P k : k = 1, ..., s} say, from the predictive distribution of P can then be used to make probability statements about the underlying true prevalence surface, or any of its properties. Formally, if T = T (P) is any predictive target, then {T k = T (P k ) : k = 1, ..., s} is a sample from the predictive distribution of T .
Reasonable choices for a "best guess" map of prevalence over the area of interest would be the sample mean or median of the sampled values of prevalence at each location. If, as is typically the case in elimination surveys, the primary objective is to assess whether a pre-specified prevalence theshold has been met, we advocate mapping the predictive exceedance probability, i.e. the probabilty for each location or area of interest that prevalence exceeds the pre-specified threshold. In the absence of a pre-declared threshold, we recommend constructing a sequence of such maps over a range of prevalence thresholds.  Figure S2: A set of 100 random data-locations (left-hand panel) and a more efficient, spatially regulated configuration of 100 data-locations (right-hand panel) placed randomly subject to the constraint that no two of them may be separated by a distance less than δ = 0.07

Model fitted to baseline data
We fitted the geostatistical model described in the methods section to baseline LF prevalence data shown in Figure S3, using Monte Carlo maximum likelihood. The resulting parameter estimates with their respective 95% confidence intervals are reported in   Figure S4 summarises the predicted prevalence surface at baseline.

Determining the eligibility of an EU
A pre-TAS survey is conducted in each EU after completion of five rounds of MDA with populaton coverage greater than 65% (WHO, 2011). If EU prevalence is less than 2% in all sentinel and spot-check sites then the EU is eligible for the TAS. For our analysis, we sampled three sentinel sites at random in each EU and classified an EU as eligible or not for the TAS accordingly. This resulted in 164 EUs being declared eligible for TAS (see Figure S5) and therefore included in our simulations.  Figure S4: Baseline predicted ICT mean prevalence at 5km pixel level (A); baseline predicted mean population weighted ICT prevalence at EU level (B); probability map for non-exceedance of 2% prevalence at pixel level (C); probability map for non-exceedance of 2% population weighted ICT prevalence at EU level (D).

Estimating the numbers of six to seven year old children
For the analyses reported in the paper, we treat each district as an EU. We downloaded population estimates for Ghana in five-year age bands at a resolution of 100m from WorldPop (www.worldpop.org). We multiplied the six-to-ten year old population raster by 2/5 and summed all pixels inside each EU to obtain an estimate of the total number of six-to-seven year old children ( Figure S6).  Figure S6: Population estimates (on the log10 scale) of 6-7 years old at pixel level (left), population estimates of 6-7 years old children at EU level (right). Figure S7 shows the locations of villages eligible to be sampled. Their locations were retrieved by combining data from OpenStreetMap (https://www.openstreetmap.org/) and Geonames (http://www.geonames.org).

Assessing the TAS survey design parameters
Using the information provided by Table A.5.2 in WHO (2011) and the estimated populations of six-to-seven year old children in each EU, we selected the design parameters to emulate the TAS. These are, for each EU: the total number of children to be tested; the number of clusters (villages or schools) to be sampled; the critical cut-off (maximum number of children testing positive to declare elimination).