A threshold-free approach with age-dependency for estimating malaria seroprevalence

In malaria serology analysis, the standard approach to obtain seroprevalence, i.e the proportion of seropositive individuals in a population, is based on a threshold which is used to classify individuals as seropositive or seronegative. The choice of this threshold is often arbitrary and is based on methods that ignore the age-dependency of the antibody distribution. Using cross-sectional antibody data from the Western Kenyan Highlands, this paper introduces a novel approach that has three main advantages over the current threshold-based approach: it avoids the use of thresholds; it accounts for the age dependency of malaria antibodies; and it allows us to propagate the uncertainty from the classification of individuals into seropositive and seronegative when estimating seroprevalence. The reversible catalytic model is used as an example for illustrating how to propagate this uncertainty into the parameter estimates of the model. This paper finds that accounting for age-dependency leads to a better fit to the data than the standard approach which uses a single threshold across all ages. Additionally, the paper also finds that the proposed threshold-free approach is more robust against the selection of different age-groups when estimating seroprevalence. The novel threshold-free approach presented in this paper provides a statistically principled and more objective approach to estimating malaria seroprevalence. The introduced statistical framework also provides a means to compare results across studies which may use different age ranges for the estimation of seroprevalence.


Introduction
Thanks to increased diagnostic capacity, preventative measures and a scale-up of interventions, there has been an overall decrease in malaria burden worldwide [1,2]. However, malaria still remains a significant global public health threat in sub-Saharan Africa, where Plasmodium falciparum (P. falciparum) is the predominant parasite. A total 229 million cases and 409,000 deaths have been estimated globally in 2019 [3]. Additionally, the decrease in malaria is heterogeneous across regions, countries and communities [2][3][4][5][6], posing additional challenges to malaria elimination efforts. These challenges require robust surveillance mechanisms which can adapt to the changing epidemiology, enabling a more targeted approach to intervention strategies [4,7].
To estimate malaria exposure and transmission, analysis of human serology data has emerged as a viable alternative approach to disease risk metrics that are based on the detection of malaria parasites in humans and mosquito populations [8][9][10]. Because of the persistence of antibodies after infection, their concentration is less influenced by the seasonality of transmission and can be used as an indicator of the cumulative exposure to malaria. Additionally, antibodies, unlike the Plasmodium parasite, can be easily detected even in low transmission areas [8,[11][12][13].
Analysis of seroprevalence-i.e the proportion of 'seropositive' individuals-is often carried out using reversible catalytic models (RCM). These models allow for the estimation of seroconversion rates which quantify the transmission intensity and correspond to the rate at which individuals convert from seronegative to seropositive through exposure to malaria parasites over time [8,9]. Alternatively, continuous antibody measurements can be used in antibody acquisition models to estimate boosting rates, another measure of transmission intensity, which represents the rate at which antibodies are boosted upon exposure to parasites [9,10,14]. Such indicators of transmission intensity can be used to inform decisions on intervention strategies by identifying hot-spots of transmission where individuals are likely to exceed a specified degree of exposure [15,16].
To estimate seroprevalence, classification of individuals as seropositive or seronegative is required. The most commonly used approach is to identify a suitable threshold of antibody density beyond which individuals are classified as seropositive, and below as seronegative [8,9,11]. To this end, mixture distributions are first fitted to the antibody density data, assuming that continuous antibody measurements consists of two latent distributions, one for the seronegative and one for the seropositive populations. By using the point estimates of the mean, µ S − and standard deviation, σ S− , of the seronegative distribution S − , the seropositivity threshold is often set to µ + 3σ [9,[17][18][19], while other studies have instead used µ + 2σ [20][21][22]. An alternative to this approach is to define thresholds based on the predictive probability of being seropositive resulting from the fitted mixture distribution [9].
The major drawback of threshold-based approaches is that the choice of the threshold is arbitrary and it is unclear to what extent this affects the results of the statistical analysis of serological data, as biased estimates of seroprevalence can in fact arise from the misclassification of individuals as seronegative or seropositive [23]. Additionally, in the case of the probability thresholds, individuals whose probability of belonging to either the seronegative or seropositive groups is close to 50% are often classified as 'intermediate' , and are therefore excluded from analysis [9,23]. Furthermore, the uncertainty around the estimated thresholds and probabilities used for the classification of individuals, is ignored.
In addition to these drawbacks, classical analysis of malaria serology data does not account for the age dependency of antibody distribution when calculating thresholds. Typically in mixture models, a threshold is obtained by assuming a constant mixing probability across all ages [14]. This assumption is questionable since, in malaria endemic settings, it is well known that antibody levels are in fact age-dependent [24,25] and thus the likeihood of being seropositive is expected to increase with age. A 2011 study by Ster [26] incorporated age-dependency for varicella zoster virus serology mixture models, however, this principle has not been applied to malaria serology data To address these issues, Kyomuhangi and Giorgi [14] proposed a unified modelling framework for the analysis of malaria serology data that uses the continuous antibody measurement rather than thresholds to estimate transmission parameters. However, as acknowledged by the authors, this modelling framework requires a larger amount of data than is usually available in serological studies to reliably estimate all the model parameters, thus limiting its applicability.
This paper proposes a novel modelling approach for the analysis of serological data that retains the same properties of the approach proposed in Kyomuhangi and Giorgi [14], but is also more parsimonious. More specifically, this novel approach satisfies the following requirements: (1) it accounts for age dependency of antibody levels; (2) it avoids the use of any threshold; and (3) it enables accounting for and propagating the uncertainty in the classification of seropositive and seronegative individuals. Using cross-sectional antibody data from Western Kenya, this paper demonstrates (1) the properties of this new methodology for estimating malaria seroprevalence, and (2) how to incorporate the uncertainty around the resulting seroprevalence estimates, using the standard RCM as an example. The discussion section in this paper explains how the principles used to develop this novel approach can be extended to more complex analysis of serological data.

Existing methods for estimating seroprevalence
This section outlines the most commonly used methods in the analysis of malaria serology data, to classify individuals as seropositive and seronegative, using a twocomponent mixture distribution.
Let Y i denote the log-transformed antibody measurement for the i-th individual in a sample, S − denote the seronegative classification, and S + denote the seropositive classification. Assuming independent and identically distributed realizations for a sample of n individuals, and µ to be the mean level of antibodies in the S − distribution, the density function of Y = (Y 1 , . . . , Y n ) is where f S − is a univariate log-normal distribution with mean µ and variance σ 2 S − for the S − population, and analogously for S + , with δ > 1 being a multiplicative factor accounting for higher mean antibodies in the S + distribution. p is the probability of being S + . Let C i and C * i denote the random variables representing classification based on the mixture model and true classification of the i-th individual, respectively. Based on the seropositivity threshold κ , the classification of individuals, say C i , into S + and S − is defined as Since most statistical analyses of malaria serology data use κ = µ S − + 3σ S − as the threshold, this will also be used in this paper.

Proposed method for estimating seroprevalence
This paper proposes a novel modelling framework that overcomes the limits of the approach described in the previous section, by incorporating age-dependency into the mixture distribution in (1), and by propagating the uncertainty in the classification of individuals into S + and S − using a Monte Carlo approach.
In this framework, age dependency is introduced into (1) using linear regression, as described in Kyomuhangi and Giorgi [14]. Assuming µ(a i ) to be the mean level of antibodies in the S − distribution for a given age a i , (1) becomes where p(a i ) is the probability of being S + at age a. Note that the seronegative distribution is also modelled as agedependent to account for potential residual antibody levels due to previous infections. The age dependencies in ( p(a) and µ(a) are modeled using logit linear and log linear regression, respectively, such that where g 2 (a) is a function of age that can be specified through empirical inspection of the data. In the case of g 1 (a) , identifying a suitable specification may be more problematic because of the need to dichotomize the data. However, because it is well established that p(a) increases for increasing a, a pragmatic approach would be, for example, to specify a logit-linear regression on a as illustrated later in this paper. Note that predictor for these models can take other functional forms such as polynomials and smoothing splines to increase their flexibility.
Using the resulting mixture distribution, the predictive probability of belonging to the S + distribution for each sampled individual is computed by conditioning on the observed antibody measurement Y i = y i and age a i , to give where θ S − = (µ(a i ), σ 2 S − ) and θ S + = (δµ(a i ), σ 2 S + ). Based on the above expressions, when then simulate 10,000 classifications C * i for a every single sampled individual. The resulting 10,000 data-sets generated from this process are then fed into the second stage of the analysis, which is explained in the next section.
There are two main advantages of this modelling approach. The first is that it avoids the use of a threshold κ as in (2) and uses the generated samples C i to propagate the uncertainty of the classification into S + and S − . The second is that the empirical approach used to account the age-dependency combines information across all ages as described in (4), and is therefore more efficient than fitting separate mixtures distribution for each age.
The structure of this modeling framework is illustrated in Fig. 1.

The reversible catalytic model
The RCM assumes that individuals are born S − and, after becoming S + upon exposure to malaria, can revert to S − in the absence of exposure. Since antibody data are believed to represent the cumulative exposure of individuals during their lifetime, an individual's age prior to the sample collection is used as proxy for historical time.
Let (a) denote the seroconversion rate for an individual at age a and ω the seroreversion rate. According to the RCM, the temporal dynamics that regulate the proportion of S + individuals of age a, i.e. p(a), are expressed by the following differential equation The seroconversion rate (a) can be modelled using a variety of approaches, the simplest of which assumes constant transmission, i.e. (a) = for all a. Due to poor identifiability of the seroreversion rate ω , this is typically fixed and assumed to be constant across ages [9,10,14,27]. Assuming a constant and ω in (6) gives the following solution More flexible models could also be used to account for the temporal variation in , including a step-wise reduction or linear reduction in transmission [9,27]. Additionally, other specifications of the RCM, for example the superinfection RCM [19] could be applied in the proposed approach. However in this paper, while comparing existing methods and the proposed approach described in the previous sections, attention is restricted to the RCM defined in the above equation for simplicity.
In order to propagate the uncertainty in classification of individuals as S + and S − , for the purpose of estimating parameters of the RCM, the likelihood of a Binomial (6) dp da = (a)(1 − p(a)) − ωp(a).
distribution with probability p(a) is maximized, as indicated in (7), for each of the 10,000 data-sets for the outcome C i as described in the previous sections. This gives 10,000 different estimates for , which is summarized by taking their mean and 2.5% and 97.5% quantiles.
The estimation of the model parameters is conducted using the maximum likelihood estimation method. Let z i denote the binary variable indicating seropositivity ( z i = 1 ) or seronegativity ( z i = 0 ) for the i-th individual; the likelihood function for the RCM in (7) is then

Data
Data is taken from a cross-sectional survey which was conducted in Rachuonyo South District (34.75 to 34.95 • E, 0.41 to 0.52 • S), in the western Kenyan highlands (1400 m to 1600 m altitude), in 2011 over a 100 km 2 area. This survey was the baseline for a cluster-randomized controlled trial whose aim was to determine the community effect of interventions targeted at malaria prevalence hotspots. Further details of the study protocol can be found in [28]. At the time of the survey, malaria transmission in this area was described as low but highly heterogeneous, and seasonal, following peaks in rainfall, typically between March-June and October-November [16,28].
The majority of malaria cases were attributed to P. falciparum, with Anopheles gambiae sensu stricto (s.s.), Anopheles arabiensis, and Anopheles funestus being the predominant vector species. Malaria control  3) and (4) interventions at the time included distribution of insecticide-treated nets which had been ongoing for many years, and indoor residual spraying which started in 2009 [29,30].
To generate the serology data, finger prick blood samples were collected from participants on filter paper and used to detect total Immunoglobulin G (IgG) antibodies against the blood-stage P. falciparum antigens, apical membrane antigen 1 (PfAMA1) and merozoite surface protein-1 19 (PfMSP1 19 ). Standard Enzyme-linked immunoassay (ELISA) methods [11,31] were used to obtain Optical density (OD) values. Further details of the study design and data collection can be found in [28].
Analysis is first restricted to individuals between 1 and 16 years. Additional analysis on 1-20 year olds, 1-30 year olds, and 1-50 year olds is presented in Additional file 1. The data is split this way in order to investigate the effect of selecting different age-groups on the performance of M1 and M2. In what follows, the focus of analysis is the 1-16 year old age group.
The data-set consists of n = 9549 children for the PfAMA1 analysis and n = 9576 for the PfMSP1 19 analysis. Figure 2 shows the age and OD distributions of the individuals included in the analyses.

Specifications of the model components
In this analysis, a comparison is conducted between two modelling approaches in the estimation of seroconversion rates, for both PfAMA1 and PfMSP1 19 .
The first, which is denoted as M1, is the classic threshold-based approach as defined in (1), which considers seropositivity according to (2). After dichotomization of the antibody measurements, the RCM, as described by (6), is fitted using the maximum likelihood method.
The second modelling approach, which is denoted as M2, is the proposed threshold-free approach described in the previous sections. For this analysis, the age-dependency of the mixture models for the two antigens is modelled using an empirical approach. Based on the Fig. 3a for PfAMA1, a linear spline with a knot at the age of 10 years is used. This is formally expressed as  Fig. 3b, a log-linear model is used. This is given by To account for the age dependency in p(a), age is introduced as a logit-linear predictor of p(a), i.e.
Note that M1 is recovered when all the regression parameters except β 0 and β 0 in (9), (10) and (11) are set to 0. Therefore for M1, only the estimates for β 0 and β 0 will be reported.
For both M1 and M2, due to the truncated nature of the antibody distributions, truncated log-normal distributions are used for both antigens. The upper limit, say y max (a i ) , of the truncation is estimated for each age group as the maximum observed value of OD. Hence, the likelihood function in (3) now becomes where F S + and F S − are the cumulative distribution functions of seropositive and seronegative probability distributions, respectively.
Finally, for the RCM, a range of values from 0.01 to 1 for ω are used, hence assuming that seroreversion events for individuals would occur between 1 and 100 years [8,11,15,32]. Profile likelihood analysis indicated flat likelihood surfaces for PfMSP1 19 , and a tendency to ω = 0 for PfAMA1 (see Additional file 1: Fig. S1), therefore ω is set to three values, namely 0.01, 0.5 and 1 to represent low, medium and high seroconversion rate respectively. In what follows, results are presented for the best performing value of ω for each antigen, i.e. ω = 0.01 for PfAMA1 and ω = 1 for PfMSP1 19 . Note that these values are not the maximum likelihood estimates for ω , but rather the best performing values out of the three choices stated above.
A summary of model parameters to estimate in this analysis is provided in Table 1. In order to compare how well M1 and M2 fit the data, the Akaike information criterion (AIC) is used. This is defined as 2p − 2 log(L) , where p is the number of parameters in the model and L is the value of the likelihood function evaluated at the maximum likelihood estimate. The AIC is used to quantify the goodness of fit of a model to the data while penalizing models that contain a larger number of parameters. The AIC can be used to compare models that are not nested, i.e. models that are not contained within each other. A lower AIC usually indicates a better fit to the data. All statistical analyses are conducted in the R version 4.1.1 (2021-08-10) [33] software environment, and  Figure 4 shows the antibody distribution and seropositivity thresholds for both antigens, as derived from M1. PfAMA1 shows greater separation between the components, as well as lower seropositivity threshold. A comparison of AIC in Table 2 19 ), indicating that the age-dependent mixture model in M2 is a better fit to the data compared to M1, which assumes a single mixture distribution across all ages. This age dependency is illustrated in Figs. 5 and 6, which show an increase in mean antibody levels and the mixture distribution with age. Of note, the increase is much more prominent for PfAMA1, than for PfMSP1 19 .

Results
Additionally, in both M1 and M2, the separation between the two components of the mixture distribution is more prominent in PfAMA1 (Fig. 5) than in PfMSP1 19 (Fig. 6) where there is poor separation of the S + and S − distributions. In the M2 PfAMA1 analysis, the bi-modal distribution is more pronounced between the ages of 5 to 10 years, and less so in younger and older individuals.
Figs. 5 and 6 also indicate that age modulates the seropositivity threshold. Figure 7 shows the difference in seroprevalence estimation between M1 and M2, with overall higher estimates across age in the latter model. For both antigens, the uncertainty resulting from M2, as quantified by the 95% confidence intervals (CIs), in the seroprevalence estimates of the RCM is considerably larger than M1. This is because the M2 estimates are obtained by incorporating the uncertainty in the seropositivity classification, while M1 ignores this uncertainty, resulting in very narrow confidence intervals for M1. Figure 7 also shows that the RCM fitted using M2, provides a good interpolation of the seroprevalence for PfMSP1 19 but less so for the PfAMA1. Although most of the seroprevalence points fall within the 95% confidence interval, it is evident that, approaching 15 years of age, where the observed seroprevalence is not contained within the 95% intervals, the model underestimates seroprevalence. This is made more clear by visualizing the the y-axis of the plot in Fig. 7 on the logit-scale (see Additional file 1: Fig. S2). This indicates that, in the case of PfAMA1, the assumptions of the standard RCM may not be fully supported by the data, which is undetected by the standard threshold-based model M1.
The distributions of estimates derived from M2 for both antigens are shown in Fig. 8

Discussion
This paper presents a threshold-free method for estimating seroprevalence that incorporates the age dependency of malaria antibodies in the classification of individuals into seropositive and seronegative. Additionally, the paper demonstrates how the uncertainty of this classification can be accounted for in a the RCM. Note that this approach can be applied to other types of analyses that require the use of models different from the RCM. For example, if the goal of the study is to map seroprevalence data within a study area, the simulated classifications (previously denoted by C i ) could be used as the input of a geostatistical model whose results are then summarized in a similar fashion as presented for the RCM in this paper.
In the application of the proposed modelling framework to the RCM, seroprevalence is modelled into two different stages, using two different approaches: first, in a mixture distribution, using a logit-linear regression; and secondly, in an RCM, using Eq. (7). This raises the question of a mathematical inconsistency since both equations cannot be simultaneously true. Note that this issue also applies to previous work which uses threshold-based RCMs [8,11,15,17], whereby the threshold is first generated using a constant mixing probability, which would correspond to an intercept-only logit-linear regression in this paper, and is then modelled using Eq. (7). To avoid this issue, one solution would be to replace the logit-linear regression on age for seroprevalence, with Eq. (7), hence embedding the assumptions of the RCM directly into the mixture distribution. However, the preference remains with the approach illustrated in this paper for the following reasons. First, the use of a logit-linear regression on age in the mixture distributions allows us to develop an empirical approach that is more flexible than an RCM and can better capture the variations of the antibody distributions across age. Secondly, the use of the RCM-based Eq. (7) for seroprevalence also in the mixture distributions would yield a circular argument, whereby the outcome to be modelled with the RCM would be already generated under an RCM, thus making any validation of the RCM assumptions a vain exercise. As shown in the case-study with western Kenya data, the approach presented in this paper can in fact better detect the inadequacy of the RCM than the current threshold-based approach.
The results in this paper show clear age-dependency in the mean antibody levels, the mixture distribution, and the threshold. The differences between PfAMA1 and PfMSP1 19 indicate that the magnitude of this dependency is likely dependent on the type of antigen and the dynamics of the immune response to it. Notably, results provide evidence that different combinations of age-groups in analysis lead to different seropositivity thresholds and, therefore different seroprevalence estimates. This inconsistency has significant implications for control programmes which rely on these results to direct intervention strategies. A key advantage of the threshold-free approach is that it is unaffected by the age limits considered for the analysis.
Furthermore, different definitions of the seropositivity threshold (i.e. between 2 and 5 standard deviations of the mean of the seronegative distribution) are an additional source of inconsistency in current literature. This makes the comparability of results reported across malaria serology studies more difficult. Avoiding the use of an arbitrary threshold, as described in this paper, provides a statistically rigorous solution to this problem and facilitates the comparison of results across studies.
The limitations of dichotomizing continuous measurements into positive and negative for statistical analysis are well established in the literature, and include loss of information which affects the ability to reliably recover regression relationships, as well as reducing the the precision of parameter estimates [34][35][36]. However when the scientific interest is in estimating seroprevalence-as this paper sets out to do-rather than modeling the dynamics that affect mean antibody antibody levels, dichotomization may be appropriate. This is because the approach presented in this paper results in a more parsimonious model than the unified mechanistic model presented in Kyomuhangi and Giorgi [14], allowing for a more efficient estimation of parameters that only modulate seroprevalence.
Depending on the degree of overlap between the seronegative and seropositive populations in the sample, mixture models can be difficult to estimate. The PfMSP1 19 analysis illustrates this key limitation. Due to the poor separation of the seronegative and seropositive populations, the estimate for shows a large value, which is inconsistent with other epidemiological data from the study site. This poor separation could be a biological feature of the antibody response to PfMSP1 19 , or due to poor dynamic range of the serological assay that generated the data. Similarly, in areas of high transmission where the majority of the population is seropositive [10,13], or in elimination settings where there are very few seropositive cases, estimating the model parameters may be difficult. In these scenarios, if prior knowledge on some of the components of the model is available, Bayesian methods of inference can be used to alleviate estimation issues though the specification of suitable prior distributions. Additionally, to deal with skewness of the antibody distributions which can still persists after taking the logarithmic transformation, a mixture of skew-Normal distributions can be used in the mixture model to model the left asymmetry of the seropositive population [37].
When fitting the RCM, the seroreversion rate may also be difficult to estimate, hence ω is usually fixed [9]. In this paper, the simplest form of the RCM, which assumes constant transmission was used. This ignores possible changes in transmission due to, for example interventions in the recent past. While the resulting seroprevalence curves from the RCM do not fit the data very well in Fig. 7, the majority of seroprevalence points fall within the 95% CIs of the seroprevalence curves. Several studies have proposed modifications which relax this assumption of constant transmission [9,17,27,38], and each of these can be fitted by using the Monte Carlo approach proposed in this paper to propagate the uncertainty in the classification of seropositive individuals.

Conclusion
This paper proposes a new threshold-free method for estimating malaria seroprevalence which accounts for age dependency of antibodies through regression, and incorporates uncertainty around the estimates in subsequent analysis of the data. This method is more robust to varying conditions of analysis and provides more consistent estimates than the traditional threshold-based approach.

Learn more biomedcentral.com/submissions
Ready to submit your research Ready to submit your research ? Choose BMC and benefit from: ? Choose BMC and benefit from: