Joint quantile disease mapping with application to malaria and G6PD deficiency

Statistical analysis based on quantile methods is more comprehensive, flexible and less sensitive to outliers when compared to mean methods. Joint disease mapping is useful for inferring correlation between different diseases. Most studies investigate this link through multiple correlated mean regressions. We propose a joint quantile regression framework for multiple diseases where different quantile levels can be considered. We are motivated by the theorized link between the presence of malaria and the gene deficiency G6PD, where medical scientists have anecdotally discovered a possible link between high levels of G6PD and lower than expected levels of malaria initially pointing towards the occurrence of G6PD inhibiting the occurrence of malaria. Thus, the need for flexible joint quantile regression in a disease mapping framework arises. Our model can be used for linear and nonlinear effects of covariates by stochastic splines since we define it as a latent Gaussian model. We perform Bayesian inference using the R integrated nested Laplace approximation, suitable even for large datasets. Finally, we illustrate the model’s applicability by considering data from 21 countries, although better data are needed to prove a significant relationship. The proposed methodology offers a framework for future studies of interrelated disease phenomena.


INTRODUCTION
Malaria is considered a leading cause of mortality worldwide, and the disease is most prominent in Africa.It has been estimated that malaria in 2017 affected about 219 million people and causing around 435,000 deaths 1 .The Malaria Atlas Project 2 provides a global database on malaria risk in order to solve critical questions.This project disseminates free, accurate, and up-to-date geographical on malaria and associated topics.One of their research outputs points out a relationship between malaria and Glucose 6 phosphate dehydrogenase (G6PD) deficiency, a genetic disorder that affects red blood cells.The G6PD is a gene that provides instructions for making the glucose-6-phosphate dehydrogenase enzyme.The research by the Malaria Atlas Project found that G6PD deficiency is common in populations that have a high level of malaria infection 3 .Studies dating back to the early 1960s, 4,5 , postulated that G6PD deficiency inhibits the occurrence of malaria.The reasoning was that G6PD deficiency leads to the accumulation of oxygen radicals inside the red blood cells ( 2 2 ).This accumulation offers resistance against malaria infection because the Plasmodium falciparum parasite (the parasite that causes malaria) does not have any antioxidant mechanism, which makes them more vulnerable to oxygen radicals 6,7 .The hypothesis that G6PD deficiency provides some protection against Plasmodium falciparum malaria was further supported by a review by Greene 8 , published in 1993, based on experimental and population studies.At the same time, it was acknowledged that there is not enough data in population studies, due to limited sample sizes, to produce concluding evidence 9,8 .However, there are opposing arguments, also based on limited population studies, stating that G6PD deficiency by itself is unlikely to produce a significant protection against malaria, see 10 .In 1995, Ruwende et al. 11 suggest, from two case-control studies of more than 2,000 African children, that G6PD deficiency reduced the risk of severe malaria by around 50%.In 2017, a systematic review by Mbanefo et al. 12 based on a selection of 28 various studies arrived at that G6PD deficiency could potentially offer some protection against uncomplicated malaria, but less likely so for severe malaria.
Following the results of 5,9 , it is of interest to perform a statistical inference of such a relationship between diseases and quantify the uncertainties involved.For this case, we propose using a quantile-based joint model, instead of the standard joint that models correlation of the means of the two diseases, since the G6PD deficiency may act as a resistance factor against malaria, while not the other way around.Thus, to identify possible directional correlation, this study looks at the joint quantiles between the two diseases by modeling the high quantile of G6PD deficiency and the low quantile of malaria.The joint quantile model can be applied to other disease mapping problems.Quantile regression was introduced by Koenker and Basset ? .After that the quantile regression has been widely used, in particular for Bayesian spatial analysis 13 .The R package bayesQR proposed by 14 can be used to estimate the parameters in quantile regression using a Bayesian approach with the asymmetric Laplace distribution.This package supports both continuous-dependent and binary-dependent variables.In 15 , the authors proposed using a negative-binomial regression -quantiles approach with an ecological regression model with application to disease mapping of lip cancer.
The main difference to our work is that we consider joint quantile regression with two diseases, instead of a single one.In joint quantile regression, one can model spatial dependence through a Gaussian or t-copula process of the quantile levels 16 , which could provide certain benefits for cases with heavy-tailed spatial data.One of the approaches to spatial quantile regression is to use the Asymmetric Laplace Process (ALP) for modeling the data 17 .However, this assumes the data is coming from the ALP, regardless of the actual generating distribution of the data.A quantile regression-based Bayesian joint modeling analysis of longitudinal-survival data has been proposed in 18 and it extends the use of the asymmetric Laplace process as in 17 to joint quantile regression.Markov chain Monte Carlo (MCMC) methods have been used for parameter estimation in Bayesian quantile regression models, for example in 19 for multivariate quantile regression.However, we advocate the use of INLA over MCMC for practical disease mapping due to its computational advantages.Spatial quantile regression is widely used with applications ranging from modeling of wildfire risk 16 to studying healthy life years expectancy 20 to economics 17 .In 21 , a Bayesian multiple quantile regression method is proposed for linear models, and they used the working likelihood instead of the likelihood of the generating distribution.In contrast, the quantile regression in 22 was developed such that the likelihood of the generating distribution is respected.For a comprehensive introduction to quantile regression for spatial data, see 23 , and for multivariate disease mapping modeling we refer to 24 , which includes many practical exercises and examples, often provided with R-code implementation.As far as the authors are aware, there is no available literature on joint quantile disease mapping is available, which we aim to contribute in this study.

DISEASE MAPPING
Disease mapping, also known as spatial epidemiology, analyzes the incidence of disease using geographical information.In other words, Disease mapping describe the spatial variation of disease.The two characteristics of disease mapping are the location of the events, which is called spatial or geographical distribution, and the disease. 25.The Poisson distribution is well representing the disease count for the data that have low disease count for a relatively large population 25 .For the region that consists of non-overlapping areas 26 , let denote the number of cases in regions .Often is assumed to be distributed as : where is the mean of .The mean function often consists of two components.The first component is usually called the relative risk, which represents the risk within a region, it is unknown and the purpose of this work to estimate these values.The second component is usually called standardization, which represents the expected local count.The expected local count is the value that represents our expectation if the population locally behaved the way the standard population behaves.The expectation of the cases in region can be written as follows: where is the expected number for the th area, which is usually assumed to be a fixed quantity 25 .The expected number can be obtained by using indirect standardization as follows: here, ( ) denotes the disease rate of the standard population in stratum j, the rate is the number of cases divided by the population, ( ) is the size in stratum j of area 27 , and is the relative risk for th area.Here = 1 means there is no augmented risk in comparison with the whole study area; > 1 , < 1 indicates higher risk and lower risk than the average respectively 28 .The maximum likelihood estimator of is ̂ = ∕ which is correspond to the standardised mortality ratio (SMR).However, mapping SMRs directly are misleading and insufficient for counties with small populations.Therefore the covariates need to be incorporated in order to smooth extreme values because of the small sample sizes by borrowing information from neighboring counties.The model considered in this work for disease mapping is formulated as follow: where is the mean of unit , 0 is the intercept that follow a weakly informative Gaussian prior with mean zero and large variance, ∑

=1
is the fixed effect of the covariates .Random effects such us splines for non linear effect of covariates is included through the functions { } =1 , is the spatial effects.For the spatial effects, , different spatial models for areal data can be assumed such as the Besag model 29 , or the extended Besag-York-Mollie model 29 , the Leroux model 30 , or the Dean's model 31 .

Prior specification and posterior propriety
We assume prior independence amongst the parameters and as such we assign Gaussian priors to the latent field elements and various other prior to the hyperparameters as set out next.
For the latent field elements assume the following: so that the joint prior for this part of the latent field is where −1 has a block diagonal structure as formed from (7).The vector of hyperparameters, is assigned a joint prior ( ) which is composed of independent marginal proper priors of any shape (not necessarily Gaussian).
The joint posterior of the unknown parameters, and from ( 6) and ( 7) is and based on the prior structures the posterior propriety holds.

QUANTILE REGRESSION
Quantile regression describes the conditional quantile of the response variable given the explanatory variables, instead of the conditional mean.Let be a real valued random variable.The th quantile of is given by where ( ) = ( ≤ ) is the Cumulative Distribution Function (CDF) of the random variable .Like the mean regression, a loss function is used in order to infer the parameters.The loss function of the quantile regression is the check loss function.
Given that 0 ≤ ≤ 1, ∀ ∈ ℝ the quantile loss function is defined as An estimate of the th quantile of the random variable can be obtained by minimizing the following risk function: When depends on the explanatory variables X, then it is called a conditional quantile.The estimate of the conditional quantile is called quantile regression.The quantile regression summarizes the relationship between X and the quantile of Y.The estimate of the quantile regression can be written as Then ( ̂ ) = , where ( ) is the CDF of the random variable .

Model-based Quantile Regression
The goal of the statistical analysis based on the Bayesian methods is to make inference from the posterior distribution for unknown parameters.Model-based quantile regression is an approach for quantile regression that considers the quantiles of the generating distribution proposed by 22 .This approach extends the Generalized Linear Mixed Model (GLMM) framework from modelling means to modelling quantiles.Two steps can do this extension.The first step is modeling the quantile; in this step, the quantile of the distribution is linked to the linear predictor through an invertible function .The second step is mapping the quantile; in this step, the quantile is mapped to the parameter of the distribution through a map function ℎ.This approach can be applied to both frequentist and Bayesian frameworks.The resulting parameters of the Bayesian framework are all identifiable, making model-based quantile regression appealing in the Bayesian inference.To see these steps, let ( ; ) be the distribution of | , where is the parameter of the distribution.Given 0 ≤ ≤ 1, the ℎ quantile of | is , = ( | ).The two steps can be written as follows: Step 1 -Modelling.The quantile , of the distribution ( , ) is modeled as follows: where is an invertible function, and is the linear predictor for the level quantile for = 1, … , .The linear predictor can include fixed effects, random effects, or both.Moreover, parametric or semi-parametric models can be included in this approach in order to study the impact of the covariates at different levels of the distribution and non-parametric models can be used for prediction.
Step 2 -Mapping.The quantile , is mapped to the parameter of the distribution ( ; ) as where ℎ is an invertible map function.The map ℎ can be obtained by two steps.First, taking the inverse of the CDF ( ( ; )) which give you the quantile function ( , ).Then, we write the parameter as a function of the quantile, and that function is the map ℎ.In this approach, the parameter is modeled indirectly by the link between the quantiles of the generating distribution and .Unlike mean regression, when the parameter of the generating distribution links to the linear predictor through a function = ( ), in model-based quantile regression the parameter of the generating distribution is linked to the linear predictor through a composition function = ℎ( ( )).In other words, in the mean regression, the (GLMM) have a link function = ( ) to link the parameter of the generating distribution to the linear predictor.

Model-based quantile regression for count data
The extension of model-based quantile regression for discrete random variables is not straight-forward since the objective function in ( 9) is non-differentiable for discrete random variables.The positive mass of the points for the discrete variable prevent the sample quantile from having an asymptotic distribution.Additionally, it is not easy to apply the modeling and mapping steps of model-based quantile regression to discrete data.First, in the modeling step, the common models for are the log for count data and the logit for binary data, and they are continuous functions.Therefore, the model , = , is not appropriate, since the quantile which is on the left hand side is discrete whereas the function is continuous.The second reason, in the mapping step, it is hard to get the map ℎ because the CDF of the discrete is non-invertible, which implies that there is no unique to generate each quantile, as one can be seen in Figure 1 .
To address these issues, 22 approximated discrete distributions by continuous counterparts, and then model the quantile for the continuous version instead of the discrete.The continuous counterpart is obtained by interpolating the cumulative distribution function (CDF) of the discrete random variable.The model-based quantile method can be applied to discrete variables if their CDF can be expressed as where is a continuous function, and is a discrete random variable.The interpolation can be obtained by removing the floor operator, so that ( , ) is the CDF of the continuous version of , assigned ′ .By definition of the floor, for all integers The continuous distribution of ′ is considered as a continuous generalization of the original variable because the two CDFs are equal for all integer values .The advantage of working with the continuous version of a discrete distribution in the Bayesian framework is that a likelihood function can be obtained by using the model-based quantile method, since the sample quantiles for a discrete random variable are generally not asymptotically normal 22 .

Continuous Poisson
Here we present the details on the approximation of the discrete Poisson distribution with a continuous Poisson counterpart.
The CDF of a Poisson distribution can be expressed as the ratio of an incomplete and regular Gamma function as follows: where is the upper incomplete Gamma function.Following Section 3.3, the Continuous Poisson is then defined from (11) as The reason for changing the support from ≥ 0 to > −1 is to avoid mass at 0, so there will be no jump on the CDF of the Continuous Poisson (CP)as illustrated in Figure 1 .If the support remains the same, then the value of the CP will be 0 if < 0 and about 0.4 at = 0, which introduces a jump at zero.However, if the support is > −1, then there will be no jump at zero because the CDF of the CP will be 0 at = −1, then an interpolation will be applied from = −1 to = 0.The Continuous and discrete Poisson random variables can be related as = ⌈ ′ ⌉.
The model-based quantile regression model for Poisson data is then defined for | a continuous Poisson random variable with parameter as

Model-based quantile regression for disease mapping
From Sections 2 and 3.3 we can define a model-based quantile regression model for disease mapping.One issue that remains is how to decompose the expected number of cases into the local expectation, and the relative risk .In the case of modeling the quantile instead of the mean there are two options as discussed by 22 : • Include in the linear model as an offset • Consider it as a scaling of the parameter of the distribution These two approaches are equivalent in the Poisson mean regression, but not equal in the Poisson quantile regression and the choice of approach depends on the purpose of the analysis.If the focus of the study is to infer a quantile-specific model then ( 13) is more appropriate whereas ( 14) can be considered as a model for the parameter .

BAYESIAN JOINT QUANTILE DISEASE MAPPING
The main goal of disease mapping is to estimate the relative risk of diseases across regions.Sometimes specific diseases have similar spatial patterns due to sharing the same etiologies.In this case, these diseases have some dependence, and it would be more appropriate to model them jointly rather than separately.Moreover, sometimes the dependence might be in different quantiles between the diseases or some diseases could inhibit the occurence of another disease.The proposed joint quantile disease mapping model links different quantiles of multiple diseases by a more general framework by considering dependence not in the mean, but in the quantiles.

Model specification
The joint quantile model for two diseases can be formulated as: where is the mean of unit for disease and it is mapped to the level quantile , as in (12).In the modeling part, is a disease-specific intercept that follows a weakly informative Gaussian prior with zero mean and large variance, is a spatial random effect and is the shared spatial component.The model also incorporates fixed effects of covariates and by in (15) and in (16), respectively.Various random effects such as splines for non-linear effects of covariates and is included through functions { } 1 =1 in (15) and { } 2 =1 in ( 16), respectively.
For the disease-specific spatial effects, , various spatial models for areal data can be assumed such as the Besag model 29 , or the extended Besag-York-Mollie model 29 , the Leroux model 30 , or the Dean's model 31 .
The shared spatial component, , that links the two diseases through their quantiles is assumed to be a Besag effects with precision matrix = , where for ≠ , = + , = − , and is the number of neighbours of node .The parameter ∈ ℜ is used to scale the shared component and correlate the two diseases in space., , , , 1 , 2 }, then we have that the data = { 1 , 2 , } is conditionally independent given the latent field and the hyperparameters such that the likelihood function is

Prior specification and posterior propriety
We assume prior independence amongst the parameters and as such we assign Gaussian priors to the latent field elements and various other prior to the hyperparameters as set out next.
The shared spatial field is assumed to follow Besag model but with an additional parameter , to ensure a proper prior of as follows: | , ∼ (0 0 0, −1 ) with the entries of as follows: for ≠ , and is in the neighbourhood of .
The disease-specific spatial fields 1 and 2 are assumed to follow uncorrelated BYM/CAR models where we reparameterize the precision matrix similar to 32 to have more orthogonal parameters resulting in useful practical interpretation of the weight parameter, .One issue with the proper CAR parameterization proposed by 32 is that the weight parameter is still not practically a weight since the unstructered effect and the Besag field might have different generalized variances.To alleviate this issue, we scale both the unstructured and Besag components to have the same geometric mean and define the proper scaled BYM field as with 1 a scaled IID effect and 2 a scaled Besag effect as in (19).With this formulation, can be interpreted as the proportion of the marginal variance explained by the spatial effect, and 1 − is the proportion of the marginal variance explained by the unstructured effect.
The joint posterior of the unknown parameters, and from ( 17) and ( 18) is and based on the prior structures the posterior propriety holds.

Approximate inference using INLA
Computational Bayesian inference can be achieved largely in one of two ways, either through sampling-based methods like Markov Chain Monte Carlo (MCMC) and deviants or approximately using approximate methods like Variational methods or Laplace approximations like Integrated Nested Laplace Approximation (INLA).INLA, as introduced by 33 , has been shown to be widely applicable to various statistical models; in particular, to the latent Gaussian models class of which disease mapping models are included 34,35,36,37 INLA employs a series of Laplace approximations and numerical integration to perform approximate Bayesian inference through numerically approximating the posterior densities of the latent field and hyperparameters.For data , latent field and hyperparameters , INLA can be summarized as follows: 1. Find the -variate Gaussian approximation of ( | , ) at the mode ( ) with matching curvature using the Hessian of ( | , ) at the mode ( ).

Let
and locate the mode of ̃ ( | ) and find a set of integration points , = 1, 2, ..., in the area of the highest probability mass.

Calculate
where we note that this is a low-dimensional integral since is generally small.
Various simplifications to the approximations have been proposed as well in order to achieve increased computational efficiency such as an empirical Bayes approach where the integration points are all set to the mode of ̃ ( | ), which is named a Simplified Laplace approximation strategy.
In this part, simulated independent and correlated data was added to the Pennsylvania map, which is considered as a connected graph of size 67. Figure 2 shows one realization of the correlated data that was added to the Pennsylvania map.
The correlated data was generated as follows.
where is the shared component that follow a Besag proper model (19) with precision matrix , where = + 1 and = −1 for ≠ .In other words, the extra term added on the diagonal is = 1, and the precision parameter = 1.There are two hyperparameters for this model: a parameter for controlling the properness, and a parameter > 0 that is a scaling parameter.The hyperpriors defined on the log scale of these hyperparameters are as follows.
The estimated values of the parameters obtained by R-INLA are similar to the true values as seen in Tables 1 and 2 , and Figures 3 and 4 .mean sd 0.025quant 0.975quant  2 The estimated hyperparameters and the 95% credible intervals.
Model selection criteria are presented in Tables 3 and 4 and it shows a preference for the joint quantile model when the data are correlated and a preference for the separate models, which are ( 15) and ( 16) without the shared components, when the data are independent.This indicates stable estimation and the model's ability to distinctly estimate an associated joint model if needed.4 The DIC and WAIC for independent data.

JOINT QUANTILE DISEASE MAPPING MODEL FOR MALARIA AND G6PD
In this section, we fit the Bayesian model by using R-INLA to estimate the risks of malaria and G6PD deficiency in some African countries by using separate and joint quantile models.The code for this analysis is available at https://github.com/JanetVN1201/Code_for_papers/tree/main/Joint%20quantile%20disease%20mapping%20.

Exploratory data analysis
The malaria cases and G6PD cases per region is obtained from https://malariaatlas.org/.Various country-level covariates can be used in our model but for the motivating example the emphasis is placed on the joint component, even though for a thorough analysis of the data itself, various fixed and random effects might be considered.
We selected only the countries for which information for both Malaria and G6PD is available, as indicated in Figure 5 .
According to Figure 5 , the countries are distributed around the world.Since we want to investigate the spatial correlation we consider the African continent so that most countries have some neighbours as in Figure 5 .In Figure 6 , 6 b and 6 e the SMRs for malaria and G6PD deficiency.It can be seen that, in general, the risk of G6PD deficiency is higher than the risk of malaria because G6PD deficiency has a higher SMR.Some countries like Abidjan and Madagascar that considered to have the highest risk of G6PD deficiency, they have the lowest risk of malaria according to the SMR values, which could indicate a prohibitive relationship between these two diseases.The observed cases for malaria and G6PD can be seen in Figures 6 a and 6 d respectively.Kenya has the highest number of malaria cases, while Nigeria has the highest number of G6PD deficiency cases.

Results
It is believed that G6PD deficiency limits the occurrence of malaria 9,5 .We expect a correlation between a high quantile of G6PD and a low quantile of malaria.Therefore we applied the joint quantile model that was discussed in section 4 where 1 and 2 represent the cases of malaria and G6PD deficiency, respectively.The quantile levels are 1 = 0.2 and 2 = 0.8.
Table 5 shows the overall means for the malaria and the G6PD deficiency.Table 6 shows the estimation of the hyperparameters.The precision of the random effects indicate that most of the spatial variability comes from the shared component.There is a significant correlation between a high quantile of G6PD deficiency and a low quantile of malaria as observed from the point estimate and the credible interval of , which is the coefficient of the shared component.This finding is consistent with the studies 9,5 .
In contrast, Table 7 shows the estimation of the hyperparameters for a high quantile of malaria with a low quantile of G6PD deficiency.As can be seen based on the credible interval of , there is no significant correlation between these two quantiles.This is expected because having G6PD deficiency protects you from having malaria, but malaria does not influence G6PD deficiency.
Figure 10 presents maps of the spatial effects.Figures 10 a and 10 d show the shared spatial effect for malaria and G6PD deficiency, respectively.The structure of the effects appear similar.However, the shared spatial effect for the G6PD deficiency is much lower.This is expected because the value of is smaller than 1.The disease-specific spatial effects in Figures 10 b and 10 e are very low compared to the shared spatial effect.This can also be seen from the posterior precision estimates for 1 and 2 compared to that of the shared component in Table 6 .Because the disease-specific spatial effects are smaller than the shared spatial effect, the total spatial effect (that is, the sum of the shared and the specific-spatial effects) is very similar to those for the shared effects, see Figures 10 c and 10 f.The iid random effect for G6PD deficiency, see Figures 11 b and 11 , is higher than the iid for malaria.This is understandable because the value of Phi for malaria is bigger than the one for G6PD deficiency, which means G6PD deficiency accounts for more iid effect.
The relative risks in Figures 9 a and 9 b show similar relative risks as obtained by the separate models.However, observe that the relative risk for G6PD deficiency from the joint model is higher than the one from the separate.This difference between the relative risks is due to borrowing strength from the spatial pattern of malaria through the shared component.The joint quantile model predicts the cases well for both diseases, as can be seen in Figures 11 c and

CONCLUDING REMARKS
The motivation stemmed from estimating the relative risk of Malaria and G6PD deficiency, jointly, on the African continent.The G6PD deficiency is considered as a resistance against malaria based on anecdotal medical studies (see 9 and 5 ).In this case, joint mean disease mapping will not provide the information needed to investigate these initial findings.Therefore, we considered a joint quantile disease mapping of different quantiles for the diseases.The approach is successful since considering the joint quantile model allows a possible investigation of the correlation between any level of the conditional distributions of the random variables that represents the number of cases, not only the correlation between their means.An advantage of the proposed approach is that the computationally efficient INLA method is used for statistical inference, such as estimating the relative risk.
Our main contribution is two-fold.Firstly, we propose a very general joint quantile disease mapping model where the correlation between different quantiles can be inferred and multiple diseases can be considered, together with an efficient computational framework for the inference thereof.Secondly, the significant correlation between a high quantile of G6PD cases and a low quantile of Malaria cases encourages further investigation based on expanded data collection efforts as already underway at the Malaria Atlas Project.This analysis provides a solid statistical framework to the anecdotal findings as remarked by medical professionals, and could underpin future studies in this direction.

FIGURE 1 ( 22 FIGURE 2
FIGURE 1 (Top) The CDF of the discrete Poisson (dashed line), and the CDF of the continuous Poisson (solid line).(Bottom) Quantile function of the discrete (dashed line) and continuous (continuous line) Poisson distributions 22

FIGURE 3 FIGURE 4 FIGURE 5 FIGURE 6 FIGURE 7
FIGURE 3The posterior distribution for the intercepts

TABLE 1
The estimated intercepts and the 95% credible intervals.

TABLE 3
The DIC and WAIC for correlated data.

TABLE 5
11 d.The model comparison shows a preference for the joint quantile model over the separate models because the values of DIC and WAIC for the joint quantile are less than the sum of the tests for the separate models.The estimated intercepts and the 95 % credible intervals.

TABLE 6
The estimation of the hyperparameters for low quantile of malaria and high quantile of G6PD deficiency.

TABLE 7
The estimation of the hyperparameters for high quantile of malaria and low quantile of G6PD deficiency.The values for the model choice criteria, DIC and WAIC, are given in Table8.

TABLE 8
Model choice criteria.