Spatial Distribution of HIV Prevalence among Young People in Mozambique

Mozambique has a high burden of HIV and is currently ranked sixth worldwide for adult prevalence. In Mozambique, HIV prevalence is not uniformly distributed geographically and throughout the population. We investigated the spatial distribution of HIV infection among adolescents and young people in Mozambique using the 2009 AIDS Indicator Survey (AIS). Generalized geoadditive modeling, combining kriging and additive modeling, was used to study the geographical variability of HIV risk among young people. The nonlinear spatial effect was assessed through radial basis splines. The estimation process was done using two-stage iterative penalized quasi-likelihood within the framework of a mixed-effects model. Our estimation procedure is an extension of the approach by Vandendijck et al., estimating the range (spatial decay) parameter in a binary context. The results revealed the presence of spatial patterns of HIV infection. After controlling for important covariates, the results showed a greater burden of HIV/AIDS in the central and northern regions of the country. Several socio-demographic, biological, and behavioral factors were found to be significantly associated with HIV infection among young people. The findings are important, as they can help health officials and policy makers to design targeted interventions for responding to the HIV epidemic.


Introduction
Mozambique is among the 10 countries most affected by HIV in the world, with the world's sixth highest prevalence among adults aged 15-49 [1]. In 2009 the HIV prevalence in adults was estimated as 11.5%, with a prevalence of 13.1% for women as compared to a prevalence of 9.2% for men [2]. A more recent AIS reported the HIV prevalence in the adult population as 13.2%, where the HIV prevalence among women was estimated at 15.4% and 10.1% among men within the same age group [3]. The HIV epidemic in Mozambique is not uniformly distributed throughout the population. Adolescents and youth aged 15-24 are one of the most vulnerable populations, with an estimated prevalence of 7.9% in 2009 and 6.9% in 2015 [2,3]. Moreover, in this age group, women account for a disproportionate number of infections, with a prevalence (11.1%) three times higher than that of their male peers (3.7%). The epidemic also shows substantial variation throughout the country, with the prevalence among this age group ranging from 2.9% in Niassa province, in the north, to 13.1% in Sofala province, in the center of the country [2].
There are many potential factors that position adolescents and young people at high risk of HIV. A recent study conducted in Maputo City points to low educational level and early sexual debut as important factors contributing to high HIV prevalence among youth [4]. Furthermore, Dias et al. [5] mention numerous biological, socio-economic, and socio-cultural characteristics that are linked to higher HIV prevalence among women in Mozambique, including age, sex of household head, wealth status, and marital status. Although the knowledge of biological, socio-economic, and socio-cultural drivers of the HIV epidemic is quite extensive, there is still a knowledge gap concerning the geographical distribution of HIV/AIDS across the country. Simply put, HIV/AIDS prevalence is not only influenced by socio-demographic, biological, and sexual behavioral factors; it can also differ significantly across districts and regions. Therefore, a thorough understanding of the geographical variability of HIV/AIDS is of paramount importance as it will help health officials formulate targeted interventions and identify where to prioritize the allocation of limited resources [6].
In this paper, we seek to understand the spatial distribution of HIV/AIDS in Mozambique among adolescents and youth by using generalized geoadditive models. The rationale for the application of this type of model is that spatial information is generally regarded as a surrogate of unobserved risk factor which may appear hard to quantify [7,8]. Our approach is an extension of the model introduced by Kammann and Wand [9]. Vandendijck et al. [10] proposed to estimate the range parameter in this model, instead of fixing it. The novelty of our approach lies in extending the estimation method of the range parameter in the context of a binary outcome. In addition, our method takes the survey design and the sampling frame into account. The structure of the paper is as follows. Section 2 gives a description of the data, a brief discussion on traditional methods to analyze geostatistical data, the statistical methods used, and details on the translation of the estimation method proposed by Vandendijck et al. [10] to the binary setting. Section 3 summarizes the results, comparing the estimates under different covariance functions and identifying the risk factors that are related to HIV prevalence in young people. We close the manuscript with a discussion including conclusions and opportunities for further research.

Data Description
To study the spatial distribution of HIV prevalence and the risk factors associated to HIV infection among young people aged 15-24 years in Mozambique, we used data from the first population-based nationally representative survey on HIV sero-prevalence, AIS 2009, which is publicly available for secondary research at https://dhsprogram.com/. Ethical approval for the survey was obtained from the National Bioethics Committee for Health (CNBS) in Mozambique and the Centers for Disease Control and Prevention (CDC) and ICF Macro in the USA [2]. The Mozambique AIS 2009 is a complex survey with a stratified multistage cluster sampling design that collected a variety of information on knowledge, attitude, and risk behavior related to HIV infection; socio-demographic and cultural factors; fertility, marriage, and sexual activity; testing and counseling in health, etc. The information was collected from four questionnaires: a household questionnaire, an individual questionnaire for young people and adults aged 15-64 years, an individual questionnaire for adolescents aged 12-14 years, and an individual questionnaire for the parents or guardians of children aged 0-11 years. Moreover, the survey collected geographic coordinates of the enumeration areas where households were sampled. Further information related to the sampling process and how the survey was conducted can be found in the Mozambique AIS 2009 final report [2]. The analysis in this study was limited to a sub-sample of 2589 adolescent and young people, aged 15-24 years, who consented to being tested for HIV. For each sampled individual, a sampling weight is available, representing the complex sampling scheme.
HIV sero-status was the outcome variable used to investigate the spatial distribution of HIV infection and the factors associated to it, coded as one if the individual tested positive, and zero if the individual tested negative. To determine the HIV sero-status of the individuals, finger-prick blood spot specimens were taken from all individuals who consented to biological testing for HIV. Blood specimens were tested for HIV using two sequential enzyme-linked immunosorbent assay (ELISA) tests. Reactive blood specimens in the first test of the sequence (Vironostika) were then submitted to a second confirmatory test using Murex. If a reactive sample in Vironostika was found to be negative in Murex, a third test was used for confirmation (GeneScreen). More details on the testing algorithm are described elsewhere [2].
The explanatory variables considered in this study can be grouped into socio-demographic factors, biological factors, and HIV knowledge and behavioral factors. The socio-demographic factors were age, sex of the respondent, educational level, occupation, marital status, sex of the head of household, wealth status, mass media exposure, religion, and type of place of residence (urban, rural). Biological factors comprised variables such as STIs in the last 12 months and blood transfusion. HIV/AIDS knowledge and attitudes included awareness, stigma/prejudice, previous HIV testing, and knowing someone with AIDS. Behavioral factors included alcohol use, multiple sexual partners, sexual debut, and condom use. A detailed description of the variables used in this study is presented in Table A1. Variables such as media exposure, HIV/AIDS awareness, and HIV/AIDS stigma are composite indices derived through principal component analysis (PCA) using a set of correlated variables. The summary indices for each dimension were constructed based on the first component scores. The PCA scores were divided into tertiles following a similar approach to Magadi and Desta [11] and Dias et al. [5].

Generalized Geoadditive Model
Kriging is a common strategy for analyzing and understanding point-reference data, also known as geostatistical data. This technique makes predictions at one or more non-observed locations from a collection of data observed at n sampled locations. The most popular kriging model assumes that the underlying process is a Gaussian process. The Gaussian assumption, generally imposed in the underlying process, is often not appropriate, as the observed process might be continuous though highly skewed, or related to counts or dichotomous data. Classical geostatistical approaches such as log-normal and indicator kriging are often recommended for non-Gaussian data [12,13]. However, resulting estimates from these approaches are difficult to interpret and are also subject to bias induced by the back-transformation [14]. Additionally, kriging itself does not allow study of the effect of possible risk factors on the outcome variable, as high prevalence in a region might be linked, for instance, to lower education level and other behavioral factors. Universal kriging is another extension, allowing for the inclusion of covariates. However, it can only accommodate for linear effects. For a continuous response, Kammann and Wand [9] introduced a geoadditive model, which is a combination of an additive model that accounts for non-linear effects and a kriging model that accounts for spatial dependence.
To study the spatial distribution of HIV prevalence among young people across the country, we used a generalized geoadditive model-an extension of the geoadditive model proposed by Kammann and Wand [9] and previously used by Diggle et al. [15] and French and Wand [16]. Let the data be denoted by (x i , s i , y i ), 1 ≤ i ≤ n, where y i is the HIV sero-status of the i th individual, x i and s i ∈ R 2 represent the row vector of the covariate values and the geographical location of the i th individual, respectively. Thus, to model the spatial structure for binary data we consider the following model: where β is a vector of parameters related to the different covariates and S(s i ) represents the geographical component of the model, acting as a surrogate for all unmeasured spatially referenced covariates. S(s i ), the underlying latent bivariate trend, is modeled using a penalized spline function. One way of specifying the spline function S(s i ) is to use tensor products of spline bases [17]. Nevertheless, its implementation comes with a price. The number of coefficients increases with the number of knots, and the tensor products of splines depend on the orientation of the coordinate axes [17]. Alternatively, Powell [18] and Ruppert et al. [17] suggest the use of radial basis functions.
Similar to Vandendijck et al. [10], we model the spatial component S(s i ) through a radial spline basis function of the form: where φ τ (·) corresponds to a covariance or generalized covariance function similar to what is used in kriging [19,20]. The vector (u s 1 , u s 2 , · · · , u s K ) contains the K s unknown coefficients of the radial basis which are penalized for overfitting. The knots {k s 1 , . . . , k s K s } are a representative subset of {s 1 , . . . , s n } which are used for construction of the basis function. The spatial component in (2) depends on τ, a strictly positive decay parameter which controls how fast the spatial dependence diminishes with increasing distance h . In Equation (2) and throughout, · denotes the Euclidean distance between two locations.
In model (2) we have to make two choices: the basis function φ(·) and the knots {k s 1 , . . . , k s K s }. In the papers by Kammann and Wand [9] and French and Wand [16], the Matérn covariance function φ τ (h) = exp (− h /τ)(1 + h /τ) is used as a radial basis function for the spatial component in (2). In the current paper, we do not restrict ourselves to their choice, but consider a broader range of covariance functions used in kriging [12]. Table 1 summarizes the covariance functions used to model the spatial component (2). A popular way of knot selection in a bivariate dimension is through the efficient space-filling algorithm [21,22]. Regarding the number of knots to be used, Ruppert et al. [17] give a thorough explanation on knot selection. Concurring with Ruppert's advice, a sensible number of knots is given by K = {max 20, min (n/4, 150)}. Nevertheless, Wand [23] argues that knot specification is of minor relevance when using penalized splines. This is seconded by Ruppert [24], who mention that as the smoothness is controlled by a penalty parameter, the number of knots is of minor importance. Therefore, considering that the number of locations with spatial information is not substantially large in our setting, all different geographical data points are used as knots. One appealing feature of the model proposed by Kammann and Wand [9] is the fact that fixing the τ in advance enables one to use a mixed model formulation for fitting, which also holds for the logistic type of model (1). In a mixed model formulation, penalization of the u s k coefficients is analogous to treating them as random effects [17]. Let , and Ω = φ k k − k k 1≤k≤K,1≤k ≤K , then the generalized geoadditive model (1) can be expressed as where u ∼ N(0, σ 2 s Ω −1 ). The reparametrizationZ = ZΩ −1/2 andũ = Ω 1/2 u, results in logit(π) = X β +Zũ,ũ ∼ N(0, σ 2 u I).
The flexibility to turn a penalized spline smoother into a mixed model enables one to use standard mixed-model software such as R or SAS for fitting the generalized geoadditive model. Furthermore, fitting generalized geoadditive models using mixed model machinery has the advantage that the amount of smoothing is selected automatically. The inclusion of additional linear or non-linear covariates is straightforward. The clustering of observations can be included in the model by adding, for example, a random intercept. In addition, the complex sampling design can be included by using a weighted likelihood, with weights corresponding to the inverse of the sampling probabilities.

Estimation of τ: Two-Stage Iterative Process
Several approaches exist to estimate the range parameter τ. Kammann and Wand [9] proposed to select the maximum distance among the observations, that is,τ = max 1≤i,j≤n s i − s j . Kneib [25] uses a similar approach, albeit he re-scales the range parameter such that the correlation among observations at the estimated distance is quite small (e.g., 0.001). More recently, Vandendijck et al. [10] estimated τ via a likelihood-based approach. The choice of τ affects the smoothness of the estimated surface, where large values of τ produce smoother surfaces [16]. Although the smoothness of the estimated surface depends on τ, Zhang and Wang [26] states that the predictions are not substantially affected by the choice of τ. Furthermore, Vandendijck et al. [10] observed that, in fact, the method used by Kammann and Wand [9] does not produce biased prediction, albeit they note that large values of τ inflate the standard errors. The different opinions regarding the estimation and the impact of τ in the predictions highlights the need of a reasonable choice of τ. In this paper, we investigate the estimation of τ in a similar way as done by Vandendijck et al. [10], but in the context of binary data.
Fitting model (1) involves estimating the vector of parameters β and the parameter τ in the covariance function used in (2). Knowing that the likelihood function of (3) is intractable, we use penalized quasi-likelihood (PQL), an approximate inference technique for generalized linear mixed models, to estimate the vector of parameters β. Details of PQL estimation are provided elsewhere [27]. The use of PQL enables us to extend the method proposed by Vandendijck et al. [10] for estimation of τ to the binary context, and here we term it the two-stage iterative PQL-based estimation method. In the first stage the generalized linear mixed model in (3) is estimated fixing τ at its current value, and in the second stage the τ parameter is optimized. This process is iterated until convergence is attained. To be more specific, the estimation method consists of the following steps: (2), fit the generalized linear mixed model in (3) using PQL. This will yield estimates of the variance parameter (σ 2 u ) (k+1) , the fixed effectsβ (k+1) , and random effectsû (k+1) . (iii) Using (σ 2 u ) (k+1) andβ (k+1) , maximize the approximate profile quasi-likelihood function for the variance component considering the pseudo-response , and g(µ u i ) is the link function. The approximate profile quasi-likelihood function is given by (Breslow and Clayton [27]): with respect to τ, V = ZDZ T + W −1 , and D = σ 2 u I, and W is an n × n diagonal matrix with diagonal elements w i = {var(y i |u)[g (µ u i )] 2 } −1 . The value of τ that maximizes this function is denoted byτ (k+1) .

Model Building
Using model building, we selected the set of covariates and the covariance function. Therefore, model building proceeded in two steps. First, to identify risk factors to be included in the generalized geoadditive model (1), multivariate logistic regression analyses were performed, considering all pairwise interactions of the selected variables presumed to be related to HIV sero-positivity. The selection of candidate variables was done using a backward stepwise selection technique. The search was done using the stepGAIC() function from the gamlss R package (resulting in the set of variables minimizing the Akaike's information criterion (AIC) value). Second, three model-building scenarios were considered to select the best covariance function: (1) models with only a spatial geographical component, (2) models with both risk factors and spatial geographical component using a two-stage iterative PQL-based estimation method, and (3) models with both risk factor and spatial geographical component but with fixed τ, using the proposal by Kammann and Wand [9]. The choice of the "suitable" covariance function in each of the three scenarios was performed with the aid of AIC, Bayesian information criterion (BIC) and corrected AIC (AICc). Note that in geostatistical models as described in (2-4), AIC fails to properly account for the degrees of freedom in the penalized spline models [10]. Furthermore, Hoeting et al. [28] argue that AIC and BIC ignore the spatial dependence in the data, and advocate the use of AICc [17]. Model performance was assessed by means of confusion matrix, a classification table comparing predicted values against the observed, and the receiver operating characteristic (ROC) curve. The model with the highest balanced accuracy, the average accuracy obtained on either class (positive and negative HIV sero-status), and the highest area under the ROC curve were considered to have good performance. Alternatively, one could use the "regular" accuracy, the number of correctly classified observations overall. Nevertheless, the latter is recommended for balanced data [29]. For this reason, the former was preferred.
All of the analyses in this paper were performed in R software version 3.6.1 [30]. To implement the described estimation method in Section 2.2.2, for estimating the τ parameter, a set of R functions were written. The R codes are provided in the supplementary material. Table 2 gives a summary of participant characteristics by HIV sero-positivity status. HIV prevalence by socio-demographic characteristic varied from 3.73% to 21.15%. The smallest proportion was observed among males and the highest was observed among divorced/widowed persons. Regarding religion, the lowest proportion was observed among those professing other religions (4.07%) and the highest among Protestants (9.93%). HIV prevalence was 10.15% among individuals in urban areas and 6.43% in rural areas. The results also show that the observed HIV prevalence was higher among young individuals living in female-headed households (10.66%). It was also observed that the prevalence was almost two times higher among high-income individuals than among low-income individuals. Prevalence in individuals with more than one sexual partner was almost two times higher when compared to those with only one sexual partner. Additionally, the prevalence among people who reported having ever had any kind of sexually transmitted infection (STI) was considerably larger (12.27%) than among those who reported having never had any kind of STI (7.62%). It can also be noticed that the prevalence of HIV among individuals aged 20-24 years was about two times higher than the prevalence among individuals aged 15-19 years.  Table 3 shows AICc, AIC, BIC, degrees of freedom (df), and parameter estimates for τ and σ u under different covariance functions for the three aforementioned scenarios. In the first scenario, based on AICc and AIC, it can be seen that the models with spherical, exponential, and circular covariance functions appeared to fit the data best. Nevertheless, according to BIC criteria, the model with the Gaussian covariance function outperformed the others. When covariates were included in the model, AICc, AIC, and BIC dropped drastically. Again, it can be observed that AICc and AIC point to models with spherical, exponential, and circular covariance functions as the best ones, while BIC outstandingly points to the model with the Gaussian covariance function as the single best one. Other models that exhibit lower BIC value were the models with the multiquadratic inverse and Matérn covariance function. When τ was fixed, no considerable changes were seen in the AICc and AIC, except for the Gaussian covariance function where it is seen a substantial increase. Therefore, there was no clear winner. Based on BIC, the results point to a model with the Gaussian covariance function as the best fit. Comparing the models where we fixed versus estimated τ, we observe that based on BIC fixing τ to the maximum distance led to a better fit, whereas when we compare using AIC, the two-stage iterative process tended to lead to a better fit. Although AICc and AIC favored spherical, exponential, and circular covariance functions, amongst them no clear winner was observed. For the model with no covariates, AICc, AIC, and BIC were not reported for multiquadratic inverse covariance as convergence was not attained. A covariance function-a member of isotropic covariance functions-that would also be of consideration is the thin plate spline. Nevertheless, all the models failed to converge under this covariance function.

Spatial Trend
Our primary interest was in the spatial distribution of the HIV sero-status. In this subsection we analyze the spatial variation of HIV infection across the country. This was achieved through a visualization of the radial spline basis in a map. We again considered the three scenarios. Figure A1 in the appendix shows the spatial distribution of HIV infection when considering a model with only the spatial component for different covariance functions. Visual inspection suggests that the predictions are insensitive to the choice of the covariance function, indicating robustness to covariance-structure. This was similarly observed for AIC and BIC values, as no clear winner was observed. The maps reveal four different places that exhibit high prevalence, pointing to the effect of a spatial geographical component in the sero-status. Figure 1 shows the spatial variation of HIV after adjustment for the covariates. In this case, the spline component refers to the geographical residual effect, after adjustment for the covariates. As a result, the variation of S(s) under the two-stage iterative adjusted model was less pronounced, illustrating that a substantial part of the variability was explained by the covariates. Again, the spatial variation of HIV under all covariance functions looked very similar, with the Gaussian covariance function showing the smoothest surface. Once more, the similarity of HIV spatial variation across different covariance function indicates that the predictions of HIV prevalence were quite robust to the choice of the covariance function. Further, note that there remains an important geographical trend in HIV prevalence that is not explained by the risk factors in the model.
Results were different when τ was fixed to the maximum distance. Although the model with Gaussian covariance was selected as the best based on BIC (Table 3), the maps in Figrue A2 in the Appendix shows that the spatial distribution of HIV reveals an inconsistent behavior, with no variation in S(s). Additionally, the estimated variance of the random effects was almost zero (see Table 3). This points to either estimation instability or a very smooth spatial process. a similar behavior was observed for the Matérn covariance function. This indicates that fixing τ to the maximum distance among the observations is not recommended, as it may lead to parameter instability. For other covariance functions, the spatial distribution of HIV was similar to the one shown by the models fitted using the two-stage iterative process, though the inverse multiquadratic function also showed a much smoother pattern.
The spatial analysis revealed the presence of some trends in the data. After controlling for important covariates, HIV prevalence appeared to be more pronounced in the center of the country, with an odds of becoming HIV infected of about two times higher than in the north-eastern region of the country and three times higher than in the southern region.  Table 4 shows the parameter estimates for five different models. The first model is the traditional logistic regression, the second, third, and fourth are generalized geoadditive models fitted using the two-stage iterative estimation process under Gaussian, spherical, and exponential covariance functions, respectively, and the fifth uses the Gaussian covariance function fixing τ to the maximum distance among the observation. From this table it is apparent that parameter estimates from traditional logistic regression and generalized geoadditive model with the Gaussian covariance function when τ was fixed were quite similar. This suggests that fixing τ under the Gaussian covariance function did not lead to any improvement when compared to conventional logistic regression. In fact, this is confirmed by the variance of the random effect, which was almost zero. Additionally, the classification table for this model shows a high imbalance between sensitivity and specificity, also indicating a poor predictive performance (Tables A3 and A5) . We also observe that the parameter estimates for the models fitted using the two-stage iterative process under the Gaussian, spherical, and exponential covariance function exhibited similar parameter estimates. This again indicates that inference was quite insensitive to the choice of the covariance function.

Covariate Effects
The results from Table 4 reveal no substantial differences between parameter estimates of models fitted using two-stage iterative PQL estimation under the Gaussian, spherical, and exponential covariance functions. Nevertheless, the results in Table 3 point to the model fitted under the Gaussian covariance function as a "suitable" choice. This finding was also supported by the classification tables, where a good balance between sensitivity and specificity was observed by the model fitted under the Gaussian covariance function. Additionally, the accuracy in Tables A3 and A4, and the ROC curve in Figure A3 also point to the model fitted under two-stage iterative PQL estimation with Gaussian covariance function. For these reasons, the model with the Gaussian covariance function was chosen for inferential purposes.
Results show that HIV sero-positivity was significantly related to sex, where female individuals were about five times more likely to be HIV positive than their male counterparts. Results also show a significant relationship between HIV sero-positivity and religion; individuals with no religion appeared to be almost two times more likely to be HIV positive than those with any religion. We also observed that age was significantly associated with HIV; individuals in the age range of 20-24 were more likely to be HIV positive as compared to adolescent individuals (age range [15][16][17][18][19].
The results further reveal significant interaction effects of place of residence and number of sexual partners, place of residence and sex of household head, wealth index and HIV/AIDS awareness, wealth index and stigma HIV/AIDS, stigma HIV/AIDS and work occupation, and stigma HIV/AIDS and education. We observed that individuals with two sexual partners living in a rural area were about four times more likely to be HIV positive than those with one partner in the same area, and for individuals living in an urban area the chances were almost half when compared to those living in a rural area. It can also be observed that high-income individuals with the highest HIV/AIDS awareness and highest HIV/AIDS stigma beliefs were about 18% less likely to be HIV positive than poor individuals with the highest HIV/AIDS awareness and highest stigma beliefs, whereas the odds of being HIV positive for those with the lowest HIV/AIDS awareness and the lowest HIV/AIDS stigma beliefs was about 13 times higher than for young poor individuals in the same category. Individuals with the highest stigma beliefs who had secondary education were about 67% less likely to be HIV positive as compared to those with no education and with the highest stigma, and those with primary education having highest stigma were about 65% less likely to be HIV positive as compared to those with no education and having highest stigma.
The results also reveal that individuals living in a female-headed household in a rural area are two times more likely to be HIV positive than those living in a male-headed household, whereas individuals living in female-headed households in urban areas were almost 32% less likely to be HIV positive than those in male-headed households. From the results it is also apparent that individuals with three or more partners were three times more likely to be HIV positive than those with only one partner.

Discussion
Analyses of the spatial distribution of HIV are very crucial, especially in developing countries where resources are scarce, as the results might direct health officials to turn their attention to locations where more resources are needed and build policies to address the burden of HIV/AIDS in specific locations. In this paper, we applied generalized geoadditive models to analyze the spatial distribution of HIV prevalence among young people, by including the geographical location in the estimation process. The inclusion of the geographical location prompted us to use radial spline basis to study the spatial effect of geographical location on the HIV sero-status of the individuals. We extended the method proposed by Vandendijck et al. [10] to the binary setting using PQL estimation. The method allowed us to estimate the effect of both the non-spatial covariates and the decay parameter in the radial basis function describing the spatial component in the generalized geoadditive model.
We observed that the two-stage iterative process PQL was robust to the choice of the covariance function. Further, we noted that the proposal by Kammann and Wand [9] for estimation of the range parameter for some covariance function led to either parameter instability or to a very flat surface (e.g., for Gaussian and Matérn covariance functions no residual variation was observed for the spatial component). Additionally, it is important to note that the Gaussian covariance function resembles a very smooth spatial process. Waller and Gotway [31] argue that such cases rarely occur in practice. Moreover, Davis and Morris [32] state that although it is a valid covariance function, it can often lead to singularities in spatial prediction equations. Thus, results under the Gaussian covariance function should be interpreted with caution. Overall, results based on the two-stage iterative PQL-based estimation method were consistent for the spherical, exponential, and Gaussian covariance functions.
The analyses revealed three main regions with high HIV prevalence, namely southern, central, and the north-eastern regions. This is in agreement with what was observed in the Mozambique AIS report [2]. The high prevalence observed can be explained partly by early marriages and early pregnancies. In the center of Mozambique, the Zambezia and Manica provinces have the highest number of child marriages [33]. The practice of early marriage subjects children to early sexual debut, increasing their lifetime risk for HIV [34]. In addition to early sexual debut, they are also exposed to frequent unprotected sex, generally with men older than them who have multiple sexual partners. In turn, this increases their risk of HIV infection [35,36]. It is also worth emphasizing that in the northern and central parts of the country, before they are forced into early marriage adolescent girls go through initiation rituals, a socio-cultural ceremony where adults pass to young people attitudes and beliefs about sexuality necessary for transitioning to adulthood, which in turn increases their risk of HIV infection [37]. The high prevalence in high-burden regions may partly be explained by the presence of the high-risk population in these areas. A study conducted in Maputo, Beira, and Nampula City, Mozambique found that most of female sex workers were at the ages of 15-24 years [38].
Several socio-demographic, knowledge and attitude-related, behavioral, and biological factors were identified to be associated with HIV infection among individuals aged 15-24 years in Mozambique. In this study, we found that young women were about six times more likely be HIV infected than their male counterparts. Gender inequality has been observed in several studies in sub-Saharan Africa. This can be partly explained by poverty, as poor women are often financially dependent on men, which drives them to early marriage and subsequent low decision-making power. Additionally, oftentimes vulnerable women have little choice but to adopt behaviors that put them at a higher risk of HIV, such as transactional and intergenerational sexual relationships [39]. Furthermore, the large differences of HIV prevalence between young women and men of the same age may also be explained by the fact than women tend to partner with older men, who are much more likely to be HIV positive [40,41]. We found that young individuals aged 20-24 years were more likely to be HIV positive than adolescents aged 15-19 years. This is in line with other studies conducted in Southern Africa [42,43].
The findings indicate that higher education has a protective effect against HIV infection. Generally, it is assumed that educated individuals are more knowledgeable about HIV risk behavior and more likely to be empowered to make decisions about sexual debut, condom negotiation, etc. Therefore, the chances of adopting preventive measures are higher. Additionally, young individuals who know more about HIV/AIDS are more likely to know their HIV status, thus contributing to their health-seeking behaviors and consequently helping to lower the risk of HIV transmission [44]. Our findings are supported by a study done in South Africa showing that adolescent and young women with tertiary education were less likely to be HIV positive [43]. We also found that higher-income people with low HIV/AIDS awareness were more likely to be HIV positive than poor young individuals with the same level of awareness. In fact, there is evidence that wealthier people tend to have more high-risk behaviors, such as multiple sex partners [45,46]. This is an indication of the importance of awareness campaigns on HIV-related matters in the fight against HIV infection.
The analysis further revealed that young individuals in female-headed households in rural areas were more than two times more likely to be HIV positive than young individuals in the male-headed households in rural areas. A different scenario was observed in the urban areas, where the odds of becoming infected for individuals in female-headed households were 32% lower as compared to individuals in male-headed households. In rural areas, many women who become head of household due to the loss of their husband are confronted with many vulnerable experiences, such as loss of property, lack of access to many traditional services, and increased workload to support their basic necessities. Such experiences increase the vulnerability of women, which may result in engaging in practices that increase their risk of HIV. For instance, a study done in Kenya noted that in order to cope with the loss after the death of an adult member, some adolescent daughters engaged in activities that exposed them to higher HIV risk infection, such as early marriage or exchange of sex for money [47]. Although the literature suggests that female-headed households tend to be poor [48], different scenarios may be observed in urban areas as poverty is much more pronounced in rural areas.
Potential limitations of our study include: the geographical locations used were at the level of enumeration areas, and all households in the same enumeration area were assumed to have the same geographical location. When there is a lack of information on the sampled units, Bocci [49] advocates to treat this as a measurement error and assume a distribution on the location inside each enumeration area. The radial basis functions considered in the analysis are only able to describe an isotropic process. In practice, many geostatistical processes are anisotropic. In future, research anisotropy in this type of model should be taken into account. As future research, it would also be interesting to consider the estimation of the decay parameter using a full-likelihood-based estimation approach. Additionally, this was a cross-sectional study, and thus causality of the identified factors cannot be proved. Our approach can also be used for cases where the observed data are counts.
Despite these limitations, the findings are very important and can help health officials and policy makers to design targeted interventions for responding to the HIV epidemic. The results revealed that provinces from the center and the north should be prioritized when allocating resources to fight HIV. Moreover, specific targeted care services and HIV prevention and awareness campaigns are urgently needed among adolescents and youth. To increase awareness, educational campaigns should include messages addressing the specific characteristics of various socio-demographic profiles. Additionally, there is a need to promote education among youth, particularly in rural and low-income areas. Special attention should be given to gender-equity and female empowerment when designing intervention measures.

Conflicts of Interest:
The authors declare no potential conflicts of interest with respect to the research, authorship, and/or publication of this article. Average media exposure index. The PCA scores are classified into quartiles, the lowest being equivalent to lowest media exposure. Respondent had any STI, genital sore/ulcer, or discharge in the last 12 months Figure A1. Image plots of the radial spline basis under different covariance functions. The images correspond to the model with only the spatial component . Model is fitted using two-stage iterative estimation process. The radial spline basis is plotted on natural log-odds scale. This illustrates the spatial variation of HIV risk in Mozambique.