Evaluating distributional regression strategies for modelling self-reported sexual age-mixing

The age dynamics of sexual partnership formation determine patterns of sexually transmitted disease transmission and have long been a focus of researchers studying human immunodeficiency virus. Data on self-reported sexual partner age distributions are available from a variety of sources. We sought to explore statistical models that accurately predict the distribution of sexual partner ages over age and sex. We identified which probability distributions and outcome specifications best captured variation in partner age and quantified the benefits of modelling these data using distributional regression. We found that distributional regression with a sinh-arcsinh distribution replicated observed partner age distributions most accurately across three geographically diverse data sets. This framework can be extended with well-known hierarchical modelling tools and can help improve estimates of sexual age-mixing dynamics.


Introduction
Patterns in sexual mixing across ages determine patterns of transmission of sexually transmitted infections (STIs). Consequently, sexual age-mixing has been of great interest to researchers studying the human immunodeficiency virus (HIV) since the beginning of the global epidemic. Anderson et al. (1992) used a model a log-logistic distribution to model partner age differences for women aged 15 to 45 years, assuming that the partner age difference distributions did not vary over respondent age. More recently, as an input to a model of Chlamydia trachomatis, Smid et al. (2018) fit skew normal distributions to each age-/sexspecific partner age distribution and used a secondary regression model to smooth the estimated skew normal parameters across respondent age. They observed substantial changes in the estimated skew normal parameters with respect to respondent age. Although this method allows for non-linear variation across respondent age, their two-stage estimation process makes uncertainty propagation complex. Replacing this process with a single "distributional" regression model, in which all distributional parameters (e.g. the location, scale, skewness, etc.) are modelled as functions of data (Kneib & Umlauf, 2017), allows for complex variation across respondent age while still robustly incorporating uncertainty.
More broadly, no previous work has systematically evaluated the wide variety of distributions potentially available to model partner age distributions. These data are skewed, heavy-tailed, and otherwise dissimilar to conventional statistical distributions due to personal preferences, social dynamics, demographic change, and any number of other factors. We were specifically interested in distributions that introduce parameters to control tail weight, which may capture intergenerational mixing that could sustain endemic HIV and STI transmission.
This led us to test the ability of the four-parameter "sinh-arcsinh" distribution originally proposed by Jones & Pewsey (2009) to fit to these data.
We hypothesized that integrating the sinh-arcsinh distribution into a distributional modelling framework would allow us to replicate observed partner age distributions more accurately than prior modelling strategies. We tested this theory by comparing a variety candidate strategies, which varied along three dimensions: the parametrisation of the dependent variable, the choice of distribution, and the method for incorporating variability across respondent age and sex.

Methods
We conducted two model comparison experiments to identify which of a set of strategies best replicated partner age distributions. First, in our probability distribution comparison, we identified which of a set of distribution-dependent variable combinations fit best to age-/sex-specific data subsets, and then, in our distributional regression evaluation, we tested whether distributional regression methods could be used to estimate age-/sex-specific partner age distributions by sharing strength across observations. We divided the model comparison into two separate experiments to make the probability distribution comparison as fair as possible (accounting for the possibility that certain distributions would perform particularly well under certain regression specification).

Data
We analysed data on sexual partner age distributions from three sources: the Africa Centre Demographic Information System, a health and demographic surveillance site in uMkhanyakude district, South Africa collected by the African Health Research Institute (AHRI) (Gareta et al., 2021;Gareta et al., 2020aGareta et al., , 2020b, the Manicaland General Population Cohort in Zimbabwe (Gregson et al., 2017), and 2017. We did not incorporate the weights associated with the survey into this analysis because our primary interest was in statistical modelling of partner age distribution as a function of respondent age, not producing population representative statistics for the Haitian population.
These data sets consisted of individuals' reports of their own age and sex and the ages of each of their sexual partners from the last year. Let i ∈ (1, ..., N) index reported partnerships, a i ∈ [15, 64] and s i ∈ {0, 1} be the age and sex of the respondent in partnership i with s = 1 indicating female, and p i be the age of non-respondent partner in partnership i. These questionnaires do not ask specifically about partner sex, but self-reporting of non-heterosexual partnerships in these populations is thought to be low.
Respondents in each of these data sets are disproportionately likely to report that their partners' ages are multiples of five or multiples of five away from their own age, leading to distinct "heaping" in the empirical partner age (or age difference) distributions at multiples of five. We tested the sensitivity of our results to heaping by developing a simple "deheaping" algorithm, applying it to the AHRI data, and running each analysis on the deheaped AHRI data. We present these results in Appendix section "Age heaping."

Probability distribution comparison
To identify the best probability distribution for modelling sexual partner age distributions, we split each data set into 12 subsets by sex and five-year age bin ranging from 20 to 50, resulting in 36 subsets, and fit a number of distributiondependent variable combinations to each subset.
Distributions. We tested five candidate probability distributions: normal, skew normal, beta, gamma, and sinh-arcsinh. Table 1 summarises the domains, parameters, and probability density functions (PDFs) of these distributions. Because the gamma distribution is always right-skewed and men typically partner with women who are younger than them, we transformed data among male respondents to be right-skewed when using the gamma distribution. Specifically, we multiplied the men's partners' ages by -1 to reflect the distribution horizontally across the y-axis, and added 150 to the reflected ages to ensure that all resulting values were positive. Similarly, the beta distribution is only defined on the interval (0, 1), so, only when using a beta distribution, we scaled all partner ages to be between zero and one using upper and lower bounds of 0 and 150.

Distribution Parameters
Domain PDF the density of this distribution with µ = 0 and σ = 1 for a variety of values of skewness and tail weight.

Dependent variable transformations.
We considered the possibility that certain distributions could interact well with particular transformations of the dependent variable (partner age) by testing a set of four potential outcome parametrisations.
For example, if X is a positive-valued, right-skewed random variable, then assuming log X is normally distributed might be more effective than assuming that X itself is normal.
Let y i be the dependent variable value for partnership i, and let a i and p i be the respondent age and partner age of partnership i, respectively. We tested the following dependent variables: 1. Linear age: y i = p i . This is untransformed partner age, included as a baseline. It has the undesirable quality of being able to predict negative ages.
2. Age difference: y i = p i − a i . If changes in expected partner age are consistent across respondent age then this variable would be more consistent across respondent age than the linear age. This parametrisation also allows for negative partner age predictions. 3. Log-age: y i = log p i . We can use a log link function to ensure that our predictions will be positive-valued.
4. Log-ratio: y i = log(p i /a i ). Finally, we can combine the link function and differencing approaches by modelling the log of the ratio of partner to respondent age. This variable will only produce positive predictions and should exhibit the same relative stability as the age difference variable.
Because the gamma and beta distributions are not defined on the entire real line, we only fit them with the linear age dependent variable with the previously discussed transformations.
To identify which distribution-dependent variable combination best modelled the characteristics of sexual partner age distributions, we stratified each of our three data sets by sex and five-year age bin from 20-24 through 45-49. We fit every viable distribution-dependent variable combination to all 36 subsets independently. Given that we fit only the linear age dependent variable to the gamma and beta distributions, comprising a total of 504 models (14 per data set).
We fit each model using the brms R package (Bürkner, 2018), defining custom families as necessary.

Distributional regression evaluation
Given a probability distribution that accurately replicated the non-Gaussian characteristics of partner age distributions, we tested whether or not distri-butional regression would allow us to pool data across age and sex without sacrificing fit. In distributional regression, we make all of our distributional parameters, not just the mean, functions of data (Kneib & Umlauf, 2017). Taking conventional Bayesian regression as an example, we have where β and log σ are free parameters. There is an explicit assumption in this model that the standard deviation of the generating distribution is constant across all observations. We can use distributional regression to relax this assumption, making σ a function of data: where β µ and β σ are now our free parameters. Note that we have not assumed that X µ = X σ . If X σ is a column of ones, this model is identical to the conventional case. This approach increases the complexity of the model and requires more data, but, based on previously described characteristics of how the distribution of partnership age distribution changes with age, even a simple model for our distributional parameters could yield large improvements.
In this case, we used a sinh-arcsinh distribution and specified a model for each of its four parameters. We fit a series of increasingly complex distributional regression specifications to the three data sets using brms (Bürkner, 2018), which has deep support for distributional regression.  in the "Model specification details" section of the Appendix.

Model comparison
Across both analyses, we used two metrics to measure model fit. First, we calculated the expected log posterior density (ELPD), which estimates the density of the model at a new, unobserved data point (Vehtari et al., 2017). In cases where we wanted to compare across dependent variables, we multiplied the posterior densities of any variables resulting from non-linear transformations of observed partner ages by the Jacobians of the transformations. For example, if our observation model was defined on the log-age dependent variable y i = log p i , we divided the posterior density by p i . We used the loo R package (Vehtari et al., 2020) to calculate ELPD values.
To measure the ability of our models to replicate partner age distributions in an objective and interpretable way, we found the root mean squared error (RMSE) between the observed and posterior predictive quantiles. We calculated quantiles from 10 to 90 in increments of 10 by age bin and sex in the data and in the posterior predictions, and found the error in model prediction of each quantile. This measure tells how well our model predicts the entire distribution in the same units as our predictions. It is equivalent to finding the mean squared or median absolute distance from the line of equality in a quantile-quantile (QQ) plot.

Software
We conducted all of these analysis using the R programming language (R Core Team, 2020) and the brms library (Bürkner, 2018). We used the loo library to estimate all ELPDs (Vehtari et al., 2020), and produced all plots in this paper with the ggplot2 library (Wickham, 2016). We cannot provide the data we used for this analysis, but we do provide code and data for an simulated case on GitHub (https://github.com/twolock/distreg-illustration).

Results
The AHRI data included 77,619 partnerships, Manicaland had 58,676, and the Haiti DHS had 12,447. As an illustrative example of the distribution of partner ages, Figure 2 presents histograms of reported partner ages among women aged 35-39 for each of our three data sets. Figure 3 shows the sex-and age bin-specific empirical moments for the three data sets. Mean partner age increased with respondent age consistently for both sexes across all three data sets: among women, mean partner age increased by 26.0, 22.7, and 23.7 years in the AHRI data, Haiti DHS data, and Manicaland data, respectively, between age bins 20-24 and 45-49. However, higher order moments were less consistent: the standard deviation of women's partners' ages changed by 2.3, 0.5, and 3.5 years in the AHRI data, Haiti DHS data, and Manicaland data, respectively.
Within each data set, there is systematic variation across sex. For example, the standard deviation of partner ages in the Haiti DHS increased by 2.5 years among men and only by 0.5 years among women. These summary statistics illustrate the heterogeneity of partner age distributions across age and sex.

Probability distribution comparison
To identify the probability distribution that most accurately described the variation in sexual partner age distributions, we first determined the dependent variable with the highest ELPD for each distribution-dependent variable combination. Figure 4 illustrates each probability distribution's best fit to AHRI data among women aged 35-39 with each of the best distribution-specific dependent  The best dependent variable varied across data subset and probability distribution. Table 3 provides the share of data sets for which each dependent variable has the highest ELPD given each distribution. The log-ratio dependent variable was best in 50.0% of subsets with a normal distribution, but it was best in only 27.8% of subsets with a skew normal distribution. The dependent variable that was best in a plurality of subsets in each probability distribution (i.e. the variable with the highest percentage in each column of Table 3) used a log link function. We restricted all remaining comparisons to each distribution-subset combination's best dependent variable.    ELPD, the absolute value of the ratio of the difference between the two best ELPDs and the estimated standard error of the difference was greater than 2, indicating that the sinh-arcsinh distribution was significantly better than the alternatives in the majority of cases. In one case, men aged 20-24 in the Haiti DHS, the skew normal distribution resulted in a slightly higher ELPD than the sinh-arcsinh distribution, but the standard error of the difference was greater than the difference. These results were not affected by deheaping the data (Appendix section "Age heaping").
To summarise each distribution's performance, we calculated the average ELPD and QQ RMSE across the three data sets (

Distributional regression evaluation
We fit all five distributional regression specifications to all three of our data sets with sinh-arcsinh distributions and log-ratio dependent variables and compared the ELPDs and QQ RMSEs as before (provided in Table 5). Across all three data sets, the most complex distributional model (Distributional 4) had the highest ELPD and lowest QQ RMSE. When fit to the AHRI and Manicaland data sets (but not for the Haiti DHS), the most complex distributional model was a least two standard errors better than the next best model. Notably, the largest ELPD improvements came from moving from conventional regression (Conventional) to the simplest distributional model (     The red estimates (Conventional Regression) of the three higher order parameters were constant across age and sex, whereas the blue estimates (Distributional Model 1) included independent, linear age and sex effects. The orange estimates (Distributional Model 4) were generated sex-specific splines with respect to age, allowing for flexible variation across age and sex.
The third row of plots in Figure 6, which corresponds to the skewness parameter,   Figure 6: Estimated sinh-arcsinh distributional parameters from the conventional regression model, and distributional models 1 and 4 fit to the AHRI data. "Conventional" assumes no variation across age and sex, "Distributional 1" allows for independent age and sex effects, and "Distributional 4" includes sex-specific splines with respect to age.

Discussion
We found that the sinh-arcsinh distribution reproduced observed sexual partner age distributions better than a number of other possible distributional assumptions across age and sex in three distinct data sets. We integrated this finding into a distributional regression framework using existing statistical modelling software. Even the simplest distributional regression in our set of candidate models far outperformed conventional regression, in which all moments except the first are estimated as constants. Our most complex distributional model fit better than all other models in all three data sets, suggesting that modelling these data benefits from the additional complexity.
These results indicate that distributional regression models with sinh-arcsinh distributions can accurately replicate age-/sex-specific sexual partner age distributions. This approach presents a number of advantages over previous methods.
First, like Smid et al., it allows a unique distribution for every age-sex combination. As Figure 3 illustrates, partner age distributions can exhibit substantial, systematic variation across age and sex in any of the first four moments, so we must consider modelling strategies that allow for such variation. Second, distributional regression offers a principled method to propagate uncertainty through this estimation process.
Finally, distributional regression implemented through brms provides access to a deep set of hierarchical modelling tools that could enable estimation in a variety of low-data settings. We evaluated a small set of relatively simple distributional models in this work, but, theoretically, each distributional parameter could have its own, arbitrarily complex hierarchical regression model. Using these tools, one could estimate unique partner age distributions across levels of stratification that are substantively interesting but do not provide sufficient sample size for independent estimation (e.g. study sites or geographic areas).
We have identified several limitations in this approach. First, the amount of data required to produce usefully precise estimates is not tested. Each additional distributional parameter introduces model parameters, so this method is more complex than conventional regression. The sinh-arcsinh distribution did fail to produce the highest ELPD in our smallest data subset (N = 170), but it was not significantly worse than the best distribution. More importantly, by integrating these data into a distributional modelling framework, we gain the ability to impose structure on these parameters, which could easily offset the cost of any additional model parameters.
Interpreting the inferred model parameters in sinh-arcsinh regression can also be difficult. Whereas conventional regression estimates the effects of covariates on expected values, the sinh-arcsinh distribution is parametrised in terms of a location parameter. This parameter correlates closely with the central tendency of the distribution, but it is not strictly equal to the mean. We can reparametrise the distribution so that we estimate a mean (and therefore effects of covariates on the expected value), but it is not currently possible in the probabilistic programming software that underlies brms.
Third, our analysis assumed that we were operating at a level of stratification at which partnerships are basically comparable, but any number of factors could lead to fundamentally different partner age distributions. For example, we did not control for whether the partnership was same-sex or the type of the partnership (married, casual, etc.). That said, our distributional framework would allow us to incorporate data on any of those factors directly into the model.
Despite these limitations, we believe that the strategy we present will work well in future projects that require estimates of partner age distributions. We plan to use these methods to produce age-mixing matrices to inform epidemic models of HIV, but there are a number of additional directions that could be explored.
We are specifically interested in leveraging the spatio-temporal structure of the survey data used here. Hierarchical mapping exercises with household survey data are increasingly common in epidemiology, but estimating spatially varying partner age distributions would require an evaluation of how best to model higher order moments over space. We would, for example, need to consider how the variance of partner age distributions varies by urbanicity.
Similarly, population-based studies typically collect far more detailed information on partnerships than we took advantage of here. Relationship type is a key confounder of the association between respondent age and partner age (that we ignored for the purposes of our experiments). We might expect the age distributions of casual partners to vary substantially from those of long-term cohabiting partners. Because we have built our model in a pre-existing regres-sion framework, incorporating new covariates into any of the distributional regression specifications should be simple.
We believe that our framework offers a flexible, accurate, and robust method for smoothing and interpolating sexual partner age distributions, but these methods are not specific to partner age distributions. The sinh-arcsinh distribution is relatively easy to implement without incurring high computational cost, so it could be applied in many settings. Even without the distributional regression framework we have used here, allowing the third and fourth moments of the distribution to vary from the "default" normal values could be valuable across a variety of applications.
Distributional regression is also underutilised in social science applications.
We often work with large surveys that would comfortably support models for higher order parameters. Data requirements will vary by application and model, but, as we have shown here, even a simple distributional model can improve fit and avoid biasing estimates.

TMW's work is funded by the Imperial College President's PhD Scholarship.
We thank all of the people who participated in the three studies that shared their data with us, as well as the survey and data management teams, without whom this work would not be possible.

Age heaping
Respondents in each of these data sets are disproportionately likely to report that their partners' ages are multiples of five or multiples of five away from their own age, leading to distinct "spikes" in the empirical partner age (or age difference) distributions at multiples of five. The left panel of Figure 8 illustrates this phenomenon among women aged 24 years in the AHRI data.
These spikes, widely referred to as "heaping," could bias our results towards certain probability distributions, so we developed a simple deheaping algorithm, applied it to the AHRI data.
To account for the possibility that heaping affected the results, we developed a simple deheaping algorithm and treated the deheaped AHRI data as a fourth data set. Due to the structure of the questionnaire ("how many years older or younger is your partner than you?"), the AHRI partner age data exhibit strong heaping on partner ages that are multiples of five years from the respondent's age. For example, among women aged 24 years, we observe far more partners aged exactly 29 years than expected.
Let n s,a,p be the number of observed partnerships with s i = s, a i = a, and p i = p.
Fixing age to be a and sex to be s, we can find the expected count at partner age p,n s,a,p by fitting a Nadaraya-Watson estimator to all ordered pairs (p, n s,a,p ) such that p − a is not a multiple of five. We can then find the positive-valued excess counts at all p such that p − a is a multiple of five: e s,a,p = max(n s,a,p −n s,a,p , 0). This quantity, e s,a,p , is what the Nadaraya-Watson estimator has identified as number of heaped observations. Fixing p to be a partner age such that (p − a) mod 5 ≡ 0, we assume that all of the excess mass at p will be allocated to the four partner ages on either side of p . We find the share of e s,a,p to be allocated to each of (p − 2, ..., p + 2), denoted b s,a,p , as b s,a,p = n s,a,p ∑ 2 i=−2 n s,a,p +i , substituting inn s,a,p for n s,a,p wherever applicable. Finally, we find the number of individuals to be reassigned from p to each p within two years of y as d s,a,p = b s,a,p · e s,a,p . Note that each partner age can only "receive" partnerships from its nearest multiple of five and that each multiple of five can only "send" partnerships to itself and the four partner ages on either side of it. For each y within two years of y , we randomly select d s,a,p individuals to move from p to p. We apply this method for both sexes and all respondent ages with at least two observations separately. Figure 8 illustrates the effect of this process on data among women aged 24 in the AHRI data. Despite its simplicity, the deheaping algorithm seems to produce distributions that should be sufficiently plausible for the purposes of this experiment. If our results differed substantially between the original and deheaped AHRI data, we would need to consider the possibility that our results could be an artefact of heaping.
This method is quite simple, but it seems to work reasonably well on the AHRI data. Regardless, we do not need a perfect deheaping algorithm for this application; we just need one that will give us a plausibly deheaped version of the AHRI data. Share of observations Figure 9: Observed sexual partner age distributions among women in the AHRI data. The left panel is original data, and the right panel is the same data set after deheaping age differences from multiples of five.
deheaped data sets (i.e. if one probability distribution works perfectly only on the deheaped data), then we will know that our results are sensitive to irregularities in the data. Figure 9 shows the presence of age heaping among women in the AHRI data, as well as the effects of our deheaping algorithm. Visible diagonal lines indicate that women were disproportionately likely to report that the difference between their partner's age and their own age was a multiple of five. Heaping to partner ages (not partner age differences) would manifest as horizontal lines. As we can see in the right panel, the deheaping procedure resolves the majority of the heaping. We cannot validate the algorithm, but for the purposes of this experiment, simply producing plausibly deheaped age distributions should be sufficient. Table 6 provides ELPD and QQ RMSE values for all five regression models fit to

Model specification details
We modelled the log-ratio dependent variable using the four-parameter sinharcsinh distribution: where β µ , β σ , β , and β δ are free parameters. We placed essentially arbitrary shrinkage priors on all coefficients: First, we fit a conventional regression, in which only the location parameter, µ, is a function of data. Specifically, we allowed for linear sex and age effects and a linear interaction between respondent sex and age (s i and a i , respectively) in the model of µ: In the second model, we allowed the three higher order distributional parameters to vary by age and sex: In the third model, all four distributional parameters had age, sex, and age-sex interaction effects: To allow for the possibility of non-linear variation with respect to age in the fourth model, we modelled the location parameter using sex-specific natural splines on age: where K is the number of columns in the spline design matrix. By including a second set of basis function values that are multiplied by s i , we are estimating an additional, female-specific trend with respect to age.
Finally, we fit a fifth model, in which all four distributional parameters were modelled as sex-specific splines with respect to age:     Figure 18: Observed partner age distributions (grey bars) and posterior predictive partner age distributions (lines) for each probability distribution among men in the Manicaland data set. Here, we plot the posterior predicitve distribution associated with each distribution's highest-ELPD dependent variable.