Creating area level indices of behaviours impacting cancer in Australia with a Bayesian generalised shared component model

This study develops a model-based index creation approach called the Generalized Shared Component Model (GSCM) by drawing on the large field of factor models. The proposed fully Bayesian approach accommodates heteroscedastic model error, multiple shared factors and flexible spatial priors. Moreover, our model, unlike previous index approaches, provides indices with uncertainty. Focusing on Australian risk factor data, the proposed GSCM is used to develop the Area Indices of Behaviors Impacting Cancer product - representing the first area level cancer risk factor index in Australia. This advancement aids in identifying communities with elevated cancer risk, facilitating targeted health interventions.


Introduction
In the past two decades, the resolution and accessibility of health data for small geographic areas have improved considerably.While these improvements allow for more granular insights into spatial disparities in specific diseases or health characteristics, the increasing quantity and complexity of such data can make it more challenging to collate, draw meaningful conclusions and formulate effective policy decisions [1].Recently, interactive and publicly available health atlases (e.g., Australian Cancer Atlas [2] or PLACES platform [3]) have gained popularity as valuable tools for disseminating health data for small areas [2,3,4].These atlases often provide estimates and measures of uncertainty (e.g., measurement error) for multiple health measures, opening up avenues for further spatial analysis [5,6].
Multidimensional data often contain redundancy; that is, some dimensions may not provide substantive additional information.In this case, the data could be reasonably summarised by a lower-dimensional latent space, called an index [1].An index should measure a multifaceted concept that cannot be captured by a single feature.Indices often offer a simple interpretation of a complex phenomenon, can help determine policy priorities, and can be used to benchmark or monitor the performance of current policies [7].
There are a plethora of methods routinely used to derive indices [7,8] including rank sum, standardisation [7] and principal component analysis (PCA) [9].However, these methods do not naturally accommodate spatial autocorrelation and area-specific uncertainty, which are common in atlas data.Since atlas data are primarily derived from models, their uncertainty is largely model-based.Like other work [10], we use the generic term measurement error to denote this model-based uncertainty.
While spatial PCA can accommodate spatial autocorrelation [11,12], model-based approaches are gaining popularity in the creation of indices [1,13,14,15] as these approaches can also accommodate measurement error and provide uncertainty measures.Furthermore, previous work highlights that by fitting Bayesian models [13,14,16], researchers can provide probabilistic statements about index values.This facilitates a sophisticated decision-making process that uses the distributions of index values rather than point estimates.With access to distributions, researchers can more effectively determine whether variations in index values are truly meaningful.Shared component models (SCMs) [17] have been used to create indices from spatial data [14].These models assume that the input data originate from a linear combination of a single shared factor and residual errors [18,19].However, conventional SCMs do not naturally accommodate measurement error [20,21,22,23,24] and can be limiting when the input data lack strong correlations [20].
This study introduces a Bayesian generalised SCM (GSCM) capable of accommodating heteroscedastic measurement error, multiple shared factors and flexible spatial priors [25].Innovations are introduced with respect to the modelling, computation, and application.In terms of the model, while Bayesian spatial factor models have been employed previously to generate health indices with uncertainty measures [13,15], the GSCM bridges the mathematical and intuitive gaps between SCMs and Bayesian spatial factor models (BSFM); a connection that, to our knowledge, has yet to be acknowledged.To help distinguish between SCMs and BSFMs, we define a SCM as a model with a single shared factor and spatial dependencies in both the shared and residual terms.A BSFM, on the other hand, has at least one shared factor and spatial dependency in at least one component of the model.
On the computational front, the model was fitted using Stan [26], a flexible tool that facilitates efficient analysis of diverse Bayesian models.As a side effect of this work, we also present an efficient implementation of the Leroux prior [27] in Stan, which, to our knowledge, represents the first feasible implementation of this increasingly common spatial prior using Hamiltonian Monte Carlo.
The motivation for this work is a significant public health challenge; cancer.In light of the substantial global health burden [28], cancer prevention strategies focus on reducing the prevalence of unhealthy behaviors (e.g., risk factors) that can lead to cancer such as drinking, smoking, obesity, poor diet, and inadequate activity [29].Previous research has found spatial disparities in unhealthy behaviors that can lead to cancer [3,4,30,31], such as smoking [32].Yet, while area-level health indices are commonplace [33], cancer-specific indices have primarily been developed at the individual level [34,35].Area-level cancer risk indices have remained unexplored.
To address this gap and showcase the potential of the GSCM approach, we conduct a comprehensive case study using publicly available modelled data on unhealthy behaviors in Australia.The aim of this study is threefold: (1) generate a summary measure of unhealthy behaviors in Australia to help understand any underlying spatial patterns; (2) identify areas with elevated prevalence of unhealthy behaviors that may cause cancer; (3) provide decision support to help policymakers focus their cancer prevention strategies.
To do this, we developed the first area-level cancer risk index in Australia called the Area Indices of Behaviors Impacting Cancer (AIBIC), which provides a relative summary measure to compare the prevalence of unhealthy behaviors between areas.Unlike other applications of model-based index creation, which provide a single index only [13,14,15], we provide four indices in the AIBIC and explore their distinct properties.The first two indices stem directly from the fitted GSCM, while the latter two are unique transformations of the former pair.

Data
The presented analyses utilize data from our previous Bayesian spatial small area estimation study [31] which used data from the 2017-2018 National Health Survey (NHS).The NHS, conducted by the Australian Bureau of Statistics (ABS), is an Australia-wide population-level health survey conducted every 3-4 years [36,37].In our previous study [31] we generated modelled estimates and measures of uncertainty for the proportion of persons aged 15 years and older in 2017-18 engaging in behaviors that increase the risk of being diagnosed with cancer [29].Modelled results were provided for 2221 small areas, known as Statistical Area Level 2 or "SA2s", as defined by the 2016 Australian Statistical Geography Standard (ASGS) [38].
For the AIBIC product, we used five features from Hogg et al. [31], namely, the proportion of the following unhealthy behaviors: current smokers, risky alcohol consumption, overweight/obese, inadequate diet and inadequate physical activity based on leisure only.The selection of these unhealthy behaviors followed the same rationale as Hogg et al. [31], involving consultations with experts, a review of relevant literature [29,39,40,41] and the evaluation of data availability in the 2017-2018 NHS.
These data on unhealthy behaviors were available as point estimates and standard deviations of proportions.For all the unhealthy behaviors, each was defined so that a higher proportion indicated a higher prevalence of poorer health behavior [31].To enable the use of a Gaussian likelihood in our model we first applied the inverse logit transformation, resulting in Y k and S k being N -dimensional vectors of estimates and corresponding standard deviations for risk factor k = 1, . . ., 5, where N = 2221 is the total number of small areas (i.e., SA2s).Finally, these logit-transformed point estimates were centered and scaled, with the corresponding standard deviations appropriately adjusted (see Section A of the Additional File).The final set of input features is available on Github [42].
Alongside the data on unhealthy behaviors, in this study, we utilised population data for persons 15 years and older at the SA2 level.The vector of population counts, P = (P 1 , . . ., P N ), was sourced from the ABS's Estimated Resident Population product and calculated as the average annual count for 2017 and 2018 [43].

Risk factors
Current smoking was defined as those who reported to be current smokers (including daily, weekly or less than weekly), and had smoked at least 100 cigarettes in their lifetime.
Risky alcohol consumption was defined as individuals who exceeded the revised 2020 National Health and Medical Research Council (NHMRC) guidelines [44] of up to 10 standard drinks/week and no more than 4 standard drinks in any day.Compliance with the guidelines was assessed using self-reported alcohol consumption during the last three drinking days from the preceding seven days.
Inadequate diet, which was based on self-reported diet, was defined as those who did not meet both the fruit (2 serves/day) and vegetable (5 serves/day) 2013 NHMRC Australian Dietary guidelines [45].Participants of the 2017-18 NHS were asked to report the number of servings of fruit and serves of vegetables they usually ate each day.
Overweight/obese was defined as persons with a Body Mass Index greater or equal to 25.The NHS data collects both measured and self-reported height and weight.Measurements were voluntary, so 40% of the measured values in the NHS were imputed by the ABS using the hot-decking method [46].
Inadequate physical activity based on leisure only was defined as those who did not meet the 2014 Department of Health Physical Activity guidelines [47].This guideline stipulated that each week, adults aged 18-64 should either do 2 1/2 to 5 hours of moderate-intensity physical activity or 1 1/4 to 2 1/2 hours of vigorous-intensity physical activity or an equivalent combination of both, plus muscle-strengthening activities at least 2 days each week.

Statistical Model
Shared component models (SCM) are a multivariate statistical tool developed in the context of disease mapping.The traditional assumption underpinning the SCM is that the input features share a single common factor [20,21] or latent field that "manifests" in the separate features.To accommodate residual heterogeneity not captured by the shared factor, SCMs also include feature-specific residual errors [17].In disease mapping, these models assume spatial dependence in both the shared factor and the feature-specific residual errors.Borrowing from the massive literature on factor models, incorporating multiple factors into the SCM enhances its ability to explain variation in the multivariate data, improves the approximation of the covariance structure, and facilitates more informative inferences.
Given the expressed interest in accommodating the uncertainty of the input features, there is a need to include areaspecific measurement error.Following the classical measurement error framework [48], which is abundant in spatial models [49,50], we adopt a similar method to that of Jahan et al. [5].With the constraint that the number of shared factors (L) was much fewer than the number of features (i.e., L << K), the proposed GSCM is written as a N -dimensional multivariate normal distribution for the kth feature, where Y k , σ k and µ k are N -dimensional vectors of estimates, corresponding standard deviations and true values for the kth risk factor.Unlike Jahan et al. [5], who placed a prior on their standard deviation estimates, in this work σ k is taken to be equal to S k (i.e., the standard deviations from Hogg et al. [31]).
In line with the classical measurement error framework [48], while the estimates are assumed independent, the true values are assumed to exhibit a complex correlation structure, captured in the µ k 's.z is a N × L matrix of shared factor scores, with independent columns and ϵ k a N -dimensional vector of feature-specific residual errors for the kth feature with associated variance τ 2 k .Λ k is the kth row of the K by L matrix of factor loadings.We assume that the ϵ k 's are independent of each other and also from z.The shared factors, z, are assumed homoscedastic, where the columns have fixed variances equal to 1 [1,51].In this work, these shared factors are used to simplify our understanding of the spatial patterns of unhealthy behaviors and represent the building blocks for the AIBIC product described later.The covariance structure of the GSCM is described in Section C of the Additional File.
Since factor models are notorious for being unidentifiable, there is an abundance of research in this area [52,53,54].In this work, we use the hierarchical structural constraint [55] where the factor loading matrix, Λ, is constrained to be lower triangular with strictly positive diagonal entries λ ii > 0 [1].This constraint enforces bounds on the number of free parameters in Λ and the maximum number of shared factors.In this work, where K = 5, L was set to 2, requiring the estimation of nine parameters in Λ.

Spatial dependence
Although the N -dimensional vectors ϵ k and z are independent of each other, we assume spatial dependence within each vector.Whilst previous research has used spatial priors on the loading matrix [56,57], shared factors [1,13], or feature-specific residual errors only [58], traditional SCMs assume spatial dependence in the shared factors and featurespecific residual errors [14,17].Although marginal spatial priors have been explored, particularly in geostatistical applications [13,16,51], conditional autoregressive priors [59] are more common in SCMs [15,20,21] and disease mapping applications in general [25].
In this work, we fit independent N -dimensional conditional spatial priors on the columns of z and the ϵ k 's.To allow for a blend of spatially structured and unstructured variation, we used the Leroux prior [27], hereafter called the LCAR (see Section B of the Additional File).
The LCAR represents a generalisation of two priors due to the inclusion of a spatial autocorrelation parameter ranging from 0 to 1.When the spatial autocorrelation parameter is equal to 1, the LCAR becomes the common intrinsic conditional autoregressive (ICAR) prior [22], which has been used in SCMs and BSFMs before [14,56].When the spatial autocorrelation parameter is equal to 0, the LCAR collapses to an unstructured independent and identically distributed (IID) normal prior.The spatial autocorrelation parameters from the shared factors and feature-specific residual errors are denoted as ρ l and κ k , respectively.
The GSCM represents several generic models as special cases.The conventional (non-measurement error) SCM is recovered when uncertainty of the input data is ignored (e.g., setting σ nk = 0, ∀n, k) and L = 1.Continuing to ignore measurement error, a BSFM is recovered similar to that of Nethery et al. [1] when all κ k = 0 or that of Mezzetti et al. [58] when all ρ k = 0. Furthermore, when all ρ k = 0, κ k = 0 the GSCM becomes a generic Bayesian factor model.

Computation
We used fully Bayesian inference with MCMC via the R package rstan Version 2.26.11 [26].Where possible we used the non-mean centered parameterisation for the shared factors and feature-specific residual errors.Although efficient implementations of the proper CAR [60] and ICAR [61] are available, to the best of our knowledge this is the first implementation of the LCAR in Stan.This computational novelty is widely applicable to any spatial model fitted using Stan, not just the model proposed in the current paper.Code and details can be found on Github [42] or in Section B.1 of the Additional File.
We used 4000 warmup and 6000 post-warmup draws for each of the four chains.For storage reasons we thinned the final posterior draws by three, resulting in 8000 draws for inference.Convergence of the models was assessed using trace and autocorrelation plots, effective sample size, and R [62].

Model fit metrics
To formally compare models we used the Deviance Information Criterion (DIC), Widely Applicable Information Criterion (WAIC) [63], and mean absolute error (MAB) between the estimates, Y k , and the posterior median of the true values, µ k .For these metrics, smaller values were preferred.

Summaries
To address the first aim of this study, which was to generate a summary measure of unhealthy behaviors in Australia, the results from the GSCM are used to acquire draws for four N -dimensional vectors.Each vector represents the underlying scores for the four indices in the AIBIC product described later.
The first two vectors are the columns of the shared factor scores, z.The third is a weighted sum of these [7], where the weights, w 1 , w 2 , are based on the squared factor loadings (see the Additional File for details), via The fourth version is obtained by weighting the third version by the SA2 population.
After acquiring these four vectors, the posterior draws were converted into posterior ranks and posterior percentiles [7,9,15].Posterior ranks were derived by ranking each draw, while posterior percentiles were derived by assigning each draw a percentile value within the range of 1 to 100.Although the posterior percentiles and ranks provide similar inference, the ranks provide slightly more granular information, whilst percentiles are slightly easier to interpret.
The final step to produce the AIBIC product was to appropriately summarise the draws of these vectors.We use the posterior median and 95% highest posterior density interval (HPDI).To address the second and third aims of this study, which were to identify areas with elevated cancer risk and provide decision support to easily select appropriate areas, we calculated the posterior probabilities of the vectors exceeding the 80th, 95th, and 99th percentiles [13,14].Additionally, we derived the posterior probability of the vectors being ranked within the top 10, 20, and 100 areas.
The point estimates, HDPIs, and posterior probabilities for the vectors, percentiles, and ranks are available on Github [42].These summaries are provided for each of the four indices, with percentiles and ranks computed both nationally and for each of the eight states and territories of Australia.

Exploratory Analyses
This section explores the unhealthy behaviors in terms of spatial dependency and correlation structure (Figures 1  and 2).This exploratory analysis is used to inform the fitting of the GSCM in the following sections.Section A in the Additional File provides descriptive statistics for the input features.
The correlation coefficients of each pair of features are illustrated in Figure 1.It indicates a strong positive relationship between current smoking, inadequate physical activity and overweight/obese, and a negative correlation between risky alcohol consumption and inadequate physical activity.Exploring the standard errors of the input data reveals that risky alcohol consumption has the lowest error (median of 0.32), while inadequate diet (median of 0.56) and overweight (median of 0.51) have the highest.Univariate Moran's I tests [64] indicate that all the risk factors exhibit a high level of spatial autocorrelation.
Initial exploratory analysis also involved a Principal Components Analysis (PCA).The first two components accounted for 72% of the total variation (Figure 2).Notably, the first principal component captured less than half of the total variation (42%) suggesting a single shared factor would be insufficient.Furthermore, risky alcohol consumption exhibited negative or weak correlations with the majority of the other unhealthy behaviors and had a very low loading with the first principal component.Thus, using a second factor allowed us to effectively capture the distinct patterns associated with risky alcohol consumption, ensuring that these patterns were not diluted within a single shared factor or lost within the residuals.
The results from the PCA were used to inform the ordering of the features in Λ in the GSCM [56].This ordering does not impact the theoretical model or its predictions [55,65], but can influence the shared factors and consequently, the interpretation of the indices.Given that a maximum of two shared factors is possible, only the ordering for the first two features matter in this case.We used risky alcohol consumption followed by current smoking; a choice driven by the negligible loading on risky alcohol consumption with principal component 1 (PC1) and strong positive loading with PC2.Moreover, current smoking was chosen as the second feature due to its strong positive loading with both PC1 and PC2.This ordering aimed to decorrelate the two shared factors.

GSCM model results
To identify the best specification, we initially fit a range of GSCMs with increasing complexity, as described in the following paragraphs.This process allowed us to assess the impact of measurement error, the number of shared factors and spatial structures, with a small section dedicated to each.Following this assessment, we select a specific GSCM specification, which serves as the basis for developing the AIBIC product.
Note that to improve identifiability and convergence, we used a non-spatial (e.g., IID) prior for the first feature-specific residual error in all models.

Impact of measurement error
To assess the impact of ignoring measurement error, we compare the results from two-factor GSCMs where one model accommodates the measurement error and the other does not.Both models use LCARs for the shared factors and IID priors for the feature-specific residual errors.Figure 3, which compares the posterior median and 95% HDPIs, confirms the expected result; the GSCM with measurement error provides shared factor 1 scores with distinct patterns (plot (a)) and considerably increased uncertainty (plot (b)).The outliers on the right of plot (a) are mostly lower socioeconomic areas within major cities. Figure S1 in Section E of the Additional File provides a similar plot for the shared factor 2 scores, illustrating slightly more agreement between the two models, but a similar increase in uncertainty.
The factor loadings are considerably impacted when the measurement error is accommodated (Table 1).The large relative changes, particularly for overweight/obese and inadequate diet, are an artefact of the higher standard errors of these input features.The largest change is for overweight/obese which loads with 0.40 for factor 1 when measurement error is ignored, but 0.07 when measurement error is accommodated.This represents both an over 5-fold decrease in central tendency and an over 4-fold increase in relative standard error.
Note the 46% decrease in the loading of risky alcohol consumption with factor 1 (Table 1).This is noteworthy, particularly since this feature loads the strongest with this factor and has the smallest standard errors among the features, potentially explaining the marked differences in factor loadings when the measurement error is accommodated.
Considering the significant impact of measurement error, all subsequent models will accommodate the known measurement error.

Impact of multiple shared factors
To assess the impact of using L > 1 in the GSCM, we compared the first shared factor when fitting a one-factor and two-factor GSCM with LCARs for both the shared factors and feature-specific residual errors.Figure 4, which compares the posterior median and 95% HDPIs for the scores, confirms our expectation: the inclusion of the second factor in the GSCM has a negligible effect on both the central tendency and uncertainty of the factor 1 scores.
More broadly though, the two-factor model offers a more accurate representation of the data's covariance structure, leading to the WAIC favoring it as the superior model.The WAIC for the one-factor and two-factor models are 16744 and 15591, respectively.Furthermore, Table 2 illustrates the large loadings of all the features with the second factor, supporting its utility in this work.The factor 1 loadings are only minimally affected when factor 2 is included, with the greatest change occurring for current smoking and inadequate diet.The difficulty in interpreting factor 1, given the negative loadings, is discussed in Sections 3.3 and 4.

Impact of spatial priors
We explored which combination of spatial priors provided the best fitting two-factor GSCM to these data.For the shared factors and feature-specific residual errors, three priors were possible (IID, LCAR, and ICAR) resulting in nine possible models.The results (Table 3) show that the models with spatial priors for both the shared factors and feature-specific residual errors provided a superior fit to the multivariate data.

Area Indices of Behaviors Impacting Cancer (AIBIC)
By considering the model selection criteria together (Table 3), the two-factor GSCM with LCAR priors for the shared factors and ICAR priors for the feature-specific residual errors was considered the optimal model.The factor loadings are given in Table 4, and summaries of the other model parameters are provided in Section D of the Additional File.The shared factors from this model are used to construct the AIBIC product, which provides a relative summary measure to compare the prevalence of unhealthy behaviors between SA2s.A low index value for an area is likely to be an area where a higher proportion of the residents engage in unhealthy behaviors that can lead to cancer.
The AIBIC product helps to address the three aims of this study: generating a summary measure of unhealthy behaviors, identifying areas with a higher prevalence of unhealthy behaviors, and providing decision support for cancer prevention in Australia.The product is tailored for broad cancer prevention strategies that aim to simultaneously address multiple unhealthy behaviors that can lead to cancer.
In this section, we propose and compare the four AIBIC indices, each with a unique purpose and interpretation (Table 5).
While the first two are derived directly from the selected GSCM, the second two are unique summations of the first two.
The AIBIC product can be found on Github [42].

Factor 1
This first shared factor (z 1 ), mapped in Figure S4 of the Additional File, could be considered an Alcohol, Activity Index as the loadings with the highest absolute magnitude appear on risky alcohol consumption and inadequate physical activity (Table 4).The direction of the loadings suggests that areas with a high factor 1 score are areas with a high prevalence of risky alcohol consumption, but a low prevalence of inadequate physical activity.We discuss the implications of the negative loadings later (Section 4).

Factor 2
This second shared factor (z 2 ), mapped in Figure S5 in the Additional File, could be considered a Smoking, Activity and Weight Index, given its strong loadings with these unhealthy behaviors.Thus, factor 2 represents an approximate average of the unhealthy behaviors, excluding risky alcohol consumption.

Health Behaviors Index (HBI)
Although factor 1 or 2 could feasibly be used to address the first and second aims of this study, justifying the sole use of either is challenging.While factor 1 can be interpreted as risky alcohol consumption minus the average of the other features, factor 2 averages all the features, except risky alcohol consumption.
Rather than select one, we followed previous advice in the Handbook of Constructing Composite Indicators [7], and derived a weighted average of factors 1 and 2 instead.This weighted average is the Health Behaviors Index (HBI) (Table 5).The Pearson correlations are 0.66 between factor 1 and the HBI, and 0.90 between factor 2 and the HBI (Figure 5); reflecting the uneven weights of 0.45 and 0.55 (Section 2.2.5).
The HBI provides a single value for each area that is more representative of the five unhealthy behaviors, both in terms of point estimates and uncertainty.Figure 6 illustrates the spatial disparities in the HBI with lower values occurring in the major cities, indicating a generally lower prevalence of unhealthy behaviors in these areas.Moreover, within each remoteness category, HBI values are generally lower for less socioeconomically disadvantaged areas (Figure S6 in the Additional File).These results generally agree with the spatial patterns of the underlying unhealthy behaviors [31].

Population Adjusted Health Behaviors Index (PAHBI)
Considering the third aim of this study (e.g., decision support), areas with high HBI values do not necessarily correspond with areas with large populations, where the impact may be more pronounced.To address this aim, the Population Adjusted HBI (PAHBI) is a population-weighted version of the HBI (Table 5).
In the PAHBI, areas with high (or low) HBI values but small populations are shrunk.These areas can be seen in the PAHBI versus HBI scatter plot of Figure 5 as the line of observations near the 50th percentile.On the other hand, areas with large populations have their index value made more extreme; ranking the area higher (or lower).
Similar to the HBI scores, the PAHBI scores (Figure 7) illustrate strong spatial disparities of unhealthy behaviors.
Although remote areas (central Australia) do have lower PAHBI values than HBI values, even after accounting for population these regions continue to represent areas with the greatest potential for impact.

Comparison of indices
The motivation for the creation of an index is to enable policymakers to prioritize health interventions in a geographically targeted manner.Figure S3 in the Additional File shows the correlations between the AIBIC indices and the individual unhealthy behaviors.
In this section, we compare how the different indices result in unique decisions when a policymaker can only, due to feasibility or monetary constraints, select four areas to apply a new health campaign aimed at reducing unhealthy behaviors.Table 6 provides these decisions based on the AIBIC derived from the GSCM and those from the non-modelbased methods of rank sum and PCA applied to the input estimates.
For the non-model-based methods, areas were selected according to the point estimates only.For the four indices from the AIBIC, the selected areas are those with the highest posterior probability of having percentiles above the 95th.Only areas with populations greater than 99 were included.Table S5 in the Additional File provides point estimates and standard errors for the selected areas in Table 6.
While the non-model-based methods can be used to efficiently select the areas with the highest prevalence of unhealthy behaviors, these methods do not accommodate the area-specific uncertainty or population size (Table 6).Considering PC1 it selects four areas with very high current smoking and inadequate activity rates, with the standard errors of the estimates for these unhealthy behaviors exceeding the 98th percentile of the standard errors.Similarly, PC2 selects four areas with very high current smoking and risky alcohol consumption rates, coupled with small populations (some under 200).The standard errors for these areas are all above the 97th percentile of the standard errors.
Considering the AIBIC, specifically the PAHBI, the standard errors for the unhealthy behaviors of the areas selected are all below the 92nd percentile (median 70th) of the standard errors.The PAHBI helps to target over 71100 people, compared to around 11300 and 5770 with PC1 and PC2, respectively.Note that the selection of which index to use is contingent on the specific requirements and context of the use case.
Table 6: Comparison of the selected top four areas when using a variety of indices.Only areas with populations greater than 99 were considered.For rank sum, PC1 and PC2, the four areas with the highest value are chosen.For the AIBIC indices, the selected areas have the highest posterior probability of having percentiles above the 95th.Areas that are italicised are also in the top four areas according to their posterior probability of also being ranked in the top 10 areas across Australia for the same index.For each selected area and unhealthy behavior, we provide the actual proportion estimates (and standard errors) as percentages on the left and the nationwide percentiles of the same proportion estimates (and standard errors) on the right.A percentile of 100 should be interpreted as a very high prevalence estimate (or high standard error) of that unhealthy behavior across Australia.

Discussion
Our results indicate that for the data utilised, ignoring the heteroscedastic measurement error, relying only on a single shared factor, or accommodating spatial autocorrelation in only the shared factors or feature-specific residual errors could lead to incomplete or inefficient policy decisions.Our solution, the GSCM, is a flexible model-based approach to index creation.Bolstering its applicability, the GSCM encompasses several previous models [1,58] used for index creation as special cases.With the computational benefits of Stan [26] and a fast implementation of the LCAR [27], the GSCM can feasibly be used to model a wide range of input data with varying degrees of spatial autocorrelation and measurement error.Although we illustrated our approach using atlas data from one data source, the GSCM can feasibly be applied to any geographical health dataset or datasets, with or without uncertainty.It would be feasible to fit a GSCM where some features have heteroscedastic measurement error whilst others remain free of such error.Moreover, our model could be easily incorporated into larger, more complex models such as spatial structural equation models [16,66].
As anticipated, the incorporation of the heteroscedastic measurement error considerably altered the shared factor scores (Figure 3), but the inclusion of a second factor had minimal influence on the scores of the first factor (Figure 4).However, the latter comparison was made using a GSCM with spatial shared factors and spatial feature-specific residual errors.If instead, IID priors were used for the feature-specific residual errors, the shared factor 1 scores exhibited very different patterns, with the one-factor model producing scores with higher uncertainty (see Figure S2 in the Additional File).This result supports the use of traditional SCMs (which assume spatial dependence in both the shared and feature-specific parts) over BSFMs (which assume spatial dependence in only one of these).By allowing for spatial dependence to be captured by a greater number of model terms, our results became more robust.Further, GSCMs with IID feature-specific residual errors, akin to Nethery et al. [1] BSFM, underperformed irrespective of whether the shared factors were spatial (Table 3).Future research could investigate whether the BSFM could be improved by using alternative spatial priors like the BYM [59] and BYM2 [67].Consistent with previous research [1,15], the worst-performing model was the standard Bayesian factor model which excluded spatial dependence entirely (Table 3).
Given the negative loadings in factor 1 (Table 4), interpreting the overall cancer risk profile based on factor 1 or factor 2 alone posed a challenge [9].This partly motivated the combined indices.Fortunately, the negative loadings were anticipated based on previous work and our exploratory analysis of the features (Figure 1).Previous work indicates an inverse relationship between alcohol consumption and area level socioeconomic disadvantage [4,68,69].
The three aims of this work were met by creating the Area Indices of Behaviors Impacting Cancer (AIBIC).The AIBIC, developed from the output of the GSCM, provides four indices designed to help identify areas with a high (or low) prevalence of persons who engage in behaviors that can lead to cancer, including current smoking, risky alcohol consumption, overweight/obese, inadequate diet and inadequate physical activity.While factor 1 and 2 are specific to certain unhealthy behaviors, the HBI is more general and thus more suitable for future spatial regression models or analyses.The PAHBI could be used for direct policy decisions, particularly in contexts where population density is relevant.Our analyses shed light on the different policy decisions that could be made by examining indices crafted with different methods.Indeed, indices inherently involve a loss of information, rendering them less suitable for decisions relating to a single feature (e.g., current smoking only).In such cases, a direct examination of the specific feature is always preferable.
The AIBIC (Figures 6 and 7) reveals a pronounced spatial autocorrelation in unhealthy behaviors and appears to reflect the remoteness [70] and socioeconomic [9] patterns of Australia (see Figures S6 and S7 in the Additional File).This pattern could be an artifact of the modelling [31] that produced the input data for this work.Although this initially raises concerns regarding the novelty of the indices, subsequent analyses demonstrated that the AIBIC does not accurately predict the Index of Relative Socio-Economic Disadvantage from SEIFA [9], with the highest accuracy being a mere 16%.On the other hand, remoteness [70] proves to be an adequate predictor of AIBIC scores, evidenced by a maximum adjusted R 2 of 45%.Thus, future research could explore the benefits or limitations of using the AIBIC, SEIFA, and remoteness variables in the same ecological model.
Several limitations were encountered in this study.Firstly, the absence of an established area-level cancer risk index posed challenges for validation [71].We sought to address this by directly comparing policy decisions derived from our indices to the underlying features.Secondly, we faced challenges with convergence and identifiability in our case study, primarily due to the strong association of risky alcohol consumption with shared factor 1, which led to the feature-specific factor for this feature being weakly identified.To mitigate this, we used more informative hyperpriors and replaced the spatial prior with an IID prior.
The third limitation is the use of only five features and two shared factors.This may limit the transferability of the GSCM to more extensive dimension reduction tasks.Our implementation of the LCAR, although efficient, still encounters computational challenges when fitting many spatial priors.Subsequent development of these models could explore the use of direct sampling [72] or approximation methods such as variational inference [73].
The usefulness of an index is constrained by the quality and breadth of the input data.Unfortunately, we lacked small area level data on factors like sun exposure, a significant determinant of melanoma, and dietary factors, including red meat and fibre consumption, which are closely linked to colorectal cancer [29,39].As these data become available, future work could update the AIBIC.
Future methodological work could continue to tailor or generalize the GSCM to include dynamic factor loadings.We experimented with using a unique factor loading matrix for each of the five remoteness categories across Australia.
Although this significantly improved model fit, it made interpretation difficult and so wasn't pursued.Another avenue of future work could be the use of non-Gaussian distributions for the likelihood, such as the Beta distribution.
Unlike Nethery et al. [1], we did not include population as a feature in our model.Our method of incorporating population into the PAHBI represents just one approach, with alternative methods warranting future exploration.Moreover, there is a lack of development in alternative methods of combining numerous factors into a single index [7].Finally, our methodology, like other factor models, is impacted by the ordering of the features, the chosen approach for ensuring identifiability and the aggregation level of the spatial data [74].Different choices can result in marked variability in inferences, highlighting possible avenues to improve the robustness of model-based methods of index creation.
In conclusion, in the current paper, we have proposed a generalised shared component model that can effectively model complex correlations in multidimensional health data, which is often derived from atlas platforms such as the Australian Cancer Atlas [2] and Social Health Atlases of Australia [4].The GSCM allows for multiple shared factors and heteroscedastic measurement error and provides uncertainty measures for improved decision-making.
We applied this model to small area-level data on unhealthy behaviors that can lead to cancer, to develop the first area-level cancer risk index in Australia; the Area Indices of Behaviors Impacting Cancer (AIBIC).Successfully addressing the three aims of this study, the AIBIC product provides a relative summary measure to compare the prevalence of unhealthy behaviors between areas and is predominantly relevant for analyses and decision-making for cancer prevention.But, the AIBIC is also applicable to other disease modelling tasks and depending on the research question, could be used in conjunction with other indices such as SEIFA.

A Data transformation
Let π nk and σ nk be the proportion estimate and standard error for the nth area and kth risk factor from Hogg et al. [1].
To transform these to the unconstrained (log-odds) scale, .
Now let θk and s k be the empirical mean and standard deviation of the point estimates, (θ 1k , θ 2k , . . ., θ N k ) for the kth risk factor.
The last step is to mean-center and scale the estimates, θ nk , and standard errors, Snk , which are the input data to the GSCM proposed in the main paper.In this section, we describe the LCAR [2] and provide our efficient Stan implementation.Let W be a N × N binary weight matrix where W nñ = 1 for row n and column ñ if area n and area ñ are neighbours and zero otherwise.In addition, let D be a N × N diagonal matrix with elements j W 1j , . . ., j W N j and ρ ∈ [0, 1) be the unknown SA parameter which controls the level of spatial dependence.
For a latent vector, z l , the unit-scale LCAR prior is a N -dimensional multivariate normal distribution, Eq. B.0.1 where the covariance matrix is specified as such to enforce dependence between neighbouring areas and allow for local smoothing.When ρ = 0 the LCAR collapses to an independent and identically distributed (IID) standard normal distribution, while when ρ = 1, the LCAR collapses to the intrinsic CAR (ICAR) prior, Eq. B.0.2 which has a singular covariance matrix.To correct for the ICAR not being a proper probability distribution, a sum-tozero constraint is placed on z l .Given this property, the LCAR can become unstable when ρ → 1.For this reason, we restricted p ∈ [0, 0.99], which explains why the upper bounds of the 95% HPDIs in Table S4 are 0.99, not 1.
Another consequence of the improper nature of the ICAR is that it cannot be utilized as the likelihood for data.Thus, in our setup, we cannot fit a BSFM where the feature-specific residual errors are spatially correlated.We avoid this when fitting the GSCM as the LCAR is used as a prior distribution.
We enact the No Island assumption, which posits that all rows in W must have a sum greater than zero.This implies that no areas exist in isolation [3].To ensure the neighborhood structure was fully connected [4], we treated the eastern and western SA2s at the top of Tasmania as neighbors of the furthest south SA2s in Victoria.The GSCM can accommodate complex covariance structures in the multivariate spatial data.See Section 2.2 of the main paper for notational details.
Assuming for simplicity that K = 3 1 , we can write the first component of Eq.C.1.1 using block matrix notation.
We can derive a similar expression when L = 1; the conventional SCM.

C.2 Decomposing the variance of the input features
Using the derivations provided above, in this section we provide mathematical expressions showcasing the covariance structure imposed by the GSCM when L = 2.

D Model results
To derive the HBI, we take the sum of the squared loadings for each shared factor (Table 4 in the main paper).The weights are the percentage of these sums for each shared factor.The result is about 0.45 and 0.55 for shared factors 1 and 2, respectively.This approach is described in Section 6.1 (page 89) of the Handbook on Constructing Composite Indicators [6].
Table S3: Posterior medians and 95% highest posterior density confidence intervals (HPDI) for the feature-specific standard deviations across the nine different two-factor GSCMs.

Figure 1 :
Figure 1: Visualisation of the correlation matrix for the point estimates of the five unhealthy behaviors.

Figure 2 :
Figure 2: Visualisation of the loadings from a Principal Components Analysis (PCA) applied to the point estimates for the five unhealthy behaviors.The horizontal bars are colored according to the direction of the loading with red indicating a negative loading.Above the plots, the percentage of variation explained by Principal Component 1 (PC1) and PC2 are given.

Figure 3 :
Figure 3: Comparison of the first shared factor when estimated using a two-factor GSCM with and without accommodating the measurement error (ME).Scatter plot (a) gives the posterior median and 95% HPDIs from the GSCM with (y-axis) and without (x-axis) ME.The red diagonal line represents equivalence of the x and y axes.Caterpillar plot (b) compares the posterior standard deviations ranked by the "no ME" model.

Figure 4 :
Figure 4: Comparison of the first shared factor when estimated using the 1-factor GSCM and the 2-factor GSCM where both have LCARs for the shared factors and feature-specific residual errors.Scatter plot (a) gives the posterior median and 95% HPDIs from the 1-factor (y-axis) and the 2-factor (x-axis) models.The red diagonal line represents the equivalence of the x and y axes.Caterpillar plot (b) compares the posterior standard deviations ranked by the 2-factor model.

Figure 5 :
Figure 5: Pairs plot of the four indices in the AIBIC product.Posterior medians of the percentiles are used.The plots in the lower triangular squares are scatter plots, the diagonal elements are density representations and the empirical correlations between the four vectors are given in the upper triangular.

Figure 6 :
Figure 6: Choropleth maps displaying the HBI, which is based on the summation of the two shared factor scores from the two-factor GSCM.The results are the posterior median of the posterior percentiles for 2221 SA2s across Australia.Higher values indicate areas with a higher prevalence of unhealthy behaviors.The map includes insets for the eight capital cities for each state and territory, with white boxes on the main map indicating the location of the inset.White lines represent the boundaries of the eight states and territories of Australia.

Figure 7 :
Figure 7: Choropleth maps displaying the PAHBI, which is based on the summation of the two shared factor scores from the two-factor GSCM.The results are the posterior median of the posterior percentiles for 2221 SA2s across Australia.Higher percentiles indicate areas with a higher prevalence of unhealthy behaviors.The map includes insets for the eight capital cities for each state and territory, with white boxes on the main map indicating the location of the inset.White lines represent the boundaries of the eight states and territories of Australia.

Figure S1 :
Figure S1: Comparison of the second shared factor when estimated using a two-factor GSCM with and without accommodating the measurement error (ME).Scatter plot (a) gives the posterior median and 95% HPDIs from the GSCM with (y-axis) and without (x-axis) ME.The red diagonal line represents equivalence of the x and y axes.Caterpillar plot (b) compares the posterior standard deviations ranked by the "no ME" model.

Figure S2 :
Figure S2: Comparison of the first shared factor when estimated using the 1-factor GSCM and the 2-factor GSCM where both have LCARs for the shared factors and non-spatial (IID) priors for the feature-specific residual errors.Scatter plot (a) gives the posterior median and 95% HPDIs from the 1-factor (y-axis) and the 2-factor (x-axis) models.The red diagonal line represents the equivalence of the x and y axes.Caterpillar plot (b) compares the posterior standard deviations ranked by the 2-factor model.

Figure S3 :
Figure S3: Correlation matrix of the five unhealthy behaviors, three non-model-based indices (rank-sum, PC1, and PC2), and the four indices from the AIBIC product.

Figure S4 :
Figure S4: Choropleth maps displaying factor 1, which are the posterior medians of the posterior percentiles for the shared factor 1 for 2221 SA2s across Australia.Higher percentiles indicate regions with a higher prevalence of unhealthy behaviors.The map includes insets for the eight capital cities for each state and territory, with white boxes on the main map indicating the location of the inset.White lines represent the boundaries of the eight states and territories of Australia.

Figure S8 :
Figure S8: Choropleth maps of Perth, Western Australia showing percentiles for eight characteristics of this analysis.The top row displays the SA2-level prevalence of risky alcohol consumption, current smoking, overweight/obesity, and the population.The bottom row displays the four indices from the AIBIC product.AIBIC index values are posterior medians, whereas selected risk factor and population values are based on point estimates only.White areas indicate SA2s lacking data on unhealthy behaviors.

Figure S9 :
Figure S9: Choropleth maps of Melbourne, Victoria showing percentiles for eight characteristics of this analysis.The top row displays the SA2-level prevalence of risky alcohol consumption, current smoking, overweight/obesity, and the population.The bottom row displays the four indices from the AIBIC product.AIBIC index values are posterior medians, whereas selected risk factor and population values are based on point estimates only.White areas indicate SA2s lacking data on unhealthy behaviors.

Table 1 :
Posterior medians (95% highest posterior density confidence intervals (HPDI)) of the factor loadings from two-factor GSCMs where one accommodates the measurement error (ME), while the other does not.Both models use LCARs for the shared factors and non-spatial priors for the feature-specific residual errors.

Table 2 :
Posterior medians (95% highest posterior density confidence intervals (HPDI)) of the factor loadings from a one-factor and two-factor GSCM with LCARs for both the shared factors and feature-specific residual errors.

Table 3 :
Comparison of the two-factor GSCM when altering the prior structure for the shared factors and feature-specific residual errors.Model comparison is carried out using the Deviance Information Criterion (DIC), Widely Applicable Information Criterion (WAIC), and mean absolute error (MAB).The bolded quantities represent the lowest value in each column where appropriate.

Table 4 :
Posterior medians (and 95% highest posterior density confidence intervals (HPDI)) of the factor loadings from the selected GSCM.

Table 5 :
Terminology for the four indices of the AIBIC product.In the table, zn1 is the first shared factor raw score for the nth area and Pn is the corresponding population count.
Thus, not only is the proposed model a useful generalisation of a variety of commonly used model-based methods for creating indices, but the set of indices created in this work have wide applicability in public health.Centre for Data Science, Queensland University of Technology, 2 George St, Brisbane City, 4000, Queensland, Australia 2 Australian Centre for Health Services Innovation, School of Public Health and Social Work, Queensland University of Technology, 2 George St, Brisbane City, 4000, Queensland, Australia 3 Viertel Cancer Research Centre, Cancer Council Queensland, 553 Gregory Terrace, Fortitude Valley, 4006, Queensland, Australia

Table S1 :
Descriptive statistics for the input features, Y nk .

Table S2 :
Descriptive statistics for the standard deviations, S nk , of the input features.

Table S4 :
Posterior medians and 95% highest posterior density confidence intervals (HPDI) for the SA parameters in the five different two-factor GSCMs with LCAR priors.Note that the SA parameter is only used for the LCAR.κ1 is omitted as an IID prior was used for ϵ1 in all models.