Categorization of continuous covariates and complex regression models – friends or foes in intersectionality research

Objective: Studies of intersectionality are increasing to examine health inequalities. Different proposals for examining intersections have recently been published. One approach (1) considers models specified with 1 st and all 2 nd -order effects and another (2) the stratification based on multiple covariates; both categorize continuous covariates. A simulation study was conducted in order to review both methods with regard to correct identification of intersections, rate of false positive results, and generalizability to independent data compared to an established approach (3) of backward variable elimination according to Bayesian information criterium (BE-BIC). Study design and setting: Two basically different settings were simulated with 1000 replications: (1) comprised the covariates age, sex, body mass index, education, and diabetes in which no association was present between covariates and a continuous response and (2), comprising the same covariates, and a non-linear interaction term of age and sex, i.e., a non-linear increase in females above middle age formed the intersection of interest. The sample size (N=200 to N=3000) and signal to noise ratios (SNR, 0.5 to 4) were varied. In each simulated dataset bootstrap with replacement was used to fit the model to internal learning data and to predict outcomes using the fitted models in these data as well as the internal validation data. In both, the mean squared error (MSE) was calculated. Results: In simulation setting 1, approaches 1/2 generated spurious effects in more than 90% of simulations across all sample sizes. In smaller sample size, approach 3 (BE-BIC) selected 36.5% the correct model, in larger sample size in 89.8% and always had a lower number of spurious effects. MSE in independent data was generally higher for approaches 1/2 when compared to 3. In simulation setting 2, approach 1 selected most frequently the correct interaction but frequently showed spurious effects (>75%). Across all sample sizes and SNR, approach 3 generated least often spurious results and had lowest MSE in independent data.


Introduction
Reducing inequalities in morbidity and mortality between socioeconomic groups remains a major public health challenge [1].Intersectionality provides a framework for an integrated understanding on how characteristics of social privilege and disadvantage are involved in shaping inequalities in health and mortality [2].Introduced by Kimberle Crenshaw, to express the synergistic experiences of Black women who experience multiple forms of oppression as both being Black and being female [3], the framework of intersectionality is being adopted to a wide range of population based research [4][5][6].Essentially, intersectionality describes that inequalities between, e.g., women versus men, and immigrants versus non-immigrants, may not simply add for an immigrant woman.With regard to the numerous combinations of person characteristics being determined by more than one social dimension, e.g., gender, race, socioeconomic status and region of residence, examining inequalities within a population seems to be methodologically challenging.Symbolic of this difficulty, numerous best practice recommendations for examining intersectionality have been published [7][8][9][10][11][12][13][14].
Building explainable and generalizable multivariable models is a common statistical task that requires subject-matter knowledge and statistical techniques for variable selection as well as finding the optimal functional form of continuous covariates.Omitting the second step may lead to false exclusion of continuous covariates [15].To date, there are no consented recommendations for combining these two tasks; some proposals still require further investigation [16].The intersectionality framework goes beyond the existing complexity.It is assumed that intersections are formed by combination of nominal, ordinal, but also continuous characteristics and their interaction.Disregarding relevant interaction terms can cause bias [17].With regard to statistical modelling, however, identifying relevant interaction terms adds further complexity as the number of all interaction terms (NIT)   = 2  −  − 1 increases exponentially with covariates p considered.Considering only 2 nd -order effects, their number is (  2 ) for p covariates and grows fast, e.g., for p being 2 to 7, it results in 1, 3, 6, 10, 15 and 21 two-way interaction effects.
Nonetheless, a recent simulation study [18] compared several methods to examine intersectionality in mid-size (N=2,000) to large data (N=200,000) comprising 192 intersections.Among other approaches, a multiple linear regression model comprising all possible interaction terms and categorized continuous covariates was found to accurately estimate intersections particularly in large data.A second approach of comparable complexity proposes the use of a stratification variable [19,20].This variable is defined as the combination of discrete covariates and arbitrarily categorized continuous covariates.For example, education (low, medium, high), biological sex (female, male), and chronological age (categorized to: <40, 40-59, >59) form a stratification variable with 18 (3 x 2 x 3) levels.In terms of model parametrization, both approaches result in complex regression models which may be necessary to detect intersections.However, overly complex regression models are known to overfit the data and perform poorly in independent data [21].With regard to intersectionality, it is unclear whether such complex models may not more often render into spurious findings.
This study simulates data for the frequently used multiple linear regression model and compares three approaches to detect intersections: (1) models comprising all possible interaction terms, (2) a stratification approach, and (3) an information criteria-based approach for the detection of intersections.

Objectives
The comparison of three methodological approaches under varying conditions of sample size and signalto-noise ratios for application in intersectionality analyses regarding: (1) the generalizability to independent data (2) the rate of correct model terms (3) and the rate of spurious findings.

Approach 2: stratification
In the stratified approach [19,20], only the categorized covariate age, sex, and education were used to create a stratification variable resulting in 32 (4 x 2 x 4) strata.This implies, that only education was considered as noise.Any further stratification such as for BMI caused multiple strata with no observations.
According to methodological recommendations [24], one of the most frequently observed strata was used as the reference (Age [48,55)|males|average education).In approach 2, the model formula is specified as: where  ⋅  ⋅  corresponds to a multivariable stratification variable; 32 effects including the intercept had to be estimated.

Approach 3: backward elimination with BIC
We chose a commonly used information criteria  = −2 (ℒ) +  (), where the first term including the likelihood ℒ rewards goodness-of-fit and the second term penalizes model complexity with k being the number of parameters in the model.BIC is combined with a backward elimination (BE) strategy for selecting model terms [25,26].There are more advanced strategies [27][28][29], yet, BE-BIC is known to select descriptive covariates in multivariable regression [16,25,30] and implemented in most statistical software [31].The upper scope from which BE-BIC started was the same model as specified for approach 1b.
Additionally, we varied between linear (approach 3a) and functional forms of continuous covariates (age, BMI) using restricted cubic splines (RCS) [32].These spline functions fit piecewise polynomials across the range of  that is divided by  knots  1 , . . .,   into  + 1 intervals.Due to restrictions in the basis-functions, fits by RCS are linear at the lower/upper tail of a covariate (the two outer knots for  ≤  1 and  ≥   ).As recommended [32], we varied  by the specifying the degrees of freedom of the spline function with  ∈ (2; 3; 4) in approaches 3b to 3d; in the first case with  = 2 the function fits a cubic polynomial between the outer knots.The location of knots was not prespecified [32,33].With varying degrees of freedom for continuous covariates, the number of parameters to estimate was 26 (BE-BIC, linear) and 42/59/78 for  ∈ (2; 3; 4) respectively, including the intercept.

Model fit and generalizability
Within each of the 1000 simulation runs, bootstrap with replacement was applied to fit the models to internal learning data (iLD, bootstrap: in-bag-data) and to predict the outcomes in the internal validation data (iVD, bootstrap: out-of-bag-data) using the fitted model, i.e. the fitted model is evaluated in data generated under the same data generating process but these data were not used for model training.The mean squared error: was calculated in iLD, to describe model fit, and in iVD, to estimate of prediction accuracy and generalizability.

True vs. spurious effects
Anderson et al. [34] defined a spurious effect as: "finding effects that seem supported by the data but are actually spurious".They arise in exploratory studies in which many hypotheses are tested [34].In simulation setting 1 (intercept model), we will count the instances in which any term (except the intercept) of approaches 1a, 1b and 2 has p-value <0.05 and, in approaches 3a-3d, any term other than the intercept is selected.In simulation setting 2, we proceeded similarly, but considered the true effects of age + sex + age:sex.To note, for approach 2 the number of correctly chosen strata is not possible.However, we calculated the number of strata with p-value <0.05.

Sensitivity analysis
We repeated analysis in data with more complex structure of association between covariates and with additional interaction terms.More details are provided in the supplement.

Computations
We used R (Version 4.3.1,[35]; RStudio version 2023.06.1.)to conduct to parallelized computations [36] at a high performance cluster computer of the University of Greifswald [37].Covariance structures were simulated using the R-package MASS [38], RCS are implemented in base R [35], and graphical illustration is done using ggplot2 and ggsci [39,40].

Simulation Setting 2
Figure 3 shows results for a SNR of 1 (R 2 = 0.5) in which an association between age, sex, as well as the interaction of age:sex was simulated.Regarding MSE in iVD, approaches 1a and 2 performed worse than the other methods.In larger data, approach 1b had similar MSE compared to approach 3a (MSEiVD = 0.937/0.922for 1b/ 3a).BE-BIC with RCS, fitted best to iLD and had highest prediction accuracy in iVD.
Similar results were found for SNR of 0.5/4 (supplementary figures SF1 and SF2).
In terms of correctly identifying the effects of age + sex + age:sex, approach 1a performed worst, particularly in smaller sample size and with low SNR.Only for N=3000/SNR of 4, the rate of correct terms was >80%.Approach 1b obtained highest rates of identifying the correct terms being statistically significant.However, approaches 1a/1b had high rates of spurious effects: in 75% of simulations, two or more irrelevant variables were shown to be statistically significant; in 50% of simulations four to seven.
The rate of additional covariates or interaction terms (other than the true effects) representing as statistically significant was considerable for approaches 1a and 1b.Irrelevant interaction terms such as age:education were shown as statistically significant in more than 50% of iterations (Table 3).In approach 1a, this frequency exceeded twice that of the true effect of sex.Note, we report a significant effect, if at least one of the levels of discrete covariates or of the respective interaction term had a significant effect (true effects in bold font).

Discussion
This simulation study compared two proposals for the analysis of intersectionality with a conventional and widely used approach of multivariable model building.We varied sample size, SNR, and studied a situation where no association of covariates with the response is present.Approaches 1 (a, b) which model all interactions terms [18] and approach 2 that uses multivariable stratification [19,20], impose very complex regression models resulting in frequent spurious findings.In simulation setting 1, in more than 90% of simulations spurious effects were generated, i.e. the model contained statistically significant effects even though the true model is a pure intercept model.In simulation setting 2, the number of spurious effects was frequently equal to or higher than the number of true effects.
The use of a stratification variable (approach 2) was the most inappropriate.Although Figure 1/2 suggest similar performance with approaches 1a/1b, the comparison is inadmissible.Not all covariates could be used to form a stratification and some noise was excluded.Predictions on independent data was frequently impossible; this effect is known for multiplicative interaction models due to lack of support in subgroups [41].Besides partial inapplicability, approach 2 allows almost no interpretability of results.
Discrete and arbitrarily categorized continuous variables are interwoven in a way making it difficult to recognize a relevant intersection.As shown in the supplementary figure SF 5, the intersection was formed by females above middle age alone.Nevertheless, all strata "above" the reference had frequently p-values <0.05.In large sample size also strata for males, which is an effect of age alone.Despite the unfavorable characteristics of multivariable stratification, the perceived simplicity already contributes to further studies using this approach [42].
Results for approach 1a and 1b are ambiguous.When using categorized continuous covariates, models were hardly generalizable due to overfitting of the complex model, particularly in smaller sample size; i.e.
the model closely fits the learning data but performs poorly in predicting new data.Coefficients in models of approach 1a were larger in smaller sample size (supplementary figure SF 3) and caused poor predictions in iVD.In approach 1b, when using continuous covariates and in larger sample size, the MSE in iVD was similar to BE-BIC.This result is in line with the study of Mahendran et al. [18].However, approach 1b had a critical rate of spurious findings that frequently exceeded the number of true effects.
Across all settings, sample sizes, and SNRs, BE-BIC generated least often spurious effects and had lowest MSE in iVD, particularly in simulation setting 2, when using RCS with three or four degrees of freedom.
This result is consistent with statistical literature from recent decades.Essentially, the upper scope of model complexity subject to BE-BIC using RCS was similar or even higher than in approaches 1a, 1b and 2.
However, the complexity is then subject to a penalty term that causes only a few terms to be retained in the final model.Noteworthy, the latter were most often the correct terms only.The superiority of BE-BIC persisted also in sensitivity analyses (please see supplementary file).
Considering the many methodological approaches to limit model complexity by means of penalization or shrinkage, it seemed unlikely that simpler approaches (1a, 1b, and 2) [18][19][20] will produce robust results at even higher levels of model complexity.Too complex models are known to generalize poorly in independent data [21,32] which is also illustrated by our study.In addition, poor generalizability goes hand in hand with high risk for spurious effects which is considered detrimental to studies of intersectionality.
Arbitrary categorization of continuous covariates amplifies the problem: the number of parameter estimates is almost twofold higher in approach 1a (48) compared to 1b (25).For decades, it is warned against categorization of continuous covariates [16,32,[43][44][45][46][47][48] as it assumes piecewise constant response within respective categories.In this study, this leads to underestimation of excess frailty in females above middle age (supplementary figure SF4, top) and is also true if the increase is linear instead of non-linear (supplementary figure SF4, bottom).Furthermore, a recent study also provides analytical proof that interaction terms with categorized continuous covariates inflates spurious effects [49].
Besides statistical properties, categorization of continuous covariates is often implausible from an intersectionality perspective.Considering, e.g., intersections arising from age and education, keeping the covariate age continuous allows for detection of different age-related offsets with regard to the educational level.In addition, none of the methodological best practice recommendations suggest categorization of continuous covariates.Neither Warner [7] nor McCall [50] were making a statement about categorization of continuous covariates.In fact, Warner [7] warned against considering too many intersections and encouraged to state why and not that intersections have been considered.

Strengths and limitations
This simulation study compared approaches to examine intersectionality in a small predictor space with limited covariance among variables.Compared to real-world analysis, this appears simplistic as usually more covariates have to be considered and association structures will be more complex.However, this simple setting shows considerable limitations of recent proposals to examine intersectionality which, according to our sensitivity analysis, also transfer to more complex settings.
We simulated a response that increased linearly with age, regardless of sex, but nonlinearly for females above a middle age.We consider such a trajectory more plausible than any discrete and piecewise constant change as ultimately assumed by an arbitrary categorization of continuous variables while nonlinearity is frequently found in public health studies [51][52][53][54].
The hesitation in using spline functions for continuous covariates may be related to non-interpretable tables of regression coefficients [55].Nevertheless, for illustration of results marginal means can be used [56][57][58].Their calculation is not restricted to linear models [56] and was applied in intersectionality studies using different modelling strategies [58][59][60][61].Restricted cubic splines (RCS) are one tool among others to model functional forms and can be used with R [62], SAS [63], SPSS [64], and STATA [65].Whereas delineation of RCS characteristics is complicated, the use in regression models is not as only the number of nodes must be specified [32,33].We summarized our approach to study intersectionality in the supplement (paragraph: BE-BIC and Resampling).

Conclusion
Categorization of continuous covariates remains a bad idea.Multivariable stratification, including categorized continuous measures, is the least appropriate approach for intersectionality research as disentangling and interpretation of effects is impossible and the probability of spurious effects high.
Approaches to study intersectionality should acknowledge the complexity of multivariable modeling and use methods that limit the chance of false-positive results.A common approach (BE-BIC) is being superior to recent proposals in terms of minimizing spurious effects and better generalizability in independent data.
For intersectionality research we consider it more important to describe relevant intersections rather than all possible intersections.

Figure 1 :
Figure 1: Example data for sample size of N=1000 and for two different simulation settings (Top left, setting 1: with no association of age, sex, and their interaction with the response; top right and bottom, setting 2: association of age, sex and their interaction with SNR (0.5, 1, 4).In all settings neither education, BMI, nor diabetes are associated with the response.Colored lines show loess-smoothed non-parametric estimates of the conditional mean of the response which is very close to the true simulated effect for  2 of 0.8; black lines (solid and dashed) show the true effect.

Figure 2 :
Figure 2: Mean squared error over 1000 simulations of setting 1 (no association between covariates and the response).

Figure 3 :
Figure 3: Mean squared error over 1000 simulations of setting 2 with SNR of 1 (moderate association between age, sex, age:sex and the response).
In setting 2, only age, sex, and the interaction of age and sex were associated with the response; BMI, education, and diabetes constitute noise only.Due to the small quadratic term, we simulate a non-linear increase in the response only for females above a middle age, i.e. this co-formation of age and female sex represents our intersection of interest.

Table 1 :
Results for simulation setting 1 on correctly identifying intercept only model (numbers in parentheses relate 25 th , 50 th 75 th percentile of erroneously chosen terms).

Table 2 :
Simulation results for setting 2 on correctly identifying age + sex + age:sex (numbers in parentheses relate 25th, 50th 75th percentile of erroneously chosen terms).

Table 3 :
For simulation setting 2 with SNR=1 and sample size N=2000, for each of the 1 st and 2nd order effects the rate is given in which the effect was statistically significant (approach 1a and 1b) or selected by BE-BIC (approaches 3a -3d).