The importance of missing data in estimating BMI trajectories

Body Mass Index (BMI) trajectories are important for understanding how BMI develops over time. Missing data is often stated as a limitation in studies that analyse BMI over time and there is limited research exploring how missing data influences BMI trajectories. This study explores the influence missing data has in estimating BMI trajectories and the impact on subsequent analysis. This study uses data from the English Longitudinal Study of Ageing. Distinct BMI trajectories are estimated for adults aged 50 years and over. Next, multiple methods accounting for missing data are implemented and compared. Estimated trajectories are then used to predict the risk of developing type 2 diabetes mellitus (T2DM). Four distinct trajectories are identified using each of the missing data methods: stable overweight, elevated BMI, increasing BMI, and decreasing BMI. However, the likelihoods of individuals following the different trajectories differ between the different methods. The influence of BMI trajectory on T2DM is reduced after accounting for missing data. More work is needed to understand which methods for missing data are most reliable. When estimating BMI trajectories, missing data should be considered. The extent to which accounting for missing data influences cost-effectiveness analyses should be investigated.

BMI trajectories, the probabilities assigned to each trajectory and any subsequent analysis, for example, what effect missing data has when these trajectories are used to predict subsequent health.
BMI trajectory has been shown to have a significant influence on the likelihood of type 2 diabetes mellitus (T2DM) 3,6 ; in particular, a consistently elevated BMI has been shown to be a leading cause of T2DM 4 .This highlights the importance of investigating BMI over time rather than at a single point in time.However, when investigating BMI over time, missing data can be a serious limitation, particularly in older adults where data is increasingly missing due to morbidity or mortality.
This study will build on previous research in three ways.First, it will estimate BMI trajectories over a 20 year period, extending the time period that previous studies have used.Next, it will explore a range of missing data methods to account for missing BMI values and discuss their benefits and limitations.Finally, it will explore the extent to which the different methods for analysing missing BMI data influence results when used to predict subsequent risk of T2DM.

Data
The English Longitudinal Study of Ageing (ELSA) 14 contains data on individuals sampled from the Health Survey for England (HSE) between the years 1998 and 2000 if they were 50 years or over on 1st March 2002.Of 18,813 eligible individuals, 11,415 are included as core members or core partners and included in wave 1 of ELSA.In this study, the HSE baseline data is referred to as wave 0. Subsequent waves are those from ELSA, which follow up these individuals.
As part of the ELSA, nurse visits are carried out during waves 2, 4, 6, 8 and 9 and offered to a subsample of participants.As well as collecting other information, nurses measured participants' height and weight, allowing their BMI to be calculated (kg/m 2 ).Each ELSA follow up is around two years after the previous wave, meaning that there are four years between nurse visits until wave eight when there is a following set of nurse visits two years later in wave 9.The data available on BMI spans the years 1998 to 2019.Self-reported T2DM is available in every wave.In accordance with previous literature 2,3,15 , baseline characteristics include age (years), sex (male/ female), smoking status (yes/no) and marital status (married or cohabiting/unmarried or separated).Ethnicity is not included because the ELSA sample are predominantly white (> 97%) and so any influence of ethnicity is unlikely to be identified 3 .

Statistical analysis
This study uses a two-step process combining a GMM with a discrete-time survival analysis (DTSA) in a structural equation modeling framework.This is done in the same way as a previous study 3 , but extends that study by including additional waves of data and considering different methods to account for missing data.The GMM, DTSA and the Bolck, Croon, and Hagenaars (BCH) method used to link the two are described briefly here.Then, the methods for accounting for missing data are subsequently outlined.

Standard growth mixture model (GMM)
A 'standard' GMM 16 is used to estimate distinct BMI trajectories in adults over the age of 50.GMMs can facilitate the identification of differences in longitudinal change in BMI among unobserved groups.This approach uses a multinomial modeling approach to determine whether heterogeneity in the population can be explained by a finite number of distinct subgroups or trajectories, often referred to as latent components.
A dependent variable, y representing BMI for patient i at time t and in component c is defined as follows.
The random coefficients η sic = η sc + ξ is for the intercept (s = 0) , slope (s = 1) and quadratic (s = 2) terms ( η 0ic , η 1ic , η 2ic , respectively) have full covariance matrix.Variances of slope and quadratic terms are set to zero ( ξ i1 = 0, ξ i2 = 0 ).Variance of the intercept term is freely estimated by the model.x t are the time scores, which impose the trend.ε it is normally distributed with zero mean and variance, σ 2 εt .Variance of BMI is restricted to be time-invariant.
The intercept is influenced by baseline covariates z (age, sex, smoking status and marital status) so that where ζ 0i is normally distributed with a full covariance matrix.Probabilities of component memberships are estimated using a multinomial logit model so that where z is the same set of baseline covariates that influence the intercept and β is a set of estimated coefficients.

Bolck, Croon, and Hagenaars (BCH) two-step approach
A BCH approach 17,18 can be used to assess the relationship between the latent trajectories and observed variables.
In this case, it is used to assess the relationship between estimated BMI trajectories and the risk of T2DM.The BCH approach uses a weighted ANOVA to provide weights inversely proportional to the error classification probabilities; observations with less classification error are given more weight in the analysis.These weights are then used in the analysis.This approach does not directly assign individual participants to a specific trajectory and so does not allow the outcomes to directly influence the BMI trajectories, minimising the risk of reverse 4 k=1 e z i β k causality.This approach has been shown to be superior to other approaches that do assign individuals to specific trajectories 19 .Within the two-step process, step one is the estimation of the BMI trajectories (using GMM) and step two is the estimation of the risks of T2DM using BMI trajectory as a predictor (using DTSA).Previous literature describes the benefits of this approach over alternative approaches 3,19 .

Discrete time survival analysis
The hazard function assumes proportional hazards and varies by individual i and component c so that where κ is a set of coefficients and v i is a set of time invariant covariates for individual i .u itc is a set of binary variables where, for individual i , in component c , at time t , it is 0 if no T2DM is present, 1 if an individual is diagnosed at time t and missing if previously diagnoses.
The time invariant covariates v i include age at baseline and sex.Only these two variables were included in order to give a basic overview of the differences between trajectory groups, whilst also keeping the model parsimonious, allowing the results to be compared to previous research 3 .Note that, because this is a two-step model, the coefficients v i in the DTSA (step two) are not directly related to the latent component membership as they are in the GMM (step one).

Missing data methods
Although individuals were not removed from the analysis as long as they had at least one valid BMI value, the standard model, described above, assumes that missing data is MAR.The assumption that missing data is MAR is untestable and there is good reason to believe that this assumption may be violated.If BMI measures are MNAR, then the shape of the estimated BMI trajectories may be misleading.This study estimates BMI trajectories using GMMs under different missing data assumptions and compares the trajectories and their influence on T2DM diagnoses.

Complete case analysis
First, complete case analysis was performed.This involved removing any individuals which had missing data for BMI at any time point in the analysis.This method has the most restrictive assumptions, assuming that data are MCAR.

Multiple imputation
Multiple imputation was used to impute any missing BMI values.This approach assumes that data is missing at random (MAR), that is, the missingness depends on observed variables available in the dataset.BMI at each wave is imputed used available BMI values at other time points, as well as the independent variables using in the model (sex, age, smoking status and marital status) allowing the relationship between these variables to be preserved 20 .The imputed values are random draws from the posterior distributions of the missing values 21 with 20 imputations.Imputed datasets are analysed using Rubin's rule 21 .

Diggle-Kenward selection model
The Diggle-Kenward selection method 22 uses repeated measures to predict the probability of missing data at a particular time point.It is a discrete-time survival model for the dropout indicators and is estimated simultaneously alongside the BMI trajectories estimated in the GMM.This approach assumes that data is missing not at random (MNAR) and explicitly models the missingness itself.
Formally, there are T − 1 (in this case 5) binary missing data indicators, d 1 ,…, d 5 where d t is 0 before dropout, 1 at the first observation after dropout and missing thereafter (i.e.all subsequent observations after dropout is first observed.
Dropout due to attrition is influenced by the latent component and by BMI measures from times t and t − 1 so that.log The logistic regression slopes β are allowed to vary by component and by time.Here, BMI is assumed to be MNAR because dropout is related to both past and current BMI.Both y it and y it−1 are observed variables before dropout has occurred and latent variables represented expected BMI after dropout has occurred 23 .
This method for estimating the missingness is estimated jointly with the standard GMM, in effect adjusting the GMM estimates to account for missing data under the assumptions of the Diggle-Kenward method.From here, the BCH approach is used in the same way as for the standard GMM, to incorporate the latent component uncertainty into the DTSA estimation.

Roy pattern mixture model
Similar to the Diggle-Kenward model, the Roy pattern mixture model (PMM) 22 explicitly models missingness and is estimated simultaneously with the GMM.However, this approach assumes that the dropout influences the latent trajectory membership, rather than the latent trajectory influencing the dropout, as the Diggle-Kenward model assumes.Similar to a conventional pattern mixture model, dropout influences the repeated measure (in this case, BMI), but rather than influencing BMI through the random coefficients as a conventional PMM would, the Roy PMM influences them indirectly through component membership.The probability of component membership depends on baseline covariates, z i and dropout d it at every time point, so that replaces the probability equation (Eq. 3) in the standard GMM.
More details on the assumptions made using selection models and pattern mixture models can be found in previous literature along with details on how the missing data is coded for use in a structural equation modelling framework 22,24 .
It is worth noting that additional covariates could be included to predict missingness in the imputation model and in the Diggle-Kenward selection model, but in order not to over complicate the model additional covariates are not included in this example.The different methods of handling missing data, outlined above, are compared to each other, as well as the standard GMM case.The trajectories that are estimated from each analysis are compared, as well as the probabilities of following different trajectories.The differences in predicting the onset of T2DM using BMI trajectory are then investigated.

Results
From the 11,415 individuals sampled from the HSE who were considered core members or core partners 27  Approximately 85% of eligible individuals (those who were interviewed in person) participated in the nurse visits in each wave 14 , although this is not always the same 85% across waves.
Table 1 shows means and standard deviations as well as the number of non-missing observations for each variable within the sample.Mean BMI steadily increased over time and the majority of individuals fell within the range for individuals living with overweight (25 ≤ BMI < 30) at every wave.Around 54% of individuals were female.
The complete case analysis resulted in a similar sample to that of the original data.The complete case sample were younger, with fewer males, fewer smokers and more married participants.The mean BMI values were not that different in the complete case analysis compared to the standard GMM, however the standard deviation of BMI values is smaller meaning that participants at both extremes of the distribution may be less likely to remain in the study.
Over 97% of individuals had at least two valid values for BMI during the study period.Table 2 displays the patterns of missing data, which appear for at least 1% of individuals.The largest proportion of individuals (22%) have valid BMI values available for all time points in the models.Interestingly, there is also a large proportion (11%) which have valid BMI values only at baseline and wave 4 (approximately 8 years post-baseline).

BMI trajectories
The standard GMM model which does not account for missing data gives similar results to a previous study which uses the same methods and data 3 .The present study extends the previous study by including additional waves, namely waves 8 and 9.The trajectories show a similar pattern using the additional waves and the same number of latent trajectories is preferred.Table 3 shows the AIC, BIC and log-likelihoods of standard GMMs estimating BMI trajectories with one to five latent components.The addition of a fifth component made the model struggle to converge; many of the parameters needed to be fixed in order to enable convergence.The additional component also had a very low probability (almost 0%).The BIC and AIC were lowest for the 4 component model.Therefore the 4 component model is chosen to be the preferred model.
Similar comparisons were made across the different methods, with the exception of the complete case analysis where one component in the four component model had a probability of 0.7%.In the multiple imputation model, the 4-component model was selected based on the model fit statistics for each of the 20 imputations.In order to aid comparisons across models, four components are used for each of the different methods and results from this point onwards refer to four component models.Figure 1 shows the estimated trajectories using each of the different methods to account for missing data.
In line with previous literature, the four distinct trajectories are considered to be "stable overweight" (mean BMI consistently around 27); "elevated BMI" (mean BMI consistently high); "increasing BMI" (mean BMI over 30 at baseline and increasing throughout the study period); and "decreasing BMI" (mean BMI around 32 at baseline and reducing to around 25). Within each method, every individual has a positive probability of being in each trajectory.
The multiple imputation and Diggle-Kenward methods produce very similar BMI trajectories; including a flatter elevated BMI trajectory.The multiple imputation, Diggle-Kenward and Roy PMM each have stable overweight, decreasing BMI and increasing BMI trajectories that start at similar baseline BMIs (close to 30).In the complete case analysis, the elevated BMI trajectory is more concave; this is probably due to the small probability www.nature.com/scientificreports/ of following this trajectory in the complete case analysis.Nevertheless, all methods estimate four trajectories that show the same general pattern.Table 4 shows the mean probabilities of component membership, for each method, as well as the odds ratios (ORs) for each component by baseline characteristics, again, for each method.ORs are relative to the stable overweight trajectory.
The stable overweight trajectory has the highest probability, meaning that individuals are most likely to be on this trajectory.This is the case of each of the methods, but the probabilities do vary substantially across the models.Compared to the standard GMM, all other methods provide a smaller probability of following the stable overweight trajectory, although the complete case method is almost identical (77.8% vs. 76.7%).The multiple imputation method produces a probability of following the stable overweight trajectory of 57.0% compared to 77.7% in the standard GMM.The probabilities for all other trajectories is higher for the multiple imputation method compared to the standard GMM, particularly for the decreasing BMI trajectory, which is 21.0% compared to 8.9% in the standard GMM model.In the complete case analysis, the mean probability of following the stable trajectory is similar to that of the standard GMM, however, the mean probability of following the elevated trajectory is 0.7%, compared to 3.7% in the standard model.In the Diggle-Kenward model, the mean probability of following the stable overweight trajectory is substantially lower than the other models at 39.3%.In the Roy PMM, the mean probability of following the stable overweight trajectory is 57.7%, similar to the multiple imputation model.Both the Diggle-Kenward and the Roy PMM have mean probabilities for the decreasing BMI trajectory that are higher than the other models (31.2% and 31.9%,respective).
In general, the ORs show a similar story across the different methods, and also when comparing to the previous study using data to wave 7 3 .All statistically significant ORs are in the same direction across models, www.nature.com/scientificreports/with the exception of the Diggle-Kenward method.Here, increasing age increases the odds of following the elevated trajectory compared to the stable trajectory.A similar result was found using the Roy PMM, but this was statistically insignificant.Being female, unmarried or a smoker, increases the odds of following the elevated BMI trajectory compared to the stable overweight trajectory.Older age, being female and smoking increases the odds of following the decreasing BMI trajectory.Being younger, female, unmarried or a smoker increases the odds of following the increasing BMI trajectory.Females have a significantly lower probability of following the stable overweight trajectory compared to all other trajectories.
All ORs for the elevated BMI trajectory are statistically insignificant in the complete case analysis where they are statistically significant in other models.The point estimates are generally as expected but they are imprecisely estimated.This is likely to be due to the low probability of following this trajectory when estimated using complete cases.

Risk of type 2 diabetes mellitus (T2DM)
Table 5 shows the hazard ratios (HRs) for T2DM for each trajectory compared to the stable overweight trajectory, using the BCH weights from the GMMs with different missing data assumptions.Results from the standard GMM are not meaningfully different from those in the previous study, which used data only until wave 7.
Using the complete case analysis, the HRs for T2DM compared to the stable overweight category are all smaller than they are using the standard GMM analysis but the patterns show the same story.This is also true for the multiple imputation analysis, which provides very similar HRs to that of the complete case analysis.
When using the Diggle-Kenward model or the Roy PMM, the HRs for T2DM for the elevated BMI trajectory compared to the stable overweight trajectory are lower than in the standard GMM, particularly in the Roy PMM.The HR for the increasing BMI trajectory compared to the stable overweight group remains similar in the Roy PMM compared to the standard GMM.It is slightly higher for the Diggle-Kenward model, but this HR is statistically insignificant with very wide confidence intervals.
The HR for the decreasing BMI trajectory compared to stable overweight is substantially lower in both the Diggle-Kenward and Roy PMM.In particular, the Diggle-Kenward model estimates a HR < 0.001 due to  estimating a hazard of T2DM of < 0.001 for the decreasing BMI trajectory.This HR is in the opposite direction for the Diggle-Kenward and Roy PMM compared to the other methods, indicating a lower risk for the decreasing BMI trajectory compared to stable overweight, rather than an increased risk based on the other models.

Discussion
This study estimated five different GMMs, each treating missing data in a different way.The standard GMM ignored any missing data, one used only complete cases, one use multiple imputation to impute missing BMI values and the Diggle-Kenward and Roy PMM explicitly modelled the missingness.The BMI trajectories estimated in these GMMs were then used as predictors in DTSAs to predict HRs for T2DM.
Although the same four trajectories (stable overweight, elevated BMI, decreasing BMI and increasing BMI) were identified using each of the different models, the probabilities of following these trajectories differed substantially between models.This is important because it means that estimated trajectories for an individual could be very different when using the different models.
The complete case GMM has a very similar mean probability for the stable overweight trajectory as the standard GMM, however, the mean probability for the elevated BMI group is much lower in the complete case analysis.This suggests that individuals with missing data had a higher probability of following the elevated BMI trajectory and that when these individuals were removed from the analysis, this trajectory was the most effected.
When BMI values were imputed using multiple imputation, these tended to be BMI values in later waves, more often than earlier waves.Since the mean probability of the stable overweight trajectory is lower and all other trajectories higher in the multiply imputed model than in the standard GMM, this suggests that individuals with more missing values are likely to be those with a higher probability of following the other trajectories, in particular the decreasing BMI trajectory.
Both the Diggle-Kenward Selection model and the Roy PMM directly model the missingness, rather than impute data for missing values.Both provide a smaller probability of following the stable overweight trajectory, compared to the standard GMM, particularly the Diggle-Kenward model.Both estimate mean probabilities for the decreasing BMI trajectory substantially higher than the other models (> 31%).After accounting for dropout, the probability of following a decreasing BMI is substantially increased, again suggesting that individuals with a decreasing BMI may be more likely to dropout.
The multiple imputation model produce HRs for T2DM that are much lower than the standard GMM in all trajectories when comparing to the stable overweight trajectory.This suggests that accounting for missing www.nature.com/scientificreports/preferred missing data method when estimating BMI trajectories, given that different assumptions might be appropriate in different circumstances.The appropriate method of accounting for missing data, or indeed whether it is necessary, will depend on both the data used in estimating the trajectories and the scenario in which the trajectories are used.The difference in T2DM HRs found in this study also suggests that these assumptions will be important for subsequent CEA that incorporate BMI trajectories in their estimation.Further research is required to determine the extent to which CEAs are influenced by the different missing data assumptions.The results from this study could also apply to estimated trajectories of other health measures; many health trajectories are often included in CEA and missing data is often a problem in their estimation.Results from this study could have substantial implications for a wide range of estimated trajectories used in CEAs.

Limitations
The main limitation of this study is that it is not known if dropout is due to loss of follow up or death.In a sample of older adults, this is likely to be a greater problem than it would be for the general population.This means that the multiple imputation could be quite misleading, since it will impute BMI values regardless of whether or not a person has died.There is however, no mortality data after wave 6 and so it is impossible to tell whether the participant is missing due to loss of follow up or death.A previous study that investigated the effect of BMI trajectory on mortality using the same data, albeit with fewer waves, found no influence of BMI trajectory on overall mortality 3 .This suggests that the level of mortality does not differ significantly between the distinct trajectories and therefore the influence of any missing data due to mortality is assumed to have minimal effect.
The MNAR models (Diggle-Kenward and Roy PMM) only account for final dropout and not for any intermittent missing data where participants return in a later wave.This could influence the trajectories estimated by these models is the intermittent missing data is not MAR.Future research could investigate this further, potentially by extending the Roy PMM to account for missing data at each time-point.
The ELSA contains information on an overwhelmingly (> 97%) white sample and research using a more diverse sample could shed light on any differences which occur between individuals of different ethnicities.

Conclusion
Missing data can substantially influence estimations of BMI trajectories.Therefore, when using BMI trajectories to inform subsequent analysis or policymaking, missing data should be considered and accounted for.More research is needed to examine the extent to which accounting for missing data might influence the costeffectiveness of policies, for example, weight management interventions.

Figure 1 .
Figure 1.BMI trajectories estimated using standard GMM, multiple imputation, complete case, Diggle-Kenward selection model and Roy pattern mixture model (from top left to bottom right).
In this approach, there are T − 1 (in this case 5) binary missing data indicators, d 1 , …, d 5 where d t is 1 for the first observation after dropout and d t is 0 at all other time points.
1234567890) Scientific Reports | (2024) 14:17740 | https://doi.org/10.1038/s41598-024-68764-2www.nature.com/scientificreports/ at wave 1 of ELSA, 973 individuals were removed because they were younger than 50 years.Of those remaining, 43 had missing values for one or more of the baseline covariates (age, sex, marital status and smoking status) and a further 103 individuals were removed because they had values of BMI considered underweight (BMI < 18.5) in at least one wave, leaving 10,296 individuals.Participants were included in the final sample only if they have at least one valid BMI measure in at least one of the ELSA nurse visits resulting in a further 1077 individuals being removed and a final sample of 9219.

Table 1 .
Summary statistics for sample in original data, imputed data and complete case.

Table 2 .
Patterns of missing BMI data.*All other patterns appear for less than 1% of observations and account for less than 9% of all observations collectively.O-represents observed/non-missing data, X-represents missing data.

Table 3 .
Model fit statistics for the standard GMM of BMI trajectories with 1-5 latent components.