Validating risk factor and chronic disease projections in the Future Adult Model

Over the past several decades, the United States has experienced a dramatic rise in obesity rates, due to both a rightward shift of the body mass index (BMI) distribution and a pushing out of the right tail. This shift has led to increases in obesityrelated chronic diseases, particularly diabetes, as well as impacts on longevity, medical expenditures, and quality of life. Microsimulation modeling is a potentially useful tool for assessing the impacts of policies targeting this epidemic, but reliably assessing policies requires a model that performs well in projecting health risk factors and disease outcomes. This research assesses the outofsample and external validity of a microsimulation model of the U.S. adult population.There are two research questions addressed in this analysis: 1. How well does the Future Adult Model (FAM) perform in projecting BMI and diabetes over a tenyear horizon compared to the host data? 2. How well do the microsimulation model’s predictions compare to external surveillance data of BMI and diabetes?FAM is an economicdemographic microsimulation model of the United States population over the age of 25. For this validation exercise, all Markov transition models are estimated using the 1999-2007 waves of the PSID. The simulation is then run from 2007-2017. For internal consistency, simulated outcomes in 2017 are compared to actual PSID outcomes. Population means and selected quantiles are compared between the simulation and the host data. Receiver operating characteristic (ROC) curves are used to assess model performance for binary outcomes using the area under the curve (AUC) statistic. For external validation, simulated outcomes for 2007-2017 are compared to the Behavioral Risk Factors Surveillance System (BRFSS), a large, nationallyrepresentative survey of the United States population.After ten years of simulation, FAM BMI projections for men and women compare well to both PSID and BRFSS data throughout much of the distribution. The 99th percentile differs significantly, with FAM underestimating the right tail of the BMI distribution. Individual assignment of obesity and severe obesity performs well using AUC as a criteria. Initial differences in the diabetes prevalence between PSID and BRFSS data are preserved in FAM projections. FAM is initially 1.9 percentage points below BRFSS for women 25 and older and is 1.6 percentage points below BRFSS for women 35 and older after ten years of simulation. Men 25 and older are 1.2 percentage points lower initially and are 0.8 percentage points lower after ten years of simulation. Individual assignment of diabetes incidence does not perform as well as clinical models with richer predictors. Researchers using FAM should be cognizant of these strengths and limitations of the microsimulation model. JEL classification: C6, I1, J1 DOI: https:// doi. org/ 10. 34196/ ijm. 00225

conditions, functional limitations, and mortality), increased medical expenditures and utilization, other economic impacts (changes in employment, disability), and declines in subjective well-being.
Many strategies have been suggested to tackle this challenge, including lifestyle interventions that target diet and exercise, medical procedures like gastric bypass surgery, pharmaceuticals that lead to weight loss, taxes on particular foods, menu labeling to improve food choices, and more. These strategies, in turn, are often implemented on a small scale or within a randomized control trial. To translate the effects to a larger scale analysts increasingly turn to microsimulation models, as these models can account for heterogeneous individuals and treatment effects. To do this credibly requires a microsimulation model that performs well, but also one where the limitations are clearly conveyed. With respect to body mass index, a model that performs well needs to project not just the mean BMI well, but also capture the distribution, as policies often target those in the right tail, such as the obese or severely obese. This is not an uncommon requirement for microsimulation models, as the distribution of continuous variables is often relevant to policy questions. Though often assumed to be ''black boxes'' or ''crystal balls'' for projecting the future, validation of past-performance and transparency about limitations can avoid overselling the capabilities of these models while better informing policy decisions.
There are two purposes for this paper. The first is to describe an approach to validation used with the Future Adult Model (FAM), a microsimulation model of the United States population over age 25. This approach is potentially applicable to other microsimulation models with similar data availability. The second goal is the validation itself, highlighting both where the model performs well and areas where further research can improve FAM's projections.
Section 2 briefly describes the model and its source data. A few approaches to out-of-sample validation are presented in Section 3. External validation is described in Section 4. Section 5 concludes.

Data and methods
FAM is described in a detailed technical appendix (https:// healthpolicy. box. com/ v/ FAM-appendix-2018), so only the core functions of the model are summarized here. The host data for the simulation  -Gonagle et al., 2012). While the PSID is designed to be nationally representative, it does face the challenges all longitudinal surveys do, such as imperfect response rate and sample attrition. We typically use PSID data from 1999-2017 as the basis for FAM. These data are used to estimate transition functions and as the initial data for the simulation. The first-order Markov transition models are the engine used in ''aging'' the simulants. The transitions are a mixture of continuous, binary, and categorical outcomes, with a time-scale that mimics the two-year structure of the data. FAM simulates dozens of variables for individuals, including health risk factors, chronic conditions, functional limitations, mortality, life events, economic outcomes, medical cost and use, and government transfers. The causal pathway of FAM is that health risk factors (such as body-mass index and smoking) impact health outcomes (such as diabetes incidence or the number of functional limitations), which in turn impact a set of economic outcomes (such as medical expenditures, retirement). BMI and diabetes are two critical outcomes, but also enter as predictors for many other models. These pathways are summarized in Table 1. Since both BMI and diabetes impact so many things, it is critical to understand the quality of projections as errors will propagate. The distribution of BMI matters for these outcomes, as high BMI increases risk of chronic illness, so ideally the simulation will capture the distribution of BMI, not just the mean.
Note that both of these measures are reported by survey respondents, not based on clinical measurements or administrative records. Self-report data are known to suffer from biases in reporting. Individuals often under-report weight and over-report height, leading to BMI values that are too low compared to their measured values. Similar challenges face measures of chronic conditions, though chronic conditions are more challenging to assess in an interview setting. Administrative medical claims are not a panacea, as they have their own challenges (Clair et al., 2017) . With that said, the measures used in FAM are commonly used in surveys such as PSID, the Behavioral Risk Factor Surveillance System (BRFSS), the Medical Expenditure Panel Survey (MEPS), the Medicare Current Beneficiary Survey (MCBS), the National Health Interview Survey, and others. Though they do not provide the true population prevalence of chronic conditions or BMI distribution, they are still useful in many ways. Within FAM, medical expenditures are estimated using MEPS and MCBS, which then allows the translation of these self-reported measures into predicted medical expenditures.
FAM estimates the two-year transition of BMI. BMI is defined as an individual's mass divided by the square of their height. Clinically, it is often used as a predictor of subsequent clinical outcomes, such as risk of diabetes or mortality. The transition model for BMI is estimated in natural logs, allowing the interpretation as a percent change in BMI. A Box-Cox analysis of BMI transitions in the PSID data suggests a λ parameter between 0 and −0.5 , depending on specification. The transition model is of the form: The transition model estimates for BMI, estimated separately for men and women, are shown in Table 2. These are reduced-form models that include both time-varying (age, log BMI with several knots, marital status) and static (race/ethnicity, education, characteristics from early age) predictors. Time-varying predictors enter as two-year lagged variables. One can think of the interpretation of this model as a percent change in BMI over a two-year period. Within the simulation, a random draw from the root-mean square error term of this model adds a stochastic element. BMI appears in other transitions either in logs or in clinically-relevant BMI categories (under 25, 25 to 30, 30 to 35, 35 to 40, or over 40).
FAM's diabetes model is a probit model of two-year diabetes incidence. As an incidence model, only individuals who did not have diabetes in the previous period are included in the estimation.
Here, we adjust for time-varying covariates (age, smoking status, exercise, and the log of the previous BMI in splines with several knots) and static characteristics (race/ethnicity, education, gender, childhood SES, childhood health). Marginal effects of this model are shown in Table 3. Consistent with the wording of the question in the PSID (''Has a doctor or health professional ever told you that you had diabetes or high blood sugar?''), diabetes is treated as an absorbing state variable for the remainder of a simulant's life.
In typical FAM use, all recent waves of PSID data are pooled for estimation. However, for the analyses presented here we restrict the data used for estimation to PSID respondents from 1999 to 2007.

Out-of-sample validity
We explore FAM's out-of-sample validity in two ways. The first focuses on population-level statistics for the two outcomes of interest. The second uses Receiver Operating Characteristic (ROC) curves to assess how well FAM classifies individuals compared to their actual outcomes on a 10-year horizon.  Table 4. For women, mean BMI is approximately the same between FAM projections (27.7) and PSID respondents (27.6) for 2017. For men, projected mean BMI is 28.5 in FAM, 28.3 for PSID. Most percentiles also compare well, with overlapping confidence intervals. In the right tail, such as at the 95th and 99th percentile, FAM and the PSID begin to deviate. At the 99th percentile, the FAM projection is 3.6 BMI points lower than the PSID value. For men, we see a similar story, though the discrepancy is around 4.2 BMI points at the 99th percentile. This suggests that the 1999-2007 data used for estimating the transition models did not accurately capture the continual expansion of the very high BMI population in the US.
Diabetes prevalence is lower in FAM than the PSID for both women and men. FAM projects 12.4% prevalence for women 35 and older, compared to the 14.6% observed in the PSID. Similarly, FAM projects 14.9% diabetes prevalence for men compared to 16.7% in the 2017 PSID. This is possibly a consequence of not projecting the right-tail of the BMI distribution. Alternatively, it could be changing diagnostic practices in the US between 2007 and 2017. A temporal time-trend would not be captured with FAM's approach to transition model estimation.

Individual assignment
In order to assess FAM's performance in classifying individuals, we use Receiver Operating Characteristic curves, as proposed in a validation of a cardiovascular disease microsimulation model (Pandya et al., 2017). In this analysis, each 2007 PSID respondent is simulated 5,000 times (50 sets of bootstrapped transition models, each with 100 Monte Carlo replications). For the outcome of interest, the fraction of simulations that resulted in that outcome is calculated. All simulants are ordered, smallest to largest, by these fractions. At each level of the fraction, the ROC analysis compares to the observed outcome for the individuals in the PSID survey data, assessing the true positive rate and false positive rate. The ROC analysis then shows the trade-off between ''true positives'' and ''false negatives'' for different thresholds of predicted prevalence after ten years. We assess four outcomes: having a BMI under 25, having a BMI over 30, having a BMI over 40, and 10-year incident diabetes for those who did not have diabetes in 2007. A performance parameter is the ''area under the curve'' (AUC in the figures). Interpretation of the value of the AUC statistic varies by context. An AUC of 1.0 is perfect classification, 0.5 is no better than random chance.  For females, the AUC for predicting a BMI under 25 is 0.91 (Figure 1), while the AUC for predicting a BMI over 30 is 0.92 ( Figure 2). For predicting a BMI over 40, the AUC is 0.95 (Figure 3). These    Predicting ten-year incidence of diabetes does not perform as well as predicting obesity or extreme obesity. The AUC for females is 0.71 (Figure 7) and the AUC for males is 0.73 (Figure 8). In addition to giving a sense of FAM's classification ability, this validation method is also useful for assessing the predictive power in incorporating additional predictors.

External validity
The Behavioral Risk Factor Surveillance System is a survey designed to collect state-specific data on health risk behaviors, chronic diseases, and other health-related outcomes related to the leading causes of death and disability in the United States (Centers for Disease Control and Prevention and others, 2018). The sample size is large, with over 450,000 observations in 2017, and designed to be nationally representative. Like the PSID, BRFSS faces challenges with response rate. The self-reported measures in BRFSS are comparable to PSID. Diabetes status is asked in a similar manner (''Has a doctor, nurse, or other health professional EVER told you that you had diabetes?''). BMI is calculated from self-reported height and weight.
Before comparing FAM projections with BRFSS, it is valuable to show that PSID and BRFSS showed comparable BMI and diabetes summary statistics for those twenty-five and older in 2007. The twentyfive and older PSID respondents are the initial cohort in the validation exercise. Selected sample characteristics are presented in Table 5 by gender. Female mean BMI is higher in BRFSS than PSID by 0.3 units and 0.2 units for males. At the 95th percentile, females in BRFSS are within 0.1 units of women in PSID, with men in BRFSS 0.6 units higher. At the 99th percentile the two samples are a bit further apart, with women in PSID 0.5 BMI units higher than BRFSS and men in PSID 1.2 units lower. Female PSID respondents report slightly lower BMI at the fifth through twenty-fifth percentiles. Diabetes rates also differ, with lower rates in PSID by 1.9 percentage points for women and 1.2 percentage points for men.
The BMI distributions for females 35 and older in FAM in 2007and BRFSS in 2007 are shown in Figure 9. Examining the two BRFSS distributions, one sees the curve shifting to the right over the decade and the right tail pushing out. Qualitatively, FAM seems to capture this behavior. Figure 10 illustrates the relative cumulative distribution between FAM and BRFSS in 2017 (left panel). The relative density functions are shown in the right panels. The Kullback-Leibler entropy is 0.018 [0.016, 0.020] and the median relative polarization index is 0.042 [0.035, 0.049]. Table 6 compares these distributions in greater detail for 2017. Mean BMI for women is 0.5 BMI units lower in FAM projections than in 2017 BRFSS respondents. At the 95th percentile, FAM projections are 0.7 BMI units lower than BRFSS. At the 99th percentile, the distributions are further apart, with FAM 2.0 BMI units lower. The BMI distributions for males 35 and older are shown in Figure 11. After 10 years of FAM projections, the FAM distribution appears to have more dispersion than BRFSS. Figure 12 shows the relative cumulative distribution and relative density functions. Entropy, as measured by the Kullback-Leibler  Table 6. Here, we see that mean BMI in FAM is 0.3 BMI units below BRFSS. At the 75th percentile FAM is 0.1 BMI units below BRFSS, FAM is 1.3 BMI units lower at the 95th percentile, and the distributions are 4.2 BMI units apart at the 99th percentile.
Diabetes prevalence for the 35 and older population is also shown in Table 6. FAM is 1.6 percentage points lower than BRFSS for women and 0.8 percentage points lower for men. This is consistent with the initial differences observed in 2007 for the twenty-five and older PSID compared to BRFSS.

Discussion
In estimating transition models from 1999-2007, then applying them to simulate 2007-2017, the analysis shown here is analogous to a period life table approach. The key assumption is that the 1999-2007 observed transitions will continue to hold through the forecasting period. For the outcomes of interest, one can imagine situations in which this assumption is violated, such as changes in diagnostic procedures for diabetes or the adoption of public health campaigns targeting obesity. Consequently, deviations from what was observed can be driven by both simulation limitations and real world changes. In projecting BMI, the 2017 mean BMI in FAM is slightly higher than 2017 PSID (0.1 BMI units women, 0.2 for men), while slightly lower than 2017 BRFSS in means (0.5 BMI units for women, 0.3 for men). For the right tail of the distribution, 2017 FAM BMI distributions are lower than both the PSID and BRFSS at the 95th percentile, much lower at the 99th. This suggests that FAM performed well for much of the distribution, but not for extreme cases. The ROC analyses for BMI are likely driven by the persistence of BMI. For example, those with a BMI over 40 are likely to have a BMI over 40 ten years later.
For consumers of the model interested in BMI, this validation exercise suggests that projections of many groups of interest are plausible, such as the fraction overweight, obese, or severely obese. However, the distribution of BMI within the severely obese category in FAM under-predicted the BMI values observed in both PSID and BRFSS by two to four BMI units. If the actual BMI values are of interest for those cases, then the BMI modeling approach would merit attention, perhaps including calibrating to an external source.
Diabetes rates projected with FAM are lower than both 2017 PSID and 2017 BRFSS. The lower rates in 2007 PSID compared to 2007 BRFSS are preserved in FAM. Individual assignment of incident diabetes does not perform as well as models with more information, as shown below. The FAM diabetes model controls for age, race, education, childhood characteristics, smoking, exercise, and BMI. This yields AUC statistics of 0.71 for twenty-five and older women, 0.72 for twenty-five and older men. Notably, the transition model does not currently incorporate family history, pharmaceutical use, biometrics (blood pressure, waist measurements), biomarkers (fasting glucose, cholesterol), or diet. In a UK sample of 40 to 79 year olds, a risk model based on self-reported data on medication, BMI, Figure 12. Relative cumulative distribution and density of BMI -Males family history, age, nutrition, and exercise yielded an AUC of 0.763 for predicting incident diabetes in 4.6 years (Simmons et al., 2007). In a US-based study of 25 to 64 year olds, the San Antonio Heart Study, incidence of diabetes was assessed in several different models incorporating a variety of clinical tests. These models had AUC statistics ranging from 0.775 to 0.859. The clinical multivariate models incorporated age, sex, ethnicity, fasting glucose level, systolic blood pressure, HDL cholesterol, BMI, and family history of diabetes (Stern et al., 2002). In a Finnish study of 10-year incidence of drug-treated diabetes amongst 35 to 64 year old individuals, a model adjusting for age, BMI, waist circumference, use of blood pressure medication, a history of high blood glucose (including those with diabetes, but not taking medication), physical activity, and diet yielded an AUC of 0.860 (Lindström and Tuomilehto, 2003).
For consumers of the model interested in diabetes, this validation exercise suggests that projections of diabetes follow the same trend as in BRFSS. The initially lower rates in PSID are maintained over the decade of projection. If possible, additional information on family history, medications, diet, or other clinical measures would likely improve the specificity and sensitivity of FAM. The AUC statistics for models that include additional information give a sense of how well FAM might perform with more information.
Overall, these validation exercises are reassuring. The PSID host data compare well with BRFSS. Trends, such as the shift right in BMI and the expansion of the right tail of the BMI distribution, are captured with fairly simple transition models that do not assume a temporal trend. Individual-level predictions are concordant for various classifications of BMI, likely due to the persistence of BMI for individuals, and diabetes incidence classification is reasonable given the lack of clinical information.