Development and validation of a prediction equation for body fat percentage from measured BMI: a supervised machine learning approach

Body mass index is a widely used but poor predictor of adiposity in populations with excessive fat-free mass. Rigorous predictive models validated specifically in a nationally representative sample of the US population and that could be used for calibration purposes are needed. The objective of this study was to develop and validate prediction equations of body fat percentage obtained from Dual Energy X-ray Absorptiometry using body mass index (BMI) and socio-demographics. We used the National Health and Nutrition Examination Survey (NHANES) data from 5931 and 2340 adults aged 20 to 69 in 1999–2002 and 2003–2006, respectively. A supervised machine learning using ordinary least squares and a validation set approach were used to develop and select best models based on R2 and root mean square error. We compared our findings with other published models and utilized our best models to assess the amount of bias in the association between predicted body fat and elevated low-density lipoprotein (LDL). Three models included BMI, BMI2, age, gender, education, income, and interaction terms and produced R-squared values of 0.87 and yielded the smallest standard errors of estimation. The amount of bias in the association between predicted BF% and elevated LDL from our best model was −0.005. Our models provided strong predictive abilities and low bias compared to most published models. Its strengths rely on its simplicity and its ease of use in low-resource settings.

www.nature.com/scientificreports/ are obese compared to BIA [15][16][17] . In fact, BIA generally tends to underestimate fat mass and percentage body fat when compared to DXA [15][16][17] . However, the DXA method is costlier, more invasive (uses low-dose X-ray) and often requires more technical expertise to use compared to BIA [15][16][17] . Therefore, given that DXA is highly accurate in assessing fat mass and percentage body fat but costlier than BIA, it is imperative to find ways to predict DXA percentage body fat without having to use DXA. One way to better and cost-effectively predict adiposity is to create an equation model for body fat percentage (BF%) obtained via DXA but using a cost-effective but imperfect measure such as BMI as its main predictor and other covariates. Although BMI has limited predictive abilities (due to measurement error), using BMI in the prediction equation is ideal since it is still the most widely used measure of adiposity, it is easily calculated and is cost-effective. In other words, this would be akin to correcting an imperfect measurement of adiposity using a predictive model. Additionally, the choice of other covariates in the model is also important as it can help improve the predictive abilities of the equation. However, some potential covariates while their addition can improve prediction accuracy, they may not readily be available in low-resource settings and secondary datasets. Several equations relating BF% and BMI and using various covariates including age, sex, handgrip, waist circumference have been published (Tables 1 and S1) [18][19][20][21] . These models did not used rigorous predictive methodologies and have not been developed and validated specifically in a nationally representative sample of the US population. More recently though, Stevens et al. published several similar models using the National Health and Nutrition Examination Survey (NHANES) data and rigorous statistical learning methodologies but used BIA in their prediction model 22 -an adiposity measure that may not be readily available in low-resource setting. Examples of variables that can be difficult to obtain and are not always available in publicly available datasets include handgrip, triceps skinfold, subscapular skinfold, and bioelectrical impedance (BIA) measures. The inclusion of such variables in equations can hinder their use, especially in low-resource settings. Parsimonious prediction model for DXA-BF% that (1) are calibrated to a U.S. general population, (2) include variables (e.g. BMI and socio-demographics only) that are easily accessible in low-resource settings and (3) lead to minimal or low bias when using it in association studies in place of the measured DXA BF% and (4) publish their equation and coefficient for wide use are lacking.
We set up this study to create such a model by developing and validating a parsimonious model of BF% as measured by DXA and using only BMI and socio-demographics using a supervised machine learning framework in a nationally representative sample of the US. Additionally, this model will be available online for widespread use for scientists and clinicians.

Methods
Study population. Study participants were derived from the National Health and Nutrition Examination Survey (NHANES) 1999-2006 23 . Briefly, NHANES is a nationally representative survey designed to assess the health and nutritional status of adults and children in the United States. The interviews which collect demographic, socioeconomic, dietary, and health-related information as well as the physical examination are conducted on a representative sample of about 5000 individuals each year. In the current study, our sample included all male and non-pregnant female participants aged 20-69 years old of either Hispanic, Caucasian or African descent. Only observations with complete data on all variables studied were included in the analytical sample. The selection resulted in three groups of populations with 5,931 subjects in NHANES 1999-2002, and 2340 subjects in the NHANES 2003-2006 23 (Fig. 1).
"Other race" means races that do not include black, white, or Hispanic. Incomplete data means having missing data in either BMI, DXA, Age, Gender, Race, Education, or Income data. Data with missing LDL values were kept in NHANES 1999-2002 but excluded in the NHANES 2003-2006 dataset.
Measurements and socio-demographics. Standing height was measured to the nearest 0.1 cm using a stadiometer. Weight was measured to the nearest 0.1 kg using a digital weight scale 23 . BMI was calculated using the following equation: BMI (Kg/m 2 ) = Weight (kg) / height 2 (m). BF% was estimated by DXA using the Hologic  23 . In NHANES 2003-2006, the BF% was measured at the android area, which was defined as the area between the waist and the mid-point of the lumbar spine and the top of the pelvis 23 . In NHANES 1999-2002, there were two measurements for BF% available: total BF% and subtotal BF% (subtotal = total minus head). We used total BF% instead of subtotal BF% so that our results will be comparable to Liu et al. equation 21 . In NHANES 2003-2006, there were changes in measurements from total/ subtotal BF% measurements to android/gynoid BF%. We used android BF% data since android fats are associated with increased risk of obesity-associated diseases such as diabetes mellitus and gynoid fats with a decrease risk of diabetes mellitus 23 . To evaluate the amount of bias that would result from using the predicted BF%, we assessed the association between predicted BF% and low-density lipoprotein (LDL). We chose this relationship because it is known and well established in the literature 24,25 . Measurement techniques of the LDL (mg/dL) sampling was introduced elsewhere 23 . We used the Adult Treatment Panel III Classification of LDL level and defined LDL > = 130 mg/dL as elevated 26 .
The following socio-demographics were considered for our model: age (year), gender (male vs. female), race (Hispanic vs. White vs. Black), education (high vs. low), and income (high vs. low). The income variable was obtained by categorizing the poverty-income-ratio (PIR) variable in NHANES as either "under the poverty threshold" or "at or above the poverty threshold". Therefore, low income represented an income under the poverty line and low education represented an education less than high school.

Statistical analysis.
To develop the best parsimonious prediction equation for BF%, we undertook the following steps: 1. Data preprocessing: To develop the best prediction model, we used a validation set approach and randomly divided the 1999-2000 NHANES data (n = 5931) into a training and a validation set representing 50% each of the total dataset. Because we were interested in developing a model that could be readily used in low-income settings, we prioritized variables that can easily be obtained at the point of care such as socio-demographics (e.g. age, sex, education, income, race/ethnicity) and BMI. We additionally considered as potential variables the square root transformations for BMI as well as interaction terms between BMI and SES variables and between BMI squared and SES variables. Continuous variables were also visually inspected for normality and were generally found to be normally distributed. We also dichotomized 1) the income variable as high (i.e. at or above poverty line) and low (i.e. below poverty line) and 2) the education variable as high (i.e., equal to or above high school level) and low (i.e. less than high school level). In the candidate prediction models, we therefore included dummy variables for "low education" and that for "low income". 2. Model training: In the training set (i.e., the first random half of the 1999-2000 NHANES dataset, n = 2965), we generated multiple linear models predicting DXA-measured BF% using ordinary least squares and selected the variables using a forward, backward, and stepwise selection procedures. In the selection process, we forced a number of variables to ensure that they will be included in the final model: BMI (kg/m 2 ), BMI squared (kg 2 /m 4 ), age, gender, race, income, and education variables. The variables that were candidates www.nature.com/scientificreports/ for selection included interaction terms between SES variables (i.e. education, income) with BMI, and SES variables with BMI squared. The significance levels for entering and exiting the model were set at 0.2. We used adjusted R 2 as the initial step to select the best models. If several models had the same adjusted R 2 , we considered the one with the lowest Akaike's Information Criteria (AIC) as the better model 27 . In fact, models with higher adjusted R squared, lower AIC and BIC were considered better. 3. Model validation: In this validation set (i.e., the second random half of the 1999-2000 NHANES, n = 2966), we first evaluated the performance of the prediction equations developed in the training stage and then also compared the models against other published models. To do so, the prediction accuracy of the prediction models was compared to other published models using the following calculated metrics: standard error of estimation (SEE), paired t-test, means of each predicted BF%, and percentage change of means from the measured values. SEE was calculated as The percentage change of means from the measured values was calculated as (mean of predicted BF%)−(mean of measured BF%) mean of trueBF% * 100% . Models with the lowest SEE and smallest percentage change of means from the measured values were considered better. 4. Bias assessment: We assessed the amount of bias or lack thereof that would occur when using the predicted BF% instead of the unmeasured BF%. To do so, we used the predicted BF% obtained from our equations and the measured BF% and assessed their associations with elevated LDL. The coefficients of the measured and predicted BF% were then compared. The bias was calculated as the difference between the coefficient obtained using the predicted BF% and the coefficient obtained using the measured BF%. We used different samples for this analysis, NHANES 2001-2002 (n = 1383) and NHANES 2003-2006 (n = 2340), to ensure the robustness and reproducibility of our findings. (Fig. S1).
All analyses were conducted using SAS version 9.4 software (Cary, NC) (See "Online Appendix" Sect. 1 for procedure details).

Results
Sample characteristics. Model training. We first generated 20 models in the training set and selected to the top three models on the basis of adjusted R 2 , AIC and BIC. The best three models selected are presented in Table 3. All selected three models 1, 2 and 3 had an adjusted R squared of 0.87. In addition, model 1, 2 and 3 had an AIC of 7100, 7101 and 7102 respectively. Lastly model 1, 2 and 3 had a BIC of 7102, 7103 and 7104, respectively. The detailed information regarding the coefficients and P-values of the variables for the selected three models were presented in Table S2. Because the significance levels for entering and exiting the model were set at 0.2, some variables included in the selected models had a p-value that was greater than 0.05. Model validation. As shown in Table 4, our best three models all yielded the smallest value of SEE of 3.29.
The developed models 1, 2 and 3 predicted a mean of BF% 33.74, 33.74 and 33.73, respectively. The predicted mean of the developed equations were closest to the measured value (33.75) compared to other models (Model 1 P = 0.90, Model 2 P = 0.93, Model 3 P = 0.85). Likewise, the model by Gomez-Ambrosi produced a predicted mean that were not different from the measured BF% (P = 0.17). Moreover, our developed models 1, 2 and 3 produced the smallest percent change in means from the measured BF% (−0.02%, −0.02% and −0.03%, respectively).
Bias assessment. We ran multivariate linear and logistic regressions to assess the risk of elevated LDL  Table 6).
Ethics approval and consent to participate. Not applicable as this study used public de-identified secondary data. However, NHANES was approved by the CDC ethics review board, and participants provided written informed consent prior to participation. Hence, initial data collection involving humans was conducted in accordance with relevant institutional ethical guidelines.

Discussion
The purpose of our study was to develop and validate a parsimonious model of BF% using only BMI and sociodemographics factors in a nationally representative sample of the US population. Our best parsimonious model yielded a high adjusted R 2 of 0.86 and small standard error of estimation of 3.29 and included only BMI, BMI squared, age, gender, race, income, education, and interaction variables. Additionally, our model produced a competitively high adjusted R-squared and low bias in the estimation of the association between BF% and LDL compared to published prediction equations. Our model can be considered to have strong predictive abilities. In fact, as recommended by Heyward 28 , a good prediction equation needs several characteristics: use of acceptable reference methods; use of large, randomly selected samples (N > 100); high correlation between the reference measures and predicted scores (R 2 > 0.8); small SEE; cross-validation of equation in samples from an independent population 28 . Our model predicted BF% as measured by DXA-a gold standard reference for measuring adiposity. In addition, we used a large sample sized data of 5,931 from NHANES that was randomly divided in a training and validation sets. The adjusted R-squared was > 0.8. The SEE of the model was 3.29 and thus considered "a very good estimation" since SEE between 3 and 3.5 indicates a very good estimation while a SEE larger than 5 indicates a poor estimation 29 .
Furthermore, we also performed a rigorous supervised statistical learning with a validation set approach and conducted a bias assessment in two separate datasets. The bias assessments in the linear and logistic regression in NHANES 2001-2002 showed that our models yielded a minimally biased association between predicted BF% and LDL and risk of high LDL, respectively (lowest bias). Of importance, is that our models tended to slightly underestimate BF% by −0.02% to −0.03%. This departure between predicted BF% and measured BF% is negligible and could be considered not clinically relevant 30 . Assuming that this underestimation is consistent for the mean BF% in a population, we can always correct and back-calculate the measured BF% from the predicted BF%.
When our models were compared to previously published models by Gallaher (2000), Gomez-Ambrosi (2012) and Liu (2015), the validation results suggested that our models improved the prediction accuracy by around 1% (SEE = 3.29 vs. 5.06, 4.22, 4.37). The improvement of SEE might be due to several factors. First, unlike Liu's study which was developed in an Asian population and Gomez-ambrosi's study which was developed in a European population, our model was developed and validated in a representative sample of the US population. Second, our training dataset included a larger number of subjects (n = 2965) which provided more opportunity to detect associations. For example, our model contained the interaction between BMI squared and race, which was not included in any of the other models. Third, Gomez-Ambrosi et al. used BF% estimated from bone density and Table 5. Association between percent body fat (measured and predicted) and high low-density lipoprotein (LDL) obtained from logistic regression and bias assessment, NHANES 2001-2006. The metrics for the top performing model(s) are in Bold. High LDL is defined as LDL > = 130 mg/dL. High LDL was model as a function of BF%, age, race, gender, education, income. a Bias = ln(OR from predicted BF%)-ln(OR from measured BF%).  22 . Nevertheless, there are differences in study design. First and most importantly, Stevens et al. used bioelectrical impedance (BIA) as the main predictor for BF% rather than BMI as in our models. BIA can be difficult to obtain while BMI is commonly utilized in health facilities and is less costly than BIA. Second, Stevens et al. reported that their models that included age, ethnicity, height, weight, BMI and BIA produced an R 2 of 0.831 in men and 0.864 in women. This was lower than the R 2 obtained in our models (0.868). Adding triceps skinfold and waist circumference, however, increased the R 2 of their models to 0.905 in males and 0.883 in female. This comes, however, at the expensive cost of adding hard-to-access measures that can be difficult to obtain in low-resource settings. Third, unlike our models, the 2017 Stevens et al. prediction equation is not easily accessible as Stevens et al. did not readily publish the coefficients in their article and the link provided for their online BF% calculator did not seem to work at the time we tried accessing it (accessed in September, 2017). Fourth, Stevens per se did not conduct a bias assessment as we did in our study. In fact, one important goal in this endeavor was to be able to predict BF% in order to correct for the measurement error due to the utilization of BMI. We could not compare Stevens et al. 's models against our model or assess the amount of bias, or lack thereof, since we did not have the regression coefficients. Lastly, the population age in Steven et al. 's was 8-49 years 22 , while our study population had a much broader age range for adults .
There are several implications of our BF% equation. First, clinicians could readily calculate each patient's approximate %BF that would have been otherwise obtained using expensive equipment. Doing so will help better guide decision-making for patient care. Additionally, clinicians can use the model developed here to obtain a better picture of the metabolic health of their patients especially those with a history of hyperlipidemia, diabetes mellitus, or hypertension who come in with low or normal BMI. Second, our prediction model will allow scientists and clinicians to conduct better weight-related research by correcting for the inherent measurement error in BMI. In addition, our model is particularly attractive as it uses easily accessible variables and as such can be utilized by researchers who wish to have a reliable measurement of adiposity but do not have enough resources to obtain it via DXA. As a result, our prediction model could potentially save time and medical expenditure for researchers, doctors, and patients. To facilitate the use of our models and eliminate the need to calculate BF% by hand, we provided an onlin e Perce nt Body Fat Calcu lator and a excel calculator to use off-line Supplemental excel document.
Our study was limited in several aspects. First, our model could not be generalized to races beyond Black, White, or Hispanic race/ethnicities because other races consisted of a very small percentage of the total NHANES population. In addition, since underweight (BMI < 18.5) and severe obese (BMI > 40) subjects only comprise small percentages of population (1.6% and 5.8%, respectively) in the training dataset, it was questionable if our model would be accurate for populations with extreme body compositions. Second, there was a change of BF% measurements from total/subtotal BF% in NHANES 1999-2002 to android/gynoid BF% in NHANES 2003-2006, which reduced the accuracy of our bias assessment conducted in NHANES 2003-2006. Third, we focused our prediction model on the prediction of body fat percentage (BF%) obtained via DXA-given its wide use and relative high accuracy in measuring adiposity. We did not build predictive models for other measures of adiposity (e.g. bioelectrical impedance analysis [BIA], body adiposity index [BAI]) and as such our model may not be able to accurately predict adiposity as would be obtained from other measures. Future studies should investigate the relative performance of an adiposity prediction model using several other measurements of adiposity and the same parsimonious model across measurements.
In sum, we developed and validated BF% prediction models with high predictive properties and low bias, tailored to American adults aged 20-69, and which could easily be accessible to clinicians and researchers.

Data availability
The data came from the National Health and Nutrition Examination Survey (NHANES). 1999-2006. It is a publicly available dataset that can be accessed and freely downloaded (with no prior registration needed) here: https:// wwwn. cdc. gov/ nchs/ nhanes/ Defau lt. aspx.