Study Design and Population
The NHANES is an ongoing study conducted by the National Center for Health Statistics (NCHS) of the Centers for Disease Control and Prevention (CDC). The study uses a complex, multistage, probability sampling strategy to include an over-sampling of minorities and to represent national non-institutionalized U.S. populations23. Information on sociodemographic characteristics, lifestyle characteristics, diet, and medical conditions are collected via an in-person interview and a physical examination in a mobile examination center (MEC), respectively. The NHANES data are released publicly every two years. The study was approved by the National Center for Health Statistics (NCHS) Research Ethics Review Board.
For this study, we used data from NHANES 2003–2004 because it provided the most recent measurements of serum PCBs for each participant. We limited the analysis to non-pregnant adults aged ≥ 20 years who had data available on serum PCBs and diabetes information (n = 1,258). Additional exclusions were individuals whose body mass index (BMI) data were unavailable (n = 30) and individuals with missing covariate information (n = 4). As a result, 1,224 adult participants were included in the study.
Exposure Assessment
Serum PCBs were measured by high-resolution gas chromatography-mass spectrometry (HRGC/ID-HRMS) among a randomly selected one-third of participants who were 12 years old or older. Briefly, around 2–10 ml of serum sample spiked with 13C-labeled internal standards were extracted using a C18 solid phase extraction (SPE) procedure with hexane 24. Each congener had a specific limit of detection. According to NHANES analytic guidance, values below LOD were assigned the value of LOD divided by the square root of 2.
A total of 40 PCB congeners were quantified, they were PCB 28, 44, 49, 52, 66, 74, 81, 87, 99, 101, 105, 110, 118, 126, 128, 138 + 158, 146, 149, 151, 153, 156, 157, 167, 169, 170, 172, 177, 178, 180, 183, 187, 189, 194, 195, 196 + 203, 199, 206, and 209. Because PCB 138 coeluted with PCB 158 and PCB 196 coeluted with PCB 203, the 40 PCB congeners were included in the analyses as 38 variables. Serum PCB concentrations were included in lipid adjusted forms because they are lipophilic.
Diabetes Ascertainment
Diabetes status was ascertained through a self-reported questionnaire by trained interviewers and lab tests. Specifically, participants were defined as having diabetes if they reported having been previously diagnosed with diabetes by a physician, or undiagnosed diabetes but had glycohemoglobin (A1C) ≥ 6.5% or plasma fasting glucose concentrations ≥ 126 mg/dL25,26. This method of diabetes ascertainment was found to be 63.2% sensitive and 97.4% specific for diabetes in a previous NHANES validation study 27.
Sociodemographic and Lifestyle Characteristics Assessment
Information on age, sex (male/female), race/ethnicity (non-Hispanic White, non-Hispanic Black, Hispanic, and other), education (less than high school, high school, and higher than high school), family history of diabetes (yes/no), family income, smoking status, alcohol consumption, and physical activity was assessed by self-reported questionnaires during the in-person interview. Family income-to-poverty ratio (PIR) was categorized as ≤ 1.30, 1.31–3.50, and > 3.50 28. Smoking status was categorized as never (smoked less than 100 cigarettes in their lifetime), ever (not smoke at the time of the survey) and current smoker (smoke at the time of the survey) 29. Physical activity was categorized as < 600, 600–1200, and > 1200 metabolic equivalents of task (MET) min per week 30. Weight and height were measured following a standardized protocol during the physical examination, and BMI was calculated as weight in kilograms divided by height in meters squared. BMI categories were defined as underweight (< 18.5 kg/m2), normal (18.5–24.9 kg/m2), overweight (25.0-29.9 kg/m2), and obese (≥ 30.0 kg/m2). Sixteen underweight participants were combined with normal-weight participants for statistical analyses. Dietary information was obtained through 24-h dietary recall. Total energy intake (kcal/day) and alcohol intake were calculated using the USDA food composition database. Alcohol intake was then categorized as non-drinker (0 g/day), moderate drinker (0.1–28 g/day for men and 0.1–14 g/day for women), and heavy drinker (≥ 28 g/day for men and ≥ 14 g/day for women) 31. Diet quality, represented by Healthy Eating Index − 2010 (HEI), has been found to be associated with a decreased risk of diabetes32. A higher HEI score indicates a higher diet quality based on 12 food components including total fruit, whole fruit, total vegetables, greens and beans, whole grains, dairy, total protein foods, seafood and plant proteins, fatty acids, refined grains, sodium, and empty calories (e.g., added sugars)31.
Statistical Analysis
For descriptive statistical analyses, we accounted for the complex, multistage design of NHANES by using appropriate sample weights, strata, and primary sampling units. We compared population characteristics by quintile of lipid adjusted serum concentration of the sum of 40 PCBs (∑40-PCBs) using the t-test for continuous variables and the chi-square test for categorical variables. Then, we examined the potential combined effects of the 40 PCB congeners on diabetes in two steps.
In our first step, we used the decision tree classification model to identify serum PCB profiles in relation to diabetes with a corresponding threshold. The classification tree, a non-parametric supervised learning method, was chosen for several reasons. First, it can perform dimensionality reduction and classification simultaneously, which is helpful for analyzing serum PCBs, a complex mixture of different congeners. Second, it can identify potential interactions among a mixture of PCBs. Third, it can identify threshold values for each PCB congener. Last, it is robust for outliers of PCBs and does not have to make assumptions about data distributions. The participants were classified as living with diabetes or not based on all measured 40 PCB congeners. The entire dataset was randomly split into 70% training sets (n = 858) and 30% test sets (n = 386). And a ten-fold cross-validation procedure was used to optimize the parameters and prune the tree to avoid overfitting. We used the confusion matrix and computed the accuracy with test sets to evaluate the tree’s performance. This analysis was performed using the rpart package in R version 4.1.2.
In our second step, logistic regression was used to estimate odds ratios (ORs) and 95% confidence intervals (CIs) of diabetes associated with the identified serum PCB profiles. We followed NHANES analytic guidelines accounting for sample weights and sample design. In the basic models, we adjusted for only demographic variables including age, gender and race/ethnicity. In the full models, we additionally adjusted for variables that could serve as potential confounders including BMI, education level, family income to poverty ratio, smoking status, alcohol intake, physical activity level, 2010 healthy eating index, and family history of diabetes.
Although NHANES does not explicitly collect information on the type of diabetes, we considered participants to have type 1 diabetes if they started insulin within one year of diabetes diagnosis, or were currently using insulin, or were diagnosed with diabetes under age 30 [62]. To explore the influence of diabetes type, we performed a sensitivity analysis excluding those possible type 1 diabetes cases; therefore, the vast majority of the remaining cases would be type 2 diabetes cases. This second step was performed using survey procedures with SAS software (version 9.4; SAS Institute Inc., Cary, NC, USA).