Smoking-diseases correlation database: comprehensive analysis of the correlation between smoking and 422 diseases based on NHANES 2013–2018

Background Smoking is a risk factor for a wide range of diseases. Previous research has confirmed over 30 Smoking-Associated Diseases in diverse systems. There is limited research exploring the correlation among multiple diseases, with an absence of comprehensive investigations. Few studies concentrate on diseases exhibiting a negative correlation with smoking, wherein smokers demonstrate a lower prevalence. Objective This study aimed to detect the correlation between smoking and other diseases using data from National Health and Nutrition Examination Surveys (NHANES) and construct a Smoking-Diseases Correlation Database (SDCD). The second aim is to obtain an extensive screening test for diseases that may be linked to smoking. Methods 39,126 subjects’ data from the NHANES 2013–2018 dataset were extracted. The baseline information, difference in blood routine and blood chemistry indicators between smokers and non-smokers, and diseases’ correlation with smoking in four different models were analyzed by R. The data and statistics were aggregated into an online SDCD. Results Our study reported 46 Smoking-Associated Diseases (SAD), including 29 Smoking Positively Associated Diseases (SPAD) and 17 Smoking Negatively Associated Diseases (SNAD). The SDCD of 422 diseases was constructed and can be accessed at https://chatgptmodel.shinyapps.io/sdcd/. Conclusion Our findings revealed 46 SADs including 29 SPADs and 17 SNADs. We aggregated the statistics and developed online SDCD, advancing our understanding of the correlation between smoking and diseases.


Introduction
Smoking is the main cause of preventable early death around the world.Approximately one person dies every 6 s due to smoking, accounting for one in five deaths in the age group between 35 and 69 (1,2).According to the WHO, the total number of tobacco users is 1.30 billion in 2020, and the prevalence rate of tobacco use is 22.3% of the global population.Over 8 million people died from tobacco-related diseases in 2019 (3).In the 1950s, Ernst Wynder et al. put forward a hypothesis that lung cancer is associated with smoking.As the evidence accumulated, in 1964, a report from the United States Department of Health confirmed the causal relationship between smoking and lung cancer (4).From then on, a quantity of evidence was regularly updated and more Smoking-Associated Diseases (SAD) were found.Studies have established that smoking can affect different organs and systems such as cardiovascular (5), respiratory (6), brain (7), reproductive (8), immune (9) systems, etc.The number of SADs increases gradually and has reached at least 30, but there are still SADs that are not being detected or lack enough evidence (10).
Tremendous research focuses on Smoking Positively Associated Diseases (SPAD).However, studies have also found that some diseases were Smoking Negatively Associated Diseases (SNAD).For example, smokers have a lower prevalence of Parkinson's disease than nonsmokers, which is consistently shown by previous epidemiology research.Because combustible tobacco smoke has been seen as a cause of adverse health outcomes, this inverse correlation was counterintuitive and fascinating (11).Furthermore, studies suggest that non-smokers have a relatively higher risk of ulcerative colitis and osteoarthritis than smokers (12).If we were to document favorable effects of smoking on different diseases, this might provide an opportunity to identify constituents of combustible tobacco smoke that have favorable effects on disease and to test these as potential treatments (12).
The majority of present studies are dispersed and specialize in a certain disease, and lack comprehensive research including multiple diseases in one study.The comprehensive study is important for the understanding of tobacco's effect on health.Therefore, we evaluated the correlation between smoking and 422 different diseases using data of 39,126 people from the National Health and Nutrition Examination Survey dataset of the United States (NHANES) and paid attention to both SPAD and SNAD.By extracting and analyzing a giant collection of data, knowledge of this database could provide a new landscape on smoking and disease, laying a foundation for further research on SAD.

Study population
For this investigation, we utilized the data extracted from the National Health and Nutrition Examination Survey dataset of the United States (NHANES) spanning from 2013 to 2018.All the data we used could be accessed at https://wwwn.cdc.gov/nchs/nhanes/Default.aspx.12,226 individuals were excluded from the study due to missing or incomplete data on body mass index (BMI), education level, smoking records, and uncertain or incomplete information.Additionally, 24 individuals who lacked ICD-10 diagnosis information were also ruled out in this study.A total of 39,126 subjects remained for the final analysis.The study of NHANES 2013-2018 was performed according to the guidelines approved by the Research Ethics Review Board of the National Center for Health Statistics (Protocol #2011-17, #2018-01).Moreover, written informed consent was duly obtained from all participants involved in the NHANES project.

Study variables
Variables data from NHANES is collected through a combination of interviews, physical examinations, and laboratory tests.Participants are interviewed to gather information on their medical history, lifestyle factors, and dietary habits.Additionally, a comprehensive physical examination is conducted, which includes measuring vital signs and collecting blood samples.The blood samples are then analyzed in a laboratory, where tests are performed to measure the various parameters of interest.These data contribute to the rich dataset available in the NHANES database, which allows researchers and public health professionals to study and monitor the health and nutritional status of the United States population.
National Health and Nutrition Examination Surveys employs a weighted sampling approach to ensure its findings are representative of the United States population.Participants are assigned base weights reflecting their probability of selection.Nonresponse adjustments account for those who choose not to participate, and post-stratification aligns the sample with known population demographics.
The population variables involved in this study include age, BMI, sex, race, and education level.Ethnicity and race are determined by a series of questions in interviews.Hispanic or Latino refers to anyone who says they were born in or had ancestors from Spain or one of the western hemisphere territories or countries where Spanish is the primary language.The blood routine and blood chemistry indicators involved in this study include white blood cell count (WBC), neutrophils (NEU), lymphocytes (LYM), monocytes (MO), platelets (PLT), neutrophil-to-lymphocyte ratio (NLR), monocyte-to-lymphocyte ratio (MLR), systemic immune inflammation index (SII), systemic immune response index (SIRI), alkaline phosphatase (ALP), total bilirubin (TBIL), albumin (ALB), gamma-glutamyl transferase (GGT), iron (Fe), and alpha-klotho (a-klotho).
The dependent variables of this research were 422 diseases with ICD-10 coding reported by each individual's response to the questionnaire using the following question: "Has a doctor ever diagnosed you with…." 422 diseases involved multiple organs and systems cover the smoking's influence on health as comprehensive as possible.
In this study, the smoking population is delineated as individuals who have consumed a minimum of 100 cigarettes throughout their lifetime.This definition is consistent with the report methods of other diseases in the questionnaire in order to get a better correlation analysis.

Statistical analyses
We have performed all statistical analysis with the R software (Version 4.2.2) according to the CDC guidelines. 1 Sample weights were taken into account in all of the estimates to produce representative data of the civilian noninstitutionalized United States population.The correlation between smoking and each disease was assessed through four multivariable logistic regression models: one unadjusted and three adjusted models.Model 1 did not include any covariate adjustments.In Model 2, adjustments were made for age, gender, and race.Model 3 accounted for age, gender, race, education level, and poverty-toincome ratio.Finally, Model 4 incorporated adjustments for age, gender, race, education level, poverty-to-income ratio, smoking status, body mass index, and bone mineral density.
We constructed a Smoking-Diseases Correlation Database (SDCD), an online repository of smoking-disease correlations, using the R software package Shiny and the statistics.The database can be accessed at the following web address: https://chatgptmodel. shinyapps.io/sdcd/.There are the characteristics of the participants, the association between smoking and selected diseases, and subgroup analysis data on the website.Subgroup analysis refers to an analysis of OR values of diseases among smokers relative to non-smokers in different gender and ethnic groups separately.

Baseline information of study participants
A total of 39,126 participants (mean age 53.90 ± 0.35, 46% males) were included in this study.48.18% of the participants have a history of smoking.The weighted distribution of the characteristics according to smoking is shown in Table 1.The average age and BMI in these two groups did not have a significant difference.Regarding sex and race, Males and non-Hispanic white people were more likely to have a higher rate of smoking.There are more smokers than non-smokers in the College or above and the High School or equivalent education level participants groups and fewer smokers than non-smokers in the College graduate or above and the Less than high school groups.

Correlations between smoking and blood routine and blood biochemistry variables
We conducted a study examining the disparities in blood routine and blood biochemistry variables between smokers and non-smokers (Table 2).To assess the difference between smokers and non-smokers, a t-test was computed.Compared to the non-smoking population, the majority of immunological inflammatory markers were significantly elevated in the smoking population, including white blood cell count (WBC), neutrophil count (NEU), lymphocyte count (LYM), monocyte count (MO), neutrophil-to-lymphocyte ratio (NLR), monocyte-tolymphocyte ratio (MLR), systemic immune-inflammation index (SII), and systemic immune response index (SIRI).Additionally, there were notable increases in certain oxidative stress indicators such as Alkaline Phosphatase (ALP), albumin (ALB), Gamma-Glutamyl Transferase (GGT), and decreases of α-Klotho and total bilirubin (TBIL), in the smoking population.Among non-smokers, for example, the WBC was 7.26, while this rose 9% to 7.91 in former smokers.NEU indicator rises 11% in smokers.MO rises 9%.Smokers' GGT indicator (34.50) is 27% larger than non-smokers (27.21) and it has the biggest increase percentage in all risen blood indicators.Interestingly, TBIL and a-klotho, respectively, decreased by 5 and 4%, and they are both the only decreased indicators with statistical significance.

Positive correlation between smoking and diseases
Within this investigation, we delved into the correlation between smoking and various diseases.We examined a comprehensive collection of 422 prevalent illnesses categorized by the ICD-10 classification system and calculated the disparities in disease incidence between smokers and non-smokers.Ultimately, we identified 29 diseases that exhibited a significant positive correlation with smoking (Table 3).SPADs listed in Table 3 primarily focus on COPD, mental health disorders, pain, and gastrointestinal diseases.It also includes muscle spasm.

Negative correlation between smoking and diseases
Notably, we also identified 17 diseases that displayed a significant negative correlation with smoking (Table 4).SNADs listed in Table 4 cover a broader range of diseases across various medical domains, including infectious diseases, endocrine disorders (diabetes), neurological conditions (Parkinson's disease), eye disorders, respiratory conditions (pulmonary embolism, otitis media), musculoskeletal conditions (psoriasis, osteoporosis), and genitourinary disorders (kidney failure, menstrual irregularities).

Discussion
Blood routine and blood biochemistry indicators had 12 differences between smokers and non-smokers, with the majority of indicators showing higher levels in smokers.Elevated indicators suggest an increased inflammatory load and oxidative stress in smokers.One protein α-klotho caught our attention.Data showed that the α-klotho level in smokers is 4% lower than in non-smokers.Klotho gene is one of the most important genes related to the aging process and controls many metabolic pathways.Klotho protein has anti-aging, anti-inflammatory, anti-oxidant, and cardioprotective effects.A study suggested that oxidative stress from smoking can disrupt the anti-oxidant defense and decrease α-klotho protein synthesis (13).
Notably, our study identified 29 diseases possibly with a positive correlation with smoking.The most prominent one of these 29 SPADs is COPD, while many SPADs can be categorized into groups by their features, the mechanism of which requires further research, such as mental disorders, pain, and gastrointestinal disorders.
Chronic obstructive pulmonary disease has the highest OR value, respectively 7.77, 7.33, 6.19, and 6.94 among both four models and 29 SPADs.COPD is commonly associated with smoking; in addition to that, second-hand smoke, occupational exposures, air pollutants, and a history of previous lung infections are risk factors (14).One possible mechanism is that cigarette smoke can reprogram the epithelium of trachea and cause mitochondrial dysfunction (14, 15).A perspective suggested that COPD is a polygenetic disease, with complex environment risk factors accumulating through lifetime and interacting with susceptible genes (16).Many mental disorders are SPADs.The rate of smoking among schizophrenia patients is 4.65 times higher compared to the general population, and other research indicates rates ranging from 2 to 5 times higher (17,18).Several hypotheses have been proposed to explain this correlation, including (a) the causal relationship hypothesis, (b) the self-medication hypothesis, and (c) the shared genetic background hypothesis (17)(18)(19)(20).In addition to schizophrenia, patients with nervousness, anxiety, and depression have respective OR values of 5.27, 1.34, and 1.41 in model 1.
Pain constitutes another category of SPADs that we have reported.Individuals experiencing chronic pain are approximately twice as likely to be current smokers compared to those without chronic pain (21).Firstly, nicotine exhibits an acute analgesic effect, intensifying pain in nicotine-deprived smokers and thereby making smoking  The OR value for Parkinson's disease is 0.36 in model 4. Numerous epidemiological studies have consistently reported the same negative result.This correlation exhibits dose-dependent and time-dependent characteristics: RRs of PD were 0.8, 0.6, 0.5, and 0.4, for 1-9, 10-24, 25-44, and 45+ pack-years, respectively, relative to never-smokers (11).Notably, there have been compelling evidence for the relationship between smoking and the declining prevalence of PD.Arguments against this correlation, such as confounding by unknown factors, shared genes, and personality characteristics of PD patients, remain unlikely (24).The neuroprotective role shown by smoking might be related to the biological effect of nicotine, which has demonstrated neuroprotective effects in animal models of PD (25).
Allergic rhinitis showed a significant negative correlation in models 1 and 2 but did not show significance in models 3 and 4. The OR values were comparable between model 1 (0.62) and 2 (0.64), as well as between 3 (0.75) and 4 (0.76).A two-sample Mendelian randomization study also revealed that smoking can decrease the risk of allergic rhinitis and increase the risk of vasomotor rhinitis.One hypothesis suggested that individuals with allergic rhinitis may have a preference for smoking.Another hypothesis suggested that smoking might inhibit the immune system, thereby reducing the risk of allergic rhinitis (26).
Long-term use of hormonal contraceptives has a low prevalence (OR = 0.44) in smokers after gender was adjusted in model 2. Currently, there is limited understanding of the correlation between OCs' use and smoking, with conflicting data reported in various studies.The most common OCs, which contain estrogen and progesterone components, can alter body progesterone and estradiol levels.Former studies have explored the potential impact of OCs on nicotine metabolism, cardiovascular reactivity, mood, and withdrawal (27).Some studies have investigated the clinical trial of exogenous progesterone's effect on smoking cessation.One study discovered that exogenous progesterone increased the odds of quitting in females nearly 3-fold and averaged 6 more days before relapse (28).The proposition of exogenous progesterone, either as an adjunctive or stand-alone therapy for smoking cessation has been put forward (29).Further epidemiological studies are required to better understand the correlation between OCs and tobacco use.Frontiers in Public Health 07 frontiersin.org

Conclusion
Based on NHANES 2013-2018, we made a comprehensive analysis of the correlation between smoking and 422 diseases and found 46 SADs including 29 SPADs.Interestingly, we revealed 17 SNADs such as Parkinson's disease, allergic rhinitis, and long-term use of hormonal contraceptives.It is noted that we put forward some previously not well-known Smoking-Associated Diseases, which require further research.Therefore, we constructed the online SDCD and summed up the correlation data between 422 diseases and smoking, making it easy to find.Our SDCD proves the importance of a large database in smoking research.These results facilitated our understanding of the correlation between smoking and diseases, providing the basis for studying the correlation between smoking and diseases.

TABLE 1
Baseline information according to smoking or not smoking of NHANES data from 2013 to 2018.
Figures in parentheses correspondto the weighted column percentage.

TABLE 2
The correlation between blood routine and blood biochemistry indicators and smoking.

TABLE 3
SPADs' names, ICD-10 numbers, and OR values among smokers relative to non-smokers in four models after adjusting for variables of interest.
Model 3 adjusted for age, gender, race, education level, and poverty to income ratio.d Model 4 adjusted for age, gender, race, education level, poverty to income ratio, body mass index, and bone mineral density.
a Model 1 adjusted for no covariates.b Model 2 adjusted for age, gender, and race.c

TABLE 4
SNADs' names, ICD-10 numbers, and OR values among smokers relative to non-smokers in four models after adjusting for variables of interest.
a Model 1 adjusted for no covariates.b Model 2 adjusted for age, gender, and race.c Model 3 adjusted for age, gender, race, education level, and poverty to income ratio.d Model 4 adjusted for age, gender, race, education level, poverty to income ratio, body mass index, and bone mineral density.