Large registry-based analysis of genetic predisposition to tuberculosis identifies genetic risk factors at HLA

Abstract Tuberculosis is a significant public health concern resulting in the death of over 1 million individuals each year worldwide. While treatment options and vaccines exist, a substantial number of infections still remain untreated or are caused by treatment resistant strains. Therefore, it is important to identify mechanisms that contribute to risk and prognosis of tuberculosis as this may provide tools to understand disease mechanisms and provide novel treatment options for those with severe infection. Our goal was to identify genetic risk factors that contribute to the risk of tuberculosis and to understand biological mechanisms and causality behind the risk of tuberculosis. A total of 1895 individuals in the FinnGen study had International Classification of Diseases-based tuberculosis diagnosis. Genome-wide association study analysis identified genetic variants with statistically significant association with tuberculosis at the human leukocyte antigen (HLA) region (P < 5e−8). Fine mapping of the HLA association provided evidence for one protective haplotype tagged by HLA DQB1*05:01 (P = 1.82E−06, OR = 0.81 [CI 95% 0.74–0.88]), and predisposing alleles tagged by HLA DRB1*13:02 (P = 0.00011, OR = 1.35 [CI 95% 1.16–1.57]). Furthermore, genetic correlation analysis showed association with earlier reported risk factors including smoking (P < 0.05). Mendelian randomization supported smoking as a risk factor for tuberculosis (inverse-variance weighted P < 0.05, OR = 1.83 [CI 95% 1.15–2.93]) with no significant evidence of pleiotropy. Our findings indicate that specific HLA alleles associate with the risk of tuberculosis. In addition, lifestyle risk factors such as smoking contribute to the risk of developing tuberculosis.


Introduction
Tuberculosis (TB) is an infectious disease caused by Mycobacterium tuberculosis. The bacteria are transmitted via airborne transmission mainly affecting the pulmonary organs (pulmonary TB), but they can also affect other organs (extrapulmonary TB) (1,2). TB can manifest as a latent non-transmissible form or as an active transmissible form where patients experience symptoms such as high fever, persistent cough, fatigue and weight loss (1). Most individuals infected by M. tuberculosis will not develop TB, whereas 5-10% of infected individuals develop the disease. Individuals with a compromised immune system, such as those infected with HIV (human immunodeficiency virus), have a higher likelihood of developing TB (2).
Incidence rates of TB in most of Europe and Northern America are less than 10 per 100 000 per year, and mortality in HIV-negative people is less than 5 per 100 000 per year (2). After development of BCG (Bacillus Calmette-Guérin) vaccination in 1921, people were widely vaccinated in Europe, which lead to declining TB incidence rates. Decline was further assisted by development of antibiotic treatments in the 1940s and 1950s, after which latent TB infections could be treated (3,4). However, Europe has the highest rate of new reported cases of multidrug-resistant TB (2).
The risk for TB is also highly correlated with comorbid diseases. While HIV infection is the strongest risk factor for TB disease, TB infection also exacerbates the course of HIV infection and is associated with 4-fold all-cause mortality among HIV patients (5). Similarly, malnutrition as well as deficiency of vitamins C or D has been associated with TB disease although this association is also related with overall fragility: very young and very old people have elevated TB risk (6)(7)(8). Finally, smoking increases risk for latent TB infection and TB disease more than 2-fold (9), and public health campaigns against smoking have been applied in high-risk areas for TB.
On a global scale, WHO estimated that 10 million people developed TB disease in the year 2019 and from those estimated 1.2 million HIV-negative individuals died due to TB in 2019 (2). Although there are treatments for TB, long-term consequences such as pulmonary dysfunction and prolonged respiratory symptoms may expose survivors to other respiratory disorders, for instance, to chronic obstructive pulmonary disease (COPD) (10)(11)(12)(13). In addition, treatment-resistant strains and comorbidity burden create challenges in managing TB at the local and global scale.
Therefore, it is crucial to understand the biological mechanisms that contribute to the onset of TB. Understanding The number of tuberculosis (all diagnosis subcodes) in FinnGen cohort R7, number of controls and the total number of individuals in FinnGen R7 are reported. The mean age of first onset of tuberculosis, number of males in cases, controls and in FinnGen R7, and mean BMI are also reported.
these mechanisms is likely to result in better disease management and potentially novel treatment options as well as nonpharmaceutical measures to prevent and to treat TB infections. By using genetic tools, we are able to study the biological mechanisms underlying TB along with causal relationships between TB and different comorbidities. Host genetic factors in TB have been studied in various different populations (14)(15)(16)(17)(18)(19)(20)(21)(22)(23)(24). In European ancestry M. tuberculosis or pulmonary TB genome-wide association study (GWAS), three variants rs557011, rs9271378 and rs9272785 in the human leukocyte antigen (HLA)-region reached genome-wide significance (P < 5E−08) (18). Additionally, in selfreported positive TB test result GWAS, variant rs2894257 (also from the HLA region) was found to be genome-wide significant in European ancestry individuals (19).
In this study, we used data from the FinnGen study data release 7 (R7) to explore the biological mechanisms underlying TB in 310 000 individuals. We were especially interested in assessing host genetic components using GWAS, and exploring comorbidity burden using genetic correlations, epidemiological tools, and causality with Mendelian randomization (MR).

HLA fine mapping
Traditionally, alleles at HLA class II are related to transplantation outcomes, autoimmune and infectious diseases. Therefore, we used imputed HLA allele information to assess if alleles in addition to individual variants associate with TB using multivariate logistic regression adjusting for age at death or at end of followup (December 31, 2019), sex and the first 20 genetic principal components. We found three associations of protective effect estimate with HLA alleles: HLA DQB1 * 05:01 (P =    Table S2). Unlike HLA DQB1 * 05:01, the risk alleles were not in high LD with risk SNP rs9391858 (r 2 = 0.2) or with HLA DQB1 * 05:01. After stepwise logistic regression adjusting the HLA associations with the lead allele HLA DQB1 * 05:01, these three positive effect alleles remained statistically significant (P = 0.0007, 0.002, 0.002, respectively) (Supplementary Material, Table S11).

Epidemiological and genetic correlates
To study the association between known risk factors and TB, we used multivariate logistic regression and adjusted for age at death or end of follow-up (December 31, 2019), sex, body mass index (BMI) and the first 10 genetic principal components. Positive correlation with TB was witnessed in current smoking status (P = 2.0E− 16   ), but TB did not (P = 0.12). Furthermore, we did not see significant interaction between smoking and TB, COPD and TB, alcohol dependence and TB or AUD and TB on survival (P-value interaction 0.85, 0.42, 0.47 and 0.59, respectively).
In addition to association studies, we estimated the genetic overlap between TB and the risk factors identified from epidemiological analysis using LD score regression. In agreement with the epidemiological associations, smoking measured as number of cigarettes previously smoked daily (r g = 0.4377, P = 0.003) and current tobacco smoking (r g = 0.3476, P = 0.0048) were positively associated with TB (Supplementary Material, Tables S10 and S12).
To explore causality between TB and different traits, we performed MR analysis. MR suggested smoking as a risk factor for TB (inverse-variance weighted P = 0.01, OR = 1.83 [CI 95% 1.15-2.93]) with no significant pleiotropy found (Egger intercept pleiotropy test P = 0.46) (Fig. 4B). In addition, we tested known risk factors, BMI, vitamin D deficiency, COPD and AUD with TB but found no causality between these traits in the FinnGen R7 cohort.

Discussion
In this paper, we identified genetic variants from the HLA region that protect from TB. In addition, we identified associations with comorbidities, where smoking and alcohol dependence, in particular, associated with TB. Our results indicate a unique interplay between host genetic components, primarily from the HLA DQB1 * 05:01 as a protective factor, and HLA DRB1 * 13:02 as susceptibility factor to TB. In addition, our findings highlight the environmental contribution from lifestyle and comorbid factors including smoking and alcohol dependence with TB susceptibility and survival.
Other genes shown to be associated with TB susceptibility in previous studies, such as ASAP1 (17), did not reach genome-wide significance in our study. In addition to HLA class II associations in TB, a recent integrative genomic analysis combining different data sets identified overall 26 candidate genes associated with TB susceptibility (36). This earlier evidence and our findings indicate that host and pathogen genetic factors affect disease susceptibility and severity.
There is clear heterogeneity between HLA class II allele association in different populations and different lead variants associate in different countries or ethnic groups. The reason behind the heterogeneity is unknown but may be due to selection, a bottleneck effect, pathogen-driven diversity or altering virulence of different lineages of M. tuberculosis bacteria (37)(38)(39)(40)(41). Furthermore, HLA alleles show diversity and individual allele frequencies differ across populations, which affects the power to observe association in different populations for those alleles that are less common. In functional studies, the role of HLA class II genes in TB has started to spark interest. Kust  In addition to HLA class II association with TB, we report the contribution of smoking and alcohol dependence to TB. Smoking is already a well-established risk factor in TB (45). Our results not only show a significant epidemiological association with TB, smoking habits and smoking-related disease (COPD), but also show genetic correlation and causality between these traits. Through MR we identified causal relationship where increase in habitual smoking increases the risk for TB. In the FinnGen cohort, TB patients were enriched for smoking throughout different decades starting from the 1970s (Supplementary Material, Table S9). All of these gained results indicate that smoking is a major risk factor for TB, alongside severe alcohol usage.
Our study does have some limitations. The lead SNP (rs9391858) identified by our GWAS was located nearest to the gene HLA DRA among the HLA genes. However, in the HLA fine mapping analysis HLA DRA was not among the imputed HLA genes in FinnGen. We used LD score regression to estimate the genetic correlation between different traits and TB. Our LD score regression showed low heritability within traits (1-5%), which affects the reliability of those results and therefore they should be interpreted with caution. Low heritability was most likely due to the fact that  Table S5), RA (rheumatoid arthritis), IBD (inf lammatory bowel disease), ever smoker, current smoker, Crohn's disease, COPD (chronic obstructive pulmonary disease), CHD (major coronary heart disease event), biological medication for rheumatoid arthritis (Bio.med. RA), AUD (alcohol use disorder), asthma and alcohol dependence had statistically significant and positive association with tuberculosis. (B) MR suggests habitual smoking (instrumented by cigarettes per day) as a risk factor for tuberculosis (inverse-variance weighted P = 0.01). The increase of habitual smoking increases the risk for tuberculosis.
the HLA region was removed from the analysis and our results were mainly from that specific chromosomal region. Our survival analysis for epidemiological traits and TB was conducted using a Cox proportional hazards model that assumes all used traits being constant over time. We validated our Cox model, evaluating the proportionality of the predictors against time. The results showed slight statistical significance (P = 0.04), which indicates that not all predictors met the proportional assumption of the Cox model, with smoking status being one of the most evident one from the included predictors (smoking P = 0.039) (Supplementary Material, Table S8, global P-value). Furthermore, Cox proportional hazards model only within TB patients can introduce collider bias and therefore the results should be interpreted with that kept in mind. Nevertheless, smoking was associated with TB in our causality estimates and risk factor correlations that highlight smoking being a significant risk factor in TB. Our endpoints in FinnGen for TB were defined using ICD-code based diagnoses (TB: ICD-10 codes A15-A19; respiratory TB: A15-A16; TB of other organs: A17-A19). Unfortunately, we did not have information on individuals among controls who would have been infected by M. tuberculosis and might suffer from an undiagnosed latent TB infection. Additionally, we assumed in our analysis BMI and smoking information to remain unchangeable throughout the studied period due to the longitudinal nature of the data used, which does not necessarily represent the BMI and smoking information at the time of the TB diagnoses.
In Finland, cases of TB have gradually decreased from the mid-20th century to present day (46). Almost half of the TB cases in Finland (year 2018) are witnessed among immigrants and transmission of TB is rare between immigrants and Finnish-born individuals (46)(47)(48). Most of the Finnish-born cases encountered present day are elderly individuals with reactivation of a latent TB originally acquired during their childhood (46). In our study, FinnGen participants were matched against a Finnish reference panel, which highlights our genetic findings to be specific to individuals of Finnish ancestry and can be also regarded as limiting factor in our study. Furthermore, it has been previously shown that individuals with RA in Finland have a higher incidence of TB compared to the general Finnish population (49). This was also witnessed in the FinnGen cohort alongside other comorbidities and risk factors that are known risk factors for TB in other populations as well (8).
Our results highlight the importance of host genetic factors in TB alongside environmental risk factors. These additional results may benefit research in and, ultimately, clinical interventions for TB and other infectious diseases. However, further epidemiological and functional studies are needed to reveal the biological mechanisms underlying the individual reaction we have as humans to different infections.

Study cohorts
The FinnGen study (https://www.finngen.fi/en) is a public-private partnership including Finnish universities, biobanks and hospital districts together with several pharmaceutical companies founded in the year 2017. The aim is to collect both National Health Record and genetic data from 500 000 Finns. The study participants include patients with acute and chronic diseases as well as healthy voluntary and population collections. R7 includes ∼310 000 individuals (∼175 000 females and ∼135 000 males).
The UKB is a prospective study containing over 500 000 individuals of mainly European ancestry (50). Invited participants were aged between 37 and 73 years upon entry to the study between 2006 and 2010 and were residents of the UK. The UKB combines medical health record data, lifestyle measures, questionnaire data, genotypes, blood count data and biochemistry measures, among other data. The electronic health records of UKB are a combination of Hospital Episode Statistics in-patient (HES; max. N = 440 512) and primary care (GP; max. N = 231 364) data. These data are updated frequently in order to capture the health trajectories of the participants.
The BBJ project is a prospective cohort based on Japanese hospital records, which was launched in 2003 (51). Cohort data consist of DNA samples, serum samples and clinical data from 200 000 participants gathered from 66 hospitals nationwide. Registration for the cohort happened between the years of 2003 and 2008 and the data were updated annually until 2013 with interviews and medical records reviews.

FinnGen ethics statement
Patients and control subjects in FinnGen provided informed consent for biobank research, based on the Finnish Biobank Act. Alternatively, separate research cohorts, collected prior the Finnish Biobank Act came into effect (in September 2013) and start of FinnGen (August 2017), were collected based on studyspecific consents and later transferred to the Finnish biobanks after approval by Fimea (Finnish Medicines Agency)

Genotyping and quality control
Genotyping in the FinnGen cohort was performed by using Illumina (Illumina Inc., San Diego, CA, USA) and Affymetrix arrays (Thermo Fisher Scientific, Santa Clara, CA, USA) and lifted over to build version 38 (GRCh38/hg38) (52). As a sample-level quality control, individuals with high genotype absence (>5%), inexplicit sex or excess heterozygosity (±4 standard deviations) were excluded from the data (52). Additionally, in the variant level quality control, variants that had high absence (>2%), low minor allele count (<3) or low Hardy-Weinberg Equilibrium (HWE) (P < 1e−06) were removed [52]. A more detailed explanation of the genotyping, quality control and the genotype imputation with SiSu v3 reference panel is described in Kurki et al. (preprint) (52). All individuals in the cohort were Finns and matched against the SiSu v3 reference panel (http://www.sisuproject.fi/).

Phenotype definition
In the FinnGen study, the main phenotype used in our study was TB of all organs defined using ICD-10 based diagnosis codes A15-A19. Individuals defined as a case in TB of all organs endpoint had to have at least one of the following ICD-10 codes or their subcode: A15, A16, A17, A18 or A19. Other phenotypes used were respiratory TB and TB of other organs. An individual was defined as a case in respiratory TB when that person had at least one of the following ICD-10 codes or their subcode: A15 or A16. Controls with other TBrelated ICD-10 codes were excluded for respiratory TB analyses. An individual was defined as a case in TB of other organs when that person had the ICD-code A18 or one of its subcodes. Controls with other TB-related ICD-10 codes were excluded for TB of other organs analyses.

UKB phenotype definition
From the UKB data, we obtained both self-reported and electronic health record data for disease definitions. To define the phenotypes, we used data from the self-report non-cancer illness codes (data field 20002), which were assessed during the baseline interview, hospital inpatient records (HES; data field 41234) and primary care diagnosis records (data field 42040). For TB of all organs, code 1440 was used from the self-reported data. From the hospital inpatient data, we included individuals as a case for the phenotype if they had at least one of the ICD-10 diagnosis codes used for FinnGen (see above) and, as with FinnGen, included participants with subcodes in the endpoint. Additionally, in the UKB, ICD-9 diagnosis codes 0130, 01199, 01789, 0172, 0160 and 015 were used. In the primary care data, diagnoses are coded using the NHS-specific Read v2 or CTV3 codes instead of the ICD coding. We used the following Read codes to define the respective phenotype: With this definition for TB of all organs, we ended up with 3431 cases and 442 492 controls of European ancestry. Most of the cases for TB of all organs came from the self-reported data (N = 1985 (57.9%)) and primary care data (N = 1298 (37.8%)).
For respiratory TB, the same ICD-10 codes were used as in FinnGen for the equivalent phenotype (see above). From the primary care data, the following Read codes were used for respiratory TB: In addition, we excluded the following codes from controls due to other TB diagnosis, TB exposure, history of personal TB or TB contact:

BBJ phenotype definition
In the publicly available BBJ summary statistics, only respiratory TB (defined by BBJ as pulmonary TB) was available in the predefined endpoints of BBJ (24). Endpoints were defined using clinical data and disease records (24). For controls, samples of the cohort without the given diagnosis for respiratory TB or related codes were used (24). With this definition, the BBJ respiratory TB phenotype consisted of 7800 cases and 170 871 controls.
Association testing between individual HLA alleles and TB was conducted with multivariate logistic regression using R (version 4.0.3, packages: data.table, dplyr and tidyverse). Multivariate logistic regression model was adjusted for age at death or end of follow-up (December 31, 2019), sex and the first 20 genetic principal components (adjusting for principal components accounts for population structure within the cohort). Stepwise logistic regression was conducted by adding the most strongly associated HLA allele as a covariate to the multivariate regression analysis. This was repeated as many times as there were significant alleles left in the analysis.
To assess genetic correlation between TB and different traits, we performed LD score regression analyses with LD HUB provided by the Broad Institute of Massachusetts Institute of Technology (MIT) and Harvard and MRC Integrative Epidemiology Unit, University of Bristol (56)(57)(58). Additionally, we tested AUD using LD score regression separately since the trait was not available in the LD HUB tool (56). For the LD score regression, we used the HapMap 3 SNP list and European LD score files provided with the software. Summary statistics for LD score regression were obtained from FinnGen R7 GWAS of TB of all organs and from Sanchez-Roige et al. for AUD (59). LD between the lead variants in the HLA region was estimated using LDpair Tool from LDlink (60).
We obtained the lead SNPs associated with smoking and used as exposure instruments against the FinnGen R7 TB GWAS summary statistics as 'Cigarettes per day' and 'Age of initiation of regular smoking' from a recent large-scale GWAS (61). The lead SNPs associated with vitamin D deficiency were obtained from the study by Revez et al. (62). The BMI associated SNPs and COPD associated SNPs were obtained from the Integrative Epidemiology Unit open GWAS project ID: ukb-b-19 953. AUD associated SNPs were obtained from the study by Sanchez-Roige et al. (59). The MR was performed using the TwoSampleMR R package (63,64). Furthermore, we tested for potential pleiotropic effects using the Egger intercept methods as part of the TwoSampleMR package and the MR-PRESSO package (65).
A Kaplan-Meier estimator (66) was used to create survival curves representing effect of selected comorbidity to survival among TB patients. A Cox proportional hazards model (Cox regression) was used to estimate the effect of selected risk factors among TB patients on survival (67). Cox regression was adjusted with stratified sex, stratified cohort (cohort representing for example biobank or study included within the FinnGen study), BMI and the first 10 genetic principal components. Survival function for the Kaplan-Meier estimator and Cox regression was constructed using age at death or end of follow-up (December 31, 2019) as time variable and death (0 or 1) as event variable. Additionally, proportional hazards assumption of Cox regression model was tested (68). Analyses were conducted using R version 4.0.3 (packages: survival, survminer, survMisc, ggsurvplot and ggplot2).

Supplementary Material
Supplementary Material is available at HMG online.

Data and Code Availability
Data and code used in this study are available upon reasonable request. The FinnGen individual level data may be accessed through applications to the Finnish Biobanks' FinnBB portal, Fingenious (www.finbb.fi). Summary data can be accessed through the FinnGen site https://www.finngen.fi/en/access_results.

Conflict of Interest statement
None declared.