LncRNA and predictive model to improve the diagnosis of clinically diagnosed pulmonary tuberculosis

Background Clinically diagnosed pulmonary tuberculosis (PTB) patients lack Mycobacterium tuberculosis (MTB) microbiologic evidence, and misdiagnosis or delayed diagnosis often occurs as a consequence. We investigated the potential of lncRNAs and corresponding predictive models to diagnose these patients. Methods We enrolled 1372 subjects, including clinically diagnosed PTB patients, non-TB disease controls and healthy controls, in three cohorts (Screening, Selection and Validation). Candidate lncRNAs differentially expressed in blood samples of the PTB and healthy control groups were identified by microarray and qRT-PCR in the Screening Cohort. Logistic regression models were developed using lncRNAs and/or electronic health records (EHRs) from clinically diagnosed PTB patients and non-TB disease controls in the Selection Cohort. These models were evaluated by AUC and decision curve analysis, and the optimal model was presented as a Web-based nomogram, which was evaluated in the Validation Cohort. The biological function of lncRNAs was interrogated using ELISA, lactate dehydrogenase release analysis and flow cytometry. Results Three differentially expressed lncRNAs (ENST00000497872, n333737, n335265) were identified. The optimal model (i.e., nomogram) incorporated these three lncRNAs and six EHR variables (age, hemoglobin, weight loss, low-grade fever, CT calcification and TB-IGRA). The nomogram showed an AUC of 0.89, sensitivity of 0.86 and specificity of 0.82 in the Validation Cohort, which demonstrated better discrimination and clinical net benefit than the EHR model. ENST00000497872 may regulate inflammatory cytokine production, cell death and apoptosis during MTB infection. Conclusions LncRNAs and the user-friendly nomogram could facilitate the early identification of PTB cases among suspected patients with negative MTB microbiologic evidence.


INTRODUCTION 1
Tuberculosis (TB) is the leading cause of death from an infectious agent 1 , but only 56% of 2 plmonary tuberculosis (PTB) cases reported to WHO in 2017 were bacteriologically 3 confirmed. Thus, approximately half of all PTB cases are clinically diagnosed worldwide, 4 and this proportion can reach 68% in China 1 . Clinically diagnosed PTB cases are 5 symptomatic but lack evidence of Mycobacterium tuberculosis (MTB) infection by smear 6 microscopy, culture or nucleic acid amplification test 1-3 . The diagnostic procedure for 7 clinically diagnosed PTB is inadequate and time-consuming and often results in misdiagnosis 8 or delayed diagnosis 3 , leading to an increased risk of morbidity and mortality 4 , or 9 overtreatment 5 . There is thus an urgent need to develop rapid and accurate strategies to 10 diagnose PTB cases without MTB microbiologic evidence 6,7 . The exploration of effective 11 host immune-response signatures represents an attractive approach for this type of assay. 12 Long noncoding RNAs (lncRNAs) can function as critical regulators of inflammatory 13 responses to infection, especially for T-cell responses 8,9 . Increasing evidence indicates that 14 blood lncRNA expression profiles are closely associated with TB disease [10][11][12] , suggesting 15 lncRNAs could function as potential noninvasive biomarkers for TB detection. However, 16 previous studies have suffered from small sample size (ranging from 66 to 510) and lack 17 independent validation. 18 Recent effort has focused on establishing clinical prediction rules or predictive models for 19 TB diagnosis based on patients' electronic health record (EHR) information [13][14][15][16]  HIV-infected PTB patients 13 . However, despite these advances, current EHR models remain 25 6 insufficient for precise TB diagnosis. Compelling studies have proposed that models 26 incorporating biomarkers and EHR information attain better performance for prediction of 27 sepsis 17 and abdominal aortic aneurysm 18 . We previously reported that combining exosomal 28 microRNAs and EHRs in the diagnosis of tuberculous meningitis (TBM) achieved AUCs of 29 up to 0.97 versus an AUC of 0.67 obtained using EHR alone 19

Study design 41
We performed this study through a four-stage approach. LncRNAs that were differentially 42 expressed (DE) between clinically diagnosed PTB patients and healthy subjects were profiled 43 by microarray in the Screening Step. The expression of top five lncRNAs were then analyzed 44 in a large prospective cohort in the Selection Step of the study, which reduced the number of 45 five lncRNAs to three based on expression difference among groups. In the Model Training 46 Step, lncRNAs and EHRs were used to develop predictive models for clinically diagnosed 47 9 e- Figure 1B Diagnostic modeling Multivariable logistic regression was used to develop predictive models 108 to distinguish clinically diagnosed PTB from patients with suspected PTB cases in the 109 Selection Cohort. Feature subsets were selected and compared using the best subset selection 110 procedure 23 and 10-fold cross-validation. The "EHR+lncRNA", "lncRNA only" and "EHR 111 only" models were developed according to their respective best feature subset in the Selection 112 Cohort. A cutoff of each model was determined by combining the Youden's index and the 113 sensitivity for the samples in the training dataset equal to or greater than 0.85. The models 114 including their cutoff were used for evaluation of the Validation Cohort. 115 Nomogram presentation and evaluation We further adopted the nomogram to visualize the 116 optimal model with the best AUC 24,25 . Nomogram calibration was assessed with the 117 calibration curve and Hosmer-Lemeshow test (p-value > 0.05 suggested no departure from 118 perfect fit). The performance of the nomogram was tested in the independent Validation 119 Cohort, with total points for each patient calculated. Decision curve analysis (DCA) 25 was 120 performed by evaluating the clinical net benefit of the nomogram and "EHR only" model 121 across the overall datasets. Assessing clinical value involves comparing the nomogram and 122 "EHR only" model using the 500 bootstrap method. The nomogram was implemented as a 123 Web-based app using R Shiny.

Characteristics of prospectively enrolled participants 147
The demographic and clinical characteristics of participants in the Selection and Validation 148 Cohorts are provided in Table 1. PTB patients were younger and had greater IGRA positivity 149 rates than their non-TB DC (p-value < 0.0001 for both the Selection and Validation Cohorts), 150 but these groups did not differ by gender, BMI, or smoking status. Healthy subjects were age-, 151 gender-, and BMI-matched with PTB patients, who had significantly different blood test 152 results compared with PTB patients (Table 1). 153 Clinically diagnosed PTB patients were responsible for 29.82% (238/798) of all active 154 PTB patients (see e-Appendix 1). This rate is markedly lower than a nationwide estimate of 155 68% based on primary public health institutions 1 , but represents the clinically diagnosed 156 PTB rate in a referral hospital with experienced specialists. 157

LncRNAs microarray profiles and candidate selection 158
In the Screening Step, microarray profiling identified a total of 325 lncRNAs that were 159 differentially expressed (287 upregulated and 38 downregulated) in the clinically diagnosed 160 PTB patients versus healthy subjects. Hierarchical clustering and a volcano plot revealed 161 clearly distinguishable lncRNA expression profiles (e- Figure 2). Top five lncRNA candidates 162 were chosen based on a set of combined criteria: fold-change > 2 between groups, p-value < 163 0.05, signal intensity > 25 27 , and including unreported lncRNAs in TB literature 28 . Three of 164 these five lncRNAs were upregulated (n335265, ENST00000518552 and TCONS_00013664) 165 and two were downregulated (n333737 and ENST00000497872) in PTB versus control 166 subjects (e- Table 3). 167 Differentially expressed lncRNAs in clinically diagnosed PTB

12
The expression level of these five candidate lncRNAs was measured by qRT-PCR in the 169 Selection Cohort, which consisted of 141 clinically diagnosed PTB, 159 non-TB DC, and 578 170 healthy subjects. Two lncRNAs (ENST00000518552 and TCONS_00013664) were excluded 171 from further analysis due to their low abundance expression (Cq > 35) in this cohort. Of the 172 three remaining lncRNAs, ENST00000497872 and n333737 were downregulated and 173 n335265 was upregulated in PTB patients versus healthy subjects (e- Table 4). Comparison 174 between clinically diagnosed PTB cases and non-TB DC patients revealed a decreased 175 expression of ENST00000497872 and n333737 in PTB patients (e- Figure 3A), age-adjusted 176 p-values both < 0.0001). 177 Short-term stability, an essential prerequisite of a potential lncRNA biomarker, was 178 assessed in PBMC samples. This study found that incubation up to 24 h had minimal effect 179 on the expression of ENST00000497872, n333737, and n335265 (e- Table 5), in accordance 180 with a previous report of lncRNA stability in blood 29 . 181

Diagnostic modeling and nomogram visualization 182
Three logistic regression models, "EHR+lncRNA", "EHR only", and "lncRNA only" were 183 evaluated as part of the training step in the Selection Cohort (see e-Appendix 4). The variance 184 inflation factors between the features ranged from 1.02 to 1.29, indicating no collinearity 185 within models. The "EHR+lncRNA" model yielded the highest AUC (0.92) for 186 distinguishing clinically diagnosed PTB from suspected PTB patients, compared to AUCs of 187 0.87 and 0.82 for the "EHR only" and "lncRNA only" models, respectively (Figure 2A). The 188 "EHR+lncRNA" model also had the best performance in sensitivity, specificity, accuracy, 189 positive predictive value, and negative predictive value (Table 2). 190 The optimal "EHR+lncRNA" model was displayed as a nomogram ( Figure 3A Table 4, e- Figure 3B). All three models were applied to 205 the Validation Cohort, and as reported in Table 2  with published lncRNA data [8][9][10][11][12]31 , this data provide new evidence that lncRNAs could 242 participate in TB immunoregulation and serve as promising biomarkers for TB diagnosis. 243 In addition to the three lncRNAs, we identified six EHR predictors (age, CT calcification, 244 positive TB-IGRA, low-grade fever, elevated hemoglobin, and weight loss) that were 245 essential in TB case finding, as proposed by prior findings 15,16 . Age was an important 246 negative predictor for clinically diagnosed PTB, which appears to conflict with the consensus 247 that advanced age correlates with higher TB susceptibility 32 . This may be explained by 248 differences in the enrollment of the PTB patients and control subjects. Previous studies 249 included healthy and/or vulnerable subjects as controls, while we enrolled inpatients with a 250 wide range of pulmonary diseases and older ages as disease controls. 251 This study serves as a first proof-of-concept study to show that integrating lncRNA 252 signatures and EHR data could be a more promising diagnostic approach for PTB patients 253 with negative MTB pathogenic evidence. The "EHR+lncRNA" model had good 254 discrimination (through AUC and diagnostic parameters), reliable calibration (via calibration 255 curve and Hosmer-Lemeshow test), and potential clinical utility for decision-making (using 256 DCA). The "EHR+lncRNA" model avoided some common problems associated with 257 sputum-based features, such as poor sputum quality or problematic sampling 33 , to improve 258 its reliability and clinical utility. Nomogram has been shown to remarkably promote early 259 diagnosis of intestinal tuberculosis 24 and prognosis prediction in PTB 34      LncRNA expressions before (blue) and after (red) a 2-month intensive anti-TB treatment regimen. Altered lncRNA expressions were calculated using log 2 lncRNA (post-treatment expression / pre-treatment expression) and the Wilcoxon matched-paired rank test was used for comparisons among 22 paired samples.
The median and interquartile range of log 2 lncRNA were as follows: ENST00000497872 (before: -