Long Noncoding RNA and Predictive Model To Improve Diagnosis of Clinically Diagnosed Pulmonary Tuberculosis

Clinically diagnosed pulmonary tuberculosis (PTB) patients lack microbiological evidence of Mycobacterium tuberculosis, and misdiagnosis or delayed diagnosis often occurs as a consequence. We investigated the potential of long noncoding RNAs (lncRNAs) and corresponding predictive models to diagnose these patients. We enrolled 1,764 subjects, including clinically diagnosed PTB patients, microbiologically confirmed PTB cases, non-TB disease controls, and healthy controls, in three cohorts (screening, selection, and validation).

infection by smear microscopy, culture, or nucleic acid amplification tests (1)(2)(3). The diagnostic procedure for clinically diagnosed PTB is inadequate and time-consuming and often results in misdiagnosis or delayed diagnosis (3), leading to an increased risk of morbidity and mortality (4) or overtreatment (5). There is thus an urgent need to develop rapid and accurate strategies to diagnose PTB cases without microbiological evidence of M. tuberculosis (6,7). The exploration of effective host immune response signatures represents an attractive approach for this type of assay.
Long noncoding RNAs (lncRNAs) can function as critical regulators of inflammatory responses to infection, especially for T-cell responses (8,9). Increasing evidence indicates that blood lncRNA expression profiles are closely associated with TB disease (10)(11)(12), suggesting that lncRNAs could function as potential noninvasive biomarkers for TB detection. However, previous studies have suffered from small sample sizes (ranging from 66 to 510) and a lack of independent validation.
Recent effort has focused on establishing clinical prediction rules or predictive models for TB diagnosis based on patients' electronic health record (EHR) information (13)(14)(15)(16). Such models can cost-effectively facilitate PTB diagnosis with a limited number of clinical-radiological predictors. For example, a 6-signature model described previously by Griesel et al. (a cough lasting Ն14 days, the inability to walk unaided, a temperature of Ͼ39°C, chest radiograph assessment, hemoglobin level, and white cell count) attained an area under the concentration-time curve (AUC) of 0.81 (95% confidence interval [CI], 0.80 to 0.82) in seriously ill HIV-infected PTB patients (13). However, despite these advances, current EHR models remain insufficient for precise TB diagnosis. Compelling studies have proposed that models incorporating biomarkers and EHR information attain better performance for the prediction of sepsis (17) and abdominal aortic aneurysm (18). We previously reported that combining exosomal microRNAs and EHRs in the diagnosis of tuberculous meningitis (TBM) achieved AUCs of up to 0.97, versus an AUC of 0.67 obtained using EHRs alone (19). Based on these studies, we hypothesized that combining lncRNAs with well-defined EHR predictors could be used to develop improved predictive models to identify PTB cases that lack microbiological evidence of M. tuberculosis infection.
This study was therefore performed to investigate the diagnostic potential of lncRNAs and predictive models incorporating lncRNA and EHR data for the identification of PTB cases without microbiological evidence of M. tuberculosis. This study also explored the diagnostic potential of lncRNA candidates and the optimal model for microbiologically confirmed PTB.

MATERIALS AND METHODS
Study design. We performed this study through a four-stage approach. lncRNAs that were differentially expressed (DE) between clinically diagnosed PTB patients and healthy subjects were profiled by microarray in the screening step. The expression levels of the top five lncRNAs were then analyzed in a large prospective cohort in the selection step of the study, which reduced the number of lncRNAs from 5 to 3 based on expression differences among groups. In the model training step, lncRNAs and EHRs were used to develop predictive models for clinically diagnosed PTB patients and nontuberculosis disease control (non-TB DC) patients, and the optimal model was visualized as a nomogram. Finally, we validated lncRNAs and the nomogram in a prospective cohort, including both clinically diagnosed PTB and microbiologically confirmed PTB cases. The study strategy is shown in Fig. 1.
Subject enrollment. (i) Screening cohort. We retrospectively collected 7 age-and gender-matched PTB cases and 5 healthy controls as the screening cohort. They were 6 males and 6 females aged 22 to 59 years. PTB cases were clinically confirmed PTB patients with positive TB symptoms, negative microbiological evidence of TB, and a good response to anti-TB therapy. Healthy subjects had a normal physical examination and no history of TB.
(ii) Selection and validation cohorts. Inpatients with clinical-radiological suspicion of PTB but lacking microbiological evidence of TB were prospectively enrolled from West China Hospital between December 2014 and May 2017. The inclusion criteria for highly suspected patients were new patients with (i) high clinical-radiological suspicion of PTB, (ii) anti-TB therapy for Ͻ7 days on admission, (iii) negative microbiological evidence of TB (i.e., at least two consecutive negative smears, one negative M. tuberculosis DNA PCR result, and one negative culture result), (iv) an age of Ն15 years, and (v) no severe immunosuppressive disease, HIV infection, or cardiac or renal failure. Two experienced pulmonologists reviewed and diagnosed all presumptive PTB patients. According to the Chinese diagnostic criteria for PTB, final diagnoses for all cases were based on the combination of clinical assessment, radiological and laboratory results, and response to treatment (1, 2) (see Appendix S1 in the supplemental material). A 12-month follow-up through telephone or WeChat was used to confirm the classification of clinically diagnosed PTB patients and non-TB patients. Detailed descriptions of patients' symptoms and recruitment, inclusion and exclusion criteria, laboratory examinations, diagnostic criteria and procedures, treatment, and sample size estimates are provided in Appendices S1 and S2 in the supplemental material. We also enrolled microbiologically confirmed PTB cases in the validation cohort. Healthy subjects were simultaneously recruited from a pool of healthy individuals with a normal physical examination and no history of TB.
The selection cohort was comprised of 878 participants (141 clinically diagnosed PTB cases, 159 non-TB DCs, and 578 healthy subjects), and the validation cohort had 874 participants (97 clinically diagnosed PTB cases, 392 microbiologically confirmed PTB cases, 140 non-TB DCs, and 245 healthy subjects). Details of the non-TB DCs are listed in Table S1 in the supplemental material. Ethics approval was obtained from the Clinical Trials and Biomedical Ethics Committee of West China [approval no. 2014(198)]. Informed consent was obtained from every participant. lncRNA detection. (i) RNA isolation and cDNA preparation. Peripheral blood mononuclear cell (PBMC) samples were isolated from 3-ml fresh blood samples from each participant using a human lymphocyte separation tube kit (Dakewe Biotech Company Limited, China). Total RNA was extracted from PBMC isolates using TRIzol reagent (Invitrogen, USA). RNA concentration and purity were evaluated spectrophotometrically, and RNA integrity was determined using agarose gel electrophoresis (Fig. S1A). The PrimeScript RT reagent kit with gDNA Eraser (TaKaRa, Japan) was used to remove contaminating genomic DNA and synthesize cDNA. All the kits were used according to the manufacturers' instructions.
(ii) lncRNA microarray profiling. lncRNA profiles were detected using Affymetrix human transcriptome array 2.0 chips based on a standard protocol (20). Raw data were normalized using the robust multiarray average expression measure algorithm. DE lncRNAs with P values of Ͻ0.05 and fold changes of Ͼ2 were identified using empirical Bayes-moderated t statistics and presented by hierarchical clustering and a volcano plot (21).
(iii) qRT-PCR for lncRNAs. Three lncRNAs were amplified using the following primers: 5=-TTCCTCA CCCTCTTCCTGCT-3= (forward) and 5=-AAGGCATGTGAGTAAGGGCG-3= (reverse) for ENST00000497872, 5=-GCAGAAAGCAAGGACCAA-3= (forward) and 5=-GGATGAGCAGCGATGAAG-3= (reverse) for n333737, and 5=-CGCAGAAGTAAGTAGCCGGG-3= (forward) and 5=-ACTGGATGAGCGTGAAGTGG-3= (reverse) for n335265 (Table S2). A final 10-l-volume reaction mixture for reverse transcription-quantitative PCR (qRT-PCR) included 5 l of SYBR Premix Ex Taq II (TaKaRa, China), 0.5 l of 10 M forward and reverse primers, 3 l of double-distilled water (ddH 2 O) (PCR grade), and 1 l of template cDNA. The cycling program consisted of 95°C for 1 min, followed by 35 cycles at 95°C for 10 s, 56 to 62ºC for 30 s, and 72°C for 60 s. The samples were denatured at 95°C for 30 s and then heated to 65°C for 30 s at a rate of 0.2°C/s. The ddH 2 O negative control and blank control in each reaction showed no detectable signals, ensuring the lack of contamination or nonspecific products. lncRNA expression was measured in a blind fashion, normalized to the endogenous control glyceraldehyde-3-phosphate dehydrogenase (GAPDH) gene, and calculated according to the 2 Ϫ⌬⌬C q method, where a quantification cycle (C q ) value of Ͻ35 was considered acceptable (22). More details of qRT-PCR detection (PCR amplification curves and standard curve, quality control, product sequencing verification, and stability test) are listed in Appendix S3 and Fig. S1B and C in the supplemental material.
Modeling. (i) Data used for modeling. A total of 41 EHRs, including demographic, clinical, laboratory, and radiological findings, were collected (Appendix S4), and a 20% missing value threshold was applied to remove incomplete features. Features with P values of Ͻ0.05 in the univariate analysis or with definite clinical significance were included for modeling. A total of 14 of the 44 original variables (41 EHRs and 3 lncRNAs) remained after filtering, including 11 EHRs and 3 lncRNAs (Appendix S4).
(ii) Diagnostic modeling. Multivariable logistic regression was used to develop predictive models to distinguish clinically diagnosed PTB cases from patients with suspected PTB in the selection cohort. Feature subsets were selected and compared using the best-subset selection procedure (23) and 10-fold cross-validation. The "EHR-plus-lncRNA" (EHRϩlncRNA), "lncRNA-only," and "EHR-only" models were developed according to their respective best-feature subset in the selection cohort. A cutoff for each model was determined by combining Youden's index and a sensitivity for the samples in the training data set of Ն0.85. The models, including their cutoffs, were used for evaluation of the validation cohort.
(iii) Nomogram presentation and evaluation. We further adopted the nomogram to visualize the optimal model with the best AUC (24,25). Nomogram calibration was assessed with the calibration curve and the Hosmer-Lemeshow test (a P value of Ͼ0.05 suggested no departure from perfect fit). The variance inflation factors (VIFs) quantified the severity of multicollinearity (a VIF of Ͼ10 indicated multicollinearity among the features in the model). Feature importance was calculated with the "varImp" function in the R package. The performance of the nomogram was tested in the validation cohort, with total points for each patient calculated. Decision curve analysis (DCA) (25) was performed by evaluating the clinical net benefit of the nomogram and EHR-only model across the overall data sets. Assessing clinical value involves comparing the nomogram and EHR-only model using the 500-bootstrap method. The nomogram was implemented as a Web-based app using R Shiny.
Statistical analysis. Categorical variables were analyzed by univariate analysis with a chi-square test, and continuous variables were analyzed using Mann-Whitney U tests or Student's t tests. All tests were 2 sided, and P values of Ͻ0.05 were considered statistically significant. Modeling was constructed and validated by individuals who were blind to diagnostic categorizations.
Data availability. lncRNA microarray data have been deposited in the Gene Expression Omnibus under accession no. GSE119143. Sequencing data for the quantitative PCR (qPCR) products of three lncRNAs, the R code, and data for modeling are available at https://github.com/xuejiaohu123/ TBdiagnosisModel.

RESULTS
Characteristics of prospectively enrolled participants. The demographic and clinical characteristics of suspected clinically diagnosed PTB participants in the selection and validation cohorts are provided in Table 1. Clinically diagnosed PTB patients were younger and had higher interferon gamma release assay (IGRA) positivity rates did than their non-TB DCs (P value of Ͻ0.0001 for both the selection and validation cohorts), but these groups did not differ by gender, body mass index (BMI), or smoking status. Healthy subjects were age, gender, and BMI matched with PTB patients, who had significantly different blood test results than those of the PTB patients (Table 1).
Clinically diagnosed PTB patients were responsible for 29.82% (238/798) of all PTB patients (238 clinically diagnosed PTB cases and 560 microbiologically confirmed PTB cases [see Appendix S1 in the supplemental material]). This rate is markedly lower than a nationwide estimate of 68% based on primary public health institutions (1) but represents the clinically diagnosed PTB rate in a referral hospital with experienced specialists.
lncRNA microarray profiles and candidate selection. In the screening step, microarray profiling identified a total of 325 lncRNAs that were differentially expressed (287 upregulated and 38 downregulated) in the clinically diagnosed PTB patients versus healthy subjects. Hierarchical clustering and a volcano plot revealed clearly distinguishable lncRNA expression profiles (Fig. S2). The top five lncRNA candidates were chosen based on a set of combined criteria, including a fold change of Ͼ2 between groups, a P value of Ͻ0.05, a signal intensity of Ͼ25 (26), and unreported lncRNAs Radiological pathology refers to abnormal chest imaging results, including at least one of the following signs: polymorphic abnormality, calcification, cavity, bronchus sign, and pleural effusion. Abbreviations: Alb, albumin; IQR, interquartile range; P 1 , P value for the comparison of clinically diagnosed PTB cases and non-TB DCs (nontuberculosis disease control patients) in the selection cohort; P 2 , P value for the comparison of clinically diagnosed PTB patients and healthy subjects (HSs) in the selection cohort; P 3 , P value for the comparison of clinically diagnosed PTB cases and non-TB DCs in the validation cohort; P in the TB literature (27). Three of these five lncRNAs were upregulated (n335265, ENST00000518552, and TCONS_00013664) and two were downregulated (n333737 and ENST00000497872) in PTB versus control subjects (Table S3). Differentially expressed lncRNAs in clinically diagnosed PTB cases. The expression levels of these five candidate lncRNAs were measured by qRT-PCR in the selection cohort, which consisted of 141 clinically diagnosed PTB cases, 159 non-TB DCs, and 578 healthy subjects. Two lncRNAs (ENST00000518552 and TCONS_00013664) were excluded from further analysis due to their low expression levels (C q of Ͼ35) in this cohort. Of the three remaining lncRNAs, ENST00000497872 and n333737 were downregulated and n335265 was upregulated in PTB patients versus healthy subjects (Fig. S3). Comparison of clinically diagnosed PTB cases and non-TB DC patients revealed decreased expression levels of ENST00000497872 and n333737 in PTB patients (Fig. S3) (age-adjusted P values of Ͻ0.0001 for both).
Short-term stability, an essential prerequisite of a potential lncRNA biomarker, was assessed in PBMC samples. This study found that incubation for up to 24 h had a minimal effect on the expression of ENST00000497872, n333737, and n335265 (Table S4), in accordance with a previous report of lncRNA stability in blood (28).
Diagnostic modeling and nomogram visualization. Fourteen features for eligible suspected patients were used for modeling, including 11 EHR features and 3 lncRNAs. Three logistic regression models, EHRϩlncRNA, EHR only, and lncRNA only, were evaluated as part of the training step in the selection cohort (Appendix S4). The VIF between the features ranged from 1.02 to 1.29, indicating no collinearity within models. The EHRϩlncRNA model included six EHR features and three lncRNAs. The EHRϩlncRNA model yielded the highest AUC (0.92) for distinguishing clinically diagnosed PTB from suspected PTB patients, compared to AUCs of 0.87 and 0.82 for the EHR-only and lncRNA-only models, respectively (Fig. 2A). The EHRϩlncRNA model also had the best performance in sensitivity, specificity, accuracy, positive predictive value, and negative predictive value ( Table 2). The optimal EHRϩlncRNA model with nine features was displayed as a nomogram (Fig. 3A), and the top five features of the nomogram were ENST00000497872, age, n333737, calcification detected by computed tomography (CT calcification), and TB-IGRA results (Table S5). The sensitivity and specificity of the nomogram for the prediction of clinically diagnosed PTB were 0.89 (95% CI, 0.82 to 0.93) and 0.80 (95% CI, 0.73 to 0.85), respectively, at a cutoff of 0.37 (Table 2). A calibration curve in the selection cohort (Fig. 3B) indicated good agreement between the nomogram predic-  tion and actual PTB cases, which was confirmed by a nonsignificant Hosmer-Lemeshow test (P value of 0.957). This nine-feature nomogram was generated as a free online app (available at https://xuejiao.shinyapps.io/shiny/) to facilitate its access for other studies. This app allows the user to insert the values of specific predictors and provides the risk prediction as a whole-number percentage. Validation for lncRNAs and the nomogram. In the validation step, the three candidate lncRNAs were analyzed in 97 clinically diagnosed PTB cases, 140 non-TB DCs, and 245 healthy subjects. This analysis showed an lncRNA expression pattern similar to that observed in the selection cohort (Fig. S3). All three models were applied to clinically diagnosed PTB patients and non-TB DCs of the validation cohort, and as reported in Table 2 and Fig. 2, it was found that the nomogram achieved superior discrimination (AUC, 0.89 [range, 0.84 to 0.93]) and good calibration (Fig. 3B) (P value of 0.668 by the Hosmer-Lemeshow test) for clinically diagnosed PTB prediction. The sensitivity and specificity of the nomogram at a cutoff of 0.37 in the validation cohort were 0.86 (range, 0.77 to 0.90) and 0.82 (range, 0.75 to 0.87), respectively. DCA indicated that the nomogram outperformed the conventional EHR-only model, with a higher clinical net benefit within a threshold probability range from 0.2 to 1 (Fig. 3C).
We also validated the nomogram in microbiologically confirmed PTB and smearnegative PTB patients. A total of 392 microbiologically confirmed PTB patients were enrolled in the validation cohort, and 48.47% of these confirmed PTB patients were smear-negative PTB cases (n ϭ190). The nomogram had good discriminative power for both microbiologically confirmed PTB (AUC of 0.90) and smear-negative PTB (AUC of 0.91) patients, similar to that observed for the prediction of clinically diagnosed PTB patients ( Table 2, Fig. S5 and S6, and Table S6).
lncRNA response to anti-TB treatment. lncRNAs were next analyzed for the ability to predict anti-TB treatment response. Paired samples were collected from 22 clinically diagnosed PTB patients before and after 2 months of intensive therapy (29), and the expression levels of ENST00000497872, n333737, and n335265 were measured by qRT-PCR. All these patients had a good response to therapy based on the clinical and radiological findings, and ENST00000497872 and n333737 levels were significantly increased posttreatment (P values of 0.005 and 0.0005, respectively) (Fig. 4), suggesting that lncRNA expression increased in response to therapy.

DISCUSSION
The present work focused on the challenge of accurately diagnosing PTB patients without microbiological evidence of M. tuberculosis infection. To our knowledge, little literature has interrogated the exact epidemiology and diagnostic models for this subtype of PTB. We first developed and validated a novel nomogram incorporating lncRNA signatures and conventional EHR features, which can effectively discriminate clinically diagnosed PTB patients from patients with suspected diseases.
In addition to the three lncRNAs, we identified six EHR predictors (age, CT calcification, positive TB-IGRA, low-grade fever, elevated hemoglobin, and weight loss) that were essential in TB case finding, as proposed by previous studies (15,16). Age was an important negative predictor for clinically diagnosed PTB, which appears to conflict with the consensus that advanced age correlates with higher TB susceptibility (31). This may be explained by differences in the enrollment of the PTB patients and control subjects. Previous studies included healthy and/or vulnerable subjects as controls, while we enrolled inpatients with a wide range of pulmonary diseases and of older ages as disease controls.
This study serves as a first proof-of-concept study to show that integrating lncRNA signatures and EHR data could be a more promising diagnostic approach for PTB patients with negative microbiological evidence of TB. The EHRϩlncRNA model had good discrimination (through AUC and diagnostic parameters), reliable calibration (via a calibration curve and a Hosmer-Lemeshow test), and potential clinical utility for decision-making (using DCA). Compared with the EHR-only model, the EHRϩlncRNA model shows a similar sensitivity and a significantly higher specificity in both clinically diagnosed PTB and microbiologically confirmed PTB patients, which may perform better as a "rule-in" test (32) and offer clinician confidence in a TB diagnosis and anti-TB treatment plan. In addition, the EHRϩlncRNA model avoided some common problems associated with sputum-based features, such as poor sputum quality or problematic sampling (33), to improve its reliability and clinical utility.
Nomograms have been shown to remarkably promote the early diagnosis of intestinal tuberculosis (24) and prognosis prediction in PTB (34) and TBM (35). The EHRϩlncRNA model here was visualized as a nomogram and further implemented in an app. The online nomogram uses readily obtainable predictors and automatically outputs a personalized quantitative risk estimate for PTB. The utilization of this userfriendly tool may speed up confirmation of a TB diagnosis, especially in resourceconstrained areas with a high TB prevalence.
Our study has several limitations. Modeling in this study was conducted based on data from a single large hospital, and multicenter validation studies are needed. Furthermore, this nomogram relies on tests, including lncRNA detection and TB-IGRA, that may not be available in most community hospitals; however, the TB-IGRA and the lncRNA assay are both blood tests and can therefore be sent to a centralized facility for testing, reducing the need for specialized laboratory testing in community hospitals.
In summary, a novel nomogram that we developed and validated in this study that incorporates three lncRNAs and six EHR fields may be a useful predictive tool in identifying PTB patients with negative microbiological evidence of TB and merits further investigation.

SUPPLEMENTAL MATERIAL
Supplemental material is available online only. SUPPLEMENTAL FILE 1, PDF file, 0.8 MB. SUPPLEMENTAL FILE 2, PDF file, 4.2 MB. SUPPLEMENTAL FILE 3, PDF file, 0.5 MB.