Genome-wide transcriptional profiling of pulmonary functional sequelae in ARDS- secondary to SARS-CoV-2 infection

Background Up to 80% of patients surviving acute respiratory distress syndrome (ARDS) secondary to SARS-CoV-2 infection present persistent anomalies in pulmonary function after hospital discharge. There is a limited understanding of the mechanistic pathways linked to post-acute pulmonary sequelae. Aim To identify the molecular underpinnings associated with severe lung diffusion involvement in survivors of SARS-CoV-2-induced ARDS. Methods Survivors attended to a complete pulmonary evaluation 3 months after hospital discharge. RNA sequencing (RNA-seq) was performed using Illumina technology in whole-blood samples from 50 patients with moderate to severe diffusion impairment (DLCO<60%) and age- and sex-matched individuals with mild-normal lung function (DLCO≥60%). A transcriptomic signature for optimal classification was constructed using random forest. Transcriptomic data were analyzed for biological pathway enrichment, cellular deconvolution, cell/tissue-specific gene expression and candidate drugs. Results RNA-seq identified 1357 differentially expressed transcripts. A model composed of 14 mRNAs allowed the optimal discrimination of survivors with severe diffusion impairment (AUC=0.979). Hallmarks of lung sequelae involved cell death signaling, cytoskeleton reorganization, cell growth and differentiation and the immune response. Resting natural killer (NK) cells were the most important immune cell subtype for the prediction of severe diffusion impairment. Components of the signature correlated with neutrophil, lymphocyte and monocyte counts. A variable expression profile of the transcripts was observed in lung cell subtypes and bodily tissues. One upregulated gene, TUBB4A, constitutes a target for FDA-approved drugs. Conclusions This work defines the transcriptional programme associated with post-acute pulmonary sequelae and provides novel insights for targeted interventions and biomarker development.


Introduction
Genome-wide transcriptional profiling constitutes a powerful tool to analyze gene expression with a huge potential for identifying novel drug targets, improving current therapies and discovering biomarkers [1,2]. In particular, whole-blood transcriptomic analysis has gained importance in clinical research [3]. It is performed in highly accessible and homogeneous samples, avoiding the bias caused by tissue sampling, provides a comprehensive picture of the host response to pathological insults and offers information on the transcriptional programs activated in bodily tissues [4,5]. Currently, RNA sequencing (RNA-seq) is considered the gold standard among transcriptome profiling tools [6]. Indeed, the implementation of RNA-seq allows to decipher pathological processes underlying multiple conditions, including respiratory disease [7].
Acute respiratory distress syndrome (ARDS) is a life-threatening respiratory complication characterized by an uncontrolled inflammatory process that leads to the damage of the alveolar endothelialepithelial barrier and, consequently, to the accumulation of proteinrich edema and the migration of immune cells [8,9]. Although the mortality rate has significantly decreased in the last decade, a great percentage of survivors of ARDS are affected by abnormalities in pulmonary structure and function [10]. Reduction of lung diffusion capacity for carbon monoxide (D LCO ) is the most common alteration, being described in 33-82% of patients several months after ARDS resolution [11,12]. In survivors of ARDS secondary to SARS-CoV-2 infection, our group reported an abnormal D LCO in up to 80% of the patients at a 3-month follow-up [13]. These results have been corroborated in different cohorts [14,15].
Gaining a better understanding of the molecular underpinnings that mediate persistent lung dysfunction could aid the development of therapeutics and biomarkers for post-COVID sequelae [16]. In this context, whole-blood transcriptomic analysis will reveal unexpected or unknown mediators in the pathogenesis of post-acute lung sequelae. To address this, we used unbiased next-generation sequencing to investigate the transcriptional programs associated with severe pulmonary diffusion abnormalities in survivors of SARS-CoV-2-induced ARDS.

Ethics statements
The study was approved by the medical ethics committee of Hospital Universitari Arnau de Vilanova (registry number CEIC/2273). The patients received written information about the nature of the study and provided informed consent. The protocol was performed in full compliance with the Declaration of Helsinki. Demographic, clinical, pharmacological and laboratory data were extracted from the electronic medical records and introduced into a REDCap database with restricted access.

Study population
This is a substudy of the ongoing multicenter study CIBER-ESUCICOVID, registered at www.clinicaltrials.gov with the identification NCT04457505 [17]. Patients with severe COVID-19 admitted to the Hospital Universitari Arnau de Vilanova-Santa María (Lleida, Spain) were enrolled if they fulfilled the following criteria: aged over 18, developed ARDS according to the Berlin definition during the hospital stay [18] and attended a "Post-COVID" evaluation 3 months after hospital discharge (between June and December 2020, median [P 25 ; P 75 ] of 92.5 [83.2; 106] days) (Supplemental Fig. S1). The standard pulmonary evaluation was performed as previously described by our group (González et al., 2021, Chest). In brief, D LCO was evaluated using a flow spirometer (MasterScreen, Jaeger, Germany) in agreement with the guidelines of the American Thoracic Society (Celli et al., 2004, Eur. Respir. J.). D LCO was expressed as a percentage of the predicted D LCO value according to the European Community Lung Health Survey (Roca et al., 1998, Eur. Respir. J.). Based on the ATS/ERS task force: standardization of lung function testing [19], patients were classified into (i) patients with normal or mild diffusion impairment (D LCO ≥60%) and (ii) patients with moderate to severe diffusion impairment (D LCO <60%). Patients with incomplete pulmonary function evaluation were excluded from further analysis. Among 100 survivors with complete pulmonary function evaluation and blood samples available, a subpopulation of 50 patients was selected for the genome-wide transcriptional profiling study. To minimize potential confounding, we restricted our analysis to survivors with D LCO < 60% and D LCO ≥ 60% (ratio 1:2) who matched according to age and sex. Matching was performed using the nearest neighbor method, which is based on measuring the distances between the propensity score of each sample and its pair [20].

Sample collection
Samples were processed in standardized conditions with support by IRBLleida Biobank (B.0000682) and "Plataforma Biobancos PT20/ 00021". Venous blood samples were collected in Tempus TM blood RNA tubes (Thermo Fisher Scientific, MA, USA) by venipuncture after a night of fasting and before beginning pulmonary evaluation, according to the manufacturer's instructions. Blood samples were stored at − 80 • C for future analysis.

Next-generation sequencing and transcriptomic analysis
RNA purification and sequencing were performed by experienced researchers blinded to patient data using standardized and automated protocols. Total RNA was extracted from 200 μL of preprocessed whole blood with the Maxwell® RSC simply RNA Blood Kit (Promega, WI, USA) in combination with the Maxwell® RSC Instruments in agreement with the manufacturer's instructions. The nucleic acid content was quantified with a NanoDrop™ 2000/2000c (Thermo Fisher Scientific, MA, USA) by independent staff. Transcriptome next-generation sequencing was performed following Illumina's protocols (San Diego, CA, USA), as previously detailed [21,22]. Briefly, the Fragment Analyzer system (Agilent, CA, USA) was used to assess the quality and integrity of total RNA. RNA was isolated from 1 μg of total RNA using Illumina´s Ribo-zero plus rRNA depletion kit Transcriptomic analyses were performed with paired-end raw read data obtained from Illumina sequencing workflows. Raw sequence data in fastq format were processed using a sequential workflow: (i) data were cleaned of adapters and low-quality sequences using TrimGalore v0.6.6 (https://www.bioinformatics.babraham.ac.uk/projects/trim_ga lore/ accessed date April 1st, 2021); (ii) reads were mapped to the reference genome (GRCh38.p12) by the aligner STAR (https://github. com/alexdobin/STAR); (iii) mapped reads were counted using Featur-eCounts (http://subread.sourceforge.net/ accessed date April 1st, 2021); (iv) global distance analysis and normalization of reads were performed through DESeq2 [23]; (v) differential expression analysis were performed with the R package [24]. Zeros were excluded when less than 25% of samples had at least one count.

Bioinformatic analyses
The Reactome database was used for the analysis of biological processes/molecular pathways [25]. Significant enrichment was considered if the p-value< 0.05. Cellular deconvolution was performed using the CIBERSORTx algorithm [26]. Raw gene counts were normalized by DESeq2 and processed with the CIBERSORTx gene signature LM22 and 100 permutations. Additionally, a selection process based on random forest [27] was used to assess the variable importance of immune cell fractions. Lung cell and tissue expression was obtained using data from the Genotype-Tissue Expression (GTEx) Portal (https://www.gtexportal. org/home/). The Drug-Gene Interaction database (DGIdb v4.2.0) was used to predict interactions between FDA-approved drugs and genes [28].

Statistical analysis, data mining and feature selection
Descriptive statistics were used to summarize the characteristics of the study population. Data are presented as the median [25th; 75th percentile] or mean (SD) for continuous variables and as frequencies (percentage) for categorical variables. The normality of the data was analyzed using the Kolmogorov-Smirnov test. Continuous variables were compared using the Mann-Whitney U test or Student's t test, and categorical variables were compared using Fisher's exact test. Spearman's rho coefficient was used to assess the correlation between continuous variables. Correlations between continuous and dichotomous variables were calculated using point-biserial correlation. All statistical analyses were performed using R software, version 4.1.2.
The classification model (boosted random forest) associated with severe diffusion impairment (D LCO <60%) was constructed using opensource data mining tools (available from Weka Machine Learning 3, https://www.cs.waikato.ac.nz/ml/weka accessed date April 1st 2021) [29]. First, an attribute evaluation algorithm was used to reduce the gene sets used for classification modeling. That process was performed in two steps: a model entropy minimization score for each attribute and a correlation-based feature selection algorithm. Filtering of useful probes from transcriptomics data was performed by using the 'Info-GainAttributeEval' feature selection method (attribute evaluator based on the reduction of entropy) followed by the "Ranker" search method to sort the results, from the WEKA suite. This process reduced the set of useful probes for classification 100-fold. The resulting set of probes were further filtered: i) to exclude transcripts expressed less than 5 counts; and ii) transcripts not associated with known genes, in order to simplify the model and to increase the reproducibility of the results in other cohorts. The resulting set fulfills the condition of being the minimal attribute set without compromising model classification performance. Model classification performance was evaluated using the area under the curve (AUC) of the receiver operating characteristic (ROC) to measure model classification performance. The AUC was also used as a performance parameter in the data mining step. The classification model was independently tested using 10-fold cross-validation.

Clinical characteristics of the study population
The most relevant demographic and clinical characteristics of the study population are displayed in Table 1. The mean (SD) age was 57.7 (11.2), and females represented 30% of the population. No differences were observed in sociodemographic characteristics or previous comorbidities, including smoking history or previous pulmonary disease. Except for the duration of prone positioning, similar findings were observed for clinical and ventilatory parameters recorded during the hospital stay. The study population showed a mean (SD) D LCO of 63.9 (14.8). Thirty-six percent of patients showed moderate-severe diffusion impairment (D LCO <60%) at 3 months after discharge. The comparison between those patients included in the study and the whole population is shown in Supplemental Table S1. The study population is a representative sample.

Whole-blood transcriptome profile associated with severe lung diffusion impairment
Overall, 1357 significant differentially expressed genes were found between the study groups ( Fig. 1 A & Supplemental Table S2). To explore the biological relevance of the unfractionated blood transcriptome, the Weka suite was used to implement a multivariable classification model. A model including the 14 transcripts that provided the best discriminative power was constructed using the random forest classifier (Fig. 1B). As shown in the heatmap and PCA (Fig. 1C & D), the classification model was able to segregate patients based on the severity of lung diffusion impairment. The model showed optimal discrimination (AUC=0.979) (Fig. 1E). Among the selected genes, BAG2, ETV7, SER-PINB6, TMEM161B and TUBB4A were upregulated in patients with the most severe pulmonary outcomes (Fig. 1F). Decreased expression of BCL2L11, GAS2, GRK2, PPP1R10 and TBC1D10B was observed in patients with severe diffusion impairment (Fig. 1F). Specific components of the classification model correlated (rho>0.3) with indicators of disease severity and/or patient management during the acute phase, i.e., use and duration of prone positioning, ICU stay, invasive mechanical ventilation (IMV) duration and non-IMV duration (Supplemental Fig. S2).

Bioinformatic analyses
The blood transcriptomic classification model was then considered for pathway analysis using the Reactome database ( Fig. 2A). The analysis identified a number of processes implicated in cell death/apoptosis: RUNX3 regulates BCL2L11 (BIM) transcription, FOXO-mediated transcription of cell death genes, Defective intrinsic pathway for apoptosis, Diseases of programmed cell death, BH3-only proteins associate with and inactivate anti-apoptotic BCL-2 members, Caspase-mediated cleavage of cytoskeletal proteins, Apoptosis and programmed cell death. A relevant proportion of the processes were also associated with pathways implicated in cytoskeleton processing: Activation of BIM and translocation to the mitochondria, Caspase-mediated cleavage of cytoskeletal proteins, Activation of SMO, Microtubule-dependent trafficking of connexons from the Golgi to the plasma membrane, Transport of connexons to the plasma membrane and Membrane trafficking. In addition, pathways related to cell differentiation, proliferation and survival (FLT3 signaling, FOXO-mediated transcription and Signaling by Hedgehog) and inflammation and regulation of immune function (Transcriptional regulation by RUNX3) were identified.
Next, using the transcriptome data, we estimated the immune cell subtypes and their proportions in whole blood by CIBERSORTx (Fig. 2B & Supplemental Table S3). No significant differences were observed between the study groups. The random forest analysis found that resting natural killer (NK) cells were the most important fraction for the   prediction of severe diffusion impairment (Fig. 2C). To further explore the association between the transcriptomic signature and immune cell composition, we assessed the correlation between the expression levels of the transcripts in our signature and analytical leukocyte counts (Fig. 2D). Neutrophil counts were directly correlated with BCL2L11, PPP1R10, GRK2 and TBC1D10B and inversely correlated with FCMR, BAG2, SYPL1 and TUBB4A. In contrast, lymphocyte counts were directly correlated with TMEM161B, ENPP5 and SERPINB6 and inversely correlated with PPP1R10, GRK2 and TBC1D10B. A positive correlation between monocyte counts and SERPINB6 was also observed. It has been proposed that the peripheral blood transcriptome shares more than 80% of gene expression with different tissues, including the lung [30]. Using available data from the GTEx project, we analyzed the expression of the genes that composed the model in the different tissues, in addition to lung cells (Fig. 2E & F). The results showed a variable expression pattern in tissues and cells, including epithelial, immune and endothelial lung cells.
A protein-drug interaction analysis was performed using the five upregulated mRNAs. We found six FDA-approved drugs that targeted TUBB4A (Supplemental Fig. S3). No interactions were observed for other candidates.

Discussion
Here, we use whole blood to comprehensively explore the transcriptomic landscape of post-acute pulmonary sequelae in survivors of COVID-19-driven ARDS. Our analysis revealed a distinct transcriptional program in patients with moderate to severe diffusion impairment 3 months after hospital discharge. Using advanced statistical methods, we constructed a classification model comprised of 14 transcripts that accurately identifies survivors with severe alterations in pulmonary function 3 months after hospital discharge. After implementing integrative pathway enrichment, we identified multiple biological mechanisms implicated in cell death but also in cytoskeleton reorganization, cell survival, growth and differentiation and the regulation of the immune response. To the best of our knowledge, our study is the first to apply blood RNA-seq technology to survivors of ARDS secondary to SARS-CoV-2 infection.
Blood cells play a major role in lung recovery after damage due to infections/pathologic insults [8]. Therefore, an altered immune cell function in survivors of SARS-CoV-2-induced ARDS with severe pulmonary damage is expected. The impaired elimination of harmful/damaged cells in this group of patients should also be considered. Supporting these hypotheses, our results suggest a profound deregulation of the immune cell phenotype in survivors with moderate to severe abnormalities in lung diffusion capacity. Indeed, severe post-acute sequelae were associated with a complex alteration of cell death signaling, a key mechanism in immune response resolution. In consonance with recently published data from our group [31], enrichment analysis revealed an overrepresentation of apoptotic pathways, including processes regulated and/or mediated by RUNX3 and FOXO and the downstream BIM apoptotic signaling pathway. The individual genes have been described in processes related to immune cell survival and depletion. BCL2L11 or its coding protein BIM, which is downregulated in survivors with moderate to severe diffusion impairment, is an inducer of apoptosis in activated immune cells [32]. Through the activation of proapoptotic Bax/Bak, BIM is critical for the control of immune system homeostasis, i.e., termination of the immune response and prevention of neoplastic and autoimmune phenomena [33]. In line with this finding, FCMR is an inhibitor of the apoptosis regulator Fas that, together with BIM, participates in the shutdown of chronic immune responses and prevention of autoimmunity [34]. BIM has also been described as a lifespan-limiting molecule in immune cells [35]. Cytosolic serpins, such as SERPINB6, in addition to preventing an excessive inflammatory response against microbial agents, contribute to myeloid cell survival by modulating programmed necrosis [36]. The upregulation of BAG2 is compatible with an inhibition of endoplasmic reticulum stress-induced cell apoptosis in macrophages [37] and a promotion in the survival of infected macrophages by inactivating the p53-ROS pathway [38]. Furthermore, GAS2 is a dual regulator of damage-induced apoptosis in several cell types with a variety of mechanisms of action [39].
Although features of anti-apoptotic signaling seem to be predominant, the impact of the transcriptomic profile on the balance between pro-and anti-apoptotic mechanisms remains to be determined. For example, except for resting NK cells, no clear associations were observed when the fractions of innate and adaptive immunity were inferred. The leukocyte counts support these findings. Nevertheless, the results suggest a close association between moderate to severe lung diffusion impairment and the alteration in immune cell death programs. This may impact the termination of the immune response but may also be an indicator of senescence [40]. Since previous results suggest that immunosenescence could negatively affect tissue healing [41], our results warrant further research. Specific targeting of immunosenescence-related genes through senolytic drugs could be an interesting approach to therapeutic strategies to re-establish homeostasis and facilitate the resolution of damage [42].
Both previous data on individual genes and current analysis suggest a significant enrichment in processes implicated in cytoskeletal architecture. Pathways related to the maintenance and integrity of the cytoskeleton, microtubule-dependent trafficking, activation of transmembrane proteins and transport of cell components were identified. TUBB4A is part of the tubulin family, which constitutes the major constituent of microtubules. This gene, upregulated in moderate to severe diffusion impairment and inversely correlated with neutrophil counts, has been previously described as a critical element for the stabilization of microtubule polymerization and cell migration [43]. It is worth noting that the drug-gene interaction analysis also identified several FDA-approved drugs that can target TUBB4A. GAS2 and GRK2 are important modulators of cytoskeletal remodeling. Both mediators induce a modification of microfilaments structure and subsequently cell membrane ruffling and contraction which, supporting the findings discussed above, are normally coupled to an apoptotic cellular phenotype [44,45]. In this sense, GAS2 also regulates the interaction between microtubules and actin, which is essential for cell morphology and cytoskeleton reorganization [39,46]. Of note, GRK2 antagonists have been proposed as a potential immunosuppressant strategy since this gene regulates immune cell migration from blood into tissue [47]. Likewise, TBC1D10C is an important modulator of cytoskeletal dynamics in peripheral macrophages and, consequently, might be fundamental for their phagocytosis and spreading capacity [48]. The expression profile could be linked to the infiltration of immune cells from the blood into the lung, which may be relevant for tissue healing [49,50]. Furthermore, the analysis identified relevant genes/pathways implicated in cell survival, growth and differentiation, such as FLT3, FOXO and Hedgehog signaling. In fact, the downregulation of PPP1R10 and SNRPG has been previously linked to reduced cellular proliferation and cell cycle arrest, respectively [51,52]. These results are in accordance with previous data that link proteomic mediators of cell proliferation and differentiation with pulmonary sequelae in severe COVID-19 [16]. Several genes in the classification model, such as TMEM161B, ENPP5, and BAG2, mediate a wide range of cellular processes, including protein synthesis. In particular, BAG2 is a co-chaperone that promotes Hsp70 chaperone activity and inhibits the degradation of misfolded proteins mediated by the ubiquitin-proteasome system and MAPK [53]. Interestingly, ETV7, which is upregulated in survivors with lung diffusion impairment, has been previously described as a negative regulator of the type I IFN response and therefore as a therapeutic target to improve innate antiviral responses and enhance IFN-based antiviral therapies [54]. Whether ETV7 expression could be a possible indicator of the long-lasting effect of the infection beyond hospitalization deserves further analysis. Overall, functional studies are needed to explore the association between these mechanistic pathways and post-acute pulmonary sequelae in survivors of SARS-CoV-2-driven ARDS.
COVID-19 still constitutes an important public health challenge due to the significant burden of persistent post-acute sequelae. Survivors of severe clinical courses of the disease suffer a huge decrease in healthrelated quality of life and organic dysfunction. The use of a significant amount of health care resources should also be highlighted. In this scenario, it is of vital importance to develop tools aimed at assisting in the follow-up of survivors of ARDS derived from COVID-19. Delineating the transcriptional programs of blood cells provides the opportunity to identify novel tools for clinical decision-making. We have shown that survivors with severe pulmonary abnormalities can be distinguished at the transcriptomic level using RNA-seq technology. We present a 14gene signature that precisely classifies patients according to their pulmonary outcomes 3 months after hospital discharge. Since RNA-based biomarker in vitro assays are commercially available [55], our results are clinically relevant. Profiling blood transcript abundance on a genome-wide scale emerges as a promising tool for the development of useful biomarkers for the management of "post-COVID" patients.

Strengths and limitations
The strengths of the study reside in the use of next-generation sequencing, without a priori bias, in a relevant number of samples (n = 50), which is far larger than similar studies, the use of advanced statistical approaches and the rigorous quality control of clinical data. Nevertheless, our results should be considered in the context of several limitations. First, this was a single-center study. The results should be validated in independent cohorts (beyond the scope of current manuscript). Second, we considered patients with at least moderate to severe diffusion impairment attending to the great prevalence of pulmonary sequelae. Third, potential confounding due to previous comorbidities, clinical management and inpatient complications should not be discarded. Although survivors were matched for age and sex, it is possible that the transcriptomic programs observed reflect unrecognized factors. Fourth, the lack of a non-COVID-19 arm limits the generalization of the findings in the context of ARDS. Fifth, we could not infer whether gene expression disturbances in the blood are due to specific tissue damage or repair mechanisms. The conclusions and relationship derived from our analyses show associations. Additional studies based on in vitro and in vivo models are fundamental to analyze causality. Finally, the identification of the molecular mechanisms associated with post-acute pulmonary sequelae in epithelial, immune and endothelial lung cells would provide valuable information for the development of therapeutic approaches. Although blood cells have been proposed as sentinels reflecting physiological and pathological events occurring in different tissues, whether whole blood can be used as surrogate biopsy material to analyze pathophysiological processes in the local lung environment should be further explored.

Conclusions
In summary, post-infection pulmonary dysfunction is associated with a distinct whole-blood transcriptomic pattern in survivors of SARS-CoV-2-driven ARDS after hospital discharge. The transcriptional programs identified provide a valuable opportunity to garner insights into the mechanisms that mediate pulmonary damage and recovery and thus to develop targeted therapeutic interventions and tools for clinical decision-making.

Funding
MCGH is the recipient of a predoctoral fellowship from the University of Lleida. MM is the recipient of a predoctoral fellowship (PFIS: FI21/00187) from Instituto de Salud Carlos III. AC is supported by Instituto de Salud Carlos III (Sara Borrell 2021: CD21/00087). DdGC has received financial support from Instituto de Salud Carlos III (Miguel Servet 2020: CP20/00041), co-funded by the European Social Fund (ESF) "Investing in your future". IML is supported by a Miguel Servet contract (CPII20/00029) from the Instituto de Salud Carlos III, cofunded by the European Social Fund (ESF) "Investing in your future". CIBERES is an initiative of the Instituto de Salud Carlos III. This work is supported by the Instituto de Salud Carlos III (COV20/00110), cofunded by the European Regional Development Fund (ERDF) "A way to make Europe". Supported by: Programa de donaciones "estar preparados"; UNESPA (Madrid, Spain) and Fundación Francisco Soria Melguizo (Madrid, Spain). Funded by: La Fundació La Marató de TV3, project with code 202108-30/− 31. COVIDPONENT is funded by the Institut Català de la Salut and Gestió de Serveis Sanitaris. This research was funded in part by a grant (PI19/01805) from the Instituto de Salud Carlos III, co-funded by the European Regional Development Fund (ERDF) "A way to build Europe" and by the Fundación Rioja Salud.