Causal Path of COPD Progression-Associated Genes in Different Biological Samples

Abstract Chronic obstructive pulmonary disease (COPD) is a progressive inflammatory disease with pulmonary and extra-pulmonary complications. Due to the disease’s systemic nature, many investigations investigated the genetic alterations in various biological samples. We aimed to infer causal genes in COPD’s pathogenesis in different biological samples using elastic-net logistic regression and the Structural Equation Model. Samples of small airway epithelial cells, bronchoalveolar lavage macrophages, lung tissue biopsy, sputum, and blood samples were selected (135, 70, 235, 143, and 226 samples, respectively). Elastic-net Logistic Regression analysis was implemented to identify the most important genes involved in COPD progression. Thirty-three candidate genes were identified as essential factors in the pathogenesis of COPD and regulation of lung function. Recognized candidate genes in small airway epithelial (SAE) cells have the highest area under the ROC curve (AUC = 97%, SD = 3.9%). Our analysis indicates that macrophages and epithelial cells are more influential in COPD progression at the transcriptome level.


Introduction
Chronic obstructive pulmonary disease (COPD) is a progressive inflammatory disease characterized by airway obstruction that is among the first three causes of death worldwide [1,2]. Clinical presentations of COPD include emphysema, small airway obstructions, and chronic bronchitis.
Although the precise underlying processes of COPD pathogenesis are still unknown, the range of long-term exposure to many types of xenobiotics, such as smoking, occupational hazardous chemichals, and hereditary background, is stated [3], with no effective therapy or medicine [4,5]. Cigarette smoking is the most significant risk factor for the development of COPD [3,6]. Nonetheless, more than 20% of COPD patients are not smokers or ex-smokers [7,8]. Furthermore, additional research has shown that smokes emitted by solid fuels used for cooking or heating are not connected with COPD ethiology [9].
As a systemic multi-organ disease, multiple causes could alter the nature and consequences of COPD at the genomic level. Despite various transcriptomics studies on COPD, the critical role of hidden gene signatures is not clarified in different clinical samples [10][11][12][13]. It seems that a comprehensive analysis of heterogeneous biological samples is mandatory to find novel candidate genes in systemic pathogenesis and progression of COPD. The progression of COPD severity depends on the interaction of different cells present in the lung microenvironment. Various techniques have been utilized to interpret the disease complexity and pathogenesis elements. Many studies have been conducted based on top-down and/or bottom-up strategies to identify key players and regulators of the pathogenesis process, but nearly none investigated the level of direct/indirect impact of different cells on the staging and progression process of COPD. Specific statistical classification and prediction approaches can satisfy the need for high throughput high-dimensional transcriptome datasets [14]. Microarray analysis studies permit statistical and mathematical approaches to understand the association between thousands of genes in a disease's specific stages. One of the primary challenges in microarray analysis is identifying genes, or groups of genes, that are differentially expressed in a disease or at different stages. More recently, machine-based learning algorithms have increasingly gained attention in bioinformatics and biology research [15,16]. In contrast, (regularization) based regression models (e.g. elastic-net logistic regression) have been widely used in microarray analysis [17].
We hypothesized that the different biological samples are not equally involved in the progression of COPD at the transcriptomics level. Also, we questioned if it would be possible to investigate the direct/indirect interaction among them. Therefore, this project was designed to identify novel important genes within different biological samples to provide novel therapeutic targets in COPD. Here, we fitted an elastic-net logistic regression model on transcriptome datasets to overcome overfitting and multi-collinearity issues, as common problems arise in the high throughput data analysis while representing a sparse and interpretable model in high dimensional datasets [12,18]. Following the identification of novel effective genes, structural equation modeling (SEM) was utilized to assess simultaneous direct and indirect effects of candidate genes and biological samples on COPD progression.

Datasets and data processing
A systematic search was performed in the PubMed database with the search strategy of COPD OR chronic obstructive pulmonary disease) AND (epitheli* OR macrophage OR basophil OR biopsy OR sputum OR BAL OR Bronchoalveolar lavage OR blood OR PBMC OR Lung tissue) AND (microarray OR transcriptomics OR RNA-seq). Then, omics data repository databases were searched to enrich the archived studies. After removing the duplicates, the studies with more than fifty samples and different stages of COPD with publicized raw data were included. Finally, studies were selected based on mostly matching transcriptomics platforms to minimize the cross-platform biases. The demographic and historical data of the selected datasets are provided in supplementary file 1.
Raw transcriptomics data from small airway epithelial cells (SAE) (GSE20257) [19], alveolar macrophages (GSE13896) [20], lung tissue (GSE47460) [21][22][23], sputum (GSE22148) [24], and blood samples (GSE54837) [25] were downloaded from GEO database. Initially, raw expression data were combined by the R package of "merging", as the combined dataset for each of biological samples included healthy controls and the patients in the GOLD stages of COPD [1][2][3][4]. After normalization of the datasets, the batch effect was removed and statistical comparison was performed by "Affy", "SVA", and "Limma" R packages, respectively [26,27]. The false discovery rate was corrected using the Benjamini-Hochberg correction method. The cutoff of the adjusted p-value (<0.0001) or absolute fold-change >2 was applied for the selection of differentially expressed genes. Since there were no healthy controls in sputum and blood datasets, the average normalized/standardized expression value of healthy controls from other datasets was used. Then Sensitivity analysis was performed to assess the robustness of the results obtained by differential expression analysis.

Elastic-net penalized logistic regression
Elastic-net logistic regression as a regularization method can be used for gene selection [28]. This statistical learning method was introduced as a compromise between ridge and lasso penalties. Elastic-net penalty combines the strengths of both ridge and lasso [29]. Based on the simulation and experimental studies on the genomic selection using regularized linear regression models, the elastic net regularization was found to outperform ridge and extensions of LASSO in high dimensional data [12,30]. This penalty selects groups of correlated genes. Besides, it possesses optimized predictive performance compared with LASSO and ridge in the transcriptomics data [12]. Elastic-net logistic regression methods were performed by the "glmnet" R package [31]. Candidate genes were identified based on the below equation: β is a vector of gene expression parameters, and lambda is tuning parameters for LASSO and ridge penalties, respectively. Beta parameters were estimated by coordinate descent as an optimization algorithm [31]. The best estimates for lambda parameters were derived by K-fold cross-validation. Alpha as a hyper-parameter, controls the distribution between the LASSO and ridge penalties. In this study, alpha was considered as a fixed value of 0.5.
Also, the statistical comparison was performed by the "pROC" R package between areas under the Roc curves (AUCs) for identified candidate genes in all biological samples [32]. The identification of novel candidate genes was performed by the elastic-net logistic regression method based on the trend of gene expression data over gold stages of COPD, and then, enrichment was performed using the ClueGO plugin of Cytoscape software.

Structural equation modeling
Structural equation modeling (SEM) includes causal modeling, analysis of covariance structures, and latent variable models. This modeling has many advantages as a generalization of multivariate multiple regression compared with conventional regression. SEM can assess multiple regression equations simultaneously, allowing identifying the strength and sign of direct and indirect effects for complex causal diagrams [33,34].
Path standardized coefficients (β) were calculated as the effect size that is free of scale and is robust against heterogeneity in the integration of genomics datasets. Goodness of fit (GOF) indices (e.g. The Root Mean Square Error of Approximation (RMSEA)<0.08, Standardized Root Mean Square Residual (SRMR)<0.08) were applied for assessing the fitness of the model. SEM was performed using the "lavaan" R package [35]. Numerical estimation of path coefficients (β) was derived by an iterative maximum likelihood algorithm. Multiple regression equations in SEM are represented below: b blood samples ®°°°°°° Y in each equation is the number of stages in the biological sample. Beta coefficient is the strength and sign of candidate genes involved in the progression of COPD (Y).

Cross-validation, stability, and accuracy
The repeated k-fold cross-validation by bootstrapping is a good strategy to reduce the high variability of cross-validation [36]. Sensitivity analysis was performed to check the robustness of the results against changes in the training sets [37]. In the present study, the algorithm split the data set using repeated random 100 times sub-sampling in 5-fold cross-validation, permuting the sample labels every time. Cross-validated performances were summarized by observed sensitivities and specificities with standard deviation (SD). Furthermore, the area under the Receiver Operator Characteristic (ROC) curve (AUC) was used to calculate the precision of performance of the classifiers [38,39]. We used the bootstrap technique, which enables us to predict the fit of a model to a hypothetical testing set when an independent dataset is not available. The bootstrap approach allows us to use a computer to mimic the process of obtaining new datasets so that we can estimate the variability of our estimate without generating additional samples. Rather than repeatedly obtaining an independent dataset, we instead obtain distinct datasets by repeatedly sampling observations from the original data sets with replacement. To validate the selected genes with previous studies, literature mining was performed in PubMed. Interactive cluster heatmap was applied by the "heatmaply" R package [40].

Differential analysis of genes expression data
Removing batch effects and normalizing data, according to the differential expression analysis of COPD vs. healthy samples, 918 probes from SAE, 1942 probes from lung tissue, 134 probes from blood, 1074 probes from alveolar macrophages, and 5768 probes from sputum samples were identified as the differentially expressed genes (adjusted p-value < 0.0001 or absolute fold-change >2) ( Table 1).

Gene selection and prediction
After adjustment of the effects of confounding variables (e.g. age and smoking status) by using Elastic-net penalized logistic regression, the total number of 33 genes was selected as associated factors with COPD progression with AUCs, sensitivities, and specificities in each biological sample ( Table 2). According to statistical comparisons of AUCs of selected genes in different biological samples, genes identified in SAE cells and macrophages performed significantly better to predict the disease progression/stage. However, the AUC of the candidate genes in SAE samples was not significantly different (p-value = 0.478) compared to the AUC in macrophages to predict the disease stage ( Figure 1). Functional enrichment classified the novel genes into five groups, including "Regulation of CoA-transferase activity", "Vacuole organization", "dendritic spine organization", and "Cell adhesion molecules" (supplementary file 2). The expression level of candidate genes was measured and graphed among all biological samples (Lung Tissue, SAE, Blood, Macrophage, and Sputum) in each healthy and all COPD stages (Figure 2).
For plotting co-expression patterns of selected genes among the patients, heatmap with agglomerative hierarchical clustering were plotted ( Figure 3). Co-expression pattern of the selected genes resulted in four major clusters in COPD patients, including (OXNAD1, CCR4, ITK, Table 1. the summarized data indicating the primary, qualified, and differentially expressed probes in each biological sample.  (Table 3). THSD4, PPP4R4, CDKN2A, CADM1, and NRG1, which has previously been detected in GWAS studies to determine single nucleotide polymorphisms (SNPs) in COPD and asthma, were among the mentioned 24 genes (https://www.ebi.ac.uk/gwas/ home) [41][42][43]. However, we identified nine genes that have not been previously reported in COPD and other lung diseases, including RPUSD2, RAB11B, BTBD19, DNM3, SH3RF2, MTHFSD, ATOH8, SRPX, and HSPA4 (Table 3). These genes may represent novel potential biomarkers in the diagnosis and prognosis of COPD. The functional protein interaction network for the selected genes is illustrated in Figure 3, based on the STRING database ( Figure 4).

Discussion
COPD patients can currently be classified based on clinical tests, including spirometry or St George's Respiratory Questionnaire (SGRQ), but the need for further prognosis mandates identifying novel biomarkers that highly correlate with the disease progression. Nowadays, the application of high-throughput techniques as well as systems biology and machine learning approaches deeply alter the vision of researchers toward identification of novel diagnostic and prognostic biomarkers. Different studies have been conducted on the transcriptome of various cell lines involved in COPD's pathogenicity. In this study, previously reported influencing cell types were studied to identify the most impacting genes in the development of COPD staging. Heterogeneity of samples make the algorithms vulnerable to identify higher number, more pure, and less noisy differentially expressed genes (DEGs), which in case narrows the discovery route of novel predicting genes.
The penalized logistic regression model and SEM are used to identify novel genes and to assess the connection network of the selected genes, respectively. The studies show that the elastic net often performs better than ridge and LASSO for the model selection consistency and prediction accuracy in microarray data [17]. The strengths of this study include the application of modern computational methods such as the statistical learning, SEM, validation of all of the results by literature review, and repeated cross-validation method (repeated 5-CV). However, some of the limitations of this study were the small sample size of disease stages, the lack of external validation dataset. On the other hand, applied novel methodologies in this study are robust against  outliers and heterogeneity in gene expression data according to the bootstrap and cross-validation, standardized path coefficient in SEM and sensitivity analysis.
Analysis determined that identified genes in SAE cells, and macrophages had a higher accuracy rate than other biological samples in prediction of the disease progression. According to our results, twenty-four out of 33 genes were previously reported to associate with COPD, lung function (FVC, FEV 1 or the FEV 1 /FVC ratio), and other lung related complications. Nine novel genes were identified with no background in COPD studies, including RPUSD2, RAB11B, BTBD19, DNM3, SH3RF2, MTHFSD, ATOH8, SRPX, and HSPA4. Amongst all, RPUSD2, RAB11B, BTBD19, DNM3, and MTHFSD were the most important novel genes in the analyses that can be nominated as novel important genes involved in the prognosis of COPD staging progression.
Contributing with the highest number of novel genes, epithelial cells appear to be most influential in pathogenesis of the COPD. The second rank in number of predicting genes belongs to macrophages. Macrophages have long been reported for their pillar role in inflammatory diseases and tissue remodeling processes. Enrichment analysis of novel candidate genes of epithelial cells demonstrated high contribution in cancer progression processes, initiation of inflammation and healing processes, and immune cell recruitment. On the other hand, some candidate genes impact hydrolyzing enzymes that leads to accumulation of mucus in bronchioles.
Thereupon, epithelial cells play a central direct role in COPD pathogenesis. However, macrophages are more indirectly in charge of inflammatory circumstances of lung microenvironment. Overall, it must be considered that macrophages have the upper hand compared to epithelial cells when it comes to COPD progression. Both M1 and M2 subclasses of macrophages are critical for healing processes. M1 cells remove damaged cells and M2 cells prepare an environment for proliferation and remodeling.
Macrophage candidate genes have high co-express with TWIST1 and PRRX1 transcription factors, which can be classified as oncogenes. These genes are mostly involved in genome replication and repair, along with autophagy and mitophagy processes. Comparing up and down-regulated candidate genes in macrophages reveal that downregulated genes mostly belong to antioxidant or proliferation initiation, while upregulated genes mostly belong to oxidant or tension responses.
Most candidate genes in tissue samples mainly contribute to chemokine signaling pathways, immune systems activation, cell-cell adhesion, and different G-protein related signaling cascades.
The co-expression network of candidate genes reveals that mostly tightly correlate with the immune system signaling mediators. We previously conducted research on COPD that highlighted the importance of epithelial cells in the progression of the disease. Also, 17 novel genes were introduced to be associated with the pathogenesis of COPD. PRKAR2B, GAD1, LINC00930, and SLITRK6 were the most important genes, which are consistent with the current study [12].

Conclusions
These novel genes may provide the basis for the development of therapeutics in COPD and its associated morbidities in the future. It is hoped that further studies on this issue would identify novel genes as biomarkers to help diagnosis and prognosis in COPD.
SM participated in the design of the study and wrote the manuscript. HB and JS performed the data gathering. AE and MA performed the statistical analysis. AA and SAJ conceived of the study, and participated in its design and coordination, and helped to draft the manuscript. MS edited the manuscript. All authors read and approved the final manuscript.

Disclosure statement
The authors declare that they have no competing interests.

Funding
The author(s) reported there is no funding associated with the work featured in this article.

Data availability statement
Data sharing not applicable to this article as no datasets were generated or analyzed during the current study.