IFI44L and C1QTNF5 as promising biomarkers of proliferative diabetic retinopathy

Proliferative diabetic retinopathy (PDR) is a world-wide leading cause of blindness among adults and may be associated with the influence of genetic factors. It is significant to search for genetic biomarkers of PDR. In our study, we collected genomic data about PDR from gene expression omnibus (GEO) database. Differentially expressed gene (DEG) analysis and weighted gene co-expression network analysis (WGCNA) were carried out. The gene module with the highest gene significance (GS) was defined as the key module. Hub genes were identified by Venn diagram. Then we verified the expression of hub genes in validation data sets and built a diagnostic model by least absolute shrinkage and selection operator (LASSO) regression. Enrichment analysis, including gene ontology (GO), Kyoto Encyclopedia of Genes and Genomes (KEGG), gene set enrichment analysis (GSEA) and construction of a protein–protein interaction (PPI) network were conducted. In GSE60436, we identified 466 DEGs. WGCNA established 14 gene modules, and the blue module (GS = 0.64), was the key module. Interferon (IFN)-induced protein 44-like (IFI44L) and complement C1q tumor necrosis factor-related protein 5 (C1QTNF5) were identified as hub genes. The expression of hub genes in GEO datasets was verified and a diagnostic model was constructed by LASSO as follows: index = IFI44L * 0.0432 + C1QTNF5 * 0.11246. IFI44L and C1QTNF5 might affect the disease progression of PDR by regulating metabolism-related and inflammatory pathways. IFI44L and C1QTNF5 may play important roles in the disease process of PDR, and a LASSO regression model suggested that the 2 genes could serve as promising biomarkers of PDR.


Introduction
Diabetes retinopathy (DR) is a major complication of diabetes and the leading cause of blindness in adults. [1,2] As of 2019, there were about 463 million people with diabetes worldwide, and up to 35% of them have DR. [3] Usually, DR is classified into 2 stages: non-proliferative diabetic retinopathy (NPDR) and proliferative diabetic retinopathy (PDR). [4] Some known risk factors for DR include the duration of diabetes, hyperglycemia, hypertension, and hyperlipidemia. By controlling these risk factors, many patients with DR could be effectively prevented from developing into the sight-threatening type of PDR or neovascular glaucoma (NVG). [5] However, there is still a high proportion of patients with DR whose disease progression is difficult to control. The pathogenesis of PDR is complex and involves vascular, inflammatory, and neuronal mechanisms. [6] These may be closely associated with the influence of genetic factors. [7] With the rapid development of genomics, many genes have been found to be related to PDR, such as vascular endothelial growth factor (VEGF), [8] hypoxia-inducible factor 1-α (HIF1A), [9] erythropoietin (EPO), [10] receptor for advanced glycation end product (RAGE), [11] among others. Neovascularization is an important feature of PDR, thus anti-VEGF agents have become a promising therapy for PDR. [12] So far, anti-VEGF therapy has shown poor efficacy for diabetic macular edema (DME), another major factor affecting the prognosis of PDR, [13] and this phenomenon is likely to be related to individual genetic differences. Therefore, it is of great significance to search for genetic biomarkers of PDR. Medicine The development of bioinformatics plays a powerful role in promoting genomics research. Gene expression omnibus (GEO, https:// www.ncbi.nlm.nih.gov/geo/) is a public database that contains microarray and high-throughput genomic datasets. It also includes genomic data from human patients with PDR. [14][15][16][17] Weighted gene co-expression network analysis (WGCNA) is a well-known systems biology method to identify gene association patterns. [18] Co-expressed genes are clustered into various modules named after colors. These modules are correlated with phenotypic traits, and key genes within the networks should be found. [19] Least absolute shrinkage and selection operator (LASSO) is an effective approach for the regression of high-dimensional parameters and has been broadly applied for diagnostic and prognostic analysis. [20,21] The objective of our research was to identify genomic biomarkers for PDR. Firstly, we searched the GEO database for genomic data of human patients with PDR. Comprehensive bioinformatics approaches, including WGCNA, were utilized to construct a gene co-expression network (GSE60436), find the key gene module and the hub genes, verify the expression of hub genes in validation datasets (GSE94019, GSE160306 and GSE178721) and analyze the enriched pathways of genes in the key module. Finally, the LASSO regression model was adopted to investigate the diagnostic value of hub genes for PDR.

Screening for differentially expressed genes (DEGs)
We utilized the R package "limma" to identify the DEGs, [24] by setting the statistical threshold to P < .05 and the absolute value of fold change (FC) > 2. Volcano plots and heatmaps were drawn by the "ggplot 2" and "ggpubr" packages in R.

Weighted gene co-expression network construction
The WGCNA package in R was used to build a gene co-expression network for GSE64036. [18] We picked out 5000 genes with the largest median values at the beginning. And then we selected a soft threshold using the "pickSoftThreshold" function of the package. Afterwards, genes were classified into multiple gene modules. The gene module with the highest gene significance (GS) value was defined as the key module. Genes in the key module with module membership (MM) > 0.9 and GS > 0.3 were considered interesting genes.

Identification and verification of hub genes
Hub genes were identified by the overlapping of DEGs in GSE60436 and interesting genes in the key module. And we further verified the expression of hub genes in GSE94019, GSE160306 and GSE178721. We generated Venn diagrams using the Venny tool (https://bioinfogp.cnb.csic.es/tools/venny/ index.html) and generated violin plots and box plots using the R packages "ggstatsplot" and "ggpubr."

Functional enrichment analysis of key module
Gene ontology (GO) enrichment analysis and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway analysis of genes in the key module were performed using the R package "ClusterProfiler." [25] Gene set enrichment analysis (GSEA) and gene set variation analysis (GSVA) are popular tools for analyzing pathways enriched in gene expression profiles. We ran GSVA analysis of GSE60436 by the "GSVA" package in R. [26] Additionally, samples from GSE94019 were divided into high-expression groups and low-expression groups according to the expression levels of single hub genes, and GSEA analysis was performed by the "ClusterProfiler" package.

Construction of a protein-protein interaction (PPI) network
Genes in the key module were imported into the STRING database (https://string-db.org/). [27] A PPI network was constructed in the Cytoscape software. [28] And we calculated and visualized the Maximum Neighborhood Component (MCC) value of each node in the PPI network by the "Cytohubba" plugin. [29]

Construction of LASSO model and receiver operating characteristic (ROC) curve analysis
Firstly, GSE94019 and GSE160306 were combined and then randomly assigned to the training set (70%) and the test set (30%). Then we constructed the LASSO model by using the "glmnet" package in R. [30] Using the regression coefficients from LASSO, a model index to weight the expression of hub genes was established with the formula: index = Gene1 * Coefficient1 + Gene2 * Coefficient2. R package "pROC" was utilized to generate ROC curve diagrams in the training set and the test set. [31] 3. Results

Identification of the key gene module by WGCNA
Firstly, we screened the top 5000 genes with the highest median to run WGCNA. Then, sample data was clustered, and no outliers were found in our study ( Fig. 2A). The soft threshold was set to 6 according to the "pickSoftThreshold" function ( Fig. 2B). Then a total of 14 gene modules were generated based on WGCNA (Fig. 2C, and see Table S1, Supplemental Digital Content, http:// links.lww.com/MD/H995, which illustrates the statistic of gene modules constructed by WGCNA). The modules were named after different colors. The largest turquoise module was composed of 2568 genes and the smallest gray module was composed of 19 genes. The blue module scored the highest among the gene modules (GS = 0.64, (Figs. 2D, 3A and 3B), so it was identified as the key module. In the blue module, 46 interesting genes whose module membership (MM) > 0.9 and GS > 0.3 were picked out (Fig. 3B). The eigengene dendrogram and adjacency heatmap displayed the correlation between PDR and the modules (Fig. 3C).

Identification and verification of hub genes
By taking the intersection of DEGs in GSE60436 and interesting genes in the blue module, interferon (IFN)-induced protein 44-like (IFI44L) and complement C1q tumor necrosis factor-related protein 5 (C1QTNF5) were identified as hub genes (Fig. 4A). We verified the expression of IFI44L and C1QTNF5 in GSE94019, GSE160306, and GSE178721 (Fig. 4B-F). IFI44L was found highly expressed in GSE94019, GSE160306, and GSE178721 but not in GSE60436. C1QTNF5 was highly expressed in all 4 datasets but was not significantly different in GSE160306 and GSE178721. What's more, we have discovered that high IFI44L expression seemed to have strong associations with the disease status of diabetic retinopathy according to its expression in GSE160306. More importantly, similar trends in IFI44L expression were shown in the tissue samples (GSE94019 and GSE160306) and serum samples (GSE178721).

Enriched biological processes and pathways in the key module and PPI network construction
Genes in the blue module were mainly enriched in protein transport to the vacuole involved in ubiquitin-dependent protein catabolic process via the multivesicular body sorting pathway (GO:0043328) in GO biological process, long chain enoyl-CoA hydratase activity (GO:0016508) in GO molecular function, and 6-phosphofructokinase complex (GO:0005945) in GO cellular component (Fig. 5A), and see Table S2, Supplemental Digital Content, http://links.lww.com/MD/H996, which illustrates GO terms enrichment analysis of the blue module). Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway analysis showed that pathways of neurodegeneration-multiple diseases (hsa05022) and endocytosis (hsa04144) were regulated by the blue module genes (Fig. 5B). Moreover, a PPI network was constructed for interesting genes in the blue module, which consisted of 38 gene nodes (Fig. 5C). GSVA of GSE60436 suggested the importance of the TNF-α signaling pathway and interferon α response for PDR (Fig. 5D). According to GSEA results of GSE94019, IFI44L might act on the insulin signaling pathway and mitogen-activated protein kinase (MAPK) signaling pathway (Fig. 6A) and C1QTNF5 might affect the retinol metabolism pathway and MAPK signaling pathway (Fig. 6B).

LASSO model of IFI44L and C1QTNF5
We utilized the LASSO model to identify IFI44L and C1QTNF5 with non-zero regression coefficients. The minimum value of lambda was 0.001245652. Two gene-based model indexes were created as follows: index = IFI44L * 0.0432 + C1QTNF5 * 0.11246 (Fig. 7A). ROC curve analysis indicated that the area under the curve (AUC) of the two-gene-based model was 0.94 in the training set (Fig. 7B) and 0.80 in the test set (Fig. 7C). This suggested that 2 hub genes could serve as potential diagnostic biomarkers of PDR.

Various
proteins, cytokines, deoxyribonucleic acid (DNA) methylation, messenger ribonucleic acid (mRNA), and non-coding RNAs are considered as biomarkers for PDR, for instance, CCDC144NL and ZNF80, [32] MMP-14, [33] extracellular matrix metalloproteinase inducer (EMMPRIN), [34] miR-296, [35] miR-20b, miR-17-3p, HOTAIR, and MALAT1, [36] hsa_circ_0001953. [37] The relationships between IFI44L and C1QTNF5 with PDR have not been reported yet. In our study, GSE60436 was divided into 14 modules by WGCNA, among which the enriched GO functions of the key module involved a variety of glucose and lipid metabolic pathways similar to those reported. [38,39] We then identified IFI44L and C1QTNF5 as hub genes and verified their expression in validation data sets. Finally, a diagnostic model was established by LASSO regression, and the diagnostic significance of the 2 genes for PDR was further analyzed.
The IFI44L gene, is an important paralog of the IFI44 gene. It has been reported that IFI44L is closely associated with tuberculosis, [40] Zika virus infection, [41] hepatitis C [42] and hepatocellular carcinoma. [43] In a study of human retinal pigmented epithelial cells (RPE) in an age-related macular degeneration (AMD) disease model, IFI44L was significantly differentially expressed and may play a role in regulating inflammatory pathways. [44] Late-onset retinal degeneration has been linked to mutations in the C1QTNF5 gene. [45,46] It belongs to the C1q/TNF-related protein (CTRP) family, a family of genes that link immunity to metabolism by exerting anti-inflammatory and insulin-sensitizing effects. [47] Surprisingly, C1q/TNF-related protein 3 (CTRP3) in the same family might serve as a novel biomarker for DR severity. [48] The pathogenesis of PDR is related to inflammatory mechanisms and retinol and insulin metabolism. [6,49,50] GSEA suggested that IFI44L could affect the insulin signaling pathway and the MAPK signaling pathway, and that C1QTNF5 could affect the retinol metabolism pathway and the MAPK signaling pathway. The MAPK signaling pathway is among the well-known cellular inflammatory pathways. [51] Therefore, we might hypothesize set (GSE94019 and GSE160306 were combined and then 70% of the combined dataset were randomly assigned to the training set). (C) ROC curves analysis of test set (the rest 30% of the combined dataset were assigned to the test set). PDR = proliferative diabetic retinopathy. Medicine that IFI44L and C1QTNF5 might affect the disease progression of PDR by regulating metabolism-related pathways and inflammatory pathways.
However, there were several shortcomings in our study. Our study was based on genomic data analysis, but the sample size of public datasets is limited. And 2 hub genes were not statistically significant in all the validation datasets. Moreover, our study was merely a theoretical analysis without experimental verification. In the follow-up study, we will expand the sample size of clinical patients to verify the difference in hub genes in peripheral blood samples and explore the possible mechanism of hub genes in PDR through experiments.

Conclusion
IFI44L and C1QTNF5 may play important roles in the disease process of PDR, and a LASSO regression model suggested that the 2 genes could serve as promising biomarkers of PDR.