Data Set Sources:
scRNA dataset (GSE167295)
RNA-seq database (GSE57148)
Data Acquisition:
Exposure Data: eQTL data comes from the eQTLGen consortium (https://www.eqtlgen.org) database. The eQTLGen consortium aims to study the genetic structure of blood gene expression and understand the genetic basis of complex traits. The large-scale eQTLGen project is currently in its second phase, focusing on large-scale genome-wide meta-analyses in blood.
Outcome Data: The participants of the outcome-related GWAS studies selected in this research are mainly of European ancestry. Outcome summary data all come from the UK Biobank database (ukb-b-16751), which provides all the details of the genome-wide association study (GWAS) pipeline developed by MRC-IEU for the full UK Biobank genetic data. Among them, there are 3,871 cases and 459,139 controls for Chronic Obstructive Pulmonary Disease.
RNA-seq Data: The GEO database (https://www.ncbi.nlm.nih.gov/geo/), officially called the Gene Expression Omnibus, is a gene expression database created and maintained by the US National Center for Biotechnology Information (NCBI). The Series Matrix File data file for GSE57148 was downloaded from the NCBI GEO public database, with the annotation file being GPL11154, incorporating expression profile data from 189 patients, including 91 controls and 98 patients with Chronic Obstructive Pulmonary Disease. Data related to Chronic Obstructive Pulmonary Disease from GSE167295 was downloaded from the NCBI GEO public database for single-cell related analysis, totaling 3 samples.
Related Gene Set: The immune-related gene set (immune) used in this analysis was obtained through the GeneCard database (https://www.genecards.org). For this analysis, 153 miRNA sets related to Chronic Obstructive Pulmonary Disease were obtained from the HMDD database (http://www.cuilab.cn/hmdd).
Gene Functional Enrichment Analysis:
Through the Metascape database (www.metascape.org), key gene sets are functionally annotated to comprehensively explore the functional relevance of the gene set. Gene Ontology (GO) analysis and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway analysis are conducted for specific genes. A Min overlap≥3 & p≤0.01 is considered statistically significant.
Mendelian Randomization Analysis:
The MR Base database (http://app.mrbase.org/) contains a large amount of summary statistical data from hundreds of GWAS studies. The outcome id filtered from the MR Base database is extracted in GWAS summary data (https://gwas.mrcieu.ac.uk/) for the causal relationship in eQTL. SNPs related to each gene at a genome-wide significance threshold (P < 1.0×10-5) are selected as potential IVs (Instrumental variables). The LD (Linkage Disequilibrium) between SNPs is calculated. Among the SNPs with R2 < 0.001 (clumping window size=10,000kb), only those with p2 < 5e-8 are retained. Sequentially, the reliability of the causal relationship is assessed using four statistical methods: Inverse variance weighted (IVW, using the meta-analysis method combining the Wald estimate of each SNP), MR Egger (based on the assumption that instrument strength is independent of direct effects (InSIDE)), Weighted median (allows correct estimation of the causal relationship even if up to 50% of the IVs are invalid), and Weighted mode (estimates have greater power to detect causal effects, less bias, and lower type I error rate than MR-Egger regression). If there is only one SNP in the causal relationship, only the Wald ratio method is used. Finally, the screened causal relationships are verified through heterogeneity (Cochran's IVW Q test) and pleiotropy tests.
Immune Cell Infiltration Analysis:
The CIBERSORT algorithm is employed to analyze the RNA-seq data of different subgroups of COPD patients to infer the relative proportions of 22 types of immune infiltrating cells. Pearson correlation analysis is conducted on gene expression levels and immune cell content, with P<0.05 considered statistically significant.
GSVA Analysis (Gene Set Variation Analysis):
Gene Set Variation Analysis (GSVA) is a non-parametric, unsupervised method to evaluate the enrichment status of transcriptomic gene sets. GSVA scores the sets of interest, converting gene-level changes into pathway-level changes, thereby determining the biological functions of the samples. In this study, gene sets will be downloaded from the Molecular Signatures Database (version 7.0), and the GSVA algorithm will be used to score each gene set, evaluating potential biological function changes in different samples.
GSEA Enrichment Analysis:
GSEA uses predefined gene sets, ordering genes by their degree of differential expression between two types of samples. It then tests whether the predetermined gene set is enriched at the top or bottom of this ordered list. In this research, GSEA will be used to compare the signaling pathway differences between high-expression and low-expression groups, exploring the molecular mechanisms of core genes in the two patient groups. The number of permutations is set at 1000, and the permutation type is set to phenotype.
Regulatory Network Analysis of Key Genes:
In this study, the "RcisTarget" R package is used to predict transcription factors. All calculations executed by RcisTarget are based on motifs. The normalized enrichment score (NES) of a motif depends on the total number of motifs in the database. In addition to the motifs annotated by the source data, further annotation files are inferred based on motif similarity and gene sequences. The first step to estimate the overexpression of each motif in the gene set is to calculate the area under the curve (AUC) for each motif-gene set pair. This is done based on the recovery curve of genes sorted by the gene set. The NES of each motif is calculated based on the AUC distribution of all motifs in the gene set. We use the rcistarget. hg19. motifdb. cisbpont.500bp for the Gene-motif rankings database.
Single-cell Analysis:
Initially, expression profiles are loaded using the Seurat package, and low-expression genes are filtered out (nFeature_RNA > 100 & percent.mt < 5). The data are then normalized, scaled, and processed with PCA. The optimal number of PCs (11) is observed using the ElbowPlot, and the TSNE analysis determines the spatial relationship between each cluster. The celldex package is employed to annotate the clusters, which are subsequently linked to cells significantly associated with disease onset.
Drug Target Prediction Analysis:
The Connectivity Map (CMap) is a gene expression database based on gene perturbation developed by the Broad Institute. It aims to reveal functional links between small molecules, genes, and disease states. It contains gene chip data from five human tumor cell lines treated with 1309 small molecule drugs under various conditions, including different drugs, concentrations, and treatment durations. This study uses the differentially expressed genes of chronic obstructive pulmonary disease to predict targeted therapeutic drugs for the condition.
Statistical Analysis:
Reliable MR analysis is based on three core assumptions: (1) the relevance assumption (the instrumental variable is closely related to the exposure but not directly to the outcome), (2) the independence assumption (the instrumental variable cannot be related to confounders), and (3) the exclusion restriction assumption (the instrumental variable can only affect the outcome through the exposure, and if the IV can affect the outcome through other means, it's considered to have pleiotropic effects). All analyses are conducted in the R language (version 4.0). All statistical tests are two-sided, and p<0.05 is considered statistically significant.