Protocol to analyze dysregulation of the eIF4F complex in human cancers using R software and large public datasets

Summary Understanding dysregulation of the eukaryotic initiation factor 4F (eIF4F) complex across tumor types is critical to cancer treatment development. We present a protocol and accompanying R package “eIF4F.analysis”. We describe analysis of copy number status, gene abundance and stoichiometry, survival probability, expression covariation, correlating genes, mRNA/protein correlation, and protein co-expression. Using publicly available large multi-omics data, eIF4F.analysis permits computationally derived and statistically powerful inferences regarding initiation factor regulation in human cancers and clinical relevance of protein interactions within the eIF4F complex. For complete details on the use and execution of this protocol, please refer to Wu and Wagner (2021).1


SUMMARY
Understanding dysregulation of the eukaryotic initiation factor 4F (eIF4F) complex across tumor types is critical to cancer treatment development. We present a protocol and accompanying R package ''eIF4F.analysis''. We describe analysis of copy number status, gene abundance and stoichiometry, survival probability, expression covariation, correlating genes, mRNA/protein correlation, and protein co-expression. Using publicly available large multi-omics data, eIF4F.analysis permits computationally derived and statistically powerful inferences regarding initiation factor regulation in human cancers and clinical relevance of protein interactions within the eIF4F complex. For complete details on the use and execution of this protocol, please refer to Wu and Wagner (2021). 1

BEFORE YOU BEGIN Overview
In complex biological systems, biological processes rely on participation of multiple proteins through their physical and specific functional interactions. Clinical relevance of protein-protein interactions (PPIs) has been traditionally investigated through wet-lab approaches using tissue cultures and animal models, which are often complicated by easily-perturbed cell culture conditions or poor recapitulation of human diseases. 2 To mitigate these complications, we developed a computational method to infer PPIs for eukaryotic translation initiation complexes and their clinical relevance in cancers from human data. We produced an R package that employs a series of computational analyses including copy number comparison, survival regression modeling, quantification of subunit abundance and stoichiometry, differential gene expression with unsupervised machine learning, differential gene correlation and pathway activity, and mRNA/protein correlations.
The validity of our method derives from the biological requirement for interacting proteins to be simultaneously present within a cell, which implies that synthesis and degradation of interacting proteins ought also to coincide. 3 The statistical association between in vivo gene co-expression and protein interactions has been reported after analyzing large-scale data from multiple species including human. [4][5][6] When using expression data from clinical samples of The Cancer Genome Atlas (TCGA), Genotype-Tissue Expression (GTEx), Cancer Cell Line Encyclopedia (CCLE), and Clinical Proteomic Tumor Analysis Consortium (CPTAC) databases, our method allows to assess functional and clinical relevance of PPIs by associating protein co-expression with malignant phenotype or disease progression. Because our approach leverages large sample groups, it benefits from high statistical power.
As we previously published, 1 we have applied this approach to assess PPIs of eIF4F subunits in healthy and cancerous tissues. It yielded high-value, cost-efficient biochemical insights into functional and regulatory effects driven by eIF4F subunit interactions across tumor types, in addition to those previously published from isolated cell lines cultivated in wet labs. Figure 1 summarizes the major analyses accomplished by functions in eIF4F.analysis package, and a step-by-step guide to infer the clinical relevance of EIF4F genes and their protein interactions from analysis results. Our innovative computational method can be extended for broader application to reveal the clinical relevance of PPIs in assorted disease conditions, given additional disease-related datasets, custom lists of candidate protein complexes with subunits, and implementation of more association analyses. However, the protocol presented here is narrowly targeted to the production of our eIF4F results.
Preparation one: Installing RStudio/R eIF4F.analysis package was developed in RStudio and implemented in the R programming language.  Preparation four: Clone the repository files We provide two R scripts in our GitHub repository for users to download datasets and to run eIF4F.analysis package. Users need to clone the package files from GitHub, and set up the directories for input and output files.
6. Open the terminal and run the following command line to clone our GitHub repository for eIF4F.analysis package. This step will store all package files under home directory as /eIF4F.analysis (Figure 2A). Note: Installing eIF4F.analysis package in the Linux system is highly recommended.
7. Under the /eIF4F.analysis/Script folder ( Figure 2B), there are two R scripts Download.R and Analysis.R for users to acquire datasets and perform analyses. Open Download.R file in RStudio, verify the directories for input and output files as following.
CRITICAL: By default, Download.R generates root directory paths /eIF4F.analysis/eIF4F_data and /eIF4F.analysis/eIF4F_output. These two directory paths are also set inside the package (in Load.R), and the values must agree in both places. The descriptions of the steps that follow assume the use of the default eIF4F_output location.

Preparation five: Download datasets
Timing: 1 h Execution of Download.R script creates two folders: /eIF4F.analysis/eIF4F_data and /eIF4F.analysis/eIF4F_output ( Figure 2C). Download.R then downloads all needed datasets (TGCA, GTEx, CPTAC, CCLE, etc.) from URLs to /eIF4F.analysis/eIF4F_data and decompresses them. After the completion of the download and unzip steps, the eIF4F_data folder contains 16 data files ( Figure 2D), with a collective size of 15 GB. /eIF4F.analysis/eIF4F_ output is to store all the analyzed results in the following steps.
8. Run the Download.R file in RStudio with the following command line.
Note: Download times were observed with a 100 Mbps internet connection. Remote data sources mentioned here may offer limited and lower bandwidth.
CRITICAL: By default, user will be able to clone all the scripts and R codes from Github repository, to download datasets, and to store analysis results (by default) under a single folder /eIF4F.analysis.

STEP-BY-STEP METHOD DETAILS
The package includes one initiation step, and seven major analysis steps. User can execute each analysis step with one exported function that calls a group of internal functions for data process, data analysis and plotting to achieve a set of analyses. users can access their source code from R scripts and modify the input parameter outside this package for their own usages. The following section explains the definition and organization of each function. The detailed documentation of all (exported and internal) functions within the package is available at the following weblink: https://a3609640.github.io/eIF4F.analysis/reference/index.html.
Analysis.R contains ten exported functions in the package to initialize package and to execute all analyses presented in Wu and Wagner (2021). 1 Users can simply execute the following command in RStudio to get all analyses performed and results stored under /eIF4F.analysis/eIF4F_output.

Timing: < 5 min
Initialization processes rely on three exported functions to define subdirectories for output data, to define the graphic formats (font size and style), and to load data from downloaded data files. The definitions of three initialization functions were stored in /eIF4F.analysis/R/Load.R.
1. Run the following command line to create the output directories to store the output files.
Note: initialize_dir() creates sub-directories under /eIF4F.analysis/eIF4F_ output ( Figure 2F) to store the output results for each analysis step.
2. Run the following command line to create variables that define font type, size, and orientation for plotting.
3. Run the following command line to acquire omics datasets from the download data files.
Note: initialize_data() is a wrapper of five data initialization functions that import, trim, clean up and merge datasets from the download data files. These data initialization functions have side effects that populate twelve global variables that process and merge omics and annotation data by sample IDs for the following analysis steps. The data contained by these global variables are available in the form of data frames, as input for our analysis functions, but do not show on the user's workspace. For users to access them, the data initialization step saves twelve global variables as csv files under /eIF4F.analysis/eIF4F_ output/ProcessedData folder ( Figure 2G).
Alternatives: Five individual data initialization functions are also accessible to users for execution. Instead of running the wrapper function initialize_data() to generate all twelve global variables at once, users can generate global variables related to each analysis function with the following individual data initialization functions. i. TCGA_CNV_value contains the unthresholded CNV value data of tumors from all TCGA cancer types combined, and comes from the dataset, "Gistic2_CopyNumber_Gistic2_ all_data_by_genes". ii. TCGA_CNV_sampletype contains the threshold CNV dataset of tumors from all combined TCGA cancer types. It comes from two datasets: "Gistic2_CopyNumber_ Gistic2_all_thresholded.by_genes", and the annotation dataset "TCGA_phenotype_ denseDataOnlyDownload.tsv". iii. TCGA_CNVratio_sampletype contains the data of CNV ratios in tumors vs. adjacent normal tissues from individual TCGA cancer types. It comes from two datasets: "broad. mit.edu_PANCAN_Genome_Wide_SNP_6_whitelisted.gene.xena", and the annotation dataset, "TCGA_phenotype_denseDataOnlyDownload.tsv". b. initialize_RNAseq_data() reads the recomputed RNAseq data from both TCGA and GTEx. The implementation details of each operation are within the /eIF4F.analysis/ R/DEG.R file. initialize_RNAseq_data() sets one global variable TCGA_GTEX_ RNAseq_sampletype for the gene expression analysis (step-3), PCA (step-5) and correlating gene analysis (step-6), and stores TCGA_GTEX_RNAseq_sampletype as ''TCGA_GTEX_RNAseq_sampletype.csv'' in the ProcessedData folder. i. TCGA_GTEX_RNAseq_sampletype comes from the recomputed RNAseq dataset from both TCGA and GTEx, "TcgaTargetGtex_RSEM_Hugo_norm_count", and the annotation dataset "TcgaTargetGTEX_phenotype.txt". c. initialize_survival_data() reads the RNAseq and patient survival data from TCGA. The implementation details of this operation are within the /eIF4F.analysis/ R/Survival.R file. This function sets the global variable TCGA_RNAseq_OS_sampletype for survival analysis (step-4), and store TCGA_RNAseq_OS_sampletype as ''TCGA_ RNAseq_OS_sampletype.csv'' inside the ProcessedData folder. i. TCGA_RNAseq_OS_sampletype comes from three datasets: the RNAseq dataset "EB++ AdjustPANCAN_IlluminaHiSeq_RNASeqV2.geneExp.xena", the survival dataset "Survival_ SupplementalTable_S1_20171025_xena_sp" and the annotation dataset "TCGA_phenotype_ denseDataOnlyDownload.tsv". d. initialize_proteomics_data() reads the proteomics related data from CCLE and CPTAC LUAD, including: proteomics data, annotation data for cancer types, RNAseq data for the correlation analysis on protein RNA levels (step-7 and step-8). The implementation details of this operation are within the /eIF4F.analysis/R/RNAProCorr.R file.
CRITICAL: The first run of data initialization functions creates and stores the processed data from download files. Following runs of the initialization functions will check the existence of processed data files and read them, which take a much shorter time to complete than the first run. Figure 3A summarizes the internal code structure for five data initialization functions, and shows twelve global variables in dark gray boxes.
Step-2: Analyze the copy number variation (CNV) status of EIF4F genes

Timing: < 1 min
This step performs three types of analyses on CNV statuses of EIF4F genes across TCGA tumors and creates the analysis results both on screen and as pdf files stored in /eIF4F.analysis/ eIF4F_output/CNV folder.

Run the following command line in RStudio.
Note: EIF4F_CNV_analysis() is a wrapper function of three internal composite functions that take input data frames and call internal functions for analysis. Figure 3B summarizes the internal code structure for this step. The detailed definitions of all internal functions are in /eIF4F.analysis/R/CNV.R.
Note: .plot_bargraph_CNV_TCGA() takes the data frame TCGA_CNV_sampletype and calculates the frequency of each CNV status for EIF4F genes in all tumors combined from 33 TCGA cancer types. Its output is a stacked bar plot that ranks the EIF4F gene by the frequencies of copy number gain ( Figure 5A). .plot_bargraph_CNV_TCGA() also calculates the frequency of CNV status from individual TCGA cancer types. Its output is a number of stacked bar plots, in which cancer types are listed in the alphabetical order (e.g., Figure 5B).  Note: .plot_boxgraph_CNVratio_TCGA() takes the data frame TCGA_CNVratio_ sampletype and generates boxplots for CNV ratios in tumors vs. normal adjacent tissues (NATs) from individual TCGA cancer types. This function produces a CNV ratio boxplot for each gene in individual TCGA cancer types (e.g., Figure 5D).
Step-3: Compare the gene expression and ratio of EIF4F genes

Timing: < 1 min
This step compares the EIF4F RNA abundances and ratios across TCGA tumors, and saves the results in the /eIF4F.analysis/eIF4F_output/DEG folder. In this step, means of log-transformed gene expression values or RNA ratios are calculated in different samples, and compared by t-test.

Run the following command line in RStudio.
Note: EIF4F_DEG_analysis() is a wrapper function of two internal composite functions that take input data frames and call internal functions for analysis. Figure 3C summarizes the internal code structure for this step. The detailed definitions of all internal functions are in /eIF4F.analysis/R/DEG.R.
Note: .plot_boxgraph_RNAseq_TCGA() takes the data frame TCGA_GTEX_RNAseq_ sampletype and performs three analyses on RNAseq data. It compares the RNA abundance of all EIF4F genes in tumors from 33 TCGA cancer types in a box plot ( Figure 6A). Then it compares the expression of each EIF4F gene in tumors vs. NATs (e.g., Figure 6B). Finally, it compares the RNA expression in primary, metastatic tumors vs. NATs from all combined TCGA cancer types to produce violin plots ( Figure 6C).
Note: .plot_boxgraph_RNAratio_TCGA() takes the data frame TCGA_GTEX_ RNAseq_sampletype to calculate the RNA ratios of input genes within each TCGA sample, and performs two analyses on the RNA ratios. It first compares RNA ratio in tumors vs. NATs from individual TCGA cancer types by boxplots (e.g., Figure 7A). It then compares RNA ratios in primary, metastatic tumors vs. NATs from all combined TCGA cancer types to produce violin plots ( Figure 7B).
Step-4: Correlate EIF4F gene expression to patient survival probability

Timing: < 1 min
This step performs two types of survival analyses on EIF4F gene expression in TCGA tumors, and output results on screen and to the /eIF4F.analysis/eIF4F_output/Survival folder.
6. Run the following command line in RStudio.  Note: EIF4F_Survival_analysis() is a wrapper function to call two internal composite functions that take input data frames and call internal functions for analysis. Figure 3D summarizes the internal code structure for this step. The detailed definitions of all internal functions are in /eIF4F.analysis/R/Survival.R.
Note: .plot_KM_RNAseq_TCGA() takes the data frame TCGA_RNAseq_OS_sampletype and performs Kaplan-Meier (KM) analysis to associate survival probabilities with gene expression. This function takes arbitrary gene expression cutoff, 0.2 for 20% or 0.3 for 30%, to stratify the patient groups based on the top or bottom precents of gene expression within their tumors. This function performs KM analysis on all combined TCGA cancer types (e.g., Figure 8A) or individual cancer type such as ''lung adenocarcinoma'' (e.g., Figure 8B). This function imports the survfit() and survdiff() functions from the ''survival'' package to analyze the survival probabilities of patient groups, and produces the KM curve plots. Note: .plot_CoxPH_RNAseq_TCGA() takes the data frame TCGA_RNAseq_OS_sampletype, and quantitatively relates patient survival and gene expression in tumors by the Cox proportional-hazards (PH) regression method. This function can perform survival analyses on all combined TCGA cancer types or individual cancer type such as ''lung adenocarcinoma'' as an argument. The function performs both univariable Cox-PH analysis using a single gene expression as the dependent variable (e.g., Figure 8C), and multivariable Cox-PH analysis to Step-5: Principal component analysis on co-variation of EIF4F expression

Timing: < 1 min
This step performs the principal component analyses (PCA) on EIF4F expression in GTEx healthy tissues, or/and TCGA tumors, and output results to the /eIF4F.analysis/eIF4F_output/PCA folder.

Run the following command line in RStudio.
Note: EIF4F_PCA() is a wrapper function of two internal composite functions that take input data frames and call internal functions for analysis. Figure 4A summarizes the internal code structure for this step. The detailed definitions of all internal functions are in /eIF4F. analysis/R/PCA.R.
Note: .plot_PCA_TCGA_GTEX_tumor() takes the data frame TCGA_GTEX_RNAseq_ sampletype and selects RNAseq data from one specific TCGA cancer type and its matched healthy tissue from GTEx. This function performs PCA on EIF4F expression within the selected sample types and generates biplots (e.g., Figure 9F), scree and matrix plots for PCA results.
Note: .plot_PCA_CPTAC_LUAD() takes the data frame CPTAC_LUAD_Proteomics and selects proteomics data of input gene list. Due to the missing proteomics data for some inquired initiation proteins in the dataset, this function performs imputed PCA with the estim_ncpPCA() function from the ''missMDA'' package.

Timing: < 30 min
This step analyzes the correlating genes of each EIF4F subunits from tumor or healthy samples, and outputs results to the /eIF4F.analysis/eIF4F_output/COR folder.  Note: EIF4F_Corrgene_analysis() is a wrapper function to call one internal composite function that takes input data frames and calls internal functions for analysis. Figure 4B summarizes the internal code structure for this step. The detailed definitions of all internal functions are in /eIF4F.analysis/R/COR.R.
Note: .plot_Corr_RNAseq_TCGA_GTEX() takes the data frameTCGA_GTEX_RNAseq_ sampletype and selects RNAseq data from two sample types: TCGA tumors, or GTEx healthy tissues. This function separately identifies correlating genes (CORs) of EIF4E, EIF4A1, EIF4G1, and EIF4EBP1 from tumor samples or from healthy tissues. The significant CORs for each EIF4F subunit are identified and classified as positive or negative CORs (posCORs and negCORs). This function analyzes the overlap posCORs or negCORs of four EIF4F subunits from tumor samples or healthy tissues as Venn plots (e.g., Figure 10A). This function counts the numbers of posCORs and negCORs, and plots them in bar graphs for comparison (e.g., Figure 10B).
Note: .plot_Corr_RNAseq_TCGA_GTEX() merges tumor and healthy correlation data, and clusters correlation strengths by similarity in heatmaps (e.g., Figure 10C), using the CRITICAL: .plot_Corr_RNAseq_TCGA_GTEX() function only applies to EIF4F and related genes.
Step-7: Analyze the correlation between RNA and protein expression Timing: < 1 min This step examines the RNA and protein expression correlation in CCLE and CPTAC lung adenocarcinoma (LUAD) datasets, and outputs results to the /eIF4F.analysis/eIF4F_output/ RNApro folder.
9. Run the following command line in RStudio.
Note: EIF4F_RNA_pro_correlation() is a wrapper function to call two internal composite functions that take input data frames and call internal functions for analysis. Figure 4C summarizes the internal code structure for this step. Pearson correlation analysis between proteomics and RNAseq levels, and produces the results as scatter plots.
Note: .plot_scatter_RNApro_LUAD() takes the data frames, CPTAC_LUAD_RNAseq and CPTAC_LUAD_Proteomics, and selects RNAseq and proteomics data for EIF4F genes. It performs the correlation analysis between the proteomics and RNAseq data for EIF4F genes and produces scatter plots (e.g., Figure 11A).
Step-8: Analyze the protein co-expression and (phospho)protein expression across tumor stages Timing: < 5 min  10. Run the following command line in RStudio.
Note: EIF4F_Proteomics_analysis() is a wrapper function to call two internal composite functions that take input data frames and call internal functions for analysis. Figure 4D summarizes the internal code structure for this step. The detailed definitions of all internal functions are in /eIF4F.analysis/R/proteomics.R.
Note: .plot_scatterplot_protein_LUAD() takes the data frame CPTAC_LUAD_Proteomics and selects the proteomics data of tumor samples. It analyzes the correlation between two input proteins across the LUAD tumor samples (e.g., Figure 11B).
Note: .plot_boxgraph_protein_CPTAC() takes the data frames CPTAC_LUAD_Proteomics, CPTAC_LUAD_Phos and CPTAC_LUAD_Clinic_Sampletype, and selects the proteomics, phosphoproteomics and clinical stage data. It generates boxplots to compare the protein (e.g., Figure 11C) or phophosphorylation (e.g., Figure 11D) levels in tumors of different clinical stages.

EXPECTED OUTCOMES
All analysis results are produced as plots on screen and as pdf files in output file directories. Examples of results from each analysis step are shown in Figures 5, 6, 7, 8, 9, 10, and 11. The complete analysis results will be saved as 338 files in 17 directories ( Figure 2F).
The execution of this package relies on specific exported functions and does not allow parameter input within the package. However, for each analysis step, the internal composite functions, the required input data frames, and the dependent functions are stored within one R file ( Figure 2E). Most internal composite functions allow changes of gene inputs. To perform any analyses with its own gene input, users can open the R file for each analysis step in /eIF4F.analysis/R and find the relevant internal functions. Figures 3 and 4 explain the internal code structure, which employs internal functions as individual building blocks. Users can combine the internal functions provided by this R package to assemble their own pipeline in a new R project. The desired input data frames for each internal composite function can be accessed from ProcessedData folder in Figure 2G.

LIMITATIONS
eIF4F.analysis package relies on many dependent packages for data analysis and plotting. Because those dependent packages are widely used for individual applications, their combined usage provides great convenience for users to achieve a thorough understand of eIF4F functions. However, a major limitation of this approach is that incompatibility may occur with future version changes, since those packages are maintained independently of each other.
This package aims to provide an easy method to reliably reproduce the results from Wu and Wagner (2021), 1 and to allow users to access and understand the associated code (which has been refactored for clarity since the original publication). Due to the large number dependencies and analyses this package performs, the exported functions are not designed to contain input parameters for users. Instead, input parameters such as gene names or sample types have been hard coded in the internal functions.

Problem 1
Error messages during Preparation three: installing and loading eIF4F.analysis.

Potential solution
It means that you have objects (functions, usually) present in your global environment with the same name as (exported) things in your package. Clean the environment and reload the package.

Problem 2
While running EIF4F_Corrgene_analysis() at step-6: analyze the correlating genes (CORs) of EIF4F subunits, you run into the following error: Note: The object org.Hs.egPFAM has not been not used for any operation in the script, thus this error message does not affect the results of the analyses.

Potential solution
Run the following command to reload the ''org.Hs.eg.db'' package. If the same error message of defunct org.Hs.egPFAM appears, please reinstall the "org.Hs.eg.db" package to the latest version (3.15.0).

RESOURCE AVAILABILITY
Lead contact Further information and requests for resources should be directed to and will be fulfilled by the lead contact, Gerhard Wagner (gerhard_wagner@hms.harvard.edu).

Materials availability
This study did not generate new unique reagents.
Data and code availability This paper analyzes existing, publicly available data. The source datasets are listed in the key resources Any additional information required to reanalyze the data reported in this paper is available from the lead contact upon request.