Analytic pipelines to assess the relationship between immune response and germline genetics in human tumors

Summary Germline genetic variants modulate human immune response. We present analytical pipelines for assessing the contribution of hosts’ genetic background to the immune landscape of solid tumors using harmonized data from more than 9,000 patients in The Cancer Genome Atlas (TCGA). These include protocols for heritability, genome-wide association studies (GWAS), colocalization, and rare variant analyses. These workflows are developed around the structure of TCGA but can be adapted to explore other repositories or in the context of cancer immunotherapy. For complete details on the use and execution of this protocol, please refer to Sayaman et al. (2021).


SUMMARY
Germline genetic variants modulate human immune response. We present analytical pipelines for assessing the contribution of hosts' genetic background to the immune landscape of solid tumors using harmonized data from more than 9,000 patients in The Cancer Genome Atlas (TCGA). These include protocols for heritability, genome-wide association studies (GWAS), colocalization, and rare variant analyses. These workflows are developed around the structure of TCGA but can be adapted to explore other repositories or in the context of cancer immunotherapy. For complete details on the use and execution of this protocol, please refer to Sayaman et al. (2021).
Note: Indicated software and package versions were used as described in (Sayaman et al., 2021). If using other or future versions of the software and packages, please check for compatibility and version-specific default parameters.
Additional notes regarding the selection and usage of the indicated software are provided below.
PLINK: Used for GWAS Analysis. PLINK is a commonly used software for large-scale genetic analyses such as GWAS. It consists in a fast implementation of linear and logistic regression and should produce same results as any implementation of those regression methods.
bcftools: BCFtools is a set of utilities that manipulate variant calls in the Variant Call Format (VCF) and its binary counterpart BCF.
GCTA: Used for heritability analysis. GCTA is one of first and well-established software packages for estimation of the proportion of phenotypic variance explained by all genome-wide SNPs for a complex trait (Yang et al., 2011). However, this is an area of active of research and newer methods are being developed.
R/Bioconductor: Used for TCGA eQTL, rare variant analyses, annotation, immune trait correlation and other analyses as indicated in the manuscript (Sayaman et al., 2021). R/Bioconductor packages offer standard toolsets that offer genomic annotation and genomic mapping functionality. New packages are released and updated quickly by an active community. Alternate software packages that provide similar functionality could be used but may vary based on curation and version of annotation databases.
LocusZoom: Used to visualize SNP location in the context of GWAS results. LocusZoom is commonly used and user friendly. It has online and standalone versions. Standalone version (used here) allows us to integrate various sources of Linkage Disequilibrium (LD).
eCAVIAR: Used for colocalization analysis. eCAVIAR is a commonly used method for colocalization analyses, and, as compared with other methods, has the advantage of modeling the LD (Hormozdiari et al., 2016). ''all.mnemonics.bedFiles.tgz''.

Download GitHub repository
Timing: 5 min The scripts and code descriptions used in the entirety of this protocol are available at: https://github.com/rwsayaman/TCGA_PanCancer_Immune_Genetics.
Optional: This protocol is designed to work with pre-processed and quality-controlled genotyping data. If users start from raw genotyping data from SNP arrays, please see the companion protocol for quality-control analysis of germline data, stranding and genotype imputation from (Chambwe et al., 2022) and the associated scripts and code descriptions at: https:// github.com/rwsayaman/TCGA_PanCancer_Genotyping_Imputation.
Note: The example code provided were designed to run on specific high-performance compute environment. Users should adjust the code to match the capabilities of their compute environment and queuing system.
For reference, analyses carried out using the software listed above are described in the following sections in the manuscript were performed on multiple severs including (1) the University of California, San Francisco (UCSF) Wynton high-performance (HPC) cluster, which currently contains 449 nodes with over 12572 Intel CPU cores and 42 nodes containing a total of 148 NVIDIA GPUs, (2) the original UCSF TIPCC HPC cluster (now C4), which had 8 communal compute nodes and 1 dedicated node, each with 12-64 cores (each node had from 64 to 512 GiB of RAM and at least 1.8 TiB of fast local disk space), and (3) two additional severs with 32 and 48 CPUs (Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00 GHz), respectively. In the optimization phase, analyses have been performed by different operators and at multiple times to ensure accuracies and reproducibility. As these are shared servers, time might considerably vary in function of the number of nodes available. From a computational point of view, the most time-consuming step is the GWAS (3-4 h for each trait, as average). However, file preparation, as well as annotation and curation of the output files is particularly time consuming. An estimated time, considering these manual steps and factoring in some troubleshooting time (see also troubleshooting section) is indicated for each component in the following sections. In terms of file preparation, the GTEx colocalization might be the most time-consuming component.

Pre-processing of imputed data
Timing: 1-2 days This section describes the pre-processing steps necessary to use the HRC Imputed Genotyping Data we generated as a general resource for this specific analysis (Sayaman et al., 2021). Note: This only affects the SNP ID name and not the encoding of the A1 and A2 alleles in PLINK which is maintained.
Note: Minor allele frequency (MAF) filtering can be performed at this step to reduce the HRC imputed genotyping data input file size. Alternatively, GWAS analysis can be performed using the R 2 filtered-HRC imputed genotyping data from ''before you begin: Pre-processing of Imputed Data'' step 20 -e.g., as in (Sayaman et al., 2021), while MAF filtering can be performed on the summary statistics using the recalculated MAF from included subjects (see ''step-by-step method details: genome-wide association study (GWAS)'' steps 16, 23b).

MATERIALS AND EQUIPMENT
The overall disk space needed for the project is 2.67 TB. Breakdown for different data types and analyses are provided below (Table 1).

Immune traits
Timing: 1 day 139 immune traits used in the analyses were curated from (Thorsson et al., 2018) by selecting nonredundant traits. Immune trait categories were defined based on the methodological source of the data. Immune trait functional modules were defined based on Pearson correlation analysis.
1. Prepare curated set immune traits genes. a. Curated set of 139 immune traits in TCGA can be downloaded from (Sayaman et al., 2021) '' Table S2  Direct link: Immune Traits.

Heritability analysis
Timing: 2 weeks Heritability analysis on 139 traits is conducted using a mixed-model approach implemented in the genome-wide complex trait analysis (GCTA) software package with the genomic-relatedness-based restricted maximum-likelihood (GREML) method (Yang et al., 2010) (Yang et al., 2011). This calculates the proportion of immune trait variation that is attributable to common genetic variants (% Heritability). Refer to the GCTA website for details (https://cnsgenomics.com/software/gcta).
6. Identify genetic ancestry assignment of each individual and create a filtered sample list for each genetic ancestry cluster: a. Ancestry assignments for each TCGA individual are provided in '' Table S1. TCGA Sample List Used in the Analysis'' from (Sayaman et al., 2021). b. Formatted input file of TCGA patient barcodes assigned to each genetic ancestry cluster is provided in the (Sayaman et al., 2021)  Note: Analysis is automatically performed on samples with complete data. A subset of the defined samples within each genetic ancestry cluster are automatically skipped due to missing data (e.g., immune trait or covariate values).
8. Estimate the genetic relatedness matrix (GRM) from all the autosomal SNPs with MAF > 0.01 within each ancestry group using GCTA: CRITICAL: Run heritability analysis unconstrained. This will produce heritability estimates (Vg/Vp) and standard deviations outside the 0-1 range.
ll OPEN ACCESS 11. From GCTA GREML ''.hsq'' result file, extract the ratio of genetic variance to phenotypic variance (Vg/Vp), estimate and SE; the LRT p-value and sample size (n) for each immune trait. 12. Concatenate heritability analysis results across all immune traits tested.
a. Annotate each result file with the corresponding immune trait, immune category and immune module (see '' Table S2'', (Sayaman et al., 2021)). b. Append annotated result files from each immune trait. 13. Correct for multiple-hypothesis testing ancestry group by calculating the FDR p-value using the Benjamini-Hochberg adjustment method. 14. Optional: Visualize % heritability (Vg/Vp * 100) across all immune traits per ancestry group for exploratory data analysis.
Direct link: Heritability Analysis. The example code in this section were optimized for the high-performance compute environment at UCSF HPC employing Portable Batch System (PBS) job scheduling; consult your system administrator to adapt the provided code to your system.

Genome-Wide Association Study (GWAS)
Timing: 4 Weeks GWAS were performed on traits that we found to have significant heritability since these would be most likely driven by common genetic variants.

Recalculate allele frequencies in PLINK for the subset of individuals used in the analysis in PLINK:
See script: ''qsub_plink_freq_GWAS.IBD.ALL.txt''. 20. Run logistic regression on dichotomized discrete immune traits in PLINK by modifying the code in step 19 using the logistic command (--logistic).
Note: ''.bed'', ''.bim'' and ''.fam'' input files are provided separately in the sample code because SNP ID names from the original HRC imputed dataset were renamed with alleles listed in alphabetical order to assist matching with other datasets. This only affects the SNP ID name and not the encoding of the A1 and A2 alleles in PLINK which is maintained.
21. Filter resulting summary statistics from PLINK based on test p-values (P in PLINK). Genome-wide significance was defined at p < 5 3 10 À8 and suggestive significance at p < 1 3 10 À6 in our study.

22.
Optional: Visualize results for exploratory data analysis: a. Manhattan plot, plotting GWAS -log10 p-value against the base pair position per chromosome. b. Quantile-quantile plot (Q-Q plot), plotting the quantile distribution of observed p-values for each SNP against expected values from a theoretical c2-distribution; calculate the genomic inflation factor (lamba), the median of the c2 test statistics divided by the expected median of the c2 distribution.
23. Optional: Each SNP can be further annotated (see '' Table S4'', (Sayaman et al., 2021)) with: a. The Minimac3 HRC imputation information for each SNP (extracted from the filtered chr*.info.rsq0.5.gz), including whether SNP was genotyped or imputed (Genotyped), the estimated value of the squared correlation between imputed genotypes and true, unobserved genotypes (Rsq), the average probability of observing the most likely allele for each haplotype  Table S4'', (Sayaman et al., 2021). b. Append annotated summary stat files from each immune trait GWAS. 25. Identify genome-wide significant loci as SNPs within 50 KB region with at least one genomewide significant SNP (p < 5 3 10 À8 ). This excludes the HLA locus on chr 6 which spans $3.5 MB.
Note: Scripts and code description used in this section are available at: https://github.com/ rwsayaman/TCGA_PanCancer_Immune_Genetics.
Direct link: Genome-Wide Association Study (GWAS). The example code in this section were optimized for the high-performance compute environment at UCSF HPC employing Portable Batch System (PBS) job scheduling; consult your system administrator to adapt the provided code to your system.
Note: Scripts and code description used in this section are available at: https://github.com/ rwsayaman/TCGA_PanCancer_Immune_Genetics.  Table S1'', (Sayaman et al., 2021)). This can be achieved using the script ''eQTLanalysis.R''. This script also runs eQTL analysis for the 30 cancers separately. v. Summarize outcome: results will have the following: chromosome, position, gene name, distance from gene, pan-cancer sample size, pan-cancer effect size, pan-cancer p-value, and then the same information repeated for each cancer type. b. sQTL.
Use ''script PrepareData.sh''. This script keeps the genes that mapped to the most significant SNPs (i.e., Suggestive and GW). Example: grep -f ListGenes.txt splice3prime > Data_3prime. It creates two more files that contain TCGA subject IDs, and splicing event IDs and types. It finally runs an R script ''Analyze.R'' that performs association analysis between SNP genotypes and splicing events using linear regression model accounting for genetic ancestry PC1-7, sex, age, and cancer type (for pan-cancer analysis only). iii. Summarize outcome, an output file containing association results as follows: chromosome, position, ensemble ID, gene name, splicing event ID. 32. GTEx Analysis.
a. Convert all GWAS suggestive and genome-wide significant SNP chromosome and base pair positions from build GRCh37 to build GRCh38 to match GTEx annotation. i. Import the GRanges object of SNP chromosome and base pair positions from 13. ii. Load the chain file ''AH14150'' for Homo sapiens rRNA hg19 to hg38 from the ''Annota-tionHub'' package in R. iii. Convert SNP chromosome and base pair positions from build GRCh37 to build GRCh38 using the liftOver function in the ''rtracklayer'' package in R. i. From each tissue type, extract only GTEx sQTL SNP results that match the GWAS SNP GRCh38 chromosome and base pair position. The output is an R object of GTEx eQTL for suggestively significant variants. ii. Run the Linux bash script ''run_split_sqtl.sh''. Each GTEx sQTL file is very large. This step separates the large GTEx sQTL file into a number of small files. The input of this script is a list of file names for GTEx sQTL. Each line of this input file is a file name for GTEx sQTL. The script generates a number of small files for each original GTEx sQTL file. iii. Run the R script ''r_extract.txt''. This script takes 2 input files. One is an R object for GWAS suggestively significant variants. The other input file is the GTEx sQTL file generated from the previous step. The output is an R object of GTEx sQTL for suggestively significant variants. iv. Concatenate filtered GTEx sQTL files from each tissue corresponding to the GWAS suggestively significant variants. Iteratively load, annotate with tissue source, extract GRCh38 chromosome and base pair position from variant ID, and append each file into a single data frame in R. v. For sQTL, limit analysis to +/-500 KB region. Filter the resulting concatenated GTEx sQTL file using the absolute value of the ''tss_distance'', set a threshold % 500,000 bp. vi. Calculate the false discovery rate (FDR) per variant across all genes and all tissues. vii. Using the GTEx Ensembl IDs, map to gene symbol and Entrez IDs using the ''EnsDb.Hsapiens.v86'' package in R. viii. Merge with Immune-Germline SNP annotation. ix. Extract GTEx sQTL results for variant-gene pairs with an FDR < 0.05 in at least one tissue.
Note: Scripts and code description used in this section are available at: https://github.com/ rwsayaman/TCGA_PanCancer_Immune_Genetics.
Direct link: Expression and splicing quantitative trait locus analysis (eQTLs and sQTL). The example code in this section were optimized for the high-performance compute environment at UCSF HPC employing Portable Batch System (PBS) job scheduling; consult your system administrator to adapt the provided code to your system.

Timing: 1 Week
This section describes colocalization analysis performed with eCAVIAR. This analysis was performed using TCGA and GTEx gene expression data. This analysis requires four input files: (1) GWAS summary statistics, (2) eQTL summary statistics, (3) LD matrix computed with GWAS data, and (4) LD matrix computed with genotype data used for eQTL analysis.
33. TCGA Analysis. a. Create a file to determine SNP-gene-trait to be tested for colocalization. Run R script ''Deter-mineRegions.R''. This script reads eQTL results and keeps SNP-gene eQTL FDR p < 0.1. Output of this script will be a file that contains 5 columns: chromosome, position, gene name, trait, and SNP significance (GW or suggestive). b. For each SNP-gene pair, run eQTL analysis between the SNP and the gene, and also between the 200 extra SNPs centered at the selected SNP and the gene. Use R script ''RunEQTL.sh''. This script creates a folder for each SNP-gene pair. It extracts the list of 201 SNPs, calculates LD matrix (plink -bfile EXTRACTED -r square -out EXTRACTED), and performs GWAS and eQTL association analysis using the ''eQTL.R'' R script. ''eQTL.R'' script outputs two files: one for GWAS and one for eQTL analysis containing the z-score, beta, and p-value. c. Run eCAVIAR using ''RunECAVIAR.sh'' script. It calls eCAVIAR as follows: The ''GWAS.z'' and ''eQTL.z'' are the z-score produced in the previous step. ''EXTRACTED.ld'' is a 1-line file that contains the LD between the selected SNP and the 200 SNPs surrounding it (100 SNP to the left and 100 SNPs to the right). The -c flag indicates the number of causal variants assumed in the model (Model was run twice assuming 1 and 2 causal variants). The output that contains the colocalization posterior probability (CLPP) for each SNP (total of 201 SNPs) is ''coloc_c1.out_col''.  vi. The list of SNPs and effect alleles in TCGA GWAS result. d. Output files iii-vi will be used to generate LD matrix for eCAVIAR. Output files i and ii will be used directly for eCAVIAR. e. Run PLINK commands to generate numeric genotype data for GTEx and TCGA for SNPs that were extracted as output files iii and iv from the previous step. i. Run the python script ''make_plink_command_gtex.py'' and ''make_plink_command_tcga.py'' to generate PLINK commands for GTEx and TCGA separately. The output is a Linux bash file that contains a number of PLINK commands. ii. Run the bash file from steps 34e-i. Example for GTEx: Example for TCGA: f. Run the R scripts ''r_cor_gtex.txt'' and ''r_cor_tcga.txt'' to calculate Pearson correlation coefficients for each pair of SNPs in GTEx and TCGA genotype data. There are 2 input files for each R script. The 1st one is the genotype file that was generated from the previous step. The 2nd one is the file with SNP ID and alternate allele. This file was generated as the output file 5 or 6 in the step for running ''r_match_tcga_gtex.txt''. g. Run eCAVIAR assuming 2 causal SNPs.
a. Repeat steps 44a-b, without including cancer type as covariate. 46. Generate outcome summary: exome files related to samples for which all the covariates and at least one immune trait was available should result in a master file of N = 9,138 samples. There will be 832 pathogenic/likely pathogenic SNPs/Indels events with at least one copy of rare allele in the whole exome sequencing data, corresponding to 586 distinct pathogenic SNPs/Indels mapping to 99 genes. The regression analysis provides p-values and beta coefficients of the association with immune traits.
Note: Scripts and code description used in this section are available at: https://github.com/ rwsayaman/TCGA_PanCancer_Immune_Genetics.
Direct link: Rare Variant Analysis.

EXPECTED OUTCOMES
The analysis protocols described above each yield data output files, as described above. Expected output files are summarized here. Data visualization tools enable researchers to explore these results interactively.

Heritability analysis
Output files from GCTA GREML is an .hsq file. For complete description of output variables, see: https://cnsgenomics.com/software/gcta/#GREMLanalysis. The combined results table include the ratio of genetic variance to phenotypic variance (Vg/Vp) estimate and SE; the likelihood-ratio test (LRT) p-value and sample size (n) for each immune trait; and the FDR p-value across all immune traits.
After conducting heritability analysis across 139 immune traits, we identified 10 immune traits with significant heritability (FDR p < 0.05), and 23 other traits with nominally significant heritability (p < 0.05) in at least one ancestry group. Within the European ancestry group, 28 traits had at least nominally significant heritability (see '' Table S3'', (Sayaman et al., 2021)).

Epigenomic analysis
The output table includes all genome-wide and suggestively significant SNPs annotated with the mapped epigenetic state from the Roadmap Epigenomics Project Expanded 18-state model. Each epigenome ID/epigenome represent a column with entries designating the epigenetic state (numeric and descriptive value) at the SNP chromosome and base pair position (see '' Table S4'', (Sayaman et al., 2021)).

Expression and splicing quantitative trait locus analysis (eQTLs and sQTL)
Output files for eQTL will have the following: chromosome, position, gene name, distance from gene, pan-cancer sample size, pan-cancer effect size, pan-cancer p-value, and then the same information repeated for each cancer type.
Output file for sQTL will contain association results as follows: chromosome, position, ensemble ID, gene name, splicing event ID.
Colocalization with eCaviar and manual curation of the expanded region Colocalization output table includes all genome-wide and suggestively significant SNPs and associated germline-immune GWAS summary stats from our study, the GTEx or TCGA QTL summary stats, and the eCaviar statistics which include: CLPP value for 1 causal SNP; regional CLPP value for 1 causal SNP; CLPP value for 2 causal SNP; regional CLPP value for 2 causal SNP; and the maximum CLPP value from 1 or 2 causal SNPs analysis or regional CLPP analysis. Finally, statistics from counter-evidence SNP colocalization analysis are included that gives the counter SNP's ID, rsID, GTEx or TCGA QTL p-value, germline-immune GWAS p-value, and distance to the QTL TSS; the -log 10 QTL p-value of the index SNP and counter SNP, and difference between these values (delta Counter SNP-Index SNP); and the curated expanded range colocalization evidence assessment (see '' Table S5'', (Sayaman et al., 2021)).

Rare variant analysis
Exome files related to samples for which all the covariates and at least one immune trait was available should result in a master file of N = 9,138 samples. There will be 832 pathogenic/likely pathogenic SNPs/Indels events with at least one copy of rare allele in the whole exome sequencing data, corresponding to 586 distinct pathogenic SNPs/Indels mapping to 99 genes. The regression analysis provides p-values and beta coefficients of the association with immune traits (see '' Table S6'', (Sayaman et al., 2021)).

Interactive visualization of results
Results of the analysis can be explored interactively with web-based interactive visualizations. Results of heritability, GWAS, colocalization, and rare variant analyses can be visualized interactively in CRI iAtlas (https://www.cri-iatlas.org/), in the "Germline Analysis" module (Figures 1, 2, and 3, Methods video S1). CRI iAtlas is a web portal for data exploration of immuno-oncology research (Eddy et al., 2020). A summary of figures from the (Sayaman et al., 2021) manuscript that can be reproduced in CRI iAtlas is available in (Table 2). A PheWeb tool (https://pheweb-tcga.qcri.org/) was setup to visualize GWAS summary statistics of all tested immune traits ( Figure 2B, Methods videos S2 and S3). Additional scripts for PheWeb GWAS initialization (Methods video S3) is provided at https:// github.com/rwsayaman/TCGA_PanCancer_Immune_Genetics. Direct link: Interactive Visualization of Results.
'' Rare variant analysis boxplots in iAtlas are generated from summary statistics computed from the germline data.
The PheWeb tool (https://pheweb-tcga.qcri.org/) visualizes GWAS summary statistics of all tested immune traits (Sayaman et al., 2021). The online tool displays an interactive Manhattan plot for In PheWeb, gene tracks represent the physical start and end of a gene and does not imply any ''causal'' relationship with the SNP in question. Ancestry used to compute LD can be specified by the user (European, African, South Asian, etc.). SNP significance can be also shown for a specific SNP across all tested traits. External resources for SNPs can be accessed from PheWeb (e.g., GWAS Catalog, dbSNP, etc).

LIMITATIONS
This protocol and related scripts are tailored for the analyses of matched genomic and immune trait datasets in TCGA but can be applied to other datasets with similar structures. The majority of analyzed immune traits are derived from gene expression data and can be adapted to other studies. Code used for the generation of immune traits from gene expression data are provided, and have been previously applied to RNA-sequencing (Thorsson et al., 2018) and microarray (Wolf et al., 2014) (Amara et al., 2017) data. Supplemental code for quality-control analysis, stranding and imputation of raw genotyping data from SNP arrays (Chambwe et al., 2022) are also made available.
Below we reported specific limitations related to each step of the protocol.
Immune Traits: Immune signatures lack cancer-specific cell type resolution. The majority of immune traits were calculated based on specific gene sets from expression data (RNA-sequencing) from bulk tissue to generate estimates of immune cell activation or abundance using different enrichment or deconvolution techniques. Caution should be exercised when interpreting results in the context of specific tumors or cell types. However, many of these signatures were validated in specific tissues and cancers via FACS sorting or immunohistochemistry/immunofluorescence imaging of immune populations. Heritability Analysis: For heritability estimates run via GCTA GREML, the GCTA FAQ (https:// cnsgenomics.com/software/gcta/#FAQ) states that at least 3,160 samples from unrelated individuals are needed to get estimates with standard errors (SEs) down to 0.1 for common SNPs. Only the European ancestry group meets this criterion. Nonetheless, heritability estimates were run in the smaller sized ancestry groups with expectation of large SEs to provide preliminary analyses of immune traits in ancestry groups that are not well studied or sampled.
Heritability analysis takes into account only common variants. In this protocol, we used MAF > 1% as cut-off. Contribution of rare variants are not accounted for and may explain ''missing" heritability.
GWAS: Linear regression assumes the residuals are normally distributed. Immune traits with skewed distributions were first log 10 transformed, those assessed to have close to normal distribution were used as continuous variables. However, some immune traits remained with very skewed distributions due to a

OPEN ACCESS
high fraction of 0 values, these traits were converted to binary 0 and 1 values based on the median value and logistic regression was performed instead (see '' Table S2'' (Sayaman et al., 2021)).
GWAS was run pan-cancer on the non-hematologic cancers in TCGA which vary in cohort size from 36 (CHOL) to 999 (BRCA). Results may be representative of the most common cancers in TCGA. Post hoc analysis evaluation of associations per cancer (forest plots) can provide insight in the directionality of the betas per cancer and identify potential outliers.
For GWAS, we used an imputation R 2 R 0.5 and MAF R 0.5% as cut-offs for inclusion. The HRC panel (version 1) consists of 64,976 haplotypes at >39M SNPs constructed from 20 whole genome sequencing studies (McCarthy et al., 2016). HRC has been shown to provide accurate genotype imputation at MAF as low as 0.1%; however, HRC was constructed from studies of predominantly European ancestries and thus, may be limited for identifying rare variants, particularly in non-European ancestry populations. Moreover, the first version of the HRC panel did not to include small insertions and deletions (indels). We have made the TCGA QC unimputed dataset available as a resource which can be imputed to other reference panels such as the 1000 Genomes Project (1000Genomes Project Consortium et al., 2015, which better capture indels, or the newest and largest reference dataset, NHLBI TOPMed (Taliun et al., 2021), which includes >50% non-European sequenced participants and may improve detection of rare and ancestry-specific genetic variants.
Epigenomic Analysis: Epigenomes are tissue-or cell line-specific. Due to cellular heterogeneity of tissue composition, epigenomes represent chromatin states of the bulk population. Interpretation of results should take into account the epigenomes of interest. A specific focus was given to immune-associated epigenomes in the analysis.
eQTLs and sQTL: GTEx data are derived from normal, tissue-specific samples; QTLs may capture features of normal tissue and those associated with certain tissue types. TCGA data are cancer-specific; QTLs may capture features associated only with specific cancers. Both GTEx and TCGA include expression from bulk tissues and so may miss eQTLs from cells that represent a small fraction of the tissue.
Our analyses used both TCGA germline and matched gene expression and alternative splicing data; and GTEx germline and matched gene expression and alternative splicing data. The rationale for using TCGA is that some loci may exert their effects via changes in expression in cancer tissue or in cells in the tumor microenvironment. The rationale for using GTEx is that some loci may exert their effects via changes in expression in normal tissue such as normal immune cells. However, care should be taken in comparing the tumor to normal eQTL results since the sample sizes are different and since the TCGA includes some surrounding normal tissue. Single cell RNA-seq is likely more robust for these types of analyses. Single cell RNA-seq is currently not available from TCGA or GTEx but in the future, as it becomes available, these eQTL analyses could be repeated to identify more variants colocalizing with certain genes and/or to determine cell-specific eQTLs. Thus, future studies with single cell analyses are likely needed to definitively characterize which cells these tissues are acting in. As eQTL and sQTL and colocalization in TCGA have been performed pan-cancer (GWAS was performed pan-cancer because of the sample size requirement for this analysis), per-cancer analysis might reveal additional cancer-specific hits, which might be compared with the ones in the related tissue. Because of the relatively limited number of samples available for each cancer type, per-cancer GWAS and colocalization should be preferentially performed by combining additional cancer-specific sources containing both phenotypic and genotypic data beyond TCGA.
Colocalization: Our protocol performed colocalization within a 200 SNP window (+/-100 from the index SNP). In some cases, we observed that there were SNPs outside of that window that had better association with gene expression but weaker or no association with the immune trait. Therefore, we ll OPEN ACCESS STAR Protocols 3, 101809, December 16, 2022 also performed a manual inspection of the entire locus (1 MB for eQTLs, 500 KB for sQTLs). We only performed this expanded region analysis of colocalization if there was plausible evidence of colocalization (eCAVIAR CLPP > 0.01) for the 200 SNP window. This expanded region analysis was intended to provide a more stringent criterion.
To perform these, we plotted the negative log 10 p-value of association with the immune trait for each SNP in the region on the X-axis and the negative log 10 p-value for association with the relevant gene expression (for eQTL analysis) or splicing event (for sQTL analysis). If we found additional SNPs outside of the 200 BP window and they demonstrated a stronger effect for association with gene expression or the sQTL and weaker association with the index SNP, we developed additional criteria for colocalization. If we identified one or more SNPs outside of the window that had a -log 10 p-value with expression or splicing that was > 1.5 than the index SNP, we considered that as negative evidence for colocalization. If the SNP(s) with stronger evidence for eQTL or sQTL association had a -log 10 p-value that was %1.5 compared with the index SNP, we considered the evidence for colocalization as ''intermediate.'' Finally, if we found no other SNP in the entire region with strong eQTL or sQTL that had a better p-value than the index SNP and the eCAVIAR results gave a posterior probability of colocalization of > 0.01, we considered the evidence to be ''strong.'' While we used this manual second step in addition to eCAVIAR, an alternative approach would have been to use COLOC (Giambartolomei et al., 2014). This package has the disadvantage that linkage disequilibrium is not explicitly modeled (unlike eCAVIAR.).
Within the TCGA dataset, colocalization was performed on a pan-cancer level. This analysis can be conducted on each cancer type separately, assuming that GWAS and eQTL analyses are also performed for each cancer type separately. Low sample size can be a limiting factor for the per-cancer analysis.
Colocalization is based on gene-expression data from TCGA and GTEx but can also be performed in different datasets that might be available.

Rare Variant Analysis:
In this analysis, we focused on variants occurring in cancer-predisposition genes, as previously annotated by (Huang et al., 2018), aggregated in categories summarizing different biological processes or function as described in (Sayaman et al., 2021) (except for BRCA1 and BRCA2, for which sufficient number of events existed to treat these genes individually). Different aggregation in functional categories might be defined by the users. Heterogeneity of germline calls and batch effect prevented us to run a comprehensive exome-wide analysis (Rasnic et al., 2019). For variables with heavily skewed distribution, we dichotomized them based on median values. Alternatively, ordinal regression might be used.

TROUBLESHOOTING Problem 1
Cannot load software or run scripts on the high-performance compute server. Implementation of provided GitHub code produces error (see ''before you begin: software installation'', steps 4-9, and scripts referenced in the ''step-by-step method details'').

Potential solution
Consult your institution's IT or compute cluster administrator for proper installation of necessary software including all needed libraries based on the high-performance compute environment. Ensure that the proper software versions, including all libraries and dependencies, are installed.

OPEN ACCESS
Software implementation may be version specific, the versions used in the protocol are provided to ensure reproducibility.
Provided code should be considered as a guide. Adjust parameters based on cluster capabilities and specifications. Job submission scripts are dependent on the resource allocation management system. E.g., the provided GitHub codes for heritability analysis and GWAS were optimized for the high-performance compute environment at University of California, San Francisco employing Portable Batch System (PBS) job scheduling; consult your system administrator to adapt the provided code to your system.

Problem 2
Cannot access the controlled access TCGA data from the GDC Portal, or the QC and HRC Imputed genotyping data from GDC publication page associated with (Carrot-Zhang et al., 2020): https:// gdc.cancer.gov/about-data/publications/CCG-AIM-2020 (see ''before you begin: download datasets'', step 12).

Potential solution
All TCGA germline data are controlled access. Ensure that you have proper authentication credentials and dbGaP authorization (see ''before you begin: institutional permissions'', steps 1-3). Make sure to follow all steps required by the GDC to obtain controlled access data including having an institution account, authentication through eRA Commons and dbGaP authorization.

Problem 3
In performing heritability analysis, filtering of individuals based on the genetic relatedness matrix is confounded by sample stratification based on genetic ancestry (see "step-by-step method details: heritability analysis", steps 8 and 9).

Potential solution
Heritability analysis should be done within each genetic ancestry cluster as genetic ancestry creates population stratification.

Potential solution
Heritability estimates do not converge when sample sizes are insufficient to build predictors; exclude groups with small samples sizes from analysis.

Problem 5
The genomic coordinates were different in the original TCGA and GTEx data. TCGA data were in Genome Reference Consortium Human Build 37 (GRCh37) but the GTEx data were in Build 38 (GRCh38) (see "step-by-step method details: expression and splicing quantitative trait locus analysis (eQTLs and sQTL)", step 32, and "colocalization with eCaviar and manual curation of the expanded region", step 34).

Potential solution
We converted the genomic coordinates in TCGA from GRCh37 to GRCh38 using liftOver (https:// genome-store.ucsc.edu/) so these two data sets can be compared.

Problem 6
In addition, genomic coordinates of some pathogenic variants in '' Table S2'' of (Huang et al., 2018) did not exactly match the WES vcf file positions (see "step-by-step method details: rare variant analysis", steps 36-37).

Potential solution
To match the pathogenic variants with the WES vcf file, the start position of each variant (''Column D'' in '' Table S2'' of Huang et al. (2018)) will be used with some changes as follows: subtract 1 for deletions, and use the start position as is for substitutions and insertions.

Problem 7
During visualization, LD colors between SNPs are shown as gray (see "expected outcomes: interactive visualization of results").

Potential solution
This means that LD information is not available, because of two reasons: (1) the SNP in question was not present in the 1000 Genomes Project data (1000Genomes Project Consortium et al., 2015 which was used to compute LD, and (2) the reference and alternative alleles are swapped. For the first problem, no solution exists within the visualization tool. For the second problem, make sure reference and alternative alleles are concordant with the reference genome alleles.

RESOURCE AVAILABILITY
Lead contact Further information and requests for resources and reagents should be directed to and will be fulfilled by the lead contact, Rosalyn Sayaman, rwsayaman@gmail.com.

Materials availability
This study did not generate new unique reagents.

Data and code availability
The TCGA Genome Wide SNP 6.0 birdseed genotyping data and clinical data can be found at the legacy archive of the GDC (https://portal.gdc.cancer.gov/legacy-archive). The quality-controlled and HRC imputed genotyping data were contributed towards ancestry analyses in (Carrot-Zhang et al., 2020) and are accessible at the associated GDC publication page (https://gdc.cancer.gov/ about-data/publications/CCG-AIM-2020, see the Supplemental Data Files section: ''TCGA QC HRC Imputed Genotyping Data used by the AIM AWG from (Sayaman et al., 2021)''). Please cite (Sayaman et al., 2021) and (Chambwe et al., 2022) when using the quality-controlled and HRC imputed genotyping data. Note, access to the TCGA original birdseed files and the pre-processed, quality-controlled and HRC imputed genotyping data generated for (Sayaman et al., 2021) requires GDC controlled access permission approval.
Details for software availability are in the ''key resources table''. The code generated during this study and the required supplemental tables from (Sayaman et al., 2021) have been deposited to GitHub (https://github.com/rwsayaman/TCGA_PanCancer_Immune_Genetics). Supplemental code for quality-control analysis, stranding and imputation of raw genotyping data from SNP arrays from (Chambwe et al., 2022) are also available at GitHub (https://github.com/rwsayaman/ TCGA_PanCancer_Genotyping_Imputation