Integrated bioinformatic pipeline using whole-exome and RNAseq data to identify germline variants correlated with cancer

Summary Germline Variants (GVs) are effective in predicting cancer risk and may be relevant in predicting patient outcomes. Here we provide a bioinformatic pipeline to identify GVs from the TCGA lower grade glioma cohort in Genomics Data Commons. We integrate paired whole exome sequences from normal and tumor samples and RNA sequences from tumor samples to determine a patient’s GV status. We then identify the subset of GVs that are predictive of patient outcomes by Cox regression. For complete details on the use and execution of this protocol, please refer to Chatrath et al. (2019) and Chatrath et al. (2020).


Download datasets
Timing: 20 min This protocol requires whole-exome-sequences from the normal samples (wxs-normal), wholeexome-sequences from the tumor samples (wxs-tumor) and RNA-sequences from the tumor samples (rnaseq-tumor) from The Cancer Genome Atlas (TCGA) patients available in the Genomics Data Commons (GDC) data portal (Grossman et al., 2016). The aligned reads in the Binary Alignment Map (BAM) format should be downloaded from the database of Genotypes and Phenotypes (dbGaP), target genes genomic regions in the Browser Extensible Data (BED) format should be downloaded from the GENCODE v37 (Frankish et al., 2019) database, and the reference human genome sequence (hg38) in the FASTA format should be downloaded from the GDC. We demonstrate the use of the pipeline on 5 wxs-normal samples, 7 wxs-tumor samples and 7 rnaseq-tumor samples. The BAMs used in this protocol were from the TCGA lower grade glioma (LGG) cohort (Brat et al., 2015). The names of test datasets are provided in the GitHub repository above.
b. SAMtools should be loaded in the system PATH. We used the command below to load samtools in our online Linux cluster.
c. Index the FASTA sequence.
6. Navigate to 'BAM' directory and download BAM files for wxs-normal samples, wxs-tumor samples, and rnaseq-tumor samples from the provided manifest files and place them in their respective directory. A GDC token is required to download BAMs from the GDC database. a. Download BAM files using the following command

Create input list
Timing: 1 min 7. Variant calling using VarDictJava requires an input.list that contain path to folders of BAM files. Prepare an input list for wxs-normal, wxs-tumor and rnaseq-tumor data types, separately. a. Run the following command from the directory that contains only folders of BAM files and do not contain other information except previous logs. For example, let us create an input.list for wxs-normal BAMs.
b. Rename the input.list as you wish.

KEY RESOURCES
Operating system: GNU/Linux Memory: 100 GB (memory requirement depends on size of the dataset) Processors: 1 required, 5 recommended The script in this protocol follows Simple Linux Utility for Resource Management (SLURM)-based schema and requires submission of the job to an online Linux cluster. The steps discussed below loop over each BAM file or the downstream output files that were generated. Therefore, it is easier to submit similar jobs as job arrays, which will create a number of independent jobs (corresponding to the defined number of tasks) and execute them simultaneously. It is common to load pre-installed software as an environmental module available on the Linux cluster. User needs to load the relevant modules on their cluster. This protocol expects that R, VarDictJava, SAMtools and BCFtools are installed on your local or online Linux cluster and loaded in the system PATH.

STEP-BY-STEP METHOD DETAILS
A typical workflow to call germline variants from the wxs-normal samples, wxs-tumor samples, and rnaseq-tumor samples (the three 'data types' in this protocol) requires six parts. The first part is variant calling using VarDictJava. The second part is the preprocessing of VCFs and extraction of genotype status of each variant for each sample. We then perform union of all the unique variants from all three data types. The third part is to calculate the sequencing coverage of the union of unique variants. The fourth part is to merge genotype status file and sequencing coverage file of each sample to determine the status of each variant after correction for low sequence coverage. The variant status at positions with fewer than ten reads for a given sample is changed to unknown. The fifth part is to combine variant status file from each sample to create a large multi-sample VCF file. Finally, in the sixth part, at positions at which variant status is listed as unknown in the wxs-normal samples (because of low sequence coverage) we will insert variant calls made from the corresponding wxstumor samples. If the variant status is still unknown in a normal sample but is called in the corresponding rnaseq-tumor sample, then the rnaseq-derived variant is inserted to create the final combined wxs-rnaseq variant call set.
Note: Run parts 1-5 separately for wxs-normal samples, wxs-tumor samples, and rnaseq-tumor samples. b. As the script in this protocol follows SLURM based schema. The structure of the job script is described below.
Note: Set the number of nodes, number of cpus, requested memory, and time in the job script according to the number of samples to be processed and resources available on the user online Linux cluster. The numbers designating the array jobs correspond to the numbers of the jobs in the input list.
c. As different data types are processed, set the path in the script to the input list that corresponds to the data type (See Figure 1). For example, to set path of input.list for wxs-normal BAMs d. The default output for this step is stored in the following directory.
e. User should create directory 'wxs-normal', 'wxs-tumor', and 'rnaseq-tumor' in the 'VCFs_from_VarDict' directory to store VCFs for respective data types. f. Set output path in the script accordingly (See Figure 1). For example, output path for VCFs for wxs-normal samples can be set as shown below g. Run script using the following command h. If the user online Linux cluster supports Terascale Open-source Resource and QUEue Manager (TORQUE) system, i. Change the structure of the job script in the VariantCallingFrom_VarDict.sh as described below ii. Run script using the following command

Part 2: Preprocessing of VCF files
Timing: 30 min 2. Index VCF file and extract variants that received a PASS (shown in the ''filter'' column) in the indexed VCF file. a. The default output for this step is stored in the following directory.
b. User should create directory for each data type in the 'PASS_variants' directory to store outputs for respective data type.
c. Set output path in the extract_pass_variants.sh script for each data type accordingly. For example, let's set the output path to store indexed passed variants for wxs-normal samples d. Run the script from the directory where VCFs from VarDictJava are located.
3. Remove header section from VCFs. Run the script from the directory where indexed pass variants are located.   with filename), 'samdepth_file' (the location of the union variants sequencing coverage file with filename), and 'out_filename' (the location of the output file with filename) for each sample. For the output file filename add _GT_samdepth_merged suffix at the end of the filename. b. The default output for this step is in the following directory.
c. User should create separate directory for each data type in the 'variant_status' directory to save output for each data type.
d. Set path in the out_filename column in the input_genotype_samdepth.txt file to save results for each data type. For example, set output path for wxs-normal samples e. Place the input_genotype_samdepth.txt file in the analysis directory. f. Run the script using the following commands from the analysis folder.
Note: The bash script determineVariantStatus.sh includes R script determineVariantStatus.R. Set array jobs with number of jobs in the input list. c. Run the script using the following commands CRITICAL: The bash script combineAllSamplesVariantStatus.sh includes the R script com-bineAllSamplesVariantStatus.R. In the R script set full path for input files (i.e., variant status file obtained from the previous step) and the output path where combined variants from all samples to be saved. The memory and time may vary based on the number of samples to be processed.
10. The script above outputs a large table which contains variant status from all the samples. However, we need to do a bit of legwork to get our data into reasonable shape. First, the combined variants status file contains millions of variants, and it will take too much of memory and time for data pre-processing. Therefore, we will first split the large-combined variant status file into chunks of small files keeping the same number of columns but split based on fixed number of rows. In this file, each line corresponds to one variant.
With the above commands we did the following: a. 13. Once preprocessing of each split file is finished, the next step is to merge them. Run the command below from the command line interface 14. Remove variants which are either unknown or Homozygous reference across all samples from the processed_CombinedVariantsFromAllSamples.txt file. Run the script below from the directory where the processed_CombinedVariantsFromAllSamples.txt file is located. The script will output the processed_CombinedPotentialSNPs.txt file.

Part 7: Examples of downstream analyses
In the previous steps, we have identified germline variants from the whole-exome sequences and rna-sequences in the TCGA-LGG cohort. Several downstream analyses can be performed to evaluate the quality and clinical relevance of the identified germline variants.
16. Use Annovar software to perform functional annotation of germline variants such as in which gene the variant is located, whether variant is in the exonic, intronic, UTR, promoter, or splicing region of a gene or is in the intergenic region. 17. Use Annovar to determine the allele frequencies of germline variants in whole-genome sequencing data from various ethnic populations such as listed in the GnomAD database. 18. Calculate the allele frequency of the germline variant in this study using the following formula: where #HR is the total number of patients with Homozygous reference allele, #HT is the total number of patients with Heterozygous allele, and #HA is the total number of patients with Homozygous alternate allele.
19. Calculate the correlation of the allele frequencies in the four variant call sets (final merged variant call set, wxs-normal, wxs-tumor, rnaseq-tumor datasets) with each other and with the allele frequency from GnomAD using GGally R package. In our study, the allele frequencies in the combined data set from $500 patients gave a correlation coefficient of >0.9 with the allele frequencies in GnomAD. 20. Determine whether the germline variant calls separate patients based on self-reported race using the PLINK software. 21. Determine the genetic linkage between germline variants at different loci by performing Linkage Disequilibrium analysis. 22. Determine for a given germline variant whether there is a significant difference in survival outcome in the minor allele compared to the reference major allele (Chatrath et al., 2019;Chatrath et al., 2020). Here, we have compared the overall survival outcome between LGG patients (n=507) with different rs1131397 ( 23. The output for multivariate Cox regression analysis includes coefficient, hazard ratio, z-score and p value for each variant. As multiple variants are tested, it is important to correct the p values after Cox regression. In our study, 196,022 variants were tested. False discovery was performed through Bonferroni Correction. 24. Determine whether a given germline variant is predictive of increase or decrease in tumor mutation burden and responsiveness to immune checkpoint chemotherapy (Chatrath et al., 2021a). 25. For germline variants that are significantly associated with a phenotype (increase or decrease of survival, tumor mutation burden, responsiveness to specific therapy), determine whether the variant is associated with differences in gene expression of nearby genes by performing expression quantitative trait loci (eQTLs) analysis. For more details on other downstream analyses that can be performed with germline variants, please refer to (Chatrath et al., 2021b).

EXPECTED OUTCOMES
This protocol presents instructions to integrate the whole-exome sequencing datasets from normal and tumor samples and RNA sequencing dataset from the tumor samples of the TCGA-LGG cohort to determine the germline variants.
Germline variants from the TCGA datasets are typically called in wxs-normal blood samples. This approach captures exonic region of the genome, allowing one to find the variants within genes. However, sequence (and variants) outside the exonic region, such as the promoters, introns, intron-exon boundaries and UTRs are also captured as a byproduct of whole-exome capture and are present in wxs data (Guo et al., 2012;Samuels et al., 2013). In this protocol we combine the information from wxs-normal, wxs-tumor and rnaseq-tumor datasets for a given patient to get additional sequence coverage at a base and thus call germline variants that may not have had sufficient sequence depth in the wxs-normal data. In addition, this approach can be used to call variants in patient sets where wxs-normal is not available. Tumor-specific mutation and RNA editing changes are

OPEN ACCESS
also called as variant in wxs-tumor and rnaseq-tumor, but these are filtered out at a later stage by focusing only on variants that are known to exist in the general population with a minor allele frequency R 0.05. Lower thresholds for minor allele frequency can be used when more patient datasets are available.
There were some choices we made in our protocol that may be changed by a user. We use a minimal sequencing coverage threshold of 10 before a variant status is called because that gives us enough confidence that if a variant was present at that locus, it should have been sampled by then (and sampled at least 3 times). This is an arbitrary threshold and other users can set other thresholds if they so desire, but one has to balance the need for finding sufficient patients with a minor allele with the need to be 100% confident that the variant call is correct. In addition, we have a filter that a minimum of three reads should support the reported variant, and this too can be changed by the user. Note that we also removed reads that may have come from PCR duplicates, so that only one read with the identical start position is kept among all the duplicates. Finally, we turned off the calling of structural variants in this analysis so that no insertions, deletions, duplications, inversions, or large copy number variations were called.
Our method has the benefit of determining the genotype status of a variant with low sequencing coverage in wxs-normal sample, by taking the genotype status of that variant from the corresponding wxs-tumor sample, where the variant at that position may be supported with higher sequencing coverage. However, if we are still unable to determine the genotype status, we insert the genotype status from the rnaseq counterparts. This method has not significantly affected the accuracy of the variant call in our study because the allele frequency calculated from the rnaseq tumor datasets were significantly positively correlated (r > 0.98) with the allele frequencies from the wxs-normal, wxs-tumor and combined-wxs-rnaseq variant call set. Importantly, allele frequencies in our study were significantly positively correlated with the allele frequency of germline variants listed in the population gnomAD database, supporting the reliability of our calls. Specifically, the allele frequency in GnomAD correlated (Pearson's correlation coefficient) with wxs-normal variant set (r > 0.963), wxs-tumor variant set (r > 0.964), rnaseq-tumor variant set (r > 0.937) and combined-wxs-rnaseq variant set (r > 0.947) (Chatrath et al., 2019).
Once the genotype status for each variant in each patient is identified, the user can check the survival outcome of minor allele compared to major allele for each variant. The Kaplan-Meier survival analysis for the three genotype ( Figure 3) in variant rs1131397 at chromosome 1 and 154965759 locus reveals that patients with homozygous alternate genotype (CC) and heterozygous genotype (GC) have significantly (p-value = 7.48e-04) worse prognosis than the reference genotype (GG), determined using log rank test. In addition, the multivariate Cox regression analysis in Figure 4 reveals that minor allele in rs1131397 is an independent factor for predicting LGG patient overall survival after adjusting for confounding clinical risk factors.

LIMITATIONS Computational resources
This protocol performs variant calling, pre-processing VCFs, calculating sequencing coverage, determining variant status, merging variant status from multi-samples, determining genotype of variant with low sequencing coverage from the respective wxs-normal, wxs-tumor and rnaseq-tumor BAM files. As these calculations are computationally intensive, we recommend running the protocol on a high-performance cluster. The memory and number of CPUs required to run each step (See Step-by-step method details) may vary with the number of samples, size of the dataset, sequencing depth, and size of the capture area. Smaller datasets will have lower memory requirements and so this method can be used on them with limited computational resources. This protocol requires the user to modify the memory and CPUs needed in the job script for each step accordingly.

Data access
A token is required for downloading controlled test dataset (See Before you begin) from the GDC. To obtain access to controlled dataset requires an NIH eRA Commons account and then dbGaP authorization. Instructions on how to apply for GDC controlled dataset access can be found at https://gdc. cancer.gov/access-data/obtaining-access-controlled-data.

Computational time
Downloading GDC datasets and germline variant calling consumes considerable amount of time and computational resources. We recommend processing a small number of wxs and rnaseq BAM files to guide the computational resources requirement for the future analyses. This protocol requires user to modify the time needed in the job script for each step accordingly.

Basic programming knowledge
This protocol follows SLURM based schema and requires knowledge of the bash and R scripting language. Newer users may need to learn some basic Linux commands such as ls, echo, cd, rm, mv, mkdir, find, nano, awk, sed, grep, pwd, bash, and sbatch used in this protocol to execute it fully.

TROUBLESHOOTING
Problem 1 A common warning may be displayed for the bam index: BAM index file is older than BAM file (corresponding protocol step: Part 1: Variant calling using VarDictJava; step 1g).

Potential solution
It is a warning message for bam index file. It can be ignored if you are sure that the index file is up to date. You can create the more recent index file of the bam file using the command below and the warning message should go away.

Problem 2
A common error may be displayed: These module(s) or extension(s) exist but cannot be loaded as requested (corresponding protocol step: Part 1: Variant calling using VarDictJava; step 1g).

Potential solution
The module system may vary between Linux clusters. The user needs to know how to load and unload modules in their online Linux cluster. For example, to search how to load R modules on our cluster, use the below command.
Note: This protocol expects that R, VarDictJava, SAMtools and BCFtools are installed on your local or online Linux cluster and loaded in the system PATH.

Problem 3
Fatal error: cannot open file 'extract_uniqueVariants.R': No such file or directory (corresponding protocol step: Part 2: Preprocessing of VCF files; step 4).

Potential solution
Try to set the full path of the R script in the extract_uniqueVariants.sh bash script.

Potential solution
This error will be in the error output file. This error occurs when the job which was running on the Linux cluster may have reached the maximum time limit as requested in the job script and thus canceled. Try increasing the time limit in the job script with the -time= sbatch flag.

Problem 5
Error in names(x) < value: 'names' attribute must be the same length as the vector (corresponding protocol step: Part 6: Insert variant call for unknown variant status in normal samples; step 15c).

Potential solution
This error message suggests that Sample_Barcode in the patient_list does not match with the sample names in the processed_CombinedPotentialSNPs.txt file. The Sample_Barcode should match exactly with the sample names in the processed_CombinedPotentialSNPs.txt file for each data type.

RESOURCE AVAILABILITY
Lead contact Further information and requests for resources should be directed to and will be fulfilled by the technical and lead contacts, Divya Sahu (dsahu@uab.edu) and Anindya Dutta (duttaa@uab.edu).

Materials availability
This study did not generate new unique reagents.

Data and code availability
The data/analyses presented in the current protocol have been deposited in and are available from the dbGaP database under dbGaP accession phs000178.