Protocol for unbiased, consolidated variant calling from whole exome sequencing data

Summary Whole Exome Sequencing (WES) is used for querying DNA variants using the protein coding parts of genomes (exomes). However, WES analysis can be challenging because of the complexity of the data. Here, we describe a consolidated protocol for unbiased WES analysis. The protocol uses three variant callers (HaplotypeCaller, FreeBayes, and DeepVariant), which have different underlying models. We provide detailed execution steps, as well as basic variant filtering, annotation, visualization, and consolidation aspects.

b. Further variant annotations along with variant impacts, computational pathogenicity scores, conservation scores and additional clinical information: dbNSFP 2.9.3. The latter requires a step of preprocessing for later variant annotation. c. Variant frequencies across population files which are required in a clinical setting to assess whether a variant has pathogenic potential according to its frequency in major population cohorts (larger frequency, less pathogenic potential): gnomAD 2.1.1 and index.
The following script template can be used to perform steps 4a-4c: Prerequisite software installation -quality control Timing: 15 min The goal of this section is to download and install software required for raw data quality control. The following command line operations can be executed as provided in most Linux distributions. We are using Ubuntu 20.04 LTS. In the end of each code snippet, we include a final command which exports to the filesystem environment the command to the tool just installed. In this way, the tool usage becomes available across all the next steps. The command has the format. where [TOOL] is the software tool just installed. Furthermore, it should be noted that certain tools and resources are versioned, meaning that the version of the file to be downloaded is subject to change. Usually, this change is evident even for the relatively inexperienced users. In the end of the section, the software required for quality control should be in the proper place for the execution of the protocol.
5. Set the directory where the tools are installed in the user's home directory. Note that some tools, especially those dependent heavily on the Python language, may not follow this convention.
7. Download and install MultiQC. Existence of the Python package manager pip is assumed but it is usually bundled with most current Linux systems.
9. Download and install TrimGalore. In this section, the software required for raw read data alignment to the reference genome is retrieved and installed. At the end of the process, the software should be in the proper place for the continuation of the protocol.

Download and install bwa.
Prerequisite software installation -variant calling Timing: 10 min In this section, the software required for variant calling is retrieved and installed. At the end of the process, the software should be in the proper place for the continuation of the protocol.
11. Download and install GATK. 13. Download and install Docker to be able to run DeepVariant. The box below follows official instructions from here.
CRITICAL: This is the only protocol step where the aid of a system administrator or a trained bioinformatician may be required for ensuring proper installation, as it requires system-level access.

Data collection
Timing: 1.5 h In this section, the raw data for the demonstration of the protocol are retrieved. At the end of the process, the data should be in the proper place for the continuation of the protocol. We demonstrate the protocol using six random male-female balanced samples from the British in England and Scotland (GBR) population in the 1000 genomes project. The samples are also listed in the key resources table. 22. Set the directory where the raw data will be placed.

MATERIALS AND EQUIPMENT
Hardware: while setting up the computational protocol, the steps were performed in a 64 physical core system with 512 GB of RAM and Ubuntu 20.04 LTS. We used 32 cores where parallelization was available. Generally, the process can be completed with adequate performance in a system with 16 cores and 128 GB of RAM. If less RAM is available, parallelization can be avoided partly by restricting the number of jobs executed asynchronously in the background (remove the '&' character where it is found in several commands).

STEP-BY-STEP METHOD DETAILS
In all the subsequent steps, the paths to the required software tools are the same as the ''exported'' paths in the respective command boxes under the ''before you begin'' section.

Quality control and filtering
Timing: 2 h 15 min Quality control of the generated data is a crucial step in every Next Generation Sequencing protocol, let alone in the case of processes related also to the clinic, such as exome sequencing and variant calling. Quality control in exomes becomes even more critical, as in the case of detecting variants on a large scale, it is not straightforward to distinguish between sequencing errors and actual variations in the human genome. Therefore, quality control procedures are often lenient and total quality assessment is a combination of various factors. In this section we outline a typical pre-alignment quality control procedure for whole exome sequencing data. In the end, quality controlled FASTQ files ready for alignment will be acquired.
1. Quality control with FastQC and MultiQC. a. Pre-alignment QC using FastQC to determine if any raw data corrective actions need to be taken. Default FastQC reports are not interactive and not aggregated. b. Use MultiQC to create a more user-friendly and complete report.
The following bash script can be used as a template: The results of MultiQC can be viewed by opening the file $FASTQC_OUTPUT/multiqc_report.html in a web browser.
Note: From the results of FastQC and MultiQC, a lot of useful information may be revealed. Some examples include the presence of adapters, the presence of bias in the 3 0 /5 0 end of reads, poor quality in the 3 0 /5 0 end of reads, poor quality for certain samples and sequence over-representation other than the one expected from adapter contamination. After a first round of inspection, we may have to improve the quality of the overall dataset prior to continuing with other actions regarding alignment to the reference genome and the subsequent variant calling. Trim Galore is a good option for this as it automates many processes, including standard adapter automated removal and maintaining paired-end read integrity. In the case of the data presented in this protocol, the quality of the dataset is acceptable and none of the above points apply. No further further action is needed. Therefore, the following section is not required. It is only mentioned here for reference purposes and protocol completeness.
2. Adapter and poor-quality base trimming (optional). A template bash script to wrap Trim Galore follows. With comments, below the main command, a stricter alternative filtering approach: For paired-end reads, Trim Galore! produces four outputs and specifically, mate 1 reads passing QC, mate 2 reads passing QC (and matched to mate 1), mate 1 failed reads (optional, not chosen above), mate 2 failed reads (optional, not chosen above).
3. Inspection of the outcome. Trim Galore also runs FastQC again. From its output we may be able to see that: a. The problematic points identified above are remedied and brought to acceptable states and error rates. b. The number of filtered reads remains at acceptable amounts.

Timing: 4 h 30 min
This section describes the process of aligning the FASTQ pairs to the reference genome and collecting alignment statistics for quality control purposes. In the end of the step, BAM files and a report of read alignment statistics are generated.

Index the reference genome.
This step is needed only once and does not have to be repeated for the application of the protocol to new data, unless the index and/or reference genomes are deleted by the user. When this process is completed, we need to create an additional file called hs37d5.dict expected by GATK tools for variant calling and other processing. We use samtools for this. After the index building is finished, the alignment process can be initiated for each FASTQ file.
CRITICAL: The downstream variant calling analysis requires read group information. Read groups are added to each alignment resulting in a BAM file in order to separate different individuals as well as samples resulting from different lanes and libraries. Read groups are required as variant callers pool samples to estimate the models behind variant discovery. Read groups (the RG tag) can be added during alignment with bwa using the -R option. The following shell script can be used to accomplish the alignment and read group addition procedure. Furthermore, as BAM files need further processing, the file extension of the aligned files is .uns.

Preparation of BAM files
RG="@RG\tID:"$BASE"\tSM:"$BASE"\tLB:WES\tPL:ILLUMINA" In this section we describe the steps taken to prepare the BAM files for the subsequent variant calling and discovery. The output of this part comprises BAM files suitable for the subsequent variant calling. The vast majority of variant callers require these preparation steps and the major steps taken (in slightly different flavors according to the tools used to make them so) are the following: 6. Merging of BAM files from different lanes. This is an optional step according to the instrument and sequencing protocol used (the files used in this protocol do not require this step). 7. Then, if the sequencing is paired-end: a. Sort the reads in the BAM file according to their names so that pairs are placed one below the other. b. Fix mates so that they both have the same sets of attributes for the subsequent preprocessing. c. Re-sort the reads according to their genomic coordinates this time. d. Mark the duplicate reads as variant callers take this information into account. 8. If the sequencing is single-end: a. Sort the reads according to their genomic coordinates. b. Mark the duplicate reads as variant callers take this information into account.
In our case, we have paired-end sequencing, so we are following the first set of steps above.
9. Sort the reads in the BAM file according to their names so that pairs are placed one below the other and fix read-mates so that they both have the same sets of attributes for the subsequent preprocessing.
10. Re-sort the reads according to their genomic coordinates and mark the duplicate reads as variant callers take this information into account. printf "%s\t%s\t%s\t%s\t%s\t%s%s\t%s\t%s\t%s\t%s\t%s\t%s\n" "name" \ "total reads" "total reads pairs" "aligned reads" \ "properly paired aligned pairs" "uniquely aligned reads (q>20)" \ "properly paired uniquely aligned reads" "chimeric reads" \ "reads overlapping targets" "total bases" "aligned bases" \ "uniquely aligned bases" "bases overlapping targets" > $REPORT printf "%s\t" $SAMPLE >> $REPORT echo " total reads" printf "%d\t" '$SAMTOOLS_PATH/samtools view -c -F2048 $BAM' >> $REPORT echo " total read pairs" Note: This section describes the steps taken to collect some useful alignment statistics and prepare the BAM files for the subsequent variant calling and discovery. The former may further help identify poor quality samples that should not be used for variant calling. While such samples may have passed the QC process applied on raw data, it is possible that they may present low alignment rates or low coverage over the target areas (exome capture kit), as for example a result of possible contamination.

Timing: 30 min
Another level of quality control as well as supporting evidence for discovered variants is the actual inspection of the sequencing signal or coverage (i.e., the histogram created by the short reads pileup in a specific locus). This can be accomplished by uploading, opening or linking signal files created from BAM files, to a genome browser such as the UCSC Genome Browser or the IGV (Robinson et al., 2011). Signal tracks in BigWig format can be created using the following shell script as a template. In this case, we note the addition of the ''chr'' short string before the chromosome names. This is required for viewing in the UCSC Genome Browser. For other browsers such as IGV, this addition depends on the reference genome loaded. The latter can be controlled in IGV but not in the UCSC Genome Browser. The output of this part is BigWig files suitable for visualization in a genome browser. The produced BigWig files must then either be put in a directory served by a web browser such as Apache in order to be viewed by a web-based genome browser (such as UCSC) or be opened directly in a local genome browser such as IGV.

Variant calling with GATK HaplotypeCaller
Timing: 6 h This section describes the variant calling procedure using GATK HaplotypeCaller and its output is a VCF file with filtered variants after the application of basic filters. Each caller accepts the BAM files as main inputs but in order to be as efficient as possible, different pre-calling procedures are required. Examples of such procedures are: Example 1: The GATK HaplotypeCaller requires a procedure called Base Quality Score Recalibration (BQSR) in order for its underlying model to work as best as possible.  Example 3: DeepVariant on the other hand does the splitting of the capture kit regions automatically.
In addition, there is nowadays some debate on whether BQSR is needed prior to variant calling or not, as this process was initially developed for older sequencers that produced poorer results than modern ones. We choose to apply BQSR for protocol completeness purposes. More info on the debate can be found in the official GATK community forums and other bioinformatics communities such as Biostars.
The calling process with GATK HaplotypeCaller has several steps and substeps. Below we outline the process and provide template scripts. Note: BQSR is a relatively lengthy process and can be executed in parallel if we split the capture kit genomic regions. The GATK toolkit has tools for this. The main inputs for BQSR in exome sequencing are, the exome capture kit, the reference genome and a list of known variant locations (e.g., dbSNP, gnomAD) used to provide the algorithm with a list of ground truth sites used to recalibrate scores.
After the BQSR process, we are ready to proceed with variant calling for each sample separately.
Although there are many alternatives to run exome analysis with HaplotypeCaller in an efficient way (e.g., parallelization of capture intervals or running each sample on the background or even using GNU parallel), we propose the following sub-protocol (''intervals'' are the capture kit genomic intervals created during the BQSR process). CRITICAL: At this point and with respect to step 2e above, it should be noted that the best filtering practices suggested by the GATK community comprise only basic variant filters in order to reduce noise. As with the rest of the variant callers, more elaborate filtering should follow, which is not generic as these filters but it is application dependent. For example, a user investigating rare disease should look for damaging variants (e.g., frameshift, splicing, missense) after variant annotation while a user interested in conducting a population study with many samples should focus on filtering variants with low frequencies as those would not characterize a population cohort. Finally, under different clinical settings, a user would possibly combine various filters, for example restrict damaging variants to certain virtual gene panels of interest. This section presents the variant calling and filtering steps with FreeBayes. Its output is a VCF file with filtered (basic filters) variants called with FreeBayes.
In comparison with GATK HaplotypeCaller, the model behind FreeBayes does not require BQSR (therefore it is faster), requires all samples processed altogether and at once in the same command (using read groups and the RG tag to distinguish them) and does not operate directly on HaplotypeCaller genomic intervals. These have to be recalculated and reformatted to the BED format for FreeBayes parallelization.
Although there are many alternatives to run exome analysis with FreeBayes in an efficient way (e.g., parallelization of exome kit capture intervals or running each sample on the background or even using GNU parallel), we propose the following protocol (''intervals'' are the capture kit genomic intervals recreated with GATK SplitIntervals for FreeBayes): 14. Rerun GATK SplitIntervals to create FreeBayes specific intervals for parallelization. 15. Create a list file with the individual interval filenames. 16. Create a list file with the individual BAM filenames. 17. For each interval, run FreeBayes jointly for all samples to create a VCF file for that interval. 18. Merge the produced multi-sample VCFs for each interval into one multi-sample VCF file using vcflib. 19. Using bcftools and R, determine upper quality (QUAL) and depth (DP) cutoffs based on the respective distributions (assuming initial QUAL>20). 20. Using bcftools apply the filters of (6). 21. Using vcflib decompose the complex variants. 22. Using bcftools normalize INDELs and produce the final VCF. 23. Cleanup the computation environment.
The DeepVariant pipeline is pretty well-defined and quite automated, leaving few steps for the user which essentially come down to variant filtering (which again is not complex). Summaries for all steps (including background processes) are recorded in a ''report'' file for general supervision. The following shell script template can be used to run the above protocol:

Variant annotation
Timing: 9 h 30 min In this section the output of each variant caller is annotated with additional elements such as variant impacts and frequencies of known variants in major population studies. The output of this part is one annotated VCF file for each variant caller.
28. Using SnpEff and SnpSift, annotate the findings with basic information including: a. Genomic location (gene, exon, etc.). b. Impact prediction based on the Sequence Ontology and the Sequence Variant Nomenclature. c. Known variant IDs from dbSNP. d. Various pathogenicity prediction scores and other SNP metrics from dbNSFP. e. Population study variant frequencies from gnomAD.
CRITICAL: It is assumed that the required resources for SnpEff and SnpSift are in place (see also the ''before you begin'' section). Prior to using SnpEff and SnpSift, a SnpEff database for our genome of interest must be downloaded (see script below).
The following shell script template can be used for annotation of the final (filtered) outputs from each variant caller:  Note: A challenging issue when using variant callers is how to summarize and consolidate different DNA variant callsets from different callers into one summarized result. Major challenges for consolidation include the decision on which of the reported variant call metrics reported in VCF file(s) from each caller will be included in the final callset (e.g., which QUAL, which DP etc.) and the level at which the variants from different callers should be considered identical or nearly identical. Regarding the latter, two common questions are should they be considered identical if they share the same genomic coordinates or start position, or, should they be considered identical if they share both positions and alleles?
Fortunately, bcftools offer functions to experiment with the many options that exist to consolidate the callsets. We have chosen to intersect the callsets and consider the overlapping variants identical if they share both genomic position and alleles. We perform the various intersections using bcftools and for each intersection we perform three operations in order to retain all the metrics for each caller but on the intersected (shared) variants. From the produced callsets, the most interesting one to begin the exploration should be the #7 which corresponds to the common variants between all the three callers we have used.

EXPECTED OUTCOMES
WES comprises a well-defined and much promising NGS technique which has been successfully deployed during the past few years with many applications in research and the clinic. The output of WES is typically a (long) list of DNA variations detected given an input DNA sample such as from a patient, when compared to a reference genome. As with every major high-throughput technique, the output is prone to noise and potential errors which require special handling in order to be filtered out and reduce false positives. The proposed protocol may aid achieve this through careful data filtering and preparation followed by variant calling with three established variant callers and subsequent (clinical) annotation and consolidation of the results.
Regarding the actual potential reduction in false positives, an estimation can be provided based on recent studies where multiple variant callers are evaluated (Barbitoff et al., 2022;Lin et al., 2018) and combined (Zhao et al., 2020). In (Zhao et al., 2020), the authors benchmarked GATK, the Illumina DRAGEN-based caller and DeepVariant using human genome data and it was shown that the average F1-score for SNP detection across 4 datasets is 0.990 for GATK without Variant Quality Score Recalibration and 0.969 for GATK with Variant Quality Score Recalibration. Furthermore, in the same study it was shown that the combination of GATK with DeepVariant leads to higher F1scores. The average F1-score for the combined methodology returned was 0.993 on average, suggesting that the combination of methods is expected to lead to more accurate SNP calling results. In addition, in (Lin et al., 2018), the comparison of GATK with DeepVariant, when applied on the analysis of trios, showed that DeepVariant made fewer calls, but with a lower false positive rate. In addition, in (Barbitoff et al., 2022), the F1-scores calculated for the three methods when applied on Whole Exome Sequencing were 0.996 for DeepVariant, 0.985 for GATK and 0.987 for FreeBayes. Based on these results and the aforementioned results regarding algorithm combination, we expect the overall F1-score to be >0.996. Last but not least, in our experience GATK tends to return more variants than the other two methodologies ( Figure 1B) even after the application of the best-practice filters, suggesting a potential higher rate of false discoveries.
The proposed protocol produces outputs at various processing steps and at various levels. Specifically, the main outputs are quality controlled raw data in FASTQ format, alignments to the reference ll OPEN ACCESS genome in BAM format, variant callsets from each caller in VCF format and annotated and consolidated variant callsets in VCF format.

LIMITATIONS
Despite the detailed description of the protocol steps, installation of prerequisite software and script templates that can be almost used out of the box by the user, there are cases where the input of a computer expert or a trained bioinformatician may be needed. Such cases could include the installation of tools requiring system-level access such as Docker, or the navigation among GATK available tools and commands. In addition, the described protocol assumes a Linux-based system and some basic skills in using the command line. Although the latter skills are not extensive and the protocol steps are very detailed, some users may find it difficult to follow. Another limitation, and also the reason for which command-line skills are required, is the fact that most of the required tools are well-behaved mostly in Linux environments. Executing them on other operating systems (such as Windows) is not prohibitive but require substantial skills and software prerequisites as most of them would require to be re-compiled from source code. On the other hand, most of them are available out of the box for Linux environments. Additional limitations may have to do with available computational power and storage resources. While most tools are flexible and running them with a few or even one core is possible -albeit much slower-the required annotation resources require available storage. However, most laboratories engaged in WES should have storage resources available. Finally, although the usage of multiple variant callers depending on different underlying statistical models may reduce related introduced biases and thus reduce false positives, visualization of the end-result is also required to derive final conclusions especially in clinical settings. Such visualization is possible through dedicated genome browsers such as the IGV, which operate on local systems and can load simultaneously WES signal (BigWig files), read alignment files (BAM files) and the called variants (VCF files). In this way, the analyst can verify -for some representative cases at least -the validity of the presence of a variant in all three callsets and if there are false calls based on aligned reads support. All this information is available within IGV.

Problem 1
The hardware I have at my disposal to run the protocol is not adequate to guarantee performance.

Potential solution
Generally, the vast majority of the tools used in the protocol can run in single core, lower-end systems such as a medium to high-end laptop. The user should try and drastically reduce the number of cores to use (denoted by the ''CORES'' environmental variable where applicable) and also reduce the number of compute jobs executed in the background, that is remove ampersands (& symbol) at the end of certain commands throughout the template scripts. The protocol will be completed but the timings will increase at rates 50%-1000%.

Problem 2
The protocol describes the variant calling procedure with paired-end sequencing data. I want to execute the protocol with single-end sequencing data.

Potential solution
The only steps slightly changing are the ones regarding basic quality control, alignment to the reference genome and the BAM file preprocessing which becomes shorter. We provide additional template scripts for this process in the GitHub repository accompanying this article.
Problem 3 I cannot find the coordinate files for the exome capture kit or I am not sure about the kit used in my experiment.

Potential solution
In this unlikely event, the user may use a list of all the reference genome exons from a public resource such as RefSeq or Ensembl. Some biases are expected. The user must make sure that the downloaded exon coordinates correspond to the same reference genome version used in the alignment process (e.g., hg19).

Problem 4
Variant calling with DeepVariant crashes.

Potential solution
Try to change the following lines in bold. Problem 5 Some tools require administrative access, or as a user I have limitations in installing tools, or I have not enough allocated space.

Potential solution
All of the tools and commands in this protocol do not assume administrative access, except from the installation of Docker, which however is bundled with most modern Linux systems. In the unlikely event of limited user access, advice from a system administrator should be sought. The same applied to additional space requirements.

Problem 6
No variants are left after the filters applied to the FreeBayes result.

Potential solution
It is possible according to the particularities of each dataset that such a case may arrive. In this unlikely event, the filtering thresholds should be lowered. The user should change the following lines from the code in step 7.
with the following: It is possible that the user may have to experiment with the quantile values, for example even set from 0.90 to 0.75.

RESOURCE AVAILABILITY
Lead contact Further information and requests for resources and reagents should be directed to and will be fulfilled by the lead contact, Panagiotis Moulos (moulos@fleming.gr).

Materials availability
This study did not generate new unique reagents.

Data and code availability
This protocol did not generate any new datasets. The sample data analyzed in this protocol can be found at SRA and using the links in the data retrieval box in the respective section as well as the key resources table. The code templates outlined through the article are available at https://github. com/moulos-lab/star_protocols_wes3x (https://doi.org/10.5281/zenodo.6491376).