ShallowHRD: detection of homologous recombination deficiency from shallow whole genome sequencing

Abstract Summary We introduce shallowHRD, a software tool to evaluate tumor homologous recombination deficiency (HRD) based on whole genome sequencing (WGS) at low coverage (shallow WGS or sWGS; ∼1X coverage). The tool, based on mining copy number alterations profile, implements a fast and straightforward procedure that shows 87.5% sensitivity and 90.5% specificity for HRD detection. shallowHRD could be instrumental in predicting response to poly(ADP-ribose) polymerase inhibitors, to which HRD tumors are selectively sensitive. shallowHRD displays efficiency comparable to most state-of-art approaches, is cost-effective, generates low-storable outputs and is also suitable for fixed-formalin paraffin embedded tissues. Availability and implementation shallowHRD R script and documentation are available at https://github.com/aeeckhou/shallowHRD. Supplementary information Supplementary data are available at Bioinformatics online.


Quality control
GC-corrected and normalized read counts profiles of sWGS and their sensitive segmentations (number of segments 300-1600), were annotated manually as "good" (n=55), "average" (n=6) or "bad" (n=8). Based on this annotation quality thresholds were defined.
Bad quality cases represented mainly sequencing failure independent of coverage, with frequent (n=5) poorly detectable local minimum in M, separating fluctuations of segments with equal copy numbers from one copy difference. Average quality was mainly due to a low coverage (0.06-0.3X) displaying high fluctuations in the number of reads per window characterized by cMAD. cMAD>0.14 and M>0.45 indicate low quality samples (Fig. S2). Coverage 0.3X is a low limit for sWGS to ensure prominent CNA profile.
After evaluating 108 down-sampled WGS of normal samples, a lower boundary for CNA cut-off was set to 0.025, to avoid CNA detection in normal and over-segmentation in low tumor content samples.

HRD annotation
In-house cases: In-house tumor cases were partially tested on the Institute Curie platform.
TCGA cohort: HRD annotation of the TCGA cohort was previously described (Manie, et al., 2016). Briefly, mutations in BRCA1/2, RAD51C and PALB2 genes were searched in whole exome sequencing (WES) data; gene inactivation was considered proven when deleterious mutation and LOH (Loss Of Heterozygosity) were observed at the gene locus or two deleterious mutations found in the gene; missense mutations annotated as pathogenic in COSMIC database were considered deleterious. BRCA1 and RAD51C promoter methylation was checked using the gene expression; cases with outlier low expression were annotated as HRD due to promoter methylation.

Specificity of HRD calls in SNP-array LST and scarHRD:
LST was validated on the TCGA cohort, which at the time of publication (Manie, et al., 2016) was not completely available for direct search and verification of the reported mutations. This explains relatively low specificity of LST method shown in Fig.1B. In the current validation set of the TCGA down-sampled WGS, specificity of LST method was very close to LGA in sWGS (predictions of SNP-array based method are indicated by colors in Fig.1A).
For scarHRD (Sztupinszki, et al., 2018), the methylation of RAD51C promoter was not assessed, which might led to missing HRD cases.

Soft and stringent HRD cut-offs and borderline HRD
Two cut-offs, soft and stringent, were introduced on the LGA number to call HRD or nonHRD. The reason for this is the appearance of HRD in breast and ovarian tumors: while the majority of cases with BRCAness (HRD) have LGA number far higher than 20, small proportion of mainly BRCA2 mutated tumors display near-diploid genome with ~15 large-scale chromosomal breaks. From the other hand, nonHRD tumors with near-tetraploid genomes can display 15-20 large-scale chromosomal breaks. When the tumor ploidy is known, there is no problem to distinguish these two situations.
For sWGS, ploidy estimation is problematic and could introduce additional uncertainty. To overcome this issue and to bring additional attention to the low confidence of the call we introduced borderline HRD.
The Supplementary Table S1 recapitulate all the TCGA down-sampled WGS cases processed including their ID, HRD diagnostic with shallowHRD and SNP-arrays, automatic quality detection and correspondence of large segment between sWGS and SNP-array. Cases with contradictory calls are commented.

Estimation of tumor content in WGS
We used estimation of tumor content inferred from the SNP-arrays by GAP method (Popova, et al., 2009) and ichorCNA (Adalsteinsson, et al., 2017) to directly estimate tumor content in sWGS and in the dilution series using window of 50kb on all autosomal chromosomes.

In silico dilution series based on sWGS
To obtain tumor content limitation for shallowHRD we performed in silico dilution of 7 in-house sWGS by 1 sWGS with quasi-normal genome (Supplementary Figure S8A). These 8 cases were sequenced in the same batch. The dilution series was done using picardTools MergeSam (http://broadinstitute.github.io/picard/), recursively merging seven times the BAM file of the tumor with "quasi-normal" profile with the BAM files of other cases. The effect of the dilution is shown in Supplementary Figs. S8 B, C and D.
Three chromosomes which carried some CNA in the "quasi-normal" profile (chromosomes 3, 5 and 17) after controlFREEC processing were masked for CNA cut-off determination and LGA counting. The number of LGAs according to those dilutions is represented in Supplementary Figure S9B. shallowHRD presents relatively stable results with mild variation in LGA counts even for high number of sequential dilutions in good quality cases.
The tumor content estimation was based on the initial tumor content from SNP-array and calculated as proportion of mapped reads in the undiluted sWGS and the diluter sWGS. Estimations of tumor content inferred with ichorCNA (designed for cfDNA) were taken for comparison.
Even though sWGS show stable results around very low tumor content (~0.1), 0.3 could be considered as a good limit for the method application. Tumor cellularity is not directly assessed in shallowHRD, but rather taken into account to some extent in the automatic quality control procedure.

Figure S1. Distribution of pairwise differences between segment medians in CNA profile
An example of density plot of pairwise differences between segment medians is shown (only segments > 3Mb were considered). The first (grey) pick corresponds to fluctuations in segment medians related to the same copy number; the second (yellow) pick corresponds to fluctuations around the one copy difference; the third (light-blue) pick corresponds to the difference in two copies, etc. The first minimum M is detected (yellow vertical line). Here M corresponds to CNA cutoff used to optimize copy number segmentation and define genomic alterations. A prominent M evidences high signal to noise ratio in CNA profile and pure copy number states (without subclones).

Figure S2. Two parameters characterizing quality of CNA profile
In-house sWGS CNA profiles (69 cases) manually annotated as of "bad", "average" or "good" quality were characterized by 2 parameters: M defining CNA cut-off, and cMAD characterizing intra-segmental variation. These two parameters could be considered as sWGS quality markers. Two thresholds were defined: cMAD=0.14 and M=0.45 for automatic attribution of sample quality.
Bad quality cases represented mainly sequencing failure independent of coverage, with frequent (n=5) poorly detectable local minimum in M, separating fluctuations of segments with equal copy numbers from one copy difference. Average quality was mainly due to a low coverage (0.06-0.3X) displaying high fluctuations in the number of reads per window characterized by cMAD.
FFPE samples were among "good" (n=3) and "average" (n=1) quality regarding the thresholds, while two cases were actually annotated manually as "average" and "bad" (the latter due to low tumor content) (see Fig.S5 for details).

I.
II. Figure S3. Reports of shallowHRD I. PDX of breast cancer with BRCA1 germline mutation; II. Primary ovarian tumor with unknown status of BRCA1/2 shallowHDR report contains the following items: A. Tumor genomic profile with LGAs indicated in green.
B. Density plot of pairwise differences between large segments used to define the CNA cut-off.
C. Visual representation of the final segmentation, where segment medians were ordered and represented by the dots. Clear stepwise profile evidences high signal to noise ratio and proper segmentation (good quality, panel I); fuzzy profile with blurred steps evidences high unspecific variation in CNA medians with ambiguous copy number levels (average or poor quality, panel II).

D. Quality and Homologous Recombination Deficiency diagnostics including M, cMAD and
LGA number with HRD status.

A.
B. Figure S4. Consistency in the large CNA segments in sWGS and in SNP-arrays A. Segmented CNA profiles on SNP-array (upper panel) and sWGS (lower panel) of the in-house tumor sample. Segmentation for SNP-array was optimized to absolute copy numbers using GAP method (Popova, et al., 2009) and sWGS profile was optimized by shallowHRD using CNA cut-off. Segments were considered consistent if they were both ≥10Mb in size and their boundaries were within 3Mb. sWGS CNA profile reproduced 86% of the large segments detected by SNP-arrays. B. Overall large segments consistency (estimated as described in Figure S4A) in 8 in-house cases and 79 TCGA down-sampled WGS processed by shallowHRD and SNP-arrays. Red dots are cases automatically detected of average quality by shallowHRD. Figure S5. FFPE profiles from sWGS analyzed by shallowHRD sWGS profiles of four FFPE cases analyzed by shallowHRD are shown. Segments in green correspond to LGA. The entire segmentation is indicated in red for the profile A because no LGA was detected. Profiles A, C and D are detected as "good" quality while the profile B is detected as "average" quality. Manual annotation classified sWGS of profile A as "bad" because of a low tumor content and profile B as "average. Samples B and C were correctly predicted as nonHRD and HRD (BRCA2-/-), respectively. Sample D with unknown status was predicted as nonHRD.
Overall, the limited number of cases does not allow us drive definitive conclusion but support that sWGS and therefore shallowHRD is applicable for FFPE cases. Moreover, several studies, including a pilot study for the 100,000 Genomes Project, investigated the use of FFPE samples for sWGS and presented good results for FFPE with WGS and CNAs interpretation (Chin, et al., 2018 ;Robbe, et al., 2018;Scheinin, et al., 2014 ).

Number of
LGAs in down-sampled WGS versus the number of LSTs in SNP-arrays is shown for 79 TCGA cases. The most discordant case, circled in red, is characterized by a high number of copyneutral Loss Of Heterozygosity (see Fig. S9). Detailed information is summarized in Supplementary table S1.

A.
B.

Figure S7. Tumor with high number of copy-neutral LOH
A. SNP-array copy number profile mined by GAP (Popova, et al., 2009) with numerous large-scale breakpoints detected due to copy-neutral Loss Of Heterozygosity (LOH) (ID: TCGA-EW-A1J5-01A). Top panel represents B-Allele Frequency; bottom panel represents Log ratio related to copy number alteration profile and absolute copy numbers detected by GAP software in the middle (red segments correspond to LOH). B. Down-sampled WGS profile of the same tumor analyzed by shallowHRD with a few copy number breakpoints recognized, which leaded to nonHRD prediction.

Figure S8. Example of in silico dilution series
Tumor with almost flat CNA profile (A), tumor with HRD (B) and in silico dilutions (C, D) of the tumor with HRD by the tumor with flat profile used as a "quasi-normal" counterpart.
A. In-house tumor used to make serial dilutions considered as a "quasi-normal" case. The chromosomes 3, 5 and 17 bear CNAs in this tumor and were masked from the analysis by shallowHRD, as detailed in Supplementary Notes. No LGA was found in this case and all segments of the final segmentation after shallowHRD processing are represented with red lines. Figure S6B) with SNA-array estimated tumor content 0.7.

B. The tumor with HRD (shown in Supplementary
LGAs are indicated with green lines. C. The tumor with HRD (B) diluted twice with "quasi-normal" case (A) and estimated tumor content 0.28. Tumor content was estimated regarding the initial tumor content and the proportion of reads of the tumor and "quasi-normal" sWGS.

D.
The tumor with HRD (B) diluted seven times with "quasi-normal" case (A). Tumor content in this case is estimated to be 0.11.

A.
B.

Figure S9. Tumor content and performance of shallowHRD
A. Tumor content and sample quality were shown for down-sampled WGS TCGA cases (n=79). Tumor content was taken from the corresponding SNP-arrays as estimated by GAP method (Popova, et al., 2009) and the quality assessment was automatically produced by shallowHRD. One case with tumor content of ~0.3 was of average quality with a low concordance between sWGS and SNP-array. Four cases of good quality had a tumor content of 0.4 and worked nicely.

Supplementary
1 : Low quality WGS due to high unspecific variation failed to be detected automatically.

2
: BRCA2-/-can have in some rare cases low number of intra-chromosomal breaks.

3
: Highly altered genome with no HRD evidence found (still may be HRD).

4
: Ploidy of 4 for this case, accessible with SNParray, helping classifying the case as nonHRD.