Pros and cons of HaloPlex enrichment in cancer predisposition genetic diagnosis

Abstract Panel sequencing is a practical option in genetic diagnosis. Enrichment and library preparation steps are critical in the diagnostic setting. In order to test the value of HaloPlex technology in diagnosis, we designed a custom oncogenetic panel including 62 genes. The procedure was tested on a training set of 71 controls and then blindly validated on 48 consecutive hereditary breast/ovarian cancer (HBOC) patients tested negative for BRCA1/2 mutation. Libraries were sequenced on HiSeq2500 and data were analysed with our academic bioinformatics pipeline. Point mutations were detected using Varscan2, median size indels were detected using Pindel and large genomic rearrangements (LGR) were detected by DESeq. Proper coverage was obtained. However, highly variable read depth was observed within genes. Excluding pseudogene analysis, all point mutations were detected on the training set. All indels were also detected using Pindel. On the other hand, DESeq allowed LGR detection but with poor specificity, preventing its use in diagnostics. Mutations were detected in 8% of BRCA1/2-negative HBOC cases. HaloPlex technology appears to be an efficient and promising solution for gene panel diagnostics. Data analysis remains a major challenge and geneticists should enhance their bioinformatics knowledge in order to ensure good quality diagnostic results.


Introduction
Individuals with hereditary predispositions to cancer are at an increased risk of developing specific cancers compared to the general population. Patients are usually evaluated on the basis of family history and/or individual criteria (age at diagnosis, tumor histology) followed by cascade testing for the most likely genes. However, in negative cases, with a complex family history and genetically heterogeneous diseases, cascade testing by Sanger sequencing appears to be time-consuming and expensive. With constant progress in next-generation sequencing (NGS) technologies and corresponding decreased costs, many diagnostic laboratories have been shifting from Sanger sequencing platforms to higher throughput NGS platforms [1][2][3]. NGS technologies now allow simultaneous analysis of multiple susceptibility genes for series of patients by gene panel sequencing. Typically, hereditary cancer panels include highly penetrant as well as moderately penetrant genes with known clinical utility and for which clinical guidelines concerning prevention or early detection have been established [4]. These genes are called -actionable genes‖. The gene panel approach has obvious advantages compared to Sanger sequencing: increased throughput, decreased delays, optimized molecular diagnosis in patients with a family history suggestive of an inherited susceptibility to cancer.
The need for high-throughput technologies is also increased by the development of personalized medicine, which tries to use targeted therapies with improved selectivity and efficacy in preselected patient cohorts based on gene analysis. One example of molecularly targeted therapy is inhibition of poly (ADP-ribose) polymerase (PARP) enzyme by small molecule inhibitors in tumors harboring BRCA1 or BRCA2 mutations. Treatments such as olaparib (which has recently been approved for ovarian cancer therapy by the FDA and European commission in patients with platinum-sensitive, recurrent, high-grade serous ovarian cancer with BRCA1 or BRCA2 mutations) [5,6] forces diagnostic laboratories to provide even more rapid diagnostic delivery.
Up until recently, library preparation time really constituted the rate-limiting step in this approach, with the exception of multiplex PCR which is fast but associated with other issues that increase with target size (number of tubes, large genomic rearrangement analysis, and coverage). It is critical to find a method which is sufficiently rapid to allow a suitable waiting time for patients, while at the same time allowing large gene panel enrichment with high diagnostic quality criteria in terms of sensitivity and specificity.
We have consequently designed and tested a HaloPlex (Agilent, Santa Clara, USA) custom gene panel including all 62 genes studied in our laboratory that are involved or suspected to be involved in several diseases: Hereditary Breast and Ovarian Cancer (HBOC), digestive cancer, retinoblastoma, DICER1 syndrome, ataxia-telangectasia, Fanconi anemia and Bloom syndrome. This panel was composed of actionable genes, moderately penetrant genes but also -research‖ genes i.e., with no known clinical validity. Libraries were then sequenced on a HiSeq (Illumina, San Diego, USA) sequencer and data were analysed with our academic bioinformatics pipeline and, in some cases, with the NextGENe software (SoftGenetics, State College, USA). The procedure was first tested on a training set of 71 challenging controls samples and then blindly validated on 48 samples. HaloPlex technology was found to be compatible with diagnostic requirements

Patients
All patients attended an interview with a geneticist and a genetic counsellor in a family cancer clinic in Institut Curie, Paris, France. Genetic testing was proposed on the basis of the patient's personal history and/or family history in relation to various clinical presentations: breast and ovarian cancer, digestive cancer, retinoblastoma, DICER1 syndrome, ataxia-telangectasia, Fanconi anemia and Bloom syndrome. Informed consent was obtained from all patients or their legal guardians. DNA was extracted from leucocytes using the Quickgene 610-L automated system (FujiFilm, Tokyo, Japan) according to the manufacturer's instructions. A series of 119 patients was studied: 71 as a training set and 48 as a diagnostic set.
The training set was composed of 71 patients previously Sanger sequenced (or by multiplex ligation-dependent probe amplification analysed-MLPA) and harboring 67 representative variations and 98 polymorphisms were used as controls. To adequately address diagnostic issues, this training set was composed of difficult cases (indels, large rearrangements) and at least, one mutation for each gene concerned by our clinical diagnostic activity.
The diagnostic set included 48 consecutive cases from our family cancer clinics who had been previously tested negative for BRCA1/2 mutations and at high risk of cancer genetic predisposition based on personal or family cancer history.

Library preparation and sequencing
A custom 62-genes panel was created using Suredesign software (Agilent, Santa Clara, USA) ( Table 1). Region of interest (ROI) was defined as coding sequences and flanking splice consensus sequences from genes of interest (padding: −20/+10 bp) with 13,266 different amplicons [7]. Three successive designs were necessary to obtain satisfactory coverage of the complete ROI, especially for BRCA1 and BRCA2 coding sequences. Additional genes were introduced in the course of these successive designs (Table 1). Target enrichment was performed according to the manufacturer's instructions. Briefly, DNAs were fragmented using a cocktail of 8 restriction enzymes, and denatured. A probe library was added and hybridized to targeted fragments. Each probe was an oligonucleotide designed to hybridize to both ends of a targeted DNA restriction fragment, thereby guiding the targeted fragments to form circular DNA molecules. The probe also contained a method-specific sequencing motif and a sample barcode, both incorporated during circularization. HaloPlex probes are biotinylated and targeted fragments can therefore be retrieved with magnetic streptavidin beads. The circular molecules were then closed by ligation, ensuring that only perfectly hybridized fragments were circularized. Only circular DNA targets are amplified, providing an enriched and bar-coded amplification product that is ready for sequencing. All libraries of target-enriched pooled DNA were analysed on LabChip (Caliper, PerkinElmer, Waltham, MA, USA) to assess successful enrichment, demonstrating a smear of amplicons ranging from 50 bp to 500-600 bp with a mean at 200 bp. Following enrichment, samples were sequenced on a HiSeq2500 (Illumina) with the fast module using the 150 paired-end chemistry according to the manufacturer's instructions.

Bioinformatics analysis
Both commercial (NextGENe, SoftGenetics) and academic solutions were used as described below. However, NextGENe® v.2.3.1 (SoftGenetics, State College, PA, USA) was available to analyse 2/3 of the training set, making comparison between academic and commercial solutions difficult. Default settings for paired-end Illumina data were used to filter and trim the raw data. The adapter sequence was trimmed at the same time according to a text file. Parameters for alignment and mutation detection were: paired read analysis, -Allowable Mismatched bases‖ option set to 0, -Allowable Ambiguous Alignments set to 50, 35 bp seed, 7 bases move step, 85% matching base percentage, -detect large indels‖ option on, 15% mutation percentage.
In addition, we designed our own academic bioinformatics pipeline and applied it on all patients. Index demultiplexing and generation of raw data were performed with Illumina Consensus Assessment of Sequence and Variation (CASAVA) software (v1.8.2, Illumina, San Diego, CA). Adapter trimming was performed with SeqPrep software (v1.0) (https://github.com/jstjohn/SeqPrep). Mapping was performed with Bowtie2 (v2.1.0) [8] on human hg19 reference genome using the sensitive mode. Insert size for valid paired-end alignments was set between 0 and 600 bp. Data were then processed using MPileup (Samtools, v0.1.18) [9] targeted on our region of interest with the following parameters: Do not perform Genotype Likelihood Computation, Do not skip anomalous read pairs in variant calling (-A), Disable probabilistic realignment for the computation of base alignment quality (-B), no mapping quality adjustment for reads containing excessive mismatches (-C 0), Max per-BAM depth was set to 10,000 (-d 10000) to avoid read downsampling. Minimum mapping quality for an alignment to be used was set to 0 and minimum base quality for a base to be considered was set to 12. Point mutation calling was performed using VarScan2 [10] with the following parameters: only bases covered by at least 30 reads are considered, 2 of which must carry the alternative variant. The minimum allelic ratio for a variant to be reported was set to 15%. Variants supported by more than 90% of reads on the same strand were included in the analysis. Variant annotation was performed using Annovar (25/10/2013) [11]. Statistics on the ROI coverage were established using two metrics: the percentage of bases covered by at least 30 reads specified by isoform for each gene and the number and localization of bases covered by less than 30 reads.
Indels of intermediate size (5 bp and larger) were called using Pindel (v0.2.5) [12] with already aligned data. Pindel first extracts reads indicating a potential or already existing indel and uses this position as an anchor, it then splits reads (as well as their potential unmapped counterpart) into several pieces to try to find a better alignment including an indel. Pindel was run with default parameters apart from the following two parameters: in this setting of very deep sequencing, the minimum number of reads supporting an event to be reported was set to 10 and the -insert-size‖ (defined as the length of sequence between the paired-end adapters) was set to 500bp. Only variants with an allelic ratio greater than 5% were reported.
To manage large genomic rearrangements (LGR) detection, the R package DESeq [13] was used to normalize read counts and estimate fold change per sample and window, fitting a generalized linear model to these normalized counts. In order to distinguish a gene deletion on the X chromosome in a woman from a man with only one copy, the patient's gender was taken into account for genes located on the X chromosome, dividing the analysis into two sub-analyses. Fold changes thresholds were then estimated on the basis of validated data and were used to highlight potential events.
Analysis filters and pipeline were established and tested using the training set and were then used to call variants in the diagnostic set. All mutations are reported according to the Human Genome Variation Society (HGVS) guidelines. References of coding sequences used for each gene are reported in supplementary Table 1.

Sanger sequencing confirmation
All point mutations passing these filters in the diagnostic set were confirmed by Sanger sequencing using Big Dye Terminator on the ABI 3500XL (Life Technologies, Carlsbad, USA). No LGR needed to be confirmed in the diagnostic set (none were detected).

Data availability
All detected mutations were registered to relevant data bases such as UMD data base for breast or colon cancers or LOVD for Fanconi anemia. Sequencing data and reads have already been provided to a national NGS consortium (INCa DGOS/NGS network). At the end of the project, data sould be made available for global community.

Results
We studied DNA samples from 119 different patients: 71 previously characterized samples were used to evaluate library preparation and define the appropriate settings for the bioinformatics analysis. Forty-eight clinical samples referred for testing were analysed. Three separate runs containing either 44 or 43 samples were performed on the HiSeq 2500 Sequencing System with the 150 bp paired end sequencing module. Each HiSeq run produced an average of 550 million reads. Run time was 40 hours. The average on-target ratio was 82%. An average of 200 variants were called for each patient i.e. 3 variants per gene.

Read depth and coverage
No significant decrease in coverage was observed between the three design versions (Table 1) and 99.4% of the target was covered by at least 100X. The average read depth was 3720X with marked heterogeneity between the different genes of our panel with a seven-fold difference between -poor performers‖ (e.g. MDM2 and XRCC2) and -good performers‖ (e.g. FANCA and MUTYH) ( Figure 1). Read depth also varies considerably in different parts of the same gene. In BRCA1, for example, read depth can easily vary from less than 100X (minimum value: 1X) to more than 6000X (maximum value: 10013X) ( Figure 2).

Training set
The 165 variants present in the training set had been previously identified by Sanger sequencing or MLPA. Sixty-seven mutations had a potential or demonstrated biological effect ( Table 2) and 98 were polymorphisms (Supplementary Table S1). Library preparation using HaloPlex technology was successful for all but 1 of the 71 DNA samples of the training set. All causative point mutations and polymorphisms were identified using the Varscan2 bioinformatics pipeline (Supplementary Table S1).
On a large set of indels ranging from 1 to 50 bp, rearrangements longer than 20 bp could not be detected by either Varscan or NextGENe. The first missed indel was a 23 bp duplication in the RB1 gene (g.2113_2135dup, p.Pro26Argfs*47) and the second missed indel was a 50 bp insertion in the BRCA1 gene (c.3729_3730ins50). Pindel with read alignment was therefore used and all indels were detected.
The allelic ratios usually observed with constitutive heterozygous mutations are close to 0.50, but, in this training set, several true constitutive heterozygous mutations were detected with allelic ratios ranging from 0.2 to 0.67 (Table 2). Similarly, several real constitutive heterozygous indels were detected with allelic ratios ranging from 0.18 to 0.8 (Table 2).
A high rate of false recurrent SNVs was observed in our results (recurrence ranging from 25 to 100% of samples and allelic ratios ranging from 1 to 20%). They were always located at the read extremities (in either the forward or reverse reads). Read trimming was then used to avoid these recurrent false positives (see Discussion).
Varscan2, Pindel and NextGENe were unable to detect LGRs. The addition of DESeq in our pipeline allowed the detection of all LGRs, but with poor specificity, preventing its use in diagnostics. For example, in BRCA1 LGR analysis, we observed 9 false duplications and 10 false deletions in the first training run of 44 samples in addition to the only true BRCA1 LGR present in this set (exons 3 to 8 duplication).

Diagnostic set
Gene panel analysis was performed for 48 HBOC patients without BRCA1/2 mutation. Mutations were detected in 4 (8%) patients in ATM, BRCA2, FANCA, FANCM and PALB2 genes; the FANCM and PALB2 mutations were present in the same patient. Likely deleterious variants were also detected in 4 (8%) patients in BLM, BRIP1, CHEK2, FANCG and NBN; the BRIP1 and CHEK2 variants were present in the same patient (Table 3).
Two RB1 mutations in the 131 patients were missed by NextGENe, but this problem was resolved by modifying the -allowable mismatch base‖ from 0 to 2.

Discussion
We describe and validate a HaloPlex-based diagnostic pipeline applied to a gene panel. Some technical comments and guidelines for implementation and use based on our experience are discussed below.
The wet lab part of the protocol is easy to implement, as it consists of kits that are easy to handle by any molecular biology laboratory. According to the manufacturer, major protocol steps can be automated. The HaloPlex target enrichment system was able to very rapidly capture and sequence the genomic regions of interest of 62 genes (4 days for a 44-sample library preparation).
DNA capture is performed by DNA fragment circularization with custom probes after enzymatic fragmentation (eight different restriction enzymes). Enrichment by circularization considerably facilitates library preparation for NGS, as sequencing primers can be added to the circularization probe, thereby eliminating the need for any further library preparation steps [14]. This technology achieves high specificity and output for library preparation from small DNA quantities. As a result it could be relevant for PARP inhibitor therapies as BRCA sequencing is also performed with DNA extracted from FFPE materials. Nevertheless, optimization would be needed.
However, using Haloplex technology also introduces several biases, especially concerning the homogeneity of gene coverage. These two aspects will be discussed below.
HaloPlex appears to be a specific library preparation technology: 82% of reads were mapped on target. This probably results from the combination of restriction enzyme and circularization by probe hybridization. Three different designs were necessary to achieve satisfactory gene coverage, especially by the addition of probes in BRCA1 and BRCA2 genes. Of note, no detrimental impact of the additional probes on the previous design was observed.
However, HaloPlex technology induces considerable read depth heterogeneity between the various sequences of interest due to the use of restriction enzymes for DNA fragmentation with the constraints of corresponding restriction maps. Gaps in coverage will therefore be observed when the distance between two restriction sites is longer than the read length. Inserts designed to be > 300 bases are underrepresented in the sequencing data and this under-representation results in coverage gaps when there is insufficient redundancy over the region [15]. GC content did not impact coverage in this gene panel design. Mutations previously identified by sequencing are reported with their corresponding RefSeq. 1 Nomenclature was numbered on the basis of the following transcripts, 2 Mutation nomenclature according to HGVS recommendations, nucleotide position was numbered with +1 corresponding to the A of the ATG of the translation initiation codon , 3 Expected consequence on the protein level, d Allelic ratios are defined as the ratio of the non-reference allele to the sum of the non-reference allele and the reference allele. Allelic ratio for heterozygous variants should be centered around 0.5. Abbreviations: LGR, large genomic rearrangement; NA, not applicable; AR, allelic ratio   Moreover, probe hybridization regions are limited to a few bases so that, when mismatch occurs, inserts are dropped out and read depth heterogeneity increases at the corresponding target sequences. The HaloPlex design includes redundancy, so targets should be covered even when inserts are dropped out, but marked read depth heterogeneity implies a greater sequencing capacity. A technology that typically has high specificity and uniformity will require less sequencing to generate adequate coverage of sequence data for the downstream analysis, making sequencing more economical. Although HaloPlex provided good specificity in terms of library preparation, coverage heterogeneity constitutes a weak point.
These insert drop outs explain why real constitutive mutations were called at a ratio different from the expected allelic ratio of 0.5, which can range from 0.18 to 0.8, resulting in a marked bias. Moreover, HaloPlex probes hybridize on one strand generating large insert sizes. Consequently, strand biases cannot be used with 150 bp paired-end sequencing to distinguish true and false positive variants [2]. We therefore decided to set the detection threshold in our bioinformatics pipeline to 0.15 for constitutive mutations and 0.05 for mosaicism detection for three specific genes in routine diagnostics (RB1, DICER1 and APC) to avoid missing any real mutations. As HaloPlex technology hybridizes only one DNA strand, recurrence criteria were then applied to filter the variants detected with these thresholds.
Our design also included FANCD2 and PMS2, but sequencing quality was too low for diagnostic procedures. Two polymorphisms were indeed missed in FANCD2 and it can be explained by a well known weakness of bioinformatic analysis i.e., the presence of 2 FANCD2 pseudogenes [16]. The same holds true for PMS2. The reason is that pseudogenes are amplified and sequenced simultaneously and corresponding reads are mixed at the mapping steps. Consequently, variants are called at a low allelic ratio. To avoid this specificity issue, we usually use long-range PCR (7 superamplicons from 2.5 kb to 8.9 kb designed with specific primers) followed by Sanger sequencing to analyse FANCD2.
We also tried to exclude pseudogenes from the mapping bed file in order to restrict mapping to FANCD2. However, if true variants were correctly found, specificity was too poor, preventing its use in diagnosis. Comparison of the data obtained with this specific approach and FANCD2 long-range PCR results could be used to define a list of pseudogene variants. Then these variants could be systematically subtracted from NGS analysis in order to focus on specific FANCD2 variants. Consequently, althought FANCD2 and PMS2 are included in the panel design, results were not reported due to low reliability.
Recurrent false-positives were observed at the read extremities (on both forward and reverse reads) and appear to be related to adapter remains, even after using adapter removal software. To improve the quality of the trimmed read, Gré en et al. proposed further trimming of sequence reads by five bases at the 3' end [17]. We tried another approach consisting of a two-step trimming procedure: adapters were first trimmed, followed by trimming of each single extremity nucleotide of sequence reads. The number of recurrent false-positives was considerably reduced by this procedure with no impact on coverage. Recent optimization of the SureCall software version 1.1.0.15 (Agilent technologies) should also overcome this issue [18].
All point mutations were detected on the training set. However, the same does not apply to larger indels. Small variants can be detected in NGS data by VarScan2 by allowing mismatches and gap opening during read alignment on the reference genome. Each alignment is defined by a score that is calculated on the basis of the number of bases that match correctly on the genome, but also on the number of mismatches and gaps, as well as their size. Inserting a gap in an alignment decreases the alignment score and it is accentuated with its length. If the alignment score decreases below a defined threshold, alignment is not reported. It is therefore very difficult to detect insertions or deletions larger than 20 bp with a tool dedicated to call small variants on certain specific alignment data without dedicated processing steps. In fact, reads that differ excessively (by more than 20% of the sequence) from the reference are usually trimmed. Indels were therefore often partially observed at read extremities, but in insufficient proportions to be detected, which is why we chose to use Pindel in addition to the -VarScan2‖ bioinformatics pipeline to detect all indels [12]. LGR detection required the use of another bioinformatics tool, which is why we used DESeq to improve our pipeline. Read counts can be compared between samples in order to detect large rearrangements in this particular design, based on the hypothesis that a rearrangement event is unlikely to be recurrent inside a run and that, as this constitutional level, the expected copy number is 2. To increase sensitivity and in order to detect all of the training set LGR, read counts were calculated per exon and large exons were split into 300 bp windows using BEDtools (V2.21), but real variants were then mixed with many false LGR. Poor performers i.e. samples with low coverage below 30X or with low DNA quality, were subsequently removed from analysis, but without resulting in any major improvement. To further enhance the throughput of this technology, the detection capacity for large rearrangements must be developed in the future.
Overall, our study shows that the combination of Varscan2 and Pindel analysis on HaloPlex library is efficient for the detection of small and medium-sized mutations (100% sensitivity on the training set). Actually, -real life‖ sensitivity is probably a bit lower as 13 genes didn't reach 100% coverage (i.e., 95.5% to 99.9%, see Table 1).
LGR detection remains an issue in clinical practice: althought DESEq correctly identified all LGR, lack of specificity would mandate confirmatory MLPA in all patients, prohibiting its use in routine practice.
This process is available in the Galaxy framework [19][20][21]. It must be stressed that important bioinformatics resources are needed to process and store data compatible with diagnostic practice i.e. with fast turn-around time and secure storage.
As previously described, we found that distinct bioinformatics parameters had a marked impact on the results [22]. Validation of the results must be based on a good understanding and better control of data analysis pipelines. Biologists must adopt a critical approach and mistrust black box solutions.

Conclusion
We show that the HaloPlex technology is compatible with oncogenetic diagnostic activity. One single process to sequence all of our genes of interest at the same time represents a major improvement in the laboratory organization by being less time-and personnel-consuming. This process can also be more efficient by proposing actionable gene analysis in addition to analysis of the principal genes analysed in the clinical setting.
Using the whole gene panel in unexplained family cancer histories also represents a major improvement, as we were able to detect 4 new mutations in BRCA1/2-negative cases. This study also demonstrates that data analysis remains a major issue. Geneticists must be actively involved in data analysis based on a good understanding of bioinformatics pipelines to avoid reporting poor quality results.