Introduction

Application of array-based genome-wide copy number investigation to patients with idiopathic mental retardation or developmental delay (MR/DD), with or without multiple congenital abnormalities (MCA), has identified pathogenic segmental aneuploidies in up to 19% of consecutive referrals.1, 2, 3 De novo deletions in sporadic patients are generally believed to be pathogenic. In contrast, genes within an inherited deletion may not provoke a phenotypic effect by mere haploinsufficiency, as these genes were also found in a single copy in a healthy parent. However, such inherited deletions may still contribute to disease when combined with rare de novo or inherited damaging alleles on the other chromosome. Therefore, screening for variants in regions corresponding to inherited hemizygous deletions may allow us to identify genes that exert a phenotypic effect by independent loss of both alleles.

Assuming this recessive mode of inheritance, genes within a deletion from a healthy parent may be pathogenic if a second mutation within the deleted region has been transmitted by the other, equally healthy parent or has arisen de novo. This recessive mechanism was initially proposed by Hatchwell.4 A literature survey shows that in multiple cases pathogenic single-nucleotide variants (SNVs) combined with a de novo deletion result in aberrant phenotypes (Supplementary Table 1). However, only few cases demonstrated unmasking by an inherited deletion (Supplementary Table 2). Whether ‘unmasking heterozygosity’ is a frequent mechanism for abnormality’ is being debated.5, 6 The recent flurry of publications regarding losses of a single or several exons in selected genes by using targeted arrays underscore that current routine diagnostic methods may miss potentially pathogenic copy number variations (CNVs).7, 8, 9, 10, 11, 12, 13, 14, 15

In this study, we investigated whether unmasking of recessive alleles may contribute to the phenotype of patients with idiopathic MR/MCA by analyzing all coding sequences in hemizygous deletion regions of patients who inherited a novel deletion from a healthy parent. To do so, we performed multiplexed genomic enrichment and next-generation sequencing of the entire coding sequence of all genes in the deletion regions of 20 patients. We identified eight SNVs that were located exclusively within the corresponding deletions of patients. In one of the patients the inherited deletion unmasked another small-inherited deletion on the second allele, which comprised the heat shock factor binding protein 1 (HSBP1) gene. Taken together, this approach provides a sensitive method to detect mutations on the second chromosome within hemizygous deletions.

Materials and methods

The procedures for selection of patients with an inherited deletion and the molecular investigations performed16, 17 are described in Supplementary Note 1 and Supplementary Tables 3 and 4.

Enrichment array design

For each patient, the maximum deletion was defined as the interval between the non-deleted array-probes flanking the deletion. When a gene was interrupted by a breakpoint, all exons of that gene were included. Using the strategy developed by Mokry et al18 we designed arrays of 60-mer probes tiled with an off-set of 10 nt covering the entire coding sequence of all genes in the deletion regions according to the Ensembl database (www.ensembl.org) using the GRCh37/hg19 build of the human reference genome. One capture array was designed for the deletion intervals of patients 1, 9, 10, 12, 13, 14, 18, 19, 20 and a second capture array was designed for patients 2, 3, 4, 5, 6, 7, 8, 11, 15, 16 and 17. Probes were synthesized on 1 M custom Agilent arrays.

Library preparation, enrichment and massively parallel DNA sequencing

For each patient, 1 μg genomic DNA was used as input for DNA shearing to fragments of 100–120 nt in length. Library preparation was performed as described by Mokry et al.18 After ligation of short adaptors (adaptor 1 and adaptor 2, Supplementary Table 4), SOLiD library barcodes (barcodes 1–20, SOLiDv3.5 manual, Life Technologies, Carlsbad, CA, USA) were introduced by 5–8 PCR cycles (100 ng adaptor-ligated DNA, 1 ml Platinum PCR Supermix (Invitrogen, Carlsbad, CA, USA), 1 μl Pfu Turbo DNA polymerase (Promega, Madison, WI, USA), 6 μl of short-P1-primer (50 μ M) and 6 μl of barcode primer 1-20 (50 μ M). Libraries were pooled after size selection (one pool for each array) and 3 μg pooled DNA was mixed with 15 μg C0T-1 DNA (1 mg/ml) and 1.5 μl of each barcode blocker oligo (barcode blocker 1 and 2, 10 μg/μl each) and hybridized on Agilent capture arrays as described by Mokry et al.18 Post enrichment PCR was performed using 1 ml Platinum PCR Supermix, 1 μl Pfu Turbo DNA polymerase, 6 μl of P1-primer and 6 μl of P2-primer for 12 cycles. Enriched libraries were sequenced on a single slide of a SOLiDv3.5 sequencer following standard procedures (Supplementary Table 5).

DNA sequence analysis and mutation identification

Sequencing reads were mapped against the reference genome (GRCh37, ensembl59) using the Burrows-Wheeler Alignment Tool19 and all SNVs and short indels up to 7 bp were compiled and annotated using in-house developed single-nucleotide polymorphism-calling (SNP) algorithms using the ensemble API release 59.20 In this compilation, SNVs mapping within the deletion region of each patient were selected for confirmation by capillary sequencing, following standard protocols. First, we compiled for each DNA sample all SNVs with a non-reference allele call >90%. Second, we compared SNVs between DNA samples, and third, we retained those SNVs, which occurred only once in the samples studied and only within the cognate deletion region of each patient. Confirmed SNVs were subsequently traced in the DNA of the parents. In addition, by capillary sequencing loss of exon 3 of the HSBP1 gene was confirmed in the affected child. The primer pairs used for confirmation by capillary sequencing are listed in Supplementary Table 6. All confirmed SNVs were evaluated for their phenotypic impact using the Grantham matrix score,21 Genomic Evolutionary Rate Profile (GERP),22, 23 PANTHER24 and PolyPhen2.25 European population frequencies were given for variants with data other than 1000 genome pilot study. We used bioinformatics tools to identify copy number changes based on read coverage deviations from normalized control levels by normalizing the coverage level of each nucleotide for the haploid patient DNA versus the mean coverage level of the other 8 or 10 diploid DNA samples in the same pool by a custom script written in PERL. The detected regions with significant (more than two standard deviations from the mean) coverage abnormalities were supported by DWAC-Seq (http://dwacseq.hubrecht.eu).

Results

Detection of SNVs in hemizygous deletions

From a cohort of 1500 patients with MR/MCA we selected 19 patients with a novel hemizygous deletion inherited from a healthy parent and 1 patient who had a de novo hemizygous deletion (Supplementary Table 3). All 20 DNA samples were, in two pools of 9 and 11 samples, respectively, subjected to multiplexed enrichment and SOLiDv3.5 sequencing to an average coverage depth of 256 × (Supplementary Table 5). Sequence reads were mapped to the human reference genome (GRCh37/hg19) and 703 SNVs with a predicted effect at the protein level were identified in the individual samples with in-house developed SNV-calling software.20 In addition we found 36 indels of maximally 7 bp, none of which confirmed to the criteria of being potentially pathogenic as described above (Materials and Methods). For three samples the SOLiD-based SNVs were compared with SNP data generated with Illumina Infinium HumanHap300 Genotyping BeadChips. For 240 (98.4%) out of the 244 SNPs covered by both platforms full concordance was found, while for 4 SNPs (1.6%) the platforms disagreed. For these four SNPs, both platforms disagreed for these being either homozygous non-reference or heterozygous. In total, the SOLiD-based data contained 675 non-synonymous SNVs calls, 15 stops gained and 13 stops lost (GRCH37/hg19); 642 SNVs were listed in dbSNP build 131 and 61 SNVs were novel. All stops gained and lost were listed as known polymorphisms in dbSNP and none of the variants were listed in databases of known pathogenic variation (eg, HGMD).

In seven patients we found eight non-synonymous SNVs mapping exclusively to the cognate deletion regions, which we subsequently confirmed by capillary sequencing (Table 1). These SNVs showed clearly homozygous calls and seven were known SNPs (according to dbSNP build 131). To assess whether the known SNPs were likely to be compatible with healthy life we determined the frequencies of homozygous calls of this particular allele in the European population as recorded in dbSNP. For three SNPs (rs5743820, rs2227983 and rs4149056) such data were available; they indicate that these SNPs have been found at frequencies ranging from 0.0016 to 0.2649 for the homozygous state in the healthy European population. These limited data are consistent with these SNPs being in Hardy–Weinberg equilibrium. Of note is that none of the genes, in which these SNPs were found, has previously been found mutated in patients with MR/MCA (see Supplementary Table 4). Interestingly, a novel SNP in the FAM178B gene in a severely dyslexic patient (patient no. 13) with a deletion of this gene inherited from his mother, resulted in the same genotype in the patient at this particular position in the FAM178B gene as that of his language proficient mother. This rules out a pathogenic contribution of this hemizygous C allele in the causation of his dyslexia.

Table 1 Genotypes, inheritance, Grantham matrix score, GERP, PANTHER and PolyPhen2 evaluation of all SNVs located exclusively to the cognate deletion regions of the patients

To assess whether these SNVs may contribute to a clinical phenotype, we determined their putative impact upon protein structure and function as reflected by the Grantham matrix scores.21 All SNVs had Grantham matrix scores in the low range, suggesting an only minor impact. To assess the degree of evolutionary conservation of a SNV we determined the GERP score.22, 23 Negative GERP scores reflect no evolutionary conservation, whereas positive scores are in agreement with a pathogenic impact of a SNV.22, 23 All SNVs, except those in the ZNF800 and the OSGIN1 genes showed low-positive GERP scores, suggesting little likelihood of pathogenic impact. In contrast, PolyPhen2,25 which combines the impact on protein structure and function with the level of evolutionary conservation, predicted a highly damaging effect from the T to C transition of rs4149056 in the SLCOB1 gene, while PANTHER24 returned no score and was thus non-informative. Thus, this SNV with a Grantham Matrix Score of 64, a GERP score of −1.19 and occurring in the homozygous state in the European population at 10.66% was judged ‘damaging’ with a likelihood of 0.993 by PolyPhen2 (Table 1). In summary, currently available algorithms predicting impact on protein structure and function, determining evolutionary conservation and combining multiple parameters did not return consistent predictions regarding pathogenicity of the eight non-synonymous SNVs in our data set.

Detection of hemizygous deletions by coverage depth analysis

Subsequently, we used bioinformatics tools to identify copy number changes based on read coverage deviations from normalized control levels.26 Thus, plots of the deletion regions flanked by diploid regions were obtained (Supplementary Figure 1). Although 11 DNA samples showed approximately half the coverage depth relative to the other samples from the same pool for the deletion region, 9 samples showed erratic patterns. Coverage signals approaching the diploid level within deletion areas corresponded with segmental duplications or indels (results not shown). Segmental duplications and indels were not detected by BAC-, oligonucleotide and SNP arrays, as these regions were not covered by these platforms. As these signals locate in regions of sequence homology with a locus outside the targeted deletion region, they may reflect more complex haplotypes in our patients than represented in the human reference genome.26

Unmasking of an inherited deletion of the HSBP1 gene

The deletion region of patient 15, with a normalized coverage level almost entirely half of that of the flanking regions, showed a small region with close to zero coverage (Figure 1a). This region corresponded to three probes with lowered signals on the oligonucleotide array-CGH (Figure 1b) and contained the entire HSBP1 gene (Figure 1c). PCR-based resequencing of exon 3 of HSBP1 showed complete absence in the affected child, with presence of the normal sized PCR product, indicative of at least one intact copy, in both parents (results not shown). Comparing the oligonucleotide array and the SOLiD coverage depth data the proximal breakpoint of this deletion maps to nucleotide position 83 841 341 (SOLiD) and the distal breakpoint between 83 857 382 and 83 873 030 (oligonucleotide array data). The latter breakpoint could not be mapped more precisely, because of the absence of resequenced exonic sequences in this region. This combined data indicate a bi-allelic deletion of minimally 16 041 and maximally 31 689 bp, which encompasses HSBP1, but no other gene (Figure 1c). We conclude that both alleles of the entire HSBP1 gene have been lost in the patient. As both heterozygous parents were healthy, the absence of both gene copies in the child may be pathogenic when a recessive mode of gene action is assumed.

Figure 1
figure 1

Oligonucleotide array and SOLiD sequencing data for the deletion region between nucleotide position 83 841 341 and 83 857 382 in chromosome band 16q23.3 in patient no. 15 (Supplementary Table 3). Upper panel: Normalized coverage depth for the deletion region. Relative read depths are in arbitrary units as produced as output by DWAC-seq.26 Note the lower read depth for the entire deletion region and the lowered signal in the left part of this region. Middle panel: Oligonucleotide array data. Note the lowered signal for three probes in the left half of the deletion region. Bottom panel: The genome region for which both alleles have been lost in the patient (UCSC browser GRCh37/hg19). Dashed lines demarcate the corresponding deletion boundaries.

Discussion

Array-based segmental aneuploidy detection has become the first-tier genetic test in MR/MCA patients.1, 27 Thus, familial deletions are increasingly being identified in healthy probands, thereby identifying genes that seem refractory to haploinsufficiency. Several mechanisms have been put forward to explain the phenotypic differences between healthy carriers of deletions and patients within a family.28, 29, 30 To investigate unmasking of recessive alleles in MR/MCA we analyzed all coding sequences in hemizygous deletion regions of 20 patients using array-based multiplexed enrichment followed by next-generation sequencing. This focused approach allows the simultaneous identification of SNVs and CNVs in a DNA sample at nucleotide resolution. Furthermore, the multiplexed design of this experiment automatically generates proper controls to efficiently filter out noise and coverage level differences generated by next-generation sequencing.

After evaluation of their inheritance patterns, impact on protein structure and function and evolutionary conservation the eight SNVs exclusively found in the corresponding deletion regions of the patients did not suggest consistent and clear pathogenic effects (Table 1). This is in line with results of the sequence analysis in inherited deletions in other studies, including the 1q21.1 deletion associated with TAR (thrombocytopenia-absent radius) syndrome,31, 32 the 16p13.11 deletion, which is associated with mental retardation and autism,33 the 16p11.2 deletion associated with autism34 and the 16p12.1 deletion associated with childhood DD,35 which all failed to detect pathogenic SNVs. In contrast to our approach, these studies did not include all coding regions in the deletion, but only selected candidate genes.

The single unmasked structural variation identified in our study, is much more likely to contribute to the patient's phenotype, including mental retardation without speech, optic nerve hypoplasia, nasolacrimal duct obstruction, cheilognathopalatoschisis, microcephaly and bilateral hearing loss. In DECIPHER36 a single patient with a de novo hemizygous deletion encompassing, HSBP1 (DECIPHER patient no. 2801) presented with a less complex phenotype. Interestingly, the hemizygous parents of our patient appeared normal. This constellation of homozygous losses in a patient with healthy parents carrying heterozygous losses is similar to that of two patients reported in the literature.37

By interaction with HSF1 (heat shock factor 1, the major transcriptional activator of heat shock genes), HSBP1 has been shown to negatively regulate the heat shock response in C. elegans, in Xenopus tadpoles and in cultures of mammalian cells.38, 39 Knockdown of Hsbp1, which is a part of the AKT1–DNA transcription network, in C57Bl/6 mice diminished neuroblast migration.40 Involvement of HSBP1 in brain development in mice is consistent with a phenotypic impact of nullizygosity for HSBP1 entailing a form of mental retardation and expressive speech disorder as found in our patient. Since to this date, HSBP1 has not been associated with the genetic pathways governing classical speech and language disorders,41 further investigations into its function appear needed.

Population analyses of large CNVs revealed a frequency of 5–10% in healthy individuals for CNVs of 500 kb and larger, which is in the same range as in patients with DD.43 As size and gene content of CNVs are negatively correlated with their population frequency,42 smaller structural variants may occur at even higher frequencies in the healthy population. A preliminary assessment indicates that CNVs of 500 bp and larger occur at a median frequency of more than 1000 CNVs per healthy individual.42, 44 Intragenic deletions, encompassing one to several exons, account for 2–5% of the spectrum of deleterious mutations in Mendelian disorders.7, 8, 9, 10, 11, 12, 13, 14, 15 In case an inherited or de novo intragenic deletion is combined with a deletion of the entire gene transmitted by a healthy parent no intact alleles of a particular gene will be found in the patient, which amounts to a recessive mode of gene action. In agreement with this inference is the finding that numerous patients within the MR/MCA spectrum carry novel or rare CNVs at two or even more loci.16, 17, 36 The latter may further increase the phenotypic variability among patients with disorders resulting from structural genome variations.36 Of note is that in contrast to the patients selected for this study, most reports of unmasking hemizygosity in the literature describe patients who combine two distinct, recognizable syndromes or diseases, such as Prader Willi Syndrome + albinism,45 Angelman Syndrome + albinism,46 Smith Magenis Syndrome + sensorineural hearing loss,47 or Wolf–Hirschhorn syndrome + Wolfram syndrome48 (see also Supplementary Tables 1 and 2), which may relate to the frequency of the mechanism under study here.

Our study of patients with idiopathic MR/MCA did not reveal any compelling evidence, except for the single-gene deletion, for unmasking of obvious gene mutations by inherited hemizygous deletions. This may be due to either the incorrect assessment of candidate genes and/or evaluation of the deleterious impact of the mutations identified in these genes by current prediction algorithms. Furthermore, we assessed only the protein-coding regions and the beginning and ends of introns in the deletion regions and may thus have missed variants that affect functions in the UTRs of the genes, non-coding RNAs, regulatory elements, or splicing efficiency.49 In addition, biases introduced by the enrichment technique as well as mapping of short (50 nt) next-generation sequencing reads result in an uneven coverage of the target regions. Therefore, not every coding base is assayed equally efficiently and some coding variants may thus have been missed. However, our data show that this involves less than 5% of the coding sequences (results not shown). Finally, uniparental disomies and segmental homozygosity of chromosomes in genetically isolated populations,50, 51, 52, 53 mutations at an unrelated locus (see patient no. 18; Materials and Methods) and differences in imprinting levels may also come into play. These possible mechanisms may represent targets for future studies.

Although our analysis of unmasking of inherited mutations by inherited deletions represents a step toward uncovering missed heritability in MR/MCA,54 a recent study has highlighted possible involvement of de novo mutations.55 Such de novo mutations may occur either within or outside a region with a hemizygous deletion and contribute to phenotypes by a recessive or a dominant mechanism, respectively. In cases in which one parent transmits a hemizygous deletion and the other parent carries two homozygous reference alleles, a non-reference allele in the child may be interpreted as a sequencing error, whereas this may reflect a de novo mutation on the allele from the other parent. Similarly, transmission of a deletion from one parent and a heterozygous allele from the other parent in the deletion region will appear as homozygous in the child. Without taking into account copy number status, such a situation will also likely be filtered out as a false positive. Therefore, current bioinformatics tools need to be adapted to allow for non-Mendelian constellations56, 57 or require a combination of experimental techniques that allow for detection of both single-nucleotide variation and copy number status. The results of such studies will also present us with novel challenges during genetic counselling of families.58, 59, 60, 61, 62, 63

In conclusion, this study suggests that CNVs may contribute to clinical phenotypes by a recessive mode of gene action. We conclude that array-based enrichment of inherited deletions may be a targeted and highly sensitive method to detect SNVs and CNVs, simultaneously. Our study extends the scope of genome-wide CNV profiling beyond de novo CNVs in sporadic patients, and may aid in uncovering missing heritability in genome-wide screening studies.