CNV analysis in Chinese children of mental retardation highlights a sex differentiation in parental contribution to de novo and inherited mutational burdens

Rare copy number variations (CNVs) are a known genetic etiology in neurodevelopmental disorders (NDD). Comprehensive CNV analysis was performed in 287 Chinese children with mental retardation and/or development delay (MR/DD) and their unaffected parents. When compared with 5,866 ancestry-matched controls, 11~12% more MR/DD children carried rare and large CNVs. The increased CNV burden in MR/DD was predominantly due to de novo CNVs, the majority of which (62%) arose in the paternal germline. We observed a 2~3 fold increase of large CNV burden in the mothers of affected children. By implementing an evidence-based review approach, pathogenic structural variants were identified in 14.3% patients and 2.4% parents, respectively. Pathogenic CNVs in parents were all carried by mothers. The maternal transmission bias of deleterious CNVs was further replicated in a published dataset. Our study confirms the pathogenic role of rare CNVs in MR/DD, and provides additional evidence to evaluate the dosage sensitivity of some candidate genes. It also supports a population model of MR/DD that spontaneous mutations in males’ germline are major contributor to the de novo mutational burden in offspring, with higher penetrance in male than female; unaffected carriers of causative mutations, mostly females, then contribute to the inherited mutational burden.

can be aligned to HapMap Phase 3 panel (Release 2) were retained, and thinned to be in approximate linkage equilibrium using PLINK option "--independent-pairwise 100 25 0.25". MHC regions were excluded to prevent the confounding effects of long range LD. Only unrelated patients were used in PCA.
When combined with three major continental populations of HapMap (CEU, YRI, and CHB+JPT), we found two outliers in the first two principle component axes: one was an admixture of European and Asian like Uyghur, the other might have African ancestry ( Figure S2A). After removing two outliers. The patient cohort mixed well with Chinese population (CHB) but separated from Japanese (JPT) ( Figure S2B).
Constituent uniparental disomy (UPD) in MR/DD children was detected as chromosome outlier in Mendel errors. We first tabulated the number of Mendel errors on each chromosome for each trio, and estimated the expected proportion of errors for each chromosome using trimmed mean across trios. The unusually large number of errors on one particular chromosome for each patient was then evaluated by exact binomial test. P-values were corrected for 218×23 tests. We also screened for the runs of homozygosity (ROH) using plink's "--homozyg" option with default parameter. The total genetic and physical length of ROHs were summed up for each individual. UPD and ROH analyses were done after zeroing out individual genotypes within CNVs.

CNV Calling
LogR ratios (LRRs) and B-allele frequencies (BAF) signals of all subjects passed SNP QC were exported from GenomeStudio ® . We applied PennCNV to calculate quality control statistics. Obviously bad samples were spotted as outliers in the distribution and removed from CNV calling. These included samples that had LRR standard deviation>0.35, BAF median>0.55 or <0.45, BAF drift >0.02, or |waviness factor|>0.15. CNVs were called using PennCNV package (Wang et al. 2007) based on the standard hidden Markov model for Illumina arrays, and customized population B allele frequencies calculated from unrelated samples included in CNV calling. The GC content-based wave adjustment was applied (Diskin et al. 2008). We only kept CNV calls spanning at least 5 markers and having confidence score >15.
The raw CNV calls were removed if they were mapped to the following artifact regions: 500kb within start or end of chromosome, within centromeres, and immunoglobulin loci (Need et al. 2009). To determine an appropriate quality control procedure, we first manually inspected the LRR/BAF signals of all large CNV calls (>800kb) and all smaller CNVs that were absent in population controls and appeared only once in each pedigree (private CNVs). We observed that although smaller CNVs in noisy samples were fraught with false positives, large CNVs could still be reliably distinguished. To identify potential samples with poor CNV calling performance, we also calculated two other QC measures adopted by Sanders et al. (2011): (1) Excessively wide LRR band, defined as the number of probes with |LRR|>0.5; (2) The number of aberrant behaving probes, defined by highly negative (extreme) LRR values (LRR<-1.0). The relationship between the total number CNVs calls and different QC metrics are shown in Figure S3. Noisy samples were empirically defined as those had LRR standard deviation>0.3, |waviness factor|>0.10, number of markers with wide or extreme LRR values fall above 90% quantile across all samples. CNV calls <500kb in noisy samples were excluded from subsequent analysis. We also noted that PennCNV tended to generate smaller fragmented calls for a large CNV, possibly due to uneven probe coverage on the array. We therefore iteratively merged neighboring CNV calls of the same copy number, if the sum of original lengths was at least 70% of the merged CNV length. The excessive number of CNV calls on a particular chromosome may also be indicative of chromosome aneuploidy. Before CNV merging, we estimated the proportion of CNV calls made on each chromosome across the sample, and detected chromosomes with significantly more than expected CNV calls using a goodnessof-fit test. One mother with chromosome X trisomy was found and CNVs on that chromosome were not included for subsequent analysis.

CNV Filtering and QC
To enrich for pathogenic CNVs, we applied a series of filters to the cleaned CNV calls. We first excluded CNVs that have 50% of its region overlapping with a known copy number polymorphism (CNP) reported in HapMap samples (McCarroll et al. 2008, Bailey, Kidd, and Eichler 2008, Conrad et al. 2010, Park et al. 2010, International HapMap et al. 2010). Then we excluded CNVs whose 50% region overlap with CNV segments of the same type in at least 5 unrelated subjects from population-based controls, or at least 2 unrelated parents. The number of CNV calls after each step of filtering is shown in Table S2. After automated filtering, we found 345 CNVs in 185(64.5%) patients and 612 CNVs in 280(54.9%) their parents. The inheritance status of CNVs found in patients was initially determined by comparing to the unfiltered CNV segments in parents. All the resulting CNVs were subject to manual inspection to check the carrier status in related samples, split incorrectly merged nearby CNVs, resolve incorrect boundaries, and flag the ambiguous cases for validation.
We initially selected 64 representative CNVs from the curated set for quantitative real-time polymerase chain reaction (qRT-PCR) validation. These include 16 CNVs with size 250~600kb that overlap candidate genes, and 48 small CNVs <250kb (22 duplications, 26 deletions) that overlap at least one gene. For CNVs in patients, validations were also performed in parents to confirm inheritance status. To evaluate the CNV calls in noisy samples, we also applied the above filtering procedures on all samples, and randomly select 20 ultra-rare CNVs overlapping at least one gene in noisy samples for validation. All 16 CNVs >250kb in cleaned samples were validated, as compared with only 2 out of 20 CNVs in noisy sample ( Figure S4B). It demonstrated the accuracy of manual curation on large CNVs on clean samples, which will be used in CNV burden analysis. Given the low validate rate, no further validation was attempted on noisy samples. Validation rate was 62% for small CNVs in cleaned sample. To derive an operational filter, we compared the LRR of each CNV to the markers in the neighboring regions at the right and left sides. The neighboring region was selected to have similar size as the CNV with a minimum of 10 markers. We defined ΔLRRright(left) as the difference in mean LRR of markers in the CNV segment and the marker in the right (left) neighboring region. If not enough markers were found within 1Mb surrounding region of the CNV or there were CNVs nearby, then ΔLRR was set to the mean LRR of markers in the CNV. We empirically determined the QC filter ΔLRR >= 0.15 for duplications and <= -0.25 for deletions, which achieved >90% sensitivity and >85% specificity on CNVs selected for validation ( Figure S4A). Most false positives were caused by the markers with extreme LRR in deletion calls, and could be identified by manual inspection. When applied the filter to the CNV calls, we observed an increase in concordance rate of CNV calls in duplicated samples without much loss of sensitivity (Table S3).
The QC filter was then applied to the ultra-rare CNVs <250kb passed automated filtering, with additional 51 ambiguous cases resolved by further validation (33 of them are validated and included in the final set). The QC of ultra-rare CNVs >=250kb were only subject to manual curation. We performed additional qRT-PCR experiments to validate all pathogenic CNVs <600kb reported in main text and Table S10. Seven de novo CNVs <600kb after curation were also validated to be absent from parents. The final set of ultra-rare CNVs in patients and parents are given in Table S8 and Table S9. The numbers of CNVs after validation, QC, and curation step are shown in Table S2. The iterative process of validation, QC and curation not only reduced the likely false positives; they also reduced the false negatives as compared with automated filtering.
To assess the frequency of CNVs in non-Chinese populations, we also obtained published CNV calls obtained from three population controls: WashU (Itsara et al. 2009), CHOP (Shaikh et al. 2009), and OPGP (Uddin et al. 2014), totaling up to 5000 samples of different ancestries (Table S1C). Among ultra-rare CNVs identified in our clinical cohort, none of them had 50% of its region covered by >=5 CNVs of them type.

Control Cohorts Analysis
For population-based controls, genotypes were called by GenomeStudio ® using the default genotype cluster file. Samples with less than 98% call rate were excluded. Duplicated samples and related samples up to 2 nd degree relatives (kinship coefficient >0.1875) were identified; and only one with highest SNP call rate was kept. Sample genders were verified based on chromosome X inbreeding coefficient. A total of 6,268 samples passed SNP genotype QC (Table S1B). The Chinese ancestry was verified by PCA (not shown).
All subjects that passed SNP QC procedures were entered into CNV calling. The CNV calls were generated using the same method (PennCNV) as the clinical cohort, with customized population B allele frequencies and wave adjustment. To reduce the labor for manual curation, only CNVs found in samples passed stringent QC criteria were retained. The QC step excluded samples that had LRR standard deviation>0.275, BAF drift>0.002, WF>0.06 or WF<-0.07. Then we further removed samples having >100 total number of CNV calls. A total of 5866 control samples passed QC, including 2780 males, 3086 females (Table S1B). We only kept CNVs that were supported by at least 5 probes and having PennCNV confidence score>10. The CNV calls located within genomic regions of known artifacts were excluded. Two chromosome X trisomy females were identified; CNVs on these two chromosomes were not included for further analysis. Neighboring CNVs of the same type were merged if the sum of original lengths were at least 80% of the merged length.

Targeted Rare CNV Genotyping
We adopted SNP-Conditional OUTlier detection (SCOUT) algorithm (Mefford et al. 2009) implemented in SCIMMkit package (Zerr et al. 2010) for targeted rare CNV genotyping in our clinical cohort. Only 22 autosomes were analyzed.
To select probes informative for CNV genotyping, we followed the recommendation of the software developer. After QC on SNP genotypes, the following more stringent criteria were used to select SNP probes: MAF>0.1, missing call rate<0.03, and HWE test p-value>10E-4. The resulting 186,996 autosomal probes were selected.
To define targeted regions, we first compiled a list of 2928 known and candidate genes for MR/DD. Genomic regions spanned by 5 consecutive markers and less than 100kb were enumerated. If a region overlapped coding exons of the candidate gene and was not contained within deletions or duplications found in >=5 population-based controls, then it was nominated as a target region. A total of 14,139 target regions were defined.
We noted that inclusion of noisy samples would reduce the sensitivity of this approach, because SCOUT algorithm uses the probe signal intensities across the samples to detect outliers. So the samples passed QC for small CNV calling were further filtered to exclude those whose wide and extreme LRR markers (see CNV Calling, Filtering, and QC section) were above 85% quantile across all samples. This resulted in 233 patients and 410 parents left for SCOUT analysis (Table S1A).
The genotyping was performed using the default parameters adopted by SCIMMkit; CNV calls with absolute per-site score >6 were generated. The overlapping CNV segments of the same type were subsequently merged. We tested the sensitivities of rare CNV genotyping on the curated ultra-rare CNVs. Among CNVs in the final list (Table S8, Table S9), we found 68 CNVs (31 duplications, 37 deletions) that cover the targeted regions and were carried by samples passed QC for SCOUT analysis. SCOUT detected 58 of them (27 duplications, 31 deletions) achieving an 85.3% recovery rate. Further increase the score threshold to 6.5 and 7.0 would reduce the recovery rate to 80.9% and 69.8% respectively.
To discover additional gene disrupting CNVs, the resulting CNV calls were filtered to exclude those that were already identified by PennCNV or also called in at least 2 unrelated parents. Then, further sample level QCs were performed to exclude those harboring more than 4 rare CNV calls after filtering, as manually inspecting the signal plots suggest that they were most likely noisy samples. This procedure resulted in 57 candidate CNVs (38 duplications and 19 deletions) in 37 samples. None of the duplications encompassed an entire gene transcript. As the functional consequence of duplications that partially overlap the coding sequence of a gene was hard to interpret, we focused on deletions (details given in Table S14). QRT-PCT validation was attempted at coding exons encompassed by 11 CNVs. We validated one deletion in the exon 1 of HIRA of MR_3590, which was not inherited from father ( Figure S10). High rate of false positives was also suggested by a large number of apparent de novo calls. Manually inspecting the signal plots suggested two sources failed validations: aberrant behaving probes in noisy samples, or the real deletion was smaller than the region spanned by five probes. In two cases where CNV calls were also made on mothers (BMP6 and GPC6), validations targeting at the intronic region defined by the probes with strongest signals were successful. Together, we identified three additional small CNVs using the targeted genotyping approach.

Mosacism Calling
To identify mosaic structural changes, we first applied MAD package (Gonzalez et al. 2011) to detect consistent skews in mirrored B-allele frequency signals. The following parameters were used for segmentation: aAlpha=0.8, T=9, MinSegLen =100. A total of 70 candidate segments of mosaic events were called. We first excluded segments that overlapped CNVs called by PennCNVs, and those that overlapped constituent ROHs identified by PLINK. The resulting segments were then subject to manual inspection to exclude remaining error classifications caused by noisy or waviness signals, artifacts regions on chromosome X, etc. Figure S1 The computational workflow for CNV filtering and interpretation: an overview.

Supplementary Figures
After SNP genotype-based QC, we performed further QC on samples based on array intensities before CNV calling. High confidence CNV calls were subject to a series of filters to enrich causal variants. The steps included the exclusion of known CNPs, CNVs of the same type that occurred at least five times in population controls, and CNVs detected in at least two other unrelated parents. The resulting ultra-rare CNV calls were subject to iterative QC, manual curation, and experimental validation to generate the final CNV list. To evaluate the pathogenicity, we first check if a CNV overlaps the critical region of known genomic disorders (GD) matched for the copy numbers (C1). Then, we check if the CNV deletes or disrupts the coding exon of haplo-insufficient genes or duplicate triplosensitive genes (C2). If no overlap with such region or gene was found, we subjectively judged the CNV as pathogenic if it affects a large number of genes (C3). To opportunistically identify smaller exonic CNVs that might be missed by standard calling algorithm, we also started with a list of known/candidate disease genes, and defined target regions that cover the coding exons and contain enough informative SNP probes. Then, targeted rare CNV genotyping were performed to search for additional small CNVs. The results were subject to validation; and their pathogenicity was assessed based on C2. The scatterplot shows the relationships between different sample-level QC metrics and total number of CNV calls. LRR_SD: standard deviation of log-R ratio (LRR) signals; WF: waviness factor, a summary measure of the LRR signal fluctuation explained by local GC content (Diskin et al. 2008); Log.Extreme: log2 of the number of probes with LRR less than -1.0; Log.Wide: log2 of the number of probes whose LRR deviate at least 0.5 from 0. Different QC metrics are correlated and capture different aspects of data quality. Noisy samples (defined in CNV Calling, Filtering, and QC section) are shown in blue. They generally have higher number of total CNV calls suggesting higher false positives.   Gene counts are defined as the number of refSeq coding genes whose coding exon sequences overlap with the CNV segment.  Segmental UPD of 11p15.3 to p-terminus was found in the DNA from MR_2646's father. The figure displays LRR and BAF signals of patient MR_2646 and her parents. All samples had normal LRR; father's BAF shows a split consistent with loss of heterozygosity in about 30% of cells. The elevation of LRR signals at the 11p terminus was likely caused by the assay artifacts. Figure S9 The case of ultra-rare small exonic deletion identified from SCOUT analysis.
(A) Genome browser tracks show the genomic position of refSeq genes, targeted interval, all SNPs markers on CytoSNP12 array, and markers selected for rare CNV genotyping (red). (B) Allele-specific fluorescent intensities at the five selected markers. The sample carrying the deletion (MR_3590) is shown in red, and exhibited consistent weaker intensities. (C) qRT-PCR relative quantification of the first coding exon of HIRA gene. The deletion is not inherited from father (mother not available for testing).
Figure S10 Deletions at 15q11.2-13 locus identified in two patients with Angelman syndrome. Three loci, overlapping segmental duplications and devoid of SNP probes, are commonly referred to as BP1-BP3. Figure S11 Overlapping ultra-rare CNVs in patients.
We imposed a minimal 5% reciprocal overlap to exclude cases of chance overlapping of several small CNVs with a very large one. Recurrent CNVs in known genomic disorders mediated by NAHR were also excluded. The following cases were identified: (A) Two large de novo pathogenic deletions intersect at 18q21 region, including NEDD4L gene. (B,C) Two cases of a small deletion embedded within a large pathogenic deletion. Genes disrupted by the small deletions, however, have not been implicated in MR. (D,E) Two cases of overlapping duplications whose common region duplicates a single gene. Both duplications have unknown clinical significance. (F,G) Two cases of small recurrent deletions with almost identical boundary. (F) Deletions shared by MR_557 and MR_144 omit exon 13 of CTNNA3 gene. The same deletion also appeared in 3 unrelated controls; deletions that disrupt this gene are common in controls. (G) Deletions shared by MR_1234 and MR_1963 omit exon 8 of ARSF gene. No deletion is found to disrupt this gene in controls or in public databases.