Identification of novel driver risk genes in CNV loci associated with neurodevelopmental disorders

Summary Copy-number variants (CNVs) are genome-wide structural variations involving the duplication or deletion of large nucleotide sequences. While these types of variations can be commonly found in humans, large and rare CNVs are known to contribute to the development of various neurodevelopmental disorders (NDDs), including autism spectrum disorder (ASD). Nevertheless, given that these NDD-risk CNVs cover broad regions of the genome, it is particularly challenging to pinpoint the critical gene(s) responsible for the manifestation of the phenotype. In this study, we performed a meta-analysis of CNV data from 11,614 affected individuals with NDDs and 4,031 control individuals from SFARI database to identify 41 NDD-risk CNV loci, including 24 novel regions. We also found evidence for dosage-sensitive genes within these regions being significantly enriched for known NDD-risk genes and pathways. In addition, a significant proportion of these genes was found to (1) converge in protein-protein interaction networks, (2) be among most expressed genes in the brain across all developmental stages, and (3) be hit by deletions that are significantly over-transmitted to individuals with ASD within multiplex ASD families from the iHART cohort. Finally, we conducted a burden analysis using 4,281 NDD cases from Decipher and iHART cohorts, and 2,504 neurotypical control individuals from 1000 Genomes and iHART, which resulted in the validation of the association of 162 dosage-sensitive genes driving risk for NDDs, including 22 novel NDD-risk genes. Importantly, most NDD-risk CNV loci entail multiple NDD-risk genes in agreement with a polygenic model associated with the majority of NDD cases.


Meta-analysis methodology
The Gene SFARI Copy Number Variants (CNVs) database (https://gene.SFARI.org/database/cnv/)entails a collection of annotated genomic regions that includes information such as number of reported patients with NDDs and controls carrying deletions and/or duplications, ASD associated genes within these regions, genomic coordinates or detection platforms.Currently, there are two versions of this database: an archive version, more extensive, which includes CNVs reported throughout the entire genome and which do not necessarily have a significant relevance in autism, and an updated version, which only includes those regions that are highly significant in the autistic phenotype.
Data contained in SFARI website was extracted using a customized web scraper.Through this procedure, we obtained information of cases and controls, at the single patient level as well as the characteristics of the cohorts in which these probands were included.
Supplemental information table 1. Overview of data obtained from SFARI collection that were used in the statistical metaanalysis.

Individual dataset
• Total initial number of carriers included in SFARI before the standardization: 23.907

Standardization procedure
Bringing together formats and characteristics of different data fields under a common framework is a critical process in the study of large-scale cohorts.In particular, medical data can involve a high level of heterogeneity, due to the intrinsic complexity of biological processes and the disparity between techniques, years of study or even the healthcare facilities where analyses have been undertaken.Once the SFARI data was obtained we detected several issues for downstream data processing (e.g., duplicate probands, inconsistencies in genomic coordinates reported or positions reported in different builds).To address data quality issues, we then proceeded to implement a data processing and standardization step, as detailed below divided by category of data intervention.

Patient Identifiers standardization
Individual patient identifiers standardization was crucial to carry out a reliable statistical analysis, since SFARI database includes duplicate entries both in cases and control samples.This is because numerous studies that were performed for the same region, based on the same case/control cohorts, are stored in SFARI database as separate entries.This fact is captured in the significant reduction that results, especially in cases, between the number of carriers before and after the standardization of the individual identifiers.(Supplemental table 1).
To resolve duplicates, we evaluated individually each of the papers of the studies from which the patients were obtained.Due to the high number of studies included, 565 case studies and 106 control studies, a first filtering was made selecting only those with more than 100 probands.This returned a total of 34 case studies and 13 control studies.We then reviewed all of them to identify the cohort used in their studies.To avoid overlapping cohorts, we selected those that did not share any patients among them either based on geographic location or because they came from different consortia.Supplemental Information:

Shared etiology of NDDs [*1]
Several NDD-risk CNV loci have been reported to be associated with risk of more than one NDD.For instance, both deletion or duplication carriers at 16p11.2 have been found to exhibit ADHD, ID, ASD or epilepsy with significantly greater frequency than controls 1; 2 .Alternatively, it is estimated that individuals carrying monosomy in the 22q11.2region are ~40% more likely to develop schizophrenic-spectrum disorders in adults 3 , although it has also been observed that 10%-50% of patients with this same deletion 4 , also known as Velocardiofacial/DiGeorge syndrome, report autism.Other behavioral disorders observed in affected individuals include attention deficit hyperactivity disorder, mood and anxiety disorders [4][5][6] Other CNV loci associated with risk for both ASD and SCZ include deletions in 3q29 and 17q12, as well as duplications in 7q11.23 or 16p13.1 , in agreement with the previously reported shared etiology for these disorders [7][8][9] .This finding also holds true for novel regions that we have detected, most of which also overlap with known risk regions.Among the newly identified regions we found some overlapping with previously known regions, such as 21q11.

Details about other validated NDD-risk genes [*2]
In the 3q29 CNV loci we found PAK2, a candidate gene for which a high confidence level of association with NDD risk had not yet been established.However, it had been shown that this serine/threorine kinase is essential in the regulation of cytoskeletal dynamics, and that knockout animal models recapitulated disturbed neurological synaptic patterns seen in patients with ASD 19 .In our analysis PAK2, a validated candidate gene, is closely associated with the MAPK stress activated cascade, an essential pathway in brain function, learning and memory.In iHART, a single variant has been found deleting this gene in four different families, and for two of which, all children diagnosed with ASD had this variant (Supplemental Table 3).
For CHM3, in the 1q43-44 region, cases with autism and ID have been previously reported 20 .
In our analyses we not only validated its association to NDD-risk, but also found a set of clinical signs and symptoms in patients from Decipher with a statistically significant association to this NDD-risk gene: depressed nasal bridge, upslanted palpebral fissure, short neck, downturned corners of the mouth, sparse hair, micrognathia, epicanthus and hypoplasia of corpus callosum (Supplemental Table 3).