Schizophrenia-associated somatic copy-number variants from 12,834 cases reveal recurrent NRXN1 and ABCB11 disruptions

Summary While germline copy-number variants (CNVs) contribute to schizophrenia (SCZ) risk, the contribution of somatic CNVs (sCNVs)—present in some but not all cells—remains unknown. We identified sCNVs using blood-derived genotype arrays from 12,834 SCZ cases and 11,648 controls, filtering sCNVs at loci recurrently mutated in clonal blood disorders. Likely early-developmental sCNVs were more common in cases (0.91%) than controls (0.51%, p = 2.68e−4), with recurrent somatic deletions of exons 1–5 of the NRXN1 gene in five SCZ cases. Hi-C maps revealed ectopic, allele-specific loops forming between a potential cryptic promoter and non-coding cis-regulatory elements upon 5′ deletions in NRXN1. We also observed recurrent intragenic deletions of ABCB11, encoding a transporter implicated in anti-psychotic response, in five treatment-resistant SCZ cases and showed that ABCB11 is specifically enriched in neurons forming mesocortical and mesolimbic dopaminergic projections. Our results indicate potential roles of sCNVs in SCZ risk.

Correspondence christopher.walsh@childrens.harvard. edu In brief Maury et al. leveraged blood-derived SNP-array data across 12,834 schizophrenia cases and 11,648 controls to explore somatic copy-number variants (sCNVs). They found higher earlydevelopmental sCNV incidence in cases compared with controls, along with specific intragenic events in NRXN1 and ABCB11 that could potentially contribute to SCZ disease.

INTRODUCTION
De novo and rare germline copy-number variants (gCNVs) contribute to up to 5.1%-5.5% of schizophrenia (SCZ) cases, with relatively large effect sizes. 1 These gCNVs are usually inherited or represent de novo events thought to arise during gametogenesis. Most gCNVs involve several genes, making it difficult to pinpoint specific causative genes. A notable excep-tion is deletion of NRXN1, which encodes a presynaptic adhesion protein and has been suggested to have a role in SCZ along with other synaptic genes. 2 Somatic copy-number variants (sCNVs) present in only a fraction of cells in the body, are increasingly implicated in neuropsychiatric disease. [3][4][5][6][7][8] For example, a recurrent, large sCNV of chromosome 1q has been repeatedly observed in focal epileptic brain malformations, [9][10][11] while blood samples from autism spectrum disorder (ASD) 3 showed enrichment of large (>4 Mb) sCNVs, with sCNV size positively correlated with phenotypic severity. The overlap in the genetic architecture of ASD and SCZ 12 suggests the hypothesis that sCNVs may have similar roles in SCZ liability. Since sCNVs are less common than germline gCNVs, large datasets must be analyzed to assess their contribution to disease, but such large genotyping datasets are generally only available from blood-derived single-nucleotide polymorphism (SNP)-array data created for genome-wide association studies (GWASs), which creates two challenges. The first challenge is that these arrays only capture the earliest developmental events, present in a relatively large fraction of cells 13,14 and hence also likely to be shared in brain cells and other tissues. Prior studies have shown that non-oncological somatic variants present in more than $1%-3% of cells in a tissue are typically shared in all developmental lineages in a mosaic fashion. [14][15][16] The mosaic fraction of variants in blood exhibited a linear relationship with the mosaic fraction in other tissues, 17 suggesting that studying highly mosaic variants in blood might reflect, to an extent, somatic variation in other tissues such as brain.
The second challenge in assessing sCNVs in blood is the increasing recognition that aging and environmental exposures are correlated with sCNVs that are restricted to blood, which are associated with leukemia or pre-cancerous conditions such as clonal hematopoiesis of indeterminate potential (CHIP). 5,18,19 However, CHIP-related sCNVs have now been extensively characterized in dozens of studies in terms of size and mosaic fraction and found to occur at recurrent chromosomal locations that disrupt specific driver genes, [18][19][20][21][22] allowing sCNVs at these loci to be filtered to identify non-CHIP, earlydevelopmental sCNVs that may be associated with SCZ.
In this study, we analyzed SNP-array data from 12,834 cases and 11,648 controls from the Psychiatric Genomic Consortium (PGC) SCZ cohort using a widely utilized, highly sensitive algorithm that leverages haplotype information to detect sCNVs in blood. 3,18, 19 We additionally used recent knowledge of the genomic loci of blood events 22 to rigorously filter candidate variants that likely originated from CHIP. We observed an excess of non-CHIP-related sCNVs in SCZ compared with controls and discovered recurrent sCNVs, including recurrent NRXN1 somatic deletions of exons 1-5 and recurrent intragenic events at ABCB11 gene as well. Taken together, these data suggest that potential roles of sCNVs in the genetic architecture of SCZ merit further study.

RESULTS
Potential enrichment of non-CHIP sCNV in SCZ cases sCNVs were identified using the MoChA 18,19 software on 26,186 blood-derived SNP arrays from the PGC2 SCZ cohort 23 (Figure 1A). We removed gCNVs previously identified in subjects of this cohort. 23 Samples that showed signs of contamination, or sCNVs whose copy-number state was not confidently determined, were excluded (STAR Methods). This quality control (QC) led to the identification of 1,341 candidate sCNV, including many presumably related to CHIP, and a subset that may potentially be associated with SCZ.
We identified 1,143 events likely to have arisen from CHIP, based on their chromosomal location at recurrent CHIP regions and resemblance to known CHIP events. We used these CHIP events, which typically have low cell fraction, 18,22 to compare the performance of MoChA in our dataset with prior studies. The events identified as CHIP in our initial call set followed a similar distribution of cell fraction (CF) and length compared with well-known CHIP events from the UK Biobank 19 Figure 1C). Performing meta-analysis to account for potential batch heterogeneity, similarly, revealed no significant decrease in CHIP events across cases and controls (one-sided Fisher's exact test, Liptak's combined p value = 0.9; Figure S1A). The mean number of CHIP events across cohorts for SCZ samples was 0.046 (SE = 0.009), similar for controls with 0.041 (SE = 0.008). Since we do not have age information on all samples, it is possible that any difference in CHIP burden in SCZ and controls might be masked by differential age distribution or other environmental factors. Nevertheless, this result suggests similar sCNV detection sensitivity in cases compared with controls in our dataset.
We next filtered likely CHIP-related events and identified a subset of early-developmental sCNVs, most present in a high CF. Specifically, we removed all copy-neutral loss of heterozygosity (CN-LOH), loci commonly altered in the immune system (e.g., major histocompatibility locus [MHC]) and other known common   19,21 and filtered outlier samples with multiple events (>5 sCNVs) ( Figure 1A) (STAR Methods). Our non-CHIP events show a different distribution of CF and length compared with events identified as CHIP ( Figure 1B, middle panel). sCNVs that occur early in development are clonally shared across multiple tissues and are thus expected to be present at larger CF than those occurring through CHIP alone. Reassuringly, variants filtered as potential CHIP exhibited significantly lower CF than non-CHIP events (Wilcoxon Rank-Sum test p = 6.4eÀ11) ( Figure 1B). This difference suggests that our filtering reliably removes likely CHIP events, although some bona fide early-developmental sCNVs may be filtered out as well, especially those coming from CN-LOH events. While we found equivalent rates of CHIP in SCZ samples and controls ( Figure 1C), stepwise removal of likely CHIP variants showed increasing enrichment of the remaining sCNVs in SCZ compared with controls, with the highest effect size once all the CHIP-related events were removed ( Figure 1C).
While non-recurrent or small events may be difficult to detect, Loh et al. 18 demonstrated detection sensitivity for events as small as 100 kb. There is an inverse-square relationship between event size and CF, such that, at a CF of $10%, events > 1 Mb are detectable, while sCNVs > 100 Mb need to be present in $1% of cells to be detected. 18 Since CHIP events tend to occur at lower CF, 18,19,21,22 consequently MoChA would have sensitivity to detect only larger events. On the other hand, it is expected from the modeling of MoChA 18 and the inverse-square relationship that early-developmental events, such as those predicted to have occurred in some individuals in our study, have a biological higher CF, and hence that MoChA should have sensitivity to detect them even if they are smaller.
Whole-genome sequencing supports presence of sCNV We confirmed several likely non-CHIP sCNV using 40-603 whole-genome sequencing (WGS) in five individuals. We detected a probable 650-kb somatic deletion in one individual with a predicted CF of 52%, which was supported by simple inspection of the Integrative Genomics Viewer (IGV) read pileup ( Figure S2). The CF estimated from WGS was $50%, closely matching the MoChA estimate. We also called a large 26-Mb somatic deletion in 9q21.11-9q22.2 (hg19 coordinates, chr9:71033538-97246817) with an estimated CF of 43% in another individual. Running MoChA on WGS also supported a similar somatic deletion overlapping the original event in the same individual (hg19 coordinates chr9, 38767760-97259994) with a CF of 46% and with a size larger than seen on the SNP array, attributable to the event extending into the centromeric region that is not well represented on SNP arrays. 24 WGS also confirmed both CFs and breakpoints for three of three much smaller NRXN1 deletions presented in detail below. Since the data for this study was generated across different countries, precluding access to DNA for more widespread validation of MoChA calls, we applied MoChA conservatively, only calling variants (for both cases and controls) in a range of size and CFs that have been shown in prior papers with this algorithm to have a negligible false-discovery rate. 3,[18][19][20]25 However, further studies will be required to provide more precise estimates of prevalence of sCNV in SCZ, and rate comparisons in our study should be interpreted with caution.
Analysis of putative early sCNVs in SCZ and controls sCNVs not related to CHIP occurred in a small but significant fraction of SCZ cases. From the initial 13,464 SCZ cases and 12,722 controls, a total of 12,834 cases and 11,648 controls remained after QC. The final non-CHIP sCNV call set consisted of 198 events in 178 individuals, made up of 127 losses and 70 gains (Tables S1and S2; and Figure 1D). These events ranged in CF from 1.10% to 63.8% (median, 21.1%), and ranged in size from 10.7 kb to 95.3 Mb (median, 686.0 kb). The high CF of events in samples without a blood cancer diagnosis suggests that these somatic variants might have arisen during early-developmental stages. 13,14,17 The percentage of individuals with at least one sCNV was 0.91% in SCZ and 0.51% in controls (OR, 1.78; 95% CI, 1.29-2.47; two-sided Fisher's exact test, p = 2.68eÀ4) ( Figure 1E). The sCNV incidence in controls was comparable with unaffected siblings in an earlier study (0.51% vs. 0.54%), 3 while our estimates in SCZ were higher compared with the ASD cases from the same study (0.91% vs. 0.58%). 3 This higher rate most likely reflects sensitivity improvement in the pipeline since the earlier study, although we cannot rule out differential sources of artifact or biological effect (STAR Methods). Prior analyses with MoChA have estimated burdens of sCNV (with CF > 10%) in blood samples of individuals with no history of hematologic cancer of 3.2% in the UK Biobank, 5.2% in the Mass General Brigham Biobank, 5.9% in FinnGen, and 1.3% in Biobank Japan. 20 These numbers are all larger than what we observe in our SCZ cohort and likely reflect our filtering of CHIP variants and/or differential environmental exposures.
To rule out potential residual CHIP events in our call set contributing to the difference in prevalence of sCNVs, we performed the burden test using different minimum CF cutoffs, with higher CF cutoff being less likely to be CHIP events and more likely to be early-developmental events. There remained a statistically significant enrichment in SCZ through several ranges, even when events were split into losses and gains (Figure 1E). We further accounted for potential batch heterogeneity ( Figure S1A) using meta-analysis across each study batch containing both cases and controls, obtaining a Liptak's combined p value of 0.032 using a one-sided Fisher's exact test. To further confirm that the enrichment observed was not driven by CHIP events, we removed all variants from samples with suspected CHIP variants (30 variants removed). With this smaller call set, we still obtained a significant enrichment in SCZ cases for sCNV compared with controls (Fisher's exact test, OR, 1.64; 95% CI [1.16:2.33]; p = 0.0041).
In contrast with previous findings in ASD, 3 sCNVs in SCZ cases were of similar size compared with control after accounting for different arrays/cohorts by mixed-effect modeling (p = 0.26) ( Figure 1F). These events were also present at similar CF in cases compared with controls (p = 0.986; Figure 1G). There was also no detectable difference in gene density (p = 0.08; Figure 1H). These trends were observed across the different batches as well (Figure S1B-D). In contrast to gCNV, 1,23 sCNV did not show overall gene-set enrichment for the top 20% expressed brain genes (p = 0.14), synaptic genes (p = 0.12), or haploinsufficient genes as measured by a probability of of being loss-of-function intolerant (pLI) score >0.90 26 (p = 0.54). We did not detect events in the top 10 genes related to SCZ by the SCHEMA consortium 27 or the presence of two-hit events (germline + sCNVs) in our dataset.
Some sCNV overlapped cytobands previously implicated in SCZ but showed distinctive features. While one SCZ case had a 4.1-Mb somatic deletion in cytoband 16p11.2, it was not only significantly larger than the canonical germline 16p11.2 deletions (<600 kb) observed in SCZ and ASD 23,28 but also the mosaic deletion did not overlap the canonical proximal or distal events ( Figure S3A). We also observed one SCZ case with a somatic deletion in the 22q11.21 locus that was significantly smaller (686 kb) than the recurrent germline 22q11.21 deletions observed in SCZ (2.35 Mb) ( Figure S3B). The mosaic 22q11 deletion we observed, however, overlapped the genes TBX1 and COMT, which have been suggested as key genes driving some of the phenotypic effects and SCZ risk of germline 22q11 deletion. 29,30 Predicted sCNV are larger and affect more gene-dense regions compared with gCNVs Comparison of the genomic features of sCNVs with rare (minor population allele frequency <0.5%) gCNVs calls of SCZ cases from the arrays used in our current study 23 showed that sCNVs were larger (fold change, 4.57; 95% CI, 3.76-5.48; mixed-effect log-normal regression p < 2eÀ16) and involved more genes (fold change, 1.27; 95% CI, 1.03-2.56; mixed-effect negative binomial regression p = 0.027) (Figures 2A and 2B). We observed that genomic regions affected by rare gCNVs present in at least five SCZ cases overlapped 43.6% of all the gCNVs, whereas these same regions overlapped only 4.48% of SCZ sCNVs (Fig-ure 2C). This difference in genomic regions persisted throughout for rare gCNVs present at different minimum recurrence cutoffs ( Figure 2C). These findings suggest that, with sufficient statistical power, mosaic events might offer additional new insights into different risk regions of the genome.

Recurrent, intragenic deletions in NRXN1 observed in SCZ
Six individuals showed somatic deletions in cytoband 2p16.3 affecting only the NRXN1 gene, at remarkably stereotyped and distinctive regions of the gene. The size of these events ranged from 105 to 534 kb, with CF ranging from 13.8% to 43.1%, suggesting that they occurred early in development. One deletion was limited to intron 5 ( Figure 3A) and is of uncertain disease significance since multiple germline deletions of this intron have been reported in control individuals. 23 In contrast, the remaining five 2p16.3 deletions consistently removed exons 1-5 of NRXN1a while leaving exon 6 and the rest of the gene intact. This stereotyped five-exon deletion contrasts with germline deletions in NRXN1, previously implicated in SCZ, 23,31 which show highly variable breakpoints and relationships to NRXN1 exons. 23,32,33 Therefore, the recurrent, mosaic deletion of the same exons 1-5 in all five exonic deletions would seem to demand a specific mechanistic explanation. To further assess the prevalence of somatic NRXN1 deletions, we re-ran MoChA with a more lenient threshold and checked whether NRXN1 copy-number variants (CNVs) identified in the original PGC study 23 as germline might in fact be somatic. This strategy revealed an NRXN1 deletion previously identified as germline, Comparing the burden of NRXN1 somatic deletions in SCZ cases vs. controls revealed significant enrichment in cases (two-sided Fisher's exact test p = 0.032, exonic only; p = 0.016, exonic + intronic; Figure 3B). Previously generated sCNV calls from the UK Biobank 18,19 identified two persons without history of psychiatric disorder (of $500,000 individuals) with similar sCNV breakpoints affecting exons 1-5 in NRXN1. Although the arrays used in the UK Biobank have different sensitivity compared with arrays used in this study, they should have comparable sensitivity to detect these large events at CF > 10%. 18 Consequently, while we cannot fully rule out batch effect bias, combining our results with the UK Biobank suggest an enrichment of exon 1-5 NRXN1 deletions in the somatic state in SCZ samples (OR, 117.08; 95% CI, 20.91-1165.84; Fisher's exact test p = 6.57eÀ9; Figure 3B). To further assess whether we could have observed five overlaps of exons 1-5 by chance, we performed a bootstrap test by randomly shuffling regions of length equal to the seven NRXN1 sCNV that we discovered across the NRXN1 locus and computed the number of random overlaps with exons 1-5. We observed that five random overlaps of exactly exons 1-5 was a highly unlikely event (p < 0.0001; see STAR Methods; Figure 3C). A similar study with a similar pipeline and dataset to this study 3 on ASD and control samples did not detect somatic deletions in NRXN1 overlapping exons 1-5, suggesting relative specificity of this event to SCZ.
We were able to obtain 403 WGS from three NRXN1a deletion cases processed at the Broad Institute, confirming that each event removed exons 1-5 of the gene with estimated CFs of 42.4%, 33.3%, and 32.4%, as expected ( Figure 3D), and defining their breakpoints at base-pair resolution. WGS analysis showed that none of the exact NRXN1 sCNVs breakpoints were recurrent or overlapped known interspersed repeats or lowcomplexity DNA sequences, which could have predisposed this genomic region to genomic instability.
Further breakpoint analysis of these NRXN1 sCNVs using previously established classification criteria 35,36 ( Figure 3E) suggested diverse mechanisms of formation. One event had only 1 bp of microhomology (MH), suggesting that this event arose via non-homologous end-joining repair (NHEJ). Another event had a 3-bp MH, implicating an alternative end-joining repair mechanism (alt-EJ). The last event had no MH but revealed 8 bp inserted at the breakpoint, small enough to have occurred by non-template directed repair associated with NHEJ, although it is also possible that a fork-stalling template switching mechanism might have occurred, 37 but this mechanism tends to produce insertions >10 bp and usually occurs where some microhomology exists. 35 Taken together, these results suggest that the somatic deletions of NRXN1 that we observed do not disrupt recurrent exons due to instability of the genomic region around the events.

NRXN1 deletions suggest a potential cryptic promoter in human induced neurons
The absence of a genomic mechanism for the recurrent somatic deletions in NRXN1a suggests an alternative hypothesis, that the recurrence reflects an unknown but specific effect of these deletions on NRXN1 gene function. These sCNVs overlap the NRXN1a promoter and the first in-frame ATG transcription start site, which would be expected to disrupt transcription of the full alpha isoform ( Figure 3A) while leaving downstream beta and gamma isoforms intact. Intriguingly, the somatic deletions leave intact H3K4Me1 histone marks that lie just 5 0 from exon 6, which contains an in-frame ATG ( Figure 3A). These features might indicate a cryptic promoter or enhancer adjacent to the in-frame ATG in exon 6, potentially producing an N-terminal truncated NRXN1a. This truncated protein would lack the signal peptide required for shuttling to the cell surface, potentially causing abnormal trafficking. Similar germline NRXN1 deletions have been shown to cause accumulation of the NRXN1 intracellular binding protein CASK in human induced pluripotent cells (iPSCs) from SCZ patients. 38 To further explore functional effects of somatic deletions in the 5 0 end of NRXN1, we generated Hi-C data from neurons differentiated from human iPSCs (hiPSCs) containing heterozygous germline deletions in the 5 0 end (exons 1-2) and compared them with an iPSC line that had no germline deletion in NRXN1 (STAR Methods). Unphased Hi-C heatmaps in iPSC neurons showed that somatic deletions affecting exons 1-5 all fully overlap the topologically associating domain (TAD) boundary colocalized with the alpha promoter ( Figure 3F). Recently, disruption of TAD boundaries by germline structural variants have been associated with developmental disorders as well as SCZ. 39,40 These observations together suggest that 5 0 NRXN1 deletions might disrupt the structural integrity of the TAD boundary in SCZ and could result in ectopic enhancer-promoter miswiring and dysregulated gene expression.
To investigate possible 3D genome miswiring due to NRXN1 deletions, we generated allele-specific, phased Hi-C maps in both control as well as deletion-carrying SCZ iPSC-neurons (STAR Methods). Surprisingly, we observed the de novo formation of an ectopic looping interaction ( Figure 3G, green circle) between exon 6 of NRXN1 ( Figure 3G, blue star) and a putative non-coding cis-regulatory element upstream of the NRXN1 alpha promoter ( Figure 3G, purple star). This ectopic loop appeared to be specific to the deletion-harboring allele of the sample bearing a heterozygous deletion spanning the alpha promoter at the 5 0 end of NRXN1 (973FB) and was not observed on either allele in samples that lacked the deletion (2607FB). Because the interaction spans the deleted region, we hypothesize that the deleted region contains an element with some degree of boundary function preventing this loop from forming normally. Consistent with our hypothesis, the frequency of non-specific interactions increased across the boundary only on the NRXN1-deleted allele, suggesting allele-specific compromise of TAD structural integrity in SCZ (Figure 3G). Together, a working model is that de novo looping interaction in 5 0 NRXN1 deletions in SCZ connecting exon 6 to a putatively regulatory element could promote spurious pathological transcripts initiating at exon 6, although other alternative explanations remain as well.
Recurrent sCNVs in the ABCB11 gene observed in treatment-resistant SCZ cases We identified six SCZ cases with focal sCNVs within the ABCB11 gene (five deletions and one gain; Figure 4A), which has previously been associated with anti-psychotic response. 41,42 These sCNVs were all smaller than average, from 10.5 to 35.4 kb, but also with high CFs (18.3%-26.8%), suggesting that they also occurred early in development. ABCB11 encodes a member of the ATP-binding cassette (ABC) transporter superfamily and has a key role in transporting bile acids across the cell membrane 42 in hepatocytes, the cells involved in a wide range antipsychotic metabolism. Biallelic loss-of-function variants in ABCB11 result in severe pediatric-onset liver disease, with many patients developing malignancies or pathological complications within the first decade of life. [43][44][45][46] All the ABCB11 sCNVs overlapped the ABC transporter 1 domain and the domain responsible for interaction with the HAX1 protein ( Figure 4B), the latter facilitating internalization of ABCB11 via clathrin-mediated endocytosis. 47,48 Consequently, deletions might not only alter the protein's function by altering the transporter domains but also prevent removal of ABCB11 from the cell surface, perhaps leading to a dominant-negative loss of function. Since the sCNVs in ABCB11 do not overlap the gene's promoter and there are in-frame ATG sites in downstream exons 19 and 20, a truncated protein could be produced. The consequences of the somatic duplication event are less clear. We also note that four out of five deletions and the duplication overlap one of the transmembrane domains, further supporting the idea that these sCNVs might have a detrimental effect on ABCB11 function. The case-control enrichment of ABCB11 sCNVs was statistically significant (two-sided Fisher's exact test, p = 0.03; Figure 4C).
All six cases with ABCB11 sCNV came from batches of CLOZUK, 49 a treatment-resistant SCZ (TRS) cohort. These Article ll OPEN ACCESS samples were obtained from individuals with TRS, taking clozapine, and thus subject to standard blood monitoring for this drug. 50 Even though the CLOZUK samples constituted a significant portion of our study, observing six cases from only this cohort represents a statistically significant enrichment (twosided Fisher's exact test, p = 0.00079, and p = 0.015 for SCZ only; Figure 4D). ABCB11 sCNVs were not found in any previous analyses of healthy individuals from the UK Biobank and Biobank Japan. 19,21 Thus, these variants might plausibly regulate either SCZ liability or treatment response. Of the samples with ABCB11 sCNV, only two (one gain and one loss) were available for WGS. The predicted breakpoints fall in repetitive regions (short interspersed nuclear elements [SINE]) ( Figure S4), making it difficult to identify exact breakpoints, although the presence of these repetitive sequences suggests a potential mechanism of somatic deletion through microhomology. It is also possible that this part of the genome is unstable, since the ABCB11 gene has a significant burden of Alu-family members flanking exons as quantified by the AluAluCNV predictor score of 0.46, potentially implicating Alu-Alu-mediated rearrangement (AAMR). 51 Combining the ABCB11 somatic deletions we observed in our SCZ cases with germline deletions identified as part of the phase 2 PGC gCNV dataset revealed robust overlap between the mosaic deletions we detected and those present in separate SCZ cases in the germline state. There were five SCZ cases with gCNVs at the ABCB11 locus, with three of them coming from the CLOZUK cohort ( Figure S5). We were not able to obtain clinical data to determine whether the remaining two cases had TRS. Although six controls showed germline ABCB11 deletions, these events tended to cluster downstream of the SCZ gCNV and sCNV variants ( Figure S4). SCZ risk association analyses combining germline and somatic deletions of ABCB11 revealed a nominally statistically significant association of sCNV at the HAX1 interaction site and ABC transporter 1 site (peak association, p = 1.4eÀ4), although this did not meet the threshold (p = 8.3eÀ8) for genome-wide significance.
ABCB11 is enriched in human dopaminergic neurons residing in the dorsal tier of the substantia nigra pars compacta While ABCB11 has been primarily studied in hepatocytes, we explored whether it might show expression in human brain. In publicly available single-nuclei RNA-sequencing data from three brain regions-cortex, caudate nucleus, and substantia nigra pars compacta (SNpc) 52,53 -across the 151 cell types surveyed from these regions, we found that two dopaminergic (DA) populations in the SNpc showed the strongest expression of ABCB11, along with expression in subpopulations of layer 5 excitatory neurons in motor cortex ( Figure 5A). The two ABCB11-expressing substantia nigra (SN) DA populations also showed strong expression of CALB1 in a recent survey of human midbrain DA neurons 52 ( Figure 5B). Interestingly, calbindin-positive DA neurons reside in the dorsal tier of the SNpc, which projects to the ventral striatum, amygdala, as well as to cortical areas through the mesolimbic and, more preferentially, the mesocortical pathways ( Figure 5C). 52,54,55 These projections have been repeatedly implicated in SCZ pathology and treatment response. 56 We validated the expression of ABCB11 in human DA neurons of the SNpc with single-molecule fluorescence in situ hybridization (smFISH) across the midbrain of a postmortem neurotypical control. In support of the small nuclear RNA sequencing (snRNAseq) data, we found multiple TH+ (tyrosine hydroxylase, the gene encoding the rate-limiting enzyme for dopamine production)/ CALB1+/ABCB11+ cells residing in the midbrain pars compacta region ( Figure 5D). We also noted ABCB11 expression in EXC L5 FEZF2 CSN1S1 and EXC L3-5 FEZF2 ASGR2 cells ( Figure 5B), which correspond to Betz cells in the primary motor cortex 53 ; however, the relationship of this cell type to SCZ is less clear and we cannot rule out that ABCB11 might be present in other related excitatory layer 5 neurons.

DISCUSSION
We show that somatic CNVs may contribute a small but significant part of the genetic architecture of SCZ, mirroring previous findings of rare germline and de novo CNVs, 1,23 but involving a much more modest proportion of cases. The estimated excess burden of sCNV in SCZ would be 0.4%, which represents a preliminary estimate; we are limited to detecting events with large enough CFs to be present as mosaics in different tissues such as blood and are not able to assess events that might be restricted to brain, in addition to limitations of sequencing validation to better characterize potential sources of artifact. Future studies with additional orthogonal validation, accounting for hereditary stratification and germline background risks, might provide more accurate estimates of the risk carried by sCNVs in SCZ.
In this study, we also report the discovery of five SCZ cases with mosaic deletions of exons 1-5 that also cover the promoter of NRXN1a. Deletions of these exons were present in only two out $500,000 individuals in the UK Biobank, which has an ascertainment bias for healthy individuals, and were absent from our control cohort. This high prevalence in our SCZ cohort of relatively large $100-500-kb deletions, and the known involvement of germline NRXN1 mutations in SCZ and other neurodevelopmental disorders, suggests that mosaic deletions of exons 1-5 might also contribute to SCZ risk.
A study characterizing germline NRXN1 deletions from 19,263 clinical arrays in individuals with neurodevelopmental disease found that most of these events were present in the 5 0 end of NRXN1 and covered exons 1-5. 32 In a case series of germline deletions in NRXN1 in individuals with more severe developmental disorders, 57 two subjects with severe developmental delay had inherited deletions of exons 1-5. In contrast, germline deletions of NRXN1 in SCZ are widely distributed throughout the gene 23 rather than being concentrated in the first few exons as in neurodevelopmental disorders. 32,33 This contrast might indicate that germline deletions of exons 1-5 result in more severe developmental phenotypes but, if present in only a fraction of cells, could result in a milder phenotype resembling SCZ.
While the most parsimonious model of pathogenicity of somatic deletions in NRXN1 exons 1-5 is simple loss of function through deletion of the alpha promoter, the vast diversity of NRXN1 isoforms warrants further exploration of alternative mechanisms. Our analysis of Hi-C data using hiPSC neurons suggests a potential formation of a cryptic promoter once the NRXN1 alpha promoter is deleted, potentially forming an N-terminal truncated form of NRXN1, leading to a novel dominantnegative mechanism by trapping NRXN1a in the cytoplasm. This mechanism is consistent with higher intracellular protein levels of a NRXN1-binding protein CASK in hiPSC lines from SCZ patients with 5 0 NRXN1 deletions. 38 However, further transcriptional and functional experiments could better validate the presence and role of this putative cryptic promoter in NRXN1 and SCZ biology.
In this study, we also found five early-developmental recurrent somatic deletions in the ABCB11 transporter gene. These deletions were present only in the SCZ cases diagnosed with TRS, which is defined as nonresponse to at least two anti-psychotic medications 58 and affects $30% of individuals with SCZ. 59 Genes in this transporter family, including ABCB11, have previously been associated with differential response to anti-psychotics. 41 However, the exact mechanism by which mutations in these genes might lead to poor response to anti-psychotics remains unknown.
We show that ABCB11 is strongly expressed in human DA neurons, specifically within the dorsal stream of the SN. Most anti-psychotic medications used to treat SCZ target DA signaling in the brain, but how DA pathways become abnormal in SCZ remains unclear. Disruption of ABCB11 could alter the function of this key neuronal circuitry in a relatively cell-type-specific manner. While the exact role of ABCB11 on DA neuron physiology or excitatory layer 5 neurons is yet unknown, our results suggest this as an area for further inquiry with potential disease relevance.

Limitations of the study
This study has several limitations that represent further areas of research. While we were able to validate most of the variants from DNA we were able to procure, further orthogonal validation is warranted to provide more accurate estimates of the burden of sCNV in SCZ and further characterize sources of potential artifact. The main limitation to validation was obtaining DNA from samples processed at institutions across the world with diverse data-sharing protocols. Our study was also limited to studying variants present in blood at a high cellular fraction, restricting variants characterized to those that might have arisen during early development, which are predicted to also be mosaic in the brain. While this experimental setup provided large sample sizes to test whether SCZ-associated sCNVs were present, future studies using brain-derived tissue might allow further characterization of the potential risk of sCNVs in SCZ. We were also limited in detecting chimeric fusion genes since the SNP density of the array platforms used in this study ($1/3 of SNPs being heterozygous) prevents enough resolution to call these events confidently without a more dedicated method. In addition, the sparsity of clinical and environmental information in our dataset limited our ability to measure interactions of these factors with the burden of sCNV in SCZ, which suggests a potential area of future research. Finally, studying the functional role of the sCNVs in NRXN1 and ABCB11, and somatic variants in general, will require novel mosaic models such as organoids, or animal models, where specific fractions of cells carry the desired events. The development of these models was outside the scope of the current study but presents an exciting future direction. The data presented here represent an initial and preliminary study that is potentially of interest to the field as the role of somatic mutations in general, and sCNV specifically, in disease comes into focus.

STAR+METHODS
Detailed methods are provided in the online version of this paper and include the following:

RESOURCE AVAILABILITY
Lead contact Further information requests for resources and reagents should be directed to and will be fulfilled by lead contact, Christopher A. Walsh (christopher.walsh@childrens.harvard.edu).

Materials availability
All unique/stable reagents generated in this study are available from the lead contact with a completed materials transfer agreement.
Data and code availability d Individual level SNP-array data is part of the Psychiatric Genomic Consortium with the corresponding privacy agreement. Access can be provided by applying through this website (https://www.med.unc.edu/pgc/shared-methods/how-to/). Whole genome sequncing data for validation experiments will be uploaded to the NIMH Data Archive after publication NDA: (https://nda.nih.gov/). d Filtered sCNV callset is inTable S2. d Scripts used to generate the main figures and analyses are available in a frozen Zenodo repository Zenodo: https://doi.org/10. 5281/zenodo.7778664. d PyMOL was used for ABCB11 schematic in Figure 4 using PBID: 6LR0. d Any additional information required to reanalyze the data reported in this paper is available.

EXPERIMENTAL MODEL AND SUBJECT DETAILS
SNP array data acquisition Allelic intensity data for cases and controls were obtained from the Psychiatric Genomic Consortium (PGC) CNV working group. The exact details of the data generation were previously described, 23 removing samples derived from cell lines. SNP array data consisting of 13,464 SCZ cases and 12,722 controls was obtained. These data were profiled with the Illumina OmniExpress, OmniExpress plus exome chip, Illum610K, and Affymetrix SNP6.0 arrays. For each determined position the B allele frequency (BAF; proportion of B allele), Log-R ratio (LRR; total genotyping intensity of A and B alleles), and genotype calls, were calculated.

Data processing
The genotypes from the SNPs from the arrays were phased using the Eagle2 64 software. Then, the BCFtools plug-in MoChA (2021-01-20 release) was used to confidently call mosaic CNVs, by taking advantage of long-range haplotype phasing of heterozygous SNP sites and BAF estimates of genotype array data. Genotyping and intensity data from Illumina platforms were distributed by the PGC in the Illumina GenomeStudio Final Report format, with the genomic positions genotyped using the hg18 human reference genome. To convert the Final Report format to VCF format, the rsID numbers were used to liftover coordinates to hg19, discarding positions without rsID, similar to Sherman et al. 3 Custom scripts were used to transform Final Reports to binary VCF format, and Illumina's TOP-BOT format was converted to dbSNP REF-ALT format using a modified version of BCFtools plug-in fix-ref. MoChA calculates cell fraction from BAF as follows: where CN is the copy number and DBAF is the deviation of B allele fraction compared to 0.50. This equation is valid for gains and losses.

Variant level quality control
In accordance with the suggestions of the MoChA processing pipeline, the following variants were filtered out: more than 2% genotypes missing, evidence of excess heterozygosity (p < 1e-6, Hardy-Weinberg equilibrium test), correlation of autosomal genotypes with sex (Fisher exact test comparing number of 0/0 genotypes vs. number of 1/1 genotypes in males and females), variants falling within segmental duplications with low divergence (<2%). This variant-level QC was performed on each separate batch.

Sample-level quality control
In order to filter out samples with contamination from another individual two statistics were calculated: BAF concordance and BAF autocorrelation. Briefly, BAF concordance calculates the probability that an adjacent heterozygous SNP has a deviation from a BAF of 0.5 given that the previous heterozygous site had the same deviation from 0.5. 65 BAF autocorrelation is the correlation of the BAF statistic at consecutive heterozygous sites once adjusted for the genotype phase. Samples with contamination with DNA from another individual would be expected to have a BAF concordance >0.5 and BAF autocorrelation >0 because of allelic intensities correlated at variants within haplotypes shared between sample DNA and contaminated DNA. Samples with BAF concordance >0.51 or BAF autocorrelation >0.03 were removed.

Event type classification
An Expectation Maximization algorithm was applied to classify events as either a Gain, Loss, or CN-LOH. The algorithm determines the slopes that characterizes the relationship between the deviation of the LRR from 0 jDLRRj, and the BAF deviation from 0.5, jDBAFj. In other words, the events are classified based on the optimization of linear regression parameters described by jDLRRj = jDBAFjb c + e, where c˛fGain; Loss; CN À LOHg, b c is the slope for each event type, e $ Nð0; s 2 c Þ is the error for each event-type clustering.
To further enhance the robustness of the classification method, we used the fact that CN-LOH events are expected to be less common within the chromosomes compared to events that extend to the telomeres. Since CN-LOH events are thought to arise during mitotic recombination, for them to occur within a chromosome would require a double crossover, which is highly unlikely. To incorporate this information into the classification model, we estimated the frequency using the UK Biobank sCNV calls 18,19 for of each event type occurring on telomeres and interstitially. These frequencies were used as priors to multiply the likelihoods for each event type, resulting in posterior probabilities. The computation for each event S i is as follows: Let X = jDBAFj and Y = jDLRRj, then PrðS i = c j L i ; X i ; Y i ÞfPrðL i Þ e À ðYi À Xi b c Þ 2 =2 s 2 c , where L i is an indicator of whether the event involves a telomere, and c is defined as above. This estimation is calculated for each event type and then normalized to sum to one.

Filtration of mosaic CNV calls
Filtration was focused on removing potential germline events and events likely to arise due to age-related clonal hematopoiesis, as well as artifacts. We required events to have a log10-odds >10 for the model based on BAF and phase, which measures how much more likely the data for a given segment of DNA is consistent with a non-diploid model than a diploid model. Events that were classified as copy number polymorphism (known CNV polymorphisms in 1000 Genomes Project) by MoChA were filtered out as possible germline events. We further excluded events that had a reciprocal overlap with events from control samples or with any CNVs reported in the 1000 Genomes project by >50%. Events that overlapped >50% with germline events previously identified in the same sample by the PGC 23 were also removed for duplications, since small duplications with high BAF deviations can be mistakenly identified as somatic variants. Copy number state was taken into consideration when calculating overlaps, i.e. overlap between gains and losses were not considered. Calls with an estimated cell fraction of 1 were also removed. For gains, we further removed any events with a deviation in BAF greater than 0.10 to have a conservative assurance that germline gains were not misclassified as mosaic, as germline gains tend to be small and produce large deviations from the a BAF of about 1/6. 18 Finally, since most of our datasets did not include age information for individuals besides the broad estimate of being younger than 40, we used a conservative approach to remove events that could have risen from clonal hematopoiesis. CN-LOH events were fully excluded from any downstream analysis as these events have been shown to be largely enriched in clonal hematopoiesis events. 18 We also removed sCNVs that contained loci commonly altered within the immune system, specifically IGH (chr14:105,000,000-108,000,000) and IGL (chr22:22,000,000-40,000,000). We also excluded CNVs within the extended MHC region (chr6:19,000,000-40,000,000). In addition, we removed deletion involving the following loci that are frequently affected by clonal hematopoiesis: 20q11, DNMT3A, TET2, 13q14, 17p, 5q14, ATM. We removed duplications in 15q. We also removed any sCNVs in 7q34 and 14q11.2, as well as trisomy 12 events. We also removed events whose copy-number state could not be determined.

Statistical analysis
Overall burden analysis To test the hypothesis of whether more individuals with at least one sCNV of cell fraction greater than a given cell fraction cut-off in cases vs. controls, the two-sided Fisher's Exact test was used. 3 The 95% confidence intervals were calculated using Wilson's score interval. For the meta-analysis using each batch separately we used a one-sided Fisher's Exact test. The p values were combined using the Tippet's (minimum p value), and the Liptk's (weighted sum of p values) approaches.
Cell fraction, gene-set, length, and gene number burden analysis To calculate the contribution of the features of gene, length, and gene number burden, we fit a mixed effect logistic regression on the case-control phenotype as the outcome variable. Let y i˛f 0; 1g be an indicator of whether the subject is diagnosed with SCZ or a control respectively. We modeled the burden as follows: where X LENGTH and X #genes are the sum of the length and number of genes overlapped by events of individual i, and X meanCF is the mean cell fraction of the events of individual i. Inference was not altered by the sufficient statistic used to summarize cell fraction (i.e. min, max, median). In the models above we were interested on testing whether bs0 for the feature of interest. The models were fit using a generalized mixed-effect model as implemented by the R package lme4 62 to account for the sample collection batches of the PGC. Statistical significance was assessed using the Satterwhite approximation to the t-test as implemented in the package lmerTest. 63 Gene set enrichment analysis We used a similar approach as recommended by Raychaudhri et al. 66 to control for event length and rate, which might result in false positive associations with neuronal genes. Namely, we fit the following model where the parameters are as defined the section above, but with X #sCNVs is the number of sCNVs in that individual, and X geneset is the number of genes in an event that intersect a gene-set of interest. We then used the likelihood ratio test to test whether b geneset s 0. We used 3 gene-sets: (1) Brain expressed genes: defined as the top 20% of brain expressed genes from the GTEx GTEx_Analysis_2017-06-05_v8_RNASeQCv1.1.9_gene_median_tpm.gct.gz (https://www.gtexportal.org/home/datasets). (2) Synaptic genes obtained from SynaptomeDB (http://metamoodics.org/SynaptomeDB/index.php). (3) High pLI genes, i.e. pLI >0.90, obtained from ExAC (file: fordist_cleaned_nonpsych_z_pli_rec_null_data.txt) (https://gnomad.broadinstitute.org/downloads).
Permutation test for enrichment of sCNV overlapping exons 1-5 of NRXN1 We used the R package regioneR 60 to randomly shuffle the 7 sCNV that overlapped NRXN1 across the NRXN1 locus using the ran-domizeRegions function, to generate a null distribution of overlaps to perform a boostrap test. We added a padding of 1Mb to the 5 0 and 3 0 ends of the NRXN1 locus. After randomly shuffling the sCNV we counted how many segments overlapped exons 1-5. We repeated this procedure 10,000 times. The p value was calculated empirically by the fraction of overlaps greater than the observed 5. Since we performed 10,000 iterations our smaller possible p value was 0.0001.

Germline CNV analyses
We obtained gCNV final calls from the SCZ Phase 2 study by the PGC CNV working group. 23 We narrowed down the gCNV calls to those that were identified in the same genotype arrays that were analyzed for sCNVs. To further control for sensitivity between the methods used to call sCNVs and gCNVs we focused on gCNV events with size >100Kb. Length analysis were performed using a lognormal mixed effect model framework using sample batch as the random effect. Gene burden analysis was done with a negative binomial mixed effect model using batch as a random effect, and log(event length) as a covariate.

Breakpoint microhomology analysis
For the NRXN1 somatic deletions, we identified the breakpoints at the single base resolution by looking for clipped reads with IGV 61 in the vicinity of discordant paired reads mapping to genomic locations that implied a larger insert size than expected. Microhomology was identified by looking at the surrounding bases of the clipped reads covering the breakpoint and looking for corresponding identical basepairs.
Characterization of the mechanism of origin was identified using the strategy described in Yang et al. 35 In brief, if there was no microhomology nor insertions >10 bp, the event was predicted to be created by non-homologous end-joining repair (NHEJ). If there was a microhomology >2 bp but <100 bp, the event was classified as alternative end joining (alt-EJ). If the microhomology was >100bp, which was not observed in this study, the event was classified as non-allelic homologous repair (NHAR).