Microarray Analysis of Copy Number Variants on the Human Y Chromosome Reveals Novel and Frequent Duplications Overrepresented in Specific Haplogroups

Background The human Y chromosome is almost always excluded from genome-wide investigations of copy number variants (CNVs) due to its highly repetitive structure. This chromosome should not be forgotten, not only for its well-known relevance in male fertility, but also for its involvement in clinical phenotypes such as cancers, heart failure and sex specific effects on brain and behaviour. Results We analysed Y chromosome data from Affymetrix 6.0 SNP arrays and found that the signal intensities for most of 8179 SNP/CN probes in the male specific region (MSY) discriminated between a male, background signals in a female and an isodicentric male containing a large deletion of the q-arm and a duplication of the p-arm of the Y chromosome. Therefore, this SNP/CN platform is suitable for identification of gain and loss of Y chromosome sequences. In a set of 1718 males, we found 25 different CNV patterns, many of which are novel. We confirmed some of these variants by PCR or qPCR. The total frequency of individuals with CNVs was 14.7%, including 9.5% with duplications, 4.5% with deletions and 0.7% exhibiting both. Hence, a novel observation is that the frequency of duplications was more than twice the frequency of deletions. Another striking result was that 10 of the 25 detected variants were significantly overrepresented in one or more haplogroups, demonstrating the importance to control for haplogroups in genome-wide investigations to avoid stratification. NO-M214(xM175) individuals presented the highest percentage (95%) of CNVs. If they were not counted, 12.4% of the rest included CNVs, and the difference between duplications (8.9%) and deletions (2.8%) was even larger. Conclusions Our results demonstrate that currently available genome-wide SNP platforms can be used to identify duplications and deletions in the human Y chromosome. Future association studies of the full spectrum of Y chromosome variants will demonstrate the potential involvement of gain or loss of Y chromosome sequence in different human phenotypes.


Introduction
Multiple genome-wide analyses of copy number variants (CNVs) have been fruitful in the search of genes containing genomic variations causing human disease [1]. Although these tests are supposed to search throughout the genome, the Y chromosome has been excluded from most investigations [2]. Studies of CNVs on the Y chromosome are not only of interest for fertility research [2,3], but are also relevant for cancer [4,5], genealogy [6,7], forensic medicine [8,9], phylogeography and population genetics [10,11], and the determination of sex differences in brain and behaviour [12,13]. The exclusion of the Y chromosome has been due to the idea that the highly repetitive sequence structure and haploid state would hamper analysis with current genome-wide CNV methods [14]. The Y chromosome is the shortest and least gene rich of all human chromosomes. It has lost many genes during evolution, and includes 78 genes in modern humans. The sequence is divided into three specific regions named X transposed, degenerative, and ampliconic [15,16]. The X transposed region has high homology to the X chromosome, and contains two single copy genes. The degenerative region exhibits higher divergence to its X homologous part and contains 16 genes. The ampliconic region shows the least similarity to the X chromosome and differs distinctively from the previously mentioned regions by the presence of eight palindromes, each varying in size and gene content [17].
The high homology of a segment of the p-arm on the Y chromosome (p11.2) to the corresponding part on the q-arm on the X chromosome (q21), together with the enrichment in repetitive and palindromic sequences elsewhere on the Y chromosome, make studies of CNVs on the Y chromosome a challenging task. Traditionally, methods such as PCR, qPCR and FISH have been used to dissect CNVs in specific regions of the Y chromosome [18][19][20]. These methods range from wide detection range (FISH), to discrimination of short segments (PCR and qPCR), and the main drawback is that they only analyse the presence or absence of CNVs in relatively short genomic regions, while the rest of the Y chromosome remains unexplored [21]. In this study, we used intensity signals after hybridisation to a Genome-wide human SNP array 6.0 from Affymetrix (Affymetrix 6.0 array). The fact that these arrays were not frequently used for the analysis of Y chromosome variants is surprising, since the male specific region (MSY) of the Y chromosome is well represented by 8179 probes, of which 288 are single nucleotide polymorphisms (SNP) probes and 7891 are copy number (CN) probes. The distribution of the probes covers most of the Y chromosome sequence, including palindromic areas. The pseudo autosomal regions (PAR) contain genes which have been found to have implication in pulmonary aveolar proteinosis [22], bipolar affective disorder, asthma [23] and short statue phenotype [24]. Recent evidence shows that duplications of XG and GYG2 genes on PAR1 can be traced within at least two Y-chromosomal subhaplogroups, namely subhaplogroups I2a-P37.2 and R1b-P312 Ã [25]. The PAR1 is covered by 426 SNP/CN probes with an average distance between the probes of 5.9kb. PAR2 has a lower coverage by only 42 SNP/CN probes with an average distance of 6.4kb. Copy number variant detection within PAR regions is indeed possible but several genes such as CRLF2, CSF2RA and VAMP7 are not covered by any probes while the IL3RA and IL9R are represented only by one and two probes respectively. The probability of CNV detection within IL3RA and IL9R is minimal unless the CNV occurs over a larger region that will also include probes in these genes as well. Some small regions with highly repetitive sequence are not well represented by SNP/CN probes, including the TSPY multicopy gene region on the p-arm, and some amplicons in the AZFc region such as: u2, b1, b2, g1, r1, r2, r3 and g3.We here show that these arrays can be used to detect most of the previously known as well as many novel deletions and duplications in the Y chromosome.

Samples
We first analysed 271 Norwegian males previously included in a larger sample set used to investigate genome-wide associations with brain morphology, schizophrenia and bipolar disorder [26][27][28]. Affymetrix Genome Wide Human SNP Array 6.0 CEL files were available for reanalysis from their previous investigations that did not include the analysis of the Y chromosome. The 271 Norwegian males included 113 healthy controls and 158 cases, of which 57 were diagnosed with bipolar disorder, 65 with schizophrenia or schizoaffective disorder, and 36 had other diagnoses including depression and non-specified psychosis. Genomic DNA was available for all the 271 males. All the procedures performed in the current study were in accordance with the Declaration of Helsinki and all human DNA samples from the Norwegian population were collected after signed informed consent was obtained. Ethical permission for this work was from the Norwegian Ethics Committee (IRB number 2009/2485).

Expansion of data
We also expanded our dataset by including Affymetrix Genome-Wide Human SNP Array 6.0 CEL files obtained from different populations available at the NCBI GEO Datasets [29]. These samples are referred to as "extended dataset". The accession numbers for each dataset are as follows: GSE21661. It includes data from 31 Tibetan samples from a study of high-altitude adaptation in Tibet [30]. 11 males were included in our study.
GSE29851. This is a genetic analysis of ancestry in Bolivian and Totonac populations of the New World [31]. It includes 24 Mesoamerican Totonac and 23 South American Bolivian males, all of which were included in our analysis. GSE30481. A population study of copy number variations in China that did not include Y chromosome analysis [32]. From the total of 155 individuals, 79 males were included in our study.
GSE18333. An association study in which normal prostate tissue was compared to cancer prostate tissue. Samples were from Western countries and China [33]. 44 samples were included in our study.
GSE23201. Genome-wide CNV association study for schizophrenia in which the Y chromosome data was not included in the original analysis [34]. From the total of 735 individuals, 415 males (151 cases and 264 controls) were included in our study.
GSE23636. A genome-wide SNP investigation of Ashkenazi Jewish population structure [35]. From 471 individuals, 236 males were included in our study.

GSE15826.
A study investigating sporadic motor neuron disease in patients and controls. Of 164 samples, 77 males were included in our study.
GSE13429. A study searching for CNVs in individuals with early onset colorectal cancer [36]. From 10 patients that were screened by Affymetrix 6.0 array, only 5 were males and therefore included in our study.
In addition to the NCBI GEO Datasets, we included data from two additional websites: HapMap 3. (http://www.sanger.ac.uk/resources/downloads/human/hapmap3.html#t_ download). This set included samples from 683 males and 714 females [37]. Females, individuals for which sex could not be determined and samples from data sets GIGAS, SCALE and SHELF were not included. Therefore, a total of 510 male samples were included in our study.
CG-SER-426. (http://www.cangem.org/CG-EXP-175/). Genome-wide association study of patients with developmental disorders. The Y chromosome was excluded from the analysis. From a total of 35 cases, 23 males were included in our study [38].
In total, 1718 CEL files from males were included in our study (S1 Table).

Copy number variation analysis
Re-analysis of CEL files for CNV detection on the complete data set (1718 individuals) was done using Affymetrix Genotyping Console Software 4.1.3.840 (GTC). Copy Number/LOH analysis was performed using the Regional GC Correction setting and samples were normalized against GenomeWideSNP_6.hapmap270.na33.r1.a5.ref reference samples. The Genome-WideSNP_6.na33.annot.dn was used as annotation reference model. The segment report analysis was done with the following settings: minimum number of markers per segment set to 3, minimum genomic size of a segment (kbp) set to 5 and including 100% of the segments that overlap with known CNV regions. Since low stringency settings were used for maximum detection of possible CNVs, many false positive segments were marked in regions of low probe content. These false positives were curated manually after visual inspection of the plots produced by GTC. The results were exported as tab delimited text files containing the Log 2 ratio values for the Y chromosome, chromosomal position and the dbSNP RS ID values for graphical representation and additional analysis.

SNP/CN probe assignment to ampliconic and palindromic regions and CNV type determination
To assign the SNP/CN probes included in the array to different palindromic and ampliconic sequences in the Y chromosome, we first selected from the literature sequence-tagged site (STS) markers previously used to define the limits of palindromic and ampliconic sequences. Markers sY1312 and sY1304 delimited P7, markers sY1275 and sY1276 for P6, sY1264 and sY1283 for P5, sY1283 and sY1277 for P4, sY1315 and sY1259 for IR2, sY142 for distal end of U1, sY1259 and sY1191 for P3, sY1191 and sY1192 for U3, sY1291 and sY1125 for gr1, sY1125 and sY1054 for b3, sY1054 and sY1206 for Y1, sY1206 and position of BPY2 for g2, sY254 and the position of DAZ3,4 genes for r3 and r4, sY1206 and sY1054 for Y2, sY1054 and sY1125 for b4, sY1054 and 1201 for gr2 [16,21,[39][40][41][42]. Then, we extracted position information for all the above listed markers from the UCSC genome browser (hg19). Finally, All SNP/CN probes within the start and stop positions for the limiting STSs were assigned to each particular palindrome/amplicon. For all 235 individuals with detected CNVs, we plotted the intensity signals for all MSY probes using colour codes to represent different palindromic and ampliconic sequences according to Repping et al. [41]. Copy number state of palindromic or ampliconic regions were used to assign each CNV pattern to different subtypes following the nomenclature by Repping et al [41]. One example for each type of CNV is shown in (S1 Fig).

PCR verification of CNVs
Samples from individuals in which SNP array analysis detected CNVs were subjected to molecular verification by PCR. Primers that amplify sequences included in Region specific Sequence Tagged Sites were selected from the NCBI Probe database (http://www.ncbi.nlm.nih.gov/ probe) as follows: primers for sY1191 and sY1192 amplify a segment included in the U3 region and sY1291 primers map to the boundary between r2 and the gr1 segment. The reaction mix contained: 2.5μl PCR Buffer (Thermo Scientific), 0.5μM of each of the primers, 0.2μM for each of the dNTPs, 0.5 units of Taq DNA Polymerase (Thermo Scientific) and 50ng of genomic DNA in a final volume of 25μl. PCR amplification was carried out on a PTC-100 thermocycler (MJ Research) with an initial heating at 94°C for 90 seconds, followed by 36 cycles of 94°C for 30 seconds, annealing at 60°C (sY1191), 63°C (sY1192) or 62°C (sY1291) for 35 seconds, 72°C for 60 seconds and final extension at 72°C for 5 min. Products were separated by electrophoresis on 1.75% agarose gels using 7μl or product mixed with 3 μl Gel Orange. Gels were stained using GelRed (1μg/ml of 0.5M TAE Buffer) and photographed under UV light using Gel DocTM XR (Bio-Rad, Hercules, CA, USA). Female samples we amplified as controls for each set of primers.

Real-time PCR analysis
Genomic DNA for individuals with P6 duplication, male controls without CNVs and samples from females were diluted to a final concentration of 5ng/μl. All samples were run in triplicates. Two STS markers, sY1081 and sY933 were selected because of their location within the P6 duplicated region, while RH38681 was chosen because of its location upstream from this CNV variant. Primer sequences for these markers were selected from the NCBI Probe database (http://www.ncbi.nlm.nih.gov/probe), and were controlled for appropriate product size and annealing temperature for qPCR experiments. The reaction mix included 10 μl Power SYBR Universal PCR Master mix (ABI), 1.25 μl for each of the primers (5 μM), 5 μl DNA (25 ng) and RNase-free water to a final volume of 25 μl. The real-time PCR amplification was carried out on a 7500 Real-time PCR system (Applied Biosystems) with initial heating at 94°C for 90 seconds, followed by 40 cycles of 94°C for 30 seconds, annealing at 61°C for 35 seconds, 72°C for 60 seconds and a final extension at 72°C for 5 min. A melting curve from 94°C to 55°C for 20min was run to control for the absence of primer dimers.

Determination of Y chromosome haplogroups
The phylogenetic position of each Y chromosome, based on the 91 SNPs which are in both the Affymetrix 6.0 array and the most up-to-date Y-chromosomal tree [43] was determined with the AMY-tree algorithm [44,45], which takes into account the missing Y-SNPs in order to retrieve the most accurate sub-haplogroup possible.

Statistical analysis
Principal Component Analysis was performed using JMP, Version 11. SAS Institute Inc., Cary, NC, 1989-2007. Data from 246 MSY specific SNPs together with the haplogroup assignment for each individual were used for the analysis using the standard (correlation) settings. SNP information was translated to numbers 1 to 4 for G, C, T and A. Missing values were given value of zero. Data from first and second eigenvectors were used to prepare the figures in the manuscript.
Pearson Chi-Square, Likelihood ratio and Fisher´s exact test were performed using the Statistical Package for Social Sciences Statistical Software release 21 (SPSS Windows Version 21, SPSS, Inc.).

Results
Affymetrix 6.0 arrays can be used for CNV detection on the Y chromosome To validate the use of Affymetrix 6.0 arrays for the detection of Y chromosome specific variants, we compared signal intensity ratios for all Y chromosome SNP/CN probes between a female, a male, and an additional male known to have an isodicentric Y chromosome [20]. In all cases, the average intensities of males were higher than for females. The total average for all 8179 Y chromosome SNP/CN probes in ten randomly selected males was -0.47 (Log 2 ratio), while the corresponding number for ten females was -2.02. Similarly, when we calculated the average intensity for each set of three consecutive SNP/CN probes in 61 females versus 25 males, most sets of probes presented large differences between females and males. However, three regions presented high background levels in females, with average intensities higher than -0.47 (S2 Table). The individual with an isodicentric Y chromosome has a deletion spanning most of the q-arm and a duplication of the remaining part of the chromosome (20). As expected, this individual had signal intensities similar to female in the deleted part of the Y chromosome, and about twice the intensities observed in males in the duplicated part, resulting in an average Log 2 ratio of 0.04 in this region. These results show that almost all (99.9%) Y chromosome SNP/CN probes can be used to detect deletions and duplications throughout the Y chromosome.
In addition, the last variant was named Palindrome 6 duplication (P6 dupl), started at position 18.279.605 (UCSC genome browser hg19) and ended at position 18.843.147. This duplication has not been previously described (Fig 2J). In the Norwegian population, five individuals displayed this variant, two of which also had a CNV in the AZFc region, including one b2/b3 del and one gr/gr dupl. We confirmed detected variants using STS markers for PCR analysis. sY1191 and sY1192 did not amplify in individuals with b2/b3 del and blue-grey dupl while they produced amplification bands in individuals with gr/gr del. Marker sY1291 amplified positively in individuals with b2/b3 del and blue-grey dupl while it was negative for gr/gr del (data not shown). For confirmation experiments of the newly discovered P6 dupl. we used a qPCR approach including markers RH38681, sY1081 and sY933 (Fig 3).
The figure shows that markers sY1081 and sY933 had larger amplification signals in an individual with P6 dupl. compared to a control.
To increase the chance to detect as many CNVs as possible, we included in the screenings both healthy individuals as well as individuals with psychiatric diagnosis as described in the Methods.

CNV discovery in an extended dataset
When we analysed the signal intensities of all Y chromosome SNP/CN probes in the extended data set obtained from 1447 individuals from different populations (described in Methods), we found 17 additional types of variants, resulting in a total of 25 different CNVs in the complete data set (S1 Fig). Of these, several have not been described previously. For example, we named one of the novel variants "blue-grey like duplication" based on its resemblance to the previously described blue-grey dupl c449. In both cases the gr amplicons are duplicated and the U3 region is deleted (compare Fig 2G with S1P Fig). The difference between them is that the Y1, Y2 amplicons are deleted and the r amplicons are duplicated in the newly discovered variant, while these changes are absent in the regular blue-grey dupl. Furthermore, the g amplicons are deleted to a greater extent in the new variant (S1P Fig). The second newly found CNV was named "P5 duplication" and includes a large group (n = 19) of similar but not identical duplications in the P5 region (S1E Fig). Interestingly, another novel pattern includes two duplications appearing in tandem. The first duplication is spanning the region upstream of P5, and the second duplication encompassing the region immediately after P4 (S1F Fig). In general, we have found several additional novel duplications that cannot easily be discovered with traditional PCR analysis of STS markers (S1A, S1B, S1D, S1E, S1F, S1H, S1K, S1U, S1W and S1X Fig). In addition, we found three rare deletions in one individual each (S1B, S1C and S1J Fig).Although, it should be noted that we did not have access to the genomic DNA for the samples in the extended dataset and the therein observed CNVs could therefore not be verified by molecular methods.

Determination of Y chromosome Haplogroups
Only 91 out of the 246 Y-SNPs present on the Affymetrix chip were informative in this study as the phylogenetic positions. The other 155 Y-SNPs were indeed unknown based on the used updated tree and other already published phylogenies. Extensive effort was put down to increase phylogenetic information in our study by exploring all open data sources and publications with full Y-chromosomal sequences in order to include any of the 155 remaining Y-SNPs into the phylogeny. As such, we explored all partial Y-chromosomal sequences from Hallast et al. [10], all full WGS samples mentioned in Van Geystelen et al. [43] and even all community-driven/experimental phylogenies on the ISOGG-website (www.isogg.org). Nevertheless, the phylogenetic position of the 155 Y-SNPs remained unknown or could not be resolved reliably and thus be informative in our study.  To determine haplogroups (HGs) in the complete collection of 1718 Y chromosomes, we applied the AMY-tree algorithm [44,45] and the most up-to date Y chromosomal tree [43]. Individuals assigned to specific sub-haplogroups were backmerged to their corresponding haplogroup according to the information available at the minimal Y tree [38] (http://www. phylotree.org/Y/). S2 Fig shows that we found 9 main haplogroups in our dataset. In addition, three groups of individuals were assigned to internal nodes because the arrays did not contain SNP information for a particular haplogroup. For example, the arrays did not contain diagnostic SNPs for haplogroup N-M231. Forty individuals were not mutant for the SNPs of haplogroup O-M175 and they were mutant for the SNPs that define NO-M214. Therefore, they were assigned to node NO-M214 named NO-M214(xM175) in all analysis herein (S2 Fig). In other words, forty samples most likely belong to haplogroup N-M231, but we could not assign them to this haplogroup as there were no Y-SNPs describing N-M231 or one of its subhaplogroups included in the Affymetrix 6.0 arrays. Similarly, ten individuals were included in haplogroup F-M89(xF1329) as they were mutant for the SNPs of F-M89 but not for those of GHIJKLT-F1329, and the arrays did not include diagnostic SNPs for F2-M247 or its subhaplogroups. A third group of 29 individuals was included in haplogroup KLT-M9(xM526) as they were mutant for the SNPs of KLT-M9 but not for those of K-M526 and its subhaplogroups (S2 Fig). For 38 individuals the determination of the haplogroup was ambiguous, in the sense that the algorithm suggested more than one possible HG. Additional 17 individuals did not have any sequence difference from the root sequence, thus could not be assigned to any HG. All these 55 individuals were excluded from further statistical analysis. As it can be observed in S2 Fig, three additional groups of individuals were assigned to internal nodes. These groups consisted of 16 individuals assigned to internal node DE-M145, 30 individuals in node P-P295 and 111 individuals in node IJ-M429. There was not enough supporting data to assign a more specific haplogroup, although this should be possible based on the SNPs included in Affymetrix 6.0 arrays. This suggests that information for SNPs important for haplotype determination was missing for these samples, and we removed these 157 individuals from PCA and CNV frequency analysis.
Principal component analysis was performed as described in Methods with data from 1506 individuals belonging to 12 haplogroups. We found that the first and second eigenvectors explained 24.1% and 13.4% of the variability between groups. In general, the HGs are clearly segregated by this analysis, particularly R-M207, E-M96 and I-M170. Haplogroups NO-M214 (xM175), KLT-M9(xM526), F-M89(xM1329), G-M201, J-M304 and O-M175 exhibit larger similarity between each other, but they can still be distinguished (Fig 4A).

CNV distribution among haplogroups
When we linked all the discovered Y chromosome variants with the corresponding HG we observed uneven distributions. Table 1 shows ten variants for which the distribution was significantly associated with one or more haplogroups as determined by Pearson Chi-Square, Likelihood ratio and Fisher´s Exact Tests.
The gr/gr deletion was distributed among many haplogroups but it was significantly overrepresented within haplogroup D-M174, with five of six individuals containing this variant (p = 7E-10, Fig 5C). P5 dupl, P4 dupl. and b1/b3 del. were all overrepresented in more than one haplogroup (Table 1 and S3 Fig). The first CNV variant listed in Table 1 corresponds to a single occurrence of a q-arm duplication together with an U3 deletion observed in one male belonging to haplogroup C-M130 (S1B Fig). Despite the low frequency of this CNV, the small amount of Y chromosomes belonging to haplogroup C-M130 (24 individuals) makes this observation significant (p = 0.027 by two sided Fisher´s exact test). The q-arm duplication spanned exons 34-44 including the 3´UTR in the USP9Y gene (positions 14939249-14974460 according to hg 19, size 35kb, including 21 SNP/CN probes). The USP9X region did not exhibit any variations in this individual.
The remaining 14 variants were not significantly overrepresented in any haplogroup (S3 Table). However, it can be noted that the newly discovered P6 duplications were present in A. The figure shows the graphical representation of the first two eigenvectors after PCA analysis. Y-axis corresponds to the first vector explaining 24.1% of the variation and X-axis explains 13.4% of the remaining variation. Each dot represents the results from one individual and the colour represent each HG as denoted by letters in the figure. The plus symbols in black denote individuals for which HG determination was ambiguous. The black triangles denote individuals for which no HG could be assigned. B. The results from the individuals carrying the blue-grey dupl are represented in blue, while the results from individuals carrying "blue-grey like dupl." are in red. All cases are included within the NO-M214(xM175) haplogroup. In total 11 individuals carry these variants (Table 1) (Table 2). In the data set including 1506 Y chromosomes for which HG information was available, 14.7% exhibited some kind of CNV but the distribution was very uneven between haplogroups. Most notably, 97.5% of individuals from haplogroup NO-M214 (xM175) carried a CNV. Haplogroup D-M174 indicated high prevalence of gr/gr del (83.3%) but it should be noted that this haplogroup had the lowest amount of samples in this study, only 6 individuals.

Assessment of CNV presence in patient vs control samples
To investigate if the males with psychiatric diagnosis in the Norwegian population differed from the controls in CNV frequencies, we calculated the total frequency of variants in each group and found that 10.6% of healthy individuals included CNVs, while 13.8% of patients with schizophrenia and 14% of bipolar subjects included variants. These differences were not significant according to Fisher's test. In males from the GSE23201 study, 13 out of the 139 included schizophrenia cases exhibited a CNV (8.6%) while 21 out of 132 control males carried a CNV (15.9%) resulting in no significant difference between these groups (p = 0.14).  psychosis merged). Since most of the CNVs occur in the AZFc region which contains fertility genes, then the probability that those CNVs might be involved in contribution to psychiatric disease becomes highly unlikely. In general, the CNVs on the p-arm and the upper q-arm would be more probable candidates for association to psychiatric diseases since those regions hosts PCDH11Y and NLGN4Y. Both of these genes expressed in the central nervous system and are involved in cell adhesion and synapse formation respectively.

Discussion
We show here that Affymetrix 6.0 arrays can be used for high resolution detection of Y chromosome specific deletions and duplications, and therefore genome-wide analysis of copy number variants using this type of arrays should include the analysis of Y chromosome. In addition, MSY specific SNPs included in these arrays can be used to determine with certainty most major Y chromosome haplogroups. An analysis of a large collection of 1718 individuals allowed detection of many previously known, as well as many novel Y chromosome copy number variants. Most important, we found that several of these CNVs are significantly overrepresented in specific haplogroups.

Y chromosome CNV detection
The Affymetrix 6.0 arrays are suitable for studies of Y chromosome variants not only because of the high coverage and wide distribution of 8179 SNP/CN probes throughout the Y chromosome, but also because we show that most of these probes (excluding those listed in S1 Table) could distinguish the signal intensities between a male, a female, and an individual with isodicentric Y chromosome. Regions not covered by any probes include the centromere and some of the amplicons in the AZFc region, including r1, r2, g1 and g4. However, the analysis of the remaining amplicons can be used to evaluate copy number state of the missing amplicons, thereby allowing determination with high resolution of the subtype of CNV according to the nomenclature by Repping et al [41]. Unfortunately, the region containing an array of TSPY genes is represented by only three probes and therefore cannot be used for CNV detection. Copy number variations in this region have been previously associated with male infertility or prostate cancer [50,51].

Discovery of Y chromosome CNVs
We detected 25 distinguishable CNV patterns in 242 out of 1718 surveyed Y chromosomes. Therefore, 14.1% of males carried one or more CNVs. Of these 9.0% corresponded to individuals that had duplications, 4.5% individuals had deletions and 0.6% had both types of variants. A large population analysis combining deletions and duplications in the Y chromosome has not been done previously, but a large survey of deletions has been done in 20.000 males [52]. This study found that 3.7% of the individuals carried any of the six deletions analysed. When we added the percentages of the corresponding deletions in our study the result was 4.1%. This shows that our SNP array analysis has similar power to detect deletions. A novel observation in our study is that duplication events occur at more than twice the frequency of deletions. Most of the duplications we discovered are novel, although duplication of the AZFc region and the Palindrome 5, which are the most commonly found in our study, have been detected previously (but not assigned to specific variants) by the first generation of comparative genomic hybridization arrays [53,54]. In our study, both deletions and duplication of the AZFc region were common. In contrast, in the palindromic regions P6, P5, P4 and IR2 we found almost exclusively duplications. One possibility for why we observe only duplications in P6 and P4 palindromes, could be the fitness effect. A loss of the genes contained by these palindromes might have severe impact on the carrier and thus not being propagated or seen very often. Although, the low gene content in P6 contradicts the fitness argument, while the P4 gene content (HSFY2 and TTTY9A, NCRNA00185, CD24 and TTTY14) on the other hand supports it. Especially, as the TTTY14 is known to be expressed in the male brain during early foetal development [55]. In the work of Hallast et al. [40], the authors notes that there is a ncRNA (107bp) residing in the P6 structure. Upon closer inspection that snRNA (RNU6-109P1) is classified as a pseudogene. Also, AC053516.1 can be found within P6 and this 111bp sequence is classified as miRNA (UCSC genome browser hg 19). After all, the main function of a palindromic structure is to protect the sequence from degradation by gene conversion [56]. High conservation of the sequence in P6 between human, chimp and gorilla speaks in favor of functionality of the region [40]. Yet another possible explanation to the overrepresentation of duplications in the palindromic regions could be that the surrounding sequences of the palindrome borders might be enriched for VNTRs or different breakpoint sequence motifs which are mainly associated with duplications as described by Conrad et al. [57]. Two other mechanisms that might play a role in the skewed distribution could be Non-homologous end joining and Microhomology-mediated break-induced repair. Nevertheless, in the light of the data generated by Conrad et al. [57], the probability for the latter mechanisms being the key players in generation of duplication overrepresentation is low.
Another aspect that might influence the observed overrepresentation of duplications in the entire dataset might be the composition of HGs included. For example the HG E-M96 exhibits an exaggerated ratio of dupl to del of 3.8:1. Finally, our study unlike most others, investigates almost the entire MSY with probes spanning both intronic and exonic regions. The possibility of detecting CNVs in the intronic regions might lead to skewed ratio compared to other studies which focus on CNV detection within exonic sequences.

CNVs overrepresented in specific haplogroups
Despite the relatively low amount of phylogenetically informative Y chromosome SNPs (n = 91) in the arrays, we were able to distinguish between most known major haplogroups, as recently described by a next generation sequencing approach [10] and also in [58]. When we studied the haplogroup distribution of the 24 detected CNV patterns, we found that ten of them were significantly overrepresented in one or more haplogroup. Of these, the most prominent result was the high frequency (67.5%) of b2/b3 del (c35) in haplogroup NO-M214 (xM175). As explained in the results section, this haplogroup is most probably equivalent to N-M231. Therefore, we confirm the high frequency of b2/b3 del in haplogroup N-M231 observed previously [41,46,52]. Interestingly, two additional CNV patterns were detected only within this haplogroup. These include the blue-grey dupl. (c449) and a novel pattern that we denoted "blue-grey like dupl" not previously described in the CNV predictions by Repping et al [41]. These variants were found in several of our population data sets, including the Norwegian, Chinese, Tibetan and HapMap3 populations. The fact that the blue-grey dupl was found only in NO-M214(xM175) individuals supports the model according to which this variant originates as a rearrangement of b2/b3 deletion [46]. The high similarity between c449 and the newly discovered pattern suggest that this novel variant originated from an individual with c449 pattern in which deletion of Y1/Y2 and G amplicons, and a duplication of gr1/gr2 and r1/r2 occurred. Taken together the three CNVs described above, 95% of NO-M214(xM175) assigned individuals contained one of these variants, rendering NO-M214(xM175) to be the most CNV rich haplogroup. The relatively large proportion of individuals containing Y chromosome CNV (13.9%) found when all 1718 individuals were considered, is not largely affected by the results in NO-M214(xM175) individuals, since 12.3% of individuals with CNVs remain upon removal of NO-M214(xM175) individuals from 1506 individuals with assigned HG. Indeed, in this case the difference between duplications (9.4%) and deletions (2.9%) becomes even larger, with more than three times duplications than deletions in this dataset.
Of the remaining seven CNVs significantly overrepresented in specific haplogroups, two of them were almost exclusively present in individuals of one haplogroup each. Indeed, the P3 duplication was found in 18 individuals of haplogroup E-M96 and only once in haplogroup R-M207. Similarly, the "gr/gr dupl + distal dupl" was found in 13 individuals, all of them within haplogroup J-M304. The frequency of these two patterns has not been described previously. The same is true for the remaining overrepresented variants except for the gr/gr del. As in our data set, this variant was previously found significantly enriched in haplogroup D-M174 [48,59].
Unfortunately, due to relatively low amount of informative Y-SNPs on Affymetrix 6.0 array, we limited our phylogenetic analysis to the main haplogroups only. In this case, it´s difficult to make reliable inferences about mutation rates of CNVs on the Y chromosome. Nevertheless, we hope that our observations will work as an implication of which variants might be common within haplogroups included in this study.
Although, the validity of the detected CNVs in the extended data set is not as precise as in the Norwegian population samples, the observation that the same CNV patterns (with minor start and stop region variations) are reoccurring in several males, strengthens the probability that these are in fact real variants (exception patterns which occur only once such as: p-arm dupl (S1A Fig), q-arm del+U3 del (S1B Fig), IR2 del (S1J Fig), Y1Y2 dupl (S1W Fig) and distal gr dupl (S1X Fig). Also, based on the concordance of the in silico detected CNVs in the Norwegian population and the results from the PCR (deletion) and qPCR (duplication) validation experiments, we are convinced that the observed CNVs in the extended data set are to large extent true. Yet we can never be 100% sure unless they all will be confirmed by molecular methods. Deciphering Developmental Disorders study (2015) has estimated the false-positive rates to about 5% and false-negative rates to <20% for technical replicates and custom designed arrays [60]. Based on these rates we estimate that the false-positive rate in our extended dataset should be approximately the similar. To decrease the rates of false discovery, we manually curated all the CNV calls which by some is regarded as "a gold standard", we also manually inspected the remaining Y chromosomes that did not have any CNV calls. We can therefore conclude that this tedious approach helped us to keep the rate of false positive and false negative observations to an absolute minimum.

Conclusions
For many years the analysis of the Y chromosome has been largely neglected in genome-wide studies of multiple phenotypes, because of the idea that this chromosome is poor in gene content and functionally relevant only for sex determination and testicular functions. However, renewed attention in this chromosome is due to possible association between the MSY region and multiple traits or diseases such as several types of cancer, graft versus host disease, gender differences in heart failure, sex reversal, spermatogenesis, male infertility and sex specific effects on the brain and behaviour including for example autism, and speech delay [61]. The high frequency of Y chromosome variants found in our study underlines the importance of including the Y chromosome in genome-wide studies. Not only deletions, but also duplications are of interest, since they can potentially modify the dosage of genes with important functions. Our results also show that haplogroup distribution between cases and controls has to be carefully controlled in order to avoid spurious results due to population stratification. To our knowledge, only one genome-wide study up to date has included the MSY region. This study evaluates the contribution of autosomal and both sex chromosomes in human spermatogenic failure [62]. Although they included large amounts of cases and controls from several populations, only a subset of them were analysed by Affymetrix 6.0 array, while the rest of the samples were hybridized to Illumina 370K or Illumina OmiExpress arrays, none of which contains good coverage of the MSY region (1412 or 1409 probes respectively). Interestingly, even when only a reduced population analysed with Affymetrix arrays was considered, and only individuals of R-M207 haplogroup were included, they detected higher abundance of rare duplications in cases than controls. The specific type of duplication(s) in cases and controls was not discussed in detail. Future genome-wide association studies using SNP/CN platforms with good coverage of Y chromosome have the potential to find the possible functional significance of Y chromosome variants for different phenotypes. Recent efforts have been done to generate arrays with high coverage specifically in the Y chromosome [63,64]. While these arrays will most certainly have impact for Y chromosome related studies, including fertility, genetic anthropology and population genetics, our results shows that even the SNP/CN platforms currently used for genome-wide CNV studies would benefit from the inclusion of the Y chromosome in the analysis.  Figure. Three additional groups of individuals were assigned to internal nodes DE-M145, P-P295 and IJ-M429, even when the arrays contain additional SNPs for determination of haplogroups connected to these nodes. Since not enough SNP information was available for these particular individuals, they were removed from additional analysis.  Table. All datasets included in the study and CNV frequencies within each dataset. The table shows all datasets that are included in this study. All male samples available as well as the samples that could be assigned to a HG and included into the total set of 1506 individuals are displayed in the two first columns. Number of deletions, duplications and both deletion and duplication patterns are displayed for included samples and the all samples together with the corresponding percentage frequencies. Last column displays the overall percentages of CNVs for each study, given for both included and all samples.  Table. Distribution of CNV patterns and their frequency within haplogroups. The table shows the distribution of CNV patterns among haplogroups for 15 variants that did not show overrepresentation in any group. Ambiguous individuals, for which haplotype determination was not possible, are shown in the table for completeness, but they were not included in the statistical analysis. Two individuals exhibited CNV patterns that were not possible to classify into any variant category, and they are therefore listed as "unspecified". (DOCX)