Genome-wide copy number variant data for inflammatory bowel disease in a caucasian population

Genome-wide copy-number association studies offer new opportunities to identify the mechanisms underlying complex diseases, including chronic inflammatory, psychiatric disorders and others. We have used genotyping microarrays to analyse the copy-number variants (CNVs) from 243 Caucasian individuals with Inflammatory Bowel Disease (IBD). The CNV data was obtained by using multiple quality control measures and merging the results of three different CNV detection algorithms: PennCNV, iPattern, and QuantiSNP. The final dataset contains 4,402 CNVs detected by two or three algorithms independently with high confidence. This paper provides a detailed description of the data generation and quality control steps. For further interpretation of the data presented in this article, please see the research article entitled ‘Copy number variation-based gene set analysis reveals cytokine signalling pathways associated with psychiatric comorbidity in patients with inflammatory bowel disease’.

steps. For further interpretation of the data presented in this article, please see the research article entitled 'Copy number variation-based gene set analysis reveals cytokine signalling pathways associated with psychiatric comorbidity in patients with inflammatory bowel disease'. © 2019 The Author(s). Published by Elsevier Inc. This is an open access article under the CC BY license (http://creativecommons. org/licenses/by/4.0/).

Data
The presented report is a description of the CNVs identified in 243 IBD patients with Caucasian ethnicity enrolled in the Manitoba IBD Cohort Study [1]. We genotyped 269 individuals with IBD using the Illumina Omni2.5M À 8 microarray. After sample quality control and population stratification analysis, we initially selected 246 IBD patients of Caucasian ethnicity. Three different CNV detection algorithms were applied to analyze the data: PennCNV [2], iPattern [3], and QuantiSNP [4]. The detected CNVs were filtered under stringent quality control criteria for their size, probe content, and algorithm-specific quality score. The quality control workflow is presented in Fig. 1. The quality control criteria and corresponding number of disqualified samples are presented in Table 1. To obtain highconfidence calls, we removed the CNVs detected by only one of the three algorithms while the CNVs detected by two or three algorithms were merged by retaining the outer boundary [5]. Numbers of CNVs detected by different algorithms are presented in Fig. 2. The examples of merging of the results obtained by three algorithms are presented on Fig. 3. Three IBD samples with extremely large number of detected CNVs were removed, which left 243 IBD samples for the further analysis. Of the remaining data, CNVs with significant overlap with the repeat rich regions, such as centromeres and telomeres, segmental duplications, and immunoglobulin regions, were excluded [2,6]. Table 2  Value of the data The IBD CNV data set provides a valuable resource for identifying potential causal genes for IBDs and its drug targets. It can be used as a baseline to compare and analyze the CNVs identified in other populations. These data will be useful to researchers to investigate the contribution of CNVs to IBD and its subtypes of CNVs detected by each algorithm, corresponding numbers of disqualified CNVs, CNVs qualified for merging, and removed due to overlapping with the repeat-rich regions. After the quality control and filtering, 4,402 stringent CNVs remained for the analysis; of those, 2,872 were deletions and 1,530 duplications. The chromosomal distribution of the stringent CNVs is presented on Table 3; the same data is visually presented on Fig. 4.

Study population
Individuals were enrolled in The Manitoba IBD Cohort Study e a population-based longitudinal study of patients with IBD [7,8]. At enrolment in the Cohort Study, participants were at least 18 years of age with a median disease duration of 4.3 years and maximal disease duration of 7 years. Participants were identified and recruited from a population-based registry, the University of Manitoba IBD Research Registry. The diagnosis of IBD was determined based on surgical, endoscopic, and histologic data. At the time of the cohort study recruitment, there were 3192 participants in the research registry. The Manitoba IBD Cohort Study was approved by the University of Manitoba Health Research Ethics Board, and participants provided written informed consent. Blood samples for genotyping were obtained from a total of 269 IBD patients.

Genotyping
Blood samples acquired from the 269 IBD patients in the cohort were genotyped using Illumina Infinium Omni2.5-8 microarray at The Centre for Applied Genomics (TCAG) in Toronto. Rigorous quality control (QC) procedures were performed on the resulting data. The Illumina Infinium Omni2.5-8 microarray contains a total of 2,372,784 markers for SNP and CNV analyses. Samples were processed using the manufacturer's recommended protocol; BeadChips were scanned on the Illumina BeadArray Reader using default settings. Analysis and intra-chip normalization were performed using Illumina's GenomeStudio software v.2011.1. Probes reclustering was conducted in the GenomeStudio using the project-specific samples to produce custom cluster file, which was applied to generate LogR ratios (LRR) and B Allele Frequencies (BAF) for the CNV detection.

Intensity quality control for CNV detection
Quality control (QC) for SNPs was performed at the individual SNP level (Table 1). Samples were excluded from the analysis if they had: i) array call rate <95%; ii) standard deviation (SD) for LRR and BAF values outside mean ± 3SD for SD of an analysis batch. Closely related samples (by identity-bydescent distance for each pair of individuals), duplicates, samples with gender mismatches (by X chromosome homozygosity rate) or Mendelian error rate >1% were excluded from the analysis batch.

Population stratification analysis
The reference population for population stratification analysis using multidimensional scaling (MDS) was obtained from Phase 3 data of 1000 Genomes Project [9]. Population stratification was

CNV calling algorithms
For comprehensive detection of CNVs in the IBD patients, we ran three CNV calling algorithms, namely, PennCNV [2], iPattern [3], and QuantiSNP [4]. The required data for CNV analysis, i.e. withinsample normalized fluorescence (i.e. X and Y normalized values), between-sample normalized fluorescence (i.e. Log R ratios (LRR) and B allele frequency (BAF) values) and genotypes for each sample, were exported directly from Illumina's Genomestudio software. Only autosomal probes were used in the CNV analysis. In summary, 10539, 13789 and 29607 CNVs were detected by PennCNV, iPattern and QuantiSNP, correspondingly ( Table 2). We excluded the CNVs if they failed the following quality control criteria: <5 probes, <5000 bp in length and low algorithm-specific confidence score (PennCNV confidential score < 15, QuantiSNP Log Bayes Factor < 10 or iPattern score < 1). After this filtering, 6174, 7190 and 8245 CNVs were identified as high quality CNVs calls for PennCNV, iPattern and QuantiSNP, respectively. Each algorithm performed differently in calling CNVs of different sizes, with PennCNV being the most conservative in detection of CNVs, while QuantiSNP was least conservative except for large (>200 kbp) CNVs (Fig. 2).

CNV merging
To obtain stringent CNV calls, we merged high quality CNVs detected by at least two of the three algorithms using outer probe boundaries (Fig. 3). All CNVs detected by only one algorithm were excluded from the further analysis. As an additional step of sample QC, we excluded three samples with excessive number of stringent CNVs. We removed the samples with more than 145 CNVs (as the mean number of CNVs plus 3 SD). After CNV merging, 5826 CNVs were considered as stringent. 2173 and 3653 of the CNVs are duplications and deletions, respectively (see Table 2).

CNV filtering
We further excluded CNVs that: 1) overlapped the centromere (100 kbp regions before and after centromeres) or the telomere (100 kb from the ends of the chromosome); 2) had > 70% of its length overlapping a segmental duplication using the entire segmental duplication dataset downloaded from the University of California, Santa Cruz (UCSC) Genome Browser website [11]; 3) had >70% overlap with immunoglobulin region (susceptible to somatic changes) [2,6].
The final CNV data set includes a total of 4402 CNVs, 2872 and 1530 of which were deletions and duplications, respectively (Fig. 4). The final list of stringent CNVs is available in the dbVar database at NCBI: nstd157. Of all CNVs, 58.3% were smaller than 20 kbp, while 4.2% covered more than 200 kbp. Chromosomal distribution of the stringent deletions and duplications in three size categories (less than 20 kbp, 20e200 kbp and more than 200 kbp) were presented in the Table 3.