Genome-wide characterization of copy number variations in the host genome in genetic resistance to Marek’s disease using next generation sequencing

Marek’s disease (MD) is a highly neoplastic disease primarily affecting chickens, and remains as a chronic infectious disease that threatens the poultry industry. Copy number variation (CNV) has been examined in many species and is recognized as a major source of genetic variation that directly contributes to phenotypic variation such as resistance to infectious diseases. Two highly inbred chicken lines, 63 (MD-resistant) and 72 (MD-susceptible), as well as their F1 generation and six recombinant congenic strains (RCSs) with varied susceptibility to MD, are considered as ideal models to identify the complex mechanisms of genetic and molecular resistance to MD. In the present study, to unravel the potential genetic mechanisms underlying resistance to MD, we performed a genome-wide CNV detection using next generation sequencing on the inbred chicken lines with the assistance of CNVnator. As a result, a total of 1649 CNV regions (CNVRs) were successfully identified after merging all the nine datasets, of which 90 CNVRs were overlapped across all the chicken lines. Within these shared regions, 1360 harbored genes were identified. In addition, 55 and 44 CNVRs with 62 and 57 harbored genes were specifically identified in line 63 and 72, respectively. Bioinformatics analysis showed that the nearby genes were significantly enriched in 36 GO terms and 6 KEGG pathways including JAK/STAT signaling pathway. Ten CNVRs (nine deletions and one duplication) involved in 10 disease-related genes were selected for validation by using quantitative real-time PCR (qPCR), all of which were successfully confirmed. Finally, qPCR was also used to validate two deletion events in line 72 that were definitely normal in line 63. One high-confidence gene, IRF2 was identified as the most promising candidate gene underlying resistance and susceptibility to MD in view of its function and overlaps with data from previous study. Our findings provide valuable insights for understanding the genetic mechanism of resistance to MD and the identified gene and pathway could be considered as the subject of further functional characterization.


Background
Marek's disease (MD) is a lymphoproliferative disease of chickens caused by a highly oncogenic Gallid alphaherpesvirus 2, a naturally occurring alphaherpesvirus [1], which goes through a complex life cycle of four main phases [2,3]: an early cytolytic phase at 2-7 days post infection (dpi), a latent phase around 7-10 dpi, a late cytolytic phase and finally, a proliferation phase after 28 dpi. MD currently remains a neoplastic disease of chickens of serious concerns to the global poultry industry. The control of MD mainly relies on vaccination; however, the vaccination efficacy is being reduced due to newly emerging strains of Marek's disease virus (MDV) with escalated virulence. Enhancing genetic resistance to MD in poultry is an important long-term goal to better control MD. Therefore, understanding of genetic basis of resistance to MD and improving MD resistance in chickens are of great interest for the poultry industry and animal welfare. In order to optimally implement this control strategy through marker assisted selection (MAS) and to understand the etiology and mechanisms of MD, it is necessary to identify more specific genes and variants with respect to MD resistance.
Genetic variations play crucial roles in phenotypic diversity [4], some of which may underlie major mechanisms that account for variations in disease resistance. The identification of structural variations and potential genetic markers is very important for better understanding of disease resistance, as well as genomic prediction and genetic improvement by selection. There are several types of genetic variations, where copy number variation (CNV) is one of the important types. According to current knowledge, CNV is a type of important genomic structural variation which includes intermediately sized DNA segments that have undergone submicroscopic insertion, deletion, segmental duplication, and complex changes of greater than 1 kilobases (Kb) to several megabases (Mb) in size [5]. It is also a major source of genetic variation underlying phenotypic diversity [6]. Following the first two genome-wide scans for CNVs in human genome [7,8], a large number of CNV detection studies have been performed, which revealed that CNVs are ubiquitously distributed in genome and can influence the phenotype via regulations of gene expression and gene dosage [9][10][11]. Besides, numerous studies in other species have also shown that CNVs contributed to phenotypic variation of complex diseases and traits [12][13][14][15][16][17][18][19], including MD in chicken [20][21][22]. Two major traditional platforms employed in CNV detection are based on SNP chips, one is known as comparative genomic hybridization (CGH) array, and the other is SNP genotyping array. However, due to the limitation in resolution and sensitivity, it is difficult to exhaust small CNVs shorter than 10 kb in length in detection and to identify the precise boundaries of CNVs.
In recent years, a variety of CNV detection approaches based on next-generation sequencing (NGS) were proposed, which offer promising alternatives as they have higher effective resolution and sensitivity to discover CNVs with more types and wider size ranges. Advances in NGS have enhanced the new platform for more detailed characterization of CNVs in genomes [23][24][25].
The primary aim of the present study is to perform a genome-wide CNV analysis by next generation sequencing in the effort to identify CNVs that may confer the difference in genetic resistance to MD between genetic lines. We applied deep sequencing on samples from nine different genetic chicken lines that significantly vary in genetic resistance to MD, including two chicken inbred lines, line 6 3 and line 7 2 , their F1 reciprocal cross, and the recombinant congenic strains (RCS). RCSs were developed using line 6 3 as the progenitor background line and line 7 2 as the progenitor donor line. Eventually, a series of 19 RCSs were generated and each contains a random sample of 87.5% line 6 3 and 12.5% line 7 2 genome. All of the chicken lines shared a common major histocompatibility complex B*2 haplotype, but the MD resistance/susceptibility differs among the RCS [1]. Furthermore, CNVnator [26] was employed to generate a comprehensive map of CNVRs and genes. qPCR was used to validate the detected CNVRs. Our finding provides valuable insights for understanding the genetic mechanism of resistance to MD, and the identified genes and pathways could greatly facilitate further functional characterization studies.

Mapping statistics and CNV detection
In this study, we performed whole genome sequencing in nine different chicken lines ( Fig. 1) Table S1). After quality control, an average of 30.42 to 226.30 million reads of each chicken line were successfully aligned to the reference genome (galGal4) with the mapping levels ranging from 90.04 to 98.10% for all the individuals. The sequencing effective depth varied from an average of 5.95× for six RCSs to an average of 19.84× for lines 6 3 , 7 2 and their F 1 hybrid, and the average coverage with respect to the reference genome was 88.25%. These high quality alignments provide confident for the subsequent analysis with a minimum of false positives.
We then applied the CNVnator software for CNV detection and the average number of CNVs per line, that passed our stringent filtering criteria, was 1888, ranging from 1368 in line RCS-A to 2476 in line RCS-J. The size of these CNVs ranges from 1 Kb to 9.56 Mb, with an average of 95.56 Kb. Detailed statistics of CNV calls are shown in Additional file 2: Table S2. A total of 1649 CNV regions (CNVRs) ( Table 1) allowing for CNV overlaps of 1 bp or greater were obtained across all the chicken lines after merging, covering autosomes 1-28, and sex chromosomes Z and W. The chicken CNV map across the genome is shown in Fig. 2. The length of CNVRs ranged from 1 Kb to 18.19 Mb with an average of 0.36 Mb. In total, 1200 (72.8%) out of all CNVRs had sizes varying from 1 to 200 Kb (Fig. 3a). The count of CNVRs on each chromosome was directly proportional to the chromosome length, and five macrochromosomes (chr1-5) possessed a large proportion (874, 53.0%) of all putative CNVRs. The number of CNVRs in different chicken lines varied greatly, ranging from 536 in RCS-L to 852 in RCS-A. Among all CNVRs, 495 (30.02%) were present in only one chicken line and 90 (5.46%) CNVRs are shared in all the nine chicken lines (Fig. 3b). In addition, the CNVRs belonging to gain and loss account

Gene annotation and functional analysis
The genes harbored in the inferred CNVRs were extracted using custom Perl scripts. As a result, a total of 2588 RefSeq genes within the regions of the 1649 CNVRs were obtained, where a majority of these genes were involved in immunity, tumors and diseases. The identified genes were submitted to DAVID (version 6.8) for GO and pathway enrichment analyses. Using functional annotation clustering, at the highest classification stringency, 145 clusters were formed, where only 9 clusters were chosen after using an enrichment cutoff of > 1.0 (Additional file 3: Table S3). GO terms and KEGG pathways analyses invoked in DAVID yielded 36 significant enriched functional terms (28 terms of biological process, 2 terms of cellular component, and 6 terms of  Fig. 4). In addition, it yielded 6 significant enriched pathways (P < 0.05, Table 2), including the JAK/STAT signaling pathway (gga04630, Additional file 4: Figure S1). The detailed information of all the GO terms and pathways are shown in Additional file 5: Table S4.

PCA analysis and cluster
To investigate genetic structure in nine inbred chicken lines, we performed a principal component analysis (PCA) using the CNV segments by custom R scripts. Nine principal components were calculated and the first two PCs were used in the plot (Fig. 5a). The nine lines were clustered to four approximate groups with the similar patterns, as indicated by dashed circles (Fig. 5a), which were consistent with their susceptibility to MD (Fig. 5b [27]). Lines RCS-A, M and 7 2 were well clustered together with high MD susceptibility. Lines RCS-D, J, L and X were clustered together with high resistance to MD. Interestingly enough, as expected, F 1 individuals with a medium MD resistance were in an intermediary position between line 7 2 and line 6 3 , which provided the theoretical basis to identify imprinting genes for disease resistance.

Shared versus line-specific CNVRs
To investigate how frequently CNVRs were shared or lineage-specific across different lines, we calculated the CNVR numbers among the nine inbred chicken lines (Table 1). In total, 90 CNVRs were detected across all the individuals, which represented the commonly shared CNVRs. A total of 55, 44, 82, 15, 14, 72, 190, 18, and 5 CNVRs were detected as line-specific CNVRs in line 6 3 , 7 2 , F 1 , RCS-A, D, J, L, M, and X, respectively, as compared to other lines (Table 1, Additional file 6: Table S5).
Importantly, the line 6 3 and 7 2 lineage-specific CNVRs could potentially offer certain clues to explore the genetic mechanisms of MD resistance or susceptibility. So, a total of 62 and 57 harbored genes were identified in line 6 3 and 7 2 , respectively, including several immune-, tumor-and disease-related genes, such as interferon regulatory factor 2 (IRF2), suggesting that the CNV in the IRF2 gene is specific to line 7 2 in this study (Additional file 7: Table S6). Interestingly, our lab also found a MD-resistance associated differentially methylated region (DMR, chr4: 38,999,001-39,000,000), which was hypermethylated in line 6 3 compared with line 7 2 , in our previous DNA methylation study. The harbored region also included IRF2, which is involved in immune response IFN alpha/beta signaling pathway. This gene could be a candidate gene associated with MD susceptibility.

CNVRs validation
To confirm the identified CNVRs, 10 CNVRs containing gains (duplications) and losses (deletions) detected here were validated by qPCR using two reference genes  (THRSP and β-actin). We found that all of the 10 CNVRs were confirmed in agreement with the CNVnator results (Fig. 6a) MXD4). For CNVR6, a total deletion was detected in line 7 2 , while line 6 3 had a normal status. For CNVR7, line 7 2 had two third of the normal copy numbers, while line 6 3 also had a normal status (Fig. 6b). Therefore, the copy numbers of these two loci were found significantly lower in line 7 2 as compared to line 6 3 , again supporting our CNV calls and suggesting that they are potentially linked to MD susceptibility.

Comparison with other studies on CNV in chickens
Considering that most of the previous CNV detection studies were based on the galGal2 and galGal3 genome assembly, coordinates of the CNVRs were converted using the UCSC liftOver tool (http://genome.ucsc.edu/ cgi-bin/hgLiftOver). We migrated all chromosomal CNVRs from galGal2 and galGal3 (used in previous studies) to galGal4. We eventually obtained 585 CNVRs in the present study for comparison. Our results were then compared to 12 previous reports on chicken genomic CNV (  [22] results that also involved in MD were validated in our study. Taken together, 42.2% of our CNVRs overlapped with these three MD studies. The detailed information of CNVRs identified in this study and previous studies is provided in Additional file 8: Table S7.

Discussion
MD, a complicated tumor disease, has been used as a model for human tumor study [36]. The genetic mechanism underlying MD is likely to be very complex and remains incompletely understood. Thus, it is important to understand the genetic basis of MD-resistance or MD-susceptibility poultry, which can provide crucial clues for human diseases. In the present study, based on the high throughput sequencing platform, some bioinformatics analyses were conducted to identify CNVs, genes and enriched pathways, taking full advantage of identical genetic background in nine inbred chicken lines. Copy number variations in the chicken genome have been explored by many research groups in the past decade. However, most of the previous studies focused on CNV discoveries using low-density SNP arrays [30,31]. With the development of high-throughput genotyping technology, the NGS data have been used to detect the complex diseases and traits-related CNVs. CNV detections based on NGS data, which has much higher density compared to SNP chip data, have been developed and implemented in different tools [37]. There are four main methods for detecting CNVs with NGS data: Read-Pair (RP), Split Read (SR), Read Depth (RD), and assembly (AS) based methods, including CNVnator [26] used here, Pindel [38], ReadDepth [39], PEMer [40], and some other useful methods. However, each of the methods have different advantages and limitations in their applicability and suitability for NGS data. CNVnator based on RD method was the only software employed in this study. It uses the established mean-shift approach with additional corrections for multiple-bandwidth partitioning and GC correction, for Fig. 6 qPCR validation. a Normalized ratio (NR) obtained by qPCR for 10 CNVRs. The y-axis shows the NR values, and the x-axis shows the CNVR ID. b qPCR validation on two line 7 2 lineage-specific deletion CNVRs. The y-axis shows the NR values, and the x-axis shows the CNVR ID and chicken lines. THRSP and β-actin served as reference genes with no variation. Samples with NR value of approximately 1 denote normal status, samples with NR value of less than 1 denote copy-number-loss status, and samples with value of about 1.5 or more denote copy-number-gain status more accurate CNV detection. Previous approaches using RD were limited to only unique regions of the genome and discovered only large CNVs with poor breakpoint resolution, or could not perform genotyping. CNVnator is able to discover CNVs in a vast range of sizes, from a few hundred bp to several Mb in length, in the whole genome. Therefore, our results here could reveal additional novel genetic variations underlying MD than those revealed by SNP arrays alone.
In the present study, we performed comparisons with the previous CNV studies, especially three researches also involved in MD. We found 247 CNVRs covering 93.9 Mb in length that overlapped with these three MD studies. It is interesting to note that only 7 (1.2%) and 15 (2.6%) CNVRs were shared with Luo et al.'s [20] and Xu et al.'s [22] results using SNP chip data, which may be, in part, related to limited sample sizes, different platforms, different analysis methods, and different chicken genome references (although we converted the genome positions from two previous genome assemblies (galGal2 and galGal3) to a newer one (galGal4) with the help of LiftOver based on UCSC, some information may still be missing). More importantly, we also used different chicken lines, especially the selected RCSs from a total of 19 RCSs (Fig. 1). We also compared with the CNV identified in lines 6 3 and 7 2 using NGS data for MD study [21], and found that 228 (39.0%) CNVRs overlapped, which provided more effective information for our study. Moreover, our study explored the genetic structure based on CNV in different inbred chicken lines. The PCA showed clearly that the first two PCs can divide all chickens into four unique groups, which is similar to the results of Xu et al. [22]. Therefore, our study further confirms that CNV markers can be used to study the genetic variability in diverse chicken lines, which could possibly contribute to lineage-specific phenotypes.
The genetic mechanism underlying MD is likely to be very complex and is not clear yet. It may be determined by some specific structural variations but not a single gene or a SNP mutation, though several candidate genes have been reported in previous studies described above. In the current study, we investigated CNVs among diverse inbred lines and found 55 and 44 unique CNVRs in lines 6 3 and 7 2 , respectively, which could be associated with MD. Notably, we successfully identified a CNVR, which was a deletion and a normal copy number in all individuals from line 7 2 and line 6 3 , respectively, including a nearby gene IRF2. Fortunately, the IRF2 gene was also highlighted in our previous DNA methylation study involved in a critical DMR identified by methyl-CpG binding domain protein enriched genome sequencing (MBD-seq) with a false discovery rate (FDR) < 0.1 and validated by bisulfite cloning sequencing, which was hypermethylated in line 6 3 compared with line 7 2 . The region of the DMR identified in previous study and the CNVR identified here was almost completely the same, with the same start site and a 1000 bp overlaps. The nearby gene IRF2 is a disease-and virus-related gene involved in the interferon gamma signaling pathway and the immune response IFN alpha/beta signaling pathway [41]. This gene is conserved in human and some other species like chimpanzee, Rhesus monkey, dog, cow, mouse, rat, zebrafish, and frog. Thus, some mutations or structural variations of this gene could be key factors related to disorders or diseases. It was reported that IRF2 gene was associated with several diseases in chickens like necrotic enteritis [42], pancreatic cancer [43], and atopic dermatitis and eczema herpeticum [44]. More interestingly, this gene can specifically bind to the upstream regulatory region of type I IFN and IFN-inducible MHC class I genes, which could be an important clue to explore the genetic mechanisms of MD resistance because, to our knowledge, MHC plays an important role in the determination of resistance to MD [1]. Therefore, IRF2 may be a very important gene related to MD according to the structural variation identified here, its known functions and our former studies. Another useful information obtained in this study is the JAK/STAT signaling pathway, which was also considered as a potential pathway responding to MDV infection reported by Perumbakkam et al. [45]. The JAK/STAT signaling pathway is one of a handful of pleiotropic cascades used to transduce a multitude of signals for development and homeostasis in animals [46]. JAK activation stimulates cell proliferation, differentiation, cell migration and apoptosis. These cellular events are critical to immune development and some other processes. Importantly, mutations that constitutively activate or fail to regulate JAK signaling properly cause inflammatory disease, including several chicken diseases [47,48]. Additionally, a previous study reported that IRF2 can regulate macrophage apoptosis through a STAT1/3 [49], which provides valuable and potential interaction of IRF2 and JAK/STAT pathway and might jointly contribute to the genetic resistance to MD. Therefore, we hypothesize a probable mechanism of complex disease: the deletions in CNV could be associated with different epigenetic effects, which further regulate an interacting pathway leading to occurrence of diseases.

Conclusions
In summary, we investigated copy number variations in inbred chicken lines using next generation sequencing.
We have successfully identified a number of line-specific CNVRs, as well as revealed genes and pathways that may be involved in genetic resistance to MD. Combined with our previous study and due to the complexity of MD, we ultimately found a high-confidence candidate gene IRF2, and an immune-or disease-related pathway, JAK/STAT signaling pathway, which could jointly play potentially important roles in response to MD resistance. Overall, our findings in the present study will provide valuable insights for understanding the genetic mechanism of resistance to MD and will be worthy of further functional characterization.

Experimental population
A total of 26 female chickens without treatments were used for blood collection in this study, including three chickens from each of the line 6 3 (MD-resistant), line 7 2 (MD-susceptible) and six recombinant congenic strains (RCSs, RCS-A, D, J, L, M, and X), and two chickens from reciprocal cross F 1 hybrid 6 3 × 7 2 (USDA-ARS ADOL, East Lansing, Michigan, USA) [50]. RCSs were developed using line 6 3 as the parental strain mated to line 7 2 and then backcrossed to line 6 3 twice followed by full-sib mating for more than 20 generations (Fig. 1). Eventually, diverse RCSs were generated and they contain 87.5% of line 6 3 and 12.5% of line 7 2 in the genetic background but differ in MD resistance/susceptibility [51]. All the experimental chickens were anesthetized with sodium pentobarbital (intraperitoneal injection: 150 mg/kg) at the end of this study.

Library construction and sequencing
Blood samples were collected from the brachial vein by venipuncture. Genomic DNA (gDNA) from blood samples was extracted using the DNeasy Blood & Tissue Mini Kit (Qiagen, USA) according to the manufacturer's instructions. The purity and concentration of the gDNA samples were measured by NanoDrop ND-1000 spectrophotometer (Thermo Scientific, USA) and by agarose gel electrophoresis. After the examinations, paired-end libraries were generated for each eligible sample using standard procedures (Illumina, USA). The average insert size was 500 bp, and the average read length was 100 bp for line 6 3 and line 7 2 , and 150 bp for the remaining chicken lines. All libraries were sequenced on an Illu-mina® HiSeq 2000 sequencing platform to an average raw read sequence coverage of × 20 for lines 6 3 , 7 2 and their F 1 hybrid, and × 6 for the six RCSs, respectively. The depth ensured the accuracy and reduced the falsenegative rate of CNV calling for downstream analysis. Library preparation and all Illumina runs were performed as the standard manufacturer's protocols.

Read alignment and CNV calling
Chicken genome assembly (galGal4) was retrieved from the UCSC Genome Browser website (http://hgdownload. soe.ucsc.edu/goldenPath/galGal4/bigZips/) [52]. In order to minimize the mapping errors, quality control was performed by FastQC [53] and low quality reads were removed with the help of FastX  [56] to enhance the alignments in regions of indel polymorphisms, which can greatly improve the sensitivity and specificity in CNV calling [57]. After mapping, CNV calling was performed using CNVnator (v0.3.3) software [26] based on read depth (RD) method to predict genomic CNVs between the nine chicken lines and the reference. CNVnator firstly calculated the counts of mapped reads within user specified non-overlapping bins of equal size as the RD signal, and then adjusted the signal in consideration of the potential correlation between RD signal and GC content of the underlying genomic sequence. The mean-shift algorithm was employed to segment the signal with presumably different underlying CN. Then CNVs were predicted by applying statistical significance tests to the segments. We then ran CNVnator with a bin size of 100 bp for our data. CNV calls were filtered using stringent criteria including P-value < 0.05 and minimum size > 1 Kb, and calls with > 50% of q0 (zero mapping quality) reads within the CNV regions were removed (q0 filter). All CNV calls overlapping with gaps in the reference genome were excluded from consideration. CNVs located on random contigs (chrN_random), unlocalized chromosomes (chrUn), or in overlapping gaps were discarded for further analysis due to the shorter length of the chrUn contigs and mapping ambiguity of chrUn sequence reads. In order to compare our results with previous studies, we converted all chromosomal CNVRs from galGal2 and galGal3 (used in previous studies) to galGal4 with the assistance of LiftOver based on UCSC (http://genome. ucsc.edu/cgi-bin/hgLiftOver).

Gene detection and functional analysis
Results from CNVnator were combined to obtain a collective set of unique CNVs with different start or end coordinates. These CNVs were then merged into nonoverlapping CNV regions (CNVRs) by aggregating CNVs that overlap by at least 1 bp across all samples of each chicken line. The Ensembl genes (release 85 Database) were obtained using BioMart software based on the chicken gene sequence assembly (galGal4) and the genes harbored in the inferred CNVRs were extracted using custom Perl scripts. The Database for Annotation, Visualization, and Integrated Discovery (DAVID, version 6.8) (https://david.ncifcrf.gov/) [58] was used to perform the gene ontology (GO) enrichment analysis and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway analysis.

Validation of CNVRs by quantitative real-time PCR (qPCR)
To experimentally validate the detected CNV calls by CNVnator, we performed qPCR confirmation of ten CNVRs randomly selected from line 6 3 and line 7 2 , respectively, using gDNA samples from different chicken lines. All the primers were designed based on the GenBank reference sequences using the Primer 3.0 webtool (http://frodo.wi. mit.edu/primer3/) (Additional file 9: Table S8). The β-actin gene and thyroid hormone responsive (THRSP) gene served as reference genes. For each chicken line, at least three individuals were used to do the validation. qPCR using SYBR Green PCR Kit was performed in triplicate based on iCycler iQ PCR System (Bio-Rad). qPCR program was run as follows: pre-incubation (95°C for 10 min), 40 cycles of amplification (95°C for 10 s, 60°C for 10 s, and 72°C for 10 s), melting curves using a heat ramp and cool down. Cycle threshold values (Ct values) were obtained from iCycler iQ PCR software. The 2 -ΔΔCT method was used to calculate the copy number [59][60][61]. The corresponding equation was: where CT is the threshold cycle, sample A is the tested individual, and sample B is the control individual with single copy or no variation in copy number. Samples with Normal Ratio (NR) about 1 denote normal individuals (two copies), samples with NR of less than 1 denote one copy loss individuals, and samples with NR about 1.5 or more denote copy number gain individuals [32,33].