Accurate characterization of the IFITM locus using MiSeq and PacBio sequencing shows genetic variation in Galliformes

Interferon inducible transmembrane (IFITM) proteins are effectors of the immune system widely characterized for their role in restricting infection by diverse enveloped and non-enveloped viruses. The chicken IFITM (chIFITM) genes are clustered on chromosome 5 and to date four genes have been annotated, namely chIFITM1, chIFITM3, chIFITM5 and chIFITM10. However, due to poor assembly of this locus in the Gallus Gallus v4 genome, accurate characterization has so far proven problematic. Recently, a new chicken reference genome assembly Gallus Gallus v5 was generated using Sanger, 454, Illumina and PacBio sequencing technologies identifying considerable differences in the chIFITM locus over the previous genome releases. We re-sequenced the locus using both Illumina MiSeq and PacBio RS II sequencing technologies and we mapped RNA-seq data from the European Nucleotide Archive (ENA) to this finalized chIFITM locus. Using SureSelect probes capture probes designed to the finalized chIFITM locus, we sequenced the locus of a different chicken breed, namely a White Leghorn, and a turkey. We confirmed the Gallus Gallus v5 consensus except for two insertions of 5 and 1 base pair within the chIFITM3 and B4GALNT4 genes, respectively, and a single base pair deletion within the B4GALNT4 gene. The pull down revealed a single amino acid substitution of A63V in the CIL domain of IFITM2 compared to Red Jungle fowl and 13, 13 and 11 differences between IFITM1, 2 and 3 of chickens and turkeys, respectively. RNA-seq shows chIFITM2 and chIFITM3 expression in numerous tissue types of different chicken breeds and avian cell lines, while the expression of the putative chIFITM1 is limited to the testis, caecum and ileum tissues. Locus resequencing using these capture probes and RNA-seq based expression analysis will allow the further characterization of genetic diversity within Galliformes.


Background
Poultry accounts for almost half of all meat consumed in the UK, with 875 million chickens, 17 million turkeys, 16 million ducks and 250,000 geese a year supplied by over 2500 poultry farms [1]. Their production can be adversely affected by infection with avian specific viruses such as infectious bursal disease virus (IBDV), infectious bronchitis virus (IBV) and Newcastle virus (NDV) [2][3][4][5][6][7]. Poultry can also serve as the source of zoonotic, or potentially zoonotic, infections with viruses such as H5N1 and H7N9, transmitted to humans through contact with poultry. To reduce the threat to the global food supply and to minimize the risk of zoonotic events, there is an ongoing need to better understand the biology of avian viral infections, the mechanism of natural resistance (viral intrinsic and innate immunity) and the characterization of the biological factors that might be involved.
As IFN-stimulated genes (ISGs), IFITM abundance within a cell increases following activation of the type 1 IFN signaling pathway in response to the detection of pathogen associated molecular patterns (PAMPs) such as viral nucleic acid in the cytoplasm of the infected cell. In addition, binding of the IFNα/β to their cell surface receptors induces translocation of the transcription factor complex IFN-stimulated gene factor 3 (ISGF3) into the nucleus [24]. This induces the transcription of several ISGs, among which are the IFITM genes. The IFITM proteins target the final stages of viral entry by preventing fusion of the viral and cellular membranes [25]. This mechanism also reflects the localization of the human IFITM2 and IFITM3 which are found predominantly in intracellular membrane compartments such as late endosomes and lysosomes [21]. It is suggested that the membrane-defined site of fusion, namely plasma membrane and endosomes, is critical for the antiviral activity of these proteins [15].
While genetics and cell biology of the human IFITMs has been extensively characterized, lack of an accurate and complete reference genome sequence has hampered progress in characterizing the locus in diverse vertebrates including avian IFITMs. The genetic structure of the chicken locus was proposed based on the human locus however critical differences suggest that the current chIFITM nomenclature might be incorrect [23]. Indeed, the relative intracellular localizations of chIFITM1 and 2 as defined by genome synteny are the opposite of their human counterpart. This prompted Smith at al. to suggest an inversion might have occurred within the locus [23]. Subsequently, it was shown that duck IFITM1 localizes on the plasma membrane, like human IFITM1, highlighting further classification difficulty in avian IFITMs [16]. In addition, in the effort to explain the conserved antiviral activities of the different human IFITMs genes, Compton et al. have recently suggested that these differences, which reflect their localization and abundance in a cell, are a sign of a duplication and mutational events of the IFITM genes that arose millions of years [26]. Although their studies focused in the evolution of the IFITM genes in various non-human primates, it underlines the necessity to consider how avian IFITM genes should be considered as their nomenclature does not reflect necessarily their human orthologues. In this scenario, while ancestral IFITM3 is clearly syntenic with hIFITM3, more studies are required to elucidate the relationship between the other two IFITM proteins.
The most recent version of the chicken genome (v5) has incorporated long PacBio sequencing reads. This new sequencing has improved the chicken genome, including the IFITM locus. However, when sequencing an entire genome and performing whole genome assembly, minor assembly errors can occur, often due to lack of coverage or because paralogous sequences at other loci compromise accurate assembly. The IFITM gene family is one of the most paralogous families known with multiple copies of both IFITM genes and pseudogenes. For this reason, we sequenced just a small region of chromosome 5 containing the IFITM locus at high coverage with PacBio and with Illumina MiSeq.
The average PacBio read length is >10 kb, depending only on the activity of the polymerase [27][28][29] and although PacBio raw reads have a higher error rate compared to other technologies (14% versus 0.1 to 1% for Illumina), high quality consensus sequence can be obtained from overlapping reads. To complement the new Gallus gallus reference we have focused solely on the chIFITM locus and better elucidated its genetic structure by sequencing a bacterial artificial chromosome (BAC), from the BAC library used to generate the original Gallus gallus genome. The 203Kb-long BAC (CH261-109H20 [30]), containing the chIFITM locus, does not include chIFITM10. Given that the current literature focuses mainly on the antiviral activity of chIFITM1, 2, 3 and 5, we present evidence of the high confidence, high coverage sequence of this locus and the expression of these 4 genes by mapping of publicly-available RNAseq data, to define each of the chIFITM proteins at the transcriptional level. Further, we describe the design and use of hybrid capture (SureSelect) probes and their use in genome capture and sequencing of other Galliform IFITM loci.

De novo assembly of PacBio and MiSeq sequencing reads
In order to obtain a consensus reference sequence from the raw sequencing data, PacBio reads derived from the BAC clone sequencing were quality filtered and de novo assembled with HGAP using the protocols available on the SMRT portal (Additional file 1). Summaries of assembly and mapping statistics for PacBio (and also Illumina, see below) reads are shown in Table 1. Because of the length of the PacBio reads, the PacBio de novo assembled consisted of 6 assembled fragments (compared to 13 with Illumina). Of these, one contig (number 2, Table 2) contained the chicken genome sequence; the others represented genomic sequences from the E.coli BAC vector (Additional file 2). Contig 2, containing chicken sequences, had the highest base coverage and its length suggested it represented the full-length BAC clone. Therefore, to confirm the identity of this de novo assembled fragment, we utilized ACT and sequence similarly plots to compare contig 2 with chromosome 5 reference sequence from both Gallus gallus v4 and Gallus gallus v5 (Fig. 1a-c). Contig 2 contained the full chI-FITM locus and highlights the substantial deficiencies to the Gallus gallus v4 genome assembly ( Fig. 1a and b). This contrasts with Gallus gallus v5 genome assembly where fewer large gaps are observed, but with the presence of a small INDEL (Fig. 1c-d). Inspection of the similarity plot shows these differences observed at the nucleotide level fall in the genomic region of the chIFITM3 gene, within the intronic region (Fig. 1c, bottom Dot Plot and sequence alignment in 1D). To further analyse this reagion we have also screened the full locus for repeats and low complexity DNA sequences as shown in Additional file 3. We attempted de novo assembly of Illumina MiSeq paired-end reads using three software packages (namely IVA, SGA and HGAP) resulting in only partial consensus sequence covering between 50 and 70% of the full chIFITM locus (including the flanking genes ATHL1 and B4GALNT4) (data not shown). The best assembly was generated using IVA, which produced the least number of contigs (13). In order to identify Illumina contigs that contained the BAC, and specifically the chIFITM locus, sequence similarity was used to compare the Illumina MiSeq contigs with the PacBio contig 2 (Additional file 4). All of the Illumina MiSeq contigs covered either portions of the PacBio contig 2, or just the chIFITM locus. These results suggest that while the longer PacBio reads map well to the reference genome (Additional file 5), Illumina MiSeq raw reads on their own are not be sufficient to assemble this region de novo, although they do map accurately to the de novo PacBio reference.
Organization of the chIFITM locus in PacBio contig 2 and Gallus gallus v4 and v5 reference sequences To study in more detail the gene order of the v4 and v5 assembled locus relative to our assembly we used Artemis. Concentrating on the chIFITM genes, we show that combined reads from both sequencing technologies mapped well to v4 or v5 assemblies, covering the locus to significant depths and aligning to all the regions of interest ( Fig. 2 and Additional file 5A-D). The deep and accurate sequence of the chIFITM locus allows us to be confident that the chIFITM1 and 2 genes as named and annotated in the v5 genome are indeed inverted in comparison to the human locus with chIFITM1, 2, 3 and 5 genes having their transcriptional units in the same direction (Table 3) [23].  SureSelect probes design and pull down of the IFITM locus from turkey breast tissue and DF1 cells The consensus sequence we have generated was used to design Agilent SureSelect probes covering the 40 kb region encompassing the IFITM locus. Our primary purpose is to use these probes to study possible IFITM variants in different chicken breeds and further into the phylogeny of Galliformes. We were able to successfully pull down the IFITM locus in DF1 cells (chicken embryonic fibroblasts) as well as turkey breast tissue (Fig. 3), showing we are able to use chicken (Phasianinae, sub-family of Galliformes) IFITM probes to pulldown and sequence the locus in a different Galliform sub-family, namely the Meleagridinae, to which the turkey belongs. The BAC clone, like Gallus gallus v5 of the chicken genome, is from a Red Jungle fowl, inbred line UCD001 (Inbred 256, female) while the DF1 cells are derived from a White Leghorn (East Lansing line-0, 10-day old eggs). Mapping of PacBio reads from DF1 cells against either v5 of the chicken genome sequence or our PacBio contig 2 gives a good coverage but with low coverage gaps detected in IFITM3 and B4GALNT4 ( Fig. 3a-  The region enlarged in the right Dot Plot shows a stretch of the genomic region within the intronic region of the chIFITM3 gene which shows differences with chicken genome assembly v5. d Clustal Omega alignment of the PacBio contig 2 consensus sequences and the chicken genome v5 (portion of the IFITM3 gene corresponding to the gap seen in 2C). In yellow is highlighted the gap genome ( Fig. 3c), suggesting the current turkey genome is in need of improvement with long read PacBio sequences as achieved for the chicken genome. We were however, able to identify all four IFITM genes in the turkey locus. We constructed multiple sequence alignments for the two chicken and turkey genome IFITMs (Fig. 4). Amino acid sequence alignment of the IFITM proteins of DF1, turkey and Gallus gallus v5 shows substantial differences as we can see from Fig. 4. For the known antiviral IFITMs one amino acid change was found between Red Jungle fowl and White Leghorn, namely A63V in the CIL domain of IFITM2. More amino acid substitutions were seen for Turkey compared to chicken IFITMs with 13, 13 and 11 differences between IFITM1, 2 and 3 respectively. Variation in one of the chicken IFITMs is maintained in the turkey gene, namely amino acid 63 A (Red Jungle Fowl) or V (White Leghorn) and 63 V (Turkey) in IFITM2.
Mapping RNA-seq data to the PacBio contig 2 reference containing the chIFITM locus The generation of a high quality de novo assembly of the IFITM locus sequence allows accurate mapping of RNAsequence data from previous published studies for qualitative and quantitative analysis. To validate which chIFITM transcripts were expressed, and to assess their level of expression, we first used RNA-seq reads from 293 T cells, engineered to express only chicken IFITM proteins constitutively. Reads from the control cells (wild type 293 T) do not map to the chIFITM locus (Table 4) 6). The number of mapped reads and by implication the expression level for chIFITM3 was higher than that of chIFITM2 and in turn both higher than that of chIFITM1 (Additional file 6). We analysed 26 RNA-seq studies totaling 293 sequenced chicken tissues and avian cell lines that were identified in the ENA database. The samples were examined for constitutive expression levels of the chIFITMs in a subset of each study covering at least one immune relevant tissue type (Table 5). To analyze constitutive expression, RNAseq data from liver, spleen, lung and trachea samples taken from the studies as listed in Table 6, were mapped against the PacBio contig 2. To these, we added expression data from commonly used laboratory cell lines (DF1, CEF, HD11, DT40). ChIFITM3 is constitutively expressed (both exons) in all tissues and cell lines analysed at levels higher than the putative chIFITM1 and chIFITM2. Indeed, putative chIFITM1 is barely detectable in most of the tissues, and much lower compared to the other IFITM transcripts, as also shown from the RPKM values in Table 5. Further, when infected or subject to cellular stress chIFITM2 and chIFITM3 are abundantly expressed, again with little IFITM1 expression. Indeed, it is not possible to detect convincing levels of IFITM1 expression at any time except for Caecal tissue and Ileum tissue infected by influenza A H5N2 or H5N11 (Figs. 5 and 6, Additional file 7 and a b c Fig. 3 a Artemis coverage and stack view of the IFITM locus in DF1 cells following pull down of the IFITM locus using SureSelect probes and sequencing with PacBio. The figure shows an intact locus and successful mapping of the IFITM locus against the Gallus gallus sequence reference, despite two gaps observed within the B4GALNT4 and IFITM3 genes. b Artemis coverage and stack view of the IFITM locus in DF1 cells following pull down of the IFITM locus using SureSelect probes and sequencing with PacBio. These reads were instead mapped against the new PacBio contig 2 sequence reference. As for the mapping above, two gaps (one partial) are observed within the B4GALNT4 and IFITM3 genes, although more reads cross the gaps, allowing full coverage. c Artemis coverage and stack view of the IFITM locus in turkey breast tissue following pull down of the IFITM locus using SureSelect probes and sequencing with Illumina MiSeq. The graph shows successful mapping of MiSeq reads despite using chicken probes to pull down the locus in turkey tissue. The white bars represent actual gaps in the turkey reference as published on both Ensemble and NCBI and to which the probes will not eventually map as gaps are shown in the reference as "NNN" Table 5). In addition, the coverage graphs confirm that the typical genetic structure of the chIFITM genes is maintained, with two exons separated by a single intron in all cases, although reads were observed to map beyond the boundaries of the annotated genes particularly in the stretch of genomic region between IFITM2 and 5 ( Figs. 5 and 6).

Discussion
In this study we have sequenced a BAC clone containing the complete chIFITM locus using both PacBio and Illumina MiSeq sequencing technologies producing an accurate assembly of the locus. We analysed expression levels of the chIFITM genes using publicly available RNA-seq data from different chicken lines and tissues, and produced hybrid capture probes for 'pull-down' sequencing of another chicken line and the more distant turkey IFITM locus.
The chIFITM locus showed several gaps in the version 4 of the chicken genome release (Gallus gallus 4). It had been improved by sequencing the same DNA reference source (Female Red Jungle Fowl, UCD001 inbred line) with PacBio technology. Comparison of the two public versions of the chIFITM locus with the one generated in our study (PacBio contig 2) still demonstrated differences, despite being the same inbred line. We believe these discrepancies in the public genome assemblies might be a consequence of genome wide assembly required for full chicken genome, suggesting that our BAC sequence (203 kb) is likely to be more accurate, particularly in GC-rich regions. In addition, quality control analysis and type of assembler used will influence the final consensus sequence generated for any region of the chicken genome, leading to the differences observed in the sequences. To produce our sequence, we employed both PacBio RSII and Illumina MiSeq technologies because they have complementary properties that met our requirements for covering gaps and maintaining sequence integrity. Sequencing within Gallus gallus domesticus lines, more outbred chickens and more divergent Galliforms is now possible using hybrid capture genome sequencing. Indeed, we have been able to document many amino acid sequence changes between chickens and turkeys in the antiviral IFITMs in regions of the proteins known to be important for their antiviral activity (Fig. 4).
The importance of obtaining an accurate sequence is vital to understand the genetic structure and confirm the identity of the IFITM locus, thus to correctly annotate the genes. Hypothetical structures of the chIFITM locus have been suggested, based on the human locus but inconsistencies remain between alignments for the putative chIFITM1 and chIFITM2 [16,17,23]. Based on the literature and current annotation the four genes are clustered on chromosome 5 which also contains the  chIFITM10 gene (the function of which remains to be elucidated). Following the discovery of chIFITM2, Smith at al. [23] proposed an organizational structure for the locus, based on features such as membrane localization and lack of an N-terminal extension (both characteristic of the IFITM2 and IFITM3 proteins), suggesting that chIFITM2 is actually analogous to human IFITM1 [23]. Our immunofluorescence analysis to study localization of the chicken proteins expressed in human (293 T) stable cell lines is in agreement with Smith et al. (data not shown, Bassano et al. in preparation) [23]. Indeed, chIFITM2 is membranebound, while chIFITM1 localizes to the early endosomes.
Here our RNA-seq analysis of the ENA dataset shows that chIFITM1 basal expression levels are very low compared to chIFITM2 and chIFITM3. The analysis of the samples in presence of IFNα, H5N2, H5N1, H5N3, IBDV, IRF7, ALV, Lipopolysaccharide or in heat-stress induced conditions, also shows that higher expression levels can be observed for chIFITM3 and chIFITM2 suggesting a key role for these two proteins as antiviral IFITMs compared to chIFITM1, expression of which is only in the intestinal tract and in the testis. Although immunofluorescence staining seems to suggest that chIFITM2 is analogous to hIFITM1 (they are both plasma membrane-bound) the genome organisation supported by long read PacBio sequences now unambiguously confirms that the chI-FITM2 and chIFITM1 locus is inverted compared to the human locus. We therefore, propose based on gene expression, genome architecture and published functional data the gene order in the chicken locus on chromosome 5q should be renamed: centromeric -

Conclusions
In this report we have produced an updated genomic map of chIFITM locus that includes the two flanking genes ATHL1 and B4GALNT4, by combining and  [17,[42][43][44][45][46][47][48][49][50][51][52][53][54][55][56][57][58][59][60] analyzing sequencing data derived from PacBio RS II and Illumina MiSeq sequencing technologies. The only difference detected in our assembled locus sequence relative to the Gallus Gallus (v5) is a 5 bp insertion in the intronic region of chIFITM3. This change in sequence may not have any influence on the function and expression of the chIFITM3 gene. However, RNA-seq analysis shows expression of all IFITMs from this locus but that chIFITM1 has different patterns of expression from the other antiviral IFITMs. Initial analysis of different chicken breeds shows IFITM amino acid variation between different chicken breeds and turkeys.

Bacterial Artificial Chromosome (BAC) construct recovery
The BAC clone (CHORI-261) from Red Jungle Fowl strain UCD001 covering the predicted IFITM locus was purchased from BACPAC Resources Centre. The BAC clone, delivered as a stab culture was streaked directly on Luria Broth (LB) agar (chloramphenicol 12.5 μg/mL) to isolate single colonies and incubated overnight at the designated growth temperature. Single colonies were picked and cultured in LB media. Plasmid DNA was then extracted and purified according to Qiagen Plasmid DNA kit manufacturer's protocol. Sequencing, assembly and alignment A total of 3 μg isolated plasmid DNA was sequenced across two platforms, the Illumina MiSeq and PacBio RSII. Library preparation and quality control was undertaken by The Wellcome Trust Sanger Institute's core sequencing facility. Assembly of PacBio sequencing reads was performed using protocols available on the SMRT ® Portal. Briefly, sequencing fragments were first filtered to remove reads that did not meet read quality and length thresholds, then de novo assembled using HGAP [31]. Errors in the re-circularization of the BAC as well as sequence consensus generation for the DF1 cell line were corrected using iCORN v2, Interative Correction of Reference Nucleotides [32]. MiSeq reads were first analysed for low quality reads with FastQC [33] and low quality reads were trimmed using Trimmomatic [34]. De novo assembly of MiSeq reads was attempted using IVA [35], SGA [36] and HGAP from the SMRT ® Analysis package [31]. SMALT, (http://www.sanger.ac.uk/ science/tools/smalt-0) a pairwise sequence alignment program was used to map MiSeq reads onto genomic reference sequences, either chromosome 5 of Gallus gallus (v4 and v5) or the consensus sequence generated from de novo assembly of PacBio sequencing reads. The SAM files generated were converted into indexed BAM files using Samtools 0.1.19 [37]. Artemis (v13.0) and ACT (Artemis Comparison Tool) [38] were used to analyse locus coverage and accuracy of the alignment. Comparison files required to run ACT were generated with megablast [39]. Dot plots were generated calling dotter from the command line [40]. Annotation for the PacBio consensus sequence was generated by RATT, Rapid Annotation Transfer Tool [41] using as scaffold the annotation from Gallus gallus version 4. All sequences produced in this manuscript are deposited in the ENA under the accession numbers ERS556272, ERS565108, ERS1276179, PRJNA361311.

SureSelect pull down of the IFITM locus
SureSelect probes covering the chicken IFITM locus (40Kb region) were purchased from Agilent and samples processed for targeting pulldown according to the Illumina and PacBio protocols.