The Gallus gallus RJF reference genome reveals an MHCY haplotype organized in gene blocks that contain 107 loci including 45 specialized, polymorphic MHC class I loci, 41 C-type lectin-like loci, and other loci amid hundreds of transposable elements

Abstract MHCY is a second major histocompatibility complex-like gene region in chickens originally identified by the presence of major histocompatibility complex class I-like and class II-like gene sequences. Up to now, the MHCY gene region has been poorly represented in genomic sequence data. A high density of repetitive sequence and multiple members of several gene families prevented the accurate assembly of short-read sequence data for MHCY. Identified here by single-molecule real-time sequencing sequencing of BAC clones for the Gallus gallus Red Jungle Fowl reference genome are 107 MHCY region genes (45 major histocompatibility complex class I-like, 41 c-type-lectin-like, 8 major histocompatibility complex class IIβ, 8 LENG9-like, 4 zinc finger protein loci, and a single only zinc finger-like locus) located amid hundreds of retroelements within 4 contigs representing the region. Sequences obtained for nearby ribosomal RNA genes have allowed MHCY to be precisely mapped with respect to the nucleolar organizer region. Gene sequences provide insights into the unusual structure of the MHCY class I molecules. The MHCY class I loci are polymorphic and group into 22 types based on predicted amino acid sequences. Some MHCY class I loci are full-length major histocompatibility complex class I genes. Others with altered gene structure are considered gene candidates. The amino acid side chains at many of the polymorphic positions in MHCY class I are directed away rather than into the antigen-binding groove as is typical of peptide-binding major histocompatibility complex class I molecules. Identical and nearly identical blocks of genomic sequence contribute to the observed multiplicity of identical MHCY genes and the large size (>639 kb) of the Red Jungle Fowl MHCY haplotype. Multiple points of hybridization observed in fluorescence in situ hybridization suggest that the Red Jungle Fowl MHCY haplotype is made up of linked, but physically separated genomic segments. The unusual gene content, the evidence of highly similar duplicated segments, and additional evidence of variation in haplotype size distinguish polymorphic MHCY from classical polymorphic major histocompatibility complex regions.

common homozygous variants and made no additional changes in the MHCY region. The polished 34j16 sequence was identical to the previous Sanger assembly. No changes were made to 34j16. The 173o1 sequence was compared to J_AA173O01 in AC275299.1 and to an earlier Sanger sequence assembly. Three changes were made to 173o1: T→TC at 31505, C→CG at 45669, C→CG at 124293 (the numbers indicate the positions prior to correction). Overall, there was little evidence for changes within the MHCY region of the assembly based on variant calls. More details can be found at: https://github.com/cwarden45/Miller_Red_Jungle_Fowl_MHCY/tree/main/Part1_Assembly.

Assembly of Contig 1
Following a "working draft" assembly, 190m7 and 19d16 sequence were combined (with the internal, common EcoRI site overlapped). The five clones (173o1, 102b15, 1o23, 190m7, and 19d16) were then compared to this initial sequence representing the entire length of Contig 1. BLAST ((Altschul et al. 1990), blastn v2.2.27+, parameters "-perc_identity 95 -evalue 1e-20") was used to confirm the relationships between the contigs and individual clones without identifying sequence to change within Contig 1. In addition, the five separate clones were aligned to the draft Contig 1 sequence using BWA-MEM v 0.7.15-r1140 (Li 2013), with the "-M" parameter. BWA-MEM can split sequences to create supplemental reads. Initial inspection of supplemental BWA-MEM alignments for 1o23 and 102b15 (the two clones other than 190m7 that contained NOR sequence), showed some noticeable divergence suggesting that these two sequences might be incorrectly aligned. Those alignments were manually removed so the resulting BWA-MEM alignment was in relatively good concordance to Contig 1 for each clone. Variations from the draft reference sequence in the BWA-MEM alignment were summarized by creating a .pileup file (using samtools v1.4 (Li et al. 2009) with default parameters).
Seven changes were made to the 190m7 sequence where the consensus of overlapping sequences suggested there were errors: C→ CG at 95439, GC→G at 101375, AC→A at 103583, A → AT at 103685, A → AC at 103997, A → AC at 104654, and T → TC at 104976 (the numbers indicate the positions prior to correction). In the area where only 19d16 and 173o1 overlapped, nucleotide mismatches could not be resolved by majority calls. For mismatches in this region, polished calls were used to determine which call to select. Two additional changes were made to produce the Contig 1 sequence: C → CA and C → CG at previous positions 146229 and 166681, respectively. An additional modification to the Contig 1 consensus sequence was made later by adding a five bp repeat (AAGGG) at position 339,990. The code associated with this assembly is available: https://github.com/cwarden45/Miller_Red_Jungle_Fowl_MHCY/tree/main/Part1_Assembly/Contig1

Definition of Contig 2 and Contig 3
Contig 2 and Contig 3 are the sequences of single clones, 58f18 and 34j16 respectively. While they have sequences to each other and to Contigs 1 and 4, the sequence could not be assembled into a single sequence.

Assembly of Contig 4 from fosmid clones
Also included in this study are GenBank sequences for two fosmid clones: AC270441.1 (clone J_AD-484G5) and AC270418.1 (clone J_AE-174A8) from UCD001 #256 fosmid libraries. These were deposited in GenBank by others (Bellott et al. 2017), but not assigned to a chromosome. They were found in searches of GenBank to have high-quality matches to the MHCY BAC sequences. They form Contig 4 (45,013 bp) in which there is a 27,060 bp 100% identity overlap between the two clones. Only two modifications were made to produce Contig 4. These were the removal of one mismatching nucleotide from the end of AC270418.1 in the region of overlap and the removal of 25 bp of vector sequence at the opposite end (outside the region of overlap).

Alignment of Illumina RNA-Seq data and NCBI EST data
To prepare a sequence set suitable for screening RNA-Seq data for evidence of gene expression and to evaluate intron/exon boundaries, a specialized sequence was prepared starting with the galGal5 reference genome. This galGal5 reference sequence was downloaded from the UCSC Genome Browser (Kent et al. 2002) on 5/12/2016. Sequence for GGA16 was modified to remove sequence containing the MHCY region loci identified in galGal5 (all sequence after 350,000 was removed), and the four contig sequences were added in. The goal was to have a genomic sequence data set against which expression data could be compared so that matches within the four contigs would be distinguished from background matches to other regions of the genome.

Assessment of gene annotations using Illumina RNA-Seq data
We use normalized coverage and absolute coverage for all splice junctions in a gene model to assess the gene candidate annotations. For this, the total unique and multi-mapped reads were parsed from the STAR summary files (with extension _Log.final.out). These values were used to calculate Count-Per-Million (CPM) normalized coverage for each junction. The raw unique and multi-mapped junction counts were collected from STAR (from files with extension _SJ.out.tab). The multi-mapped coverage was adjusted based upon the total number of known copies present in Contigs 1-4. In Table S2, the separately adjusted total CPM values were summed across copies of 100% identical genes for each unique gene sequence.
If all junctions in the gene had at least one hundred reads in at least ten libraries, the gene was assigned to evidence category "Good". If all junctions in the gene had at least ten reads in at least three libraries, the gene was assigned to evidence category "Intermediate". Genes with at least one read in at least one sample for all junctions were assigned to evidence level "Low". If there was not read evidence for all junctions with the main set of STAR alignment parameters used, then we inspected the alignments and decided if the gene should be kept. Rather than using "pseudogene" to describe sequences with very low (one block) scores, an evidence level of "Poor" was assigned. Intermediate tables and the code to define gene annotation confidence using the STAR splice junctions from Illumina RNA-Seq data can be found here: https://github.com/cwarden45/Miller_Red_Jungle_Fowl_MHCY/tree/main/Part2_Annotation/ST AR_Splice_Junction_Evidence. There is also an additional strategy to quantify the evenness of expression that complements the assignments for all "Good" evidence genes as well as a selected number of "Intermediate" evidence genes (https://github.com/cwarden45/Miller_Red_Jungle_Fowl_MHCY/discussions/3).

dN/dS Divergence Methods
All MHCY sequences are provided from this paper, and analysis uses MHCY sequences shown in Figure S3. For the HLA comparison, a list of 12 from HLA-A and 13 from HLA-B alleles was used to find representative sequences in IMGT/HLA (Robinson et al. 2020). A representative sequence from GenBank was selected based upon certain criteria, and it was confirmed that the nucleotide sequence for that specific GenBank entry was 100% match to what is provided by IMGT/HLA (and provided allele information in the GenBank deposit was compatible with IMGT/HLA). If needed, Biopython (Cock et al. 2009)was used to extract coding (CDS) sequences from GenBank deposits. Likewise, as needed, the "Translate tool" from ExPASy was used for manual sequence inspection (Gasteiger et al. 2003).
A neighbor joining tree (Saitou and Nei 1987) of peptide alignments for the α1and α2 domains were created using Clustal Omega Goujon et al. 2010). Phylogenetic trees in text format were exported from Clustal Omega and a scaled tree for amino acid divergence was created using ggtree (Yu et al. 2017). Divergence values (dN and dS) were calculated using codeml in PAML (v4.8, (Yang 2007)), using the Nei and Gojobori (Nei and Gojobori 1986) estimates. The configuration file was based upon that provided by Bitarello (Bitarello et al. 2016) for HLA sequences. A modified version of the configuration file and all other code necessary to reproduce results are available at https://github.com/cwarden45/Miller_Red_Jungle_Fowl_MHCY/tree/main/Part2_Annotation/dN_dS_Analysis.