Trans-species polymorphism at antimicrobial innate immunity cathelicidin genes of Atlantic cod and related species

Natural selection, the most important force in evolution, comes in three forms. Negative purifying selection removes deleterious variation and maintains adaptations. Positive directional selection fixes beneficial variants, producing new adaptations. Balancing selection maintains variation in a population. Important mechanisms of balancing selection include heterozygote advantage, frequency-dependent advantage of rarity, and local and fluctuating episodic selection. A rare pathogen gains an advantage because host defenses are predominantly effective against prevalent types. Similarly, a rare immune variant gives its host an advantage because the prevalent pathogens cannot escape the host’s apostatic defense. Due to the stochastic nature of evolution, neutral variation may accumulate on genealogical branches, but trans-species polymorphisms are rare under neutrality and are strong evidence for balancing selection. Balanced polymorphism maintains diversity at the major histocompatibility complex (MHC) in vertebrates. The Atlantic cod is missing genes for both MHC-II and CD4, vital parts of the adaptive immune system. Nevertheless, cod are healthy in their ecological niche, maintaining large populations that support major commercial fisheries. Innate immunity is of interest from an evolutionary perspective, particularly in taxa lacking adaptive immunity. Here, we analyze extensive amino acid and nucleotide polymorphisms of the cathelicidin gene family in Atlantic cod and closely related taxa. There are three major clusters, Cath1, Cath2, and Cath3, that we consider to be paralogous genes. There is extensive nucleotide and amino acid allelic variation between and within clusters. The major feature of the results is that the variation clusters by alleles and not by species in phylogenetic trees and discriminant analysis of principal components. Variation within the three groups shows trans-species polymorphism that is older than speciation and that is suggestive of balancing selection maintaining the variation. Using Bayesian and likelihood methods positive and negative selection is evident at sites in the conserved part of the genes and, to a larger extent, in the active part which also shows episodic diversifying selection, further supporting the argument for balancing selection.


INTRODUCTION
Vertebrates fight microbial infections using both innate immunity and adaptive responses. 13 MHC molecules, cell surface molecules with broad (MHC-I) and specialized (MHC-II) 14 pathogen recognition features (Murphy et al., 2007), show trans-species polymorphisms, 15 variation indicative of adaptive balancing selection. For example, certain MHC-II alleles 16 of humans are more closely related to certain alleles of chimpanzee than to other human 17 alleles (Fan et al., 1989;Nei and Hughes, 1991). An ancient balanced polymorphism 18 will generate long genealogical branches. Neutral variation will accumulate at sites 19 close to the balanced polymorphic sites (Charlesworth, 2006). However, depending on 20 recombination, the size of the genomic region can be quite short, making trans-species 21 polymorphism hard to detect. Obvious and pervasive trans-species polymorphism, in 22 contrast, is most likely due either to multiple sites under balancing selection or to 23 suppression of recombination or to both (Wiuf et al., 2004). The models that have been 24 proposed for detecting balancing selection in molecular data frequently assume that there host is crucial for its own survival as well as the survival of the parasite. The innate 42 immune system is at the forefront of this battle. It is of special interest to investigate 43 evolution and variation of the innate immunity genes responsible for host defense. 44 Various families of antimicrobial peptides are an essential part of innate immunity. 45 The cathelicidin family has been extensively studied in many organisms, i.e. primates 46 (Zelezetsky et al., 2006) and fish (Maier et al., 2008;Kapralova et al., 2013) but it 47 was first described in mammals (Zanetti et al., 1995). The number of genes coding 48 for this protein varies among species. For example, there is a single gene in human 49 (Gudmundsson et al., 1996) whereas there are ten in pig (Dawson et al., 2013). The 50 protein is characterized by an N-terminus, a signal sequence, a conserved cathelin-like 51 domain (exons 1, 2 and 3) and a C-terminal domain with antimicrobial activity (exon 52 4). The N-terminus of the protein has certain conserved features that characterize all 53 cathelicidins, i.e., four cysteine residues forming two disulfide bridges (Tomasinsig 54 and Zanetti, 2005) ( Figure S1). This evolutionarily conserved part is, nevertheless, 55 targeted by positive selection (Zhu, 2008) ( Figure S1). The C-terminus is highly variable within multigene families and among species, most likely due to diversifying balancing 57 selection (Tomasinsig and Zanetti, 2005). Many innate immune molecules have been 58 described in Atlantic cod, e.g., piscidin (Fernandes et al., 2010), beta-defensin (Ruangsri 59 et al., 2013) and the expanded toll-like receptor family (Sundaram et al., 2012), showing 60 novel forms and patterns indicating importance of antimicrobial peptides and their genes 61 for the immunity of these fish. 62 Several hypotheses have been proposed for the selective maintenance of high di-63 versity at the MHC-II loci in vertebrates. These hypotheses include the heterozygote 64 advantage hypothesis, the frequency-dependent rare-allele advantage hypothesis, and  Most genome-wide studies, scanning for variation, show high-frequency polymor-81 phisms in genes related to immunity (Nielsen et al., 2007;Leffler et al., 2013). In this 82 study, we examine the Cathelicidin family of innate immunity genes in Atlantic cod in 83 individuals from throughout the distributional range (Figure 1), and in closely related 84 species. We report large variation within and among species. We report a distinctive 85 data set discovered when we attempted to amplify a particular Cathelicidin gene with 86 a pair of primers designed from Atlantic cod sequences. Our initial aim was to study 87 population variation at the single codCath1 locus previously described (Maier et al.,88 2008) and also found in the Atlantic cod genome sequence (Star et al., 2011). With only 89 a single pair of primers we found extreme variation in 97 clones from 27 individuals.

90
The amount and patterns of variation both within and among species cannot be explained 91 as single locus variation. We discuss the orthologous and paralogous variation in terms 92 of trans-species polymorphism.

94
Sampling 95 We used 97 clones from 27 individuals in the study. There were 19 individuals of 96 Atlantic cod (mnemonic: Gmo) from throughout the distributional range of the species: 97 two each from Greenland (Gre), Barents Sea (Bar), Celtic Sea (Cel), Baltic Sea (Bal), 98 Norway (Nor), Faroe Islands (Far), and Canada (Can) and five from around Iceland (Ice). 99 We also included two individuals of each of the closely related species (Figure 1) 133 We extracted DNA using a Chelex/proteinase K extraction method (Walsh et al., 1991 to a strategy that we discuss below and in Árnason and Halldórsdóttir (2015). The 149 amplified fragment contained the whole gene, four exons and three introns with part of 150 the 5 and 3 UTR ( Figure S2). We sequenced the gene and the 3 UTR. EcoR1 digest of 151 the clones run on agarose gels showed different sizes of the fragments in clones from 152 some individuals. The size differences were confirmed upon sequencing. We, therefore, 153 added and sequenced more clones from chosen individuals to further study the different 154 sized fragments (see Table 1).

156
Errors occur during PCR amplification and inevitably will be found, mostly as singletons, showed that sequences of clones from some individuals were very different from each 161 other, too divergent to be variation due to PCR errors. In some instances they belonged 162 on the amino acid level to already described paralogous genes (Maier et al., 2008). therefore, only sites found in Cath2 that were common to the three genes were analyzed.

213
The significance level for the SLAC, FEL and MEME p-values was 0.2. The REL Bayes

214
Factor was 50, and the FUBAR Posterior Probability was 0.9.

215
Due to the trans-species nature of variation some analysis, that are developed for 216 intraspecific variation were made on the trans-species variation. The assumption here 217 is that trans-species variation is representative of intraspecific variation that could be 218 found with larger sample sizes.    Figure 6). The DAPC cleanly separated groups defined 303 by alleles but groups based on species were largely overlapping. We thus conclude that 304 there are three paralogous genes, Cath1, Cath2, and Cath3, and that the variation within 305 each cluster represents allelic variation of each gene.

306
In some individuals we found representatives of only one gene or even of only 307 a single allele. In some instances we looked more closely at several clones of such 308 individuals without detecting more alleles. This may be a chance event or it may be 309 due to variation in primer binding sites. In that case our data would have ascertainment 310 bias from using only a single primer pair for PCR amplification. If that was the case we   (Figures 2, 3, 4, 5, 6, S1, S2, S3, S4).  Figure 4) has the nine aa insertion that previously had been described as a paralogous 367 gene Cath3 (Maier et al., 2008). According to our data it is an allelic variant of Cath1. 368 We therefore drop the Cath3 label for this variant and reserve that for the major cluster PrePrints (Figure 2). Interestingly a shorter insertion of six aa (similar but not identical) was also 370 found in individual 114718-4.Gmo.Far, an Atlantic cod from the Faeroe Islands.

371
Population genetic statistics 372 We estimated the nucleotide diversity π, the scaled mutation rate θ and Tajima's D in a 373 sliding window of 100 bp over the genes coding for Cath1 and Cath3, noncoding regions 374 and both synonymous and non-synonymous sites in coding regions. For Cath1, θ was 375 higher than π, giving a negative D over the whole gene ( Figures S5 and S6) (Table 2 and Table 3). We screened alignments for recombination with GARD  We analyzed exons 1-3, the conserved part, separately from exon 4, which constitutes 405 the active peptide. Sites containing gaps were excluded from this analysis. Therefore,  (Table 2 and Table 3).

412
The SLAC (Single Likelihood Ancestor Counting) program, the most conservative    Halldórsdóttir, 2015), would make this even more significant. 506 We show that the polymorphism is older than speciation given that divergent alleles 507 of different genes can be found in different species. The balancing selection hypothesis 508 is a plausible explanation because a scenario of concerted evolution between paralogous 509 genes would otherwise be expected (Liao, 1999).

10
.     PrePrints Table 3. Codon-based maximum likelihood and Bayesian analysis for negatively selected sites in exon 4 and in exons 1, 2, and 3 combined. Underlined codon is non-neutral according to the given method at the specified significance level. Consensus column summarizes methods which found the codon negatively selected. Analysis was made using the Datamonkey server www.datamonkey.org (Delport et