Genomic architecture and evolutionary conflict drive allele-specific expression in the social supergene of the red fire ant

Supergenes are genomic regions of suppressed recombination that underlie complex polymorphisms. Despite the importance of such regions, our empirical understanding of their early evolution is limited. The young “social” supergene of the fire ant Solenopsis invicta provides a powerful system for disentangling the roles of evolutionary conflict and the implications of suppressed recombination. We used population genomics to identify genetic differences between supergene variants and gene expression analyses across different populations, castes and body parts to characterize allelic expression differences for the hundreds of genes in the supergene. We find that the expression of most genes is independent of social form or supergene variant, in line with the young age of this system. Many of the genes with allelic expression differences, however, show a pattern consistent with gene degeneration due to suppressed recombination. In contrast, a small portion of the genes has the signature of evolutionary conflict between social forms.


Introduction
Selection for multiple phenotypic optima within a single species can result in intra-specific evolutionary conflict. Such conflict can lead to selection for reduced recombination between co-adapted alleles encoding the different phenotypes (1). In turn, this selection pressure can favor the formation of supergene regions containing tightly linked alleles of up to hundreds of genes. Such regions enable the maintenance of genetic interactions over evolutionary time (2,3). We now know that supergenes controlling ecologically important traits are widespread. These include flower heterostyly in Primula (4), mating type in Mycrobotryum fungi (5), Batesian mimicry in butterflies (6,7), mating behavior in white-throated sparrows (8,9) or male sexual morphs in ruffs (10,11).
The non-recombining portions of sex chromosomes are a special type of supergene because the two sexes are interdependent (12). The large body of research on the evolution of sex chromosomes constitutes a good starting point to understand supergene evolution. Evolutionary conflict over sexual phenotypes is believed to cause an enrichment of sexually-biased genes in the sex chromosomes, where the Y (or Z) variant is masculinized, and the X (or W) variant is feminized (13,14). This pattern occurs in many established sex chromosome systems (15)(16)(17)(18). Studies focusing on young sex chromosome systems, however, suggest that the role of evolutionary conflict in driving sex chromosome evolution has been overestimated (19)(20)(21)(22)(23)(24). Instead, sex chromosome evolution could be driven mostly by processes unrelated to phenotypic differences between sexes (5,25,26). For example, gene-specific dosage compensation can be favored in response to mutations that change coding sequences (27) or expression levels (28,29) in the non-recombining variant. In sum, processes derived from gene degeneration and selection for alternate genomic optima to overcome evolutionary conflict can play a role during the different stages of sex chromosome evolution (30).
We aim to understand the relative importance of evolutionary conflict and the consequences of recombination suppression in shaping the early evolution of a supergene. For this, we focus on the social supergene of the red fire ant Solenopsis invicta. Two social forms coexist in this species: colonies either have one or multiple queens. This social polymorphism is associated with multiple behavioral and physiological traits, all of which are controlled by a pair of "social chromosomes", SB and Sb, that carry distinct supergene variants (31). In single-queen colonies, all workers and queens are SB/SB homozygotes. In multiple-queen colonies, all egg-laying queens are SB/Sb heterozygotes, but workers can either be homozygous or heterozygous. Sb/Sb queens are rare in the wild and their offspring are unviable (32,33). Recombination is severely repressed between SB and Sb (31,34) due to the presence of at least 3 inversions (35,36). Similarly to a Y chromosome, Sb thus lacks recombination opportunities. The lack of recombination reduces the efficacy of selection in the Sb variant and is likely responsible for its ongoing degeneration through the accumulation of deleterious mutations (34,35).
The extent to which a lack of recombination leads to gene depletion is variable. For example in Y chromosomes of Poeciliid fish, variation among taxa in gene depletion correlates with the intensity of sexual conflict (37). In the fire ant, the vast majority of genes are intact in both variants of the social chromosome supergene (31,38). This lack of gene depletion is likely due to two main factors. First, the variants of the social chromosome only diverged recently (approximately 1M years ago (31,38)). Additionally, male ants are haploid and therefore a haplotype with a deleted gene would be fully exposed to selection in males. The resulting selection against deletion would be much higher than in diploid species which would still carry an unaffected gene copy (35,39).
Evolutionary conflict and the suppression of recombination should leave distinct signatures on patterns of gene expression. On one hand, conflict between single-and multiple-queen phenotypes should favor the accumulation of alleles adapted to the multiple-queen environment on the Sb haplotype. Under this scenario, we would expect that genes expressing the Sb more than the SB haplotype would tend to show an overall pattern of higher expression in multiple-rather than single-queen colonies. On the other hand, gene degeneration in the Sb haplotype should lead to lower expression of Sb alleles and perhaps genespecific dosage compensation of corresponding SB alleles in dosage-sensitive genes. Here, we test these ideas by generating detailed genomic and transcriptomic data: we sequenced genomes of fire ants from their South American native range and combined these with existing genomes from the invasive North American range to identify genes with fixed differences between the SB and Sb variants of the social chromosome. To detect differences in expression between SB and Sb alleles, we performed RNA sequencing (RNA-seq) from SB/Sb individuals from the South American range of the species. We used three body parts of queens (head, thorax and abdomen) and whole bodies of workers. We then compared these expression patterns with those obtained from comparisons between social forms. Our results are consistent with evolutionary conflict shaping the evolution of the fire ant supergene in combination with a strong influence of the consequences of reduced recombination.

Hundreds of genes have fixed allelic differences between supergene variants
To identify differences between supergene variants we obtained a total of 408⨉ coverage of genome sequence from 20 haploid SB males and 20 haploid Sb males. Thirteen within each group (65%) are from the native South American range whereas the rest are from an invasive North American population (31). By comparing the two groups of males we identified 2,877 single nucleotide polymorphisms (SNPs) with one allele in all SB individuals and a different allele in all Sb individuals, affecting 352 genes (Supporting Information, Table S1). Among the 3.4% of SNPs affecting coding sequence, almost half change the aminoacid sequence (non-synonymous), with one change to a premature stop codon. The remaining SNPs were in intergenic (36.1%), intronic (58.0%) or untranslated regions (2.5%).
Because the invasive North American population went through a severe bottleneck in the 20th century (40), we repeated the analysis after separating populations. We found 252 additional SNPs with fixed differences between SB and Sb individuals in South America, and an additional 23,022 fixed differences between SB and Sb in North America. The latter number is 4-fold higher than expected due to differences in sampling size alone and is in line with lower genetic diversity of both supergene variants in North America due to the invasion bottleneck (40).

Seven genes have consistent variant-specific expression patterns in all populations
Because most coding sequences are intact in both variants of the supergene, we asked whether variantspecific allelic expression biases may occur. We generated individual-specific RNA-seq data from whole bodies of SB/Sb workers and from abdomens, thoraces, and heads of SB/Sb virgin queens collected in South America. To add additional robustness to our analysis, we additionally incorporated existing RNA-seq gene expression data from pools of whole bodies of SB/Sb queens collected in the USA and Taiwan (41,42).
We additionally identified differences between expression of SB and Sb-linked alleles using a populationspecific approach for South American and also independently in North American and Taiwanese populations. We found 124 expressed genes containing fixed differences between variants in the native South American range, 343 and 374 such genes in the invasive populations of North America and Taiwan respectively. Most of the genes with sufficient expression data and fixed differences between variants in South America (122 out of 124, 98.4%) could also be examined in the North American and Taiwanese datasets ( Figure 1-Figure supplement 1).
We applied DESeq2 (v1.14.1 (43)) to each population to detect allelic expression differences. Within the South American data we generated, this approach detected 12 genes with significant allele-specific expression, 6 being more highly expressed in SB and 6 in Sb (  Table S3a). Of these, three had been detected by the linear mixed effects model using all populations ("carbohydrate sulfotransferase 11-like", "ejaculatory bulb-specific protein 3" and "uncharacterized LOC105193135"). We found no general bias in expression towards either variant (median of log2 expression ratios=-0.07; Wilcoxon sum rank test p=0.27).
We repeated this analysis using gene expression data from pools of SB/Sb queens from North American populations of S. invicta (41) (Figure 1-Figure supplement 3, Table S3b). Twenty nine of the 343 genes showed significant allele-specific expression. Of these, six had been detected by the linear mixed effects model using all populations. There was no directional bias, with 16 of the genes being more highly expressed in SB, and 13 in Sb (binomial test p=0.7; median of log2 expression ratios=6⨉10 -17 ; Wilcoxon sum rank test p=0.40).
Finally, we also performed this analysis using pools of SB/Sb queens from Taiwanese populations (42) (Figure 1-Figure supplement 4, Table S3c). Out of the 374 genes analyzed, 19 showed significant allelespecific expression. Of these, three had been detected in the joint linear mixed effects model. Additionally, 12 of the genes with significant allele-expression differences between variants found in Taiwanese populations were independently found using North American populations too. Again, we found no directional bias in allele-specific expression across the supergene (8 genes more highly expressed in Sb, and 11 in SB; binomial test p=0.64; median of log2 expression ratios=-0.02, Wilcoxon sum rank test p=0.11).
These two additional analyses are thus consistent with the findings from the overall model reported above, and also uncovered population-specific expression. Figure 1. Allele-specific expression for genes in the fire ant social supergene in heterozygous SB/Sb individuals. Differences in expression (y axis) between the variants of the social chromosome in whole bodies of South American workers (purple), heads (blue), thoraces (green) and abdomens (red) of South American queens, whole bodies of North American queens (light blue) and whole bodies of Taiwanese queens (yellow). We show relative expression levels for the 16 genes from the social chromosome supergene region which have significant allele-specific expression differences between variants in at least one population. Significance for each gene for each comparison is indicated by an asterisk at the bottom of each plot, and non-significant comparisons are semi-transparent (Benjamini-Hochberg (BH) adjusted p<0.05 from the joint linear mixed-effects for Taiwanese, South and North American populations; BH adjusted p<0.05 from DESeq2 Wald tests for analyses restricted to South America or Taiwan). Each box shows the distribution of logarithm 2 expression ratios between SB and Sb per caste/body part/population. Genes with values above the solid line (Log2 expression ratios=0) have higher SB expression; those with values below the line have higher Sb expression. A log2 expression ratio greater than 1 or smaller than -1 represents a two-fold gene expression difference in either direction. Genes are in chromosomal order.

Supergene allele expression differs based on population ancestry but not body parts
If allele-specific expression patterns of genes within the supergene are mostly driven by selection for social form-specific functions, we would expect allelic expression biases to differ between tissues. The South American RNA-seq data we generated provided the opportunity to test this idea because it includes multiple body parts and castes (head, thorax and abdomen in queens, and whole body in workers). We found no significant effect of body part or caste on allelic expression for any gene (DESeq2's Logarithmic Ratio Test and all pairwise Wald comparisons between interaction terms; all BH adjusted p>0.05). This finding suggests that allelic expression within the supergene is not fine-tuned for specific functions in different tissues. We found additional support for this finding by detecting allelic expression differences between body parts for genes in normally recombining parts of the genome, despite having less power to do so than in the supergene region (see Methods).

SB bias
Log2 expression ratios ****** ****** ****** **** * **** **** ****** ****** **** ****** **** **** **** ****** **** ****  The progressive degeneration of the Sb variant of the supergene through the accumulation of deleterious mutations (35) could affect gene expression. Under this scenario, we expect allelic expression patterns within the supergene to differ more between populations that are more distantly related. Our inclusion of three populations with different levels of relatedness provided the opportunity to test this hypothesis. For each pair of populations, we calculated correlations of the log2 ratios between SB-and Sb-allelic expression ( Figure 1-Figure supplement 5). The log2 expression ratios between North American and Taiwanese populations were strongly correlated (Spearman's r 2 =0.67), likely because the North American and Taiwanese populations are highly related (44). In contrast, the correlation between either invasive and the South American population was weak (Spearman's r 2 =0.21 and 0.18 for the South America-North America and South America-Taiwan correlations respectively). This result was further supported by an additional linear mixed-effects model incorporating all data. This model shows that ancestry of the populations explains more of the variance observed in allelic expression than the continent on which they are (interaction between North American-Taiwanese populations and gene effect in predicting log2 expression ratios: F=3.42, p<10 -15 ; interaction South-North American populations and gene effect in predicting log2 expression ratios: F=0.94, p=0.65).
These results support the idea that genomic architecture plays a major role in defining supergene allelic expression.

Overrepresentation of socially biased genes in the social supergene
Differential selection for single-queen and multiple-queen colonies leads to the prediction that genes with socially biased expression should be overrepresented in the supergene region. To test whether this occurs, we compared gene expression between egg-laying queens from single-queen and from multiple-queen colonies. We identified 293 such socially biased genes for which chromosomal locations are known (Supporting Information, Table S4). Such genes were indeed overrepresented in the supergene region ( Figure 2a, 33 out of 293, 12 expected by chance, χ 2 =29.7, p<10 -7 ). Additionally, the vast majority of socially biased genes (274 out of 293, i.e., 94%) were more highly expressed in multiple-queen colonies than in single-queen colonies (binomial test, p<10 -7 , Figure 2b). We independently confirmed this result by comparing expression profiles of queens from multiple-queen colonies from another RNA-seq dataset (41) to those of queens from single queen colonies (45). For this, the samples were normalized together using the default DESeq2 method. Because these samples belong to different datasets, there are too many confounding factors to estimate significance levels of differential expression for each gene. We thus used a different approach. We first performed a simple DESeq2 analysis to obtain the log2 expression ratios of this comparison. We then compared the log2 expression ratios using a linear regression to those obtained from the comparison using only the Morandin et al. (45) data to detect patterns shared by the two datasets. The strong trend we found (49% of the variance in log2 expression ratios explained by social form ± a standard deviation of 2.9%, p<10 -16 )) confirmed previous results.

Figure 2.
Distribution of socially biased genes in the genome of the red fire ant within (left bars) and outside (right bars) the supergene region a) for genes with significant expression differences between social forms and b) for socially biased genes that are more highly expressed in multiple-queen colonies than in single-queen colonies. The vast majority of genes with differential expression between social forms are more highly expressed in multiple-queen colonies. The proportion of genes that are more highly expressed in multiple-queen colonies is similar within and outside the supergene, which shows that expression bias towards multiple-queen colonies is not a unique feature of the supergene.

Genes with higher expression of Sb than SB alleles are socially biased towards higher expression in multiple-queen colonies
Similarly to the Y chromosomes in males, because Sb is present only in multiple-queen colonies, it should accumulate alleles beneficial for multiple-queen colonies. When a gene has sexually-biased gene expression, this is generally interpreted as a sign that this expression is beneficial to that sex (47). If we assume that socially biased expression similarly indicates a benefit to the particular social form, genes with higher expression of Sb than SB alleles should be more highly expressed in multiple-queen than in singlequeen colonies.
Information regarding gene expression differences between social forms exists only for North American populations. To match genes with differences in expression between social forms with genes showing allele-specific expression differences between supergene variants, we analyzed data from North American populations only. Among the genes in the supergene region, 294 had sufficient data for both comparisons ( Figure 3 and Figure 3-Figure supplement 1). Most of these genes (256; 87%) had no significant differential expression patterns ( Figure 3a) whereas eight genes (3%) had both allele-biased and socially-biased expression. Their expression was strongly directionally biased, with the majority of these genes having higher expression in Sb and in multiple-queen colonies (5 out 8 Sb biased genes were more highly expressed in multiple-queen colonies ( Figure 3d); compared to 1 out of 15 for SB biased genes, χ 2 =5.8, p=0.02). We additionally tested whether this was a general pattern for the supergene region. To do this, we extended the analysis to search for the same trends using all 294 genes in the supergene region. With this data we fitted a linear model estimating changes in allele-specific expression based on gene expression differences between social forms (Fig 4). The model showed that higher expression of the Sb allele results in higher expression in individuals from multiple-queen than single-queen colonies. This approach additionally identified a trend that went undetected when considering individual genes: higher expression of the SB alleles in heterozygous individuals generally corresponds to lower expression in multiple-queen than in single-queen colonies. We find that compared with the standard linear regression (straight black line in Figure 4), there is a slightly better fit for all values for the linear regression of allelic expression bias (PB) on the gene expression bias towards multiple-queen colonies (PMQ; whereby PB=(1 -PMQ) / PMQ; curved red line in Figure 4). For both relationships, the significant increase in PB for genes with relatively low expression in multiple-queen colonies (PMQ<0.5; p<10 -5 ) suggests that that expression of the Sb allelic variant is depleted at these loci. We additionally show that this pattern is very unlikely to be explained solely due to Sb degeneration, consistent with an adaptive role of some of the genes with Sb biased expression. A model explaining the observed expression patterns based on Sb degeneration alone (i.e., based on allelic bias alone: PMQ = 1 / (2PB + 1); purple line in Figure 4) poorly predicts the data and is significantly different from the best fit model (p<10 -5 ).

Genes with no social bias tend to have allele-specific expression biased towards the SB variant
If purifying selection is relaxed for the majority of genes in the Sb variant of the supergene, they should be accumulating deleterious mutations. In consequence, the expression of Sb alleles for such genes would decrease or be downregulated. If stoichiometric ratios require such genes to be expressed at specific levels, we would expect their "healthy" SB alleles to compensate for "faulty" Sb variants. In such cases of gene-specific dosage compensation we would thus expect allelic bias towards higher expression of SB alleles, but no gene expression differences between social forms.
Fifteen of the 294 genes in the previous section had allele-specific expression differences but no differences in gene expression between social forms (5%; Figure 3c). Most of these genes (12) had higher expression in SB, with only 3 being more highly expressed in Sb (binomial test, p=0.03). In line with this, the median SB-Sb expression ratio across all 15 genes was 5.9 (Wilcoxon signed-rank test p=0.0008 for differences against a ratio of 1). The bias in expression towards the SB alleles of genes with no expression differences between social forms is consistent with gene-specific dosage compensation taking place in the supergene region.

Discussion
In the fire ant, a supergene system with two variants (SB and Sb) controls whether colonies have one or multiple queens. We compared gene expression patterns between the SB and Sb variants of the social supergene within heterozygote SB/Sb individuals which exist only in multiple-queen colonies, and between queens from single-queen and multiple-queen colonies. Our results show patterns of expression consistent with evolutionary conflict and with degradation of Sb. Furthermore, our work highlights candidate genes potentially responsible for differences between social forms.

A balance between Sb degeneration, dosage compensation, and selection for alternate social forms
We found that in each population, 5% to 10% of the genes in the supergene region showed allele-specific expression differences between the SB and Sb variants. Seven of these genes show consistent SB-Sb allelic expression biases across all the populations. It is tempting to conclude that gene expression differences arose through selection, as a consequence of evolutionary conflict between the single-queen and multiple-queen phenotypes. Our other results, however, suggest that first level interpretations of expression patterns can be too simplistic, as they ignore the impacts of supergene degeneration on gene expression.
We found enrichment of genes with expression differences between single-queen and multiple-queen colonies in the supergene region ( Figure 2). Two alternative hypotheses can explain this enrichment. The first hypothesis is that evolutionary conflict drives differences in gene expression between social forms in a similar way to the generally accepted model of sex chromosome evolution (30). Assuming that high expression levels reflect a beneficial effect for the bearer (14), evolutionary conflict between social forms should lead to increased expression of genes favoring each of the social forms in each of the supergene variants (1,48). The alternative hypothesis is that the differences in expression between social forms are non-adaptive, instead reflecting genetic differentiation between SB and Sb at gene regulatory sites in the supergene region. Although some of this differentiation could be neutral, most is expected to be deleterious, as a result of the low efficacy of purifying selection in Sb caused by absence of recombination (i.e., Hill-Robertson interference) (34). Indeed, point mutations (49,50) and insertions of transposable elements (23) can cause changes in gene expression, and Sb is enriched both types of mutations (34,51).
We performed a joint analysis of gene expression differences between social forms and between supergene variants. For this analysis, we used only allele-specific information from North American colonies because this was the only population for which we had information for social form differences. We found that 23 genes in the supergene had fixed differences between variants and differential expression between the social forms. For fifteen of these genes, SB and Sb alleles had similar expression levels. Five of the eight genes with differences in expression between SB and Sb alleles followed a single pattern: higher expression in the multiple-queen social form, and higher expression of Sb than SB alleles in SB/Sb heterozygotes. Assuming that higher expression in one of the social forms indicates adaptive value, we conclude that the Sb alleles of these genes likely evolved as a response to selection for adaptation to the multiple-queen social phenotype.
We additionally observed a depletion of multiple-queen biased loci in the SB variant. Similarly to what we observed for Sb, this pattern could be interpreted as selection against the expression of single-queen biased loci in SB. However, the pattern of enrichment in Sb is stronger than the depletion in SB, and is stronger than what would be expected due to Sb degeneration alone. The different strengths of these patterns may reflect an important characteristic of the social chromosome system: that the Sb supergene variant only occurs in multiple-queen colonies, whereas the SB variant occurs both in single-queen colonies and in multiple-queen colonies. This characteristic of the system raises the expectation that selection is effective at accumulating alleles benefitting multiple-queen colonies in Sb, but less effective at accumulating alleles benefitting single-queen colonies in SB, which is consistent with the expression patterns we observed. Similarly, in some XY systems, the X chromosome lacks female-biased alleles (52).
Interestingly, the supergene also included 15 genes with allelic expression biases but no difference between social forms (Figure 3c, Fig S5). For the majority of genes with this pattern, the allele carried by SB had higher expression than its Sb counterpart. This result is consistent with a gene-specific dosage compensation mechanism to ensure stable overall expression despite decreased expression of the Sb variant. Two processes could explain why dosage compensation might have arisen for these genes. First, selection could favor lower expression of the Sb allele if random mutations lead to mis-expression, or to the production of non-functional proteins. Alternatively, mutations in regulatory elements in the Sb variant could have directly decreased expression levels of Sb alleles. In both cases, higher expression of the "healthy" SB variants would then have been selected for to offset the lower expression of Sb. Regardless of what is driving dosage compensation in this system, the overall pattern is similar to reports in other young supergene systems (20,21,53), where compensation occurs for individual genes rather than globally, as seen in, for instance, some mammalian X chromosomes (54). Furthermore, these two hypotheses underlying potential causes for dosage compensation could act simultaneously. Indeed, most of the genes with low Sb expression (92%, 11 out of 12) had more than one mutation in potentially regulatory upstream regions and slightly more than half (58%, 7 out of 12) were affected by nonsynonymous coding substitutions. This proportion of nonsynonymous mutations is similar to that observed across the supergene, where roughly half of the fixed exonic mutations between SB and Sb impact the protein sequence (47.7% nonsynonymous, 52.3% synonymous). This finding is consistent with ongoing gene degeneration, in line with previous results in this (34) and other young supergenes (9).
Gene degeneration could also contribute to the relatively small number of genes with SB and multiplequeen biased expression. One possibility is that gene degeneration leads to low expression of the Sb allele, which has two effects: a relatively low expression of the affected gene, combined with bias towards the expression of the SB allele (rather than Sb). A second interpretation is that the low expression of a gene in multiple-queen colonies is indicative of relaxed selection on that gene. Relaxed selection would in turn facilitate degeneration of the Sb allele, and its consequent reduction in expression. These explanations are not mutually exclusive.
Remarkably, the degree of differential allelic expression of SB and Sb alleles was consistent across body parts for all the loci where allelic bias was detected ( Figure 1). This pattern contrasts with human data, in which tissue-specific selection resulted in allele-specific expression variation across tissues (43). Similarly, in the fire ant we might have expected that selection would aim to fine-tune gene expression independently in each tissue to accommodate for tissue-specific functions in each social form. The young age of the supergene (31) in combination with low selection efficacy may have precluded such fine-tuning. As a result, strong selection for a particular level of allele-specific expression in one body part (e.g., in antennae), could result in a consistent allele-specific expression pattern across tissues, even if this has mildly deleterious effects (e.g., in the gut).
The absence of allelic differences between body parts and castes contrasted with our finding strong effects of population of origin on allelic expression: gene expression differences between SB and Sb were strongly correlated between North American and Taiwanese populations, despite the data from the two populations being from different studies. The North American population derived from the South American population over the last century, and the Taiwanese population derived from the North American population within the last 30 years (44). Both populations thus have lower genetic diversity overall, and in the supergene region in particular (34,44). In line with this finding, the correlation in gene expression patterns within the supergene was weak between South American and Taiwanese or North American populations. This strong effect of ancestry suggests an important role of genomic architecture in defining expression patterns within the supergene.
Overall, our results show that gene degeneration plays a major role in shaping the expression patterns within the supergene. Indeed, the fixed differences in allele-specific expression across populations that we report may be consequences of recombination suppression rather than adaptations arising from evolutionary conflict. To distinguish between these two processes, it is relevant to put allele-specific biased expression in the context of gene expression differences between social forms. By doing so, we highlight several candidate genes that could play key roles in defining social forms.

Candidate genes for differences between social forms
The approaches underpinning our analyses are unable to detect allelic differences in genes absent from the reference genome such as OBP-Z5, a putative Odorant Binding Protein exclusive to Sb (38). However, our analysis does single out candidate genes potentially contributing to the social polymorphism of the fire ant (lists of all genetic differences between SB and Sb are in Table S1; lists of genes that showed expression differences in any comparison are in Tables S2, S3 and S4). Three genes in particular stood out because they were differentially expressed between social forms and also had variant-specific allele expression in all populations. For the first gene, "Pheromone-binding protein Gp-9" (LOC105194481), also known as OBP-3, the Sb allele was more highly expressed. For decades, this gene has been a candidate effector for social form differences (34,55), yet its linkage to hundreds of other genes in the supergene led to doubts that its association to social form is any more than coincidental. We found five fixed differences between SB and Sb for this gene, four of which could have an impact on protein efficiency (consistent with findings from Krieger and Ross (56)). For the second gene, "Ejaculatory bulb-specific protein 3" (LOC105199531), which Page of 9 18   1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71  72  73  74  75  76  77  78  79  80 also contains an insect odorant binding protein domain (InterPro IPR005055), the SB allele was more highly expressed. Orthologs of this gene are associated with mating (57) in Drosophila melanogaster, sexual behavior in a moth (58), subcaste differences in bumblebees (59), venom production in social hornets (60) and caste differences in the termite Reticulitermes flavipes (61). Finally, LOC105199327 is similar in sequence to Pinta retinol-binding proteins, which are linked to pigment transport and vision in D. melanogaster and the butterfly Papilio xuthus (62). In sum, all these candidate genes have putative functions related to environmental perception, in line with the complex social phenotype requiring subtle changes in environmental perception or signaling.

Conclusions
By analyzing allele-specific expression patterns between the SB and Sb variants of the fire ant supergene we found consistent differences across three populations. Such strong patterns can naively be assumed to be indicative of adaptive processes emerging from evolutionary conflict between social forms. We show, however, that the evolutionary forces shaping expression patterns in the supergene are complex and must be interpreted with care. In particular, genes with higher expression of the SB than the Sb allele tend to either lack expression differences between social forms or have lower expression in multiple-queen than in single-queen colonies. Both patterns are consistent with the idea that recombination suppression leads to degeneration in Sb and thus lower expression of Sb alleles for some genes. In some cases, a dosage compensation mechanism through higher expression of the healthy SB allele leads to similar expression levels in both social forms. In cases where no dosage compensation occurs, overall expression is lower in multiple-queen colonies than in single-queen colonies.
Conversely, we also show that genes with higher expression of the Sb than the SB allele also tend to have expression biased towards multiple-queen colonies. This pattern is consistent with evolutionary conflict favoring the accumulation of beneficial alleles for multiple-queen individuals in Sb, the supergene variant found only in this social form.
Our study shows that multiple complex evolutionary forces can simultaneously act on a young supergene system. In particular, it highlights that allele-specific expression patterns alone are insufficient for inferring whether they are adaptive, deleterious or phenotypically neutral. Instead, putting such expression differences into broader contexts is needed to draw reasonable conclusions. By applying this idea to our data we highlight genes which have molecular roles that could affect perception or signalling of the social environment.

RNA sequencing of fire ants
We used three previously published and generated one novel fire ant RNA-seq datasets. Wurm et al (41) obtained whole-body RNA-seq data from six pools of 4 egg-laying SB/Sb queens, each from a multiplequeen colony from Georgia, USA. Morandin et al (45) obtained whole-body RNA-seq data from 6 samples, each being a pool of 3 queens from a single-queen or a multiple-queen colony (3 replicates per social form) from Texas, USA. Due to low mapping quality, we eliminated one queen sample from one of the multiplequeen colonies. All the queens were mature and egg-laying (C. Morandin, personal communication), thus queens from multiple-queen colonies carried the SB/Sb genotype (Keller and Ross 1998). Additionally, we manually checked the resulting RNA reads for heterozygous positions at key supergene markers using the IGV gene browser v2.4.19 (63). Fontana et al (42) generated RNA-seq data from 4 samples of SB/Sb queens from multiple-queen colonies from Taiwan. Each sample is a pool of whole bodies from two virgin queens (more details in Supporting Information, Table S5a).
All three published datasets are from pools of whole bodies from the invasive ranges of S. invicta. The Taiwanese invasive population of red fire ants is derived from that of North America (44). Both invasive populations will therefore be considered as a single distinct from that of South America, for the purposes of this study. Because comparisons of whole bodies can be confounded by allometric differences (64) and genetic diversity is reduced among Sb haplotypes in the invasive populations of North America (34), we generated a new gene expression dataset. From the native South American range of this species, we collected 6 multiple-queen colonies (collection and exportation permit numbers 007/15, 282/2016, 433/02101-0014449-4 and 25253/16). We confirmed the social form of each colony using the Krieger and Ross assay (56) on a pool of DNA from 10 randomly chosen workers. Colonies were kept for 6 weeks under semi-controlled conditions (natural light, room temperature, cricket, mealworm and honey water diet) before sampling. From each colony, we snap-froze (between 12:00 and 15:00 local time) one worker and one unmated queen for gene expression analysis. To partly control for allometric differences between genotypes, we separated each queen into head, thorax and abdomen. This was done in petri dishes over dry ice using bleached tweezers. In total, we had 24 samples for RNA extraction: six whole bodies of workers and six replicates of three body parts from queens (more details in Supporting Information, Table  S5b).
We extracted RNA and DNA from each sample using a dual DNA/RNA Tri Reagent based protocol (Supplementary file 1). We applied the Krieger and Ross assay (56)  For all datasets, we assessed read quality using fastQC (v0.11.5; http:// www.bioinformatics.babraham.ac.uk/projects/fastqc/). Raw reads for all samples were of sufficient quality to be used in subsequent analysis. We removed low quality bases using fqtrim with default parameters (v0.9.5; http://ccb.jhu.edu/software/fqtrim/), and Illumina adapters using Cutadapt v1.13 (65). We then generated a STAR v2.5.3a (66) index of the S. invicta reference genome (version gnG; RefSeq GCF_000188075.1 (41) while providing geneset v000188075.1 in GFF format through the "sjdbGTFtagExonParentTranscript=Parent" option. As recommended by the developers of STAR, we aligned each sample to the reference twice, using the "out.tab" file for the second run, and set "sjdbOverhang" to the maximum trimmed read length minus one, i.e., 74 for the Wurm et al. data (41) and Morandin et al. data (45), 125 for the Fontana et al. data (42) and 149 for the South American data we generated here. Alignments were run using GNU Parallel v20150922 (67). All steps and downstream analyses were performed on the Queen Mary University of London's Apocrita High Performance Computing Cluster (68).
We further assessed aligned reads (i.e., BAM files) using MultiQC v1.5 (69) and the BodyGene_coverage.py script of the RSeQC toolkit v2.6.4 (70). We removed one sample from multiple-queen colonies in the Morandin et al. data from subsequent analyses due to poor alignment quality. None of the other BAM files showed markers of technical artefacts that could bias our results.

Identifying SNPs with fixed differences between SB and Sb males
To detect allele-specific differences between SB and Sb we first identified SNPs with fixed differences between the SB and Sb variants. Because the patterns of genetic diversity differ between the invasive and South American S. invicta populations (71, 72), we estimated allele specific expression differences in the social chromosome independently for each population. For this we used haploid male ants because they can provide unambiguous genotypes. For the invasive populations, we identified fixed allelic differences between a group of 7 SB males and a group of 7 Sb males from North America (NCBI SRP017317) (31).
For the South American population, we sequenced the genomes of 13 SB males and 13 Sb males sampled from across Argentina. For each individual, we extracted 1 µg of genomic DNA using a phenol-chloroform protocol. The extracted material was sheared to 350 bp fragments using a Covaris (M220). We constructed individually barcoded libraries using the Illumina TruSeq PCR-free kit. The libraries were quantified through qPCR (NEB library quant kit). An equimolar pool of the 46 libraries was sequenced on a HiSeq4000 at 150 bp paired reads. This produced an average of 17,790,416 pairs of reads per sample, with a maximum of 38,823,285 and a minimum of 7,910,042. (Table S6, genomic reads of all samples deposited on NCBI SRA (PRJNA542606)). For each dataset, we identified fixed allelic differences between the group of SB males and the group of Sb males. We first aligned the reads of each sample to the S. invicta reference genome ((41); gnG assembly; RefSeq GCF_000188075.1) using Bowtie2 v2.3.4 (73). We then used Freebayes v1.1.0 (74) to call variants across all individuals (parameters: ploidy = 1, min-alternate-count = 1, min-alternatefraction = 0.2). We used BCFtools (75) and VariantAnnotation (76) to only retain variant sites with single nucleotide polymorphisms (SNPs), with quality value Q greater than or equal to 25, and where all individuals had a minimum coverage of 1. To avoid considering SNPs erroneously called from repetitive regions that are collapsed in the reference genome, we discarded any SNP with mean coverage greater than 16 for the North American samples or 12 for the South American samples or where any individuals had less than 60% reads supporting the called allele. This last filtering step also acts to remove SNPs called from reads with sequencing errors. We then extracted only the SNPs located within the supergene (based on the genomic locations from Pracana et al. (34)) and with fixed differences between SB and Sb. This step was performed independently for each population. The two resulting variant call files were inspected using VCFtools v0.1.15 (77) and we manually ensured that all variants had the SB allele as reference and Sb allele as alternative. To test the effect of sample size differences between populations we downsampled the South American dataset to 7 pairs of SB and Sb males, matching the sample size in the North American dataset.
We extracted SNPs shared between South and North American populations using BCFtools isec v1.9 (75). We then used SNPeff (78) to characterize the effects of individual SNPs.

Estimating read counts from alternate social chromosome variants in heterozygous individuals
Because the reference genome for S. invicta is based on an SB individual, read mapping could be biased towards the SB variant in heterozygous individuals, resulting in apparent overall expression of the reference variant (79). To overcome this potential artifact, we performed two alignments using STAR (with the same parameters as described in the "Generation of RNA-seq data" section): one with the regular SB reference genome and another with a modified reference genome in which we had replaced the fixed positions between variants in the supergene by those found in Sb. We then used the last steps from the WASP pipeline (80) to generate reference-bias free alignment files. We obtained allele-specific counts using GATK's "ASEReadCounter" v 3.6-0-g89b7209 (81) using the fixed SNP differences between Sb and SB. To generate allele-specific read counts in a manner that eliminates potential reference bias, we first generated modified red fire ant reference genomes in which all the positions in the supergene had been changed to match the Sb alleles. We called BCFtools consensus v1.9 (75) once using North American Sb males and once using South American Sb males. We then aligned the RNAseq reads from each sample to the regular reference genome (version gnG; RefSeq GCF_000188075.1; (2)) and also, independently, to the modified references. For the reads from the Taiwanese population, we used the Sb reference using North American SNPs. For the alignment we used STAR with the same parameters as described above. We merged the two resulting BAM files from each sample using SAMtools v1.9 (75). The vast majority of reads would have been mapped to both references, and thus appear twice in the merged aligned file. To remove these duplicates in an unbiased manner, we used the "rmdup" script from the WASP pipeline (80). The resulting BAM files can be considered reference bias free alignments.
We added a reading group ID to each reference-bias free BAM file using the "AddOrReplaceReadGroups" tool from Picard (v 2.7.0-SNAPSHOT; http://broadinstitute.github.io/picard/). We then ran all BAM files through GATK's "ASEReadCounter" v 3.6-0-g89b7209 (17) with default options to obtain read counts for each allele. This step was carried out once on all 6 North American samples, once on all 4 SB/Sb Taiwanese samples and once on all 24 South American samples, using the fixed differences between variants generated previously for each population. In both cases we used the original genome of S. invicta (version gnG; RefSeq GCF_000188075.1) as a reference.
We then imported the resulting allele-specific SNP read counts per sample generated by GATK into R v3.4.4 (46). We used the R packages "GenomicRanges" v1. 26.4 (82) and "GenomicFeatures" v1. 26.3 (82) along with the NCBI protein-coding gene annotation for S. invicta (gnG assembly, release 100) to identify which SNPs are in which genes. This is because allele-specific expression of long genes with several fixed SNPs between variants could be overestimated if the reads per SNP are counted individually. To avoid this, we estimated the total expression level for a particular allele (i.e., the SB or Sb variant for any given gene) as the median of all SNP-specific read counts per gene and per variant. For instance, consider a gene with 3 fixed SNPs between SB and Sb for which the SB variants have support from 12, 15 and 18 reads, and the Sb variants from 5, 8 and 6 reads. In this particular case, we would report that the SB variant for this particular gene has an expression level of 15 reads and the Sb variant, 6 reads. If instead of this approach, we randomly select one of the possible SNPs for every gene, we find qualitatively similar results to those reported.Additionally, to test whether we would be able to detect allele-specific expression changes across body parts and castes in the South American data, we calculated allele-specific expression in the whole genome as a positive control. We used the VCF file containing all SNPs in the 26 males collected from South America. We retained only SNPs with expression data in all samples and a median of at least 1-x RNA coverage in each allele across all samples. After filtering, 1096 SNPs remained on which we were able to test for allele-specific expression. We then performed an allele specific expression analysis throughout the whole genome using body part and caste information from South American populations. Unlike the analysis of genes in the supergene region, in the whole genome analysis we cannot ensure that every individual sequenced is heterozygous for all SNPs. Indeed, the average frequency in the population for all the alleles analyzed was 0.41 with a standard deviation of ± 0.2. This implies that both alleles were not necessarily present in all samples. We therefore had far less power to detect allele-specific expression across body parts using data from the whole genome than using SNPs from the supergene region only. Despite this lack of power, we were able to detect significant (Wald test BH adjusted p<0.05) allele-specific expression changes across body parts of queens and workers in 15 SNPs. These significant SNPs were distributed across 9 genomic scaffolds. The significant differences in allele-specific expression were between a queen body part and whole bodies of workers.

Identifying expression differences between the SB and Sb variants
We imported the estimated read counts generated by Kallisto into R using Tximport v1.2.0 (83) and DESeq2 v1.14.1 (43). For every sample, read counts for the SB alleles and for the Sb alleles come from a single sequencing library, thus standard normalization methods (84) are not applicable. As recommended by the developers of DESeq2 (85), we thus deactivated normalization by setting SizeFactors=1. For the North American and Taiwanese datasets (41), we only considered genes expressed in all samples for downstream analyses, whereas for the South American populations RNA dataset, we only analyzed genes expressed in all replicates of at least one body part.
To have the strongest possible analysis of expression between the SB and Sb variants of the supergene region, we performed a joint analysis of RNAseq data from Taiwanese, South and North American populations. The South American dataset includes body part information, which is absent in the North American dataset. If both datasets were analyzed together with any the standard tools for gene expression analysis, the effect in expression levels arising from different body parts would be confounded with that arising from differences between the two datasets. To overcome this issue, we applied a linear mixed effects model on the log2 of the expression ratios between SB and Sb across populations and body parts, using the R packages lme4 v1. 1-18.1 (86) and lmerTest v3.1-0 (87). The aim of this model was to identify the expression differences between SB and Sb across the different datasets, accounting for the proportion of variance explained by body part and population. We fitted the log2 expression ratios using a 0 intercept with gene, population and their interaction as fixed effects, and the interaction between gene and body part as random effects (formula: log2 expression ratio ~ 0 + gene * population + (1|body part:gene)).
For both models, the log2 expression ratios were weighed by a function of the total read counts per gene to reduce the impacts of genes with low expression which have extremely high variance. Here we only report the results of the fixed effects per gene after adjustment of p values for multiple testing following the Benjamini-Hochberg approach (88). For this joint analysis we only used genes that had fixed differences between SB and Sb in all three populations.
We additionally analyzed the allele-specific expression patterns between the SB and Sb variants of the supergene in each population independently using DESeq2 (43) as suggested by Castel and collaborators (79). The model formula used for the South American RNA-seq data used "body part" and "colony of origin" as blocking factors, and allele-specific expression, i.e. "variant effect", as variable of interest. This analysis allowed us to detect differences in expression between variants specific to body part. Preliminary analyses showed that the interaction between "variant effect" and "body part" had no significant effect in any of the genes, and consequently, only the main "variant effect" was considered as the factor of interest for this analysis. The model formula for both the Wurm et al. (41) and the Fontana et al. (42) RNA-seq datasets included only whole bodies of queens. We thus used "sample" as a blocking factor and "variant effect" as variable of interest.
In all analyses, we report gene differences between variants as log2 expression ratios between the SB and the Sb counts. That is, genes with expression biased towards SB will produce positive log2 expression ratios whereas those biased towards Sb will produce a negative value. To check whether there was an overall bias towards either variant, we tested the significance of the deviation from 0 for the median log2 expression ratios between SB and Sb via a Wilcoxon sum rank test. Significant differences between alleles for genes common in both populations are reported in Figure 1 Table S3c for Taiwanese populations.

Expression differences between social forms
We determined the expression levels for all samples from the North American populations (45) by using the count mode in Kallisto v0.44.0 (89) using S. invicta coding sequences (version gnG; RefSeq GCF_000188075.1, release 100) as a reference. We imported the estimated counts into DESeq2 v1.14.1 (43) using Tximport v1.2.0 (83). We compared the DESeq2 normalized expression levels between social forms, determining significance of differential expression using the default Wald test for pairwise comparisons between genes. For each caste we estimated the proportion of significantly differentially to non-differentially expressed genes within and outside the supergene region based on supergene region coordinates from Pracana et al. (34). We then used the R packages GenomicRanges and GenomicFeatures (82) along with the annotations of S. invicta coding sequences to locate each gene with expression information in the genome. Our analyses are restricted to the 10,481 known S. invicta genes that can be reliably placed within or outside the supergene region; other genes are on scaffolds which lack chromosomal locations.