Fire ant social chromosomes: Differences in number, sequence and expression of odorant binding proteins

Abstract Variation in social behavior is common yet our knowledge of the mechanisms underpinning its evolution is limited. The fire ant Solenopsis invicta provides a textbook example of a Mendelian element controlling social organization: alternate alleles of a genetic element first identified as encoding an odorant binding protein (OBP) named Gp‐9 determine whether a colony accepts one or multiple queens. The potential roles of such a protein in perceiving olfactory cues and evidence of positive selection on its amino acid sequence made it an appealing candidate gene. However, we recently showed that recombination is suppressed between Gp‐9 and hundreds of other genes as part of a >19 Mb supergene‐like region carried by a pair of social chromosomes. This finding raises the need to reassess the potential role of Gp‐9. We identify 23 OBPs in the fire ant genome assembly, including nine located in the region of suppressed recombination with Gp‐9. For six of these, the alleles carried by the two variants of the supergene‐like region differ in protein‐coding sequence and thus likely in function, with Gp‐9 showing the strongest evidence of positive selection. We identify an additional OBP specific to the Sb variant of the region. Finally, we find that 14 OBPs are differentially expressed between single‐ and multiple‐queen colonies. These results are consistent with multiple OBPs playing a role in determining social structure.


Supplementary Methods
OBP discovery and manual gene model curation We used MAKER2 (v2.31; Cantarel et al. 2008) to generate consensus gene models for the S. invicta genome assembly ) from TopHat2 (v2.0.11; Kim et al. 2013) alignments of RNASeq reads (SRA accessions SRX757226-SRX757228) to the reference genome, an assembly of fire ant Expressed Sequence Tag (EST) libraries, protein sequences from SwissProt (downloaded June, 2014), A. mellifera (amel_OGSv3.2_pep.fa) and N. vitripennis (Nvit_OGSv1.2_pep.fa) genome projects, and de novo predictions from SNAP (Korf 2004) and Augustus (Stanke et al. 2006) using HMM models that had been generated during the fire ant genome project  . To identify regions of the genome putatively containing OBPs, we performed blastn and tblastn (Camacho et al. 2009;Priyam et al. 2015) searches of the fire ant genome on antgenomes.org (Wurm et al. 2009) using as queries previously published fire ant OBP sequences (Gotzek et al. 2011) and Uniprot sequences that are part of the Pfam family 'PBP_GOBP' (Finn et al. 2014;UniProt Consortium 2015) . We integrated all aforementioned data using the genomic annotation editor Afra (Priyam unpublished) , the genome browser JBrowse (Skinner et al. 2009) , the tool GeneValidator (which assesses the quality of annotations by comparing them to public databases; Drăgan et al. 2016) and custom scripts. Manual curation followed a the standard approach based on Web Apollo (Lee et al. 2013) , including the inspection and adjustment of exon boundaries to ensure that the exon-intron structure of gene models was consistent with mappings of RNA sequence reads, and that the gene models had canonical splice sites, translation start and stop sites, and appropriate open reading frames. We also identified alternative spliced transcripts by visualising the alignments of the RNA sequence reads. After producing high quality gene models through this method, we used these sequences for further blastn and tblastn searches against the reference genome to identify further putative OBPs. These were curated as above; this process was repeated iteratively until no new putative OBP loci were discovered.
Our pipeline identified seventeen out of the eighteen OBP genes that had been previously reported, although eight had differences in sequence and/or length relative to the published sequences (Table S1). We found no genomic region with more than 80% identity to the remainder gene, SiOBP18 (Table S1), which had been identified from a single Sanger-sequenced EST (Wang et al. 2007;Gotzek et al. 2011), suggesting that this gene is either missing from the reference genome assembly, or that the sequence of the original EST was an artefact. We found evidence of alternative splicing for SiOBP17 (four splice forms) and for the newly discovered SiOBPZ7 (two splice forms). We found no support for the suggestion that SiOBP12 and SiOBP13 share an exon (Zhang et al. 2016) . All the genomic and transcriptomic sequences analysed in the OBP discovery pipeline included an insertion relative to the reference assembly affecting SiOBP14 (insertion of a T in position NW_011802221.1:1,287,729), suggesting an error in the assembly. The sequence reported for this gene includes this insertion.
We used a genetic map (Pracana et al. 2017) to assign OBPs to linkage groups. We were able to position 20 of the OBPs in linkage groups, including all novel OBPs ( Fig. 1). Four OBPs were in unmapped scaffolds. Of these, SiOBP9 is in a scaffold that we previously classed as putatively in the supergene (in Pracana et al. 2017) given its high SB-Sb differentiation. SiOBP2 was in a scaffold without any divergence, thus classified as outside the supergene. It was not possible to confidently determine the positions of the remaining two OBPs ( SiOBP5 and SiOBP7 ), as each had exons in multiple small unmapped scaffolds.
Phylogenetic analysis S. invicta OBPs are a highly divergent gene family (Gotzek et al. 2011). We aligned the coding sequences of the 24 S. invicta OBPs using MAFFT-linsi (v6.903b; Katoh & Toh 2008). We removed ambiguous sections from this alignment using trimAL (v1.4.1; Capella-Gutiérrez et al. 2009) with the -gappyout option and built a "guide" tree using RaxML (v8.2.9; Stamatakis 2006) with the GTRGAMMAI model. We then used PRANK (v120626; Löytynoja & Goldman 2005) to generate a codon-level alignment of the original sequences, guided by the tree obtained above. Using the same parameters as above, we removed ambiguous sections from this alignment using trimAl and built a final tree using RaXML (10,000 bootstraps).

Read filtering of S. invicta whole-genome sequences
We used whole-genome sequences from one SB and one Sb male from each of seven colonies that had been sequenced at low coverage (Illumina 2*100bp paired-end genome shotgun sequences; ~6x-8x coverage) in 2012 (NCBI SRP017317) ) . Each of these samples is a haploid male (ants have a haplo-diploid sex determination system), and the sequencing coverage is sufficiently homogeneous (Pracana et al. 2017) for the analysis reported here, including high confidence genotype calling. We used seqtk (v1.0-r31; https://github.com/lh3/seqtk) to trim 2bp from the start and 5bp from the ends of the reads. We removed any read where more than 25% of the bases had a quality score smaller than 25 using fastq_quality_filter in the fastx toolkit (0.0.14; http://hannonlab.cshl.edu/fastx_toolkit). We used GNU parallel to parallelise this pipeline (Tange 2011) .

Detection of copy number and structural variation in OBPs
We used bowtie2 (v2.1.0; Langmead & Salzberg 2012) to align the cleaned reads to the reference genome assembly. We visually inspected the alignments of each of our curated gene predictions, searching for regions with no coverage to identify deletions and high coverage to identify duplications.
The genomic region that includes three exons of SiOBP15 (scaffold NW_011801067.1:293,460-296,015) had no reads in any Sb individual, consistent with a deletion of this region (it is impossible to determine the exact size of the deletion as the region is directly upstream of a non-assembled portion of the scaffold). We observed no other such pattern of deletion.
SiOBP12 (which is within the supergene region) had approximately two times higher coverage in Sb individuals relative to SB individuals. Approximately half of the reads from Sb individuals had a small number of consistent sequence differences to the other reads. This suggested that a recent duplication of the gene occurred. To obtain consensus sequences for Sb individuals for both duplicates, we extracted all pairs of reads from Sb individuals for which at least one pair mapped to the contig containing the transcribed region of SiOBP12 and performed de novo assembly using MIRA (v4.0.2; Chevreux et al. 1999) . This resulted in assemblies on separate contigs of two genes: SiOBP12 and the Sb-specific duplicate we named SiOBPZ5 .
Visual inspection of SiOBPZ6 (outside the supergene region) revealed that this gene had a much higher number of mapped reads than other genes. To estimate the number of copies of SiOBPZ6 , we measured the median coverage per base pair of the seven SB individuals for this gene and for 1000 additional randomly sampled genes using bedtools coverage (with argument -d; v2.25.0; Quinlan & Hall 2010) . For each individual, we calculated the ratio between the coverage of SiOBPZ6 and the mean coverage of the 1000 random genes. For a single-copy gene, we expect these ratios to be one; we used a one-sample t-test to determine if the distribution of these ratios had a mean different from one. We did not produce individual sequences for each SiOBPZ6 copy because there was an insufficient number of variable sites to differentiate the copies. The sample used for genome assembly (NCBI SAMN00014755; Wurm et al. 2011) was not included in this test because it was sequenced using an earlier (noisier) Illumina technology.
Orthology in other species Using a reciprocal blast approach, we searched for the closest orthologous sequence of each OBP gene. First, we ran a tblastn search of all S. invicta OBPs against all non-S. invicta arthropod sequences available on NCBI nr on 2017-03-21, accepting hits where e-value < 10 -3 . We then ran a blastx search of these hits against the S. invicta gene predictions (including our newly curated OBP set). We report the hits with the lowest e-value (Table S7). We repeated this analysis by searching non-ant arthropods (not Formicidae).

Variant Calling in S. invicta OBPs
We added the contig with the Sb -specific SiOBPZ5 to the reference assembly. Using bowtie2 (v2.1.0; Langmead & Salzberg 2012) , we aligned the cleaned reads of the seven Sb individuals (see above) to the revised assembly and the seven SB individuals to the original reference assembly. We called single nucleotide polymorphisms (SNPs) in the protein coding regions of the supergene OBPs using samtools mpileup (Li et al. 2009) and bcftools call (with arguments --ploidy 1 and -m; v1.3.1; https://samtools.github.io/bcftools/bcftools.html ). We manually inspected the read alignments at each SNP position using the genome viewer IGV (Thorvaldsdóttir et al. 2013) .
Sequencing and variant calling of the OBPs of an outgroup species We produced whole-genome sequencing reads of the outgroup species Solenopsis geminata . DNA was extracted from a pool of ten workers (sampled in Thailand by Dr Adam Devenish, University College London, United Kingdom) using the Phenol-Chloroform method in Hunt and Page (1994) and sequenced using Illumina HiSeq 4000 (x11 coverage). We filtered the reads using skewer (v0.  Table S2) . These included Illumina and Roche 454 sequences. Read quality was assessed using fastQC (v0.11.5; http://www.bioinformatics.babraham.ac.uk/projects/fastqc/). Low quality bases were removed using the default options in fqtrim (v0.9.5; http://ccb.jhu.edu/software/fqtrim/ ).
We determined the expression levels of S. invicta transcripts using count mode in Kallisto (v0.43.0; Bray et al. 2016) . As a reference, we modified the S. invicta protein-coding gene annotation release 100 (NCBI) by removing all automatically annotated OBPs and instead adding the OBP sequences we manually curated above. We masked regions of SiOBP12 and SiOBPZ5 that were identical between these recent duplicates to prevent reads from one gene to be misassigned to the other. SiOBP15 lacks three exons in its Sb variant, so to prevent misalignment, we treated each variant of SiOBP15 as a different transcript. The total read count for SiOBP15 is the sum of its two variants. To control for the potential effects of sequence differences between SB and Sb, we repeated the analysis twice: first using the SB alleles of the OBPs, then using the Sb alleles. We only show the analysis done using the Sb alleles of the OBPs because both analyses produced qualitatively identical results.
For paired-end reads, we used the default counting options of Kallisto. For single-end reads, we provided Kallisto the average fragment length of each sample (as indicated on NCBI SRA) and we set the estimated standard deviation to 20 bp. To be able to analyse at least >50% of low-coverage Roche 454 reads with Kallisto, we set the average and the standard deviation of fragment length to 1.
We used Tximport (v1.2.0; Soneson et al. 2015) to import the estimated counts produced by Kallisto into the R (R Core Team 2016) implementation of DESeq2 (v1.14.1; Love et al. 2014) . Each sample was independently normalised using the DESeq2 method. Additionally, we performed genome-wide analysis of differential expression on data from (Morandin et al. 2016) using a standard DESeq2 approach to identify expression differences between social forms in queens and in workers. Queens and workers were analysed separately because they were sampled using different collection methods, which resulted in different variance patterns in each dataset (Morandin et al. 2016). We included the Sb-specific SiOBPZ5 in the analysis as a control. As expected, this gene is significantly differentially expressed between singleand multiple-queen colonies in both workers and queens. We performed a standard Chi 2 test to determine whether the supergene region is enriched in differentially expressed loci relative to the rest of the genome.

Differential expression of gene co-expression modules across social forms
We created gene co-expression modules from two microarray sets comparing single-queen with multiple-queen colonies, one with queen samples (GSE42062; Nipitwattanaphon et al. 2013) and the other with worker samples (E-GEOD-11694; Wang et al. 2008) . We did not use the RNAseq data because it does not include a sufficient number of samples of each social form to create gene co-expression modules. Both microarray sets use the same microarray platform (Platform GPL6930), which includes 25,344 probes (Wang et al. 2007) . To determine the number of genes that these probes represented, we aligned the sequences of all probes against the gnG assembly for S. invicta using the 'est2genome' mode with a minimum 95% identity in Exonerate (v2.2.0; Slater & Birney 2005) . The positions of the probes in the genome were then intersected against the annotation release 100 for S. invicta used in the rest of analyses with the R package 'GenomicRanges' (Lawrence et al. 2013) . The probe sequences intersected with 3,673 unique genes.
We downloaded the normalised expression values of each dataset from NCBI GEO. For the queen set, we removed the 16 samples with reproductive age class because SB/SB reproductive samples had very low variance in gene expression relative to individuals of other age classes. The remaining set included 31 SB/SB samples and 31 SB/Sb samples (all virgin queens originating from multiple-queen colonies). The worker set included 20 samples from single-queen colonies and 40 from multiple-queen colonies (20 SB/SB and 20 SB/Sb ; we removed two Sb/Sb samples). For each set, we removed any probe that had "null" expression in more than five individuals. For the remaining probes, individuals with "null" expression were imputed to the median expression of the probe. After filtering, there were 18,291 probes in common between the two datasets, representing 3,046 genes. We used the ComBat function in the sva R library (v3.18.0; Leek et al. 2012) to adjust both sets for the year in which the microarrays were produced. We used Weighted Gene Co-expression Network Analysis (v1.49; Langfelder & Horvath 2008) to create signed modules for each set. We used a soft-thresholding power of 5 for both sets. Modules were detected using the Dynamic Tree Cut method and merged using an eigengene dissimilarity threshold of 0.3. We used t-tests to determine whether any module eigengene is correlated with genotype or social form. In queens, we compared SB/SB to SB/Sb samples because all originate in multiple-queen colonies. In workers, we separated the effect of genotype from the effect of social form following the approach in Wang et al. (Wang et al. 2008) : we compared genotypes ( SB/SB versus SB/Sb ) using samples from multiple-queen colonies, and we compared across social forms (single-queen versus multiple-queen) using SB/SB samples only, and corrected with the p-values Bonferroni correction. In the worker dataset, SB/Sb samples originate from both single-and multiple-queen colonies, so we also tested whether any module eigengene is correlated with social form in this dataset.
Gene Ontology (GO) term annotation of the Solenopsis invicta genome We used the modified S. invicta annotations created for the RNAseq alignments (above) as a query for blastp against the nr database of NCBI. We limited the hits to the 20 best matches, with a minimum e-value of 10 -5 . The results were used with Blast2GO (v4.1.9; Conesa et al. 2005) to obtain the GO terms for each protein coding gene in S.invicta . We tested whether any GO terms were overrepresented in any co-expression modules that were significantly correlated with social form or genotype using TopGO (v2. 26 Evidence for selection based on nucleotide diversity Genomic regions that underwent recent selective sweeps are characterised by low nucleotide diversity (π) (Smith & Haigh 1974; Nei 1987; Nachman 2001). We used measurements of π along a sliding window of the genome, originally produced by Pracana et al. (2017), to identify selection pressure acting on S. invicta OBPs. These measurements were produced from SNPs identified de novo from the 7 SB samples mentioned above and an additional SB sample (NCBI SAMN00014755, ~33x coverage) using Cortex (v1.0.5.20; Iqbal et al. 2012). Measurements of π were taken from non-overlapping 10kb windows. Sb samples were excluded to avoid measuring diversity across sibling pairs, and because of the very low diversity in the Sb supergene variant (π ≈ 0), which may be the result of low recombination in Sb and a putative recent fixation of this variant in the sampled population (Pracana et al. 2017).