Cis-regulatory evolution in prokaryotes revealed by interspecific archaeal hybrids

The study of allele-specific expression (ASE) in interspecific hybrids has played a central role in our understanding of a wide range of phenomena, including genomic imprinting, X-chromosome inactivation, and cis-regulatory evolution. However across the hundreds of studies of hybrid ASE, all have been restricted to sexually reproducing eukaryotes, leaving a major gap in our understanding of the genomic patterns of cis-regulatory evolution in prokaryotes. Here we introduce a method to generate stable hybrids between two species of halophilic archaea, and measure genome-wide ASE in these hybrids with RNA-seq. We found that over half of all genes have significant ASE, and that genes encoding kinases show evidence of lineage-specific selection on their cis-regulation. This pattern of polygenic selection suggested species-specific adaptation to low phosphate conditions, which we confirmed with growth experiments. Altogether, our work extends the study of ASE to archaea, and suggests that cis-regulation can evolve under polygenic lineage-specific selection in prokaryotes.

this obstacle by maintaining two different selection markers at the same genetic locus in the two parental species. In such a condition any homologous recombination event will result in swapping one selection marker for the other, and as long as one selects for both markers, only heterozygous cells will survive, assuming no ectopic recombination occurs.
We have applied this unique system to explore cis-regulatory evolution in the genus Haloferax. The two species we studied were Haloferax volcanii, isolated from the Dead Sea in Jordan 15 , and Haloferax mediterranei, isolated from a saltern in Alicante, Spain 16 . These two species have ~13.4% sequence divergence in the protein-coding regions of their ~4 Mbp genomes, which is composed of a ~3 Mbp chromosome and three large plasmids. While both species' isolation sites were characterized by high salt concentrations, they likely differed greatly in other respects, such as concentrations of magnesium and phosphate ions, raising the possibility of lineage-specific adaptations of these species to their respective environments.

Results
We have previously shown that H. volcanii and H. mediterranei are able to efficiently mate and generate interspecies recombinants 14 . In order to generate a stable H. volcanii × H. mediterranei hybrid, we needed to prevent the possibility of recombination between chromosomes, thus forcing the hybrid to retain both parental chromosomes. For that we needed to create mutants that carry two different selectable markers at the same genomic location, since the two strains are syntenic 17 (Fig. 1A). We used the H. mediterranei strain WR646 (ΔtrpA hdrB + ), an auxotroph for tryptophan and prototroph for thymidine 14 , and the H. volcanii strain H133 (ΔtrpA ΔhdrB), an auxotroph for tryptophan and thymidine 18 . H133 was then modified by inserting the trpA selectable marker into the hdrB locus to generate UG241 (trpA + ΔhdrB). This was done by transforming H133 with pTA160-trpA and selecting on media lacking thymidine, thus selecting for a double crossover event copying the trpA selectable marker into the hdrB locus. To create the stable hybrid, WR646 and UG241 were mated and colonies were selected on media lacking thymidine and tryptophan ( Fig. 1B and Methods).
We performed both RNA-and DNA-seq on two independently derived H. volcanii × H. mediterranei hybrid cultures, each derived from a single colony (hereafter replicates 1 and 2). Reads were mapped to a reference containing both parental genomes, and species-specific gene-level expression was calculated in reads per kilobase per million mapped reads (RPKM). The DNA-seq data showed nearly equal representation of both parental genomes ( Supplementary Fig. 1), confirming that our approach resulted in true hybrids, as opposed to maintenance of both markers via ectopic recombination. Integrating ortholog and operon predictions 19, 20 resulted in 1,954 orthologous transcriptional units, hereafter referred to as 'orthologs' , corresponding to 1,507 individual genes and 447 operons (Supplementary File 1; see Methods).
As Haloferax species are highly tolerant of both intra-and inter-chromosomal and plasmid copy number variation 21 , we used the DNA-seq data to identify large-scale amplifications (see Methods). As expected, the ratio of plasmid coverage to chromosomal coverage varied between the two alleles in each replicate ( Supplementary  Fig. 1). Consequently, we restricted our analysis to orthologs found outside of amplified regions and on the main chromosomes of the two parental species, resulting in 1,526 orthologs for analysis (Supplementary Table 1). We observed similar patterns of expression levels and ASE ratios in the two biological replicates ( Supplementary  Fig. 2). The genomic organization of the selectable markers involved in the study. (B) Generation of a stable hybrid. H133 was transformed with pTA160 trpA, and upon selection on media lacking thymidine the trpA marker was integrated in the hdrB locus, generating UG241. UG241 was mated with WR646, which are autotrophs for thymidine and tryptophan, respectively. The mated colonies were selected on a media lacking thymidine and tryptophan. Small circles indicate the plasmids and the rectangle represents the chromosome.
Differential expression of the two species' alleles within a common trans cellular background, known as allele-specific expression (ASE), indicates divergence of cis-regulation between orthologs 8,9 . This inference holds regardless of whatever trans-acting changes also impact gene expression. In order to detect significant ASE, we employed a method that takes into account both gene length and base-compositional differences between parental alleles 22,23 (see Methods). 929 orthologs showed significant ASE at a 5% false-discovery rate (FDR), indicating the presence of substantial cis-regulatory differences between the two parental species ( Fig. 2A). We found no significant difference in the number of genes favoring either species' allele (453 vs. 476 favoring the H. mediteranei vs. H. volcanii allele, χ 2 = 0.569, 1 degree of freedom, p = 0.451), suggesting that ASE was about equally likely to favor either allele.
We also tested the accuracy of our classification of genes into orthologous operons by testing whether adjacent genes within operons showed greater similarity in ASE ratios than adjacent, independently transcribed genes. Indeed, genes within operons had a significantly smaller median absolute log 2 differences in ASE values than those outside of operons in both biological replicates ( Fig. 2B; Kruskal-Wallis test p = 2.2 × 10 −185 and 3.2 × 10 −191 for replicates 1 and 2, respectively). These differences may be conservative, since any errors in the operon predictions 19 would lead us to underestimate their magnitudes.
Although ASE data reveal genome-wide patterns of cis-regulatory divergence, these might mostly reflect random changes due to genetic drift of neutral alleles. To identify those changes driven by lineage-specific natural selection, we and others have developed a "sign test" that detects selection acting on the regulation of entire groups of functionally related genes 24 . This test has been successfully applied to fungi, plants, and metazoans 22-31 , but not to prokaryotes, due to the previous lack of ASE data from interspecific hybrids.
We applied the sign test to Gene Ontology gene sets from H. volcanii 32 to search for gene sets with ASE directionality biased towards one parental species, which represents a robust signature of lineage-specific selection (see Methods). We found that genes with a known role in phosphorylation (GO:0016310) showed a significant bias in ASE directionality (ASE for 16/21 alleles favoring H. mediterranei in each biological replicate; permutation-based p < 0.001). These phosphorylation-related genes were predominantly kinases, and the "kinase activity" subset (GO:0016301) showed a similar ASE bias (ASE for 15/19 alleles favoring H. mediterranei in each biological replicate; permutation-based p < 0.001; Fig. 3A, Supplementary Table 1). We further confirmed this result using the arCOG database annotations 33 , which showed a similar trend (16/22 kinases favoring H. mediterranei). These gene sets showed the strongest sign test results of any GO gene set, suggesting that genes related to phosphorylation-particularly kinases-have evolved under lineage-specific selective pressures leading to increased expression in H. mediterranei, or decreased expression in H. volcanii.
The results of the sign test led us to hypothesize that the higher expression of kinases in H. mediterranei may be the result of selection in conditions where phosphate is limiting, since this could allow more efficient utilization of the scarce phosphate. If this was the case, we would also predict that phosphate transporters should show a similar up-regulation in H. mediterranei. Indeed, the Pst operon-containing four high-affinity phosphate transporters that are the major regulators of phosphate uptake in related halophiles 34 -was 3.8-fold more highly expressed from the H. mediterranei alleles in our hybrids, making it one of the most strongly biased operons in the genome. Considering the concordant directionality of both kinases and phosphate transporters, we predicted that selection for optimal growth in low phosphate should be reflected by an increased fitness of H. mediterranei in low phosphate. To test this, we grew both parental strains in low (0.1 mM) phosphate for 30 hours. Consistent with our prediction, H. mediterranei showed robust growth in this condition, in contrast to H. volcanii, whose growth was highly impaired (Fig. 3B and Supplementary Fig. 3).

Discussion
In this work we have introduced a method to create stable interspecific hybrids of Haloferax, and used these hybrids to investigate the extent and phenotypic impacts of cis-regulatory evolution. Our application of the sign test revealed lineage-specific selection acting on the cis-regulation of kinases, which led to our prediction-and confirmation-of H. mediterranei's superior fitness in phosphate-limiting conditions, as well as its up-regulation of high-affinity phosphate transporters.
Although we do not know the phosphate concentrations of the specific sites where these two species were isolated, it is well established that phosphate is the main limiting nutrient in the Mediterranean 35 . In contrast, the Dead Sea contains higher phosphate levels, particularly in the sediments where H. volcanii was once most abundant 36 . Therefore it is plausible that H. mediterranei may have adapted to the low phosphate levels with increased expression levels of kinases and phosphate transporters, compared to H. volcanii. Although it is consistent with our prediction, further experiments will be required to prove whether this fitness difference is caused by the cis-regulatory divergence that we observed. In addition, two important caveats are that 1) Additional phosphate-related genes may also have been subject to lineage-specific selection that could not be detected by our sign test, e.g. due to lack of comprehensive functional annotations for these genomes; and 2) Our ASE-focused approach would not detect any protein-coding changes that could also affect fitness in low phosphate.
Over half (929/1526) of the orthologs we studied showed significant ASE, and this fraction would likely increase with greater sequencing depth. Based on this lower bound estimate, we conclude that cis-regulatory divergence is likely to be a major source of evolutionary novelty in Haloferax, though of course this does not preclude a role for other sources of variation, such as in protein-coding regions. We note that although the ASE we have observed can only be explained by cis-regulatory divergence (since archaea lack any other known source of ASE, such as X-chromosome inactivation or genomic imprinting), the molecular mechanism of this divergence could involve a combination of both transcriptional and post-transcriptional regulation. Given the extensive sequence divergence between these species, and the small fraction of these changes expected to impact cis-regulation, simple correlations of sequence divergence vs. ASE cannot reveal the locations of causal changes; targeted experiments of individual candidate cis-regulatory variants would be required to establish their mechanisms.
In sum, our results suggest that selection can act on the cis-regulation of groups of functionally related genes in prokaryotes, similar to patterns of polygenic adaptation that have been discovered with the sign test across a wide range of eukaryotes. An exciting direction for future work will be to compare finer-scale patterns of evolution between eukaryotes and prokaryotes, in order to better understand to what extent these vastly different organisms adapt to their environments in a fundamentally similar fashion.  18 ), and pTA298 for making deletions in ∆trpA background 37 .
Strains were routinely grown in rich medium (Hv-YPC). When selection was needed we used Casamino Acids medium (Hv-Ca). When required, 50 µg/ml of thymidine, uracil or tryptophan were added. Following mating all strains were grown on enhanced Casamino broth. All media were made as described (http://www.haloarchaea. com/resources/halohandbook/version 7.2). All growth was at 45 °C unless otherwise noted.
To introduce trpA at the hdrB locus of H. volcanii, we first inserted the trpA gene into the plasmid pTA160, originally designed to delete hdrB. The trpA gene, under the ferredoxin (fdx) promoter of H. salinarium, was amplified using primers AP389 (aaagctagcgctcggtacccggggatcc) and AP390 (tttgctagccgttatgtgcgttccggat), from pTA298. Using NheI, the PCR product was inserted into pTA160 between the hdrB flanking regions. Transformation of H. volcanii was carried out using the PEG method as described 38 .
Prior to hybridization, each culture was grown to an OD 600 of 1-1.1, and 2 ml samples were taken from both strains and applied to a 0.2 mm filter connected to a vacuum to eliminate excess media. The filter was then placed on a Petri dish containing a rich medium (HY medium + thymidine) for 48 hr at 42 °C. The cells were washed and resuspended in Casamino broth, washed twice more in the same media, and plated on selective media.
RNA-seq and DNA-seq libraries were prepared using Illumina TruSeq v3 kits, following manufacturer protocols. All libraries were multiplexed in one lane of an Illumina HiSeq 2000 and sequenced as single-end 101 bp reads. Sequencing data have been deposited in the NCBI SRA (http://www.ncbi.nlm.nih.gov/sra), BioProject accession PRJNA327107, and are summarized in Table 1.

Genome annotation and read mapping. We obtained the genome assemblies and annotations for
H. volcanii (strain DS2) and H. mediterranei (strain ATCC 33500) from NCBI RefSeq (accession numbers: GCF_000025685.1 and GCF_000337295.1, respectively). In order to determine which bases in each genome would be unambiguously mappable in the hybrids, in each parental genome, we employed a sliding window of 75 bp (our mapping read length; see below) and a step of one bp to create simulated NGS reads. These reads were mapped to a reference consisting of both parent's genomes using Bowtie 0.12.8, with default parameters, retaining only uniquely mapping reads. Any base overlapped by reads that could not be mapped uniquely were masked from further analysis (corresponding to 3.9% and 1.3% of the H. volcanii and H. mediterranei genomes, respectively).
We identified orthologous genes between the two species using the RoundUp database 20 . Genes were then grouped into operons based on the MicrobesOnline operon predictions in H. volcanii 19 (http://meta.microbesonline.org/operons/gnc309800.html). Corresponding H. mediterranei operons were inferred from the presence of co-linearity of orthologs between the parental species.
All DNA-seq and RNA-seq reads were trimmed to 75 bp in length and mapped to a reference consisting of the concatenation of both parental genomes using Bowtie, version 0.12.8, with default parameters and retaining only uniquely mapping reads. As the number of genomic equivalents used during library construction vastly exceeded the base-level coverage, it was unlikely that any given RNA molecule was sequenced multiple times, thus all mapped reads were retained. DNA-seq RPKM was calculated using the number of unambiguously mappable bases as the gene length (although RPKM is typically used for RNA-seq data, it is equally appropriate for measuring read density in DNA-seq data).
DNA-seq results indicated that all genes were present from both parents in the hybrids, though not always with equal copy number. We detected local copy number variants among orthologs on the main chromosomes (defined as having DNA-seq RPKM greater or less than 2 standard deviations from the mean RPKM across all orthologs on the main chromosome), indicated by the grey points in Supplementary Fig. 1. These orthologs were removed from further analysis in order to prevent spurious detection of ASE. In addition, all genes on the plasmids were removed due to their greater variation in copy number (Supplementary Fig. 1).

Detecting significant ASE.
We determined base-level coverage of gene coding regions of both species for all uniquely mappable positions for both hybrid replicates for main chromosome located orthologs with at least 100 reads mapping per gene (summed over both alleles) in both biological replicates, to ensure robust ASE estimates. As the DNA-seq data indicated that parental chromosomal abundance was not necessarily equal in both  Table 1. Summary of sequencing reads generated for each sample. A large proportion of reads generated in the RNA-seq libraries originate from ribosomal RNA, which were not included in the mapping reference.
replicates, the base-level coverage of the main chromosome of the parent with the higher coverage was linearly scaled down such that the total coverage was equal to that of the lower coverage parent: where scaled i is the scaled coverage at position i on the main chromosome, high i is the coverage at position i in the higher-coverage parent, and low i is the coverage at position i in the lower-coverage parent. The RNA-seq RPKMs were calculated as the base level coverage/(the number of uniquely mappable bases × the total base level coverage for all orthologs × the mapped read length [75 bp]). Although RPKM values are influenced by the distribution of expression levels across all genes, this effect will have no impact on the ASE ratios-our metric of interest-since it will affect both alleles equally, thus canceling out.
To test for significant ASE, we applied the resampling test of Bullard et al. 22 (Supplementary Fig. 4): the base-level read coverage of each parental allele was resampled with replacement 10,000 times, under two conditions: either 1) using the H. volcanii marginal nucleotide frequencies (  cis-ratio from each biological replicate in order to obtain a two-tailed p-value based on how often the observed ratio was outside of the bounds of the null distribution. In cases where both biological replicates agreed in the direction of parental bias, the least significant (i.e. largest) p-value among the four comparisons (two null distributions compared to each of two replicates) was retained as a measure of the significance of differential expression. All p-values for genes in which the biological replicates agreed in the direction of bias were adjusted such that we retained only those comparisons significant at an FDR 39 of 5% for further analysis.
To determine whether ASE measurements between genes within predicted operons were more similar than those outside of operons, we performed 10,000 random samples of two categories of pairs of adjacent genes: either within predicted operons or outside of any predicted operon. For each sampled pair of genes we calculated the difference in the absolute values of log 2 (ASE ratios). Finally, we asked whether the distribution of these differences from genes sampled within operons was significantly lower than that sampled outside of operons.
All statistical analyses were performed using R version 3.13 40 . Kruskal-Wallis tests were performed using 10,000 permutations of the data as implemented in the 'coin' package 41 .
Detecting selection on cis-regulatory divergence. Gene Ontology (GO) categories for H. volcanii genes were obtained from the EBI Quick-GO database 32 (accessed on 18 Feb. 2014). In the case of multi-gene operons, the operon was annotated as the union of the GO terms associated with its respective genes. For the purpose of interspecific comparisons, H. mediterranei orthologs were assigned to the same GO categories as H. volcanii.
Orthologs with significant cis-regulatory divergence at either level were divided into two categories based on the upregulating parental allele and ranked based on the magnitude of their absolute cis ratio (from largest to smallest). We searched for lineage-specific bias among GO biological process, GO molecular function, and GO cellular component. In order to detect lineage-specific bias within a gene set, we identified all functional categories containing at least 10 members in the set and determined whether significant bias existed in the direction of one or the other lineage using a χ 2 'goodness of fit' test. Because many different categories were being tested, we determined the probability of observing a particular enrichment by permuting ortholog assignments and repeating the test 10,000 times, retaining the most significant p-value observed in each functional dataset. We obtained a permutation-based p-value by asking how often a χ 2 value of equal or greater significance would be observed in the permuted data (which is equivalent to a GO category-specific FDR 23 ). The sign test was performed at two thresholds, using either the top 50% most biased orthologs, or analyzing all biased orthologs. The sign test differs from typical applications of gene set enrichment because each gene/operon with ASE is affected Scientific RepoRts | 7: 3986 | DOI:10.1038/s41598-017-04278-4 by independent cis-regulatory changes; in contrast, in most applications of gene set enrichment (e.g. to genes differentially expressed between different conditions, cell types, individuals, etc.) the genes could be responding to a single upstream factor, such as a transcription factor, and thus are not independent. The independence inferred from ASE allows us to test a rigorous null model of neutral evolution, which when rejected (as in the case of kinases here) indicates the presence of lineage-specific natural selection 24 . Growth in low phosphate. The low phosphate media was Hv-Min medium 18 , supplemented with potassium phosphate buffer (pH 7.5), the only phosphate source, to a final concentration of 0.1 mM phosphate. To compare the growth rates each strain was grown in low phosphate minimal broth medium at 42 °C in shaking incubator for three days to reach OD 600 > 0.4, then both strains diluted to be at the same OD (<0.15) to start the growth analysis. The growth curves were done using a Biotek ELX808IU-PC in 96-well plates at 42 °C with continuous shaking, measuring OD 595 every 30 minutes for 30 hours. Three technical replicates were performed for each growth curve.