Natural variation in stochastic photoreceptor specification and color preference in Drosophila

Each individual perceives the world in a unique way, but little is known about the genetic basis of variation in sensory perception. In the fly eye, the random mosaic of color-detecting R7 photoreceptor subtypes is determined by stochastic on/off expression of the transcription factor Spineless (Ss). In a genome-wide association study, we identified a naturally occurring insertion in a regulatory DNA element in ss that lowers the ratio of SsON to SsOFF cells. This change in photoreceptor fates shifts the innate color preference of flies from green to blue. The genetic variant increases the binding affinity for Klumpfuss (Klu), a zinc finger transcriptional repressor that regulates ss expression. Klu is expressed at intermediate levels to determine the normal ratio of SsON to SsOFF cells. Thus, binding site affinity and transcription factor levels are finely tuned to regulate stochastic expression, setting the ratio of alternative fates and ultimately determining color preference.


Introduction
Organisms require a diverse repertoire of sensory receptor neurons to perceive a range of stimuli in their environments. Differentiation of sensory neurons often requires stochastic mechanisms whereby individual neurons randomly choose between different fates. Stochastic fate specification diversifies sensory neuron subtypes in a wide array of species including worms, flies, mice, and humans (Ressler et al., 1993;Roorda and Williams, 1999;Troemel et al., 1999;Hofer et al., 2005;Johnston and Desplan, 2010;Magklara and Lomvardas, 2013;Alqadah et al., 2016;Viets et al., 2016). How naturally occurring changes in the genome affect stochastic mechanisms to alter sensory system development and perception is poorly understood. To address this question, we investigated natural variation in stochastic color photoreceptor specification in the Drosophila retina.
The stochastic decision to express Ss is made cell-autonomously at the level of the ss gene locus via a random repression mechanism. The R7/R8 enhancer induces ss expression in all R7s, whereas two silencer regions (silencer 1 and 2) repress expression in a random subset of R7s ( Figure 1B) (Johnston and Desplan, 2014).
Though the stochastic expression of Ss is binary (i.e. on or off) in individual R7s, it does not result in a simple 50:50 on/off ratio across the population of R7s in a given retina. In most lab stocks, Ss is on in~65% of R7s and off in~35% ( Figure 1C) (Wernet et al., 2006;Johnston and Desplan, 2014). Here, we find that the proportion of Ss ON to Ss OFF R7s varies greatly among fly lines derived from the wild. We performed a genome-wide association study (GWAS) and identified a single base pair insertion that increases the affinity of a DNA binding site for a transcriptional repressor, significantly reducing the Ss ON /Ss OFF ratio. This genetic variant changes the proportion of photoreceptor subtypes and alters the innate color preference of flies.

Results
sin decreases the ratio of Ss ON to Ss OFF R7s To determine the mechanism controlling the ratio of stochastic on/off Ss expression, we analyzed the variation in 203 naturally-derived lines collected from Raleigh, North Carolina (Drosophila Genetic Reference Panel (DGRP)) (Mackay et al., 2012). We evaluated Rh4 and Rh3 expression, as they faithfully report Ss expression in R7s (i.e. Ss ON = Rh4; Ss OFF = Rh3) ( Figure 1A) (Thanawala et al., 2013;Johnston and Desplan, 2014). To facilitate scoring, we generated a semi-automated counting system to determine the Rh4:Rh3 ratio for each genotype ( Figure 1C).
To assess the variation in the DGRP lines attributable to the ss locus and limit the phenotypic contribution of recessive variants at other loci, we crossed each DGRP line to a line containing a~200 kb deficiency covering the ss locus and analyzed Rh3 and Rh4 expression in the F1 male progeny ( Figure 1D). This genetic strategy generated flies hemizygous (i.e. single copy) for the wild-derived ss gene locus, heterozygous wild-derived/lab stock for the second, third, and fourth chromosomes, and hemizygous lab stock for the X chromosome ( Figure 1D). While the lab stock expressed Ss (Rh4) in 62% of R7s under these conditions, expression among the DGRP lines varied significantly, ranging from 19% to 83% Ss ON (Rh4) ( Figure 1E-F; Figure 1-source data 1).
To identify the genetic basis of this variation, we performed a genome-wide association study (GWAS) using the Ss ON (Rh4) phenotype data and inferred full genome sequences of the progeny of each DGRP line crossed with the ss deficiency line. We performed an association analysis and identified a single base pair insertion within the ss locus ('ss insertion' or 'sin') that was significant (p<10 À13 ) after Bonferroni correction ( Figure 1G). sin was enriched in DGRP lines with a low ratio of Ss ON to Ss OFF R7s ( Figure 1F and H).
We next confirmed the regulatory role of sin. Naturally derived lines from Africa that are homozygous for sin displayed a decrease in the proportion of Ss ON (Rh4) R7s compared to lines from Africa lacking sin ( Figure 1I) (Lack et al., 2015). We identified sin on a balancer chromosome (TM6B) in a lab stock that similarly displayed a decrease in the proportion of Ss ON (Rh4) R7s when ss was hemizygous ( Figure 1J). To definitively test the role of sin, we used CRISPR to insert sin into a lab stock. Flies hemizygous for CRISPR sin alleles displayed a significant decrease in the proportion of Ss ON (Rh4) R7s ( Figure 1K). Using a Ss antibody, we examined Ss expression directly and found that flies homozygous for CRISPR sin alleles displayed a significant decrease in the proportion of Ss ON R7s ( sin shifts innate color preference from green to blue As sin alters the proportion of color-detecting photoreceptors, we hypothesized that it would also change color detection and preference. When presented with two light stimuli in a T-maze (Tully and Quinn, 1985), flies will phototax toward the light source that they perceive as more intense ( Figure 2A) (McEwen, 1918;Heisenberg and Wolf, 1984;Choe and Clandinin, 2005). The absorption spectra of Rh3 and Rh4 significantly overlap in the UV range (Feiler et al., 1992), complicating behavioral assessment of color preference caused by differences in R7 photoreceptor ratios. Instead, we focused on the perception of blue light by Rh5 and green light by Rh6 in the R8 photoreceptors, as these Rhodopsins have more distinct absorption spectra (Salcedo et al., 1999). Because R8 fate is coupled to R7 fate (Chou et al., 1996) (Figure 1A), we predicted that flies with sin would have a low ratio of Rh6-to Rh5-expressing R8s and would consequently prefer blue light, while flies without sin would have a higher ratio of Rh6-to Rh5-expressing R8s and would instead prefer green light. Indeed, DGRP lines containing sin preferred blue light, while DGRP lines lacking sin preferred green light ( Jukam et al., 2013;Viets et al., 2016) or in the neural circuit downstream of R8 signaling likely caused the green light preference of some lines with sin.
sin increases the binding affinity for the Klumpfuss transcription factor sin is a single base pair insertion within a previously uncharacterized non-coding region of the ss locus located~7 kb upstream of the transcriptional start ( Figure 1B and Figure 3A). To identify trans factors whose binding might be affected by sin, we searched for binding motifs affected by sin in SELEX-seq (Nitta et al., 2015) and bacterial one-hybrid datasets (B1H) (Zhu et al., 2011;Enuameh et al., 2013). sin lies in a predicted binding site for the zinc finger transcription factor Klumpfuss (Klu), the fly homolog of Wilms' Tumor Suppressor Protein 1 (WT1) ( Figure  To evaluate the effect of sin on Klu binding, we analyzed available SELEX-seq binding data (Nitta et al., 2015), focusing on the core 10-mer. The number of reads containing the Klu binding site with sin (CGCCCACACC) was significantly higher than without sin (CGCCCACACA) ( Figure 3D), and thus, Klu binds the endogenous ss sequence with sin better than without it. Considering the frequency of 10-mers as a measure of site preference, we found that 506 10-mers (0.10%) have frequencies greater than the Klu site without sin, whereas only 366 10-mers (0.07%) have frequencies greater than the Klu site with sin. Together, sin increases the binding affinity of the Klu site in vitro.
We further analyzed SELEX-seq data to understand the differences between Klu binding affinities for the predicted optimal site and endogenous site in ss. The endogenous 10-mer core sequence in ss (CGCCCACACA) deviates from the optimal site (CGCCCACGCA) at position 8, causing a dramatic reduction in the affinity for the endogenous site ( Figure  significantly decreases binding affinity for the optimal site (compare CGCCCACGCA to CGCCCACGCC), whereas it increases affinity for the endogenous Klu site in ss (compare CGCCCA-CACA to CGCCCACACC) ( Figure 3E). A PWM is a good representation of the sequence preferences of a DNA-binding protein, but it assumes independent contributions of individual bases. In this case, we observe dependence between positions within the motif that the PWM disregards. Our analysis indicates that Klu binding affinity is dependent on the relationship between the bases in position 8 and 10. This dependence reveals that Klu preferentially interacts with the site with sin (C in position 10) over the site without sin (A in position 10) in the endogenous spineless locus (A in position 8), in contrast to the general predictions of the PWM (preferred G in position eight and A in position 10). Dependence between positions suggests that binding of transcription factors like Klu is determined not only by sequence but also by DNA shape, as has been described previously Zhou et al., 2015;Chiu et al., 2017). These data suggest that the Klu site in the endogenous locus is a low-affinity site and that sin increases its affinity.
Since sin is predicted to increase the binding affinity for Klu and sin caused a reduction in the on/ off ratio of Ss expression, we hypothesized that mutating the Klu site to an optimized high-affinity site would also cause a decrease in the proportion of Ss ON R7s. We used CRISPR to mutate the endogenous low-affinity Klu site (ACGCCCACACAC) to the predicted optimized high-affinity site (ACGCCCACGCAC) and observed a decrease in the proportion of Ss ON R7s similar to flies with sin ( Figure 3F). The observation that an optimized high-affinity Klu site causes a similar phenotype as sin is consistent with the conclusion that sin increases the binding affinity for Klu.

Klu lowers the Ss ON /Ss OFF ratio in R7s
Klu/WT1 has been shown to be a transcriptional repressor in other systems (Drummond et al., 1992;McDonald et al., 2003;Kaspar et al., 2008). As sin decreases Ss expression frequency and is predicted to increase Klu binding affinity, we hypothesized that Klu also represses stochastic ss expression in R7s. We found that Klu was expressed in R7s in larval eye imaginal discs in a Gaussian distribution ( Figure 4A Conversely, klu loss-of-function mutants displayed increases in the proportion of Ss ON (Rh4) R7s ( Figure 4E-F). We examined Ss expression directly and found that the proportion of Ss ON R7s increased in klu null mutants (Figure 1-figure supplement 2D-E). Moreover, we found that the proportion of Ss ON R7s increased in klu mutant clones compared to wild type clones ( Figure 4G-I).
As the proportion of Ss ON R7s increases specifically in klu mutant clones and decreases upon ectopic expression of Klu in R7s, we conclude that Klu is endogenously expressed at intermediate levels and acts cell-autonomously to determine Ss expression state.
Our data suggest that the ratio of Ss on/off gene expression is controlled by both the level of Klu protein and the binding affinity of the Klu site. To test this idea, we altered Klu levels in flies with the higher affinity Klu site (i.e. with sin). Because the proportion of Ss ON R7s is reduced in flies with  increased Klu levels (high repressor levels) or in flies with the sin variant (high binding affinity), we predicted a further reduction in flies with both high Klu and sin (high repressor levels, high binding affinity). We generated flies with increased levels of Klu in a sin genetic background and observed a significant additional reduction in the proportion of Ss ON R7s ( Figure 4D).
To further test the relationship between Klu levels and binding site affinity, we reduced klu gene dosage in flies with sin and found that the sin phenotype was suppressed in klu mutant heterozygotes ( Figure 4J). We conclude that sin increases Klu binding affinity and that the binding affinity of the Klu site and levels of Klu protein determine the proportion of Ss ON R7s.

Discussion
Our studies of wild-derived flies revealed significant variation in stochastic Ss expression. We identified sin, a single base pair insertion in the~60 kb ss locus that dramatically lowers the Ss ON /Ss OFF ratio by increasing the binding affinity for the transcriptional repressor Klu. This decrease in Ss expression frequency changes the proportion of color-detecting photoreceptors and alters innate color preference in flies.
sin appears to be a relatively new mutation in D. melanogaster populations. sin is absent among diverse drosophilid species spanning millions of years of divergence (  (Bergland et al., 2016). The recent rise in the frequency of sin suggests that it could be the target of natural selection, perhaps via modulation of innate color preference. We tested this model by assessing patterns of allele frequency differentiation among populations sampled worldwide and by examining haplotype homozygosity surrounding sin. We compared these statistics at sin to the distribution of statistics calculated from several thousand randomly selected 1-2 bp indel polymorphisms that segregate at~25% in the DGRP. Curiously, sin did not deviate from genome-wide patterns (Figure 1-figure supplement 3G-J) suggesting that it might be selectively neutral in contemporary D. melanogaster populations.
It is interesting that Rhodopsin expression varies so significantly in the wild, given the nearly invariant hexagonal lattice of ommatidia in the fly eye. Rhodopsins are G-protein coupled receptors (GPCRs), a class of proteins identified as a source of natural behavioral variation in worms, mice, and voles (Young et al., 1999;Yalcin et al., 2004;Bendesky et al., 2011). Dramatic differences in Rhodopsin expression patterns across insect species (Hilbrant et al., 2014;Wernet et al., 2015)  suggest that variation in the expression of GPCRs, rather than retinal morphology, may allow rapid evolution in response to environmental changes.
sin increases the binding affinity of a conserved Klu site, suggesting that the site is suboptimal or low-affinity for Klu binding. Low-affinity sites ensure the timing and specificity of gene expression (Jiang and Levine, 1993;Gaudet and Mango, 2002;Scardigli et al., 2003;Rowan et al., 2010;Ramos and Barolo, 2013;Crocker et al., 2015;Farley et al., 2015;Crocker et al., 2016). Our studies reveal a critical role for a low-affinity binding site in the regulation of a stochastically expressed gene. The suboptimal Klu site, bound by endogenous levels of Klu, yields the normal 65:35 Ss ON /Ss OFF ratio. Changing the affinity of the site or the level of Klu alters the ratio of Ss ON / Ss OFF cells. We conclude that stochastic on/off gene expression is controlled by threshold levels of trans factors binding to low-affinity sites.
The level of Klu (analog input) determines the binary on/off ratio of Ss expression (digital output). In contrast, gene regulation is best understood in cases where levels of transcription factors (analog input) regulate the levels of target gene expression (analog output). Interestingly, sin or genetic perturbation of klu affected the frequency of Ss expression ( Figures 1F, I-K and 4C-J, Figure 1-figure supplement 2) but not levels (Figure 1-figure supplement 4).
The on/off nature of Ss expression suggests a cooperative mechanism whereby Klu acts with other factors to regulate ss. Conservation of additional base pairs surrounding the Klu site ( Figure 3C) is consistent with cooperative binding of Klu and others factors, possibly through dimerization or multimerization. These additional conserved base pairs could also enable binding of activating transcription factors to sites that overlap with the Klu site. These activating transcription factors may compete with the repressor Klu for binding to determine the stochastic on/off expression state of ss.
The expression state of ss could be determined by the intrinsic variation in Klu levels ( Figure 4figure supplement 1). In this model, if Klu levels exceed a threshold, ss is off, and if Klu levels are below the threshold, ss is on. Alternatively, Klu levels could set the threshold for a different gene regulatory mechanism, such as DNA looping or heterochromatin spreading. The regions encompassing and neighboring the Klu binding site drive gene expression in the eye (Figure 1-figure supplement 1), suggesting that complex interactions between this regulatory DNA element, the R7/R8 enhancer, and the two silencers ( Figure 1B) ultimately control the ss on/off decision.
Cell fate specification is commonly thought of as a reproducible process whereby cell types uniformly express specific batteries of genes. This reproducibility is often the result of high levels of transcription factors binding to high-affinity sites, far exceeding a regulatory threshold, yielding expression of target genes in all cells of a given type. In contrast, the stochastic on/off expression of Ss requires finely tuned levels of regulators binding to low-affinity sites. We predict that fine tuning of binding site affinities and transcription factor levels will emerge as a common mechanistic feature that determines the ratio of alternative fates in stochastic systems.

Drosophila genotypes and stocks
Flies were raised on standard cornmeal-molasses-agar medium and grown at 25˚C.

Antibody staining
Adult, mid-pupal, and larval retinas were dissected as described (Hsiao et al., 2012) and fixed for 15 min with 4% formaldehyde at room temperature. Retinas were rinsed three times in PBS plus 0.3% Triton X-100 (PBX) and washed in PBX for >2 hr. Retinas were incubated with primary antibodies diluted in PBX overnight at room temperature and then rinsed three times in PBX and washed in PBX for >4 hr. Retinas were incubated with secondary antibodies diluted in PBX overnight at room temperature and then rinsed three times in PBX and washed in PBX for >2 hr. Retinas were mounted in SlowFade Gold Antifade Reagent (Invitrogen). Images were acquired using a Zeiss LSM 700 confocal microscope.

Quantification of expression
Frequency of Rh3 (Ss OFF ) and Rh4 (Ss ON ) expression in R7s was scored in adults. Six or more retinas were scored for each genotype (N). 100 or more R7s were scored for each retina (n). Frequency was assessed using custom semi-automated software (see below) or manually. Frequency of Ss expression in R7s was assessed with a Ss antibody in mid-pupal animals. Four or more retinas were scored for each genotype (N). 70 or more R7s were scored for each retina (n). Frequency was assessed manually.
Levels of Ss expression in Ss ON R7s were assessed with a Ss antibody in mid-pupal animals. Three retinas were scored (N). 40 or more Ss ON R7s were scored for each retina (n). We used ImageJ software to quantify Ss levels in Ss ON R7s (Figure 1-figure supplement 4A-C). A circular ''region of interest'' was manually placed at the center of each Ss ON R7 (identified by expression of Ss and the R7 marker Prospero) to avoid signal from neighboring photoreceptors. ImageJ software assessed the mean pixel intensity for each region of interest for each Ss ON R7.
Levels of Klu expression in R7s were assessed with a Klu antibody in third instar larval animals. Five retinas were scored (N). 65 or more R7s were scored for each retina (n). We used ImageJ software to quantify Klu levels in all R7s (Figure 4-figure supplement 1). A circular ''region of interest'' was manually placed at the center of each R7 (identified by pm181 >GAL4; UAS > GFP reporter expression) to avoid signal from neighboring photoreceptors. ImageJ software assessed the mean pixel intensity for each region of interest for each R7.
To determine the number of rows from the equator to the dorsal third region of the adult retina, we first used phalloidin to stain actin (marking rhabdomeres of ommatidia) to locate the equator of each retina. We then counted the number of rows from the equator to the first R7 cell with coexpression of Rh3 and Rh4.

Image processing
We employed a custom algorithm to identify the positions of individual R7 photoreceptors within an image of the fly retina. First, individual fluorescence images from each wavelength channel were denoised using a homomorphic filter (Oppenheim et al., 1968) and Gaussian blur. Next, R7 boundaries were located using the Canny edge detection method (Canny, 1986). Cells were then roughly segmented using the convex hull algorithm (Barber et al., 1996). Active contouring (Chan and Vese, 2001) was used to refine the segments to fit the R7s more closely. Finally, a watershed transform was applied to the image, dividing it into regions that each contain a single R7. Regions were excluded by size or distance from the center to prevent artifacts due to the curvature of the fly retina. For the remaining regions, normalized intensities from the Rh3 and Rh4 channels were compared in order to assign each region a label, indicating that its R7# is stained with Rh3 or Rh4. A MATLAB (The MathWorks, Inc.) script that implements our algorithm is available at https://app. assembla.com/spaces/roberts-lab-public/wiki/Fly_Retina_Analysis.

Genome-Wide association studies
Genotype data from the DGRP freeze two lifted to the dm6/BDGP6 release of the D. melanogaster genome was obtained from (ftp://ftp.hgsc.bcm.edu/DGRP/). Phenotypes were calculated for the progeny of crosses of DGRP lines and Df(3R)Exel6269 flies. To estimate genotypes of these flies from the DGRP data, we simulated each cross. For each SNP or indel variant in the DGRP genotype data, we assigned a new genotype: (1) homozygous reference remains homozygous reference, (2) homozygous alternate maps to homozygous alternate if in deficiency region, otherwise heterozygous, and (3) all other genotypes mapped to missing or unknown and not included in subsequent analyses. We performed quantitative trait association analysis using plink2 -linear (version 1.90 beta 25 Mar 2016; PMID:25722852). To reduce the impact of population structure, we included the first 20 principal components of the standardized genetic relationship matrix as covariates (calculated using plink2 -pca). To empirically correct p-values for each site, we performed a max(T) permutation test with 10,000 permutations (mperm option to plink2).

CRISPR-mediated mutagenesis
sin was inserted into a lab stock line using CRISPR (Gratz et al., 2013;Port et al., 2014). Sense and antisense DNA oligos for the forward and reverse strands of the gRNA were designed to generate BbsI restriction site overhangs. The oligos were annealed and cloned into the pCFD3 cloning vector (Addgene, Cambridge, MA). A single stranded DNA homology bridge was generated with 60 bp homologous regions flanking each side of the predicted cleavage site. The gRNA construct (500 ng/ ul) and homology bridge oligo (100 ng/ul) were injected into Drosophila embryos (BestGene, Inc.). Single males were crossed with a balancer stock (yw; +; TM2/TM6B), and F1 female progeny were screened for the insertion via PCR and sequencing. Single F1 males whose siblings were sin-positive were crossed to the balancer stock (yw; +; TM2/TM6B) and the F2 progeny were screened for the insertion via PCR and sequencing. sin-negative flies from a single founder were used to establish a stable stock (CL-1) and sin-positive flies from three founders were used to establish independent stable stocks (CL-2, CL-3, CL-4).

Genotype R GTCAGCCACTACATGGTTTCG
The Klu optimal site was generated in a lab stock line using CRISPR (Gratz et al., 2013;Port et al., 2014). CRISPR was performed with the same gRNA and genotyping primers as described above, but with a new homologous bridge donor.

T-maze behavioral assays
Adult flies were raised on standard medium on a 14 hr/10 hr light and dark cycle at 25˚C. The behavioral assay room was illuminated by a 630 nm red LED bulb (superbrightleds.com; PAR30IP-x8-90) whose emitted light lies outside of the sensitivity spectrum for fly photodetection. For each trial, 100 female flies were starved for 8 hr and then inserted into the elevator of the T-maze (Robert Eifert, Bayshore, NY). The elevator was lowered to a junction that, on each side, held an unused plastic tube (Falcon 352017). The T-maze was covered in black chalkboard tape and the plastic tubes were painted black. The T-maze and lights were kept a constant distance apart by a custom 3D-printed holder. A blue LED light (450 nm) and a green LED light (525 nm) on opposite sides were simultaneously turned on. Blue and green LED lights were obtained from superbrightleds.com (E12-B5, E12G5). The blue light was covered with three layers of 3x neutral density (ND) filters, while the green light was covered with one layer. After 20s, the lights were turned off and the tubes were removed and capped. Flies from each tube were counted and the preference index (PI) was calculated using the formula PI = (N G -N b ) / (N G + N b ), where N G equals the number of flies in the tube illuminated with green light and N b equals the number of flies in the tube illuminated with blue light. PI ranged from À1 to 1, with negative values indicating a blue preference and positive values indicating a green preference. Five or more trials were conducted for each genotype (N). 100 or more flies were scored for each trial (n).

Consensus sequence
For the B1H data sets, WebLogo3 was used to generate position weight matrices (PWMs) (Zhu et al., 2011;Enuameh et al., 2013) (Figure 3-figure supplement 1A). For the SELEX-SEQ data sets, MEME-ChIP version 4.11.2 was used to generate PWMs (Machanick and Bailey, 2011;Nitta et al., 2015) (ENA: ERX606541-ERX606544). Motif discovery and enrichment mode was set to normal, and the 1st order model of sequences was used as the background model. Expected motif site distribution was set at zero or one occurrence per sequence. Minimum width of motifs to be found by MEME was set to 12, and the max was set to 20.

Conservation analysis
The Klu site and neighboring sequences for 21 Drosophila species were obtained from the UCSC genome browser. TOMTOM version 4.11.2. was used to generate the conservation PWM (Gupta et al., 2007)

SELEX-seq analysis
SELEX-seq datasets from (Nitta et al., 2015) were obtained from ENA (ERX606541-ERX606544). For read-level analysis, we counted the number of reads containing the Klu binding site with sin, without sin, and neither site (there were no reads with both sites). We performed McNemar's test to assess significance. We computed the frequency of each 10-mer within each dataset using Jellyfish version 2.2.6 (Març ais and Kingsford, 2011). Using these counts, we determined the number of 10-mers with frequency greater than that of the Klu binding site with and without sin. Frequencies reported are for the combination of all four SELEX datasets. Jellyfish 2.2.6 was used to canonically count kmers of length 10 with an initial hash of size 100M from fasta files generated from the first and fourth rounds of selection. Kmers were reverse complemented as necessary to minimize the hamming distance from the consensus sequence. Reported counts come from the fourth round.

Population genetic analyses
We estimated allele frequencies from populations sampled worldwide at sin and at other 1-2 bp indel polymorphisms. Allele frequency estimates based on pooled resequencing of populations sampled in North America and Europe were obtained from (Bergland et al., 2014) and (Kapun et al., 2016). Allele frequencies based on haplotypes (Lack et al., 2016) were also obtained from populations sampled in North America, the Caribbean, Europe, and Africa.
For pooled samples, we mapped raw sequence reads to Release 6 of the Drosophila melanogaster genome, removed PCR duplicates, performed indel-realignment using GATK version, and called allele frequencies using VarScan. For haplotype data, we relied on published indel VCF files obtained from the Drosophila Genome Nexus (DGN; http://www.johnpool.net/genomes.html). Regions of admixture in African genomes were identified based on analyses by Lack et al., 2016. DGN data were mapped to Release 5 of the Drosophila genome and we converted those coordinates to Release six using the lift-over file available from the UCSC Genome browser (http:// hgdownload.soe.ucsc.edu/goldenPath/dm3/liftOver/dm3ToDm6.over.chain.gz).
We sought to assess whether the distribution of sin among populations sampled worldwide was significantly different than expected by chance based on other comparable indel polymorphisms. sin was originally identified in the Drosophila Genetic Reference Panel (DGRP), derived from a population in Raleigh, NC, where it segregates at~25%. We observed that sin segregates at~10-25% in other North American and European populations but is rare/absent in ancestral African populations (Figure 1-figure supplement 3A-F). Such changes in allele frequency among continents could indicate the action of positive selection. To test this model, we identified~1500 other 1-2 bp autosomal indel polymorphisms that segregate at 25 ± 5% in the DGRP (hereafter, 'control set'). We estimated F ST among continents as well as within North America at sin and our control set. F ST values were rank normalized and converted to a Z-score through an inverse normal CDF with mean zero and standard deviation one. sin did not show elevated levels of F ST within or between continents relative to the control set, suggesting that sin does not contribute to local adaptation amongst sampled populations (Figure 1-figure supplement 3G-H).
Next, we tested whether haplotype patterns surrounding sin are indicative of a partial selective sweep. We calculated the extended haplotype homozogosity (EHH) score and integrated EHH (iEHH) score for haplotypes with and without sin (derived and ancestral haplotypes, respectively) in the DGRP data where sin was originally identified. EHH scores were also calculated at the control set, as described above. EHH scores were calculated using the R package rehh (Gautier and Vitalis, 2012). The derived sin allele shows an elevated iEHH score compared to the ancestral allele, suggestive of a partial selective sweep (Figure 1-figure supplement 3I-J). To test this model, we calculated the integrated haplotype statistic as, for sin as well as the control set. IHS for sin is not significantly different than expected by chance relative to other comparable indel polymorphisms (Figure 1-figure supplement 3I-J).