Evolution of salivary glue genes in Drosophila species

Background At the very end of the larval stage Drosophila expectorate a glue secreted by their salivary glands to attach themselves to a substrate while pupariating. The glue is a mixture of apparently unrelated proteins, some of which are highly glycosylated and possess internal repeats. Because species adhere to distinct substrates (i.e. leaves, wood, rotten fruits), glue genes are expected to evolve rapidly. Results We used available genome sequences and PCR-sequencing of regions of interest to investigate the glue genes in 20 Drosophila species. We discovered a new gene in addition to the seven glue genes annotated in D. melanogaster. We also identified a phase 1 intron at a conserved position present in five of the eight glue genes of D. melanogaster, suggesting a common origin for those glue genes. A slightly significant rate of gene turnover was inferred. Both the number of repeats and the repeat sequence were found to diverge rapidly, even between closely related species. We also detected high repeat number variation at the intrapopulation level in D. melanogaster. Conclusion Most conspicuous signs of accelerated evolution are found in the repeat regions of several glue genes. Electronic supplementary material The online version of this article (10.1186/s12862-019-1364-9) contains supplementary material, which is available to authorized users.


4
Drosophila species, such as D. sechellia, D. simulans, and the invasive D. suzukii, appear to pupariate directly within the wet rotten part of fruits (J. David, personal communication, [12]).
Given the diversity of pupation sites, we hypothesized that they would require distinct types of glue and therefore that Sgs genes might evolve rapidly among the Drosophila genus.
A sixth electrophoretic band migrating slightly slower than the Sgs3 protein was also detected in a few D. melanogaster lines [17,31,32]. The nucleotide sequence of the corresponding gene, Sgs6, remains unknown but cytogenetic and genetic mapping indicates that Sgs6 is located in region 71C3-4 and differs from Eig71Ee [14,28,32].
The three genes Sgs3, Sgs7 and Sgs8 form a tightly linked cluster on the 3L chromosomal arm at position 68C [33,34]. All glue genes were found to start with a signal peptide. The largest glue genes, Sgs1, Sgs3 and Sgs4 and Eig71Ee were shown to harbor numerous internal repeats of amino acid motifs, rich in proline, threonine and serine [19,25,29,35]. Molecular studies showed that the number of internal repeats was variable between strains in Sgs3 [36], and Sgs4 [35]. Additionally, consistent with missing protein bands, a few laboratory strains were inferred to carry loss-of-function mutations in Sgs4 [6 , 16, 35, 37], Sgs5 [27] and Sgs6 [17,31,32].
In the present study, we characterize the diversity and evolution of the Sgs genes within the Drosophila genus. We inferred loss and gain of glue genes and we investigated repeat number variation and sequence repeat diversity across 19 species and across paralogs.

Results
We used the six Sgs genes and Eig71Ee annotated in D. melanogaster as BLAST queries to identify their putative homologs in 19 other Drosophila species (Table 1). The homologs are summarized in Figure 1 and Table 2. In D. melanogaster, the glue genes are "extremely highly" or "very highly" expressed in late larval salivary glands according to the RNAseq data in Flybase. But transcript data that would be useful for annotating the genes were not available for all species, probably because the expression window of the glue genes (late third larval instar and only in salivary glands) is narrow [6]. The organization of the Sgs genes was found to be generally conserved across the Drosophila species we investigated (Fig. 1).
Proper identification of each ortholog was based on sequence similarity and, when possible, synteny. We describe below our findings for each category of Sgs genes.

Gains and losses of Sgs5 genes
We found that Sgs5 had a tandem paralog in D. melanogaster, located ca. 300 bp upstream of Sgs5 (CG7587, hereafter named Sgs5bis), sharing 46,3 % identity and 66,9 % similarity at the protein level. It is co-expressed with Sgs5 during late third larval instar in dissected salivary pseudoobscura lineages, then the ancestral gene before the duplication was probably Sgs5bis.
Gains and losses of Sgs3, Sgs7, and Sgs8 genes The genes Sgs3, Sgs7 and Sgs8 form a tight cluster, 4.5 kb long, on the 3L arm in D.
melanogaster [33] and they share sequence similarities [19] in their N-terminal and Cterminal parts. Sgs3 contains internal repeats whereas Sgs7 and Sgs8 do not. When the internal repeats of Sgs3 are excluded, the amino acid identity in D. melanogaster is 51.3 % between Sgs3 and Sgs7, 48.7 % between Sgs3 and Sgs8, and 46.7 % between Sgs7 and Sgs8.
Additionally Sgs3, Sgs7 and Sgs8 share a phase 1 intron position, interrupting the signal peptide sequence [19]. In the clade D. yakuba / santomea / erecta, Sgs7 and Sgs8 are inverted with respect to the D. melanogaster arrangement (Fig. 1). In addition, Sgs7 is duplicated in D.
yakuba (Dyak\GE20214 and Dyak\GE21218) and D. santomea (Fig. 1). The two copies, inverted relative to each other, have only one, nonsynonymous, nucleotide difference. Sgs8 lies between the two Sgs7 copies, and has the same orientation as Sgs3. In species outside the D. melanogaster subgroup, all the Sgs3, Sgs7 and Sgs8 sequences also have the same intron, with slightly different positions depending on codon indels before the intron. Notably, D.
suzukii is the only species in our study that has lost Sgs3. D. suzukii retained Sgs8 and underwent an amplification of Sgs7, three copies of which are identical. In a number of species, Sgs7 and Sgs8 could not be identified (Sgs7 and Sgs8 are small proteins, about 75 amino acids in length). However, when a BLAST search was performed using the Sgs7 or Sgs8 sequences of D. melanogaster, we retrieved the same target hits as with Sgs3 (Table 2).
In those species, several Sgs3-like genes were found instead, i.e. long proteins with internal repeats showing N-terminal and C-terminal parts similar to Sgs3. In species with no Sgs7, no Sgs8 and several Sgs3-like genes occupying the physical location of Sgs7 and Sgs8 (D. pseudoobscura, D. ficusphila, D. rhopaloa, see Fig. 1), it is tempting to infer that the ancestral Sgs7 and Sgs8 have gained internal repeats. According to such a hypothesis, at least in some cases, the non-repeated parts of those Sgs3-like protein sequences are expected to cluster with Sgs7/8. To disentangle the relationships among Sgs3-7-8 paralogs, we constructed a phylogeny using an alignment of the non-repeated parts of the protein sequences. The tree ( Fig. 3), which does not fit well to the species phylogeny, shows a clear separation between Sgs3/Sgs3-like and Sgs7/Sgs8, except for D. bipectinata and D. willistoni, whose Sgs7/Sgs8 sequences are linked to the Sgs3 branch, with low support. This would rather suggest that those Sgs7/Sgs8 sequences are old Sgs3-like sequences which have lost their internal repeats. However, the sequence length is far too short to get a reliable tree and we cannot confirm this hypothesis. While it is more parsimonious to infer that there were two ancestral Sgs3 and that subsequent losses occurred, the tree topology is not accurate enough to confirm this hypothesis.

Gains and losses of Sgs1 genes
Sgs1 was found only in the melanogaster subgroup and in the Oriental subgroups, which suggests that it originated in the ancestor of this clade. No Sgs1 gene was detected in D. erecta, providing evidence for a loss of Sgs1. The Sgs1 sequence identified by our BLAST search in the D. suzukii genome database (see Materials and Methods) showed many stop codons in the second half of the repeat region and was not annotated as a coding sequence.
Based upon the surrounding repeat sequences, we found that inserting a C at position 1829 (from start) would restore the reading frame, translating into a putative 2245 amino acid protein. Our analysis of another genome sequence of D. suzukii [38] (contig CAKG01017146) showed that in this second strain there is a C at position 1829 and that Sgs1 is 2245 amino acid long. Since position 1829 lies in the middle of a long repeat-containing region which prevents PCR amplification, we did not try to check experimentally for the missing C in the first D. suzukii genome sequence. In all the Sgs1 genes identified, except in D. elegans, an intron was found at the same position and phase as in Sgs3, Sgs7 and Sgs8. There is also a loose similarity in the N-terminal and C-terminal parts of Sgs1 and Sgs3 (in D. melanogaster about 14% identity between Sgs3 and Sgs1 excluding the repeats). This suggests that Sgs1 belongs to the same family as Sgs3/Sgs7/Sgs8 genes.
The origin of Sgs4 is unknown. We found no similarity with any other sequence in any genome. Some sequence similarity between Eig71Ee and Sgs4 had been reported [29], but is not convincing since it was in the repeat parts, which are of low complexity. Eig71Ee was found in all the D. melanogaster subgroup species and in some of the so-called Oriental species, where it has been annotated as mucin2, or extensin in D. takahashii, or even, erroneously, Sgs3 in D. suzukii. We also detected the N-terminal parts of it in the D. ananassae group; thus making unclear the phylogenetic distribution of the gene (Table 2).
More interestingly, we noticed that Eig71Ee harbors an intron at the same position as the one 8 of Sgs3, Sgs7, Sgs8 and Sgs1. This result argues for a certain relatedness among those genes.
However, using Eig71Ee as a TBLASTN query did not retrieve any hits from any Sgs genes and the Eig71Ee amino acid sequence does not align with the Sgs sequences.

Rate of gene gains and losses in the glue gene families
Our analysis reveals that the seven annotated genes that code for glue proteins can be grouped into three gene families. Sgs1, Sgs3, Sgs7, Sgs8, and Eig71Ee comprise one of the three families since all of them share a phase 1 intron at the same position, interrupting the signal peptide sequence. Sgs4 then forms its own family and the Sgs5 and 5bis comprise the third family. We used CAFE [39] to reconstruct ancestral copy numbers throughout the Drosophila phylogeny and to test whether these three gene families evolve at an accelerated rate along any Drosophila lineage. For the CAFE analysis Eig71Ee was not included due to uncertainties about its presence in some species. We find that the Sgs4 and Sgs5-5bis families do not evolve faster compared to other gene families present in the Drosophila genomes (p=0.58 and p=0.107, respectively; Table S1), however the Sgs1-3-7-8 family was found to evolve rapidly (p=0.005; Table S1). Overall, this family seems to be prone to duplication and loss (Fig. S1) and we find that this signal for rapid evolution is driven mostly by small changes on many lineages (i.e. a gain or loss of 1 gene) rather than large changes on one or a few particular lineage. Table 3 summarizes the characteristics of the repeated sequences present within Sgs genes. Sgs1, Sgs3 and to a lesser extent, Sgs4 and Eig71Ee, are characterized, besides a signal peptide and a conserved C-terminal part, by long repeats often rich in threonine and prone to O-glycosylations. Although D. melanogaster Sgs5 protein is devoid of internal repeats (we checked that it is the case in all populations of the PopFly database), in most other species, even in close relatives, repeats are present, mostly pairs Pro-(Glu/Asp). Sgs5 protein length is highly variable across species. In D. kikkawai, there is a long additional stretch (127 amino acids) containing 60 % of acidic residues. The paralog Sgs5bis never has repeats. Sgs7 and Sgs8 are much smaller proteins, without any repeats, and are rich in cysteine (12-14 %). The conserved C-terminal parts are about 120 amino acids long in Sgs1, 50 amino acids in Sgs3, 120 amino acids in Sgs4, 115 amino acids in Sgs5/5bis and 135 amino acids in Eig71Ee. The repeats are quite variable in motif, length and number, even between closely related species, so that, most often, glue proteins may be retrieved only based on their conserved C-terminal 9 part. The longest Sgs protein is Sgs1 of D. suzukii (2245 aa), which harbors ca. 63 repeats of a 29 amino acid, threonine-rich motif so that the total content of threonine is 40% ; in D.

Characterization of the repeats in glue proteins
melanogaster, Sgs1 is also very long (1286 aa) due to 86 repeats of a motif of 10 amino acids, also threonine-rich (46%). The shortest Sgs1 protein is the one of D. sechellia (492 aa). In all the species where it exists, Sgs1 is also rich in proline (12-18%). Sgs3 has the same kind of amino acid composition. Repeats can also be quite different between paralogs. For example, in D. eugracilis, while the two genes are physically neighbors, Sgs3 has about seven repeats of CAP(T)9, whereas Sgs3bis has ca. 65 KPT repeats. In D. elegans, the three Sgs3-like proteins also have quite different repeats (Table 3). Sgs4 is richer in proline than in threonine (18% vs. 16% in D. melanogaster) and contains 10% cysteine residues.

Interspecific variation in number and sequence of repeats
Between closely related species the number of repeats varied enormously and the repeated sequence diverged sometimes rapidly (Table 3) (Table   3). Sgs4 genes show 91% identity at the protein level with the same 23 repeats; Sgs5 97% identity. Another pair of species worth of interest is D. suzukii/D. biarmipes, considered to have diverged ca. 7.3 mya [43]. As mentioned above, only Sgs1 and Sgs5 can be compared because D. suzukii has lost Sgs3, and Sgs4 is limited to the melanogaster subgroup. Despite a longer divergence time than for the previous comparisons, the Sgs1 29 amino acid repeats are similar in the two species but D. suzukii has many more repeat units. In the non repeat parts, identity is 69.3 %; Sgs5 is well conserved even in the repeat region, with an overall identity of 76.4 % in amino acids, and 84.8 % in the non-repeat parts. A last pair of related species (despite their belonging to different subgroups) is D. elegans/D. rhopaloa. Their divergent time is unknown. We found that their Sgs proteins are very similar overall, including the repeat parts. This is less striking for the repeats in Sgs3, which exists as four gene copies in D. rhopaloa. Their Sgs5 shared a high overall identity (75%), with repeats (Glu-Pro)n. In the non-repeat parts, identity rose to 82%. Indeed we often found more divergence among paralogs within a genome than across orthologous proteins. indicate that the repeat regions of Sgs1, Sgs3, Sgs4, Sgs5 and Eig71Ee are intrinsically disordered (Fig. 4). Only IUPred and PrDOS indicate Sgs4 repeats to be ordered, in disagreement with the other predictors.

Intraspecific variation in number of repeats
Owing to the difficulty of short-read sequencing methods to deal with the repeated sequences found in glue genes, we could not get a species-wide insight of repeat number variation (RNV) in D. melanogaster. Therefore, we resequenced Sgs3 and Sgs4 in strains from various geographic locations using classical Sanger sequencing (Table 4). We found striking interand intrapopulation variation in the number of repeats : for Sgs3 ( Fig. S2 and S3, Table 4), there was at least 9 repeat difference between the shortest and the longest allele (22 to 31); for Sgs4, 18 to more than 26 repeats ( Fig. S4 and S5, Table 4). Regarding the Sgs4 data from the Drosophila Genome Nexus study (Cairo population), we observed that the repeat region was erroneously reconstituted, often underestimating the repeat number, compared to our Sanger sequencing. We also sequenced the Sgs3 and Sgs4 genes in wild-caught D. mauritiana individuals. For Sgs3 we found variation in the number of stretched threonines (10 or 12) and in the number of repeats ( Fig. S6A and Table 4). For Sgs4, we found that the actual sequences were much longer than the sequence available online, and variable in length, even at the intrapopulation level, ranging from 25 to 35 repeats of the 7 amino acid motif ( Fig. S6B and Table   4).

Nonsense mutations in the Sgs genes
Despite the rather low quality of sequences in the Drosophila Genome Nexus data set, we searched for putative premature termination codons (PTC) in Sgs genes of D. melanogaster, which could lead to non-functional proteins. The search was limited to non-repeat regions.
We found PTC in Sgs4 of several lines, that truncated the protein at the beginning of its conserved C-terminal part. We confirmed experimentally the presence of this PTC in 10 lines of the Cairo population EG (K165stop) ( Fig. S5 and Table 4). We also found putative PTC for Sgs5 in a few lines (W161stop, that is sub-terminal, and maybe not detrimental), and experimental verification confirmed it in one Ethiopian line (EF66N); in Sgs5bis, we found a putative PTC (C33stop) in six African lines from Rwanda (RG population) and Uganda (UG population). We also found a putative PTC for Sgs1 in a few lines from USA and Cairo (P49stop), which was confirmed by resequencing the Egyptian line EG36N. This nonsense mutation required two substitutions from CCA to TAA in all cases. Interestingly, EG36N has also a truncated Sgs4. Therefore its glue should be investigated more carefully. In Sgs3, no PTC was found. Putative PTC were found for Eig71Ee in two lines, EA90N (S345stop) and RAL894 (W380stop), both in the C-terminal region. One putative PTC was found in Sgs7 (Q47stop, line USI33), but was not checked experimentally. No PTC was found in Sgs8 sequences. Stretches of Ns found in non-repeat regions could possibly, at least in some cases, turn out to be true deletions, which deserves further investigation. There is a possibility that some PTCs could experience stop codon readthrough [48] leading to a correct protein, for instance in Sgs4, because the nonsense mutation was not accompanied by other mutations, which would be expected in case of relaxed selection unless the nonsense mutation is very recent. Further studies of the protein content of the salivary glands in those strains will be needed to check whether Sgs4 is produced and if it is full-size.
Evolutionary rate of Sgs protein sequences Given than glue proteins harbor RNV and given our hypothesis that they could be putative targets for fast selection, we wanted to test whether glue gene coding sequences evolve quickly. To this end, we computed substitution rates of Sgs genes between D. melanogaster and D. simulans, the genomes of which are well annotated. We did not include Sgs3, because the internal repeats were very different and not alignable between the two species. This, at any rate, shows that this particular gene evolved rapidly. However, we were able to make an estimate for Sgs1, although it had the biggest size and the highest number of repeats, because the repeats were rather similar in D. melanogaster and D. simulans. We removed the unalignable parts before computation, therefore underestimating the real evolutionary rate.
We performed similarly for Eig71Ee, Sgs4 and Sgs5. The results are shown on Table 5. We contrasts with the one of its close relative Sgs7 (0.0475). We wondered if Sgs8 had also evolved faster than Sgs7 in other pairs of related species. Table 6 shows the results for other species pairs known to be close relatives : D. melanogaster/D. sechellia, D. simulans/D. sechellia ; D. yakuba/D. erecta ; D. biarmipes/D. suzukii. Whereas the latter two pairs showed no evolutionary rate difference between Sgs7 and Sgs8, comparing D. simulans and D.
sechellia showed a ten times higher dN for Sgs7 relative to Sgs8, a situation opposite to D. simulans vs. D. melanogaster. In fact, D. sechellia Sgs7 is more divergent than D. simulans from D. melanogaster Sgs7, whereas Sgs8 did not diverged further. Obviously, the small number of substitutions points to a high variance, and the difference may be not significant.
To test for adaptive evolution after the "out of Africa" event of D. melanogaster [50], we measured the nucleotide diversity  and divergence Dxy between one population from Zambia, (ZI) thought to be within the original geographical area of D. melanogaster, another African population (EF, Ethiopia) and two derived populations, from France (FR) and USA (Raleigh, RAL). This study was limited to the coding sequences of Sgs5 and Sgs5bis, due to the absence of internal repeats and to the gene size, not too short, (Sgs7 and Sgs8 were too short). Due to the numerous residual unidentified nucleotides in the Drosophila Genome Nexus data, the number of sites taken into account could be much smaller than the sequence size, e.g. for Sgs5bis, 278 sites left over 489 in RAL. We compared the overall  and Dxy between these populations [51]. The results are shown on Table 7. Roughly, for both genes  is higher in ZI than in EF, FR and RAL, as for the whole genome and as expected for the region of origin of this species, but divergences Dxy are less than expected from the whole genome, except for the ZI/EF comparison of Sgs5. Both genes gave similar results. Therefore, the glue genes Sgs5 and Sgs5bis do not show particular divergence across populations, which could have been related to a change in population environment.
We also searched for episodic diversifying selection (EDS) among species for the three genes entirely devoid of repeats, Sgs5bis, Sgs7 and Sgs8. The branch-site REL test of the HyPhy package was used. No accelerated evolution was detected for Sgs5bis, whereas one branch (D. santomea-D. yakuba clade) underwent EDS for Sgs7 (corrected p-value 0.012) and one branch (D. erecta-D. yakuba-D. santomea) underwent EDS for Sgs8 (corrected p-value 0.015) ( Fig 6). These results must be considered with caution given the small size of the data set, but anyway do not favor a specific selection regime, regarding single nucleotide (or amino acid) polymorphism.

Discussion and conclusion
We have investigated the presence and characteristics of Sgs genes and proteins in several Drosophila species belonging to the two main subgenera Sophophora and Drosophila, with particular emphasis on species closer to D. melanogaster. We have identified the various Sgs genes through sequence similarity with D. melanogaster. While this study is extensive, it is of course possible that we may have missed glue genes completely different from the ones of D.
melanogaster. In order to get the full collection of glue genes we require transcriptional evidence from late larval salivary gland RNA for each species studied. Interestingly, according to our census, the seven genes characterized for years in D. melanogaster are far from being always present in the other genomes, although the seven members are generally preserved in the D. melanogaster subgroup. Our results are in disagreement with the succinct interspecific study of Farkaš [15]. We also propose here an eighth glue gene, Sgs5bis. Based on its close sequence homology and its co-expression with Sgs5 we propose that these two genes are tandem paralogs. We notice that Sgs5bis never contains internal repeats whereas Sgs5 often harbors more or less developed repeat motifs, although not in D. melanogaster.
Given our data, and notwithstanding the unbalanced taxonomic sampling which may mislead us, we suggest that the ancestor of the species studied here had only Sgs3 and Sgs5bis (Fig 1).
It is likely that Sgs7, Sgs8, and perhaps also Sgs1 and Eig71Ee, originated from duplications of Sgs3. The important differences in repeat motifs between duplicate Sgs3 (e.g. in D. eugracilis) are striking and suggest a high rate of evolution, or independent acquisition of repeats from a repeatless or repeat-poor parental gene. A part of the sequence we named Sgs3like in D. willistoni is reported in FlyBase as GK28127, with transcription on the opposite strand, and without a homolog in D. melanogaster. Thus, it is possible that some duplicates of Sgs3 may have been actually recruited for other functions other than glue production. In this respect, it is also possible that Eig71Ee, which has been studied mostly for its immune functions, could be an ancient glue protein, which gained new functions.
The repeat-containing glue proteins are typical of secreted mucins. Mucins are highly glycosylated proteins found in animal mucus and they protect epithelia from physical damage and pathogens [52]. In D. melanogaster, more than 30 mucin-like proteins have been identified [53] but the precise function of most of them remain unknown. It would be interesting to compare the glue genes with the other mucin-like genes in terms of protein domains and sequence evolution. In D. melanogaster, repeats similar to those of Sgs3 (KPTT) are found in the mucin gene Muc12Ea. The high level of glycosylation is thought to favor solubility at high concentration while accumulating in salivary glands ([15]). The richness in cysteines suggests that, upon release in the environment through expectoration, disulfide bridges between glue proteins may be formed by cysteine oxidation by air, making a complex fibrous matrix. Intramolecular disulfide bonds can also be predicted ([15]). Examination of the amino acid composition of the glue proteins suggests that the numerous prolines may induce a zigzag-like shape while serine and threonine, which are very abundant, besides being prone to O-glycosylation, make them very hydrophilic and favor interaction with the solvent and then solubility while preventing folding. The presence of regularly scattered arginines or lysines (or sometimes aspartic and glutamic acids) would add charge repulsion, helping the thread structure to be maintained flat and extended. This is similar to linkers found between mobile domains in some proteins [54]. The shorter Sgs7/Sgs8 would, considering their richness in cysteine, bind the threads together through disulfide bonding.
In the frame of an intrinsically disordered structure (Fig. 4), it is not surprising to observe a high level of repeat number variation (RNV) even at the intra-population level. It has been reported ([55, 56]) that in proteins with internal domain or motif repeats, if these repeats form disordered regions and do not interact with the rest of the protein chain (for a cooperative folding for example), they are more prone to indels which are better tolerated, and favored by the genetic instability of repeated sequences. It is likely that, within a certain repeat number range, variations in repeat numbers might have little effect on the chemical and mechanical properties of the glue. In fact it is likely that the differences in repeat motif sequences rather than the number of repeats would change the mechanical and physical properties of the glue.
Accordingly, we measured rather fast rates of evolution, but found no clear indication of positive selection. One reason why the evolution of the repeats is fast (across related species or across paralogs) might be that the constraints to maintain disorder and the thread-like shape are rather loose ([55]) We do not know the respective roles of the different Sgs proteins in the final glue. Farkaš (2016) mentioned that Sgs1 could have chitin-binding properties, which is in line with the function of the glue. He also proposed roles of specific components before expectoration, inside salivary gland granules, related to packaging, solubility... The absence of some glue components may have consequences on its properties and may play a role in adaptation, as suggested by [15]. Gene loss, gene duplication, or repeat sequence change may modify the strength of the glue or its resistance to water or moisture, to acidity (of a fruit) and therefore might be linked to pupariation site preference. D. suzukii lacks both Sgs1 and Sgs3, and has duplications of Sgs7. D. suzukii pupae are found mostly in the soil just below the surface, and less rarely within ripe and wet fruits such as cherries or raspberries, the pupa half protruding 15 [57,58]. The extensive loss of Sgs genes in D. suzukii may be related to its pupariation in soil. Shivanna et al. ([59]) have related pupariation site preference to the quantity of glue and, counter-intuitively, have reported that species that prefer to pupariate on the food medium in the laboratory produce more glue than species that pupariate on the glass walls of the vials.
However, the chemical glue content was not investigated. Another study [60]  In this respect, collecting wandering larvae from various substrates, analyzing their glue composition and designing adhesion assays to compare adhesive properties between various glues will be valuable.
In conclusion, the pupal glue appears as a genetically and phenotypically simple model system for investigating the genetic basis of adaptation. The present work provides a first exploration of the evolution of glue genes across Drosophila species and paves the way for future studies on the functional and adaptive consequences of glue composition variation in relation to habitat and geographic and climatic origin.
The genome data used for each species is indicated in melanogaster Sgs sequences. Consequently, BLAST results were often limited to the Cterminal part of the targeted gene, which was the most conserved part of the proteins, and to a lesser extent to the N-terminal end. For each species, a nucleotide sequence containing large regions upstream and downstream of the BLAST hits was downloaded from InsectBase [66] or from species-specific websites when genome data was not present in InsectBase (Table 1).
We used Geneious (Biomatters Ltd.) to identify by eye the coding regions, the start of which was identified by the signal peptide sequence. Putative introns were also identified manually, guided by the intron-exon structure of the D. melanogaster orthologs. In cases of uncertainties or missing sequence data, we extracted DNA from single flies of the relevant species (Table   4) and the questionable gene regions were amplified with primers chosen in the reliable sequence parts (Table S2), and sequenced by the Sanger method using an ABI 3130 sequencer. For instance, we characterized the exact sequence corresponding to N stretches in the published sequence of D. mauritiana Sgs4 ; we found that the published premature termination codon (PTC) of D. biarmipes Sgs3 was an error and that three frameshifts found within 50 bp in D. sechellia Sgs1 were erroneous. Thus, dN and dS were computed using yn00 in the PAML package ([70]), removing the unalignable parts. We tested for episodic diversifying selection across species using the branch-site random effect likelihood (BS-REL) algorithm implemented in the HyPhy package [71,72] at the Datamonkey website (classic.datamonkey.org) [73]. We used only genes devoid of repeats to ensure reliable aligments, and we supplied species trees for the analysis.

Test for accelerated gene turnover
To infer ancestral gene counts in the three newly classified Sgs gene families and to determine whether the three newly classified Sgs gene families are evolving rapidly we first need to determine the average rate of gene gain and loss ( )   With the gene family data and ultrametric phylogeny as input, we estimated gene gain and loss rates (λ) with CAFE v3.0 [4]. This version of CAFE is able to estimate the amount of assembly and annotation error (ε) present in the input data using a distribution across the observed gene family counts and a pseudo-likelihood search. CAFE is then able to correct for this error and obtain a more accurate estimate of . We find an ε of about 0.04, which implies that 4% of gene families have observed counts that are not equal to their true counts. After correcting for this error rate, we find = 0.0034. This value for ε is on par with those previously reported for Drosophila (Table S3; [39]). However, this estimate is much higher than the previous reported from 12 Drosophila species (Table S3; [4,39]), indicating a much higher rate of error distributed in such a way that CAFE was unable to correct for it, or a much higher rate of gene family evolution across Drosophila than previously estimated. The 25 species Drosophila phylogeny was then manually pruned and modified to represent the 20 Drosophila species in which Sgs gene families have been annotated. Some Sgs gene families are not present in the ancestor of all 20 species, so additional pruning was done to the phylogeny for each family as necessary (see Table S1). The phylogeny, Sgs gene copy numbers, and the updated rate of gene gain/loss ( = 0.0034) were then used by CAFE to infer p-values in each lineage of each family (Table S4). Low p-values (< 0.01) may indicate a greater extent of gene family change along a lineage than is expected with the given value, and therefore may represent rapid evolution. and other species' genomes were obtained using NGS technologies, which yielded short reads. The data were often not accurate in repeat regions, likely because short reads may be not properly assembled when there are numerous short tandem repeats, and thus could not be used for counting RNV. Thus, experimentally, using single-fly DNAs, we amplified and sequenced the repeat-containing Sgs3 and Sgs4 from one or a few individual flies from several strains or natural populations available at the laboratory (French Guyana, Ethiopia, France, Benin, Ivory Coast, India, Comores, and the laboratory strain Canton S), and from a number of lines used in the Drosophila Genome Nexus study (Table 4). In addition, we investigated the occurrence of possible premature termination codons in gene alignments from

Declarations:
Ethics approval and consent to participate :

Availability of data and material :
Available upon request to the authors

Competing interests :
The authors declare that they have no competing interest

Funding :
The research leading to this paper has received funding from the regular annual funding of

Authors' contributions :
VCO and JLDL designed the study and analyzed data; JLDL and MB performed experimental work; GWCT performed CAFE analysis; JLDL, VCO and GWCT wrote the manuscript. All authors have read and approved the manuscript. Sgs3, dark blue is for Sgs7, light blue is for Sgs8, green is for Sgs4, orange is for Sgs5-5bis, purple is for Eig71Ee. Along with each species is a schematic representation of the organization of the glue gene cluster, with relative position and orientation for the species with confirmed synteny information. Gene sizes and distances are not to scale. "R" means that internal repeats are present. "R?" means that no clear repeats were identified. In D.
pseudoobscura, the relative orientation of the three clustered Sgs3-like sequences GA25425, GA23426, GA23878 suggested that GA23426 could be orthologous to Sgs3 (it is inside an intron of GA11155, homologue of Mob2, which is close to Sgs3 in D. melanogaster), GA23425 to Sgs7 and GA23878 to Sgs8. The last two had more similar sequences compared to GA23426, including the repeat region. Furthermore, the latter was neighbor to GA20420, a homologue of chrb-PC, a gene adjacent to Sgs8 in D. melanogaster.        3R : 1975190-1975883 3R : 1974195-1974756* 3L : 11539572-11539861 3L : 11536774-11536485 3L : 11537383-11537681 3L : 18202978-18201736 Figure S1: Ancestral states for the Sgs1-3-7-8 gene family inferred by CAFE. Species tips are labeled with the observed gene count and internal nodes are labeled with inferred gene counts.

Legends of Supplementary Materials
Orange branches represent gene losses, blue branches represent gene gains, while black branches represent lineages in which no change in gene copy number is observed. Branches marked with asterisks have marginally significant p-values (< 0.05).     Sgs4 mau has been corrected with our resequencing. Xs are undetermined amino acids.