Uncovering Genomic Regions Associated with Trypanosoma Infections in Wild Populations of the Tsetse Fly Glossina fuscipes

Vector-borne diseases are responsible for > 1 million deaths every year but genomic resources for most species responsible for their transmission are limited. This is true for neglected diseases such as sleeping sickness (Human African Trypanosomiasis), a disease caused by Trypanosoma parasites vectored by several species of tseste flies within the genus Glossina. We describe an integrative approach that identifies statistical associations between trypanosome infection status of Glossina fuscipes fuscipes (Gff) flies from Uganda, for which functional studies are complicated because the species cannot be easily maintained in laboratory colonies, and ∼73,000 polymorphic sites distributed across the genome. Then, we identify candidate genes involved in Gff trypanosome susceptibility by taking advantage of genomic resources from a closely related species, G. morsitans morsitans (Gmm). We compiled a comprehensive transcript library from 72 published and unpublished RNAseq experiments of trypanosome-infected and uninfected Gmm flies, and improved the current Gmm transcriptome assembly. This new assembly was then used to enhance the functional annotations on the Gff genome. As a consequence, we identified 56 candidate genes in the vicinity of the 18 regions associated with Trypanosoma infection status in Gff. Twenty-nine of these genes were differentially expressed (DE) among parasite-infected and uninfected Gmm, suggesting that their orthologs in Gff may correlate with disease transmission. These genes were involved in DNA regulation, neurophysiological functions, and immune responses. We highlight the power of integrating population and functional genomics from related species to enhance our understanding of the genetic basis of physiological traits, particularly in nonmodel organisms.

populations of Gff in Uganda. Three genetically distinct Gff population clusters can be identified: the northern, western, and southern regions (Beadell et al. 2010;Hyseni et al. 2012). These clusters are both spatially and temporally stable (Beadell et al. 2010;Echodu et al. 2011;Hyseni et al. 2012;Opiro et al. 2016Opiro et al. , 2017, even though hybrid zones among these units exist (Beadell et al. 2010;Opiro et al. 2017), which facilitate genetic admixing within a 100 km radius (Beadell et al. 2010). This information can be used to guide vector monitoring and control interventions. However, it does not provide insights on the genetic basis of traits shaping the interactions between vectors and parasites.
Understanding the interaction between vectors and their parasites is fundamental to inform disease control methods, but data from natural populations is scarce. This is a major knowledge gap with important epidemiological implications. To facilitate the study of genotype-phenotype interactions, we recently expanded the genetic marker toolset of Gff using whole-genome sequencing (WGS) coupled with double digestion RAD methods [ddRADSeq (Baird et al. 2008;Peterson et al. 2012)] to a set of 73,000 SNPs distributed across the Gff genome (Gloria-Soria et al. 2016). The Gff SNP data set obtained from this pilot study was generated from flies belonging to each of the three Ugandan genetic clusters (N = 53), with the same number of infected and uninfected flies per population. This data set was used to calculate the average linkage disequilibrium (LD) across the Gff genome and to look for signatures of selection via genomic scans (Gloria-Soria et al. 2016). Our results indicated that LD in Gff decays at a slower rate than in other insects, being an order of magnitude lower than in the fruit fly Drosophila melanogaster (Mackay et al. 2012). The average genomic LD also varied among the Gff populations (r 2 max /2 ranged between 1359 and 2429 bp) (Gloria-Soria et al. 2016). These high levels of LD likely increased our ability to subsequently identify loci under selection, despite the small sample size (Gloria-Soria et al. 2016). The results of this pilot study support the feasibility of conducting GWAS analyses in this species, using much lower sample sizes that those required by other species, including Anopheles or Aedes mosquitoes, whose LD levels are quite low (Harris et al. 2010;David et al. 2014;Marsden et al. 2014).
Although the pilot study yielded a list of candidate SNPs, mostly related to environmental adaptation (Gloria-Soria et al. 2016), only one SNP was identified as associated with the infection status of the flies using outlier detection methods (Fischer et al. 2011;Duforet-Frebourg et al. 2014). Furthermore, functional characterization of candidate genes was complicated by the poor annotation of the Gff genome and the lack of transcriptomics for this species, a common problem in nonmodel organisms. The present work extends the analyses on the initial ddRAD data set from Gloria-Soria et al. (2016) to gain additional insights on the nature of the genomic regions associated with the susceptibility to Trypanosoma infection of Gff populations by using statistical methods that maximize efficiency and power (Berg and Coop 2014), while accounting for spatial stratification of the samples as well as for species-and population-specific LD (Purcell et al. 2007). This work takes advantage of the functional data available from a closely related species, Gmm (Petersen et al. 2007), to improve the Gff genome annotation of candidate genes, thus enabling their functional characterization. Comparison of the genomes from six tsetse fly species, including Gmm and Gff, is ongoing, but our preliminary analysis indicates that . 90% of the predicted coding sequences from Gff align to the Gmm sequences. The remarkable similarity across the Gff and Gmm genomes highlights the validity of the present approach. As a result of this method, we identified candidate parasite resistance genes in Gff that could potentially be used to develop targeted control strategies for this and other Glossina species, paving the way for association studies on other phenotypes of epidemiological interest in this and other tsetse species (e.g., developmental time, life span, and susceptibility to insecticides, etc.).

Gff association analyses
Gff genomic data: The starting SNP data set for this study included all 73,297 loci identified by Gloria-Soria et al. (2016), using a combination of double digest RADseq (Peterson et al. 2012) and WGS (40· coverage/ individual). The raw data are available from NCBI bioproject 303153. The SNPs are distributed across the Gff genome at a density of 1 SNP for every 5 kb and were derived from four geographically distinct populations in Uganda-Otuboi, OT (North), Masindi, MS (West), and Namutumba, NB and Kalangala Island (South)-which belong to three genetic clusters. Sampling details can be found in Gloria-Soria et al. (2016). The use of genetically distinct populations maximized the natural genetic diversity represented in the data set, while also minimizing ascertainment bias, thus reducing the impact of false positives in our analysis (Schoville et al. 2012).
Genetic association analysis for susceptibility to trypanosome infection: The analysis was conducted on individuals from three of the four Gff populations from Uganda described in Gloria-Soria et al. (2016). These populations represent each of the three Gff genetic clusters in this country. The Kalangala Island population was excluded due to small sample size. Association analysis was performed using PLINK v1.9 (Chang et al. 2015) under a logistic regression model to identify SNPs associated with infection status. Infection status is treated as response and the predictor is the additive allele dosage effect of each marker. To minimize the confounding effects due to population structure, we used Faststructure v1.0 (Raj et al. 2014) to characterize the population structure by a vector (X1, X2, and X3), where the optimal dimensionality was determined by chooseK.py. Because (X1, X2, and X3) is compositional, log transformations were applied on X1/X3 and X2/X3 before including them in the model. We used this method because it can automatically select the complexity of the population structure (i.e., the dimensionality of covariates), and provides more efficient computation and better interpretations of global ancestry (Raj et al. 2014). In addition, log transformation is a common statistical practice to deal with composition variables (Aitchison 1982) and has been used to identify covariates in association studies in order to account for population structure (Mariette et al. 2016). We also carried out a similar association analysis accounting for population structure via principle components (PCs). Specifically, the top two PCs were extracted (-pca in PLINK v1.9) and included in the logistic regression model.
Statistical evidence of association between disease status and each marker was assessed through the Benjamini-Hochberg-adjusted p-values (Benjamini and Hochberg 1995). Subsequently, since the distribution of association p-values did not fit the expected uniform distribution in the tailed portion of the histogram (Supplemental Material, Figure S1A), we implemented an alternative strategy to simplify the underlying complexity of the statistical analysis; that is, we extracted representative SNPs based on different LD thresholds (r 2 = 0.01-0.001). This strategy aims to facilitate the interpretation of associations across regions.

Harnessing Gmm genomic data
Gmm genomic data: To functionally characterize the gene regions in Gff likely to be in association with Trypanosoma infection, we took advantage of the genomic and functional studies in a closely related species, Gmm, which is amenable to laboratory work (Petersen et al. 2007). We combined old and new transcriptome (step 1) and RNAseq (step 2) data to improve the published Gmm transcriptome assembly (step 3), then we generated a list of expressed Gmm transcripts expressed in various Gmm tissues (step 4), and ultimately identified those DE genes in relation to Gmm Trypanosoma infection (step 5); see Figure S2.
Step 1: Gmm transcriptome data: Methods from previously published transcriptomes can be found in the corresponding references listed in Table S1. Unpublished transcriptomes followed the same overall protocol. Briefly, Gmm were maintained at the Yale School of Public Health insectary and infected with T. brucei rhodensiense (Ytat 1.1), as described previously (Aksoy et al. 2016). Fourteen days postchallenge, tissues were microscopically scored for the presence or absence of trypanosome infections, and tissues of interest were flash-frozen in liquid nitrogen for RNA extraction (see Table S1 for tissues processed). Total RNA was extracted using TRIzol (Invitrogen) and subjected to DNase treatment (TURBO DNA-free kit AM1907; Ambion). RNA quantity and quality were determined using an Agilent Bioanalyzer 2100 RNA Nano chip. cDNA libraries were prepared using a NEBNext Ultra Directional RNA Library Prep Kit (E7420S; New England Biolabs), according to the manufacturer's protocol. The libraries were constructed and barcoded for Illumina HiSeq sequencing (unpaired 75 bases) at the Yale Center for Genome Analysis (New Haven, CT).
Step 2: Gmm RNAseq data: The RNAseq data analysis pipeline is shown in Figure S2. The raw Illumina RNAseq reads from 72 samples (Table S1) were first assessed for quality by FastQC (Andrews 2010). The FASTX-toolkit (http://hannonlab.cshl.edu/fastx_toolkit/) was then used to trim off Illumina adapter sequences and low-quality bases. Due to the prevalence of tsetse symbiont sequences in the reads, a specific quality control step was included to reduce bacterial sequence reads, using reference genome data sets from the three symbionts that are most commonly found in tsetse flies (Aksoy et al. 2014): Wigglesworthia (Rio et al. 2012), Wolbachia (Brelsfoard et al. 2014), and Sodalis (Toh et al. 2006). Parasite sequences were removed from parasitized samples by mapping the Illumina reads to the T. brucei genome strain 927 and T. congolense (www.tritrypdb.org) prior to analysis. Reads passing quality control, and symbionts and parasites separation, were aligned to a previously generated contig library ) by TopHat2 (Kim et al. 2013) and Bowtie (Langmead et al. 2009) alignment engines. Reads that mapped to multiple locations were discarded from further analysis. The percentage of reads aligned to bacteria, parasites, and the host genome is shown in Figure S3. The raw Illumina RNAseq reads can be found under the accession numbers listed in Table S1.
Step 3: updating Gmm transcriptome assembly: We constructed a more comprehensive Gmm transcriptome assembly than the one available through VectorBase [VectorBase (Giraldo-Calderón et al. 2015); GmorY1.4] by integrating the previous assembly with the 72 RNAseq samples described in Table S1. The reference annotation from VectorBase (Giraldo-Calderón et al. 2015) (GmorY1.4) was used to guide transcript assembly by Cufflinks v2.2.1 (Roberts et al. 2011a;Trapnell et al. 2012) and obtain the relative expression of a transcript (FPKM; fragments per kilobase of exon per million fragments mapped) for each annotated gene. The resulting 72 individual Cufflinks assemblies were then merged into a single unified assembly (Gmm Tx) using Cuffmerge, which can maximize assembly quality by removing transcripts that are artifacts and merging novel isoforms with known isoforms across all Cufflinks assemblies. This unified assembly was then integrated with the previous assembly (GmorY1.4), using Cufflinks to compare and determine a unique set of isoforms for each transcript locus.
Step 4: Gmm DE analysis: Two strategies, a count-based method and Cufflinks pipeline, were used to identify DE genes between Trypanosoma-infected and noninfected Gmm samples. In the countbased strategy, raw gene counts were first counted by HTSeq package (Anders et al. 2015) using the unified assembly generated as above. Batch effects and other unwanted variations were removed using the RUVseq package (Risso et al. 2014). Batch effect-corrected counts were then normalized using the TMM method (Robinson and Oshlack 2010) in the edgeR package Smyth 2007, 2008;Robinson et al. , 2012Anders et al. 2013), which accounts for RNA composition bias.
Step 5: DE genes between Gmm Trypanosoma-infected and noninfected flies: A negative binomial generalized linear model implemented in edgeR was used to determine DE genes between Trypanosomainfected and noninfected samples. In the Cufflinks strategy, the abundance of transcripts was quantified by FPKM and compared between Trypanosoma-infected and noninfected samples using Cuffdiff with fragment bias correction (Roberts et al. 2011b). This method corrects for sequence-specific bias and conducts a multihit read correction by dividing the value of a multimapped read between each map location based on a probabilistic model. In both strategies, genes with a foldchange . 2 and an FDR-controlled p-value , 0.05 were considered DE. For comparisons without biological replicates, only fold-change was used as a criterion.

Combining genomic data from Gff and Gmm
To gain insights into the functional role of Gff gene regions associated with Trypanosoma infection, we used the Gmm database, which included DE genes between infected and noninfected flies to narrow down the list of genes identified by the association analyses in Gff (see above) to only those genes displaying differential expression between infected and noninfected Gmm. First, we mapped the new Gmm transcriptome assembly to the Gff genome, producing a hybrid Gff annotation (step A). We then identified regions near the representative SNP of a region found to be associated with Trypanosoma infection in Gff, and DE in infected and noninfected Gmm flies (step B): Step A: cross-species transcript mapping: We used Blat (v35) (Kent 2002) to map the newly generated Gmm transcriptome assembly (Gmm Tx) to the Gff genome (GfusI1), with options set to exclude overoccurring 11mers from initial match seeds (-ooc 11). Blat hits (the Gmm transcripts that matched to some sequences of the Gff genome) were filtered and retained for further analysis if the hit was $ 100 bp long and had match coverage of $ 90%. Note that the purpose of using Gmm Tx was to supplement the current Gff annotation with genomic regions that may be missing, not to "fix" existing Gff annotations. For this reason, only Gmm Tx that did not overlap existing annotated features in Gff were added to the original Gff annotations to generate the final set of annotations used in our downstream analysis.
Step B: determination of DE genes in Gff linked to the SNPs in association with trypanosome infection: After identifying Gff regions that shared . 90% sequence identity with annotated Gmm genes and were longer than 120 bp, we added any novel Gff-annotated region to a hybrid gene annotation for Gff. We then looked in this hybrid annotation for genome features within 2500 bp in either direction of each representative SNP associated to trypanosome infection status. This distance was determined based on previous LD findings ( Gloria-Soria et al. 2016). We then cross-checked these gene regions in Gff with the Gmm data (above) to identify those DE in Gmm. The infectionassociated DE genes in Gmm were used to filter/annotate the Gff hybrid annotation for probable trypanosome infection-related Gmm homologs. This step helped to focus our efforts on candidates with a confirmed biological function ("real" genes as opposed to possible annotation artifacts) based on their DE expression in Gmm.

Functional characterization of candidate Gff genes
Drosophila and tsetse flies are both higher Dipterans that share several biological pathways. Previous studies investigating tsetse gene functions at the molecular level have shown a high conservation between tsetse flies and Drosophila, especially with regard to their interaction with microbes (Attardo et al. 2014;Benoit et al. 2017). Thus, to gain insight into the functions of our candidate genes, we looked for their homologs in these species. Transcripts from the Gff genes associated with trypanosome infection status were ascertained using the annotated VectorBase (www.vectorbase.org) Gff transcript database Gfus1.4. These resulting transcripts were then compared to the VectorBase Gmm transcript database GmorY1.5 using the VectorBase BLASTx tool to determine their corresponding homologous transcripts in Gmm. Similarly, the same subset of Gff transcripts were blasted (BLASTx) to FlyBase (www.flybase.org) to identify their corresponding D. melanogaster homologs.

Data availability
Original ddRAD data are available from NCBI bioproject 303153. Table 1 provides a list of the Gff genes associated with trypanosome infection that have homologs differentially expressed among Gmm-infected and noninfected flies. Table S1 provides information on the 72 Gmm transcriptomes used in this study and their corresponding SRA accession numbers. Table S2 and Table S3 provide the SNPs associated with trypanosome infection status in Gff using two alternative methods to control for population structure. Table S4 provides a comparison between the new and previous Gmm transcriptome assembly. Table S5 contains the genes DE between infected and uninfected Gmm tissues. Table S6 describes the multiple genome feature IDs used throughout the paper and association across the ID types. Table S7 lists the 56 candidate genes identified in the vicinity of representative SNPs associated with trypanosome infection status. Table S8 compares the DE genes in Gmm across tissues. Table S9 provides the results from comparing the candidates SNPs from Gloria-Soria et al. (2016) with those found in this study. Figure 1 shows the DE genes shared across tissues for Gmm and the number of genes expressed by tissue. Figure S1 shows the Manhattan plot before and after clumping the association results based on LD. Figure  S2 shows the analysis pipeline of the Gmm transcriptome. Figure S3 provides the percentage of transcriptome reads from Gmm aligned to the host, bacteria, and parasite genomes. Figure S4 shows the Manhattan plots and histograms of p-values from association for each population of G. fuscipes clumped at an LD threshold of 1%.

Associations between genetic variation and trypanosome infection status in Gff
We used the 73,297 SNPs previously developed for Gff (Gloria-Soria et al. 2016) to identify markers associated with trypanosome infection status (see Materials and Methods). Association results were grouped based on LD clumping to facilitate interpretation of the association signal. We explored several LD thresholds (r 2 = 0.01-0.001) and selected the data set with a null distribution of p-values closest to a uniform distribution ( Figure S1). An LD threshold of r 2 , 0.01 was chosen and yielded a set of 63 SNP groups centered around a representative SNP or "index variant" (PLINK; Chang et al. 2015) shown in Table S2. The resulting Manhattan plot and the corresponding histogram of p-values are displayed in Figure S1C. As shown across Figure  S1, accounting for LD facilitated detection of the association signal. From the 63 representative SNPs, 18 displayed statistically significant associations with trypanosome infection status, after correction for multiple testing (FDR-controlled p-value , 0.05; Table S2A) and accounting for population structure using FastStructure (Raj et al. 2014).
As LD levels can vary among populations, similar analyses were carried out separately for each of the three Gff populations-OT, NB, and MS-with the goal of identifying relevant population-specific markers. For each population, LD clumping yielded 60 representative SNPs (threshold of r 2 . 0.01; Table S2, B-D) and we identified 19, 19, and 16 SNPs in OT, NB, and MS, respectively, which were significantly associated with the phenotype (Table S2, B-D). The corresponding Manhattan plots and histograms of p-values for each population are shown in Figure S4. Table S3 shows the results of the association analysis using PCs as covariates, as an alternative approach to FastStructure (Raj et al. 2014) to control for population structure. Results were similar to those using FastStructure (above), but yielded 64 representative SNPs during the clumping procedure (previously 63). Additionally, one of the previously identified representative SNPs showed significant association values using the PC method (GFvariants_VB2014e_tvcf:KK351818.1:1387926, pval = 0.0046 after FDR).

Enhancing the Gff transcriptome assembly
In our effort to identify Gff genes relevant to the trypanosome infection status, we took advantage of the already existing Gmm transcriptome assembly [GmorY1.4 (Giraldo-Calderón et al. 2015)] and improved it with data from 72 additional RNAseq samples (Table S1). The new assembly (Gmm Tx) covers 41,174 genes with 85,849 transcripts, among which 12,053 are completely matched with previous assemblies and 38,632 are potentially novel isoforms. A comparison between the existing assembly and the one generated in the current study can be found in Table S4.

DE genes in Gmm
DE genes were identified between Trypanosoma-infected and uninfected Gmm samples in four tissues [midgut, salivary glands, cardia (proventriculus), and proboscis], using both count-based methods and the Cufflinks pipeline (Table S5). Both differential expression analysis strategies identified a substantial portion of genes (12-15% in the midgut, 40-50% in the salivary glands, 10-40% in the cardia, and 50% in the proboscis) that were only annotated in the new assembly (Table  S5A). One-third of the DE genes were identified with both edgeR and Cuffdiff methods (Table S5B). Figure 1A-B show the count of genes identified as being DE across tissues.

Cross-species transcript mapping
The newly assembled Gmm transcriptome was then used to enhance Gff existing genomic features. Mapping the new Gmm transcriptome assembly (Gmm Tx) to the Gff genome resulted in 20,448 Gmm transcripts that matched to sequences in the Gff genome. Among them, 15,954 were already annotated in Gff. Another 4494 were mapped to the Gff genome and represent novel annotations. The vast majority of the filtered Gmm Tx (99.11%) mapped to a unique location on the Gff genome. Of the novel annotations, the maximum number of Gmm Tx   mappings after filtering was 21, while the mean number of mappings for all Gmm Tx was 1.017 (median = 1). Details about the genome features used through this cross-species mapping can be found in Table S6.

Candidate Gff loci involved in susceptibility to trypanosome infection
We identified 56 Trypanosoma infection-associated Gff loci (Table S7) by screening for genes 2500 bp upstream and downstream of the representative SNPs displaying a significant association signal with the Trypanosoma infection phenotype (SNPs across all populations [N = 18] and SNPs for each individual population [N = 19, 19, and 16 for OT, NB, and MS, respectively] (see association section above). We also looked for genes within a 2500 bp window around the additional SNP identified across populations, using the PCA method to control for population structure, but found none.
The 56 identified loci can be divided into two categories: "official" and "novel" Gff loci. The first category includes 42 loci that fall within official Gff (GFusI1.3) gene annotations and bear official gene symbols, prefixed with "GFUI." The second category represent 14 Gff loci where no official annotation exists but where a transcript was inferred using cufflinks-deduced Gmm Tx transcripts longer than 100 bp (mapped with 100% identity over $ 90% of its length). These genes, added by virtue of the projection of homologous transcripts of Gmm onto the Gff genome, are referred to by a cufflinks-assigned ID and a "TCONS" prefix (see Table S6 for genome feature description). All official and novel loci reported here have two common characteristics; a SNP with a significant genetic association located within # 2500 bp of its coordinates, and an ortholog (for official loci) or corresponding TCONS Tx (for novel loci) in Gmm (Table S7). Thus, 42 official Gff genes and 14 novel Gff genes (a total of 56 Trypanosoma-associated Gff loci) were identified by this selection criterion and are listed in Table S7. One of these novel genes (XLOC_035548) was not annotated in either of the two genomes. Twenty-nine of all 56 genes identified were DE between Trypanosoma-infected and uninfected Gmm flies (Table 1).
To evaluate whether DE of 29 out of 56 loci in Gmm was significant (not random), we determined how common DE was across tissues in Gmm. We identified the total number of unique DE genes across the Gmm tissues tested and evaluated whether the disagreement in DE was due to a difference in the genes expressed across tissues. Figure 1, A and B show the DE genes shared by tissue, identified by methods cuffdiff, edgeR (Table S5A), and both (identified by both cuffdiff and edgeR, Table S5B), respectively. Between 2000 and 7000 unique DE genes were identified by these methods, with no DE genes shared by all six tissues and only one-tenth shared by at least two tissues (Figure 1, A and B and Table S8). This disagreement in DE could arise from tissues associated with infection differentially or simply because different genes are expressed in different tissues. To address this issue, we compared all expressed genes (genes with at least two reads aligned in at least half of the samples) across tissues. The number of genes expressed by tissue is shown in Figure 1C. A total of 19,110 unique genes were expressed in at least one tissue, with 8786 (46%) expressed across all six tissues. These two findings together indicate that a large difference exists across tissues in terms of the genes associated with infection status, and suggests that finding 29 out of 56 DE loci in Gff is not likely due to DE being common across tissues.

Functional characterization of candidate Gff genes found in association with Trypanosoma infection status
We functionally characterized the 29 candidate genes linked to the phenotype of interest (Trypanosoma-infected vs. noninfected flies) to which a DE Gmm Tx had been mapped. We used this criterion to reduce the possibility of annotation artifacts among our candidates, thus ensuring that only real genes were included. Furthermore, this specific data set of genes would more likely underlie differences in vector competency across both species, given that they are closely related. We searched for D. melanogaster homologs of these 29 Gff genes by BLASTx analysis of the ascertained Gff transcript sequences against the fruit fly genome database [FlyBase (Gramates et al. 2017)]. These 29 genes were assigned to 23 functional homologs in Drosophila, with five of the novel Gff genes being paralogs of Gff official genes (Table 1), and no functional homolog found for the XLOC_035548 gene. The XLOC_035548 (TCONS_00071717) sequence maps to Gmm scaffold "scf7180000652014" from position 87,226 to 87,796; the nearest gene is 20 kb upstream: CSP2 (GMOY010026). Further searching the PFAM-A database of protein domain models (Finn et al. 2016) yielded no known functional domains within the sequence. Table 1 summarizes these results, showing that the products of seven of the Gff genes identified are associated with insect neurophysiology, including putative roles in feeding behavior; seven gene products are associated with DNA regulation, involving mitosis and cell proliferation processes; three gene products are involved in fly immune responses; and six gene products do not fall within any discrete category.
Comparison with previously identified SNPs associated with trypanosome susceptibility The significant SNPs identified in the MS, NB, and OT populations, and the SNPs identified from the combined population set, were compared to the 360 SNPs previously reported as potentially associated with infection status using F st outlier methods implemented in BayeScan (Gloria-Soria et al. 2016). A total of 55 SNP pairs were identified between the two SNP lists, based on the scaffolds they were annotated to. These SNP-SNP pairs and the distance between each pair are reported in Table S9. SNP 709851, located in scaffold 27 and identified in the MS population, was the only SNP identified in both studies as a candidate SNP. The other SNP pairs were located . 34 kb away from each other.

DISCUSSION
Performing genomic association analysis of disease susceptibility in nonmodel organisms is challenging due to the inability to perform controlled laboratory experiments, small sample sizes, and the lack of well-annotated genomes, or the absence of a reference genome to guide the efforts. Since the annotated Gff genome (Gfus1.3) was incomplete, we took advantage of additional genomic resources available for Gmm, a tsetse species for which there is extensive laboratory data (Petersen et al. 2007). Using transcriptome data from Gmm, we complemented missing annotations in the Gff genome, which allowed us to go beyond SNP associations to functionally identify candidate genes. Furthermore, we narrowed down our gene candidate list by identifying homologs that were DE among infected or uninfected Gmm flies that had been experimentally challenged with Trypanosoma (Table 1 and Table S5).

Regions associated with trypanosome infection in Glossina spp
Tsetse flies of the genus Glossina are major vectors of human and nonhuman animal diseases caused by infection with African trypanosomes. Laboratory studies have identified genetic factors for resistance to trypanosomes in Gmm (Hamidou Soumana et al. 2017;Geiger et al. 2015), but no information is available on HAT vector species such as Gff (e.g., host-parasite strain combinations) and in general for any species from natural populations. Gff colonies are difficult to maintain in the laboratory, making experimental manipulations in this species extremely challenging. Here, we tested the hypothesis that natural variation for resistance or sensitivity to Trypanosoma infection in wild populations of tsetse flies has a genetic basis and could be used to identify the genetic components underlying this epidemiologically important trait.
Preliminary analysis of the Gff SNP data set used for this study failed to identify candidate genes involved in susceptibility to trypanosome infection, based on allele frequency differences between treatment groups (Gloria-Soria et al. 2016). Such F st outlier methods are powerful at detecting markers subjected to strong selection pressures, but have low power to detect markers under weak or balancing selection (Narum and Hess 2011;Jensen et al. 2016). Furthermore, the F st outlier approach used in the pilot study, as implemented by BayeScan (Fischer et al. 2011), assumes that the contrasting populations are evolutionary independent from each other and that gene frequencies under neutrality approximate a multinomial Dirichlet distribution (Beaumont 2005). This method is generally appropriate to identify loci involved in environmental adaptation, and Gloria-Soria et al. (2016) successfully identified several candidate loci under this framework. However, the assumption of evolutionary independence is violated in case-control studies, where samples are drawn from the same population (Lotterhos and Whitlock 2014), such as when trypanosome-infected and uninfected flies from the same population are compared. More appropriate for this type of analysis is the haplotype-based analysis also used by Gloria-Soria et al. (2016), hapFLK (Fariello et al. 2013(Fariello et al. , 2014, because it corrects for group coancestry. Unfortunately, this approach lacked the resolution to identify outliers in the Gff data (Gloria-Soria et al. 2016).
Here, we applied genetic association methods, commonly used in human genetics and model organisms, to search for genetic determinants of trypanosome susceptibility in Gff. Association methods assume that common variants have modest effects on disease frequency and can explain the variability of a common disease [CD/CV hypothesis (Reich and Lander 2001)]. Their success in identifying a disease susceptibility locus relies on detecting increases in disease allele frequency in the affected individuals (Reich and Lander 2001). We applied a logistic regression of the phenotype on allele dosage that allows for multiple covariates, thus can control for sample stratification and coancestry by incorporating population structure into the analysis (Hill et al. 2017). LD-based clumping of the SNPs provided a parsimonious representative genome that greatly reduced the complexity of model fitting. Furthermore, this process allowed us to increase the strength of the association signal, which was otherwise hindered by background noise ( Figure S1). We then performed a gene search that extended # 2500 bp away from the selected representative SNPs. This was a conservative search space based on the average LD reported for these populations in Gloria-Soria et al. (2016). Longer LD regions are likely to occur around genes under strong selection and some genes may have been missed in this quest. This approach yielded 56 Gff candidate loci associated with trypanosome infection (Table S7), providing a good starting point to begin the functional validation. The search can be extended, as needed, for future analyses.

Functional characterization of candidate genes
This study suggests that two main functional categories, cell proliferation and neurophysiology, may be important determinants of trypanosome infection outcomes in tsetse flies. Fifty-six genes were identified near SNPs statistically associated with trypanosome-parasitized individuals in Gff, and 29 of these genes were DE between Trypanosoma-infected and uninfected Gmm flies. Out of the 23 Gff genes with functional homologs in Drosophila, seven are involved in cell proliferation processes and DNA regulation (particularly in mitosis), and seven were associated with neurophysiology. These cell proliferation genes include wge, pole2, msps, cdc14, moira, apc, and wee-1. Midgut cell regeneration is an important process in the host defense mechanism against pathogens, and involves the delamination of damaged cells and their replacement with new ones derived from proliferating stem cells (Baton and Ranford-Cartwright 2007;Buchon et al. 2009a,b;Jiang et al. 2011). Differential expression of cell proliferation-associated genes has been observed between trypanosome-infected midgut and salivary gland epithelial tissues of tsetse flies (Aksoy et al. 2016;Telleria et al. 2014;Matetovici et al. 2016). The presence of the parasite within epithelial tissues likely disrupts cellular homeostasis, triggering the insect cell regeneration response. Since tissue integrity is essential for fighting invasive microbes (Buchon et al. 2009a), any mutation impairing cellular regeneration would have major consequences on the ability of tsetse flies to resist infectious trypanosomes. Among the seven genes assigned to the neurophysiology category, four gene products, homologous to Drosophila ptp99A, malvolio, SV2B, and an insulin-like peptide (Table 1), are DE in tsetse flies in response to infection. Little is known about the function of these genes in tsetse flies, but Drosophila Ptp99A is associated with microbiota-dependent nutriment functions (Dobson et al. 2015), and insulin-like protein belongs to a family of genes involved in feeding regulation (Pool and Scott 2014). Interestingly, malvolio, which is expressed in Drosophila's central nervous system, macrophages, and midgut (Folwell et al. 2006), is required for normal taste behavior (Rodrigues et al. 1995). Any mechanism that may lead to an increase in blood-feeding frequency would concurrently increase the exposure of tsetse flies to parasites. Finally, among the three genes involved in the tsetse immune response are the Drosophila homologs of the toll and dorsal genes, which encode the receptor and transcription factors, respectively, of the Toll pathway. This immune pathway is a major component of the insect innate immune system (Valanne et al. 2011). In tsetse, the Toll pathway is induced upon parasite infection (Lehane et al. 2003) and activates the production of antimicrobial peptides that kill invasive microbes, including African trypanosomes (Boulanger et al. 2002). The identification of SNPs associated with trypanosome infection in two major genes of this pathway supports the important role of the insect immune response in the control of the parasite upon its entrance into the tsetse fly's gut.
These findings are consistent with physiological outcomes associated with parasite infections in many insects, including tsetse flies, and encompass modifications in behavior and locomotion (Hurd 2003; van Houte et al. 2013;Caljon et al. 2016;Van Den Abbeele et al. 2010).

Conclusions
The methodological approach used here allowed us to identify 56 trypanosome infection-associated loci in Gff, 29 of which were DE between Trypanosoma-infected and uninfected flies in the closely related species Gmm. The gene products identified by this screen are putatively involved in two main functional categories, cell proliferation and neurophysiology, suggesting that these processes may be determinants of the outcome of trypanosome infection in tsetse flies. Identification of the genetic determinants of trypanosome infection in Gff, and in tsetse flies in general, can provide targets for pharmaceutical and genetic interventions aimed at aiding the fight against the disease in endemic regions. Once disease-specific alleles have been identified, disease susceptibility could be actively monitored in the different tsetse populations by determining their frequency. This information can guide control efforts focused on disease elimination by prioritizing areas of higher risk. Future work will involve the screening of a large number of populations to both increase the sample size and cover different genetic backgrounds to validate this finding, followed by functional work on the top candidate genes identified by this work.

ACKNOWLEDGMENTS
We thank N. P. Havill for providing helpful comments on this manuscript and the Ambrose Monell Foundation for support. Funding was provided by National Institutes of Health grant R01 A09620 awarded to S.A. and A.C., and by Fogarty grant D43TW007391.