A Multipurpose, High-Throughput Single-Nucleotide Polymorphism Chip for the Dengue and Yellow Fever Mosquito, Aedes aegypti

The dengue and yellow fever mosquito, Aedes aegypti, contributes significantly to global disease burden. Genetic study of Aedes aegypti is essential to understanding its evolutionary history, competence as a disease vector, and the effects and efficacy of vector control methods. The prevalence of repeats and transposable elements in the Aedes aegypti genome complicates marker development and makes genome-wide genetic study challenging. To overcome these challenges, we developed a high-throughput genotyping chip, Axiom_aegypti1. This chip screens for 50,000 single-nucleotide polymorphisms present in Aedes aegypti populations from around the world. The array currently used genotypes 96 samples simultaneously. To ensure that these markers satisfy assumptions commonly made in many genetic analyses, we tested for Mendelian inheritance and linkage disequilibrium in laboratory crosses and a wild population, respectively. We have validated more than 25,000 of these markers to date, and expect this number to increase with more sampling. We also present evidence of the chip’s efficacy in distinguishing populations throughout the world. The markers on this chip are ideal for applications ranging from population genetics to genome-wide association studies. This tool makes rapid, cost-effective, and comparable genotype data attainable to diverse sets of Aedes aegypti researchers, from those interested in potential range shifts due to climate change to those characterizing the genetic underpinnings of its competence to transmit disease.

species-specific challenges. As is the case for many eukaryotes, wholegenome sequencing for routine genetic screening is cost-prohibitive and unnecessary in Ae. aegypti. Moreover, the structure, size, and content of the Ae. aegypti genome makes the collection, interpretation, and analysis of whole-genome data difficult. At least two thirds of the nuclear genome is composed of repeats, duplications, or transposable elements (Nene et al. 2007), and there are several nearly complete nuclear copies of the mitochondrial genome (Hlaing et al. 2009). Microsatellites, sometimes referred to as simple sequence repeats, or SSRs, are highly informative and mutable markers; yet, they require significant expertise and investment to develop. Even with access to the entire Ae. aegypti genome the yield was~30 or fewer useable loci (Lovin et al. 2009;Brown et al. 2011). Microsatellites also are subject to human error in scoring and require special case-by-case attention in cross-study and cross-laboratory comparison, making their use as high-throughput markers difficult (Jones et al. 1997;Van Oosterhout et al. 2004;Pasqualotto et al. 2007). Single-nucleotide polymorphisms (SNPs) are information-poor when analyzed in small numbers, because they are usually chosen to be biallelic. Nevertheless, their simplicity and abundance allows one to compensate for per-marker information content by screening large numbers (Morin et al. 2004;Weller et al. 2006;Smouse 2010). Although bias attributed to the way the SNPs were attained does shift allele frequency spectra, with care this can also be corrected for (Nielsen et al. 2004). Restriction-site associated DNA sequencing, or RAD-seq (Hohenlohe et al. 2010;Peterson et al. 2012), recently has become a popular way to simultaneously identify and screen for SNPs across genomes, but data analysis is nontrivial, and there is still debate as to what is best practice for handling and interpreting missing data (Arnold et al. 2013;Huang and Knowles 2014).
Under these circumstances, and especially for this mosquito species, the development of a genotyping chip provides an excellent alternative for fast, high-throughput, and cost-effective screening of thousands to millions of SNPs. Ideally, markers included in a chip should have the following characteristics: 1) There should be as many markers as possible; 2) they should be present as a single copy in the genome; 3) they should be inherited in a Mendelian fashion, and 4) they should be distributed across known positions in the genome so that physical linkage can be distinguished from linkage disequilibrium due to other processes.
To this end, we developed a genotyping chip that contains probes for 50,000 SNP loci, distributed throughout the Ae. aegypti genome. Here, we validate a subset of the markers included on this chip, and show that these markers yield results consistent with previous studies of genetic variation in Ae. aegypti.

MATERIALS AND METHODS
Mosquitoes used in double digest RAD-seq (ddRAD-seq) Ae. aegypti were collected from 25 locations from around the world (Table 1) for SNP development and genotyping. Some of these population samples were used in a previous study (Brown et al. 2014), although the individuals are different. Mosquitoes were collected as eggs in the field and reared to adulthood, or were obtained as preserved larvae collected in the field. Genomic DNA was extracted for each individual mosquito using DNeasy extraction kits (QIAGEN).

RNA-seq marker development
Fastq data from Sequence Read Archive accession numbers SRP015697 and SRP035216 were aligned to the Ae. aegypti reference genome by TopHat (version 2.0.5). We removed reads generated from polymerase chain reaction duplicates with Picard v1.96, and reads with mapping quality lower than 30 were excluded for further analysis. Variants were called separately for each sample by GATK UnifiedGenotyper using default parameters only in regions from a list of genes compiled from literature associated with dengue competence and insecticide resistance (Supporting Information, Table S1). This decision was made partially due to computational restraints, but mostly because there was already an abundance of data to choose from in the analysis of the RAD-seq data. Variants were then filtered for depth of coverage (7·) and allele frequency (minor allele frequency $0.2). To avoid multiallelic sites, we calculated the percentage of reads with either reference allele or the called alternative allele in the total number of reads covering the corresponding site and those with less than 0.85 were left out for further analysis. These potential sites were further filtered for 35 bp of invariant sequence on each side to allow for probe design.
RAD sequencing and marker development RAD-seq DNA samples were sent to Floragenex in Eugene, Oregon, for library preparation. ddRAD DNA samples were prepared at Yale using the protocol described in the ddRAD manuscript (Peterson et al. 2012). Genomic DNA was digested with enzymes MluCI and NlaIII (New England BioLabs), barcoded with oligos described in Peterson et al. (2012), and fragments were size-selected to be~195 bp 6 20 bp of genomic DNA with a BluePippin (Sage Science). This size was chosen to maximize the markers recovered per-individual and increase the chances of finding SNPs with 35 bp of known flanking invariant sequence. Two 24-individual ddRAD libraries were sequenced on two lanes on an Illumina Hisequation 2000 at the Yale Center for Genome Analysis. All RAD sequence data were mapped to the Ae. aegypti reference genome (Nene et al. 2007) (AaegL1) using Bowtie2 (Langmead and Salzberg 2012 filtered for depth of coverage (minimum 7x), quality (minimum variant phred score 25), missing data (maximum missing 20%), and 35 bp of invariant sequence on each side to allow for probe design. These variants, combined with those identified in the RAD-seq data described in Brown et al. 2014), produced a set of 139,654 potential probes. These probes were then scored using the Affymetrix SNP chip design pipeline, taking into account which probe pairs will perform well based on thermodynamics, self-hybridization, and copy number present in the reference genome. The 50,000 best rated sets of probes were chosen for final printing of the chip, Axiom_aegypti1. For a list of probe sequences and genome positions, see Table S2.

Positional bias
We measured Pearson's r, using the scipy.stats v 0.14 function pearsonr, between pairwise SNP distance and supercontig length to test whether there is significant positional bias between SNPs across the genome.
SNP genotyping and quality control Genomic DNA was sent to The Functional Genomics Core at University of North Carolina, Chapel Hill in batches of 95 samples with one negative control and genotyped using manufacturer protocols. We analyzed Axiom_aegypti1 data using Genotyping Console v4.2 using default parameters outlined as best practice for non-human samples, and the R package SNPolisher v1.4 (Affymetrix, Santa Clara, CA) with default parameters, except with the call threshold set to 95%, to generate and post-process genotype calls. We genotyped a total of 160 wild and 101 lab-reared mosquitoes for this study. Full probe-by-probe results can be found in Table S3.

Mendelian inheritance
We generated single pair matings from 5 laboratory colonies (Table S4), which generated 27 cohorts of F 1 offspring. All parents were genotyped using Axiom_aegypti1. We chose 32, 31, and 32 individual F1 offspring from crosses 5, 16, and 25, respectively. These crosses were chosen for genotyping so that they maximized the number of SNPs where the parental genotypes were heterozygous in one parent and homozygous in the other. This allowed for maximal power in testing of Mendelian inheritance across markers. We performed x 2 tests comparing expected and observed genotype frequencies in each of the three cohorts separately, combining P values where possible using Fisher's method. We were able to test a total of 16,111 SNP loci in at least one cohort. These P values were then sequential Bonferroni-Holm corrected.
n  (132) The numbers in parentheses are those used for principal component analysis ( Figure 2).

Linkage disequilibrium
We carried out pairwise genotypic r 2 tests on SNPs not eliminated in quality control or Mendelian tests. We measured linkage disequilibrium as implemented in plink v1.07 (Purcell et al. 2007), with no distance or r 2 filters, on 40 individuals collected from Jacobina, Bahia, Brazil. We generated a log-scaled two-dimensional histogram of r 2 values vs. distance using the python package matplotlib v1.4.0. Locally weighted scatterplot smoothing curves were generated with python package statsmodels v0.5.0 for r 2 greater than or equal to 0.2 and all r 2 values.
Principal components analysis (PCA) and geographic assignment We used a scaled matrix of genotype frequencies for wild mosquito samples for a PCA as implemented in the python package scikit-learn (Pedregosa and Varoquaux 2011). To avoid results biased by oversampling, we randomly subsampled the population of 40 individuals down to 12 (Table 2).

RESULTS
To develop the SNP markers for this chip, we mined previous RADseq generated SNP data ( (Table 1). In the RNA-seq data we focused specifically on genes identified in a literature search as correlated or functionally related to dengue competence and insecticide resistance (Table S1). The rest of the SNPs were randomly distributed throughout the genome. After quality filtering the SNP data, we used the highest scoring 50,000 SNP markers as ranked for expected performance on an Affymetrix Axiom MyDesign array ( Figure 1A), which we named Axiom_aegypti1. There was little overlap between RAD-seq datasets, as expected when different digestion enzymes are used. The average genomic distance between SNPs on the chip is 24,741 bp, with at least one SNP on 1950 supercontigs. This set of supercontigs represents 96.95% of the total published genomic sequence (Nene et al. 2007) and constitutes 116 of the 120 supercontigs placed on the 3 Ae. aegypti chromosomes in a recent florescence in situ hybridization study (Timoshevskiy et al. 2013) (see Table S2 for full description of probes). There is no significant correlation between supercontig length and inter-SNP distance, which suggests that there is little if any bias in SNP position relative to others across the genome (Pearson's r = 20.007, P = 0.133, Figure 2). The SNPs on the chip were further tested for array performance by genotyping 160 wild individuals from different populations (Table 2 and Figure 3A). A total of 10,183 SNPs showed low array performance, 7969 SNPs had evidence of a third allele or possible duplication event, and 4808 SNPs were monomorphic in populations genotyped ( Figure 1B, Table 3, and Table S3). We then tested the remaining 27,040 SNPs for a Mendelian pattern of inheritance by using the chip to genotype three F1 full-sibling cohorts and their parents. We found 1451 SNPs to be significantly deviant from what is expected under a Mendelian genetic model of inheritance ( Figure 1C, Table S4 and Table S5). This could either be due to true departure from a model of Mendelian inheritance, or due to a genotyping error in the parents. These positions were excluded, leaving 25,589 SNPs for further analysis. We then examined pair-wise linkage disequilibrium, as measured by r 2 , in a South American population for which we had the highest sample size (n = 40). The global mean r 2 value is 0.0348. The averages for SNP pairs on the same and different supercontigs are 0.1797 and 0.0345, respectively. When only SNP pairs with r 2 values of 1 on the same supercontig are examined by themselves, 50% of these pairs are within a distance of~148kb of one another ( Figure 4C). We measured relatedness (as measured by the unadjusted Ajk statistic (Yang et al. 2010)) and tested for departures from Hardy Weinberg equilibrium in these SNPs. We found 3 individuals that may be related to one another ( Figure S1 and Table S6), and less than 1% (204 SNP loci) that are out of Hardy Weinberg equilibrium after multiple test correction, a = 0.05 (Table S7).
Next we performed principal components analysis on the wild individuals to examine variation in natural populations. Strong geographic signal is evident in the first two principal components when examining all populations together ( Figure 2B), as found in previous studies (Brown et al. 2014;Ra si c et al. 2014). When analyzed separately (Figure 2, C2E), populations from Africa, Asia and Pacific Islands, and the Americas are clearly diagnosable at resolution not seen previously (Brown et al. 2011(Brown et al. , 2014.

DISCUSSION
Here we describe the development of a SNP chip for Ae. aegypti and test a set of SNP markers useful in genotypic analysis of natural populations. This robust set of SNPs do not suffer from the difficulties inherent to the development and analysis of microsatellites (Chambers et al. 2007;Brown et al. 2011) or genotype-by-sequencing markers (Ra si c et al. 2014). There are at least~25,000 useful markers on the chip, named Axiom_aegypti1, that are Mendelian in nature and distributed throughout the Ae. aegypti genome.
With additional geographic sampling and more substantial population study, this preliminary and conservatively filtered set of SNPs will grow. Future results will benefit from the Bayesian base-calling algorithms employed in the genotyping pipeline, increasing the number of high-confidence SNPs per sample as well as the number of SNPs that pass quality control. Furthermore, although 4800 probes were monomorphic and therefore uninformative in our samples, some of these may prove polymorphic in future populations samples.
As more results are generated, we encourage their archival in public databases so they may be mirrored in Vectorbase (Megy et al. 2012). We are hopeful that this database will grow rapidly as data generation using this platform is fast, on the order of a few days. Using such an open framework will yield data that are easily combined, shared across research groups, and visualized, further lowering the barrier to synthesis and analysis. Researchers now have the capacity to rapidly generate and synthesize results to address questions that were previously recalcitrant or unanswerable in this system.
A database of genetic variation for Ae. aegypti is a valuable tool, especially in the face of the spread and increase in incidence of dengue and Chikungunya viruses. Cataloging genetic variation with Axiom_aegypti1 will be crucial to tracking and better understanding of the patterns of movement of Ae. aegypti at fine geographic and temporal scales. Refining current hypotheses on the evolutionary origins of Ae. aegypti (Moore et al. 2013;Powell and Tabachnick 2013;Brown et al. 2014), will improve understanding of current patterns of invasion and other questions in invasion biology and domestication. In addition to basic science, this database will be central to better understanding of the genetic underpinnings of genotypic factors in Ae. aegypti relevant to human health.
A more immediate evolutionary application of these markers is exploring how population structure and genetic differentiation affect the interface between wild populations and genetically modified strains being released to fight Ae. aegypti borne illness. This can, in turn, inform control strategies e.g., timing of releases, measurement of program efficacy, etc. During the release of genetically modified organisms, detecting any level of introgression can be valuable to assess the spread of undesired traits into wild populations, to reduce public concerns, or to evaluate the effectiveness of different release strategies. Screens for single locus diagnostic alleles are likely to miss incipient introgression or hybridization where recombination has eliminated the gene that is n  being screened. Therefore, the ability to measure these genetic parameters at the scale of the genome provides a great asset to release programs employing genetically modified organisms. Various genetic control strategies, both currently being deployed and under development, are reviewed by McGraw and O'Neill (McGraw and O'Neill 2013). Genetic data in the face of these strategies can also help to assess the effects of a release in a more holistic evolutionary and ecological framework (David et al. 2013). Ultimately, the field application of these still nascent technologies demands rigorous and independent analysis of their effects and efficacy. The identification of genetic backgrounds underlying various factors immediately relevant to human health will also benefit from using this tool. Laboratory study of recently caught wild populations with varying levels of competence to transmit virus or feeding behavior can be screened for genotypes that are especially susceptible or refractory to such phenotypes using GWAS as has been done in other systems (Welter et al. 2014). Given the average distance between SNPs on the Axiom_aegypti1 chip, the observed linkage disequilibrium ( Figure 4A), and the fact that the Ae. aegypti genome constitutes 200 cM (Severson et al. 2002;Timoshevskiy et al. 2013), this chip will also prove useful in quantitative trait locus mapping and genomewide association studies; there are at least~127 markers per centimorgan. In contrast to Anopheles gambiae, where linkage disequilibrium breaks down at distances as small as 200 bp (Harris et al. 2010;Weetman et al. 2010), these markers should prove informative to traits of interest at greater genomic distances in the larger genome of Ae. aegypti (Black Iv et al. 2008). Although the average r 2 we measured in one South American population is globally low, there are good indications of many useful genome-wide association study markers (Figure 4). Larger-scale study is needed to further characterize linkage disequilibrium these markers in multiple populations from around the world. Overall, this tool shows promise to advance the genetic study of Ae. aegypti significantly while allowing those without extensive genomics backgrounds to participate and drive progress.

Data availability
The Axiom_aegypti1 chip is available from Affymetrix in 96 sample Axiom array format. Individual-level data on the samples included in this study are available in European Molecular Biology Laboratory2 European-Bioinformatics Institute BioSamples group SAMEG188691, and their genotyping information is mirrored in VectorBase (http:// www.vectorbase.org).