Exploring the origin and degree of genetic isolation of Anopheles gambiae from the islands of São Tomé and Príncipe, potential sites for testing transgenic-based vector control

The evolutionary processes at play between island and mainland populations of the malaria mosquito vector Anopheles gambiae sensu stricto are of great interest as islands may be suitable sites for preliminary application of transgenic-based vector control strategies. São Tomé and Príncipe, located off the West African coast, have received such attention in recent years. This study investigates the degree of isolation of An. gambiae s.s. populations between these islands and the mainland based on mitochondrial and ribosomal DNA molecular data. We identify possible continental localities from which these island populations derived. For these purposes, we used FST values, haplotype networks, and nested clade analysis to estimate migration rates and patterns. Haplotypes from both markers are geographically widespread across the African continent. Results indicate that the populations from São Tomé and Príncipe are relatively isolated from continental African populations, suggesting they are promising sites for test releases of transgenic individuals. These island populations are possibly derived from two separate continental migrations. This result is discussed in the context of the history of the African slave trade with respect to São Tomé and Príncipe.


Introduction
Despite many years of research and control efforts, malaria remains a major health problem. It kills more than a million people a year and costs Sub-Saharan Africa an estimated US$12 billion in gross domestic product annually (Mocumbi 2004). The development and application of transgenic mosquitoes refractory to transmission into natural populations is one vector control strategy currently under consideration . However, before such strategies are adopted researchers must have a firm knowledge of the vector's ecology and its genetic basis. Understanding current and past evolutionary processes in this complex may help in the battle against malaria by directing vector control programs in a way that will maximize their efficiency and minimize their potential risks. Further, before any releases of transgenic organisms are undertaken on a continental scale, preliminary studies in relatively controlled environments need to be conducted.
The Anopheles gambiae mosquito complex is the major African malaria vector and has served as a model in the study of speciation. Genetic studies mainly within Anopheles gambiae sensu stricto have revealed complex patterns of population structure (Lehmann et al. 2003;della Torre et al. 2005). Patterns of paracentric inversions in chromosome 2 suggested that this species could be divided into distinct chromosomal forms (Coluzzi et al. 1979(Coluzzi et al. , 1985Bryan et al. 1982;Coluzzi 1982). However, subsequent studies found little concordance between chromosome forms and molecular markers (Besansky et al. 1995;Mukabayire et al.1996Mukabayire et al. , 2001Gentile et al. 2001). More recently, two distinct molecular forms, termed M and S, were described based on fixed differences in the intergenic spacer gene region (IGS) of the ribosomal DNA (rDNA) (Favia et al. 1997;della Torre et al. 2001). Notably, heterozygote individuals of the molecular forms are rarely observed even in sympatric populations (della Torre et al. 2005;Tripet et al. 2001;Wondji et al. 2002;Diabate et al. 2003; Barnes et al. 2005). Although several other nuclear markers show either near complete linkage disequilibrium, significant genetic distance, or large frequency differences between the molecular forms (Weill et al. 2000;Gentile et al. 2001Gentile et al. , 2002Gentile et al. , 2004Diabate et al. 2003; Barnes et al. 2005;della Torre et al. 2005), many other loci show little differentiation between the molecular forms (Gentile et al. 2001;Lehmann et al. 2003). Turner et al. (2005) hybridized population samples of genomic DNA from each molecular form to microarrays and found that three regions (<2.8 Mb) are the only locations where these forms are significantly differentiated. Therefore, neither recent reproductive isolation nor locus specific selection alone seems to be able to explain the genetic patterns and population structure between the M and S forms.
The archipelago of São Tomé and Príncipe (STP), in the Gulf of Guinea off the west coast of Africa may be a suitable location for the initial release of transgenic mosquitoes (Pinto et al. 2002(Pinto et al. , 2003. At the time of their discovery in 1471 ad, the islands were uninhabited (Tenreiro 1961). Anthropophilic mosquitoes are likely to have reached the islands in the late 15th century when the islands were first colonized, as malaria was a major health problem from the earliest days of human settlement (Baptista 1996). In 1946, five species of anopheline mosquitoes were recorded from the islands, two of which, An. gambiae and Anopheles funestus, are malaria vectors (Mesquita 1946). However, following an intensive intradomiciliary DDT spraying campaign from 1980 to 1982, a survey conducted in 1986 found only An. gambiae and Anopheles coustani (Ribeiro et al. 1990). More recently, an extensive survey utilizing both cytogenetics and PCR techniques found the FOREST chromosomal form and M molecular form of An. gambiae s.s. to be the only malaria vector present on the islands (Pinto et al. 2000). Pinto et al. (2003) found no evidence of an impact from the DDT campaign on the effective population size of An. gambiae s.s. on the island of São Tomé. This study also found significant differentiation between the southernmost and the northern localities on São Tomé. These island populations are also significantly differentiated from An. gambiae S molecular form populations from mainland Africa (Gabon), but no comparisons with the continental M forms were carried out in that study (Pinto et al. 2002).
In this study, we generated mitochondrial and nuclear rDNA sequences for a large number of An. gambiae s.s. from mainland Africa (including both S and M molecular forms) and STP island populations [673 mitochondrial DNA (mtDNA) and 634 rDNA]. We then combined these sequences with sequences already generated in previous studies (160 mtDNA/147 rDNA) to compare them with sequence data generated from STP (Table 1, Fig. 1). The aims of this study were to: (i) determine the likely mainland origin of the mosquito populations of STP and how this relates to human migration patterns and (ii) assess the degree of isolation and restrictions to gene flow between the islands and the continent to determine STP's suitability as a potential release site for genetically modified mosquitoes.
Our markers of choice for this study were based on several important factors. To examine the origins of mosquito introduction to the islands and its relationship to human history, we needed several markers that differed in their rate of evolution but evolved sufficiently slow to capture historical gene flow. The use of mtDNA coupled with the fast evolving internal transcribed spacer -ITS (rDNA) seemed to satisfy this need and at the same time avoid focusing exclusively on current gene flow as is estimated by frequency-based approaches associated with microsatellite marker use (Sunnucks 2000). Also, inherent difficulties with using such fast evolving markers like microsatellites in a situation like ours, where evidence indicates a recent population expansion for the continental populations, and the fact that the island populations are likely not in mutation-drift-equilibria makes unbiased gene flow estimation challenging. However, even with these limitations, such work is currently underway (Pinto, personal communication).
Additionally, ITS has been used extensively in determining differences at and below the species level in anophelines (Hackett et al. 2000;Torres et al. 2000;Beebe et al. 2001;Manonmani et al. 2001). Within interbreeding populations these tandem arrays go through rapid homogenization through concerted evolution and give rise to differences between noninterbreeding populations. Both rDNA markers ITS and IGS segregate into distinct molecular forms (M and S) and seem to have the level of variation appropriate to track recent but historic gene flow (Lehmann et al. 2003) in contrast to other regions that retain the ancestral signal of a rapid range expansion across Africa coincident with human agricultural expansion (Coluzzi et al. 1979;Coluzzi et al. 1985). This region is one of the most used markers for work on intraspecific differentiation in the An. gambiae complex. As such, we were able to compile ample data from other studies and analyze them with our newly generated data, something that is in practice nearly impossible with microsatellite data generated in different laboratories. In addition, the 'universality' of the use of ITS marker in Anopheles studies makes it easy for other researchers to relate directly their work with our results.
Lastly, a previous study (Gentile et al. 2001) has described a novel type of ITS-rDNA (type III) in São Tomé that had not been found on the mainland. This could indicate strong isolation but deserves further attention via a much more extensive sampling of the continental populations. Without shortcomings for all of the above mentioned reasons we believe that the combination of the use of ITS-rDNA and mtDNA can provide novel insights into the origin and genetic structure of An. gambiae s.s. populations on the islands of San Tomé and Príncipe.

Materials and methods
Samples and data collection Anopheles gambiae s.s. samples were collected and pooled into 37 sites from across Africa. DNA sequence data were also obtained from published studies (ITS; Gentile et al. 2001Gentile et al. , 2002; NADH dehydrogenase subunit 5 -ND5; Donnelly et al. 2001Donnelly et al. , 2004. DNA extractions were performed using DNeasy Tissue Kit (Qiagen Sciences, Germantown, MD, USA). Specimens were assigned to species and molecular form using the PCR-RFLP method   Fanello et al. (2002). Table 1 lists localities and sample sizes and indicates which DNA sequences were produced de novo and which were gathered from GenBank/previous studies. A 570 base-pair (bp) segment of the mitochondrial ND5 gene was amplified following the protocols of Besansky et al. (1997) with minor changes. We used the primers pair 7013/Dmp3A and altered the thermocycling conditions to an initial 10-min denaturation at 94°C, followed by 35 cycles of a 45-s denaturation at 94°C, 45-s annealing at 47°C, and a 1-min extension at 72°C, followed by a final 5-min extension at 72°C. A 840-bp fragment of rDNA including the internal transcribed spacers (ITS) and the intervening 5.8S rDNA was amplified using the protocols and primers sets of Gentile et al. (2001). PCR products were cleaned using QIAquick PCR Purification Kit (Qiagen Sciences). Sequencing was carried out at the DNA Analysis Facility on Science Hill (Yale University, CT, USA).

Data analysis
All DNA sequences were edited using sequencher 4.1 (Gene Codes Corp., Inc., Ann Arbor, MI, USA). All sequences (ND5 and ITS) were aligned using the high throughput multiple sequence alignment program MUS-CLE (Edgar 2004). The ITS gene is an X-linked region and therefore diploid only in females. In addition, it shows an extremely low frequency of heterozygotes and has a very complex evolutionary history (Gentile et al. 2001(Gentile et al. , 2002. For these reasons, we treated the ITS data as haploid markers. We estimated genealogic relationships between both ND5 and ITS haplotypes by constructing a parsimony network using the tcs software version 1.21 (Clement et al. 2000). At the population level, this method is preferred over traditional phylogenetic tree reconstruction as it does not require bifurcating relationships and does not assume that the ancestral sequence/s are missing. We also used the tcs software (Clement et al. 2000) to identify the most probable ancestral haplotypes.

ND5 sequence data
We used the ND5 sequence data to test the degree of separation between island and continental populations through the estimation of frequency-based F ST values. Although shared ancestral polymorphisms, which are very likely to occur in this data set, often confound the effects of population history with current gene flow, we determined these values and used them to obtain 'gross' estimates of separation to guide further analyses. The F ST values were calculated using the program DnaSP (Rozas et al. 2003). F ST matrices were imported into paup* (Swofford 2002) to estimate a neighbor-joining (NJ) tree to visualize the degree of separation of the populations. To run statistical permutation tests of population subdivision between the STP populations and the continental populations with the smallest F ST values, we used DnaSP (Rozas et al. 2003) to estimate three statistics: C 2 , a haplotype frequency-based statistic recommended for nonrecombining data (Hudson et al. 1992); K ST *, a sequence-based statistic that utilizes information on the number of differences between haplotypes and not just the frequencies of haplotypes, giving less weight to large number of differences than other sequence-based statistics (Hudson et al. 1992); and S nn , the nearest-neighbor statistic (Hudson 2000).

ITS sequence data
The ITS marker is one of the few markers shown to be divergent between the An. gambiae s.s. molecular forms and thus most likely to reflect current rather than historical population structure (della Torre et al. 2005). As such, we used our ITS sequence data to assess current patterns of genetic variation between island and mainland populations.
We used a two-step analytical strategy. We first identified the continental populations that were the most likely source for the STP populations by estimating traditional frequency-based F ST values, using the program DnaSP (Rozas et al. 2003) and visualized their relationship by constructing a NJ tree (paup*, Swofford 2002). Locality samples containing both molecular forms were treated as distinct subsets. The lowest island-continental pairwise F ST values identified continental populations of potential origin.
Second, we used nested clade analysis (NCA) to assess these patterns within our total ITS data set. When assessing patterns of genetic variation at the intraspecific level it is often difficult to distinguish current population structure from population history using traditional population genetic estimates such as F ST (Templeton et al. 1995). For instance, two populations sharing similar alleles at similar frequencies could be the product of ongoing gene flow (current population structure) or a past range expansion of the organism (population history). NCA uses haplotype frequencies in conjunction with the genealogic relationships and geographic distribution of the haplotypes to distinguish current structure from historical events (Templeton et al. 1995). Although NCA has received some recent criticism (Knowles and Maddison 2002), it provides valuable information when inferences are supported by independent biologic knowledge of the system and caution is used in assessing the results. In the present situation, we were particularly interested in inferences of significant fragmentation Origins of An. gambiae in Sao Tomé and Principe Marshall et al. or range expansion between island and continental populations.
To implement the NCA, an ITS network was first constructed using the program tcs (Clement et al. 2000). Haplotypes were connected using a 95% parsimony limit. A number of ambiguous connections or loops in the resulting haplotype networks were broken using the criteria in Crandall and Templeton (1993). The total network was then nested (Templeton 1998) and input into geodis version 2.4  together with sample location information. We then performed 1000 permutations tests to test for association between phylogeny and geographic distribution. Clade distances (D c ) and nested clade distances (D n ) were measured and interior and tip clade differences estimated. Templeton (2004) revised inference key (Modified 11 November 2005) was applied to the clades with significant results from geodis to determine the outcome of the NCA.

ITS sequence analysis
We combined our ITS sequence data with the published ones (Gentile et al. 2001(Gentile et al. , 2002 to form a data set consisting of 781 samples of 840 bp fragments. This resulted in 24 different haplotypes, 12 of which have not been previously described (haplotypes 6,7,8,11,12,13,15,16,17,18,19,and 22,Appendix 3). Figure 3 shows the phylogenetic relationships between ITS haplotypes using a tcs network (Templeton et al. 1995). The topology of this network is similar to the one described by Gentile et al. (2002) with the exception of a new loop formed through In3, which connects the Type I and Type II haplotypes more directly than in the previously published network. In S-form An. gambiae, haplotypes IA and ID were found through western and eastern Africa (Fig. 4). In the M-form, haplotype IIA was found across western Africa (Fig. 4). Notably haplotype IIIB, previously found only on São Tomé (Gentile et al. 2002) was found for the first time in the continental populations from Angola. The samples Ang01M and Ang02M (Table 1) both contain this haplotype. The existence of haplotype IIIB in Angola suggests potential for current gene flow and/or the origin for the STP populations. This is also supported by the topology of the NJ tree in Fig. 2B. Figure 3 shows the nested haplotype network used for NCA analysis (Templeton et al. 1995). NCA requires conversion of a haplotype network into a hierarchical set of nested clades following the set of rules given in Templeton et al. (1987) and Templeton and Sing (1993). We followed these rules but nested all STP haplotypes into their own group (clade 2-1, Fig. 3) before nesting them into the rest of the network. We also allowed all M molecular forms and all S molecular forms to nest into their own group (clades 3-1 and 3-2 in Fig. 3) before joining the two groups into the total network. This enabled comparisons between STP and mainland populations and M and S forms to be made. The final nesting (Fig. 3) deviates little from a strict application of the Templeton and Sing (1993) rules. The general pattern and conclusion from NCA for both the continental M and S molecular forms (clades 2-1 and 2-3 respectively) were very similar. Both clades showed widespread ancestral haplotypes with tip haplotypes restricted to single localities within the range of the widespread ancestral haplotypes. Using Templeton's modified inference key, restricted gene flow with isolation by distance (IBD) was inferred in both of these clades. Inferences for clade 1-5 indicated restricted gene flow/dispersal with some long distance dispersal for haplotype IC. Haplotype IC was restricted to the island of Madagascar. In Madagascar, like STP, we find unique or nearly unique haplotypes supporting the notion that oceanic barriers may be hindering gene flow from the continent. NCA results for significant clades are given in Table 3. These results indicate a range expansion for clades 2-1, from Angola to the island of São Tomé and Príncipe, and 2-2, across most mainland populations of Western Africa.

Discussion
Significant levels of genetic differentiation were determined between populations of An. gambiae s.s. on the Haplotypes labeled I are S forms. Small empty ovals represent missing haplotypes. The square haplotype (IIIB), representing the inferred ancestral haplotype, was found in populations St20 (frequency = 0.21), St21 (0.34), Pr22 (0.21), Ang01 (1.00), and Ang02 (0.97). Size of haplotype roughly corresponds to frequency. Lines separating symbols represent mutational steps or single base-pair substitutions. Dotted lines show where loops were broken. Haplotypes are nested into hierarchical clades in preparation for the NCA analysis. Thin lines group haplotypes clustered into one-step clades, dashed lines into two-step clades, and thick lines into three-step clades. Clades are labeled by clade level and hyplotype number, for example, 1-2 is the second haplotype in the one-step clade.
Origins of An. gambiae in Sao Tomé and Principe Marshall et al. islands of São Tomé and Príncipe and mainland Africa indicating that gene flow between them is restricted. We report large F ST values obtained by mtDNA ND5 sequence data between island and continental populations, as well as between São Tomé and Príncipe. In addition, statistical tests confirmed that the STP island populations were significantly differentiated from those continental populations for which F ST estimates had been the lowest (Table 2). However, a few nonsignificant results were obtained for comparisons with the M-form continental population of Ivory Coast (IC16-M). While this might suggest a closer relationship between STP and Ivory Coast, the low sample size (N = 3) of the latter may have influenced these nonsignificant results.
Results of the rDNA ITS sequence data also supported limited gene flow between STP populations and the continent. With the exception of haplotype IIIB that was also found in two mainland populations in Angola, all of the ITS haplotypes found on São Tomé and Príncipe are unique to the islands. Although a single mutational step separates Haplotype IIIB from two common widespread mainland haplotypes (IIA, IA), the fact that the only other place it is found is Angola suggests that An. gambiae s.s. from Angola may be a possible source from which the STP populations were derived. Cuamba et al. (2006) sampled 403 An. gambiae from across Angola and found that 93.5% of them were M form, the same form that is found on STP. This pattern was also recently confirmed by Calzetta et al. (2008). In addition, recent findings by Slotman et al. (2007) based on microsatellite markers on chromosome 3 from samples in Cameroon suggest that the M molecular form may be further divided into M-forest chromosome type and M-mopti chromosome type. STP and Angola M-form population are both forest chromosome type. Figure 2B based on the ITS sequence data shows these localities clustering together, however, the two additional M-forest localities Cam05 and Cam07 cluster instead with other M-form localities. This suggests that further work is needed to understand the extent of further subdivision within the M molecular form.
The F ST values estimated from the ND5 data were the smallest between STP and Ghana, Senegal, and the Ivory Coast but not Angola. Therefore, we cannot exclude that the west coast of Africa, in particular the Gulf of Guinea, may be another possible source of the STP colonizing populations. ND5 haplotypes were also shared between molecular forms and, where both molecular forms were in close geographic proximity, did not cluster on the NJ tree ( Fig. 2A).
The discordance between patterns based on ITS and ND5 sequences is similar to the conflicting pattern between the rDNA markers and most other genomic markers (Lehmann et al. 2003;della Torre et al. 2005). Sperling (1994) suggests that X-linked differences tend to be associated with species boundaries and it appears that such is the case between molecular forms in An. gambiae (Stump et al. 2005). This discrepancy could be due to the different molecular processes at play and rates of B A Figure 4 Geographical distribution of ITS haplotypes that were found in multiple localities. Where distributions of two haplotypes overlap one haplotype distribution is represented by shaded area. Haplotypes are widespread across large distances. Haplotype numbers correspond to those in Fig. 3. A single haplotype IIIB is found in both island (Sã o Tomé and Príncipe) and continental populations in Angola. evolution between the ND5 and ITS rDNA sequences, and/or may track two separate colonization events. The repetitive structure of rDNA is known to give rise to concerted evolution (Arnheim 1983) and therefore can spread new alleles much faster through a population than other markers (Arnheim 1983;Collins and Paskewitz 1996). The process by which rDNA variants arise and convert other copies into themselves would reduce the total genetic variation of the marker and hasten the fixation of alternative alleles. This is what we observed in our study. The ITS haplotype IIIB was found at very high frequencies in the Angola populations (Ang01-M = 100%,  Within and nested clade distances are given, in miles, for both interior (I) and tip (T) clades. PSS, P-value for a significantly small value; PSL, P-value for a significantly large value. I-T distance is the interior clade distance minus the tip clade distance. Inferences based on Templeton key are given for each clade.

Origins of An. gambiae in Sao Tomé and Principe
Marshall et al.
Ang02-M = 97%) but only at intermediate levels in the STP populations (St20M = 21%, St21-M = 34%, and Pr22-M = 21%) as expected when considering a more recent introduction of this haplotype into these islands from Angola migrants. It is therefore possible that the ND5 sequences represent the original colonization of An. gambiae into STP and the ITS sequences reflect a later colonization followed by on-going rapid gene conversion. These results are interesting as they fit loosely with the history of early human migration to STP. Two major waves of human migration have occurred in the history of these islands. The first occurred between 1500 and 1521 ad during which 2500-5400 slaves a year were imported into São Tomé to provide labor for Portuguese efforts at sugar cane production (Lovejoy 2000). The northern arrow in Fig. 1 shows the general direction of this migration.
The islands' sugar production was short-lived by the end of the 16th century. When the demand for cheap labor ended, no major human migration occurred for more than 200 years. This lull was broken when cocoa and coffee plantations were introduced on the islands. These crops, like sugar cane, needed a cheap labor force to be economically viable. The Portuguese brought laborers mainly from Angola and Cape Verde. By the 1860s over 1000 laborers a year were being transported to the islands, resulting in a second major human migration to the islands (Lovejoy 2000). The southern arrow in Fig. 1 shows the general direction of this migration.
Altogether, our results suggest that gene flow between the islands of STP and mainland Africa is likely to be restricted and that there have been only few events of migration of An. gambiae into these islands coincident with episodes of intense human migration. This is not surprising given the anthropophilic nature of this species and man-mediated transportation of insect vectors is a welldocumented phenomenon (Lounibos 2002). These introductions have had important epidemiologic consequences, such as the occurrence of a malaria epidemic in northeast Brazil associated with the introduction of An. gambiae in the 1930s (Killeen et al. 2002;Parmakelis et al. 2008).
By performing NCA, we were able to confirm the range expansion between Angola the islands of São Tomé and Príncipe and also another expansion across central and northwest Africa (della Torre et al. 2005). Anopheles gambiae is thought to have only recently expanded across Africa around 4000-10 000 years ago coincident with human agricultural expansion (Coluzzi et al. 1979;Coluzzi 1982) and thus the historical evidence fits well with these inferences. NCA has provided a statistical framework for testing these events.
This study provides evidence supporting the notion that populations of An. gambiae from the islands of STP are genetically isolated from continental populations. This is in agreement with previous results based on microsatellite data (Pinto et al. 2002). Here, the inclusion of a more extensive data collection from 22 African countries and the use of two distinct genetic markers, mtDNA and rDNA, allowed us not only to confirm a significant degree of genetic isolation but also to infer that probably only a few major colonization events of An. gambiae occurred and that these may have been man-mediated. Gene flow is thus likely to be a rare event due to restrictions imposed by the ocean as a physical barrier preventing migration of insects, as previously observed in this and other mosquito vector species (Simard et al. 1999;Reimer et al. 2005;Turner et al. 2005;Ayala et al. 2006;Foley and Torres 2006). It appears contemporary man-mediated gene flow is negligible even given the fact that traffic between the mainland and the islands is substantial. However, given the potential for man-mediated introduction of vectors highlighted by this and other studies, particular care should be taken in implementing effective vector monitoring schemes in strategic points such as airports and harbors.
Advances in analytical methods and increased resolution of molecular markers are helping scientists to better understand this complicated vector system (Beidler et al. 2003;Barnes et al. 2005;della Torre et al. 2005). The high levels of genetic isolation found coupled with the strong evidence of restricted gene flow suggest that STP islands could be suitable sites for controlled experiments with genetically modified mosquitoes (Benedict and Robinson 2003). The recent upsurge of malaria in many parts of Africa is probably caused by a multitude of factors (climate change, drug resistance, etc.) that will require novel methods of vector control (Nchinda 1998). Genetic control methods offer new hope and the use of genetically modified mosquitoes to control the disease continues to gain support (Christophides 2005). It is important to couple such efforts with studies like ours that use population and phylogenetic analyses in selecting potential first release sites. also supported by the following grants: NIH RO1 AI046018, Studies on mosquito vectors in West Africa, aimed at malaria epidemiology and control -INCO-DC/ EU (IC18CT960030), Malaria vectors in islands, studies on genetic isolation -UNICEF/UNDP/World Bank/WHO Special Programme for Research and Training in Tropical Diseases (TDR) (ID no. A50239) to JP, and UNICEF/ UNDP/World Bank/WHO Special Program for Research and Training in Tropical Diseases (TDR) (ID no. A10435) to AdT.