Genetic Markers of Adaptation of Plasmodium falciparum to Transmission by American Vectors Identified in the Genomes of Parasites from Haiti and South America

Historical data suggest that millions of P. falciparum parasite lineages were introduced into the Americas during the trans-Atlantic slave trade, which would suggest a paraphyletic origin of the extant isolates in the Western Hemisphere. Our analyses of whole-genome variants show that the American parasites belong to a well-supported monophyletic clade. We hypothesize that the required adaptation to American vectors created a severe bottleneck, reducing the effective introduction to a few lineages. In support of this hypothesis, we discovered genes expressed in the mosquito stages of the life cycle that have alleles with multiple, high-frequency or fixed, nonsynonymous mutations in the American populations which are rarely found in African isolates. These alleles appear to be in gene products critical for transmission through the anopheline vector. Thus, these results may inform efforts to develop novel transmission-blocking vaccines by identifying parasite proteins functionally interacting with the vector that are important for successful transmission. Further, to the best of our knowledge, these are the first whole-genome data available from Haitian P. falciparum isolates. Defining the genome of these parasites provides genetic markers useful for mapping parasite populations and monitoring parasite movements/introductions.

Variant calling for population genetic analyses. Two variant calling iterations were performed, one using the 21 Haitian isolates only (Haitian data set) and one using these data plus whole-genome sequencing data for 149 isolates from numerous malarious regions worldwide retrieved from the MalariaGEN database, for a total of 170 isolates (WW data set). Genome sequence data from a total of 27 samples were available from South America, 16 of which originated from Colombia and 11 from Peru. Based on historical data regarding the trans-Atlantic slave trade (3,21), we downloaded genome sequence data from isolates recovered in West Africa: Gambia, Ghana, and Cameroon (10 samples from each country). We also downloaded data from 10 samples from Central Africa (Democratic Republic of Congo) and 10 from each Kenya, Malawi, and Tanzania to represent East Africa. Data on 22 genomes from Papua New Guinea were downloaded as representative of Oceania and 10 each from Cambodia, Myanmar, and Thailand to represent Southeast Asia. Data sets chosen were from paired-end Illumina libraries, with a minimum 100-nucleotide read length. The largest data sets from each country were chosen and whenever possible were larger than 2 Gb. An exception to the read length rule was made for South American data sets, due to the scarcity of samples available; thus, these libraries have 54-to 100-nucleotide paired-end reads. Details regarding the downloaded data are in the supplemental material (Table S2).
When variant calling was done on Haitian isolates versus the coding regions of the 3D7 genome, 447,339 variants were obtained. A filtering pipeline based on the one published by Manske et al. (19) was applied, and after filtering, 22,044 variant loci were retained as reliable for further analyses. These were located in 3,189 different genes. The P. falciparum genome is particularly difficult to map; thus, reliably mapped reads come from just 60% of its ϳ5,300 annotated genes (15). About 90% (19,901) of the variants were SNPs, while the rest were indels, most of which were in frame. Almost three-quarters of the SNPs (14,148) were nonsynonymous mutations resulting in an amino acid change in the translated gene product. This is a known phenomenon, and it has been attributed to continuous positive selection exerted by the host (22). The complexity of infection was assessed using THE REAL McCOIL (23), and all Haitian isolates represented single infections with comparatively few heterozygous alleles consistent with the low-transmission setting.
For phylogenetic purposes, variant calling was repeated on the WW data set. Variant calling produced almost 1,400,000 variants, which were reduced to about 139,000 variants by the filtering process. In order to determine the ancestral relationship of the Haitian P. falciparum to isolates from Africa, Asia, and South America, sites with indels, conserved mutations (those found in all isolates), and singletons were removed, since they provide no phylogenetic information, leaving 50,469 sites in 3,106 genes representing almost 60% of the genes in the nuclear genome.
Population genetics using whole-genome SNPs. Principal-component analysis was used to study the clustering of the samples without assuming an evolutionary model. The PCA shows a clear separation between isolates from different continents (Fig. 2). Haitian isolates clustered separately from South American isolates, while the Peruvian and Colombian clusters partially overlapped. The Haitian isolates grouped into two different clusters, one of which consisted of nine virtually identical isolates, evidence of an apparent epidemic expansion ( Fig. 2A). Discriminant analysis of the principal components (DAPC) was also performed, which minimizes within-group variance, while maximizing between-group variance (24). The results are similar to those of the PCA, yielding evidence for seven different population clusters (Fig. S1).
The sample set was reduced from 170 to 149 sequences by removing samples resulting from apparent clonal (epidemic) expansion (Fig. S2) or having complexity of infection (COI). The Haitian data set was reduced from 21 to nine isolates by selecting the six independent isolates and one representing each of the three epidemic expansions (Table S1). Three isolates from Peru were also removed from two epidemic expansions. When only one isolate representing each epidemic expansion among Haitian and Peruvian samples was utilized in the PCA analysis, similar results were obtained ( Fig. 2B and C).
Phylogenetic relationship of the Haitian parasite to the African, Asian, and South American strains using whole-genome SNPs. Based on historical data, we would expect the P. falciparum strains found in the Americas to have a paraphyletic origin. This scenario was affirmed by Joy et al., using mtDNA sequences (25), and by Yalcindag et al. (26), using genomic SNPs consisting of a mix of coding and noncoding loci. The first data set we used to investigate the ancestry of American parasite populations was the whole-genome SNP alignment (50,469 loci). From these we removed SNPs under strong positive selection in multiple ways. The first alignment subset included putatively neutral SNPs only, as determined by Bayescan v.2.1 (27)(28)(29). This program identifies candidate loci under selection using differences in allele frequencies between populations. After removal of these loci, 48,194 remained. A subset consisting of 15,020 sites was generated by keeping synonymous mutations only. Multinucleotide variants (MNVs) were also eliminated, as different combinations of SNPs in the same codon might result in a nonsynonymous mutation in part of the samples. It is recognized that this approach does not ensure the exclusive selection of neutral mutations, since different synonymous codons could have an impact on gene expression regulation (30). Recognizing that one of the main drivers of selection on P. falciparum upon migration to new areas is the change of vector species (31)(32)(33)(34), we also removed from the data sets for analysis the data from genes which are Ն10-fold upregulated in the late gametocyte (gametocyte V), ookinete, and sporozoite stages (rather than the asexual blood stages) (35,36). The phylogenetic signal of these alignments was verified as shown in Fig. S3.
The neighbor-joining (NJ) tree from the synonymous SNPs minus genes upregulated in the sexual and mosquito stages (13,597 loci) is depicted in Fig. 3. To calculate the tree, we used the log-det model, which has been shown to be robust to biased base composition (37). The other trees are reported in the supplemental material and have similar topologies (Fig. S4), with the separation of populations between continents having strong bootstrap support. The low support for African clades matches the parasite's known high diversity and high transmission rates on that continent (19,38). The monophyly of the American clade is always well supported, and the Haitian P. falciparum population is isolated. Looking at the different trees and bootstrap support values, the population structure of the Colombian and Peruvian parasites is not fully resolved and is probably partially mixed. The isolation of the Haitian population has important epidemiological implications, should a larger sample size, covering other areas of the Western Hemisphere, confirm these findings.
Selective pressure of American mosquito vectors on P. falciparum genes. The results of the phylogenetic analyses, showing a monophyletic American clade, are counterintuitive, because of the scenario involving continuous introduction from different parts of the African continent through the slave trade. Considering the extent to which we went to remove loci potentially under selection, these results likely derive from the use of coding SNPs and reflect a genuine bottleneck which the parasite went through after it was introduced into the Americas. The most dramatic change to which P. falciparum had to adapt was transmission by novel definitive hosts, as the American Anopheles (Nyssorhynchus) spp. diverged from the African Anopheles (Cellia) spp. ϳ100

FIG 3
Neighbor-joining tree calculated using synonymous SNPs. MNVs and SNPs in genes upregulated in the mosquito stages of the parasite life cycle were removed (13,627 sites were retained). The tree was calculated using the log-det model. million years ago (17). Evidence has recently emerged regarding the impact that novel vectors have had on the Pfs47 gene (31-34, 39, 40). Allelic changes in Pfs47 partially control infectivity for different vectors, suggesting that this and perhaps additional genes are under selection during adaptation to novel vector species.
A preliminary test was performed on our alignment as a way to identify potential genes with mutations necessary for adaptation to the American vectors. For these analyses, the Haitian data set was reduced from 21 to the nine isolates representative of the Haitian subclades (Fig. 3) by selecting the six independent isolates and one representing each of three apparent epidemic expansions (Table S1). Three isolates from Peru were also removed from two apparent epidemic expansions. P. falciparum expression data (35,36) were downloaded from PlasmoDB26 database (41), and we identified those genes that are upregulated in the gametocyte, ookinete, and sporozoite stages as opposed to the blood stages. We then split the alignment in two, depending on SNPs belonging to genes upregulated in the mosquito stages compared to the blood stages. We compared the ratio of nonsynonymous to synonymous substitutions (dN/dS); dN/dS is statistically higher in the subset data from genes upregulated in the mosquito stages (1.01 versus 0.53; P Ͻ 0.01), further suggestive of the impact that the vector might have had on the parasite population.
As evolutionary rates and substitution patterns may vary between genes, we tried to narrow our data set to a few more likely candidate genes for further, in-depth analyses. Genes under strong selection were preliminarily identified by filtering to identify nonsynonymous mutations having a frequency of Ն0.7 in the 33 American isolates (9 Haitian, 16 Colombian, and 8 Peruvian) and Յ0.3 in the 70 African isolates; 68 variant genes were retained of the 3,106-gene data set. In this data set, 397 genes (13.1%) were upregulated in one or more of these sexual/mosquito stages of the parasite life cycle, but among the retained 68 genes, the frequency of sexual/mosquitostage genes was almost 2-fold higher, with 17 (25%) being upregulated in one or more of these stages ( Table 1). Twelve of these genes contained a single variant codon with differential frequency; one had two such mutations, three had five, and one had seven. The four genes with the most mutations were TRAP (PF3D7_1335900), CTRP (PF3D7_ 0315200), PSOP26 (PF3D7_1244500), and Pfs47 (PF3D7_1346800).
The consensus sequences of these genes were compiled for a representative subset of the isolates, as described in Text S1. No sign of recombination was found by RDP4  Table S3.

DISCUSSION
Whole-genome analysis for P. falciparum is particularly complicated, due to the repetitive nature and high AT content of the genome, requiring an intense effort to eliminate analysis artifacts. Further, P. falciparum also regularly undergoes sexual recombination, which constitutes an additional obstacle to phylogenetic analyses. Despite these challenges and potential limitations, the resulting data can help answer a multiplicity of questions, including shedding light on the variety of evolutionary drivers acting on the parasite (19,20,45). The whole-genome sequence data reported here are the first available from Haitian P. falciparum isolates. These data were obtained from isolates obtained in Grand'Anse plus one in Sud-Est, two regions with the highest rates of transmission of malaria in Haiti. Comparisons of the genotypes of these isolates to those obtained by Charles et al. (14) indicate that data obtained from the isolates analyzed reasonably represent the entire Haitian P. falciparum population.
A large proportion of the coding SNPs common to the Haitian P. falciparum parasite population are now known, and there are many SNPs that are unique and private to the Haiti population, based on our analysis. This knowledge will aid in the elimination of malaria from the island, by offering the ability to discern imported infections from indigenous ones. Reassessing this scenario with data from additional strains from Central America and the eastern areas of South America will be essential to monitor movements of the parasite between these regions and consequent possible reintroductions.
The shape of the NJ tree (Fig. 3), with long terminal branches, could derive from bottlenecks resulting from multiple strategies to fight malaria but are more likely due to less recent ones resulting from the necessity of adapting to new definitive hosts following migration to Asia and the Americas, followed by genetic drift. The presence   (19,20,46,47), and these data show that the Haitian population is clearly distinct from Colombian and Peruvian populations. Based upon historical data, we would expect the American parasite population to be paraphyletic; however, this is not seen. The monophyly of the American clade as seen in our analyses likely derives from the use of coding SNPs. Our synonymous SNPs may be largely neutral (30,48), but they have probably been subjected to repeated selective sweeps and were fixed along with mutations that are advantageous in Haiti and other regions of the Americas. Since the American isolates appear as a single clade despite millions of introductions in widely separated geographical regions, this suggests a common selective bottleneck. Branching within this clade into distinct subclades is then created by local bottlenecks and geographical isolation followed by genetic drift.
The greatest challenge to the parasite in the transition to the Americas was the sudden change in the definitive host, resulting from the transoceanic migration of the intermediate host. Finding the proportion of genes under strong selection expressed in the sexual/mosquito stages to be twice that which was expected provided further impetus to evaluate the hypothesis that the common bottleneck experienced was the required adaptation to transmission by Anopheles (Nyssorhynchus) spp. in all locales. This was followed by further adaptation to additional species-specific selective pressures presented by local vectors of this subgenus. Genes which are upregulated in the mosquito stages of the life cycle of the parasite are thus potentially under selective drive when the parasite shifts from one vector species to another, and they are scattered across the P. falciparum genome on different chromosomes, which would exacerbate the genetic bottleneck acting upon the parasite population.
It is estimated that the African and American anopheline vectors evolved independently for nearly 100 million years (17), offering an evolutionary basis for major differences in the genus. Further, evidence has recently emerged regarding the impact that the immune systems of novel vectors have had on the parasite genome. Key mutations in the Pfs47 gene allow P. falciparum to escape the complement-like immune system of its definitive host, where different optimal combinations of the amino acid substitutions are necessary for the successful infection of vector species in different  regions of the world (31-34, 39, 40). The crucial role of this gene product was first identified through linkage analysis of the progeny of a cross between GB4 (an African isolate) and 7G8 (a Brazilian isolate). While multiple selective pressures have shaped the low-diversity American parasite populations, our research correctly identified Pfs47 and the four codons (codons 236, 242, 247, and 248) for which there is in vivo evidence of being under selection by the vector (31) The fifth codon identified here (codon 178) might play an additional role in the successful infection of Anopheles albimanus, the dominant vector on Hispaniola, since presumably progenitor strains giving rise to 7G8 were adapted for successful transmission via Anopheles darlingi, the dominant vector in Brazil.
The other three genes we identified have not been previously reported to contribute to adaptation to specific vectors. TRAP (PF3D7_1335900) is a protein with two adhesive domains (the A and TSR domains) and is essential for trafficking to the salivary glands of the mosquito (49). Five of the seven SNPs under selection in American isolates (three in the A domain and two in TSR) are found in these two regions of the protein. CTRP (PF3D7_ 0315200) is a conserved protein essential to ookinete motility and invasion of mosquito midgut epithelium (50). This protein has a COOH-terminal transmembrane domain and a short cytoplasmic domain with a possible rhomboid protease cleavage site adjacent to the external face of the transmembrane domain (50). The function of the fourth gene, PSOP26 (PF3D7_1244500), is unknown; however, our results suggest an important role for this protein, which is expressed in the ookinete stage, in the interface with the vector. Interestingly, both the Honduran isolate HB3 (NCBI accession no. GCA_900631985.1) and the Salvadoran isolate Santa Lucia (NCBI accession no. GCA_000150455.3) present all 24 of the mutations located in these four genes, which we found under selection in the American isolates evaluated here. Conversely, the isolate 7G8, originally from Brazil (NCBI accession no. GCA_000150435. The present results constitute a starting point for investigating gene products which have a high likelihood of having a crucial interaction with the mosquito vector. The next step would be to assess the effect of such mutations in vivo on transmission efficiency in different vectors. This would help not only define the effect of specific alleles but also illustrate the functions of protein products that are poorly characterized thus far. This knowledge of relevant protein variants also might help in developing transmission blocking vaccines by identifying genes products critical for transmission. Such vaccines may target either a crucial parasite protein, as recently demonstrated for Pfs47 (51), or the functional contact between that parasite protein and the vector (52,53).
Other genes, expressed in the human host, might also be under mosquito selective pressure, indirectly. As an example, A. albimanus, the vector on Hispaniola, has a strong (20:1) preference for livestock versus humans and is exophilic and exophagic (54), greatly reducing the intensity of transmission and severity of infection. Parasite alleles promoting enhanced transmission through the Haitian vector and persistent human infections with long-term production of gametocytes were probably under immediate selection. Further, these selective pressures would be expected to limit the productive gene flow from P. falciparum from other hemispheres, including drug resistance alleles, as these strains would not be competitive with the adapted local strains.
Our comparatively small sample and the small number of published sequences from American parasites limit the current phylogenetic reconstruction. Recent technologies, such as selective whole-genome amplification (sWGA) and single-cell sequencing, have been successfully implemented with Plasmodium spp. (55)(56)(57), and blood spots collected on filter paper are finally usable for whole-genome analysis. This will facilitate expanding access to isolates in the future, adding greatly to the data set of whole-genome sequences available from P. falciparum populations in the Americas.

MATERIALS AND METHODS
Sample collection and processing. This study was conducted in accordance with institutional review board guidelines and requirements of the University of Florida and the ethical review board of the Haitian Ministry of Health, after all permits and approvals had been obtained (IRB201400225; MSPP reference no. 1314-62). Blood samples were collected with informed consent from patients who were positive for malaria by a rapid diagnostic test during the period of September 2014 through February 2015 and subsequently deidentified. Aliquots of some of the samples were also placed in in vitro culture (58). Leukocytes were removed using either CF11 cellulose columns (59) or Plasmodipur filters (catalog no. 8011Filter25u; EuroProxima BV). Following DNA extraction (genomic DNA midikit; Zymo, Inc.) and in preparation for constructing a sequencing library, the amount of parasite and human DNA recovered from each sample was estimated by TaqMan-based qPCR as described in Text S1. DNA from each primary isolate was utilized for microsatellite analysis and for sequencing library preparation, as described in Text S1.
Genomic sequencing and data quality analysis. Sequencing was performed on an Illumina MiSeq system using the Illumina MiSeq reagent kit v3 reagents according to the manufacturer's instructions to generate 300-nucleotide paired-end reads.
The general quality of the sequence data was assessed using Fastqc v. 0.11.4 (60). Primer dimers and leftover insert sequences were removed from the Haitian sample sequences with Trimmomatic v.0.36 (61), and Trim Galore (62) was used for downloaded data sets.
Variant calling and filtering. Variant calling was performed by following directions from the SAMtools 1.3.1 pipeline (63) as described at http://www.htslib.org/workflow/#mapping_to_variant and in reference 64. Results were annotated using SnpEff v.4.2 (65). Since the P. falciparum genome has a high AT bias (ϳ82% AT content) and is rich in repetitive segments, it was necessary to further refine these results, using a protocol defined by Manske et al. (19), which we implemented with custom scripts in R language v.3.3.1 (66) through the RStudio shell (67) and Shustring (68). Details of this pipeline and the impact of filtering steps on the two data sets are reported at https://drive.google.com/file/d/1dA _TPvuJGEiz41w2fU8Q82bOCJNU4JIJ/view?uspϭsharing.
Principal-component analysis (PCA) and spatial principal-component analysis (sPCA) were performed on the resulting data as described in Text S1. Genetic distance was calculated with the general time-reversible model versus transitions and transversions and the Xia test (69) was performed using DAMBE v. 6.4.81 (70). The final alignment (13,627 sites) was scanned for recombination using GARD (71) as implemented in HyPhy (72).
Data availability. The sequence data obtained from the 21 Haitian isolates used in this study were filtered to remove sequences not mapping to the P. falciparum genome. Read pairs for which at least one read mapped to the P. falciparum genome were uploaded to the SRA database with project number PRJNA603776.

SUPPLEMENTAL MATERIAL
Supplemental material is available online only. TEXT S1, DOCX file, 0.04 MB.

ACKNOWLEDGMENTS
Research was supported by a grant from the Emerging Pathogens Institute of the University of Florida .
This publication uses data from the MalariaGEN Plasmodium falciparum Community Project as described in reference 47. We thank E. Yalcindag for the helpful discussion of his data early in this study.