Pf7: an open dataset of Plasmodium falciparum genome variation in 20,000 worldwide samples

We describe the MalariaGEN Pf7 data resource, the seventh release of Plasmodium falciparum genome variation data from the MalariaGEN network. It comprises over 20,000 samples from 82 partner studies in 33 countries, including several malaria endemic regions that were previously underrepresented. For the first time we include dried blood spot samples that were sequenced after selective whole genome amplification, necessitating new methods to genotype copy number variations. We identify a large number of newly emerging crt mutations in parts of Southeast Asia, and show examples of heterogeneities in patterns of drug resistance within Africa and within the Indian subcontinent. We describe the profile of variations in the C-terminal of the csp gene and relate this to the sequence used in the RTS,S and R21 malaria vaccines. Pf7 provides high-quality data on genotype calls for 6 million SNPs and short indels, analysis of large deletions that cause failure of rapid diagnostic tests, and systematic characterisation of six major drug resistance loci, all of which can be freely downloaded from the MalariaGEN website.


Introduction
Despite global malaria eradication efforts in the mid-20th century and more recent advances in malaria control, Plasmodium falciparum remains endemic throughout Africa, Asia, South America and Oceania. According to the most recent World Malaria Report, each year over 200 million people suffer from malaria due to P. falciparum and over 600,000 die as a result 1 . Most of the disease burden falls on Africa, and particularly African children. There is international commitment to control malaria more effectively and many countries are working towards the long-term goal of malaria elimination. However the parasites are continually evolving to resist antimalarial drugs and to evade host immunity, and this is a major challenge to sustainable malaria control and elimination.
Our understanding of the evolutionary biology and population genomics of malaria parasites has advanced considerably over the past decade. There is now a substantial body of literature on the genomic diversity and global population structure of malaria parasites, on within-host genetic variation and what this tells us about superinfection and cotransmission, on the identification and monitoring of parasite drug resistance loci, on genetic variation in malaria vaccine antigens, and on methods of analysing genetic relatedness to understand patterns of malaria transmission. A useful summary of the current state of the field can be found in the recent review by Neafsey et al. 2 .
This rapidly growing area of research is underpinned by open data on genome sequence variation in natural parasite populations. Here we report a new release of curated open data on Plasmodium falciparum genome variation from the MalariaGEN network 3 . It includes samples featured in previous MalariaGEN data releases including the Pf3k Project 4 , the Plasmodium falciparum Community Project 5 and the GenRe Mekong Project 6 . To avoid confusion between different MalariaGEN datasets we now identify each by a version number. In this new nomenclature, the previous version 5 is called Pf6, and the version described here is called Pf7.
Whole genome sequencing of all the samples in the Pf7 dataset was performed at the Wellcome Sanger Institute and a standardised analysis pipeline was used for variant discovery and genotyping. The Pf7 analysis pipeline was broadly similar to that used for the Pf6 dataset with some improvements that are described in more detail below. Sequence data and genotype calls were returned to partners for use in their own analyses and publications in line with MalariaGEN's guiding principles on equitable data sharing 3  The most significant technical advance in the Pf7 dataset is that most of the new samples (12,891/13,752, 94%) came from the MalariaGEN SpotMalaria Project 7 . SpotMalaria was designed to simplify and standardise the process of collecting dried blood spot (DBS) samples for parasite genetic analysis. The SpotMalaria protocol ensures that the vast majority of DBS samples are suitable for targeted genotyping of drug resistance loci, e.g. by amplicon sequencing, and that a significant proportion of samples are also suitable for whole genome sequencing. This requires an intermediate step known as selective whole genome amplification (sWGA), which makes it possible to obtain parasite genome sequence data from a very small sample, but at the cost of introducing considerable variability in sequencing coverage across the genome. This is the first study to analyse a large number of sWGA P. falciparum genomes and therefore it was important to establish that sWGA was not introducing significant biases, and to adapt our methods for calling structural variations which are particularly sensitive to artefactual variation in sequencing coverage.
We have performed a number of analyses to make the Pf7 dataset as useful as possible for a broad range of users. These include descriptions of global population structure, geographic patterns of drug resistance, haplotypic analysis of drug resistance loci, hrp2 and hrp3 deletions that can cause failure of rapid diagnostic tests, and variation in the C-terminal of the csp antigen used in the most advanced malaria vaccines. These analyses are not intended to be comprehensive and technical users of the dataset can download the analysis-ready dataset for more specialised or detailed investigations.
A high level view of the Pf7 dataset can be obtained from the data exploration tool at the MalariaGEN website. This shows the locations and years where the samples were collected, and the genotype-inferred drug resistance status of each sample. Most importantly it names the investigators who led the studies that contributed the samples at each location and thus made this global dataset possible.

Variant discovery and genotyping
We used the Illumina platform to produce whole genome sequencing data on all samples and mapped sequence reads against the P. falciparum 3D7 v3 reference genome. The median depth of coverage was 107 sequence reads averaged across the whole genome and across all samples. We used an analysis pipeline for variant discovery and genotyping analogous to that used in Pf6, as outlined in the Methods section.
In the first stage of analysis we discovered genomic variations in nearly half of the 23Mb P. falciparum positions (10,145,661 in total, Supplementary Table 3), including 4,397,801 single nucleotide polymorphisms (SNPs).
For the analysis reported here, we excluded all variants in subtelomeric and internal hypervariable regions, mitochondrial and apicoplast genomes and applied stringent quality filters to the remaining variants as described in the Methods section. A total of 3,125,721 SNPs (of which 2,513,888 were biallelic) and 2,742,938 non-SNPs, i.e. short indels or SNP-indel combinations, passed all these filters. Some of the variant positions that were classified as SNPs in Pf6 are now classified as non-SNPs because they additionally include indel alleles.
We performed quality control checks to remove samples with: (i) unverified or incomplete sample collection information; (ii) evidence of co-infection with other Plasmodium species; (iii) more than one technical replicate or time course sampling; (iv) low coverage; (v) a higher than expected number of singleton SNPs. In total, we retained 16,203 high-quality samples ( Table 1).
This analysis-ready dataset with details of all participating partner studies and a python package providing convenience methods for accessing is available here.
Effects of selective whole genome amplification (sWGA) Unlike the previous version, nearly all samples that are new to this release (12,891/13,752, 94%) have been sequenced after undergoing selective whole genome amplification (sWGA). This process allows us to sequence samples collected as dried blood spots, which greatly simplifies many of the operational challenges in collecting venous blood 8 .
An artefact introduced by sWGA is high variability in coverage across the genome 8 . This impacts on the use of local variation in genomic coverage as a way to identify large structural variations such as tandem duplications. We therefore developed a novel method based on GATK GermlineCNVCaller (gCNV) for typing duplications around mdr1 and plasmepsin 2-3 (associated with resistance to mefloquine and piperaquine, respectively) and deletions of hrp2 and hrp3 (associated with rapid diagnostic test failures). We started by compiling a list of observed breakpoints in and around the loci of interest. We then leveraged on the fact that the amplification bias introduced by sWGA, and the consequent variation in coverage, is relatively systematic and can be used for a cross-sample normalisation. Finally, we complemented the results with an analysis to detect presence of face-away reads around the known breakpoints and obtained a final set of calls. For plasmepsin 2-3 duplications, concordance between gCNV and the face-away reads methods was high, with 99% of samples called as duplication by gCNV also being called as duplication by the face-away method, and the remaining 1% all called as missing. For mdr1, concordance was significantly lower, with 19% of samples called as duplication by gCNV being called as no duplication by the face-away method. This could be explained by the fact that the set of breakpoints used is likely not exhaustive, and also by some duplications not being tandem duplications 5 . For samples called as no duplication by gCNV, the vast majority were also called as no duplication by the face-away method (83%) or else missing (17%). For hrp2 and hrp3 deletions, we manually validated the results and identified evidence of breakpoints for all the deletion calls.
To ensure that sWGA is not introducing biases in population structure, we analysed four sets of samples from the same location and time periods for which we had a substantial number of both sWGA and non-sWGA samples, and could detect no apparent stratification (Supplementary Figure 3).

Global population structure
We grouped samples by location using the classification scheme known as first-level administrative division: we refer to these as sampling locations. Based on principal coordinate and neighbour-joining tree analyses of all samples, we identified ten major divisions of population structure: we refer to these as major sub-populations. We then determined the geographical range of each major sub-population by examining the sampling locations that it contained ( This geographical assignment of ten major sub-populations is a somewhat crude approximation of the underlying population structure, and it does not reflect international conventions for grouping countries or regions. However it provides a framework that allows a broad comparison of population genetic features between different parts of the world, such as the rate of decay of linkage disequilibrium (Supplementary Figure 5

Geographic patterns of drug resistance
We classified samples as resistant or sensitive to major antimalarials and combinations based on genotyping of known drug resistance alleles (Table 2 -see here for details of the heuristics used). At a regional level, the frequency of samples classified as resistant to each drug is broadly consistent with known and previously reported geographical patterns, with the highest prevalence of multidrug resistance observed in Southeast Asia. Interestingly, in South Asia, we find that the frequency of resistance to chloroquine, sulfadoxine and pyrimethamine appears to be much higher in the far-eastern sub-population (Bangladesh and Tripura) than in the eastern sub-population (Odisha and West Bengal).
Care is required when interpreting these findings, as most of the major sub-populations spanned a large geographic region, within which there could be considerable epidemiological diversity, and also because we aggregate samples that were collected over relatively long periods of time during which patterns of resistance may have changed (Supplementary Figure 8). To take West Africa as an example, if we consider samples collected between 2013 and 2016 ( Figure 2), we observed levels of chloroquine resistance varying from 0% in Volta, Ghana to 100% in Atlantique, Benin. Here there is an approximate mapping between the principal components and the geographic location (latitude and longitude). Table 2. Frequency of different sets of polymorphisms associated with drug resistance in samples from different geographical regions. All samples were classified into different types of drug resistance based on published genetic markers, and represent best attempt based on the available data. Each type of resistance was considered to be either present, absent or unknown for a given sample. For each resistance type, the table reports: the genetic markers considered; the drug they are associated with; the proportion of samples in each major subpopulation classified as resistant out of the samples where the type was not unknown. The number of samples classified as either resistant or not resistant varies for each type of resistance considered (e.g. due to different levels of genomic accessibility); numbers in brackets report the minimum and maximum number analysed while the exact numbers considered are reported in Supplementary Amplifications of the genes mdr1 and plasmepsin 2-3 are markers of resistance to mefloquine and piperaquine, respectively. Interestingly, two samples collected in 1993 in Cambodia have tandem duplications of both genes, an event which is relatively rare in more recent samples (only 21 samples in total out of 1,959 that have evidence of amplification of either gene). In addition to presence/absence of the amplification, we also provide details of the associated breakpoints for all samples which shows these two samples having two distinct breakpoints in plasmepsin, one of which is identical to that most commonly found in contemporary samples.
Haplotype analysis of kelch13 and crt drug resistance loci Previous reports have shown that the current wave of multidrug-resistant P. falciparum in Southeast Asia is driven by the KEL1 lineage of the kelch13 artemisinin resistance locus [9][10][11][12] and is associated with multiple new mutations in the crt resistance locus 9 . This dataset confirms the dramatic increase of KEL1 in Cambodia, Eastern Thailand, Laos and Vietnam that has occurred over the past ten years (Supplementary Figure 9). Analysis of the crt locus in samples with the K76T resistance variant reveals a major cluster of haplotypes on a common genetic background, the one observed in a widely used lab strain isolated in Asia in 1980 and commonly referred to as Dd2 ( Figure 3A, Supplementary Table 5). In addition to the original Dd2 haplotype, we observe 31 additional mutations.
These are essentially restricted to eastern SE Asia with only four samples from outside this region, and they include mutations that have previously been associated with piperaquine resistance 9,13 . They are seen across all countries of eastern SE Asia and have risen rapidly in frequency leading us to consider them newly emerging Dd2-background haplotypes ( Figure 3B). Most have a single mutation on a Dd2 background, but we observe 13 haplotypes with two mutations and one haplotype (in a single sample) with three mutations (Supplementary Table 5). These findings highlight the value in retrospective analysis of drug resistance mutations, as most of these samples were collected and sequenced before the relevance of crt mutations to piperaquine resistance was appreciated.
Variation in the c-terminal region of csp In addition to the selective pressures due to antimalarial drugs highlighted in the previous section, another area of interest is selective pressure due to vaccines. Having a baseline understanding of genetic variation in vaccine genes is likely to be valuable.
The WHO has recently recommended the RTS,S vaccine for use in regions with moderate to high transmission which includes much of sub-Saharan Africa 14 . The vaccine targets the gene csp and has a construct based on the 3D7 reference sequence of part of the central NANP repeat region where  antibodies bind and the c-terminal region which contains T cell epitopes 15 . Another vaccine based on the same region and sequence, R21, is also showing promise in early stage clinical trials 16 . Vaccine efficacy is likely to depend on a number of factors, both host and parasite, and clinical trials show some variability between different locations in Africa 17 . How similar the parasite is in the targeted region to the 3D7 sequence used in vaccine design could be a contributing factor to this variability. Genetic diversity in the construct region may or may not affect vaccine efficacy, and in order to understand this it will be important to monitor efficacy against diversity going forwards. Here we begin a systematic catalogue of population-level diversity in csp. While it is challenging if not impossible to genotype the central repeat region using short read data, we start here by looking at variation leading to amino acid changes in the c-terminal region of the protein.
We identified all non-synonymous mutations in the csp c-terminal region and analysed the frequency of these in different populations. Interestingly, the vast majority of the samples across the globe carry non-reference alleles, i.e. different from the 3D7 sequence used in the vaccine design, at amino acids 301, 317, 318, 321 and 361 (Supplementary Figure 10). We found a total of 248 unique amino acid haplotypes of csp 277-397 out of 11,254 samples with no ambiguous calls. Amino acid haplotype sequences for the c-terminal region of csp for all samples can be found at the MalariaGEN website.
Surprisingly, the most common haplotype in the dataset and the one with the second lowest number of differences from all other unique haplotypes is the one observed in lab strain Dd2, being found in 2,760/11,254 (25%) of the samples and having a mean of 4.7 differences from other haplotypes (Figure 4a). In contrast, the 3D7 haplotype used for both csp-based vaccines is only found in 3% of samples and has on average 6.9 differences from all other haplotypes.
Importantly, this striking difference also holds when examining each population separately, including in West Africa from where 3D7 is thought to have originated (Figure 4b).
Taken together, these results show perhaps surprising differences between the target haplotype used in the design of RTS,S and R21 and those circulating in natural parasite populations, and provide a systematic catalogue that can be used in future studies to elucidate any possible clinical significance of sequence diversity.
Genetic origins of hrp2 and hrp3 deletions Most widely used rapid diagnostic tests (RDTs) rely on detection of the products of the hrp2 and hrp3 genes, and deletions in these genes is known to lead to RDT failure 18 . We used gCNV to call presence/absence of hrp2 and hrp3 deletions in 68% of QC passed samples. Frequencies of deletions vary greatly across countries, and deletions of hrp3 (1.9%) are more common than those of hrp2 (0.14%) (Supplementary Table 6). The only countries where we see deletions in both hrp2 and hrp3 that would cause HRP RDTs to fail are Peru (6/20 samples), Indonesia (2/115 samples) and Sudan (1/7 samples).
There have been numerous reports of such deletions in recent years, but to date there has been little detail on the mechanisms causing such deletions. We manually inspected reads around the apparent breakpoints in order to classify the types of events driving these deletions. For hrp2, all deletion events can be explained by a process of telomere healing whereby the end of a chromosome is deleted and a telomere repeat sequence attached to the breakpoint 19,20 ( Figure 5). Telomere healing events can be determined with breakpoint precision and in almost all cases samples with the same breakpoints are from the same country (Supplementary Table 7). For hrp3, we also identified a number of telomere healing events, but also two other quite different types of event causing deletion of the gene ( Figure 5, Supplementary Table 7). In many cases a new hybrid chromosome appears to have been created by a recombination between chromosome 13 and 11 at a cluster of rRNA genes that have orthologous copies on both chromosomes. In other cases a recombination between chromosome 13 and an inverted section from within chromosome 5 containing the gene mdr1 can be identified. This remarkable event results in both deletion of hrp3 and duplication of mdr1.
Other types of genetic variation: allelic forms of eba175 As with previous releases, in Pf7 we have created genotypes at SNPs genome-wide and CNVs in specific locations, but we intend to continue to expand the resource to consider other types of genetic variation. Various surface antigens, including vaccine candidate genes, have two distinctly different allelic forms 21 . Often the two forms are so divergent that reads from a non-reference form will not map to the 3D7 reference genome 22 , hence necessitating an alternative to the mapping-based genotype calling approach described above. As an example, the gene eba175 has two different allelic forms, known as the F-and C-types 23,24 . As a proof of concept for such a dimorphic gene, we used a novel kmer-based method to call these two types in each sample. We see that 7,380 (69%) samples have the F allele exclusively and 3,364 (31%) have the C allele. Although frequencies of each type vary between populations, we see >10% frequency of each type in all populations (Supplementary Figure 11). These results give weight to the argument that eba175 is under balancing selection, most likely negative frequency-dependent selection driven by interactions with the human immune system. Analyses of other such dimorphic genes will be left to future work, as will more detailed analysis of variation within these different allelic types.

Discussion
The Pf7 dataset increases the amount of curated open data for population genomic analysis of P. falciparum by almost threefold, and greatly increases the number of samples collected within the last five years. With denser geographical coverage it is possible to undertake higher resolution analysis of epidemiological variation within a region, e.g. we observe considerable heterogeneity of inferred chloroquine resistance in West Africa (Figure 3), and it also allows us to identify new sub-populations with distinctive epidemiological features, e.g. we find two sub-populations in south Asia that have contrasting drug resistance profiles. There is useful historical information to be obtained from older samples that are included in this new data release, e.g. some samples collected in Cambodia in the early 1990s appear to be resistant to both piperaquine and mefloquine, which is highly relevant to the ongoing evolution of multidrug resistance to artemisinin combination therapy in Southeast Asia.
An important technical advance is that Pf7 contains a large number of samples that were collected as dried blood spots in the Figure 5. HRP deletion breakpoints. We see five different breakpoints resulting in the deletion of hrp2. Four of these are within exon 2 of the gene whereas the fifth is found between hrp2 and the pseudogene PF3D7_0831750. For all five events we see evidence of telomeric healing from reads that contain part Pf3D7_08_v3 sequence and part telomeric repeat sequence (GGGTTCA/GGGTTTA). We see 16 different breakpoints resulting in the deletion of hrp3. For fourteen of these we see evidence of telomeric healing. Note that many of these events result in the deletion of other genes in addition to hrp3. For twenty samples from Cambodia and a single sample from Vietnam we see evidence of a recombination with chromosome 5 which results in a hybrid chromosome comprising mostly chromosome 13 sequence but a small inverted section of an internal portion of chromosome 5 containing the gene mdr1. We also see evidence of a recombination with chromosome 11 which results in a hybrid chromosome comprising mostly chromosome 13 sequence but also a section of the 3' end of chromosome 11. This is the most common deletion type, being seen in 151 samples from 14 different countries. Because the recombination occurs between highly similar sequences of a set of three orthologous ribosomal RNA genes found on both chromosomes, it is not possible to identify the exact breakpoints.
field. We and others have previously described successful whole genome sequencing of P. falciparum from DBS after selective whole-genome amplification but it was unclear how well this methodology would perform at scale. Here we show that the SpotMalaria protocol for sWGA of DBS samples allows us to generate whole genome sequence data of sufficient quality to genotype the vast majority of SNPs with sufficient accuracy and reliability for large-scale population genomic analysis. We have introduced improvements to our pipelines for calling copy number variants, necessitated by the greatly increased heterogeneity of sequencing coverage following sWGA. There remain hypervariable gene families and other regions of the parasite genome that cannot be accurately genotyped in field samples using current methods, and these difficulties are compounded by sWGA, but by working on sequencing and analysis methods we aim to continually improve genome coverage in future releases.
The knowledge that DBS samples can be used for whole genome analysis in large-scale studies is of practical importance, as it empowers field researchers and national malaria control programs to integrate population genomic information with other forms of epidemiological and public health data, and it paves the way to a global infrastructure for genomic surveillance of P. falciparum. Information about the processes and methods of the SpotMalaria Project can be obtained at the MalariaGEN website The Pf7 dataset includes a range of analyses and sample annotations that are intended to increase the utility of the data for researchers working on practical problems in malaria control. Compared to the Pf6 dataset, we have made improvements to methods for calling CNVs at the mdr1 and pm2 drug resistance loci and for calling hrp2 and hrp3 deletions that can affect rapid diagnostic tests. Other new analyses included in Pf7 include more detailed descriptions of: (a) hrp2 and hrp3 deletion breakpoints; (b) drug resistance locus haplotypes and in particular newly emerging crt haplotypes; (c) profiles of variation in the csp vaccine antigen and the vaccine candidate eba175. In future releases we aim to improve and expand analyses that are relevant to malaria control programmes and policymakers.
The Pf7 dataset focuses entirely on genome sequencing data, but there is a growing body of data from amplicon sequencing and targeted genotyping approaches that is highly informative about multiple aspects of P. falciparum population genomics. For example, the GenRe-Mekong Project has used the SpotMalaria platform combined with amplicon sequencing to enable malaria control programmes in the Greater Mekong Region to conduct national genomic surveillance of multidrug resistance 6 . In future data releases we aim to integrate data from these different sources to greatly increase sample size and geographical coverage, and thus improve the resolution of population genomic analysis.

Methods
All samples in this study were derived from blood samples obtained from patients with P. falciparum malaria, collected with informed consent from the patient or a parent or guardian. At each location, sample collection was approved by the appropriate local and institutional ethics committees. Here we summarise the bioinformatics methods used to produce and analyse the data; full details are available here.
Reads mapping to the human reference genome were discarded before all analyses, and the remaining reads were mapped to the P. falciparum 3D7 v3 reference genome using bwa mem 26 . "Improved" BAMs were created using the Samtools FixMateInformation, Picard MarkDuplicates, and GATK base quality score recalibration. All lanes for each sample were merged to create sample-level BAM files.
Putative variants were called in each sample independently using GATK (v4.1.4.0) HaplotypeCaller, then all samples were combined to jointly genotype the entire cohort using GATK GenotypeGVCFs 27 .
SNPs and indels were filtered using GATK's Variant Quality Score Recalibration (VQSR). Variants with a VQSLOD score ≤ 2 were filtered out. Functional annotations were applied using snpEff 28 version 4.3. Genome regions were annotated using BCFtools v1.10.2 (http://www.htslib.org/doc/bcftools.html) and masked if they were outside the core genome. Unless otherwise specified, we used biallelic SNPs that pass all quality filters for all the analysis.
We identified species using nucleotide sequence from reads mapping to six different loci in the mitochondrial genome, using custom java code. The loci were located within the cox3 gene (PF3D7_MIT01400), as described in a previously published species detection method 29 . Alleles at various mitochondrial positions within the six loci were genotyped and used for classification.
We created a final analysis set of 16,203 samples after removing samples with unverified identity, mixed species, replicate and low coverage samples, and samples with excessive numbers of singleton SNPs.
We calculate genetic distance between samples using biallelic coding SNPs that pass filters using a method previously described 9 .
The matrix of genetic distances was used to generate neighbour-joining trees and principal coordinates. Based on these observations we grouped the samples into ten major sub-populations: South America, West Africa, Central Africa, Northeast Africa, East Africa, an eastern part of South Asia, a far-eastern part of South Asia, the western part of Southeast Asia, the eastern part of Southeast Asia and Oceania, with samples assigned to region based on the geographic location of the sampling site.
F WS was calculated using custom python scripts using the method previously described 7 . Nucleotide diversity (π) was calculated in non-overlapping 25kbp genomic windows, only considering coding SNPs to reduce the ascertainment bias caused by poor accessibility of non-coding regions. LD decay (r 2 ) was calculated using the method of Rogers and Huff and biallelic SNPs with low missingness and regional allele frequency >10%. Mean F ST between populations was calculated using Hudson's method.
To call duplication genotypes around mdr1 and plasmepsin 2-3 from binned coverage, we adapted the GATK GermlineCNVCaller (gCNV) pipeline, which features a probabilistic Bayesian model that jointly infers both copy-number activity and a model for denoising sequencing systematics. After breakpoint genotypes were called, we performed an initial, permissive round of annotation-based call filtering, using hard cuts to identify failing samples and demarcate duplication and reference genotypes. This was followed by a final round of curation, based on manual inspection of the denoised copy ratios, to discard spurious duplication calls. The resulting filtered gCNV call set was integrated with an analogous call set based on consideration of face-away read-pair evidence, in which we set the breakpoint to be that with the highest proportion of face-away reads.
Deletions in hrp2 and hrp3, genes which are located in subtelomeric regions of the genome with very high levels of natural variation, were identified using the same breakpointgenotyping framework introduced above. As before, an initial round of permissive, annotation-based filtering was performed, followed by a final round of curation to discard spurious deletion calls. We identified deletion breakpoints by manual inspection of custom plots.
The procedure used to map genetic markers to inferred resistance status classification is described in detail for each drug in the accompanying data release. In brief, we called amino acids at selected loci by first determining the reference amino acids and then, for each sample, applying all variations using the GT field of the VCF file. This same approach was used to identify haplotypes of csp 277-397 . The amino acid and copy number calls generated in drug resistance genes were used to classify all samples into different types of drug resistance. Our methods of classification were heuristic and based on the available data and current knowledge of the molecular mechanisms. Each type of resistance was considered to be either present, absent or unknown for a given sample. eba175 F-and C-type calls are made by identifying samples that have 19bp kmers present that are unique to the C and F haplotypes. • Study information: Details of the 82 contributing partner studies, including description, contact information and key people.

Data availability
• Sample provenance and sequencing metadata: sample information including partner study information, location and year of collection, ENA accession numbers, and QC information for 20,864 samples from 33 countries.
• Measure of complexity of infections: characterisation of within-host diversity (F WS ) for 16,203 QC pass samples.
• Inferred resistance status classification: classification of 16,203 QC pass samples into different types of resistance to 10 drugs or combinations of drugs and to RDT detection: chloroquine, pyrimethamine, sulfadoxine, mefloquine, artemisinin, piperaquine, sulfadoxinepyrimethamine for treatment of uncomplicated malaria, sulfadoxine-pyrimethamine for intermittent preventive treatment in pregnancy, artesunate-mefloquine, dihydroartemisinin-piperaquine, hrp2 and hrp3 gene deletions.
• Drug resistance markers to inferred resistance status: details of the heuristics utilised to map genetic markers to resistance status classification.
• CRT haplotypes: Full crt gene haplotypes for 16,203 QC pass samples.
• CSP C-terminal haplotypes: Full csp C-terminal haplotypes for 16,203 QC pass samples plus 6 lab strains.
• Reference genome: the version of the 3D7 reference genome fasta file used for mapping.
• Annotation file: the version of the 3D7 reference annotation gff file used for genome annotations. •

If applicable, is the statistical analysis and its interpretation appropriate? Yes
Are all the source data underlying the results available to ensure full reproducibility? Yes Are the conclusions drawn adequately supported by the results? Yes Figure 2: Heterogeneity of chloroquine-resistant P. falciparum across west Africa (as determined using the pfcrt K76T mutation that is a well validated genetic marker). This finding likely reflects different chloroquine and potentially amodiaquine usage across the region.
○ Figure 3: The finding of multiple new pfcrt mutations in Southeast Asia, arising on the chloroquine-resistant Dd2 background, which in a number of cases reflects a key role for these mutations in contributing to piperaquine resistance. A number of these mutations are also likely to compensate for fitness costs caused by pfcrt mutations. ○ Figure 4: Evidence of a high rate of non-synonymous mutations in CSP, which is important given that a segment of this protein forms the core parasite antigen used in the WHOapproved RTS,S and the clinically trialed R21 vaccines. That figure would benefit from additional annotation to show the location of the segments used in these vaccines. ○ Figure 5: Evidence of breakpoints that cause deletions of HRP2 and HRP3, whose detection is an essential component of the rapid diagnostic tests commonly used to diagnose malarial infection. Here, it would be good to point out the absence of samples from the Horn of Africa, where this is a major problem. That description should cite the relevant literature, e.g. PMID 34580442 1 .
○ Table 1 lists the new samples by region.
○ Table 2 is a very useful list of the prevalence of genetic markers of resistance to multiple drugs. When citing amplification of plasmepsins 2 and 3 as a marker of piperaquine resistance, the authors should cite the two initial reports of their association: PMID 27818095 2 and 27818097 3 . This lists, by necessity, excludes pfcrt mutations as, unlike for chloroquine, there is no single mutation that is essential and instead there are several that can contribute to resistance. The authors cite a review but could include a recent study that provides more insight into mechanism: PMID 31776516 4 .

○
The real power of this study can be found through their searchable website: https://www.malariagen.net/resource/34. This resource will provide a superb tool for the malaria research community. Now that this article is published, the following link can be updated to reflect that data release should no longer be forthcoming: https://www.malariagen.net/apps/pf7/. It will be important for readers to be able to interrogate the database in an easy way to collate information on variation in a particular gene. I was not able yet to figure out how to generate this search and more instruction on this point would be very helpful.

Fabián Sáenz
Centro de Investigación para la Salud en América Latina, Facultad de Ciencias Exactas y Naturales, Pontificia Universidad Católica del Ecuador, Quito, Ecuador This manuscript by MalariaGEN and collaborators describes a new release set of genomic data called Pf7. The data includes curated genomic data of more than 20,000 Plasmodium falciparum samples, out of which 13750 were new for the Pf7 set. Samples were obtained from 33 countries in 4 continents collected between 1984 and 2018. Most of the new samples came from dried blood spots for which SWGA was used (and used a GATK based method to normalize data around certain genes). Variants were called and incomplete samples discarded. The authors divided the samples in 10 sub populations that matched quite well sample locations based in principal coordinate analysis and neighbor joining trees (four in Africa, four in Asia, one in Oceania and one in South America). In order to show relevance of the data, some interesting analysis specific to drug resistance, vaccine target diversity, rapid diagnosis test target deletion and invasion ligands was presented. Of particular interest, samples were classified according to the resistance level to some of the main drugs using well known markers and show how drug resistance has moved across SE Asia in unique genetic backgrounds. In addition, the variation in the c-terminal region and background mutations was analyzed and shown to be quite divergent from the reference 3D7 (used in the RTS,S vaccine). In addition to map the deletions of hrp2 and hrp3 in the samples around the world. the origin of these deletions is explored and the authors conclude that telomere healing plays an important role in the process. The available data also helps confirm that eba175 is an invasion ligand gene under balancing selection.
This work is a very important addition to the genomic Plasmodium datasets available for the scientific community. The data will help the development of new worldwide and regional malaria studies helping elimination in several areas of the world and ultimately helping in the ultimate goal that is eradication.
Here are some minor observations to this work: Introduction: The analysis pipeline should be referenced through the text.

Results:
Variant discovery and genotyping: When referring to 3D7 v3 reference genome a reference should be included. When referring to the variant discovery pipeline for Pf6, a reference should be included.

○
Effects of selective whole genome amplification (sWGA): The authors are well aware that variability in coverage across the genome can arise from sWGA and develop a method based on gCNV. It can be seen that for some genes such as mdr1, concordance was low. Would this method be usable for other markers in the genome? Limitations in this sense should be further discussed.

○
To show that sWGA of DBS samples is not introducing a bias in population structure, the authors compare clustering of the same samples from whole blood and DBS. They do not see stratification by sample type (supl fig 3). From this figure, it is not clear to me that clustering is similar in both cases since samples of both collection types are analysed together. In addition, it is not clear how sample type clustering would be sufficient to demonstrate no bias in population structure due to sWGA. It would be important to have further insight from the authors in this sense. Global population structure: In figure 1B and supplementary figure 4 it is of particular interest that some African samples and South American isolates appear to be more related than African samples between themselves, but this is not commented by the authors. Could the authors comment in this sense.
○ Genetic origins of hrp2 and hrp3 deletions: The first sentence in this section reads: "Most widely used rapid diagnostic tests rely on detection of the products of the hrp2 and hrp3 genes". In my opinion this should be clarified or written differently since RDTs are designed to detect the product of hrp2 gene but no hrp3. The deletion of hrp3 may affect the outcome of RDTs if hrp2 is deleted because of the similarity between both of the products.

If applicable, is the statistical analysis and its interpretation appropriate? Yes
Are all the source data underlying the results available to ensure full reproducibility? Yes Are the conclusions drawn adequately supported by the results? Yes