Single-cell transcriptomics of malaria parasites

Single-cell RNA-sequencing is revolutionising our understanding of seemingly homogeneous cell populations, but has not yet been applied to single cell organisms. Here, we established a method to successfully investigate transcriptional variation across individual malaria parasites. We discover an unexpected, discontinuous program of transcription during asexual growth previously masked by bulk analyses, and uncover novel variation among sexual stage parasites in their expression of gene families important in host-parasite interactions.


148
To further explore transcriptional variation in sexual stage biology, we carried out scRNA-seq on P. we found an enrichment for variation in the well-studied var gene family (14 of 60 genes, 155 hypergeometric test, p = 0.0006). Expression of noncoding var transcripts is common and is involved 156 in maintaining mutually exclusive expression of a single var gene 17 . Therefore, to assess the 157 presence of true var mRNAs, we identified reads that confirmed intronic splicing. We discovered that 158 although there were reads mapping to multiple vars within each cell, only a single var gene ever had 159 reads that spanned the spliced intron. This suggests that mutually exclusive expression of var genes 160 occurs in sexual stages, as it does for asexual parasites 17 . The var genes variably expressed across 161 females were always from internal var gene clusters, consisting primarily of upsC type vars (Fig. 3b) 162 but whose role is largely unexplored. In asexual stages, var gene products (PfEMP1) allow the 163 parasites to adhere to the host vasculature and avoid clearance in the spleen and are important in 164 evading the adaptive immune system 17 . Variable and mutually exclusive transcription of upsC-rich 165 internal var genes in female gametocytes suggest that this variable gene family, like the pir genes In this study, we have demonstrated a highly successful approach to single cell RNA-sequencing in 169 the study of a unicellular eukaryotic organism. We have used scRNA-seq to successfully resolve 170 individual parasites within a mixed population, to classify them, and to identify previously hidden 171 transcriptional variation. Our data suggest that the transcriptional cascade during development in 172 red blood cells is not continuous as previously described, but instead it occurs in discrete stages. This 173 has important implications for our understanding of gene regulation in the most pathogenic stage of 174 the parasite. We also show that variable expression of genes involved in host-parasite interactions is 175 a feature of multiple life stages of Plasmodium parasites. We hope that our optimisation framework 176 will assist in extending scRNA-seq to a much wider range of diverse eukaryotic cell types.

Mapping reads and calculating read counts 287
All sequencing experiments were processed in the following way. CRAM files of reads were acquired 288 from the WTSI core pipeline, converted to BAM using samtools-1.2 view -b, sorted using samtools 289 sort -n, converted to fastq using samtools-1.2 bam2fq and then deinterleaved 31 . Nextera adaptor 290 sequences were trimmed using trim_galore -q 20 -a CTGTCTCTTATACACATCT --paired --stringency 3 -291 -length 50 -e 0.1 (v0.4.1). HISAT2 (v2.0.0-beta) 32 indexes were produced for the P. falciparum v3 292 (http://www.genedb.org/Homepage/Pfalciparum) or P. berghei v3 33 genome sequences, 293 downloaded from GeneDB 34 , using default parameters (October 2016). Trimmed, paired reads were 294 mapped to either genome sequence using hisat2 --max-intronlen 5000 -p 12 . For the dual sort 295 experiment we mapped against a combined reference, allowing us to exclude reads that map to 296 both genomes. SAM files were converted to BAM using samtools-1.2 view -b and sorted with 297 samtools-1.2 sort. GFF files were downloaded from GeneDB (October 2016) and converted to GTF 298 files using an in-house script. All feature types (mRNA, rRNA, tRNA, snRNA, snoRNA, 299 pseudogenic_transcript and ncRNA) were conserved, with their individual "coding" regions labelled 300 as CDS in every case for convenience. Where multiple transcripts were annotated for an individual counted as ambiguous. 318

319
We compared the library complexity of different iterations of our protocol in order to determine 320 whether more reads resulted in more complexity, or simply more reads from the same genes, 321 perhaps due to large numbers of PCR cycles. Different sequencing runs had very different library 322 sizes and so we downsampled the data. To maximise the number of cells included, while also 323 allowing a reasonable number of reads per cell, we chose to downsample to 50000 reads per cell. To 324 do this, 50000 counts from HTSeq were randomly sampled for each cell. Counts associated with 325 protein coding genes were enumerated and genes were called as detected if there were at least 10 326 reads mapping to them. 327 328

Assessing bias in single-cell sequencing libraries 329
Different library preparation and sequencing protocols exhibit different biases in representation of 330 GC/AT-rich sequences and 5` or 3` transcript ends. In order to assess such biases we took an 331 approach of using the mapped RNA-seq data to identify fragments of genes which were expressed 332 and examined the coverage of genes by these fragments. The reason for doing this, rather than 333 looking at coverage depth was that we had noticed that genes often did not have full coverage, 334 particularly when very long or expressed at a low level. This suggests that, although we would expect not be determined. Read counts were converted to FPKMs and transcripts with an FPKM >= 10 were 352 counted as expressed. We used these data to show that no well contained more than one cell, i.e.  contaminants were also more highly expressed in their cells of origin (Supplementary Figures 2d,e). We therefore normalised the scran values by taking the exponent (2 x ), multiplying by 1000 and 394 dividing by the cDNA length, determined from the GeneDB cDNA FASTA file (coding sequence only, 395 no UTRs). This is similar to the FPKM calculation, except the library size normalisation is already 396 accomplished. We refer to these values as l-scran, for length-normalised scran values. 397 398

Determining parasite life cycle stages using bulk reference data and clustering 399
We used several bulk RNA-seq data sets to assign a life cycle stage to each cell. For P. berghei 400 asexual stages we used microarray data 11 that captures the 24-hour asexual development cycle at Single cell gene expression values were converted to length normalised scran values (l-scran), as 413 described above, in order to produce more accurate rank expression levels for our scRNA-seq data. 414 We compared each single-cell expression profile against each reference data set. To reduce noise, 415 genes that do not vary greatly between conditions in the reference data were removed. For the P. 416 berghei 24h intraerythrocytic developmental cycle reference data 11 , genes were only included if 417 their expression profile had a mean rank of greater than 30 and less than 70 and standard deviation 418 in rank across samples of greater than 3. Genes from the query dataset with l-scran < 3 were also 419 removed. A minimum of 100 remaining genes common to both the reference and query profiles 420 were required to calculate a correlation between them. The Spearman rank correlation was used in 421 order make the microarray and RNA-seq datasets more comparable. The best correlation of a single-422 cell expression profile with a reference expression profile was taken as the consensus stage 423 prediction for that single-cell. As new data (e.g. single-cell analysis of timepoints across the full, 424 synchronised erythrocytic development cycle) become available, benchmarking staging algorithms 425 will become feasible. Bulk RNA-seq data to classify P. berghei males and females directly was not 426 available. Therefore, we used bulk RNA-seq data 12 that includes mixed-sex gametocyte samples, 427 after converting the profiles to v3 using previous id annotation from PlasmoDB 41 . 2612 genes for the Hoo data 11 . There were 651 shared genes, which were used to compare the two 497 datasets. 498

499
The P. falciparum asexual cells were ordered differently. We used the Otto P. falciparum asexual 500 development cycle time course data 24 as a reference. These data were processed as described 501 above. We used the Fourier transform approach described above, with a normalised S/N ratio of 0.5 502 to identify 4517 genes from the Otto et al. dataset 24 . We then identified 336 genes common to this 503 list and the list of 361 differentially expressed genes identified across the 155 single cells. This 504 approach was taken because the window of time captured by our single-cells was too narrow to <0.05). We then generated heatmaps for the two datasets, with the genes ordered by their peak 507 time in the Otto et al. dataset 24 in both cases. 508 509

Determining gene expression variability within different cell types 510
To examine gene expression variation within life stages, we used the filtered datasets and 511 considered male, female, trophozoite and schizont cells separately. For P. berghei, M3Drop 15 was 512 used to determine gene expression variability amongst cells with FDR <= 0.05. We identified 115 513 variable genes in P. berghei females, 73 in males, 27 in trophozoites and four in schizonts. Twenty 514 variable genes were shared between males and females. One variable gene from trophozoites and 515 schizonts was shared: an L1 type pir gene (PBANKA_0943900). Ten were shared between females 516 and trophozoites, none between females and schizonts, four between males and trophozoites, none 517 between males and schizonts. To examine functional classes enriched amongst variable genes we 518 used topGO with the weight01 algorithm, the Fisher statistic, node size = 5 and False Discovery Rate 519 >= 0.05 48 . Gene ontology terms for P. berghei and P. falciparum genes were extracted from GeneDB 520 EMBL files. Multigene families in P. berghei do not have associated Gene Ontology (GO) terms and so 521 we used ad hoc hypergeometric tests to look at their enrichment. We found that pir genes were 522 enriched amongst variable genes in trophozoites (2 of 135, hypergeometric test, p=0.04), schizonts 523 (1 of 135, hypergeometric test p=0.005) and males (5 of 135, hypergeometric test, p=0.02), but there 524 were none in females. Those in asexual stages (trophozoites and schizonts) were from the L1 525 subfamily, but those in males were from the S1 and S4 subfamilies. 526 527 Many more variable genes were present in P. falciparum females and so we reduced the FDR cut off 528 to 0.01 to improve specificity. We found 448 variable genes in P. falciparum females. We were not 529 able to analyse variability in P. falciparum males because we found only five of them. Amongst 530 several enriched GO terms were modulation by symbiont of host erythrocyte and cytoadherence to 531 microvasculature, mediated by symbiont protein. These terms refer to 14 var genes found amongst 532 the variable genes. 533 534

Identifying putative functional var transcripts 535
It is known that antisense transcripts are expressed from a bidirectional promoter in the intron of 536 each var gene 17 . Our protocol does not preserve information about which strand is transcribed. 537 Therefore finding that reads map to either exon of a var gene does not provide evidence that it is 538 functionally expressed. In fact it may indicate that the gene is silenced. In order to identify sense 539 transcripts, we looked for reads mapping over the single intron of each var gene. These reads that include both exons must originate from transcripts including the intron, and thus indicate that the 541 gene was transcribed on the sense strand, not from antisense transcripts beginning within the 542 intron. Initially we identified reads from the HISAT2 mappings which overlapped annotated var 543 genes using bedtools intersect. From the resulting BAM file we selected those reads which included 544 an N in the CIGAR string, indicating a split read. We then looked for which var gene each read 545 overlapped and whether it was split exactly over the intron. We called expression for a var gene 546 where there were at least two reads mapping over the intron. 547 548

Code availability 549
Perl, R and C++ code for various analyses is available on request.  berghei transcriptomes, although no cell had more than 37 contaminating transcripts and no pair of 747 cells shared more than 17. There was a total of 258 P. berghei transcripts contaminating P. 748 falciparum transcriptomes, although no cell had more than 32 of these and no pair of cells shared 749 more than 10. The data suggest that P. falciparum schizonts cause more contamination than P. Different combinations of the protocol were tested by sequencing. Initial trials were performed with 853 2 µl of lysis buffer, this was increased to 4µl to augment capture efficiency. Permutations of the 854 protocol that were tested were a terminal anchoring base (A,G,C; V) or not (T), 2 reverse 855 transcriptase enzymes (Smartscribe (SmSc); Superscript II (SII)) and 25 or 30 cycles of 856 preamplification. Both sexual and asexual cells of P. berghei and P. falciparum were tested. For each 857 sequenced dataset, we calculated the mean percentages of rRNA, mRNA and other reads across the 858 cells. For some samples we also downsampled the data to 50,000 reads per cell to allow comparison 859 of the number of genes detected. This was done to determine differences in the complexity of each 860 library. For the three larger datasets produced (P. falciparum gametocytes, P. berghei mixed blood 861 stages, and P. falciparum asexual stages), we provide the numbers of pre-and post-filtered cells and 862 median number of genes in those filtered cells. 863 864