A machine learning approach to define antimalarial drug action from heterogeneous cell-based screens

Machine learning is applied to high-throughput microscopy images of malaria parasites to define antimalarial drug mode of action.


Supplementary Note 1: Human expert agreement with majority consensus
To calculate agreement between the six human experts, each labelled image patch was compared to the majority consensus (category with most votes). There was total agreement (5 votes) of lifecycle stage for 23.7 % of parasite images, with 4 or 3 out of 5 votes accounting for 24.4 % and 32.7 % of parasite image classification respectively. Labellers' answers were compared to the majority consensus for 448 images (each expert labelled between 143 and 448 images). Labeller answer versus majority consensus (in descending order of images labelled) was 66.1 %, 78.4 %, 73.1 %, 60.9 %, 71.8 %, 68.8 %.

Supplementary Note 2: Human experts provide noisy initial labels to train supervised models
With typical mammalian cells, almost every normal cell should look approximately the same. Dividing cells look different but are sparse and generally dropped from analysis. P. falciparum shows dramatic morphological changes throughout its lifecycle. In order to successfully apply ML, especially when investigating drug induced changes, 'normal' needed to be defined throughout the lifecycle. One solution would be to create synchronized cultures at each stage and use these as the ground truth. In reality, even these synchronized cultures show parasite-to-parasite variability (a 'trophozoite-heavy' culture is still a mix of some late rings, some trophozoites, some early schizonts). Instead, asynchronous cultures were used and collected ground truth labels from human experts. This presents a segregation challenge for the ML when channel intensities range greatly (e.g. DAPI brightness between ring and schizont stages [ Figure 1c]). Human labels enable training of a standard supervised random forest model to bin parasites into ring / trophozoite / schizont stages. However, these include increased levels of noise, especially away from canonical images, for example experts disagree about whether a parasite is a late ring or early trophozoite. A random forest trained on these labels also has disagreement with the held-out test dataset. It is unclear if this disagreement is because the human labels are noisy or because the random forest is poorly trained.

Supplementary Note 3: Validating model-derived lifecycle stage ordering
To test the lifecycle continuum defined by the model, 6 human labellers were asked to order pairs of parasite images by labelling the first image as 'earlier' or 'later' in the lifecycle compared to the second image. This demonstrated whether humans and the model correctly ordered parasite stages and to what level of information granularity. Majority consensus was met when 4 of the 6 human labellers agreed on the developmental order of the images. There was also a category for when developmental order was unclear ('too close to call'), this definition was met when labellers' votes were split evenly between 'earlier' or 'later' or when no answer had 3 votes.
Of 295 pairs of images, the majority vote classified 75 as 'too close to call', with 220 classified as 'earlier' or 'later'. Individual human labellers gave a vote of 'earlier' or 'later' between 164 and 219 times, and between 89.5 % and 95.8 % of these labels aligned with the consensus.

Supplementary Note 4: Precision -Recall curve calculations
Precision and recall were calculated against the consensus human paired answers. Precision was calculated as the fraction of before/after images with a definitive answer (i.e. not 'too close to call') that were called correctly. Recall was calculated as the fraction of pairs not called incorrectly that were also given a definitive answer. For the evaluation of the machine learning model, the difference in angle (predicted lifecycle stage) between the two parasite patches was calculated and compared to a 'too close to call' threshold. If the angle difference was less than this threshold, then the ML answer was judged to be 'too close to call'. The curve of precision/recall for the machine learning algorithm was calculated by stepping through this threshold from zero until all pairs were considered too close to call.