Machine learning enhances prediction of plants as potential sources of antimalarials

Plants are a rich source of bioactive compounds and a number of plant-derived antiplasmodial compounds have been developed into pharmaceutical drugs for the prevention and treatment of malaria, a major public health challenge. However, identifying plants with antiplasmodial potential can be time-consuming and costly. One approach for selecting plants to investigate is based on ethnobotanical knowledge which, though having provided some major successes, is restricted to a relatively small group of plant species. Machine learning, incorporating ethnobotanical and plant trait data, provides a promising approach to improve the identification of antiplasmodial plants and accelerate the search for new plant-derived antiplasmodial compounds. In this paper we present a novel dataset on antiplasmodial activity for three flowering plant families – Apocynaceae, Loganiaceae and Rubiaceae (together comprising c. 21,100 species) – and demonstrate the ability of machine learning algorithms to predict the antiplasmodial potential of plant species. We evaluate the predictive capability of a variety of algorithms – Support Vector Machines, Logistic Regression, Gradient Boosted Trees and Bayesian Neural Networks – and compare these to two ethnobotanical selection approaches – based on usage as an antimalarial and general usage as a medicine. We evaluate the approaches using the given data and when the given samples are reweighted to correct for sampling biases. In both evaluation settings each of the machine learning models have a higher precision than the ethnobotanical approaches. In the bias-corrected scenario, the Support Vector classifier performs best – attaining a mean precision of 0.67 compared to the best performing ethnobotanical approach with a mean precision of 0.46. We also use the bias correction method and the Support Vector classifier to estimate the potential of plants to provide novel antiplasmodial compounds. We estimate that 7677 species in Apocynaceae, Loganiaceae and Rubiaceae warrant further investigation and that at least 1300 active antiplasmodial species are highly unlikely to be investigated by conventional approaches. While traditional and Indigenous knowledge remains vital to our understanding of people-plant relationships and an invaluable source of information, these results indicate a vast and relatively untapped source in the search for new plant-derived antiplasmodial compounds.


COMPILED DATA
Collected traits for Apocynaceae, Rubiaceae and Loganiaceae are given in separate repositories. 12 Where we have relied on literature reviews to collect data, appropriate references can be found in the manually collected data sections of the trait repositories for each family. All finalised trait data and analyses are given in the trait modelling repository. 3 The following traits were collected and used in the analysis (* denotes traits not used to train the machine learning models):

TOOLS
To resolve names in data sources to accepted names in the World Checklist of Vascular Plants (WCVP) (Govaerts et al., 2021), we developed the automatchnames Python library 4 ; and used automatchnames v0.1 to resolve names to the WCVP V7. Methods used to collect and compile data are collected in the miningtraitdata v0.1 repository. 5

Literature
Where we have relied on published literature for data, we have searched using Scopus, PubMed and Google Scholar. When searching for a particular property, we have searched this word along with every accepted genus in the three families of interest e.g. 'antiplasmodial aspidosperma'. We have also searched for this word with each of the study family names.

Geographic Regions with Malaria
Data on global malarial incidence and transmission were generated from (WHO, 2022; The World Bank, 2022; Centers for Disease Control and Prevention, 2022;Bryan et al., 1996;Girod et al., 1995;Lounibos and Conn, 2000;Snow et al., 2012;Chadee et al., 1993). The regions indicated in these sources have then been mapped onto the World geographical scheme for recording plant distributions (level 3) (Brummitt et al., 2001) and are indicated in Figure S1.

Occurrence Records
The GBIF Occurrence records used to generate the environmental data were gathered from the following

Classifying Antiplasmodial Activity
Here we describe in detail the process for classifying plants as active or inactive from reported bioassays. Firstly, when conducting the literature review on antiplasmodial activity, we documented the authors' decisions regarding the degree of activity. Though authors use differing terminology and categories to label activity values, authors generally split into two or three categories -Active, Weak and Inactive. As we are using binary labels, Active and Inactive, in the following we consider Weak cases to be Inactive.
The main ambiguity that arose when parsing authors decisions was relating to the linguistic modifiers used. Any sort of neutral modifier (e.g. Acceptable/Moderate activity) was denoted as Weak/Inactive. Any positive modifier (e.g. Strong, Good etc..) was denoted as Active. Next we documented the given values for activity and the type of test used, including the strain of malaria, plant part and preparation method.
To provide a binary classification of the given activity values we separated into three main cases, Crude Extractions, Fractions and Isolations, which are further subdivided into in vitro and in vivo tests. These are further divided based on measurement units.

Crude Extractions
In in vitro cases where authors provide definite IC50 values in µg/ml, we follow a generalised version of the definitions given in (Rasoanaivo et al., 2004). 11 Extracts with activities of < 10µg/ml are considered to warrant further investigation, so we label samples as follows: This corresponds well to most authors interpretations.
Sometimes authors provide degrees of inhibition at differing doses rather than IC50 values. In clear active cases (e.g. 80% inhibition at 9 µg/ml) and inactive cases (22.35% inhibition at 100 µg/ml), we use the above schema. When there is ambiguity we use authors decisions.
In in vivo contexts, there is not a standardised dosing across studies, and cytotoxicity often influences authors' judgements. Rasoanaivo et al. (2004) provide suggested classifications for inhibition rates at 250mg/kg/day but it is not clear how to translate these to studies in the literature. Authors decisions are used here.

Fractions
Tests using fractions were relatively rare. For a plant containing an active compound, one would expect a fractionation containing this compound to be more active than the crude extracts. To reflect this, we use an IC50 threshold of 5 µg/ml. This threshold in general corresponds to authors decisions.

Isolated Compounds
According to the the Medicines for Malaria Venture 12 compounds with IC50 values under 1 µM are of interest for further investigation. Where compounds have been isolated from plants and subsequently tested, we use this threshold in our data for in vitro studies.

Known Antiplasmodial Compounds
Where isolated compounds have been found to be active, we have cross-referenced these compounds with the presence of compounds in other species using KNApSAcK (Afendi et al., 2012) and labelled those species containing these compounds as active. All such species had in fact also been tested for their antiplamsodial activity in bioassays and so the inclusion of data on antiplasmodial compounds has not affected knowledge of the sampling biases.

Final Decision
For each plant we obtain a list of tests for antiplasmodial activity and their associated activity labels. In cases of plants with multiple tests, we assign the label Active if any of the tests are active.