Improving disaggregation models of malaria incidence by ensembling non-linear models of prevalence

Maps of disease burden are a core tool needed for the control and elimination of malaria. Reliable routine surveillance data of malaria incidence, typically aggregated to administrative units, is becoming more widely available. Disaggregation regression is an important model framework for estimating high resolution risk maps from aggregated data. However, the aggregation of incidence over large, heterogeneous areas means that these data are underpowered for estimating complex, non-linear models. In contrast, prevalence point-surveys are directly linked to local environmental conditions but are not common in many areas of the world. Here, we train multiple non-linear, machine learning models on Plasmodium falciparum prevalence point-surveys. We then ensemble the predictions from these machine learning models with a disaggregation regression model that uses aggregated malaria incidences as response data. We find that using a disaggregation regression model to combine predictions from machine learning models improves model accuracy relative to a baseline model.


Extended methods for machine learning models
We trained five machine learning models to each prevalence dataset. These models were elastic net, projection pursuit regression, neural networks, random forest and boosted regression trees. All models were trained on untransformed prevalence data with any predictions below zero being set at zero. For tree based models, transforming the data makes almost no difference. For the other models, training on empirical logit transformed data was considered but as the final performance metric (correlation between predicted and observed polygon incidence data) was on an absolute scale it was deemed better to train the models on absolute scales. For all models, the sample size of each survey was used as a weight.
Elastic net models are penalised, linear regression models. They find the regression parameters where β is a vector of regression parameters, y i is the prevalence response data, X i is a vector of environmental covariates and w i is the weight given by the sample size. λ is the penalty term and α controls the relative contribution of the l1 and l2 penalties. With α = 0 the model collapses to ridge regression while with α = 1 the model is equivalent to the LASSO. α and λ are selected by a cross validation with candidate values given by a grid search. We examined 15 values for α taken evenly between 0.05 and 1. We examined 15 values for λ taken as 14 values spread evenly on a log10 scale between -4 and -1 and additionally considering λ = 0 which gives the least squares estimate.
Projection pursuit regression fits a number of smooth functions to linear combinations of all the covariates. It involves finding regression parameters β in the model where the f j 's are smoothing functions, X i is the vector of covariates and r is the number of smooth terms to be fitted (a hyperparameter selected by cross-validation). The model is fitted iteratively r times with the model being fitted to the residuals of the previous models at each iteration. We considered rß in the range 1 to 15.
We fitted neural networks with a single hidden layer, logistic activation functions for the hidden units and a linear activation function for the output. This architecture, with h hidden units and c covariates, can be written as with β 0 being the intercept, β 0h being the intercept for each hidden unit, β ih being the weights between the inputs and hidden units and β ho being the weights between the hidden units and the outputs. The nnet package optimises this model using back-propogation and includes an additional hyperparameter, 'decay', that is a penalty on the sum of the squares of the weights. We considered a grid of hyperparameters. For the number of hidden units we considered 15 equally spaced values between 1 and 29. We examined 15 values for decay taken as 14 values spread evenly on a log10 scale between -4 and -1 and additionally considering λ = 0.
The Random Forest algorithm creates an ensemble of regression trees. To control overfitting each tree is fitted to a bootstrap sample of the data. To further control overfitting, at each node of each tree, mtry random selected covariates are considered instead of the full number of covariates. For each node in a tree the covariate and split point is found which minimises the sum of the variances in the nodes. However, we consider two further variants of this rule.
We consider "extremely randomised trees" in which the split point for each covariate is drawn uniformly randomly. We then consider "conditional inference trees" in which after a split point is chosen, a significance test is performed and the split point is only used if the data on each side of the split are significantly different (with a significance threshold of 0.5). The trees are gown until all nodes reach a minimum node size. The split rule and the parameters mtry and the minimal node size are chosen by cross-validation. We considered a random search of 15 parameter sets.
Finally, boosted regression trees also makes an ensemble of trees. The trees are built sequentially and for each successive tree, the data are weighted by the size of the residual between the observed data and the prediction from all the previous trees. The XGBoost implementation of boosted 3 regression trees contains a number of hyperparameters: nrounds is the number of additive trees, max depth is the maximum depth of each tree, eta reduces the additional weight given to poorly predicted data points, gamma sets the minimum loss reduction needed for a split to be allowed, for each tree, a proportion colsamplebytree of covariates are considered, min child weight causes the growing of the tree to stop when a node has this many data points and finally subsample is a proportion of the data to randomly sample to use in training of each tree. Again we used a random search and considered 15 parameter sets.   8

Hyperparameter optimisation
As ranger and GBM were tuned with random hyperparameter search, the plots become difficult and are not included.       15

Hyperparameter optimisation
As ranger and GBM were tuned with random hyperparameter search, the plots become difficult and are not included.

Hyperparameter optimisation
As ranger and GBM were tuned with random hyperparameter search, the plots become difficult and are not included.       27 Figure S34: Scatter plot of predictions and held out observed data for the GBM trained on the Madagascar dataset. 28

Hyperparameter optimisation
As ranger and GBM were tuned with random hyperparameter search, the plots become difficult and are not included.    32 Figure S41: Scatter plot of predictions and held out observed data for the PPR trained on the Senegal dataset. Figure S42: Scatter plot of predictions and held out observed data for the Random Forest trained on the Senegal dataset.
33 Figure S43: Scatter plot of predictions and held out observed data for the GBM trained on the Senegal dataset. 34 6.3. Hyperparameter optimisation As ranger and GBM were tuned with random hyperparameter search, the plots become difficult and are not included. Figure S44: Optimisation for elastic net hyperparameters trained on the Senegal dataset.