Towards a Predictive Analytics-Based Intelligent Malaria Outbreak Warning System †

Malaria, as one of the most serious infectious diseases causing public health problems in the world, affects about two-thirds of the world population, with estimated resultant deaths close to a million annually. The effects of this disease are much more profound in third world countries, which have very limited medical resources. When an intense outbreak occurs, most of these countries cannot cope with the high number of patients due to the lack of medicine, equipment and hospital facilities. The prevention or reduction of the risk factor of this disease is very challenging, especially in third world countries, due to poverty and economic insatiability. Technology can offer alternative solutions by providing early detection mechanisms that help to control the spread of the disease and allow the management of treatment facilities in advance to ensure a more timely health service, which can save thousands of lives. In this study, we have deployed an intelligent malaria outbreak early warning system, which is a mobile application that predicts malaria outbreak based on climatic factors using machine learning algorithms. The system will help hospitals, healthcare providers, and health organizations take precautions in time and utilize their resources in case of emergency. To our best knowledge, the system developed in this paper is the first publicly available application. Since confounding effects of climatic factors have a greater influence on the incidence of malaria, we have also conducted extensive research on exploring a new ecosystem model for the assessment of hidden ecological factors and identified three confounding factors that significantly influence the malaria incidence. Additionally, we deploy a smart healthcare application; this paper also makes a significant contribution by identifying hidden ecological factors of malaria.


U s a g e g u i d e l i n e s
Pl e a s e r ef e r t o t h e u s a g e g ui d eli n e s a t h t t p://s u r e . s u n d e rl a n d. a c. u k/ p olici e s. h t ml o r al t e r n a tiv ely c o n t a c t s u r e @ s u n d e rl a n d. a c. u k.

Introduction
Malaria, as one of the most serious infectious diseases causing public health problems in the world, affects about two-thirds of the world population, with estimated resultant deaths close to a million annually [1]. Its prevalence can be significantly attributed to climate factors, usually worsened by human factors through poor sanitation, overwhelmed sewage and deforestation. These climatic factors were found to contribute to the incidence of malaria [2], which apparently imposes a greater challenge to human life today.
The effects of malaria are much more profound in third world countries due to very limited medical resources. When an intense outbreak occurs, most of these countries cannot cope with the high number of patients due to the lack of medicine, equipment and hospital facilities. The prevention or reduction of the risk factor of this disease is very challenging, especially in these countries, due to poverty, and economic insatiability. Technology can offer alternative solutions by providing early detection mechanisms that help to control the spread of the disease and allow the management of treatment facilities in advance to ensure a more timely health service, which can save thousands of lives. The availability of an early detection system will not only prevent or decrease the large spread of malaria by creating quarantine zones, but also help healthcare providers deliver the necessary medical care on time by managing resources and calling for international aid and support, if needed.
In this study, we aim to design and deploy an intelligent malaria outbreak early warning system, which is a mobile application, that predicts malaria outbreak based on climatic factors using machine learning algorithms. The system will help hospitals, healthcare providers, and health organizations take precautions in time and utilize their resources in case of emergency. To our best knowledge, the system developed in this paper is the first publicly available application.
As well as deploying a smart healthcare application, this paper also makes a significant contribution by identifying hidden ecological factors of malaria (e.g., temperature, humidity, wind, location, drought, floods, etc.). Since confounding effects of climatic factors have a greater influence on the incidence of malaria, we have also conducted extensive research on exploring a new ecosystem model for the assessment of hidden ecological factors and identified three confounding factors that significantly contribute to the outbreak of malaria.
In this paper, we use an efficient methodology, comprising four stages. In the first stage, we have collected data from some repositories. Unfortunately, most of this data was incomplete in terms of climate factors. We have completed the dataset with the climate variables using satellite-based meteorological data obtained from CFSR (Climate Forecast System Reanalysis).
In the second stage, we have identified hidden ecological factors of malaria. The fundamental concept behind this emanated from the fact that a causal relationship exists among the climatic factors [3]. Some recent studies [4,5] combined meteorological variables together with malaria incidence data and established time series models for predicting malaria incidence. Regression and correlation analysis modelling was applied and using meteorological variables the trend of malaria incidence was determined [6]. Also, one of the most recent studies presented in this direction [7] uses a hybrid approach for time-series modelling and lagged-regression analysis of climate data combined with reported malaria incidence cases. Their result showed that malaria incidence in the area studied has a significant association with relative humidity, whereas temperature and precipitation were found to have negligible effects. This finding might particularly reveal that malaria incidence can be strongly influenced by relative humidity alone. However, this methodology suffers weaknesses due to its inability to capture the pre-determined existing causal relationship among the climate factors.
In this study, we use the partial least squares path modelling (PLS-PM) [8] methodology to analyse the causal relationships among meteorological variables, e.g., minimum average temperature, maximum average temperature, relative humidity, wind speed, precipitation and solar radiation, and explored their impact on the outbreak of malaria. In doing so, we develop an integrated model that provides insight into which lacking pre-determined confounding effects could be identified as hidden ecological factors. In the third stage, we have used machine learning algorithms to identify a pattern/model that will be used to make an accurate prediction of malaria outbreak. We have evaluated the prediction of machine learning algorithms, and obtained a very high accuracy rate. Machine learning has been used for prediction and diagnosis of several diseases, e.g., Parkinson's [9], cancer [10] and heart disease [11]. Among machine learning methods, Support Vector Machines (SVM) [12] have been used in malaria incidence prediction [13]; but this study has several shortcomings: (i) the dataset used was extremely small (the size is only 33), which makes accuracy of prediction questionable; (ii) the dataset was used without analysing ecological factors, which could result in the inclusion of statistically insignificant variables in the prediction model, and hence could cause overfitting; (iii) there is no systematic methodology to transform this predictor into a smart healthcare system.
In the fourth stage, we have developed a mobile application by embedding the best predictor generated in the previous stage. The application reads climatic information, i.e., temperature, relative humidity, wind speed, solar radiation and precipitation, from free weather and geographical Application Programming Interface (APIs). It then predicts the possibility of malaria outbreak several days in advance (based on available forecasting data).
The subsequent sections of this paper are presented as follows: In Section 2, we present the complete analysis of identification of hidden ecological factors for the incidence of malaria transmission and its health implications to the change in biodiversity. Section 3 presents the intelligent malaria outbreak warning system, comprising data pre-processing, generating a prediction model using a machine learning algorithm and deployment of an intelligent mobile application. Section 4 concludes the paper by providing the summary of our results and our future work.

Assessment of Hidden Ecological Factors
Climate factors are the drivers of malaria transmission [14]; however, a study analysing the causal ecological relationship among the climatic factors that affect the incidence of malaria is still lacking, particularly in Sub-Saharan African countries.
The malaria ecosystem comprises four main components: human host, mosquitoes vector, parasites and environmental condition (see Figure 1).  These components are very dynamic in nature due to the inherent characteristics of ecology and the anticipatory change to biodiversity because of global warming. The works by [15][16][17] reported that ecological changes would adversely affect human health in some ways that are both obvious and obscure. However, the growing evidence also suggests that due to the rise in temperature as a result of the anticipated global warming, some previously unexposed regions of malaria transmission would have a 50% chance of experiencing it due to the link between malaria incidence and ecological factors [18]. The relationship between environmental changes and human health cannot be overemphasized because of the inherent variability and complexity of human nature. In many circumstances, grasslands and forest are converted for agriculture to reduce communicable disease, including wetland drainage for the prevention and control of malaria [17]. These activities can either lead to unintended negative health effects or succeed in the designed purpose. Also, transforming forest to augment food production may, in the long run, lead to the creation of a suitable environment for disease-causing agents such as mosquitoes for malaria transmission [19].

Study Site and Population
Ejisu-Juaben Municipal has a population of 143,762 [20], lies within latitudes 1 • 15 N and 1 • 45 N also with longitudes 6 • 15 W and 7 • 00 W, occupies a land area of 582.5 km 2 [21]. The vegetation of the municipal is a typical semi-deciduous forest (see Figure 2), with undulating topography and low altitude of about 240 m-300 m above sea level [21]. Also, the rainfall pattern of the area is bi-modal (i.e., two distinct seasons in a year), characterized by major and minor rainfall. The major rainfall begins from March to July with average annual rainfall between 1200 mm-1500 mm, while the minor rainfall begins in September and tapers off in November with annual average rainfall of 900 mm-1120 mm. Usually, December through February is hot, dry and dusty with mean annual temperature 25 • C-32 • C, and the relative humidity is moderately high during the rainy seasons [21]. Figure 2 presents the map of Ejisu-Juaben Municipal, which lies within the red-squared portion labelled Kumasi-the capital city of the Ashanti Region, in southern Ghana.

Data Collection and Source
A total of 85,627 confirmed diagnosed cases of malaria incidence for the period of five years from 2009 to 2013, were retrieved in [22]. The distributional pattern of malaria cases reported in the study area shows an indication of high malaria incidence. We sought data on climate factors in the designated weather station of the study area location [20]; unfortunately, very few data are available and also a lot are missing. This is perhaps due to laxity of the weather station staff for not properly keeping up-to-date data. Since the data available is not sufficient for the analysis, we overcome this challenge by using satellite-based meteorological data obtained via [23]. We used the boundary metrics dimensions of [22] at latitude 6.7989 • N to 6.6823 • S and longitude −1.5656 • W to −1.4186 • E and demarcate the location of the study area on the satellite globe map. Within the demarcated area, we identify a weather station. We then generate the data of the climate variables of our interest. Moreover, the Ghana malaria incidence data is sufficient for the application of PLS-PM due to its suitability for handling small sample data, non-normality, multi-dimensions and multicollinearity [24,25]. However, the sample set is not sufficient to obtain high precision accuracy when applying machine learning algorithms. A small dataset might also cause the overfitting of data. For that reason, we coupled malaria incidence data used in [26] with [22] and proceed with the analysis.

Factor Analysis
Exploratory factor analysis (EFA) is one of the techniques for factor analysis (FA). It is primarily used in statistics to describe variance among observed correlated variables in terms of potentially a smaller number of unobserved variables, usually referred to as factors [27]. In this work, EFA was employed to search for confounding ecological factors that are latent [8,27] from the set of observed meteorological variables.
We demonstrate the FA technique using simple mathematical sketches; the observed variables can be expressed as linear combinations of the potential factors plus the residual terms. Consider the following observed variables Y 1 , Y 2 , · · · , Y M of size M, and assume they are linearly related to a small number of unobservable (latent variables) factors F 1 , F 2 , · · · , F N , with N M such that: . .
where e 1 , · · · , e M are the residual terms, assuming that E(e i ) = 0, and Var(e i ) = δ 2 i . While the unobservable factors F i are independent from each other and E(F j ) = 0 and Var(F j ) = 1. These two assumptions stand as the robust pre-conditions for the application of structural equation modelling (SEM). The loading scores can be obtained from covariance and variance of any two observed variables using the following formula presented in Equation (2) Cov where the summation sign in Equation (2) denotes communality of the variables, the variance of which is explained by the common factors F N .

Structural Equation Modelling
The SEM is a very popular technique that has multidisciplinary applications which combine together both the measurement and structural models [28][29][30]. In Figure 3a, we present a complex hypothetical SEM showing the causal relationship between malaria incidence and latent ecological factors together with their observed variables. We used ellipse shapes to represent latent factors, while the observed variables are represented by rectangular shapes.
The following system of Equation (3) describes the SEM technique in which the observed variables can be expressed as a linear combination of the potential factors plus residual terms. We therefore present SEM mathematical representations from Figure 3b as follows:

Estimation of PLS-PM
The technique called PLS-PM or PLS-SEM was developed by [31] and chosen due to its characteristics in terms of small sample size, non-normality, multi-dimensions, and multicollinearity [23,24]. We have identified three hidden factors using EFA, and subsequently applied SEM for construction of the model (see Figure 3b). The PLS-PM is basically divided into three components: estimation of LVs, estimation of inner and outer models and estimation of the structural relations. The PLS algorithm is essentially represented as a sequence of regression in terms of weight vectors [32] and estimates the values of LVs (factor scores) iteratively until convergence is achieved. The fundamental PLS algorithm, as suggested by [30] (see Appendixes A.1-A.3 for detailed procedural descriptions). The PLS-PM is a component-based estimation technique that uses an iteration algorithm, separately analyzes the blocks of the measurement model and estimates the path coefficients in the structural model [33]. We used a package called semPLS in R for the estimation of PLS-SEM parameters including the analysis presented in Section 2.6.
For estimating parameters of SEM, we invoked the PLS technique, and further used 10,000 samples for the bootstrapping analysis instead of the default number of samples set to 500 selections [33]. Also, the PLS-PM latent variable scores were expressed as a linear combination of their observed variables and treated as an error-free substitute for the observed variables [33].

Measurement Model
The model, presented in Figure 3b, shows how observed Measurement Variables (MVs) are related to their Latent Variables (LVs). Hence without any loss of generality, for a good representation of the inner model, the following assumptions must hold: • Matrix of MVs Y are scaled to have zero mean and unit variance.
• Each block of MVs Y g is already transformed to be positively correlated for all LVs The measurement model is broadly classified as either reflective (Mode A) or formative (Mode B) [26], and this depends on the relation between LVs and MVs formation.

Mode A
In this form, each block of MVs reflects its LV and can be represented in multivariate regression form as: where the w g can be estimated using the least squares method.

Mode B
Also, in this form, the LV is considered to be formed by its MVs represented by a multiple regression as: using the same method of least squares, the estimate for w g can be obtained.

Presentation of Results
In the application of PLS-SEM, three weighting schemes such as centroid weighting, factorial weighting and path weighting are conceptually used for model specifications and estimations. The conceptual SEM presented in Figure 3a shows the hypothetical causal relationship between latent (hidden) variables and observed meteorological (manifest) variables to the occurrence of malaria incidence. For the identification of confounding hidden variables, we performed factor analysis using exploratory factor analysis (EFA) [34]. From the results, three hidden factors were identified: Factor I (related to the minimum temperature and relative humidity), Factor II (related to the maximum temperature and solar radiation) and Factor III (related to precipitation and wind speed). The identified factors accounted for 64% of the total variance, and at α = 5% level of significance, χ 2 = 13.91, df = 8, Pvalue = 0.0841. This result provides sufficient evidence to explain malaria incidence in the study area.
We also explored the Guttman-Kaiser Criterion [35] and Cattell scree plots [36], to determine the number of factors to extract, the result of which reconfirmed the existence of three hidden ecological factors to the incidence of malaria. In the Guttman-Kaiser Criterion, we have the eigenvalues 2.71, 1.53, 1.02, 0.82, 0.57, 0.29, 0.05 computed using the correlation matrix (see Table 1); however, the rule for extraction is based on the factors whose eigenvalues are greater than unity. We then discard those factors that have eigenvalues less than unity, and are left with three eigenvalues indicating the number of factors to be considered. Similarly, the Cattell scree plot presented in (Figure 4) facilitates decisions regarding the number of factors to retain.
By analysing Table 1, we obtained the scree plot shown in Figure 4 which represents the relative proportion of variance accounted for by the components. In the scree plot, the eigenvalues of the first three components greater than unity can be seen from the parallel indicator, while the subsequent components below unity also line up beneath the parallel indicator. However, it is important to evaluate the variance accounted for by a few of the eigenvalues regarded as sufficient so that we can focus on them and discard the remaining insufficient factors as noise.  (7) Wind speed.
In Table 2, we present Pearson's cross-correlation between meteorological variables and occurrence of malaria incidence at various lag effects from 0 to 3 months. The Lag 0, Lag 1 and Lag 2 (e.g., 0 month, 1 month and 2 month) presented in Table 2 which indicates the lagged correlation effects between climate variables and the incidence of malaria in the study area. We observed that at lag effects of 1 month, the minimum temperature, maximum temperature and relative humidity have positive association with malaria incidence as indicated by 0.321, 0.215 and 0.254 respectively. While the precipitation is negatively correlated with malaria incidence at lag effects of 1 month as indicated by −0.292. This explained that the climate drivers at lag of 1 month would be quite enough for the mosquitoes to reproduce and also complete their incubation periods (EIP) to becomes fully active in transmitting malaria infection. We found that the preceding result is consistent with other relevant studies on the influence of meteorological variables on the malaria incidence [37]. The 1 month time lag in the study area is sufficient to capture the pattern of malaria transmission for various strains of plasmodium parasites with definite lengths of EIP. This period usually takes about 10-15 days [38] and temporally varies over location, parasite species and climatic resolution. At Lag 0 and Lag 2, the minimum temperature, precipitation and relative humidity have negative lag effects at 0 month and 2 month except the maximum temperature which has effects of 0.284 and 0.092. These results revealed some clear indications that the malaria transmission in the study area at Lag 0 and Lag 2 suffered a negative effect, which might be attributed to the bi-annual rainfall pattern, low relative humidity-say less than 50%-and inability of mosquitoes completing the EIP cycle. In general, the result showed that maximum temperature, minimum temperature, and relative humidity were related to the malaria incidence at lagged effects of 1 month (i.e., a month in advance) except precipitation which has a negative association in the study area.  Table 1. Some important summary statistics are presented in Table 2, which describe the distributional pattern of the climate indicators of malaria incidence and variance inflated factor (VIF). In factor analysis, multicollinearity can be used as a diagnostics check prior to application of regression analysis, whereby variables with high-factor loadings are typically multicollinear. We compute VIF of the climate variables to measure the degrees of multicollinearity and identify those factors that are independent of the magnitude of their VIF. In Table 2, the minimum temperature and relative humidity have VIF of 8.7919 and 9.0065 that gives a high degree of multicollinearity. The results revealed a high independent predictor of malaria incidence in the study area, and the degree to which they are independent gives evidence to accurately determine the major factors. However, the values of kurtosis (see Table 2) indicate a high peak of the climate variables with positive values across all the indicators except the wind speed which indicates a flat distribution. Positive values, generally listed in Table 2, indicate that the peakedness of distribution of the climate variables particularly influences the malaria incidence. Also, the standard error estimates provide information on the statistical accuracy of the climate variables; the larger the standard error, the wider the confidence interval of the statistic and vice-versa.
Non-normality of the dataset is one necessity for adopting PLS-SEM, and it is very robust when used on extremely non-normal data [39]. We examined the degree to which the data on malaria incidence are non-normal using the Shapiro-Wilk tests by invoking R software (3.4.1, University of Aukland, New Zealand). The results show that the null hypothesis (Ho) is rejected, indicating that the malaria incidence dataset is non-normal as suggested by the following indices W = 0.9486, p-value = 0.0134 and α = 0.05, respectively. This method is particularly chosen and useful in smaller samples sizes, less than 2000 [40], and the null hypothesis is that the data are from a normal distribution. Similarly, we used a graphical approach called quantile-quantile (Q-Q) plot [41] and tested for normality of the dataset in similar fashion. The approach creates a plot from the ranked samples of the dataset against a similar number of ranked theoretical samples from a normal distribution. The plot shown in Figure 5, clearly indicates that the data points for malaria incidence are deviating from the straight line. Hence, the malaria incidence dataset is therefore not normally distributed using the Q-Q plot.
In Table 3, we show the results of the factor score estimates for path coefficients of SEM estimated using PLS path modelling, and three different structural model weighting schemes were analysed. We observed that Centroid (A) converges faster after 12 iterations, while factorial (B) and path weighting (C) converge after 15 iterations. The procedure of selecting the best weighting scheme is determined by the maximum number of iterations that will be used for calculating the PLS results and this algorithm did not stop until the maximum number of iterations is reached due to the stop criterion. From Table 3, we can observe that the B and C weighting schemes converge at the same maximum number of iterations in estimating the parameters of SEM. The weighting scheme provides the highest R 2 value for endogenous latent variables in the PLS path model specifications and estimations. This result shows that the C weighting scheme is better than A and B, as suggested by [42] in terms of robustness and also when the path model includes higher-order constructs.  Table 3. Factor scores for path coefficients in the PLS-PM using three weighting schemes.

Measurement/Structural Model Parameter Estimate Centroid (A) Factorial (B) Path Weighting (C)
Minimum temperature ←− FactorI  Table 4 presents the results of bootstrapping sampling for outer loadings of the observed variables and path coefficient of the latent variables estimated using PLS-PM. The results also show that all outer loadings and path coefficients are significant at α = 5%, except for the solar radiation with Factor II and wind speed with Factor III that contains zero-point in the bootstrap confidence interval. Furthermore, the interaction effects of the Factors between I and II, II and III were also investigated and the results revealed that none of the Factor combinations is significant in the incidence of malaria in the study area. This result provides sufficient evidence that high malaria incidence in the study area was attributed to the occurrence of minimum temperature and relative humidity which are identified as Factor I.
The decision to select the most influential hidden ecological factor to the incidence of malaria is based on the communality and Dillon-Goldstein's indices. Furthermore, Table 5 summarizes the results, indicating some indices for selecting the hidden ecological factors to the high incidence of malaria in the study area. Among the three factors identified by EFA, we find that Factor I, indicated by minimum temperature and relative humidity, influences malaria transmission with communality index (0.94) and Dillon-Goldstein's ρ (0.97). This result is also consistent with the finding in [37], where a positive association exists between temperature and occurrence of dengue. Factor II and Factor III appear to have less influence on the malaria incidence.

Intelligent Malaria Outbreak Warning System
In the previous section, we identified the hidden ecological factors of malaria using partial least squares path modelling. In this section, we discuss, in detail, the implementation of the malaria outbreak system, based on the identified hidden ecological factors. The deployment comprises of three stages: data processing, generating the predictive model using machine learning and deployment of a mobile application.

Data Preprocessing
It was a tradition, prior to the application of machine learning algorithms, that datasets need to be pre-processed to enable a faster and more accurate learning process. The heuristic approach involves the discretization techniques most often used in data mining. This involves transforming continuous-valued datasets to discrete datasets by creating a set of contiguous intervals [43]. In this paper, we are making use of a dataset on climate variables as the input while the output variable is malaria incidence data. Our concern is the development of a predictive model using supervised machine learning algorithms that will predict the likelihood of malaria incidence. The output variable appeared to have high-magnitude in-terms of reported number of malaria cases; hence using it directly may cause over-fitting to the predictive model. Therefore, it is pertinent to transform the dataset using some techniques for discretization to enable us to build efficient models.
We have therefore discretized the output variable to form a target variable using the k-means clustering algorithm [44]. This methodology is chosen over equal width (EW) and equal frequency (EF) because it is less sensitive to outliers and also the number of clusters (partitions) can be optimized by analysis rather than pre-determination. In general, the choice of discretization method and choice of k can be guided by the objectives of the discretization task. By invoking R software, we can determine the optimum number of clusters to enable us to partition the output variable.
From the analysis, the optimum number of clusters obtained is k = 4 and the algorithm converges after nine iterations with 89.9% variation. Also, we observed that for k = 5, the number of iterations exceeded the maximum number of tolerable iterations supposed to achieve convergence and in this case it diverges even though the percentage of variation is still good at 93%. For k = 2 and 3, the algorithm converges after three and four iterations with 66.4% and 82% variation, respectively. This gives sufficient evidence to choose the optimum number of clusters as k = 4. Similarly, we have also tried the "NbClust" package in R software for determining the optimal number of clusters. Using the values of k ranging from two to five allows the algorithm to select the optimum number of clusters to be used in order to partition the output variable. The algorithm run and selected k = 4 as the optimum number of clusters to partition the output variables. Hence, both methodologies give the same number of optimum clusters to consider and subsequently prove to be consistent. We then partitioned the output variable into four classes according to the results of k-means algorithm and re-labeled them as: low, medium, high and very high incidence status of malaria. We present, in Table 6, the summary analysis of k-means algorithm clustering. Table 6. Summary of data discretization using the k-means algorithm. SSB: The sum of squares of errors between the clusters; SST: The total sum of squares of the entire clusters.

Machine Learning
The next stage is to identify a pattern/model from the data processed in Section 3.1 that will be used to make an accurate prediction of malaria incidence. Evolved from traditional pattern recognition approaches, machine learning methods explore the algorithms that can learn from the data and overcome prediction tasks by building a mathematical model with a data sample input. A learning algorithm will mark each given Malaria epidemic data sample as one category, then after being trained using the training dataset, it will build a model to predict which category a forthcoming data sample falls into.
We have applied several machine learning algorithms, including Support Vector Machine (SVM), K-Nearest Neighbours (KNN), Naive Bayes and Decision Trees, to find the best predicting algorithm from the scikit framework [45] in Python.
To evaluate the prediction of machine learning algorithms on the training set, we have used the 10-fold cross-validation technique by selecting a training set and test sets that are mutually independent. Table 7 shows the prediction results in comparison to seven different Machine Learning methods. • Linear Regression (LiR) method gives overall good prediction results, but it seems that the method failed to produce any medium predictions. • Logistic Regression (LoR) method predicts the probability of occurrence of an event by fitting the dataset, as a set of independent variables, into a logic function. In other words, for a correlated data set, LoR may not be able to find the intrinsic-relationships between events. • Decision Tree (DT) works very well for both categorical and continuous dependent variables; however, this dataset cannot be separated as distinct groups since the edges of the samples are fuzzy. Therefore, DT gave a bad prediction after all.
• Support Vector Machine (SVM) is one of the most efficient supervised machine learning algorithms, which is mainly used for solving classification and regression problems. The best part of this algorithm is that training and testing data can be plotted as a point in a n-dimensional plane, with a feature being the value of a particular coordinate. Without optimisation of the parameters, SVM gave a 80.56% predicting result. After parameter optimisation, especially on the penalty parameter and gamma coefficient adjustment, SVM (o) gave a 99.0% predicting result. • Naive Bayes (NB) is a well-known classification method, which is based on Bayes' Theorem with an oversimplified assumption of independence between classifiers. Moreover, NB is a conditional probability model, which means that the method needs to be assigned a series of certain events. For this data set, NB did not produce a good prediction overall. • K-Nearest Neighbours (KNN) method is able to deal with both classification and regression problems. In comparison to KNN5 (where k = 5) and KNN10 (where k = 10), KNN1 (where k = 1) failed to make a good prediction. It means that the data may need to do more pre-process and/or noise removal in a theory; however, most of data from the real world are incomplete; that is why KNN5 and KNN10 make a better prediction. • K-Means (K-M) is a type of un-supervised method for clustering. In this case, three clusters have been set at the beginning; however, a convergence did not perfectly land; therefore, it cannot give a good overall prediction.
The results, presented in Table 7, show that the best performing algorithm is SVM. We therefore integrate the SVM model into our system.

Mobile Application
We have developed a mobile application, Malaria Outbreak Warning System, with a built-in SVM model, published at Google Play. The tool can be accessed via [46].
The application is based on the theoretical experiences and practical experiments of the SVM algorithm and model, which has been tested for developing systematic and effective strategies to predict the outbreak of a Malaria epidemic. Meanwhile, the parameters of the model kernel have been optimised and set into this application.
The application consists of three processes: pre-processing the weather forecasting data, processing the data by applying them into the model and implementing the model's interface, and post-processing the prediction data by presenting results on the app's UI front layer. It is a well-suited implementation for location detection. Figure 6 shows a screen shot of the tool. The application not only supports the automatic gathering of weather forecasting data, but also supports manual data input. The application reads climatic information, i.e., temperature, relative humidity, wind speed, solar radiation and precipitation, from the weather and geographical APIs. When the units of the weather and atmosphere are different from the data set used to construct the predictor, we carry out the required normalisation or feature scaling or similar pre-processing. The tool then predicts the Malaria outbreak a couple of days in advance based on available forecast information acquired from the APIs. The user can slide the screen to see the available outbreak predictions for the current and future days. The additional button on the bottom of the screen is to let the user manually enter a set of weather measurements to make a prediction for customised parameters.
The trained SVM model has been implemented in Java by taking advantage of the LIBSVM (2.88, University of California, Berkeley, CA, USA) [47]. LIBSVM is an integrated software for SVM, regression and distribution estimation. The mobile application has been developed for Android using Android Studio. The weather forecasting data is powered by OpenWeatherMap API (3.0, Riga, Latvia) [48], which is an online service provider for weather data. OpenWeatherMap provides API for searching forecasting data for up to 5 days by coordinates; and the responses served as JavaScript Object Notation (JSON), Extensible Markup Language (XML) and HyperText Markup Language (HTML) endpoints. All of the data provided is under CC BY-SA 4.0 license.

Discussion
The current prototype of the intelligent malaria outbreak warning system relies on a batch machine learning process. That is, the learning algorithm is trained and tested offline using the available dataset, and the prediction model is embedded within the tool. Hence, the prediction process relies on the prediction model trained offline at once.
A more effective approach is to make the learning process online. That is, whenever new data is available, the data is automatically updated, and the learning process is run again to encapsulate the new data. This will not only allow an automatic and dynamic learning process, but also increase the accuracy of the prediction by adapting to new patterns in the data.
The online learning approach requires a mechanistic data collection mechanism, which is very challenging to perform as hospitals and health service providers do not make the relevant data available online. Even acquiring permission to have access to the available data is a long and bureaucratic process. On the other hand, as discussed in Section 2.2, most available data cannot be directly used in this system as they are incomplete and/or not processed.
To alleviate these issues and to support the online learning process, the Malaria outbreak warning application can be extended to collect online data from its users. Namely, the users, e.g., hospitals, healthcare providers, individuals etc., report a Malaria case to the system. Using the geographical location of the incident, the application will acquire all the necessary information for the ecological factors. In this way, new data will be collected at run time, and the learning process will be instantiated each time new data is available. We are currently working on the development of this approach.

Conclusions
In this study, we have deployed an intelligent malaria outbreak early warning system, predicting malaria outbreaks based on climatic factors using machine learning algorithms. The system will help hospitals, healthcare providers, and health organizations take precautions in time and utilize their resources in case of emergency. To our best knowledge, the system developed in this paper is the first publicly available application.
We have also provided an ecosystem overview for malaria modelling and proposed a new framework for the study of a malaria transmission ecosystem to prevent and control its effects. We have assessed and identified hidden factors that lead to a high malaria outbreak. Our data analysis results have shown that the minimum temperature and relative humidity, which are related to Factor I, have a positive association with the incidence of malaria in the study area. The other observed variables such as maximum temperature, solar radiation, precipitation and wind speed, which are related to hidden Factor II and Factor III, appear to have mildly influenced malaria incidence.
The primary results obtained in this study have demonstrated the power of the proposed predicative analytics-based malaria outbreak warning system. The further development of the system will incorporate automatic data gathering from a variety of sources. We are currently working on further development of our system and methodology to support automatic data collection at run time, and the online learning process. This will not only allow an automatic and dynamic learning process, but also increase the accuracy of the prediction by adapting to new patterns in the data.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: resulting in the outer estimation: X = (x 1 , · · · ,x G ).

Step 5 Iteration
If the relative change of all the outer weights from one iteration to the next are smaller than a predefined tolerance, ŵ old kg −ŵ new kĝ w new kg < , ∀, k = 1, · · · , K ∧ g = 1, · · · , G, the estimation of factor scores done in (A5) is taken to be final. Otherwise, go back to (A2).

. Total Effects
We can calculate the matrix of the total effectsT as the sum of the 1 to G step transition matrices: Note thatB g expands to g−times B ·B · · · · ·B, e.g.,B 2 contains all the indirect effects mediated by only one LV.