Spatial distribution estimation of malaria in northern China and its scenarios in 2020, 2030, 2040 and 2050

Malaria is one of the most severe parasitic diseases in the world. Spatial distribution estimation of malaria and its future scenarios are important issues for malaria control and elimination. Furthermore, sophisticated nonlinear relationships for prediction between malaria incidence and potential variables have not been well constructed in previous research. This study aims to estimate these nonlinear relationships and predict future malaria scenarios in northern China. Nonlinear relationships between malaria incidence and predictor variables were constructed using a genetic programming (GP) method, to predict the spatial distributions of malaria under climate change scenarios. For this, the examples of monthly average malaria incidence were used in each county of northern China from 2004 to 2010. Among the five variables at county level, precipitation rate and temperature are used for projections, while elevation, water density index, and gross domestic product are held at their present-day values. Average malaria incidence was 0.107 ‰ per annum in northern China, with incidence characteristics in significant spatial clustering. A GP-based model fit the relationships with average relative error (ARE) = 8.127 % for training data (R2 = 0.825) and 17.102 % for test data (R2 = 0.532). The fitness of GP results are significantly improved compared with those by generalized additive models (GAM) and linear regressions. With the future precipitation rate and temperature conditions in Special Report on Emission Scenarios (SRES) family B1, A1B and A2 scenarios, spatial distributions and changes in malaria incidences in 2020, 2030, 2040 and 2050 were predicted and mapped. The GP method increases the precision of predicting the spatial distribution of malaria incidence. With the assumption of varied precipitation rate and temperature, and other variables controlled, the relationships between incidence and the varied variables appear sophisticated nonlinearity and spatially differentiation. Using the future fluctuated precipitation and the increased temperature, median malaria incidence in 2020, 2030, 2040 and 2050 would significantly increase that it might increase 19 to 29 % in 2020, but currently China is in the malaria elimination phase, indicating that the effective strategies and actions had been taken. While the mean incidences will not increase even reduce due to the incidence reduction in high-risk regions but the simultaneous expansion of the high-risk areas.


Background
With a massive population at risk and widely threatened areas, malaria is a serious parasitic disease worldwide. In developing nations of tropical and subtropical area, malaria has become one of the largest obstacles to socioeconomic advancement [1]. Approximately 3.2 billion people (43.8 % of the world's population) living in more than 100 countries are threatened by malaria to varying degrees [2]. Great achievements have been made in fighting and eliminating malaria over the past few decades, for instance, insecticide-treated nets are the most widespread intervention and are responsible for malaria reduction in many endemic countries [3]. Malaria risk areas, however, have not varied significantly compared with those a half century ago. Research has indicated that 1.13 and 1.44 billion people globally are at risk for unstable and stable Plasmodium falciparum malaria, respectively [4], and 2.5 billion people worldwide are at risk for Plasmodium vivax malaria [5].
Climate change and corresponding environmental alterations have significantly influenced the variation and transmission of malaria [6][7][8][9][10]. With a reliably predicted future malaria scenario, the malaria incidence at different locations could be depicted, especially in high-risk and new outbreak areas, to propose malaria elimination strategies and develop health policies [11]. Global and regional studies on malaria prediction have shown that the effects of climate change on malaria vary spatially [9]. Thus, improving understanding of both the temporal and spatial dynamic effects of climate on malaria transmission is of great importance for reducing the disease burden and risks to human health [12,13].
Predictor variables include various environmental and socioeconomic variables that contribute to the appearance and transmission of malaria, such as precipitation, temperature, elevation, water density index (WDI), and gross domestic product (GDP) [14][15][16][17][18][19]. Remote sensing techniques and products can be used to predict malaria incidence because the propagation processes of malaria, namely, the source of infection, route of transmission, and susceptible individuals, are affected by atmospheric and environmental conditions [16,[20][21][22]. These conditions not only affect the growth of parasites inside malaria vectors, but also directly affect the habitat conditions and transmission activities of the Anopheles vector [23][24][25]. Remote sensing has advantages over realtime monitoring of these conditions, such as its features of timeliness, wide monitoring range, and easier data acquisition compared with ground monitoring stations [21,[26][27][28]. Precipitation and land surface temperature (LST), combined with epidemiological data, are commonly used to model and predict conditions of malaria prevalence [20,23,[29][30][31][32][33]. These alternative, remotelysensed ecological indicators could directly reflect the relationships between malaria transmission and atmospheric and environmental variables [21,25]. Research conducted in Kenya, Africa, has successfully predicted local seasonal malaria prevalence and transmission intensity [20]. In addition, research in the Horn of Africa and Eastern Africa suggests that the prediction accuracy of P. falciparum malaria transmission intensity reaches 75 % using these techniques [17]. With respect to predicting distribution and quantity of the Anopheles vector, remote sensing techniques could be used to determine mosquito breeding sites and predict malaria risk distribution, to assist in malaria control efforts [29,34].
Precipitation and temperature, especially remotelysensed precipitation rate (PR) and LST data, are particularly effective predictor variables because they have significant relationships with malaria incidence and their temporal delayed effects [35,36]; they both are also important products of future climate change scenarios. Research in Huang-Huai River in China demonstrated the malaria re-emergence was significantly related to the change of local precipitation [37,38]. A study in Guangdong Province, China, compared the median temperature with 30 °C and showed that temperature has an important role in malaria incidence with delayed effects lasting for 4 weeks (maximum relative risk (RR) of 1.57, 95 % confidence interval (CI) 1.06-2.33) [39]. Moreover, temperature has significant localized effects on malaria transmission [40], and the relationship between temperature and malaria incidence is affected by the various environmental conditions in a certain area [41].
While the majority of current research between malaria and environmental/socioeconomic variables focuses on linear modelling [24,42], a few studies have revealed a nonlinear relationship in certain settings [18,23]. Exploring the significance of malaria at various intervals using remote sensing data is a common experimental goal [22,43,44]. Therefore, exploring nonlinear relationships between malaria and predictor variables using nonlinear methods is important. Genetic programming (GP) is an optimization method that explores the ability to construct complex nonlinear relationships between certain problems and express them mathematically. GP is therefore effective in addressing sophisticated nonlinear issues, eliminating nonfunctional variables, and modelling a proper function structure closest to the truth [45,46].
Northern China (Henan and Anhui provinces) is a typical mid-latitude, high-risk area of locally prevalent P. vivax malaria, which presents a great threat to the population of 170 million (2010). The number of malaria cases in China has decreased since 1950. The reported average malaria incidence had decreased to 0.194 ‰ per annum (p.a.) by the year 2000, and the number of cases has decreased to 24,088 in 2000 from over 24 million cases by 1970. These numbers began to rebound in 2000, exceeding 64,000 cases in 2006 when the incidence reached 0.50 ‰ p.a. Malaria in China has a distinct regional distribution, with northern China one of the areas with the highest prevalence [47]. In the present study, the influence of explanatory variables on malaria incidence in northern China and its future spatial distributions under climate change scenarios were predicted using the GP method, accompanied by geographic information system (GIS) methods for advanced spatial analysis and expression [48]. Predictor variables were remote sensing data of PR and LST, together with elevation, WDI, and GDP. In addition, China would achieve malaria elimination by 2020, Asia-Pacific region would achieve malaria-free by 2030 [2], and perhaps by 2050, as an ambitious goal, human malaria was expected to be eventually eliminated [49]. While if no strategies were implemented, population exposed to the primary malaria vectors would continuously increase in 2030s and 2050s in China [10]. Therefore, 2020, 2030, 2040 and 2050 were the years for projections owning to their significance regionally and globally. Assuming precipitation and temperature were the changed variable and other variables remained unchanged, future spatial distributions of malaria in the years 2020, 2030, 2040 and 2050 were predicted and mapped with the future temperature conditions in Special Report on Emission Scenarios (SRES) family B1, A1B and A2 scenarios [50].

Study area and malaria data
The study area consisted of 205 counties in two provinces (Henan and Anhui) of northern China (Fig. 1). This area is located within 110.35-119.64°E and 29.40-36.37°N. About 170 million people live within 306,000 km 2 , in one of the most densely populated areas globally. Plains make up the predominant terrain in this area, with a few mountains located in western Henan Province and some hilly regions in Anhui Province. Three main rivers flow through this area, the Yellow, Huaihe, and Yangtze rivers.
The Chinese Center for Disease Control and Prevention (Chinese CDC) has summarized monthly malaria cases in each county from 2004 to 2010. In two provinces (Henan and Anhui) of northern China, the total number of malaria cases was 127,448, and average malaria incidence was 0.107 ‰ p.a. The total cases in Henan and Anhui provinces during the seven-year period were 19,182 and 108,266, respectively. With populations of 103 million in Henan and 66 million in Anhui, the annual average malaria incidence is 0.026 and 0.232 ‰ p.a., respectively. Areas with malaria incidence greater than 0.1 ‰ p.a. are considered stable risk areas, whereas those with incidence lower than 0.1 ‰ p.a. are considered areas of unstable risk [51,52]. Among the seven-year average incidences in each county, the highest incidence reached 2.191 ‰ p.a. and the incidences in 22 counties were larger than 0.1 ‰ p.a. The study area was, therefore, a malaria high-risk area during these 7 years.

Predictor variables and climate change scenarios data
With the advantages of remote sensing over real-time data, such as its wide spatial coverage and relatively easy acquisition, remotely sensed monitoring data have been applied to various malaria prediction problems [16,21,23,53]. The remote sensing monitoring indexes in this research include monthly PR (mm/h) from the Tropical Rainfall Measuring Mission (TRMM) 3B43 (version 7) product with the spatial resolution of 0.25° (~25 km) [54], and monthly LST (°C) from the terra moderate resolution imaging spectroradiometer (MODIS) product MOD11A2 with the spatial resolution of 1 km downloaded from the level 1 and atmosphere archive and distribution system (LAADS Web) at NASA website [55]. Monthly PR are resampled and calculated to the data with the unit of mm/day and the spatial resolution of 5 km. Both remote sensing products are pre-processed to county level, meaning that the spatial average values of each county are calculated such that these variables reflect the average atmospheric or environmental conditions at county level. In addition, remotely sensed precipitation and temperature match the monthly malaria incidence data in time, from 2004 to 2010. Given that the occurrence and spread of malaria are not only influenced by meteorological and Fig. 1 Study area consists of 205 counties in three provinces (Henan and Anhui) in northern China environmental variables, two kinds of easily obtained auxiliary data are used in this study, including geographical variables (elevation and WDI) and a social variable (GDP). Elevation data is derived from topographic dataset of Shuttle Radar Topography Mission (SRTM3) with a 90 m spatial resolution [56]. The percentage of the total area of rivers and lakes in a county is calculated as WDI, which is calculated with raster data from global land water regions dataset at 30 m spatial resolution [57]. GDP is sourced from 1 km Grid GDP Dataset of China (2010) [58]. They are all transformed to county level, corresponding to malaria data.
When averaging the variables over the counties, the impact of the variable variation within each county on the county-level transforming process is analysed by F-test. For instance, elevation varies within county k (k = 1, 2,…, 205), and its variation is depicted by 205 randomly selected spatial points of elevation values, which is the data A k . F-test is used to test the difference between A k and the county-level averaged elevation data B. As such, the percentage of counties with significant difference comparing data B is used to depict the impact of the variable variation within each county on the averaging. Table 1  The future precipitation and temperature are projected to change significantly under the SRES family B1, A1B and A2 scenarios. Future precipitation and temperature data of climate change scenarios were analysed by the a series of global climate models (GCMs), which were available by the CGIAR Research Program on Climate Change, Agriculture and Food Security (CCAFS) [59] based on the World Climate Research Programme's (WCRP's) Coupled Model Intercomparison Project Phase 3 (CMIP3) multi-model dataset [50,60]. B1, A1B and A2 emission scenarios were included in CMIP3 for climate projection, each of which corresponds to a specific pathway to reach each target radiative forcing caused by long-lived and short-lived greenhouse gases [61]. The projected global average surface temperature changes at 2090-2099 relative to 1980-1999 are estimated to be 1.8, 2.8 and 3.4 °C under B1, A1B and A2 scenarios, respectively [62]. The future scenarios data with the spatial resolution of 5 km are also summarized to the county level in the study area.

GP-based malaria incidence prediction
The objective of GP-based prediction was to predict the spatial distributions of malaria incidence in northern China under climate change scenarios in the years 2020, 2030, 2040 and 2050, assuming that the PR and temperature variables were the varied ones among all variables for prediction. County was used as the spatial mapping unit corresponding to malaria case data; thus, both predictor variables and climate change scenario data were averaged for each county. The main steps of GP-based malaria incidence prediction are outlined in the schematic overview (Fig. 2). This process consisted of three steps: (1) spatiotemporal analysis for malaria incidence; (2) data pre-processing and selection of predictor variables; and (3) modelling and prediction with the GP method. In the first step, spatiotemporal scan analysis was performed to identify and quantify the spatiotemporal clustering scales of malaria incidence in the study area [63,64]. The basic theory of scan analysis used in epidemiology is as follows: building a moving scan window in space, calculating the total number of cases C and number of expected cases E both inside and outside a certain window, and estimating the difference between incidences in and out of the window through assessing log likelihood ratio r with the formula where c is the number of actual cases and I() is an indicator function. During the scan process, when the number of cases is larger than the expected value, I() is 1; otherwise, it is 0. By dynamically changing the size and location of the window and recalculating r when new cases appear until a maximum r is selected, the window at this time is the clustering window of high incidence. The window size is depicted with the ratio of population within the window to the total population, which ranges from 0 to the maximum risk population that is set based on research and is less than 50 % of the total population. The result of spatiotemporal scan statistics is the accurate high-risk areas of prevalence.
The second step before GP prediction was to preprocess data and select reliable predictor variables. As a small probability event, malaria incidence summarized in spatial cross sections was 0 in many of the spatial units. The focus of spatial analysis is spatial cross-section data as well as the differences in various regions. Therefore, to select the proper variables, Spearman correlation coefficients (given rank information) between dependence and independence were calculated [65]. The effect of a 0 value was thus reduced; the information provided by 0 incidence was used fully and loss of information was decreased. Then, the multicollinearities of these explanatory variables were analyzed and as a result, variables with strong collinearities were removed [66,67]. The variables and malaria incidence were significantly correlated, and the correlation coefficients of remote sensing indexes with 1-month lag effect reached maximum values ( Table 2). After testing for normality and data transformation, the pre-processed predictor variables were X1 (RP, lag = 1), X2 (LST, lag = 1), X3 (log-transformed elevation), X4 (log-transformed WDI), and X5 (log-transformed GDP), which are statistically summarized in Table 3.
Finally, the relationships between malaria incidence and the corresponding predictor variables were Fig. 2 Schematic overview of GP-based malaria incidence prediction constructed using the GP method. Based on this nonlinear relationship, the spatial distributions in 2020, 2030, 2040 and 2050 could be predicted and mapped under climate change scenarios. To ensure reliability, data in 70 % of the counties with monthly cases (144 counties) were randomly selected as training data, and data of the remaining 30 % (61 counties) were regarded as test data (Fig. 3). The fitted results would be affected by the quality of the parameter settings. General parameter settings for the GP framework are listed in Table 4. Terminal variables were X1, X2, X3, X4, and X5 and the function set was (+, −, × , /, power, log, exp, sqrt). At the beginning of the GP process, 200 equations were randomly generated with the terminal variables and the functions in the set to capture the relationship between malaria incidence and the predictors. These equations were the individuals in the initial population, and one of them with the best fitness would be selected. During GP process, because a better equation structure in the result required a lower fitness. The fitness function used in this experiment was the sum of the absolute difference (SAD), where N was the total number of observations, and y i and p i were observed values and GP-predicted values, respectively. With the calculation of fitness, "winner" individuals were probabilistically transformed with crossover and mutation for parts of equations, to replace the "loser" ones, so that the individuals would be renewed in each next generation. The above steps were repeated until a program was developed that could reasonably predict malaria incidence [46,48,68]. GP was performed with 1000 generations in each monthly malaria incidence prediction case. At the end of the experiment, uncertainties were analysed to detect the precision and reliability of the research method. GPLAB, a genetic programming toolbox written for MATLAB software, was adopted to generate the prediction solution of malaria incidence [69]. A series of engineering and scientific problems have previously been addressed    successfully using the GPLAB toolbox [68,[70][71][72]. To validate the performance of GP-based malaria incidence prediction model, the results of both generalized additive models (GAM) and linear regression are used for comparison. GAM is a common nonlinear model describing the nonlinear relationships via nonparametric smoothing functions [73], and it is performed by the mgcv package in the program R. For GAM and linear regression, the same predictors are used including the one-month lagged precipitation and temperature.

Results
Monthly average malaria incidence data were collected in 205 counties from 2004 to 2010, as shown in Fig. 4. This figure illustrates the spatial clustering and seasonality of malaria incidence in each county. The cluster regions detected by spatiotemporal scan statistics are mapped in Fig. 5 so it was necessary to consider the problem with monthly malaria incidence and the corresponding variables. A nonlinear relationship between county incidences and the five variables was constructed for the training data of each monthly case. Figure 6 presents the fitness of the best GP equation with the tree form composed by predictors and functions for each monthly case during the evolution process of 1000 generations. The relationships were then applied, to predict malaria incidence in the test counties with the five known variables. To compare the spatial distributions of GP-fitted malaria incidences in the training and test counties, these were summarized to annual average fitted values, mapped, and compared with the original dependent variable, the transformed incidence data. Figure 7 illustrates a map of the original transformed malaria incidence data (A), GP-fitted data (B), GAM-fitted data (C) and linear regression-fitted data (D). The patterns and trends of spatial distribution were rationally predicted with the GP method but were not predicted by GAM and linear regression. Table 5 presents prediction errors of the GP-based model, GAM-based model and linear regression approach for the monthly cases and the annual average cases. In the table, the average relative error (ARE) and mean sum squared error where O i and P i denote observation and prediction for ith data, respectively; N is the total number of data items in the dataset. R 2 describes the goodness-of-fit of the model which means the degree of association between the observed and model-simulated data. Among the three indexes evaluating prediction accuracy, ARE is more reliable owing to its focus on relative errors. Thus the ARE values of the monthly models are also validated as shown in Fig. 8. From the results evaluation table, it can be concluded that the GP method could more accurately predict malaria incidence, with ARE = 8.127 % for training data (R 2 = 0.825) and ARE = 17.102 % for test data (R 2 = 0.532), compared with GAM method ARE = 19.163 % for training data (R 2 = 0.445) and ARE = 30.155 % for test data (R 2 = 0.452), and linear regression ARE = 27.449 % for training data (R 2 = 0.159) and ARE = 31.031 % for test data (R 2 = 0.189). The ARE results in Fig. 8 also demonstrate that the fitness are significantly improved by the monthly GP model.
With the future PR and temperature conditions in the SRES family B1, A1B and A2 scenarios, the spatial distributions of malaria in 2020, 2030, 2040 and 2050 were predicted in each month of these four future years using a GP-based prediction model. Malaria incidence change maps are shown in Fig. 9, which depict changes in the predicted malaria incidence, especially the phenomenon

Discussion
The GP-based nonlinear model used in this study predicted the spatial distributions and changes of malaria incidence for the years 2020, 2030, 2040 and 2050 under SRES family B1, A1B and A2 climate scenarios with the assumption of a varied variables of precipitation and temperature, and the constant variables of elevation, WDI, and GDP. In the study area, to depict fluctuations in the nonlinear relationship between malaria incidence and varied precipitation and temperature with the other three variables remaining constant, monthly-predicted incidences were summarized to annual ones. The annual  average incidence in each county was calculated with the precipitation and temperature set from the minimum to the maximum in the climate change scenarios. The relationships in all counties of the study area were then summarized. Figure 10 illustrates the summary by running mean monthly malaria incidences for each 0.07 mm/ day of precipitation (Fig. 10a) and 0.19 °C of temperature (Fig. 10b), in which the incidences predicted by the GP-based model fluctuated with the increased lagged precipitation and temperature. There are four main peak values in the fluctuating relationship between incidence and precipitation with the precipitation of 2.2, 3.0, 5.7 and 9.8 mm/day, and a peak value in the nonlinear relationship between incidence and temperature with the temperature of 21.9 °C. In general, five primary highincidence areas of thresholds with the running mean monthly incidence larger than 0.0001 ‰ p.a. appeared with the varied two lagged variables as shown in Fig. 10c and d. They are Area I with precipitation ranging from 0 to 4 mm/day and temperature from 20 to 31.7 °C, Area II with precipitation from 7 to 10.6 mm/day and temperature from 23 to 31.7 °C, Area III with precipitation ranging from 2 to 8 mm/day and temperature from 13 to 18 °C, Area VI with precipitation near 0 mm/day and temperature near 12 °C, and Area V with precipitation from 1 to 2 mm/day and temperature near 0 °C. Annual future malaria incidences derived from the sum of monthly predicted incidences using GP-based models were summarized in Fig. 11, together with the future precipitation and temperature changes. This analysis indicated that with fluctuated precipitation and the increased temperature, the median incidences would significantly increase during the studied future periods. If no actions were taken, incidence in northern China would increase 19 to 29 % in 2020, 43 % to 73 % in 2030, 33 to 119 % in 2040 and 69 to 182 % in 2050. This trend was identical with the projections of malaria vectors distribution under climate change scenarios in China [10]. But the mean incidences would not increase even decrease under SRES family B1, A1B and A2 scenarios. The integration of this result and the malaria incidences changes across the space in the future that changes primarily appeared in counties along the Huaihe River and Yangtze River shown in Fig. 9 demonstrated that the incidences in the clustering highrisk regions would decrease, but those in their surrounding regions would significantly increase and the high-risk regions would be enlarged. Under SRES family scenarios, all spatial scales of the increased incidences were enlarged in 2020, 2030, 2040 and 2050, and the decreased incidences appeared in the central high-risk areas. The comparison between these predictions and China was on the malaria elimination phase in 2014 reported in World Malaria Report 2015 [1] showed that the strategies and actions of China on malaria elimination were effective.
There are some limitations to this research. Two variables derived from remote sensing data were used for Fig. 8 The ARE values of monthly fitting models for train data (a) and test data (b) malaria prediction, but a great many remote sensing products were not explored. In previous studies as well as in this research, variables derived from remote sensing data are primarily selected based on the general theory of malaria transmission processes, and computed with correlation analysis or regression methods. The effects of various variables, however, are spatially different. Therefore, a variable explaining the malaria incidence in one location might be not appropriate or significant for other locations. In future work, more variables stemming from remote sensing data will be explored and their effects at different locations taken into consideration. In addition, the performance of averaging raster variables at county level is tested in this research, which shows that the variation of variables within most of the countries has no negative impact on the averaging process. But there

Conclusions
Northern China is a typical mid-latitude high-risk malaria area. The combination of GP and GIS methods models well the nonlinear relationships between predictor variables and malaria incidence, to predict the future spatial distributions of malaria. The key benefit of GP is that no final solution form is assumed before constructing the relationships, unlike the forms of traditional linear regression and nonlinear model such as GAM, which are determined in advance. Thus, GP uses a proper function form instead of coefficients, as in linear regression and GAM. As a result, the GP method is able to more accurately predict malaria incidence, compared with a linear regression approach and GAM, for both training and test data. With the nonlinear relationships constructed by the GP-based prediction model, the malaria incidences in 2020, 2030, 2040 and 2050 under future climate change scenarios were predicted, mapped and analyzed. In northern China, with fluctuated precipitation and increased temperature that have one-month lagged effects on malaria incidence, the median incidence would significantly increase that it would increase 19 to 29 % in 2020, but by 2020, malaria would be eliminated in China, which indicated that the effective strategies Fig. 10 GP-predicted relationships between incidence and varied variables, precipitation and temperature, in northern China. Relationship between incidence and lagged precipitation (a), relationship between incidence and temperature (b), relationship between incidence and both variables (c), and its three-dimension expression (d) and actions had been taken. While, the mean incidences would not increase even reduce, since the incidences in high-risk regions would reduce while the areas of highrisk regions would be enlarged.