A new dataset for coffee rust detection in Colombian crops base on classifiers

Coffee production is the main agricultural activity in Colombia. More than 350.000 Colombian families depend on coffee harvest. Since coffee rust disease was first reported in the country in 1983, these families have had to face severe consequences. Recently, machine learning approaches have built a dataset for monitoring coffee rust incidence that involves weather conditions and physic crop properties. This background encouraged us to build a dataset for coffee rust detection in Colombian crops through data mining process as Cross Industry Standard Process for Data Mining (CRISP-DM). In this paper we define a proper data to generate accurate models; once the dataset is built, this is tested using classifiers as: Support Vector Regression, Backpropagation Neural Networks and Regression Trees.


I. Introduction
Coffee production is the main agricultural activity in Colombia. More than 350.000 Colombian families depend on coffee harvest for their sole income. Diseases, pests and even low prices cause a big impact on the economic and social aspects of the main coffeegrowing regions. Coffee rust, first reported in 1983 (Shieber & Zentmyer, 1984), is the most important and severe disease currently affecting the production of Colombian coffee. Varieties of coffee, that could resist this disease, have been developed through improvement with genes of Timor Hybrid (plant that features natural resistance to the disease) as a solution to the rust problem (Zapata & Ruíz, 1988). However, more than 50 percent of the country's coffee crop is still susceptible in the productive phase. Studies on coffee rust have concluded that the spores carrying the infection are spread by climatic elements such as wind and rainfall (Becker, 1979). Wind being the vector for long distance spore transport, while precipitation droplets are responsible for vertical propagation from infected leaves or soil (Becker, 1979). Once spores make contact with a susceptible leaf, the infection process is increased by high shadow index, high humidity (atmosphere and leaf), soil acidity, high coffee tree density and low soil fertility. The dataset proposed herein, joins each of the favorable conditions that coffee rust requires to infect the crop, by taking prophylactic measures (biological, chemical and weather), in order to allow the prevention of the onset of the disease.
Nowadays, there are not Colombian approaches reported about coffee rust detection based on machine learning techniques and coffee producers lack technological systems to detect coffee rust incidence, in order to improve coffee quality and reduce investment costs. Therefore, the Cross Industry Standard Process for Data Mining (CRISP-DM) propose (Wirth, 2000) six phases (Business Understanding, Data Understanding, Data Preparation, Modeling, Evaluation, Deployment) to build a Data Mining solution. Our approach aims to define a proper data to generate accurate models (the dataset proposed herein, joins each of the favorable conditions that coffee rust requires to infect the crop, by taking prophylactic measures); once the dataset is built, this is tested using classifiers as: Support Vector Regression, Backpropagation Neural Networks and Regression Trees.
The rest of the paper is organized as follows: data collection and classifiers are introduced in Section 2. Section 3 presents results and discussion, while Section 4 reports the conclusions.

II. Material and Methods
This section describes the data collection process and the generation of datasets used in experiments, and introduces three classifiers for prediction: Support Vector Regression, Backpropagation Neural Network, and Regression Tree M5.
The dataset is composed for 21 attributes which are divided in 4 categories: Weather conditions (6 attributes), Soil fertility properties (5 attributes), Physic crop properties (6 attributes), and Crop management (4 attributes). Table 1 describes the 21 attributes and its data type: Numerical (Nu) or Nominal (No).

Category
Attributes Type Weather conditions 1 Relative humidity average in the last 2 months. Nu 2 Hours of relative humidity > 90% in the last months. Nu 3 Temperature variation average in the last month. Nu 4 Rainy days in the last month. Nu 5 Accumulated precipitation in the last 2 months. Nu 6 Nightly accumulated precipitation in the last month. Nu  In this sense, the class (variable to predict) was defined as, the Incidence Rate of Rust (IRR). IRR is calculated following the methodology developed by Cenicafé (Rivillas-Osorio, Serna-Giraldo, Cristancho-Ardila, & Gaitán-Bustamante, 2011) for a plot with area lower or equal of one hectare. The steps of the methodology are presented next: 1. The farmer must be standing in the middle of the first furrow and he has to choose one coffee tree and pick out the branch with greater foliage for each level (high, medium, low); leaves of selected branches are counted as well as infected ones for rust. 2. The farmer must repeat the step 1 for every tree in the plot until 60 trees are selected. It must be taken in consideration that the same number of trees must be selected in every furrow (e.g. if plot has 30 furrows, the farmer selects two coffee trees for each furrow). Finished the step 1 and 2, the leaves of coffee trees selected (60) are added as well as the infected leaves of rust. Later it must be calculated the Incidence Rate of Rust (IRR) through Equation 1.

Support Vector Regression (SVR)
SVR is a supervised learning algorithm based on statistical learning theory and structural risk minimization principle (Vapnik, 1999;. It can be expressed as the following equation:

Equation 2
Where is a non-linear mapping which takes the input data points into a higher dimensional feature space, is a vector in the feature space and is a scalar threshold (Balasundaram & Gupta, 2014). On the other hand, the unknowns and are solved as the solution of the constrained quadratic programming problems (Smola & Schölkopf, 2004) given below (Equation 3):

Equation 3
Where , are vectors of slack variables and are input parameters. Rather than solving the primal problem considered above, it is introduced Lagrange multipliers and in and it is applied the kernel function (Cristianini & Shawe-Taylor, 2000;Vapnik, 2000) (Equation 4):

Equation 4
Lagrange multipliers in Equation 4 satisfy the equality . The Lagrange multipliers, and , are calculated and an optimal desired weight vector of the regression hyperplane is given as next:

Equation 5
Backpropagation Neural Network (BPNN) Backpropagation neural network is a feed forward neural network used to capture the relationship between the inputs and outputs (Poh, 1991). The neural network is trained using backpropagation algorithm (Haykin, 2003). The backpropagation training algorithm the error in the output neuron is given by Equation 6:

Equation 6
Where and are the actual and desired outputs of neuron q in the output layer, respectively. The weight from neuron in the hidden layer to neuron in the output layer is adjusted using Equation 7:

Equation 7
Where is the learning rate coefficient, , and and are the weights before and after adjustment, respectively (the criterion for choosing the value of the parameters is a trial and error). The error in the output layer is propagated backwards to adjust the weights in the hidden layers. The error in neuron in the hidden layer is obtained using Equation 8:

Equation 8
The error is used to adjust the weights connecting to neuron in the hidden layer. This process is repeated for all the hidden layers. Application of all inputs once to the network and adjusting the weights is called an epoch. In the backpropagation training algorithm the network weights are adjusted for certain number of epochs to map the relationship between inputs and outputs (Suhasini, Palanivel, & Ramalingam, 2011).
Regression Tree (M5) M5 is the most commonly used algorithm of regression trees family. Structurally, a model tree takes the form of a decision tree with linear regression functions instead of terminal class values at its leaves (Bonakdar & Etemad-Shahidi, 2011). The M5 model tree is a numerical prediction algorithm and the nodes of the tree are chosen over the attribute that maximizes the expected error reduction as function of the standard deviation of output parameter (Zhang & Tsai, 2007).
At first, the M5 algorithm constructs a regression tree by recursively splitting the instance space. The splitting condition is used to minimize the intra-subset variability in the values down from the root through the branch to the node. The variability is measured by the standard deviation of the values that reach that node from the root through the branch, with calculating the expected reduction in error as a result of testing each attribute at that node (Bonakdar & Etemad-Shahidi, 2011). In this way, the attribute that maximizes the expected error reduction is chosen. The splitting process would be done if either the output values of all the instances that reach the node vary slightly or only a few instances remain. The standard deviation reduction (SDR) is calculated as follows (Equation 9):

Equation 9
Where , is the set of examples that reach the node, are the sets that are resulted from splitting the node according to the chosen attribute and is the standard deviation (Wang, Witten, & Science, 1996). After the tree has been grown, a linear multiple regression model is built for every inner node, using the data associated with that node and all the attributes that participate in test in the sub-tree rooted at that node. In consequence, linear regression models are simplified by dropping attributes if it results in a lower expected error on future data. After this simplification, every sub-tree is considered for pruning. Pruning occurs if the estimated error for linear model at the root of a sub-tree is smaller or equal to the expected error for the sub-tree (Bonakdar & Etemad-Shahidi, 2011).

III. Results and Discussion
In this part of the paper are reported the results obtained by the classifiers described in the section 2.2. First, it is shown the performance evaluation methods (section 3.1) and after that are presented the results obtained by the classifiers (section 3.2). In all cases, the scores presented in Tables were estimated using a 10-fold cross-validation (Refaeilzadeh, Tang, & Liu, 2009).

Pearson correlation coefficient (PCC)
In statistics, Pearson correlation coefficient is a measure of how well a linear equation describes the relation between two variables and measured on the same object or organism. The result of the calculus of this coefficient is a numeric value that runs from to (Monedero et al., 2012). This coefficient is calculated by means of the following equation:

Equation 10
Where is the covariance between and . is the product of the standard deviations for and .
A value of 1 indicates that a linear equation describes the relationship perfectly and positively, with all data points lying on the same line and with Y increasing with X. A score of -1 shows that all data points lie on a single line but Y increases as X decreases. At last, a value of 0 shows that a linear model is inappropriate -there is no linear relationship between the variables (Huitema, 1980).

Mean absolute error (MAE)
Measures the closeness among a prediction and the actual value of a data set (Hyndman & Koehler, 2006), and is defined by equation:

Equation 11
Where is the prediction, the real value, and the number of observations. Root mean squared error (RMSE) It represents the difference between the predicted value and the observed value by the mean square (Armstrong & Collopy, 1992), and is defined as (Equation 12):

Equation 12
Where is the prediction, the real value, and the number of observations. Relative absolute error (RAE) It calculates the prediction error rate of a classifier (Armstrong & Collopy, 1992;Hall et al., 2009), using the following equation:

Equation 13
Where is the prediction, the real value, and is the total absolute error.

b) Experimental Results
Experiments were carried out with SMOreg, Multilayer Perceptron (uses backpropagation to classify instances) , and M5P which are implementations of SVR, BPNN and M5 in the WEKA (Hall et al., 2009) framework. Table 1 describes general characteristics of the datasets used, presenting number of attributes. The datasets were defined in order to avoid redundant attributes.
The dataset DS 1 was tested with performance evaluation methods: Pearson Correlation Coefficient (PCC), Mean Absolute Error (MAE), Root Mean Squared Error (RMSE) and Relative Absolute Error (RAE) using the classifiers: Support Vector Regression
The dataset DS 2 was tested in the same way as DS 1 (Table 4), but DS 2 excludes attributes from Soil fertility properties (7-11) and three of Physic crop properties (13-15), in order to remove redundant attributes.
The DS 2 outcomes (Table 4) are mildly better than those for DS 1 . This is because of the positive correlation of (BPNN) and (SVR). MAE and RMSE evaluations for DS 2 are also better in comparison with those for DS 1 . This is due to the difference that exists between IRR-predicted mean and IRR-real mean, since the number is smaller for the classifiers SVR (MAE = 2.2872, RMSE = 3.3897) and BPNN (MAE = 2.3499, RMSE = 3.3115). However, the minimum error rate prediction obtained by classifiers in DS 1 did not improve on DS 2 . The best results are: 93.18% (SVR) and 95.73% (BPNN).
Finally the dataset DS 3 , replaces attributes 7-11 and 13-15 by the number of plot (plot_number), which clusters the values of removed attributes. The evaluation of DS 3 , is presented in Table 5. When the attribute plot_number is added, the outcomes improve (Table 5) regarding DS 1 and DS 2 . In this sense, the evaluation of classifiers used for DS 3 is better. This is because it presents a positive correlation closer to 1, in respect to the values obtained for DS 1 and DS 2 , where SVR ( ) is the best result. Whereas BPNN ( ) and M5 ( ) present similar results. Furthermore, the outcomes acquired by measures MAE and RMSE show a high closeness between IRR-predicted mean and IRR-real mean with 2.0512 (MAE) and 2.8 (RMSE) percentage scores of IRR using SVR; while M5 gets the values 2.1723 (MAE) and 2.8679 (RMSE). Similarly, the prediction error rate decreases by SVR (89.72%) regarding the outcomes obtained by the classifiers in DS 1 and DS 2 .

Conclusions
We have built a novel dataset for coffee rust detection in Colombian crops given weather conditions, soil fertility properties, physic crop properties, and crop management. This approach required a significant effort to analyze data and preprocessing it. We consider that understanding the nature of the problem is the first step towards solving it.
The proposed dataset was used to form three distinct subsets of selected attributes (DS 1 , DS 2 and DS 3 ) which were evaluated with three classifiers: Support Vector Regression, Backpropagation Neural Network, and M5 Regression Tree. Dataset DS3 presents the best outcomes regarding DS 1 and DS 2 , and Support Vector Regression obtains the best performance evaluation for each dataset (DS 1 , DS 2 and DS 3 ). Nevertheless, few instances to train a classifier limit his performance, since the classifier cannot take the right decision if the dataset training does not have cases that support the expected decision.