The control of moldy risk during rice storage based on multivariate linear regression analysis and random forest algorithm

Public summary ■ A multivariate linear regression model among several important factors, such as spore number and ambient temperature, rice moisture content and storage days, were developed based on the experimental data. ■ Random forest algorithm was introduced into fungal spore prediction during grain storage. ■ The established regression models can be used to predict the spore number under different ambient temperature, rice moisture content and storage days during the storage process. Abstract: Clarifying the mechanism of fungi growth is of great significance for maintaining the quality during grain storage. Among the factors that affect the growth of fungi spores, the most important factors are temperature, moisture content and storage time. Therefore, through this study, a multivariate linear regression model among several important factors, such as the spore number and ambient temperature, rice moisture content and storage days, were developed based on the experimental data. In order to build a more accurate model, we introduce a random forest algorithm into the fungal spore prediction during grain storage. The established regression models can be used to predict the spore number under different ambient temperature, rice moisture content and storage days during the storage process. For the random forest model, it could control the predicted value to be of the same order of magnitude as the actual value for 99% of the original data, which have a high accuracy to predict the spore number during the storage process. Furthermore, we plot the prediction surface graph to help practitioners to control the storage environment within the conditions in the low risk region.


Introduction
Grain mildew is an important cause of grain loss during grain. The condition of grain mildew can be quickly determined by detecting the number of fungi spores [1] . Therefore, clarifying the mechanism of fungi growth is of great significance for maintaining the quality during grain storage. Rice as one of the main crops has a wide range of cultivation in China. Its storage environment and conditions are especially important to ensure an edible quality. For example, rice with high moisture content is very easy to be heated and molded during storage. If it cannot be controlled appropriately, it could spread rapidly inside the grain heap, resulting in the damage of rice even food accidents. The storage of rice could be dependent on many influencing factors, such as ambient temperature, the moisture content of rice, and storage days. Although it is commonly known that these factors affect the storage of rice, the related quantitative analysis is limited to address the behind mechanism to benefit the related maintenance of appropriate storage environment.
Many studies have built a solid basis for the rice storage research, but these studies have been much focused on one or two influencing factors, lacking a general analysis to address those main influencing factors. For example, Yin et al. [2] investigated the rice mycoflora in China and found that the main fungi at the early stage of the new grain are field fungi. With the increase of storage time, the field fungi decreased which was gradually replaced by storage fungi. Purushtham et al. [3] addressed the relationship between the change of storage quality and the growth of fungi. Laca et al. [4] indicated that the mold is mainly distributed at the surface of the grain. Genkawa et al. [5] investigated the changes of fatty acid value, germination rate and mold activity during rice storage under the effects of moisture contents. Soponronnarit et al. [6] found that the quality of the rice with high initial moisture content is more likely to change. Zhou et al. [7,8] studied the exchanges of rice fungal mycoflora under different storage conditions, revealing that the ambient temperature and rice moisture content can directly affect the growth of rice storage fungi during the storage process. Therefore, it is quite critical to address the behind mechanism of rice storages under the effects of the main influencing factors.
Biological methods have been widely used previously, which have focused on revealing the relationship between the growth mechanism of fungi and environmental factors such as rice moisture content, and temperature. However, the related mechanism on fungal growth is often complicated due to the interaction of multiple influencing factors. It is difficult to fully understand the inherent causal relationship between these factors and fungal growth and establish an accurate mathematic model based on mechanism analysis. The usual approach is to collect a large amount of data and build a mathematical model based on statistical analysis of these data, but very few people carry out research work in this area due to the long-term process of the experimental tests.
In this study, based on the experimental data of rice storage experiments carried out by the Tang Fang research group at the Academy of National Food and Strategic Reserves Administration, a multivariate linear regression model and ran-dom forest model were developed between spore number of rice and those main influencing factors, containing ambient temperature, rice moisture content, and storage days. Firstly, Regression analysis, which is a statistical model widely used to address the behind correlation, was conducted to establish a linear relationship model between spore number of rice and those main influencing factors. After building the model, significant tests on the multivariate linear regression model were also performed to improve the model, like the F test, the ttest, and the residual analysis. However, even though the linear regression model could deduce a clear function for the spores number of stored rice, it do not have sufficient precision for a prediction model.
Random forest, a relatively new algorithm in the AI field, is an ensemble of multiple decision trees and a nonparametric statistical technique to enhance the prediction accuracy, which has been widely used to address engineering problems [9−11] . As a result, we introduce decision tree and random forest algorithm into fungal spore prediction during grain storage. For the random forest model, it could control the predicted value to be of the same order of magnitude as the actual value for 99% of the original data, which have a high accuracy to predict the spore number during the storage process. Furthermore, the prediction surface graph can be plotted to help practitioner to control the storage environment within the conditions in the low risk region. The results have suggested that the random forest model showed a higher predictive ability in classification and regression of this issue, which benefit to understand the growth pattern of fungal spores and assist managers in taking measures in advance.

Experimental details and treatments
Experimental studies on the mechanism of fungal growth have shown that there are three main factors affecting the growth of rice spores, namely ambient temperature, rice moisture content, and storage time. As a result, relevant experiments were conducted in the laboratory to simulate the rice storage environment.
The rice sample was obtained from Heilongjiang province in the north of China. After determining the initial moisture content, deionized water was added into the samples to adjust the sample moisture to the target moisture, and then the samples were sealed and refrigerated at 4 ℃ to balance the moisture for 30～ 60 days until the moisture content of the rice reaches the target moisture. Then, rice with 10 moisture contents, including 13%, 13.5%, 14%, 14.5%, 15%, 15.5%, 16%, 16.5%, 17%, 18%, were sealed and placed in different storage temperatures (10, 15, 20, 25, 30, 35 ℃) in biochemical culture to simulate storage for 180 days. So there were a total of 10 × 6 = 60 experimental scenarios. 2 samples from each of all the 60 scenarios were taken every 10 days to detect moisture and the number of spores by microscope, and the average of the spore number of the two samples was used for further analysis. For example, by counting the numbers under the microscope, the grain with a moisture content of 16% shows a spores number of 4650000 at 30 ℃ after 180 days of storage. After done the whole experiments, a total of 18 × 60 = 1080 groups of data were obtained.

Multivariate linear regression model
Theory Multivariate linear regression model as an important and popularly adopted method to address the relationship of multiple influence factors was utilized in this study. This method was briefly introduced here, more details are presented by Chen [12] . The multivariate linear regression model assumes that the interpreted variable (dependent variable) y is affected by n explanatory independent variables following a linear relationship. The model can be expressed as It is assumed that there is a linear relationship between the dependent and independent variables. The rationality of this assumption will be determined later by a significance test. The significance test can be further divided into an F test (the significance test for the regression model as a whole) and Ttest (the significance test for each independent variable) [12] .
Residual analysis The multivariate linear regression method mentioned above is one of the popularly adopted regression method for engineering uses. In the practical applications, abnormal data may exist due to the interference of accidental factors during data acquisition. Even if the significance test is passed, some abnormal data cannot be completely excluded. The residual is the difference between the observed values of each data group in the dataset and the fitted values of the regression model, which is usually introduced for the residual analysis to eliminate abnormal data. If the confidence interval of the residual contains zero, it indicates that the regression model can well match the original data, otherwise, it can be regarded as an abnormal point. After finding and eliminating the abnormal points, the regression model is re-established with the remaining data to improve the accuracy of the regression model.

Random forest algorithm.
Theory Decision tree is a common method for various machine learning tasks, but this method can occur with low precision and over-fitting. Therefore, many researchers combine other models with decision trees to improve the accuracy of the classifier or regression models. The training algorithm for random forests applies the general technique of bootstrap aggregating, or bagging, to tree learners.
Random forests are a combination of tree predictors such that each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest, which was suggested by Breiman [13] . After the random forest algorithm was proposed, it has been widely used for classification and regression in various fields. Vladimir Svetnik et al. [14] introduced a new method for predicting a compound's quantitative or categorical biological activity based on random forest. Sandra Oliveira et al. [15] investigated the potential applicability of the Random Forest method in the assessment of fire occurrence. Classification and regression trees or CART for short is a term introduced to decision tree algorithms by Leo Breiman [16] that can be used for classification or regression predictive modeling problems. Creating a CART model involves selecting input variables and split points on those variables until a suitable tree is constructed. A greedy approach is used to divide the space called recursive binary splitting [17] . The greedy algorithm means to select the input variables and the specific split or cut points to minimize the cost function. For regression predictive modeling problems, the objective is to minimize the cost function of the chosen split points across all training samples that fall within the rectangle: y R j is the average prediction value of the training samples in the rectangle R j [18] . The tree construction ends with a predefined stopping criterion, for example, if the split of a node results in a child node whose node size is less than the userspecified minimum child node size value, the node will not be split [19] . In addition, many improved algorithms have been proposed to enhance the prediction accuracy [20−22] .
Principle of random forest algorithm Given a training set X=x 1 , x 2 , …, x n with responses Y=y 1 , y 2 , …, y n , bagging repeatedly (B times) selects a random sample with replacement of the training set. For b from 1 to B, the training examples from X, Y are named X b , Y b and the corresponding decision tree f b is trained by X b , Y b . Then, the predictions of multiple decision trees are combined and the final prediction results are obtained by voting. A large number of theoretical and empirical studies have proved that RF has a high prediction accuracy and is not prone to overfitting. overfitting.
Random forest is generated by the method as following: (Ⅰ) Generate n samples from the original set by means of bootstrap aggregating method; (Ⅱ) Assuming that the number of sample features is a, select k features in a for n samples, and obtain the optimal segmentation point by establishing a decision tree; · · · } (Ⅲ) Repeat m times to generate m decision trees ; (Ⅳ) Use Bagging's strategy to make predictions:

Results of multivariate linear regression model of the spores number of stored rice
During the data analysis, those experimental data were imported into Matlab to draw the scatter plots of spore number along with ambient temperature, rice moisture content, and storage days. The related outputs can be seen in Fig. 1. To simplify calculation, define y as "spore number" andy = N × 10 −5 , where N represent the number of spores. After the regression analysis, the regression coefficient vector can be obtained, and the multivariate linear regression model between the spore number and ambient temperature, rice moisture content and storage days can be expressed by where y is the spore number; x 1 is the ambient temperature; x 2 is the moisture content of the rice samples; and x 3 is the storage days.

U/T
It is also known that F > F 0.05 (3,1076), so it can be considered that there is a multivariate linear relationship between the spore number and ambient temperature, rice moisture content and storage days. But the multiple correlation coefficient R = = 0.5036, R 2 = 0.2536, which indicates that this linear relationship is not that good and further analysis is needed.
From the scatter plot shown in Fig. 1, it can be seen that the spore number is approximately exponential with the ambient temperature, the rice moisture content, and the storage days. Therefore, the multivariate linear regression could be performed after taking a natural logarithm of the spore number. The scatter plot after taking the natural logarithm is shown in Fig. 2.
where y is the spore number after taking the natural logarithm.

√ U/T
It is also known that F > F 0.05 (3,1076), and the multiple correlation coefficient R = = 0.8721, R 2 = 0.7606. The results indicated that there is an ideal multivariate linear relationship between the spore number and ambient temperature, rice moisture content and storage days after taking the natural logarithm. Residual analysis Take the residual as the ordinate and the case number of observed values as the abscissa to plot the residual analysis, the outputs can be seen in Fig. 3.
Green lines in Fig. 3 show the confidence interval of the residual crossing zero. Red lines are the confidence interval of the residual, which do not cross zero, namely the abnormal point. The original data corresponding to these abnormal points are the observations obtained by using different spore counting standard during the experiments, and those show significant differences with the others. After eliminating these abnormal points, the multivariate linear regression is re-performed on the remaining data, and we can get the new regression coefficients vector . The new multivariate linear regression model can be given as From the regression coefficients, it can be seen that the order of these three independent variables' influences on the dependent variable can be concluded as x 2 >x 1 >x 3 . In other words, the effect of rice moisture content is the biggest, ambient temperature is the second, and storage days show the least impact on the spore number of rice.

Conclusions of multivariate linear regression model of the spores number of stored rice
Based on a long-term experimental test on rice storage, the multivariate linear regression model was developed between the spore number and the main influencing factors, such as ambient temperature, rice moisture content, and storage days. Several conclusions can be addressed: (Ⅰ) Multivariate linear regression based on the natural logarithm shows its advantage on the spore number of rice during the storage. During the process, both the statistical variable F and the multiple correlation coefficient R are significantly improved; (Ⅱ) It is known from the T-test that all of the three explan-  atory variables (e.g. ambient temperature, rice moisture content and storage days) play important role in the spore number of rice during its storage. A multivariate linear regression model was also developed to address the mechanism, which indicates that the growth of fungal spores is exponentially related to ambient temperature, rice moisture content, and storage days; (Ⅲ) Through the residual analysis, the improved multivariate linear regression model y=−31.39+0.22x 1 +1.57x 2 + 0.03x 3 is obtained after eliminating the abnormal data, where y, x 1 , x 2 , x 3 are the spore number after taking a natural logarithm, ambient temperature, rice moisture content and storage days respectively. Statistics variable F = 1654.73, the multiple correlation coefficient R = 0.9155, R 2 = 0.8381, the statistics variables t 1 , t 2 , t 3 corresponding to ambient temperature, rice moisture content, and storage days, can be obtained as 39.07, 48.47, 28.20; (Ⅳ) The multivariate linear regression model obtained above can predict the spore number of rice under different ambient temperature, rice moisture content and storage days during the storage process, so as to judge whether mildew could occur in rice; (Ⅴ) It can be seen from the regression coefficients that the number of fungal spores is most closely related to the moisture content of rice, followed by ambient temperature, and then the storage days.

Data processing
The experimental data is the same as the second chapter, containing the number of spores and the three main affecting factors including ambient temperature, rice moisture content, and storage time. For the number of spores (N) in the original data ploted in Fig. 4 a shows no apparent distribution, we plot logarithm spore number data for easy analysis. Since many sample raw data is 0.01, after taking the logarithm, only the data of the part of x>0 is concerned, and it can be observed that this part of the data basically conforms to the normal distribution.

Model construction
For our issue, the goal of the training algorithm is to reduce the regression error.The model construction containing the following procedures: (Ⅰ) From the training data of n samples, draw a bootstrap sample.
For each bootstrap sample, tree grows by choosing the split at each node by minimizing the cost function: , where is the average prediction value of the training samples in the chosen district R j . We choose mean absolute error not mean square error because it is not sensitive to outliers. (Ⅲ) A single node is formed by dividing continuous or discrete attributes (for example, temperature > 20), and a tree is generated by different nodes as a model to determine the final regression result. The focus point of the algorithm is to determine whether a node needs to continue to increase, in other words, to judge the gain of a node to the entire model. The node will stop to split when the current tree depth or the node size reaches the maximum limit valued or when splitting the node could not reduce the value of the cost function.
(Ⅳ) Repeat the above steps until sufficient trees from different random training data are obtained (Ⅴ) For the given condition,the average value of different trees is finally used as the final prediction result.

Out of bag (OOB) score and mean absolute error
Out of bag (OOB) score is a way of validating the random forest model. After creating the classifiers (S trees), for each (x i , y i ) in the original training set i.e. (T), select all (T i ) which does not include (x i , y i ). This set is called out-of-bag examples. There are n such subsets (one for each data record in original dataset T). OOB classifier is the aggregation of votes only over (T k ) such that it does not contain (x i , y i ). Out-of-bag estimate for the generalization error is the error rate of the outof-bag classifier on the training set (compare it with known y i ) [23] .
(MAE = As shown in Fig. 5, the OOB error rate decrease dramatically as the tree number increase from 1 to 6. After the number of trees reaches 10, the error rate does not decline significantly. We also calculate the mean absolute error

R 2 score of the multiple linear regression model and random forest model
In order to compare the advantages and disadvantages of the random forest model and the multiple linear regression method, here we use R 2 score to evaluate the models.
. y 1 , y 2 , …, y n y = 1 n n ∑ i=1 y i y 1 , y 2 , …, y n is the real value of the samples, is the predicted value, , and the R 2 score of the two models is shown in Table 1.
The R 2 score for test data of RF model is 0.95, which is much higher than the MLR model, thus it prove that the ran-dom forest prediction model can play a better role in predicting the number of mold spores grown in rice. Fig. 6 displays the variation of spore number with temperature, moisture and storage days intuitively. The number of spores grows with the addition of moisture and the extension of storage time. As shown in the figure, the moisture and storage days in the purple district means the grain would not mildew and it is a safe district for grain storage. Therefore, it is significant to control the storage environment within the conditions in the low risk region.

The prediction model
The Fig. 7 was plotted by using the predicted value as the abscissa and the actual value as the ordinate. It can be found in the figure that the scatter points basically fall between the    two lines of y = x − 0.5 and y = x + 0.5, which suggest that the prediction model is accurate. After conversion, it was found that for 99% of the original data, the predicted value could be controlled to be of the same order of magnitude as the actual value. All the analysis above suggest that the random forest model have a high prediction accuracy.

The feature importance
Random forests can calculate the importance of each feature, or its contribution to the decision tree. Feature importance is calculated as the decrease in node impurity weighted by the probability of reaching that node. The node probability can be calculated by the number of samples that reach the node, divided by the total number of samples. The higher the value the more important the feature. We calculate the importance for the moisture, temperature and storage time by the following formula: And then these can be normalized to a value between 0 and 1 by dividing by the sum of all feature importance values.
The final feature importance, at the Random Forest level, is the average of the feature's importance value on each trees. As a result, the importance for the three features of moisture, temperature and storage time is calculated to be 0.539, 0.246 and 0.215, respectively. Therefore, for the rice mildew, moisture is the most important factor, followed by temperature, again storage time, which is consistent with the result of the multivariate linear regression analysis.

Conclusions
In this work,based on the experimental data of rice storage experiments, a multivariate linear regression model and random forest model were developed between spore number of rice and main influencing factors, containing ambient temperature, rice moisture content, and storage days.
For multivariate linear regression analysis, through the residual analysis, the improved multivariate linear regression model has been developed and it furtherly prove that moisture is the most important factor, followed by temperature, again storage time for the rice mildew.For the random forest model, it could control the predicted value to be of the same order of magnitude as the actual value for 99% of the original data, which have a high accuracy to predict the spore number under different ambient temperature, rice moisture content and storage days during the storage process. Furthermore, we plot the prediction surface graph to help practitioner to control the storage environment within the conditions in the low risk region. Therefore, we believe that our results can be applied in practice and reduce loss during grain storage.