DeepCropNet: a deep spatial-temporal learning framework for county-level corn yield estimation

Large-scale crop yield estimation is critical for understanding the dynamics of global food security. Understanding and quantifying the temporal cumulative effect of crop growth and spatial variances across different regions remains challenging for large-scale crop yield estimation. In this study, a deep spatial-temporal learning framework, named DeepCropNet (DCN), has been developed to hierarchically capture the features for county-level corn yield estimation. The temporal features are learned by an attention-based long short-term memory network and the spatial features are learned by the multi-task learning (MTL) output layers. The DCN model has been applied to quantify the relationship between meteorological factors and the county-level corn yield in the US Corn Belt from 1981 to 2016. Three meteorological factors, including growing degree days, killing degree days, and precipitation, are used as time-series inputs. The results show that DCN provides an improved estimation accuracy (RMSE = 0.82 Mg ha−1) as compared to that of conventional methods such as LASSO (RMSE = 1.14 Mg ha−1) and Random Forest (RMSE = 1.05 Mg ha−1). Temporally, the attention values computed from the temporal learning module indicate that DCN captures the temporal cumulative effect and this temporal pattern is consistent across all states. Spatially, the spatial learning module improves the estimation accuracy based on the regional specific features captured by the MTL mechanism. The study highlights that the DCN model provides a promising spatial-temporal learning framework for corn yield estimation under changing meteorological conditions across large spatial regions.


Introduction
Providing sufficient food supply becomes a critical global issue under the increased population growth. Food supply, however, has become more vulnerable under the global warming phenomenon that causes the increased occurrence of extreme meteorological events (FAO 2018). Understanding the relationship between meteorological factors and crop growth is essential to quantify the crop yield estimation, which would provide decision support for national food production policy and local farming management (Liu et al 2016, Jin et al 2017, Jones et al 2017.
Process-based biophysical modeling (McCown et al 1996) and statistical modeling (Lobell and Burke 2010) are two mainstream approaches to quantify crop yield Any further distribution of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI. based on meteorological factors. Biophysical modeling is able to represent the key processes of crop growth and yield formation based on mechanistic equations developed through theoretical derivation and experimental data. However, the validation of mechanistic equations is challenging, especially when scientific experiments and data are limited. Statistical modeling uses observed data of meteorology and crop yield to quantify their associations by statistical regression, which is convenient for large-scale applications. Statistical modeling has challenges in dealing with the complex interactions among meteorological variables and the nonlinear relationships among the dependent and independent variables. Biophysical modeling is more suitable for sitespecific yield analysis, whereas statistical modeling is often adopted in large-scale spatial analysis. Some researchers look into the integration of process-based and statistical modeling in an effort to achieve higher accuracy (Michael et al 2017). Under the rapid development of computing capabilities in recent years, artificial intelligence approaches, such as Random Forest (RF) (Saeed et al 2017), artificial neural network (Alvarez 2009), Bayesian network (Gandhi et al 2016), semiparametric neural network (Crane-Droesch 2018), and convolutional neural network (You et al 2017, Yang et al 2019, have gradually been applied to extract patterns for agricultural yield estimation. These studies, however, often simplify the temporal cumulative effect of crop growth and the spatial heterogeneity problem.
Crop growth is featured by its temporal cumulative growth and spatial explicit pattern. Temporally, crops grow and accumulate biomass over time. How to express the physiological process while building a machine learning model is not well understood. Spatially, meteorological conditions and their influence on crop yield vary across different regions Huybers 2013, Zipper et al 2016). Individual local models in each region ignore the inherent correlations among different regions. Most global models consider all regions as a whole and do not deal with the spatial differences of the meteorological conditions and their relationships with crop yield (Van Wart et al 2013, Van Ittersum et al 2013. How to develop a data-driven deep learning model to capture the spatial and temporal features for crop yield estimation is critical for an improved understanding of the relationship between crop yield and meteorological factors.
Long short-term memory (LSTM) model is a special type of recursive neural network, which can facilitate time-series analysis and handle complexity and nonlinearity by its unique structure (Hochreiter and Schmidhuber 1997). Previous studies have demonstrated that LSTM performs well in dealing with long sequential data in natural language modeling (Sundermeyer et al 2012) and human trajectory prediction (Alahi et al 2016). In agriculture, researchers have also begun to study the feasibility of LSTM, such as water table depth prediction (Zhang et al 2018), cropland classification (Rußwurm and Körner 2017), and yield prediction (Jiang et al 2019, Wang et al 2018.
Attention (AT) mechanism has been inspired by human visual attention, and it can show how much attention the model is paid to the input sequence (Chaudhari et al 2019). Attention mechanism has shown its state-of-the-art performance in concentration on different parts of a sequence data both in natural language processing (Luong et al 2015, Wang et al 2016 and in image processing (Ba et al 2014. Multi-task learning (MTL) is an approach to build a model for multiple tasks to improve generalization. MTL can learn tasks in parallel while using a shared representation and what is learned for each task can help other tasks be learned better (Caruana 1997). MTL is able to study the commonalities and differences between different tasks and has shown good performance in rainfall prediction (Qiu et al 2017), neuroimaging studies (Ma et al 2018) and railway track inspection (Gibert et al 2017).
Inspired by the successful applications of the deep learning technologies mentioned above, a deep spatial-temporal learning framework named Deep-CropNet (DCN) is developed in this study to understand the interactions between crop yield and meteorological factors through the integration of the temporal and spatial learning modules. For the temporal pattern, a temporal learning module that incorporates the attention mechanism into the LSTM model is applied to capture and visualize the temporal cumulative effect. To address the spatial heterogeneity problem, a spatial learning module is developed to facilitate MTL to take yield estimation in different meteorological regions as different tasks to learn both the common features and region-specific features. The US Corn Belt is selected as the region for a case study and the tasks are defined based on the spatial distributions of the meteorological conditions during the corn growth period. This study aims to address the three research questions as follows: (1) How does DCN perform for corn yield estimation compared with LASSO and RF?
(2) How does DCN learn the temporal patterns between time-series meteorological information and corn yields?
(3) How does DCN learn the regional specific features of corn yield estimation?

DeepCropNet
The DCN model proposed in this study is a deep learning framework that uses multiple layers to hierarchically learn the spatial and temporal features from the meteorological and yield data. The DCN consists of two parts: temporal learning module and spatial learning module (figure 1). The input of DCN is a time-series of meteorological factors, including county-level weekly growing degree days (GDD), killing degree days (KDD) and precipitation (PRCP) from corn planted (week 1) to mature (week 20). The output is the corn yield anomaly at the county-level. In this study, an AT-LSTM network is trained with parameters shared for all regions to learn and visualize the general temporal pattern captured from the timeseries of weekly meteorological factors. To learn the spatial features in each region and output the corn yield anomaly, region-specific output layers are developed based on MTL as the spatial learning module.

Temporal learning module
The general temporal features are learned by an AT-LSTM network, which consists of an LSTM Net and an Attention Net (figure 1). The LSTM Net has five layers: input layer (one layer), hidden layers (three layers), and output layer (one layer). The input is a time sequence {x 1 , x 2 , K, x t } and x t is a vector which includes meteorological factors [GDD t , KDD t , PRCP t ] in the period t during the corn growing season. The hidden layers are three LSTM layers composed of LSTM cells to transport and store information selectively. The dimension of features in each LSTM layer is set as 32 after selection range from 16 to 256. The multiple LSTM layers enable the LSTM Net to generate hierarchical features from layer by layer. Temporally, the feature vector and the cell state vector in each LSTM layer are transferred from the current period (t) to the next period (t + 1). After processing the wholetime sequence data, the LSTM Net calculates the output, h t , the hidden feature vector of the LSTM in period t. In this study, we define the vector of hidden features, h t , as the reference of the effect of meteorological factors on corn yield. The calculation process of h t is as follows: where t represents the period t during the corn growth and b c are weight matrices and bias vector parameters of forget gate, input gate, output gate and cell state update, respectively. S t is the updated cell state vector at period t. g , t f g , t i and g t o are the vectors generated by these gates. The three vectors consist of numbers range from zero to one, for the operation of filtering information. Zero represents totally removing, and one represents totally reserving. S t is the LSTM cell state vector to store information across the whole-time sequence. h t is the hidden feature vector of the LSTM. s and tanh are activation functions.
The Attention Net can generate different attention value a t based on the h t to reflect the importance of hidden feature vector to corn yield in the corresponding period. The structure of the Attention Net is designed as one single-layer full-connected neural network in this study. The final feature vector (H) captured by the AT-LSTM is the combination of attention values of Attention Net and hidden feature vectors of LSTM Net calculated as follows: where a t is the attention value, softmax is the activation function to limit the sum of attention value to one, W AT is the weight matrix of the Attention Net, and b AT is the bias vector.
As the sum of attention values during the corn growth period equals to one, the distribution of attention values can reflect the importance of hidden features at each period. The higher attention value represents that the hidden features in the corresponding period have a higher impact on the estimated yield. The attention distribution and its association with the corn growth process can provide insights to help us understand how the temporal module work for estimating yield.

Spatial learning module
MTL is applied to learn region-specific features in the relationships between corn yield and meteorological factors. Considering the meteorological conditions (extreme heat in particular) vary across latitudes (figure S1 is available online at stacks.iop.org/ERL/ 15/034016/mmedia), the nine states are divided into three regions and the yield estimation in each region is defined as an individual task to adjust the regional meteorological conditions. In the three regions, the LSTM layers and the Attention Net layer of the temporal learning module are shared while the final output layers of the spatial learning module are taskspecific (or region-specific). The shared layers extract the common features across different regions, and the task-specific output layers generate weights based on local characteristics for final yield estimation in each region. The yield anomaly (y) is calculated as equation (8).
where r is the index of the region, W r and b r are the weight matrix and bias vector of the responding regional specific output layer for region r. H are the attention weighted features captured by the AT-LSTM network.

Study area
This study focuses on county-level rainfed corn yield in the nine states of the US Corn Belt region from 1981 to 2016 (figure 2). These 743 counties accounted for 61% of the total corn production of the US in 2016 (USDA-NASS 2017). During the 36 years between 1981 and 2016, the annual county-level average yield in this area is 7.9 Mg ha −1 . The high-yielding regions are concentrated in southern Minnesota, Iowa, and northern Illinois. All counties in this study have an increasing trend of corn yield, with an average trend of 0.12 Mg ha −1 yr −1 .

Data and preprocessing
The county-level rainfed corn yield data is obtained from USDA's National Agriculture Statistics Service from 1981 to 2016 (USDA-NASS 2017). To capture the impact of meteorological factors on corn yield, the yield anomaly (yield after detrending) is calculated by linear regression (Lobell et al 2011). To ensure the reliability of the data in our analysis, we design three criteria for data cleaning and processing: (1) The county should have more than 12 year yield statistics, which is one-third of the whole research period.
(2) The county should have an average corn harvest area higher than 476 ha, which is 10 percentile value of the county-level average corn harvest area in the whole sample set.
(3) The county should have a yield trend that passes the t-test, with a two-side p-value less than 0.05.
There are 848 counties with 28,374 samples in the raw yield data. After the data cleaning and processing, 743 counties (87.6%) with 26,015 samples (91.7%) remain in our analysis. The state-level corn phenology data is based on USDA-NASS (2017). The growing season is defined as the average historical average length, 20 weeks, from the specific planting week recorded in each state to keep the periods consistent.
The daily county-level meteorological data is from the Applied Climate Information System Web Service (ACIS) (RCC-ACIS 2017). The meteorological data used in this study include maximum daily temperature, minimum daily temperature, and daily precipitation. Based on these meteorological data, cumulative meteorological factors are calculated including cumulative PRCP as an index to represent the water supply, GDD as an index to represent effective heat accumulation, and KDD as an index to represent extreme heat accumulation. The calculation of GDD and KDD is consistent with a previous study (Butler and Huybers 2015). During the corn growth period, the weekly meteorological factors are calculated and fed into the model as sequential vector inputs. To improve the model convergence speed, the input is standardized into standard normal distribution.

Baselines and performance evaluation
In this study, two approaches are chosen as baselines: LASSO regression and RF. For the LASSO regression model, the l value is set to 0.0005 by the 10-fold crossvalidations based on the training set. For RF, the tree number is set to 2000 and 20 of 60 input variables are randomly taken as candidates at each split. The maximum number of nodes for each tree is limited to 200. The glmnet and randomForest R package are used to train LASSO and RF, respectively. The DCN is implemented on the Python platform using Pytorch library and performed on a Linux workstation (Ubuntu 14.04 LTS) with 128 GB of RAM, and an Nvidia Geforce GTX1080Ti graphics card with 11 GB of RAM.
To evaluate the models' performance in yield estimation, all models are trained using the dataset during the period from 1981 to 2014 and tested using the 2015 and 2016 data. The estimated yield anomaly is added with the yield trend to get the final estimated yield. The RMSE and R 2 between actual yield and estimated yield in the test set are used as indicators to evaluate the estimation accuracy of these models. To analyze the benefit from the spatial learning module, the temporal learning module is applied individually to estimate yield as a comparison with DCN.
To evaluate the model performance in predicting the interannual variability of corn yield, 36 years are randomly partitioned into six subsets (table S1) to conduct 6-fold cross-validation. The DCN model is trained by the five subsets and tested in the remaining subset. The RMSE and R 2 of each subset are used to evaluate the model stability.

Estimation accuracy of DCN at county-level
The DCN model shows its estimation performance at the county-level, with RMSE equals 0.83 Mg ha −1 in 2015 and 0.81 Mg ha −1 in 2016 (figure 3). The error maps indicate that DCN performs well in most counties, with an average absolute error lower than 0.8 Mg ha −1 (figure 3). While in Missouri, DCN estimates less accurately, with the average absolute error equals 0.88 Mg ha −1 . These results can be explained by the high data missing rate (35%) of this state in the test set. A similar pattern that the number of county samples affects model's performance has been reported in previous studies (Li et al 2019).
Although there exist few extreme estimation errors, the overall performance of DCN is satisfying which indicates that DCN is able to well capture the yield variability at the county-level.

Performance comparison among DCN, LASSO, and RF
DCN provides the best estimation results with RMSE equals 0.82 Mg ha −1 as compared with LASSO and RF ( figure 4). LASSO provides the lowest estimation accuracy with RMSE equals 1.14 Mg ha −1 , agreeing with the discovery in previous study that linear methods are limited in simulating the relationship between crop yield and meteorological factors relative to nonlinear approaches (Cai et al 2018). RF provides a better result than LASSO with RMSE equals 1.05 Mg ha −1 . Compared with LASSO and RF, DCN reduces the estimation RMSE by 0.23-0.32 Mg ha −1 , which demonstrates the effectiveness of the spatial and temporal learning process by neural network based structure.
The results show a spatial pattern that all models perform better in northern states than in southern states ( figure 4). The estimation RMSEs are correlated with the variances of yield anomaly at state-level, with Pearson's correlation coefficient of 0.79 (figure S2). Similar results have been reported in a previous study (Li et al 2019). The higher variabilities of the yield anomaly in southern states indicate that corn yield is more affected by meteorological stress in the south. For example, the historical meteorological data shows that the average cumulative KDD during the corn growth period in southern states (165 degree day) is 2.2 times higher than that in northern states (76 degree day) in past three decades. Therefore, it appears that the spatial distribution of meteorological stress is one of the driving factors for this spatial pattern.
The DCN model shows its higher estimation accuracy in most states compared with LASSO and RF, particularly in the southern states. In the states in which corn yield is more stable, such as Michigan and Iowa, different models perform similarly and DCN may not be the best approach. While in the southern states in which the corn yield has more variations due to the higher heat stress, DCN performs the best and shows large improvements relative to LASSO and RF. DCN lowers the RMSE in these states by 13.1%-47.1% compared with LASSO and RF. The stable high performance of DCN indicates that it is a practical approach for yield estimation in the regions with more complex and stressful corn growth conditions.

Temporal pattern learning module by AT-LSTM
The AT-LSTM captures a general temporal pattern across all states that DCN provides higher attention weights on the features in reproductive stage than vegetative stage (figure 5). The attention values represent the relative importance of the cumulative information captured by the LSTM Net from the previous input data when estimating corn yields. The consistent pattern of attention values implies that the model has the ability to capture the general temporal information during corn growth across the geographical areas, which is essential for the model's generalization in different regions. The weekly average attention value keeps to be 0.02 from week 1 to week 11 (vegetative stage), starts increasing at week 12 (silking stage), and achieves the highest value of 0.16 at week 20 (maturity stage). This result can be explained by the cumulative effect across time. From planted to mature, the effective information accumulates and drives the model to generate increased weights at a later stage. The temporal distribution of attention values shows that the DCN discovers the silking stage as the starting point for increased attention. Cumulative information becomes increasingly important to the yield estimation from silking stage. This indicates that DCN can recognize the key growth phases during the corn growing season through the temporal learning module. Previous studies have demonstrated that corn yield is highly correlated to the rainfall before silking and effective accumulated temperature after silking (Lu et al 2017). The variability of attention values starts increasing from week 15. The increasing variability during the later stage of corn growth possibly indicates that accumulated meteological information have an increased impact on corn yield but may vary among different counties in different years even in the same state. The results illustrate that temporal learning module provides a promising approach to capture the temporal pattern that relates to crop physiological process.
The temporal learning module (AT-LSTM) has already achieved a better estimation accuracy than LASSO and RF, with RMSE equals 0.89 Mg ha −1 and R 2 equals 0.72 ( figure S3). This demonstrates the function of the temporal feature learning in crop yield estimation. The results show relatively high estimation errors in Illinois, Indiana, and Ohio (figure 6). These overestimated results may result from the spatial variances in these states that are not captured by the temporal learning module.

Spatial pattern learning module by MTL
The spatial learning module improves the estimation performance relative to the temporal learning module,  where overall RMSE in the test set is reduced by 0.07 Mg ha −1 (7.9%). All regions in 2015 and 2016 reach a better estimation accuracy with the improvements range from 1.7% to 11.2%, as the regional RMSEs are reduced by 0.02 to 0.10 Mg ha −1 (figure 7). The most significant improvement is in the central region, with the RMSE reduced by 9.1% in 2015 and 11.2% in 2016. These improvements demonstrate that incorporating the spatial learning module can improve the model performance in regions with various meteorological conditions.
The spatial learning module brings improvements by reducing the relatively high estimation errors produced by the temporal learning module. The absolute estimation errors can be reduced by up to 1. These overestimations can be explained by a localized pattern that these counties suffer excessive rainfall during the corn growing season in 2015, with their precipitation (669 mm) is 1.5 times the historical average (460 mm). These localized improvements indicate that the spatial learning module could possibly capture the local specific pattern (e.g. excessive rainfall; drought; extreme high temperature), to improve yield estimation.

Model robustness performance for interannual yield variability
The DCN model outperforms RF and LASSO at estimating interannual variability in corn yields, with the lowest RMSE and highest R 2 in every subset of the 6-fold cross-validation (table 1). Among these subsets, the RMSE of DCN ranges from 0.84 to 1.18 Mg ha −1 and the R 2 ranges from 0.71 to 0.86. The DCN model shows the highest performance stability, with the lowest standard deviation of RMSE equals 0.13 Mg ha −1 , compared with RF (0.23 Mg ha −1 ) and LASSO (0.16 Mg ha −1 ). The DCN model shows the largest performance improvement compared with RF and LASSO in subset 5, which includes the year 2012. The US Corn Belt suffered a heavy drought in 2012 with an average yield at 6.8 Mg ha −1 . These results indicate that DCN produces robust corn yield estimation across a wide range of environmental conditions.

Discussion
The DCN model provides improved yield estimation compared to the existing approaches using both meteorological data and remote sensing data  (2016) proposed a machine learning approach using satellite images and climate data to estimate corn yield in Iowa with RMSE equals 0.76 Mg ha −1 . It is noted that different models have varied analysis scopes in both spatial and temporal scales and the dimensions of inputs. The DCN model in this study achieves satisfactory yield estimation accuracy with overall RMSE equals 0.82 Mg ha −1 and state-level RMSE ranges from 0.61 to 1.08 Mg ha −1 (figure 4). It is noteworthy that only three meteorological factors are used as input of the DCN model in this study. Additional data, such as vapor pressure deficit and vegetation indices derived from the remote sensing reflectance, can be added into the model to potentially improve its performance.
The DCN model provides interpretable information, which can provide some new insights in agricultural production through the combination of big data and deep learning. Low interpretability makes deep learning model a 'black box' and how to improving it is a research focus (Montavon et al 2018). In this study, the attention mechanism is applied to visualize the temporal patterns between time-series meteorological information and corn yields. The temporal patterns illustrate that DCN is able to recognize key growth phases, such as silking and kernel filling stages, which stages are essential to the corn yield as reported in related studies (Edreira et al 2011, Sánchez et al 2014. This information allows us to understand how deep learning based crop yield model work to improve its application.
Many possibilities remain to further refine the spatial and temporal learning framework, such as adjusting the temporal and spatial resolution, spatial clustering and model fusion as follows: (1) The temporal and spatial resolutions of DCN are scalable. The temporal resolution can be defined from daily to monthly, or by the crop growth phase. The DCN can also be applied at different spatial scales to estimate crop yield. The association between the performance of DCN and its spatial and temporal scales remains to be further studied.
(2) An improved spatial clustering approach can be developed compared to the regional identification based on the state-level heat distribution in this study (figure S1). For example, unsupervised approaches can be utilized to achieve a better clustering of spatial regions with more input dimensions such as soil and solar radiation.
(3) Combining DCN with statistical and processbased crop model can possibly improve its performance. The knowledge obtained from the statistical analysis and crop model definition allows for the refining of the calculation process of DCN, such as improving the parameterization.

Conclusion
A deep spatial-temporal learning framework named DCN has been developed for corn yield estimation. The results highlight the potential of DCN in crop yield estimation because of the ability to capture the temporal general pattern and spatial specific features. The temporal learning module shows a consistent pattern in all states that DCN provides more attention to the reproductive stage rather than vegetative stage due to the cumulative effect. This result indicates that DCN can recognize the key growth phases during the corn growing season through the temporal learning module. The comparison between the temporal learning only and DCN proves that blending spatial learning module into DCN improves corn yield estimation in various regions. The result implies that it is advantageous to consider spatial variances while estimating meteorological impacts on corn yield. The study demonstrates that DCN is a promising approach for crop yield estimation and identifies potential for improvement. Although the model proposed here is for rainfed corn in the US Corn Belt, it can possibly be applied to other crops and in other regions.