Exploring the potential role of environmental and multi-source satellite data in crop yield prediction across Northeast China
Graphical abstract
Introduction
Understanding the spatiotemporal patterns of crop yield, along with accurately predicting those patterns are a challenging issue and a key research area in agricultural studies (Franz et al., 2020; Li et al., 2019b). Such estimates can facilitate better assessments of yield response to environmental stresses, help to better understand the gaps between actual and potential yields, and thus provide better information for farm resource management (Guan et al., 2017; Ma et al., 2021). Moreover, information about crop yield at the regional and national scales can provide important information to food security, agricultural commodity markets, and to guide policy decision-making (Hoffman et al., 2015; Sherrick et al., 2014).
In agriculture, crop yield is strongly influenced by various variables including environment (e.g. climate, soil properties), genetics, and management (Mathieu and Aires, 2018), all these factors need to be generally considered in monitoring and forecasting crop yield through statistical or physical simulation models. Climate data and soil properties describe the environmental information that constrains the growing condition of the crop, they are extensively used in crop predicting systems. However, crop growing status is not only affected by abiotic factors, but also by biotic factors (Cai et al., 2019; Lichtenthaler, 1996; Mahlein et al., 2012). Thus, simply using environment data may not be sufficient.
Satellite remote sensing (RS) data has been widely used for crop yield estimation across a wide range of scales and geographic locations (Guan et al., 2017; Sakamoto et al., 2013; Sibley et al., 2014). Previous studies have also shown better yield estimations by using satellite data or combining satellite data with environmental information than using climate data only (Cai et al., 2019; Li et al., 2019b). In particular, the Normalized Difference Vegetation Index (NDVI) and Enhanced Vegetation Index (EVI) derived from visible and near-infrared (NIR) satellite data, which provide a general indicator of photosynthetic canopy cover or aboveground biomass, were the most commonly used predictors for their long-time records and recognized value in monitoring crop conditions. However, these vegetation indices (VIs) only utilize information from a small portion of the electromagnetic spectrum within optical wavelengths, the crop information they provided is also limited. In fact, current earth observation satellites can capture crop growing conditions from a diversity of spectral ranges (Guan et al., 2017), including visible, infrared, thermal, and microwave, satellite observations from these data platforms provide unique information that can describe crop growth condition from both biotic and abiotic stresses. Solar-induced chlorophyll fluorescence (SIF), derived from a specific narrow range of the near-infrared band, has emerged as a proxy of plant photosynthesis (Guanter et al., 2014; Porcar-Castell et al., 2014). Some other VIs like normalized difference phenology index (NDPI) and land surface water index (LSWI) that use broader spectral wavelengths from visible to shortwave infrared (SWIR) are found to have better performances in detecting crop biomass and canopy water content (Dong et al., 2015; Xu et al., 2021). The thermal RS data, a direct measurement of land-surface temperature (LST), can capture heat stress and drought impact on yield variations (Johnson, 2014; Khanal et al., 2017), it was also considered to be an alternative to air temperature in data-limited regions (Heft-Neal et al., 2017). Microwave data with longer wavelength bands, either passive or active, is commonly referred to as vegetation optical depth (VOD) using microwave radiative transfer models (Jackson and Schmugge, 1991; Vreugdenhil et al., 2016), this indicator provides frequency-dependent information related to the crop canopy density, biomass, and water content of vegetation (Liu et al., 2015; Momen et al., 2017). VOD estimates from long wavelengths (e.g. C, L or P-band) are generally more sensitive to deeper vegetation layers while VOD estimates from short wavelengths (e.g. Ku-, X-band) are more sensitive to leaf moisture content (Chaparro et al., 2018; Konings et al., 2019; Tian et al., 2018). Satellite data from a single platform or combinations of several platforms were extensively used in crop monitoring, however, crop yield estimation using the whole set of the available spectral bands has been comparatively less studied (Guan et al., 2017).
Generally, two yield prediction methods have been widely used: the physical simulation models and the statistical models. Physical-based crop models estimate yield by dynamically simulating crop growth and yield formation processes (Jeong et al., 2022; Jones et al., 2017; Rosenzweig et al., 2014), even powerful, these models require extensive locally crop specific biotic and abiotic inputs, limiting their applicability in large-scale yield modeling (Kang and Özdoğan, 2019; Lecerf et al., 2019). Statistical models are widely used in operational large-scale crop yield forecasting systems due to their simplicity, fewer inputs required, and relatively high predictive power when sufficient training data are available (Chipanshi et al., 2015; Johnson, 2014; Li et al., 2019b). Comparatively, statistical machine learning (ML) models have complex functions and abilities to handle complicated relationships between the predictors and the target variable (Johnson et al., 2016; Ma et al., 2021), thus the approaches have been increasingly employed in the research fields of agriculture in recent years (Cao et al., 2021; Schauberger et al., 2020).
Given the public availability of global environmental data and various remote sensing products across a diverse spectral range, each of them can provide unique information and offer new opportunities for agricultural monitoring. Several questions regarding the integration and use of these data remain: (1) How much information can the environmental and satellite data provide to the crop yield prediction and what combinations of these data will achieve the best performance? (2) What is the performance of various satellite data in predicting crop yield and how to combine them for crop yield predictions in northeast China? (3) How does the within-season crop yield forecasting perform with the progression of growing season and more data available? In this study, we used environmental data including climate, soil, geography, and topography and a diverse set of satellite data introduced above to build two linear regression models (partial least-square regression (PLSR) and least absolute shrinkage and selection operator (LASSO)), three machine learning models (stochastic gradient boosting (SGB), support vector regression (SVR), and random forest (RF)), and one ML ensemble model for yield prediction of three major crops (maize, rice, and soybean) across northeast China. The end-of-season yield modeling using different methods and different combinations of inputs was conducted to quantify the contributions of the environmental and satellite data in determining crop yield, the within-season yield forecasting was conducted to analyze the model performance with crop growth progression and more input information. The results of this study will facilitate the synergistic use and development of new generation satellite RS products, and provide valuable information for crop yield forecasting system development.
Section snippets
Study region
Our study was conducted in the Northeastern region of China, which includes the Heilongjiang province, the Jilin province, and the Liaoning province, the total area is about 0.79 million km2 (Fig. 1). Northeast China is the leading grain production region in China with a crop planting area of 0.26 million km2, which occupies more than 15% of the total crop planting area in China, about one-fifth of the national grain is produced here. The major crops are maize, soybean, and rice, the sum of the
Exploratory data analysis
Before model construction, an exploratory data analysis was conducted to reduce input dimensionality and select appropriate inputs. For this purpose, simple correlation analysis was firstly performed between each variable and crop yield. Variables that have insignificant correlation (P > 0.01) with crop yield were discarded to avoid bias and exclude the impractical variables. For the remaining variables that have significant correlation with yield variations, the correlation coefficients among
Correlation of crop yield with climate and remote sensing variables
Before exploratory data analysis, the seasonal cycles of all the input covariates were examined (Fig. 2). Generally, all the covariates showed a similar pattern of mid-summer peak during the crop growing season (July–August) but differed with onset and peak timing. EVI, NDPI, LSWI, and VOD reached their peak in August, while EVI, NDPI, LSWI demonstrated similar seasonal cycles, the seasonal cycle of VOD lags behind the other covariates both at the beginning and the end of the growing season.
Combining environmental and satellite data achieves best yield prediction
Our results showed that the combination of environmental (climate, soil, geography, and topography) and satellite data has achieved the best performance for predicting crop yield in northeast China, the combined use of satellite and climate data outperformed the individual input, indicating all the three types of data provided unique information to the final crop yield prediction. In the situation of using individual data sources, satellite data can achieve better predictive performance than
Conclusions
In this study, we investigated the relative performances of environmental and multi-source satellite data for yield forecasting of three major crops in northeast China, two linear regression approaches, three ML approaches, and one ML ensemble method were used to build yield prediction models with different combinations of input variables. Our study showed that, overall, the ensemble mode outperformed the regression and ML models, and better performance is achieved with more input data,
CRediT authorship contribution statement
Zhenwang Li: Conceptualization, Methodology, Writing – original draft. Lei Ding: Data curation, Writing – original draft. Dawei Xu: Data curation, Resources.
Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgements
This work was supported by the National Key Research and Development Program of China (No. 2018YFE0107000) and Key Research and Development Program of Shandong Province (2019JZZY010713).
References (74)
- et al.
Towards fine resolution global maps of crop yields: testing multiple methods and satellites in three countries
Remote Sens. Environ.
(2017) - et al.
Integrating satellite and climate data to predict wheat yield in Australia using machine learning approaches
Agric. For. Meteorol.
(2019) - et al.
Integrating multi-source data for rice yield prediction across China using machine learning and deep learning approaches
Agric. For. Meteorol.
(2021) - et al.
L-band vegetation optical depth seasonal metrics for crop yield assessment
Remote Sens. Environ.
(2018) - et al.
Evaluation of the integrated Canadian crop yield forecaster (ICCYF) model for in-season prediction of crop yield across the Canadian agricultural landscape
Agric. For. Meteorol.
(2015) - et al.
Comparison of four EVI-based models for estimating gross primary production of maize and soybean croplands and tallgrass prairie under severe drought
Remote Sens. Environ.
(2015) - et al.
Predicting the invasive trend of exotic plants in China based on the ensemble model under climate change: a case for three invasive plants of Asteraceae
Sci. Total Environ.
(2021) - et al.
The role of topography, soil, and remotely sensed vegetation condition towards predicting crop yield
Field Crop Res.
(2020) - et al.
Toward mapping crop progress at field scales through fusion of Landsat and MODIS imagery
Remote Sens. Environ.
(2017) - et al.
Comparison of SMOS and AMSR-E vegetation optical depth to four MODIS-based vegetation indices
Remote Sens. Environ.
(2016)