Exploring the potential role of environmental and multi-source satellite data in crop yield prediction across Northeast China

https://doi.org/10.1016/j.scitotenv.2021.152880Get rights and content

Highlights

  • Stacked ensemble model outperformed single machine learning models.

  • Environmental and satellite data are complementary and useful to estimate crop yield.

  • SIF, LST and VOD provided extra information beyond EVI for crop yield prediction.

  • Crop yield can be satisfactorily forecasted at two to three months before harvest.

  • Geography, DEM, VOD, EVI, soil hydraulic and nutrient properties are important predictors.

Abstract

Developing an accurate crop yield predicting system at a large scale is of paramount importance for agricultural resource management and global food security. Earth observation provides a unique source of information to monitor crops from a diversity of spectral ranges. However, the integrated use of these data and their values in crop yield prediction is still understudied. Here we proposed the combination of environmental data (climate, soil, geography, and topography) with multiple satellite data (optical-based vegetation indices, solar-induced fluorescence (SIF), land surface temperature (LST), and microwave vegetation optical depth (VOD)) into the framework to estimate crop yield for maize, rice, and soybean in northeast China, and their unique value and relative influence on yield prediction was assessed. Two linear regression methods, three machine learning (ML) methods, and one ML ensemble model were adopted to build yield prediction models. Results showed that the individual ML methods outperformed the linear regression methods, the ML ensemble model further improved the single ML models. Moreover, models with more inputs achieved better performance, the combination of satellite data with environmental data, which explained 72%, 69%, and 57% of maize, rice, and soybean yield variability, respectively, demonstrated higher yield prediction performance than individual inputs. While satellite data contributed to crop yield prediction mainly at the early-peak of the growing season, climate data offered extra information mainly at the peak-late season. We also found that the combined use of EVI, LST and SIF has improved the model accuracy compared to the benchmark EVI model. However, the optical-based vegetation indices shared similar information and did not provide much extra information beyond EVI. The within-season yield forecasting showed that crop yields can be satisfactorily forecasted at two to three months prior to harvest. Geography, topography, VOD, EVI, soil hydraulic and nutrient parameters are more important for crop yield prediction.

Introduction

Understanding the spatiotemporal patterns of crop yield, along with accurately predicting those patterns are a challenging issue and a key research area in agricultural studies (Franz et al., 2020; Li et al., 2019b). Such estimates can facilitate better assessments of yield response to environmental stresses, help to better understand the gaps between actual and potential yields, and thus provide better information for farm resource management (Guan et al., 2017; Ma et al., 2021). Moreover, information about crop yield at the regional and national scales can provide important information to food security, agricultural commodity markets, and to guide policy decision-making (Hoffman et al., 2015; Sherrick et al., 2014).

In agriculture, crop yield is strongly influenced by various variables including environment (e.g. climate, soil properties), genetics, and management (Mathieu and Aires, 2018), all these factors need to be generally considered in monitoring and forecasting crop yield through statistical or physical simulation models. Climate data and soil properties describe the environmental information that constrains the growing condition of the crop, they are extensively used in crop predicting systems. However, crop growing status is not only affected by abiotic factors, but also by biotic factors (Cai et al., 2019; Lichtenthaler, 1996; Mahlein et al., 2012). Thus, simply using environment data may not be sufficient.

Satellite remote sensing (RS) data has been widely used for crop yield estimation across a wide range of scales and geographic locations (Guan et al., 2017; Sakamoto et al., 2013; Sibley et al., 2014). Previous studies have also shown better yield estimations by using satellite data or combining satellite data with environmental information than using climate data only (Cai et al., 2019; Li et al., 2019b). In particular, the Normalized Difference Vegetation Index (NDVI) and Enhanced Vegetation Index (EVI) derived from visible and near-infrared (NIR) satellite data, which provide a general indicator of photosynthetic canopy cover or aboveground biomass, were the most commonly used predictors for their long-time records and recognized value in monitoring crop conditions. However, these vegetation indices (VIs) only utilize information from a small portion of the electromagnetic spectrum within optical wavelengths, the crop information they provided is also limited. In fact, current earth observation satellites can capture crop growing conditions from a diversity of spectral ranges (Guan et al., 2017), including visible, infrared, thermal, and microwave, satellite observations from these data platforms provide unique information that can describe crop growth condition from both biotic and abiotic stresses. Solar-induced chlorophyll fluorescence (SIF), derived from a specific narrow range of the near-infrared band, has emerged as a proxy of plant photosynthesis (Guanter et al., 2014; Porcar-Castell et al., 2014). Some other VIs like normalized difference phenology index (NDPI) and land surface water index (LSWI) that use broader spectral wavelengths from visible to shortwave infrared (SWIR) are found to have better performances in detecting crop biomass and canopy water content (Dong et al., 2015; Xu et al., 2021). The thermal RS data, a direct measurement of land-surface temperature (LST), can capture heat stress and drought impact on yield variations (Johnson, 2014; Khanal et al., 2017), it was also considered to be an alternative to air temperature in data-limited regions (Heft-Neal et al., 2017). Microwave data with longer wavelength bands, either passive or active, is commonly referred to as vegetation optical depth (VOD) using microwave radiative transfer models (Jackson and Schmugge, 1991; Vreugdenhil et al., 2016), this indicator provides frequency-dependent information related to the crop canopy density, biomass, and water content of vegetation (Liu et al., 2015; Momen et al., 2017). VOD estimates from long wavelengths (e.g. C, L or P-band) are generally more sensitive to deeper vegetation layers while VOD estimates from short wavelengths (e.g. Ku-, X-band) are more sensitive to leaf moisture content (Chaparro et al., 2018; Konings et al., 2019; Tian et al., 2018). Satellite data from a single platform or combinations of several platforms were extensively used in crop monitoring, however, crop yield estimation using the whole set of the available spectral bands has been comparatively less studied (Guan et al., 2017).

Generally, two yield prediction methods have been widely used: the physical simulation models and the statistical models. Physical-based crop models estimate yield by dynamically simulating crop growth and yield formation processes (Jeong et al., 2022; Jones et al., 2017; Rosenzweig et al., 2014), even powerful, these models require extensive locally crop specific biotic and abiotic inputs, limiting their applicability in large-scale yield modeling (Kang and Özdoğan, 2019; Lecerf et al., 2019). Statistical models are widely used in operational large-scale crop yield forecasting systems due to their simplicity, fewer inputs required, and relatively high predictive power when sufficient training data are available (Chipanshi et al., 2015; Johnson, 2014; Li et al., 2019b). Comparatively, statistical machine learning (ML) models have complex functions and abilities to handle complicated relationships between the predictors and the target variable (Johnson et al., 2016; Ma et al., 2021), thus the approaches have been increasingly employed in the research fields of agriculture in recent years (Cao et al., 2021; Schauberger et al., 2020).

Given the public availability of global environmental data and various remote sensing products across a diverse spectral range, each of them can provide unique information and offer new opportunities for agricultural monitoring. Several questions regarding the integration and use of these data remain: (1) How much information can the environmental and satellite data provide to the crop yield prediction and what combinations of these data will achieve the best performance? (2) What is the performance of various satellite data in predicting crop yield and how to combine them for crop yield predictions in northeast China? (3) How does the within-season crop yield forecasting perform with the progression of growing season and more data available? In this study, we used environmental data including climate, soil, geography, and topography and a diverse set of satellite data introduced above to build two linear regression models (partial least-square regression (PLSR) and least absolute shrinkage and selection operator (LASSO)), three machine learning models (stochastic gradient boosting (SGB), support vector regression (SVR), and random forest (RF)), and one ML ensemble model for yield prediction of three major crops (maize, rice, and soybean) across northeast China. The end-of-season yield modeling using different methods and different combinations of inputs was conducted to quantify the contributions of the environmental and satellite data in determining crop yield, the within-season yield forecasting was conducted to analyze the model performance with crop growth progression and more input information. The results of this study will facilitate the synergistic use and development of new generation satellite RS products, and provide valuable information for crop yield forecasting system development.

Section snippets

Study region

Our study was conducted in the Northeastern region of China, which includes the Heilongjiang province, the Jilin province, and the Liaoning province, the total area is about 0.79 million km2 (Fig. 1). Northeast China is the leading grain production region in China with a crop planting area of 0.26 million km2, which occupies more than 15% of the total crop planting area in China, about one-fifth of the national grain is produced here. The major crops are maize, soybean, and rice, the sum of the

Exploratory data analysis

Before model construction, an exploratory data analysis was conducted to reduce input dimensionality and select appropriate inputs. For this purpose, simple correlation analysis was firstly performed between each variable and crop yield. Variables that have insignificant correlation (P > 0.01) with crop yield were discarded to avoid bias and exclude the impractical variables. For the remaining variables that have significant correlation with yield variations, the correlation coefficients among

Correlation of crop yield with climate and remote sensing variables

Before exploratory data analysis, the seasonal cycles of all the input covariates were examined (Fig. 2). Generally, all the covariates showed a similar pattern of mid-summer peak during the crop growing season (July–August) but differed with onset and peak timing. EVI, NDPI, LSWI, and VOD reached their peak in August, while EVI, NDPI, LSWI demonstrated similar seasonal cycles, the seasonal cycle of VOD lags behind the other covariates both at the beginning and the end of the growing season.

Combining environmental and satellite data achieves best yield prediction

Our results showed that the combination of environmental (climate, soil, geography, and topography) and satellite data has achieved the best performance for predicting crop yield in northeast China, the combined use of satellite and climate data outperformed the individual input, indicating all the three types of data provided unique information to the final crop yield prediction. In the situation of using individual data sources, satellite data can achieve better predictive performance than

Conclusions

In this study, we investigated the relative performances of environmental and multi-source satellite data for yield forecasting of three major crops in northeast China, two linear regression approaches, three ML approaches, and one ML ensemble method were used to build yield prediction models with different combinations of input variables. Our study showed that, overall, the ensemble mode outperformed the regression and ML models, and better performance is achieved with more input data,

CRediT authorship contribution statement

Zhenwang Li: Conceptualization, Methodology, Writing – original draft. Lei Ding: Data curation, Writing – original draft. Dawei Xu: Data curation, Resources.

Declaration of competing interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgements

This work was supported by the National Key Research and Development Program of China (No. 2018YFE0107000) and Key Research and Development Program of Shandong Province (2019JZZY010713).

References (74)

  • K. Guan et al.

    The shared and unique values of optical, fluorescence, thermal and microwave satellite data for estimating large-scale crop yields

    Remote Sens. Environ.

    (2017)
  • Q. Hu et al.

    Integrating coarse-resolution images and agricultural statistics to generate sub-pixel crop type maps and reconciled area estimates

    Remote Sens. Environ.

    (2021)
  • A. Huete et al.

    Overview of the radiometric and biophysical performance of the MODIS vegetation indices

    Remote Sens. Environ.

    (2002)
  • T.J. Jackson et al.

    Vegetation effects on the microwave emission of soils

    Remote Sens. Environ.

    (1991)
  • S. Jeong et al.

    Predicting rice yield at pixel scale through synthetic use of crop and deep learning models with satellite data in South and North Korea

    Sci. Total Environ.

    (2022)
  • D.M. Johnson

    An assessment of pre- and within-season remotely sensed variables for forecasting corn and soybean yields in the United States

    Remote Sens. Environ.

    (2014)
  • M.D. Johnson et al.

    Crop yield forecasting on the Canadian prairies by remotely sensed vegetation indices and machine learning methods

    Agric. For. Meteorol.

    (2016)
  • J.W. Jones et al.

    Brief history of agricultural systems modeling

    Agric. Syst.

    (2017)
  • Y. Kang et al.

    Field-level crop yield mapping with Landsat using a hierarchical data assimilation approach

    Remote Sens. Environ.

    (2019)
  • S. Khanal et al.

    An overview of current and potential applications of thermal remote sensing in precision agriculture

    Comput. Electron. Agric.

    (2017)
  • R. Lecerf et al.

    Assessing the information in crop model and meteorological indicators to forecast crop yield over Europe

    Agric. Syst.

    (2019)
  • Y. Li et al.

    Toward building a transparent statistical model for improving crop yield prediction: modeling rainfed corn in the U.S

    Field Crop Res.

    (2019)
  • L. Li et al.

    Crop yield forecasting and associated optimum lead time analysis based on multi-source environmental data across China

    Agric. For. Meteorol.

    (2021)
  • H.K. Lichtenthaler

    Vegetation stress: an introduction to the stress concept in plants

    J. Plant Physiol.

    (1996)
  • D.B. Lobell

    The use of satellite data for crop yield gap analysis

    Field Crop Res.

    (2013)
  • Y. Ma et al.

    Corn yield prediction and uncertainty analysis based on remotely sensed variables using a Bayesian neural network approach

    Remote Sens. Environ.

    (2021)
  • A. Mateo-Sanchis et al.

    Synergistic integration of optical and microwave satellite data for crop yield estimation

    Remote Sens. Environ.

    (2019)
  • J.A. Mathieu et al.

    Assessment of the agro-climatic indices to improve crop yield forecasting

    Agric. For. Meteorol.

    (2018)
  • T. Pede et al.

    Improving corn yield prediction across the US Corn Belt by replacing air temperature with daily MODIS land surface temperature

    Agric. For. Meteorol.

    (2019)
  • B. Peng et al.

    Assessing the benefit of satellite-based solar-induced chlorophyll fluorescence in crop yield prediction

    Int. J. Appl. Earth Obs. Geoinf.

    (2020)
  • T. Sakamoto et al.

    MODIS-based corn grain yield estimation model incorporating crop phenology information

    Remote Sens. Environ.

    (2013)
  • B. Schauberger et al.

    A systematic review of local to regional yield forecasting approaches and frequently used data resources

    Eur. J. Agron.

    (2020)
  • C. Wang et al.

    A snow-free vegetation index for improved monitoring of vegetation spring green-up date in deciduous ecosystems

    Remote Sens. Environ.

    (2017)
  • J. Wen et al.

    A framework for harmonizing multiple satellite instruments to generate a long-term global high spatial-resolution solar-induced chlorophyll fluorescence (SIF)

    Remote Sens. Environ.

    (2020)
  • X. Xiao et al.

    Mapping paddy rice agriculture in South and Southeast Asia using multi-temporal MODIS images

    Remote Sens. Environ.

    (2006)
  • D. Xu et al.

    The superiority of the normalized difference phenology index (NDPI) for estimating grassland aboveground fresh biomass

    Remote Sens. Environ.

    (2021)
  • J. Zeng et al.

    Evaluation of remotely sensed and reanalysis soil moisture products over the Tibetan Plateau using in-situ observations

    Remote Sens. Environ.

    (2015)
  • Cited by (0)

    View full text