Pine caterpillar occurrence modeling using satellite spring phenology and meteorological variables

Outbreaks of leaf-feeding Lepidopteran insects substantially weaken the quality of forest trees and strongly affect the ecosystem functions of plant photosynthesis and carbon uptake. The narrow phenological time window of leaf out about ten days, during which Lepidopteran larvae feed on high nutrient newly flushed leaves, may change the insect community and outbreak dynamics by determining the survival rate of larvae. The Chinese pine Caterpillar (Dendrolimus tabulaeformis Tsai et Liu) infestation of the northern Chinese pine (Pinus tabulaeformis) forest in China is a major concern, and accurately modeling the day of insect occurrence (DIO) in the spring remains challenging. With continuous in-situ observed insect activities of 20 plots and satellite and meteorological observations from 1983 to 2014, we found a strong synchronization (r = 0.54, p = 0.001) between the satellite-based vegetation spring phenology, i.e. the green-up day (GUD), and DIO of the pine caterpillar over time. We used partial least squares regression and ridge regression models, and identified that monthly preseason air temperature, wind speed, specific humidity, and downward radiation were key environmental cues that awakened the overwintering pine caterpillars. After removing the collinearity of multiple variables, we showed that the dimensionality reduction-based regression models substantially improved the accuracy of DIO modeling than commonly used models, such as interval and degree-day models. In particular, including GUD significantly enhanced the predictive strength of the models increasing the coefficient of determination (R 2) by 17.1% and consequently a decrease of 16.5% in the root mean square error. We further showed that evapotranspiration changed the environmental moisture content, which indirectly affected the activities of insects. Our results revealed a useful linkage between spring leaf development and insect occurrence, and therefore are of great importance for the large-scale monitoring of pest outbreaks with future remote sensing observations.


Introduction
Forest insects and related diseases reduce the quality and vitality of forest trees, increase ecological vulnerability, and are important drivers of global forest ecosystem change (Kurz et al 2008, Kautz et al 2018, Seidl et al 2018. Biotic disturbances cause extensive loss of trees of specific species within the landscape with serious consequences for tree biodiversity and carbon sink functions (Clark et al 2010, Bao et al 2020). Therefore, examining the mechanisms and conditions of forest insects and disease occurrence is necessary to conserve, manage, and develop forest resources.
The pine caterpillar (Dendrolimus tabulaeformis Tsai et Liu) has been the most serious destructive defoliator in northern China (Fan et al 2013, Zhang et al 2014. The infestation might occur each year and causes the trees to lose leaves, resulting in changes in the host trees' physiology (Seidl et al 2017, Bao et al 2019. Once an outbreak occurs, the number of larvae per tree might exceed 400, which would have a significant effect on the tree's growth and even result in tree mortality (Fan et al 2013, Zhang et al 2018. The day of insect occurrence (DIO) refers to the timing when the pine caterpillar larvae first appear in the spring and travel up to trees for feeding, which occurs after overwintering in the thin layer around the roots of the host, under the rocks, the roots of the grass, and the covered pits. As the defensive compounds of young leaves tend to increase gradually early in the season after 30 • days (Tikkanen andJulkunen-Tiitto 2003, Falk et al 2018), Lepidopteran larvae might benefit from feeding on the leaves during a narrow time window before the leaves develop enough defensive compounds (Ekholm et al 2020), suggesting that larvae feeding with leaf flush are more likely to develop into outbreaks (Thomson et al 1984, Forkner et al 2008. The ability to predict DIO dynamics has important consequences for the management of destructive forest pests (Chen et al 2003), as well as the deployment of phenologically appropriate, timed insecticide treatments (Waldstein and Reissig 2001).
Temperature and photoperiod are the two major environmental factors that influence folivore and host tree phenology in the spring (Cannell and Smith 1983, van Asch and Visser 2007, Chuine et al 2016. Traditional researches analyze insect activities and population dynamics with ground meteorological data using the temperature-based degree-day (DDC) models (Mussey and Potter 1997, Wang et al 2013, Frank 2021. These models often use a baseline temperature above a particular threshold, below which no insect development is expected, and accordingly could provide a decent general description of larval occurrence (Buse andGood 1996, Bergant et al 2005). Due to the assumption of a simple linear relationship between temperature and insect developmental rate in DDC models, these models often tended to overestimate the development rate at high temperatures. As the synchrony of larval feeding with leaf flush exists, host plant phenology can reflect insect activity to some extent, although the underlying mechanism may be different for larvae and their host plants (van Asch and Visser 2007, Forkner et al 2008, Ekholm et al 2020. Regionally, larval feeding is often accompanied by the expansion of Populus × Canadensis buds and Salix matsudana buds, and the opening of Salix babylonica buds. However, the regional long-term records of host plant phenology (i.e. budburst, leaf expansion) are rare and hard to obtain.
During the past few decades, the increased use of remote sensing has greatly expanded the observation of traditional plant phenology (Richardson et al 2013, Keenan et al 2014, Fu et al 2015, Senf et al 2017. Landscape-scale vegetation spring phenology, that is, green-up date (GUD), depends on detecting the timing of phenological events in the temporal profile of greenness-related vegetation indices, such as the normalized difference vegetation index (NDVI) and the enhanced vegetation index (Piao et al 2019). But only in a study area with strong homogeneity, satellite-derived phenology can reflect to some extent the response of ecosystems to climate change. And it has the potential for insect occurrence modeling.
Based on 32 years of continuous field work in-situ observation data from fixed monitoring stations of pine caterpillars in Qingsongling, China from 1983 to 2014, our objectives are (a) to fill the knowledge gap of main drivers of insect activities (Dendrolimus tabulaeformis Tsai et Liu) in early spring in the study area, (b) to explore the synchrony of larval feeding with satellite-derived phenology and improve the predictive power of DIO models combining remote sensing (NDVI) with biotic variables (vegetation phenology), and finally (c) to explain the underlying mechanism of the synchrony between GUD and DIO.

Study area
The study area, Qingsongling, is located in eastern Jianping County of Liaoning Province, China (41.72 • -41.86 • N, 119.80 • -119.99 • E), which has a temperate continental monsoon climate (figure 1). The mean annual air temperature is approximately 6.9 • C (ranging from −21.3 • C to 28.2 • C in a year) and the total precipitation sum reaches approximately 440 mm (He et al 2012). The State Forestry Administration of China reports that, Pinus tabuliformis appeared on 327.8 km 2 in Qingsongling, accounting for 93.1% of the total area covered by forest and the rest is mixed forest, scrub and other vegetation. Therefore, Pinus tabuliformis is the dominant tree species in the study area.

Insect activities data
We set up 20 plots (50 m × 40 m) in the study area, and selected 20 Pinus tabulaeformis trees each for observation according to the 'Z' method (figure 2(a)). We used the following strategies to arrange the plots: (a) as evenly and uniformly distributed and homogeneously as is physically possible; (b) there are no artificially constructed buildings or roads in the neighborhood; (c) there are trees that are similar in age, height, and planting distance. The plots were not at regular intervals, within about 2 km. Before the larvae traveled up the tree in early March, we used a special sticky plastic skirt to wrap around the tree's breast  height, and regularly checked and recorded the number of larvae in the plastic skirt every day until no larvae were detected on the tree for three consecutive days. The DIO within a plot was the date when the first larva has been observed on any tree. The site-wide DIO was defined when all plots have recorded at least one larva sighting. We calculated DIOs by setting different thresholds and the interannual variations between them were highly correlated (supplementary figure S7). DIO data was continuously recorded from 1983 to 2014.

Meteorological and land surface data
The China Meteorological Forcing Dataset (CMFD) is a gridded near-surface meteorological dataset specially developed for the study of land surface processes in China (http://data.tpdc.accn/en/data/8028b944daaa-4511-8769-965612652c49/), with a spatial resolution of 0.1 • (Yang and He 2019). We used 2 m air temperature (Temp), specific humidity (Shum), 10 m wind speed (Wind), downward shortwave radiation (Srad), and precipitation rate (Prec) as the preseason predictor variables ( As the larvae overwinter is under ground cover, land skin parameters should also be taken into consideration. We used the ECMWF Reanalysis v5 (ERA5)-Land reanalysis dataset that provides a consistent view of the evolution of land variables over several decades at an enhanced resolution compared to ERA5 (www.ecmwf.int/en/era5-land) (Muñoz-Sabater et al 2021). These data was gridded to 0.1 • × 0.1 • with monthly mean averages provided. We used skin reservoir content and skin temperature to reflect the land skin moisture and temperature from 1983 to 2014.
We also used the eight day moderate resolution imaging spectroradiometer (MODIS) Evapotranspiration/Latent Heat Flux product (MOD16A2, 500 m, version 6) from 2001 to 2014 (Running et al 2017, Li and Xiao 2019). To reduce study uncertainties, we evaluated the evapotranspiration (ET) from MODIS and ERA5-Land (supplementary figure S5) in the study area from 2001 to 2014, which signaled a strong positive correlation (r = 0.95, p < 0.001). As the ET from MODIS has a higher resolution and is independent of other predictors, we used the MODIS data to explore the underlying mechanism. We interpolated linearly it to a daily scale and evaluated the most optimal window size (WS) before and after GUD to explore the relationships between ET and DIO. We defined a WS ranging from 0 to 50 days, and the duration was WS days before and after GUD (GUD ± WS days) occurred.

Satellite NDVI dataset
NDVI is a proxy for vegetation greenness (Jeong et al 2017). To coincide with the years of the site-based DIO, we used the NDVI GIMMS third-generation data (https://data.tpdc.accn/en/data/9775f2b4-7370-4e5e-a537-3482c9a83d88/) (Tucker et al 2005) from 1983 to 2014. With a 1/12 • spatial resolution, the fortnightly NDVI dataset has long-term observational data since 1982. As the longest NDVI record, the GIMMS3g NDVI data have been extensively applied for spring phenological research (Jeong et al 2011). We used seven NDVI pixels in the study area to extract vegetation phenology after area-weighted averaging.

Determination of spring phenology
Based on the time range of NDVI3g (1982NDVI3g ( -2015 and DIO (1983DIO ( -2014, our study focused on the period of 1983-2014. After area averaging and data preprocessing, we first applied a modified Savitzky-Golay filter to exclude abnormal NDVI data and take the envelope ( (1) where t is time in days, f (t) is the NDVI value at time t, and α 1 ,…, α 7 are fitting parameters. Specifically, α 1 is the background NDVI, represents the mean values of NDVI in the non-growing season of vegetation; α 2 and α 5 represent the difference between the background and the amplitude of the late summer and autumn plateau NDVI units; α 3 and α 6 are the midpoints of the transitions for green-up and senescence/abscission, respectively; α 4 and α 7 are the transitions curvature parameters (normalized slope coefficients). All seven optimal parameters in the study area were presented in the supplementary material (supplementary table S4). GUD was calculated as the time when the change rate of fitted NDVI reached its first local maximum value in spring (Gonsamo et al 2013, Wang and Liu 2021).

Commonly used models to predict DIO
We applied two traditional models commonly used to predict DIO (Li 1959, Chai et al 2002, Li et al 2002, Zhou et al 2012. The first was the interval method, which predicts the DIO based on the average interval of a multi-year overwintering period: where F is the date of the current developmental stage starts, H i is the date of the previous developmental stage starts, X is the average interval obtained from field observations. In this study, F is DIO, and H i , acting as an input, is the specific date when the larvae begin overwintering in previous year. The second was the DDC model which follows the idea of accumulated forcing temperature (van Asch and Visser 2007, Liu et al 2018). Using the baseline temperature, accumulated forcing temperature, and the local daily temperature during growth of the pine caterpillar, the occurrence period was predicted as: where DD(d) is the growing degree on date d, T b is the baseline temperature, T mean (d) is the daily mean temperature on date d, DD threshold is the accumulated forcing temperature from d 0 to DIO and d 0 is the first day of accumulation, set as 1 January of the year. DIO was defined as the date that DD(d) first exceeded the multiyear mean DD threshold . When predicting DIO of overwintering larvae rather than egg hatching, the average value of the multi-year winter minimum temperature was used as the baseline in this study (van Asch and Visser 2007).

Statistical analysis strategy
To remove covariate effects of environmental variables on DIO, we applied a partial correlation analysis to determine the optimal length of the preseason for each variable in the study area (supplementary table  S3), as the most relevant period for DIO ranged several months before it, and the length differed among environmental cues. The optimal preseason for each variable was defined as the period (with one month steps) for which the partial correlation coefficient between DIO and variables was highest during 1983-2014 (Fu et al 2015). We used partial correlation analyses, excluding interference between variables, to examine the responses of several environmental cues to insect activities in terms of DIO (Guo et al 2021). We applied two regression models for key variables selection. We used the partial least squares regression (PLS-R) and ridge regression (Ridge-R) to solve the variable collinearity issue. For the PLS-R, the variable importance in projection (VIP) was applied (Tran et al 2014, Farrés et al 2015, with the absolute value representing the driver's influence on DIO. The PLS-R model has a 'greater than one' criteria (Guo et al 2021), which states that a VIP value bigger than one shows that the driver significantly affects DIO. Ridge-R works directly with the predictors and adds a penalty related to the size of the coefficients to remove the variable collinearity (Hoerl and Kennard 1970) which is being increasingly adopted for variable selection and sensitivity analysis due to its reliable model performance (Graham 2003, Chen et al 2020. For Ridge-R, the sign of the model coefficient (MC) showed the positive or negative impacts of the driving factors and the t-test was used to calculate the significance.
When using the models, all predictor variables have been standardized and centered to remove the influence of units. Leave-one-out and K-fold crossvalidation methods, built into the functions in R, were also used to ensure the credibility of the results. R packages 'pls' (Wehrens et al 2021) and 'ridge' (Cule and De Iorio 2013) were used for applying and testing the models (R Core Team 2020). All statistical significance was set at p < 0.05.
A detailed data analysis procedure was presented in figure 3. Before performing the analysis, we resampled CMFD, ERA5-Land, and MODIS datasets into 1/12 • to match GUD derived from NDVI3g data. And we use the same method to average them. We summarized all the datasets and models used in this study in the supplementary material (supplementary tables S1 and S2).
We used the VIP of PLSR and MCs of Ridge-R to represent the contribution of each predictor to DIO. GUD and Wind also showed a positive effect on DIO, opposite to Temp,Shum,and Srad (figures 4(b) and (c)), which supported the partial correlation results. We also applied the ordinary least squares linear regression and similar results were obtained (supplementary figure S4). Comprehensively considering the results of significance test and the 'greater than one' criterion, we selected Green up day (GUD), air temperature (Temp), specific humidity (Shum), 10 m wind speed (Wind), and downward shortwave radiation (Srad) as key predictor variables for the modelling of DIO. Figure 5(a) shows that DIO and GUD were synchronized with a significant positive correlation (r = 0.54, p = 0.001). The pixel-based GUD and plots-based DIO shows similar synchronization (supplementary figure S6). The difference (in days) between the DIO and the GUD of the larvae's specific host tree determined the degree of synchrony. The degree of synchrony can vary strongly between years with a mean value of 4.22 days ( figure 5(b)).

Predicting DIO by different models
Among the two commonly used prediction models, the DDC model outperformed the simple interval method (figures 6(a) and (b)), but the explanatory power was still low (R 2 = 0.20, root mean square error (RMSE) = 6.99 days). The multivariate regression methods based on dimensionality reduction, when only considering environmental variables (Temp, Wind, Shum, Srad), showed a significantly stronger predictive strength than commonly used methods (figures 6(c) and (d)). PLS-R (R 2 = 0.64, RMSE = 3.33 days) performed similar with Ridge-R (R 2 = 0.64, RMSE = 3.39 days) in the term of RMSE. After combining both biotic and abiotic factors, GUD was added to PLS-R (R 2 = 0.75, RMSE = 2.78 days) and Ridge-R (R 2 = 0.75, RMSE = 2.80 days), and the models performed better than only considering environmental variables (figures 6(e) and (f)). Overall, GUD enhanced the predictive strength of the models increasing R 2 by 17.1% and consequently a decrease of 16.5% in RMSE.

Underlying mechanism including spring GUD for DIO modeling
To better understand the reasons for a better performance of modeling strategy with GUD, we further explored the possible changes over the spring leaf developmental stage that may influence caterpillar activities. We found a significant positive correlation between average ET and DIO with the WS between 18 and 32. At the optimal WS of GUD ± 26 days, ET showed the strongest impacts on DIO with r = 0.55 (p < 0.05) (figures 7(a) and (b)).

Complex mechanism of larvae overwintering and occurrence
Previous studies have shown that meteorological conditions are the driving factors for the activity Figure 4. (a) Partial correlation coefficients (PCOR) between DIO and environmental variables. Ridge regression's model coefficients (b) and partial least squares regression's variable importance in projection (VIP) values (c) acted as the sensitivity of DIO to environmental variables. We used five near-surface environment variables and two land skin variables, including 2 m air temperature (Temp), specific humidity (Shum), 10 m wind speed (Wind), downward shortwave radiation (Srad), precipitation rate (Prec), skin reservoir content (Src) and skin temperature (Skt). Asterisks indicate level of significance (p < 0.05: ' * '; p < 0.01: ' * * '; p < 0.001: ' * * * ') of t-test. dynamics of forest insects (Frank 2021), especially photoperiod and temperature (van Asch and Visser 2007). However, reasons for the relationships between wind speed and air humidity and larval emergence are limited. Wind speed generally obstructs larvae awakening as it can change local short-term weather conditions, either by directly blowing down tree larvae or by destroying their overwintering environment (He et al 2012).
As a result of the sluggish development and low survival of pine caterpillars caused by dehydration, ET was found to be substantially linked with DIO (Han  et al 2008). The higher environmental moisture content is favorable to their survival and development (Wu et al 2021). Freeze-tolerant insects withstand internal ice formation and maintain a high supercooling point through the production of ice nucleators (Zeng et al 2008). The environmental moisture content can therefore impact the water balance of insects both directly and indirectly with implications on larvae's metabolic activities. ET is the sum of water evaporation and transpiration from a surface area (such as soil, canopy interception, water bodies, and plants) to the atmosphere (Wang and Dickinson 2012). In spring, plants need water for photosynthesis and transpiration pull promotes the absorption of soil water (Tang et al 2021). The stronger the ET during GUD, the more local cooling and reduction of environmental moisture content will be caused, which is not conducive to the survival of insects and leading to a much later DIO.

Synchronization of vegetation phenology and insect activities
The synchronization of host plant phenology and insect activities has long been noticed by predecessors, as vegetation and insect phenology both are regulated by regional environmental characteristics (van Asch and Visser 2007, Forkner et al 2008, Ekholm et al 2020, although the underlying mechanism maybe different for herbivores and their host plants. The pattern of synchrony may vary under climate change (Schwartzberg et al 2014), which may consequently poses challenges for our model development. Therefore, we suggest that combining this synchronization with environmental variables in the model will help reduce the uncertainty.
This synchronization also makes vegetation phenology, a more readily observable land surface variable, somewhat of a proxy for insect activity. Future vegetation phenology changes can now be accurately predicted with the development of numerous spring phenology models (Liu et al 2018), such as the onephase growing degree days, spring warming models and two-phase sequential, parallel models. Our analysis indicates that the spring phenology (GUD) will be a crucial variable for predicting DIO in addition to meteorological data. We therefore expect that future higher spatial-temporal resolution, and even 'realtime' GUD monitoring, will undoubtedly provide a finer description of the synchronization between plant phenology and insects activities, even limitations existed in the current phenology data and model developments. The predictive power of insect activity models will be substantially increased in future due to the advancements in mechanism interpretation and data quality control.

Scale issues in remote sensing
Numerous factors, such as the homogeneity of the study area landscape, which makes it more challenging to identify mechanisms than ground-based observation, hinder the proper use of remote sensing data in ecological studies (Wu and Li 2009). Because the homogeneity of the study area could amplify the characteristics of individual tree species, the phenology of Pinus tabuliformis can actually be detectable by satellites because of the seasonal needle drops (Wang et al 2017), i.e. evergreen needleleaf forests do regularly shed older needles as newer needles fill in (Vilhar et al 2013). The needle drop is mostly noticeable when several trees start to lose their needles at the same time (supplementary figure S3), which is a seasonal process and therefore is detectable from remote sensing observations. Spatial mismatch is another concern for the proper use of remote sensing data. The GIMMS3g NDVI dataset has been widely used to monitor terrestrial vegetation change over the Northern terrestrial ecosystems (Piao et al 2020). This dataset has the advantage of a very long-term observation from 1982 to 2015, and its biweekly observation can support the monitoring of vegetation cycle changes in a year (Piao et al 2019). Since the first Landsat satellite launched in 1972, Landsat data can also support long-term monitoring with a higher spatial resolution (Wulder et al 2019). However, the persistent presence of atmospheric contamination hinders the reconstruction of the vegetation index time series within one year. Therefore, the GIMMS3g NDVI dataset was a better choice to retrieve the land surface phenology of the study area after time series reconstruction. Future sensor performance with a higher accuracy to enable more accurate extraction of land surface phenology will be of great help to support modeling of various terrestrial variables including DIO. Figure 7. (a) Find the optimal window size (WS) before and after green-up day (GUD ± 26 days) by best estimation. The blue area represents a significant correlation (p < 0.05). (b) Correlation between day of insect occurrence (DIO) and evapotranspiration before and after GUD (GUD ± 26 days).

Conclusion
While previous works on insect activity models mostly considered abiotic factors such as temperature and photoperiod, we attempted to model the DIO in spring using GUD, a biotic factor of the host phenology based on NDVI. Combined with remote sensing and reanalysis data, we screened the key environmental cues that affect insect occurrence and built predictive models for DIO. The dimensionality reduction-based regression methods greatly enhanced the explanatory power compared with commonly used models. We also found that biotic and abiotic factors have a joint effect on DIO, and satellite-derived spring phenology has great potential in predicting DIO, which can significantly enhance the predictive strength of the models increasing the coefficient of determination (R 2 ) by 17.1% and consequently a decrease of 16.5% in the RMSE. Environmental moisture content changes induced from ET before and after GUD may be a key factor in awakening the overwintering larvae. Our results are therefore of great significance for studying insect activities from a landscape-scale perspective and contribute to forestry management, and insect prevention modeling.

Data availability statement
The data generated and/or analysed during the current study are not publicly available for legal/ethical reasons but are available from the corresponding author on reasonable request.