Cue identification in phenology: A case study of the predictive performance of current statistical tools

Abstract Changes in the timing of life‐history events (phenology) are a widespread consequence of climate change. Predicting population resilience requires knowledge of how phenology is likely to change over time, which can be gained by identifying the specific environmental cues that drive phenological events. Cue identification is often achieved with statistical testing of candidate cues. As the number of methods used to generate predictions increases, assessing the predictive accuracy of different approaches has become necessary. This study aims to (a) provide an empirical illustration of the predictive ability of five commonly applied statistical methods for cue identification (absolute and relative sliding time window analyses, penalized signal regression, climate sensitivity profiles and a growing degree‐day model) and (b) discuss approaches for implementing cue identification methods in different systems. Using a dataset of mean clutch initiation timing in wild great tits (Parus major), we explored how the days of the year identified as most important, and the aggregate statistic identified as a cue, differed between statistical methods and with respect to the time span of data used. Each method's predictive capacity was tested using cross‐validation and assessed for robustness to varying sample size. We show that the identified critical time window of cue sensitivity was consistent across four of the five methods. The accuracy and precision of predictions differed by method with penalized signal regression resulting in the most accurate and most precise predictions in our case. Accuracy was maximal for near‐future predictions and showed a relationship with time. The difference between predictions and observations systematically shifted across the study from preceding observations to lagging. This temporal trend in prediction error suggests that the current statistical tools either fail to capture a key component of the cue–phenology relationship, or the relationship itself is changing through time in our system. These two influences need to be teased apart if we are to generate realistic predictions of phenological change. We recommend future phenological studies to challenge the idea of a static cue–phenology relationship and should cross‐validate results across multiple time periods.


| INTRODUC TI ON
Rapid climate change is causing shifts in the timing of annual peak resource availability for many animal populations across the world (Cook et al., 2012;Post, Pedersen, Wilmers, & Forchhammer, 2008;Singer & Parmesan, 2010;Visser, Holleman, & Gienapp, 2006;Visser, Marvelde, & Lof, 2012). Species can respond to these changes through evolution or by phenotypic plasticity in the timing of lifehistory events. Phenotypic plasticity has often been indicated as the primary driver of interannual phenological change, for example in breeding and migration dates of birds, summarized in Charmantier and Gienapp (2014). Species are thought to either respond directly to environmental changes or to proxy cues which relate to optimal timing. When temperature sensitivity differs between species, interspecific interactions can be disrupted causing temporal mismatch between trophic levels. Such mismatches impact individual fitness, with poor matching reducing survival and reproductive success, potentially affecting population resilience (Lane, Kruuk, Charmantier, Murie, & Dobson, 2012;Plard et al., 2014;Reed, Jenouvrier, & Visser, 2013;Visser et al., 2006).
Predicting phenological change over time is an important element in assessing population resilience, since identifying potential trophic mismatches can highlight populations at risk. Prediction is becoming more common in ecological studies, particularly those including phenology (van Asch, Tienderen, Holleman, & Visser, 2007;Chuine & Beaubien, 2001;Cleland et al., 2007;Pau et al., 2011;Roy & Sparks, 2000). Before generating predictions, however, it is necessary to build an understanding of how species time their life-history events. Life-history events such as the onset of reproduction (Ardia, Cooper, & Dhondt, 2006;Charmantier et al., 2008), flowering (Cleland et al., 2007;Menzel et al., 2006), migration (Usui, Butchart, & Phillimore, 2017) and hibernation (Lane et al., 2012) have been linked to environmental variables (Parmesan, 2006). However, identifying the exact environmental cue or cues used is challenging. To establish causation between a proposed cue and phenology would require experimentation, preferably in the organism's natural environment. However, such experiments are logistically challenging, particularly for large and mobile species. As a result, for long-term phenological studies, the primary tool for identifying environmental cues has been statistical analysis. There are two overarching approaches: phenomenological and mechanistic (Roberts, Tansey, Smithers, & Phillimore, 2015). Phenomenological approaches are based on statistical associations between observed data (Roberts et al., 2015). Mechanistic approaches are based on assuming known biological processes as drivers of variation. Both methods identify correlative rather than causal relationships. This impedes the teasing apart of real cues from confounding variables, particularly as many weather variables are highly autocorrelated both temporally and spatially. Of the different methods used, the phenomenological sliding time window and smoothing function-based regressions, and the mechanistic growing degree-day model (GDD), have been used most frequently.
Sliding time window analyses-a regression-based approach (Hudson, 2010)-have been widely applied (Bailey & De Pol, 2016;Husby et al., 2010;Perrins & McCleery, 1989;Phillimore et al., 2013;van de Pol et al., 2016;van de Pol & Cockburn, 2011;Samplonius et al., 2018). This approach statistically identifies the critical time window in which an environmental variable explains the most variance in phenology. Typically, the explanatory variable is some measure of an abiotic weather variable across a temporal window; the response variable is the phenological variable of interest. The duration and temporal position of the window can be altered, and the explanatory power of each model compared. The temporal position is either tied to a reference calendar day (i.e. 20 May) or to the time event itself (i.e. days prior to event); these approaches are absolute sliding time window (SWA) analysis and relative sliding time window (SWR) analysis, respectively (van de Pol et al., 2016). SWAs vary in the annual lag between the window and the event, whereas SWRs are fixed relative to the phenological event, but calendar day can vary. The SWR has been proposed as an alternative to absolute methods to allow for individual or interannual variation in weather conditions experienced (van de Pol et al., 2016;van de Pol & Cockburn, 2011).
Absolute methods (SWA; climate sensitivity profiles-CSP; and Pspline signal regressions-PSR), which are tied to a calendar day, can identify good proxies for the actual cue, which can be useful for prediction, if the relationship between the cue and proxy is stable. However, they cannot identify a true biological cue. It is biologically implausible that the exact same period of calendar days, for example May temperature, influences the phenological event in all individuals or in all years. For example, in a particularly warm year the mean phenological event timing may fall prior to or within May.
Consequently, the timing decision could not be influenced by this cue and therefore it cannot be a true biological cue, even if it were to correlate strongly with it (therefore being a good proxy). Relative windows provide some alternative to this in an attempt to access a more biologically meaningful cue by having temporally variable windows, for example temperature in the month prior to laying would influence lay date each year. Instead of being fixed to a calendar day, SWR methods are fixed to the phenological event, covering different calendar days each year (or for each individual) but having a fixed lag time between the cue and event.
Smoothing function methods are a more recent approach (Roberts, 2008;Roberts et al., 2015;Thackeray et al., 2016). They are also regression-based analyses but employ smoothing functions and K E Y W O R D S climate sensitivity profile, cue identification, GAM, great tits, growing degree days, penalized signal regression, phenology, sliding windows penalties to generate sensitivity profiles (Roberts, 2010). These analyses consider the influence of the environment on all days during a year rather than bounded windows. The PSR method (Roberts, 2008) and CSP method (Thackeray et al., 2016) are both examples of smoothing function approaches. Using these methods, it is possible to identify the most influential days of the year, based on the effect size.
Growing degree-day (GDD) models were developed for plants (Chuine, 2000;Kramer, 1995;Rötzer, Grote, & Pretzsch, 2004), based on the assumption of linear relations between temperature and development due to enzyme activity (Bonhomme, 2000). Any temperatures exceeding a particular threshold are considered to contribute to development, and the influence of each degree over the threshold is accumulated until a second threshold is reached and the phenological event occurs. These models can be applied to animal systems although fewer direct developmental links to temperature might be expected for endothermic animals (Khaliq, Hof, Prinzinger, Böhning-Gaese, & Pfenninger, 2014;McNab, 2012).
A fundamental implicit assumption across all of these methods is that the environmental cues driving phenology remain consistent across time. Many studies of phenology in long-term systems continue to use the same cue identified previously to inform later analyses (Charmantier et al., 2008;Visser et al., 2006;Visser, Noordwijk, van Tinbergen, & Lessells, 1998). Alternatively, cues are defined using all data available or without consideration of the temporal distribution of the data (Husby et al., 2010;Perrins & McCleery, 1989;Phillimore et al., 2013;van de Pol et al., 2016;Thackeray et al., 2016).
Both approaches assume that the cue, or the relationship between a proxy and timing, does not change over the time period considered.
This assumption fails to account not only for the potential of changing cues over time but also for the influence of sample size on the cues identified.
Other methods exist in the cue identification toolkit, such as machine learning (Holloway, Kudenko, &Bell, 2018); however, the above are the most commonly applied and well developed. Previous studies have compared the performance of different methods in capturing environmental sensitivity (Phillimore et al., 2013;Roberts et al., 2015), finding largely congruent results across different methods. Similar cues were identified using GDD, SWA and PSR (Phillimore et al., 2013;Roberts et al., 2015). While predictive potential was inferred in these studies through R 2 values, the predictive capacity and accuracy of different methods were not compared. The aims of explanation and prediction are fundamentally different and models optimized for one will not necessarily perform well for the other. Therefore, it is timely to assess the predictive performance of cues identified for the aim of explanation as a predictive focus in ecology increases.
In this study, we address the following specific objectives, using a dataset of population mean annual laying date, collected over 55 years from a long-term study of the great tit, as our phenology measure to: 1. Compare and contrast temperature cues identified (allowing variation in both the critical window and aggregate statistic of the cue) by five commonly used methods (sliding window absolute analysis (SWA), sliding window relative analysis (SWR), climate sensitivity profiles (CSP), penalized signal regression (PSR) and a growing degree-day model (GDD)).
2. Assess how these identified cues vary depending on the length of the dataset and the precise time period used.
3. Use the temperature cues identified by the different methods and lengths of dataset to predict phenology for five-year test datasets of observed data.
4. Evaluate the predictive performance of the different methods and interpret their biological meaning.

| The study system
The Wytham Woods great tit nest box population study has been conducted in a standardized way since 1960 (Perrins, 1979). Each breeding season (April-June), weekly nest box checks are carried out to provide data on nest stage, number of eggs and the onset of incubation. Clutch initiation date is determined by assuming a laying rate of one egg per day; therefore, it is possible to count backwards from the number of eggs counted on the weekly check to determine the date that the clutch was initiated. Species identity (four species of tit use the study nest boxes) is initially confirmed by weighing eggs, when at least three are present; great tit eggs can be confirmed by average egg weight of >1.3 g per egg. Clutch initiation timing, hereafter "lay date," has been extensively studied in this population and strongly linked to spring temperature (Charmantier et al., 2008;Husby et al., 2010;van Noordwijk, McCleery, & Perrins, 1995;Perrins & McCleery, 1989). Therefore, we focus on testing temperature cues only in this study. We also exclude any lay dates more than 30 days after the first lay date of the year from analyses. This is to avoid inclusion of second or replacement clutches (Van Der Jeugd, Henk, & McCleery, 2002). All mean lay dates were rounded to the nearest whole day.
This study uses annual mean lay dates from 55 years ; these were calculated from 14,372 individual nest observations. The mean number of nests per year was 256.6 (range 114-473). The mean standard deviation of lay dates in a given year was 7.8 days (range 4.4-13.4 days).
Our biological dataset has the following format (column names): year, mean lay date (in days since 1 April), mean lay date (dd/mm/ yyyy), day of the month, month, day of the year.

| Environmental data collection
Temperature data were collected by the Meteorological Office, as part of the National Climate Information Centre gridded daily data (grid point 447500E 202500N) (Hollis & McCarthy, 2017;Met Office, 2009). These data are available at a 5 km by 5 km resolution across the United Kingdom (UK). Daily maximum and minimum temperatures were recorded in weather stations across the UK and the mean of these taken to produce a daily mean temperature measure. Spatial and temporal interpolation for missing data was conducted by the Meteorological Office (Met Office, 2009). Environmental data were available from 1960 to 2015.
Our environmental dataset has the following format (column names) with daily resolution; date (dd/mm/yyyy), day of month, month, day of year, temperature (°C).

| Cue identification methods
The response variable used in all analyses was the mean annual lay date for the Wytham Woods great tit population.

| Sliding time window analyses
The premise of sliding time window analyses is that there is an optimum time period in which the focal species is most strongly influenced by abiotic conditions. To identify this time period, environmental conditions, such as precipitation or temperature, are aggregated across a period of days (the time "window"). The duration and calendar position of the window are then altered to generate a series of candidate windows and associated environmental conditions, which are the explanatory variable. The number of candidates chosen can be exhaustive, testing all possible durations and positions within some bounds (van de Pol et al., 2016) or restricted to only a few a priori chosen candidates (Husby et al., 2010;Perrins, 1965). A regression-based analysis (typically a linear model) is then used to determine the explanatory power of each candidate window. Model selection, with a focus on explanatory performance, is then conducted to determine the preferred model from the candidates. The aggregate statistic used in this method is typically a sum, mean (e.g. of the daily mean, minimum, or maximum temperature), minimum (e.g. the minimum mean daily temperature reached in a focal window) or maximum (e.g. the maximum mean daily temperature reached in a focal window) environmental value, or the slope of environmental change across the window (the gradient of a linear model of daily mean temperature against date within the focal window).

| Absolute sliding time window (SWA)
In SWA, the position of the candidate window occurs at the same point in the calendar year every year, for example always from the 10 April to 20 May. In other words, it holds an absolute position not influenced by phenological timing.
In this study, we chose to implement the SWA as an exhaustive analysis. For this analysis, we used a reference day of the 20 May to bound exploration to windows that occur prior to that date, meaning all windows are characterized in "days prior to 20 May." Windows were allowed to vary in length from a single day up to 365 days.
The whole year was used for consistency with other methods in this study (CSP and PSR) and to reduce a priori restrictions. It also allows influence of autumn or winter temperatures to be detected if these are important, as has been previously reported for some species (e.g. Thackeray et al., 2016). Four aggregate temperature statistics were tested (mean, minimum and maximum temperatures, and the slope of temperature change across each window-calculated as the gradient from a linear model of temperature against day in window). All aggregate statistics were calculated from data of mean daily temperature. For instance, the minimum and maximum are the most extreme mean daily temperatures that occurred during a particular window.
Model selection was performed using the AICc (Akaike information criterion (Akaike, 1973) with a small sample size correction); the preferred model was identified by having the lowest AICc compared to the baseline model, an intercept-only linear model of annual mean lay dates. The SWA analysis was run using the R package "climwin" (Bailey & De Pol, 2016;van de Pol et al., 2016). This package was designed to standardize the process of fitting sliding time windows to phenological data. It allows for exhaustive exploratory analyses across a variety of aggregate statistics.

| Relative sliding time window (SWR)
SWR is similar in principle to SWA; however, the candidate windows are not tied to a calendar day, but to the phenological event itself. All windows are defined in "days prior to the phenological event" and occur on different calendar days each year.
In the same way as for SWA analyses, the time windows were allowed to vary in length from one to 365 days, and we tested aggregate temperature statistics of the mean, minimum and maximum temperature, and slope of temperature change across each window.
Model selection was again conducted using AICc (with small sample size correction) compared to the baseline model, an intercept-only linear model of annual lay date means. SWR analyses were also run using the "climwin" package in R.

| Climate sensitivity profile (CSP)
The CSP technique was first introduced by Thackeray et al. (2016), and we follow their general methodology here. This method takes a measure of an environmental variable on a single day, for example daily mean or maximum temperature, and regresses this against the phenological event of interest in turn for every day of a year. The first day taken is the day prior to a reference day. Here, we used 20 May as our reference. We began with mean temperature for the 20 May and regressed this against the annual mean lay dates (using the lm function in R), iterating backwards through time up to 365 days prior to 20 May. For each regression, we saved the coefficient value (the slope of the relationship between temperature and phenology) and the R 2 . These coefficient and R 2 values were each passed through a GAM using the R package "mgcv" (Wood, 2001;Wood & Wood, 2015) to smooth the values across time. These smoothed functions are used to identify the calendar days of greatest influence on phenology. This period is defined as a period of consecutive days on which the coefficient and the R 2 values exceed the lower and upper quantiles, defined as greater or equal to the lowest 2.5% and highest 97.5% of values following the method used in Thackeray et al. (2016).
To run this method, our environmental data were reformatted so that each row was a year and each column was a day prior to 20 May, with entries being daily mean temperature. While Thackeray et al. propose using the date on which 95% of individuals have initiated laying (DOY95) as a response, we use annual mean lay date for consistency with the other methods trialled here. We also test using DOY95 to compare using the method as intended by the authors (see Figure S4).

| P-spline signal regression (PSR)
P-spline signal regression for phenological cue identification was introduced by Roberts (2008). This method works on a similar principle to the CSP methodology, but instead of a two-step process there is a single smoothing and coefficient creating step. This method regresses all 365 days of temperature against the response, simultaneously creating partial coefficients (slope of the relationship between temperature and phenology). These partial coefficients are smoothed by penalizing for differences in consecutive days.
The inclusion of many explanatory variables in a single analysis is addressed through a data reduction phase in order to reduce the high dimensionality. This is achieved by creating a B-spline basis, creating a series of piecemeal polynomials (curves) joined at knots.
These knots must be specified and cannot exceed one less than the sample size. The combination of B-splines with a difference penalty results in P-splines (penalized B-splines), which penalizes differences between B-splines to prevent overfitting. The level of difference penalty is chosen through cross-validation. Here, we take the order of the B-spline basis and the difference penalty from that described in Roberts (2008) and Roberts et al. (2015), cubic and first order, respectively. The approach can be implemented directly using a GAM.
We ran the PSR using the "mgcv" R package. The GAM is run on the raw phenology data and climate data indexed to the reference date of 20 May. Our climate data were arranged in the same way as for the CSP method. The whole year of temperature can be used to subsequently predict phenology in given years. The most important days of temperature influence on phenology in the year may be identified by as those with partial coefficients greater than or less than zero by more than two times their standard error.

| Growing degree day (GDD)
The GDD model we implement here is a three-parameter thermaltime model (also used in Phillimore et al., 2013). The parameters used in this model are: start date, minimum threshold temperature and the cumulative GDD requirement. The environmental values begin being accumulated from the start date onwards; here, we use mean daily temperatures. Every degree above the minimum threshold temperature is cumulatively summed until the cumulative GDD requirement is reached, at which point the phenological event is predicted to occur. These parameters were optimized to minimize the sum of squared differences between the predicted annual mean lay date and the observed mean lay dates. The sum of squares was calculated using a linear model with the observed phenology as an explanatory variable and predicted phenology as the response. Optimization was performed using a generalized simulated annealing optimizer through the "GenSA" R package (Xiang, Gubian, Suomela, & Hoeng, 2013). A wide area of parameter space is searched with bounds for each parameter of start dates from 1 to 200 (year day on which temperature starts being counted); minimum temperature from 1°C to 10°C; and cumulative GDD requirement of 50°C to 1,000°C. Parameters were bootstrapped 1,000 times using the "boot" package in R (Canty & Ripley, 2017;Davidson & Hinkley, 1997), and percentile bootstrap confidence intervals for each parameter were produced to capture uncertainty in the parameter estimation. It should be noted that the aggregate statistic used for GDD models is cumulative sum of temperature in contrast to predominantly mean temperature in other methods.

| Identification of critical time windows and aggregate statistics
In this study, we identified cues using several different data subsets to address the question of whether the time period covered by the data alters the cue identified. To answer this question, the data were divided into three subsets: a 50-year training dataset , the first 25 years of data  and the last 25 years of data . Data from 2011 to 2015 were retained separately to use as a test dataset for predictive analyses. An optimal temperature cue was identified using each of the five methods detailed above, by running each method following its own protocol.
To assess the amount of variance explained by the identified cues, we ran linear models with annual mean lay date as a response variable and the cues identified as the explanatory variables. For the SWA, SWR and CSP methods, the cue identified is a temperature variable. For the GDD and PSR methods, the cue identified is the date on which the cumulative GDD requirement is reached or the fitted date from the PSR model. For each of the linear models, the adjusted R 2 value was calculated as an indication of the amount of variance in the annual mean lay date that has been explained by the focal cue.

| Prediction of lay dates using identified cues
Only four of the five methods were used predictively in this study because the SWR cue is defined based on the timing of the phenological event, and therefore, to identify the cue, the event must have already occurred. As a result, prediction using this method was not possible. For the SWA and CSP methods, predictions of annual mean lay date timing were generated from linear models of the identified cues against phenology, based on the observed temperature values from the test years. For PSR, predictions were generated from the P-spline GAM using all days of temperature in the 365 days preceding 20 May. For the GDD model, the temperature data from the test years were passed through the GDD equation using previously optimized parameter values to identify the date in each year on which the cumulative GDD requirement was reached. Prediction intervals were also generated for all methods using the results of the linear models or GAM for SWA, CSP and PSR. However, for the GDD method prediction intervals were generated by predicting using the lower and upper bounds of estimates of each parameter from bootstrapped confidence intervals.
We explored the influence of the time period of the training data relative to the test data and the length of the training data on predictive performance at several resolutions. In order to tease apart these different influences, three sets of predictions were generated to answer specific questions: 2. How does the temporal distance between the dataset and predicted years interact with data length to influence predictive performance?
To address this question, we used the same data subsets from above but instead of using the 5 years following the end of the training dataset as test years we used only 2011-2015. For datasets ending in 2010 (five of the nine), this was the same analysis as above. This allowed the influence of years of study to be distinguished from the influence of temporal lag between data and predictions.
3. How accurate and precise are predictions from the SWA, CSP, PSR and GDD methods across our dataset? To address this, we conducted K-fold (in this instance 5-fold) cross-validation to determine predictive performance when accounting for stochasticity in the test years of data. We split the original 55-year dataset  into five-year subsets and in turn predicted one five-year subset from the parameters generated by the remaining 50 years (e.g. 1961-1965 could be held as a test dataset and 1966-2015 data used to predict phenology in the test years).
To quantify the accuracy of the phenological predictions and compare across different methods and data subsets, the mean absolute error (MAE) was calculated for each set of five predicted years (2011)(2012)(2013)(2014)(2015).
The MAE is the mean of the positive value of all discrepancies between the predicted lay dates and the observed lay dates (error) across the five test years. The larger the MAE, the greater the discrepancy between predicted and observed phenology. In addition to the MAE, we also calculated the raw discrepancy (retaining sign of error) between the predicted and observed phenology during the K-fold cross-validation (the mean prediction error). This allowed an exploration of bias in the predictions, which would not be captured by MAE.
Precision of predictions was represented by prediction intervals.
The widths of prediction intervals were compared, with wider intervals indicating lower precision. We also explored how well the precision is captured by calculating the proportion of times the observed value fell within the prediction interval for each set of five test years (coverage of the prediction interval).

| Critical time windows of sensitivity
We found considerable variability in the exact days identified as the "critical window" of environmental sensitivity; the window length and position varied based on the method used and the time period covered by the dataset (Figure 1). The timing of the critical window was broadly similar between the SWA, PSR, CSP and GDD methods, for all time periods. However, the SWR method identified windows considerably earlier than all other methods, more than 200 days earlier at the most extreme. The SWR method showed large differences in window timing depending on the dataset used, with older data producing earlier windows. Time windows identified using the whole long-term training dataset  were typically midway between the early data and late data, with the exception of SWR where the window was different from both early and late. In addition, for all methods except SWR, the windows for the latter half of F I G U R E 1 Temporal critical windows for great tit egglaying phenology identified by different statistical methods and timeframes of data. Vertical dotted line shows 1 January, and solid vertical line indicates 20 May, the reference day for absolute methods; absolute sliding window analysis (SWA), climate sensitivity profile (CSP) and penalized signal regression (PSR). Growing degree-day model (GDD) is plotted relative to the mean lay date across years and the relative sliding window analysis (SWR) relative to the reference date of 20 May our data (1986-2010) began earlier and were longer in duration than the windows identified using older data.
Relationships between the identified cue and the mean annual lay date varied from explaining 41% (CSP) to 77% (PSR) of the variance in the response variable. Three of the five (SWA, SWR and PSR) explained approx. 70%, or greater, of the variation (Table 1). When selecting for a model which maximizes explanatory power, dissimilar environmental cues can still produce similar results and consequently be difficult to choose between. This is not surprising in highly explorative analyses such as this; consequently, R 2 should be used with caution.

| How accurate is predicted phenology?
Error between predicted and observed phenology was lowest for near-future predictions-those where test years occur directly after the end of the training dataset (Figure 2a,b). These predictions had MAE that did not exceed 7 days. The standard deviation in mean annual lay dates across the study period was 7.2 days; therefore, all near-future predictions had error which was lower than the standard deviation of the data as a whole.
Sample size did not appear to have a strong influence on predictive accuracy. Mean absolute error remained consistent across five different sample sizes from 10 to 50 years in Figure 2(a,b). There was a slight reduction of MAE with increasing sample size in Figure 2c.
However, this also coincided with decreasing temporal lag between data and predictions. The larger the temporal lag between data and predictions, the greater the error up to 12 days MAE. When data were collected relative to when it is predicting had a greater impact on accuracy than the total number of years of data. The PSR method had the lowest mean MAE and consequently the highest accuracy of all methods in K-fold cross-validation (Table 2). SWA, GDD and CSP also had mean MAE of <7 days.
While MAE was consistent across all sets of test years for nearfuture predictions, the mean prediction error (mean of the raw values of prediction minus observed) indicated systematic bias in accuracy ( Figure 3). Non-random structure was present in the error.
Predictions for the earliest half of the study period tended to be earlier than observations, and predictions for the latter half of the study period tended to be later than observations. This structure was described by using a fitted least-squares line with study year as an explanatory variable: 50% of the variation in the error was explained by year of study (quantified by multiple R 2 ). While the GDD method did demonstrate bias in the prediction error, in the same direction as all other methods, the error for this method did not appear to follow a linear trend. If only the regression-based analyses were included, the amount of variation explained by the linear term increased to 66%.
The fitted line in Figure 3 crossed 0 at approximately the mid-point of our study (1986).

| How precise are our predictions of phenology?
The amount of uncertainty (how precise the predictions are) in predicted phenology-the width of prediction intervals-varied substantially between methods (Table 2 and Figure 4) from 6.73 days for the PSR method to 66.13 for the GDD method. There was no clear association between width of prediction intervals and either number of years of data or distance from the data and predictions ( Table 2).
The range of the 95% prediction intervals of all methods except PSR was more than double the within (7.8 days)-and between (7.2 days)year standard deviation in lay dates for this population.
Within methods, there was no association between how accurate and how precise predictions were, that is sets of test years with higher accuracy in predictions did not consistently have tighter prediction intervals. In contrast, between methods those with higher accuracy (PSR and SWA) were also more precise (have tighter prediction intervals- Table 2). The PSR method was both the most accurate and the most precise method, based on prediction interval width. However, the width of a prediction interval alone does not give us a complete picture of how precise a method is. It is also important to know whether the estimated prediction interval does accurately capture the uncertainty in a particular method. To quantify this, we calculated the proportion of prediction intervals which contained the observed value (coverage). For 95% prediction intervals, we would expect on average 95% of the intervals to overlap the observed value. From observed value in this study ranged from 100% to 0% (Table S2), but was most interesting for the K-fold cross-validation because these results take account of stochasticity in individual sets of test years by having 11 sets of test years (55 predicted years in total). For the K-fold cross-validation, the SWA and CSP methods had approximately 95% of observations falling within the prediction intervals, as would be expected. The GDD method had 82% of observations within the prediction intervals, whereas the PSR method had a particularly low proportion of observations within prediction intervals, 58%. This is much lower than expected even with stochastic variation. Despite the seemingly high precision and absolute accuracy of this method, it appears that precision in the predictions was not being correctly quantified.

| Identified temperature cues differ between methods and over time but explanatory power of cues remains high
We have shown, using multiple analyses on the same dataset, that the identified temporal windows of environmental sensitivity vary  F I G U R E 3 Plot of the error from K-fold cross-validation (mean difference between predicted and observed mean annual lay date).
Error is plotted against the time period of the five-year test dataset it is generated for. Plotted solid line is a fitted least-squares line, dashed line indicates 0 F I G U R E 4 Predicted and observed mean annual lay dates. Predictions generated from different methods, using the whole long-term dataset, are plotted against the observed lay dates. Vertical lines represent 95% prediction intervals or the standard deviation for observed data in their position and duration dependent on the statistical method and the time period of data used (Figure 1). While there was some temporal overlap between the windows identified by SWA, CSP, PSR and GDD, the SWR method identified critical windows of temperature sensitivity that were 50-250 days earlier than other methods. The amount of temporal overlap in cues identified by different methods was highly influenced by the time period of data used to identify the cue. If only the most recent 25 years were used, the results of the SWA, CSP, PSR and GDD methods were largely congruent, as found in previous comparisons of statistical methods for cue identification (Phillimore et al., 2013;Roberts et al., 2015).
However, when the earliest 25 years of data were used, more marked between-method differences emerged. This appears to be driven by different susceptibility of the methods to the input data, with SWR having the greatest variation in cues identified. It should be noted that when CSP is employed using DOY95, as suggested in Thackeray et al. (2016), the windows identified by this method are more variable (see Figure S4). For all methods, excluding SWR, the windows identified by the most recent half of the data all began earlier in the year and had a longer duration than those identified by the earlier half. This is as would be expected as the birds are shifting their lay dates earlier in the year.
Despite the difference in cues identified by different methods, explanatory power of the different cues remained fairly high (excluding CSP). Using metrics based on the ability to explain variances such as AIC or R 2 may not be sufficient to distinguish between correlated cues. In the worst case, the R 2 values can be overly optimistic in highly explorative studies and should be interpreted with caution. As a result, we recommend, for prediction in particular, to assess how well the identified cues perform for prediction to determine whether the identified cues hold for future years or novel conditions.
Temporal movement of the cue identified across our study period data could indicate a shift in the actual temperature cue over time ( Figure 1). This could operate through evolution of the cuephenology relationship. The consistent (expect for SWR) shift to an earlier beginning of identified critical windows for more recent data could give support to evolution to use a different cue. However, an alternative possibility is that the statistical methods could be identifying proxies for the actual cue used by great tits and that under climatic change the relationship between these proxies and the actual cue are shifting, leading to a change in the proxy identified as the best predictor. It may also be the case that none of the statistical methods we use identify either the precise cue or a consistent proxy and that the cue identified is simply the best predictor of variance in lay dates for that particular time period and that this is altered depending on the exact years included.
We found (Figure 1) that the cue identified by SWR method was notably different to all other techniques, which identified very similar windows here and in previous comparison studies (Hudson, 2010;Phillimore et al., 2013;Roberts et al., 2015). This indicates that the relative approach can also be problematic in phenological cue identification.
The SWR approach has been used successfully to understand climate predictors of egg size in fairy wrens (Langmore, Bailey, Heinsohn, Russell, & Kilner, 2016). A variant on a relative approach was also successfully used to look at the timing of incubation onset relative to clutch completion (Simmonds, Sheldon, Coulson, & Cole, 2017); here, the temperature windows were tied to clutch completion rather than a calendar date but the lag between the cue and incubation onset remained variable. However, relative approaches should be used with caution. Due to the linear regression basis of these methods the strongest effect size, R 2 , and lowest AIC values will be produced when there is variability in the explanatory variable and this variability correlates with the response variable, that is high values of temperature correspond with either early or late lay dates and low temperatures with the opposite. When using relative windows, it is likely that the time period preceding the phenological event will have similar temperature values for all years. For instance, if laying commences soon after particular temperatures are reached, regardless of their exact yearly timing. In this case, the explanatory power of windows identified close to the lay date will be low, because there will be lower variability in the explanatory variable than in the response variable. Only when a difference between temperatures for early and late years occurs will the explanatory power increase. This is likely to occur at periods of seasonal transition, for example, the onset of spring or winter. At this point, years with early lay dates will have their relative window cross these transitions prior to later years, generating strong temperature differences, a linear relationship and potentially temperatures which can explain variance in lay date. However, what has been identified is unlikely to be a cue for laying; instead, it is a statistical artefact of the method being used. The location of the SWR windows identified here, around the autumn and winter onsets, suggests that the statistical artefact discussed above might be the cause of the erroneous cues identified in this study.
Being aware of the statistical limitations of any methods used is vital for conducting analyses based on these methods. If used in a threshold rather than regression-based format, relative windows might provide a suitable alternative to absolute methods, which can never identify the precise cues being used. Equally, a failure to identify a precise cue, if the proxy is good, may not be an impairment to many analyses. But relative windows used in a regression format to identify phenological cues will not statistically identify a true cue and cannot be used predictively, so risk being misleading in this context.

| Predictive accuracy is highest for near-future predictions
We have shown that the accuracy of predictions of phenological timing are influenced more by the temporal distance between the data used to identify the cue and the years being predicted, than by the number of years in the sample. When predicting the five years following the end of the data used to identify the cue, the prediction error remained consistent despite reductions in sample size ( Figure 2a,b). This trend was consistent across all methods trialled.
A reduction in MAE with increasing sample size was only seen in tandem with a reduction in the temporal distance between dataset end and predicted years (Figure 2c).
It is not surprising that near-future predictions have higher accuracy than those further from the data used to parameterize the models. Extrapolating statistical relationships beyond the data used to estimate them can always be problematic as the identified relationships may not hold under novel conditions. Our results suggest that this may be the case in the Wytham Woods great tit population. When cues and cue-phenology relationships identified by data early in our study years are used to predict years later in the study, predictive accuracy is reduced. Such a pattern indicates that either the cue itself, or the cue-phenology relationship, is changing over the period of our study; further work should address the consequences of such changes.
The suggestion that biological systems, in particular phenological relationships, are not static is further supported by Figure 3. Our results show temporal autocorrelation in prediction error from Kfold cross-validation. Predictions earlier in our study years tended to be earlier than observed phenology; in contrast, predictions later in our study years tended to be later than observed phenology, with a single linear term (including prediction error from regression-based methods only) explaining up to 66% of the variation observed. The mechanistic GDD method also showed the same bias in error but with a nonlinear trend. Therefore, a directional temporal component of the system has not been captured by our current analyses. One possibility is evolution of the cue-phenology relationship over the period of our study (c. 30 great tit generations). Previous work on another great tit population suggests that evolution of reaction norms may be hampered by low heritability and the sex-limited nature of laying date (Ramakers, Gienapp, & Visser, 2018). However, exploration of evolution to use different cues (e.g. shifting to be sensitive to cues earlier in the year) has not yet been explored. Another possibility is that our models have not identified the true cue used by great tits to time phenology. If the true cue and the identified cue respond differently to climate change, then predictions will diverge over time.
As a result, the impact of current climate change on phenology would not be correctly captured in our models and could lead to a temporal bias. A potential cause of this could be that all methods discussed here and the majority of studies (Charmantier et al., 2008;Husby et al., 2010;Perrins & McCleery, 1989;Visser et al., 1998) focus solely on the role of abiotic cues. It seems highly unlikely that biotic cues play no role in the phenology of other species. If in reality a biotic cue drives great tit phenology, this could create a pattern such as the one shown here. Teasing apart these two potential causes requires quantification of selection on the cue-phenology relationship, or validating cue identification experimentally (e.g. Lambrechts, Perret, Maistre, Perret, Maistre, & Blondel, 1999;Schaper et al., 2012;Schaper, Rueda, Sharp, Dawson, & Visser, 2011).

| How precise predictions are differs by method and shows a between-method association with accuracy
The predictive precision of methods varied from over 100 days (GDD method) to fewer than 7 days (PSR method). The very wide prediction intervals of the GDD method ( Figure 4 and Table S2), which were several times larger than those of the other methods, likely stem from the need to estimate uncertainty of three parameters for this method (start date, minimum threshold temperature and the cumulative GDD requirement). This is in contrast to two (intercept and slope) for the regression-based approaches. The PSR method was shown to be both the most precise and most accurate method. However, the precision in this method was not correctly quantified. This was demonstrated by the prediction intervals including the observed values only 58% of the time (Table 2). 58% is a much lower coverage than would be expected from a 95% prediction interval, even when allowing for the fact it will rarely cover exactly 95% (we would not expect coverage of <80% (Altman & Krzywinski, 2018)). This overestimation of precision in PSR could in part be influenced by the temporal trend in error as the error in PSR predictions for the middle years of our study was close to 0 ( Figure 3); if the temporal trend in PSR is corrected, estimation of how precise the method is could also improve. This highlights the importance of cross-validation and quantifying the accuracy and precision of predictions. Both the SWA and CSP methods had errors of approximately one week (across year standard deviation was 7 days).
They also had prediction intervals of roughly two (SWA) or three (CSP) weeks. These intervals seem to represent the precision well in these methods, when considered across the whole study period (Table 2).
These methods could provide rather uninformative predictions about phenology in this population, given that the precision of predictions is greater than twice the standard deviation of the data. A prediction interval of 15 days would imply that 95% of the observed mean lay dates were equally as plausible as the predicted estimate. As a result, little insight would be gained into the change in phenology over time.
When generating predictions, it is essential to consider which is most important for a particular question (accuracy or precision).
Achieving a balance between these different predictive measures is important. One step to achieving this, possibly the most important, is to quantify these metrics and to be aware of the limitations of the method being used to predict. The results of this study demonstrate that the accuracy and precision of predictions relating to seasonal phenology are influenced by the method used and the distance between predictions and the data used to generate them. For near-future predictions, all of the methods trialled here produced predictions with good accuracy but variable precision. If precision of predictions is not fully considered, this could create misleading predictions and misleading conclusions about the future of populations.
However, with the right consideration of how the underlying statistical methods operate and cross-validation of predictions, usable predictive outputs can be generated. In the case of our study system, accounting for the additional temporal trend that is not included in our models will also be essential to creating useable predictions.
All of the specific results here are generated from one study system and may not hold for other species or systems. However, the potential for variation in results generated by the different methods, shown here, is something that should be considered in all phenological analyses. This study is a step along a continual process of assessment and critique of the state of the art in ecological methods. As methodological approaches, computing power and data availability improve, we should remain aware of the limitations of the statistical tools we use, especially when applied to purposes beyond their original design. Here, we have provided an empirical illustration of how variable common cue identification methods can be in a predictive context. We hope that our recommendations below can act as best practice in the light of the fact that all populations will likely produce different results.
To create a theoretical understanding of the performance of these methods, simulation studies could be appealing. However, there are some difficulties with using small-scale simulation studies when looking at phenological cue identification. The main problem is that we do not yet know the underlying cue that drives phenology.
As a result, we have to assume a cue and a relationship between that cue and phenology in order to simulate data. As each cue identification method assumes a different underlying cue (e.g. temperature in a fixed window, temperature sum up to a threshold or temperature during the entire year), the cue chosen in order to simulate the data will inevitably have a strong influence on which method performs best. Extensive simulation studies would be a welcome next step to further investigating the performance of statistical cue identification tools. In the meantime, we recommend several key steps to follow when implementing cue identification methods on empirical data: • Previously identified cues from populations should be reassessed when new data become available. By doing this, the assumptions that cues are static across time and that the cues identified previously remain reliable can be avoided (Charmantier et al., 2008;Visser et al., 2006).
• The merits of different statistical cue identification methods should be considered when applying them. Our results suggest that-for this dataset-the PSR and SWA methods were the most accurate and precise, with SWA having the more accurately estimated precision.
• More robust evaluation techniques need to be implemented as standard. In particular, K-fold cross-validation of predictions from the method chosen should be conducted and accuracy and precision should be quantified and reported. This can be easily implemented in some R packages such as climwin (Bailey & De Pol, 2016;van de Pol et al., 2016).
• More flexible approaches allowing windows that vary in length and timing across years should be explored. These could act as more realistic alternatives to absolute and relative regressionbased approaches, which both have theoretical flaws. In order to improve our confidence in cue identification (including exploring biotic cues), research effort is required to teasing apart the causes of the temporal autocorrelation in prediction error.
Cue identification models are increasingly being used predictively (Morin et al., 2009;van de Pol et al., 2016;Roberts et al., 2015;Thackeray et al., 2016), and consequently, it is timely to assess of the accuracy of such predictions. Our results suggest some fundamental issues with our current toolkit, particularly the inadequacy of the relative sliding window method for identification of phenological cues.
They also demonstrate, through the temporal trend in predictive error, that our current tools either miss a key component of the cue-phenology relationship or the relationship is changing through time for some systems. Future phenological studies should challenge the idea of a static cue-phenology relationship and should cross-validate results across multiple time periods.

ACK N OWLED G EM ENTS
We are grateful to all of the Wytham fieldworkers who collected population census data on the Wytham great tits. This work was supported by NERC grant NE/K006274/1 to B.C.S. We would also like to thank Martijn van de Pol, Liam Bailey, Ally Phillimore and Adrian Roberts for their support in using their "climwin" package and the GDD and PSR methods, respectively, in addition to their helpful comments on this manuscript. All code will be available on GitHub, and we thank Ally Phillimore and Adrian Roberts for providing the base code for the GDD and PSR methods.

AUTH O R S ' CO NTR I B UTI O N S
E.G.S. conceived the ideas, designed the methodology, conducted the analyses and led the writing of this manuscript. E.G.S., E.F.C.
and B.C.S. have all contributed to data collection. E.F.C. and B.C.S.
provided detailed feedback on the methodological approaches and results throughout the analysis. All authors contributed critically to the drafts and gave final approval for publication.

DATA AVA I L A B I L I T Y S TAT E M E N T
The biological and environmental data, and the R code used for this study are available at the following GitHub repository: (https :// github.com/emily gsimm onds/Cue_Ident ifica tion) and archived in Zenodo: https ://doi.org/10.5281/zenodo.2838825 (Simmonds, 2019