Calibrating vegetation phenology from Sentinel-2 using eddy covariance, PhenoCam, and PEP725 networks across Europe

Vegetation phenology obtained from time series of remote sensing data is relevant for a range of ecological applications. The freely available Sentinel-2 imagery at a 10 m spatial resolution with a ~ 5-day repeat cycle provides an opportunity to map vegetation phenology at an unprecedented fine spatial scale. To facilitate the production of a Europe-wide Copernicus Land Monitoring Sentinel-2 based phenology dataset, we design and evaluate a framework based on a comprehensive set of ground observations, including eddy covariance gross primary production (GPP), PhenoCam green chromatic coordinate (GCC), and phenology phases from the PanEuropean Phenological database (PEP725). We test three vegetation indices (VI) — the normalized difference vegetation index (NDVI), the two-band enhanced vegetation index (EVI2), and the plant phenology index (PPI) — regarding their capability to track the seasonal trajectories of GPP and GCC and their performance in reflecting spatial variabilities of the corresponding GPP and GCC phenometrics, i.e., start of season (SOS) and end of season (EOS). We find that for GPP phenology, PPI performs the best, in particular for evergreen coniferous forest areas where the seasonal variations in leaf area are small and snow is prevalent during wintertime. Results are inconclusive for GCC phenology, for which no index is consistently better than the others. When comparing to PEP725 phenology phases, PPI and EVI2 perform better than NDVI regarding the spatial correlation and consistency (i.e., lower standard deviation). We also link VI phenometrics at various amplitude thresholds to the PEP725 phenophases and find that PPI SOS at 25% and PPI EOS at 15% provide the best matches with the ground-observed phenological stages. Finally, we demonstrate that applying bidirectional reflectance distribution function correction to Sentinel-2 reflectance is a step that can be excluded for phenology mapping in Europe.


A R T I C L E I N F O Editor: Jing M. Chen
Keywords: Sentinel-2 Vegetation phenology Plant phenology index (PPI) NDVI EVI2 Gross primary production (GPP) PhenoCam PEP725 Europe A B S T R A C T Vegetation phenology obtained from time series of remote sensing data is relevant for a range of ecological applications. The freely available Sentinel-2 imagery at a 10 m spatial resolution with a ~ 5-day repeat cycle provides an opportunity to map vegetation phenology at an unprecedented fine spatial scale. To facilitate the production of a Europe-wide Copernicus Land Monitoring Sentinel-2 based phenology dataset, we design and evaluate a framework based on a comprehensive set of ground observations, including eddy covariance gross primary production (GPP), PhenoCam green chromatic coordinate (GCC), and phenology phases from the Pan-European Phenological database (PEP725). We test three vegetation indices (VI) -the normalized difference vegetation index (NDVI), the two-band enhanced vegetation index (EVI2), and the plant phenology index (PPI) -regarding their capability to track the seasonal trajectories of GPP and GCC and their performance in reflecting spatial variabilities of the corresponding GPP and GCC phenometrics, i.e., start of season (SOS) and end of season (EOS). We find that for GPP phenology, PPI performs the best, in particular for evergreen coniferous forest areas where the seasonal variations in leaf area are small and snow is prevalent during wintertime. Results are inconclusive for GCC phenology, for which no index is consistently better than the others. When comparing to PEP725 phenology phases, PPI and EVI2 perform better than NDVI regarding the spatial correlation and consistency (i.e., lower standard deviation). We also link VI phenometrics at various amplitude thresholds to the PEP725 phenophases and find that PPI SOS at 25% and PPI EOS at 15% provide the best matches with the ground-observed phenological stages. Finally, we demonstrate that applying bidirectional reflectance distribution function correction to Sentinel-2 reflectance is a step that can be excluded for phenology mapping in Europe.

Introduction
Vegetation phenology is the timing of repeated biological events, e. g., leaf emergence, flowering, leaf coloration, and leaf fall (Lieth, 1974;Richardson et al., 2013). It is controlled by molecular mechanisms, related to the plant functioning and driven by cues like photoperiod and temperature (Singh et al., 2017), and recognized as a key indicator of terrestrial ecosystem functioning (Peñuelas and Filella, 2001). Time series of satellite imagery are capable of tracking seasonal foliage variations and phenological events across space and time (Buitenwerf et al., 2015;Cong et al., 2013;Malingreau, 1986;Melaas et al., 2018;Piao et al., 2019).
The mapping of vegetation phenology using satellite data started with the separation of near-infrared (NIR) and red reflectance bands in the Advanced Very High Resolution Radiometer (AVHRR) sensor in the early 1980s (Justice et al., 1985). The AVHRR sensor has since then enabled daily monitoring of the abundance of photosynthetically active vegetation (Tucker, 1979). The later MODerate resolution Imaging Spectroradiometer (MODIS) sensor improved vegetation monitoring capabilities with a spatial resolution of 250-500 m and a daily revisit period since the year 2000 (Justice et al., 2002). These two sensors have been widely used for regional to global scale phenology studies due to their long-term records, high temporal resolution, and ready-to-use vegetation index (VI) products, and they have greatly advanced our understanding of vegetation phenology and its responses to global change (Brown et al., 2010;de Beurs and Henebry, 2004;Jeong et al., 2011;Malingreau, 1986;Myneni et al., 1997;Piao et al., 2019;Tong et al., 2019;White et al., 1997;Wu et al., 2018;Zhang et al., 2004). Also other coarse-resolution sensors are being used for phenology mapping, e.g. the Visible Infrared Imaging Radiometer Suite (VIIRS), VEGETA-TION, and PROBA-V (Bórnez et al., 2020;Zhang et al., 2018b). To distinguish remotely sensed phenology from in-situ monitoring of specific plant phenophases (e.g., bud-burst, flowering, and leaf-fall), the term land surface phenology is often used (de Beurs and Henebry, 2004;Ganguly et al., 2010;Schwartz and Reed, 1999;White and Nemani, 2006).
With the coarse spatial resolution of these sensors (>250 m), pixels often cover plant communities with a mixture of species and individuals characterized by distinct seasonal patterns, leading to inconclusive biological interpretations for certain conditions (Archibald and Scholes, 2007;Helman, 2018;Zhang et al., 2017). Satellite data with high spatial and temporal resolution are therefore needed. With a 30 m resolution, Landsat has been used for mapping vegetation phenology in the US (Fisher et al., 2006;Melaas et al., 2018Melaas et al., , 2016. However, its 16-day revisit period along with the data gaps induced by cloud contamination results in often too sparse time series to obtain robust phenology metrics (Wulder et al., 2016). The Sentinel-2 sensors with a 10 m spatial and 5-day temporal resolution provide an opportunity for continentalscale high-resolution phenology mapping (Drusch et al., 2012), and have been successfully applied at local scales (Cai, 2019;d'Andrimont et al., 2020;Vrieling et al., 2018). A combined Landsat-8 and Sentinel-2 dataset (Claverie et al., 2018) was utilized in the development of the Multi-Source Land Imaging (MuSLI) 30-m land surface phenology dataset covering North America (Bolton et al., 2020). Recently, the European Environmental Agency (EEA) commissioned the development of an operational Sentinel-2 based High-Resolution Vegetation Phenology and Productivity (HR-VPP) dataset at 10 m resolution for Europe, as part of the Copernicus Land Monitoring Service (Copernicus, 2020). This study was conducted as part of that development.
Satellite observations of vegetation phenology need calibration and validation against ground observations. Plant growth stages collected by professional or volunteer observers in national and international phenological networks, such as the Pan European Phenology (PEP725) project (Menzel, 2000;Templ et al., 2018) are valuable for studying plant responses to climate variability and trends (Scheifinger et al., 2003). However, these data suffer from some limitations for remote sensing calibration/validation purposes, including uneven spatial distribution and coverage, often poor geolocation accuracy, and variability of prior experience of citizen scientists. In recent years, digital camera networks for phenology (PhenoCam) have developed, generating nearcontinuous data that are useful for satellite validation purposes (Richardson et al., , 2007. The Red-Green-Blue (RGB) imagery from these cameras is normally used for generating time-series of greenness indices, e.g. the green chromatic coordinate (GCC, Richardson et al., 2018). A third commonly used datatype is gross primary productivity (GPP) from eddy covariance measurements at flux tower sites. These measure the photosynthetic carbon uptake of the vegetation canopy, and have been used in a number of studies as proxy of vegetation phenology Melaas et al., 2013;Peng et al., 2017). Whereas manual phenology observations and PhenoCam data represent phenology as seen by the human eye, GPP relates more to the photosynthetic phenology that is controlled by both the plant canopy development and the light use efficiency (Medlyn, 1998). Thus, all these data reflect different aspects of vegetation phenological variations, and are useful and complementary for the calibration and validation of satellite-based phenology.
The first step in producing a Europe-wide phenology dataset is to select an optimal VI. The normalized difference vegetation index (NDVI, Tucker, 1979) has commonly been used as a proxy for vegetation productivity and for studying land surface phenology in the past decades. However, NDVI-derived phenology is uncertain for evergreen coniferous forest areas where the seasonal amplitudes are small, and for areas with snow cover where a strong NDVI increase is seen at the time of snowmelt (Delbart et al., 2005;Jin et al., 2017;Jönsson et al., 2010;Norris and Walker, 2020). The enhanced vegetation index (EVI) was found to track vegetation phenology better than NDVI in areas with both dense and sparse vegetation (Klosterman et al., 2014;Zhang et al., 2018a), but may still be problematic in snow covered areas (Huete et al., 2002;Jin and Eklundh, 2014). Bolton et al. (2020) showed the high capability of the two-band EVI (EVI2) in reflecting phenology of ecosystems with strong seasonality, but less so in evergreen ecosystems. As a solution to the problem of mapping phenology of evergreen coniferous forests and of snow-covered regions, Jin and Eklundh (2014) developed the plant phenology index (PPI). While PPI has potential for high-resolution phenology mapping across different ecosystems, it has primarily been tested in boreal regions based on MODIS data Karkauskaite et al., 2017). Thus, it is necessary to test the performance of PPI in more biomes and also with higher spatial resolution data. Besides the choice of VI, there are many other considerations in the derivation of land surface phenology from remotely sensed data, including methods for data smoothing, gap-filling, and derivation of the phenological metrics (Cai et al., 2017;de Beurs and Henebry, 2010;Liu et al., 2017;White et al., 2009).
The aim of this study is to assess the performance of NDVI, EVI2, and PPI, derived from Sentinel-2 data for Europe-wide phenology mapping across different land cover types, and link the satellite-derived phenometrics of start of season (SOS) and end of season (EOS) to groundobserved plant development stages (i.e., assign phenological meanings to the SOS and EOS metrics). Given the sensitivity of VIs to sun-sensor geometry (Jin and Eklundh, 2014;Tagesson et al., 2015) we also analyze the influence of bidirectional reflectance distribution function (BRDF) on the phenometrics. The overarching goal is to support the operational production of the Europe-wide Sentinel-2 phenology dataset through calibration with ground observation networks of eddy covariance, PhenoCam, and PEP725.

Methods
The overall framework of the calibration process includes seven steps ( Fig. 1): 1) Data collection and creation of time series of NDVI, EVI2, and PPI from April 2017 to March 2020. 2) Comparison of VI time series to GPP and GCC seasonality for 2018. 3) Fitting of regular functions to the noisy and irregular VI time series and extraction of SOS and EOS at various amplitude thresholds. 4) Evaluation of the spatial consistency between VI and GPP/GCC phenometrics. 5) Linking VI phenometrics to PEP725 phenophases to assign certain phenological meanings to the obtained SOS and EOS. 6) Analyzing variabilities in VI phenometrics along latitudinal and elevational transects. 7) Analyzing BRDF effects on VI phenometrics.

Sentinel-2 reflectance data processing
Sentinel-2 reflectance data were extracted from Google Earth Engine (GEE), which were Level-2A top of canopy reflectance (TOC) processed using the Sen2Cor tool by the Copernicus Scientific Data Hub (Doxani et al., 2018). These data include a scene classification (SCL) layer at 20 m resolution indicating the pixel status, based on which we filtered the Sentinel-2 data by removing pixels characterized by clouds, cloud shadows, cirrus, snow/ice, and water, that is, only retaining SCL values of 4 (soil) and 5 (vegetation). Topographic effects were corrected for in the Level-2A surface reflectance products ( Supplementary Fig. S1), whereas BRDF correction was not implemented. Despite the narrow field of view of Sentinel-2 (within ±10.3 • view zenith angle), the BRDF may need to be adjusted for to make the data consistent across space and time, especially for continental-scale studies covering large latitudinal gradients. To generate nadir BRDF-adjusted reflectance (NBAR) for Sentinel-2 reflectance, we used the c-factor BRDF normalization approach utilizing the MODIS MCD43A1 BRDF/Albedo products (Roy et al., 2008). The MODIS reflectance at a certain wavelength, location, and time ρ modis (θ v , θ s , φ) for viewing zenith angle θ v , solar zenith angle θ s , and relative azimuth angle φ was modeled as: where k vol (θ v , θ s , φ) and k geo (θ v , θ s , φ) are the volumetric scattering and geometric-optical model kernels, defined by the Ross-Thick (Ross, 1981) and Li-Sparse-reciprocal (Li and Strahler, 1992) equations, respectively; while f iso , f vol , and f geo are the isotropic, volumetric, and geometric model weighting parameters, respectively, provided in the MCD43A1 product. The band dependent c-factor for calculating Sentinel NBAR at the local noon solar zenith angle (θ s noon ) and the sensor nadir viewing direction (θ v = 0) were determined as: Then the Sentinel-2 NBAR for NIR and red bands were obtained by multiplying the Level-2A surface reflectance with their respective cfactor.
Based on the MODIS BRDF/Albedo parameters product, Roy et al. (2016) generalized a set of constant f iso , f vol , and f geo values to normalize Landsat reflectance using the c-factor approach and applied them to Sentinel-2A reflectance (Roy et al., 2017). These constant values (f iso = 0.1690, f vol = 0.0574, and f geo = 0.0227 for the red band, and f iso = 0.3093, f vol = 0.1535, and f geo = 0.0330 for the NIR band) were used to fill the MCD43A1 data gaps to generate spatiotemporally consistent Sentinel-2 NBAR data in this study.

Ground data processing and Sentinel-2 data extraction
Ground measurements of eddy covariance GPP, PhenoCam GCC, and PEP725 phenophases were distributed over Europe (Fig. 2) covering the period 2018-2019, representing the major land cover types of deciduous broad-leaved forest, evergreen coniferous forest, mixed forest, cropland, grassland, and wetland.
The GPP data were obtained from the Drought2018 project (https:// www.icos-cp.eu/data-products/YVR0-4898), which includes eddy covariance sites across Europe, providing half-hourly net ecosystem exchange measurements and GPP estimations for 1989-2018. The GPP database contained 52 sites (accessed in September 2020), of which 49 were located in the study area ( Fig. 2 and Supplementary Table S1) and only used data from 2018 for comparison with Sentinel-2 data. The land cover at each flux tower was determined based on the CORINE land cover map for 2018 (https://land.copernicus.eu/pan-european/corine -land-cover/clc2018) and was grouped into six categories: broadleaved forest (code ID 311), coniferous forest (code ID 312), mixed forest (code ID 313 and 324), agriculture (code ID 211, 212, 221, 243, and 244), grassland (code ID 231 and 321), and wetland (code ID 411 and 412). We aggregated the half-hourly GPP values to daily values by averaging, as has been done for the FLUXNET2015 dataset. To extract the Sentinel-2 pixels matching the eddy covariance data, we used a buffer of 100 m radius around the flux tower and computed the mean values of NIR and red reflectance. This is a reasonable size matching the eddy covariance surface source areas; in a recent study it was shown that due to the homogeneity of most flux tower sites, the effect of varying the size of the sampling area is limited when staying below 500 m .
The GCC data were calculated from the RGB camera images using the ratio of the green channel digital numbers to the total brightness of the image as described in Hufkens et al. (2018) and Richardson et al. (2018). The original half-hour GCC time series were aggregated to 3 days by calculating the 90th percentiles to reduce adverse illumination effects caused by atmospheric influences such as rain, snow, and fog. Due to varying camera settings, the quality of GCC time series at times varied widely and thus required a post-hoc assessment, through which we excluded 19 sites (7 with large gaps and 12 with no seasonal patterns) for the comparison analysis. Given the high resolution of Sentinel-2 data, we did not use the camera location but visually georeferenced the location of each region of interest (ROI, i.e., the parts in the image with studied vegetation canopies; there can be multiple ROIs in one image for heterogeneous regions). The spatial coverage of each ROI was only about one to a few tree crowns, so we used one Sentinel-2 pixel covering the ROI location for the comparison analysis. For sites with contrasting vegetation types (e.g., a sparse evergreen canopy with grass understory), uncertainties in the ROI georeferencing process would introduce a mismatch between the GCC and VI time series, and therefore we excluded 6 sites after visual checking. As a result, among all the 57 PhenoCam sites in Europe, 32 sites with 49 site-year observations during 2018-2019 were used in the analysis ( Fig. 2 Table  S2).

and Supplementary
The PEP725 database contains millions of plant phenological records across Europe (www.pep725.eu), including several plant species and growth stages, collected by both European professionals and volunteers (Templ et al., 2018). Currently, 1321 sites are available in the database for 2018-2019 (Fig. 2). Each site contains phenological records of tens to hundreds of individual plants, but without precise geographical locations for individual records. The coordinates provided for each site roughly mark the locations of phenological records for that site at a precision of 2-4 km. To link the Sentinel-2 SOS/EOS and the PEP725 records, we created a 4 km buffer around each site, and classified the corresponding Sentinel-2 pixels into three vegetation groups, i.e., deciduous broad-leaved tree, evergreen coniferous tree, and meadow, according to the plant genus information in the PEP725 database and the CORINE land cover map (Table 1). The average time series of Sentinel-2 reflectance bands were then extracted from GEE for each vegetation group. Pixels that did not belong to any of the selected CORINE land Fig. 2. Locations of the ground data, including the eddy covariance GPP, PhenoCam GCC, and PEP725 phenology phases in the calibration process, the randomly selected points for evaluating the impacts of the BRDF, and the two transects in northern and southern Europe used for evaluation purpose. The northern transect contains 3000 individual Sentinel-2 pixels at every 500 m step and the southern transect contains 1500 individual pixels at every 100 m step. The number of phenology records for each genus is shown in brackets.
cover types were excluded from the analysis. An example of buffer areas and the seasonal curves of different vegetation groups is illustrated in Fig. 3. Information on crops in PEP725 was either summer crop, winter crop, or different species during summer/winter, characterized by distinct seasonal patterns that cannot be separated from the CORINE land cover map, and crops were thus excluded from further analysis.

VI calculation
Based on the Sentinel-2 NIR and red reflectance data (both NBAR and TOC), we calculated the NDVI, EVI2, and PPI values using the following equations: where DVI is the difference between NIR and red reflectance: DVI max is the pixel-specific maximum DVI, estimated as the 95% upper quantile from the multi-year time series of DVI values, and DVI soil is an estimate of the DVI of soil background (an empirical value of 0.09 is used in this study and further discussed in Section 4.2). K is the canopy light extinction efficiency, including both light absorption and scattering, computed as: where d c is an instantaneous diffuse fraction of solar radiation at sun zenith angle θ s (obtained from the scene metadata), computed as:

Comparing VI and GPP/GCC seasonal trajectories
We assessed the capability of each VI for tracking vegetation seasonal dynamics in a range of land cover types by comparing the VI values to the GPP and GCC time series. We first used the locally estimated scatterplot smoothing (LOESS, span = 0.2) method to identify and remove residual outliers in each of the VI time series with >1.5 S.D. from the smoothed curve. This process was mainly for removing single unrealistic high PPI values in the growing season (see discussion Section 4.2) and was also useful for removing parts of the residual outliers missed by the SCL filtering. To eliminate the high-frequency day-to-day variations in GPP and GCC data, we used TIMESAT (Section 2.5) to obtain smoothed GPP and GCC seasonality. Then, we compared the irregular raw VI values with the smoothed GPP/GCC curves from site level to land cover level, and further to continental level.

Curve fitting and SOS/EOS extraction
To reconstruct continuous seasonal trajectories from the irregular and noisy Sentinel-2 VI time series, we used a newly developed method with box constrained fits to a superposition of double-logistic model functions (Jönsson et al., 2018). The double-logistic model function was chosen because of its proven robustness and ability to handle large data gaps, e.g., during winter season in the boreal regions (Cai et al., 2017;Fisher et al., 2006;Zhang et al., 2003). The fits are stabilized by a base value, and pixels identified as cloudy are given a weight of zero. Before the double-logistic fitting, the algorithm identifies single or double growing seasons following an initial spline fitting to the original data to enable modeling of irregular seasonality, e.g. shifts from single to double-cropping patterns. This method has been implemented in the new version 4.1 of the TIMESAT software package (Eklundh and Jönsson, 2015;Jönsson and Eklundh, 2004). There are many methods to identify SOS and EOS metrics, which could be broadly grouped into the threshold category and the inflection point category (de Beurs and Henebry, 2004;Jönsson and Eklundh, 2002;White et al., 2009;Zhang et al., 2003). Whereas the inflection point category uses a fixed definition for determining SOS/EOS dates (the mid-point of the fitted function, though other mathematical parameters could also be chosen), the threshold category allows choosing any fraction of the seasonal amplitude from the base level for defining the SOS/EOS dates, and is a method that has been widely used in vegetation phenology studies. The threshold values can be optimized to certain locations and/or land cover types to define SOS and EOS that can be linked with specific phenophases of interest. However, in a continental application, it is advantageous to use a common threshold for all areas. Here we tested a range of threshold values from 5% to 50% by a step of 5% with the primary aim to find single optimal values for SOS and EOS to use at the continental scale.

Assessing spatial consistency between VI and GPP/GCC phenometrics
We also calculated GPP and GCC phenometrics at varying thresholds from 5% to 50% by a step of 5%, which were used as references for analyzing how well the VI-derived SOS/EOS reflect the spatial variability in the GPP and GCC-derived SOS/EOS across all the sites. The performance of each VI and threshold was quantified by the mean absolute bias and Pearson correlation coefficient (R). More than one SOS/ EOS value were obtained for some site-years due to double growing seasons, drought impacts, and VI sensitivities to background disturbances not successfully removed by SCL filtering (Supplementary Figs. S2 and S3). For these site-years, we only used one pair of phenometrics with a minimum bias between VI and GPP/GCC for each site each year to facilitate the comparison analysis.

Assigning phenological meanings to the VI phenometrics
To explore the phenological meaning of the VI-derived SOS and EOS metrics, i.e. their translation into events that indicate certain development stages in the phenological cycle, we calculated the mean bias (and standard deviation) between the PEP725 phenophases and the VI phenometrics at varying thresholds. Many different phases were recorded in the PEP725 database (www.pep725.eu/pep725_phase.php), from which we selected the six most relevant ones for comparison with the Sentinel-2 SOS and EOS metrics (Table 2). For each site and each phase in the PEP725 database, we calculated the mean value of the different records that belong to the same vegetation group (Table) to obtain a unique value for comparison with the Sentinel-2 SOS/EOS. Given the various limitations of using PEP725 database for calibrating/validating satellite data, we also checked the spatial correlation between VI phenometrics and PEP725 phenophases as a way to verify the robustness of our method.

Time series of VI, GPP, and GCC
Seasonal patterns of the VIs show distinctly different temporal patterns, similarly to the differences between GPP and GCC (see example in Fig. 4 from the DE-HoH site located in central Germany; note that GPP and GCC observations are derived from different pixels), which can be attributed to their different functional meanings (GPP expressing productivity vs. GCC expressing canopy colors). The GPP curves show distinct seasonal patterns with a peak in the beginning of the growing season, and a continuous decrease thereafter, whereas the GCC curves are characterized by large changes in the beginning and the end of the growing season, but a plateau in-between (Fig. 4). Visually, PPI is closer to the GPP curve, NDVI is closer to the GCC curve, whereas EVI2 is inbetween (Fig. 4). It is noticeable that NDVI and EVI2 show some high values in the winter season for the GPP buffer area, whereas PPI is as flat as the GPP observations. Time series of all the flux and PhenoCam sites are plotted in Supplementary Figs. S2 and S3 respectively, and these data provide the basis for all the following comparisons and analyses.

Performance of VI in tracking GPP/GCC seasonal variations
PPI performs the best in tracking the GPP seasonal dynamics, with less variation across sites within each land cover type (Fig. 5a), and higher correlation coefficients (Fig. 5b-c), particularly for the evergreen coniferous forest sites. The performance of EVI2 is for some comparisons as high or higher than PPI, but the range between low and high correlations is generally larger and lacks consistency across land cover types. NDVI shows the lowest correlation coefficients for all the land cover types, particularly for the evergreen coniferous forests, where the NDVI and GPP correlation is negative. For the comparison with the PhenoCam GCC observations, none of the VIs shows a distinctly better performance than the others for various land cover types ( Fig. 5d and f), but EVI2 shows the highest correlation with GCC for all the sites together (R = 0.65), closely followed by NDVI (R = 0.63), and PPI performing the worst (R = 0.56, Fig. 5e).

Spatial consistency between VI and GPP/GCC phenometrics
Comparisons between VI and GPP/GCC phenometrics were carried out for all the nested pairs of thresholds, and is shown as a set of correlations and biases (supplementary Fig. S5). The GPP/GCC threshold with the minimum bias as compared to each VI threshold was selected to harmonize the different levels of shifts between seasonal curves of VI and GPP/GCC across land cover types (Fig. 6). As seen, PPI consistently Agriculture (14) Grassland (4) Wetland (5) Deciduous forest (6) Evergreen forest (15 (5) Grassland (8) Wetland (5) Deciduous forest (7) Evergreen forest (7) Mixed forest ( shows the lowest bias and highest R value when comparing to both GPP SOS and GPP EOS (Fig. 6a). As for the comparisons with GCC SOS, all the VIs have comparable performances regarding the spatial correlation coefficients at all tested amplitude thresholds (Fig. 6b), while PPI has lower bias at thresholds below 30% and EVI2 has lower bias above the 30% threshold. For GCC EOS, NDVI consistently shows the best performance with the highest R values and lowest bias (Fig. 6b). The plots are useful for selecting the optimal VI thresholds and for providing information about the sensitivity of the relationships to the choice of threshold values. In most cases the variation in bias and R is low when changing the VI threshold value, but in a few cases the choice of a wrong threshold value can have a negative impact on the accuracy, e.g., R values for GCC EOS, which drop clearly from a threshold of 10% to 15% for PPI and EVI2.
For further illustrating the relationships, we selected the VI thresholds with the highest and lowest accuracy for each comparison to GPP and GCC phenometrics (Fig. 7). The highest and lowest accuracy thresholds were defined based on the ratio between the bias and R values so that the bias value was minimized and the R value was maximized. The similarities between these pairwise plots show that relationships are quite robust to the selection of threshold values.

Comparison between VI phenometrics and PEP725 phenophases
Most of the paired phases show R values close to zero (Fig. 8a), which could be attributed to 1) the lack of accurate locations of the PEP725 observations, 2) the small spatial dynamic ranges of the PEP725 observations used in this study (mainly in Germany), and 3) uncertainties originated from the subjective interpretation of various observers. However, there are clear positive correlations for the leaf unfolded 50% phase, particularly for the EVI2 and PPI phenometrics, which have R values of 0.61 and 0.41-0.46 at VI threshold 50% for deciduous broadleaved trees and evergreen coniferous trees, respectively (Fig. 8b). The leaf unfolded phase for evergreen coniferous trees also have relatively high R values (0.52-0.63), although the number of observations is low (only 12, Fig. 8a). These results support the method we employed to compare PEP725 phenophases and Sentinel-2 VI phenometrics.
The optimal VI thresholds matching the best with PEP725 phenophases (both the mean bias and standard deviation should be as low as possible) are different among paired phases and VIs (Fig. 9). To find an overall spring phenology threshold value that is usable across all three vegetation types, we calculated the mean bias and standard deviation values weighted by the number of observations for each paired phase (column-wise in Fig. 9a). The weighted spring mean biases are smallest at threshold 40% for NDVI (5 ± 25 days) and EVI2 (6 ± 14 days) and at threshold 25% for PPI (6 ± 12 days). For these best overall thresholds we checked which of the spring phenophases had the smallest bias for each vegetation type: Deciduous broad-leaved trees matches the best with the leaf unfolded phase for all the three VIs (NDVI: − 1 ± 12 days, EVI2: − 1 ± 10 days, and PPI: − 1 ± 8 days); evergreen coniferous trees matches the best with the leaf unfolded 50% phase for NDVI (6 ± 29 days), and with the leaf emerged phase for EVI2 (8 ± 12 days) and PPI (11 ± 9 days); and the meadow fresh green 25% phase have a match with NDVI of − 5 ± 41 days, with EVI2 of − 10 ± 22 days, and with PPI of − 5 ± 20 days (Fig. 9a). The autumn phases only contain deciduous broad-leaved trees, with the best match at 50% threshold for NDVI for the leaf fallen 50% phase (− 12 ± 28 days), at 30% threshold for EVI2 comparing to the leaf fallen 50% phase (− 1 ± 20), and at 15% threshold for PPI for the leaf coloration 50% phase (− 4 ± 19 days) (Fig. 9b).

Latitudinal and elevational variability in SOS and EOS
For studying how the estimated SOS and EOS parameters vary across latitudinal and elevational gradients we used the VI SOS/EOS extracted at amplitude threshold that best matched with the PEP725 phases (Section 3.4). Two transects were selected, one located in northern Europe ranging from N58.4 • to N70.4 • , and one located in southern Europe depicting steeper elevation variations across a mountain region (Fig. 2). The latitudinal and elevational variabilities in SOS and EOS along these two transects are shown in Fig. 10. Overall, EVI2 and PPI are highly consistent with each other and follow the latitudinal and elevational fluctuations well, particularly for SOS. NDVI aligns with EVI2 and PPI for parts of the transects, e.g., SOS from N42.6 • to N43.1 • for the southern transect, but shows a different pattern as compared to EVI2 and PPI for other parts of the transects, e.g., SOS and EOS from N42.1 • to N42.6 • for the southern transect and EOS for the northern transect.

Impact of BRDF
In addition to using VIs calculated from Sentinel-2 NBAR NIR and red reflectance for the analyses (Sections 3.2-3.5), we also performed all the analyses based on VIs calculated from the non-BRDF corrected TOC data (Supplementary Figs. S4 and S6-S11). The TOC-based results are almost the same as the NBAR-based results for all the comparisons with GPP, GCC, and PEP725 metrics, except for GPP EOS where NDVI shows a worse performance and EVI2 shows a better performance for TOC-based results as compared to the NBAR-based results ( Fig. 6 and Supplementary Fig. S7). To assess the BRDF impacts on the phenological parameters, we plotted SOS and EOS for the 10,000 pixels across Europe calculated from NBAR and TOC reflectance (Fig. 2). The SOS/EOS derived from NBAR and TOC inputs for the three VIs show high consistency, with R and slope values both around 0.94 (Fig. 11). Overall, we found that using TOC rather than NBAR-corrected reflectance would lead to the same conclusions regarding choice of VI and threshold values.

Performance of VIs and thresholds for deriving phenometrics
In this study, we used the spatial correlation analysis primarily as a way of selecting the most suitable VI for Sentinel-2 phenology mapping. For the comparison to GPP and GCC data, we obtained R values >0.8 for the best VIs (Fig. 6), which are higher than those derived from MODIS VIs (<0.7) as reported by Liu and Wu (2020), although from different sites, years, and methods. The higher spatial resolution of Sentinel-2 imagery could contribute to these stronger relationships, although Cai et al. (2021) did not report stronger relationships with eddy covariancebased GPP for Sentinel-2 than for MODIS in Scandinavia. We attribute the stronger relationships with GPP seasonality and phenology parameters in our study to the use of PPI. The remaining unexplained variability between the calibration data and the remotely sensed data could be related to several factors, including the limited ability of satellite signals to pick up phenological variability at ground level, data processing methods, inaccuracy of ground observations, merging of different land cover classes, and the functional relationship (e.g. linearity) between the calibration and the satellite data.
The use of either GPP or GCC as a reference to evaluate the VI performances in tracking vegetation seasonal variations makes a difference due to their different functional meanings (GPP expressing photosynthetic productivity vs. GCC expressing canopy colors). PPI is best aligned with GPP seasonal variations for all the land cover classes studied ( Fig. 5c). This temporal alignment between PPI and GPP results in the highest spatial consistency between their phenometrics (smallest bias and largest R values for both SOS and EOS, Fig. 6). For the GCC trajectories, none of the three VIs shows the highest correlations across all the land cover classes, but EVI2 shows the overall best performance (Fig. 5). Regarding the comparison with GCC phenometrics, all three VIs show comparable correlations for GCC SOS, while biases are lowest for PPI and EVI2. For GCC EOS, NDVI outperforms PPI and EVI2 (Fig. 6). The strong relationship between NDVI and GCC EOS is attributable to their similar sensitivity to leaf pigmentation Luo et al., 2018). PPI is near-linearly related to green LAI by definition (Jin and Eklundh, 2014), thus showing more variability than the other indices during the growing season, and strong relationship with GPP.
The VIs' performances are determined not only by their responses to vegetation changes (e.g., leaf area and leaf color), but also by their sensitivity to background noise (e.g., snow cover and soil color). As seen in the supplementary Fig. S2 and S3, both NDVI and EVI2 showed outliers in the snowy seasons, particularly for the coniferous forest sites ( Fig. S2b and S3b), although we applied quality control based on the SCL information. The outliers in snowy seasons can sometimes introduce false identification of growing season, which in the current analysis was eliminated by retaining only one SOS/EOS for each year to facilitate the comparison with ground calibration data. However, these outliers can complicate the operational production of the Sentinel-2 phenology dataset, given the imperfect accuracy of the state-of-the-art masking algorithms for Sentinel-2 imagery (Zekoll et al., 2021). In contrast, PPI is insensitive to snow cover (Supplementary Fig. S2 and S3), which is an advantage for phenology mapping in the northern hemisphere.
The spatial correlations of SOS/EOS derived from VIs and GPP/GCC are very similar across thresholds varying from 5% to 50%, and are relatively high (~ 0.8,Figs. 6 and 7). Therefore, it is acceptable to use a uniform threshold to quantify SOS/EOS across different land cover types in Europe, which is important for operational production of a continental-scale dataset. It also indicates that as long as simple functions are used for modeling the growing seasons, the selection criteria for SOS and EOS (threshold, derivatives, inflection points etc.) generate relatively equal strength of relationships, but mainly affect the biases against phenological phases. Therefore, selecting a threshold that best matches with certain PEP725 phenophases is necessary for providing a phenological meaning to the mathematically defined VI phenometrics.
Although the PEP725 observations lack precise geographical locations and are mainly distributed in Germany, leading to generally weak correlations with ground data, we managed to link the threshold-based VI SOS/EOS with phenological events at the land cover type level. For certain phenophases (e.g., leaf unfolded 50%), we obtained significant spatial correlations between PEP725 and VI phenometrics with R values around 0.5 (Fig. 8), similar to those reported by Kowalski et al. (2020), thus verifying the feasibility of the methodology. Based on the statistics between paired PEP725 and VI phases, we found that the phenological biases were minimized for SOS at NDVI 40%, at EVI2 40%, and at PPI 25% and for EOS at NDVI 50%, at EVI2 40%, and at PPI 15%, with a generally smaller standard deviation (i.e., higher consistency) for PPI than for EVI2 and NDVI (Fig. 9). EVI2 and PPI phenometrics derived from these optimal thresholds are visually consistent with latitudinal and elevational variations along two transects (Fig. 10), suggesting their capability of reflecting spatial variabilities in vegetation phenology.

PPI calculation and BRDF correction
PPI is capable of clearly reflecting phenology cycles for dense canopies, like in boreal forests, due to the usage of a logarithmic design and sun-zenith-angle-related canopy light extinction efficiency (Jin and Eklundh, 2014). However, the logarithm comes with a cost of being highly sensitive to variability in the differences between NIR and red reflectance during the peak of the growing season, which can have an impact on the calculation of phenometrics via influencing the seasonal amplitude. This problem can be largely solved by outlier filtering. A disadvantage of PPI as compared to NDVI and EVI2 is its complexity of calculation, particularly the calculation of DVI max for each pixel, which can be time-consuming at the continental-scale. Moreover, the DVI max parameter is in principle a pixel-dependent theoretical value, which can be estimated from multi-year DVI observations (Jin and Eklundh, 2014). Given the currently short time series of Sentinel-2 data, the DVI max parameter needs to be updated when more years of observations are available. Abdi et al. (2019) found that PPI showed a better performance for estimating GPP than other indices for African semi-arid ecosystems. However, PPI values appeared to be less reliable in the sparsely vegetated areas in Tibetan Plateau, where there is a large proportion of bare soil (Huang et al., 2021). This issue is likely attributable to the use of a constant DVI soil value of 0.09, which is an empirical estimation from the soil reflectance spectral database (speclib.jpl.nasa.gov) and the field measurements by Huete et al. (1984). A promising solution to this issue is to introduce a scene-based soil reflectance parameter and is currently under investigation.
Our results show a small difference in the derived SOS/EOS values between Sentinel-2 NBAR and TOC reflectance. Also, it should be noted that the correction of Sentinel-2 BRDF effects based on MODIS  (Table 2) for different vegetation groups (Table 1) at varying VI amplitude thresholds from 5% to 50%. The mean bias and standard deviation values are labeled as black (upper) and grey (lower) digits respectively for each comparison. The spring phases with the best match for each vegetation group at the optimal threshold according to the weighted mean values is marked by a solid black dot. The number of sites containing each paired PEP725 and Sentinel-2 phase is labeled on the y-axis.
parameters is not free from uncertainties, particularly for heterogeneous areas due to the scale difference between them (10 m vs. 500 m resolutions). Furthermore, using an external dynamic data source for the operational production of Sentinel-2 phenology dataset is impractical, considering the possibility of future deprecation of MODIS and the extra computation load. Therefore, based on our results (Section 3.6), applying BRDF correction to Sentinel-2 is a step that can be excluded for phenology mapping in Europe.

Implications and perspectives
Landsat was often used in combination with Sentinel-2 to gain more time series observations for phenology mapping, such as the MuSLI dataset for North America (Bolton et al., 2020) and the study by Kowalski et al. (2020) in Germany. While the data combination is beneficial for Landsat-based phenology mapping with long historical records back to the 1980s, it sacrifices the higher spatial resolution (10 m) of Sentinel-2, which may be crucial for certain applications like forest stand species mapping in heterogeneous areas (Grabska et al., 2020). Moreover, synthetic aperture radar (SAR) data from Sentinel-1 was proposed to integrate with Sentinel-2 for crop phenology mapping (Mercier et al., 2020). SAR provides more cloud-free observations and information on vegetation structural changes, which may be particularly helpful for near-real-time crop monitoring. But our preliminary analyses (not shown here) indicate that the SAR time series from Sentinel-1 is highly noisy for evergreen coniferous forests providing little information on phenology. With a proper selection of VI and the use of robust time series fitting methods, our results suggest that Sentinel-2 alone is capable to map vegetation phenology across a range of land cover types in Europe. Finally, with the forthcoming release of the Sentinel-2 HR-VPP dataset it will be of great importance to carry out wall-to-wall comparisons with existing datasets, such as the MODIS phenology product (Friedl et al., 2019).

Conclusions
In summary, PPI is shown to be the index with strongest relationships to photosynthetic phenology as represented by GPP, and it has the highest robustness against background noise in snowy conditions. For greenness phenology (as represented by GCC), no clear conclusion can be drawn, since different VIs perform best in different parts of the analysis. However, when considering all the analyses performed in the study, if only one VI is going to be used for Sentinel-2 phenology mapping across Europe we recommend using PPI. The PPI amplitude thresholds with best overall relationships with ground-observed PEP725 phenophases are 25% for SOS and 15% for EOS. It should be noted that all the ground datasets have limitations in geographical extent, and for regional and local applications or for adaptations to specific vegetation types or species, fine-tuning of the thresholds is probably necessary. Based on the results of this study we conclude that in the era of full Sentinel-2 operation with two satellites, we can accurately monitor continental-scale phenology at a 10 m spatial resolution.  (Fig. 2). The amplitude thresholds used are 40% for NDVI/EVI2 SOS, 25% for PPI SOS, 50% for NDVI EOS, 45% for EVI2 EOS, and 15% for PPI EOS. The lines are the latitudinal trends of the per-pixel values using the LOESS smoothing method (span = 0.1) and the shaded areas are the standard errors.

Declaration of Competing Interest
The authors declare no competing interest.  Fig. 11. Comparison of (a) SOS and (b) EOS values at amplitude threshold of 25% between NBAR VI and TOC (without BRDF correction) VI for all the 10,000 randomly selected pixels during 2018-2019.