Evaluating the impact of land surface on medium‐range weather forecasts using screen‐level analyses

In this study screen‐level analyses of air temperature and humidity are used to objectively evaluate the impact of a new land‐surface package being considered for implementation in Environment and Climate Change Canada (ECCC)'s medium‐range global deterministic numerical weather prediction (NWP) system. Through its control of heat, moisture, and momentum fluxes to the atmosphere, the land surface has a substantial impact on near‐surface meteorology and on the atmospheric boundary layer. The approach examined in this study is based on the comparison between model forecasts and screen‐level analyses. It demonstrates the impact on medium‐range NWP of a new land‐surface package that includes (i) a new set of databases to specify soils and land‐cover characteristics, (ii) improved land‐surface initial conditions obtained by the assimilation of space‐based remote‐sensing observations, and (iii) a more sophisticated scheme for land‐surface modelling. The evaluation method is shown to provide useful information on the impact of the new land‐surface package, including lead‐time‐averaged difference maps as well as plots showing the evolution with lead time of the standard deviation of errors (STDE) and of the temporal correlation between forecasts and analyses. The new land‐surface package has a positive impact on near‐surface forecasts of air temperature and humidity for a summertime period, with smaller STDE and larger temporal correlation for both variables. The improvement is greater for humidity than for air temperature. The maximum impact is found around seven‐day lead time, with substantial gains in absolute and relative values for STDE and temporal correlation. The positive impact is also quantified in terms of prediction hours, with gains of about one day at the medium range. Details of the pros and cons for this objective evaluation approach are discussed.


INTRODUCTION
The crucial role and importance of land-surface processes in numerical weather prediction (NWP) systems is well recognized at meteorological and environmental prediction centres (e.g., Zheng et al., 2018;Crow et al., 2020;Boussetta et al., 2021). Land surface models (LSMs), along with the land data-assimilation systems (LDAS) that are used for their initialization, represent the time evolution and spatial variability of land-surface variables related to temperature, soil moisture, snow, and vegetation. In addition to having a direct impact on near-surface meteorology (e.g., screen-level air temperature, humidity, winds), LSMs also provide the lower-boundary conditions for atmospheric models' turbulent diffusion schemes (i.e., heat, moisture, and momentum fluxes) which to a great extent determine the evolution of the planetary boundary layer (PBL) and which have a certain influence on clouds and precipitation processes (e.g., Koster et al., 2003;Ek and Holtslag, 2004;Santanello et al., 2011). The impact of land-surface processes is felt through events closely experienced by humans, not only weather-related but also for other operational services with high impact on society, for example, for agricultural production, droughts, floods, air quality, and atmospheric dispersion (e.g., Ren et al., 2018;Campbell et al., 2019;He et al., 2020).
A vast literature exists concerning the modelling and data-assimilation components of the land surface. On the modelling side, a recent review by Blyth et al. (2021) discusses the evolution, recent progress, and future directions of LSMs. They review the representation of the main physical processes in LSMs, with emphasis on problems linked to internal exchanges between these processes and to external exchanges with other systems (e.g., atmosphere, subsurface, catchment, landscape). The physical aspects they discuss include surface and canopy processes, snow and soil physics, water bodies, vegetation physiology, soil biogeochemistry, and vegetation dynamics.
The scale of the effort and investment given by national centres to develop and improve their land-surface modelling component is well illustrated by Boussetta et al. (2021), with their description of ECLand, the land-surface modelling system at the European Centre for Medium-Range Weather Forecasts (ECMWF). Other recent studies to be noted include those of Druel et al. (2021), Li et al. (2021), and Takata and Hanasaki (2021).
On the data-assimilation side, Xia et al. (2019) and Kumar et al. (2022) provide recent reviews with extensive literature surveys. They highlight the current challenges related to LDAS inputs, that is, surface observations, retrievals from space-based observations, atmospheric forcing (most notably precipitation), and surface databases for soils and vegetation. The other significant challenges they present are related to land-surface modelling (including parameter estimation) necessary to provide a first guess in LDAS, and to assimilation methods, calling attention to spatial scale incompatibility between the several sources of information put together to build land-surface analyses, and to the coupling between LDAS and atmospheric data-assimilation systems. Kumar et al. (2022) is of particular interest since the agenda they lay out for LDAS considers gaps specific to NWP.
A few examples of recent innovations and developments on major LDAS include: the assimilation by Mucia et al. (2022) of passive microwave vegetation optical depth in Meteo-France's LDAS-Monde (Albergel et al., 2018) which led to improved evapotranspiration over the United States; the inclusion by Sassi et al. (2023) of land-surface temperature retrievals from radiances measured by the Spinning Enhanced Visible and InfraRed Imager (SEVIRI) in the AROME-France model LDAS which had a positive impact on short-range night-time forecasts at the surface and in the PBL; the development by Fairbairn et al. (2019) of a stand-alone version of ECMWF's LDAS (de Rosnay et al., 2014) which can be useful for development and testing; and the application by Kumar et al. (2019) of the US National Aeronautics and Space Administration (NASA)'s Land Information System (LIS, see Kumar et al., 2006) to provide information that could benefit the USA National Climate Assessment.
Similar efforts to improve land-surface aspects in its numerical environmental prediction systems have also recently intensified at ECCC. In the last few years, two different versions of the Canadian Land Data-Assimilation System (CaLDAS) have been operationally implemented at ECCC's Canadian Centre for Meteorological and Environmental Prediction (CCMEP). One is based on the assimilation of screen-level and surface observations for the national deterministic short-range km-scale NWP system (Milbrandt et al., 2016) and the other assimilates space-based remote-sensing data as part of the experimental national deterministic and ensemble surface and river prediction systems Garnaud et al., 2021). ECCC's recently developed Soil, Vegetation, and Snow (SVS) scheme Husain et al., 2016) has also been included in the deterministic and ensemble systems for hydrologic prediction.
In the present study, changes to the land surface have been tested in the context of ECCC's global deterministic NWP prediction system (GDPS). Emphasis here is on the objective evaluation of the impact these land-surface changes have on medium-range weather forecasts (i.e., 4-6 day forecasts in this study) which is the primary mandate of the GDPS. The land-surface modifications include more realistic two-dimensional fields to map soils and vegetation characteristics and related parameters, the SVS scheme for land-surface modelling, and initialization of land-surface variables with the satellite version of CaLDAS, that is, one which does not use screen-level observations but rather assimilates space-based L-band radiometric observations for soil moisture and surface-temperature retrievals from a series of polar orbiting thermal infrared sensors.
Due to the nature of the changes made to GDPS in this study, the overall project focuses on the meteorological impact near the land surface. The land-surface modifications also have an effect on the PBL and the upper atmosphere, but this will be presented in a separate study. At the screen level, objective evaluation versus surface observations is mostly used (e.g., Caron et al., 2015;Milbrandt et al., 2016). For a country like Canada, this approach is problematic because the spatial coverage of surface-observational networks is sparse and inhomogeneous, which leads to highly uncertain error estimates, heavily influenced by characteristics of the area being evaluated. With this approach, large areas of the country are in effect not represented in the evaluation. Objective evaluation at the screen level based on surface observations suffers from a few other problems, as discussed later in this article.
These challenges were the motivation for the development of a new tool at ECCC to evaluate NWP forecasts using land-surface, screen-level, and near-surface upper-air analyses to objectively evaluate the quality of predicted land-surface variables, screen-level meteorology, and PBL evolution and its vertical structure. This new verification package was first used to evaluate two different configurations of GDPS, one with substantial changes to the land surface, referred to in this article as the 'NEW' experiment, and another in which none of these changes are used, referred to as the 'OP' control to illustrate the fact that this configuration was used as the basis for the most recent GDPS operational implementation at CCMEP.
In order to properly present and discuss the benefits and difficulties associated with verifying NWP models' performance near the land surface, the focus in this study is entirely on the objective evaluation of NEW and OP for global medium-range forecasts versus screen-level analyses of air temperature and humidity. The objective verification presented here is a subcomponent of the recently developed package described above and more generally of the tools available to evaluate global NWP models. Recognizing that modifications to the land surface have the potential to impact more than just near-surface meteorology (e.g., Santanello et al., 2019;Benjamin et al., 2022;Liu et al., 2022), other aspects of the objective evaluation, including boundary layer and upper-air components, will be presented in follow-up articles.
The main objectives of the study are thus to (a) present and discuss an evaluation approach of deterministic medium-range NWP based on the comparison of model forecasts against screen-level analyses; and (b) evaluate the impact of the NEW land-surface package (including surface fields, SVS, and CaLDAS) on near-surface meteorological forecasts issued from ECCC's GDPS. This study provides a detailed discussion on the methods used to evaluate objectively NWP models near the land surface, with emphasis on aspects related to the use of analyses instead of surface observations, to the validity of biases when determined from comparison against analyses, to the role of variability, and to the independence of the verification datasets. Concerning the second objective, it should be noted that evaluation of the impact of individual components of the land-surface package is beyond what is possible with a single scientific article, and in most cases is actually not possible due to the intricate relationship between each of these components.
The article is structured this way: the next section on methods and data describes the modelling and analysis systems, the experimental setup, and the evaluation approach; the third section presents the results of the objective evaluation against screen-level analyses of air temperature and humidity; the fourth section provides a discussion on the benefits and disadvantages of this type of objective evaluation; and a brief summary of the study and of its conclusions is given in the last section.

NEW and OP configurations for ECCC's GDPS
The results presented in this study are obtained with a recent version of ECCC's GDPS system. The OP control configuration is similar to CCMEP's current operational system, based on McTaggart-Cowan et al. (2019) except that in this study the atmospheric model (described below) is not coupled with the ocean modelling system NEMO (for Nucleus for European Modelling of the Ocean, see Madec et al., 1998;Madec, 2008) as described in Smith et al. (2018). [Instead, a one-dimensional mixed-layer model based on Fairall et al., 1996 anddeveloped by Zeng andBeljaars, 2005 is used to predict the evolution of near-surface ocean layers]. Also, a problem in the GDPS version used in this study was identified and solved prior to its operational implementation at CCMEP, related to incorrect representation of background covariances for upper-air data assimilation in the stratosphere; this is TA B L E 1 NEW and OP configurations for global deterministic numerical weather prediction (NWP) prediction system (GDPS)

Component
NEW OP unlikely to have any significant impact on the results and conclusions from the present study, focused on model performance at the land surface. The differences between NEW and OP are listed in Table 1 and are described in details in the rest of this section, including explanations for all the acronyms. Practically all aspects of the representation of the land surface and of its interaction with the atmosphere have been modified in NEW. The list includes changes to land-surface modelling, to land-surface initial conditions, to the databases used to specify vegetation and soils characteristics, and to the physical connection between the land surface and the atmosphere.
The differences between the two configurations are substantial. This is often the case at ECCC due to the relatively long period associated with each implementation cycle (about two years). In that context, small incremental modifications to the operational forecasting systems are often not practical since it could take several cycles to completely implement a set of modifications that was in fact developed and calibrated as a whole (as is the case here). Furthermore, a sizeable set of changes was preferred for this specific study to make sure its impact on the forecasts is easily detectable, to illustrate the pros and cons related to the evaluation approach presented here. Finally, this article provides documentation and reference for a new land-surface package that is prepared for transfer to CCMEP's operational systems.

The GEM atmospheric model
The numerical simulations presented in this study were performed with the Global Environmental Multiscale (GEM) atmospheric model, developed and operated by ECCC. This model is used for short, medium, and long-range NWP at local, regional, and global scales, for deterministic and ensemble applications. The original version of GEM is described in Côté et al. (1998a, b), with recent improvements presented in Girard et al. (2014) and Husain and Girard (2017). In GEM, the non-hydrostatic primitive equations are solved with a two-time iterative implicit scheme with semi-Lagrangian treatment for the advective terms. Horizontal spatial discretization is based on finite differences with an Arakawa-C grid in the horizontal direction, while vertical discretization is from a staggered Charney-Phillips grid that can be nonuniform. The global version of GEM used for medium-range deterministic forecasts is described in McTaggart-Cowan et al. (2019), along with its representation of physical processes. In GEM's GDPS configuration, atmospheric radiation is based on Li and Barker (2005) using a correlated-k distribution for gaseous transmission, and vertical diffusion for atmospheric turbulence follows a turbulent kinetic energy approach described in Mailhot and Benoit (1982), Benoit et al. (1989), and Bélair et al. (1999). Several schemes are concurrently used to represent a wide range of clouds and precipitation types in GEM-GDPS, as described in Bélair et al. (2005) and in McTaggart-Cowan et al. (2019, 2020. For its application in the GDPS system, and this study, GEM version 5 is integrated globally for 10-day forecasts on a Yin-Yang grid (Qaddouri and Lee, 2011) with horizontal grid spacing on the order of 15 km and with 84 levels extending up to approximately 0.1 hPa. The lowest momentum level is approximately 20 m above the surface while the lowest thermodynamic level is at 10 m.

2.3
The ISBA (OP) and SVS (NEW) land-surface schemes A Canadian version of the land-surface scheme called Interactions between Soil, Biosphere and Atmosphere (ISBA), as presented in Bélair et al. (2003a, b), is used in ECCC's operational NWP systems and in this study's OP control. Developed in the late 1980s at Meteo-France by Noilhan and Planton (1989), this scheme is based on a two-layer Force-Restore (FR) approach for both soil moisture and surface-soil temperatures. The near-surface layer for soil moisture is 10 cm deep in this study, with a deeper layer for root-zone soil moisture. The FR coefficients for soil moisture are determined from the soil texture. Bare soil evaporation is calculated from relative humidity at the ground using a reference soil moisture at field capacity. Transpiration relies on the determination of a surface resistance with limiting factors related to photosynthetically active radiation, water stress, vapour pressure deficit, and air temperature. Surface temperature and mean surface temperature are also obtained from FR based on a single energy budget over land. These temperatures are thus representative of bare soil, vegetation, and snow. Vegetation is considered as a single layer, with no effect or impact on ground and possibly snow beneath vegetation.
It should be mentioned that several more advanced versions of ISBA have been developed and used at Meteo-France, such as the three-layer FR approach (Boone et al., 1999) and the multi-energy balance version (Boone et al., 2017). ISBA's original version, close to the one presented in Noilhan and Planton (1989) and Noilhan and Mahfouf (1996), and described in Bélair et al. (2003a, b), has been maintained without much modification at ECCC for its operational systems. Recent development and improvement related to land-surface modelling at ECCC have been done in the context of the SVS scheme.
The SVS code (part of the NEW experiment) originates from the Canadian version of ISBA. Several aspects were modified and new features were added in SVS in an effort to modernize land-surface modelling at ECCC. One of these new features is the inclusion of multiple energy budgets at the surface for the ground, vegetation, and snow-covered portions of land . Multiple water budgets are also part of SVS, as was already the case in ISBA (Bélair et al., 2003a). Spatial aggregation to obtain grid-scale heat and moisture turbulent fluxes over land depends on the fractional area coverage of low vegetation, high vegetation, and two snow packs, that is, one in open areas (over ground and low vegetation) and one in forested areas (under high vegetation) .
Another major difference between the two schemes is that ISBA's FR approach for soil moisture was replaced by a new hydrology computed on multiple soil layers with a soil water diffusion scheme combined with representation of surface, lateral, and base flows (see Alavi et al., 2016, for details). There are seven soil layers for this SVS application into the GDPS, with a first layer of 5-cm depth, and the interface of the layers below at 10, 20, 40, 100, 200, and 300 cm. Soil water vertical diffusion is obtained by numerically solving Richards equations for unsaturated flow in porous media (Verseghy, 1991;Soulis et al., 2000) with a Runge-Kutta 4th order numerical scheme. Lateral flows over tilted landscapes are derived from a parameterization of subgrid-scale interflows presented in Soulis et al. (2000Soulis et al. ( , 2011. Base flow occurs when soil moisture for the bottom layer exceeds the field capacity determined from an expression provided in Soulis et al. (2011). It should be mentioned that the FR approach is still used in that version of SVS for all surface and soil temperatures.
Bare soil evaporation is similar in SVS and ISBA, but vegetation transpiration is now calculated using a surface stomatal resistance based on photosynthesis processes. These processes are represented with a module that was adapted from the Canadian Terrestrial Ecosystem Model (CTEM) (Arora, 2003;Arora and Boer, 2005). The stomatal conductance is obtained in this study from Leuning (1995)-there is also the option of using the formulation developed in Ball et al. (1987). Soil water stress and transpiration rely on a vertical roots profile following Schenk and Jackson (2003), with the fraction of active roots given by a linear weighting function using as reference soil moisture values at wilting point and field capacity.

Land-atmosphere interactions
In addition to ISBA's replacement by SVS for land-surface processes, several aspects of the exchange processes between the land surface and the atmosphere have been modified with the NEW configuration: • Subgrid-scale orography. In OP, the subgrid-scale orographic component of the turbulent exchanges between the land surface and the atmosphere is entirely included as part of the roughness length for momentum (z 0m ) that is used to calculate the land-surface turbulent fluxes for heat, moisture, and momentum in ISBA.
In NEW, only the component associated with local roughness (i.e., land use/land cover) is considered in SVS, and the impact of subgrid-scale orography on lower-tropospheric winds is represented with the form drag formulation presented in Beljaars et al. (2004).
• Dynamic roughness length z 0h . In OP a fixed ratio z 0m /z 0h is applied to specify the roughness length for thermal turbulent exchanges at the surface (a ratio of five is used, and z 0h is the thermal roughness length). In NEW, this ratio dynamically evolves with time and depends on the Reynolds number, following the formulation presented in Zilitinkevich (1995).
• The value of the minimum Monin-Obukhov length (L min ) used to control possible decoupling between the land surface and the atmosphere in stable conditions (see McTaggart-Cowan et al., 2019) has been changed from L min = 20 m in OP to L min = 10 m in NEW.
• The surface layer stability functions under stable conditions from Beljaars and Holtslag (1991) (OP) have been replaced by Delage (1997) (NEW).
• Spatial filtering is applied in NEW to the aggregated turbulent exchange coefficients for temperature and humidity that are calculated by the surface schemes and provided as lower-boundary conditions to the one-dimensional vertical diffusion scheme (Benoit et al., 1989). This is performed by an explicit second-order filter developed by Shapiro (1971), based on a discrete operator with a three-point stencil.
• The determination in SVS of an effective land-surface state (surface temperature and soil humidity) required to calculate the surface turbulent fluxes over GEM's grid areas land portion, is based on spatial averaging influenced by the aerodynamic resistance of each SVS sub tiles (see Chehbouni et al., 1995). In ISBA (OP configuration) and previous versions of SVS, it is based on simple area averaging.

Land surface characteristics mapping
In OP, all vegetation characteristics are based on the USGS (United States Geological Survey) Global Land Cover Characterization (GLCC) database with an approximate 900-m resolution at the global scale (Loveland, 2000). Soil texture is provided by the FAO/UNESCO (Food and Agriculture Organization/United Nations Educational, Scientific, and Cultural Organization, 2003) Soil Map of the World, a one-layer database with 1/12 • resolution.
More recent and detailed databases are used in NEW to specify the characteristics associated with soil texture and vegetation. Vegetation types are obtained from the European Space Agency (ESA)'s Climate Change Initiative Land Cover (CCILC) 2015 release, with an approximate 300-m resolution over the globe. The vegetation type classification is further refined by comparing CCILC with terrestrial biome information from the Ecoregions 2017 dataset (Dinerstein et al., 2017). The soil texture is derived from a 2014 release of the Global Soil Dataset for use in Earth System Models (GSDE), also known as the Beijing Normal University (BNU) database, with eight soil layers and an approximate global 1-km resolution (Shangguan et al., 2014).
Look-up tables are used in both NEW and OP to specify land-surface characteristics related to vegetation (e.g., fractional coverage, leaf area index, root depth) based on the vegetation type classification. The exception is the momentum roughness length associated with local vegetation, which is determined in NEW from vegetation height. For high vegetation (i.e., trees), calculations are based on the Global Forest Canopy Height Database, derived from space-based measurements with the Geoscience Laser Altimeter System (GLAS, Simard et al., 2011). Look-up tables are used for low vegetation. Some of these look-up tables have been modified in NEW.

2.6
Upper-air data assimilation The upper-air initial conditions for the GDPS forecasts evaluated in this study are obtained from a 4D-EnVar data assimilation system, presented in Buehner et al. (2015). In the 4D-EnVar, the background-error covariances are entirely flow-dependent in the troposphere and lower stratosphere (Caron and Buehner, 2022) and are estimated from short-range forecasts from a local ensemble transform Kalman filter (LETKF), described in Buehner (2020). In the upper troposphere, outputs from LETKF are combined with static climatological estimates to specify the background-error covariances. Scale-dependent background-error covariance localization has been recently implemented in GDPS's 4D-EnVar (Caron and Buehner, 2022). Consistency between the deterministic and ensemble data assimilation systems is assured by using the same bias-corrected observations in both systems and by partially recentring the ensemble using a 4D-EnVar analysis. Data from a wide range of observing systems are assimilated in 4D-EnVar, including land-surface stations, ships and buoys, radiosondes, aircrafts, scatterometers, ground-based Global Positioning System (GPS) instruments, atmospheric motion vectors, satellite-based radio occultation, as well as infrared and microwave satellite data. These observations are ingested in six-hour assimilation windows, with 15-min time discretization for the calculations of the associated innovations. A four-dimensional Incremental Analysis Update (IAU, Bloom et al., 1996) technique is used with increments applied at hourly intervals.

Land data assimilation
Initial conditions over land for soil moisture and land-surface temperature for OP are obtained from the sequential optimal interpolation (OI) assimilation of screen-level observations, described in Bélair et al. (2003a) and introduced by Mahfouf (1991). This technique (or its variant based on nudging) has been used for more than two decades by several national environmental prediction centres (e.g., see Douville et al., 2000;Giard and Bazile, 2000;Drusch, 2007). With this approach, analysis increments for ISBA's soil moisture and surface temperature are calculated and applied to the control variables in order to minimize short-term forecast errors at the screen level. Snow depth, on the other hand, is obtained by assimilating surface observations with an OI approach described in Brasnett (1999).
In NEW, the land-surface initial conditions for SVS are provided by CaLDAS (Carrera et al., 2015. For soil moisture and surface temperature, a one-dimensional ensemble Kalman filter (1D-EnKF) is performed by means of a series of independent assimilation problems for each computational grid box. The 1D-EnKF is run with a forecast step of three hours, consisting of 24 ensemble members. The analysed variables include soil moisture in the first four layers of SVS (corresponding to a depth of 40 cm), as well as distinct surface temperatures for bare soil, vegetation, and snow pack.
Horizontal error correlations between state variables from different grid boxes are neglected for soil moisture and surface temperatures. The first guess is produced with offline two-dimensional land-surface runs including SVS (similar to Garnaud et al., 2016Garnaud et al., , 2017. The atmospheric forcing fields for the offline SVS are derived from short-range forecasts from the upper-air 4D-EnVar trial fields, except for precipitation which is provided by the Canadian Precipitation Analysis (CaPA, Fortin et al., 2015Fortin et al., , 2018. Stochastic perturbations are added to the atmospheric forcing variables (incoming radiation, precipitation, and near-surface air temperature) in order to generate spread among the ensemble members. Precipitation perturbations are generated by combining first-guess precipitation from the upper-air 4D-EnVar trial runs with observations using CaPA's methodology (Carrera et al., 2015). The perturbations for incoming radiation and precipitation are spatially coherent.
Swaths of geolocated brightness temperature from the SMOS (Soil Moisture and Ocean Salinity) and SMAP (Soil Moisture Active Passive) satellites are assimilated in CaL-DAS for its soil moisture analyses to a soil depth of 40 • cm. After quality control, and after processing of SMOS data to 40 • incidence angle, these observations are combined and regridded to the CaLDAS analysis grid using ordinary kriging with an assumed variogram. The SMOS brightness temperatures are restricted to the alias-free zone and the SMOS radio-frequency interference flags are applied to remove uncertain observations. Brightness temperatures with radiometric errors greater than 4 K are rejected. For SMAP, only brightness temperature measurements having acceptable quality as indicated with the provided data flags are included. The SMAP observations assimilated in CaLDAS are those with the water/land contamination correction. SMOS and SMAP observations over grid points with frozen soil or with snow coverage are excluded, based on information provided by the first guess. No bias correction is applied to the SMOS and SMAP brightness temperatures. [It should be mentioned that SMAP observations were not available between 20 June 2019 and 22 July 2019].
The forward model, called Community Microwave Emission Modelling Platform (CMEM), developed at ECMWF (de Rosnay et al., 2009), is used to calculate first-guess brightness temperatures from offline SVS to compare with SMOS and SMAP observations. The median of CaLDAS EnKF members is selected for the soil moisture analyses provided to the atmospheric forecast model GEM and to the 4D-EnVar upper-air assimilation system.
Land-surface temperature retrievals (Heilliette et al., 2017) from the polar orbiters AIRS (Atmospheric Infrared Sounder), CrIS (Cross-Track Infrared Sounder), and IASI (Infrared Atmospheric Sounding Interferometer) infrared sounders are assimilated in CaLDAS for the analysis of bare ground, vegetation, and snow temperatures. [Note: AIRS data were not ingested in CaLDAS after 8 December 2019]. Data from these sensors are also combined, quality controlled, and placed on the analysis grid using ordinary kriging. The bias correction for AIRS, CrIS, and IASI brightness temperatures is performed using a linear regression based on air-mass predictors similar to those used in Harris and Kelly (2001). Only clear-sky and bias-corrected radiances are used for the skin-temperature retrievals, after removal of atmospheric emission and absorption from observed radiances. Similar to soil moisture, the medians of the CaLDAS EnKF surface-temperature analyses are provided as initial conditions to GEM and to 4D-EnVar.
The snow-depth analysis in CaLDAS is not based upon an EnKF approach, but is rather similar to the analysis used in ECCC's current operational systems (and used in OP). A snow analysis is performed with an OI approach using in-situ snow-depth observations. In CaLDAS, the OI snow-depth analysis is done for all 24 members and it is the ensemble mean that serves as the initial snow-depth analysis for GEM and 4D-EnVar. This approach has been used in ECCC's high-resolution deterministic NWP system since CaLDAS's first implementation in 2014 (Milbrandt et al., 2016).
The merging of the 4D-EnVar and CaLDAS analyses is of a weak coupling nature (Bani Shahabadi et al., 2019) with direct insertion of the CaLDAS land-surface analyses at T-3h and T-0h within the IAU period of the upper-air 4D-EnVar. The inclusion of the T-0h update has the advantage of refreshing the snow analysis (which is performed every six hours in CaLDAS), as well as the surface temperatures and soil moisture analyses (which are produced every three hours).
Based on this description, and in agreement with a recent review from De Lannoy et al. (2022), CaLDAS places among the state-of-the-art LDAS systems currently used in major meteorological and environmental prediction centres. The weak coupling approach, in which the land-surface and atmospheric data assimilation systems run side by side with frequent bidirectional information exchange, has been used for several decades at ECCC (see Bélair et al., 2003a). Work is currently under way, along the lines discussed in de , to more closely combine CaLDAS with ECCC's atmospheric data-assimilation system and determine the level of coupling required to obtain an optimal impact on applications connected to both land-surface and atmospheric numerical prediction. Coupling of CaLDAS with ECCC's hydrological assimilation systems is also planned.

Experimental setup
In the overall project to evaluate NEW's performance versus OP, comparison of GEM's global medium-range forecasts against screen-level analyses was performed for two time periods, that is, from 16 June 2019 to 31 August 2019 (referred to as summer in this study) and from 16 December 2019 to 29 February 2020 (referred to as winter). Since the main objective of this study is to demonstrate the usefulness of objectively evaluating the impact of land-surface modifications on GEM's medium-range screen-level forecasts, only figures describing results for the summer 2019 period are presented and described in the main text. For the winter 2020 period, only the summary tables are provided.
For NEW, which includes SVS, CaLDAS, and other changes compared with ECCC's operational systems, an offline open-loop run was performed from October 2017 all the way to 15 May 2019 in order to spin up SVS prognostic variables (terrestrial snow, soil moisture, and surface temperatures). A one-month offline land-surface assimilation cycle with CaLDAS was then run from 15 May 2019 to 16 June 2019, the first day of the summertime evaluation period for which the fully coupled (CaLDAS, 4D-EnVar, GEM) system was integrated. An offline open loop was then performed from 1 September 2019 to 15 November 2019, followed by a one-month offline (uncoupled) CaL-DAS assimilation cycle all the way to 16 December 2019, the first day of the winter period for which the full system was integrated.
Series of 10-day global GEM deterministic forecasts were produced every twelve hours (initialized at 0000 UTC and 1200 UTC each day) for the two evaluation periods. In order to preserve the diurnal cycle of the error patterns, spatial and temporal, only the evaluation of 0000 UTC runs is presented and discussed in this study.

Datasets for evaluation
As briefly presented in Section 1, and as more thoroughly argued in Section 4 below, there are several problems associated with the use of surface observations to evaluate the quality of NWP forecasts. These problems are related to the sparse and inhomogeneous nature of surface-observational networks over several parts of the world, to errors associated with representativeness of observations, and to possible height inconsistency between observations and model forecasts. For these reasons, this study's objective evaluation is performed by comparing GEM forecasts against screen-level (1.5 m) analyses of air temperature and specific humidity. The screen-level analyses datasets are extracted from the 4D-EnVar assimilation cycles and taken at the central time (T-0h) of each six-hour assimilation window. It should be mentioned that there is no prognostic level in GEM at the screen level. In that context the first guess used for the screen-level analyses within the 4D-EnVar is rather the result of vertical interpolation between the lowest atmospheric level (i.e., ∼20 m above the model surface for air temperature and specific humidity) and the land-surface values as provided by the land-surface analyses. This vertical interpolation relies on Monin-Obukhov similarity theory, using stability functions specific to each experiment.
During its six-hour time window, the 4D-EnVar assimilates screen-level observations from the SYNOP, SHIP, and drifting-buoys datasets. SYNOP data represent the majority of these observations and are only available at the central time. Observations from land data assimilation also influence the quality of the 4D-EnVar screen-level product. In OP, initialization of soil moisture and surface temperature is performed at T-0h with an OI based on the assimilation of METAR and Surface Weather and marine OBservations (SWOB, distributed by ECCC) screen-level data, in addition to the SYNOP dataset previously mentioned. In NEW, the land-surface analyses from CaLDAS are directly inserted in 4D-EnVar at T-3h and T-0h, ensuring an impact of SMAP, SMOS, AIRS, CrIS, and IASI on the 4D-EnVar screen-level analyses.
An alternative to 4D-EnVar would have been to use screen-level analyses directly from the surface OI and CaL-DAS, since both of these approaches produce such analyses (OI as an input for the assimilation, CaLDAS as an end product). But the screen-level analyses from OI are produced on a legacy 33-km global grid whereas those from CaLDAS are generated on the GEM computational (and CaLDAS analysis) grid. Because the two screen-level analyses are quite different in terms of resolution and computational grids, it was deemed preferable to use the 4D-EnVar analyses for this study, as they are more consistent in terms of their production mode and their computational grids. Another alternative would of course be to evaluate the systems' performance directly against surface observations. As previously mentioned, the motivation behind the use of screen-level analyses is detailed in Section 4.
GEM forecasts from NEW and OP are verified against their 'own' analyses, that is, against screen-level analyses produced as part of the assimilation cycles performed for their initialization. Possible issues related to independence of the verification datasets is an important theme also argued later in Section 4.
A rigorous and objective estimation of the quality of the screen-level analyses is outside the range of this study, and will be presented in a follow-up article. Because screen-level analyses are based on very short-range forecasts and are influenced by a large amount of observations (both from space and at the surface, as described above in this section), and because as described later this study's emphasis is on the medium range (Day 4-6) for which forecast errors substantially exceed the analyses' uncertainty (as evidenced in Section 3 by plots of errors as a function of lead time), there is satisfactory confidence that the use of screen-level analyses is adequate for evaluation of near-surface forecasts.

Evaluation metrics
Emphasis in this study is on the random component of the forecast errors. Maps or plots of mean errors (biases) are not shown because of the uncertainty associated with their estimation when using near-surface analyses, as argued later in Section 4. An estimate of the relative importance of mean errors versus the standard deviation of errors (STDE) is provided in the form of tables. Evaluation of the GEM medium-range forecasts accuracy is assessed using the temporal STDE and correlation ( ) based on comparison with screen-level analyses. These two metrics, as well as a few more derived from them, are calculated at each model (forecasts and analyses) grid point for the two series of integrations (summertime and wintertime). No spatial interpolation (horizontal or vertical) or aggregation is required since the forecasts and analyses share the same computational grids and orography.
The STDE at each grid point (i) and for each lead time (t) is given by: in which d is an index for the validation dates, N d is the number of validation dates (equal to the number of integrations used for the evaluation), f d , a d and e d are forecasts, analyses, and forecasts errors at specific validity dates, while f , a, and e are mean values over the evaluation periods, that is, for all the validation dates. A relative measure of the performance difference between the two analysis and prediction systems can be simply written as: (2) in which the subscripts 'NEW' and 'OP' refer to numerical forecasts from NEW and OP. This estimate of the land-surface impact is expressed as a percentage (i.e., multiplied by 100%). Please note that the indices for area and lead time have been omitted in this equation since these differences could be computed for each grid point and lead time, or based on spatial and or lead-time averages.
The STDE differences are also normalized in this study with forecast-error growth in order to provide a more useful estimate of the predictability gain associated with a possible implementation of NEW versus what would be achieved with OP. This metric is expressed as: or more precisely as in which STDE growth is calculated from the OP experiment. This growth metric, dSTDE growth , informs on the predictability 'gain' or 'loss' in time units (hours in this study), and is defined as being positive when NEW performs better than OP.
In the summary tables, the evaluation is performed for ranges between 72 and 144 hr (prediction days 4, 5, and 6), together with the temporal means over the same forecast range for STDE, to reflect the fact that medium range is the main concern for ECCC's GDPS. It is important to mention that in some cases the error growth can saturate quickly with lead time and become relatively small at the medium range, leading to unrealistic values of dSTDE growth . This is certainly a limitation related to this metric. In the tables, the entire forecast range, from t 1 = 0 h to t 2 = 240 h is used to evaluate the reference (OP) error growth. Situations when error growth is very small have been visually identified from the plots of STDE as a function of lead time, and have been removed and identified as 'N/A' (not applicable) in the tables.
The other set of metrics used in this study is based on the temporal correlation between forecasts and analyses, also calculated for each grid point and for each lead time: In the same manner as what is done for STDE, relative measures of improvement are calculated for the correlation, following: for the relative improvement of the correlation (multiplied by 100%) and: for the gains or losses in predictability hours. As for STDE, positive values indicate an improved performance by NEW compared with OP. The lead times t 1 and t 2 are defined as previously described for STDE. For each land-surface grid point and for each lead time, pairs of values for model forecasts and for their own experiment analyses are thus constructed over the two selected time periods. Each item of these distributions corresponds to a specific validation date. The error statistics described above, based on STDE and temporal correlation, are then calculated as plots as a function of lead time for the specific period or season.
The evaluation metrics can be displayed as maps (i.e., with values for each grid point, sometimes averaged over a lead-time period), as plots function of lead time (i.e., showing means over a certain area), or numerically with tables (i.e., based on both spatial and lead-time averages). In this study, the maps and tables are produced by averaging the metrics at each grid point for the lead-time period between 72 and 144 hr, that is, from Day 4 to Day 6 forecasts, considered here as the 'medium range', considering forecasts valid at 0000, 0600, 1200, and 1800 UTC.
Plots for errors as a function of forecast lead time are obtained by spatially averaging over the land portion of a specific area (e.g., North America, Europe) for a specific lead time or for a specific period. This spatial averaging is performed on a global latitude-longitude grid and uses summations weighted by the cosinus of the latitude ( i ): where represents any of the error metrics described in this subsection, m is for the grid point, and area represents the region over which the spatial averaging is performed.

Global maps
The random errors spatial distribution for medium-range forecasts for summer 2019 from the OP control can be identified from the global maps shown in Figure 1. For both the upper and lower panels the STDEs for screen-level air temperature and specific humidity have been averaged over the 72-144 hr forecast range, that is, for Day 4 to Day 6. Screen-level air-temperature STDE, shown in Figure 1 upper panel, substantially varies from continent to continent. Errors are larger over North America and over the vast area extending from Eastern Europe to Eastern Asia, with maximum values at the northern edge of these regions. In the Southern Hemisphere STDEs are generally smaller, except for the southern portion of South America. Other notable features of the air-temperature STDE map include relatively large error values in central United States and in the northern part of Africa's equatorial climate zone. F I G U R E 1 Global maps of (a) air temperature ( • C) and (b) specific humidity (g kg −1 ) standard deviation of errors (STDE) for the OP control experiment versus own screen-level analyses for summer 2019, averaged over the 72-144 hr forecast range The STDE spatial patterns are different for screen-level forecasts of specific humidity (Figure 1 lower panel). Errors are smaller over northern areas compared to air temperature. The most notable features for specific humidity are the large errors over the eastern part of the United States and over the northern equatorial zone in Africa, with large values extending eastward to include the tip of the Arabian Peninsula and the northeastern edge of India.
Some of the spatial structures in Figure 1 for air temperature and specific humidity are easily understandable. For instance, larger STDE values are expected for both near-surface air temperature and humidity over summertime convective areas such as southeastern United States and the intertropical convergence zone. It is also logical that air-temperature STDEs are larger in northern and arctic regions, where land-surface and atmospheric initial conditions are more uncertain and where NWP models are not as well tested and calibrated compared with more-populated areas such as the United States, Europe, and Asia (Jung et al., 2016).
The medium-range temporal correlation between screen-level forecasts and analyses is shown in Figure 2.
The map for air temperature (upper panel) displays less in terms of spatial variability, with values that are relatively uniform spatially over land, except for the weaker correlations observed over northern land areas and over the southern edge of the Tibetan plateau. The correlation for specific humidity is generally weaker than for air temperature and has more spatial structures. Values are larger over tropical and subtropical areas, and smaller over drier and northern areas (e.g., Northern Africa, the Middle East, and parts of Australia).
Global maps of STDE differences between the two experiments (NEW minus OP) are given in Figure 3. In these maps, negative values (in red) indicate areas where NEW performs better than OP over the medium range (Day 4 to Day 6) for air temperature and specific humidity. (It should be noted that throughout this article, for plots comparing the two experiments, the colour red is associated with the NEW experiment while the colour blue represents the OP control). Results indicate that the impact of NEW versus OP is more positive for specific humidity (lower panel) than for air temperature (upper panel), at least for that evaluation period of summer 2019. Looking at North America, for example, the difference maps indicate an STDE decrease with NEW for northern areas and for the central United States, together with an STDE increase for the southeastern United States and part of central Canada, for both air temperature and specific humidity. Similar analyses can be made for all the other major areas and continents, with both positive and negative areas.
The substantial air-temperature STDE improvement with NEW in northern land areas of North America and Eurasia is certainly one of the most striking feature of these global difference maps. These improvements can be attributed to the more advanced specification of land-surface conditions with NEW, namely to better initial conditions caused by an increased number of observations used to initialize soil moisture and surface temperature in northern areas (associated with relatively frequent passes of the orbiting satellites compared with a sparse surface observation network), and to more realistic surface modelling, in agreement with previous studies (e.g., Køltzow et al., 2019;Ortega et al., 2022). Overall statistics informing on NEW versus OP performance for each continent will be presented later with a summary table.
Similar maps of differences were also produced for the temporal correlation ( Figure 4). In these maps positive values (in red) indicate greater (better) temporal correlation between screen-level forecasts and analyses for the NEW experiment. The differences for air temperature are in general small and mostly positive (i.e., red, for NEW), except for the region of the Himalayas and parts of the Tibetan plateau where temporal correlation is substantially larger for NEW. Results are not as good over parts of Siberia, which show degradations for air temperature and specific humidity, for both STDE and temporal correlation. Over North America, the largest increase in temporal correlation for the NEW experiment is over the northern portion of the continent. Central and western United States also exhibit increased temporal correlation for NEW, whereas it is the reverse for the southeastern United States (although the negative values are relatively small).
Similarly to what was described for STDE differences, the impact of the NEW experiment on temporal correlation is greater for specific humidity (lower panel) than for air temperature. The temporal correlation is in favour of NEW almost everywhere, except for a few areas including F I G U R E 3 Global maps of (a) air temperature ( • C) and (b) humidity (g kg −1 ) standard deviation of errors (STDE) differences between NEW and OP versus own screen-level analyses for summer 2019, averaged over the 72-144 hr forecast range southeastern United States and central Asia, and in southwest Africa. The largest increase of temporal correlation for specific humidity is over the very northern areas of North America and parts of Asia.
Because there are substantial changes made to the land surface between the two experiments, it is difficult to precisely attribute improvement or degradation over particular areas of the maps in Figures 3 and 4 to specific items of OP's or NEW's configurations. For instance, degradations with NEW could be caused by imprecise land-surface characteristics, inacurrate space-based observations or retrievals, problems with land-surface modelling, or incorrect representation of the land-atmosphere interaction in the GEM atmospheric model (e.g., with impact on PBL evolution, clouds, precipitation). While hypotheses are proposed in this section for NEW's gains or deteriorations over some areas, an advanced analysis is required to better understand the complex processes responsible for the differences observed in Figures 3 and 4 over specific areas.

Lead-time error plots over Canada
Figures 5-8 provide some focus on Canada itself, that is, the area of greatest interest for operational implementation of this particular medium-range prediction system by CCMEP. The figures show STDE and temporal correlation as a function of lead time spatially averaged of Canada's land mass. Figure 5 gives a different view of the positive impact of NEW for air-temperature STDE over Canada, with lower mean STDE for all ranges of the forecasts. The difference between the red (NEW) and blue (OP) lines gradually increases with lead time in Figure 5 to reach a maximum around 144-168 hr (i.e., ∼Day 7). The difference decreases afterwards to become practically neutral at the end of the forecasts, that is, 240-hr (10-day) forecast. The STDE exhibits a marked diurnal signal as a function of lead time with maximum values at 0000 UTC for both experiments, but the NEW minus OP differences are relatively similar for all primary synoptic hours. At its maximum, the F I G U R E 4 Same as Figure 3, but for the temporal correlation differences are on the order of 0.4 • C in absolute values, which correspond to an approximate relative reduction of approximately 15% and approximately 25 hr (i.e., one day) in predictability. Interestingly, the maximum impact over Canada occurs just after the 'medium-range' evaluation period (72-144 hr, highlighted in yellow in Figures 5-8) used for the maps and summary tables.
As depicted in Figure 6, the impact of NEW versus OP for screen-level air temperature is also positive over Canada when looking at the evolution of the temporal correlation. The difference between the two experiments smoothly grows with lead time to reach its peak around 144-168 hr (i.e., Day 7) and then decreases to become nearly neutral at 240-hr lead time. Interestingly, a diurnal signal develops with lead time for the temporal correlation difference, becoming more evident towards the end of the 10-day forecasts. The improvement with NEW remains large at night (0600 and 1200 UTC) late in the forecasts, but during daytime (1800 and 0000 UTC) the difference decreases more rapidly towards neutrality. At its maximum, the differences in temporal correlation are approximately 0.05 in absolute values, approximately 5% and approximately 30-40 hr in relative and predictability terms. As should be expected, the amplitude of NEW's impact versus OP is different for the temporal correlation and STDE. Smaller for the relative difference but greater in terms of predictability hours, this apparent mismatch for the NEW and OP comparison is related to the metrics formulation, but also to the statistical links between STDE and temporal correlation (this last aspect is further elaborated in Section 4 below).
Similar analysis and conclusions can be derived from Figures 7 and 8 for screen-level specific humidity. These figures confirm that the impact of the NEW experiment is greater for specific humidity than for air temperature. For STDE (Figure 7), the maximum distance between the two experiments still occurs around Day 7, with difference of approximately 0.3 g kg −1 , which corresponds to approximately 15% in relative terms and 25-30 hr in lead time. For the temporal correlation (Figure 8), the difference corresponds to gains of approximately 0.05 in absolute value, approximately 5-8% in relative term, and approximately 40-60 hr in predictability. For both STDE and temporal correlation, the improvement with NEW is larger at 0600 UTC (night) and smaller at 1800 UTC (daytime).

F I G U R E 5 (a) Standard deviation of errors (STDE) ( • C), (b) STDE difference (dSTDE, o C), (c) STDE relative difference (dSTDE_rel, %), and (d) gain/loss in predictability based on STDE (dSTDE_growth, h) as a function of lead time (h) over Canada for air temperature versus own
screen-level analyses during summer 2019. Data are shown every 6 hr. The yellow shadings indicate the medium-range period of interest, that is, between 72 and 144 hr forecasts, which is used for the calculations presented in Table 2. The blue shading in the second panel indicates the plus and minus one sigma (i.e., standard deviation) for the spatial distribution of STDE differences between the two experiments over Canada F I G U R E 6 Same as Figure 5, but for the temporal correlation (ρ) and a few measures of its difference between the NEW experiment and the OP control reference, that is, direct differences (drho), relative differences (drho_rel), as well as gain/loss in predictability (drho_growth, h) F I G U R E 7 Same as Figure 5 but for specific humidity (g kg −1 ) F I G U R E 8 Same as Figure 6 but for specific humidity (g kg −1 ) The reasons for the diurnal signals detected in the error differences shown in Figures 5-8 are unclear. With distinct characteristics for air temperature and humidity, this diurnal signal appears to be related to modelling aspects of GEM, since it develops or even appears during the model forecasts. A rigorous analysis would be required in order to document the causes behind this temporal evolution of the model errors.
The reasons why NEW's impact is greater for near-surface humidity than for air temperature are also complex. They are related to both dynamical and thermodynamical aspects of near-surface meteorology, including surface-energy budgets and its atmospheric forcing, evaporation, surface layer representation, and boundary-layer turbulent mixing. Although this study does not provide a detailed analysis of which NEW's specific components contribute the most to humidity improvement, it is expected that more realistic soil moisture initial conditions resulting from the assimilation of SMAP and SMOS (not explicitly shown in this study, but previously demonstrated in Carrera et al., 2019) together with a more physical representation of surface evaporation with the SVS land-surface scheme are the main contributors to this improvement. The same type of improvement, that is, better for near-surface humidity than for air temperature, was also found when evaluating SVS against the ISBA land-surface scheme in offline mode (see Husain et al., 2016).

Summary tables for global performance
A summary analysis is given in Table 2 with spatially and temporally averaged differences between the two experiments for each continent or major region of the world, valid for the medium-range lead-time period (72-144 hr). Differences for STDE and temporal correlation are provided for both air temperature and specific humidity in terms of absolute and relative values, as well as predictability in hours. As alluded to earlier, the estimation of predictability gains in hours is not applicable ('N/A' in the table) in certain regions, namely Africa and South America, where the error growth can be quite small.
The summary table indicates that the NEW experiment is better than the OP control for every region for both STDE and temporal correlation. For air temperature the largest STDE improvement is found over North and South America based on absolute and relative differences.
A similar signal is found for temporal correlation for these two regions. Large improvement for temporal correlation is also detected over Asia. The gains in hours of predictability vary widely from region to region, and are not the same for STDE and temporal correlation. Gains are about tens of hours, going from approximately 10 to approximately 45 hr.
The differences for specific humidity are approximately 0.1-0.2 g kg −1 , which corresponds to decreases ranging from approximately 10% to 16%. North and South America are still substantially impacted by NEW versus OP, but it should be noted that differences of about approximately 10-12% are also observed over all the other regions. Differences for specific humidity temporal correlation range from approximately 0.02 to 0.04, with NEW's best performance over North America. Gains in hours of predictability widely vary and again should be viewed with caution as they are meaningless for areas with small error growth. For specific humidity the gains in predictability are estimated between approximately 20 and 45 hr.
The impact of NEW versus OP is also substantial for the winter 2020 period, as demonstrated by the numbers in Table 3. As could be expected, there are some differences between the two seasons (compared with Table 2) in the Northern Hemisphere (for North America, Europe, and Asia) where the air-temperature STDE gains are actually larger in absolute values than in summer, but are similar in relative values and predictability hours. Europe is the region most improved, with a decrease of 0.31 • C in absolute STDE, which corresponds to 12.4% in relative value and 17.8 hr in predictability. A similar signal is observed for the air-temperature temporal correlation. For Europe, it is increased by 0.019, which corresponds to 2.3% and 7.6 hr in predictability. The results in the Southern Hemisphere are similar for that season, except for a substantial decrease in air-temperature STDE in Australia (of about 0.28 • C). This STDE gain is larger than in summer 2019 and corresponds to 12.9% in relative error and 34.5 hr in predictability. Improvement with NEW for the temporal correlation over that region is also larger than in the boreal summer season.
In the Northern Hemisphere, the specific humidity gains for both STDE and temporal correlation are smaller in winter 2020 than in summer 2019 (see Tables 2 and 3). This weaker impact of NEW during the boreal winter season is expected, since atmospheric water vapour is less abundant in the Northern Hemisphere during that season, with less land-surface evaporation (and less impact from the land-surface modelling and data assimilation). In the Southern Hemisphere, the results are similar for South America and Africa. But NEW's benefits over Australia are substantially larger in winter 2020 for specific humidity, with a decrease of STDE of about 0.27 g kg −1 , which corresponds to 13.5% in relative error and 43.3 hr in predictability. The temporal correlation over Australia also sees greater gains during that season, with a decrease of about 0.035 which corresponds to about 4.0%.
It should be mentioned that gains over Canada are smaller for the winter period compared to what is shown in Figures 5-8 for the summer period. The lead-time plots for winter (not shown here) indicate a smaller but still slightly positive impact for air-temperature STDE, peaking at around nine-day integration, instead of seven days as what is found during the summer season. The lead-time plots for the air-temperature temporal correlation, and for specific humidity STDE and temporal correlation, show a practically neutral impact for the winter season over Canada (not shown).

DISCUSSION
The evidence provided and described in the previous section indicates that overall the NEW configuration leads to better forecasts of near-surface air temperature and specific humidity, at least based on objective evaluation using screen-level analyses. But as is the case for every type of objective evaluation, using various observations types or 'truth' references, there are important questions that need to be addressed and discussed in order to clearly understand the validity of the approach and the significance of the results. The questions concern the use of screen-level analyses instead of surface observations, the problems associated with mean (bias) errors when using screen-level analyses (or surface observations for that matter), the role of forecasts and analyses' variability, and the independence of the datasets used for objective evaluation.

Evaluation versus screen-level analyses instead of surface observations
Results described in the previous section are organized in such a way to demonstrate the potential and usefulness of using screen-level analyses to evaluate the quality of medium-range meteorological forecasts for near-surface air temperature and humidity. Extending on this demonstration, it could be argued that for a country like Canada evaluation based on surface analyses is as important as one based on surface-observational networks. Objective evaluation relying on surface observations suffers from several problems.

Sparse and inhomogeneous coverage
In many countries, the coverage of surface-observational networks is unequal, sparse, and does not allow for an adequate evaluation of forecasts errors, either systematic (biases) or random (STDE, correlation). Figure 9 shows a typical spatial distribution of surface observations used at ECCC/CCMEP to evaluate near-surface forecasts from its NWP systems. This map indicates that some areas F I G U R E 9 Location of stations of surface observational networks typically used for operational objective evaluation at the Canadian Centre for Meteorological and Environmental Prediction. The areas specified by the red lines are used for the objective evaluation presented in Figure 10 of North America are relatively well covered with these networks, like the eastern part of the United States, the United States Great Lakes area, and the Alberta province in Canada, while other areas are practically void of observations, like most of Canada and the western United States. It is worth noting that northern land areas are the worst in that respect, in effect preventing that part of the continent to substantially impact the overall near-surface evaluation process. As a consequence this limits the development of specific solutions to improve NWP for these areas.
To illustrate the fact that objective evaluation from surface networks might not provide an entirely satisfactory estimate of the errors over specific spatial areas, Figure 10 compares STDE obtained over land for two areas (depicted in Figure 9 and labelled as 'North' and 'South') based on air-temperature evaluation versus screen-level analyses, using either values at locations of the surface stations (with dashed grey lines) or values for all grid points over the entire land mass for that region (with full black lines). In other words, the two sets of lines in Figure 10 are based on evaluation of the same forecasts versus the same reference (i.e., screen-level analyses) but they are done over a different set of points (i.e., stations versus entire land domain).
Results from Figure 10 reveal substantial differences between the two approaches for the sparsely covered 'North' domain (upper panel), with larger STDE at surface-station points (in grey) for the first five or six days of the forecasts. That is, objective evaluation using just these station data points does not truly represent the model-forecasts' errors over that domain, even when other types of errors (such as representativeness and height inconsistency, discussed below) are not considered.
In contrast, the same type of comparison over the more densely observed 'South' domain shows a better agreement. The similarity is not perfect, however, particularly during daytime when some differences are evident. These deviations at 1800 and 0000 UTC are likely related to increased near-surface variability caused by turbulence in well-developed convective boundary layers, or by small-scale features related to convective precipitation which tend to occur later in the day over that area. It should be noted that observation 'thinning' is sometimes used to partially mitigate the heterogeneity aspect of the problem. But this is done at the expense of spatial coverage density, making the overall observational coverage even more sparse.

Errors of representativeness
The previous demonstration does not consider errors associated with the representativeness of screen-level observations. Substantial subgrid-scale variability can be observed at or near the land surface for air temperature and humidity. This variability can be related to meteorological causes (horizontal gradients, small-scale cloudiness) or more importantly to heterogeneity of land-surface coverage (land versus water, urban areas versus natural covers, wet soils versus dry soils, high vegetation versus low vegetation, etc). This subgrid-scale variability could be relatively large, on the order of 1-2 • for both air temperature and dew-point temperature (see Rochoux et al., 2016, in which this was estimated with numerical experiments). Therefore, natural variability occurs when comparing local observations at stations located within a specific model-grid area with grid-averaged model forecasts. Downscaling model forecasts using subgrid-scale information to better represent values observed at a specific location within model-grid areas has been attempted, but this type of procedure adds uncertainty because of assumptions made. If subgrid-scale processes like horizontal mixing are not properly addressed, such downscaling could lead to overestimation of horizontal differences between different types of land types.

Height inconsistency
The grid-scale orography used in NWP models does not perfectly fit reality, and differences between the (real) height of observational stations and (estimated) height of model-grid point adds to the variance and biases when F I G U R E 10 Standard deviation of errors (STDE) as a function of lead time (hr) for NEW's screen-level air temperature ( • C), averaged over grid points corresponding to locations of surface stations (as shown in Figure 9) (dashed grey lines) and over all land grid points (full black lines). The comparison is performed for summer 2019 over two domains, (a) 'north' and (b) 'south' as depicted in Figure 9.
comparing the two sets of data. Height inconsistency could be viewed as a grid-scale error of representativeness, in contrast with the previous item's argument based on subgrid-scale variability.

Favouring systems which assimilates data used for evaluation
In LDAS such as the OI featured in the OP control (see Subsection 2.7), the quality of near-surface meteorological forecasts is expected to be optimal at locations with screen-level observations, due to the fact that land-surface initial conditions (for surface temperature and more importantly for soil moisture) are estimated in a way to minimize short-range prediction errors based on that specific set of observations. This is not so with the NEW experiment in which these observations are not present in the assimilation leading to surface-temperature and soil-moisture initial conditions. The process in NEW is rather based on the assimilation of space-based observations (SMAP, SMOS, AIRS, CRiS, and IASI), which is expected to produce land-surface analyses of a more spatially-uniformed quality (depending of course on the spatial availability of satellite observations). In that context, objective evaluation vs screen-level observations at surface stations might overestimate OP's performance when compared with NEW. Evaluation over the entire land mass should provide a more realistic (and fair) estimate of the relative quality of forecasts from these two configurations.
The main argument presented in this subsection is therefore that for some regions or countries objective evaluation of medium-range weather forecasts could be considered as being an appropriate alternative when performed against screen-level analyses as it provides more accurate forecast-error estimates over the entirety of the land mass. This approach does not suffer from representativeness errors or height inconsistency because the screen-level analyses are produced on the same computational grids as the forecasts and with the same orography and land-surface characteristics.
In opposition to this premise, it might be argued that analyses are not direct observations, but rather a combination of observations and background information from short-range numerical modelling. But this counter argument is weakened when considering the extensive amount of observations that are actually used to produce screen-level analyses (see Subsection 2.9), which not only directly include screen-level observations but are also indirectly affected by space-based remote-sensing information used to specify land-surface variables (temperature, soil moisture, snow), and by observations used to initialize the lower troposphere in the atmospheric data assimilation system (i.e., radiosondes as well as ground-based and space-based remote sensing).

Validity and relative importance of mean errors versus analyses
The mean differences between the two sets of analyses used for the objective evaluation could be non-negligible and affect the estimation of model biases, especially if the spatial coverage of observations used for LDAS is dissimilar between the experiments that are compared, as is the case for the present study. Caution should thus be exercised when using screen-level analyses to evaluate model biases. The same argument holds for evaluation versus upper-air analyses in the lower few kms of the troposphere, a fact not always considered by national centres in their determination of the quality of NWP forecasts.
Beyond this, it might be interesting to examine the difference of mean errors between two experiments in order to quantify its relative contribution to the total error. For NWP models, most of the near-surface and atmospheric errors at medium range are associated with random processes. But for some variables, over specific regions and at certain times of the year, systematic errors can play a substantial role. This impact of bias (B), as expressed here by its contribution to changes in root mean square of the errors (RMSE), is evaluated based on: Changes in RMSE can be expressed as: which by simple rules of differentiation can be developed as: where OP is used as the reference experiment and where absolute values are used for the bias term on the right in order to preserve the correct sign for its contribution to the total error (because bias can be positive or negative). The bias contribution to the total error (RMSE), compared with STDE's contribution, can be expressed by the ratio of the absolute values of the second and first terms on the right-hand side of the above equation, that is: This new quantity, ratio bias , provides information on the relative contribution to the total error (expressed here as RMSE) of systematic versus random errors. This type of error analysis based on derivatives is typically valid for small differences; in this study it is only used to provide a rough estimate of the bias relative contribution to the total error.
The results presented in Table 4 indicate that the contribution of the bias is always smaller than that of STDE (i.e., less than 1.0). For air temperature, this contribution is small for all regions (less or equal to ∼0.15), with values close to zero for North and South America. For specific humidity, this ratio is even smaller, with values smaller less than 0.10 for all regions. These results are comforting in the sense that even though biases might be difficult to evaluate using screen-level analyses (or any meteorological analyses close to the land surface), its relative contribution to the total error (in the RMSE sense) at the medium range is relatively small, that is, generally about 15% or less, relative to STDE's contribution.
Interestingly, the situation is different in winter as seen from the numbers in Table 5. Because ECCC's GDPS has a strong cold bias in the Northern Hemisphere during that period (close to −1 • C in North America and Europe), and because this bias is substantially reduced with NEW (by half over these two regions), the relative contribution of the bias to the total error (RMSE) relative to STDE's contribution is 0.76 for North America, 0.85 for Europe, and 0.75 for Asia (see Table 5). Even at the medium range, the impact of this bias reduction with NEW remains substantial, that is, more than two thirds of STDE's contribution in the Northern Hemisphere for that specific winter season.  (10) and (11). Bold values are for the bias ratio defined in Equation (12). Abbreviation: STDE, standard deviation of errors.

The role of variability: linking STDE and temporal correlation
Because the two-model configurations could lead to forecasts with different levels of spatial and temporal variability, it is useful to provide an assessment of the impact this could have on the evaluation metrics used in this study, namely STDE and temporal correlation. Visual inspection of model forecasts and analyses has revealed that the NEW configuration leads to smoother (with less spatial variability) forecasts at or near the surface compared with forecasts from the OP control, due mostly to its filtering of the lower-boundary conditions for the one-dimensional vertical diffusion (atmospheric turbulence, see Subsection 2.4). Although this does not automatically transfer into changes for temporal variability (from case to case) which is influenced by other factors, it is legitimate to examine what impact any increase or decrease of model-forecasts' variability has on the performance metrics, that is, how it can explain differences found in terms of NEW versus OP impact for STDE and temporal correlation.
As was shown in Figures 3 and 4, the positive impact of NEW versus OP appears to be more spatially uniform when looking at temporal correlation (Figure 4), compared with what is found for STDE (Figure 3). This is especially true for the African continent and parts of Asia (e.g., Tibetan plateau).
To shed light on this, the following equation for the standard deviation of an error distribution is used as a starting point: in which 'F' refers to forecasts and 'A' refers to the analyses. The STDE is the sum of the forecasts and analyses variance minus a correlation term. This equation is rewritten as: and then as It is worth noting that the multiplicative factor, , is minimum when F = A and increases in a symmetric manner if the forecast variability moves around that optimal value. Based on this fact alone, there is an equal advantage (or disadvantage) in terms of correlation to either increase or decrease the spatial or temporal variability of the forecasts.
For the optimal situation when model forecasts variability perfectly corresponds to that of the analyses, that is, F = A = , the correlation is simply expressed as: In that situation, decreasing the model and analysis variability has the negative effect of decreasing the correlation (if STDE FA is considered as being unaffected by the change in variability). Developing more on that, the difference between the temporal correlation for NEW and OP can be written this way (with the hypothesis that the temporal variance of the forecasts is similar to that of the analyses): in which the 'FA', 'F', and 'A' indices have been dropped, and 'NEW' and 'OP' are now used to indicate the experiments names. In order to improve the temporal correlation with the NEW experiment, the following condition must be met; that is, for Δ > 0, the following is required: In situations where the temporal variability is similar for the two experiments, that is, 2 OP ∼ 2 NEW , the conditions naturally reverts to STDE 2 NEW < STDE 2 OP . In situations where STDE 2 NEW ∼ STDE 2 OP , the condition becomes 2 NEW > 2 OP . In that context, Figure 11 provides maps of the temporal variance ratio, that is, 2 NEW ∕ 2 OP , of medium-range forecasts for both air temperature and specific humidity at screen level. In this figure, the red colour indicates areas where NEW's forecasts have greater temporal variability than those from OP (and the reverse for areas in blue). The results indicate that temporal variability for air temperature (upper panel) is substantially increased in NEW over most of the Northern Hemisphere, including the southeastern United States, Europe, and Asia. The variability is decreased over northern areas, in Canada and Eurasia. In Africa, the variability is increased, whereas South America and Australia show more mixed signals.
Comparing Figure 11 upper panel with those from both Figures 3 and 4, it is clear that this high ratio of temporal variability leads to a more positive (or less negative) impact of NEW on the air-temperature temporal correlation than on its STDE. Maybe the most notable area is the Tibetan plateau, which shows worse STDE for NEW, but better temporal correlation, which could be related to a substantial increase in temporal variability. A similar effect is also seen, to a lesser degree, in other areas (e.g., southeastern United States, Europe, and Africa).
There is also a clear link between the STDE decrease for air temperature with NEW for the northern land masses of North America and Eurasia and the fact that temporal variability has been reduced over these areas. But interestingly the temporal correlation has been improved in spite of the decrease in temporal variability, resulting in improvement to both temporal correlation and STDE for air temperature over northern land areas.
Results for specific humidity are shown in the lower panel of Figure 11. For that quantity, the temporal variability is increased for NEW over northern areas (Canada and Asia), western United States, the northern portion of Africa, and the Tibetan plateau. Overall, the impact on the ratio is more mixed for humidity, with large areas where in fact the temporal variability is decreased with the NEW configuration (e.g., central Canada, Europe, and Australia). The influence of model forecasts' variability can be deduced another way, by looking at the evolution of STDE and temporal correlation as a function of forecast lead time, as shown in Figures 5-8. In all these figures, STDE and temporal correlation from NEW and OP are quite similar in the early stages of the forecasts. Differences between the two experiments gradually and smoothly increase with lead time to reach maximum values around Day 7 of the forecasts. These differences then decrease in the last portion of the forecasts, after Day 7, to the point that errors are of the same magnitude at the end of the forecasts, that is, at 240 hr. This type of model errors evolution with lead time is reminiscent of earlier studies focused on documenting and understanding the error growth in atmospheric medium-range forecasts. This includes studies first by Lorenz (1982), followed by Simmons et al. (1995), Boer (1994), Dalcher and Kalnay (1987) and more recently by Bednaf et al. (2021). Of special interest in the context of the present study is the decrease of the error growth (see upper panels in Figures 5-8) towards the end of the 10-day atmospheric forecasts. As model errors get close to saturation evidenced by their asymptotic behaviour (for both STDE and temporal correlation), they are less influenced F I G U R E 11 Global maps of the ratio of NEW to OP temporal variance from (a) air temperature and (b) specific humidity forecasts. Results are valid for the summer 2019 period and are averaged over the 72-144 hr forecast range by transient and small-scale components of the total errors and tend to be more dominated by systematic or climatological structures of the forecasts.
In that last regime the asymptotic values are mostly determined by the variance of the forecasts and of the analyses, as presented for instance in the appendix of Simmons et al. (1995). The fact that the error metrics (STDE and temporal correlation) are trending towards the same asymptotic values suggests that the variance (temporal in that case) is in general similar for both experiments (NEW and OP), at least when spatially averaged over Canada.

On the independence of verification datasets
Independence from the model configurations and from the forecasts that are used to evaluate is an important feature for verification datasets. While this virtue is necessary, it might not be sufficient however. The questions relevant to the present study are thus the following: What is the degree of dependence between the screen-level analyses and medium-range forecasts from GEM? In addition, would alternative screen-level analyses, from other institutions or initiatives, be a better choice?
Assuredly, evaluation of GEM's forecasts versus screen-level analyses issued from the same assimilation cycle that provides its initial conditions implies a certain degree of dependence. This correspondence is linked to the first guess used to generate the screen-level analyses, which comes from the same model configurations used for the forecasts.
A few mitigating factors can be invoked to argue that this dependence on the model first guess is acceptable in the context of this study. First, the quality of the screen-level short-range forecasts used as first guess is substantially greater than the medium-range forecasts that are used to evaluate. Furthermore, the value of the screen-level analyses themselves is a step beyond the first guess, as they benefit from information provided by the assimilation of a large amount of observations, both in the upper-air 4D-EnVar (with contribution from its lowest atmospheric level) and in the LDAS (with contribution from its land-surface analyses).
This continuous correction to both the low-level atmosphere and land-surface state variables prevents the screen-level analyses from deviating far from reality, in contrast with medium-range forecasts, which exhibit larger errors. The impact of assimilated observations is expected to be greater in the NEW experiment, with the inclusion of space-based observations for soil moisture and surface temperature, with relatively frequent and global coverage.
Some might consider the use of screen-level analyses from other centres or initiatives as a preferable alternative, compared with verification versus own analyses as presented here. Accepting the fact that such products are not, a priori, of better quality than this study's screen-level analyses, they would have the benefit of being independent, since other approaches and models are involved.
But this virtue is offset, in a more serious manner, by the introduction of other problems. For one, the surface characteristics related to water, snow, ice, and vegetation coverage would be different (and not necessarily better represented) than what is used in the NEW and OP experiments. Given the strong dependency of the screen-level analyses to these surface characteristics, this discrepancy between external systems and GEM would inject uncertainty associated with the interpolation/aggregation process and more importantly with the different surface characteristics. As a reminder, this very specific advantage of using own analyses for the objective evaluation, that is, elimination of errors associated with grids differences and representativeness errors, was listed as a major advantage when using two-dimensional analyses versus surface observations. These considerations and concerns regarding the use of analyses for model objective evaluation are not new. Other studies and reviews have highlighted the points made above about the pros and cons of this approach. In Ebert et al. (2013)'s review, they refer to the World Meteorological Organization's Manual for Global Data-processing and Forecasting System (WMO no. 485, 2021) stating that evaluation versus analyses at each centre should be performed using the centre's own analyses, instead of using a 'truth' reference from another institution or national centre. Evaluation versus own analyses is requested by WMO because analyses may greatly differ between various national centres. One example provided in Ebert et al. (2013) points to the evaluation of air temperature at 850 hPa (as presented in Park et al., 2008) for which disparities between analyses from several centres could be as large as or even larger than the differences in forecasts performance for various prediction systems. As discussed in this study, this is particularly true close to the land surface and more generally in the lower troposphere over land. Comparison with independent analyses is more useful and meaningful for upper-atmosphere forecasts.

SUMMARY AND CONCLUSIONS
A method is presented in this study to evaluate NWP forecasts of near-surface air temperature and humidity based on comparison with screen-level analyses. Metrics such as STDE and temporal correlation are used for this evaluation and are presented in the form of global maps, plots as a function of lead time, and summary tables. The approach is applied to evaluate the performance of a new land-surface package prepared for implementation at ECCC. The new package includes changes to land-surface characteristics, LDAS, land-surface modelling, and interactions between land and atmosphere. The objective evaluation was performed for summer 2019 and winter 2020. Results from the summer season are presented and discussed in detail, whereas only summary tables are used to describe the winter results.
The findings indicate that the evaluation method provides useful information concerning the impact of the NEW system configuration, compared with the OP reference configuration. Maps clearly indicate the spatial areas where NEW has the advantage over OP, and where it has some difficulties. Other plots inform on the impact of NEW as a function of lead time and include comparative details such as STDE or temporal correlation differences, relative differences in percentage, and gain/loss in predictability hours.
The objective evaluation based on screen-level analyses reveals that for the summer season the more recent land-surface package (NEW) leads to marked improvement in medium-range forecasts of near-surface air temperature and humidity, compared with the operational configuration (OP). This positive impact is shown to be greater for humidity than for temperature, with maximum differences between NEW and OP at around seven-day lead time.
An important message from this study is that objective evaluation of near-surface forecasts from NWP systems is not as easy or as straightforward as what might be inferred from literature, or from the systematic evaluation procedures generally used at operational NWP centres. The main difficulty is not necessarily associated with the metrics that are computed, but rather with the validity of the reference or truth that serves for the evaluation, and how it relates to the NWP model outputs. Surface observations are typically used for assessing NWP systems' performance at the screen level, and this study's aim is not to discourage this practice but rather to advise on several challenges associated with their use, especially in areas or countries with sparse and inhomogeneous observational networks. Screen-level analyses are part of the solution proposed in this study. As discussed in details, this method has its limitations, but it appears to be more suitable for countries such as Canada.
In this study, several issues concerning objective evaluation versus screen-level analyses are discussed. An argument is presented for instance about biases, which for some areas or countries (e.g., Canada) might be difficult to estimate with sufficient accuracy for NWP medium-range forecasts of screen-level air temperature and humidity. Results indicate that for ECCC's GDPS the bias contribution to the total error (RMSE) is small compared to that from STDE during the summer season examined in this study, but could be relatively large during winter. Another issue debated concerns the advantages and disadvantages associated with the use of a configuration's own analyses, versus using analyses from other centres or institutions. Although no perfect solution exists, many advantages are presented in favour of using a configuration's own analyses to evaluate its forecasts performance at or near the land surface. Another important aspect discussed in this article is the relationship between STDE and temporal correlation, linked by the temporal variability of the analyses and forecasts. Maps of the NEW to OP forecast variance ratio exhibit complex spatial structures that must be considered in order to properly interpret the models' evaluation results.
Several streams of research and development could extend from or connect with the present study: • Objective evaluation versus analyses could be expanded to other variables (e.g., winds) and to other aspects of the forecasts (the boundary-layer structure). This type of evaluation could also be done for precipitation over areas where reliable analyses are available. This could be combined with evaluation of upper-air forecasts against own and independent analyses. Most of these aspects have been examined in the context of this project, but presentation and description of these results is beyond the scope of the present article.
• A categorical type of evaluation could be performed with screen-level analyses to determine the systems' ability to predict specific events. For instance, what is the models' ability to predict near-surface air temperature and humidity within certain thresholds above and below observed values? How are they able to predict extreme cold and warm events? Or meteorological events featuring strong winds, intense precipitation?
• Land-surface variables for global deterministic forecasts could be initialized from an open loop to isolate and better quantify the role of LDAS, compared with other changes related to land-surface characteristics and modelling.
• The quality of the screen-level analyses for different systems configurations could be evaluated. It is expected that these analyses will differ for OP and NEW due to the first guess. Furthermore, screen-level analyses directly extracted from the LDAS (OI from OP and CaLDAS from NEW) could be evaluated in the same manner.
• The physical and statistical links between errors for screen-level variables and those either in the atmosphere (e.g., related to the structure of the boundary layer) or in the soil (e.g., soil moisture) could be investigated.
• The objective evaluation could also be stratified by vegetation type, orographic bands, snow presence, atmospheric state, providing useful hints for process-based model errors and to further research investigations.
• The objective evaluation versus screen-level analyses could be weighted spatially based on the estimated uncertainty of these analyses, determined from either analytic or ensemble approaches.
Several of these studies are now under way at ECCC. (Dr Gregory Smith, Dr Hai Lin, and Dr Vincent Fortin) who provided internal review for this manuscript.

FUNDING INFORMATION
This work was funded by Environment and Climate Change Canada.