Dependence on initial conditions versus model formulations for medium‐range forecast error variations

Understanding the root causes of forecast errors and occasional very poor forecasts is essential but difficult. In this paper we investigate the relative importance of initial conditions and model formulation for medium‐range errors in 500 hPa geopotential height. The question is addressed by comparing forecasts produced with ECMWF‐IFS and NCEP‐GFS forecasting systems, and with the GFDL‐fvGFS model initialized with the ECMWF and NCEP initial conditions. This gives two pairs of configurations that use the same initial conditions but different models, and one pair with the same model but different initial conditions. The first conclusion is that the initial conditions play the major role in differences between the configurations in terms of the average root‐mean‐square error for both Northern and Southern Hemispheres as well as Europe and the contiguous US (CONUS), while the model dominates the systematic errors. A similar conclusion is also found by verifying precipitation over low latitudes and the CONUS. The day‐to‐day variations of 500 hPa geopotential height scores are exemplified by one case of a forecast bust over Europe, where the error is found to be dominated by initial errors. The results are generalized by calculating correlations between errors integrated over Europe, CONUS and a region in the southeastern Pacific from the different configurations. For Europe and southeast Pacific, the correlations in the medium range are highest between the pairs that use the same initial conditions, while over CONUS they are highest for the pair with the same model. This suggests different mechanisms behind the day‐to‐day variability of the score for these regions. Over CONUS the link is made to the propagation of troughs over the Rockies, and the result suggests that the large differences in parametrizations of orographic drag between the models play a role.

systems such as satellites, and the usage of these observations. The relative contributions to forecast improvement over the past decades from initial conditions and forecast model were investigated by Magnusson and Källén (2013). They found that similar contributions to the improvements from these two aspects have been seen over the past decades. The fastest pace of improvement for 500 hPa geopotential height (z500) forecasts from the European Centre for Medium-range Weather Forecasts (ECMWF) were between 1995 and 2005 with the introduction of three-dimensional and subsequentially four-dimensional variational data assimilation in 1997 (Rabier et al., 2000) and following improvements in the usage of satellite data. However, the quality of the initial conditions is also highly dependent on the model errors through the use of a short forecast as a "first guess" for the data assimilation and the use of the model within the data assimilation system.
The predictive skill of a forecast is highly flow-dependent in midlatitudes and shows large variations from day to day. Occasionally, forecasts experience very low scores and such episodes are often referred to as "forecast busts" or "dropouts". The mechanisms behind such busts in medium-range forecasts have been investigated by, for example, Rodwell et al. (2013) and Magnusson (2017). Even though it is clear that the atmospheric dynamics play the major role in the error amplification, it is still unclear if the major source of the error originates from the initial conditions or appears during the model integration because of errors in the model formulation and/or computational errors.
Several different techniques to track the origin of forecast errors are available. Magnusson (2017) discussed the usefulness of manual error tracking, ensemble sensitivities based on work from Torn and Hakim (2008) and nudging experiments. These techniques indicate the geographical origin of the error, but cannot disentangle errors in initial conditions from those arising during the forecast integration due to model/computational errors. One method to address this question is to use the same initial conditions in different models and compare with forecasts using different initial conditions in the same model. However, this method requires a global model which can be initialized from different analyses and usually needs major collaborations between different modelling centres, e.g. as in the ensemble evaluation by Richardson (2001) and Harrison et al. (1999). In Rodwell et al. (2012), the UK Met Office (UKMO) model was used and reproduced a case of a poor ECMWF forecast over Europe in April 2011 using the same ECMWF analysis as initial conditions.
In this study, based on a collaboration between ECMWF and National Oceanic and Atmospheric Administration Geophysical Fluid Dynamics Laboratory (NOAA/GFDL), the functionality to use ECMWF analyses as the model initial conditions was built into the GFDL finite-volume Global Forecasting System (fvGFS; Chen et al. 2018a) to evaluate the forecast difference from using the National Centers for Environmental Prediction (NCEP) GFS and ECMWF analyses as the initial conditions. The model fvGFS has been developed and maintained at GFDL. This experimental set-up gives two pairs of forecasts using the same initial conditions but different models, and one pair with the same model but different initial conditions. In this paper we will evaluate these forecasts in midlatitudes in order to address the question of the dependency of model versus initial errors with respect to the day-to-day variability in z500 scores. It should be noted that the initial conditions are partly a product of the model used in the data assimilation system, i.e. the initial conditions are a product of the observation usage, data assimilation method and the model. However, if a difference appears between forecasts starting from the same initial conditions but different models, it will highlight dependency on the model formulation. Even if we were not able to fully answer the question of the impact from the model formulation for the full forecasting system, the results for this study are important, for example to guide development of ensemble systems and to interpret differences between forecasts from different forecasting systems.
We are focusing here on z500 to understand the impact on the synoptic predictability. Medium-range forecast errors in the surface parameters such as 2 m temperature are often dependent on the ability to predict the large-scale flow, such as in the case of the European forecast bust presented in Grams et al. (2018).
The structure of the paper is as follows. In Section 2 the different models and data assimilation systems are briefly described. In Section 3 the results are presented for the scores for a sample covering a full year, for one case of a forecast bust in the ECMWF forecast and for the correlation of the day-to-day variations in the full sample. In Section 4 the results from the previous section are further explored, and finally the results are summarized in Section 5.

ECMWF IFS
The ECMWF operational forecasts are produced with the ECMWF Integrated Forecasting System (IFS). The section below gives a brief description of the model system and details can be found in ECMWF (2016). The dynamical core of the model is hydrostatic and spectral, and employs a two-time-level semi-implicit semi-Lagrangian scheme combined with a spectral transform technique for the horizontal discretization and a finite-element method for the vertical discretization (Untch and Hortal, 2004). The current resolution for the ECMWF high-resolution, deterministic forecast (HRES) is approximately 9 km in the horizontal, and with 137 vertical levels reaching up to 0.01 hPa (approximately 80 km). The radiation code is based on the Rapid Radiation Transfer Model and cloud-radiation interactions are taken into account using the McICA (Monte Carlo Independent Column Approximation) method (Morcrette et al., 2008). The parametrization of convection is based on the mass-flux approach (Bechtold et al., 2008) with a modified CAPE closure leading to an improved diurnal cycle of convection (Bechtold et al., 2014). The cloud and large-scale precipitation scheme is based on Tiedtke (1993), but has been substantially upgraded with separate prognostic variables for cloud water, cloud ice, rain, snow and cloud fraction, and improved parametrization of microphysical processes (Forbes and Ahlgrimm, 2014). The orographic gravity wave drag is parametrized following Lott and Miller (1997) and Beljaars et al. (2004), and a non-orographic gravity wave drag parametrization is described in Orr et al. (2010). Recent modifications to these schemes include reduced turbulent mixing in stable conditions and increased orographic drag . The surface module of IFS is described in Balsamo et al. (2014).
The data assimilation component of IFS consists of a four-dimensional variational data assimilation (4D-Var; Rabier et al. (2000)). The initial conditions for the operational forecasts use a 6 hr assimilation window (±3 hr of the initialization time), while the first-guess forecast is provided from an analysis based on a 12 hr window. To provide background-error statistics, a 25-member ensemble of 4D-Var assimilations (Ensemble Data Assimilation, or EDA) is run with a lower horizontal resolution (Bonavita et al., 2012).
The forecasts and initial conditions used in this study are based on model cycle 41r2, which became operational in March 2016, with 9 km resolution and 137 vertical levels. This study uses initial conditions and forecasts from the pre-operational testing up to 8 March 2016, when the cycle was implemented, and operational data thereafter. During the evaluation period studied in this article, the model (HRES) used persisted SST anomalies during the integration.

NCEP GFS
The atmospheric forecast model used in the NCEP GFS is a global spectral model (GSM) with spherical harmonic basis functions. Details about the GFS model system can be found in http://www.emc.ncep.noaa.gov/GFS/ doc.php (accessed 18 April 2019). The current operational horizontal resolution is approximately 13 km at the Equator for forecast days 0-10. In the vertical there are 64 hybrid sigma-pressure (Sela, 2009) layers with the top layer centred around 0.27 hPa (approximately 55 km). The current operational dynamical core of the GFS is based on a two-time-level semi-implicit semi-Lagrangian discretization (Sela, 2010) with three-dimensional Hermite interpolation.
The radiation parametrizations are modified and optimized versions of the Rapid Radiative Transfer Models (Clough et al., 2005). A hybrid eddy-diffusivity mass-flux (EDMF) PBL parametrization has been used since January 2015. The orographic gravity wave drag and mountain blocking parametrization is scale-aware in the GFS and implemented across NCEP global and regional models (Chun and Baik, 1994;Kim and Arakawa, 1995;Kim and Doyle, 2005). For deep and shallow convection, the Simplified Arakawa-Schubert (SAS) parametrization (Han and Pan, 2011) is used. The Zhao-Carr grid-scale scheme is used for condensation and precipitation parametrization (Zhao and Carr, 1997). The land surface model (LSM) of GFS is four soil layer (10, 30, 60, 100 cm thick) Noah model (Ek et al., 2003).
The initial conditions for the global forecasts are obtained through the Global Data Assimilation System (GDAS) which uses a hybrid four-dimensional ensemble variational formulation (Hybrid 4DEnVar; Buehner et al. (2013)). The GDAS ingests all available observations within a ±3 hr window of the analysis time. A 9 hr GSM forecast (TL1534 interpolated to TL574) from the previous GDAS analysis is used as the first guess for the assimilation. The GDAS also runs with a late data cut-off to provide the first-guess forecast for the next 6-hourly cycle.

NOAA/GFDL fvGFS
The NOAA/GFDL fvGFS model combines the Finite-Volume Cubed-Sphere Dynamical Core (FV3) and the common GFS physics package, which was initially provided for the Next Generation Global Prediction System (NGGPS) phase II to test the robustness of the dynamical cores under a wide range of realistic atmospheric initial conditions. FV3 uses the "vertically Lagrangian" dynamics of Lin (2004) extended with a non-hydrostatic pressure-gradient computation of Lin (1997) and a semi-implicit solver for vertically propagating sound waves, discretized on the cubed-sphere grid of Putman and Lin (2007). An alternative vertically Lagrangian non-hydrostatic discretization is also demonstrated in Chen et al. (2013b). The GFS physical parametrizations described in the previous section are used in this study, except for the Zhao-Carr grid-scale condensation and precipitation parametrization which is replaced with the GFDL single-moment six-category cloud microphysics . In the most recent version of fvGFS, the EDMF PBL scheme is replaced with the YSU (Yonsei University) PBL scheme (Hong et al., 2006) and a mixed-layer ocean model (Polland et al., 1973) with some modifications, e.g. including an ocean current damping term, a relaxation of the ocean mixed-layer depth toward observational climatology and a relaxation of the SSTs toward observational climatology plus initial anomaly, is also used. Initialization of the atmosphere, land surface, and sea-surface temperatures (SSTs) are from NCEP operational analyses. A 13 km uniform-resolution version of fvGFS has been run in real time at GFDL since mid-2016. The forecast characteristics of the fvGFS, with a focus on tropical cyclone prediction, are described in Chen et al. (2018a) and Hazelton et al. (2018). Besides the pre-processing tool developed during the NGGPS phase II to use the GFS analysis as the initial condition, a sophisticated interpolation tool was also developed in fvGFS to carefully use the IFS analysis data as the initial condition. The interpolation procedure from the Gaussian grid to the cubed-sphere grid is documented in Chen et al. (2018b). All upper-air variables are initialized from the IFS analysis, while the land surface variables are initialized from the GFS analysis for practical reasons. This discrepancy will affect land surface temperatures and also sub-seasonal predictions, but is believed to have a small impact on medium-range upper-air variability which is the scope of this study.

RESULTS
To compare different forecast configurations over a full year, forecasts are initialized/used every fifth day between 15 August 2015 and 9 August 2016. By running forecasts every fifth day, the temporal correlation of errors is expected be small. EC is based on pre-operational runs before 8 March 2016 and operational runs of ECMWF HRES forecast thereafter. GFS is based on operational NCEP deterministic forecasts, while FV3gfs uses the GDFL fvGFS initialized from NCEP initial conditions and FV3ec uses fvGFS initialized from ECMWF initial conditions. The verification presented in this paper is based on z500 on a 1 • regular lat-lon grid. The forecasts are verified against the UKMO analysis, provided by the TIGGE archive (Bougeault et al., 2010), so as not to favour any of the configurations. Mean error (ME), root-mean-square error (RMSE) and anomaly correlation coefficients (ACC) have been calculated and evaluated. The scores have been calculated for the Northern Hemisphere (N.Hem, 20 • -90 • N), Southern Hemisphere (S. Hem, 20 While the results presented in the next sections are based on RMSE, similar conclusions hold for ACC. As a complement to the z500 evaluation, we have also evaluated 6-hourly precipitation forecasts against TRMM 3B42 version 7, which merges satellite rainfall estimates with gauge data (Chen et al., 2013a). The temporal resolution of TRMM 3B42 is 3-hourly, while the horizontal resolution is 0.25 • between 50 • S and 50 • N. The precipitation is evaluated for a zonal band spanning lower latitudes (40 • N-40 • S) and for CONUS.

3.1
Average scores RMSE ( Figure 1) have been calculated for N.Hem, S. Hem, Europe and CONUS for the four forecast datasets based on all cases. Figure 1b,d,f,h show the differences in RMSE with respect to FV3ec. In these panels the statistical significance of the differences with respect to FV3ec has been calculated with a Student's -test, and lead times where the difference is significantly different (95% confidence level) are indicated with dots. For lead times up to 2-3 days, the EC forecasts show the lowest errors for all evaluated regions, including being lower than FV3ec. One could speculate whether this difference is due to more similarities in the mean climate between the UKMO analysis and ECMWF forecasts. For the 5-7-day forecasts, the scores over N.Hem and S.Hem for EC and FV3ec are similar without any significant differences, and both are significantly better than FV3gfs and GFS. For N.Hem, FV3gfs has lower RMSE than GFS, signalling that the model difference plays a role for the forecast error but such a difference is not prominent for S.Hem. For the smaller regions (Europe, CONUS), the scores are more noisy but gives a similar ordering as for the full Northern Hemisphere.
For the mean error (ME) of z500 (Figure 2), all forecasts initially have a negative bias compared to the UKMO analysis for both hemispheres. This difference is due to a bias in the UKMO data assimilation which was corrected during spring 2016 and hence affected the first part of the sample. FV3ec has somewhat lower geopotential already after 6 hr, although initialized from the same analysis as EC. Later both FV3 systems drift to a significant lower geopotential in N.Hem, while EC has a slight positive drift and GFS shows a slow negative drift.
To further evaluate the average differences between the simulations, Figures 3 and 4 shows the RMSE and ME in precipitation respectively for 40 • N-40 • S and for the CONUS. Europe has not been included due to a large part being outside of the TRMM region. For 40 • N-40 • S, all forecasts overestimate the precipitation compared to TRMM, with the largest biases for GFS. As FV3ec is initialized from an analysis from a different model, we can expect some initialization shock even if we apply the initialization procedure outlined in Chen et al. (2018b). For the first 6 hr we indeed find the highest bias in FV3ec, but later the bias quickly stabilizes and is similar to FV3gfs. For EC we find a strong diurnal cycle in the precipitation bias for both regions. This is not necessarily due to a real diurnal cycle bias as the verification for 40 • N-40 • S includes all longitudes, but rather can be a sign of a geographical bias. One contributor here is an over-prediction of precipitation over tropical South America (not shown) in the EC, which contributes to 1200-1800 UTC verification. Evaluating the biases over CONUS, the EC has a strong negative bias for 0000-0600 UTC due to underestimation of evening precipitation over the central US (not shown), but shows small biases for the other times of the day. The diurnal cycle in the ECMWF forecasts were greatly improved in 2013 with an update of the convection scheme discussed in Bechtold et al. Regarding the RMSE for precipitation ( Figure 3) for 40 • N-40 • S, we find the lowest errors for all lead times apart from the first day for EC, followed by FV3ec. For the first day, FV3ec obtains the lowest RMSE despite the spin-up issue mentioned in the previous paragraph. The results show that, also for RMSE in precipitation, the initial conditions play an important role. For CONUS, the results are more noisy but also here we find a lower RMSE for the forecasts initialized with ECMWF analyses for the first few days.
The results in this section show that the mid-tropospheric forecast errors in the range of 5-7 days have larger sensitivity to the initial conditions than to the forecast model, with some advantage also for the FV3 model compared to the GFS. For the mean error, the model dependence is much stronger but, despite very different bias structure between EC and FV3ec, their RMSE is similar and superior to that of the forecasts initialized from GFS analyses. In the coming sections we will investigate whether the sensitivity to the choice of initial conditions also holds for the day-to-day variability of the z500 errors.

Case-study
To exemplify the errors in the different forecasts, we have chosen one case of a forecast bust for Europe. The case has previously been evaluated in Magnusson (2017) and Grams et al. (2018) and we refer to these papers for details of the predictability and error sources. The bust was associated with a failure in predicting a blocking over Europe and resulted in large 2 m temperature errors for northwestern Europe. The case presented here was the most extreme case in terms of ECMWF forecast error during the evaluated period for day 6 over Europe. Figure 5 shows the time series of RMSE for 6-day z500 forecasts over Europe for EC, GFS, FV3gfs and FV3ec. To give an estimate of the uncertainty in the forecasts, the scores from individual ensemble members are also included from the ECMWF and NCEP operational ensembles, available through the TIGGE archive (Bougeault et al., 2010). The ensemble results are visualized with the median and the area covered by the 10th to 90th percentile of the ensemble members (80% of the members). The figure also includes the ensemble standard deviation (spread) for the both ensembles. Note that the median value of the scores for individual members is not the same as the score of an ensemble mean, and that one should expect on average a lower RMSE from a deterministic unperturbed forecast than from individual ensemble members.
From the time series in Figure 5, variations in scores from day to day are evident due to periods of higher and lower atmospheric predictability. We also find variations of the ranking between the different forecast systems. Although EC and FV3ec on average have lower RMSE for both regions for the full period ( Figure 1) than GFS and FV3gfs, this does not hold true for every day, as demonstrated by the case below.
In the beginning of March 2016, both NCEP and ECMWF forecasts experienced increased errors over Europe; this is also apparent considering the errors in both ensembles and we also find an increased ensemble standard deviation. However, the NCEP forecast recovered from the error a day earlier than ECMWF, and for the 7 March 0000 UTC forecast the 50% (median) of the ECMWF ensemble members was worse than 90% of the NCEP ensemble members, and the EC deterministic forecast (red dot) was far worse than GFS (green dot). Comparing the two FV3 forecasts, the FV3ec performed worse than FV3gfs. Magnusson (2017) traced the ECMWF forecast error back to the development of a surface low over the western Atlantic. The error later amplified and propagated eastward in connection with a warm conveyor belt that was triggered ahead of the low (Grams et al., 2018). In the 4-6 day range, the error affected the ridge-building over the northeastern Atlantic and later the onset of a Scandinavian blocking regime.  Figure 6 illustrates the z500 error in 2-day and 6-day forecasts initialized at 0000 UTC on 7 March from the four different forecast configurations. In the 2-day forecasts, both forecasts initialized from ECMWF initial conditions (EC and FV3ec, a-d) show similar structures of the errors over the central Atlantic, with a too zonal flow compared to the corresponding analysis. Such error structure is not present in the two forecasts initialized with GFS initial conditions (GFS and FV3gfs). Comparing the 6-day error in the different forecasts, the error related to the ridge over northern Europe is much larger in the forecasts with ECMWF initial conditions than in the forecasts with GFS initial conditions. This result indicates that, for this case, errors in the initial conditions played the major role in creating the forecast bust over Europe. In the next section we will try to generalize the results by using the full experiment period.

Correlations of errors
To further investigate the relative importance of the initial conditions and model formulations for the day-to-day variations in scores, in this section we examine correlations of errors between the forecast configurations. Figure 7 shows the Pearson correlation of RMSE versus lead time from different pairs of configurations. A confidence interval (shaded) is included for FV3ec versus FV3gfs based on a bootstrap method where we have generated 1,000 series with the same length (i.e. with replacement) as the original series by randomly sampling from the original series. The confidence interval shows the 5th to 95th percentile of the distribution from the random series. To test the sensitivity for the correlation calculation, the Spearman correlation (not shown) has also been calculated and the results give a message similar to the Pearson correlation.
The reason for showing the sub-regions (Europe, CONUS and SE Pacific in this case) instead of the full hemispheres is that we are interested in the flow-dependent variations of the errors, which are often smoothed out when averaging over the full hemisphere. The region over the SE Pacific is defined as 60 • -25 • S,120 • -75 • W. The three regions are outlined in Figure 8.
First of all, we find relatively high correlations in all pairs of experiments for all regions. This correlation is due to the flow-dependence of the error, meaning that forecasts with different model and initial conditions still have correlated errors. For the shortest lead times, the correlations are highest for the combinations that use the same initial conditions, as expected. Already at 1-day lead time, differences appear between the three regions. For Europe and SE Pacific, the initial separation between the pairs persists and highest correlations are found for the pairs that share initial conditions (FV3gfs/GFS and FV3ec/EC) for all lead times well into the medium range. On other hand, the correlation between FV3ec/FV3gfs error is lower and on a similar level to the combinations that do not share model or initial conditions (FV3ec/GFS, FV3gfs/EC and GFS/EC). This indicates that also for the full period, the initial conditions play the largest role for the error variations in z500 over Europe and SE Pacific for all time-scales, including the medium range.
As for Europe, for CONUS the highest correlation is found for FV3gfs/GFS. We need to keep in mind that FV3gfs and GFS share both the initial conditions and many of the physical parametrizations in the model. In contrast to the results for Europe, the correlations for FV3ec/EC are lower than for FV3gfs/FV3ec for days 1-3 and after that are on a similar level. The correlation for days 1-3 for FV3ec/EC is also slightly lower than for FV3ec/GFS, where the latter pair shares large parts of the model physics. This result indicates that the physical parametrizations might play a larger role in the day-to-day variation of the errors for North America than for Europe, especially for relatively short-range forecasts. In the next section we discuss these results.

DISCUSSION
To understand the regional differences in the error correlations found in the previous section, Figure 8 shows the RMS difference (RMSD) in z500 between (a, b) FV3gfs and GFS, (c, d) FV3ec and EC, and (e, f) FV3ec and FV3gfs after 6 hr and 30 hr into the forecasts. The panels include the outlines of the regional boxes used in Figure 7 plus a smaller box over central CONUS. The 6 hr RMSD should be close to the initial RMSD and is expected to be small for FV3gfs/GFS, and FV3ec/EC. However, the initialization procedure could create some initial differences, and also there may be rapidly developing differences due to model formulations. For the FV3ec/EC pair we find the largest 6 hr RMSD where we have the highest orography. The RMSD between FV3ec and FV3gfs should reflect the differences between the ECMWF and GFS analyses. Here the largest RMSD are found in the Southern Hemisphere, where the observation networks are more sparse and hence the analyses are less constrained. In the Northern Hemisphere the largest RMSD are found for a similar reason over the Arctic and over oceans. Regarding the RMSD between FV3ec and FV3gfs after 30 hr, we find the largest values over the southern oceans and also over the Atlantic. The region with large RMSD over the Atlantic partly falls into the Europe box.
As synoptic-scale errors (and forecast differences) in the first part of medium-range forecasts grow roughly exponentially (Lorenz, 1982;Magnusson and Källén, 2013), forecasts with large initial differences will separate faster than forecasts that start with very small initial differences. Such separation is faster in regions with strong baroclinicity. For the Southern Hemisphere, the relatively large difference in initial conditions is a likely explanation of the low error correlations between the forecasts with different initial conditions compared to the ones that share initial conditions. This is also likely to be the explanation for the dependency on the initial conditions for the day-to-day variability of score over Europe, as it is downstream of the Atlantic with relatively large initial differences and a dynamically active region with fast baroclinic growth of disturbances (not shown).
However, the argument above should also apply for CONUS which is downstream of the Pacific, while we here find higher error correlations for the forecast with the same model. As the shift in the ordering appears during the first  (Figure 7b), we can expect local processes to play a key role. It is also worth noting that the correlation after 6 hr is still largest for the pairs that share initial conditions, and is therefore not simply a product of the initialization process in FV3 from ECMWF initial conditions. After 30 hr, we find a local maximum of RMSD between EC and FV3ec over central North America (Figure 8) on the eastern side of the Rockies. The magnitude is similar to (and locally higher than) what we find between FV3ec and FV3gfs.
To understand the local maximum in the RMSD between EC and FV3ec over central CONUS, we make a composite of the seven cases (10% of the sample) with the largest growth of RMSD (dRMSD) between 6 and 30 hr. The selection is based on dRMSD in the region of 30 • -50 • N, 110 • -90 • W (outlined in Figure 9). To inspect the evolution of mean z500 for these cases, the composite of the analyses from 6 hr before, 18 hr after, and 42 hr after the forecast dates are plotted in Figure 9. The middle point ("18 hr after") is chosen to be in the middle of the window used for the difference growth. The difference from zero anomaly has been tested with a Student -test, and areas that pass the 95% confidence limit are plotted in bold colours.
In the composite of the analyses 6 hr before the initialization, a ridge was present over the eastern Pacific and a trough was centred over the west coast of North America. At 24 hr later, the trough was passing the Rockies and starting to amplify in the lee of the mountains. Another day later (42 hr after the initialization time of the forecasts), the trough over central US had further amplified and we also find a ridge further downstream forming part of a Rossby wave packet. The location of the trough agrees with the region of the large RMSD between FV3ec and EC after 30 hr.
These composites suggest that the cases of most rapid dRMSD between FV3ec and EC are related to the passage of troughs over the Rockies. As the formation of a trough in the lee of mountains is dependent on the conservation of potential vorticity (e.g. Holton 1993), differences in diabatic processes related to orography between FV3 and EC seem to be possible causes for the growth in z500 difference.
To investigate the difference between FV3ec and EC related to orography, Figure 10 shows diagnostics of surface drag from the parametrization of the sub-grid turbulence and gravity waves plus blocking (ECMWF, 2016) . Comparing the two components, we find large differences in the partitioning between the surface drag processes in the two models, where the main part of the drag in the IFS model is from the sub-grid turbulence, while in the FV3 the main drag over orography is from the gravity-wave part. The sum of the components over the Rockies results in 5% more drag in the FV3 model. It is also worth noting that the spatial pattern of the drag is different in the two models, with more localized drag over high mountains in FV3. The difference could be due to the parametrizations of the orographic drag and/or the definition of the orography in the different models. As the orographic drag has a day-to-day variability due to the large-scale flow, it could therefore result in differences between the two models and at least partly explain the day-to-day variability in the scores.
As discussed in the previous section, there are also significant differences in the diurnal precipitation cycle between EC and both FV3 systems over CONUS that could also contribute to the differences in z500 for the simulations starting from the same initial conditions, but interacting with the large-scale flow on slightly different time-scales. However, we did not find any strong correlation between convective precipitation and cases with fast dRMSD between FV3ec and EC (not shown).

SUMMARY AND CONCLUSIONS
In this paper we have investigated the relative importance of initial conditions and model formulation for errors in 500 hPa geopotential height (z500), both in terms of average root-mean-square error (RMSE) and mean error as well as the day-to-day variability, and average scores for precipitation. We have also diagnosed orographic drag in the model to further understand the results. The relative importance is addressed by comparing forecasts produced with the same initial conditions but different models and forecasts using different initial conditions but the same model. The models used for this study are the ECMWF-IFS, NCEP-GFS and GFDL-fvGFS with similar horizontal resolution (9 km for the first and 13 km for the others). While the first two models have a hydrostatic and spectral dynamical core, the fvGFS is non-hydrostatic with a cube-sphere grid and a finite-volume discretization. At the same time the physical parametrizations are similar for many of the processes in GFS and fvGFS. The GFDL-fvGFS has been initialized from both ECMWF and NCEP initial conditions. The caveat with this approach is that the model plays a significant role in the creation of initial conditions, and some of the difference in the initial conditions is therefore due to model differences.
The main conclusion is that the initial conditions play the major role for differences between the tested configurations for RMSE of z500 for at least 8 days into the forecasts, and for 5 days for precipitation. For the mean error (bias), the model formulation is the dominating factor for the investigated parameters. For the day-to-day variation of RMSE for z500, the initial conditions dominate for Europe and for the Southern Hemisphere, while for North America the model formulation has a larger influence.
For the RMSE the forecasts using ECMWF initial conditions yield the lowest RMSE in the medium range for both Northern and Southern Hemispheres, with similar scores for the ECMWF forecasts and the fvGFS model when using ECMWF initial conditions. We also find that the forecasts initialized from ECMWF analyses have an advantage for rainfall predictions. This result should not directly be interpreted to mean the scores are insensitive to the model formulation,  Figure 7. The green sub-box over US is explained in the text but rather that the two different models have similar quality although they are very different in formulation.
For the day-to-day variation of the errors, relatively large correlations are found for all pairs of experiments. This is not unexpected as the flow conditions affect all forecasts and during unpredictable situations all forecasts are more likely to have large errors. Such situations vary with season and were recently reviewed by Lillo and Parsons (2016). On top of the correlations given by the flow situation, the initial conditions contribute most to the error correlations over Europe and the Southern Hemisphere. The relatively large difference between the two used sets of initial conditions over oceans is one contributing factor. Such differences will amplify quickly, especially over regions with strong baroclinicity such as the Atlantic and southern oceans. The experimentation here also confirmed that errors in the initial conditions played the major role for the March 2016 ECMWF forecast bust discussed in Magnusson (2017) and Grams et al. (2018).
For North America the picture is different regarding the dominating factor for the day-to-day variability. The strongest correlations are still found between the errors from GFS and FV3gfs, but less between EC and FV3ec. The stronger correlation for the first pair might also have been influenced by the similarities in the physical parametrization packages used in the GFDL-fvGFS and GFS models. The results show that difference between EC and FV3ec start to appear in the lee of the Rockies, and we find the fastest separation during passages of troughs over the mountain ridge. As we find large differences between the way the two models handle orographic drag, this is a plausible explanation for the forecast differences. The F I G U R E 9 Composites of z500 (contours) and anomalies (shading) for the analyses for the top 10% of cases of dRMSD between Fv3ec and EC and between +6 hr and +30 hr. The composites are for (a) 6 hr before, (b) 18 hr after, and (c) 42 hr after the initialization times for the forecasts difference in orographic drag between numerical weather prediction models is more widely discussed in Zadra (2015) and Sandu et al. (2016). This result suggests that special attention should be paid to perturbing the drag processes in ensemble forecasting systems to capture the flow-dependent forecast uncertainties. The current developments for model uncertainties at ECMWF are outlined in Leutbecher et al. (2017). From a forecast user perspective, being aware of situations where the differences in the model itself plays a larger role is useful when interpreting results from different forecasting systems. Although the initial conditions seem to be most important for medium-range error differences between the experiments, we have to stress that the model plays a very important role in the creation of the initial conditions by the use of a short forecast as "first guess" in the data assimilation, as well as being used for the nonlinear outer trajectory and the tangent-linear model in the 4D-Var minimization. We also believe that the model formulation is much more important for the errors in weather parameters, such as 2 m temperature, cloud cover and 10 m wind speed. Further work using this framework will investigate the error for such parameters.