How Well do Models Represent the Development of Extra-Tropical Cyclones? Evaluation of Two General Circulation Models Against NAWDEX IOP 6 Observations

The dynamical and microphysical properties of a well-observed cyclone from the North Atlantic Waveguide and Downstream impact Experiment (NAWDEX), called the Stalactite cyclone and corresponding to Intensive Observation Period 6, is examined using two atmospheric global circulation models: CNRM-CM6-1 and IPSL-CM6A. The hindcasts are performed in “weather forecast mode”, run at CMIP 6 resolution (LR) and c. 0.5◦ (HR) and initialized during the initiation stage of the cyclone. Cyclogenesis results from the merging of two relative vorticity maxima at low levels: one is associated with a Diabatic 5 Rossby Vortex (DRV) propagating from the subtropics and the other is initiated by baroclinic development and interaction with a pre-existing upper-level PV cut-off. All hindcasts produce (to some extent) a DRV. However, the second vorticity maximum is almost absent in LR hindcasts because of an underestimated upper-level PV cut-off. The evolution of the cyclone is examined via the quasi-geostrophic ω equation, which separates the diabatic heating component from the dynamical one at each given time. In contrast with some previous studies, there is no change in the relative importance of diabatic heating with increased 10 resolution. The analysis also shows that IPSL-CM6A produces a more active cyclone compared to the CNRM-CM6-1 due to stronger diabatism. To examine this further, hindcasts initialized during the mature stage of the cyclone are compared with airborne remote-sensing measurements. There is generally an underestimation of the ice water content in the model compared to the one retrieved from radar-lidar measurements, even when the liquid water content is added. Consistent with the increased diabatism in IPSL-CM6A compared to CNRM-CM6-1, the sum of liquid and ice water contents is higher in IPSL-CM6A than 15 CNRM-CM6-1 and, in that sense, IPSL-CM6A is closer to the observations. However, ISPL-CM6A strongly overestimates the fraction of super-cooled liquid compared to the observations by a factor of approximately 50. *Current Affiliation: Met Office, Exeter, UK 1 How Well do Models Represent the Development of Extra-Tropical Cyclones? Evaluation of Two General Circulation Models Against NAWDEX IOP 6 Observations https://doi.org/10.5194/wcd-2020-43 Preprint. Discussion started: 30 September 2020 c © Author(s) 2020. CC BY 4.0 License.

1. Initial cyclogenesis occurs as a result of the merger of a DRV and another near-surface cyclonic vortex located further north and associated with baroclinic interaction with an upper-level PV cut-off.
2. A main deepening phase associated with large-scale troughs is present.
3. A minimum pressure deepening rate of 24 hPa in 24 h during the secondary deepening phase.
If all of these criteria are met then the climate models are able to correctly represent the Stalactite cyclone. The climate models and experimental setup used are discussed in the following section. In this section we discuss the model setup and experimental protocol of the T-AMIP experiments (Section 3.1), the observations used (Section 3.2), and the diagnostics considered (Section 3.3). We also compare our simulations against the European Centre for Medium Range Weather Forecasting (ECMWF) analysis. 125 We use the atmospheric components of two climate models run in "weather forecast mode" to represent T-AMIP-style experi-  (Eyring et al., 2016), and here we make use of the same model versions and configuraions. 130 The CNRMCM61 atmospheric component is based on the version 6.3 of the global atmospheric model ARPEGEClimat (Roehrig et al., 2020). It is a spectral model derived from the ARPEGE/IFS (Integrated Forecast System) numerical weather prediction model developed jointly by MétéoFrance and the ECMWF. It is run with 91 vertical levels, and at two different horizontal resolutions. The first of these resolutions corresponds to a T127 truncature (c. 150km globally; hereafter denoted as ARPEGE-LR) and the second to a T359 truncature (c. 50km; hereafter ARPEGE-HR). The CNRM-CM6-1 convection scheme 135 is based on the work of Piriou et al. (2007) and Guérémy (2011). The longwave radiation scheme corresponds to the global circulation model (GCM) version of Rapid Radiation Transfer Model (RRTM; Mlawer et al., 1997), whilst the ECMWF/IFS cycle 32 radiation scheme is used of the shortwave component (Fouquart and Bonnel, 1980;Morcrette et al., 2008). The microphysics scheme follows the work of Lopez (2002), and the turbulence scheme that of Cuxart et al. (2000). The model output is then converted onto a 1.4 • and 0.5 • latitude/longitude grid for ARPEGE-LR and -HR respectively for the purposes 140 of analysis.

Models and Experimental Setup
The IPSL-CM6A atmospheric component, known as LMDZ6A (Hourdin et al., 2020), is run with 79 vertical levels, and also at two different horizontal resolutions. The first of these resolutions is the CMIP 6 resolution (2.5 • ×1.2 • : IPSL-CM6A-LR and hereafter denoted as LMDZ-LR) and a higher resolution (IPSL-CM6A-HR and hereafter denoted as LMDZ-HR). The IPSL-CM6A-HR configuration utilises the zoom function of LMDZ, in which the resolution over part of the domain is increased 145 5 https://doi.org/10.5194/wcd-2020-43 Preprint. Discussion started: 30 September 2020 c Author(s) 2020. CC BY 4.0 License. compared to the rest in a variable resolution configuration. For our simulations the zoomed domain is centred at (40 • E, 55 • N) with a resolution equivalent to 0.33 • and the resolution decreases away from the centre. This results in a resolution of approximately 0.5 • over the North Atlantic and approximately 1.1 • in other parts of the domain. The convection scheme is based on Rochetin et al. (2014), the shortwave radiation is an extension of Fouquart and Bonnel (1980) to six bands, and the longwave radiation scheme is the GCM version of RRTM (Mlawer et al., 1997). The microphysics scheme follows Madeleine  (Hourdin et al., 2019) and the surface scheme follows Cheruy et al. (2020).
For both models hindcasts are initiated at 0000 UTC on 27-29 September, and 1-2 October 2016 from the ECMWF analysis.
For ARPEGE microphysics state variables and turbulent kinetic energy were initialised to zero and aerosols are prescribed from a present-day climatology. On the other hand, in LMDZ model state variables not defined in the analysis are set to zero, 155 alongside the aerosols. Sea surface and sea ice cover are also from the ECMWF analysis. All hindcasts are performed out to a leadtime of T+10 d. For both of the models, and all hindcasts, output data is interpolated onto a pressure grid in the vertical, every 25 hPa, from 1000 hPa to 100 hPa. Furthermore, output is also produced using the CFMIP (Cloud Feedback Model Intercomparison Project) Observation Simulator Package (COSP; Bodas-Salcedo et al., 2011) for radar reflectivities from Cloudsat to be compared with the observed aircraft-borne radar reflectivities from the NAWDEX field campaign. 160 We restrict the number of hindcasts to take into account the impact of the overall synoptic situation at the time being largely unpredictable (e.g. Schäfler et al., 2018). This is confirmed by the hindcasts initiated on 27 and 28 September at low resolution not producing the Stalactite Cyclone (not shown).
As with all T-AMIP experiments our simulations are initialised with non-native initial conditions as they are initialised with ECMWF analysis. However, initialising from non-native initial conditions can lead to an adjustment period before the 165 model returns to its natural trajectory and balance. This adjustment period is referred to as the "initial shock" (e.g. Klocke and Rodwell, 2014). The initial shock can be seen through the presence of gravity waves throughout the domain or by comparing with simulations started from the native analysis. Therefore, to ensure the same diagnosis of initial shock in both models the presence (or lack) of numerical artefacts (such as small-scale disturbances and gravity waves) and their persistence was examined. In considering the first 12 h of each simulation no numerical gravity waves or small-scale disturbances were noted 170 in any of the model runs (not shown), thus initial shock does not appear to be significant. As a precautionary measure, we do not analyse the hindcasts prior to T+18 h.
To consider the uncertainty associated with our results we examine the hindcats at different leadtimes. We do not create an ensemble of different initial conditions (i.e. initial conditions from different forecasting centres) as this would produce similar results given the validity period of T-AMIP protocol (e.g. Fig. 1 in Klocke and Rodwell, 2014). Furthermore, experiments 175 initialised from the Météo-France analyses yield similar results to those initialised from the ECWMF analyses (not shown).
Given that ARPEGE and LMDZ have different dynamical cores and parametrizations, some amount of uncertainty in the results can be obtained by comparing the models. The results presented within this paper are consistent as long as a cyclone resembling the Stalactite cyclone is produced in the hindcasts, regardless of leadtime. Therefore, for consistency, all the plots produced when discussing the life cycle of the cyclone are from the 00 UTC 29 September initialisation, and observation 180 6 https://doi.org/10.5194/wcd-2020-43 Preprint. Discussion started: 30 September 2020 c Author(s) 2020. CC BY 4.0 License. comparisons (Section 4.4) are considered from the hindcast initialised at 00 UTC 1 October 2016, unless otherwise stated. The first time (00 UTC 29 September) has been chosen because it is just prior to cyclogenesis and the precursors already exist in the analysis for all resolutions. Hence, it is a well-suited initial time to study the whole life cycle of the cyclone. The second time (00 UTC 1 October) has been chosen, to make comparisons with the observations, because the cyclone is required to be in (or as close as possible to) the observed location to allow the different features of the cyclone to be comparable.

Observations
During the NAWDEX field campaign, four aircraft equipped with remote sensing and in-situ instruments were operated and among them the French SAFIRE Falcon aircraft from 1 October to 15 October (Schäfler et al., 2018). The SAFIRE Falcon made two flights to observe the Stalactite cyclone on 2 October 2016: F6 (towards Greenland) and F7 (south of Iceland; Fig. 1b). The second flight (F7) was directly into the cyclone in the ascending branch of the associated warm conveyor belt, as opposed to 190 the first flight (F6) which considered the warm conveyor belt outflow, so in the main manuscript we focus on F7. A comparison with F6 data can be found in the supplementary material and produces similar results to F7 comparisons. The first leg of F7 (the most eastern one) was chosen because there was an overpass with CloudSat-CALIPSO track at 14:09 UTC which allows us to assess observation uncertainties by comparing airborne and spaceborne measurements. The payload onboard the SAFIRE Falcon included a 95-GHz Doppler cloud radar and a high-spectral resolution Doppler lidar capable of measuring at 355, 532 195 and 1064 nm (e.g. Delanoë et al., 2013). Measurements by these two instruments allow the retrieval of ice water content thanks to the variational algorithm of Delanoë and Hogan (2008) updated by Cazenave et al. (2019). The combination of radar and lidar further allows for the identification of the phase of the particles to be identified (e.g. super-cooled liquid, ice, liquid, etc.) using principles outlined in Delanoë and Hogan (2010). Furthermore, Doppler-derived windspeeds, and radar reflectivities are also used from these instruments. The retrievals come from both radar products only (RASTA) and a combined radar and lidar 200 product (RALI) to allow for uncertainty in the measurements to be taken into account.
To ensure a fair comparison between the observations and the model, the observations are first converted onto the model grid. This conversion assumes a linear relationship between the speed of the aircraft and distance travelled by the aircraft.
On the other hand, the model output is converted onto the flight path using the nearest gridpoint to the flight path to create a "virtual" flight path. Other interpolation methods were tested but had limited impacts on the conclusions (not shown).

205
The modelled ice water content (IWC) is compared throughout the entire "virtual" flight track, and also by applying the observation mask (to help determine the impact of positional errors). Furthermore, the model IWC is defined in two ways: "potential" IWC (equivalent to cloud ice plus snow) and "maximum" IWC [equivalent to cloud ice plus snow plus liquid water content (LWC)]. The inclusion of LWC in the "maximum" IWC is to take into account the impact of mixed-phase clouds and super-cooled liquid. On the other hand, the radar reflectivites are compared without applying the mask due to the output from 210 the COSP simulator occurring in contour frequency altitude diagram form rather than exact radar reflectivity values. Since COSP radar simulator has been developed to be compared to ClouSat radar reflectivity, a comparison between the CloudSat and airborne radar reflectivities is also inserted in the supplementary material to assess observation uncertainties.
The comparisons are made at the most appropriate time with respect to the model cyclone rather than the observation time.
This framework takes into account any delay in the cyclone formation in the climate models (Sec. 4), and ensures that the 215 modelled cyclone has a structure that is as similar as possible to the analysis. In reality, for the hindcast that is compared against the observations in this study (initiated at 00 UTC 1 October 2016) neither timing nor positioning adjustments are required.

Vertical Motion and Baroclinic Conversion Budgets
Extra-tropical cyclone evolution can be considered through many methods. For example, the surface pressure tendency equation We also take this a step further and consider the energetics of the system through the baroclinic conversion (BC). All our results using these methods are considered in a cyclone-relative framework.
The QG ω-equation, that includes diabatic heating and the β term, can be written in terms of the so-called Q vector following 225 Hoskins et al. (1978) and Hoskins and Pedder (1980). We use the formulation of Holton (2004) that includes the diabatic heating too: for where σ is the static stability (which is obtained by temporally averaging the temperature across the lifetime of the Stalactite cyclone), f 0 is a reference coriolis parameter, β is the beta term in the coriolis forcing, p is the pressure, R is the specific gas constant, c p is the specific heat, J is the rate of heating per unit mass, u g is the geostrophic wind vector, T is the temperature, x and y are the positions in the meridional and zonal direction, respectively, and ω QG is the vertical velocity obtained from the QG ω-equation (1). In the rest of the paper, the geostrophic wind is used to compute the Q vector, but we have checked that 235 using the full horizontal wind components do not change the results so much.
Equation (1) allows us to distinguish between the dynamical and diabatic contributions to the vertical motion in the cyclone.
Physically the Q vector and the β terms represent the dynamical components of the flow and the Laplacian of the rate of heating per unit mass represents the diabatic heating. A friction term can also be added to Eq. (1). However, as the friction term is (at least) an order of magnitude smaller than all of the other terms (not shown) it is neglected.

240
The third term on the right-hand side of Eq. (1) can be split into components from all of the different model physics parametrizations (e.g. convection, radiation, and large-scale heating from condensation) as follows J = J c + J r + J lscp ..., where the subscripts represent the initials of the parametrization they represent, e.g. c is convection, r is radiation, and lscp is large-scale heating due to condensation/large-scale cooling due to evaporation (i.e. latent heating).

245
To solve Eq. (1) the 3D Laplacian is inverted using Liebmann successive over-relaxation with boundary conditions such that ω is zero at 1000 hPa, 100 hPa, and all horizontal boundaries. The vertical motion is computed for every 25 hPa in the vertical.
The inversion is not fully accurate due to the QG assumption. However, on average most of the modelled vertical motion is recovered using this method. Furthermore, the timing of the development of the inverted ω (ω QG ) matches that of the model ω. Further discussion of the differences between ω QG and the model ω are discussed in Sec. 4.3. The ω QG can be split into 250 dynamic (ω dyn ) and diabatic (ω diab ) components as follows and Inversion of the two previous equations allows to separate the contribution of dynamical and diabatic processes in the 255 vertical motion. Such a decomposition provides further insights into the development of the cyclone by determining the balance between these processes in the evolution of the cyclone. Vertical velocity intervenes in different key terms of the classical equations for the development of extratropical cyclones such as in the stretching term of the relative vorticity equation or in the baroclinic conversion term of the kinetic energy equation. In the present study, we adopt the energetic framework and compute the baroclinic conversion from eddy potential energy to eddy kinetic energy within the extra-tropical cyclone (e.g. Orlanski and 260 Katzfey, 1991; Rivière and Joly, 2006). The baroclinic conversion is proportional to the vertical heat flux and can be written as where h = (R/p)(p/p s ) R/Cp where p s is the surface pressure and θ is the potential temperature. Primes denote the difference from the 5-day temporal average of that quantity centered over the lifecycle of the Stalactite cyclone. Some tests have shown 265 that the results are rather insensitive to the definition of the temporal average as long as it is made over an interval equal or longer than the life cycle of the cyclone to suppress the signal associated with the cyclone.
The baroclinic conversion term is mainly positive in areas following the cyclone trajectory (Rivière and Joly, 2006;Rivière et al., 2015). It can be approximated by replacing the vertical velocity by its quasi-geostropic formulation Eq. (1). As ω QG is split up into its dynamical and diabatic components (ω dyn , ω diab ) following Eqs. (2) and (3), the approximated baroclinic 270 conversion using the quasi-geostrophic equation can be written as: It is worth noting that θ remains identical in all of the components in Eq. (5). Therefore, the decomposition of the baroclinic conversion into dynamical and diabatic terms only results from that of the vertical velocity. Also, the temporal means of ω and ω QG being small, the primes can be suppressed on those variables. This equation has been kept in this form for simplicity and so will naturally lead to some contamination of the diabatic and dynamic parts within the final results due to θ remaining constant.

Pressure Evolution and Track
The representation of the Stalactite Cyclone is first considered via an overview of the cyclone through its track and minimum sea level pressure evolution (Fig. 2). The LR hindcasts both produce a rapidly deepening cyclone: 24 hPa in 24 h in ARPEGE-

285
LR and 38 hPa in 24 h in LMDZ-LR. However, this deepening is delayed by 24 h compared to the analysis, and the initial cyclogenesis is not as intense as in the analysis. This weaker cyclogenesis results in an initially weaker cyclone compared to the analysis in both models (Fig. 2a). However, the explosive deepening in LMDZ-LR compensates for the lack of initial deepening and results in a cyclone with the same intensity as the observations. On the other hand, ARPEGE-LR has the same secondary deepening strength as the analysis so produces a weaker cyclone.

290
The cyclone track also differs from the analysis. The difference occurs 18 h into the hindcasts. The two LR hindcasts produce a track that is too far south and has a later re-curvature so the cyclone track occurs further east compared to the analysis (Fig. 2b). The eastward shift in the track agrees with the global weather forecasts prior to 29 September 2016 (e.g. Maddison et al., 2020). Given the rapid divergence of the forecast track from the analysis, differences in the cyclogenesis could be one aspect leading to the track occurring too far east. The cyclogenesis being important for the cyclone track is partially 295 corroborated by the track representation having improved (i.e. no eastward shift), regardless of resolution, after the cyclone appears in the initial conditions (not shown).
As expected, increasing the resolution improves the representation of the track, timing and intensity of the Stalactite Cyclone.
The LMDZ-HR hindcast, like the LMDZ-LR hindcast, produces a deepening rate of 38 hPa in 24 h. The initial deepening is well represented in LMDZ-HR whereas it is absent in LMDZ-LR. However, the latter run has a very rapid deepening during 300 the second stage of development which compensates the absence of the first stage. This compensation has resulted in a cyclone with a similar intensity to the analysis but with a delay of roughly 1 day in the deepening. The ARPEGE-HR and ARPEGE-LR hindcasts are also marked by rapid deepening phases with deepening rates being close to 24 hPa in 24 h. However, the cyclones are less deepened in ARPEGE runs compared to LMDZ runs (compare the blue and red curves). There is a significant delay in the deepening phase of the cyclone in ARPEGE-LR compared to ARPEGE-HR, similar to the difference between LMDZ-LR 305 and LMDZ-HR, which is also due to the quasi-absence of the initiation stage in the lower-resolution run. Thus both models are able to produce an explosively-deepening cyclone at the two resolutions considered. The main differences to the representation of the Stalactite cyclone compared to the analysis, on initial inspection, appear to be within the cyclogenesis phase of the cyclone and the different deepening rate of LMDZ compared to ARPEGE. These two aspects are examined further within the following subsections.

Cyclogenesis
The cyclogenesis of the Stalactite cyclone occurs on the mesoscale as the merging of two low-level vorticity precursors: a DRV coming from the subtropics and a vortex located further north baroclinically interacting with an upper-level PV cut-off

Formation of the northern precursor via baroclinic interaction with the PV cut-off
More important differences appear between LR and HR runs in the representation of the northern precursor. First, in the LR hindcasts, the vorticity of the northern precursor is much smaller than the vorticity of the DRV precursor (2.4 × smaller in ARPEGE-LR and 3.3 × in LMDZ-LR) whereas it is only slightly smaller in HR runs (ratio of 1.6 in ARPEGE-HR and 1.3 in 335 LMDZ-HR) similarly to the analysis. Second, the two LR runs (Figs. 3a,c) have a more zonal PV cut-off than in the analysis ( Fig. 1c) and in the two HR runs (Figs. 3a,c). Third, the low-level northern vorticity maximum moves to the east of the cut-off In contrast with the DRV, the northern precursor is a mixture of diabatic and dynamic processes as shown by the baroclinic 340 conversion rates of Figs. 4 and 5. The vertical cross sections of Fig. 6 show that the dynamical component is mainly centred at upper levels, but with an equivalent-barotropic structure. This suggests that the northern precursor is forced by the vertical velocity associated with the PV cut-off which is characteristic of type-B cyclogenesis (Petterssen and Smebye, 1971). In LR hindcasts, the dynamical forcing has a smaller vertical extent and is more spread out than the HR hindcasts. Also, the dynamical forcing in LR hindcasts is located further east than the diabatic forcing (Figs. 6b and c) while the two forcings add more to 345 each other in HR hindcasts (Figs.6e and f; see also figure S2 for ARPEGE). Both of the two forcings increase with resolution by a factor of more than five in the two models. However, the peak values of the diabatic baroclinic conversion exhibit larger increase than those of the dynamical baroclinic conversion with resolution during the formation of the northern precursor (Figs. 5b,c,e,f and 6b,c,e,f) but as seen in next section, it is more the reverse that occurs during the rest of the life cycle. To conclude, the northern precursor is rather badly represented in LR compared to HR hindcasts because the less intense, and 350 more spatially diluted, PV inside the cut-off induces a weaker, and more spread out, dynamical forcing. There is also a weaker spatial correlation between the dynamical and diabatic vertical forcing in the LR compared to HR hindcasts. An additional factor is the more active diabatic forcing in HR hindcasts in the vicinity of the northern precursor. So both the dynamical and diabatic terms, and their overlap, improve with resolution and it is difficult to determine which component matters most.
For the hindcasts shown here the merger of these two different precursors differs in timing from the analysis and between resolutions. The higher resolution configurations (although delayed by 6 h compared to the ECMWF analysis) merge the DRV and upper-level dynamical precursor 12-18 h earlier than the LR runs (not shown). For LMDZ-LR, there is even no merging of the two precursors. The delay or absence of interaction between the two precursors has an impact on the track of the cyclone which was systematically located too far east in the LR runs (Fig. 2b). There are two factors to explain the delayed or missed 360 merging. One is the more rapidly eastward propagation of the DRV in HR than LR runs (see e.g., the 2 • more eastward shift of the DRV in HR compared to LR runs in Fig. 3), which is consistent with a stronger latent heating in the former runs. The second is the low-level northern precursor and the upper-level cut-off are moving less rapidly eastward in HR runs (not shown).
This can be partly explained by the difference in longitude of the dynamical forcing between LR and HR hindcasts (compare Figs. 6b and e). The more rapid propagation of the DRV and less rapid motion of the northern precursor explain why the DRV 365 is more able to catch up the northern precursor in HR runs as in the analysis.
To conclude on cyclogenesis, the LR hindcasts struggle to correctly represent the initiation of the cyclone because they miss the initial deepening of the northern small-scale low-level vortex and the roll-up of the merging two low-level vortices around the PV cut-off. However, the unexpected result is that the LR hindcasts are able to reproduce the behavior of the DRV rather well, albeit with a smaller propagation speed.

Main Deepening
The main focus of this section is the main deepening stage of the Stalactite Cyclone. Like the cyclogenesis phase the main deepening phase shall be considered by analysing the baroclinic conversion and averaging it over a 10 • ×10 • area centred on the maximum baroclinic conversion closest to the minimum pressure of the Stalactite Cyclone. (Fig. 7) or by computing its local maximum (Fig. S3). The averaged quasi-geostrophic baroclinic conversion is roughly reduced by two thirds in magnitude 375 compared to that directly calculated from the model ω but is consistent across the models for the timing of the growth and decay of the cyclones. As previously discussed there is a delay in the maximum deepening in LR runs compared to HR runs, about 1 day in LMDZ and half a day in ARPEGE.
In the cyclone average values shown in Fig. 7, the two stages of cyclone development are well separated: (i) the initial cyclogenesis stage occurring on 29 September (Section 4.2), and (ii) the main development stage that is dominated by the 380 presence of a large-scale trough and an explosively-developing cyclone. The initiation stage is clearly dominated by diabatic processes. On the other hand, during the main deepening stage, the dynamical processes begin to be more important, and more so in the HR hindcasts compared to the LR hindcasts. In the HR runs the dynamical term is even larger than the diabatic term during the whole main deepening stage. It is also clear in the LR hindcasts that there is a delay in the dynamical processes compared to the diabatic processes, suggesting a delayed forcing by the large-scale upper-level trough. Therefore, there is an 385 increased importance of the dynamic term relative to the diabatic term with increased resolution. This ratio consistency is true for both the maximum (Fig. S3) and average values (Fig. 7) in both models and leadtime. This ratio consistency disagrees with DRV. However, the PV injection coming from the large-scale region of high PV located to the northeast into the upper-level disturbance interacting with the surface cyclone is delayed. In the analysis, some PV injection has already occurred (white areas of 5 to 7 PVU in Fig. 8a) but is just starting in the HR runs (the blueish areas of 3 to 5 PVU in Fig. 8c). The situation in ARPEGE-HR on 1 October 12 UTC (Fig. 8f) resembles more that of the ECMWF approximately 6 hours earlier (not shown), with the cyclonic wave breaking being more advanced in the ECMWF analysis (Fig. 8d) compared to ARPEGE-HR (Fig. 8f).

400
Several studies have shown that the PV of the upper-level trough baroclinically interacting with a surface extratropical cyclone tends to advect the cyclone polewards (Rivière et al., 2012;Oruba et al., 2013;Coronel et al., 2015). Therefore the sooner nonlinear interaction of the cyclone with the large-scale upper-level PV reservoir and the sooner roll-up of the two features around each other explains the sooner deviation of the cyclone track to the north and the more westward position of the track in the analysis than in the hindcasts. For the HR hindcasts, the delay is a maximum of 6 hours and the eastward shift is minimal 405 while for LR hindcasts the delay is about 24 hours and the eastward shift is much more marked.

Interpretation of the difference between the models and comparison with Aircraft Observations
As previously said, to have cyclone features roughly at the same place in the models as in the observations, simulations initiated at 00 UTC 1 October 2016 are analyzed in the present section. In other words, the idea is to have the dynamical features roughly at the same place in the observations and simulations to be comparable.  Figures 2a and 7 show that LMDZ produces a more active cyclone than ARPEGE. This activity is hypothesised to be linked to the vertical structure of diabatic heating within the Stalactite Cyclone given the similarities between the large-scale troughs in both models. To examine this hypothesis distributions of the ω QG are considered over the cyclone (Fig. 9). Whilst Fig. 9 shows the hindcasts initiated at 00 UTC 1 October 2016 similar results occur for the hindcasts initiated at 00 UTC 29 September 2016 415 (not shown). Figure 9 shows that ω QG increases with increased resolution (c.f. Figs. 9a,c and b,d). Also, the larger ascents mainly arise from diabatic processes (c.f. Figs. 9e-g to i-l), which is particularly obvious for LR hindcasts. Concerning the differences between models, stronger ascents occur in LMDZ than ARPEGE when looking at ω QG (compare Fig. 9a to c  Thus it is likely that observations of microphysical properties of the Stalactite Cyclone could be used to qualitatively determine which model has the better heating rates or structure. These comparisons are considered next.

430
To determine whether observations of microphysical properties from field campaign flights can provide information on the underlying diabatic heating the Stalactite Cyclone hindcasts are compared with flights F6 and F7 (Fig. 1b) Figure 10 shows bi-variate histograms of the total IWC for F7 from two observation platforms: RASTA (Fig. 10a) and RALI (Fig. 10f). There are larger values of IWC in RASTA compared to RALI because the lidar (being sensitive to smaller ice particles and smaller quantities of ice) information in RALI 440 leads to a reduction of IWC compared to RASTA. However, both platforms show the same shape with increasing values of IWC to around 600 hPa and then a uniform distribution until around 800 hPa, below which the instruments no longer detect ice clouds. The two retrieved IWC histograms provide an indication of uncertainty of the observations, which is useful to be compared with model outputs.
The model contribution to Fig. 10 consists of four rows, the first two rows showing "potential" IWC (cloud ice + snow) 445 while the last two rows "maximum" IWC (IWC + liquid water content  (Figs. 10m,n,q,r). The LMDZ distributions have been changed to the extent that the shape now shows much better agreement with the observations than when the LWC was not taken into account. These changes in LMDZ are also apparent within Fig. 11, where the addition of LWC in LMDZ produces consistently larger values of total IWC at all heights compared with ARPEGE, although this model difference is reduced at increased resolution (Figs. 11c and d).

470
The much larger IWC+LWC in LMDZ compared to ARPEGE over all the levels is consistent with the larger diabatic heating shown in Figs. 9e-h.
Given the change by the inclusion of LWC in the definition of the IWC it is useful to know the proportion of ice, mixed phase, and super-cooled liquid points that make up these distributions. Here, we arbitrarily define ice points in the model to be those where the LWC component of the total IWC is less than 1% and "pure" super-cooled liquid to be points where the LWC 475 component is greater than 99% of the total IWC, all other points are mixed phase. These results are compared with those points defined as super-cooled liquid, mixed phase and ice retrieved IWC from RALI measurements. To ensure a fair comparison between ice and super-cooled liquid water the "pure" values are combined with the mixed phase values. Table 2 shows that whilst the combined ice points exceed that of the observations (particularly for ARPEGE) the values are not unreasonable.
However, when the combined super-cooled liquid water is considered the models significantly over-estimate the amount of 480 super-cooled liquid points, with the smallest difference being a factor of 24 and the largest by a factor of 47. Considering Table 2 alongside the earlier discussion of the impact of adding LWC shows that the super-cooled liquid water being added to ARPEGE is of a smaller magnitude than that of LMDZ -a result confirmed by changing the threshold and seeing large decreases in the LWC for ARPEGE (with thresholds up to 10% for the definition of ice) and the values remaining constant in LMDZ. It is also worth noting that although the LR hindcasts are more largely underestimating the IWC than the HR hindcasts, 485 they are closer to the observations (in shape) than the HR hindcasts in the percentage of super-cooled water.
Radar reflectivities confirm the strong underestimation of IWC in the hindcasts (Fig. 12). For instance, below 5 km altitude, most of the reflectivity values are in the range 0 to 15 dBz in the observations, -10 to 5 dBz in ARPEGE, and -15 to 0 dBz in LMDZ. The smaller values reached by LMDZ compared to ARPEGE is probably due to the larger percentage of hindcasts and ARPEGE is better than LMDZ in terms of shape of the IWC distribution. Despite a systematic underestimation of reflectivity at all levels, the ARPEGE-LR reflectivity exhibits the closest shape to the observations compared to the other three hindcasts, with a majority of values between 0 and 5 dBZ below 5 km and then a rather constant decrease of reflectivity values with altitude above 5 km as in the observations.
Finally, to be confident in the above results, additional figures are presented in the supplementary material. Figures S5 to   495 S7 support the above findings by doing the same analysis along flight F6. Also, a comparison between RALI and CloudSat-CALIPSO measurements has been made along the common path of flight F7 and the A-train. The CloudSat reflectivities have similar structure and similar amplitude as the RALI reflectivities (Fig. S8c,d). The DARDAR and RALI target classifications agree for the large pictures and the main discrepancies originate from the time shift and the higher noise in CALIOP backscatter and the lower sensitivity of RASTA close to the surface. This explains why the supercooled layers detection is consistent but 500 the mixed phase attribution is slightly different due to the radars sensitivity (Fig.S8e,f). Despite these differences, regions of combined super-cooled liquid (supercooled plus mixed phase) are rather similar which gives confidence in the above conclusions.
To conclude, LMDZ produces more IWC which is associated with a more intense latent heating than ARPEGE. In that sense, it is closer to the observations. However, the ratio between liquid vs. solid species contributing the IWC is less realistic 505 in LMDZ than ARPEGE. Hence, it is worth noting that whilst the IWC can provide some information about the diabatic heating, caution is needed in interpreting the results as it does not provide complete information to be able to determine which of the two models produce the better heating compared to reality. However, the microphysical observations from flights during field campaigns are still useful in helping to identify the deficiencies of each model and determine what processes are linked in the models and why one of the models produces a more active cyclone compared to the other. The protocol also gives us valuable insight into the formation of the Stalactite Cyclone. Figure 13 shows a schematic of the many stages of the Stalactite Cyclone: from initiation as a DRV initiated from a mesoscale convective system (point 0) through the merger of the DRV (point 1) and a dynamical forcing factor (point 2) at cyclogenesis (point 3), to its rapid deepening associated with strong diabatic forcing throughout the column and interaction with multiple, 520 embedded, upper-level high PV regions (point 4), and comparisons with the observations (point 5) round to cyclolysis. There are differences between each of the models and with the analysis at each of these points and these are summarised in the main results below. The points are numbered based on the schematic (Fig. 13).
1. All hindcasts produce a DRV to some degree of accuracy: LR hindcasts produce a qualitative DRV whereas HR hindcasts produce a quantitative DRV that meet the criteria of Boettcher and Wernli (2013) ( Table 1). 525 2. All models produce an upper-level potential vorticity cut-off. However, due to its fine-scale structure, the cut-off is not as intense nor as deep in the LR hindcasts as in the HR hindcasts and analysis. Therefore, the LR hindcasts produce a weaker northern precursor.
3. Due to the above the initial deepening associated with the vortex roll-up between the two precursors at cyclogenesis is weaker in LR hindcasts. However, the initial deepening is better represented when the resolution increases. This reduced 530 initial deepening implies that LR versions cannot fully (dynamically) represent the Stalactite Cyclone. Although the present results only apply for this particular case study 3 , the results have important implications and show areas that warrant further investigation. Firstly, it shows that the T-AMIP protocol is useful for considering the physical mechanisms that occur within cyclones and their interaction with dynamics. Secondly, it shows that increasing resolution does help with the representation of cyclones such that within the next few years, when climate models will be regularly run at c. 50 km, 565 many synoptic-scale features of the atmosphere will be dynamically well represented. Finally, and arguably most critically, it warns that although climate models may produce similar cyclones they can be doing so for very different reasons and these reasons are likely to have an influence upon other areas of the climate system and the response of model cyclones to climate change. We recommend that further research occurs into the partition of super-cooled liquid water, mixed phase and ice water in models (and the influence this has on cyclone representation) and further comparisons with observations are made in all 570 regions as this will have a strong influence on the development of microphysical schemes in climate and weather prediction models. Therefore, whilst signs are encouraging for future versions of climate models, caution is still needed when considering current simulations of future climate scenarios and the impact of extra-tropical cyclones, particularly for regional impact-based studies.
Data availability. Data is available by contacting either D. Flack at david.flack1@metoffice.gov.uk or the corresponding author.     observations; b-e) hindcast output using "potential" ice water content (cloud ice + snow) without applying a mask to the observations; g-j) hindcast output of "potential" ice water content (cloud ice + snow) and with the observation mask applied; k-n) hindcast for "maximum" ice water content (ice water content + liquid water content) without the observation mask applied; and o-r) hindcast of "maximum" ice water content (ice water content + liquid water content) with the observation mask applied.  Figure 11. Difference bi-variate histograms for F7 of ice water content vs. pressure between ARPEGE and LMDZ for a) LR differences in "potential" ice water content (cloud ice + snow) only ( Fig. 10b -Fig. 10d) ; b) HR differences in "potential" ice water content (cloud ice + snow) only ( Fig. 10c -Fig. 10e); c) LR differences in "maximum" ice water content (ice water content + liquid water content) (Fig. 10k   -Fig. 10m); and d) HR differences in "maximum" ice water content (ice water content + liquid water content) (Fig. 10l -Fig. 10n). Reds refer to ARPEGE having a larger quantity and blues for LMDZ. The colour scale applies to all panels. The hindcasts are initiated at 00 UTC .