Global Evaluation of the Fidelity of Clouds in the ECMWF Integrated Forecast System

Weather forecasting centers mainly assimilate infrared sounder data in clear‐conditions or in channels with their main sensitivity to the atmosphere well above the cloud tops. Sometimes channels with stronger cloud sensitivity are used in overcast conditions, but currently no cloud information is used from infrared sounders, and all‐sky assimilation approaches are still under development. However, cloudy radiances could already be used for validating the quality of clouds in forecasts. We illustrate this by comparing the brightness temperatures observed (obs) with AIRS (Atmospheric Infrared Sounder) to those calculated (cal) based on the clouds specified in the ECMWF (European Centre for Medium Range Weather Forecasting) Integrated Forecast System (IFS). Our analysis is based on a 12 hr ingest of AIRS data into the ECMWF assimilation system. We show that the standard deviation of (obs‐cal) using the 1,231 cm−1 atmospheric window channel is a metric of the fidelity of the clouds in the IFS. The global standard deviation of 5 K after accounting for likely space/time interpolation errors, appears to be dominated by clouds in the IFS which are not seen in the AIRS data, and vice versa. Our metric capitalizes on the unique sensitivity of infrared sounders to clouds for the routine monitoring of the fidelity of clouds in weather forecasts.

• We compare the clouds observed from space to the clouds in a medium range weather forecast system, where clouds are created from first principle physics alone • We propose a metric, which allows the numerical assessments of the cloud fidelity of the current forecast system to be compared with future upgrades • On average, the agreement between the forecast clouds and Atmospheric Infrared Sounder (AIRS) observations is very good, but, particularly in areas prone to deep convection, there are large cloudy areas which are not seen in the AIRS data, and vice versa cles. They are more sensitive to the larger frozen particles (along with water cloud and rain) and successful all-sky assimilation has been possible (e.g., Geer et al., 2017). Unlike the current assimilation of infrared data, this makes use of the cloud information itself. Although there has been much recent progress on the experimental applications of all-sky assimilation of infrared radiances (e.g., Geer et al., 2019;Li et al., 2021;Okamoto et al., 2014Okamoto et al., , 2019Otkin & Potthast, 2019;Sawada et al., 2019) this is not yet operationally done at any weather forecasting center.
The full use of cloudy infrared data, even if not assimilated, has historically been stymied by concerns about the computational cost and accuracy of cloud-capable Radiative Transfer Models (RTMs). A number of RTMs have now been developed to allow the calculation of infrared sounder radiances, given the vertical distribution of T, q, and clouds, for example, SARTA (Machado et al., 2017), CRTM (Ding et al., 2011), RTTOV (Vidot et al., 2015), and PCRTM (Chen et al., 2013;Liu et al., 2006Liu et al., , 2016Liu et al., , 2009). Aumann, Behrangi, and Wang (2018) and Aumann, Chen, et al. (2018) evaluated the degree to which the calculated brightness temperature (cal) agreed with the Atmospheric Infrared Sounder (AIRS) observation (obs) based on ECMWF IFS (Integrated Forecasting System) data from March 2009. Using the 1,231 cm −1 thermal infrared window channel, they found that the RTMs agreed with each other with little bias and 6-10 K Standard Deviation (SD), but the SD of (obs-cal) was as large as 22 K. This large disagreement was attributed to the fact that the AIRS data were interpolated to match the ECMWF data on a 3 hr and 25 km grid. This grid was too coarse to allow a credible space and time interpolation of clouds in the IFS to the AIRS observations. The present work makes use of the much better space and time interpolation provided by the IFS. This more accurate interpolation should make it possible to better attribute remaining discrepancies between the calculated brightness temperatures and the observations.
Even in the absence of all-sky data assimilation, the statistics of (obs-cal) are very useful for understanding the quality of model cloud fields. Early use of infrared and broadband radiances in the validation of model cloud fields includes Allan et al. (2007) and Chevallier and Kelly (2002). With more recent weather forecasting models and more advanced radiative transfer approaches, comparisons to all-sky infrared radiances have shown deficiencies in ice cloud representations in several forecast models (e.g., Okamoto et al., 2021;Otkin & Potthast, 2019). Systematic errors in the IFS cloud representation are significantly smaller but still present (Geer et al., 2019); these are discussed in more detail later. Infrared radiances are also simulated from climate models and compared to MODIS observations (MODerate resolution Imaging Spectroradiometer, e.g., Bodas-Salcedo et al., 2011;Masunaga et al., 2010). However, this can only be done in a climatological sense, whereas comparisons to weather forecasting models can be done at the level of individual weather systems. The objective of our study is to further highlight the ability of the daily and scene-level statistics of (obs-cal) from all-sky infrared radiances to quantify the fidelity of the clouds in weather models, thus capitalizing on the unique sensitivity of infrared data to clouds.
Along with much previous work, this is a case study, and the real benefits would come from routinely monitoring the all-sky infrared observations in operational data assimilation systems. In the following, we illustrate this using the ECMWF IFS matched to AIRS data as an example, but our analysis is relevant to any other weather forecasting model and all hyperspectral infrared sounders.

Observations
NASA's AIRS (Aumann et al., 2003) became operational in September 2002. Since 2005 NOAA has distributed AIRS data to the NWCs for assimilation in weather forecasting systems. AIRS data are distributed in 6 min granules. Each granule covers an area of about 1,500 × 2,000 km with 90 (cross-track) × 135 (along-track) observations. Each observation has a 1.1° field of view (15 km at nadir from 707 km altitude), and produces a 2,378 channel spectrum of calibrated radiances between 3.7 and 15.5 μm. NOAA distributes spatially and spectrally subsampled AIRS data. The data are spectrally subsampled by distributing only the 324 of 2,378 channels selected by the NWCs. The data are spatially subsampled by dividing the 90 × 135 spatial pixels into 30 × 45 "golf balls" of 3 × 3 pixels. From each golf ball NOAA selects the warmest pixel for distribution, with the assumption that the NWCs are mainly interested in clear data and that the warmest pixel in a golf ball would be the least cloudy pixel. Our comparisons are based on this subsampled data, since this is what is available to the NWCs. This selection potentially creates a warm bias, which will be discussed later.

Model Fields
The state of the atmosphere was obtained from a 12 hr background forecast from an experimental run of Cycle 46r1 of the IFS (ECMWF, 2019). The IFS represents the atmosphere and clouds as profiles of cloud cover (cc), cloud ice water content (cic), cloud liquid water content (clc), temperature, water vapor, and ozone at 137 pressure levels. Total cloud cover (tcc) is derived from the cloud cover profiles. The 12 hr forecast is initialized from an analysis that assimilates AIRS data and other hyperspectral infrared instruments only under clear sky conditions, and well above clouds. The IFS also assimilates radiances from MHS, MWHS2 (MicroWave Humidity Sounder 2), AMSR2 (Advanced Microwave Scanning Radiometer 2), GMI (GPM Microwave Imager), and SSMIS (Special Sensor Microwave Imaging Sounder) in atmospheric window channels over the oceans at 19, 22/24, 37, 89/92, 150/166 GHz and channels around 183 GHz, under all-sky conditions (Geer et al., 2017). Clouds in the analysis are also indirectly constrained by the assimilation of temperature and moisture sensitive observations from other sensors. However, especially since a 12 hr forecast is being used here, the strongest constraint on the cloud fields is the model physics, which includes a large-scale prognostic cloud scheme and a diagnostic mass-flux convection scheme. The model fields are represented in the horizontal using a combination of spectral and gridded fields in a configuration referred to as Tco1279, which provides around 8-9 km sampling, depending on latitude. The model timestep is 7.5 min but only every fourth time step was used in the matchup process, so the maximum time offset between the IFS grid and AIRS observations was about 15 min. The atmospheric profiles from the IFS internal 12 hr forecast grid were interpolated to the AIRS sample time and position using the operators for data assimilation in the IFS.
We used the ECMWF profiles for the 12 hr period between 31 October 2018 21:00 UTC and 1 November 2018 09:00 UTC. This time period was covered by 120 AIRS data granules. The AIRS data were merged with the IFS interpolated atmospheric states at the AIRS space/time locations. The model precipitation fields were not used in this work; this is a standard assumption when simulating all-sky infrared radiances.

Radiative Transfer Models
The IFS data were converted to brightness temperatures using SARTA, CRTM v2.3, and PCRTM 3.4. The PCRTM calculations used the Chen et al. (2013) model with 2, 4, and 50 columns with the Maximum Random Overlap (MRO) and Exponential Random overlap cloud overlap assumptions. In the following, we refer to PCRTM MRO_50col as PCRTM unless stated otherwise. CRTM was used in its default Advanced Doubling-Adding MRO 2col mode. SARTA was used in a TwoSlab mode (effectively 2col), with clouds placed at the median pressure altitude of the cloud ice and cloud liquid water profiles, with the Random Overlap assumption. Our choice of RTMs was intended to cover the range from the highest fidelity cloud calculation (PCRTM 50col) to RTMs which traded some cloud fidelity for a computational less stressing approach (SARTA and CRTM). For the surface emissivity, we used Masuda et al. (1988) for ocean, Zhou et al. (2011) for land.

Results
The computationally most economical way to evaluate the fidelity of the clouds in the IFS is to simulate one atmospheric window channel. We selected the 1231.3 cm −1 channel, and calculated its brightness temperature, bt1231. We then evaluated the difference between the observed bt1231, obs, and the calculated bt1231, cal. This window channel was chosen since (a) all the current generation hyperspectral sounders have a similar channel, (b) it is not sensitive to the distribution of CO 2 , Ozone, or other minor gases, (c) it is only weakly influenced by the water vapor continuum, about 2 K under clear conditions, (d) Non-unity surface emissivity causes only slight decreases in bt1231. For example, the sea surface emissivity of 0.98 decreases bt1231 by typically 1 K. Under ideal conditions, all effects are accounted for in the RTM calculation. Under clear ocean conditions the mean(obscal) at 1231.3 cm −1 is less than 0.1 K and the SD is typically 0.4 K (Aumann et al., 2021). The bias and the SD change drastically with the presence of clouds, as will be discussed subsequently. We first discuss results for three granules. This allows us to define various metrics, which are then used for the interpretation of global results.

Granule Analysis
We selected three of the 120 granules to define the parameters used for the analysis of the differences between the AIRS observation and the RTM calculations. Figure 1 shows the locations of the granules. Granule 215 (red) is from the night mid-latitude ocean, granule 55 (blue) represents the night tropical ocean, and granule 64 (green) is from the day tropical ocean warm pool.

Granule 215
The left panel of Figure 2 shows bt1231 for the NOAA subsampled AIRS observations. The surface temperatures in this granule range from 274 to 294 K. The coldest cloud tops were at 225 K. The warmest (darkest red) areas in term of the observed bt1231are relatively clear, although bt1231 is still 5 K colder than the underlying sea surface. Since water vapor accounts for about 2 K, emissivity for about 1 K, clouds in this relatively clear area account for 2 K. The clouds which create this 2 K effect are not necessary uniformly distributed over the footprint. If the 15 km diameter footprint were totally free of clouds above a 295 K surface, except for a 1 × 1 km thunderstorm with cloud top temperature of 225 K, bt1231 would drop by only 0.3 K. The 2 K cloud effect could thus be the result of seven thunderstorms scattered in the footprint, or 50% of the footprint covered with low stratus clouds 1 km above the surface. The colder (green and blue) areas are much more cloudy, with a band of high clouds stretching almost diagonally across the granule.
The center panel of Figure 2 shows the water cloud fraction, Σclw*cc/Σclw, the right panel shows the ice cloud fraction, Σciw*cc/Σciw, based on the IFS profiles, with the sum being over the vertical model levels. The units of the x and y axis in this and all subsequent granule images are the cross-track and along-track positions, separated by approximately 45 km. The most striking feature in the obs is the band of cold clouds, which extends from the mid left to the lower right corner in the granule, corresponds to a band of ice clouds.
In Figure 3, we show (obs-cal) from SARTA (left) and PCRTM MRO 50col (center). The results from two RTMs differ from obs in the area of mixed clouds by as much as ±40 K. Inspite of the fact that the differences between PCRTM and SARTA (right panel) include the inherent randomness of the cloud overlap assumptions in areas of broken clouds, the two RTMs agree with each other better than with the observations. The difference between the obs and cal may have two additional components: (a) NOAA only distributes the warmest bt1231 of each 3 × 3 "golf balls" and (b) spatial inhomogeneity and forecast errors. The spatial inhomogeneity in a golf ball can be quantified by the cx1231 parameter, which is the difference between the warmest and the coldest bt1231 in a golf ball, calculated from the full spatial resolution AIRS data. The following provides two numerical examples.
1. Assume that the center of the 3 × 3 is near the edge of a large cold cloud at 225 K over a 295 K clear surface.
Based on the IFS, cal should be close to 225 K. Now assume that one of the 3 × 3 footprints extends half a footprint diameter (7 km) into the clear area at 295 K. As a result, the observed bt1231 will be 270 K, and  NOAA will select the 270 K footprint for distribution. Now obs will be 45 K warmer than expected, and the footprint will be an extreme warm (obs-cal) outlier. The NOAA clearest of the 3 × 3 selection will produce only warm outliers. The cx1231 parameter for this case will be about 50 K. 2. A similar situation can be caused by a space/time interpolation error, or equivalently by position errors in the clouds generated in the IFS forecast. Assume the same situation as above, but the edge of the cloud has shifted 7 km, such that the footprint is now further away from the edge, with bt1231 = 220 K, or it has moved 7 km into the clear, and now bt1231 = 270 K. This scenario can also be reversed, a 295 K expected obs based on the IFS, can turn into a 270 K observed. The interpolation error thus produces positive and negative outliers, statistically in equal number. The cx1231 parameter for this case will be about 50 K.
Areas of high spatial nonuniformity are sensitive to the interpolation or forecast error. The left panel of Figure 4 shows cx1231, the right panel shows (obs-cal) for SARTA (same as Figure 3a). There is an excellent visual  We characterize the PDF of (obs-cal) using three methods.
1. The Gaussian mean and SD of all footprints in a granule. As discussed above, the (obs-cal) will include large outliers, which inflate the SD. 2. We can argue that (obs-cal) in areas of high spatial inhomogeneity are unreliable and exclude observations where cx1231 exceeds a threshold from the calculation of the mean and SD of (obs-cal). 3. Since the NOAA distribution does not include the cx1231 parameter, we could also use quartile statistics.
Quartile statistics effectively eliminate all positive and negative outliers, regardless of their origin. Let Q1, Q2, and Q3 be the first, the median, and the third quartile. The mean is replaced by Q2, and (Q3-Q1) becomes the effective SD. We define the quartile skew as (Q3-2*Q2 + Q1)/SD.
In Table 1, we summarize results (obs-cal) from SARTA and PCRTM MRO 50col in terms of mean and SD. Column #2 shows the Gaussian statistics, Columns #3 and #4 are the results with cx1231 < 10 K and cx1231 < 5 K filtering, and column#5 shows the quartile statistics. The first number is the mean, following the ± is the SD, the third number, relevant only for cx1231 filtering, is the number of cases from the granule used for the statistics.
In granule 215 88% of the data are from relatively uniform, cx1231 < 10 K, areas. For the quartile statistics only 625 of the possible 1,350 points are used. As expected, the SD of (obs-cal) decreases, when outliers due to spatial inhomogeneity are removed. The effect is stronger with quartile statistics. In all cases, there is little impact on the mean (obs-cal).
We limited Table 1 to results from SARTA and PCRTM MRO 50col, since they represent the range from the computationally least and most demanding RTMs. The results are typical of other RTMs. As an example, the Gaussian statistics, mean ± SD, from granule#215 for CRTM are +1.1 ± 8.2 K, for PCRTM MRO 2col they are +0.6 ± 6.8 K.

Granule 55
This granule is from a night overpass of the tropical Atlantic Ocean off the coast of Brazil. The surface temperature of the ocean ranged from 300 to 322 K. Figure 5 shows the NOAA distribution of bt1231 (left panel), the water cloud cover (center), and the ice cloud cover from the associated IFS data.
In Figure 6, we compare (obs-cal.SARTA), (obs-cal.PCRTM.MRO.4col), and (obs-cal.PCRTM.MRO.50col) in granule 55. The two red areas at the center of the picture are ice clouds in the IFS not seen in the AIRS data. All three RTMs are up to 40 K warmer than AIRS (blue) in the lower left corner of the granule for ice clouds above water clouds. The red and blue pixels are typically not isolated incidents, but are seen in cluster of 10 or more pixels. Since each pixel subtends a 45 × 45 km area, the discrepancies between observed and IFS clouds extend for 100 km or more. We come back to this later.
The left panel of Figure 7 shows cx1231, the right panels repeat (obs-cal.SARTA) from Figure 6. A visual correlation between cx1231 and obs-cal is seen in the lower left corner of granule 55, where (obs-cal) is negative, that is, the clouds in the IFS are optically too thin or too warm (low). In the center area, there is no correlation between cx1231 and the cold bias in cal (positive obs-cal). This indicates that in this granule some outliers are not due high spatial contrast, but are an indication of the limited accuracy in the IFS cloud field on a 100 km scale. The (obs-cal) mean and SD, the cx1231 < 10 K and cx1231 < 5 K filtered mean and SD, and the quartile statistics are summarized in Table 1. Only 12% of the data from granule 55 are rejected by the cx1231 < 10 K filter.

Granule 64
Granule 64 is from a daytime overpass of the tropical warm pool. The surface temperature of the ocean ranged from 300 to 304 K. Figure 8 shows the NOAA distribution of bt1231 (left panel), the Cloud Liquid Water fraction (center), and the Cloud Ice Water fraction (right) from the associated IFS data. Inspection of the AIRS bt1231 observations shows that the lower half of the granule is relatively clear (dark red). The IFS in this region has close to zero water cloud cover and low ice cloud cover. In the upper half of the granule AIRS obs show two areas of Deep Convective Clouds (DCCs), blue area where bt1231 < 220 K, but only the one on the right agrees loosely with an area of ice clouds in the IFS.
The left panel of Figure 9 shows the observed bt1231 (same as left panel Figure 8), the center shows (obs-cal.PCRTM) and the right panel shows cx1231. Only PCRTM is shown, as all RTMs essentially agree. The  lower half of granule 64 is relatively clear, as seen by the 300 K obs, and spatial uniformity (low cx1231). The small (obs-cal) shows that AIRS and the IFS agree. However, in the area centered on [col 7, row 15] in Figures 8  and 9, the IFS sees a 40 K warmer area than the cold clouds seen in obs. But the area centered on [7,7], which is very cold in the IFS, is fairly warm in the obs. Neither area is associated with a large cx1231. It appears that the cold clouds seen in the obs are seen in the IFS shifted about 8 pixels to the north. This is a displacement of 8 × 45 = 350 km. Spatial displacements of this magnitude largely cancel in the granule mean (obs-cal), but contribute to an enhanced SD. The granule statistics are summarized in Table 1.
An additional perspective into granule 64 is provide by two scatter diagrams. The left panel of Figure 10 shows (obs-cal) as function of cx1231. Here, 35% of the data have cx1231 > 10 K. The outliers in (obs-cal) are relatively consistent with SARTA and PCRTM. Figure 10 (right panel) shows (obs-cal) as function of the local  sea surface temperature (Canada Meteorological Center., 2012). The big outliers are almost exclusively at a surface temperature between 302.5 and 303.5 K. Sea surface temperatures warmer than 302 K are associated with the rapid onset of deep convection and DCCs (Aumann, Behrangi, & Wang, 2018;Aumann, Chen, et al., 2018). The observation that the presence of the DCCs results in almost symmetric large positive and negative outliers in (obs-cal) indicates that the IFS creates DCCs, but the space/time interpolation to the AIRS observations results in random hits and misses, which results in a large SD, with little impact on the granule mean.

Global Results
In Figure 11, we plot SARTA and PCRTM statistics for all granules as function of latitude. In all cases, the mean is only weakly latitude dependent, and relatively close to zero considering the large cloud displacement errors  seen in the individual granules. The SD increases steeply in the tropical zone (left panel). Averaged over all latitudes, the SD is 7.4 K for SARTA, 7.5 K for PCRTM. When the high scene inhomogeneity cases are removed with a cx1231 < 10 K filter (center), this latitude dependence is almost totally suppressed. With cx1231 < 10 K filtering, and averaged over all granules, the Gaussian mean is −0.61 K, SD = 6.4 K for SARTA, compared to mean = +0.30 K, SD = 5.2 K for PCRTM. The PDF of (obs-cal) is very symmetric. The quartile skew of the PDF is −0.0025 for SARTA, and +0.0046 for PCRTM. The quartile statistics (right panel) effectively suppresses the latitude dependence and any skew. There is no significant change in the bias, and SARTA and PCRTM have SD = 6.3 K with quartile statistics. Figure 2 is an example of how good the IFS can be away from the tropics, but even there the SD(obs-calc) after cx1231 filtering is about 5 K. Using the [x = cross-track, y = scanline] notation in Figure 2, the band of cold clouds seen by AIRS extending from [0, 15] on the center left to [35,40] in the lower righthand corner, matches the band of ice clouds. The band of ice clouds overlays a broader region of significant water clouds. This creates a complicated cloud overlap situation. This is seen by the fact that this band matches a band in (obs-cal.SARTA), where cal.SARTA is typically 10 K warmer than obs. These cases are better handled by PCRTM MRO 50col. Figure 4 (left panel) shows that this band is filled with many cx1231 > 10 K cases, that is, broken clouds. If these cases are eliminated from the granule statistics, the performance of SARTA and PCRTM are very close, but we also excluded spatially non-uniform cases were PCRTM 50 col would represent the clouds with higher fidelity.

Granule Statistics
Granule 55 provides more insights into differences between RTMs. Focus on the [27, 23] area in Figure 6. The yellow/red (obs-cal) indicates that PCRTM MRO 50col is only slightly colder than AIRS, but the red PCRTM MRO 4col generates colder, while the dark red SARTA (2 col MO) generates much colder brightness temperatures, up to 40 K colder than AIRS. This area shows quite low ice cloud amounts, but the cloud tops are high (about 200 hPa according to SARTA). It seems that SARTA and PCRTM MRO 4col give too much weight to the high clouds. Since the 50-col approaches agree best with AIRS, the IFS simulation appears to be not too far wrong for this cloud feature. The advantage of MRO 50col is in complicated high/low cloud overlap region.
The spatial scales of (obs-cal) discrepancies are virtually never limited to single 45 km golf balls. This indicates that they are not simple mismatch issues. In Figure 6, lower left corner [1:10, 32:35] the simulations are typically warmer than AIRS in all RTMs, that is, (obs-cal) is dark blue, and mean RTM differences reach 10-20 K. Yet, the IFS has high ice cloud tops (250 hPa or higher) and significant ice cloud amount (greater than 3 g/m 3 product of TCC and column cloud). We can only speculate that the IFS may need to put the clouds even higher, or give higher ice contents or cloud fractions. The bigger picture is that the IFS and reality (AIRS) are not too far apart, at least in terms of the shape of this convective system. At [8,22] the IFS has cloud tops near 250 hPa and with high ice cloud amount. This seems to be a clear case of the IFS generating deep convection where none exists in reality. Figure 11. Latitude dependence of the SD of (obs-cal) from SARTA and PCRTM MRO 50col. Left: all data, Center: with a cx1231 < 10 K filter. Right: quartile statistics. The latitude dependence of the SD is flattened out almost equally well by cx1231 < 10 K spatial coherence filtering and quartile statistics.

Global Statistics
The global (obs-cal) statistics shown in Figure 11 indicate little difference between the RTMs: The mean is close to ±1 K, with SD = 7.5 K, independent of the methodology. When spatial inhomogeneity cases are eliminated directly by cx1231 filtering, or indirectly using the quartile statistics, the global SD is still about 6 K. One could argue that the main reason why the IFS and AIRS do not match up better is that we used a 12 hr forecast of the clouds, not a post-facto analysis of the IFS clouds. There is very little predictability of the smallest scales of cloud and precipitation for example, (sub 100 km) beyond a few hours, so it is natural that cloud features in the 12 hr forecast may be displaced or mis-represented compared to the observations. Figure 10 makes this point with the onset of deep convection in the tropical warm pool. However, even if we were looking at the analysis instead of the forecast, the 4D-Var assimilation technique would not get the clouds, particularly deep convection, in exactly the right place in the analysis, because 4D-Var is usually constrained to follow a 12 hr forecast trajectory (all-sky assimilation systems with much shorter timescales, e.g., on the order of 10 min, do show promise at fitting clouds more exactly, e.g., Sawada et al., 2019). Hence, one may conclude that the displacement and misrepresentation of clouds on the smallest scales is just a natural feature of any forecast cloud data set. However, as illustrated in the granule images, the displacements between observed and missing cold clouds can be much larger than 100 km.
Compared to the bias(obs-cal), which is typically less than 1 K, the standard deviation of (obs-cal) and the outliers are revealing. The three main reasons for discrepancies between "obs" and "cal" are (a) the IFS clouds in the 12 hr forecast are not necessarily correct. (b) Even when they are correct, the RTMs may not be capable of accurately converting the IFS clouds into "obs." (c) In areas of high cloud inhomogeneity the AIRS "obs" may not represent a "truth" due to a spatial and/or temporal mismatch, which gives rise to positive and negative outliers. For a large ensemble, the resulting bias averages to zero, but the outliers leave a tell-tale enlarged SD (e.g., of the order of 5 K in Figure 11). Figure 11 shows that once the largest outliers are eliminated by removing spatially inhomogeneous scenes (either using the cx1231 filtering, or quartile statistics) the bias and SD are relatively independent of latitude and RTM, outside a few tropical locations. The concern that the NOAA selection of the warmest footprint in a 3 × 3 golf ball may cause the PDF of (obs-cal) to lean toward the warm side, that is, create a positive outlier skew, appears to be unfounded: Once the gross outliers are eliminated by cx1231 < 10 K filtering, the skew is very small.
Discrepancies between clouds in the IFS and in collocated observations have been reported previously, for example, comparing to AMSR-E 19 and 37 GHz data (Geer & Bauer, 2011) and to IASI (Infrared Atmospheric Sounding Interferometer, Geer et al., 2019). The latter used data as recent as February 2018, based on the immediate prior version of the IFS to the one used in our evaluation. The results indicated a lack of clouds in the IFS over the marine stratocumulus regions, which led to the calculated brightness temperatures being warmer than observed.
In the inter-tropical convergence zone over ocean, the IFS appeared to overestimate convection as observed in the infrared, which lead to the calculated brightness temperature being colder than observed, that is, a warm bias in (obs-cal). In the present study, we see a rough balance between warm and cold bias, which results in a global and zonal mean bias close to zero, but with an enlarged SD compared to what is seen in clear-sky comparisons. Bias maps would likely reveal small spatial variations similar to those seen by Geer et al. (2019) but it would require many days of averaging to compute them; this is not available from our case study.

Routine IFS Cloud Fidelity Evaluation
In the current IFS, the mean and SD of (obs-cal) of many channels from infrared sounders are evaluated under clear-sky screened conditions, and their time series are routinely monitored to assess the quality of both the observations and forecast. Monitoring the global mean and SD of (obs-cal) using a cloud-enabled RTM could similarly provide information on the fidelity of the model cloud field. Making the all-sky SD a routine diagnostic product would allow most of the area covered by infrared sounders to be utilized, not just the limited clear areas. The routine availability of the SD of (obs-cal) would help monitor improvements in the representation of clouds in the IFS and other weather forecasting systems. Future consideration could also be given to statistics with reduced sensitivity to outliers induced by cloud displacements, such as filtering using the spatial inhomogeneity measure (cx1231) or quartile statistics illustrated here. Other methods exist in the literature but can be substantially more complex to apply (e.g., Roberts & Lean, 2008).
For the case study presented in this paper neither data volumes nor computational complexity were a major issue. However, for the routine monitoring of the cloud fidelity in any model, resource requirements impose severe limitations. Here, in order to generate sufficiently accurate colocations, we have made use of the internal IFS-grid time/space interpolation, which has access to the model fields every 30 min, something which is nearly impossible to archive or distribute externally. Hence, the routine generation of these statistics would need to be done online, within the weather forecasting model. Observation simulators for climate models (e.g., Bodas-Salcedo et al., 2011) are run online for similar reasons. The need for the highest fidelity cloud enabled RTM also needs to be considered. The runtime of any cloud enabled RTM is proportional to the number of channels, the complexity of the cloud microphysics and associated scattering code, and the number of columns. More columns should increase the fidelity of the calculation, particularly for broken clouds, but they also will increase to computer resource requirements. In Figure 11, SARTA and PCRTM MRO 50col represent two extremes in terms of the computational resource requirements: SARTA uses a simple two-slab approach for simulating the IFS clouds, while the PCRTM 50-column MRO is intended for a more faithful simulation of the IFS clouds. However, we find that the decrease in bias and SD of using the highest fidelity RTM is marginal compared to the increase in computational complexity. This finding has a simple explanation: PCRTM MRO 50 col is much better at handling broken clouds, but the cal under these conditions are also extremely sensitive to space/time interpolation errors. Filtering out the most broken cloud cases to suppress interpolation or cloud displacement related outliers also removes cases where a computationally more demanding RTM could have had an advantage.

Summary
The objective of our study was to explore the use of simulated all-sky infrared radiances to quantify the fidelity of the clouds in model simulations, based on observation minus simulation (obs-cal). The all-sky approach capitalizes on the unique sensitivity of infrared observations to clouds. For our evaluation, we used AIRS observations and the ECMWF IFS. A main step forward was to use the internal IFS interpolation to achieve colocations within 15 min and 5 km, much better than achieved in our previous comparison (Aumann, Behrangi, & Wang, 2018;Aumann, Chen, et al., 2018) and difficult to achieve outside of a forecast model (i.e., in an "offline" study). Detailed examination of the differences on the granule scale showed that modeled and observed clouds are often displaced, likely due to forecast error and remaining interpolation error. While some of the discrepancies between AIRS and the IFS could be explained by spatial and temporal mismatch issues in spatially inhomogeneous areas, there are cases in spatially relatively uniform areas where the IFS claims deep convection where none exists in reality or the deep convection is displaced by hundreds of kilometers. This leads to large standard deviations of (obs-cal) on the granule scale. However, granule averages are almost unaffected by these displacements, given the small granule mean of (obs-cal). The effect of these displacements can be reduced by eliminating scenes using measures of spatial inhomogeneity or quartile statistics. This work has also evaluated different RTMs, illustrating the benefits of using large numbers of columns to represent cloud overlap in complex cloud profiles. However, when the most complex scenes are removed using the spatial inhomogeneity filter or quartile statistics, little is gained from RTMs with more than two columns.
A possible metric of the IFS cloud fidelity is the standard deviation of the all-sky (obs-cal) for a chosen window channel. We illustrated this with AIRS data, but the proposed metric applies to any hyperspectral infrared sounders. Even when the SD of all-sky (obs-cal) is filtered for the most spatially inhomogeneous cases, it still utilizes an order of magnitude more of the infrared sounder observations than in current clear-sky approaches. If calculated online within the processing chains of a weather forecasting system such as the IFS, the time series of all-sky (obs-cal) statistics like SD would provide routine feedback on improvement in the quality of the clouds.

Data Availability Statement
The IFS files and the associated AIRS matchup files are freely available from https://thunder.jpl.nasa.gov/ftp/ hha/ECMWF20181101/. The elat, elon, esolzen and ebt1231 in each of the 120 matchup files are the latitude, longitude, solar zenith angle, and bt1231 of the NOAA distribution.