MESMER-M: an Earth system model emulator for spatially resolved monthly temperature

. The degree of trust placed in climate model projections is commensurate with how well their uncertainty can be quantiﬁed, particularly at timescales relevant to climate policy makers. On inter-annual to decadal timescales, model projection uncertainty due to natural variability dominates at the local level and is imperative to describing near-term and seasonal climate events but difﬁcult to quantify owing to the computational constraints of producing large ensembles. To this extent, emulators are valuable tools for approximating climate model runs, allowing for the exploration of the uncertainty space surrounding selected climate variables at a substantially reduced computational cost. Most emulators, however, operate at annual to seasonal timescales, leaving out monthly information that may be essential to assessing climate impacts. This study extends the framework of an existing spatially resolved, annual-scale


Introduction
Climate model emulators are computationally inexpensive devices that derive simplified statistical relationships from existing climate model runs to then approximate model runs that have not been generated yet.By reproducing runs from deterministic, process-based climate models at a substantially reduced computational time, climate model emulators facilitate the exploration of the uncertainty space surrounding model representation of climate responses to specific forcings.A wide toolset of Earth system model (ESM) emulators exists with capabilities ranging from investigating the effects of greenhouse gas (GHG) emission scenarios on global and regional mean annual climate fields (Meinshausen et al., 2011;Tebaldi and Arblaster, 2014) to looking at regional-scale annual, seasonal, and monthly natural climate variabilities (McKinnon and Deser, 2018;Alexeeff et al., 2018;Castruccio et al., 2019;Link et al., 2019).
Recently, the Modular Earth System Model Emulator with spatially Resolved output (MESMER) (Beusch et al., 2020) has been developed with the ability to statistically represent ESM-specific forced local responses alongside projection uncertainty arising from natural climate variability.It does so using a combination of pattern scaling and a natural climate variability module, to generate grid point level, yearly temperature realizations that emulate the properties of ESM initial-condition ensembles.By training on different ESMs individually, MESMER is furthermore able to account for inter-ESM differences in forced local responses and natural climate variability, thus approximating a multi-model initialcondition ensemble, i.e. a "super-ensemble".The probability distributions of grid point level, yearly temperatures generated by MESMER could be especially relevant when used as input data for simulation of impacts that depend on this variable.MESMER thus offers the perspective of improving our description of the likelihood of future impacts under multiple future climate scenarios.
Considering the importance of monthly and seasonal information when assessing the impacts of climate change (Schlenker and Roberts, 2009;Wramneby et al., 2010;Stéfanon et al., 2012;Zhao et al., 2017), extending the MES-MER approach to grid point level monthly temperatures appears desirable.Such an approach holds additional value in assessing the evolving likelihoods of future impacts, as the temperature response at monthly timescales displays nonuniformities that are distributed onto monthly variabilities and therefore uncertainties, which are otherwise unapparent at annual timescales.In particular, winter months can warm disproportionately more than summer months (Fischer et al., 2011;Holmes et al., 2016;Loikith and Neelin, 2019), which in turn leads to non-stationarity in the amplitude of the seasonal cycle (i.e.intra-annual temperature variabilities) with evolving yearly temperatures (i.e. the intra-annual temperature response is heteroscedastic with regard to yearly temperature) (Fischer et al., 2012;Huntingford et al., 2013;Thomp-son et al., 2015;Osborn et al., 2016).Additionally, given that monthly temperature distributions have been observed to display non-Gaussianity, evolving yearly temperatures may cause disproportionate effects on their tail extremes, leading to changes in skewness (Wang et al., 2017;Sheridan and Lee, 2018;Tamarin-Brodsky et al., 2020).
This study focuses on extending MESMER's framework by a local monthly downscaling module (MESMER-M).This enables the estimation of projection uncertainty due to natural variability as propagated from annual to monthly timescales since MESMER-M builds upon MESMER, which has already been validated as yielding spatio-temporally accurate variabilities.In constructing MESMER-M, we furthermore place emphasis on representing heteroscedasticity of the intra-annual temperature response as well as changes in skewness of individual monthly distributions in a spatiotemporally accurate manner.The structure of this study is as follows: we first introduce the framework of the emulator in Sect.3.1 and the approach to verification of the emulator performance in Sect.3.2; we then provide the calibration results of the emulator in Sect. 4 and verification results in Sect.5, after which we proceed to the conclusion and outlook in Sect.6.

Data and terminology
In the analysis, 38 CMIP6 models (Eyring et al., 2016) are considered, using simulations for the SSP5-8.5 highemission scenario (O'Neill et al., 2016), so as to first explore the emulator's applicability to the extreme end of GHGinduced warming.Where an ESM's initial-condition ensemble set contains more than one member, it is split into a training set (used for emulator calibration) and a test set (used for emulator cross-validation).Systematic approaches in getting the best train-test split exist, such as that employed by Castruccio et al. (2019), which requires the presence of a large ensemble and compromises between stability in inference (represented for example by variability) of the emulator and the computational time for training it.Since the primary purpose of MESMER-M is to provide the best possible emulations based on available training material, such approaches are optional however and would only limit the demonstration of MESMER-M to CMIP6 ESMs with large ensembles.Here we implement a simple 70 %-30 % (and for models with more than 20 ensemble members a 50 %-50 %) train-test split so as to maintain a good balance between training time and emulator performance.A summary of the CMIP6 models used, their associated modelling groups, and the initialcondition ensemble members present within the training and test sets are given in Table A1.All ESM runs are obtained at a monthly resolution and are bilinearly interpolated to a spatial resolution of 2.5 • × 2.5 • (Brunner et al., 2020).The emulator is trained using yearly averaged temperature as predictor values.The term "temperature" here refers to anoma-lies of surface air temperature (standard name, "tas") relative to the annual climatological mean over the reference period of 1870-1899.

MESMER
MESMER is an ESM-specific emulator built to statistically produce spatially resolved, yearly temperature fields by considering both the local mean response and the local variability surrounding the mean response.Within MESMER, local temperatures T for a given grid point s and year y are emulated as follows (Beusch et al., 2020): where g s is the local mean response to global mean temperatures T glob and consists of a multiple, linear regression on the forced T glob,forced (capturing the smooth trend in T glob ) and variability T glob,var components of T glob , with coefficients β forced s and β var s respectively and intercept term β intercept s . More details on the extraction and representation of T glob,forced and T glob,var can be found in Beusch et al. (2020).η s,y represents the residual variability around the mean response.

MESMER-M
We divide MESMER-M into a mean response module and a residual variability module, each calibrated on ESM simulation output data at each grid point individually according to the procedure described in this section and summarized in Fig. 1.Such division of a modelling exercise has been applied to other climate model emulators (Tebaldi and Arblaster, 2014;Alexeeff et al., 2018;Link et al., 2019;Beusch et al., 2020) and comes with its underlying assumptions.The primary assumption in our case is that the ESM monthly temperatures are distinctly separable into a mean response component and a residual variability component.Traditionally the mean response module is designed to be dependent on a certain forcing (in this case local, yearly temperatures), while the variability module is space-time-dependent (i.e.varying with time and across the spatial domain).Given the expected changes in monthly skewness with evolving yearly temperature however, we furthermore propose both a space-timeand yearly-temperature-dependent variability module.Other forcings, such as land cover changes, are furthermore assumed to have considerably smaller impacts on the monthly temperature response and are thus not explicitly included.Given the modular approach taken, this assumption could be addressed in the future with the addition of separate modules which isolate the signals of such other forcings.

Mean response module
The mean response module was conceived to represent the grid point level mean response of both monthly temperature values and the amplitudes of the seasonal cycle (intra-annual variability) to changing yearly temperatures.We employ a harmonic model consisting of a Fourier series with amplitude terms fitted as linear combinations of yearly temperature, centred around a linear function of yearly temperature as shown in Eq. ( 2): where T is temperature, m, s, and y refer to month, space (i.e.grid point), and year indexes, % is modulo, and g i and h i are linear functions of T for the ith ordered term of the Fourier series.Since the monthly cycle revolves around its yearly temperature, fitting results for β 0 and β 1 coefficients had negligible effects (β 0 = 0 and β 1 = 1), and for simplicity's sake, we show the Fourier series as centred around yearly temperature values within the following sections.In choosing the order (n) of the harmonic model, we optimized between model complexity and accuracy using the Bayesian information criterion (BIC).For this, harmonic models with orders n = 1, . . ., 6 were fitted at each grid point and their BIC scores calculated, from which the order with the lowest score was chosen.It should be acknowledged that this approach may lead to a higher number of terms selected as temporal independence in model performance is assumed.

Residual variability module
The local residual variability (otherwise simply called "residuals"), i.e. the difference between actual local monthly temperature and its mean monthly response to variations in local yearly temperature (given by the harmonic model), is assumed to be solely a manifestation of intra-annual variability processes.It can thus be thought of as short-term spatio-temporally correlated patterns.Similar to previous MESMER developments, we represent the residual variability using an AR(1) process.Since the residual variability distributions display month-dependent skewness (see example Shapiro-Wilk test for January and July in Appendix B), which is non-stationary with regard to yearly temperatures, we first transform them into a stationary Gaussian space before fitting the AR(1) process.This strategy has already been pursued in problems concerning skewness in residual precipitation variability (Frei and Isotta, 2019).Precisely, we use the monotonic Yeo-Johnson transformation (Ŵ m,s ), which accounts for both positive and negative data values, to locally normalize monthly residuals (Yeo and Johnson, 2000).
(3) Ŵ m,s relies on a λ parameter to deduce the shape of a distribution and normalize accordingly.Non-stationarity in monthspecific residual skewness with regard to yearly temperature is taken into account by defining the λ parameter as a logistic function of yearly temperature, as shown in Eq. ( 4): where ξ 0,m,s and ξ 1,m,s are coefficients fitted using maximum likelihood.The performance improvement of having a λ m,s,y parameter that is dependent on yearly temperature compared to a case where it is invariant (λ m,s ) is demonstrated by additional tests shown in Appendix C (shown for January and July).Since the majority of ESMs profit from yearly temperature dependence in λ, we use λ m,s,y in all ESMs.
The AR(1) process is applied to the residual variability terms after they are locally normalized as according to Eq. (3).Following the approach of Beusch et al. (2020) where η m,s,y is the locally normalized residual variability at month, grid point, and year m, s, and y and η temp.
m,s,y and η spat.m,s,y are its respective temporally varying and spatially varying components.
The AR(1) process accounts for autocorrelations up to a time lag of 1 and is suited in representing the residual variability, which is assumed to have rapidly decaying covariation such that longer-term patterns (if any) covary with yearly temperature.η temp.
m,s,y (i.e. the noise component of the AR(1) process) needs to account for the spatial crosscorrelations between grid points.It is modelled through a localized monthly multivariate Gaussian process and thus dampens spatial covariations with increasing distance, as shown in Eq. ( 7): where N (0, ν m (r m )) is a multivariate Gaussian process with a mean of 0 and covariance matrix ν m .As the number of land grid points is much higher than the number of temperature field samples, ν m is rank deficient and is thus localized by point-wise multiplication with the smooth Gaspari-Cohn correlation function (Gaspari and Cohn, 1999), which has exponentially vanishing correlations with distance r m and was used in previous MESMER fittings (Beusch et al., 2020).r m is chosen per month in a similar cross-validation with a leave-one-out approach as previous MESMER fittings (Beusch et al., 2020) using distances of 1500 to 8000 km at 250 km intervals.The localized empirical covariance matrix, ν m , is derived analytically based on the mathematical expectations for the covariance of noise terms of an AR(1) process (Matalas, 1967;Richardson, 1981), as shown in Eq. ( 8): where η m is the empirical covariance matrix constructed across all grid points for a given month from the locally normalized empirical residuals.When generating emulated realizations from the AR(1) process, we apply the inverse Yeo-Johnson transformation to obtain the final residual variability terms.
3.3 Evaluating emulator performance

Mean response verification
To evaluate how well the monthly cycle's mean response, f s (T s,y , m), is captured, we calculate Pearson correlation coefficients between the emulated values obtained from the harmonic model and their training run values across the whole globe for each month, with each grid point weighted equally.
This not only gives an idea of how well the magnitudes of mean response changes correspond but also how in phase the emulated monthly cycles are with training run monthly cycles.Where test runs are available, their correlations to the harmonic-model results are also calculated to assess how well the harmonic model can represent data it has not been trained on.Ideally, the test run correlations should be equal to those of the training runs, with anything substantially lower indicating overfitting and anything substantially higher indicating a non-representative training set (i.e.further modifications in the train-to-test splitting would have to be considered).

Residual variability verification
In order to evaluate how well the emulator reproduces the deviations from the harmonic model simulated by the ESMs, 50 emulations are generated per training run.First, we check that short-term temporal features are sufficiently captured for each individual grid point.To distinguish such short-term temporal features, we isolate the high-frequency temporal patterns present within the residual variability: each residual variability sequence is decomposed into its continuous power spectra, from which we verify, by computing Pearson correlation coefficients, that the top 50 highest-frequency bands within the training run residual variabilities appear with similar power spectra in the corresponding emulated residual variabilities.Second, we verify how well the spatial covariance structure is preserved by calculating monthly spatial cross-correlations across the residuals produced by each individual emulation and, by using Pearson correlations, comparing them to those of their respective training runs.Where test runs are available, a similar verification between them and training runs is done, thus yielding an approximation of how actual ESM initial-condition ensemble members would relate to each other.

Regional-scale ensemble reliability verification
The full emulator, consisting of both the mean response (i.e.harmonic model) and residual variability modules, is evaluated for its representation of regional, area-weighted average monthly temperatures of all 26 SREX regions (Seneviratne et al reproduce the 5th, 50th, and 95th quantiles of the respective ESM initial-condition ensemble over the periods of 1870-2000 and 2000-2100, by means of quantile deviations as previously done by Beusch et al. (2020).A step-by-step process for calculating monthly quantile deviations of the ESM from the emulator is as follows.
1. Calculate the area-weighted regional average of a given SREX region for each ESM run and each of its respective emulations at a given month.
2. Extract the qth emulated quantile at each time step from the set of regionally averaged emulations.
3. Within a defined time period (e.g. , calculate the proportion of time steps for which the regionally averaged ESM value is less than the respective emulated qth quantile value.The resulting number is q ESM .4. The quantile deviation of the ESM from the emulator is then given as q ESM − q. By drawing comparisons between the quantile deviations obtained in the two time periods considered, we can evaluate whether inter-annual variations in monthly temperature distributions are sufficiently captured.Since the magnitude of global warming varies between both time periods, such a comparison will additionally help identify whether the emulator sufficiently captures the expected changes in temperature skewness under changing climatic conditions (Wang et al., 2017;Sheridan and Lee, 2018;Tamarin-Brodsky et al., 2020).

Benchmarking MESMER-M using a simple physically informed approach
Any variability in monthly temperatures that cannot be explained by variability in yearly temperature alone is stochastically accounted for in MESMER-M's local residual variability module (see Sect. 3.2.2),following existing downscaling theory (Berner et al., 2017;Arnold, 2001).Hence, month-and season-dependent variability linked to physical drivers such as atmosphere-ocean interactions (Neale et al., 2008;Deser et al., 2012), e.g. the El Niño-Southern Oscillation (ENSO), and biophysical feedbacks (Potopová et al., 2016;Xu and Dirmeyer, 2011;Jaeger and Seneviratne, 2011;King, 2019;Tamarin-Brodsky et al., 2020), e.g.snow-albedo feedbacks, is not explicitly modelled but instead represented by a stochastic process.Nevertheless, first-order changes in the characteristics of these variabilities across warming levels can be approximated within MESMER-M since the skewness of MESMER-M's residual variability emulations depends on the yearly temperature.In this section, we delineate a framework to verify that this statistical approach, based on a single input variable of yearly temperature, can sufficiently imitate properties of the monthly temperature response which otherwise result from intra-annual variabil-ity processes.We primarily verify for representation of secondary biophysical feedbacks as biophysical variables are obtainable as direct output from the ESMs, whereas accounting for modes of climate variability and atmospheric processes would require further data processing and analysis.Furthermore, some effects of atmospheric processes can follow from or manifest themselves in biophysical variables, e.g. as seen by Allen and Zender (2011), and hence are indirectly accounted for by using biophysical variables.
To isolate the contribution of secondary biophysical feedbacks to the monthly temperature response, we consider them as inducing the residual differences between the ESM and harmonic-model realizations.This follows from the harmonic model representing the expected direct mean response to evolving yearly temperatures, with any systematic departure from it being driven by secondary forcers.To rudimentarily represent these contributions, a simple, physically informed model consisting of a suite of gradient-boosting regressors (GBRs) (Hastie et al., 2009) is built for each ESM.Each GBR within the suite is calibrated over one grid point and is trained to predict the local residual differences using local biophysical variable values (see Table 1) as predictors.Predictors are chosen so as to best represent the intra-annual variation in radiative and thermal fluxes alongside their evolution under changing yearly temperatures.The list of predictors is complemented by local yearly temperature values and month values in their harmonic form (hence π(n%12+1) 6 , n = 1 for January, etc.) to account for month dependencies in the residual differences and yearly temperature influences (if any) left behind within the residuals.
To optimize the selection of the biophysical variables used as predictors, we first compare the performance of different physically informed models trained using different sets of biophysical variables for each ESM.The best globally performing model is selected as a benchmark to assess how well the residual variability module, described in Sect.3.2.2,statistically represents properties within the monthly response arising from secondary biophysical feedbacks.Pearson correlations, between ESM test runs and harmonic-model test results augmented by biophysical variable, T yr and monthbased physically informed model predictions are calculated.As a measure of performance, the aforementioned correlation values are given relative to those obtained when augmenting using only T yr and month-based physically informed model predictions.This additionally allows the determination of whether improvement in residual representation comes from the added biophysical variable information.As we are most interested in the representation of monthly temperature distributions and the influences of biophysical feedbacks therein, we consider the energy distances of the benchmark, "physically informed" emulations -constituting the mean response with GBR-predicted residuals added on top -from the actual ESM runs and compare them to the energy distances of the statistical emulations -constituting the mean response with residuals from the residual variability ) month module added on top (as described in Sect.3.2) -from the actual ESM runs.The energy distance is a non-parametric estimate of the distance between two cumulative distribution functions (cdfs), x and y, by considering all their independent pairs of variables, X i , X j (i.e.pairs of physically informed/statistical emulated values) and Y k , Y l (i.e.pairs of actual ESM values) respectively.
Time series of the biophysical variables are obtained from CMIP6 runs.For this analysis, we only focus on ESMs which provided data for all five biophysical variables under consideration, for both the test and training runs used during emulator calibration.

Calibration results
When calibrating the harmonic model constituting the mean response module, the highest orders of the Fourier series were found in tropical to sub-tropical regions where the seasonal cycle has a relatively low amplitude (first row, Fig. 2).The Arctic also displays relatively high orders chosen within the Fourier series, possibly due to higher variabilities in the response of the seasonal cycle shape with increasing yearly temperatures.In contrast, temperate regions which possess distinctly sinusoidal seasonal cycles with marked snow-driven summer to winter transitions display relatively lower orders.CanESM5 and MIROC6 show the overall highest orders, which can be tracked back to them having more training runs, and hence more information on which to train the emulator, allowing for more model complexity without compromising on accuracy (refer to Table A1).
The residual variability module calibration results are shown in Fig. 2 for January and July.The average Yeo-Johnson parameter ( λ m,s ) displays a shift of values greater than 1 to values close to 1 in the Northern Hemisphere (30-50 • ) between January and July.In general, λ m,s values greater (less) than 1 indicate a concave (convex) transformation function owing to negative (positive) skewness, while values equal to 1 suggest minimal skewness in the input distribution (Yeo and Johnson, 2000).This explains the seasonality in λ m,s as we expect a more negatively skewed residual distribution in the northern hemispheric winter when the snow-albedo feedback leads to a non-linear wintertime warming (Cohen and Rind, 1991;Hall, 2004;Colman, 2013;Thackeray et al., 2019) causing the harmonic model to overestimate the mean temperature response.July displays significantly high λ m,s values for polar latitudes (> 80 • ) explainable by the sudden increase in warming rates during ice-free summers (Blackport and Kushner, 2016).Around the Equator (−5 to 5 • ) we see λ m,s values consistently higher than 1 especially in the month of July, with INM-CM5-8 and INM-CM5-0 displaying significantly high values.The source of high equatorial λ m,s values varies model to model but mainly originates from the north-west South America and Sahel regions, alluding to the presence of some non-linear warming response in these regions.
The lag-1 autocorrelation coefficients (γ 1,m,s ) mostly exhibit positive values across all ESMs for January, with at least 70 % of grid points having values between 0 and 0.3, suggesting minimal month-to-month memory additional to the seasonal cycle.July shows similar behaviour for γ 1,m,s across most ESMs, albeit with a larger spread in values.ACCESS-CM2 and HadGEM3-GC31-LL present themselves as outliers here with the bulk of their γ 1,m,s coefficients centred around 0 for both January and July, indicating negligible autocorrelations.
Localization radii vary from model to model and are generally higher in January than July reflecting seasonal differences in residual behaviour possibly due to northern hemispheric winter snow cover yielding larger spatial patterns.CanESM5 and MIROC6 display notably higher localization radii, which can again be tracked back to them having more training runs: more time steps are available during the leaveone-out cross-validation, thus making it generally possible to robustly estimate spatial correlations up to higher distances, which in turn leads to selecting larger localization radii.It should however be stressed that the ESM itself is the main driver behind the calibration results (e.g. even with only one ensemble member, MCM-UA-1-0 has a high localization radius).

Regional behaviour for four selected ESMs
To illustrate the regional behaviour of the calibrated emulator, we focus on four select ESMs. Figure 3 visually demonstrates the harmonic model constituting the mean response module capturing the mean monthly temperature response for both January and July, at global and regional scales (here we show the SREX regions western North America, WNA, and West Africa, WAF), across all four ESMs.The remaining natural variability surrounding the mean response displays a month dependency across the four ESMs, such that January variabilities are up to double that of July both globally and in https://doi.org/10.5194/esd-13-851-2022 Earth Syst.Dynam., 13, 851-877, 2022 Figure 2. Calibration parameters obtained from emulator fittings for all CMIP6 models.For the mean response module, the latitudinally averaged order of the Fourier series considered in the harmonic model is plotted against latitude (row 1).For the monthly residual variability module, parameters are displayed for January (rows 2-4) and July (rows 5-7).λ m,s,y coefficients averaged over latitude and all years for the local Yeo-Johnson transformation are plotted against latitude (rows 2 and 4).The local lag-1 autocorrelation coefficients (γ 1,m,s ) are plotted as boxplots (rows 3 and 6) with whiskers covering the 0 to 1 quantile range, and the localization radii (r m ) are given as bar charts (rows 4 and 7).
the displayed regions.These month dependencies in variabilities are well accounted for within the full emulations comprising both the mean response and the residual variability module, highlighting the necessity of a month-specific residual variability module.
Figure 4 shows the trends in the variance of each year's monthly temperatures around the yearly mean (i.e.variance in intra-annual temperatures).The harmonic model is able to capture the general trends displayed by the ESMs, albeit not being able to fully account for non-linearities within them.For example, MPI-ESM1-2-LR displays a marked non-linear increase in intra-annual variance with increasing yearly tem-peratures for WAF, which is misrepresented by the harmonic model as a linear increase.Construction of the physically informed model outlined in Sect.3.4 elucidated albedo as the main covariant to monthly temperature variability in WAF for MPI-ESM1-2-LR (see Sect. 5.4), indicating changes in land surface properties (possibly due to the reduction in tree cover in this region) as driving intra-annual variance changes.This demonstrates the limitation of solely relying on yearly temperatures as input towards predicting monthly temperatures when other forcings (in this case changes in land surfaces) dominate.Other forcings rarely dominate over the re- sponse to yearly temperatures however, and the full emulations are able to capture the overall spread.

Mean response verification
We evaluate the ability of the harmonic model, constituting the mean response module, in capturing the monthly cycle's response to evolving yearly temperatures.Pearson correlations between the harmonic model and ESM training runs range from 0.7 up to almost 1 (Fig. 5).Summer months (e.g.June) exhibit the highest correlations while tran-sition months of spring (March, April) and autumn (October, November) have the lowest correlations.Such low correlations could result from the inter-annual spread in the timing of snow cover increase and decrease, such that the mean response extracted does not always match individual years.Winter month (e.g.January) correlations are generally higher than those of transition months but lower than those of summer months.This is possibly due to snow-albedo feedbacks, which induce non-linearities into the winter period mean response (Cohen and Rind, 1991;Hall, 2004;Colman, 2013;Thackeray et al., 2019) leading to lower correlations than those of summer months where the response is more linear. https://doi.org/10.5194/esd-13-851-2022 Earth Syst.Dynam., 13, 851-877, 2022

Residual variability verification
To establish if temporal patterns within the ESM residual variabilities are successfully emulated, the correspondence of their respective power spectra at a grid point level is considered.Results shown in Fig. 6 display the emulator's median correlations with the ESMs' training run power spectra lying between 0.9 and 1.This corresponds well to the correlations across the ESM test runs (crosses).Correlations between the ESM training runs and emulations for a given ESM display very little spread, which is in agreement with the near-identical correlations seen amongst ESM test runs.In the example 2D histogram plot (given for CESM2), we see that the emulator is most successful in capturing lower power-spectra-to-frequency ratios.This may be a consequence of the emulator design, as we restrict ourselves to considering only lag-1 autocorrelations such that lower frequencies with higher power spectra are not accounted for.
For verification of the residual variability's spatial component, we consider the spatial cross-correlations within four geographical bands centred around the grid point for which temperature is being emulated (Fig. 7) for the example months January and July.As the spatial covariance matrix within the emulator is localized (see Sect. 3.2.2),its spatial cross-correlations are by design expected to diminish with increasing distances.Hence, we see the emulator performing best at distances below 1500 km, with median correlations of 0.91-0.99,which are in line with correlations between ESM test and training runs (crosses).Beyond 1500 km, the emulator performs progressively worse with correlations dropping below 0.1 for distances between 3000 and 6000 km and staying there for distances larger than 6000 km, while those of test runs -where spatial cross-correlations are not localized -remain around 0.5-0.8.CanESM5 and MIROC6 are the two exceptions at distances between 3000 and 6000 km, with correlations of 0.33-0.5, which then again drop to below 0.1 at distances larger than 6000 km.This is due to their notably larger localization radii (see Fig. 2), which leads to a slower decline of spatial cross-correlations with increasing distances as compared to other ESMs.

Regional-scale ensemble reliability verification
Regionally aggregated 5 %, 50 %, and 95 % quantile deviations of the ESM training (and where available test) runs from the emulated quantiles (derived from the full emulator consisting of both the mean response and local variability module) are plotted over the periods of 1870-2000 and 2000-2100 for the example months January (Figs. 8 and 10) and July (Figs. 9 and 11).The 50 % quantile deviations over the period of 1870-2000 in January and July (Figs.8 and 9 respectively) generally show low magnitudes (−3 % to 3 %).A slight regional dependency for this period is visible, where tropical/sub-tropical regions of AMZ, NEB, SSA, WAF, EAF, and SAF have generally warmer (colder) emulated 50 % quantiles as compared to the ESM runs, while those of the remaining regions are colder (warmer) for January (July).While January 50 % quantile deviations over the period of 2000-2100 remain low with less (if any) distinguishable regional dependency, July 50 % quantile deviahttps://doi.org/10.5194/esd-13-851-2022 Earth Syst.Dynam., 13, 851-877, 2022 tions for this period increase (−10 % to 10 %) with an opposite pattern in regional dependency to that of 1870-2000.
The increase for July in deviations could be a combined result of non-linear warming and relatively lower variability in July temperature values as compared to those of January in the ESM simulations.This would indicate a limitation in the emulator's design, where delegating the representation of non-uniformities in the monthly temperature response to the residual variability module does not fully work in the presence of lower variabilities.Generally, emulated 5 % (95 %) quantiles are warmer (colder) than those of the ESM training and test runs.Such under-dispersivity for regional averages is linked to the localization of the spatial covariance matrix within the residual variability module, such that spatial correlations drop faster within the emulator than they do in the actual ESM.For January over the time period of 1870-2000, the lowest magnitudes in 5 % and 95 % quantile deviations are observed for southern hemispheric regions (e.g.AMZ, NEB, WSA, SSA), along with slight over-dispersivity (see the blue 5 % quantile and red 95% quantile values in their respective panels of Fig. 8).Over the period of 2000-2100, this behaviour for January switches to northern hemispheric regions (e.g.CEU, ALA, ENA, WNA, TIB) and is mostly apparent for the 95 % quantile, possibly due to a decrease (increase) in January variability in the Northern (Southern) Hemisphere with increasing yearly temperatures (Holmes et al., 2016).In contrast, over both the periods of 1870-2000 and 2000-2100, July consistently displays the lowest magnitudes of 5 % and 95 % quantile deviations (even with a slight over-dispersivity) in northern hemispheric regions (e.g.WNA, ENA, NAS, WAS).The observed small regional over-dispersivities hint at additional processes being at play in these regions which are not accounted for by the emulator and that counteract the expected regional-scale underdispersivity which is inherent in the emulator's residual variability module design.

Benchmarking MESMER-M using a simple physical approach
In-depth analysis of the benchmarking approach outlined in Sect.3.3 is conducted for four selected ESMs which exhibit diverse genealogies (Knutti et al., 2013;Brunner et al., 2020)   tial improvements in other regions, notably WAF, EAF, SAS and NAU regions.It is worth noting that for most ESMs, the biophysical predictor configuration of albedo, cloud cover, and snow cover (ACS) performs consistently worse than any configuration containing sensible and latent heat fluxes (H ), suggesting the presence of processes only explainable using sensible and latent heat fluxes.MIROC6 is an exception to this, with both the biophysical configurations of latent heat flux (Hl) and H yielding 0 or lower relative correlations in WNA, CNA, ENA, CEU, and EAS, while HC (biophysical configuration of sensible heat fluxes, latent heat fluxes, and cloud cover) displays no improvements for these regions.This could be due to colinearities between cloud cover and latent and sensible heat fluxes alongside overfitting of the physically informed model to latent and sensible heat fluxes due to confounding variabilities.
As HACS (biophysical configuration of sensible heat fluxes, latent heat fluxes, albedo, cloud cover, and snow cover) performs the best globally (appears as 1 in global land) across all four ESMs, we choose it as the benchmark physically informed model to compare the residual variabil-ity module to. Figure 13 shows the energy distances of the physically informed (harmonic model+HACS) and statistical (full emulator = harmonic model + residual variability module) emulated cdfs to the ESM cdfs for January and July, where 0 indicates identical and thus "perfect" emulated cdfs.Energy distances in July for both the physically informed and statistical models are close to 0 indicating near-perfect cdfs, with only MIROC6 and MPI-ESM1-LR showing larger distances for the full emulator in the Indo-Gangetic region and central West Africa respectively.In contrast, January shows higher distances for both the physically informed and statistical model cdfs, particularly in northern hemispheric regions with seasonal snowfall and most notably in the full emulator of MIROC6.Overall, the statistical model performs better than the physically informed model for CESM2 and UKESM1-0-LL and worse for MIROC6 and MPI-ESM1-2-LR.An explanation behind this could the combination of biophysical feedbacks being more pronounced in January's northern hemispheric variability and that MIROC6 and MPI-ESM1-2-LR have at least four more training runs than CESM2 and UKESM1-0-LL, providing the GBR model with more training material to extract such biophysical information from.This suggests a limit to when the statistical approach performs better than the physical approach, depending on how present biophysical feedbacks are within the overall variability and how much information is available to train on.Nevertheless, without the prerequisite of having more training runs -which can be seen as an added advantage -the statistical approach taken by the full emulator generally shows better performances across most ESMs for January and July than the physical approach (Fig. E2).Thus, the distributional properties of local monthly temperatures as seen within ESM initial-condition ensembles can be sufficiently represented using the statistical approach outlined in this paper, which takes only local yearly temperatures as input.

Conclusion and outlook
We extend MESMER's framework to include the monthly downscaling module, MESMER-M, trained for each ESM at each grid point individually, thus providing realistic, spatially explicit monthly temperature fields from yearly tem-perature fields in a matter of seconds.We assume a linear response of the seasonal temperature cycle to its yearly mean values and represent it using a harmonic model.Any remaining response patterns are expected to arise from regionalscale, physical, and intra-annual processes, such as snowalbedo feedbacks or the modulation of atmospheric circulation patterns due to changes in ENSO, and have asymmetric, non-uniform (i.e.non-linear, non-stationary, affecting variance and skewness) effects across months.To capture them, we build a month-specific residual variability module which samples spatio-temporally correlated terms, conserving lag-1 autocorrelations and spatial cross-correlations whilst accounting for specificities in the residual variability structure across months.By letting the skewness of the residual sampling space additionally covary with yearly temperatures, non-uniformities in secondary feedbacks are furthermore inferred through their manifestations within the monthly temperature distributions.
Verification results across all ESMs show the emulator altogether reproducing the mean monthly temperature response, as well as conserving temporal and spatial correlation patterns and regional-scale temperature distributions up https://doi.org/10.5194/esd-13-851-2022 Earth Syst.Dynam., 13, 851-877, 2022 to a degree sensible to its simplicity.To further assess how well the emulator is able to represent non-uniformities in the monthly temperature response arising from secondary biophysical feedbacks, we compare its performance to that of a simple physically informed model built on biophysical information.The emulator overall reproduces the cdfs of the actual ESM just as well as, and in most cases even better than, the physically informed model, evidencing the validity of such a statistical approach in inferring temperature distributions and thus the uncertainty due to natural variability within temperature realizations.Given that the uncertainty due to natural variability is a property intrinsic to climate models and largely irreducible (Deser et al., 2020), the emulator thus proves itself as a pragmatic alternative to otherwise having to generate large single-model, initial-condition ensembles.

Further emulator developments
In this study, MESMER-M was only trained on SSP5-8.5 climate scenario runs so as to demonstrate its performance over the extreme spectrum of climate response types.A further step would be to investigate the inter-scenario applicability of MESMER-M, and this has already been done for MES-MER with satisfactory results (Beusch et al., 2022).While we would expect the overall mean response of monthly to yearly temperatures to remain relatively stable between climate scenarios, non-uniformities arising in the local variability may be more scenario-specific (e.g.due to slowing down of the snow-albedo feedback under an equilibrated climate for low-emission scenarios).Bearing this in mind, we recommend training MESMER-M on all available climate scenarios before using it for inter-scenario exploration.Such would provide the local Yeo-Johnson transformation with enough information on the relationship between yearly temperatures and skewness of monthly temperatures.Additional adjustments such as considering the rate of yearly temperature change as a covariate to monthly temperature skewness could also be investigated.This study demonstrates the advantage of constructing modular emulators such that the emulator framework can be extended according to the area of application.Additional module developments which increase the impact relevance Figure 12.Global land (without Antarctica) and regional performances (columns) of the physically informed model trained using different predictor sets (rows) shown for four selected CMIP6 models, for each SREX region.Acronyms for the predictor sets (y axis tick labels) can be referred back to in Table 1.Pearson correlations calculated over all months between test runs and harmonic-model test results augmented by the physically informed model's predictions of residual variability are considered.Here we show changes in correlations relative to those obtained when augmenting using only T yr and month values as predictors.Numbers in the global land column indicate the ranking of each predictor set, where 1 is the best-performing.
of the emulations and improve the fidelity in global and regional representation of monthly temperatures under different climate scenarios should be given priority.A module that comes to mind would be one representing changes in land cover, such as de/afforestation, which has been historically assessed to have biophysical impacts of a similar magnitude on regional climate as the concomitant increase in GHGs (De Noblet-Ducoudré et al., 2012) and for which very distinct imprints on the seasonal cycle of temperatures as well as the tails of the temperature distributions have been identified (Pitman et al., 2012;Lejeune et al., 2017).Such a module would furthermore increase the emulator's relevance towards impact assessments, in light of the important land cover changes expected to happen in the 21st century (Popp et al., 2017;O'Neill et al., 2016) and the relevance of accounting for their regional climate impacts especially in high-mitigation scenarios such as those compatible with the 1.5 • C long-term temperature goal of the Paris Agreement (Seneviratne et al., 2018;Roe et al., 2019;Arneth et al., 2019)).One technical advantage of adding a land cover module would be that the effect of land cover changes can be expected to be sufficiently decoupled from the overall GHGinduced temperature response (De Noblet-Ducoudré et al., 2012;Lejeune et al., 2017).Hence, the direct local effect of such a module would not interfere with the mean temperature response as extracted within the rest of the emulator.
Another modular development could include an explicit representation of the main modes of climate variability, such as ENSO, so as to strengthen MESMER-M's inter-annual variability representation.Since these modes are potentially coupled to the overall GHG-induced temperature response however, such an inclusion would be more complicated.One possible approach could be to introduce soil moisture as an additional variable term and investigate its lag correlations to monthly temperature variabilities.Alternatively we could explore building upon existing approaches such as the one of McKinnon and Deser (2018).Bearing in mind that one key advantage of MESMER-M is that it only requires yearly temperatures as input, the added value of such a module should be critically assessed against the need for additional predictors.Another possibility could be to instead decompose the covariance matrix used in η spat.
m,s,y (see Fig. 1) so as to account for spatial cross-correlations affected by major modes of variability; however, in this case the added model complexity should again be weighed against gained skill in emulation. https://doi.org/10.5194/esd-13-851-2022 Earth Syst.Dynam., 13, 851-877, 2022 As a follow-up from the physically informed model-based benchmarking done within this study, we further propose that the residuals from the mean response module also contain ESM-specific, scale-dependent information, constituting the distinct representations and parameterizations of biophysical feedbacks within each ESM.For example, models with strong snow-albedo feedbacks and a large snow cover reduction with increasing global mean temperature will show stronger warming of cold months (Fischer et al., 2011) and thus more negatively skewed residuals for those months.
A step towards disentangling such process representation within the ESMs has already been made in this study, through the representation of biophysical contributions within residual variabilities using the GBR-based physically informed model.Further analysing the strength of the covariations in different biophysical variables with the residuals, as identified by the physically informed model, could then help isolate the exact contributions of these variables.While the key physical variables contributing to temperature variability within ESMs have already been studied (Schwingshackl et al., 2018), such an analysis would further provide information on the amount by which a selected number of biophysical variables contribute to residual variability within each ESM.Performing a similar analysis on observational datasets and comparing the results to those of the ESMs could then serve as a means to evaluate model representation of biophysical interactions under a changing climate.

Figure 1 .
Figure 1.Modular framework for the monthly emulator illustrated for an example ESM.The local mean response module is represented using a harmonic model (Fourier series).The local variability module starts with a Yeo-Johnson transformation before fitting the AR(1) process (distinguished into temporally varying η temp.m,s,y and spatially varying η spat.m,s,y components).An inverse Yeo-Johnson transformation is applied to emulated realizations drawn from the AR(1) process.A schematic illustrating the inverse Yeo-Johnson transformation's representation of the residual distribution skewness (grouped into the periods 1990-2020, 2020-2050, and 2050-2100) with evolving yearly temperatures (plotted below the distributions) is provided.Example data for the multivariate Gaussian process used within η space m,s,y are also given.Three emulation examples ("emu" in the figure) are shown as time series for one grid point, and one emulation example is also shown as a global map.

Figure 3 .
Figure 3. Regionally averaged temperature time series (rows) of January and July for four example ESMs (columns).Three ESM ensemble runs (coloured), their respective harmonic-model results (black), and 50 full emulations (emus) for each of the three patterns (grey colour scale) are plotted.Temperature values are given as anomalies with respect to the annual climatological mean over the reference period of 1870-1899.The regions are, from top to bottom, global land without Antarctica, western North America (WNA), and West Africa (WAF).

Figure 4 .
Figure 4. Regionally averaged variance of intra-annual temperatures (i.e.variance of each year's monthly temperatures around the yearly mean) scatterplotted against yearly temperature (rows) for four example ESMs (columns).Three ESM ensemble runs (coloured), their respective harmonic-model results (black), and 50 full emulations for each of the three patterns (grey colour scale) are plotted.Temperature values are taken as anomalies with respect to the annual climatological mean over the reference period of 1870-1899.Each dot represents the temperature variance calculated from the monthly values for 1 individual year.The regions are from top to bottom: global land without Antarctica, western North America (WNA), and West Africa (WAF).

Figure 5 .
Figure 5. Local mean monthly response verification for all CMIP6 models by means of Pearson correlation between the harmonic model and training runs (indicated by the colour of the lower triangle), over all global land grid points (without Antarctica) for each month.The correlations between the harmonic model and test runs are given in colour in the upper triangles to demonstrate how well the harmonic model performs for data it has not seen yet (a grey upper triangle means that no test run is available for this model).

Figure 6 .
Figure 6.Verification of the time component of the local variability module by means of Pearson correlations between the power spectra of the 50 highest-frequency bands present within the training runs (i.e.considering all months together) and the power spectra at which the same 50 frequency bands appear within the respective emulations (boxplots, whiskers indicating 0.05 and 0.95 quantiles) calculated per grid point.Fifty emulations are evaluated per training run.Where test runs are available, their correlations with training runs are also given (black crosses).The example 2D histogram shows the power-spectra-to-frequency ratio for CESM2 training runs versus the corresponding power-spectra-to-frequency ratio within its emulations.

Figure 7 .
Figure 7. Verification of the spatial representation within the local variability module.Pearson correlations between ESM training run and emulated spatial cross-correlations are considered for four geographical bands centred around the grid point for which temperature is being emulated (rows) at individual months of January (black boxplots) and July (red boxplots).Boxplot whiskers indicate 5th and 95th quantiles.Fifty emulations are evaluated per training run.Where ESM test runs are available, their correlations with training runs are also given for January (black crosses) and July (red crosses).Example 2D histograms of the January spatial cross-correlations for CESM2 training runs versus those of its emulations are given for each geographical band.

Figure 8 .
Figure 8. January 5 % (left), 50 % (middle), and 95 % (right) quantile deviations (colour) of the climate model runs from the emulated quantiles of the ESM training (top block) and test (bottom block) run values from their monthly emulated quantiles, over the period 1870-2000 for global land (without Antarctica) and SREX regions (columns) across all CMIP6 models (rows).The monthly emulated quantile is computed based on 50 emulations per ESM run, and quantile deviations are given as averages across the respective number of ESM training/test runs.The number of test runs averaged across is indicated in brackets next to the model names in the bottom block.Red means that the emulated quantile is warmer than the quantile of the ESM run and vice versa for blue.

Figure 13 .
Figure 13.Comparison of the performance of the harmonic model + physically informed HACS model to that of the full emulator for January and July of four selected CMIP6 models.The energy distance from the actual model test runs is considered, where 0 indicates the best performance.

Table 1 .
List of biophysical variables used in training the gradient boosting regressor.
(see Fig.E1for summarized results of all other ESMs).