Characterization of long period return values of extreme daily temperature and precipitation in the CMIP6 models: Part 1, model evaluation

Using a non-stationary Generalized Extreme Value statistical method, we calculate selected extreme daily tem- perature and precipitation indices and their 20 year return values from the CMIP5 and CMIP6 historically forced climate models. We evaluate model performance of these indices and their return values in replicating similar quantities calculated from gridded land based daily observations. We find that at their standard resolutions, there are no meaningful differences between the two generations of models in their quality of simulated extreme daily temperature and precipitation.


Introduction
Much attention has been paid over the last 30 years to incorporating additional processes relevant to the climate system and their changes into climate models. Significant resources have also been directed to improving the quality of model simulations as compared to available observations over the recent past (Flato et al., 2013). Model evaluation is enabled by the historical simulation protocols of the Coupled Model Intercomparison Project (CMIP), which is currently in its 6th incarnation (CMIP6; Eyring et al., 2016). The historical simulations typically span the mid-19th century to the recent past (2005 for CMIP5 and 2015 for CMIP6) and are forced by realistic greenhouse gas, sulphate aerosol, stratospheric ozone, volcanic aerosol and solar luminosity variations based on observations. This paper focuses on evaluating the average and long period return values of extreme daily precipitation and temperature produced by the fully coupled climate models in the CMIP5 and CMIP6 projects with freely varying atmosphere, ocean, land surface and sea ice submodels but not with the biogeochemistry submodels of Earth System Models. A companion paper utilizes the same extreme value statistical methods to compare projections of average and long period return values of extreme daily precipitation and temperature in the CMIP5 and CMIP6 models. While the bulk of the model intercomparison and evaluation literature focuses on mean quantities and dominant variability characteristics, extreme precipitation and temperature have been assessed and compared for the CMIP3 and CMIP5 model submissions (Sillmann et al., 2013). More recently, evaluation of extreme precipitation in the higher resolution CMIP6 models has also been performed (Bador et al., 2020). Many of these studies examined some or all of the indices of the Expert Team on Climate Change Detection Indices (ETCCDI). 1 While these 27 indices were devised to aid in climate change detection and attribution, they are convenient model evaluation variables as they were chosen based on available global observations .
We focus on 20-year return values of daily extreme temperature and precipitation as such rare events can have much higher impacts on human and natural systems than the annual or seasonal averages of the ETCCDI indices. In a stationary climate, 20 year return values may be experienced only 3 or 4 times in a normal human lifetime and would likely be remembered anecdotally in tall tales told to grandchildren. However, in a warming climate, hot and wet extremes of a specified magnitude generally become more common and cold extremes less common . As a consequence, a non-stationary 20-year return value at any given time is better thought of as an event that has a 5% chance of occurring during that particular year.

Data and methods
In retrospect, not all 27 indices are useful for global purposes. In particular, those based on hard thresholds may be extremely rare in some parts of the globe and very common in other parts. This illustrates the subjective nature of observed and simulated climate extreme value analyses and precludes a single definition of what constitutes "extreme". In this paper, we use 4 temperature and 1 precipitation indices of the ETCCDI set to evaluate and compare CMIP6 and CMIP5 model simulations under their respective "historical" forcing scenarios of the mid-19th to early 21st century periods.
We deliberately choose the ETCCDI indices that are "block" extrema. "Hot days" are represented by TXx, the annual maximum of the daily maximum temperature. "Warm nights" are represented by TNx, the annual maximum of the daily minimum temperature. "Cool days" are represented by TXn, the annual minimum of the daily maximum temperature. "Cold nights" are represented by TNn, the annual minimum of the daily minimum temperature. "Wet days" are represented by Rx1day, the annual seasonal maxima of daily total precipitation.
We evaluate these extreme variables and their return values by comparing CMIP5/6 model simulations to gridded observations on the observational products' grids. Annual maxima and minima of daily maximum and minimum temperature are obtained directly from the HadEx3 database at a resolution of 2.5 o x3.75 • (Dunn, 2020). Annual maxima of daily precipitation are obtained by extraction from the REGEN (Rainfall Estimates on a Gridded Network) gridded daily precipitation at a resolution of 1 o x1 o (Contractor et al., 2019). While both gridded observational products have values only over land, REGEN fills in unobserved land regions by ordinary block Kriging, whereas HadEx3 assigns missing values over areas without quality station data. Both datasets use station data from a wide variety of sources. REGEN provides daily precipitation data from which the block extrema can be extracted directly. HadEx3 on the other hand, calculates the extrema at the stations first followed by the gridding process, as some of the raw station data cannot be made available by agreement with the station data sources. This difference in the order of operations can have a noticeable effect on the gridded extreme values (Donat et al., 2013;Gervais et al., 2014), as it is not guaranteed that all stations within a grid cell will have extrema on the same day and would generally result in higher gridded extreme values than if daily gridded quantities are constructed first . However, this effect is likely larger for precipitation than for temperature due to the more complex spatial structure associated with precipitation extreme events. Gridding procedures, choice of station data sources and quality control methods also contribute to uncertainty in gridded land based observational products. Remote sensing via satellite can offer a spatially complete dataset, but their relationship to ground based estimates of extreme precipitation is not close (Timmermans et al., 2019) probably due either to retrieval algorithms or temporal sampling limitations. For the CMIP5 and CMIP6 models, block extrema are extracted from the daily variables and masked when the model's native land area fraction is less than 0.5.
Twenty year return values are calculated using a nonstationary Generalized Extreme Value (GEV) distribution using ln(CO 2 ) as a covariate in the location parameter. This choice of statistical model motivates the selection of the block extrema ETCCDI variables over those defined by percentile exceedances. While Peaks over Threshold (POT) extreme value methods may be a more efficient use of the limited extreme sample data in a stationary setting, non-stationary thresholds (Acero et al., 2010;Kyselý et al., 2010;Roth et al., 2012;Solari et al., 2017) are not as straightforward as covariate GEV methods (Coles, 2001;Katz, 2010). Uncertainty in both the average and the return values of extreme temperature and precipitation statistics is strongly dependent on available sample sizes . For this study, we aimed to use all available model output from all the individual realizations to reduce uncertainty. This necessitates a non-stationary statistical approach. Of these, the Maximum Likelihood Estimate technique of fitting GEV parameters is the most well exercised in the literature (Easterling et al., 2016) Other choices are possible and equally or even more valid. However, as will be seen, model errors are substantially larger than uncertainties in the methodology. In all model cases, the sample sizes are much larger than 20 years, the return period of interest in this study. Observational sample sizes are smaller but more than twice this return period.
ln (CO 2 ) is chosen as a physically based covariate as it has long been known to force global mean temperature changes (Arrhenius, 1896). While global or local temperature could also provide a useful and physically motivated non-stationary covariate, the annual average of ln (CO 2 ) was chosen as it isolates most of the anthropogenic components to climate change without any significant internal variability. The CMIP5/6 experimental protocols specify atmospheric CO 2 as an external forcing agent for the historical simulations and hence it is the same for each model, simplifying the analysis. Furthermore, given this physical connection to the anthropogenic variations of extreme temperature and precipitation, usage of much longer portions of the available datasets than in previous GEV-based analyses can be justified Risser and Wehner, 2017). The resulting GEV estimates of long period return values are a function of ln(CO 2 ) and can be easily estimated for any given year by using the appropriate value for that year. In a certain sense, this choice of non-stationary covariate then can be interpreted of as a (weakly) non-linear time covariate. Again, other choices of covariate are possible and can serve different purposes. For instance, choice of the aforementioned global or local temperature would implicitly incorporate other natural and external forcings in addition to ln(CO 2 ). However, the purpose of this paper is to quantify errors in extreme precipitation and temperature. Use of a temperature based covariate would incorporate model differences (and errors) into the statistical methodology and could tend to hide differences in the extreme indices. The choice of ln(CO 2 ) as the covariate for this study is deliberately made so that the statistical models are the same across climate models. Arguably, in the companion paper about projections of extremes at selected global warming levels (Wehner, 2020), global mean temperature would be appropriate as climate sensitivity would then be removed.
In this study, we have introduced a linear covariate into the GEV location parameter only, fitting the scale and shape parameters as constants. Due to the uncertainties in estimating the shape parameters, most previous studies also keep it stationary (Cooley et al., 2007). However, in some locations, the quality of the fitted distribution may or may not be improved by a non-stationary scale parameter and/or a nonlinear covariate dependence in the location parameter. However, in the interest of simplicity and consistent with the relevant detection and attribution studies of the human influence on extreme temperature (Brown et al., 2008;S.-K. Seung-Ki Min et al., 2013;Zwiers et al., 2011) and precipitation (Min et al., 2011;Westra et al., 2012;Zhang et al., 2013), we have not added these additional testing requirements.
A more complete discussion of non-stationary covariate choices and their implications for isolating both anthropogenic and natural variations in extreme values can be found in . We have not performed goodness of fit analyses for each model. However, in our previous studies, this particular choice of non-stationary statistical model performs adequately. For a detailed discussion of some of the statistical issues concerning sample size see Appendix C of .
Fitting GEV distribution parameters was done using a Maximum Likelihood Estimates (MLE) procedure in the climextRemes software package (Paciorek et al., 2018), a python and R library built upon the extRemes library Katz, 2016, 2011) and available at http s://cran.r-project.org/web/packages/climextRemes/index.html. The covariate appears linearly in the GEV location parameter as μ(t) = μ 0 + μ 1 ln(CO 2 ). Hence, there are four fitted parameters, the two components of the location parameter, μ 0 ,μ 1 , the scale parameter σ and the shape parameter ξ. In order to reduce the statistical uncertainty in fitting GEV distributions and in calculating the averages of the selected extreme variables, the entire time series of all available complete realizations of the historical simulations were used. With some exceptions CMIP5 models were run from 1851 to 2005 and CMIP6 models were run from 1851 to 2015. The two observational products do not span such a large time interval. Daily precipitation from REGEN is available from 1950 to 2013. Although the HadEx3 temperatures extrema dataset runs from 1901 to 2018, our starting date for fitting the GEV parameters was chosen as 1960 to reduce the amount of missing data. To compare estimates of both the 7 block extrema and their twenty year return values between these observational products and the CMIP5/6, we choose to average all performance metrics over the longest common period between the observations and two historical CMIP ensembles, 1961CMIP ensembles, -2004 It is important to note that while the data that goes into the GEV estimates of return values is not the same between observations and models, that additional data only serves to improve the fit of the GEV and reduce statistical uncertainty due to the length of the data records and to note that the comparison is over the same period. We also note that non-stationary return values calculated in this manner are temporally smoothed over both internal and externally forced variations and it would be appropriate to perform model evaluation at a midpoint of this period. We chose to average return values over this period simply for clarity in the comparing to the corresponding mean climatology and average extreme value indices.
Although changes in both temperature (Kim et al., 2015;Seung-Ki, Min et al., 2013) and precipitation (Min et al., 2011;Zhang et al., 2013) extremes have been attributed at the global scale and large regional scale (King et al., 2016;Wang et al., 2017;Zwiers et al., 2010) to anthropogenic changes to the composition of the atmosphere, we do not evaluate model performance of simulating trends due to the high natural variability at sub-continental scales (Kay et al., 2015).
As model errors in simulating extreme temperatures and precipitation rates may be affected by their errors in simulating the average temperature and precipitation climatology, we also present a brief comparison of mean state errors to extreme value errors. As the HadEx3 does not contain the underlying daily data nor an alternate method to calculate mean temperature, we use the GHCNCAMS gridded 2 m temperature dataset as a reference for observations of mean temperature (Fan and van den Dool, 2008). We note this and similar global products provide monthly averages of daily mean temperature, which is usually approximated as the average of the daily maximum and minimum temperatures rather than the separate averages of these individual daily extrema. However, annual average precipitation climatologies are calculated directly from the REGEN daily values providing a more direct comparison between mean and extreme precipitation errors.
For the multi-model averages, relationships between the error structures for each average extreme value and the appropriate mean climatology are discussed as well as the relationships between average extreme value errors and return value errors. Relationships between the error structures of the CMIP5 and CMIP6 multi-model averages are also discussed for each variable.
In Sections 3 and 4, we present performance metrics of average annual extreme values and 20-year return values in both tabular and graphic forms with the difference between observations and the multimodel average statistics shown as spatial maps. In addition to this multi-model analyses, results from individual models are being made publicly available as part of the Program for Climate Model Diagnosis and Intercomparison (PCMDI) Simulation Summaries. 2 As new CMIP6 models are added to the CMIP database, results will be continually updated to this online resource where "performance portraits" (Gleckler et al., 2008) provide an interactive gateway to maps of individual models from which our evaluation statistics were derived. The analysis software used to produce our extreme value analysis are publicly available as part of the PCMDI Metrics Package (PMP; Gleckler et al., 2016).
We use Taylor Diagrams (Taylor, 2001) to provide statistical summaries for each model as well as the multi-model averages. Taylor diagrams use the pattern correlation between models and observational products as the angular dimension, and normalized (simulated/observed) spatial standard deviation as the radial dimension (Taylor, 2001). Exploiting a geometric relationship between the basic definitions of a centered root mean square error (RMSE), correlation and standard deviation, a byproduct of the Taylor Diagram is that the distance between a model result and the reference data on the abscissa is the centered RMSE. In the event that a model has identical spatial variance as the reference data, its data point will lie on a radial arc of unity. If a model is perfectly correlated with the reference data, its data point will lie on the abscissa. If both are true, the RMSE will also be zero and the model data will coincide with the reference data point on the abscissa. As a complement to the routine statistics shown on the Taylor diagram, Taylor's modified skill (Wehner, 2013) is presented in tables across models for each extreme variable and its return value. This measure scales the centered RMSE to reduce the possibility of models' skill being artificially inflated simply by data smoothing.

Temperature-hot days
In the mid and high latitudes, the hottest day of the year (TXx) generally occurs in the summer. The top row of Fig. 1 shows the 1961-2004 average of observed boreal summer temperatures from GHCNCAMS, (left), TXx (center)and the 20-year return value of TXx (right) from HadEx3 over land. In this and similar following figures, the middle row shows the difference (model minus observation) for the CMIP5 multi-model average while the bottom row shows the same for the CMIP6 multi-model average. Only models where a 20-year return value can be calculated over land are included (see Tables 1a and 1b). Models were generally excluded if they did not provide both daily data and a land/sea mask. All results use the same set of models for consistency of comparison. Missing data over land in the observed 20year return values, is usually a result of a lack of convergence in the Maximum Likelihood Estimates (MLE) used in the calculation of the GEV coefficients and is shown as white in the top row of Fig. 1. In the HadEx3 data, this usually is seen in regions where the data record is incomplete and hence shorter. While all post-1960 data from the HadeX3 product was used, in the less well-observed areas of South America, Africa and Northern Asia, records may not span the entire 1960-2018 period. HadEx3 data was used as is, with no effort made to impose a minimum non-missing data length. However, the presence of severe outliers cannot be ruled out. Indeed, certain models exhibited a lack of convergence in isolated individual cells primarily located in hot and dry regions despite sample sizes exceeding 150 years. Two CMIP6 models failed to converge over most of the Northern Hemisphere. This is likely an indication of unrealistic model behavior but further investigation is needed.
While the global range of errors in Fig. 1 is similar for warm mean and extreme temperatures, no clear relationship exists between them with errors of opposite sign often the case. The centered pattern correlation (Mitchell et al., 2001) between average boreal summer temperature and TXx errors is small for both generations of models with values of 0.07 for the CMIP5 and 0.03 for the CMIP6 multi-model averages. TXx errors and its return value errors exhibit similar behavior in some parts of the world but are of opposite sign in the well observed parts of the North America and Europe causing the centered pattern correlation between them to be low with values of 0.03 for the CMIP5 and 0.05 for the CMIP6 multi-model averages.
The spatial pattern of average boreal summer temperature and TXx errors between the CMIP5 and CMIP6 multi-model averages are very Table 1a Extreme temperature Taylor's modified skill scores. TXx, TNn, TNx, TXn and their 20-year return values for the CMIP6 models. Row labelled "cmip6" is calculated from the multi-model mean indices. Models are arranged from highest skill averaged over the 4 temperature return value indices to the lowest. Failure to converge in large regions are noted as "fail".  Fig. 1, this metric of pattern similarity is very near zero, probably due to the much larger difference in local errors. Indeed, the root mean square difference between CMIP5 and CMIP6 multi-model average return value errors is 10 times larger than it is for TXx errors. The Taylor plots of Fig. 2 show model performance in simulating TXx (left) and its return value (right) as measured by pattern correlation with

Table 1b
Extreme temperature Taylor's modified skill scores. TXx, TNn, TNx, TXn and their 20-year return values for the CMIP5 models. Row labelled "cmip5" is calculated from the multi-model mean indices. Models are arranged from highest skill averaged over the 4 temperature return value indices to the lowest. Models with inadequate data are marked "miss". The CMIP5 and CMIP6 models are tightly clustered in the TXx Taylor diagram with the spatial standard deviation (radial distance from origin) of all models being more than double that of the reference data. Nevertheless, the pattern correlation is high, between 0.8 and 0.92 for all models, although some of the CMIP5 models are at the lower end of this range. Thus, while the pattern of the simulated spatial distribution in TXx corresponds reasonably well with HadEX3, its variation in amplitude is excessive. The two generations of models are even more tightly clustered in the TXx 20-year return value Taylor diagram. Although centered RMSE of the return value cluster is not very different, pattern correlation is very much degraded for the return value at between 0.2 and 0.4. As a result, Taylor skill, which ranges from 0.6 to 0.7 for TXx for most models is degraded for its return values to a range of 0.45-0.5. The substantial degradation in the return values pattern correlation is partly due to a small comparison region (due to the omission of poorly sampled regions of HadEx3) and to somewhat larger errors than TXx.
Taylor's modified TXx skill ranges from 0.6 to 0.7 for most models. Despite the low pattern correlation between the CMIP5 and CMIP6, TXx 20-year return values errors shown in Fig. 1, Taylor's modified TXx return value skill is very similar between model generations for this variable with most models close to 0.5. Overall, the performance of the CMIP5 and CMIP6 multi-model average in simulating TXx and its return value are very similar.
This lack of a relationship between the errors in simulated warm temperatures across rarities likely reflects errors in the different responsible physical mechanisms. It is known that land surface moisture feedbacks influence temperature extremes (Lorenz et al., 2016), as do the duration of blocking events (Zschenderlein et al., 2019). The relative importance of errors in these and other relevant heatwave mechanisms  Uncertainty in long period return value estimates from the limited data of the tail of the available sample can be considerable. In , several methods of estimating this sampling uncertainty are compared. For large sample sizes, all methods considered produced similar uncertainty estimates. For small samples, the delta method, an asymptotic formula for the standard error of a function of maximum likelihood estimates (Coles, 2001) was found to be a more accurate estimation of the true uncertainty for simulated precipitation from a climate model of the CMIP5 generation as well as much computationally less intensive than bootstrapping approaches. The single realization of the observational datasets is a relatively small sample for calculating 20-year return values with periods of roughly 45-70 years. However, the non-stationary GEV approach permits using the much longer 150+ year simulation datasets. Furthermore, using the entire available ensemble increases the parent dataset size by several factors, especially for the more mature CMIP5 experiments. Fig. 3 shows this standard error of the 20-year TXx return value for the HadEX3 observations on the left averaged over 1961-2004. The middle and right maps show this standard error averaged over the CMIP5 and CMIP6 models respectively.
As a result of these large datasets, enabled by the non-stationary covariate, the uncertainty in the 20-year TXx return value estimates due to GEV fit uncertainty is much smaller than the model errors discussed above. A larger source of uncertainty in quantifying model error stems from observational uncertainties. However, a complete discussion of differences between observational products, either for temperature or precipitation, is outside the scope of this study but is an active area of research.

Temperature -cold nights
In the mid and high latitudes, the coldest day of the year (TNn) generally occurs in the winter. The top row of Fig. 4 shows the 1961-2004 average of observed boreal winter temperatures from GHCNCAMS, (left), TNn (center) and the 20 year return value of TNn (right) from HadEx3 over land while the middle and lower rows show the multi-model average errors. Note that the color scale on observations is 10C lower than Fig. 1 but remains the same for the mean model errors. The area of missing data, again reflecting a lack of MLE convergence from the HadEX3 data is significantly reduced from that of hot days. This is most easily visualized by comparing the white areas over land in the return value error maps of Figs. 1 and 3 as there is no comparison to be made in missing data regions. Lack of convergence was not an issue for any of the models.
In many of the well observed areas, simulated average TNn is colder than observed. Simulated 20-year TNn return values are everywhere too cold with much larger errors for both model generations than in average TNn. Seasonal mean cold temperature errors are weakly related to average extreme cold night temperature errors with pattern correlation values of 0.25 for CMIP5 and CMIP6. The pattern correlation between average TNn errors and return value errors is also low with values of about 0.4 for both generations of models.
As for hot seasons and TXx, the spatial pattern of errors for average boreal winter temperature and TNn errors are again very similar between CMIP5 and CMIP6 with pattern correlation values between them exceeding 0.95. TNn return value errors are weakly related between CMIP5 and CMIP6 with a pattern correlation value of 0.4. TNn return value uncertainty (not shown) is much smaller than the errors shown in Fig. 4. The Taylor diagrams in Fig. 5 show that model performance in simulating TNn (left) and its return value (right) is very tightly clustered. The 4th and 5th columns of Table 2 shows that the modified Taylor skill for TNn exceeds 0.95 and 0.85 for its return value for nearly every model. TNn centered RMSE is lower than 0.5 for all models and 0.75 for its return value for most models despite the large biases of Fig. 4. Normalized standard deviation is greater than one for each model for both measures of cold nights but not higher than 1.25 for TNn and 1.5 for its return value. Unlike the case for hot days (Fig. 2), the simulated patterns of 20-year return values for cold nights are highly correlated with the reference data. As for hot days, there is little difference between CMIP5 and CMIP6 multi-model average cold night performance metrics.
Mid-latitude winter cold snaps typically occur on clear nights favoring outgoing longwave radiative cooling. As the underlying radiative physics is well understood, the overly cool northern hemisphere TNN return values in Fig. 4 may be the result of errors in the frequency and duration of winter blocking which are known to be an important process (Sillmann et al., 2011). It is also been shown that block statistics can be influenced by horizontal resolution (Schiemann et al., 2017) which is essentially unchanged between CMIP5 and CMIP6. While a systematic . The radial axis is normalized standard deviation while the angular axis is the centered pattern correlation. The reference data set is HadEX3 (black square). The concentric circles show the models' centered RMSE. CMIP5 models are shown in red. CMIP6 models are shown in blue. Multi-model averages are denoted as "cmip5" and "cmip6" in the legend. (For interpretation of the references to color in this figure legend, the reader is referred to the Web version of this article.) exploration of blocking statistics errors is outside the scope of this study, there is evidence of improvement in the CMIP6 models . The relationship of this process to errors in both hot and cold temperature extremes is worthy of further analyses.
Errors and Taylor diagrams for simulated warm nights (TNx) and cool days (TXn) are shown in the Appendix. Modified Taylor skill is shown in columns 6-9 of Tables 1a and 1b for these extreme temperature metrics.

Daily precipitation
The top row of Fig. 6 shows the 1961-2004 average of observed mean annual precipitation, (left), average Rx1day and the 20-year return value of Rx1day from the REGEN database. Model errors in the middle and bottom rows are shown as percent errors. This choice tends to emphasize dry regions, whereas portraying absolute errors would highlight wet areas. Lack of convergence in return calculations was minimal and confined to isolated cells, often in dry regions.
Errors in simulated annual mean precipitation are more closely related to errors in extreme daily precipitation than any of the temperature extreme indices. The pattern correlation between annual mean percent error and annual Rx1day percent error is 0.88 for the CMIP5 and CMIP6 multi-model averages. Furthermore, annual Rx1day errors are closely related to return value errors with pattern correlations between them of 0.81 for the CMIP5 and 0.83 for the CMIP6 multi-model averages.
The relationship between model generations is also much closer for extreme precipitation errors than it is for extreme temperature errors. The pattern correlation between the two multi-model averages is 0.99 for RX1day errors and 0.88 for RX1day return value errors. It is also high for annual mean precipitation at 0.98. Fig. 7 shows the 1961-2004 averaged standard error of the 20-year annual Rx1day return value for the REGEN observations calculated from the delta method. The standard error is normalized by the 1961-2004 average return value and is shown as a percentage. The middle and right maps show this standard error averaged over the CMIP5 and CMIP6 models respectively. This uncertainty is generally much less than the errors shown in Fig. 6.
Taylor diagrams for Rx1day and its return value are shown in Fig. 8. Table 2 shows Taylor's modified skill for average annual Rx1day and its return value. Models are ordered within their generation from return value skill to lowest. Both generations of models exhibit wider ranges of Taylor skill than they do for extreme temperatures and are not as tightly clustered in the Taylor diagrams. Centered RMSEs of annual Rx1day and its return value are between 0.5 and 1.0 for most models but a few CMIP5 models and a single CMIP6 models are outliers with centered RMSE values exceeding unity. However, extreme daily precipitation Taylor skill is generally not as degraded from the average annual extreme to the return value as it is for extreme daily temperatures and for some models is improved. Note the spread in the Rx1day Taylor diagram results is largely due to large inter-model differences in the amplitude of the spatial pattern, with some models substantially underestimating the pattern amplitude while for others it is overestimated. Unlike temperature extremes, both pattern correlation and Taylor skill for the multi-model averages are better than any single model for both Rx1day and its return value for both generations of climate models. Overall model performance in simulating wet days is similar across both generations of models from the best models to the poor performers.
While the main purpose of this paper is to compare the relative performance of the CMIP5 and CMIP6 generations of climate models, the actual performance of the models is highly dependent on the reference data set evaluated against. Fig. 9 shows the multi-model errors in average annual Rx1day as measured by REGEN, HadEx3 and a reanalysis product, ERA5 (Hersbach et al., 2020). While Rx1day from the REGEN and ERA5 products are constructed in the same manner as in the models (i.e. constructed as annual maximum of gridded daily precipitation totals), the HadEx3 product is not. Due to station data availability limitations, HadEx3 is constructed by calculating the annual maxima at the individual stations followed by the gridding procedure. Chen and Knutson (2008) and Gervais et al. (2014) argue that such products are inappropriate for model evaluation. Rather, products such as REGEN (e. g. gridded daily station precipitation totals) more closely resemble the conserved quantities that models actually produce. Indeed, Gervais et al. (2014) demonstrated that extreme precipitation indices constructed by gridding station indices produce larger estimates than from gridded station daily totals. This then explains why the models in Fig. 9 are assessed to be consistently drier when compared to HadEx3 than to REGEN. We note that this is particularly apparent in North America and Western Europe, even though the raw data entering into the two products are based on very similar dense station networks. While HadEx3 assigns poorly observed land areas as "missing data", REGEN fills in these regions with a statistical algorithm (Contractor et al., 2019).

Table 2
Extreme precipitation Taylor's modified skill scores relative to the REGEN gridded observations. Annual Rx1day and 20 year return values for the CMIP6 (left group) and CMIP5 (right group) models. Top rows labelled "cmip6" and "cmip5" are calculated from the multi-model mean and are shown in bold font. Within each group, models are arranged from highest averaged annual return value skill to the lowest. Reanalysis products have also been used in the evaluation of simulated temperature and precipitation with the added benefit of coverage over the oceans (Sillmann et al., 2013). Products such as ERA5 do not directly assimilate precipitation, although some such as the North American Regional Reanalysis (NARR) do (Mesinger et al., 2006). However, they do assimilate moisture and its transport, which are clearly important as noted in the discussion of Fig. 6. However, reanalyses are model products, however highly constrained, and exhibit their own parameterization errors. Hence, there are both similarities and differences in the model errors when compared to REGEN and ERA in Fig. 9, illustrating part of the effects of observational uncertainties in evaluating model performance.

Discussion
We have quantified simulated errors in extreme temperature and precipitation in the two most recent generations of climate models, CMIP5 and CMIP6. We analyzed the annual average and 20 year return value of 4 extreme daily temperature indices, TXx (hot days), TNn (cold nights), TNx (warm nights), TXn (cool days) and an extreme annual daily precipitation index, wet days (Rx1day). The annual average indices, also interpretable as the 1 year return values, are the input into a non-stationary Generalized Extreme Value (GEV) statistical model to produce estimates of the much rarer 20 year return values. The nonstationarity of the GEV model permits usage of longer input data sets than in previous quasi-stationary approaches, yielding uncertainties in long period return values that are much smaller than climate model errors themselves. Model performance in simulating the rarer extremes is generally substantially degraded from the simulation of less rare extremes.
For the temperature indices, the pattern of winter extremes (TNn, TXn) is subtantially better than the pattern of summer extremes (TXx, TNx) but the magnitudes of the errors are larger. Extreme temperature errors bear little resemblance to seasonal mean temperature errors for  either the CMIP5 or the CMIP6 ensemble. Furthermore, only a weak relationship between model errors in simulated annual temperature extreme metrics and model errors in corresponding 20-year return values is found. This implies that the causes of model mean temperature errors are different from some of the causes of model extreme temperature error. It also implies that mean state model errors do not influence the distribution of simulated extreme temperatures uniformly.
The percent errors of simulated annual mean precipitation, average Rx1day and return values are remarkably similar in pattern and magnitude suggesting that errors in water vapor transport are important to both the simulated mean state and to extremes. Errors in the annual average Rx1day and its 20 year return value are more closely related and return value skill is not so degraded as for extreme temperatures.
Although statistical uncertainty in the estimation of return values is low, uncertainty in the observational products may not be and has not been fully addressed here. Of particular concern is that algorithms used to infill poorly observed regions may not adequately treat the far tail of the distribution of daily data. This is not an issue for the HadEX3 datasets but could be for the REGEN precipitation dataset. However, the process of gridding station data introduces biases (Donat et al., 2014;Gervais et al., 2014; and products that are based on gridded extreme indices are not ideal for model evaluation. Among other usages, model output is often used for the projection of future climate change and/or the attribution of climate change that has already occurred. Bias correction is a popular but potentially misleading strategy to force model output to more closely resemble observations. In particular, as the errors between mean state and extreme temperatures are weakly anti-correlated, simply bias correcting the distribution of daily values by mean state bias would make corrected simulated extreme temperatures worse. Similarly, simply bias correcting the distribution of annual extrema by the average error may also degrade corrected 20 year return values when the correlation between errors is low. In these cases, quantile bias correction (Jeon et al., 2016) of the 20 year return value itself is appropriate. While not considered here, the possibility of time dependence in the errors is very real. This could come from anthropogenic and/or natural modes of variability and be highly localized. A more detailed analysis of the effect of relevant covariates on observed precipitation return values will be presented in Risser et al. (2020).
Before any decision to utilize model simulated quantities, whether bias corrected or not, an arbitrary value judgement must be made as to whether the model and/or experimental design is "fit for the purpose" intended. While the model error metrics presented here can provide some guidance to fitness decisions, other value judgements about the appropriateness of performance metrics and how to use them must be made. Also, the credibility of the observed databases used as reference standards must also be considered even in well observed regions (Gibson et al., 2019). Performance metrics can also form a basis for model skill Fig. 8. Taylor diagram measuring model performance of simulating annual Rx1day (left) and its 20 year return value (right). The radial axis is normalized standard deviation while the angular axis is the centered pattern correlation. The reference data set is REGEN (black square). The concentric circles show the models' centered RMSE. CMIP5 models are shown in red. CMIP6 models are shown in blue. Multi-model averages are denoted as "cmip5" and "cmip6" in the legend. (For interpretation of the references to color in this figure legend, the reader is referred to the Web version of this article.) weighting but the process is inherently arbitrary (Sanderson et al., 2017).
The analysis presented here reveals that no single CMIP5 nor CMIP6 model stands out as distinctly superior across either temperature or precipitation extremes. The range of model performance in simulating temperature extremes is comparable between the two generations of climate models and little difference in the performance of multi-model average simulations of annual average temperature and precipitation extremes or their 20 year return values. While we have not fully explored the effect of observational uncertainty on the selected extreme temperature and precipitation metrics, these general conclusions about the similarity between the CMIP5 and CMIP6 simulations are robust.

Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. Model performance metrics in simulating warm nights are similar to hot days. The Taylor diagrams in Figure A2 show that model performance in simulating TNx (left) and its return value (right) are very tightly clustered except for a few outliers in the return value diagram. As for TXx, pattern correlation is significantly degraded for the return value compared to the average annual value with all values below about 0.4. Columns 6 and 7 of Table 1a,b shows the individual models' Taylor skill for TNx and its return value. Performance of most of the models is tightly clustered although both generations have some poor performers as return value error outliers. Except for these outliers, Taylor skill and centered RMSE for warm nights span similar ranges for CMIP5 and CMIP6 ensembles and the multi-models average performance are about the same. Fig. A2. Taylor diagram measuring model performance of simulating TNx (left) and its 20-year return value (right). The radial axis is normalized standard deviation while the angular axis is the centered pattern correlation. The reference data set is HadEX3 (black square). The concentric circles show the models' centered RMSE. CMIP5 models are shown in red. CMIP6 models are shown in blue. Multi-model averages are denoted as "cmip5" and "cmip6" in the legend.
In the mid and high latitudes, the coolest day of the year (TXn) generally occurs in the winter (Fig. 3, left column). The top row of Fig. A2 and Fig. A3 shows TXn and its 20 year return value from the HadEX3 dataset and the middle and bottom rows show the CMIP5 and CMIP6 multi-model average errors. Relationships between these error maps are very similar to those of cold nights with weak centered pattern correlation (~0.25) between boreal winter average temperature errors and TXn errors. While the CMIP5 multi-model average TXn errors exhibits a moderate centered pattern correlation to its return values (0.55), the CMIP6 multi-model average does not. As for cold nights, the centered pattern correlation between the CMIP5 and CMIP6 multi-model averages is high for TXn errors (0.97) but not for return value errors. Model performance metrics in simulating cool days are similar to cold nights. The Taylor diagrams in Fig. A4 show that model performance in simulating TXn (left) and its return value (right) are the most tightly clustered of any of the performance metrics considered in this study despite a very cold bias. The range of both centered RMSE and Taylor skill is also the smallest and the difference between performance of the CMIP5 and CMIP6 multi-model average small. In general, model performance metrics are slightly better for cool days than for cold nights.  A4. Taylor diagram measuring model performance of simulating TXn (left) and its 20 year return value (right). The radial axis is normalized standard deviation while the angular axis is the centered pattern correlation. The reference data set is HadEX3 (black square). The concentric circles show the models' centered RMSE. CMIP5 models are shown in red. CMIP6 models are shown in blue. Multi-model averages are denoted as "cmip5" and "cmip6" in the legend.