Assessment of CMIP5 climate models and projected temperature changes over Northern Eurasia

Assessing the performance of climate models in surface air temperature (SAT) simulation and projection have received increasing attention during the recent decades. This paper assesses the performance of the Coupled Model Intercomparison Project phase 5 (CMIP5) in simulating intra-annual, annual and decadal temperature over Northern Eurasia from 1901 to 2005. We evaluate the skill of different multi-model ensemble techniques and use the best technique to project the future SAT changes under different emission scenarios. The results show that most of the general circulation models (GCMs) overestimate the annual mean SAT in Northern Eurasia and the difference between the observation and the simulations primarily comes from the winter season. Most of the GCMs can approximately capture the decadal SAT trend; however, the accuracy of annual SAT simulation is relatively low. The correlation coefficient R between each GCM simulation and the annual observation is in the range of 0.20 to 0.56. The Taylor diagram shows that the ensemble results generated by the simple model averaging (SMA), reliability ensemble averaging (REA) and Bayesian model averaging (BMA) methods are superior to any single GCM output; and the decadal SAT change generated by SMA, REA and BMA are almost identical during 1901–2005. Heuristically, the uncertainty of BMA simulation is the smallest among the three multi-model ensemble simulations. The future SAT projection generated by the BMA shows that the SAT in Northern Eurasia will increase in the 21st century by around 1.03 °C/100 yr, 3.11 °C/100 yr and 7.14 °C/100 yr under the RCP 2.6, RCP 4.5 and RCP 8.5 scenarios, respectively; and the warming accelerates with the increasing latitude. In addition, the spring season contributes most to the decadal warming occurring under the RCP 2.6 and RCP 4.5 scenarios, while the winter season contributes most to the decadal warming occurring under the RCP 8.5 scenario. Generally, the uncertainty of the SAT projections increases with time in the 21st century.

Global atmospheric concentrations of greenhouse gases have significantly increased since the pre-industrial era. The increasing concentration of greenhouse gases is an important reason for global warming from the last century with high confidence (Yang et al 2011). During the 20th century, the average surface air temperature (SAT) of the Northern Hemisphere has risen approximately 1°C (IPCC 2007, Polyakov et al 2012. As outlined in the fourth assessment report (AR4) of the Intergovernmental Panel on Climate Change (IPCC), even if greenhouse gases are stabilized at the level in 2000, the average global temperature will increase approximately 0.1°C every decade (IPCC 2007).
Increases in the temperature may have serious influence on the natural and social aspects, such as water availability (Piao et al 2007, Gosling and Arnell 2011, Miao et al 2009, Sheffield et al 2012, food security (Tubiello et al 2007, Lobell et al 2008, Piao et al 2010, ecological environment (Allen et al 2010, Miao et al 2010, Tabari et al 2013, Yang et al 2013, Yang et al 2014, species biodiversity (Wake andVredenbury 2008, Nowak 2010), and human health (Robine et al 2008, Gosling et al 2009 etc These influences have urged the scientific and social communities to improve understanding of the causes and consequences of global warming (Sun et al 2014). Moreover, policymakers need the latest information on the likely future impacts of climate change to reconcile human society with natural systems.
Northern Eurasia accounts for about 20% of the Earth's land surface and 60% of the terrestrial land cover north of°4 0 N . It contains vast areas of wetlands, especially peatland, which contains a large amount of organic carbon and is often underlain by continuous and discontinuous permafrost (Zhu et al 2011). Compared with low latitude regions, Northern Eurasia, especially its northern areas, has been under more dramatic environmental changes in the 20th century, including increasing temperatures, melting permafrost, changing precipitation and prolonged growing seasons (Romanovsky et al 2007, IPCC 2007. According to observation, Northern Eurasia is the region with the largest and the steadiest SAT increases, and warming became most pronounced during the second half of the 20th century (Groisman et al 2007). During the period of widespread instrumental observations in Northern Eurasia (since 1881), the annual surface air temperature has increased 1.5°C (while 3°C in the winter season) (Groisman and Soja 2009). There is a statistically significant increase in the number of thaw days over Northern Eurasia (McBean et al 2005), which is primarily due to the reduction of days with frost, ice and remnant snow on the ground rather than due to the snow cover retreat (Groisman et al 2006). However, there is an interesting phenomenon in Northern Eurasia found by Bulygina et al (2011) that most areas of Northern Eurasia have experienced an increase in both winter average and maximum snow depths in recent decades, which is against the background of global temperature rise and sea ice reduction in the northern hemisphere.
Climate projections and their associated applications have become an important topic during recent decades. Several research teams around the world develop models to simulate the current climate and its future evolution under different greenhouse gas and aerosol scenarios (Buser et al 2009). Global coupled Atmospheric-Ocean General Circulation Models (coupled GCMs) are the modeling tools traditionally used in theoretical investigations of climatic change mechanisms (Covey et al 2003). By using GCMs, we can not only simulate the present-day and project future climatic changes under different scenarios but also separate natural climate variability from anthropogenic effects.
The GCMs simulations for the fifth assessment report (AR5) of the IPCC have recently become available (Taylor et al 2012). Comparing to the IPCC AR4, the GCMs in AR5 include a more diverse set of model types (i.e., climate/Earth system models with more interactive components such as atmospheric chemistry, aerosols, dynamic vegetation, ice sheets and carbon cycle) (Liu et al 2013). A number of improvements in the physics, numerical algorithms and configurations are implemented in the IPCC AR5 models with a new set of scenarios called representative concentration pathways (RCPs) used in the AR5 simulations (Moss et al 2010). The RCPs span a large range of stabilization, mitigation and non-mitigation pathways. Consequently, the range of the temperature estimates is larger than that of the scenarios in the AR4, which only covers non-mitigation scenarios (Rogelj et al 2012). It is expected that some of the scientific questions that occur during the preparation of the IPCC AR4 will be addressed in the AR5 (Taylor et al 2012).
The climate change in Northern Eurasia is a topic of great interest, and the amount of research associated with the GCMs is developed. However, previous analyses are primarily focused on the early experiments of the IPCC. An evaluation and application of the updated generation of the AR5 GCMs in Northern Eurasia is missing. In this study, we focus on the state-of-the-art models that have been made publically available through the Coupled Model Intercomparison Project phase 5 (CMIP5). This study is aimed at answering the following questions: 1) how well do the AR5 GCMs reproduce the historical SAT patterns; 2) which type of multi-model ensemble techniques can provide the best skill to improve the simulation performance; and 3) what changes in climate means may be expected in the future. Our results potentially provide inputs for climate change impact assessments that explore the probability of climate-related threats in Northern Eurasia.   and the future  periods. Here we focus on the SAT projection under three scenarios (i.e., RCP 2.6, RCP 4.5 and RCP 8.5). The three RCPs represent 'low' (RCP 2.6), 'medium' (RCP 4.5) and 'high' (RCP 8.5) scenarios featured by the radiative forcings of 2.6, 4.5 and 8.5 W m −2 by 2100, respectively. The CO 2 -equivalent concentrations in the year 2100 for RCP 2.6, RCP 4.5 and RCP 8.5 are 421 ppm, 538 ppm and 936 ppm, respectively (Meinshausen et al 2011). For comparison purpose, all GCM outputs are regridded to the same resolution as that of the observed data (0.5°× 0.5°grid).

The methodology of multi-model ensemble averaging
Because single models are overconfident (Weigel et al 2008) and multi-model ensembles contain information from all participating models (Pincus et al 2008), it is generally believed that multi-model ensembles are superior to single models (IPCC 2001, Duan and Phillips 2010, Miao et al 2013. In this study, three types of popular ensemble methods are used. They are simple model averaging (SMA), reliability ensemble averaging (REA) and Bayesian model averaging (BMA) techniques.
SMA is the simplest multi-model ensemble technique. Each model has the same weight (w k = 1/K, where K is the number of models) in the multi-model forecast. When using the SMA, any knowledge about the performance of the model is neglected (Casanova and Ahrens 2009).
The REA is a weighted average of ensemble members method based on the 'reliability' of each model (Giorgi and Mearns 2002). The reliability factor of the kth model (R k ) takes into account of the ability of each ensemble member to simulate the observed climate (R B ) and its degree of convergence in the projected climate change with respect to the other models in the ensemble (R D ).
where R B is a factor that is inverse proportional to the absolute bias (B) in simulating the present-day variable and R D is a factor that measures the model reliability in terms of the NorESM-ME distance (D) of the change calculated by a given model from the REA average change. The parameters m and n are the weights of the model performance criterion (R B ) and the model convergence criterion (R D ), respectively, which are typically equal to 1. The parameter ε in equation (1) is the natural variability of the climatic variable. More details of the REA process are provided in Giorgi and Mearns (2002) and Mote and Salathé (2010). The BMA generates a probability density function (PDF), which is a weighted average of the PDFs centered on the forecasts. The BMA weights reflect the relative contributions of the component models to the predictive skill over a training sample. The combined forecast PDF of a variable y is: The BMA weights are estimated using the maximum likelihood (Raftery et al 2005). To estimate given parameters, the likelihood function is the probability of the training data and is viewed as a function of the parameters (Yang et al 2011). The weights are chosen to maximize this function (i.e., the parameter values for which the observation data are most likely to have been observed). The algorithm used to calculate the BMA weights and variance is called the expectation maximization (EM) algorithm (Dempster et al 1977). More details of the BMA process are provided in Raftery (2005) and Duan and Phillips (2010).

Evaluation process
For GCM performance assessment, the bias between observation and model simulations is compared. The SAT change during the historical period  and the projected future scenarios  are analyzed.
In order to evaluate the ensemble performance, the Taylor diagram technique is used. The Taylor diagram is quantified in terms of the correlation (R), the centered rootmean-square-error (RMSE) and the amplitude of the standard deviations (Std). The diagram provides a way of graphically summarizing how closely a pattern matches observations (Taylor 2001). Moreover, the uncertainty of the different multi-model ensembles is also compared. Here, we calculate the standard deviation of the changes, δ, defined by where (w k ) is the weight of kth model generated by different multi-model ensemble techniques, T k is the kth model output and En is the ensemble. The upper and lower uncertainty limits are thus defined as: If the changes are distributed as a Gaussian PDF, the ±δ range implies a 68.3% confidence interval.

Temperature projection
It is generally accepted that the agreement between models and observations currently is the only way to assign confidence into the quality of a model (Errasti 2011), and the better a model performance in reproducing present-day climate, the higher the reliability of the climate change simulation (Giorgi and Mearns 2002, Coquard et al 2004. Each GCM's weight can be obtained from the ensemble process during the period of 1901-2005. Applying with these weights, multi-model ensemble projections in temperature over the 21st century scenarios can be generated. Figure 1 shows the bias in 24 CMIP5 climate models by comparing the observed and simulated data of the 105 year annual and seasonal mean temperatures. Most of the GCMs give reasonably accurate predictions of the mean temperature. Among the 24 GCMs, nine models underestimate the annual mean temperature, while the others overestimate. The maximum bias for SAT simulation comes from the FGOALS-g2 model, with a value of −4.31°C. The BCC-CSM 1.1 and GISS-E2-R models perform the best, with a minimum bias of 0.10°C. It should be noted that the 105 year mean of observed temperature is about −4.50°C. It is also found that models with higher resolution do not always perform better than those with lower resolutions (such as the FGOALS-g2 model). For the seasonal SAT simulation, model biases in March-April-May (MAM) and June-July-August (JJA) are relatively small, while the bias in December-January-February (DJF) is high. Compared with the BCC-CSM 1.1 model, the GISS-E2-R model has the smaller bias in the seasonal SAT simulation.

Model bias and warming trend
The SAT observation has increased at a rate of about 1.1°C/100 yrs during 1901-2005 (figure 2). Among the 24 GCMs, 12 models overestimate the warming trend, and the others overestimate. However, the warming trend differences And only one GCM model (GISS-E2-R) estimates the global SAT will not increase by 2°C within this century under RCP 4.5. The projected time by the remaining GCMs is dispersed from 2034 to 2081. All GCMs affirm that the global SAT will rise by 2°C in this century under the RCP 8.5, and most of the models forecast that the projected time will occur from the 2030s-2050s. Under different scenarios, most of GCMs (except BCC-CSM1.1 (m), CSIRO-Mk3.6.0, GISS-E2-R and MPI-ESM-LR) believe the warming rate over the Northern Eurasia is higher than the global warming average in this century. And some models (such as GFDL-CM3 in RCP 4.5 and MRI-CGCM3 in RCP 8.5) even show the warming in Northern Eurasia is more than twice faster than the global average. Figure 3 shows the performance of the annual SAT simulation over the Northern Eurasia. The annual mean SAT observation increases during 1901-2005, and the warming has accelerated since the mid-20th century. Compared with the CMIP3 model, the CMIP5 model improves slightly in the annual SAT simulation, showing as closer to the observation point in the Taylor diagram ( figure 3(a)). It is also indicated that most of the GCMs can approximate the trend of SAT, but the accuracy of annual SAT simulation is relatively low. The correlation coefficient R between each GCM simulation and the annual observation ranges from 0.20 to 0.56 ( figure 3(a)). The ensemble results show that the ensemble technique can improve the temporal SAT simulation when compared to a single GCM ( figure 3(b)). For multi-model ensemble simulation, the performances of the SMA, REA and BMA are similar since their outputs nearly overlap in the Taylor diagram. Comparing the weights generated by BMA and REA techniques, it is found that the model weights are similar but not identical. This is an interesting phenomenon that SMA,  REA and BMA generate very similar results, even though the ensemble members receive different weights through the three ensemble techniques. Because a GCM cannot accurately reflect the actual annual SAT change ( figure 3(b)), here we focus on the decadal SAT simulation (figure 4). It shows that GCM can catch the trend of 10-year moving average SAT over Northern Eurasia. Compared with the annual scale, the correlation coefficients between the decadal SAT simulations and observation are increased, being primarily between 0.6 and 0.9; and the ensemble technique can improve the decal SAT simulation further ( figure 4(a)). Similar to figure 3(b), the decadal SAT changes simulated by three kinds of ensemble mean methods are almost the same ( figure 4(b)). Besides  ensemble average, uncertainty is another important skill score. Hence, the uncertainty of the multi-model ensembles during the period of 1901-2005 is also compared. Figure 5 shows the ensemble uncertainty of the simulated results with a 10 year moving average. It is indicated that the uncertainty generated by the BMA is the smallest among the three multimodel ensemble results for simulating the annual and seasonal SAT. In addition, the results show that the uncertainty in DJF is the largest over Northern Eurasia.

Projected SAT change in the 21st century
Considering the smallest uncertainty, the BMA method is applied to project the SAT change in the 21st century under the three future emission scenarios (RCP 2.6, RCP 4.5, RCP 8.5) (figure 6). Only the decadal SAT is projected, due to the poor performance on the annual scale. The BMA simulations show that the SAT of Northern Eurasia will increase remarkably over the 21st century. On average, the SAT over  Northern Eurasia will rise by 1.03°C/100 yr, 3.11°C/100 yr and 7.14°C/100 yr for the RCP 2.6, RCP 4.5 and RCP 8.5 scenarios, respectively. Under the RCP 2.6 and RCP 4.5 scenarios, the greatest contribution to the decadal warming is from MAM, while DJF is the largest contributor under the RCP 8.5 scenario. The warming trend slows down or even declines after 2050 under RCP 2.6. For the uncertainty of the SAT projections, it is found that the uncertainty of SAT projections increases with time in the 21st century, and the uncertainty under RCP 8.5 is larger than that under RCP 2.6 and RCP 4.5. Figure 7 shows the projected annual mean SAT change for 2080-2099. The annual mean SAT in the last two decades of 21st century relative to 1986-2005 over Northern Eurasia will increase 1.92°C, 3.25°C and 6.40°C under the RCP 2.6, RCP 4.5 and RCP 8.5 scenarios, respectively. It is found that the warming climate accelerates with increasing latitude. Grids with maximum SAT changes are concentrated in the Svalbard region of Norway under three scenarios, the corresponding changes are 3.61°C, 5.57°C and 10.24°C under the RCP 2.6, RCP 4.5 and RCP 8.5, respectively.

Discussion and conclusions
This study evaluates and compares the SAT change over Northern Eurasia using the outputs from the GCMs, SMA, REA and BMA, and projects the future SAT trend for different emission scenarios. The major findings of this study are summarized and discussed as follows: (1) Most of the GCMs overestimate the annual mean temperature over Northern Eurasia. Similar results are also found in the Arctic (Chylek et al 2011), in the Northern hemisphere (Zhao et al 2013) and globally (Kim et al 2012). The forced and internal variation might contribute the overestimated warming in SAT simulation. It is reported that some CMIP5 models overestimate the responses to the increasing greenhouse gas and other anthropogenic forcing (IPCC 2013). The stratospheric aerosol concentration has increased over the past decade due to the volcanic eruptions, and has cooled global lower-atmosphere temperatures to a statistically significant degree (Santer et al 2014). However, none of the CMIP5 simulations takes this into account (Solomon et al 2011, Santer et al 2014. Moreover, some researchers think the inaccurate way that CMIP5 model handles clouds and water vapor is the main reason for the overestimation. It is found that the CMIP5 model tends to underestimate the cloud cover (Nam et al 2012) and stratospheric water vapor (Fyfe et al 2013a), both of which allow more sun to get in and then to heat up the planet during the simulation. Satellite observations suggest that climate models have ignored the negative feedbacks produced by clouds and water vapor (Christy et al 2010). The missing of these negative feedbacks diminishes the warming effects of carbon dioxide (Fyfe et al 2013a,b). Similarly to the CMIP3 (Miao et al 2013), the models with higher resolution do not always perform better than those with lower resolutions. Generally, the error of annual SAT simulation primarily comes from DJF, while the model biases in MAM and JJA are relatively smaller ( figure 1(b), figure 3(b), figure 5). Most of the GCMs can approximate the decadal SAT trend, but the accuracy of annual SAT simulation is relatively low. The correlation coefficients R between each GCM simulation and the annual observations range from 0.20 to 0.56. Hence, direct use of the short-term output from single GCM is not recommended. To model the shortterm dynamic series, the effective techniques to improve the regional simulation accuracy should be considered in advance.
(2) The performances of the multi-model ensembles are superior to that of any single GCM. The Taylor   8.5 scenario, DJF becomes the greatest contributor. Compared with RCP 4.5 and RCP 8.5, the warming trend slows down or even starts declining after the 2050s under RCP 2.6. It is primarily because the RCP 2.6 scenario is designed to limit the increase of global mean temperature to 2°C ( van Vuuren et al 2011), and it has a peak in the radiative forcing at approximately 3 W m −2 (approximately 400 ppm CO 2 ) before 2100 and then declines to 2.6 W m −2 by the end of the 21st century (approximately 330 ppm CO 2 ) (Sillmann et al 2013). In addition, it is found that the uncertainty of the SAT projection outputs simulated by the BMA in the 21st century increases with time. The uncertainty change can be explained by its composition. The uncertainty in the SAT projection arises from three distinct sources: the internal variability of the climate system, the model uncertainty and the scenario uncertainty (Hawkins and Sutton 2009). In CMIP5, internal variability is roughly constant through time, and the other uncertainties grow with time, but at different rates (IPCC 2013). For scenario uncertainty, the spread between RCP scenarios is the dominant source of uncertainty by the end of the century (Hawkins and Sutton 2011). Overall, the uncertainty concluded using CMIP5 is not much changed from using CMIP3 (Knutti and Sedlacek 2013). Considering the mitigation and controllability of greenhouse gas, if we assume that the atmospheric concentrations of greenhouse gases decline quite rapidly under all RCPs in the next few decades, the scenario uncertainty may be smaller than that in the current projection. Not surprisingly, we are supposed to pay more attention to the internal variability and intermodel uncertainty in the near future.
GCM serves as a primary tool for studying and understanding climate change. In response to the SAT projections under different scenarios; it is important to make different adaptation and mitigation strategies in Northern Eurasia.