Köppen bioclimatic evaluation of CMIP historical climate simulations

Köppen bioclimatic classification relates generic vegetation types to characteristics of the interactive annual-cycles of continental temperature (T) and precipitation (P). In addition to predicting possible bioclimatic consequences of past or prospective climate change, a Köppen scheme can be used to pinpoint biases in model simulations of historical T and P. In this study a Köppen evaluation of Coupled Model Intercomparison Project (CMIP) simulations of historical climate is conducted for the period 1980–1999. Evaluation of an example CMIP5 model illustrates how errors in simulating Köppen vegetation types (relative to those derived from observational reference data) can be deconstructed and related to model-specific temperature and precipitation biases. Measures of CMIP model skill in simulating the reference Köppen vegetation types are also developed, allowing the bioclimatic performance of a CMIP5 simulation of T and P to be compared quantitatively with its CMIP3 antecedent. Although certain bioclimatic discrepancies persist across model generations, the CMIP5 models collectively display an improved rendering of historical T and P relative to their CMIP3 counterparts. In addition, the Köppen-based performance metrics are found to be quite insensitive to alternative choices of observational reference data or to differences in model horizontal resolution.


Introduction
Wladimir Köppen (1900) was the first to systematically quantify perceived relationships between the climatological annual cycles of continental temperature and precipitation (hereafter, T and P) and associated generic vegetation types inhabiting different regions (e.g. tundra vegetation, evergreen or deciduous forests, grasslands, etc). Various modifications of Köppen's initial classification scheme were later introduced by Köppen andGeiger (1930), Trewartha (1968), and Lamb (1972), for example. Other researchers have attempted to improve on Köppen's methodology, for instance by considering plant physiological factors in defining the operative climatic variables (e.g. Holdridge 1947, Thornthwaite 1948, Prentice 1990, Fedema 2005, Jolly et al 2006 or by applying statistical clustering techniques to more precisely define the boundaries of ecoregions whose climate characteristics evolve in time (Hargrove andHoffman 1999, Hoffman et al 2005). In common with Köppen classification, these alternative schemes implicitly assume that the geographical distribution of dominant vegetation types is both in equilibrium with, and largely determined by, the climate state. Instead, probabilistic approaches such as that of Brovkin et al (1997) allow for the coexistence of several vegetation types in the same region that have different probabilities of occurrence which may vary continuously under changing climate conditions. In spite of their shortcomings from an ecological perspective, Köppen and alternative deterministic schemes have proved very useful for identifying the first-order bioclimatic consequences of past climate change (Guetter and Kutzbach 1990, Wang and Overland 2004, Gerstengarbe and Werner 2009, Rubel and Kottek 2010, Chen and Chen 2013 or of prospective future climate change inferred from simulated projections of T and P under diverse greenhouse-gas emissions scenarios (Leemans et al 1996, de Castro et al 2007, Diaz and Eischeid 2007, Gao and Georgi 2008, Roderfeld et al 2008, Jylhä et al 2010, Feng et al 2012, 2014Hanf et al 2012, Gallardo et al 2013, Mahlstein et al 2013, and Elguindi et al 2014. It is also noteworthy that Mahlstein et al (2013), Elguindi et al (2014), andFeng et al (2014) have conducted such studies using future-climate projections provided by the most recent generation of global coupled ocean-atmosphere climate model entries in phase 5 of the Coupled Model Intercomparison Project (CMIP5), described by Bony (2011) andTaylor et al (2012).
For projections of future climate to be credible, however, models must demonstrate an ability to accurately simulate T and P in the historical climate record. Early applications of Köppen schemes for evaluating the performance of particular climate models were reported by Manabe and Holloway (1975) and Lohmann et al (1993). More recently, Gnanadesikan and Stouffer (2006) (hereafter, 'G and S') used a Köppen scheme to evaluate selected simulations of late-20th century continental climate rendered by multiple coupled ocean-atmosphere global climate model (AOGCM) entries in Phase 3 of the Coupled Model Intercomparison Project (CMIP3, see Meehl et al 2007). G and S pointed out that a model bias in T or P in one region may not have the same biological consequences as in another. They emphasized that, for climate models to provide useful bioclimatic predictions, they must correctly simulate specific thresholds of T and P that determine the natural regional boundaries of different vegetation types.
The present study aims to extend the work of G and S by providing an updated Köppen evaluation of CMIP5 simulations of the recent historical climate. (Elguindi et al 2014 also evaluate the CMIP5 historical climate simulations, but instead employ a revised Thornthwaite bioclimatic scheme.) We conduct our evaluation by comparing the Köppen mappings of vegetation types derived from each CMIP5 simulation with those obtained from observational reference values of T and P. To illustrate typical bioclimatic strengths and weaknesses of the models, we discuss detailed results for an example CMIP5 model simulation and that of its CMIP3 antecedent. In addition, we show how model errors in rendering Köppen vegetation types may be deconstructed to reveal the character of corresponding biases in regional T and P.
A secondary goal of our study is to compare the bioclimatic performance of the CMIP5 models with that of their CMIP3 antecedents. This is an important exercise, since the collective improvement or deterioration of CMIP models impact periodic assessments of the Intergovernmental Panel on Climate Change (e.g. IPCC 2007, IPCC 2013, which rely heavily on the CMIP simulations of historical or future climate. To this end, we develop bioclimatic measures to quantify the performance of the CMIP3 and CMIP5 simulations of T and P relative to observational reference data. We also conduct a preliminary investigation of the effects of observational uncertainty, and of the impact of model horizontal resolution on these simulation performance measures. Subsequent sections describe the methods and data employed (section 2) and the salient model-evaluation results (section 3). We offer concluding remarks in section 4, and discuss additional technical details in a supplementary material (SM) appendix available at stacks.iop.org/ERL/10/064005/mmedia.

Methodology and data
A Köppen classification scheme identifies generic vegetation types associated with regional climate zones defined by characteristics of the annual cycles of T and P (e.g. the value of maximum monthly temperature or the season of maximum precipitation). A Köppen vegetation type thus embodies the interplay of the amplitude and seasonal phase of the associated regional T and P annual cycles.
Because variants of Köppen classification define generic vegetation types somewhat differently, one should choose a scheme that is appropriate for evaluating the CMIP models. For example, the Köppen-Geiger classification differentiates some 30 vegetation types (Kottek et al 2006, Peel et al 2007-probably too many for practical application on the coarse horizontal grid (resolving only several degrees latitude/ longitude) that is typical of a CMIP model.
We instead adopt the scheme that G and S employed, which distinguishes 14 regional climates and associated vegetation types, but still sets a challenging standard for model evaluation. Choosing the G and S scheme also permits their initial evaluation of selected CMIP3 models to be extended consistently to the current-generation CMIP5 models. The criteria defining the 14 generic vegetation types and their associated regional climates are listed in table 1. Because these vegetation types are rather ambiguously described (e.g. 'evergreen forest', 'evergreen broad-leaf forest', etc), hereafter we will refer to the vegetation types by their corresponding Köppen regional climate designations (Dc, Cw, Cs, etc). Further details of the G and S Köppen scheme are discussed in the SM appendix.
By applying the defining criteria of table 1, the 14 vegetation types can be mapped from observations of the climatological annual cycles of regional T and P. Because fully global, satellite-based estimates of T and P were not available prior to 1979, and because the CMIP3 historical climate simulations did not extend past the year 2000 (Meehl et al 2007), we focused on the 20 year climatological period 1980-1999. Observationally based estimates of climatological monthly continental T for this 20 year period were obtained from the National Center for Environmental Prediction/National Center for Atmospheric Research (NCEP/NCAR) reanalysis surface air temperature field (Kalnay et al 1996), and climatological monthly continental P amounts from the Global Precipitation Climatology Project (GPCP) data set (Adler et al 2003).
The vegetation types of table 1, derived from the chosen observational estimates of T and P (hereafter referred to as the 'OBS' vegetation reference) are mapped on a 72 × 144 (2.5º × 2.5º) grid in figure 1(a). It is seen that the climate zones A to E in figure 1(a) align roughly according to latitude, but their respective classes/subclasses and associated vegetation types display Table 1. Köppen climate types (and associated vegetation-type description), with their corresponding continental temperature/precipitation (T/P) defining criteria, after Gnanadesikan and Stouffer (2006). (See further explanation in section S1 of the supplementary material).
*Here T min,max,avg are, respectively, the minimum monthly, maximum monthly, and annual-average continental temperature T in degrees Celsius (C). P min,max,year are the minimum monthly, maximum monthly, and annually integrated continental precipitation P in centimeters (cm). The dimensionless precipitation seasonality index P off is set to a value of 0 if > 30% of P year falls in winter, to 7 if there is no distinctly wet season, and to 14 if > 30% of P year falls in summer. substantial longitudinal variation. For example, tundra (Et) and boreal forest (Dc) dominate northern Eurasia, but also the Tibetan plateau. Desert (BW) and semi-arid (BS) vegetation types occupy subtropical Africa and central Asia, but also intrude in 'rain shadows' to the east of the Rockies and Andes, and to the north of the Himalayas. Where subsiding air dominates the continental interiors of southern Africa and Australia, desert (BW) and semi-arid (BS) vegetation coexist with temperate types that occupy wetter regions nearby. Tropical vegetation (Af, Am, Aw) populates southern and southeastern Asia, as well as Amazonia and equatorial Africa. Over North America, tundra (Et) and boreal forest (Dc) coexist with broadleaf forests (Dab, Cs, Cfa), while semi-arid (BS) and desert (BW) vegetation dominates the southwest US and Mexico. A patchwork of temperate forests (Cs, Cw, Cfa, Cfb, Cfc) also occupy portions of South America, Africa, Australia, Europe, and China-notwithstanding the absence of forests resulting from  (2006), on a 72 × 144 (2.5 × 2.5º) grid (see table 1 for type definitions). In (a), the vegetation types (referred to as the OBS reference) are derived from observationally based estimates of the mean monthly annual-cycle climatologies of continental temperature T and precipitation P, for the period 1980-1999. In (b) the same vegetation types are mapped according to the 1980-1999 annual-cycle climatologies of T and P simulated by the NCAR CCSM4 climate model, an entry in the CMIP5 intercomparison.
historical land-clearing practices in many of these locations.
The OBS vegetation mapping of figure 1(a) serves as a reference standard against which the G and S vegetation types derived from CMIP3/5 historical climate simulations of T and P can be evaluated. Our study considers 1980-1999 historical climate simulations of T and P from 27 CMIP5 models and their 18 CMIP3 antecedents (for the same modeling groups). In many cases, a CMIP model produced multiple realizations of historical climate that were usually distinguished by a different specification of the initial conditions of the model's coupled ocean-atmosphere state. In such cases, we arbitrarily chose only the first in a series of these realizations (denoted as 'run 1' of a CMIP3 model's 20th Century ('20c3m') experiment, or 'r1i1p1' of a CMIP5 model's 'historical' experiment-see web address http://cmip-pcmdi.llnl.gov/cmip5/docs/ cmip5_data_reference_syntax_v0-25_clean.pdf for notational details on the CMIP5 realizations). All chosen simulations were mapped to the same 72 × 144 grid as that of the OBS reference. Detailed information on technical features of individual CMIP3 models may be obtained at web address www-pcmdi.llnl.gov/ipcc/ model_documentation/ipcc_model_documentation. php, and for individual CMIP5 models at http://esdoc.org. The CMIP3/5 model names, their native horizontal grid resolutions, and the associated modeling groups are listed in table 2.

Vegetation mappings
Köppen vegetation maps such as those shown in figure 1(a) were derived for each of the CMIP3/5 model historical simulations. As an illustrative example, we focus on the simulation of the widely used Community Climate System Model Version 4 (CCSM4), a CMIP5 entry. The CCSM4 vegetation mapping (figure 1(b)) displays strengths and weaknesses that are typical of many other CMIP models.
In general, the CCSM4 vegetation types replicate salient features of the OBS mapping (figure 1(a)), but with differences in position and areal extent. Ef and Et types are generally well-simulated, with the exception of Alaska. Marked discrepancies include the constriction of desert (BW) and semi-arid (BS) vegetation areas in the lee of the Rockies, Andes, and Himalayas, as well as over Mexico and in the southern African and Australian interiors, where they are displaced in many regions by temperate vegetation types (Cw, Cfa). In central and eastern Europe also, temperate broadleaf forests (Cfb) erroneously encroach on deciduous cold-winter forests (Dab). Some discrepancies in simulated vegetation types occur elsewhere, such as Amazonia and equatorial Africa, where the divisions between different types of tropical forests (Af, Am, Aw) are not well reproduced.

Deconstruction of simulation biases
Because Köppen vegetation maps are built up from characteristics of the annual-cycle climatologies T and P (e.g. fields of T max/min , T avg , P max/min , P year , P off -see table 1), we can readily deconstruct the specific biases in modeled T and/or P that are responsible for distinctive regional vegetation errors. Here, the CCSM4 simulation of the vegetation map in figure 1(b) again provides an illustrative example. The more egregious CCSM4 vegetation-type errors (relative to the OBS standard in figure 1(a)) tend to occur in the drier Köppen climatic zones such as leeward of topography or over southern Africa and the Australian interior, where the annual accumulation of precipitation P year is excessive for these regions (not shown). Other CCSM4 dry-zone errors over central Europe and Mexico instead result from an interplay of the biases in T and P.
For example, we deconstruct the CCSM4 T and P biases over Mexico in figure 2. The modeled vegetation types (figure 2(a)) are quite different from those seen in the OBS reference ( figure 2(b)). In CCSM4, there is a dearth of semi-arid (BS) vegetation that appears in the OBS mapping over central Mexico; instead, temperate vegetation types (Cfa and Cfb, denoted by shades of green) occupy these model grid cells. In panels 2 (c) and (d), the CCSM4-OBS differences in annual-average temperature (ΔT avg ) and in annually integrated precipitation (ΔP year ) are shown. In some locations, the CCSM4 simulation of P year exceeds that of the OBS reference by ∼90 cm, while the modeled T avg falls below that of the OBS by ∼4 C. By examining monthly T and P, it is found that both the CCSM4 T deficits and P excesses over Mexico are present throughout the year, but are most extreme in May-July (not shown).
In addition, there are sizeable temporal phase errors in the modeled precipitation, as displayed by a field of the seasonality index P off (see table 1 and SM section 1) for the CCSM4 (figure 2(e)) that contrasts with the observed P off (figure 2(f)), notably over central Mexico. Here, instead of a pronounced summer peak in the observed P (P off = 14) coinciding with T max , CCSM4 precipitation is distributed more evenly over the entire year (P off = 7). Hence, the model's maximum and annual-averages temperatures T max and T avg are both depressed relative to observations (figure 2(c)). In consequence, the simulated Köppen estimate of potential evaporation E p = (T avg + P off ) over Mexico (see SM section S1) is substantially smaller than that for the OBS, and the defining criteria for semi-arid vegetation BS (E p < P year < 2 E p , from table 1) is not satisfied (since simulated P year > 2 E p ). Instead, temperate broadleaf forest types Cfa and Cfb (figure 2(a)) erroneously displace much of the Mexican semi-arid (BS) vegetation ( figure 2(b)) in locations where the model's depressed annual T max satisfies the corresponding temperate criteria (see table 1).

Bioclimatic performance metrics
For objective comparison of CMIP model simulations, it is necessary to develop quantitative measures of bioclimatic performance. For example, we define a vegetation 'hits' metric h(v i ) as an area-weighted measure of the percentage of one-to-one matches of each model-derived vegetation type with the corresponding OBS type v i (where i = 1 to 14) in each grid cell. Here h(v i ) is weighted by the grid-cell areas which (given the convergence of the meridians toward the pole) decrease as the cosine of the latitude.
In effect, the metric h constitutes the diagonal of a hits matrix H(v i , v j ) that relates occurrences of model vegetation type v j in grid cells where the observed (OBS) vegetation type is v i . In figure 3, hits matrices are shown for the CCSM4 historical climate simulation and that of its CMIP3 antecedent, the CCSM3 model. The color-coded model-OBS hits percentages h(v i ) for each OBS vegetation type v i are arrayed along the matrix diagonals. The off-diagonal patches represent the model 'misses' in vegetation type, where their vertical distances from the diagonal broadly indicate the degree of mismatch with the OBS vegetation types. Overall, the CCSM4 simulation displays fewer misses in vegetation type than the CCSM3.
The hits metric h is a stringent measure of a model's bioclimatic performance, since slight but consistent errors in the spatial locations of simulated vegetation types can have a sizeable negative impact. A more 'forgiving' performance metric instead compares the percentage of total land area a(v i ) occupied by each simulated vegetation type v i with that of the corresponding OBS reference v i , regardless of whether there is a one-to-one match of vegetation types in each grid box.
Both these metrics are plotted for the CCSM3 and CCSM4 simulations in figure 4. For the two simulations, the hits percentages tend to be higher for Table 2. Selected participating CMIP modeling groups (and home countries) listed with associated climate models (and their native horizontal grids, expressed as the number of latitudes × longitudes). Globally aggregated performance scores VH and VA (optimal values = 100%) also are listed for each model's simulation of historical climate evaluated relative to the OBS reference (see text for further details). VH and VA scores are shaded green where CMIP5 model scores improve on those from the CMIP3 antecedent model(s). vegetation in the Polar, Boreal, and Temperate (E, D, and C) zones than in the Arid (B) or Tropical (A) zones ( figure 4(a)). Both CCSM3 and CCSM4 display particular weaknesses in matching the locations of the semiarid (BS) and the moist tropical (Am) vegetation types, with CCSM3 performing somewhat better for BS, and CCSM4 for Am. For the majority of vegetation types, however, the hits percentages are higher for CCSM4 than for CCSM3.
The aggregate percentage of total land area a(v i ) occupied by different OBS vegetation types v i (black line in figure 4(b)) ranges widely, with Evergreen Boreal Forest (type Dc) occupying the largest percentage area (∼17%), and Temperate Needle-Tree Forest (Cfc) the smallest (∼1%). Semiarid (BS), desert (BW), and tropical dry (Aw) vegetation types also cover comparatively large (∼11-16%) percentage areas. Both the CCSM3 and CCSM4 simulations display good agreement with the OBS tropical (Af, Am, Aw) vegetation areas. The close matching of the simulated moist tropical (Am) vegetation area with that of the OBS appears to result from compensating errors, with too little of this type being simulated over Amazonia, and too much over equatorial Africa. This fortuitous matching of total Am areas stands in contrast to a relatively low hits percentage ( figure 4(a)), evidenced most clearly by the erroneous spatial displacement of the Am type over Amazonia ( figure 1(b)).
Both CCSM models also under-predict desert vegetation areas (BW), as noted in section 2. In Table 2. (Continued.) addition, CCSM4 under-predicts the area of the semiarid (BS) vegetation, in agreement with a markedly low hit percentage for this type ( figure 4(a)). However, CCSM4 matches Polar, Tundra, and Boreal Forest (Ef, ET, Dc, and Dab) vegetation areas much better than CCSM3. Both models reproduce temperate vegetation areas comparatively well, but with CCSM4 tending to over-predict Cw, and Cfa types ( figure 4(b)). The CCSM4 simulation better reproduces the majority of OBS vegetation areas, however.
In order to assess the overall bioclimatic performance of each CMIP model, measures that are aggregated across all 14 vegetation types v i are also necessary. For example, aggregate performance scores VH and VA can be derived from the vegetation-specific indices h(v i ) and a(v i ): where Δ I signifies the model-OBS difference in a(v i ) percentage. Note that both VH and VA have optimum values of 100%. Figure 2. Deconstruction of CCSM4-OBS differences in Köppen vegetation types over Mexico. Regional vegetation types for the CCSM4 simulation are displayed in (a) and those for the OBS standard in (b). CCSM4 differences in annual average temperature ΔT avg from that of the OBS are shown in (c) and differences in accumulated annual precipitation ΔP year in (d). The field of the precipitation seasonality index P off is shown for the CCSM4 simulation in (e) and for the OBS reference data in (f). Blue areas denote where P off = 0 (predominantly wintertime precipitation), red areas where P off = 14 (predominantly summertime precipitation), and green areas where P off = 7 (no distinct seasonality in precipitation). See table 1 for notational definitions.
The VH score measures an aggregate agreement in both vegetation type and location. In similar evaluative studies such as that of Elguindi et al (2014) this agreement is instead expressed by a kappa statistic (Cohen 1960, Monserud 1990, Prentice et al 1992. Although VH and kappa both attempt to measure overall agreement between two fields of vegetation types, we would point out several basic differences in these measures. Our study's objective is to measure the area-weighted, gridbox-scale agreement of each modeled vegetation type with that of the OBS reference vegetation field, expressed in the area-weighted hits metric values h(k). Thus, we implicitly treat the OBS reference vegetation field as a 'truth' target that a simulation of vegetation types matches according to its h(k) values. The VH aggregate score then is built up by averaging a simulation's h(k) over all vegetation types. The kappa statistic, as implemented by Elguindi et al (2014), does not implicitly treat the vegetation reference as a truth target. The kappa statistic also is strictly an aggregate measure of agreement, and is not built up from any vegetation-specific measure, nor are vegetation types area-weighted. Finally, unlike the VH score, the kappa statistic attempts to account for chance occurrences of agreement in the vegetation fields, and so kappa is considered a very conservative measure. Table 2 lists the VH and VA metrical values (rounded to the nearest percent) for the CMIP3 and CMIP5 simulations. For example, the overall performance of the CCSM4 model (VH = 68, VA = 84) is substantially better than that of CCSM3 (VH = 62 and VA = 73). Adjacent to the CCSM4 results in table 2, performance scores are listed for different versions of the successor Community Earth System Model (CESM1) which include more complex representations of atmospheric radiation and cloud-aerosol interactions than CCSM4 (see www.cesm.ucar.edu/models/cesm1.0/ notable_improvements.html and linked pages). In addition, emissions of biogenic aerosols and their deposition on ice, snow, and vegetation are treated in the CESM1-FASTCHEM, while the CESM1-WACCM simulates the stratosphere at finer vertical resolution (although at only half the horizontal resolution of the other CESM1 model versions-see table 2). CESM1-BGC also predicts variations in the global carbon cycle, including flux exchanges between ocean, land, and atmosphere, as well as related variations in vegetation types and areas. Note, however, that in the CMIP5 historical climate experiments, all such carbon-cycle prognostics are 'switched off'. Nevertheless, for such an Earth System Model (ESM), the Köppen evaluation subjects the simulation of T and P to careful scrutiny, as a prerequisite for a realistic simulation of the land's carbon cycle and vegetation cover.
The simulations of all the CESM1 historical simulations display biases in vegetation types (not shown) that are qualitatively similar to those of the CCSM4 ( figure 1(b)). The performance scores VH and VA for the CESM1-BGC and CESM1-FASTCHEM-despite the greater complexity of their physical-process representations-also are similar to those of the CCSM4 (table 2). (The VH and VA scores for the CESM1-CAM5-1-FV2 are somewhat lower than CCSM4s, and NorESM1-M) are also close to those of the less complex CMIP5 counterpart(s) from the same modeling group.
Additional salient model performance results in table 2 are summarized as follows. For the CMIP3 simulations, the VH score ranges between values of 32-66, and for CMIP5, between 56 and 70. In most cases, the VH scores for a particular CMIP5 model simulation roughly equal, or in many instances exceed, those for its CMIP3 antecedent (see green- A collective improvement in the CMIP5 models relative to their CMIP3 counterparts is also evidenced in multi-model averages of the respective hits index h (v i ), shown in figure 5. However, the CMIP5 improvement mostly occurs in the vegetation types of the Boreal (D) and Temperate (C) climate zones. The collective VH scores of CMIP5 are close to those of CMIP3 in the Polar (E) and Arid (B) zone, and are only slightly higher for some Tropical (Am and Aw) vegetation classes. The persistent difficulties in replicating the dry vegetation (BS and BW) types may be related to an erroneous 'drizzle effect' exhibited by many climate models, in which simulated precipitation events occur more frequently than observed (e.g. Stephens et al 2010). From inspection of Köppen vegetation maps of individual CMIP models, the collective failure to reproduce the moist tropical vegetation type (Am) results mainly from erroneous simulation of precipitation amounts or patterns over Amazonia.
From table 2, it is seen that a model's VA score almost always surpasses its VH value. This is as expected, since VH measures the overall ability of a model to reproduce vegetation types that correctly match the OBS reference at each grid cell, while VA measures its ability to simulate only the aggregate percentage areas of vegetation types. In many CMIP5 simulations there are also improvements in the VA scores relative to their CMIP3 antecedents, but with a number of exceptions to this pattern (note gray-shaded cells in the rightmost column of table 2). These outcomes imply that VA is a less consistent measure of overall model performance than VH.
3.4. Sensitivity to observational uncertainty and model resolution It is possible that our bioclimatic performance metrics are sensitive to observational uncertainties (e.g. depending on the choice of observed reference data) or to the grid resolution of the Köppen vegetation mappings. However, further investigation (see SM sections S2 and S3) implies that the bioclimatic performance measures are quite insensitive to both these factors.

Concluding remarks
Our study demonstrates the efficacy of a Köppenbased bioclimatic evaluation of the collection of CMIP model simulations of the annual-cycle climatologies T and P, as critical determinants of habitability for living organisms. In particular, a Köppen scheme pinpoints where simulated T and P deviate from values associated with observed regional bioclimatic zones. Because the Köppen vegetation types are derived from specific characteristics of the annual cycles of T and P, it is straightforward to deconstruct the particular Figure 4. Comparison of performance metrics for the NCAR CCSM3 and CCSM4 models. In (a), the area-weighted percentage hits metric h(v i ), computed with respect to the OBS reference, is plotted for the CCSM3 and CCSM4 historical simulations. The optimum value h(v i ) = 100% is shown for comparison. In (b), the vegetation area metric a(v i ) is plotted for the CCSM3 and CCSM4 simulations, in comparison with that of the OBS reference. simulation biases that produce distinctive regional vegetation errors. This approach also lends itself to developing metrics and related graphical depictions of model strengths and weaknesses in simulating T and P, and thus for tracking changes in model performance across development cycles.
Our bioclimatic performance evaluation is moderately encouraging, in the sense that most CMIP5 models display improvements over their CMIP3 antecedents, especially in representing T and P in high-and mid-latitudes (i.e. Köppen climate zones E, D and C). It is also reassuring that the performance of the ESMs compares well with their less complex model counterparts. Moreover, these improvements appear robust to different choices of observational reference data and grid resolution.
On the other hand, the CMIP5 simulations agree with the observational reference vegetation types in, at most, about 70 percent of the grid cells (see VH values in table 2 and S1), and obvious deficiencies remain in simulating the Arid (B) and Tropical (A) zones. This harsh appraisal should perhaps be softened somewhat, however, in view of the generally more accurate CMIP5 simulations of aggregate areas of Köppen vegetation types.
These outcomes highlight the stringency of a Köppen-based regional evaluation of global climate simulations that are mostly implemented at rather coarse horizontal resolution. Our study suggests, however, that such regional discrepancies should not be attributed mainly to resolution deficiencies. Rather, the lackluster bioclimatic performance of today's climate models reflects a collective inability to predict essential physical characteristics of T and especially of P, whose realistic simulation requires accurate representations of frontal dynamics, convection, and topographic uplift (e.g. Qian et al 2009, Catto et al 2013, Hirota and Takayabu 2013. In addition, representations of the complex aerosol-cloud and biogeochemical interactions relevant to cloud microphysics and precipitation formation have only recently been introduced in some of today's climate models (e.g. Gettelman et al 2015).
With the advent of the ESMs, climate-biosphere interactions stand at the forefront of current modeling efforts. Hence, there is an acute need to more precisely simulate details of T and P on regional scales, as an essential prerequisite for the prediction of the global carbon cycle, vegetation cover, and related processes. We thus anticipate that Köppen-based evaluations of climate simulations will continue to prove their worth.