One simulation, different conclusions—the baseline period makes the difference!

The choice of the baseline period, intentionally chosen or not, as a reference for assessing future changes of any projected variable can play an important role for the resulting statement. In regional climate impact studies, well-established or arbitrarily chosen baselines are often used without being questioned. Here we investigated the effects of different baseline periods on the interpretation of discharge simulations from eight river basins in the period 1960–2099. The simulations were forced by four bias-adjusted and downscaled Global Climate Modelsunder two radiative forcing scenarios (RCP 2.6 and RCP 8.5). To systematically evaluate how far the choice of different baselines impacts the simulation results, we developed a similarity index that compares two time series of projected changes. The results show that 25% of the analyzed simulations are sensitive to the choice of the baseline period under RCP 2.6 and 32% under RCP 8.5. In extreme cases, change signals of two time series show opposite trends. This has serious consequences for key messages drawn from a basin-scale climate impact study. To address this problem, an algorithm was developed to identify flexible baseline periods for each simulation individually, which better represent the statistical properties of a given historical period.


Introduction
In the context of climate change mitigation and adaptation, decision-makers generally call for information about impacts of projected changes in a specific region at different global warming levels or in certain future periods. They need answers to questions like: 'Can we expect an increase or decrease in water availability, extreme events, such as floods, droughts, storm surges or heatwaves, around the year 2030, 2050 or by the end of the 21 st century? And what will be the consequences for, e.g. crop production, renewable electricity generation?' To answer such questions, regional climate impact modelers face a variety of challenges, which relate to technical, methodological, and communication issues of simulation results [1] and corresponding recommendations under uncertainties in a comprehensible way.
Adding to technical and methodological challenges includes, e.g. the choice of climate scenarios, climate and impact models, the use of biasadjustment methods, and model calibration and validation periods. The performance of a climate model is usually measured against its ability to represent spatial patterns and trends in the historical climate. Sometimes the performance is used to assign weights to individual models within a model ensemble [2][3][4][5][6][7]. The uncertainty cascade in the impact modeling is basically associated with model structure, model parameterization, and input data quality [8][9][10][11][12][13][14][15].
After the simulations have been carried out, the question about the baseline period used to compare future simulation results to, will arise. Where future scenario periods are usually defined to reflect the decision maker's planning horizon, baseline periods are often chosen arbitrarily or are based on existing standards. However, choosing a baseline period is a sensitive issue and can be easily instrumentalized to support specific conclusions, whether intentionally or not.
The World Meteorological Organization (WMO) recommends to use the 30-year period of 1961-1990 as the climate normal when comparing with future periods and that this should be maintained as a reference for monitoring long-term climate variability and change [16,17]. Beyond that, a regularly updated 30-year baseline period, currently 1981-2010, should be employed to give people a more recent context for understanding weather and climate extremes and forecasts [17][18][19]. The Intergovernmental Panel on Climate Change (IPCC) used the 20-year period 1986-2005 as the baseline in many graphs in the Fifth Assessment Report [20] and will use the years 1995-2014 in its Sixth Assessment Report. So, what are climate impact modelers supposed to do? Which baseline should they select and does it actually matter?
At the global or continental scale, it is virtually impossible to choose a baseline period whose climate is represented realistically by all climate models. An arbitrary determination of global baselines is therefore justifiable. However, global and regional climate simulations are often not designed to synchronize with real year to year patterns and events [21], which creates a communication challenge, particularly in regional impact studies. For example, some climate models depict the mid-1980s as a period with above-normal rainfall, when in reality a drought hit West Africa. Others simulate the extraordinary wet 1950s and 1960s as very dry. Nevertheless, wellestablished global baseline periods are often used unquestioningly in regional impact studies, although the real-life statistical properties of the specific historical period may not be adequately represented by climate model simulations, therefore, also not in subsequent applications.
Even though the implications of the choice of baseline periods for the interpretation of simulation results are well known, little attention has been paid to them in the climate impact community. Ruokolainen and Räisänen (2007) [22] analyze the sensitivity of forecasts to the choice of different baselines in Southern Finland. Razavi et al (2015) [23] emphasize that different length of baseline periods may lead to different conclusions about stationarity/non-stationarity. Hawkins and Sutton (2016) [18] discuss the choice of climate reference periods when comparing global air temperature projected by climate models with observations. Huang et al (2018) [24] depict future flood characteristics in future periods in four river basins based on different 30-year baseline periods. Snell et al (2018) [25] highlight the sensitivity to the choice of baseline climate in dynamic forest modeling in the Alps. Baker et al (2016) [26] assessed the impact of six different climate baselines on projections of African bird species' responses to future climate change. Although this issue has been addressed as a side effect in several other studies, it has generally not been considered important to form the focal point for systematic research.
The present study systematically investigates the effect of the choice of the baseline period on the interpretation of simulation results. It provides examples from eight river basins located in various climate zones, where changes in projected future discharge are estimated based on WMO and IPCC baselines using four bias-adjusted and downscaled Global Climate Models (GCMs) from the Inter-Sectoral Impact Model Intercomparison Project (ISIMIP) [27][28][29]. An index measuring the similarity between two time series is introduced and was used to assess the sensitivity of choosing different baseline periods. We developed an algorithm to overcome the problem in cases of substantial deviations. It identifies a baseline period, which consists of similar basic statistical properties as the historical period and is flexible in terms of length and timing. Although the main focus of this study is the analysis of river discharge, the method is in principle applicable to any time series variable, such as meteorological data, crop yields, emissions of greenhouse gases, hydropower potentials and so on.

Study sites
The impact of different baseline periods was investigated by using simulated river discharge from eight exemplary river basins located in various climate zones from equatorial to polar ( figure 1 and table 1). The simulations were carried out within the framework of various research projects (see references in table 1). What they have in common is the hydrological model, the same four forcing GCMs, and the simulation period they cover (1960-2099), which guarantees consistency across the study basins.

Data
The investigation was conducted using annual mean discharge MQ, derived from simulated daily discharge from eight river basins, based on climate model input from four GCMs in the period 1960-2099. The discharge was simulated with the semi-distributed, eco-hydrological soil and water integrated model (SWIM) [35,36]. The downscaled and bias-adjusted GCM climate simulation data were provided by ISIMIP [27][28][29]37] for the GFDL-ESM2M, HadGEM2-ES, IPSL-CMA5-LR, and MIROC5 models. The aim is to provide harmonized climate simulation input to impact modelers and thereby to support the intercomparison of global and regional impact studies.

Baseline periods
The impact of the baseline period on the interpretation of changes in simulated future river discharge was investigated by using two baselines established by the WMO [17] and the IPCC [20]. The WMO baseline covers the 30 years 1961-1990 and the IPCC baseline the 20 years 1986-2005. Other IPCC reports use also different and longer baseline periods. However, we chose the above-mentioned baseline here, as it is used in many graphs in the IPCC AR5 report [20] and is therefore likely to tempt impact modelers to use it as a standard in their studies. Interestingly, the central limit theorem dictates that at least 30 samples are needed if we assume a normal distribution and to ample natural variability [38] as in the case of the WMO [16,19]. From this perspective, the IPCC baseline is thereby too short, especially if variables with a high degree of natural variability are considered, e.g. river discharge. However, one could argue that the sample size is sufficiently large, if the combination of years in the baseline period times the number of models exceeds a critical threshold, which is given in the case of the IPCC (20 years times 40 + GCMs). In addition, the selection of the baseline period should strike the balance between being statistically robust and representative of the target conditions (e.g. 'present-day climate'). For rapidly changing variables, such as for instance extreme temperatures, reference periods of 30 years or longer might be considered insufficiently representative of the target conditions.
In this study, we hypothesize that the baseline period is a subset that accurately represents some basic statistical properties of a historical period, here defined as 1960-2005. An algorithm was developed to identify for each simulation a baseline period of variable length within a given historical period. The algorithm searches for a baseline period whose mean, minimum, and maximum values correspond to those of the historical period. In line with common practice of hydro-climatic impact studies, the baseline period should cover at least 30 years. The statistical properties of the baseline period are allowed to deviate from those of the historical period by not more than a user-defined threshold, e.g. 5%. If the algorithm is not able to find an appropriate baseline with n = 30 years, n is incremented by 1. The resulting baseline period is therefore flexible in terms of its length and starting year and is called hereafter 'flexible baseline' . The corresponding function, implemented in R, is provided in appendix A. It works only for annual series but can be easily adapted for monthly or daily series.
To account for the possibility of a linear climate change trend in the historical discharge, the algorithm was tested using a time series detrended using the first (linear) differencing method (appendix B). In general, the differences in the results were found to be minor and the identified baseline periods to be longer. To avoid accidentally removing or suppressing some of the extreme years by applying a linear operation, results shown below are all based on the original data.

Similarity index
The MQ time series was used to compute the relative change between a specific baseline and a corresponding future period as follows: where MQ base is the average of the annual values of a specific baseline period and MQ future the average of a future period. The index i refers to different future periods with central years between 2020 and 2080, i.e. 61 time steps. The index j represents the different baselines (WMO, IPCC, and flexible baseline), where the length of the baseline determines the number of years around the central years in corresponding MQ future periods. The mean absolute deviation between two ∆MQ time series, e.g. ∆MQ WMO,i for the WMO and ∆MQ IPCC,i for the IPCC baseline, over all k = 61 time steps was then quantified as: The deviation was then re-scaled by a userdefined deviation threshold D max to an agreement score value This Agreement Score ranges between one (no deviation, perfect agreement) and zero (deviation larger than the threshold). In this study a threshold value of D max = 25% was defined, because deviations in discharge projections > 25% that are solely based on different baselines, were considered to be very large and indicative for a substantial difference. For other applications (e.g. greenhouse gas emissions, temperature, precipitation, wind speed) or by using not relative but absolute changes for ∆MQ i,j , other threshold values might be more appropriate.
Apart from the deviation based on the choice of different baselines, we quantified the direction of change signals CS as compared the agreement between two baselines for the future periods by setting to eventually derive the average agreement in the direction of the change signal by Finally, a Similarity Index was defined as A value of zero is derived if the selection of a baseline has a large impact on the interpretation of results of an impact study while the optimum value of The computation of SI was also tested by integrating other factors, such as agreement in standard deviation or R 2 , but the results achieved with a more complex indicator were not considered to be more meaningful than those achieved with the simplistic approach. The SI was also used to assess the sensitivity of the choice of the baseline depending on the GCM and the climate zone.

Results and discussion
This section shows to what extent the choice of the baseline alone can influence the interpretation of simulation results. Figure 2 shows future ∆MQ series for selected river basins relative to MQs in the WMO and IPCC baselines. Future change signals and magnitudes of change can be extremely different between the two ∆MQ series (figures 2(a) and (c)). Both examples are therefore characterized by low SI values of 0.24 and 0.19, respectively, which indicate large differences of MQ values in the respective baselines. They also demonstrate that neither the results based on the one nor the other baseline generally tends to suggest higher or lower future ∆MQ, a phenomenon   figure 2 is conclusive, where low SI values indicate a high sensitivity to the choice of the baseline period and high SI values a low sensitivity. As with model performance indicators (e.g. R 2 , PBIAS), an evaluation of which value ranges indicate actually a good or poor fit, or in the case of the SI, which values represent high or low sensitivity, remains somehow subjective. In the context of simulated river discharge, we propose SI values below 0.5 to indicate high sensitivity.
The choice of the baseline period has the highest impact on the interpretation of simulation results performed with the IPSL model and the lowest impact with the MIROC5 model. However, the average GCM SI value (table 2) does not imply that this assumption is true for all basins and all RCPs. The results for RCP 8.5 are slightly different, where the highest SI   value is also achieved with the MIROC5 model, but the lowest values with GFDL (table D1).
Assuming an SI threshold of 0.5, it mattered in 25% of the simulations under RCP 2.6 (table 2) and in 32% under RCP 8.5 (table D1), whether the one or the other baseline was used to assess future changes. There are basically two options to deal with simulations resulting in SI ≤ 0.5: (i) discuss the uncertainties and/or (ii) choose a different baseline that represents the basic statistical properties of the historical period more consistently, e.g. by using the proposed algorithm in appendix A.
To exemplify, the projected discharge changes with an additional flexible baseline for the Volta River basin is shown in Results from all river basins under both RCPs show that the projected ∆MQ series using flexible baselines lie either in between or outside WMO and IPCC ∆MQ. But, in all cases, they resemble the WMO more than the IPCC ∆MQ series (figures in appendix C), which is an indication that also the length of the baseline period matters. Table 3 shows relative differences of ensemble ∆MQ between WMO, IPCC, and flexible baselines for all river basins around the central years 2040, 2060, and 2080 for RCP 2.6 and table E2 for RCP 8.5. In the Northern Dvina River basin (DVI) in 2040 and 2060 and in the São Francisco River basin (SFC) in 2080, the ensemble mean projects opposing change signals between WMO and IPCC baselines, with absolute differences up to 13.9% under RCP 2.6 and almost 20% under RCP 8.5. Relative differences between WMO and IPCC baselines are lower if the ensemble mean is considered (appendix E), but can be very high for individual models, as was shown in figures 2(a) and (c). As with individual models, the ensemble mean ∆MQ series of the flexible baseline are always more similar to the WMO than to the IPCC ∆MQ series.
The sensitivity (SI ≤ 0.5) of the choice of the baseline period for different climate zones is inconclusive (table 2 and table E2). A larger sample size of catchments from various climate zones is required to make more robust statements. However, the lowest sensitivity was achieved in warm temperate climates (C) represented by the Rhine, Tagus, and Upper Blue Nile River basins.

Conclusions
This study demonstrates how solely the choice of a baseline period can influence the interpretation of discharge projections in eight river basins using climate input from four bias-adjusted GCMs. To evaluate whether the choice of either the well established 30 years (1961-1990) WMO [17] or the more recent 20 years (1986-2005) IPCC [20] baseline matters, a similarity index SI was introduced as a measure to compare the two resulting time series of future change. In about 25% of the simulations under RCP 2.6 and in 32% under RCP 8.5, large quantitative differences and/or opposite signals of change were found, with at least one case of major discrepancies in each river basin. The deviations for selected future periods can be so large that they range from −5% to +45% for a given central year. These figures indicate that different recommendations for action could possibly be derived in at least every fourth case.
No systematic differences in the direction of change using either baseline period could be identified. Neither the results based on the WMO nor those based on the IPCC baseline tend to generally project higher or lower future river discharge.

Choosing baseline periods
Given that a baseline period is normally a subset of the historical period, it should represent its basic statistical properties. From a formal statistical perspective, a minimum length of 30 years is highly recommended for regional impact studies, particularly when using integrated variables, such as river discharge. We developed an algorithm, which identifies for each simulation a flexible baseline of variable length and variable start year representing the basic statistical properties of a given historical period. In about 20% of the 32 simulations, the flexible baselines were longer than 30 years, highlighting the importance of longer-term perspectives to more confidently quantify historic reference variability when developing adaptation strategies. The use of flexible baselines helps to reduce uncertainty in the interpretation of model simulations in cases where standard baseline periods do not capture the variability of the historical period. If multiple ranges of uncertainty, such as those implied by the impact modeling cascade and multi-criteria baseline selection, are combined, the central limit theorem implies that central tendencies are favored at the expense of extremes [39].

Regional context
At the local and regional scales, it is important to take region-specific characteristics into account, where other factors that are largely independent of past climate variability may also influence the choice of a representative baseline period, e.g. degree of human impact (land use/cover change, reservoirs, irrigation). In this context, it is reasonable to question whether the baseline period should represent rather natural conditions (far back in time with low human impact) or more recent conditions (with strong human impact). Another reason why the application of standard baseline periods is questionable is that they are often detached from reality. If a baseline period is chosen that, for example, was characterized by severe droughts in reality and future simulations project relatively drier conditions (even though the simulated baseline was above normal), stakeholders may interpret that the future will be drier than the driest period they have experienced in their lives. Using flexible baselines is a solution to better tailor information to the needs of decision makers while addressing the challenge of uncertainty transparently and efficiently.

Ensemble mean versus single model simulations
Generally, the interpretation of results based on model ensembles is less sensitive to the choice of baseline periods than for single model simulations. Nevertheless, in three cases, even the ensemble mean using the WMO and IPCC baselines projected opposite change signals in selected future periods.

Outlook
An analysis of results based on monthly or daily time series or a focus on extremes rather than the average might reveal an even higher sensitivity to chosen baseline periods than the annual time series used in this study. An improvement of the algorithm to identify flexible baseline periods, in terms of incorporating more sophisticated statistical parameters and tests, might be necessary if applied to monthly or daily time series.

Acknowledgment
This research was funded in the frame of the CIREG project (https://cireg.pik-potsdam.de/en/) by ERA-NET Co-fund action initiated by JPI Climate, funded by BMBF (DE), FORMAS (SE), BELSPO (BE), and IFD (DK) with co-funding by the European Union's Horizon 2020 Framework Program (Grant 690462).