Reconciling Precipitation with Runoff: Observed Hydrological Change in the Midlatitudes

Century-long observed gridded land precipitation datasets are a cornerstone of hydrometeorological research. But recent work has suggested that observed Northern Hemisphere midlatitude (NHML) land mean precipitation does not show evidence of an expected negative response to mid-twentieth-century aerosol forcing. Utilizing observed river discharges, the observed runoff is calculated and compared with observed land precipitation. The results showa near-zero twentieth-century trend inobservedNHML landmean runoff, in contrast to the significant positive trend in observed NHML land mean precipitation. However, precipitation and runoff share common interannual and decadal variability. An obvious split, or breakpoint, is found in the NHML land mean runoff–precipitation relationship in the 1930s. Using runoff simulated by six land surface models (LSMs), which are driven by the observed precipitation dataset, such breakpoints are absent. These findings support previous hypotheses that inhomogeneities exist in the early-twentieth-century NHML land mean precipitation record. Adjusting the observed precipitation record according to the observed runoff record largely accounts for the departure of the observed precipitation response from that predicted given the real-world aerosol forcing estimate, more than halving the discrepancy from about 6 to around 2Wm. Consideration of complementary observed runoff adds support to the suggestion that NHML-wide early-twentieth-century precipitation observations are unsuitable for climate change studies. The agreement between precipitation and runoff over Europe, however, is excellent, supporting the use of whole-twentieth-century observed precipitation datasets here. Denotes Open Access content. @@ Current affiliation: Department of Life Sciences, Imperial College London, Silwood Park Campus, Ascot, United Kingdom. Publisher’s Note: This article was revised on 30 November 2015 to include the CCBY reuse license, which was omitted when originally published. This article is licensed under a Creative Commons Attribution 4.0 license. Corresponding author address: Joe M. Osborne, Laver Building, University of Exeter, North Park Road, Exeter EX4 4QE, United Kingdom. E-mail: j.m.osborne@exeter.ac.uk DECEMBER 2015 O SBORNE ET AL . 2403 DOI: 10.1175/JHM-D-15-0055.1 2015 American Meteorological Society


Introduction
Precipitation and runoff are two intrinsically related components of the land hydrological cycle, but they are often studied in isolation (Xu et al. 2010). Consideration of observed runoff can complement observed precipitation and add extra weight to hypotheses for recent hydrological changes. Runoff observations can also be used to validate observed precipitation changes, allowing for the identification of inhomogeneities that cannot be explained by natural variability or anthropogenic influences on the hydrological cycle.
Recent work by Osborne and Lambert (2014, hereafter OL14) showed that observed Northern Hemisphere midlatitude (NHML) land mean precipitation changes contain no evidence of a response to sharply increasing aerosol emissions in the mid-twentieth century (Lamarque et al. 2010). The NHML region was the main source of these emissions, which led to a well understood decrease in observed precipitation over the remote Northern Hemisphere tropical land region. This pattern is robustly simulated by the latest general circulation models (GCMs; Hwang et al. 2013;Wu et al. 2013). Theory suggests that such a decrease should also be evident locally in observed NHML land mean precipitation. Indeed, GCMs typically simulate this decrease, with the magnitude of the decrease dependent on the strength of NHML aerosol forcing within the model. Conversely, observed NHML land mean precipitation increases rather than decreases. This could be explained by a major misunderstanding of aerosol radiative effects, contradicting a long-established scientific literature (Charlson et al. 1992). Alternatively, it is possible that precipitation in the NHML land region is poorly simulated by the latest models and that they fail to capture important physical processes, such as land-atmosphere coupling . A final possible explanation concerns the quality of existing precipitation observations. Inhomogeneities in widely used gridded observed land precipitation datasets have been extensively documented (Groisman et al. 1991;Legates 1995), with records from the early twentieth century particularly vulnerable to inhomogeneities.
In the absence of impacts of human activity and where natural changes in storage terms, such as soil infiltration and deep percolation, can be neglected-typically true on annual or longer time scales (Walter et al. 2004;Jung et al. 2013)-we can apply a simple water budget equation to a region consisting of one or more river catchments: where P, Q, and ET are precipitation, runoff, and evapotranspiration, respectively. Changes in runoff in many river catchments reflect changes in precipitation (Milliman et al. 2008). However, human activities can influence this water budget by changing the partitioning of precipitation into evapotranspiration and runoff and through changes in water storage, giving where DS is the net change in water storage due to human activities (Adam and Lettenmaier 2008). Dam and reservoir construction can affect seasonal water storage, but they appear to have little effect at annual time scales (Adam et al. 2007;Adam and Lettenmaier 2008). Melting ground ice in permafrost, due to human-induced warming (Bindoff et al. 2013), could lead to decreased water storage and has been invoked in explaining the disparity between decreasing precipitation and increasing runoff in northern Eurasia river catchments (Adam and Lettenmaier 2008). Humans can have a number of direct, but also poorly quantified, influences on the hydrological cycle, affecting the partitioning of precipitation into evapotranspiration and runoff. A major influence is water withdrawal for irrigation, which can have a big impact on runoff at the catchment scale (Milliman et al. 2008;Xu et al. 2010). A fraction of this withdrawal is not returned to runoff, defined as water consumption (Xu et al. 2010;Sterling et al. 2013). Although the significance of irrigation and other water withdrawals at the catchment scale is undisputed, there is less agreement over its influence on global mean runoff (e.g., Gerten et al. 2008;Sterling et al. 2013). Some research suggests that decreases in global runoff resulting from irrigation are cancelled out by increases resulting from changes in land cover or land use (predominantly deforestation; Gordon et al. 2005;Sterling et al. 2013).
Further to these direct influences of human activity are a number of indirect influences. Rising CO 2 can cause a transpiration-reducing plant physiological response and an accompanying increase in runoff (Gerten et al. 2008). Gedney et al. (2006) suggested this effect as the reason for increases in global runoff, as described by Labat et al. (2004), before contradictory subsequent work found no significant change in global runoff over the twentieth century (Milliman et al. 2008;Dai et al. 2009). Consideration of expanding vegetation is also required to quantify the overall CO 2 effect on global runoff in the twentieth century, the sign of which is, again, uncertain. Greenhouse gas-driven global warming (Bindoff et al. 2013) has caused a slight decrease in global runoff over the twentieth century, largely through higher summer evapotranspiration (Gerten et al. 2008), although this is dependent on sufficient moisture supply (Jung et al. 2010). Solar dimming associated with mid-twentieth-century aerosol emissions (Wild 2012) has also been proposed as an indirect human influence on runoff, through forcing a decrease in evapotranspiration (Oliveira et al. 2011;Gedney et al. 2014). The net aerosol impact is to increase runoff, with Gedney et al. (2014) finding a detectable increase in local river flow in response to increased NHML aerosol emissions in the 1960s and 1970s.
Despite the complicating mix of direct and indirect human influences on the hydrological cycle, the dominant driver of changes in runoff is changes in precipitation (Gerten et al. 2008;Milliman et al. 2008), which itself has been affected by human activities (Zhang et al. 2007;Wu et al. 2013;OL14). Gradual changes in the large-scale runoff-precipitation relationship can be expected because of the more transient human influences, such as the CO 2 physiological effect. For example, increases in evapotranspiration on the North American continent have been documented (Milly and Dunne 2001;Walter et al. 2004). But we still expect an approximately linear runoffprecipitation relationship, with sudden shifts in this relationship only at scales at, or close to, the catchment scale, perhaps due to deforestation or major dam construction. To the best of our knowledge, there is no physical explanation behind any such shifts at the scales we are investigating, which would require, for example, a temporally concentrated, continental-wide dam construction scheme. Fekete et al. (2002) suggested that river discharge data, used to calculate catchment runoff, are more accurate than other water cycle components, with errors thought to be within 10%-20%. This is an improvement on typical precipitation errors (Hagemann and Dümenil 1997), with underestimates common because of windinduced undercatch (Legates 1995). Snowfall undercatch can lead to biases in winter higher-midlatitude precipitation totals of 20%-40% (Adam and Lettenmaier 2003;Yang et al. 2005). Rainfall uncertainty is also dependent on the spatial scale being studied, but scale is not as important for river discharge uncertainty because of the integrated nature of measurement (McMillan et al. 2012).
At the large scale, such as the NHML land mean, different precipitation datasets share more common variability and trends (Polson et al. 2013;OL14). Therefore, any major inhomogeneities that still exist must be common across all gridded precipitation datasets. Such inhomogeneities can be a consequence of changes in measurement practice, instrument changes, and the implementation of wetting corrections (Groisman and Rankova 2001). Producing homogeneous records relies on quality observations of other meteorological variables (e.g., temperature, wind speed, and precipitation type) and metainformation about individual stations (New et al. 2000;Groisman and Rankova 2001). These are both difficult to obtain, particularly in the early twentieth century. As such, traditional approaches to correcting observations may prove fruitless. The framework presented in OL14 offers a basis for discovering inhomogeneities given the expected response to best estimates of real-world climate forcing. Yet there is still significant uncertainty in estimates of twentieth-century aerosol forcing (Myhre et al. 2013) and a still-developing physical understanding of the aerosol effect on precipitation, particularly with regards to complex aerosol-cloud interactions (Boucher et al. 2013). Here, we develop and strengthen this physical framework by also considering observed NHML mean runoff.
Using the conclusions of OL14, we hypothesize that inhomogeneities in observed NHML land mean precipitation make the early-twentieth-century data unsuitable for climate change studies. Observed NHML mean runoff, as far as we are aware, provides the only valid test of this hypothesis. This test relies on two key assumptions: 1) that runoff observations are immune to inhomogeneities of the type and magnitude common in precipitation observations (particularly those due to changes in spatial coverage), and 2) that in the absence of inhomogeneities the runoff-precipitation relationship is approximately linear. We use a statistical model to search for sudden shifts, which we define as breakpoints, in the NHML land mean runoff-precipitation relationship, also repeating this approach at the continental scale to further pinpoint where any inhomogeneities in observed precipitation may exist. We also analyze NHML land mean runoff in six land surface models (LSMs) that form part of the Trends in Net Land-Atmosphere Carbon Exchanges (TRENDY) land surface model intercomparison project (Sitch et al. 2015). These models were all driven by observed NHML land mean precipitation and so offer the opportunity to search for breakpoints in the runoffprecipitation relationship that have arisen through a simulated physical process, thus serving to test the second assumption above. Observed NHML land mean precipitation is then adjusted using its runoff counterpart and the implications for our understanding of the response of NHML land mean precipitation to midtwentieth-century aerosol forcing is discussed.
Section 2 details the data used; section 3 considers the observed and modeled runoff-precipitation relationship, including the approach to testing for breakpoints; and section 4 describes the adjustment applied to observed NHML land mean precipitation made on the basis of our findings. Section 5 discusses possible reasons for a breakpoint in the runoff-precipitation relationship, and in section 6 we discuss our conclusions for the effect of aerosols on NHML land mean precipitation.

a. Observed NHML land mean runoff
We use the Dai et al. (2009)  streamflow, is the temporally lagged, spatial integral of runoff over a river basin (Milly et al. 2005). It is possible to derive estimates of runoff for a catchment by dividing river discharge at a gauging station by the upstream catchment area (Fekete et al. 2002). Dai et al. (2009) produced a temporally complete streamflow dataset for 925 of the world's largest ocean-reaching rivers for 1948-2004, by infilling data gaps through considering nearby streamflow data, where possible. Any remaining data gaps were reconstructed by using a hydrological model and the correlation between precipitation and streamflow. Because we want to contrast mid-twentieth-century aerosoldriven hydrological changes with overall twentieth-century changes, we instead use the temporally incomplete, observations-only, dataset that extends back to 1900. We follow OL14 and use 1905-2004 as our twentiethcentury period, but instead use the water year, October-September, with the first month contributing toward our period being October 1904. This minimizes the lag effect associated with early winter snowfall being locked within a catchment until the spring melt of the following calendar year (Dai et al. 2009). Owing to this lag effect, we calculate mean water-year runoff for a given catchment only where all 12 months' worth of data are available. Further, the streamflow records are conditioned so that rivers must contain a minimum record length of 20 years in the 30-yr 1961-90 period, to which data are anomalized to throughout this work. In OL14, the NHML region is defined as the latitude band between 308 and 658N. We stipulate that .80% of the catchment area upstream of the farthest downstream gauging station must fall within this band for a river to contribute. This leaves us with a total of 148 rivers. Figure 1 shows the spatial coverage and record length of the 148 rivers.

b. Observed NHML land mean precipitation
In OL14, four gridded observational precipitation datasets were used in total. We ignored spatially interpolated versions of two of the datasets, which offer complete global terrestrial coverage. Here, such a product is necessary to allow for direct comparison with the spatial and temporal coverage of observed NHML land mean runoff. To ensure an accurate comparison between precipitation and runoff in each catchment, we produce high-resolution catchment masks on a 0.58 3 0.58 grid to match that of the precipitation dataset used. We select the latest Climatic Research Unit Time Series, version 3.21 (CRU TS3.21), high-resolution precipitation dataset (Harris et al. 2014), which has been modified for use in driving the TRENDY models. This modification merged the monthly climatology of CRU TS3.21 with the National Centers for Environmental Prediction (NCEP)-National Center for Atmospheric Research (NCAR) reanalyses (from 1948 to present) dataset (Kalnay et al. 1996), bilinearly interpolated to the CRU TS3.21 resolution, to generate the diurnal and daily variability required to drive the TRENDY models at 6-hourly time steps [see Fisher et al. (2013) for further details]. The merged dataset (CRU-NCEP, hereafter CRUNCEP) offers a finer temporal resolution for modeling purposes, but monthly precipitation totals remain the same. Figure 2 shows the twentieth-century NHML annual spatial coverage for both runoff and the precipitation (with and without the time-varying runoff mask applied) station network. During some later decades, runoff observations offer roughly a fivefold increase in coverage over precipitation. This is, of course, an overestimate for precipitation, since gauges only provide a point measurement and many grid boxes are poorly gauged, especially in mountainous regions where precipitation is highly variable (Adam et al. 2006).

c. TRENDY models
We use an ensemble of six LSMs, summarized in Table 1. The models were forced over the historical period with CRUNCEP precipitation, as well as other observed climate variables that were derived from merging CRU and NCEP data and changing CO 2 concentrations (Sitch et al. 2015;Le Quéré et al. 2014). These runs were also driven by historical land-use changes, calculated from the History Database of the Global Environment (HYDE; Klein Goldewijk and Verburg 2013). Although Yang et al. (2015) report some sensitivity of runoff trend in TRENDY models to landuse change at the catchment scale, our conclusions are insensitive to the inclusion or exclusion of the historical land-use dataset. Simulated runoff is available as output at a monthly frequency. Model output not already at 0.58 3 0.58 resolution is regridded using bilinear interpolation. Simulated runoff is then masked to the varying spatial and temporal coverage of the runoff (streamflow) observations.

d. CMIP5 models
We compare observed NHML land mean precipitation with that modeled by 15 models participating in phase 5 of the Coupled Model Intercomparison Project (CMIP5; Taylor et al. 2012). In contrast to OL14, we have omitted GCMs with a historical (all forcings) experiment ensemble size of just one, as well as the two GISS GCMs, from our analyses. OL14 found the GISS models to have anomalously strong NHML surface longwave forcing given their NHML surface aerosol forcing. Therefore, shortwave-induced surface cooling is counteracted by significant longwave-induced surface warming. Also, it has been subsequently noted that GISS-E2-R does not produce the interhemispheric forcing asymmetry expected from its strongly negative NHML surface aerosol forcing because it overestimates negative Southern Hemisphere nitrate aerosol and ozone forcing (Shindell 2014). This clearly influenced the results presented by OL14. Simulated precipitation in the CMIP5 models is also regridded to 0.58 3 0.58 resolution and masked to the availability of runoff observations.
Using these data we calculate the precipitation response to aerosol forcing, which we call the offset (see section 2e), and plot this against the strength of NHML surface aerosol forcing-calculated using the Forster et al. (2013) linear forcing-feedback model and using shortwave forcing as a proxy for total aerosol forcing-for each of the 15 CMIP5 models, as well as the observations (an estimate of real-world-or IPCC-equivalent NHML surface aerosol forcing is used for the observations; see OL14). Figure 3a shows this updated version of Fig. 3c from OL14. Overall, there is little difference between the plots, with the stark contrast between the observed precipitation response and the modeled precipitation responses still evident. Removing this mask and instead considering all grid boxes in the CRUNCEP dataset between 308 and 658N, we see an even more robust relationship between the CMIP5 precipitation response to aerosol forcing and the CMIP5 NHML surface aerosol forcing (Fig. 3b). This suggests that masking the precipitation dataset to the runoff observations weakens this robust signal, but it remains significant.

e. NHML land mean precipitation offset
In OL14 it is shown that the NHML land mean precipitation response to aerosol is stronger than that to greenhouse gases in GCMs, but the global mean temperature response to both is similar. Therefore, with a step in aerosol emissions around the 1960s (Lamarque et al. 2010), a simple linear regression approach is used to model the NHML land mean precipitation-global mean temperature relationship as a greenhouse gasdriven straight line offset from pre-1960 data by a mid-twentieth-century increase in aerosol forcing. To calculate the size of the NHML land mean precipitation offset b 0 , we fit the following model to the data: where P 60204 are 5-yr mean NHML land mean precipitation anomalies for the 1960-2004 period with respect to the 1905-59 mean NHML land mean precipitation anomaly and T 60204 are 5-yr mean global mean temperature anomalies for the 1960-2004 period with respect to the 1905-59 mean global mean temperature anomaly. The intercept term b 0 is the NHML land mean precipitation offset, a measure of the precipitation response to aerosol forcing, and b 1 is the slope from the linear regression fit to 1960-2004 five-year means. The residuals « 60204 are assumed to be identically and independently distributed with zero mean. This model is fitted to the observations and each of the CMIP5 models in turn. Further details on this offset and physical mechanisms behind it are discussed in OL14 (and references therein).

Runoff-precipitation relationship
Ideally, we would be able to reproduce Fig. 3c of OL14 using just NHML land mean runoff, with an expectation that interannual, decadal, and multidecadal changes in runoff should match those in precipitation. However, we find the representation of runoff in CMIP5 models to be suboptimal, with the runoff-precipitation FIG. 3. NHML land mean precipitation offset against NHML surface aerosol forcing. Error bars represent the 5%-95% uncertainty range, with real-world-or IPCC-equivalent NHML surface aerosol forcing and the associated 5%-95% uncertainty range used for the observed data point. (a) CRUNCEP masked to the availability of runoff observations and (b) all CRUNCEP grid boxes between 308 and 658N. The value next to the error bar in (a) indicates where the uncertainty extends beyond the scale. There is a significant positive correlation between precipitation offset and aerosol forcing across the 15 CMIP5 GCMs in both (a) and (b) (r 5 0.54, p , 0.05 and r 5 0.87, p , 0.01, respectively). For further details on the methods used to calculate the offsets, aerosol forcings, and associated uncertainty ranges, refer to OL14. relationship frequently nonlinear. Bias in the simulated precipitation climatology leads to bias in aridity and, in turn, the runoff coefficient (runoff divided by precipitation), in line with the Budyko framework (Budyko 1974). Runoff outputs from GCMs have also been described as biased because of the representation of runoff processes (Hirabayashi et al. 2013). This may be a consequence of low resolution or insufficient subgrid parameterization (Sperna Weiland et al. 2012). In some GCMs, runoff may be simply advected straight to the ocean, meaning that runoff is poorly represented in larger river catchments because time lags between runoff generation and return to the ocean are not considered (Falloon et al. 2011;Sperna Weiland et al. 2012). Because of the coarse resolution and reduced complexity of land surface schemes in GCMs relative to offline LSMs, changes in runoff are often investigated using direct GCM runoff output integrated to river discharges using a river routing scheme (Milly et al. 2005;Hirabayashi et al. 2013), or through using observed precipitation or GCM precipitation output to drive LSMs (Gerten et al. 2008;Schewe et al. 2014), as we have done with the TRENDY models.
Although we consider direct runoff output, rather than routed discharge, the six TRENDY LSMs all simulate NHML land mean runoff that is strongly correlated with observed NHML land mean precipitation used to drive the experiments (Fig. 4). The twentiethcentury observed CRUNCEP precipitation trend is found to be positive and significant, consistent with the four precipitation datasets used in OL14 (Table 2), which use only actual observations (no spatial interpolation) at a coarser resolution and are not constrained to the availability of runoff observations. Trends in NHML land mean precipitation are robust to the choice of dataset, some of which favor fewer longterm homogenized records over more short-term records, infilling of data through interpolation and spatial coverage. Interestingly, the runoff trend is positive and significant in all six models, but in each case less than the trend in observed CRUNCEP precipitation (Table 3). With the increase in precipitation not being matched by an increase in runoff, it would appear that the LSMs are all able to capture the already mentioned, welldocumented increase in evapotranspiration (Milly and Dunne 2001;Walter et al. 2004).
Observed NHML land mean runoff shows a near-zero trend over the twentieth century (Table 2)  annual mean precipitation (r 5 0.62, p , 0.01). However, inspection of the NHML land mean runoff and precipitation time series appears to show a split in the relationship around the 1940s (Fig. 5). Simply adjusting the pre-1960 precipitation, as in OL14, to the pre-1960 runoff, using the linear regression fit of post-1960 fiveyear mean NHML land mean runoff against five-year NHML land mean precipitation, we find that the positive trend in precipitation is reduced and becomes nonsignificant. Arbitrarily selecting 1940 as a breakpoint in the runoff-precipitation relationship, we plot 5-yr mean NHML land mean runoff anomalies against 5-yr mean NHML land mean precipitation anomalies, split into a pre-1940 group and a post-1940 group (Fig. 6). On first impression there would appear to be a clear breakpoint, with two lines with similar slopes but different intercepts.
Determining whether a breakpoint is a real feature of this relationship requires a statistical test. Testing for breakpoints, also referred to as inhomogeneities (Peterson et al. 1998), change points (Reeves et al. 2007), and step changes (McCabe and Wolock 2002), in hydrometeorological variables is a long-established practice. Many different breakpoint detection methods are available and they vary in the number of breakpoints they can detect, the assumptions made on data distribution, and whether the timing of the breakpoint is known before testing [see Reeves et al. (2007) and Ferguson and Villarini (2012) for more in depth discussions]. Our approach differs from many of these studies in that we want to detect breakpoints in the relationship between two intrinsically linked hydrometeorological variables, rather than the time series of just one variable.
Therefore, we test for a breakpoint by fitting a model with four parameters, representing two lines with unique slopes and intercepts: where Q are the 5-yr mean NHML land mean runoff anomalies, P are the 5-yr mean NHML land mean precipitation anomalies, and z is a binary variable coded 0 for data points in the first time segment and coded 1 for data points in the second time segment. The b coefficients are parameters to be estimated. The residuals « are assumed to be identically and independently distributed with zero mean. By varying the length of the two time segments, it is possible to test for breakpoints at a number of 5-yr  intervals. The minimum length of each time segment, or group, is set at 25 years (a sample size of five). Therefore, we search for breakpoints between 1930 and 1980 inclusive. An analysis of variance (ANOVA) F test is used to determine whether Eq. (4), at any given division, offers a significantly better fit to the data than a simple one slope, one intercept model: However, because we are performing multiple tests-we perform 11 tests at the p 5 0.05 level-there is a probability greater than 0.05 of finding at least one breakpoint. Because there is a probability greater than 0.05 that evidence against the null hypothesis [that Eq. (5) is the best fit] appears by chance, the significance level at which each of the 11 tests is performed requires adjustment to reduce the probability of a type I error back to 0.05. To investigate this problem further, we simulate pseudo data, creating 10 000 datasets of runoff and precipitation anomalies. Random data for both precipitation and runoff are generated with a normal distribution with mean 0 and variance 1 (the variance is assumed to be constant between time segments). The probability of finding at least one breakpoint is 0.213. We find that performing each of the 11 tests at the p 5 0.0078 level gives a probability of finding at least one breakpoint of 0.05. Therefore, if the F statistic at any of the 11 intervals is significant at the 0.78% level, we proceed with fitting the more complex model [Eq. (4)]. If b 3 in Eq. (4), known as the interaction parameter, is not significant ( p . 0.05) we can assume b 3 5 0 and fit a more parsimonious, additive model used for separate regression lines with the same slopes. However, we might expect some change in slope due to the more transient, indirect human influences discussed that could be prevalent at such large scales (e.g., Walter et al. 2004;Gedney et al. 2014). To avoid including the effect of slope changes in the breakpoint estimate, we take b 2 from Eq. (6) to be the breakpoint magnitude. The difference in b 2 values from Eqs. (4) and (6) is due to slope changes between the two time segments. This residual is typically small and our conclusions are not dependent on the choice of either Eq. (4) or (6). Although there could be more than one inhomogeneity in the observed precipitation and runoff records, the assumption is made that only one major breakpoint exists in the runoff-precipitation relationship. Further, this approach requires the assumption that any major inhomogeneity occurs on a time scale of years (Groisman and Rankova 2001) rather than decades. Where possible breakpoints are identified at more than one time division, the breakpoint is chosen as the model fit that minimizes the residual sum of squares (RSS). A possible negative breakpoint in the observed NHML land mean runoff-precipitation relationship is identified at three time divisions-1930, 1935, and 1940-with the model fit to the 1930 breakpoint found to minimize the RSS (Table 4). Because this is the earliest interval at which a breakpoint can be detected using 5-yr means, the analysis is repeated using 3-yr means. Using 5-yr means is common when analyzing changes or trends in precipitation and in detection and attribution studies (Min et al. 2011;Wu et al. 2013; OL14) as it reduces the signal to noise and avoids problems with autocorrelation. The autocorrelation function for both the observed runoff and precipitation decays toward zero at a time lag of 3 yr (not shown). Repeating the analysis for 3-yr means-offering a detection range between 1920 and 1990-gives near-identical results with the breakpoint identified in 1932 (accounting for the multiple tests problem). This suggests the result is robust. Because of the highly variable nature of precipitation, at smaller  scales-continental scale and below-we only consider 5-yr means in further analyses to improve the signal to noise. With a breakpoint detected, the approach is extended to determine where geographically the breakpoint occurs. While downscaling to the catchment scale is limited by temporal coverage and poor signal to noise, downscaling to the continental scale should offer more insight (Fig. 7). The continental-scale analysis points toward North America as the likely source of the breakpoint (Fig. 7a). A North American breakpoint is detected in 1935. The agreement between observed Europe land mean runoff and precipitation is excellent and the relationship is free of a breakpoint (Fig. 7b). The Asia land mean runoff and precipitation time series, which include all Russian river catchments (Fig. 7c), suggest that a breakpoint might exist there in the first 30-40 years. Although our approach fails to detect one, reducing the minimum time segment length to 20 years, instead of 25 years, detects a breakpoint in 1925. The first ;25 years of the Asia time series are dominated by the Amur River, a catchment with particularly poor precipitation station spatial coverage at that time. This will undoubtedly contribute toward the 1930 breakpoint found in the NHML land mean runoff-precipitation relationship. We address the role of the Amur River and poor spatial coverage of the early-twentieth-century rain gauge network in section 4. Beyond about 1930, the agreement between precipitation and runoff in Asia is reasonable. However, we also note a contrast between decreasing precipitation and increasing runoff between the 1950s and 1980s, as found in other studies (Adam and Lettenmaier 2008;Milliman et al. 2008). The breakpoint model is also applied to the TRENDY LSM NHML land mean runoff-precipitation relationships. Breakpoints are detected in two models, CLM4.5 and JULES, both in 1950. However, the breakpoints are of opposite sign (positive) to the observed breakpoint (negative) and are about half the size. Also, using the multimodel mean NHML land mean runoff no breakpoint is detected. The CLM4.5 and JULES simulations here have the lowest resolution of the six TRENDY models. Previous work has shown that a model performs much better in simulating the hydrological balance of various European regions with increased resolution (Elguindi et al. 2011). Investigating the differences between observed and simulated runoff spatially, the North American continent again stands out (not shown).

Adjusting observed precipitation
Having detected a 1930 breakpoint in the observed NHML land mean runoff-precipitation relationship, we now investigate the effect of adjusting the precipitation time series to bring it in line with runoff. This gives a hypothesized time series of early-twentieth-century precipitation based on the assumptions laid out in section 1. It is not to be interpreted as a true precipitation time series-the adjustment is only a crude estimate. However, it is instructive to see the extent to which the adjusted series is consistent with the expected precipitation response to aerosol forcing. The adjustment technique preserves the more subtle changes in the runoffprecipitation relationship, so that the slopes of the two lines in Eq. (4) are unchanged. Instead we simply increase pre-1930 precipitation so that b 2 5 0 in Eq. (6). The effect of this is to reduce the positive trend in twentiethcentury precipitation from 44.8 6 12.6 to 7.2 6 13.6 mm yr 21 century 21 (Fig. 8). Reproducing Fig. 3 with adjusted observed precipitation, the observed NHML land mean precipitation offset of 0.002 6 0.021 mm day 21 is equivalent to an aerosol forcing of 20.1 6 2.0 W m 22 , calculated using the NHML land mean precipitation offset-aerosol forcing regression coefficients (Fig. 9). Prior to this precipitation adjustment, an offset of 0.044 6 0.021 mm day 21 gave an estimate of aerosol forcing of 4.0 6 2.0 W m 22 .
This adjusted precipitation gives an offset consistent with negative aerosol forcing. The difference between the best estimate of real-world NHML surface aerosol forcing of 22.0 (from 23.3 to 20.9) W m 22 and that suggested by the precipitation offset has reduced from 6.0 to 1.9 W m 22 . We find quantitatively similar results using the simple technique outlined in section 3 (Fig. 5), where we simply adjust the entire precipitation time series given the post-1960 runoff-precipitation relationship. This reduces the discrepancy from 6.0 to 1.7 W m 22 . These results still somewhat contradict similar analyses performed for observed Northern Hemisphere tropical land mean precipitation and observed temperature gradient (between the NHML region and Southern Hemisphere extratropical region), which both show an aerosol response in agreement with expectations (see OL14).

Explaining the breakpoint
Meteorological datasets will still contain inherent inhomogeneities, owing to their changing spatial coverage in time, in addition to the various systematic errors listed above. These inhomogeneities and biases may be shared between different gridded precipitation datasets, as different groups use many common records. From Figs. 1, 2, and 7, it can be seen that the annual NHML land area coverage increases around the time that the breakpoint is found. This is when the three great Arctic Ocean draining Eurasian rivers (Ob, Yenisey, and Lena) and the Mississippi River begin contributing to NHML mean runoff. A near-constant spatial mask between 1905 and 2004 can be considered by selecting only river catchments with discharge records at least 80 years or longer (catchments with black borders in Fig. 1). Any missing years for rivers with long discharge records are not noticeably clustered at the beginning of the twentieth century and are instead most obvious post-1990 (not shown). Because we are working with a spatially interpolated gridded precipitation dataset, the true spatial coverage of the underlying station network will still vary somewhat. This approach greatly reduces the spatial coverage but the breakpoint remains a robust feature, although 1935 is now the most likely timing (not shown). Changing spatial coverage in time could possibly lead to a breakpoint as different climatic regions are incorporated into the NHML mean. This is because a certain change in precipitation should not be expected to produce the same change in runoff in an arid region as in a humid region (Budyko 1974). However, the TRENDY LSMs realistically represent the range in the runoff coefficient across different climates, but do not contain breakpoints of the sign or magnitude of the observed breakpoint.
In the introduction, we dismissed the impact of dam and reservoir construction on the NHML mean runoff time series. At the catchment level, the influence of flow regulation (the ratio of total reservoir storage capacity to mean annual discharge) could be important. Indeed, Vörösmarty et al. (1997) showed that 20%-30% of global mean annual discharge was now stored in reservoirs constructed in the late twentieth century. Water impoundments can increase evaporative losses and therefore decrease runoff. Evaporative losses have been shown to have an effect on individual drainage basins (e.g., Vörösmarty and Moore 1991), but appear to be insignificant at continental scales (Vörösmarty et al. 1997;McClelland et al. 2004). Schwalm et al. (2014) recently found that including a representation of water management structures in terrestrial biosphere models made little difference to simulated runoff over the United States. Of course, the filling of a large reservoir following the construction of a dam causes an abrupt decrease in catchment runoff. This could lead to an anomalous annual mean runoff value. But following filling, streamflow should return to predam levels, so such impoundments are unlikely to cause breakpoints.
Dam and reservoir construction allows for easier access to water resources and typically leads to more water withdrawal, often for irrigation. Water withdrawn from reservoirs for irrigation can cause substantial decreases in runoff because of enhanced evapotranspiration. Irrigation is likely to account for a significant fraction of the stark runoff decreases observed in the Yellow River catchment in China since the 1960s (Xu et al. 2010). The Yellow River is one of the 148 included in our analyses. To determine what effect highly irrigated catchments have on our results, we take irrigation FIG. 9. NHML land mean precipitation offset against NHML surface aerosol forcing. This is an updated version of Fig. 3, but the adjusted precipitation record, given a 1930 breakpoint in the runoff-precipitation relationship, is used to calculate the observed NHML land mean precipitation offset.
index values [the ratio of area equipped for irrigation, from Siebert et al. (2005), to naturalized discharge (discharge prior to direct human influences)] from the Nilsson et al. (2005) study. Values are available for 42 out of 52 large rivers considered here with an upstream catchment area of greater than 50 000 km 2 . These 42 large rivers account for 88% of the total area of all 148 river catchments. As found by Milliman et al. (2008), irrigation and flow regulation indices [the latter are also taken from the Nilsson et al. (2005) study] are strongly correlated (not shown). We find a total of eight rivers (Sacramento, Dnepr, Yellow, Don, Ebro, Colorado, Tagus, and Liao) that exceed both an irrigation index of 200 and a flow regulation of 20, defined by Milliman et al. (2008) as thresholds above which deficit rivers (rivers where changes in runoff are considerably less than changes in precipitation) typically fall. Removing these rivers from the study and considering only the remaining 34 does not change our conclusions. We still find a significant breakpoint, although a 1935 breakpoint provides a marginally better fit to the data than a 1930 breakpoint.
A subject worthy of consideration is the quality of the interpolated CRUNCEP precipitation observations. While complete spatial coverage is desirable, it is important to consider the true spatial coverage of the underlying station network. The area represented per station within a catchment, defined as the area per station (APS), is much greater across the Asian continent than the North American and European continents (Table 5), particularly so in the 1905-29 period. However, as already discussed, this is also a time period when spatial coverage of runoff observations across the Asian continent was largely limited to the Amur River, the fifth-largest river of the 148 in terms of catchment area and ranking fifth in terms of 1905-2004 mean annual streamflow. There are no available streamflow observations for the three great Arctic Ocean-draining Eurasian rivers (Ob, Yenisey, andLena) between 1905 and1929, when their APS values were extremely high. As such, Asia mean APS values are noticeably different in this period depending on whether a fixed temporal mask of all rivers is used or whether precipitation is masked to the availability of runoff observations. When masked to the availability of runoff observations, the Amur catchment has the worst mean annual APS of all rivers. The poor APS of the Amur River means that Asia land mean precipitation is likely to be more vulnerable to inhomogeneities in the pre-1930 period.
Although the APS might explain the Asian continent contribution toward the 1930 NHML land mean runoffprecipitation breakpoint, it has less merit in explaining the more prominent 1935 North American breakpoint.
Further, the North American and European APS values are comparable, yet there is good agreement between observed precipitation and runoff for Europe. Up until 1935, the St. Lawrence and Columbia River catchments account for much of the spatial coverage of North American runoff observations. The St. Lawrence and Columbia are the fourth-and fifth-largest river catchments of the 72 in North America, respectively, with the former draining into the Atlantic and the latter draining into the Pacific. Combined aggregated Pacific and Atlantic draining river catchments contain a runoffprecipitation breakpoint in 1935, although the b 1 and b 2 coefficients in Eq. (4) are not significant ( p 5 0.25 and p 5 0.29, respectively). No breakpoint is found when considering the aggregated Gulf of Mexico draining rivers (dominated by the Mississippi catchment). While attributing the exact location and cause behind a breakpoint in the runoff-precipitation relationship is beyond the scope of this study, this does help narrow the focus of future work into precipitation inhomogeneities.
The 1930s saw a notable hydroclimatic event across North America in the form of the Dust Bowl, a period characterized by low precipitation amounts and high temperatures (e.g., Schubert et al. 2004). This precipitation deficit period is seemingly captured by the North America precipitation time series (Fig. 7a) and in a hypothetical time series of observed North America land mean precipitation (following the methodology of section 4; not shown). The LSMs-driven by observed precipitation-do not appear to simulate this breakpoint. A breakpoint is detected in one model (LPX) in 1940, which is of the same sign as the observed breakpoint but roughly half the magnitude. Interestingly, this is no longer detected when considering only river catchments with discharge records at least 80 years or longer (as above), implying some sensitivity to spatial coverage. Overall, the LSMs do not support a real change in the runoff-precipitation relationship over the North American continent around this time.

Summary and conclusions
In OL14, observed NHML land mean precipitation was found to be missing an expected response (Wu et al. 2013) to well-documented mid-twentieth-century aerosol forcing (Lamarque et al. 2010;Wild 2012). This is despite a robust signal across CMIP5 GCMs, with the magnitude of the response dependent on the strength of aerosol forcing. It was suggested that inhomogeneities in the precipitation record could be part of the reason for this disagreement, with some studies questioning the quality of precipitation observations, particularly in the first half of the twentieth century (Groisman and Rankova 2001). Here, we have attempted to reconcile these precipitation observations with complementary runoff observations, which represent a key land surface component of the hydrological cycle.
Variations in interannual, decadal, and multidecadal runoff are known to be predominantly forced by variations in precipitation (Dai et al. 2009). Occasionally, long-term trends in runoff are shown to differ from longterm trends in precipitation (Milliman et al. 2008). This is particularly prevalent at the catchment scale, where direct human influences, such as water withdrawals for irrigation (Milliman et al. 2008;Xu et al. 2010) and landcover or land-use change (Piao et al. 2007;Sterling et al. 2013) can have significant impacts. At larger scales, differences in runoff and precipitation trends can be attributed to changes in evapotranspiration (Milly and Dunne 2001;Walter et al. 2004), perhaps because of increasing temperatures due to rising greenhouse gas concentrations (Huntington 2008;Krakauer and Fung 2008) and augmented, at least locally, by aerosolinduced dimming (Gedney et al. 2014). Consideration of the plant physiological response to rising CO 2 levels is also necessary (Gedney et al. 2006). However, while these influences may have an effect on the runoffprecipitation relationship in the NHML land mean, they are not expected to cause breakpoints.
Testing for breakpoints in the NHML land mean runoff-precipitation relationship, three potential candidates at 5-yr intervals were found (1930, 1935, and 1940), with a negative breakpoint in 1930 found to provide the best fit to the data. Runoff anomalies pre-1940 are noticeably greater than precipitation anomalies. This lends support to the hypothesis suggested by OL14 that pre-1960 precipitation anomalies should be greater. They found that an increase of 2.9% in monthly total precipitation before 1960 in the CRU TS3.21 dataset would generate a negative NHML land mean precipitation offset consistent with estimated aerosol forcing. Our results do not support an increase in precipitation totals across the entire pre-1960 period or the magnitude of the increase suggested. However, adjusting the precipitation record before the 1930 breakpoint using the runoff record, we do account for some of the discrepancy between the (lack of) observed precipitation aerosol response and that suggested given our most recent estimates of real-world aerosol forcing (Myhre et al. 2013). Importantly, this adjustment does allow the precipitation aerosol offset to be negative and means that the range of aerosol forcing suggested by this response is within the real-world aerosol forcing range.
Using six LSMs as part of the TRENDY intercomparison project (Sitch et al. 2015), it is possible to show that this breakpoint is unlikely to be a consequence of known physical processes. Driven by observed precipitation, four of six models show no breakpoint, whereas two show small, positive breakpoints in 1950.
Although not nearly as apparent as the breakpoint in the observed runoff-precipitation relationship, and perhaps just a consequence of the choice of statistical model, this warrants further attention. However, we conclude that the modeled runoff-precipitation relationship is free of breakpoints of the type and magnitude found in observations. Perhaps the most obvious shortcoming in this technique of using observed precipitation to drive LSMs is that land surface feedbacks on the atmosphere are neglected, because of the absence of coupling (Harding et al. 2014;Schewe et al. 2014), limiting the potential accuracy of results. Future efforts to incorporate more complex land surface schemes into GCMs should improve our understanding of the importance of such interactions. The LSMs used here also do not include a representation of direct human influences, such as irrigation, with the exception of historical land-use changes, although we argue in section 1 that these influences are unlikely to cause breakpoints at continental scales.
A continental-scale search for breakpoints found that the North American continent was the likely source for the observed NHML-wide breakpoint. Changes in measurement practice and gauge type are just two of many explanations for such an inhomogeneity in precipitation records (Groisman and Rankova 2001). We have assumed that runoff records are less susceptible to inhomogeneities (Hagemann and Dümenil 1997), but some will undoubtedly exist and should also be investigated in much the same way as precipitation records. In addition, it would be unwise to wholly discount the influence of dams, with dam construction in the Columbia River catchment, for example, beginning in the 1930s, notably with the Bonneville Dam. Although most work suggests that the influence of dams on annual mean discharge at large scales is minimal (Vörösmarty et al. 1997;McClelland et al. 2004), the St. Lawrence and Columbia River catchments deserve particular attention as their runoff and precipitation observations contribute significantly to the pre-1935 North American land means. Although the LSMs do not simulate this North American breakpoint, there remains the possibility that the unique hydroclimatological precursors that lead to the 1930s Dust Bowl event are not well captured by other meteorological variables used to drive the models.
Further, we stress some shortcomings in our methodology. First, not all precipitation inhomogeneities can be expected to appear as an obvious discontinuity, with some biases likely to emerge slowly over time, perhaps because of changes in vegetation near to a rain gauge (Legates 1995). Also, we have used just one statistical model to test for breakpoints. It would be useful to see how robust our findings are to different techniques, while also considering the runoff and precipitation records independently. Problems associated with using spatially interpolated data have also been highlighted. The Amur River catchment was identified as a region with a sparse precipitation station network in the early twentieth century. Therefore, a direct comparison between runoff and precipitation here would contrast the complete spatial coverage of runoff (inherent in streamflow measurements) with just a few point measurements of precipitation.
We are also unable to provide a robust explanation for a 1930s breakpoint. One of the more widely documented systematic changes in precipitation measurement occurred in the former USSR. In 1966-67 a wetting correction was introduced that added 5%-15% to annual totals across the former USSR. In the far north of Russia in winter months, totals increased by up to 30% (Groisman et al. 1991;Groisman and Rankova 2001), but no such systematic changes are documented in the 1930s. Instead, it might be the case that the spatial coverage of early-twentieth-century observations, combined with the underrepresentation of, say, mountainous regions, leaves NHML land mean precipitation observations prone to inhomogeneities during this period. Without targeted studies, however, this remains speculative, but we believe we have here eliminated some possibilities behind such a breakpoint (section 5), including the influence of dams and reservoirs, as well as irrigation. This is to be expected since the 1930s preceded major direct human influences on the land surface component of the hydrological cycle.
Although changes in runoff closely track changes in precipitation, the relationship between these two variables is complicated. For example, recent research has shown that a precipitation shift from snow to rain may lead to a decrease in runoff (Berghuijs et al. 2014), and this may play a role in explaining the steady evapotranspiration increases seen over the NHML region and particularly the North American continent. Regardless, by incorporating runoff into the precipitationaerosol physical framework of OL14, we have found more evidence suggesting NHML land mean precipitation observations to be in error. Early-twentieth-century precipitation observations typically do not lend themselves well to robust climate change research, although the European continent is a notable exception. Here, the runoff-precipitation agreement is excellent. But even after incorporating runoff into this physical framework, the discrepancy between observed and modeled precipitation over the NHML land region cannot be wholly accounted for. So, while inhomogeneities are likely to exist in precipitation records, we conclude that modelsimulated precipitation response to aerosol forcing should still be questioned. thanked for reviews that helped improve the manuscript. We acknowledge the World Climate Research Programme's Working Group on Coupled Modeling, which is responsible for CMIP, and we thank the climate modeling groups (listed in Table 1