Ranking of tree-ring based hydroclimate reconstructions of the past millennium

Abstract To place recent hydroclimate changes, including drought occurrences, in a long-term historical context, tree-ring records serve as an important natural archive. Here, we evaluate 46 millennium-long tree-ring based hydroclimate reconstructions for their Data Homogeneity, Sample Replication, Growth Coherence, Chronology Development, and Climate Signal based on criteria published by Esper et al. (2016) to assess tree-ring based temperature reconstructions. The compilation of 46 individually calibrated site reconstructions includes 37 different tree species and stem from North America (n = 29), Asia (n = 10); Europe (n = 5), northern Africa (n = 1) and southern South America (n = 1). For each criterion, the individual reconstructions were ranked in four groups, and results showed that no reconstruction scores highest or lowest for all analyzed parameters. We find no geographical differences in the overall ranking, but reconstructions from arid and semi-arid environments tend to score highest. A strong and stable hydroclimate signal is found to be of greater importance than a long calibration period. The most challenging trade-off identified is between high continuous sample replications, as well as a well-mixed age class distribution over time, and a good internal growth coherence. Unlike temperature reconstructions, a high proportion of the hydroclimate reconstructions are produced using individual series detrending methods removing centennial-scale variability. By providing a quantitative and objective evaluation of all available tree-ring based hydroclimate reconstructions we hope to boost future improvements in the development of such records and provide practical guidance to secondary users of these reconstructions.


Introduction
Tree-ring chronologies built from living and dead trees offer a valuable source of information for understanding different aspects of natural and human history, ranging from archeological dating to past climate conditions. Tree-ring chronologies are both annually resolved and precisely dated (Douglass, 1909(Douglass, , 1920(Douglass, , 19281941;Stokes and Smiley, 1968;Fritts, 1976;Schweingruber, 1988;Speer, 2010;Anchukaitis, 2017;. Long chronologies can be developed in most temperate and subtropical areas of the world across almost all types of habitats (St George, 2014;St George and Ault, 2014). The availability of numerous tree-ring data sets from different sites and tree species, from diverse natural environments, allows for comprehensive statistical analyses (e.g., Bj€ orklund et al., 2017;Seftigen et al., 2018;Babst et al., 2019;Büntgen et al., 2019).
Depending on the dominant growth-limiting climate factor in a particular site, tree-ring data can be used to reconstruct either growing season temperature or hydroclimate variability (Fritts, 1976). Millennium-long temperature reconstructions, entirely or partly derived from tree-ring data, have gained the widest attention through their almost iconic status in the current global warming discourse (see, e.g., Frank et al., 2010;Masson-Delmotte et al., 2013;Smerdon and Pollack, 2016;Esper et al., 2018). Treering based hydroclimate reconstructions are perhaps less widely known, but they play an equally important role in contributing to our understanding of climate variability over the past one to two millennia. The use of tree-ring data to understand past hydroclimate variability has also a considerably longer history than the use of tree-ring data to address temperature variability, as the science of dendrochronology was developed in the moisturelimited growth environment of the southwestern United States (Douglass, 1929(Douglass, , 1941. Notable earlier works in the field include Bogue (1905), Douglass (1917), Hawley and Clark (1940), Schulman (1956), andFritts (1976). Some of the earliest examples of long calibrated precipitation, drought and streamflow reconstructions can be found in Schulman (1945), Meko et al. (1980), Cook and Jacoby (1983).
Reconstructing hydroclimate is more challenging than reconstructing temperature as precipitation and drought are highly affected by topography and local features (Feng et al., 2013) and have greater spatial variability (Osborn and Hulme, 1997;Datta et al., 2003;Hofstra and New, 2009;Büntgen et al., 2010a,b;Wan et al., 2013). Precipitation shows significant spatial correlations of 500e700 km at decadal time-scales Ljungqvist et al., 2016;Schneider et al., 2019) compared to up to several thousand kilometers for temperature (Jones et al., 1997;Christiansen and Ljungqvist, 2017).
Despite these challenges several large-scale gridded hydroclimate reconstructions, covering major portions of continents, have been produced using tree-ring data: e.g. the North American Drought Atlas , the Monsoon Asia Drought Atlas (Cook et al., 2010), the Old World Drought Atlas (Cook et al., 2015a, b), the Mexican Drought Atlas , the Eastern Australia and New Zealand Drought Atlas (Palmer et al., 2015) and recently the combined Global Drought Atlas (Marvel et al., 2019) covering large portions of the world back to 1400 CE and offering reasonable coverage for parts of the Northern Hemisphere back to 1000 CE. However, the majority of tree-ring chronologies included in these gridded reconstructions have not been published as individual quality-assessed hydroclimate reconstructions. Although the chronologies in the drought atlases, when used together, provide a skillful drought reconstruction over space and time, their strength lies in the representation of the general hydroclimatic condition in a region due to the applied aggregation, and thus interpolation, approach. Complementary to those drought atlases, however, it is important to use individual tree-ring based site reconstructions to understand the underlying data and investigate local hydroclimatic conditions. This is of paramount importance especially when the local hydroclimateetree growth relationship deviates in season or in hydroclimatic metric from the one used in the drought atlases.
The network of millennium-long hydroclimate tree-ring based reconstructions is geographically confined to a few regions ( Fig. 1) with the largest concentration in the southwestern United States, and a smaller cluster on the edge of the northeastern Tibetan Plateau. Considering the drought change difference between 1983e2016 and 1950e1982, one finds hydroclimate reconstructions distributed over both regions that tend to get wetter and regions that tend to get drier (Fig. 1). It is obvious that the present network of millennium-long reconstructions is woefully inadequate for capturing the spatially heterogeneous nature of hydroclimate variability.

Objectives
Future hydroclimate changes are arguably the largest uncertainty connected with global warming that, at the same time, likely have the largest environmental and societal impacts (Field et al., 2014;Schewe et al., 2014;Lehner et al., 2017;Trnka et al., 2018). State-of-the-art climate model simulations provide highly uncertain projections of hydroclimate changes at regional to continental scales (Stephens et al., 2010;Orlowsky and Seneviratne, 2013;Christensen et al., 2014;Nasrollahi et al., 2015). Climate model evaluation through paleoclimate reconstructionesimulation comparison studies is thus of uttermost importance to improve the models' skill (e.g., Ault et al., 2013Ault et al., , 2014Coats et al., 2015;Cook et al., 2015aCook et al., , b, 2016Smerdon et al., 2015;Ljungqvist et al., 2016Ljungqvist et al., , 2019Xoplaki et al., 2016Xoplaki et al., , 2018Seftigen et al., 2017;Bothe et al., 2019). Hydroclimate reconstructions are therefore highly important for a deeper understanding of past, present and future hydroclimatic conditions and it is critically important to objectively assess and communicate the strengths and weaknesses of each individual record.
In this article, we evaluate and rank 46 millennium-long treering based hydroclimate reconstructions by considering their Data Homogeneity, Sample Replication, Growth Coherence, Chronology Development, and Climate Signal using an ordinal scoring scheme set forth in Esper et al. (2016) for ranking tree-ring based temperature reconstructions. We discuss the implications of the ranking, provide recommendations for how to select hydroclimate reconstructions to use for different purposes, and make recommendations for the development of new hydroclimate reconstructions. In addition, we compare the results of the two rankings of hydroclimate and temperature reconstructions.

Reconstructed hydroclimatic metrics
Our compilation of tree-ring based hydroclimate reconstructions, extending back to 1000 CE, includes 24 reconstructions of precipitation, 11 reconstructions of streamflow, 6 reconstructions of the Palmer Drought Severity Index (PDSI; Palmer, 1965;van der Schrier et al., 2011), 3 reconstructions of moisture availability/balance, 1 reconstruction of the Standardized Precipitation Index (SPI; McKee et al., 1993), and 1 reconstruction of Palmer Hydrological Drought Index (PHDI) (Karl, 1986). Precipitation is the most easily available metric as it is directly derived from meteorological station data, although it does not fully reflect the complex hydrological systems. Furthermore, tree-ring hydroclimate sensitivity might vary depending on soil characteristics and evapotranspiration rates, making different drought metrics more or less suitable.
PDSI integrates precipitation and temperature to estimate relative dryness ranging from À10 (very dry) to þ10 (very wet) (Palmer, 1965;Dai et al., 2004;Wells et al., 2004;van der Schrier et al., 2011). It tracks long-term changes in physiological drought, relative to the mean conditions in a given region, as it combines a physical water balance model with temperature and thus considers potential evapotranspiration (Hobbins et al., 2008). PHDI captures the slower impacts of drought and was developed to quantify long-term hydrological effects better than the PDSI (Jacobi et al., 2013).
SPI quantifies the observed precipitation as a standardized departure from the long-term mean (Keyantash and Dracup, 2002). One potential weakness with SPI is that it does not consider changes in evapotranspiration since it only reflects changes in water supply. The metric relates well to soil moisture on shorter timescales and to groundwater and reservoir storage on longer timescales (McKee et al., 1993). It is typically a more comparable metric across regions than PDSI, albeit this limitation of PDSI is greatly relieved in self-calibrated PDSI variant (scPDSI; Wells et al., 2004;van der Schrier et al., 2011).
Streamflow can be reconstructed from tree-ring data, as both river discharge and tree growth could be modulated by common precipitation and evaporation patterns at a local to regional scale (Schulman, 1945;Stockton, 1975;Stockton and Jacoby, 1976;Woodhouse et al., 2006;Ho et al., 2016). However, streamflow has its own characteristics: after a heavy precipitation, discharge typically reaches a peak, and then gradually subsides to base flow.

Tree-ring based hydroclimate reconstructions
A literature review (completed in February 2019) resulted in the identification of 48 tree-ring width based hydroclimate reconstructions extending back to at least 1000 CE, each with a minimum replication in any given year of at least three measurement series. Only 46 of these 48 reconstructions are included in this assessment since the raw data and sufficient information from two reconstructions e the Northeastern Tibetan Plateau precipitation reconstruction by Liu et al. (2006) and the Qaidam Basin moisture availability reconstruction by Yin et al. (2008) e could not be obtained. All data used here were otherwise either accessible from public repositories or made available to us by the original authors. We did not include older reconstructions using mainly the same tree-ring material as in a newer version. 1 Moreover, all tree-ring isotope based reconstructions (see e.g., Duffy et al., 2019) were excluded from this assessment as they either lack annual resolution (e.g., Edwards et al., 2008Edwards et al., , 2017Wang et al., 2013;Kress et al., 2014) or the reconstruction was derived from annually pooled samples (e.g., Treydte et al., 2006;Griebinger et al., 2017), precluding the calculation of key metrics used in this assessment.
Out of the 46 tree-ring width based hydroclimate reconstructions, 10 are from Asia, 5 from Europe, 1 from (northern) Africa, 29 from North America, and 1 from (southern) South America. The five reconstructions from Europe and the one from (northern) Africa are treated as one group ( Fig. 1; Table 1). The 46 reconstructions are derived from 37 tree species representing 16 different genera, with Pinus (n ¼ 21), Pseudotsuga (n ¼ 14), and Juniperus (n ¼ 11) being the most common. Most species (n ¼ 22), however, occur only in one single reconstruction. The majority of the reconstructions (n ¼ 29) are composed of one tree species, but 11 include two species, and six combine three or more species (Table 1). Only seven reconstructions are composed of ring width data solely from living trees, mainly from China, while 39 are composed of living trees in combination with relict material from archeological, historical, remnant, and/or sub-fossil samples. The season of the strongest tree-growth response to hydroclimate differs among the reconstructions (see column "Season" in Table 1).

Hydroclimate tree-ring chronology characteristics and metrics
The characteristics Data Homogeneity, Sample Replication, Growth Coherence, Chronology Development, and Climate Signal described in Esper et al. (2016) are here adapted for hydroclimate Table 1 List of all the 46 tree-ring reconstructions, extending back at least to 1000 CE, published as calibrated hydroclimate reconstructions. The abbreviation code for tree species follows the standard used in the International Tree-Ring Data Bank (ITRDB; Grissino-Mayer and Fritts, 1997) as listed in Grissino-Mayer (1993 (Cook and Krusic, 2005). Each characteristic (see sections 2.3.1 to 2.3.5) is used to produce an ordinal scoring scheme to rank the 46 tree-ring hydroclimate reconstructions. The scores for each criterion and their combination are divided into four classes (from highest to lowest rank): class A, class B, class C, and class D. In the quantitative ranking of Sample Replication, Growth Coherence, Chronology Development, and Climate Signal, the 12 top-ranked hydroclimate reconstructions fall in class A, ranks 13e24 in class B, ranks 25e35 in class C, and ranks 36e46 in class D. In the mainly qualitative ranking of the Data Homogeneity an uneven number of reconstructions fall into the four hierarchal classes (11 reconstructions in class A, 14 class B, 14 class C, and 7 class D). To produce an overall score, the individual ranking order for each characteristic (sections 2.3.1 to 2.3.5) is combined.

Data homogeneity
The category Data Homogeneity combines characteristics of the (i) "Source" of tree-ring samples, (ii) "Type of chronology", (iii) "Number" of tree species, (iv) "Temporal clustering" of tree-ring data, and (v) more general "Remarks" on the sampling site(s). Source includes information about the origin of tree-ring samples, the number of sampling sites, and their location in relation to each other. The Data Homogeneity score takes into account whether, and to what extent, the tree-ring samples originate from one or more sites. This information was obtained either from the original publication or via personal communication with the author(s)/data contributor(s). Chronology type differentiates between two types of tree-ring reconstructions: composite "C" reconstructions, composed of living in addition to relict (historical/remnant/subfossil) material, and living "L" reconstructions composed only of samples from living trees. Historic denotes samples from both archeological excavations and standing structures. Remnant denotes samples from dead wood found on the ground in different states of conservation. Sub-fossil denotes samples retrieved from sediments. Number of Species considers the number of different tree species contributing to a reconstruction. Temporal clustering refers to when the contribution of tree-ring data from distinct homogeneous sites and/or a specific tree species dominate specific periods of the past millennium. Such clustering can complicate the preservation of low-frequency climate information (sensu, Melvin et al., 2013). Remark summarizes particular features of the data in a particular reconstruction relevant to the Data Homogeneity score.

Sample replication
The availability of tree-ring series varies over time, resulting in an uneven temporal distribution over the past millennium with typically increasingly fewer series back in time. We consider how these temporal changes affect reconstruction skill in the Sample Replication metric by integrating information about (i) "Mean replication", (ii) "Maximum replication", (iii) "Minimum replication", and (iv) "11 th /20 th Century Ratio". Mean Replication denotes the average number of measurement series (either core samples or radii from disks) considering all years from 1000 CE to the most recent year of a reconstruction (thus, meaning that the exact number of years can differ slightly due to the different end dates of the reconstructions). Maximum Replication and Minimum Replication refer to the maximum and minimum numbers of contributing measurements at any year in the reconstruction. The 11 th /20 th Century Ratio refers to the mean 11 th century replication divided by the mean 20 th century replication multiplied by 100. This metric is particularly important since tree-ring based reconstructions are calibrated over the typically well-replicated recent period. We calculate the combined Sample Replication score by summing the first three values (i þ ii þ iii) and multiplying the result by (iv). As explained in Esper et al. (2016), these measures e as well as those for the other scores described below e are somewhat arbitrary but derived through dendroclimatological expert knowledge to produce an ordinal scoring system that permits the comparison and ranking of tree-ring based reconstructions. Sample Replication was calculated using the program ARSTAN. 2

Growth coherence
Growth coherence is expressed by the correlation between the individual measurement series: the so-called inter-series correlation (Rbar) (Wigley et al., 1984). Growth Coherence is an important chronology characteristic when evaluating the temporal reliability of a tree-ring based climate reconstruction. Using the program ARSTAN, we calculated the running mean Rbar value for every 10 years of a chronology using a 100-year window with an overlap of 90 years from 1000 CE onwards. The final Growth Coherence score is obtained by summing the (i) mean Rbar, (ii) maximum Rbar, and (iii) minimum Rbar and multiplying the resulting sum by the (iv) 11 th /20 th century ratio Rbar (in %). The mean, as well as the minimum and maximum Rbar were calculated in a similar manner from 1050 CE onwards. In order to avoid biased positive results from very high Rbar values in the 11 th century compared to in the 20 th century, the maximum allowed Rbar ratio is capped at 150% in the calculation of the final Growth Coherence score. This 150% ceiling only affects three reconstructions, all from the United States: Potomac River (Maxwell et al., 2011), Southern Sierra Nevada (Graumlich, 1993), and Upper Arkansas River Basin .

Chronology development
The Chronology Development score incorporates four metrics: (i) type of detrending ("1" for Regional Curve Standardization (RCS), and "2" for individual-series detrending method), (ii) the square root of the difference between the maximum and the minimum age, (iii) the slope of the linear regression in the age curve multiplied by 100, and (iv) the maximum retained low-frequency score ("1" for multi-centennial and "2" for decadal to centennial). The choice of detrending method to remove tree-age related growth trends from the raw measurement series can have profound effect on the ability to preserve low-frequency variability and long-term trends in tree-ring reconstructions. Only certain detrending methods can overcome limitations induced by the segment length of individual tree-ring series (Cook et al., 1995). The RCS method (Briffa et al., 1992;Esper et al., 2003) is most commonly used to achieve trend preservation and the maximum retained lowfrequency score is "1" for RCS detrended. Reconstructions produced by individual series detrending are by default supposed not to preserve low-frequency variability beyond their segment length and obtain the score "2". However, chronologies with tree-ring series, on average, exceeding 400 years are still supposed to retain some multi-centennial variability. We calculated the difference between the maximum and minimum age over the past millennium, and the slope of the linear regression fit to the age curve. In the ranking of temperature reconstructions by Esper et al. (2016), the maximum low-frequency information a reconstruction is arguably able to retain is divided into three categories: multicentennial ¼ "1", to centennial ¼ "2", to decadal ¼ "3". Here, for our 2 The 11 th century sample depth is calculated over the period 1001 to 1100, and the 20 th century sample depth is calculated from 1901 to the most recent year of a reconstruction. ranking, we only use two categories: multi-centennial ¼ "1" and decadal to centennial ¼ "2". The rationale for a two-category scale when working with hydroclimate reconstructions is because, compared to temperature, it is less certain what are the deterministic and stochastic controls on hydroclimate low-frequency variability (Hurst, 1951;Pelletier and Turcotte, 1997;Markonis and Koutsoyiannis, 2016). The final Chronology Development score is obtained by multiplying (i) the method score ("1" for RCS, "2" for individual detrending), with (ii) the square root of the maximumeminimum age difference, (iii) the absolute linear regression slope multiplied by 100, and (iv) the maximum retained low-frequency score.

Climate signal
We acknowledge the limitations with the Climate Signal metric considering that the assessment of hydroclimate signal strength to a large degree is dependent on the quality and length of the instrumental data. Moreover, in some cases, especially in regions with a short and sparse network of instrumental data, the hydroclimate signal in the trees may in fact be better than the instrumental data used for calibration. The Climate Signal score is derived by (i) calculating the square root of the number of years of overlap between the reconstruction and the instrumental target used for calibration, multiplied by the residual between, (ii) the correlation coefficients between tree-ring chronologies and instrumental climate data, and (iii) the difference between correlation values of the calibration/verification periods. When the calibration/verification statistics are not reported, we estimate the difference based on our calculations using gridded instrumental data. In addition, we included another variable (iv) to account for a calibration period that was deliberately shortened to avoid "divergence", i.e., an anomalous offset between tree growth and climate sensitivity (sensu D' Arrigo et al., 2008). When such "divergence" is reported in the original publication, and the calibration period has been truncated, we use 0.5 as a multiplier instead of 1 as in all other cases. The final Climate Signal score is obtained by calculating the square root i Â (iieiii) Â iv.

Data homogeneity
The reconstructions scoring the highest (rank A) by Data Homogeneity (Table 3), of which none are from Europe, are derived from only one site or, in case of the Tavaputs Plateau (Knight et al., 2010), from two very nearby sites in one canyon. Moreover, when the reconstructions are only based on one tree species, and when the data are from only one site, it is not possible for temporal clustering to occur. The reconstructions scoring second highest (class B) are based on tree-ring material from either one or two or several sites (e.g., Barranca de Amealco; Stahle et al., 2011 and Flowerpot;Buckley et al., 2004). In cases when they are based on only one site this site includes less homogeneous material than those in class A. When the data are from two or more sites, these are typically homogeneous growth environments in close proximity and the reconstructions are composed of at most two species. There may exist inhomogeneities such as early chronology portions that are based on only one site (e.g., Atlas Mountains; Esper et al., 2007), substantial changes in mean ring width level (e.g., Barranca de Amealco; Stahle et al., 2011), data obtained from two different river systems (e.g., Choctawhatchee River; Stahle et al., 2012), different microsite conditions (e.g., Flowerpot; Buckley et al., 2004).
Reconstructions scoring less well (class C) typically consist of rather inhomogeneous material, often collected across a large region. In some cases, the data are from a larger number of sites (e.g., 17 living tree sites and 5 archeological sites on the Northeastern Tibetan Plateau; Yang et al., 2014). Parts of the chronologies may also be derived from historical and/or archeological wood that does not necessarily provenance from the same area or environment as the living or remnant samples in the same chronology (e.g., Central Europe; Büntgen et al., 2011, Dulan;Sheppard et al., 2004, East Anglia;Cooper et al., 2013, Southeastern England;Wilson et al., 2013, and Mesa Verde;Stahle et al., 2015). The reconstructions scoring lowest in Data Homogeneity (class D) do not necessarily consist of more sites than those in class C. However, the sites are geographically more dispersed as well as diverse in their growth environments. All reconstructions in class D, except one, include three to up to nine different tree species (see Table 2). All class D reconstructions are from North America, including many that consist of numerous sites, widely dispersed over several states, and separated by distances up to several hundreds of kilometers. It is thus the number of sites, plus the distance between them, as well as the inhomogeneous growth environments that primarily are impacting Data Homogeneity. However, when a reconstruction includes three or more tree species the scoring decreases to the point where it contributes to place the reconstruction in class D. Temporal clustering is present in most class C and D chronologies.

Sample replication
Reconstructions from Asia and Europe generally include more samples than reconstructions from North America (Table 4). Overall, mean replication is similar between Asia and Europe except for the sharp replication increase after c. 1850 in Europe at (Fig. 2). Noteworthy is also the decreasing sample replication towards the present in Asia as well as gradual post-1500 increase seen in many reconstructions from North America. The post-1850 replication increase in Europe biases the (20 th century) calibration statistics e a feature absent in Asia and North America. Mean and maximum replication are highest in Europe and lowest in North America. The 11 th /20 th century ratio of the mean replication is highest, and with the largest spread, in Asia, and basically identical in Europe and North America (Fig. 5).
The reconstruction ranking highest in the category Sample Replication is the Northeastern Tibetan Plateau including 837 measurement series (Yang et al., 2014), followed by Central Europe (3124 series; Büntgen et al., 2011) and Colorado River (390 series;MacDonald et al., 2008). Reconstructions scoring well in Sample Replication are disproportionately often from Asia and Europe, whereas the majority of low scoring ones are from North America. The latter is even more apparent when considering the minimum replication: except two, all reconstructions including periods during which replication falls below 10 samples are from North America (Table 4).

Growth coherence
Mean Rbar values are highest in North America (0.42) and lowest in Europe (0.25), with values in Asia (0.38) closer to those of North America (Fig. 3; Fig. 6). The low Rbar values in Europe likely result from the inclusion of tree-ring material that is less homogeneous over time, including material derived from historical construction timber harvested over a wide region in different growth environment conditions. Another possible explanation for the low Rbar values in Europe is a lower proportion of the tree-ring material that is derived from arid or semi-arid environments.
Reconstructions scoring well in the category Sample Replication perform in some cases less well in the category Growth Coherence and vice versa. This is presumably related to data from sites, with various growth conditions, being included in many of the reconstruction with high replication resulting a weaker common signal. All reconstructions with the highest Growth Coherence (class A) come from North America. There is no consistent geographical pattern associated with those reconstructions with the lowest Growth Coherence (class D). Three reconstructions have negative Rbar values at some point during the past millennium (1000e2000 CE). Interestingly, these negative Rbar values do not necessarily appear in the, generally most weakly replicated, early part of the chronology. 3

Chronology development
Whereas reconstructions from Europe are overrepresented among those with the highest Chronology Development scores (class A) several reconstructions from China (n ¼ 4) and North America (n ¼ 7) appear in class D ( Table 6). The low Chronology Development scores are related to a large age range and a steep age trend in combination with individual detrending instead of RCS detrending (Fig. 3). An uneven age distribution also introduces a climate signal age effect bias (e.g., Linderholm and Linderholm, 2004;Rossi et al., 2008;Rozas et al., 2009;Cerm ak et al., 2019). Asian chronologies have the largest age range and age trend (Fig. 4) e as well as the largest spread in both parameters e whereas European chronologies have the smallest age range and age trend (Fig. 7). The smaller observed average age trend in Europe, compared to Asia and North America, is related to the relative absence of long-lived tree species in Europe as well as due to the long history in Europe of intensive land use. The European chronologies have a flat age trend until the late nineteenth century in Europe, whereas in Asia the increase is visible already by c. 1300, and by c. 1700 in North America (Fig. 3). In addition, the spread in the age trend between chronologies from North America increases after c. 1600. All three continents have a strong age trend increase during the twentieth century. It is more common for chronologies from Europe to retain centennial to multi-centennial variability than for chronologies from Asia or North America as RCS has been applied to composite datasets.

Climate signal
All 12 reconstructions in the highest Climate Signal class A are from North America (Table 7). These reconstructions calibrate exceptionally well (mean 0.79 ± 0.07) against relatively long instrumental data (mean 96 ± 13 years) and in most cases the calibration/verification difference is a very small one (mean r. 0.08 ± 0.05) (Fig. 8). A very high correlation coefficient can compensate for a shorter calibration period and a larger calibration/ verification difference. The reconstruction with the highest correlation to instrumental data (r. 0.90), the Bear River streamflow reconstruction (DeRose et al., 2015), has a calibration period of only 68 years and the calibration/verification difference is as large as r. 0.18, but is still placed in class A. There is an obvious overrepresentation of humid sites among those reconstructions with Table 2 Abbreviations of tree species included in this study (see Table 1), used in the International Tree-Ring Data Bank (ITRDB; Grissino-Mayer and Fritts, 1997), following Grissino-Mayer (1993)

Table 3
Data Homogeneity scores. Chronology type "C" refers to reconstructions derived from a composite of material from living trees, remnant, historical and/or sub-fossil wood. Type "L" refers to reconstructions derived from only living trees. Temporal clustering (Yes) indicates reconstructions composed of data from distinct sites or species concentrated in discrete periods over the past 1000 years. Other abbreviations: AM ¼ archeological material; HM ¼ historical material; RM ¼ remnant material; SF ¼ subfossil material (MacDonald and Case, 2005). (For interpretation of the references to color in this table legend, the reader is referred to the Web version of this article.) Table 4 Sample Replication scores. The number of measurement series included in the reconstructions. 11 th /20 th is the ratio of the mean replication during the 11 th century relative to the mean replication during the 20 th century. (For interpretation of the references to color in this table legend, the reader is referred to the Web version of this article.) the lowest Climate Signal scores (class D). The eleven reconstructions of the lowest Climate Signal class D are characterized by comparatively low correlation values to their instrumental targets (r. 0.63 ± 0.09), rather large calibration/verification differences (r. 0.14 ± 0.08), but highly variable calibration period lengths ranging from 34 to 115 years. The calibration period of all Climate Signal class D reconstructions has been truncated due to a "divergence" problem. In Asia, the short calibration periods stand out, but the correlation values are similar to those of North America. The reconstructions from Europe are typically calibrated over periods of similar length as those for North America but correlation values are lower (Fig. 8c). It can be noted that the majority of the evaluated hydroclimate tree-ring records show a weak e mostly insignificant e negative correlation to local annual mean temperature over the twentieth century, with a mean of À0.12 and a range from À0.01 and À0.25 between the first and the third quartiles.

Overall tree-ring hydroclimate reconstruction ranking
The results from our assessment of Data Homogeneity, Sample Replication, Growth Coherence, Chronology Development, and Climate Signal of 46 millennium-long tree-ring based hydroclimate reconstructions are presented in Tables 3e7. Clear differences between reconstructions become apparent in the overall tree-ring chronology ranking shown in Table 8. Two reconstructions, Khorgo and Uurgat , score high (class A or class B) in all five categories. Nine reconstructions score high (class A or class B) in four of out five categories. Eleven reconstructions score less well (class C and class D) in at least four out of five categories. Some reconstructions score high in some parameters and low in some others. The most notable example is the Central Europe precipitation reconstruction (Büntgen et al., 2011). It ranks #1 in Chronology Development and #2 in Sample Replication, but #45 in Growth Coherence and #44 in Climate Signal. Another reconstruction, Southern Sierra Nevada (Graumlich, 1993), scores the highest (class A) in all categories except in Sample Replication where it scores the lowest (class D). Conversely, the Colorado River reconstruction (MacDonald et al., 2008) scores low (class D) in all categories except in Sample Replication where it scores high (class A).
No geographical differences are apparent in the overall tree-ring hydroclimate reconstruction ranking. However, with only a few exceptions e e.g., two reconstructions from humid United Kingdom e reconstructions from arid and semi-arid environments dominate those in class A. Reconstructions from humid environments are on the other hand overrepresented in class D, although several reconstructions from arid and semi-arid environments are also found there. We also find that recently developed reconstructions are not necessarily better than older ones, except for the ability to preserve low-frequency information. Three of the highest-ranking reconstructions e El Malpais (Grissino-Mayer, 1995), Southern Sierra Nevada (Graumlich, 1993) and White Mountains (Hughes and Graumlich, 1996) e were actually among the earliest developed millennium-long hydroclimate reconstructions.

Implications of the ranking of hydroclimate reconstructions
This article attempts to provide an objective evaluation of the strength and weakness of millennium-long tree-ring based hydroclimate reconstructions. Our ranking offers guidance for users of these reconstructions inside and outside the dendroclimatological community. It emphasizes the complexity of a comprehensive assessment in which the correlation with instrumental data e arguably the most intuitive quality criterion e is only one out of many aspects. In practice, different research questions will pose different selection criteria so that the ranking presented here will be not equally applicable to all dendroclimatological studies. For example, if the objective is to infer the influence of drought stress on long-term agricultural productivity, it is desirable to select the best, regionally representative, reconstruction. Furthermore, if the focus is on the effect of climatic extreme events, a lack of lowfrequency information may be less of a problem. On the other hand, a wide spatial coverage, even sample replication over time, and preserved low-frequency information, are desirable if the goal is to investigate where warmewet and warmedry associations tend to occur or to understand the synoptic climate situations and feedback mechanisms responsible for such patterns. The design of our criteria includes variability at timescales from inter-annual to multi-centennial, with a specific accentuation on the lower frequencies that cannot be controlled in the period of instrumental overlap. An issue to consider is that poor replication during the first centuries, compared to the (20 th century) calibration period, makes the quantification of the severity of medieval megadroughts or enhanced monsoon precipitation in comparison to recent "extremes" uncertain. In this context, it can also be noted that several reconstructions, published as millennium-long, were excluded from this assessment as they either stopped just short of 1000 CE or did not have the sufficient replication (of at least three samples) all the way back to 1000 CE (e.g., Büntgen et al., 2010a,b;Stambaugh et al., 2011). The threshold of at least three measurement series is set rather low. Generally speaking, at least 10 ring width measurement series from different trees ought to be included in a reliable reconstruction, though the precise number depends on the inter-series correlation (Rbar) and the climate signal strength inherent to the particular data.
Hydroclimate is a complex climatological metric as it includes precipitation, soil moisture and temperature-driven evapotranspiration. It also possesses a higher spatial heterogeneity than temperature and a multi-facetted spectral character. The much shorter spatial co-variance of precipitation and all other metrics of hydroclimate compared to temperature makes it less feasible than for temperature to only include the highest-ranking hydroclimate reconstructions in further assessments or large-scale reconstructions. In the interpretation of the low-frequency hydroclimate variability it is important to consider to what extent a reconstruction actually preserves information on longer than multi-decadal time-scales. We here identified a problematic feature with the tree-ring based hydroclimate reconstructions, as opposed to most state-of-the-art tree-ring based temperature reconstructions, in the low proportion of reconstructions produced through RCS. The general application of individual-series detrending methods to produce most of the hydroclimate reconstructions risk removal of centennial-scale variability. Including "noisy" reconstructions, with only a few Fig. 4. Tree-ring chronology age curves. Thin black curves show the mean tree age of the tree-ring width data used in the local hydroclimate reconstructions from Asia (a), Europe and North Africa (b), and North America (c). Colored curves are the arithmetic means calculated over the common period covered by all reconstructions in each of the three regions. (d) Comparison of mean replication curves for Europe/North Africa, Asia, and North America. measurement series back in time, does not necessarily improve any network analysis. It is rather recommended to evaluate each individual chronology and include only those reconstructions that can be expected to include relevant information. Thus, data selection based on only the calibration statistics is not recommended.
Evaluating the robustness of the tree-ring based reconstructions based on other types of hydroclimate proxy records is unfortunately difficult for several reasons (and cannot thus be turned into an evaluation criteria). Tree-ring records are by far the most abundant natural climate archive with a temporal resolution and age control that allows for calibration and validation against instrumental observations. For many of the evaluated tree-ring chronologies, there exists no other comparable calibrated proxy record in the region. Investigating the agreement of the lowfrequency signal in the hydroclimate reconstructions with that of lower resolution records is not as straightforward option as it may appear. Recent studies (e.g., Schneider et al., 2019) show that a robust quality estimation requires a very dense proxy network, composed of many various archives, rather than a single neighboring proxy record.
The frequently short and unevenly distributed meteorological station data in Asia (normally starting after 1950) pose severe constraints on the calibration and verification statistics for this portion of the hydroclimate network. Several reconstructions from Asia e most notably the one from the Northeastern Tibetan Plateau (Yang et al., 2014), reaching a correlation to instrumental precipitation data of r. 0.84, would rank high in the category Climate Signal along with the records from North America, if a longer (reliable) instrumental calibration period was available. Allowing for a 100year long calibration period would potentially score the Northeastern Tibetan Plateau (Yang et al., 2014), Heihe River Basin (Yang et al., 2012), Khorgo and Uurgat  in Climate Signal class A. Likewise, it could improve the ranking of A'nyêmaqên (Gou et al., 2010), Delingha (Shao et al., 2005), Hexi Corridor , and Qilian Mountains (Zhang et al., 2011).

Comparison with the temperature reconstruction ranking
Unlike the tree-ring based temperature reconstructions , the hydroclimate reconstructions can include more (up to nine) species (Table 2). The largest difference between the ranking of the hydroclimate and temperature reconstructions is found for Sample Replication. A similar replication for the chronologies between continents is found for temperature reconstructions, compared to a much higher replication for Asia and Europe and a lower replication for North America for hydroclimate reconstructions. The relative Growth Coherence between continents are, on the other hand, similar for the hydroclimate and temperature reconstructions, with the lowest values for Europe and comparable ones for Asia and North America. The highest Chronology Development scores, with the smallest spread, are found in Europe for both hydroclimate and temperature reconstructions. A larger Chronology Development spread is evident for hydroclimate reconstructions in Asia and for temperature reconstructions in North America. Climate Signal scores are similar for each continent in both the hydroclimate and temperature reconstructions, with Europe having overall the highest scores (Fig. 9).
Severe climatic conditions for tree growth at the species' distribution limit (Fritts, 1976) resulted in the highest Growth Coherence scores for both tree-ring based hydroclimate and temperature reconstructions. The twelve Growth Coherence best-scoring hydroclimate reconstructions are from arid or semi-arid environments in the southwestern Unites States (see e.g., St George, 2014; St George and , whereas the three best-scoring temperature reconstructions are all from northern Siberia: Indigirka (Sidorova et al., 2006), Yamal , and Taimyr (Briffa et al., 2008). The trees included in these reconstructions, growing in a shallow active layer in the continuous permafrost zone, likely experience a shorter growing season than any of the other temperature reconstructions included in Esper et al. (2016).
The four highest-ranking reconstructions in the category Chronology Development, both for hydroclimate and temperature, are from Europe. For hydroclimate, it is Central Europe (Büntgen et al., 2011), East Anglia , Southern Finland (Helama et al., 2009), and Southcentral England , whereas for temperature it is Northern Scandinavia (Esper et al., 2012), Finland (Helama et al., 2010), tree-ring width version of Tornetr€ ask , and L€ otschental (Büntgen et al., 2006). High scores in Chronology Development typically result from a combination of a small age range and minor linear trends in mean age curve over the past millennium, in combination with the application of RCS detrending, to emphasize centennial to multicentennial climate variability.
Overall, the average correlation between the tree-ring reconstructions and the instrumental data is higher for hydroclimate reconstructions (mean r. 0.69 ± 0.11) than for temperature reconstructions (r. 0.59 ± 0.15), which perhaps appears surprising given the spatially homogeneous nature of hydroclimate. The region with the generally highest relationship between tree growth and hydroclimate is found in the southwestern United States (see, e.g., St George, 2014; St George and  whereas the highest  Table 5 Growth Coherence scores. Mean, maximum, and minimum correlations among the series included in the reconstructions. 11 th /20 th is the ratio of the correlation during the 11 th century relative to the 20 th century correlation. (For interpretation of the references to color in this table legend, the reader is referred to the Web version of this article.) Table 6 Chronology Development scores. Detrending method 1 ¼ RCS (and Signal Free), and 2 ¼ individual detrending. Age range is the difference between highest and lowest point on the mean age curve over the past millennium. Age trend is the slope of a linear regression fit to the mean age curve over the past millennium (times 100). Maximum frequency indicates the wavelength of lowest frequency information retained in a reconstruction, with 1 ¼ centennial to multi-centennial, and 2 ¼ decadal to centennial. (For interpretation of the references to color in this table legend, the reader is referred to the Web version of this article.) relationship between tree growth and temperature is generally found in high latitude Eurasia and in the European Alps . The calibration period is generally shorter for the hydroclimate reconstructions (mean 79 ± 23 years) than for temperature reconstructions (mean 101 ± 43 years). This provides a larger challenge to skillfully calibrate especially the low-frequency component of hydroclimate variability. Typically, precipitation measurements are either shorter or contain more noise prior to the twentieth century than temperature measurements (Pauling et al., 2006;Harris et al., 2014).

Expansion of the hydroclimate tree-ring reconstruction network
At present, millennium-long tree-ring based reconstructions with a well-verified hydroclimate signal are only available from few locations in the world ( Fig. 1; Fig. 10). As tree-ring records are the only natural hydroclimate proxy with annual resolution and exact dating control, there is an urgent need to expand this network. From more mesic locations there is a general challenge to extend hydroclimate tree-ring records back in time, as they offer generally less favorable conditions for wood preservation. In China, subfossil woods in lake or river sediments are difficult to find (He et al., 2019), and old living trees and remnant woods can mainly be collected in the dry parts of the country . In some places, not least in Europe, tree-ring based reconstructions can be extended with wood from archeological sites and old buildings (Tegel et al., 2010).
An additional challenge is posed by the decrease in hydroclimate sensitivity of tree growth in cool and wet environments. One solution to this problem is to reconstruct soil moisture availability using tree-ring data from temperature-limited environments by considering the pivotal role of surface temperature in determining the land surface heat flux, evapotranspiration and consequently the water balance (Cook et al., 2015a,b;Seftigen et al., 2015a,b). However, such reconstructions need to be treated with caution e both Baek et al. (2017) and Ljungqvist et al. (2019) found that they may overestimate the influence of temperature variability on soil moisture. Moreover, temperature and precipitation contain different spectral characteristics, where the former contains larger low-frequency loadings than the latter (Bunde et al., 2013;Franke et al., 2013;Zhang et al., 2015), making it problematic to use temperature-sensitive tree-ring data for hydroclimate reconstructions.
Despite such constraints, it has been demonstrated that treering chronologies with a strong hydroclimatic signal can be developed in cooler and wetter environments. Hydroclimate reconstructions have been developed in Scandinavia spanning the past three to five centuries (see e.g., Helama and Lindholm, 2003;Linderholm et al., 2004;J€ onsson and Nilsson, 2009;Drobyshev et al., 2011, Seftigen et al., 2015a2015b). The potential to develop millennium-long reconstructions is evident from the Helama et al. (2009) MayeJune precipitation reconstruction from south-east Finland. In European Russia (52e57 N, 35e52 E), most tree-ring chronologies have been shown to correlate weakly but significantly with hydroclimate (Matskovsky, 2016;Matskovsky et al., 2017;Solomina et al., 2017), but all the available hydroclimate tree-ring reconstructions at present only reach back to the eighteenth century.
The development of millennium-long hydroclimate-sensitive tree-ring records is particularly difficult in sub-Arctic in general  and, in particular, in those parts of the boreal zone that are underlain by permafrost serving as a source of additional water supply for the trees during dry summers (Sugimoto et al., 2002;Saurer et al., 2016). Although potential to develop long chronologies in the region exist (Thomsen, 2001;Agafonov et al., 2016) only a limited number of Siberian sites show statistically significant, albeit weak, correlations between tree growth and either monthly Shestakova et al., 2019) or summer (Hellmann et al., 2016) precipitation or monthly SPEI Arzac et al. (2019). Not surprisingly, hydroclimate   (grey), Asia (red), Europe and North Africa (blue), and North America (green) with a box drawn between the first and third quartiles, a line across the box shows the median, the black dot shows the mean, and minimum and maximum values indicated by whiskers. (a) Age range between the highest and lowest point on the mean age curve over the past millennium. (b) Age trend as a slope of a linear regression fit to the mean age curve over the past millennium (times 100).

Table 7
Climate Signal scores. Length is the period of overlap with instrumental temperature data in years. Correlation is the Pearson correlation coefficient between the tree-ring chronology and the instrumental data over the calibration period. Calibration/verification difference indicates the correlation range between different periods of overlap with instrumental data. Truncation ¼ 0.5 if the calibration period was shortened (e.g. due to divergence), truncation ¼ 1 if this is not the case. (For interpretation of the references to color in this table legend, the reader is referred to the Web version of this article.) reconstructions in the warmer and drier southern Siberia have shown greater promise (Shah et al., 2015;Belokopytova et al., 2018;Kostyakova et al., 2018). A new impetus to long hydroclimate reconstructions in the boreal zone, particular in Siberia, may be provided with the development of tree-ring stable isotope chronologies (e.g., Waterhouse et al., 2000;Kirdyanov et al., 2008;Sidorova et al., 2009Sidorova et al., , 2010Knorre et al., 2010;Tei et al., 2013Tei et al., , 2015Panyushkina et al., 2016;Shestakova et al., 2017).
The moisture-limited tree growth environments of Central Asia, the Middle East, and North Africa have a high potential to yield millennium-long hydroclimate tree-ring reconstructions but comparatively little work has so far been done in the region. However, several century long reconstructions have been developed for Turkey (D'Arrigo and Cullen, 2001;Touchan et al., 2003Touchan et al., , 2005Touchan et al., , 2007Akkemik et al., , 2008, Jordan (Touchan et al., 1999) (2018) showed that it is possible to extend hydroclimate reconstructions for this region for the full past millennium or more. Likewise, Esper et al. (2007) successfully developed a past millennium reconstruction from the Atlas Mountains of Morocco.
Although the vast majority of existing tree-ring based hydroclimate reconstructions are from Northern Hemisphere, there are potential to develop moisture-sensitive chronologies in the Southern Hemisphere as well. Early efforts by Schulman (1956) recognized a number of South American tree species sensitive to precipitation variations, and in the 1970s the first tree-ring based estimates of past hydroclimate conditions was developed in southern South America (LaMarche, 1978;Holmes et al., 1979). Recent work includes streamflow reconstructions spanning the past four to six centuries from the sub-Antarctic (Lara et al., 2008(Lara et al., , 2015, the temperate (Urrutia et al., 2011;Mundo et al., 2012;Muñoz et al., 2016) and the subtropical (Ferrero et al., 2015) regions along the Andes, and even longer hydroclimate reconstructions from the Andes of central Chile (Le Quesne et al., 2006Masiokas et al., 2012) and the Bolivian Altiplano . Recent studies have also shown a potential in the South American tropics (Lopez et al., 2017;Granato-Souza et al., 2019), as well as Australia, although efforts in the latter region are hampered by large spatial hydroclimatic heterogeneity (Allen et al., 2019) as well as the short temporal extension of the data .

Recommendations for future hydroclimate reconstructions
The six recommendations presented by Esper et al. (2016) for tree-ring based temperature reconstructions also hold true for the development of hydroclimate reconstructions: (a) preserving centennial-scale variability, using RCS detrending, for understanding low-frequency variance, (b) avoiding a strong decrease of series back in time, (c) strive for a homogeneous sample composition over time, (d) avoid too large replication and inter-series correlation changes, (e) avoid strong age curve changes over time, and (f) keep in mind that the calibration statistics may give a false impression of reconstruction skill. Based on the results from this assessment, we find it important to improve the replication in the earlier parts of the reconstructions, especially in North America, as a weak replication during medieval times precludes robust comparisons with recent hydroclimate conditions. It is equally important to include young and old trees throughout time in the chronologies to achieve a more evenly distributed age curve. The most difficult trade-off, however, is likely between achieving a high sample replication (over time) and a strong growth coherence, as the inclusion of additional sites can degrade growth coherence within a reconstruction. It appears less advisable to include more than, at most, two tree species in any reconstruction and they should ideally derive from the same genera. When tree-ring material is obtained from multiple sites, it is important that it originates from similar environments with regard to moisture stress. Whenever notable micro site conditions exist (Düthorn et al., 2015), temporal clustering of a certain micro site condition should be avoided.
For tree-ring datasets composed of relatively young trees it is essential to successfully apply RCS detrending to preserve lowfrequency information. This requires a large number of raw measurement series from relatively evenly distributed tree age over time. If the biological age of measurements shows a steep increase towards the present RCS should only be applied with great caution. The use of measurement series from very old trees as an alternative to RCS detrending, at the price of a steep age curve, to preserve lowfrequency variability may introduce biases from a climate signal age effect (Esper et al., 2008;Konter et al., 2016) and should be avoided if possible.
We find that a strong and stable hydroclimate signal is of far  greater importance than having a long calibration period. This implies that it is fully feasible to develop well-verified tree-ring based hydroclimate also from regions with short instrumental measurements. Moreover, it needs to be kept in mind that the calibration statistics obtained, regardless of the length of instrumental measurements, typically are optimistic estimates in the sense that the inter-series correlation as well as sample replication typically decreases back in time. If the calibration had been conducted on a less replicated part of the reconstruction, with lower inter-series correlation values, the correlation values would in most cases have been lower. Tests including artificially reduced-sample chronologies (Esper et al., 2012) are thus recommended.

Conclusions
Following a scheme developed by Esper et al. (2016) for temperature reconstructions, we assessed and ranked 46 millenniumlong tree-ring based hydroclimate reconstructions. This scoring considers: Data Homogeneity, Sample Replication, Growth Coherence, Chronology Development, and Climate Signal (Fig. 10). Most of these characteristics, with the exception of Climate Signal, are rarely or ever considered outside the dendrochronological community, but impacts paleoclimate reconstructionemodel simulation comparison studies. Our assessment will guide secondary users of tree-ring based hydroclimate reconstructions by providing information on the strength and limitations of the individual records beyond their simple correlation with instrumental data. Moreover, we hope these results will advance future work on developing new tree-ring based hydroclimate reconstructions or improving and extending the existing ones.
The ranking scores produced for each of the five evaluation categories represent an attempt at objectively identifying suitable and less suitable hydroclimate reconstructions to use for different purposes. For example, in a study of short-term hydroclimate impacts following large volcanic eruptions, long-term trends and variations are less important in a particular reconstruction. On the other hand, if the purpose is to compare the average hydroclimate conditions during medieval times with those of today, it is advisable to only consider reconstructions that realistically retain lowfrequency variability. We conclude that the same ranking implications and related recommendations for tree-ring based temperature reconstructions by Esper et al. (2016) are also valid for treering based hydroclimate reconstructions (see section 4.2).
The systematic assessment of 46 tree-ring based hydroclimate reconstructions, covering the past millennium, permitted ranking them into four groups (class A to class D) for each of the five categories Data Homogeneity, Sample Replication, Growth Coherence, Chronology Development, and Climate Signal. All reconstructions have their various strengths and weaknesses e and no reconstruction ranked A or D in all five categories e but there are some reconstructions that consistently performer high: Khorgo , the Northeastern Tibetan Plateau (Yang et al., 2014), and Uurgat  from Asia; East Anglia  and Southerncentral England  from Europe; Tavaputs Plateau (Knight et al., 2010), El Malpais (Grissino-Mayer, 1995), Southern Sierra Nevada (Graumlich, 1993), Summitville (Routson et al., 2011), andBear River (DeRose et al., 2015) from North America. Though it is our goal to provide evaluations that will assist investigators in making informed selections for their purposes, we at the same time recognize that all the reconstructions contain valuable information depending on the questions asked of them.