Evaluating climate field reconstruction techniques using improved emulations of real-world conditions

Pseudoproxy experiments (PPEs) have become an important framework for evaluating paleoclimate reconstruction methods. Most existing PPE studies assume constant proxy availability through time and uniform proxy quality across the pseudoproxy network. Real multiproxy networks are, however, marked by pronounced disparities in proxy quality, and a steep decline in proxy availability back in time, either of which may have large effects on reconstruction skill. A suite of PPEs constructed from a millennium-length general circulation model (GCM) simulation is thus designed to mimic these various real-world characteristics. The new pseudoproxy network is used to evaluate four climate field reconstruction (CFR) techniques: truncated total least squares embedded within the regularized EM (expectationmaximization) algorithm (RegEM-TTLS), the Mann et al. (2009) implementation of RegEM-TTLS (M09), canonical correlation analysis (CCA), and Gaussian graphical models embedded within RegEM (GraphEM). Each method’s risk properties are also assessed via a 100-member noise ensemble. Contrary to expectation, it is found that reconstruction skill does not vary monotonically with proxy availability, but also is a function of the type and amplitude of climate variability (forced events vs. internal variability). The use of realistic spatiotemporal pseudoproxy characteristics also exposes large inter-method differences. Despite the comparable fidelity in reconstructing the global mean temperature, spatial skill varies considerably between CFR techniques. Both GraphEM and CCA efficiently exploit teleconnections, and produce consistent reconstructions across the ensemble. RegEM-TTLS and M09 appear advantageous for reconstructions on highly noisy data, but are subject to larger stochastic variations across different realizations of pseudoproxy noise. Results collectively highlight the importance of designing realistic pseudoproxy networks and implementing multiple noise realizations of PPEs. The results also underscore the difficulty in finding the proper bias-variance tradeoff for jointly optimizing the spatial skill of CFRs and the fidelity of the global mean reconstructions.


Introduction
Over the past few decades, multiple methods have been proposed to estimate hemispheric and global temperature variability from proxy data over the Common Era (see Jones et al., 2009;Tingley et al., 2012, for comprehensive reviews).Such reconstructions provide an important test bed for understanding multidecadal to centennial climate variability and the climate sensitivity to exogenous forcing, while providing an extended context prior to the instrumental era for anthropogenic warming (Jansen et al., 2007).The majority of such reconstructions target an index (e.g., northern hemispheric mean temperature, Briffa et al., 2001;Crowley and Lowery, 2000;Mann and Jones, 2003;D'Arrigo et al., 2006), while a few are derived from climate field reconstruction (CFR) methods that aim to estimate the spatial, as well as the temporal aspects of large-scale temperature variability (Mann et al., 1998(Mann et al., , 1999;;Mann et al., 2009;Evans et al., 2002;Luterbacher et al., 2004;Rutherford et al., 2005;Tingley and Huybers, 2013).

Published by Copernicus Publications on behalf of the European Geosciences Union.
A leading challenge in producing credible real-world climate reconstructions is the assessment of their uncertainties.The uncertainty of a real-world reconstruction is a mixture of two sources: the uncertainty associated with using necessarily imperfect proxy and target data, and the uncertainty associated with the employed statistical methodologies.Data uncertainties include measurement errors in the proxies, uncertainty in proxy-temperature relationships, sampling errors in instrumental climate fields, chronological uncertainties, and the uncertainty resulting from the network's coarse spatiotemporal coverage.Methodological uncertainties include a given method's sensitivity to input data (type of data, resolution, noise level, and spatiotemporal variability), its sensitivity to model parameters, and the uncertainty associated with the choice of these parameters.
Until recently, assessments of reconstruction uncertainties have primarily relied on cross-validation (CV, Cook et al., 1994), which consists of calibrating CFR methods over a subset of the instrumental period, and then validating the methods with the remaining observations.This method has the advantage of being firmly grounded in statistical theory (e.g., Hastie et al., 2008, Chap. 7) and it relies solely on actual observations; however, it was recently shown that shortening the calibration interval can lead to estimates of low-frequency skill that are biased low (Emile-Geay et al., 2013a).Temporal variations in reconstruction skill may be crudely estimated from "frozen network" experiments (Jones et al., 1998;Crowley and Lowery, 2000;Mann and Jones, 2003;Hegerl et al., 2006;Mann et al., 2007;Emile-Geay et al., 2013a), but because instrumental records are only available since the 1850s, it is impossible to directly estimate skill prior to the 19th century.Reconstruction uncertainty, particularly on multidecadal to centennial timescales, is thus difficult to quantify.
In this study, we use pseudoproxy experiments (PPEs) to extend our skill assessments of CFRs to decadal and centennial timescales and to isolate the impacts of the two principal uncertainty sources discussed above.PPEs were originally proposed by Bradley (1996) and adopted by Mann and Rutherford (2002) as a means of methodological assessment, and have been widely used to assess the performance of different CFRs in reconstructing global or hemispheric temperature (see Smerdon, 2012, and references therein for more details).Only a few of these PPEs, however, have focused on comprehensive assessments of CFR spatial skill (Tingley and Huybers, 2010b;Smerdon et al., 2011;Li and Smerdon, 2012;Annan and Hargreaves, 2012;Werner et al., 2013).In keeping with these earlier investigations, we focus herein on direct assessments of the spatial skill associated with leading CFR methods.Our approach nevertheless relies on more realistically designed pseudoproxy networks that give us better insights into the true spatial and temporal uncertainties in currently available CFR products.
Pseudoproxies typically are derived from the output of GCM simulations.The synthetic proxy data mimic some aspects of real-world proxy networks, and reconstruction algorithms are applied to the data to backcast the GCMsimulated climate conditions.Thus PPEs allow for controlled assessments of reconstruction methods with regard to the geographical and temporal distribution of proxies, their quality, and the spectral characteristics of the noise (Smerdon, 2012).However, most PPEs to date have constructed pseudoproxies that are temporally invariant throughout the reconstruction interval and have uniform proxy quality.Such networks under-represent the complexity of real-world proxy networks, limiting the direct applicability of their results to real-world reconstructions.
Here we construct more realistic pseudoproxy networks that mimic the key spatiotemporal characteristics of the multiproxy network used by Mann et al. (2008) (hereinafter M08).Two novelties in pseudoproxy design are introduced in this work: (1) the decrease in proxy availability over time follows that of the M08 network; and (2) the spatial variations of proxy quality mimic those found in M08.The more realistic pseudoproxy design allows a more stringent test on the performance of different CFR techniques, and provides insights into at least three aspects: (1) assessing how the spatiotemporal characteristics of the proxy network affects reconstruction skill, (2) tracing factors that contribute to the spatial variations of reconstruction skill, and (3) evaluating a method's ability to produce skillful index and field reconstructions.The four reconstruction techniques that we evaluate are (1) truncated total least squares regression embedded within the regularized expectation-maximization algorithm (Schneider, 2001, hereinafter RegEM-TTLS), (2) the Mann et al. (2009) implementation of RegEM-TTLS (hereinafter M09), (3) canonical correlation analysis (Smerdon et al., 2010, hereinafter CCA), and (4) Gaussian graphical models embedded within the EM algorithm (Guillot et al., 2013, hereinafter GraphEM).We first explore the spatiotemporal characteristics of M08 proxies in Sect.2, and then describe the employed CFR techniques in Sect.3. We present results in Sect.4, followed by a discussion (Sect.5) and a summary of our findings (Sect.6).

Properties of real-world proxy networks
We consider the M08 proxy network as the basis for our pseudoproxy emulation of a real-world proxy network.The M08 proxy network has a relatively extensive spatial coverage over land, and most proxies have a temporal resolution of less than 10 yr.More importantly, the M08 network has recently been used to derive real-world CFRs (Mann et al., 2009), in which the authors reconstructed spatial patterns of surface temperature over the past 1500 yr and explored their associated dynamical causes.Out of the total 1209 proxies in the network, we exclude 71 European composite surface temperature reconstruction records (Luterbacher et al., 2004), so that only true natural proxies are used to as a basis for our   emulation.Figure 1 shows the spatiotemporal distribution of the remaining 1138 proxies.Most proxies are concentrated in extra-tropical land regions of the Northern Hemisphere, particularly across North America and western Europe (Fig. 1a).Tree-ring width is the dominant proxy class, and fewer than 200 proxies in total are available prior to 1400 AD (Fig. 1b).

Spatial characteristics
Spatial relationships between proxies (P ) and temperature (T ) are explored by calculating the Pearson's correlation coefficient (ρ) between each proxy and the HadCRUT3v surface temperature field (Brohan et al., 2006, the temperature target used in the M08 and M09 studies).As in M08, temperature grid boxes less than 10 % complete were removed from the HadCRUT3v data set, and missing values were infilled with the RegEM algorithm using ridge regression (Schneider, 2001) during the 1850-2006 AD period.ρ is calculated between annually averaged HadCRUT3v temperature observations and annual proxy data 1 over the 1850-1995 AD period.The statistical significance of ρ is also taken into account, using a spectrum-preserving, nonparametric test (Ebisuzaki, 1997).
In Fig. 2, |ρ| max is plotted as a function of P -T distance, where |ρ| max (i) = max j ∈ [1,p] |ρ(P i , T j )| is the highest absolute value of the estimated correlation coefficients between the ith proxy and all temperature grid points.The total number of temperature grid cells is p = 1732, as in M08.All the temperature and proxy data can be downloaded at http://www.ncdc.noaa.gov/paleo/pubs/mann2008/mann2008.html.
Contrary to common assumptions (e.g., Jones and Mann, 2004), we find that ρ(P , T ) is not a monotonically decreasing function of distance.As in Fig. 2, the distribution of P -T distance is bimodal: one cluster of proxies is well correlated to local temperature (distance shorter than 2000 km, similar to findings in Hansen and Lebedeff, 1987), but the majority of proxies are at least 8000 km away from the temperature point yielding the highest |ρ|.On the other hand, the distribution of |ρ| max is unimodal and positively skewed.The distribution exhibits a mode near 0.4, while high values are quite rare (95 % of ρ values are below 0.76).The average |ρ| max is 0.45, corresponding to a P -T distance of 11 000 km.The counterintuitive pattern of Fig. 2 is a consequence of two effects: many proxies indeed are primarily sensitive to local temperature, but the probability of finding a spurious (non-physical) correlation also increases as the search radius increases.Some of the high non-local correlations may be reflective of long-range temperature dependencies (teleconnections, c.f. Liu and Alexander, 2007), such as precipitation proxies in the southwestern United States (e.g., Cook et al., 2004Cook et al., , 2007, see also Fig. S23, Supplement).Others may arise by chance alone.Since we lack a theoretical criterion to distinguish real teleconnections from spurious correlations, we constructed pseudoproxies representing two endmember possibilities: one corresponding to local temperature associations, and the other mimicking each proxy's highest potential to capture large-scale teleconnections.An alternative network design, balancing the two extreme scenarios, is explored in the supplementary information (SI, Sect.3).
Traditionally, pseudoproxies P (x, t) are generated according to where T s is a time-standardized2 version of T .The primary data of T are grid cells extracted from GCM fields in a way that mimics instrumental data availability.ε(x, t) are independent realizations of a Gaussian white-noise process with zero mean and unit variance, and the signal-to-noise ratio (SNR) controls the amount of noise in the pseudoproxies (Mann and Rutherford, 2002;von Storch et al., 2004;Mann et al., 2005Mann et al., , 2007;;Rutherford et al., 2005;Küttel et al., 2007;Smerdon et al., 2008;Smerdon et al., 2011;Christiansen et al., 2009;Emile-Geay et al., 2013a).SNR is related to proxy-temperature correlations ρ via (2) (Mann et al., 2007).While most studies have heretofore considered spatially uniform SNRs, it is clear that |ρ| is quite variable (Fig. 2), requiring pseudoproxy networks to contain such variability.In this study, spatial variations of SNRs and the bimodal pattern of Fig. 2 1).As illustrated in Fig. 3, even in the best-case scenario (max SNR), SNR is on average lower than 0.5, with fewer than 30 proxies exhibiting an SNR above 1.0.In the local SNR case, the mean SNR is even lower (0.27), close to the low end of SNRs usually considered in pseudoproxy studies (0.25).We emphasize that neither choice of SNR design is physically realistic -instead, each may only be viewed as an end-member experiment of real-world conditions.A middleground scenario, balancing locality with the ability to capture the largest correlations, is explored in the SI (Sect.3).Results based on this intermediate SNR design are found to be very similar to the max SNR case, and thus are not shown in the manuscript.
Similar characteristics were also considered in PPEs by Christiansen et al. (2009), in which empirical SNRs and noise values were used to reflect the heterogeneous proxy quality in the Mann et al. (1998) proxy network.Their work, however, did not model the temporal heterogeneity in proxy availability, and the spatial skill evaluation was not their main focus.Our study here seeks to evaluate the impact of spatial heterogeneity in multiproxy networks and its impact on derived CFRs.Six networks are designed to address this problem, of which two model the spatial variation of SNRs in real-world proxies (local SNR and max SNR) and four have uniform SNRs.Following previous studies (Mann and Rutherford, 2002;von Storch et al., 2004;Mann et al., 2005Mann et al., , 2007;;Rutherford et al., 2005;Küttel et al., 2007;Smerdon et al., 2008;Smerdon et al., 2011;Werner et al., 2013), the four networks of homogenous quality are designed by assigning constant SNRs (SNR = ∞, 1.0, 0.5, 0.25) in Eq. ( 1).These six networks together provide the basis for our experiments.

Temporal characteristics
Another realistic characteristic we incorporate into the PPE design is the temporal heterogeneity of proxy availability.As shown in Fig. 1b, data availability decreases steeply back in time, and a staircase pattern is evident for all proxy classes.In a similar manner, the effective SNR, which is the average SNR of all proxies available at a given time point, also declines back in time (Fig. 4).The pattern is consistent with properties of the M08 network: most of the high SNR proxies (such as tree rings and corals) are only available for several decades or centuries prior to widespread observational data.For instance, most tree-ring chronologies drop out of the network prior to the 16th century, with fewer than 100 (out of an original 1031) still available before the 14th century.Overall, only 47 proxies are available throughout the entire reconstruction period, of which 19 have decadal or lower resolution.
To isolate the impact of temporal availability, we specify two types of pseudoproxy networks: 1. M08 "flat" network: pseudoproxy availability is uniform through time.
Contrasting these two cases will therefore characterize the impact of temporal heterogeneity on CFR performance.2) for idealized pseudoproxy networks and observed networks.For the observed SNRs, we consider two possible scenarios as described in Sect.2.1.Shaded areas correspond to the 2.5-97.5 percentile interval, providing a complete picture of effective SNR across the 100member noise ensemble.Dashed lines correspond to the values of SNR for temporally invariant networks.SNR = ∞ is not plotted.

Limitations
Despite the spatiotemporal characteristics that are modeled in the pseudoproxies, the networks are still idealized in various respects.The pseudoproxy networks do not model the temporal auto-correlation (persistence) present in real-world temperature and proxy data, nor do they consider the effect of using low-resolution data for reconstructions of annual temperature.All pseudoproxies are generated on an annual basis without regard for the proxies' actual resolution (as done in most other pseudoproxy studies).To more realistically model real-world proxies, future PPE designs should represent the actual resolution of each proxy (Christiansen, 2011).
Using Gaussian white noise for ε in Eq. ( 1) is a natural first step, but a more complex noise model could be used to better reflect real-world noisy proxies (Smerdon, 2012, and references therein), as done in Tingley and Huybers (2010a, b) and Christiansen and Ljungqvist (2012).Furthermore, mechanistic proxy models could be used to simulate synthetic proxy records with more realistic properties (Anchukaitis et al., 2006;Evans, 2007;Cobb et al., 2008;Thompson et al., 2011;Evans et al., 2013).Finally, the target field is assumed to be noise-free, yet in reality, gridded instrumental observations may contain substantial noise or interpolation errors, yielding a large influence on the derived calibrations and thus the reconstruction in the preinstrumental era (Emile-Geay et al., 2013a, b).While we explore herein the impacts of significant advancements in PPE design, incorporation of the above considerations will further improve the degree to which PPEs can be interpreted as representative of real-world CFR performance.

CFR techniques
Two classes of statistical methods are commonly used to perform CFRs.One is based on multivariate linear regression models, where inference is performed in a frequentist framework (e.g., Mann et al., 1998;Mann et al., 2008Mann et al., , 2009;;Schneider, 2001;Luterbacher et al., 2004;Smerdon et al., 2010;Guillot et al., 2013) and the other uses Bayesian hierarchical models (BHMs, e.g., Li et al., 2010;Tingley and Huybers, 2010a).We restrict our attention to frequentist regression-based methods since only those have heretofore been used to derive global/hemispheric CFRs.
Let P be an n p × p p matrix of proxy values and T be an n t × p t matrix of instrumental temperature records, where n p and n t are the number of years of available data (i.e., number of observations), p p and p t are the number of spatial locations (i.e., number of variables), and the subscripts p and t denote proxies and instrumental data, respectively.Traditional regression-based CFR methods assume a multivariate linear relationship between proxies and the climate variable of interest: e.g., temperature (Jones and Mann, 2004;National Research Council, 2006;Jones et al., 2009;Tingley et al., 2012).Additionally, each year is often treated as an independent observation.In this context, temperature may be estimated from the proxies via the regression equation: where ε is an error term following a multivariate normal distribution with zero mean.In the sample-rich setting familiar to classic regression problems (e.g., Anderson, 2003), the optimal estimate of B would be given by the ordinary least squares (OLS) estimate: where P c is the submatrix of P spanning the calibration interval.This formulation is such that in order to estimate B, P c P c must be invertible (non-singular).In paleoclimate applications, however, it is often the case that P c P c is rank-deficient, (i.e., not invertible): instrumental temperature records are only available for the past 150 yr or so (n t ≈ 150), and the number of proxies p p is on the order of 10 3 (high dimension, low-sample size).In this setting, the OLS estimate is no longer optimal, and may be wildly erroneous.Some form of regularization is needed to make P P invertible, and thus solving Eq. ( 4) amounts to finding a regularized least squares solution (Hansen, 1998).Each regression-based CFR technique accomplishes this in a different manner.Our study focuses on four such techniques: RegEM-TTLS, M09, GraphEM and CCA.

RegEM
RegEM (Schneider, 2001) is a variant of the EM algorithm (Dempster et al., 1977;Little and Rubin, 2002) designed for the imputation of missing values in spatiotemporal data sets typically encountered in climatology.Under the multivariate normal assumption, given an initial estimate of the mean μ and the covariance matrix , the RegEM algorithm reduces to regressing the missing values onto the available ones (instrumental temperature data and overlapping proxies)3 .The estimates of μ and are updated at each iteration until convergence is achieved.Two regularization methods have been considered in RegEM: one is ridge regression (Tikhonov and Arsenin, 1977;Hoerl and Kennard, 1970a, b) TTLS mitigates such variance losses, and therefore is chosen as our regularization method for this study.We employ two different styles of RegEM-TTLS: one following the standard formulation described in Schneider (2001), and the other following its paleoclimate-specific implementation described in Mann et al. (2009).

M09 implementation of RegEM-TTLS
The M09 implementation of TTLS uses a hybrid version of RegEM-TTLS (Mann et al., 2007) that treats low-frequency and high-frequency signals separately (the domains are split at a 20 yr period).The reconstruction is then performed in a forward stepwise approach century by century.For each step, the target data are compressed in that only the first M leading modes of surface temperature are retained, where M is an estimate of the number of degrees of freedom in the proxy network, determined by a fit to its log-eigenvalue spectrum.Additionally, semi-adaptive choices are adopted for both lowfrequency k l and high-frequency truncation parameters k h .The method selects (1) k l to retain 33 % of the low-frequency multivariate data variance (Mann et al., 2009;Rutherford et al., 2010); and (2) k h by detecting the first break in the log-eigenvalue spectrum of high-frequency multivariate data variance.
Although these ad hoc modifications are not grounded in statistical theory, the M09 implementation of TTLS has proven effective in practice (Mann et al., 2009;Emile-Geay et al., 2013a, b).Indeed, the M09 implementation is the only technique that has ever been used for global-scale, real-world CFRs, so it is taken as the benchmark of our study.By comparing M09 to TTLS with fixed regularization (i.e., RegEM-TTLS), we investigate the merits of the M09 approach to parameter selection in an ensemble framework, which is a novel assessment of this heuristic approach.

CCA
CCA (Christiansen et al., 2009;Smerdon et al., 2010) is based on ideas presented in Barnett and Preisendorfer (1987).As discussed in Smerdon et al. (2010), CCA employs singular value decomposition (SVD) to perform dimensional reductions separately on T , P , and B. The basic assumption, as in most paleoclimate applications, is that the first few leading modes of EOF-PC pairs contain most of the variance in the target climate field and the multiproxy network.The algorithm seeks an optimal set of truncation parameters (d t , d p , d cca ) that yields good approximations of T , P , and B, respectively.These truncation parameters are chosen by minimizing the area-weighted root mean square error (RMSE) of the reconstruction relative to the target field using a leave-halfout cross-validation procedure (e.g., Chap.7, Hastie et al., 2008).Smerdon et al. (2010); Smerdon et al. (2011) have only applied CCA on pseudoproxy networks with constant temporal availability.For this study, we modify the original CCA code to make it applicable to real networks with variable temporal availability, and also made it more computationally efficient.Readers are referred to the SI for more details.

GraphEM
GraphEM (Guillot et al., 2013) is based on the theory of Gaussian graphical models (GGMs, a.k.a.Markov random fields, Whittaker, 1990;Lauritzen, 1996).A GGM makes use of the conditional independence4 structure of the climate field, in order to reduce the dimensionality and obtain a parsimonious estimate of ≡ −1 .The conditional independence relations are estimated by solving an 1 -penalized maximum likelihood problem (Friedman et al., 2008). is then estimated in accordance with these conditional independence relations.The resulting is sparse and betterconditioned, and therefore is applicable within the OLS framework.This procedure is implemented within the standard EM algorithm without further need for regularization.GraphEM was extensively tested against RegEM-TTLS in Guillot et al. (2013), albeit with the more idealized proxy design of Smerdon et al. (2011).One goal of this study is to document GraphEM's performance in a more realistic context.

Numerical experiments
As in Smerdon et al. (2011), all of our pseudoproxy experiments target the annual surface temperature field computed from the NCAR CSM1.4 integration of Ammann et al. (2007), using the correctly oriented version of the CSM1.4 field (Smerdon et al., 2010).Although multiple last millennium simulations are becoming publicly available as part of the Paleoclimate Modelling Intercomparison Project Phase 3 (PMIP3) and through other projects (Fernández-Donado et al., 2013), we selected the CSM1.4 model to enable comparisons with previous work (Mann et al., 2005(Mann et al., , 2007;;Lee et al., 2008;Mann et al., 2009;Li et al., 2010;Ammann et al., 2010;Smerdon et al., 2010;Smerdon et al., 2011).The target field was spatially masked to approximate the availability of the HadCRUT3v data set used in the Mann et al. (2009) study.We generated 100 realizations of pseudoproxy series for each SNR case (SNR = ∞, 1.0, 0.5, 0.25, local SNR and max SNR) by varying ε in Eq. ( 1), and then performed reconstructions with the four CFR techniques.The first four cases have fixed SNR, corresponding to networks of homogeneous quality.
The ensemble approach allows us to identify spurious reconstruction skill arising from random noise, and to test the robustness of each method.We also conducted experiments using both the flat and staircase networks.Experiments using the flat network, in which the spatial distribution of pseudoproxies is temporally invariant, serve as the control group in this study.Experiments with the staircase network, correspondingly serve as the test group.By comparing results between these two experiments, we test the null hypothesis that temporal heterogeneity does not affect reconstruction skill.
Reconstruction and calibration intervals are 850-1849 AD and 1850-1995 AD, respectively.Annual means of temperature and pseudoproxies are used for the reconstructions.The same input data are assigned to each CFR method, and standardization (if necessary) is performed internally in each method.Reconstructions are referenced to zero over the calibration interval.Possible trends during the calibration interval are not removed.We also consider computational cost, and investigate numerical methods to speed up each CFR technique.Further details are provided in the SI.

Skill metrics
As is common practice in paleoclimatology, we evaluate reconstruction skill using the coefficient of efficiency (CE), reduction of error (RE) and the coefficient of determination Table 1.Comparison between R 2 , RE, and CE.y i denotes the ith temperature grid point, and ŷi denotes the estimation of y i .y c and y v refer to the mean of (y 1 , y 2 , . . ., y n ) over the calibration period and verification period, respectively.Adapted from National Research Council ( 2006), Chap.6.

Metric Expression
Range Tracks [−∞, 1] Phase, amplitude and mean (R 2 ) (Cook et al., 1994;Bürger, 2007).These validation statistics (Table 1) are related to the mean squared error (MSE) that is commonly used for statistical analysis.They are calculated for both the global mean temperature index and the spatial field; the former provides an aggregate summary of a method's ability to track global climate fluctuations, and the latter evaluates a method's spatial performance.
The global mean enables comparisons with index reconstructions, which comprise the majority of published reconstructions of hemispheric and global temperatures (Fig. 6.10 in Jansen et al., 2007).
As indicated in Fig. 3, the quality of pseudoproxies in the local SNR case, on average, is comparable with the SNR = 0.25 network, and the average SNR of pseudoproxies in the max SNR case is similar to the SNR = 0.5 network.Based on these considerations, we only show results from the spatially heterogenous pseudoproxy networks.The reader is referred to the SI for results on spatially homogenous networks.

Reconstructing the global mean
In the global mean temperature reconstructions, the overall shape of low-frequency variability is reasonably well reconstructed by all CFR methods (Fig. 5).Warm biases nevertheless are present in all of the reconstructed estimates (von Storch et al., 2004;Smerdon et al., 2011), as expected from regression dilution (e.g., Frost and Thompson, 2000;Tingley et al., 2012).In the local SNR case (Fig. 5a), GraphEM and CCA, in particular, underestimate the amplitude of variability by a factor of 3-5.The variance-bias decomposition (Hastie et al., 2008, Chap. 7) further shows that for GraphEM and CCA (Fig. 6, right column), the bias contributes more than 75 % to their MSE.RegEM-TTLS and M09, on the other hand, have a similar variance but a much smaller bias, and thus a correspondingly smaller MSE.Overall, M09 produces the most skillful global mean temperature series in both cases, closely followed by RegEM-TTLS.It would be erroneous, however, to conclude that these two methods produce the closest match to the target, as there is a large spread between different noise realizations.
In assessing the risk properties of each method (Fig. 5a), we find that GraphEM and CCA are more consistent estimators than RegEM-TTLS and M09: their ensemble spreads are much narrower, especially for early reconstruction intervals (prior to 1600 AD).This indicates that any given RegEM-TTLS or M09 reconstruction may yield an inaccurate depiction of the true temperature, and this risk should be kept in mind especially when using M09 and RegEM-TTLS for realworld reconstructions.

Spatial performance
We now examine the spatial performance of the four CFRs.In Figs. 7 and 8, we summarize the century-by-century skill variation and each CFR's ensemble spread using box plots of the globally averaged CE statistic.CE is shown because it is the most stringent metric among the three in Table 1 (Cook et al., 1994;Ammann and Wahl, 2007), and thus exposes significant contrast between methods (Table S1, Supplement).Box plots characterize the full distribution of the 100member ensemble: the median of each distribution assesses the tendency for temporal proxy availability to affect reconstruction skill, while the spread yields information about the consistency of each method.The impact of spatial heterogeneities is isolated by plotting spatial maps of CE using the flat network in Figs. 9 and 10, which help us trace spatial errors and inter-method similarities and differences.

Effect of temporal heterogeneities
As evident in Figs.7 and 8, reconstruction skill varies substantially from century to century, even when proxy availability is time-invariant (gray box plots in Figs. 7, 8, S13, S14, and S3-S12, Supplement).In general, reconstruction skill is highest in the most recent 100 yr slice (1750-1849 AD) but does not decrease monotonically prior to this slice.For instance, reconstruction skill decreases from 1050 to 1450 AD, but increases from 850 to 1049 AD.During this interval (broadly coincident with the "Medieval Climate Anomaly"), the NCAR CSM1.forcing events, which may have more coherent spatial expressions than other fluctuations, appear to be more easily captured by the proxy network.This suggests that reconstruction skill is not only affected by proxy availability and quality, but is also a function of the type and amplitude of climate variations (i.e., internally generated vs. externally forced variations).See Sect. 4 in the SI for more discussions.
An additional important observation to note is that RegEM-TTLS in the local SNR case is distinct from the other three methods (Fig. 7) in that the temporal availability of input data dominates the reconstruction skill, which is a monotonically increasing function of time.The ensemble spread becomes wider back in time as well (consistent with Fig. 5a).The decreasing trend back in time for RegEM-TTLS is partially due to the fact that the method uses a fixed truncation parameter for the estimation of , despite the declining availability (Fig. 3) and quality (Fig. 4) of pseudoproxies back in time.Consequently, the TTLS solution tends to be less regularized, and hence is dominated by noise (Sima and Van Huffel, 2007).For GraphEM, a fixed graph is used for the entire reconstruction interval.The graph identifies most of the significant proxy-temperature relationships and thus GraphEM is able to efficiently use the relationships for reconstruction.M09 and CCA, on the other hand, have semi-adaptive and adaptive criteria respectively when performing reconstructions, and are less sensitive to temporal heterogeneities in the pseudoproxies.In the max SNR case (Fig. 8), RegEM-TTLS shows a similar pattern to the other CFRs.This implies that for RegEM-TTLS, the reconstruction skill vs. data availability relationship is conditionally dependent on data quality: when proxy quality is relatively high (max SNR), reconstruction skill is relatively insensitive to the choice of truncation parameters in RegEM-TTLS, but -like other CFRs -the skill is still sensitive to high-amplitude climate events.
Despite similarities across reconstructions of global mean temperature (Figs.5b, 6; max SNR columns), the spatial metrics reveal large discrepancies among methods.Although M09 and RegEM-TTLS perform well reconstructing the global mean temperature, their globally averaged spatial skill only breaches zero in the last two centuries of the reconstruction, even in the max SNR case (Fig. 8).GraphEM and CCA, on the other hand, display high spatial skill for most of the reconstruction period (in the max SNR case, Fig. 8).

Effect of spatial heterogeneities
To isolate the effect of spatial heterogeneities, and to better visualize spatial patterns, Figs. 9 and 10 display the spatial pattern of CE using the flat network over the first (850-949 AD) and last centuries (1750-1849 AD) of the reconstruction5 .In Fig. 9, a band of high CE scores connecting the eastern equatorial Pacific to North America is evident in all cases, and appears to be a feature of CSM1.4's climate (Smerdon et al., 2011).Similarly, there is some reconstruction skill over other oceans where no proxies are available.Collectively, this indicates that modeled teleconnections are effectively exploited by all methods to reconstruct surface temperature in regions with little to no proxy coverage.However, the pattern of enhanced skill associated with ENSO (El Niño-Southern Oscillation) teleconnections vanishes in the 850-949 AD interval when employing RegEM-TTLS or M09 on the max SNR network (Fig. 10), but is still visible in CFRs using CCA and GraphEM.This suggests that the latter two methods are more skillful in resolving such spatial patterns.
Using the local SNR network, we find that RegEM-TTLS and M09 both produce more skillful reconstructions than GraphEM and CCA.In the case of max SNR, however, the results are opposite: reconstructions with CCA and GraphEM are more skillful.In particular, GraphEM is the most skillful method almost everywhere (Fig. 10) and across all time intervals (Figs. 8, S10, Supplement).The goal of dimension reduction in CCA is to increase the solution stability, and is achieved by pre-filtering noise and retaining only a few leading modes.Similarly, GraphEM filters out spurious proxy-temperature relationships and noise by assigning zeros in the precision matrix (Friedman et al., 2008;Hastie et al., 2008;Guillot et al., 2013).In the case of the local SNR network, most proxies have SNRs lower than 0.3 (or equivalently, more than 92 % noise6 ), and hence are dominated by random noise.As a consequence, both CCA and GraphEM tend to treat those proxies as noise and filter them out, shrinking the reconstruction closer to the calibration mean.methods, as discussed above, indicate that single inferences drawn with these methods should be treated with caution.
Despite the differences among methods described above, some common features emerge when comparing reconstructions with the realistic SNRs to uniform SNR networks (Figs.S9-S12, Supplement)7 .Compared with CFRs using the SNR = 0.25 network (Fig. S11, Supplement), CFRs using the local SNR network (Fig. S9, Supplement) produce similar results but are less skillful.This is primarily due to the fact that the local SNR network contains only 312 records, while the SNR = 0.25 network includes all 1138 proxies in the M08 database.The comparison between CFRs with SNR = 0.5 (Fig. S12, Supplement) and max SNR networks (Fig. S10, Supplement), however, shows that reconstructions using the max SNR network are much more skillful, especially during early reconstruction periods.We discuss this point below.

Discussion
In exploring the proxy-temperature relationship, we find that even in the max SNR network, where the highest possible SNR is taken for each proxy, the average SNR is 0.47 (close to the SNR = 0.5 case used in previous studies).Nevertheless, some of the proxies on the right end of the distribution (Fig. 3) may exhibit SNRs higher than 1.0; these proxies drive reconstruction skill upward, so that reconstructions derived from the max SNR network are more skillful than using the SNR = 0.5 network (Figs.S10 and S12 in the Supplement, Table 2).In other words, a small number of high-quality proxies may contribute to a majority of the reconstruction skill.This is encouraging and suggests that global surface temperature may be skillfully reconstructed without requiring uniform spatial sampling over the entire globe.Nevertheless, in our experimental settings, the number of unique temperature grid cells used for sampling T in Eq. ( 1) is only 128 and 551, in the local and max SNR networks, respectively.
Sampling the same grid cells multiple times effectively increases the SNR for those grid cells, and thus may contribute to the reconstruction skill.This feature is expected to operate in nature as well, however, and provides additional motivation for replicating proxy records.Additionally, we note again that such conclusions might be model-dependent.The skill observed here may result from the low internal variability in the NCAR CSM1.4 model (partially a consequence of its low resolution).To confirm these findings, similar experiments will be conducted with PMIP3-generation last millennium simulations (http://pmip3.lsce.ipsl.fr/).
We also find that differences across CFRs are much smaller in the case of the max SNR network (Fig. 10), indicating that, not surprisingly, reconstructions are much less sensitive to methodology when data quality is high.The spatial reconstruction skill, as expected, is highest in regions of dense proxy availability (e.g., North America), which is consistent with previous findings by Smerdon et al. (2011).The contrast between Figs. 9 and 10, both assuming constant time availability, suggests that our CFR methods, in particular GraphEM and CCA, are more sensitive to data quality than to temporal availability.
As mentioned in Sect.4.3.1, the ensemble spread for each method is quite different.RegEM-TTLS consistently yields the largest spread, followed by M09, CCA and GraphEM.As an error-in-variable model (EVM), TTLS is designed to minimize the variance of residuals from both the predictands ( T − T ) and the predictors ( P − P ).The minimization is subject to the estimates of regression coefficients, which in turn depend crucially on the choice of the truncation parameter.Since the noise ε in Eq. ( 3) is randomly generated, it is sometimes spuriously high in the calibration interval and makes the true signal too noisy for CFR methods to identify, especially in the local SNR network.Under such circumstances, reconstructions using TTLS with fixed truncation may therefore be over-fitting to noise.This confirms an important point made by Christiansen et al. (2009): there is a substantial element of stochasticity in the reconstructions.Hence, one might obtain very different results with the same method applied to different pseudoproxy noise realizations or to different jackknifed proxy networks in real-world reconstructions.In order to improve reconstruction skill when employing RegEM-TTLS, we suggest the development of an algorithm that adaptively selects the regularization parameters using standard statistical theory.
Compared with RegEM-TTLS, M09 produces more skillful reconstructions.In particular, M09 appears advantageous at very high noise levels (local SNR;Figs. 7,S6,Supplement).This is not surprising in part due to the heuristic truncation choices.M09 strongly benefits from the hybridfrequency approach to perform reconstructions, namely white noise contained in the low-frequency pseudoproxies is effectively filtered out and hence the low-frequency components are better reconstructed in the PPEs.The noise-filtering advantage is likely not present in real-world reconstructions, and the absence of a theoretical justification for the selection criterion makes it vulnerable.Given a different data set or given a different noise model (by varying ε in Eq. ( 1), for instance, using red noise instead), the 20 yr frequency split might no longer be optimal, and the "33 %-truncation" criterion might also need to change.For the global mean temperature, M09 produces the closest fit to the target overall (Fig. 5), which is due to the fact that truncation parameters are optimized to fit the global mean.However, our results suggest that the optimization comes at the expense of spatial skill, especially during the early reconstruction period (Figs. 10,S5,Supplement).
Reconstructions derived from CCA and GraphEM in general show very similar results, with GraphEM slightly outperforming CCA.In particular, as shown in Figs. 8 and 10, in the max SNR case, GraphEM outperforms other methods at all locations across all time intervals.This indicates that given enough high-quality data, GraphEM can produce the most skillful reconstructions.The strength of GraphEM is especially noticeable in regions of dense proxy sampling.For instance, over North America (Fig. S10, Supplement), the other three methods display negative CE scores prior to 1650 AD, yet GraphEM displays positive CE scores over the entire reconstruction interval.Nevertheless, as noted in Fig. 9, GraphEM does not perform as well in the local SNR setting as it does using the max SNR network.GraphEM's superior performances in the former case and unsatisfying performances in the latter are both largely due to the graphical structure selected by the GraphEM algorithm.In the max SNR case, pseudoproxies have, by design, much higher SNR than they do in the local SNR case.Thus most of the significant proxy-temperature relationships are effectively detected and exploited by the method.In the local SNR case, on the other hand, very few significant relationships are detected and thus GraphEM fails to produce meaningful CFRs.Despite sensitivity to data quality, another potential cause of GraphEM's poor performance in the local SNR cases is the choice of the graph.Currently, as described in Guillot et al. (2013), the graph is a fixed choice for the entire reconstruction period, which is based solely on instrumental proxytemperature relationships.To improve the performance of GraphEM-based CFRs, adaptive choices of the graph should be made for each century of the reconstruction.Through this approach, available proxies from each century will be more effectively used.Alternative methods should also be explored to estimate the graph.
As illustrated in Fig. 6, both GraphEM and CCA suffer from large mean biases, which contribute to more than 75 % of the total MSE associated with reconstructions from each method.It is well-known that introducing a certain amount of bias can lead to a reduction in MSE (compared to using an unbiased estimator).However, this often results in the reduction of the corresponding variance, i.e., the amplitude of past climatic variations are underestimated (Fig. 5).CCA often does best by measure of R 2 values (Table 2).This is expected: the method regularizes by maximizing the crosscorrelation between the proxy and target matrices, but without further constraining the variance.One possible modification would be matching the variance of CCA reconstructions to the variance of the target data in each grid cell during the calibration interval.In doing this, more variance would be preserved for networks affected by declining data availability.However, this modification would inflate errors and the solution could no longer be interpreted as minimizing the calibration misfit.By contrasting the CFRs derived from four methods in both the spatial and temporal context, we find that, despite some general agreements (Fig. 5b) and reasonable skill (Table 1) in the global mean temperature reconstruction, the four methods yield large spatial differences, and their validation scores in terms of CE can still be large locally.This confirms previous findings in Smerdon et al. (2011), that the global mean temperature series is a poor indicator of spatial skill, and that spatial performance metrics are crucial for the assessment of different CFR techniques (e.g., Li and Smerdon, 2012).The results also highlight the difficulty in jointly optimizing the spatial skill and the global mean temperature.Fundamentally, reconstruction skill can be assessed using MSE.As discussed in the previous paragraph, in order to find the lowest MSE (thus the best reconstruction), a tradeoff must be found between bias and variance.It therefore is most likely that the lowest MSE for the spatial field and the global mean time series are not given by the same set of regularization parameters.
We also calculated a suite of diagnostics for reconstruction skill and its dependence on (1) the number of proxies, (2) the average SNR, and (3) the sum of SNR in each grid box.No apparent relationships between these variables and spatial skill were found (Figs.S15-S20, Supplement).Our experiments also highlight the need for methodological refinements, since no method can consistently perform well in all cases for both index and field reconstructions.We find that both RegEM-TTLS and M09 produce meaningful global mean reconstructions, but do not perform as well in the spatial field.The disagreement between the field and index reconstruction was explored in Guillot et al. (2013), in which it is found that the skillful performance of TTLS-based global mean temperature reconstructions involves considerable cancellation between positive and negative deviations from the true field at any given grid point (see Supplement).Hence, the fidelity of the reconstructed global mean is a poor indication of spatial skill (Figs.S21, S22, Supplement).
Additionally, we note that all methods consistently introduce a warm bias to the global mean temperature reconstruction, even in the max SNR setting.As previously found in von Storch et al. ( 2004), Christiansen et al. (2009), andSmerdon et al. (2011), regression-based CFR methods are generally associated with variance losses and large mean biases.These are an inevitable by-products of linearly regressing temperature onto proxies, and are especially severe if proxies are subject to extensive errors.This wellknown problem is called regression dilution (e.g., Frost and Thompson, 2000;Tingley et al., 2012).It commonly translates into reconstructions that are always biased towards the mean of the calibration interval.Ammann et al. (2010) has proposed a correction to regression dilution in the context of index reconstructions, which minimizes out-of-sample bias over subsets of the calibration interval.An alternative is to consider methods that respect the proxies' physical dependence on temperature, and express proxies as a function of temperature (Christiansen, 2011).Tingley and Li (2012) find that this leads to a reduced bias but may also lead to infinite variance in very noisy cases.They suggest an alternative solution leveraging Bayesian hierarchical models, wherein a proxy's dependence on temperature is formulated using process-based forward models at the data level, allowing for an elimination of the variance inflation and an internally consistent quantification of uncertainties (see also Christiansen, 2012, for more discussions).
www.clim-past.net/10/1/2014/Clim.Past, 10, 1-19, 2014 An updated pseudoproxy network design has been constructed with more realistic characteristics: for the first time, pseudoproxies were sampled with spatiotemporal characteristics that reflect heterogeneities in proxy quality and proxy attrition back in time.The updated network has allowed an assessment of the spatial performance of four different CFR techniques using a comprehensive suite of experiments.
Results based on the max SNR network show relatively small CFR sensitivity to the choice of methodology when SNR is high.However, results are strongly methoddependent in sample-starved settings.Overall, reconstructions are generally better in regions with dense proxy sampling, although teleconnections are also exploited by these CFR methods, in particular CCA and GraphEM, to derive spatial skill outside of directly sampled regions.
The effect of temporal heterogeneities of proxy availability is counterintuitive.We find that despite the declining data availability back in time, reconstruction skill does not necessarily follow suit.Rather, our experiments show that forced, high-amplitude climatic events have a larger impact on reconstruction skill and are more easily resolved by these methods, even when data availability is low.This conclusion is nevertheless model-dependent, and needs to be verified with PPEs using output from other GCMs.
Our experiments also show that no method universally outperforms another, and that each method has its own strengths and weaknesses.Overall, RegEM-TTLS and M09 produce more skillful index reconstructions (global mean temperature, Fig. 6), and retain a higher skill than other methods when proxies are very noisy (local SNR network, Fig. 9).However, RegEM-TTLS displays large ensemble spreads, partially due to its fixed choice of truncation parameter and high sensitivity to noise in the data.This emphasizes the high risk associated with conclusions from a single noise realization (such as a real-world reconstruction).The stochasticity of a reconstruction method should therefore always be seriously considered when evaluating a real-world reconstruction (Christiansen et al., 2009).
The heuristic parameter choices proposed by M09 show the potential for RegEM-TTLS to produce meaningful global mean temperature reconstructions.The setup nevertheless deviates greatly from the standard implementation of RegEM-TTLS.Additionally, for global mean reconstructions, both RegEM-TTLS and M09 involve error cancellations that are not readily noticeable (Sect.2.4 in the Supplement, Figs.S21, S22).This might explain some of the observed divergence between the quality of index and field reconstructions using these methods.
CCA and GraphEM generally produce very similar results, but the former suffers from larger variance losses and associated mean biases.This can be attributed to the manner in which the method selects for the optimal estimates of regression coefficients B. Given enough high-quality data, reconstructions using GraphEM display a higher spatial skill than the other three methods everywhere in the field, and in particular over the oceans and regions with denser proxy sampling.This suggests that the reconstruction strongly benefits from the improved covariance estimation induced by the use of Gaussian graphical models.
Given the large performance differences among various CFR methods in the pseudoproxy context, we emphasize that unless reconstructions with various methods provide very similar spatiotemporal information, real-world reconstructions derived from a single method should be viewed with caution.In agreement with Smerdon et al. (2011), we recommend applying as many methods as possible to make robust conclusions.Additionally, the exact pattern of spatial skill varies according to the GCM simulation used as the basis of the PPEs.Multiple PMIP3 last millennium simulations should ideally be used to validate the present results.Future studies should also rigorously model real-world conditions, including persistence, noise characteristics, and a mechanistic representation of climate proxies.Finally, we emphasize the fundamental difficulty in finding a bias-variance tradeoff that optimizes the reconstruction of both the temperature field and its global mean.Future studies should explore solutions that jointly minimize spatial and temporal errors.

Fig. 1 .
Fig. 1.Temporal and spatial availability of the M08 proxy network between 850-1800 AD.Top panel: the spatial distribution of the M08 proxies, with colored dots indicating the first year that each proxy record becomes available.Each marker represents a proxy class, as indicated in the legend.Bottom panel: the temporal availability of each proxy class.

Fig. 3 .
Fig.3.Estimated signal-to-noise ratio (SNR) for proxies in the M08 network.Top panel: Local SNR scenario, in which SNR is calculated between each proxy and its closest temperature grid.Bottom panel: max SNR scenario, in which the highest SNR for each proxy is chosen from its correlations with all temperature grids available.Colors reflect the value of SNR assigned to each pseudoproxy, as per Eq.(2).

Fig. 4 .
Fig.4.Effective network quality expressed as SNR (Eq.2) for idealized pseudoproxy networks and observed networks.For the observed SNRs, we consider two possible scenarios as described in Sect.2.1.Shaded areas correspond to the 2.5-97.5 percentile interval, providing a complete picture of effective SNR across the 100member noise ensemble.Dashed lines correspond to the values of SNR for temporally invariant networks.SNR = ∞ is not plotted.

Fig. 5 .
Fig. 5. Area-weighted global mean time series comparison of the four CFR methods, with the staircase network.(a) local SNR, (b) max SNR.Only the low-frequency (20 yr lowpass) signal is plotted.Black line: target temperature from the CSM1.4 model output; colored lines: reconstructed temperature from median of the reconstruction ensembles; shaded areas: 2.5-97.5 percentiles derived from the reconstruction ensembles.

Fig. 6 .Fig. 7 .
Fig. 6.Summary of verification statistics of the global mean temperature reconstruction ensemble.Note the range of ordinates for each metric is different.MSE = variance + bias 2 .

Fig. 9 .
Fig. 9. Spatial pattern of CE, based on ensemble median using the local SNR flat network.850-949 and 1750-1849 AD represent periods with the minimum and maximum global mean CE over the entire reconstruction interval, respectively.
Figures 9 and 10 suggest that RegEM-TTLS and M09 are likely to be more powerful when data are very noisy (the local SNR network).Nevertheless, the poor risk properties of these two

Temporal availability by proxy type
Maximum absolute correlation coefficient |ρ| between proxies of the M08 network and the HadCRUT3v grid point temperatures vs. the corresponding distance between the proxy location and the grid point.On the y axis is the histogram of the maximum |ρ|; on the x axis is the histogram of distance between each proxy P i and the corresponding temperature grid T j that gives the highest |ρ|.