Constraining two climate field reconstruction methodologies over the North Atlantic realm using pseudo-proxy experiments

This study presents pseudo-proxy experiments to quantify the reconstruction skill of two climate field reconstruction methodologies for a marine proxy network subject to age uncertainties. The BARCAST methodology (Bayesian Algorithm for Reconstructing Climate Anomalies in Space and Time) is tested for sea surface temperature (SST) reconstruction for the first time over the northern North Atlantic region, and compared with a classic analogue reconstruction methodology. The reconstruction experiments are performed at annual and decadal resolution. We implement chronological uncertainties inherent to marine proxies as a novelty, using a simulated age-model ensemble covering the past millennium. Our experiments comprise different scenarios for the input data network, with the noise levels added to the target variable extending from ideal to realistic. Results show that both methodologies are able to reconstruct the Summer mean SST skillfully when the proxy network is considered absolutely dated, but the skill of the analogue method is superior to BARCAST. Only the analogue method provides skillful correlations with the true target variable in the case of a realistic noisy and age-uncertain proxy network. The spatiotemporal properties of the input target data are partly contrasting with the BARCAST model formulations, resulting in an inferior reconstruction ensemble that is similar to a white-noise stochastic process in time. The analogue method is also successful in reconstructing decadal temperatures, while BARCAST fails. The results contribute to constraining uncertainties in CFR for ocean dynamics which are highly important for climate across the globe. © 2021 The Authors. Published by Elsevier Ltd. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).


Introduction
The well-documented changes in a number of observational variables demonstrate a significant role of anthropogenic activities in modifying the Earth's climate. Yet, the quantification of natural variability in the climate system is crucial to put recent and future climate change into a long-term perspective. Climate Field reconstructions (CFR) offer critical insights into the range and geographic characteristics of historical climate variability prior to the instrumental era, by scaling-up localized and sparse proxy measurements and providing information that can be compared with climate model simulation output.
A number of CFRs have been published over the last decades, quantifying late Holocene climate variations (Masson-Delmotte et al., 2013;Christiansen and Ljungqvist, 2017). CFR provides a distinct advantage over averaged climate reconstructions for instance by studying the response of climate in a region to some external forcing, such as a large tropical volcanic eruption. A wellknown class of CFR methodology is multivariate linear regression, where the reconstruction problem can be formulated either using direct or indirect regression between proxies and the climate variable. Regularization of the problem is needed because the system of equations is underdetermined. Examples of CFR methodologies using different regularizations include principal component regression (Luterbacher et al., 2002), canonical correlation analysis (Smerdon et al., 2010) and RegEM (Mann et al., 2008(Mann et al., , 2009. A second class of CFR methods includes Bayesian hierarchical models (BHM). The Bayesian methodology differs from the common regression-based method in the sense that model parameters are formulated in probability statements representing the current state of knowledge of the climate variable given observation data. This type of method is probabilistic, resulting in reconstruction ensembles (Tingley and Huybers, 2010;Werner et al., 2018). Yet another class of CFR methodology comprises data assimilation methods (Steiger et al., 2014), typically combined with forward proxy system modeling (Hakim et al., 2016;Tardif et al., 2019). This approach combines a sparse proxy network with climate model simulations through a prior, mapped to the proxy space using proxy system modeling. A Kalman filter is typically used to update the prior. A similar but simpler methodology is the analogue method, or the proxy surrogate reconstruction method (Graham et al., 2007;Franke et al., 2011). Relevant differences between the analogue method and data assimilation techniques are discussed in G omez-Navarro et al. (2017). Instrumental observations or climate model time slices are used as analogues to a gridded proxy network, but there is no prior distribution or Kalman filtering for the analogue method. The reconstruction at every time step is selected as one single or a function of the closest analogue time slices.
The existing CFRs defined with annual temporal resolution are based on dominantly terrestrial proxy records, representing terrestrial regions Huybers, 2010, 2013;Werner et al., 2018), the entire globe (Mann et al., 2009;Hakim et al., 2016;Tardif et al., 2019;Neukom et al., 2019), or the global marine realm (Singh et al., 2018). The majority of the published CFRs exhibits annual reconstructed values over the last 1e2 millennia, and so the SST field reconstruction of Tierney et al. (2020) differs in comprising only two time slices. One is an average over the Last Glacial Maximum (LGM) time period 23-19 kyr before present (BP), and the other is averaged for the late Holocene period 4-0 kyr BP. In summary, there are no time-resolving marine CFRs for the past millennium based on marine proxy data alone. This is unfortunate because the oceanic impacts on surface air temperature cannot be independently investigated when using the same, terrestrial proxies to reconstruct both land and ocean temperatures. Given the importance of circulation patterns and heat content of the North Atlantic Ocean in regulating atmospheric variability across the Northern Hemisphere, this missing piece of paleoclimate information has a central role in improving the understanding of natural climate variability.
The spatial patterns of North Atlantic sea surface temperature (SST) and atmospheric sea level pressure gradients drive atmospheric circulation that brings about large-scale changes in precipitation and temperature across the Northern Hemisphere, the strength and trajectory of Atlantic storms as well as ocean biological productivity (Goldenberg et al., 2001;Reid et al., 2003;Beugrand and Reid, 2003;Sutton and Hodson, 2005). Long-term shifts in the distribution of heat in the North Atlantic Ocean could also have significant socioeconomic impacts across the Northern Hemisphere, brought about by changes in the strength of the subpolar gyre and/or associated Atlantic meriodional overturning circulation (AMOC).
Index reconstructions exist for the AMOC and the Atlantic multidecadal variability (AMV) (Gray et al., 2004;Mann et al., 2009;Rahmstorf et al., 2015;Wang et al., 2017). These reconstructions are based entirely or dominantly on terrestrial data: Mann et al. (2009) has a proportion of 91% tree-ring records with marine records constituting just 1.6% of the proxy network. However, we note that the more recent multi-method CFR ensemble of surface temperature is improved in this respect with 19% marine records, all annually resolved (Neukom et al., 2019).
SST reconstruction based on terrestrial proxies relies on documented ocean-atmosphere teleconnections, with the advantage of high-precision dating and data abundance. The SST is measured more directly in marine proxy records (Moffa-S anchez et al., 2019). The marine records are considered prominent candidates for stateof-the-art marine CFR in this study. The North Atlantic Ocean including the Arctic is particularly suitable for initial testing purposes since the extratropical region is subject to high sampling density, demonstrated by the Ocean2k network (McGregor et al., 2015). From the total of 57 marine proxy records, 31 of these are sampled in the North Atlantic and the Arctic Ocean. A number of these records exhibit subdecadal resolution due to targeted research efforts on high-accumulation sites. Recent proxy development and improved statistical methods in age-model simulation and CFR contribute to our confidence that the field of northern North Atlantic SST can be skillfully reconstructed (Pyrina et al., 2017;Reynolds et al., 2018), despite the shortcomings of the marine proxy data. Proxy drawbacks include age-uncertainties, poor spatial availability, depth and seasonality biases. Additionally and specifically for marine sediment deposits, the proxy forming process is comparable to low-pass filtering, resulting in generic low temporal resolution.
In principle, all types of marine proxy data are subject to chronological uncertainty, either through the age-depth relationship of marine sediments, or in the form of miscounted bands/layers. Bivalve records are exceptions exhibiting annual bands, but crossdated to achieve absolute dating similar to tree-ring records (Butler et al., 2013;Pyrina et al., 2017;Reynolds et al., 2017Reynolds et al., , 2018. The temporal resolution for deposits is estimated as a function of depth and accumulation rate. This rate may vary down-core, and is best constrained by radiometric dating techniques and tephra layers as absolute age markers. Age-depth models for deposits can be simulated using Bayesian methods such as BACON (Blaauw and Christen, 2011). The "Banded archive modeling" (BAM) statistical method is designed to simulate age model ensembles for annually banded or layered records such as corals (Comboul et al., 2014).
We propose to generate an entirely marine CFR in this study, using marine data for reconstruction and targeting the SST variable. Two state-of-the-art CFR methods are described and tested in detail, evaluating their skill in reconstructing northern North Atlantic SST over the past millennium with annual and decadal resolution. The reconstruction methods to be tested are the Bayesian Algorithm for Reconstructing Climate Anomalies in Space and Time (BARCAST) (Tingley and Huybers, 2010), and the analogue method (Franke et al., 2011). This is the first time that BARCAST applicability is tested for an SST reconstruction. An extended version of the methodology is applied, allowing age-uncertain proxy records following Werner and Tingley (2015). This functionality is critical for handling age uncertainties properly when reconstructing northern North Atlantic SST. By including ensemble chronologies and their errors into the Bayesian reconstruction framework the age-uncertainties are, in theory, reduced compared to other reconstruction methodologies including the analogue method, where age-uncertainties will contribute to increased total uncertainty in the reconstruction (Werner and Tingley, 2015).
Ensembles of annual age-uncertain chronologies are simulated using the BAM technique. The experiment setup is similar to previous BARCAST experiments in terms of age-uncertainties, where the method has proven successful (Werner and Tingley, 2015;Werner et al., 2018). This bridges the gap between terrestrial and marine CFR using BARCAST. It is interesting to test for the pseudoreality whether the method is able to skillfully reconstruct annual SST, even if this high resolution is difficult to achieve for existing marine proxy networks. We have also implemented one set of pseudo proxy experiments producing decadally-resolved reconstructions, and another set combining BARCAST with a clustering technique, similar to Talento et al. (2019). The reconstruction skill was demonstrated to be higher for the annually resolved reconstruction than for decadal resolution in that study, although the BARCAST implementation was simple (no ageuncertainty) and the target variable was precipitation instead of SST.
The temporal resolution of North Atlantic proxy deposits varies from subdecadal to multicentennial (McGregor et al., 2015). The BACON technique would produce more realistic age-depth models for marine sediments than BAM, although with decadal resolution at best.
The manuscript is structured as follows. Sect. 2 describes the experiment design, input data and reconstruction methodologies, including their ensemble chronology extensions. The skill metrics are also outlined. In Sect. 3 we present the reconstruction skill results and the implications of age-uncertainties in marine data. A discussion follows in Sect. 4, before finalizing with concluding remarks in Sect. 5.

Data and methods
CFRs are generated from a limited number of proxy records, and validating such reconstructions is non-trivial because there is no perfect, independent record of the past climate evolution. The process of pseudo-proxy experiments overcomes this shortcoming by testing CFR methodologies using data from general circulation models (GCM) as a pseudo-reality (Mann et al., 2005(Mann et al., , 2007Lee et al., 2008). Pseudo-proxy experiments are used to test the mean and spatial skill of novel CFR methodologies as well as the sensitivity to the observation data network (Smerdon et al., 2011;Smerdon, 2012;Wang et al., 2014;G omez-Navarro et al., 2017).
The idea behind our idealized pseudo-proxy experiment is to extract target SST data from a climate model simulation. The target data is sampled in a spatiotemporal pattern that simulates real proxy and instrumental networks. The target data representing proxies are perturbed with noise to simulate real-world proxy data in a systematic manner, while the pseudo-instrumental data is only weakly perturbed with noise of magnitude typical for the realworld instrumental data errors. The pseudo-proxy and pseudoinstrumental data are used as input to the BARCAST and analogue reconstruction techniques, and the resulting reconstructions are compared with the true simulation target. The reconstruction skill is quantified through statistical metrics for a millennium-long validation interval.

Model data and target
The target variable for our pseudo-proxy experiment is the North Atlantic (57.5 W, 47.5 N to 37.5 E, 82.5 N) summer mean (June, July, August, JJA) SST, originating from one simulation of the Community Earth System Model, Last Millennium Ensemble (CESM-LME, run 001), (Otto-Bliesner et al., 2016). The ensemble simulation covers the time period 850-2005 AD, with transient forcing series based on reconstructed solar insolation , volcanic activity (Gao et al., 2008), greenhouse gases (Schmidt et al., 2011), land-use and land cover (Pongratz et al., 2008;Hurtt et al., 2011), orbital parameters (Berger, 1978) and prescribed stratospheric aerosols.
The original SST variable has spatial resolution of~1 in the northern North Atlantic Ocean, and is interpolated to a 5 by 5 regular grid for reconstruction purposes. This is the same resolution as the real-world HadSST.4.0.0.0 observation data (Kennedy et al., 2019).
Confined oceanic regions may be considered less representative of the larger northern North Atlantic, and are excluded from the reconstruction domain. The excluded regions comprise the Davis Strait, Baffin and Hudson Bay, the Baltic Sea and the Gulf of Bothnia.
In the real-world scenario, the temperature signal within these regions is strongly influenced by sea-ice and/or land processes (Moffa-S anchez et al., 2019;Stramska and Białstrokogrodzka, 2015). The target data is perturbed with white additive noise to simulate all types of errors and uncertainties in the instrumental and proxy data. This particular type of noise has been implemented and tested earlier for both methodologies, with satisfactory results Talento et al., 2019).

Resampling the SST field for the pseudo-instrumental period 1850 -2005 AD
The pseudo-instrumental network is constructed to mimic the characteristics of the HadSST.4.0.0.0 data set (Kennedy et al., 2019). It covers the period 1850e2005 in an uniform 5 by 5 grid. The resampled data is perturbed with additive white noise, resulting in pseudo-instrumental data with signal-to-noise (SNR) ratio of 3 by standard deviation. The noise represents e.g. measurement uncertainties and bias corrections in the real-world data.

Resampling the SST field for the pseudo-proxy period 850-2005 AD
A realistic proxy network is used as the baseline for selecting pseudo-proxy locations for the time period 850-2005 AD (Fig. 1a, Table A1). The North Atlantic proxy network comprises 22 records, with data originating from marine sediment cores or bivalves. Selected records originate from the North Atlantic Ocean2k contribution to the PAGES2k archives, denoted by "Ocn_" site IDs in Table A1 (McGregor et al., 2015;PAGES2k Consortium, 2017). Ocean2k records were excluded if the locations are outside the grid specified in Sect. 2.1, or if the dating information is missing from the age-models of the publications. Additional proxy records are included in the network even if they are not part of the 2015 version of the Ocean2k database (numbered site IDs in Table A1). Specifically, we sought for reconstructions published later than May 2013, which is the cutoff-period for inclusion in the 2k synthesis (McGregor et al., 2015). The spatial availability of the pseudo-proxy network is made artificially idealistic with the proxy locations as the baseline. These adjustments are optional, they may affect the reconstruction quality but are not critical for the functionality of the reconstruction methodologies. First, we opt to consider at most one proxy per grid cell for the pseudo-proxy experiments. Secondly, we reduce the number of grid cells by shifting the proxy locations onto the nearest neighboring instrumental location. The reduction of total number of grid nodes is a manner of easing the pressure on the computational resources for BARCAST. The resulting grid has 82 nodes in total, with 12 of these classified as unique pseudo-proxy locations, illustrated in Fig. 1b. The distance between neighboring pseudo-proxy locations ranges from 213 km to 1190 km.
The pseudo-proxy data are perturbed with additive white noise to simulate real-world proxy noise for the full simulation period 850-2005 AD. Three different levels of proxy-noise are tested, as the true SNR of marine proxies is generally unknown and may vary depending on the archive or proxy type. The noise perturbations are such that the resulting pseudo-proxies have a SNR of 10, 1 and 0.5 by standard deviation, respectively. The strongest noise level SNR ¼ 0.5 is considered the most realistic scenario, similar to the estimates obtained by (Laepple and Huybers 2014) for marine proxy records.

Original BARCAST methodology
BARCAST is a Bayesian hierarchical model with three levels: process, data and prior levels. The evolution of the target variable in space and time is described at the process level. The data level describes the relationship between the target and the instrumental/proxy data, while the statistical information of the priors is incorporated into the last level. The hierarchical structure implies that model parameters are dependent across levels.
The temperature field T is modeled as a multivariate first-order autoregressive model (AR(1)) in time: Where t denotes time in years, the scalar parameter m is the mean of the process, and a is the AR(1) coefficient. The innovations (increments) ε t are assumed to be IID normal draws ε t~N (0, S). The spatial covariance matrix depicting the covariance between locations x i and x j is: The spatial e-folding distance is 1/4. The simplicity of the exponentially decaying covariance structure keeps the computation time low, while the reconstruction skill is satisfactory according to earlier pseudo-proxy studies (Werner et al., 2013;Nilsen et al., 2018;Talento et al., 2019). The BARCAST assumption fits only partly with the behaviour displayed by the SST target (Fig. 2). The correlation is measured for Summer SST between every grid cell in the reconstruction region, but only the median correlations calculated within distance bins separated at 50 km are plotted in the figure. Initially, the correlations decrease and reach zero at distancẽ 2000 km, but they increase again as distances exceed 3000 km. We will test how well BARCAST reconstructs the SST field given these teleconnections, that contrasts to some degree the spatial covariance structures assumed within the BARCAST model Eq. (2).
On the data level, the observation equations for the instrumental and proxy data are: Where e I,t and e P,t are multivariate normal draws $ Nð0; t 2 I IÞ and $ Nð0;t 2 P IÞ. H I,t and H P,t are selection matrices of ones and zeros which at each year select the locations where there are instrumental/ proxy data. b 0 and b 1 are parameters representing the bias and scaling factor of the proxy records relative to the temperatures.
The remaining level is the prior. Weakly informative but proper prior distributions are specified for the scalar parameters and the temperature field for the first year in the analysis. The priors for all parameters except 4 are conditionally conjugate, meaning the prior and the posterior distribution has the same parametric form. The Markov-Chain Monte Carlo (MCMC) algorithm known as the Gibbs sampler (with one Metropolis step) is used for the posterior simulation (Gelman et al., 2003). Technical details of the Bayesian inference of BARCAST is found in Tingley and Huybers (2010).

BARCAST extension to probabilistic constraining ageuncertain proxy records
Age-uncertain proxies are associated with an ensemble of chronologies: fT k ; k ¼ 1;2;…Mg. BARCAST equation (3) is rewritten so that the proxy observations are conditioned on the selected chronology T , and considered a function of location s, rather than the uncertain time point t.
where e s $ N 0t 2 , I constitute the independent normal errors at location s. L T s represents the dependence on the chronology, and replaces H P,t from Eq.
(3). Elements are selected from the true temperature field T s corresponding to the proxy observations W s , depending on the chronology T . T. Nilsen, S. Talento and J.P. Werner Quaternary Science Reviews 265 (2021) 107009 A new conditional posterior is defined to update the probabilities on the members of the ensemble T k , see Werner and Tingley (2015) for technical details.
Overall, the BARCAST model parameters are unchanged, but the inference on the conditional posteriors now relies also on the currently selected chronology. The climate process and scalar parameters are sampled at each iteration step of the MCMC, and a chronology is then selected according to the conditional posterior.
The so-called "waiting time dilemma" is a deficiency of the described BARCAST extension, implying that the MCMC sampler will quickly iterate towards a limited number of the chronologies that are most compatible with the initial climate estimate. Running a Metropolis-Coupled MCMC sampler (MC 3 ) compensates this problem without other complications (Werner and Tingley, 2015, and references therein). The Metropolis coupling allows parallel tempering with several "heated" chains, allowing more rapid exploration of the probability space, coupled with one unheated chain to retain a stationary distribution of the MCMC (Wong and Liang, 1997;Li et al., 2009). Coupling is enabled by stateswapping between neighboring chains, i.e. the current values of all parameters, temperatures and the selected chronology may be swapped between parallel runs at certain intervals.
Finally, note that the scalar response parameters b 0 , b 1 , and t P are replaced with vectors in this version. This is in line with the assumption that individual proxies respond differently to surface temperature across the reconstruction domain.

The analogue method
The analogue method was first introduced for weather forecasting (Lorenz, 1969), but has since then been modified and applied for reconstruction purposes. The method has been successfully applied to reconstruct SST during the Dansgaard-Oeschger events 5e8 (Jensen et al., 2018), surface air temperatures and precipitation during more recent time periods (Franke et al., 2011;Gomez-Navarro et al., 2015;G omez-Navarro et al., 2017;Talento et al., 2019). It relies on the concept of combining proxy data and a pool of instrumental or model information identifying, for each time slice in the reconstruction period, one or several time slices in the instrumental or simulated period that could be considered as a potential analogue. In this manuscript, we use proxy and instrumental data, and denote this basic implementation as the "classic" analogue method. Along this we introduce a variation taking ageuncertainties into consideration. In a real-world application this would result in an analogue reconstruction independent from climate model simulations, in contrast to the version using a model-based pool of analogues.

Classic analogue method
To reconstruct the SST field at time t, the set of predictor variables consists of the individual proxy records at time t.
As a first step, a distance metric between time t and each t i in the instrumental period is defined. We select the distance between t and t i as the Euclidean distance between the vectors of proxy data at times t and t i : where Prox(l j , t) denotes the proxy information at location l j and time t. l 1 , …, l K denote the proxy locations (K ¼ 12).
Second, each time slice of the instrumental period is sorted according to its distance to t. Third, the N closest time-steps of the instrumental period are selected as analogues. Finally, the SST reconstruction at time t is defined as the average of the instrumental SST fields in the N analogues t 1 , …, t N : The number of analogues, N, could range between 1 and the total number of time slices in the instrumental period (156 in the present study). As N grows, there is a tendency for the reconstruction to become more similar to the instrumental-period mean. We select N ¼ 20, as it provides the best reconstruction skill results across the different scenarios tested in this paper.
The analogue method can be used to generate an ensemble reconstruction in the case proxy data exhibit age uncertainties. The method is run repeatedly, each time randomly selecting a realistic set of chronologies for the proxies as described in Sect. 2.6.

Constructing chronology ensembles
Age-uncertainties are included in the pseudo-proxy experiment scheme in order to validate the reconstruction skill for a marine multi-proxy network. We set up one experiment with all pseudoproxy records subject to age-uncertainties similar to layeredcounted proxies. This setup is simplistic rather than realistic: we use the most simple chronology modeling scheme in this initial analysis, which achieves the goal to quantify the reconstruction skill given an age-uncertain data network.
Ensembles of chronologies are generated using the BAM timescale modeling scheme as in Werner et al. (2018) and Comboul et al. (2014). BAM models the timescale counting procedure as a superposition of two cumulative Poisson processes with error rates {Q 1 , Q 2 }. Between the years i and i þ 1, the stochastic processes (P Q1 i ) and ðP Q2 i Þ counts the number of missing years or double-counted years respectively (Comboul et al., 2014). As a conservative estimate, we assume for our pseudo-proxy experiments that annual layers are missing or double counted by the same probability (Q 1 ¼ Q 2 ¼ 0.02). These probabilities are identical to Werner and Tingley (2015) for direct comparison. 2% is also the average probability of miscounting in the lake sediments and ice core proxy records used for BARCAST reconstruction in Werner et al. (2018).
The time increment d i is given by Data and time index i is removed and considered missing from the pseudo-proxy record if d i > 1. On the contrary, data is duplicated to the record at time i if d i < 1. The time increment d i will be exactly one most of the time, and the chronology is unchanged in this case. One might expect an overall zero-offset due to the symmetry of the error rates, but this is a potential pitfall as demonstrated by Comboul et al. (2014) and in our Fig. 3. The mismatch between an ensemble of 1000 age models and the true age is illustrated for one arbitrary pseudo-proxy in this figure. As layer counting starts from the top or the most recent date, the counting errors accumulate back in time. The possible offset for early dates may have profound effects on the resulting reconstruction product.

Metrics for reconstruction skill
The deviations between the reconstruction and the simulation target can be measured and quantified using a range of metrics. Using a proper scoring rule is advantageous in the case of a probabilistic forecast, securing that the maximum reward is given when the true probability distribution is reported (Gneiting and Raftery, 2007). The continuous ranked probability score (CRPS) is a proper scoring rule, and will be used in the following to quantify the reconstruction skill for the BARCAST reconstruction ensemble. The often-used reduction of error (RE) is not a proper scoring rule, and will not be used. RE ¼ 1 implies a deterministic forecast, but the maximum score is obtained when the mean (a point measurement) within the probability distribution P is used instead of the predictive distribution P itself. Additionally and for both reconstruction types, we calculate the Pearson's correlation coefficient and the root-mean-squared error (RMSE) with the true target, in order to have skill measures more suited for comparison.
The concept behind the CRPS is to provide a metric of the distance between the predicted (forecasted) and occurred (observed) cumulative distribution functions of the variable of interest. The lowest possible value for the metric corresponding to a perfect forecast is therefore CRPS ¼ 0. Following Hersbach (2000), the definition of the CRPS can be defined as follows: Where x is the variable of interest, x target denotes target (validation) data, u is the unit step function and P(x) is the cumulative distribution function of the forecast ensemble with a probability density function (PDF) of r(y): CRPS denotes the temporal average CRPS. CRPS is the sum of the average reliability and the average potential CRPS: Where the Reli can be interpreted as the validity of the uncertainty bands, or the mean-squared error of the confidence intervals. The CRPS pot measures the accuracy of the reconstruction, quantifying the spread of the ensemble and the mismatch between the best estimate and the target variable. See Gneiting and Raftery (2007) for details and elaboration on the CRPS.

Results
As mentioned above, pseudo proxy data was generated with three different SNR (10, 1, 0.5), with the last one being the most realistic (albeit still optimistic) experiment. For SNR ¼ 0.5, the experiment was repeated with chronological uncertainties. The reconstructions were generated using annually and decadally resolved input data. For each SNR experiment, the successful BARCAST ensemble of reconstructions consists of~250 members. Similar ensembles are generated for the analogue method when implementing chronological uncertainties. The following analyses are performed on the member-basis, but figures show the ensemble means.

Skill metrics
The skill results are shown for the Summer mean reconstructions in Figs. 4e6. Note that Figs. 5 and 6, depicting RMSE and CRPS, use the same color bar as in the cross correlation analysis Fig. 4, but best skill is achieved where the RMSE/CRPS is low as opposed to best skill for the highest cross correlations.
Significant correlations between 0 and 1 are plotted in color in Fig. 4. A p-value of 0.05 was used for significance testing, and all significant correlations are positive. The correlations between target and reconstruction are highest directly at the pseudo-proxy locations, and decreases with distance. The correlations also decrease systematically with increasing levels of proxy noise. Regions of minimum significance are detected for the longitude bands 50 N and~82 N for pseudo-proxies without chronological errors (Fig. 4aef). BARCAST is unable to produce skillful reconstructions when all proxies exhibit chronological errors (Fig. 4g), while the analogue method reconstructs significant, albeit low correlations for 95% of the grid cells (Fig. 4h). The analogue method outperforms BARCAST for all SNR experiments in terms of area average correlation, visualized by the figure box plots. The correlation skill is stable within the BARCAST reconstruction ensembles, with maximum difference in correlation skill of 0.024. The intraensemble skill variance is highest away from pseudo-proxy locations. The analogue-chronology ensemble has a maximum intraensemble skill difference of 0.31.
Different probabilities for miscounting have been tested for the simulation of chronology ensembles. Changing P1 and P2 from 2% to 5% results in a decreased reconstruction skill for the analogue method, but only minor changes in significance levels. Changing P1 and P2 to 1% is not enough to gain significant skill for BARCAST.
We observe similar skill results for the spatially distributed RMSEs (Fig. 5). Unexpectedly, the skill does not decrease Fig. 3. (a) Trace plots of the age model ensemble generated for one arbitrary pseudoproxy located at 12.5 W, 57.5 N. The mismatch between true age and counted age is shown on the y-axis, as a function of time. The blue, dotted lines constrains the 95% confidence range of the age models. The trace plot plotted in green (cyan) represents the single age model which is closest (most distant) from the true age, respectively. (b) Time series for the same record plotted in blue. The shaded grey area denotes the 95% confidence range for the age-uncertainties.
T. Nilsen, S. Talento and J.P. Werner Quaternary Science Reviews 265 (2021) 107009 monotonically for lower SNR, seen for Fig. 5c,e. The analogue method performs better than BARCAST for all SNRs considering the area average skill metric. The intra-ensemble RMSE metrics vary by < 0.035. The BARCAST scores of correlations and RMSEs for SNR ¼ 0.5 are inferior to those obtained in Werner et al. (2013), where the reconstruction target was terrestrial surface air temperature.
Lastly, CRPS pot and Reli are shown in Fig. 6. All CRPS scores are limited to <1 degree K, with the area average CRPS pot ranging from 0.45 K to 0.8 K (Fig. 6a, c, e, f). The area average Reli is better than 0.4 K for all SNR experiments (Fig. 6b, d, f, h).

Alternative experiment setups
We dedicated specific experiments to implement (a) decadal resolution of all input data and the reconstruction, and (b) an initial clustering step to the data before applying the reconstruction methodologies. The motivation for these implementations is the potential of increased reconstruction skill, and/or reduced computation time for BARCAST.
All input data and the chronology ensembles are generated on the annual scale before averaging to decadal resolution. Chronologies were simulated using BAM, as for the annual experiments. The analogue reconstructions are successful using this setup, but BARCAST fails to produce meaningful posterior parameter estimates. Correlation skill results are therefore only shown for the analogue method (Fig. 7), while BARCAST is deemed unsuitable for this particular configuration of decadal SST data. Selected grid cells in the decadal analogue reconstructions are significantly negatively correlated with the target (Fig. 7). In general, the box plots show that the area-average skill is lower than for annual data, and correlations across the region have a larger spread. The number of grid cells without significant correlation is higher.
The clustering separates the geographical domain of the northern North Atlantic into C clusters, based on similarities in the SST regimes during the instrumental period. The distance between two geographical points is defined as one minus the correlation between the instrumental SST time series at these locations. Then, the analogue method or BARCAST is applied inside each of the C clusters, independently. Finally, the reconstructions obtained within each cluster are merged together, producing a reconstruction covering the whole geographical domain. The results show no improvement in skill when clustering is implemented to the reconstruction schemes. In fact, the BARCAST skill is deteriorated, and we therefore regard the clustering approach as redundant for the SST reconstructions in this study. In summary, neither the decadal resolution nor the initial clustering step offers advantages for either method in this pseudo proxy setting. The remaining results therefore concern annual Summer mean only.

BARCAST posterior parameters
The mean values for the posterior distributions of the BARCAST model parameters are listed in Table 1. The SNR of the BARCAST reconstruction is estimated from the posterior parameters: The reconstructed SNR is not monotonically decreasing: SNR rec ¼ 7.2, 1.8, 2.18 and 0.01. Together with the overestimated SNR rec ¼ 2.18 we also find an inflated estimate of the instrumental data error variance t 2 I ¼0.43 and interpret this as BARCAST incorrectly ascribing measurement errors to the SST predictors for the instrumental period.
The estimated spatial e-folding distance 1/4 ranges from T. Nilsen, S. Talento and J.P. Werner Quaternary Science Reviews 265 (2021) 107009 2115 km for SNR ¼ 10, to 8141 km for SNR ¼ 0.5. Ideally, the spatial correlation structure should not be influenced so strongly from the proxy noise. The teleconnections observed for the input spatial network are likely contributing to this instability.
3.2.1. The AR(1) parameter a The posterior mean AR(1) coefficient a ranges between 0.08 and 0.12 for the SNR experiments. a can be interpreted as a measure of temporal persistence, where a~0 is similar to a white noise stochastic process. Hence, BARCAST reconstructs the North Atlantic SSTs as very weakly persistent in time. The parameter estimate is unexpected because the ocean has a much larger heat capacity than the atmosphere, and responds more slowly to external forcing. The expected a estimate should be similar, or even higher than that of terrestrial BARCAST reconstructions, where a~0.4 (Tingley and Huybers, 2010;, and a~0.8 . Searching to explain the lack of temporal persistence, we compare the BARCAST posterior a estimates with estimates for the real-world HadSST data, for the simulated model data, and those for the analogue reconstruction. The AR(1) parameter is estimated using the least-squares method and JJA mean SST data. For the instrumental HadSST data set, the subpolar and Arctic regions are missing more data points than the temperate region during the earlier part of the record (figure not shown). The area average a is estimated to 0.32 when data from the full time period 1850e2018 is included (Fig. 8a). The estimate increases to 0.48 when only the past 50 years of data are used for estimation, with better temporal coverage (Fig. 8b). The AR(1) coefficients for the CESM-LME simulation data are estimated on the grid-cell level using the full simulation period 850-2005 AD. This period is most realistic when comparing our BARCAST posterior estimate, although it makes the comparison with HadSST less clear. Spatial resolution of 1 by 1 or 5 by 5 gives nearly identical estimates of the AR(1) coefficients. The area average CESM model run AR(1) coefficient is estimated to 0.20 (Fig. 8c). As a validation exercise, we perform the same analysis for one individual paleoclimate simulation of the MPI-ESM-P (past-1000 r1, Jungclaus et al. (2014)) (Fig. 8d). The area average a is 0.26 for that run, hence the simulated SST for both model runs exhibit less temporal persistence than estimated for the existing landbased BARCAST reconstructions. At the same time, the model SST   T. Nilsen, S. Talento and J.P. Werner Quaternary Science Reviews 265 (2021) 107009 data exhibit a higher value of a than the resulting BARCAST pseudoreconstruction ensembles. The area average AR(1) coefficient of the analogue reconstructions vary between 0.11 and 0.21 for the different SNR experiments (figure not shown).

Suitability of the BARCAST AR(1) assumption for CESM SST data
It is possible that the BARCAST AR(1) model assumption itself is improper for the North Atlantic CESM target SST data, and that violation of the BARCAST assumptions causes the underestimation of the posterior AR(1) coefficient a. The AR(1) process is preferred for BARCAST calculations since it facilitates a reconstruction problem without supreme complexity. Previous studies show that a long-range memory (LRM) stochastic model such as the fractional Gaussian noise (fGn) is more consistent with local instrumental SST observations and SSTs from control-run simulations over this region than an AR(1) model (Fredriksen and Rypdal, 2016;Løvsletten and Rypdal, 2016).
An LRM stochastic process exhibits an autocorrelation function (ACF) and a power spectral density (PSD) of a power-law form: C(t) t bÀ1 , and S(f)~f Àb respectively. The spectral exponent b determines the strength of the persistence. Persistence properties of the target data are analyzed in the spectral domain using the periodogram as the estimator. Hypothesis testing is used to identify the most suitable statistical model similar to Nilsen et al. (2018). See Appendix B for details on how the periodogram is estimated and on the hypothesis testing.
The hypothesis testing provides important information regarding the temporal persistence of the target CESM SST data on the 5 by 5 grid-cell level. The persistence properties vary across the region, and there is no single preferred statistical model for all grid cells. In fact, the results cover all possible outcomes approximately equally: data points consistent with the AR(1) hypothesis only, data points consistent with the fGn hypothesis only, data points consistent with both null hypotheses, and data points inconsistent with both null hypothesis. The AR(1) hypothesis is most appropriate in the region illustrated by red color in Fig. 8c.
Weak LRM is identified if the fGn model is preferred, b 0.2.
Results for one arbitrary grid cell off the Norwegian coast are shown for illustrative purposes in Fig. 9. The AR(1) coefficient a and the LRM parameter b are estimated. These parameters are used to generate Monte Carlo ensembles of AR(1)/fGn processes, and spectra for the 95% confidence ranges of the two processes are plotted together with the CESM target power spectrum (Fig. 9b and  c). In this particular location, the CESM power spectrum is consistent with the AR(1) process spectra (Fig. 9b), and not with the fGn process spectra (Fig. 9c). The two processes exhibit spectral differences on all frequencies, visible when plotted together (Fig. 9d).
As a final remark on the results presented in this study, we note that the skill metric results and posterior parameter estimates are not automatically transferable to other ocean regions than the northern North Atlantic. For instance, the temporal covariability of the tropical Pacific SST are strongly influenced by the El Nino Southern Oscillation (ENSO), in addition to the year-to-year variability. Previous results show that the tropical Pacific SSTs are consistent with an autoregressive process of order 1 (AR(1)) nullmodel (Løvsletten and Rypdal, 2016). In contrast, our results demonstrate only around 25% of the North Atlantic SSTs are consistent with the AR(1) null-model. The tropical Pacific therefore complies better with the BARCAST model assumption. When other conditions are equal, the reconstruction skill may hypothetically be higher for the Pacific region than the North Atlantic.

Discussion
This study lays the groundwork for SST field reconstruction covering the northern North Atlantic region, using a marine proxy network subject to age uncertainties. The main objectives have been to (1) test the reconstruction skill of BARCAST and the analogue reconstruction methods using pseudo-proxy experiments, in order to determine whether one or both of the methodologies are suitable for SST field reconstruction. Objective (2) was testing how age-uncertainties in the marine records influence the reconstruction skill.
We find that both CFR methodologies generate skillful annually resolved reconstructions when the input data is perfectly dated (Figs. 4-6a-f). The reconstruction skill decreases away from proxy locations, and decreases with increased levels of proxy noise. The intra-ensemble skill variance is highest for the chronology ensemble version of the analogue method, confirming that repeated runs with random age-model selection actually inflate the reconstruction uncertainties as described in Werner and Tingley (2015). Overall, the analogue method performs better than BAR-CAST considering both the correlation and the RMSE skill metrics (Figs. 4 and 5). The superiority of the analogue method is likely linked to the flexibility in the modeled proxy-temperature relationship, as the method does not assume a stationary, linear relationship between proxies and the SST signal. Experimenting with decadal resolution and spatial clustering does not improve the reconstruction skill. BARCAST fails to reconstruct decadal temperatures, and deteriorates the skill in the case of clustering. The negative results are in line with Talento et al. (2019), yet we highlight the importance of performing these exercises because the spatiotemporal properties of SSTs differ from air temperatures (Fredriksen and Rypdal, 2016).
As for all CFR products, the reconstructions in our study would benefit from a higher-density proxy network. However, high- quality records are preferred for supplementing the network. We obtain significant skill for most of our experiments using this network (Figs. 4e6), so the spatial information in the observation data is considered satisfactory for our purpose.
In addition to the skill results, the posterior estimates of BAR-CAST parameters also provide indications of the reconstruction quality. The reconstructed temporal covariance is low compared with the target data, the ensemble mean posterior AR(1) parameter a 0.12. The reason behind this is unclear, but the spatial teleconnections in the target data could influence the temporal structure due to the coupled spatiotemporal model formulation of BARCAST.
Objective 2 is elaborated because BARCAST is unable to reconstruct skillful annual fields of SST when all proxies are subject to chronological errors in form of miscounted layers (Fig. 4g). This is discouraging in the context of real-world SST reconstruction, since chronological errors occur in the majority of the available marine proxy records. The earlier successful BARCAST reconstructions that implemented chronological uncertainties had the advantages of being applied exclusively to a terrestrial data network and reconstruction domain, in addition to either an evenly sampled domain (Werner and Tingley, 2015), or combined with a dominant proportion of absolutely dated proxies (31 out of 44 proxy records absolutely dated in Werner et al. (2018)). While our marine proxy network has none of these advantages, the three bivalve mollusc records stand out as absolutely dated (site IDs 4, 16 and 17, Table A1). The proxy development for climate reconstruction is relatively new for bivalves, and has been rapidly progressing over the past decade (Butler et al., 2019). Independent of the CFR methodology, prospects are good for a future higher-skill marine reconstruction under the condition that a predominantly absolutely dated proxy network is applied. Improved annual reconstruction skill will be difficult to achieve using a modified chronology modeling scheme such as BACON instead of BAM, although this modification would imply more realistic age model ensembles. Further work using BACON for agedepth modeling could house potential for marine CFR exhibiting decadal temporal resolution. There are other measures easily available to improve the annual reconstruction skill for the analogue method. The analogue implementation can be modified to construct the pool of analogues from climate model simulations instead of instrumental data. An optimal model pool comprises independent simulations originating from different climate models. This strategy avoids potential influence on the reconstruction skill from model bias. The reconstruction skill is expected to increase with a larger analogue pool (Bothe and Zorita, 2021), with the cost that the model-dependent analogue reconstruction cannot be used for traditional proxy-model comparison studies (PAGES 2k-PMIP3 group, 2015;Moffa-S anchez et al., 2019). The size of the analogue pool may differ by as much as two orders of magnitudes for the two options. The analogue method was implemented successfully for land temperatures over Europe in Bothe and Zorita (2020), facilitating slightly less than 10 000 model analogues. In comparison, there is only one observed instrumental data set covering the period back to 1850. Using the analogue reconstruction method it is possible to rank individual model simulations by their ability to replicate the climate signal observed in a known proxy network. This is a useful model evaluation tool, applicable when the analogue pool consists of climate model time slices. One strategy for model evaluation is outlined in G omez-Navarro et al. (2017), using an analogue pool of 16 simulations from seven Earth System Models and the surface temperature proxy archive of PAGES2k Consortium (2013). For every year to reconstruct, the study finds that the closest analogue time slices were selected from each of the simulations at some point, but two simulations were selected more often than the rest, and two others were selected less frequently. The target simulation is selected as one of several ensemble members from a single model, and the remaining ensemble simulations are not selected more often than simulations from other models.
Finally, extending the analogue method to search for analogues in the space of empirical orthogonal functions (EOF) is another promising option to potentially increase reconstruction skill in the North Atlantic Ocean (G omez-Navarro et al., 2017).

Conclusions
Pseudo-proxy experiments are helpful tools for ranking the reconstruction skills of the BARCAST and analogue reconstruction methodologies in this study. The northern North Atlantic marine proxy network is subject to age-uncertainties and proxy noise, and our analyses demonstrate that the analogue method is superior to BARCAST for SST reconstruction under these premises. A clear deficiency is that observed teleconnections in the climate model simulation SSTs are unaccounted for in the spatiotemporal structure of the BARCAST model assumptions.
Our results suggest that a future annually resolved SST reconstruction based on the BARCAST methodology should rely on a marine proxy network with dominance of absolutely dated records, and preferably uniform sampling in space. A smaller reconstruction region is also recommended without strong teleconnections. The analogue method is computationally cheap to run, allowing users to quickly test different modifications and identifying the optimal implementation for their specific use. The implementation for northern North Atlantic SST reconstruction will benefit from exploring a larger analogue pool based on climate model simulations.
In this study we bridge the gap between terrestrial and marine CFR for the common era. The research topic benefits from exploring a range of advanced statistical tools, the analytics are expected to become wider used in CFR applications after our systematic demonstrations. The experiment results contribute to constraining uncertainties of natural ocean variability influencing climate across the globe. Further CFR comparison with climate model runs also provides an important test bed for understanding multidecadal to centennial climate variability and the climate sensitivity to external forcing, while providing an extended context for anthropogenic warming prior to the instrumental era.

Data availability
The BARCAST code package is available at https://bitbucket.org/ jopewerner/barcast/commits/tag/Nilsen_2021. Codes and input data to the analogue method are available at https://bitbucket.org/ NilsenT/amcfr/src/master/. All proxy records are previously published, references are listed in Table A1. CESM LME model simulation data can be found at https://www.cesm.ucar.edu/projects/ community-projects/LME/data-sets.html. Data for the MPI-ESM past-1000 r1 simulation is available at https://esgf-data.dkrz.de/ projects/esgf-dkrz/. Other data and codes are available upon request from the corresponding author.
Author statement for manuscript nr JQSR-D-21-00136 All authors have contributed to the completion of the revised manuscript, and approve the final submitted version: Tine Nilsen (TN), Stefanie Talento (ST), and Johannes Werner (JW). TN and JW designed the study. TN has performed the majority of analyses, produced all figures and tables, and wrote the manuscript draft. ST contributed with source code and analyses for the analogue method. JW has provided BARCAST source code and infrastructure for public access to the code in GitHuB. All authors have worked together to revise the manuscript text.

Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Appendix B. Spectral analysis of temporal persistence and hypothesis testing
The periodogram is defined here in terms of the discrete Fourier transform H m as (Malamud and Turcotte, 1999): For evenly sampled time series x 1 , x 2 , ….x N . The sampling time is an arbitrary time unit, and the frequency is measured in cycles per time unit: f m ¼ m N . Df ¼ 1 N is the frequency resolution and the smallest frequency which can be represented in the spectrum.
Log-binned power spectra are visualized in logelog plots (Fig. 9) since the spectral exponent b can be estimated by a simple linear fit to the spectrum. Log binning of the periodogram is used for analytical purposes, this representation weights all frequencies equally with respect to their contributions to the total variance. The first null hypothesis tested is that the target data set can be described using an AR(1) process at all frequencies, with the parameter a estimated from the CESM SST data on the grid-cell level. For testing we generate a Monte Carlo ensemble of AR (1) series with a value of the a parameter identical to the target data.
The power spectrum of each ensemble member is estimated, and the confidence range for the theoretical spectrum is then calculated using the 2.5 and 97.5 quantiles of the log-binned periodograms of the Monte Carlo ensemble. The second null hypothesis is that the target data set can be described using the fractional Gaussian noise (fGn) stochastic process at all frequencies, with scaling parameter b estimated from the power spectrum of the target data at grid-cell level. The respective null hypothesis is rejected if the log-binned spectrum of the target data is outside of the confidence range for the AR(1) or fGn model at any point.