BayClump: Bayesian Calibration and Temperature Reconstructions for Clumped Isotope Thermometry

,

• Bayesian and ordinary least squares linear models recover regression parameters and reconstruct temperatures with the highest accuracy, and Bayesian regression with the highest precision.
• BayClump is a Shiny dashboard with web-interface and standalone versions, that facilitates the use of Bayesian models which are data intensive and non-Bayesian models for calibration and reconstruction by the broader community.
!" has been found to scale linearly with 1/T 2 across temperature ranges of 0-100 o C, leading to the use of linear regression models to calibrate the temperature dependence of this proxy, and for the estimation of clumped isotope-derived temperatures (Ghosh et al., 2006;Eiler 2007).Most prior clumped isotope calibration studies have relied on either ordinary least squares linear (Ghosh et al., 2006) or error-in-variables regression models (e.g., Deming;Tripati et al., 2010;e.g., York;Kelson et al., 2017) for inferring model parameters that relate  !" and 10 # / $ values (i.e., regression slope and intercept).The ordinary least squares linear and York regression models mostly differ in how they treat uncertainty in measured  !" and 10 # / $ .Each has their own advantages and limitations.For instance, ordinary least squares linear are already implemented in many statistical packages and a commonplace in the field.However, a clear limitation of ordinary least squares linear models is the inability to account for errors in 10 # / $ from the modeling framework, even though error is intrinsic to both clumped isotope and temperature measurements used for deriving calibrations.Furthermore, the magnitude of uncertainty in  !" and 10 # / $ varies for different calibration datasets (e.g., depending on material, instrumentation used, standardization, knowledge of temperature for environment samples are from) and ordinary least squares linear regressions treat all of these equally when different datasets are combined.In contrast, the York and Deming regression models account for error in both variables (e.g., Tripati et al., 2010;Peral et al., 2018;Meinicke et al. 2020;Anderson et al., 2021).Nevertheless, the performance of these two later models is still to be tested under simulated conditions that are relevant to the field.
To date and to our knowledge, no study has critically and comparatively evaluated the performance of error-in-variable regression models on the accuracy and precision of clumped isotope temperature calibrations.Similarly, although Bayesian frameworks have been used for other temperature proxies including TEX86 and Mg/Ca and have provided a more robust method for estimating uncertainties in tracer-based estimates of temperature (Tingley and Huybers, 2010;Tierney andTingley, 2014, 2015;Khider et al., 2015;Tierney et al., 2019;Crampton-Flood et al., 2020;Martinez-Sosa et al., 2021), no study has utilized these frameworks for the calibration of clumped isotopes, or for reconstructing temperatures using  !" .Thus, it remains unclear whether accounting for uncertainties in both variables actually improves the reliability of inferred regression parameters and reconstructed temperatures using  !" , and how error-invariable models compare to Bayesian methods.
In this study, we extend the classic regression approach for calibrating the clumped isotopes paleothermometer into a Bayesian framework, and compare Bayesian and non-Bayesian regression models utilizing synthetic datasets.We focus on answering whether Bayesian models and error-in-variable models outperform models that ignore uncertainty in  !" .Our main goal is to discuss relative model performance between newly developed and existing models commonly used for calibrating the clumped isotope paleothermometer.In this study, we do not intend to provide a general equation for analyzing clumped isotopes datasets.Instead, we provide a critical overview on the methods that are used in the field.Given the increasing availability of clumped isotope data, the Bayesian models developed in this study should help to pave the way for a unified calibration equation for the clumped isotopes paleothermometer.
Inspired by BAYSPAR, a web-interface for Bayesian models for the TEX86 temperature proxy (Tierney and Tingley, 2014), we developed BayClump, a shiny dashboard created in R that provides community-wide access to the Bayesian and non-Bayesian models for the clumped isotope proxy from this study.We derive calibration regression parameters using a published synthesis of calibration data (Petersen et al., 2019;Anderson et al. 2021;Sun et al. 2021).Overall, this work allows us to demonstrate the conceptual and practical advantages of using Bayesian models for inferring model parameters and deriving reliable reconstructions for clumped isotopes, as it has been outlined before in other temperature proxies (Tierney and Tingley 2015).

General modeling framework
We examine the performance of Bayesian and non-Bayesian linear models primarily using synthetic datasets (but see section 3.4 for analyses using real-world data).Tables 1, 2, S1, S2 show the range of uncertainties in  !" , T, and 10 # / $ from existing calibration datasets, respectively.We use these distributions to define "low", "intermediate", and "high" uncertainties in each of the variables.Thus, the analyzed synthetic datasets, assuming the linear relationship between  !" and 10 # / $ , follow different levels of error in  !" and 10 # / $ .
Note that although the general practice in the field is to predict 10 # / $ from  !" values to reconstruct temperature using a regression model defined from a temperature calibration dataset, our approach relied on a "forward" modeling where regression model parameters are estimated by using  !" as the response variable.This forward approach is consistent with  !" being a response to temperature, as opposed to the cause.Using synthetic datasets (with low, intermediate, or high uncertainties in  !" and 10 # / $ ), we utilize different models to estimate regression parameters.Specifically, we compare the parameters inferred from each statistical model with the true parameters used to simulate the synthetic datasets.This approach allows us to assess whether different models (ordinary and weighted least squares linear, York, Deming, and Bayesian models) yield accurate and precise values for the slope and intercept.Finally, we utilize the inferred regression parameters and their uncertainties from each model to reconstruct temperatures for specific target  !" values of 0.600‰, 0.700‰, and 0.800‰ that correspond to temperatures that are low (~10 o C), moderate (~19 o C), and high (~60 o C).We account for different values of uncertainty in the analyzed target  !" (low, intermediate, and high).

Regression models
We fit seven types of regression models to the synthetic clumped isotope- !" calibration datasets (Fig. 1).Four models are non-Bayesian regressions and three are Bayesian models.Note that we use Bayesian linear models as a way to propagate uncertainty in regression parameters.Additionally, depending on the Bayesian model, we also propagate uncertainty in the measurements of both 10 # / $ and  !" .Model performance for proxy calibration is in this section assessed with 10 # / $ as the independent variable and  !" as the response variable.

Non-Bayesian linear regression models
Ordinary least squares: We first fit an ordinary least squares linear regression model.This regression model is the simplest model used in this study and assumes no errors in 10 # / $ (the independent variable in the regression).We fit the ordinary least squares linear regression model using the lm function in the stats R package version 4.1.0(R Core Team, 2021) under default parameters.The approach implemented in the lm function in R minimizes the sum of squared error (i.e., sum over the squared of residuals in  !" ) in the relationship between 10 # / $ and  !" .
Weighted least squares: Second, we fit an ordinary least squares linear model with observations being weighted based on the inverse of their squared uncertainty.In this model, observations with larger residual values (estimated using ordinary least squares linear models) have less importance in estimating the error of alternative proposed lines during the least square optimization of the model.Although this approach accounts for variable uncertainty in  !" , the weighted least squares model still does not account for uncertainties in 10 # / $ .The weighted least squares regression was fit using the lm function in the stats R package version 4.1.0(R Core Team, 2021).The weights argument is set to the inverse of the squared residuals for the observations.
Deming: Third, we fit a Deming regression using the deming R package version 1.4 (Therneau, 2018).In this study, the Deming regression model is the simplest model that explicitly accounts for measurement error in both  !" and 10 # / $ .With the Deming regression, the ratio of the variance in  !" and 10 # / $ (calculated in the deming R package using jackknifing-based uncertainties on 10 # / $ and  !" ) is assigned to be constant over all data points (Martin, 2000).To fit this model, we specify values for  !" and 10 # / $ , along with the corresponding inverse of the squared standard error for each of the observations of temperature and  !" .The Deming model also aims to minimize the sum of squared residuals, where the residuals are a function of the inferred errors in both variables and the specified variance ratio (Deming, 1943).
York: Fourth, we analyzed a York model using the york function in the IsoplotR R package version 3.4 (Vermeesch, 2018).This approach is based on the same ideas that underlie the Deming regression model, specifically accounting for errors in both  !" and 10 # / $ .However, under the York model, the ratio of the weights in  !" and 10 # / $ varies across data points instead of being constant for the whole dataset as in the Deming regression (Martin 2000).Note that the weights are based on the correlation between errors in variables.We specify observations in  !" and 10 # / $ , along with the corresponding standard error (transformed to the inverse of the squared error within IsoplotR's york function) for each observation, when fitting York models.

Bayesian linear regression models
Bayesian linear: Fifth, we fit a Bayesian linear regression, the simplest Bayesian model fit in the study, and is equivalent to the ordinary least squares linear regression model presented above.For this regression, instead of parameter estimates being derived based on ordinary least squares optimization, regression parameters are estimated under a Bayesian framework (see below).Under a Bayesian approach, we use information from prior studies and newly generated clumped isotope data (synthetic datasets) to update the relevant regression parameters (e.g., slope and intercept) that are used in the calibration and reconstruction steps.Below, we present the mathematical definition of this model: 10 #  $ %  ∼ (0.231,0.065)  ∼ (0.039,0.004)  = 1  $  ∼ (0, 100)  = 1, . . . .,  Priors for  and  follow previous publications (Table S3; references therein).We use diffuse priors (priors that make weak assumptions about the model) on the precision parameter  $ .We also present results that utilize diffuse priors for  and  by selecting the same mean parameter value outlined above but with three times a wider standard deviation.
Bayesian linear with errors: Sixth, we fit a linear regression model that accounted for uncertainties in both 10 # / $ and  !" .The Bayesian linear model with error in variables is defined as the Bayesian linear model outlined above except for the following terms that account for measurement error in both variables: .% .Therefore, the observed response is  !" !, with errors  * "# (see also Daëron, 2021).The T explanatory variable is the temperature in the form Note that the prior used for corresponds to the mean value of temperature within the environmental range and a standard deviation reflecting high temperature uncertainty (Table 1).
Bayesian linear mixed: Seventh, we fit a Bayesian linear mixed model that accounts for error in both variables (Hilbe et al. 2007).This model is different from the above "linear with errors" in that it assumes that different calibration materials can potentially have distinguishable differences in the relationship between  !" and 10 # / $ .Note that for a single material, this regression model should behave similarly to the previous Bayesian regression model.We use the Bayesian linear mixed model to examine whether a relatively more complex model potentially assuming multiple materials under a single material dataset can still perform similarly to models that intrinsically assume equivalent material behavior.The utility of this model will be used in upcoming papers for assessing if there is evidence for material-specificity in real-world datasets.Below, we present the mathematical definition of this model.Except in the following aspects that allow for material-specific regression parameters, this model is equivalent to the Bayesian linear with errors presented above: 10 #  $ %  0 ∼ (0.231,0.065)  0 ∼ (0.039,0.004)  ∈ {1,2, . . .}, where j is an indicator of the type of material Therefore, this model allows for material-specific regression parameters.Material identities are indicated under alternative .

Implementation of Bayesian regression models
All three Bayesian regression models are fit using the jags function in R2jags version 0.6-1 in R version 4.02 under JAGS version 4.3.0(Plummer 2003).Posterior distributions on parameter estimates are based on 20,000 iterations (three chains), with 50% of samples discarded as burn-in.We used informative priors for the slope and intercept in alternative analyses that are presented in the supplement to this article.We use the same seed in R for all the analyses conducted in this study (set.seed()set to 3 in R).All the code and datasets used in this study are available on GitHub (https://github.com/Tripati-Lab/BayesPaper;https://github.com/Tripati-Lab/BayClump).BayClump can be accessed at the following link: https://bayclump.tripatilab.epss.ucla.edu/.
Fig. 1.Conceptual representation of the seven regression models used in this study for the derivation of   -temperature calibrations.We compared the performance of a total of seven regression models in deriving calibration relationships for use in carbonate clumped isotope thermometry.Parameter estimates, or slopes and intercepts of  !" -temperature calibrations derived for each of these models, were optimized through minimizing squared residuals (non-Bayesian models) or maximizing a likelihood function (Bayesian models).A subset of the classical least squares and Bayesian models account for error-in-variables (i.e., the regression parameters calculated factor in uncertainties in both  !" and 10 # / $ ).We also developed a Bayesian model that can potentially account for differences in parameter estimates between materials or other types of sample groups.This model is equivalent in complexity to a Bayesian simple linear regression with errors when the number of materials is one.

Clumped isotope temperature proxy calibration: model performance based on parameter estimates on the synthetic datasets
Comparisons of model performance for parameter estimates were based on three synthetic datasets of simulated values of 10 # / $ and  !" .We examine the performance of regression models by simulating errors in 10 # / $ and  !" .We account for three sources of uncertainty in the 10 # / $ - !" relationship: replication error in  !" across labs, instrument noise in  !" , and errors in 10 # / $ .These three sources of uncertainty closely reflect the main three sources of uncertainty in real clumped isotope datasets, and span a realistic set of values reported across labs and  !" -10 # / $ relationships for different materials (Table 1, 2, S1, S2).For each source of error, we model different scenarios with low, intermediate, and high levels of error.

Replication error in 𝛥 !" across labs (𝜎 5 parameter)
Reproducibility error across labs can be caused by multiple factors, including the drift in instruments over time, choice of reference frame, and noise in the sample preparation for each replicate, among others.Therefore, this source of error is intrinsic to any clumped isotope dataset. !" errors describing reproducibility across labs (see the  5 parameter below), we follow the distribution of reported  !" errors in the Petersen et al. (2019) compilation.We estimate typical reproducibility of  !" by examining the distribution of reproducibility for each lab that participated in the Intercarb interlaboratory exercise (Supplemental Table 1 in Bernasconi et al., 2021; see Section 2.3.2 below).In our lab, long-term reproducibility of better than 0.02‰ is typical, and that for recent instrumentation, better than 0.01‰ is also routinely feasible with sufficiently large numbers of standards being run.We assign reproducibility errors of 0.0125‰, 0.0225‰, and 0.0275‰ corresponding to low, intermediate, and high error scenarios, respectively.2.3.2Instrument noise in  !" across labs ( 6 parameter) Instrument noise can be related to the stability of beams over a 1 to 2-hour period, or even the length of time that is used to integrate on a single gas.For error in  !" caused by instrument noise, we use 0.0025‰, 0.0075‰, and 0.0125‰ for low, intermediate, and high error scenarios, respectively (Bernasconi et al. 2021).This error describes stability of the instrument across the ~1 to 2-hour analysis time for each replicate.

Errors in 10 # /𝑇 $
For measurement error in temperatures (10 # / $ ) used for proxy calibration, we define our levels of error by examining the typical uncertainties reported for different types of carbonates used in published studies that have compiled different calibration data (Petersen et al., 2019; Table 1).Low error in formation temperature is defined using reported values for synthetic carbonates (e.g., Ghosh et al., 2006;Tripati et al., 2015;Bonifacie et al., 2017), which have the most well constrained temperatures due to precipitating in controlled environments.Our estimates for high uncertainty for  !" error across labs reflects either naturally occurring terrestrial carbonates with larger variability in precipitation temperature, such as lacustrine samples (Huntington et al., 2010;Li et al., 2021;Wang et al., 2021) or naturally-occurring dolomites (Winkelstern et al., 2016;Came et al., 2017).Estimates for intermediate error fall in between those for low and high (e.g., foraminifera; Tripati et al., 2010;Meinicke et al., 2020; some naturally forming carbonates with less seasonal variability such as marine mollusks and brachiopods (Eagle et al., 2013;Henkes et al., 2013).We use 0.25°C, 2°C, 5°C as the low, intermediate, and high error scenarios for T when prescribing 10 # / $ , respectively.

2.3.4
Integrating sources of error into the simulated  !" and 10 # / $ datasets We analyze model performance for an "all-low error scenario", with low values of error in measurement error in  !" errors and measurement error in 10 # / $ .Next, we examine model performance in an "all-intermediate error scenario", with intermediate values of error in  !" and measurement error in 10 # / $ .Finally, we use an "all-high error scenario" with high error in  !" and measurement error in 10 # / $ .For simplicity, we refer to each of these as low-, intermediate-, and high-error scenarios for proxy calibration.Analyses in the main text primarily focus on three simplified end-member error scenarios (Data Set S1-S3).
For each error scenario, we simulate a total of 1,000  !" and 10 # / $ observations assuming a true value for the slope of 0.0369 and intercept of 0.268.These values were chosen because they represent the mean in the range of values from previous calibrations across different materials (see Table S3 and references therein).We first generated a total of 1,000 observations of 10 # / $ with a normal distribution under the following parameters (informed using Table S2): ∼ (12.03,2.5) These observations are treated as the true 10 # / $ values (range of true 10 # / $ in our dataset is between 5-19).Next, we simulate random error in 10 # / $ using on a normal distribution with mean 0 and standard error following a given 10 # / $ error scenario (; values of 0.019, 0.155, 0.070 for low-, intermediate-, and high-errors based on Table 1): The observed values of 10 # / $ result from the addition of and . % ()''7') .Next, we simulate  !" values based on a given true slope, intercept, true 10 # / $ , and random error in  !" under a given error scenario ( 6 ; values=0.0125‰,0.0225‰, and 0.0275‰) based on Bernasconi et al. (2021;related to  $ in model 6): Finally, we account for measurement error in  !" values representing replication error based on the error  5 as estimated in Petersen et al. (2019; related to  * "# !$ in model 6; values used in the simulation: 0.0025‰, 0.0075‰, 0.0125‰): Thus,  !" ! 768)'9): values are calculated as the addition of each initial  !" !value to a corresponding  !" ! )''7' .

Fitting regression models on the simulated 'clumped isotope' datasets
We examine whether each of the models correctly recover the true slope and intercept for the calibration.For non-Bayesian models, we examine the distribution of slopes and intercepts using 1,000 replicates per model per error scenario.For Bayesian models, we run a total of 20,000 iterations (see details in Section 2.2.3).We analyze datasets with 10, 50, and 500 observations randomly sampled from the original 1,000 data point calibration dataset.Datasets with n=50 generally reflect the size of calibration datasets for individual materials in published clumped isotope calibration studies (e.g., Tripati et al., 2010;Petersen et al., 2019;Anderson et al., 2021).
2.5 Inverting the forward model to predict 10 6 /T 2 from  !" : Temperature reconstructions for unknowns In addition to examining whether models accurately and precisely recover regression parameters (Section 2.4), we examine model performance during the temperature reconstruction phase.Paleotemperatures can be inferred by applying a regression model to sample  !" values.To evaluate model performance for temperature reconstructions, we apply the estimated regression parameters to three  !" values (0.600‰, 0.700‰, and 0.800‰) with several scenarios for replicate measurement errors in  !" (i.e., 0.005‰, 0.010‰, and 0.020‰).Our goal is to show whether reconstructions based on these carbonates under-or over-estimate true temperature under each of the examined regression models and scenarios of error.We reconstruct temperatures and their uncertainties following the usual practice in the field: where,  (℃) is the reconstructed temperature in degrees Celsius, ,  (℃) is the error in reconstructed temperature (also in degrees Celsius),  !" ! is the analyzed target  !" !, S !" !T the uncertainty in the analyzed target  !" !, and both  and  the intercept and slope, both estimated during the calibration step.For non-Bayesian models, inversions were conducted using the invest function in the investr R package (Greenwell and Kabban, 2014).For Bayesian models, temperature reconstructions were conducted using Jags in two major steps (with uncertainty propagated across steps).First, we estimated the point estimate of temperature as follows: − 273.15 where,  !" ) is the analyzed target  !" ,  and , the intercept and slope (both corresponding to samples of the post-burnin distribution in the calibration step of the analyses),  is the precision (also based on the post-burnin distribution in the calibration phase of the analyses), is the reconstructed temperature (in °K), with priors set to reflect an environmental temperature range (mean) and high uncertainty (standard deviation; Table 1).Finally, the reconstructed point estimate of temperature in degrees Celsius is summarized in  ; .
Note that this Bayesian reconstruction model follows the same structure as the three Bayesian calibration models presented before in Section 2.2.2.However, in the reconstruction section (Section 2.2.2), we focus on k target  !" , instead of i  !" from the calibration set.Similarly, values for , , and  are derived from the posterior distribution estimates in the calibration step (Section 2.2.2) and not calculated here during the reconstruction phase of the analyses.We randomly sample 500 observations from the post-burnin posterior distribution for each of these parameters based on the calibration step.The ,  for Bayesian Mixed models are specific to the material that is utilized as an archive for reconstructing temperature and are matched to calibration-derived values.
Second, we estimated the uncertainty in reconstructed temperatures within a Bayesian framework by using a uniform prior for the true error in  !" ) .This prior on the temperature error was based on the uncertainty of the target  !" ) (( !" ) ), which is here assumed to be 0.005‰, 0.1‰, 0.2‰: Here, all the parameters follow the same structure outlined above for the point estimates of temperature reconstructions under Bayesian models.However, we note that instead of focusing on  !" ) , we used  !" ) * for reconstructions. !" ) * is here used to estimate a second point estimate of temperature ( ; * ), which is subtracted from the first point estimate of temperature ( ; ) to calculate the error in reconstructed temperature ( )''7'; ; ).Finally,  ; summarize the estimated temperature (in °C) with propagated uncertainty.

Results and Discussion
3.1 Model performance in calibration datasets with 50 data points Most of the existing calibration datasets have 20-50 data points (e.g., Tripati et al., 2010;Petersen et al., 2019;Anderson et al., 2021).Therefore, we focus on results based on examining model performance on datasets with 50 observations.We found differences in the performance of classical and Bayesian models with our synthetic datasets for each error scenario considered.
Our results suggest that while all the examined models are able to correctly recover true regression parameters, regressions differ in their accuracy and precision during the calibration stage (Figs.2, 3).In general, Bayesian and non-Bayesian simple linear models generally outperform other methods for inferring regression parameters with the highest accuracy and precision.Precision and accuracy in parameter estimate under Bayesian and non-Bayesian linear models are not strongly affected by the analyzed error scenario.Deming and York regression models are consistently the least accurate and precise of the models.These models show the best performance when uncertainty in the analyzed calibration scenario is intermediate, again in datasets with n ~ 50 datapoints.Our results suggest that York regressions generally underestimate the intercept and overestimate the slope more strongly than any other model examined in this study.Conversely, Deming regressions generally overestimate the intercept and underestimate the slope.

Patterns of accuracy and precision between models within scenarios of error
Under a low-error scenario, accuracy and precision is relatively high for all the analyzed models (Figs. 2, 3).All the models slightly underestimate the true intercept.The slope is slightly underestimated by Bayesian models and slightly overestimated by non-Bayesian models.Overall, weighted and unweighted linear regression models show the best accuracy among the examined models and Bayesian models show the best precision.Specifically, York regressions underestimate the intercept by 2.7% and overestimate the slope by 1.5%.Deming regressions underestimate the intercept by 1.2% and overestimate the slope by 0.25%.Weighted and unweighted linear regression models underestimate the intercept by 1.3-1.6% and overestimate the slope by 0.7-0.87%.Finally, Bayesian models recover intercept values that are 1.3-1.4% smaller than true intercept and estimate slope values 0.15-0.2%smaller than the true slope.In terms of precision, Bayesian regressions recover at least 1.7 times less parameter uncertainty than any of the non-Bayesian models analyzed in this study.
Under an intermediate-error scenario, Bayesian models overestimate the intercept and underestimate the slope.Conversely, non-Bayesian models underestimate the intercept and overestimate the slope.Weighted and unweighted linear models show the best accuracy and Bayesian regressions the best precision.York regressions underestimate the true intercept by 7% and overestimate the slope by 4%.Deming regressions underestimate the intercept by ~4% and overestimate the slope by 1.4%.Weighted and unweighted linear models underestimate the intercept by 2.6% and overestimate the slope by 1.4%.Finally, Bayesian models overestimate the intercept by ~4% and underestimate the slope by 3%.We also note that Bayesian models recover parameter estimates with at least 1.6 times less uncertainty than any of the alternatives.Results are for seven models (panels), and different datasets (colors within panels) indicating alternative error scenarios, reflecting a low-error scenario (measurement error in  !" = 0.0025‰, instrument error  !" = 0.0125‰, measurement error in 10 # / $ = 0.25 °C), an intermediate-error scenario (measurement error in  !" = 0.0075‰, instrument error  !" = 0.0225‰, measurement error in 10 # / $ = 2 °C), and a high-error scenario (measurement error in  !" = 0.0125‰, instrument error  !" = 0.0275‰, measurement error in 10 # / $ = 5 °C).Also shown are the true slope = 0.0369 and intercept = 0.268 (red vertical lines).In the figure, "B-SL" stands for Bayesian simple linear model without errors, "B-SL-E" for Bayesian simple linear model with errors, "B-LMM" for Bayesian linear mixed model, "OLS" for ordinary least squares regression, "W-OLS" for weighted ordinary least squares regression, "D" for Deming regression, and "Y" for York model.

Conclusions on model performance with synthetic datasets with 50 data points
In general, our analyses suggest that for calibration purposes (assuming 50-datapoint datasets) and across all error scenarios examined in this study, Bayesian and non-Bayesian linear models perform the best in terms of accuracy and precision.Non-Bayesian linear models generally outperform other models in terms of accuracy and Bayesian in terms of accuracy.We note that York and Deming regressions perform similarly across error scenarios, with these models over-or under-estimating the slope and intercept more strongly than any of the other models examined in this study.Therefore, our results suggest that Bayesian and non-Bayesian simple linear regression models should be used for calibrating the 'clumped isotopes' paleothermometer instead of York and Deming regression models.
We acknowledge that our results on the performance of the York model in the calibration part of our study might be unexpected for some readers.To our knowledge, only two studies have examined the performance of York regressions relative to any other model, within any context (clumped isotopes or otherwise).First, Wu and Yu (2018) compared the performance of multiple regression models, including York and ordinary least squares linear regressions.Using synthetic data with errors in both variables, these authors concluded that parameter estimates under York were less biased than those estimated under simple linear models.Their approach also involved examining the distribution parameter estimates under a given set of independent "runs" of each model.However, critical details on how the characteristics of each of these "runs" are missing from the study.For instance, it is unclear whether each of these "runs" was conducted on the complete dataset or a smaller set of observations.If analyses were run on subsampled datasets, the size of each smaller datasets is not provided.Similarly, we note that Wu and Yu (2018) concluded that ordinary least squares models tend to fail to recover true parameters when the mean Y to X ratio is larger than 1.Therefore, our results suggesting that OLS models outperform York regression potentially reflect the fact that the mean Y to X ratio in clumped isotopes datasets is consistently well below 1.Second, results presented in Höhener and Imfeld (2021) show similar patterns to the ones reported in our study.For instance, Höhener and Imfeld (2021) indicate that while ordinary least squares linear models produce narrower error estimates, York regressions recover true regression parameters with a larger error.Nevertheless, we highlight that results in Höhener and Imfeld (2021) are also not generalizable to our study.For instance, the simulated datasets analyzed in Höhener and Imfeld (2021) assumed no error in variables.Finally, we note that York et al. ( 2004) does not provide a direct comparison of the performance of the York regression relative to other models.

Small and large calibration datasets: 𝛥 !" -T calibration: Model performance with synthetic datasets
In addition to examining model performance on calibration datasets with intermediate sample size (n=50; Figs. 2, 3), we compare precision and accuracy between models in calibration datasets with a smaller (i.e., n=10 data points) and larger sample size (i.e., n=500).
Within our 10-datapoint datasets, Bayesian models show the worst performance across all the examined models regardless of the analyzed error scenario.For instance, under an intermediate error scenario, while the intercept was underestimated by ~20% in all the Bayesian models, non-Bayesian models underestimated the same parameter by only 2-7%.Among non-Bayesian models, Deming, simple, and weighted linear regressions show a similar performance, greater than the one for the York regression.Deming regressions outperform the simple and weighted linear models when error in the calibration dataset is high.Therefore, we recommend using (1) simple or weighted linear models when uncertainty in the calibration dataset is low, (2) Deming, simple, or weighted models when uncertainty is intermediate, and (3) a Deming model when uncertainty is high.In general, we recommend avoiding the use of Bayesian regression models when datasets are small (~10 observations).
Under 500-datapoint datasets, Bayesian regressions outperform any of the other models examined in this study in terms of both accuracy and precision.These results are independent of the analyzed scenario of error.For instance, under an intermediate scenario of errors, Bayesian models underestimate the true intercept by 0.5-0.7% and overestimate the slope by 0.3-0.5%.Non-Bayesian models underestimate the intercept by 2-5% and overestimate the slope by 1-3%.Therefore, Bayesian regressions are consistently a better choice for improving parameter estimation when the calibration dataset is large (n~500 datapoints).Therefore, as the size of datasets increase, the implementation of our Bayesian regressions will certainly help to improve the overall performance of reconstructions based on clumped isotope data.

Fig. 4. Patterns of parameter uncertainty across regression models based on calibration datasets of variable size.
Results are shown for an intermediate-error scenario (measurement error in  !" = 0.0075‰, instrument error  !" = 0.0225‰, measurement error in 10 # / $ = 2 °C).Patterns are largely similar to those for low-and high-error scenarios.We present mean parameter estimates for regression models (black line), the associated 95% CI (shaded blue region), and true model (red line).We fit regression models based on datasets with 10, 50, and 500 data points (columns) and using seven regression models (indicated by the column headings).As described in the text, for large calibration datasets (n > 50), Bayesian models typically yield the most accurate and precise regression parameters, while for n < 50, OLS is the best performing model.In the figure, "B-SL" stands for Bayesian simple linear model without errors, "B-SL-E" for Bayesian simple linear model with errors, "B-LMM" for Bayesian linear mixed model, "OLS" for ordinary least squares regression, "W-OLS" for weighted ordinary least squares regression, "D" for Deming regression, and "Y" for York model.

General recommendations for selecting models to use for calibration in clumped isotope studies
We use the results presented in sections 3.1 and 3.2 to generalize major patterns of model performance between calibration datasets of different sizes and provide recommendations.Our analyses suggest that when the calibration dataset is small (n=10), Deming, weighted and unweighted OLS show the best accuracy.In datasets with intermediate sample size (n=50), weighted and unweighted OLS models recover the best accuracy.In large datasets (n=500), Bayesian regressions show the best performance in terms of precision and accuracy.Our analyses highlight the relative downsides of using York regressions in datasets of different sample size and with variable level of error.Instead, analyses conducted under OLS models (small and intermediate sample size) and Bayesian models (large sample size) should be prioritized for providing more reliable parameter estimates than other approaches.Due to the increasing trend in the number of observations in calibration datasets, our Bayesian framework provides a natural pathway to appropriately analyze large syntheses of clumped isotope data.
Finally, we also suggest that significant improvements on the performance of Bayesian models could be achieved in small datasets (e.g., 10 datapoints) by using maximum a posteriori estimation of regression parameters based on models fit in a larger number of replicates for the calibration dataset.In this study, parameter estimates under Bayesian models were based on fitting each model only once in each of the analyzed datasets.A more computationally intensive approach, fitting each model multiple times (e.g., 100-1,000) in each dataset and summarizing parameter estimates across the same number of posterior distributions, could have led to a better performance of Bayesian models in small datasets.
3.4 Inverting the forward model to predict 10 6 /T 2 from  !" : Temperature reconstructions for unknowns We evaluate model performance for temperature reconstructions.We use regression parameters derived from calibration datasets with a total of 50 observations.The corresponding results, summarized in the sections presented below, are also included in the supplement.We present results in the main text that reflects patterns associated with an intermediate level of error for  !" (0.01‰).
Note that patterns of accuracy across models outlined in this section are expected to generally reflect the overall patterns of model performance during the calibration step (see Section 3.1).Given that Bayesian models were fit a single time in each of the datasets, performance for these models in this section likely reflect only an example of parameter estimates that are possible for 50-datapoint datasets.Due to the sample size dependency in accuracy that was outlined in Section 3.2, slightly different results can be obtained if reconstructions are conducted using an alternative version of the 50-datapoint dataset (see distributions in Fig. 2).
3.4.1 Reconstructions for low-temperature carbonates ( !" = 0.8‰; T ~ 10 o C) Fig. 5 shows that for low-temperature carbonates, non-Bayesian models generally recover true temperature with better accuracy than Bayesian models.Under a low-error scenario, Deming, weighted and unweighted OLS overestimates true temperature by 0.5%, York by ~2%, and Bayesian models between 0.4% (no errors) and 1% (Bayesian linear mixed model).Under an intermediate-error scenario, Deming regression overestimates true temperature by 0.34%, weighted and unweighted OLS by 0.67%, York models by 4.5%, Bayesian simple linear models by 0.7%, and the Bayesian linear mixed model underestimates true temperature by 0.38%.Finally, under a high-error scenario, while non-Bayesian models overestimate temperature between 0.19% (weighted, unweighted OLS, and Deming regressions) and 8% (York model), Bayesian models underestimate temperature by 20%.

Reconstructions for intermediate-temperature carbonates (𝛥
Fig. 6 shows that for intermediate-temperature carbonates, Bayesian reconstructions only outperform any other approach when the error associated with the calibration dataset is low.However, non-Bayesian models are more appropriate when the error in the calibration dataset is intermediate or high.Under a low-error scenario, Bayesian models overestimate true temperature by 0.4-0.6% and non-Bayesian models underestimate temperature by ~1%.Under an intermediate error scenario, Bayesian models underestimate true temperature by ~6% and non-Bayesian models by ~2%.Finally, under a high-error scenario, Bayesian models underestimate true temperature by 7-8% and non-Bayesian models by 2-3%.In terms of precision, Bayesian reconstructions exhibit ~1.7 times less uncertainty in temperature relative to any of the alternative non-Bayesian models examined in this study.

Recommendations
We make a series of recommendations for calibration datasets with n=50 or more observations.For low-temperature carbonates ( !" ~0.8‰), non-Bayesian linear models reconstruct temperatures with the highest precision (e.g., Deming and weighted/unweighted OLS) and Bayesian models with the highest accuracy.For intermediate-temperature carbonates ( !" ~ 0.7‰), Bayesian models outperform any other models when uncertainty in the calibration dataset is low.However, non-Bayesian models (e.g., York, Deming, weighted/unweighted) may be preferred when uncertainty in the calibration set is intermediate or larger.For high-temperature carbonates ( !" ~ 0.6‰), Bayesian models reconstruct temperatures more accurately when the uncertainty in the calibration dataset is at either extreme (small or high).We recommend using non-Bayesian regression models (with the exception of York regressions) regression models when uncertainty in the calibration set is intermediate.

BayClump: A Shiny app for paleothermometry using 'clumped isotopes'
To support the use of Bayesian models and the analytical framework developed in this study for clumped isotope calibration and for temperature reconstructions, and to facilitate comparisons of Bayesian and classical models, we present a self-contained R Shiny Dashboard application, BayClump (Fig. 8).BayClump fits both classical and Bayesian linear regressions to calibration datasets and performs temperature reconstructions.It uses most of the models described in this study in a graphical user interface (GUI) environment, without the need for expertise in R or programming of any kind.BayClump is open source and analyses are highly reproducible.BayClump is available as a web application at https://bayclump.tripatilab.epss.ucla.edu,as a local application for users familiar with R and RStudio (https://github.com/Tripati-Lab/BayClump),or as a standalone Electron desktop application which requires no additional software (the Electron application will be made available through Zenodo upon acceptance for publication), as freeware with the only requirements being citation of this study, and including an appropriate statement if the software or calibrations are modified.2021) (Model 2), which will be updated as new calibration studies are published.Regression models can be developed using existing datasets or users can upload their own calibration data to work with by using a template available within the app.Users can also combine new calibration data with posted datasets to create a larger calibration set if desired.Any data that a user works with is not made available to others.
Based on current best practices (Bernasconi et al., 2021;Upadhyay et al., 2021), both Model 1 and Model 2 calibration sets are provided in the Intercarb Carbon Dioxide Equilibrium Scale (I-CDES).For this, we used an AFF to adjust values (i.e., 0.088 from Petersen et al. (2019)) so they are projected to the same acid digestion temperature (I-CDES is anchored to values that assume a reaction T of 90°C, not 25°C).Users may project their data into the I-CDES90 reference frame prior to adding into the template and uploading, for compatibility with default datasets, or they can exclusively use their own calibration data in any reference frame.Calibration models may be selected independently of one another, and options are available to scale data if needed.BayClump provides the ability for the user to download full calibration regression model output and any of the associated calibration regression model plots.
We also provide a GUI in BayClump for reconstructing temperature using both Bayesian and non-Bayesian models.A separate template in comma separated value (.csv) format is provided where users can add a table of sample Δ47 values and the combined error from measurement and standardization, and then download calculated temperatures in an Excel file.Currently, users can implement the Bayesian linear regression model with errors, utilize a Bayesian framework for estimating temperature that intrinsically accounts for both uncertainty in parameter estimates and error in target Δ47.
Alternatively, BayClump users can transfer over a distribution of Bayesian or non-Bayesian regression parameters derived from their own datasets from the Calibration tab to use for temperature reconstruction in the Reconstruction tab, either within or outside of a Bayesian framework.For non-Bayesian temperature estimates, reconstructed values of temperature will be shown for each of the selected models (in the calibration tab) when (1) parameter error is ignored in temperature reconstructions, or (2) when parameter error is accounted for in the reconstruction step.Summary information is provided for each selected regression model upon run completion.B): Plots of calibration data and each selected regression model are provided in the 'Calibration Plots' tab.Plots are fully interactive and downloadable.C) The application automatically transfers calibration models and data to the 'Reconstructions' tab.Users upload their own  !" data for temperature reconstructions using the provided template.Summary information is provided at run completion (output shown above is truncated).Both 'Calibration' and 'Reconstruction' tabs provide buttons to download full model output in tabbed and labeled Excel spreadsheets.

Reanalysis of the Petersen et al. (2019) dataset using BayClump
In addition to providing a general summary of model performance using synthetic datasets, we utilize BayClump to estimate regression parameters for a published synthesis of calibration data that contained results for 451 samples measured in several different laboratories on the Carbon Dioxide Equilibrium Scale reference frame (Petersen et al. 2019).In that study, the authors provided estimates of the slope and intercept using a Monte Carlo least squares regression based on 10,000 replicates (Table 3).Here, we analyze the same dataset using all seven regression models analyzed in this study (Fig. 9).We perform a total of 1,000 replicates of each non-Bayesian model implemented in BayClump.Bayesian models were analyzed using 50,000 iterations (50% burnin).Note that given our main focus was on re-examining the main equation in Petersen et al. (2019), we do not fit material-specific regression models (e.g., using B-LMM).The authors of the same study extensively discuss material specificity in the same article.Similarly, we focus on deriving a calibration for the full temperature range of the dataset (also largely the main goal of their study).Fig. 9 compares regression coefficients and their associated uncertainties.Except for York models, Bayesian and non-Bayesian linear models differed from the slope estimated in Petersen et al. (2019) by less than 1.5%.York recovered a slope estimate that was ~17% larger than the one in Petersen et al. (2019;slope in York=0.0397 vs Petersen 0.0383).In terms of the intercept, Deming and York models recovered parameter estimates that were between 5% (Deming) and 26% smaller than the ones in Petersen et al. (2019;intercept Deming=0.245 vs York=0.191 vs Petersen = 0.258).Based on our analysis of synthetic datasets, the divergence of York and Deming models from the rest of the regression models is expected.These two models generally show a poor performance and under-or over-estimate calibration regression parameters relative to Bayesian and OLS models.for Bayesian linear mixed model, "OLS" for ordinary least squares regression, "W-OLS" for weighted ordinary least squares regression, "D" for Deming regression, and "Y" for York model.

Reanalysis of the Anderson et al. (2021) dataset using BayClump
Using BayClump, we also reanalyze the Anderson et al. (2021) dataset (Fig. 10) using the same methodology as described in this study in Section 3.5.1.Anderson et al. (2021) provided parameter estimates based on York regression models.Instead of using a single version of the full dataset as in Petersen et al. (2019), we use two versions of the Anderson et al. (2021) calibration dataset.First, we fit regressions to the whole dataset, over the full range of temperature values.Second, we perform analyses that include only samples in the environmental temperature range (0-35 o C; Table 3).Specifically, given that the simulated datasets closely reflect an environmental temperature range, we subsampled the Anderson et al. (2021) dataset for being able to use our results in the synthetic component of this paper as a reference.Note that patterns of association between 10 6 /T 2 from  !" in a limited temperature range (~0-35 o C) we examined in Anderson et al. (2019) but not in Petersen et al. (2019).Similarly, we examined material-specific regression parameters (e.g., B-LMM) given that this aspect was not extensively discussed in Anderson et al. (2021).
Note that the uncertainty in the full and subsampled Anderson et al. (2021) dataset with a mean temperature error ~0.5 °C and mean error in Δ47=0.012‰(0.040‰ in the subsampled), roughly corresponds to the error structure of our low-error scenario.When using the entire Anderson et al. (2021) dataset, our parameter estimates for the based on the York model fit in BayClump differ by -0.5% (slope) and 1.1% (intercept) from parameter estimates in Anderson et al. (2021;Table 3).Our Bayesian simple linear models recover estimates of the slope that are -0.5-1%different than the one in Anderson et al. (2021) and intercept values that are 0.04-1.27%larger than the published ones.The rest of the models differed from parameter estimates in Anderson et al. (2021) by less than 3%.The most important difference is not actually between models but related to the analyzed partition of the dataset.Relative to the original Anderson et al. (2021) calibration for their full dataset, all of our models yield different regression parameters (15-29% smaller slope estimates and 27-55% larger intercept values relative to the published estimates) when the reduced Anderson et al. (2021) dataset, that includes only samples in the environmental temperature range, is used.Finally, we note that our Bayesian linear mixed models did detect differences in parameter estimates (full Anderson dataset: mostly related to the slope; ATM Anderson dataset: related to both the slope and intercept; Table 3) between calcite and dolomite samples in Anderson et al. (2021).2021) for samples in an environmental temperature range (0-35 o C; bottom row).Note that Bayesian linear mixed models estimate material-specific regression models (two lines and associated uncertainties in plot).However, parameter estimates for each material under might have suffered from further partitioning the already small dataset in Anderson et al. (2021).In the figure, "B-SL" stands for Bayesian simple linear model without errors, "B-SL-E" for Bayesian simple linear model with errors, "B-LMM" for Bayesian linear mixed model, "OLS" for ordinary least squares regression, "W-OLS" for weighted ordinary least squares regression, "D" for Deming regression, and "Y" for York model.

Reanalysis of the Sun et al. (2021) dataset using BayClump
We recalculate temperatures using data from Sun et al. (2021) for Late Miocene and recent carbonates from the Shuitangba hominoids site in Yunnan, China with BayClump (Fig. 11).This study represents a relevant study case where the main focus is directly related to addressing whether temperatures between two sites differed at different times (~0 and 6.2 Ma).For this reanalysis, we take regression parameters for the different models from Table 3 that are calculated using the three datasets described above (dataset 1 -the Petersen et al. (2019) calibration dataset from section 3.5.1;dataset 2 -the full sample set from the Anderson et al. (2021) calibration from section 3.5.2;dataset 3 -a restricted subset of the Anderson et al. (2021) calibration that uses their samples from environmental temperatures (0-35 °C) from section 3.5.2).Next, we perform temperature reconstructions that account for both regression parameter uncertainty (Table 3) and error in Δ47 (Sun et al., 2021;Table 4).Note that the uncertainty in Petersen et al. (2019) (mean temperature error=1.49°C, and mean error in Δ47=0.010‰),roughly corresponds to the error structure of our intermediate-error scenario (measurement error in Δ47= 0.0075‰, error in true Δ47= 0.0225‰, measurement error in temperature=2 °C).The mean Δ47 in Sun et al. (2021) is similar to our intermediate-temperature carbonates scenario (~0.7‰), and most of their samples have an intermediate level of error (~0.01‰).Given our results based on synthetic datasets (section 3.4.2;Table 3), we expect that Bayesian models should be the most accurate and precise among the examined models.We also compare reconstructed temperatures to the results reported in Sun et al. (2021) that utilized published calibrations (Henkes et al., 2013;Tripati et al., 2015).
Overall, our mean estimates of temperature based on Bayesian regression models suggest that there are not major differences in reconstructed temperatures between sites (Table 3).Temperature estimates are, however, very similar to those reported by Sun et al. (2021) based on calibrations from Tripati et al. (2015) and Henkes et al. (2013) (see also Table 4).Our Bayesian framework generally yields temperature values with much smaller uncertainties, depending on the analyzed dataset.
In general, our results based on the utilization of Bayesian and non-Bayesian regression models and a Bayesian framework for reconstructing temperatures support one of the main conclusions in Sun et al. (2021).Specifically, the authors and our study suggest similar temperatures in Shuitangba during the Late Miocene and the present-day temperatures in the Fuxian Lake area (both sites, ~17 o C).Results in Sun et al. (2021) noted that uncertainties in reconstructed temperatures were larger than the mean temperature difference between areas.Our Bayesian reconstructions were able to shrink uncertainty around median temperature values while also concluding that temperatures are similar between sites.2021)'s dataset reduced to environmental samples (bottom row).In each of these datasets, we fit seven regression models and reconstructed temperatures for the two sited in Sun et al. (2021).In the figure, "B-SL" stands for Bayesian simple linear model without errors, "B-SL-E" for Bayesian simple linear model with errors, "B-LMM" for Bayesian linear mixed model, "OLS" for ordinary least squares regression, "W-OLS" for weighted ordinary least squares regression, "D" for Deming regression, and "Y" for York model.

Conclusions
We examine the performance of seven regression models for calibrating the clumped isotope thermometer, including the first Bayesian implementations of regression models.We implement a Bayesian linear mixed model that can accommodate differences in regression parameters between groups, so that a range of materials can have different slopes and/or intercepts.Using simulated calibration datasets with variable number of observations (from 10 to 500 samples) and degrees of error in clumped isotope measurements and temperature, we find that Bayesian and non-Bayesian ordinary least squares linear models consistently outperform other regression models in terms of accuracy and precision under most synthetic scenarios when reconstructing true regression parameters.The performance of Bayesian linear models strongly improves when the number of observations in the calibration dataset exceeds 50 data points.
We also utilized different frameworks for reconstructing temperatures and found differences in temperature reconstruction performance between regression models.In general, Bayesian reconstructions were more precise and accurate than non-Bayesian reconstructions when error in the examined Δ47 was small (<0.01‰).Non-Bayesian reconstructions using Bayesian model-derived regression parameters were generally more robust than other approaches, and accurately recovered temperatures in a range of scenarios.Based on our analyses, we summarized the models that showed the best performance during the calibration and reconstruction phase.A Bayesian regression model when applied to published calibration syntheses yields reduced uncertainties.Some of the differences between temperature reconstructions based on published syntheses may originate from material-specific patterns in the calibration datasets, which should be investigated in future work.
The analytical tools developed in this study are available in BayClump, a Shiny dashboard with data templates that facilitates the use of Bayesian methods for both calibration and temperature reconstruction.Application to published clumped isotope data from Late Miocene and recent samples from the Shuitangba hominoids site in Fuxian Lake (Yunnan, China) show the potential of BayClump as a tool for resolving relatively small temperature changes with confidence in paleoclimatology and support the published conclusions.We expect the tools developed in this study to provide a basis for robustly choosing to use particular regression models for temperature calibration and reconstruction in clumped isotopes, and to reduce the uncertainty of temperature reconstructions.

Table captions
Table 1.Distribution of measurement error in temperature that was used to design the synthetic datasets for this study.We provide examples of the materials that correspond to each category.For example, calibration datasets for synthetic carbonates grown at known temperatures, or benthic foraminifera from intermediate and deep-ocean sites, often have very well-constrained temperatures with errors of less than 0.5 °C (e.g., Ghosh et al., 2006;Tripati et al., 2010).Levels of error were defined based on the distribution of typical uncertainties reported for calibration temperatures in a recent synthesis of calibration data (Petersen et al., 2019).Temperatures in degree °C are transformed into 10 # / $ , with T in °K, for calibration purposes (after Ghosh et al., 2006).
Table 2. Distribution of measurement error in  !" used to design our synthetic datasets.Measurement error in  !" was estimated using the distribution of reported  !" errors in a synthesis of calibration data (Petersen et al., 2019).We present distributions for natural, synthetic, calcite, and aragonite samples.Distribution across sample types are under the "all materials" heading.Petersen et al. (2019) relative to our new estimates based on the same dataset.Results in Petersen et al. (2019) are based on a Monte Carlo sampling (10,000 replicates) for synthetic carbonate samples only (n = 451 replicates).We use a total of 10,000 replicates for estimating each regression parameter under each of the regression models implemented in BayClump.For our newly fit regressions, we report the mean and SE for each parameter.Tripati et al. (2015) and Henkes et al. (2013) to values obtained using BayClump for different calibration datasets and the York, OLS, and Bayesian regression models.We use results presented in Table 2 from Sun et al. (2021), with data provided in their Table S4.Temperature uncertainties are in SE and 95% CIs.

Table 3. Comparison of regression parameters estimated in
Under a high-error scenario, patterns of model performance are similar to an intermediate-error scenario.York regressions underestimate the intercept by 8% and overestimate the slope by 5%.Deming regressions underestimate the intercept by 4% and overestimate the slope by 1%.Weighted and unweighted linear models underestimate the intercept by ~2.6% and overestimate the slope by ~1.5%.Bayesian linear models overestimate the intercept by ~3-4% and underestimate the slope by 3-4%.Parameter uncertainty is >1.4 times smaller in Bayesian models relative to any of the other models.

Fig. 2 .
Fig. 2. Performance of different statistical models for deriving regression parameters for clumped isotope temperature calibrations, evaluated using a synthetic dataset with different levels of uncertainty in both   and T. We show the distribution of regression parameters (A-intercept; B-slope) based on re-sampling of the calibration dataset.Results are for seven models (panels), and different datasets (colors within panels) indicating alternative error scenarios, reflecting a low-error scenario (measurement error in  !" = 0.0025‰, instrument error  !" = 0.0125‰, measurement error in 10 # / $ = 0.25 °C), an intermediate-error scenario (measurement error in  !" = 0.0075‰, instrument error  !" = 0.0225‰, measurement error in 10 # / $ = 2 °C), and a high-error scenario (measurement error in  !" = 0.0125‰, instrument error  !" = 0.0275‰, measurement error in 10 # / $ = 5 °C).Also shown are the true slope = 0.0369 and intercept = 0.268 (red vertical lines).In the figure, "B-SL" stands for Bayesian simple linear model without errors, "B-SL-E" for Bayesian simple linear model with errors, "B-LMM" for Bayesian linear mixed model, "OLS" for ordinary least squares regression, "W-OLS" for weighted ordinary least squares regression, "D" for Deming regression, and "Y" for York model.

Fig. 3 .
Fig. 3. Prior and posterior distributions for Bayesian analyses on the slope and intercept based on analyses run using informed and diffuse priors on regression parameters.Bayesian analyses yield posterior distributions that are robust.Results are for a low-error scenario (measurement error in  !" = 0.0025‰, instrument error  !" = 0.0125‰, measurement error in 10 # / $ = 0.25 °C).The true slope = 0.0369 and intercept = 0.268, and the vertical dashed black line in each panel indicates the true parameter value.Results shown for datasets with 50 calibration samples.We show results for analyses run using the Bayesian simple linear model with errors, including both informed and diffuse priors on regression parameters.In the figure, "B-SL" stands for Bayesian simple linear model without errors, "B-SL-E" for Bayesian simple linear model with errors, "B-LMM" for Bayesian linear mixed model, "P-I" for Prior -Informative, and "P-D" Prior-Diffuse.

Fig. 8 .
Fig. 8. BayClump application screenshots for proxy calibration and for temperature reconstruction.A) Calibration options are chosen in the first tab of the application.Multiple preloaded datasets are included, or the user may opt to upload their own calibration data.

Fig. 9 .
Fig. 9. Reanalysis of the Petersen et al. (2019) dataset using BayClump.We present mean parameter estimates for each of the seven regression models implemented in BayClump (blue line), the associated 95% CI (shaded orange region), and the regression reported by Petersen et al. (2009) (red dashed line; 95% CI in grey).In the figure, "B-SL" stands for Bayesian simple linear model without errors, "B-SL-E" for Bayesian simple linear model with errors, "B-LMM"for Bayesian linear mixed model, "OLS" for ordinary least squares regression, "W-OLS" for weighted ordinary least squares regression, "D" for Deming regression, and "Y" for York model.

Fig. 10 .
Fig. 10.Reanalysis of the Anderson et al. (2021) dataset synthesis using BayClump.We present mean parameter estimates for each of the seven regression models implemented in BayClump (blue line), the associated 95% CI (shaded orange region), and the published syntheses (red dashed line; 95% CI in grey).Results are shown for Anderson et al. (2021) synthesis for all samples (top row), and Anderson et al. (2021) for samples in an environmental temperature range (0-35 o C; bottom row).Note that Bayesian linear mixed models estimate material-specific regression models (two lines and associated uncertainties in plot).However, parameter estimates for each material under might have suffered from further partitioning the already small dataset inAnderson et al. (2021).In the figure, "B-SL" stands for Bayesian simple linear model without errors, "B-SL-E" for Bayesian simple linear model with errors, "B-LMM" for Bayesian linear mixed model, "OLS" for ordinary least squares regression, "W-OLS" for weighted ordinary least squares regression, "D" for Deming regression, and "Y" for York model.

Fig. 11 .
Fig. 11.Temperature reconstructions presented in Sun et al. (2021) for a hominoid locality in Yunnan, China compared to those derived using BayClump.We present mean and associated error for reconstructed temperatures fromSun et al. (2021) based on calibrations byHenkes et al. (2013; "H-2013")  andTripati et al. (2015; "T-2015").Note that "H-2013" and "T-2015" are the same across rows in the figure (dotted lines).These reconstructions are used in each panel as a reference to the new calibrations performed in BayClump (solid lines).Our analyses in BayClump (i.e., solid lines) are shown for calibrations performed on the Petersen et al. (2019; top row), Anderson et al. (2021)'s full calibration dataset (second row), and Anderson et al. (2021)'s dataset reduced to environmental samples (bottom row).In each of these datasets, we fit seven regression models and reconstructed temperatures for the two sited inSun et al. (2021).In the figure, "B-SL" stands for Bayesian simple linear model without errors, "B-SL-E" for Bayesian simple linear model with errors, "B-LMM" for Bayesian linear mixed model, "OLS" for ordinary least squares regression, "W-OLS" for weighted ordinary least squares regression, "D" for Deming regression, and "Y" for York model.

Table 4 .
Comparison of Miocene and Recent temperatures reconstructed by Sun et al. (2021) using published calibrations from