Advances in understanding how stratigraphic structure impacts inferences of phenotypic evolution

A fundamental question in evolutionary biology and paleobiology is how quickly populations and/or species evolve and under what circumstances. Because the fossil record affords us the most direct view of how species lineages have changed in the past, considerable effort has gone into developing methodological approaches for assessing rates of evolution as well as what has been termed “mode of evolution” which generally describes pattern of evolution, for example whether the morphological change captured in fossil time series are best characterized as static, punctuated, or trending (e.g., Sheets and Mitchell, 2001; Hunt, 2006, 2008; Voje et al., 2018). The rock record from which these samples are taken, however, is incomplete, due to spatially and temporally heterogenous sediment deposition and erosion. The resulting structure of the stratigraphic record may confound the direct application of evolutionary models to fossil time series, most of which come from single localities. This pressing issue is tackled in a new study entitled “Identification of the mode of evolution in incomplete carbonate successions” (Hohmann et al., 2024). The “carbonate successions”

Thank you for submitting the revised manuscript.Your response resolved many reviewer questions, but I decided to solicit one more review to comment specifically on the outstanding issue of model performance.
Note that the reviewer was very positive about the study overall, not least of which its novelty, but also made some important clarifying statements about the modeling framework that might impact your thinking about the results and/or inspire some additional analyses.I invite you to consider these and look forward to seeing the revised manuscript.

Reviewed by Gene Hunt , 11 May 2024
This manuscript uses forward sedimentological modeling to generate synthetic stratigraphic sections, simulates trait evolution in the stratigraphic and time domains, and then fits evolutionary models to assess the degree to which stratigraphic architecture affects evolutionary interpretation.Surprisingly, they find that the geological effects are minor, but that a common way to analyze models of trait evolution seems to perform poorly, regardless of the geological filter.
There is a lot to admire about this manuscript.Explorations of how stratigraphic structure can affect interpretation of trait evolution in time-series has been almost completely ignored since Bjarte Hannisdal's excellent paper almost 20 years ago.I don't have the expertise to evaluate the sedimentological model, but it is well explained, and the exploration of a carbonate system is unique in the literature of evolutionary time-series, as far as I know.The study is well designed, using both idealized and empirical sea level curves, and sampling across different parts of the carbonate platform.Figures are very nice, and it is written clearly and with grace.
The one problem is that the surprisingly bad performance of the evolutionary models is based on a misunderstanding, and, as a result, a lot of the results and discussion will need to be reconsidered.The general issue is that one needs to evaluate not just model support (AICc and Akaike weights), but also model parameters to understand an analysis.In this study, specifically the use of the OU model creates confusion because this model can take on parameter values that cause it to converge to the other three models.The OU model has 4 parameters: the ancestral trait value (anc), the optimal trait value (theta), the strength of attraction to the optimum (alpha), and the stochastic component (vstep).
• When alpha is very strong, the traits are drawn very quickly to the optimum.In the limit (alpha -> Inf), you get a white noise process around theta, with a stationary variance of vstep/(2*alpha), equivalent to how stasis is modeled.
• When alpha is very weak, the optimum has little effect.As alpha -> 0, you get a random walk / Brownian motion.
• When alpha is weak and the optimum is very far from the starting trait value, you get a nearly linear trend from anc to theta.So, the initially confusing results of evolutionary model fitting can be understood: nearly all model support is received by either the generating model or the OU model with parameter values that cause it to mimic the generating model.For example, when stasis is the generating model, theta and anc will be very similar and alpha values will be quite strong (often easier to judge from the half-life, log(2)/alpha), and furthermore, vstep/(2*alpha) will be nearly exactly equal to the stasis variance.Note that considering parameter values removes all difficulties of interpretation: the user would find that all the model support goes to two different ways to parameterize white noise, which is the correct, generating model.
The Discussion alludes to this property of the OU model (line 687ff), but doesn't make the connection to interpreting the results in this light.The need to consider parameter values was one of the major points of Grabowski et al. (2023) in their correct criticism of Cooper et al.'s (2016) paper about OU models.
In terms of how to handle this, the paper can be revised to account for this interpretation, but I'd argue it would be better served by simply omitting the OU model, for two reasons.First, this model is not easily to justify biologically.Yes, the OU model can be used to model a population converging to a new adaptive peak.
But this dynamic is rapid, and is expected to last a few generations to, at most, a few thousand generations.
On the ~2 Myr scale of this study, there really isn't any expectation that this dynamic should be captured.
Second: URW, Trend/drift, and stasis are useful because they capture three qualitatively different evolutionary patterns: meandering, directional, and fluctuating, respectively.I don't see any benefit to adding a 4th model that just has the effect of mimicking the best-supported of the other models, essentially splitting the support for the correct dynamic over two nearly equivalent models.
Another contributing factor here is in how Akaike weights work with nested models.If one model is nested within another (e.g., Brownian motion within Brownian motion with drift; the other three models are also nested or nearly nested within OU), it is impossible for the simpler model to decisively beat the more complex one according to Akaike weight.The log-likelihood of the more complex model cannot be lower than that of the simpler model when they are nested.Therefore, the only way for the simpler model to be better is via the parsimony term in AIC.For models that differ by 1 parameter, this leads to a delta AIC of 2, and maximum Akaike weights of 0.73 for the simple model, even when the simple model is correct (see Hunt 2006, p. 596).This is for AIC; with AICc, the exact weight will be initially higher for the simple model and then converge to the AIC value with increasing n.This means that the 0.9 used as a threshold for Akaike weight is inappropriate: it is mathematically impossible for the simpler of nested models to reach this threshold for AIC (and for AICc except when the parsimony penalty is high at low n).This, by the way, also explains the puzzling behavior in Figure 10 in which performance seems to gets worse with increasing n: more complex models will face decreasing parsimony penalties as n increases, which explains the asymptotic increase in support for OU in these plots.I will say that I am puzzled that the OU model so consistently beats Stasis even with the two extra parameters in the OU model.It doesn't really matter much here because the dynamics will be basically equivalent (as discussed above), but this is something I am curious about.
Below I have added some minor comments, in manuscript order.Despite the problem I have identified above, I want to emphasize how much I like this study.With suitable revision, I think it will be an important contribution to the literature.

Minor comments, in manuscript order
• Line 34: here, and elsewhere in the manuscript, pulsed change is referred to as punctuated equilibrium.
I don't think this is quite accurate: the punc eq model has pulsed change but it occurs at lineage splitting.A pulsed change within an unbranched lineage is more evidence against than for punc eq because it involves large changes without speciation.(Gould would sometimes try to cloud this issue.)I'd recommend using terms like pulsed or punctuated change, and not punctuated equilibrium, for unbranched lineages.
• Line 88, before the Fossilized Birth Death model and related approaches, there was a phase in which fossil data was used a lot (sometimes naively) to get constraints for node dating approaches.
• Line 110ff: The presentation of completeness that I am familiar with (e.g., Shanan Peters' work) emphasizes that completeness will depend on the temporal scale of resolution.A section may be mostly complete when considered in 1 Myr bins, but will be much less so if the bins are 10 Kyr.
• Line 169: here and at a few other places, it seems to imply that previous approaches in paleo have required samples to be equally spaced in time.The model fitting approaches used here and in Hunt ( 2006) cited here have always allowed for arbitrary spacing of samples.
• Line 305: I don't think it needs to be done in this paper, but, as an FYI, it is not difficult to generate realizations of the OU model with unequal sampling.The sim.OU function in paleoTS does it one way, and there is another approach in which a whole time-series is a single draw from a multivariate normal distribution using the vector of means and covariance matrix from Hansen & Martins (1996).
• Line 322: not quite right as written, as the standard deviation would be sigma *sqrt(t), not sigma.The simulation code is correct, though.
• I would not say scenario 3 is "weakly directional".Both it and scenario 4 are strongly directional, really more so than just about any empirical sequence.This can be seen from the figures -both look almost like straight lines -and the results are basically the same throughout for both.Calculations from Hunt (2012, Table 1) indicate that directionality accounts for 98% and 99.5% of the evolutionary change in scenario 3 and 4, respectively.I'd recommend just keeping scenario 3 as representing trends and dropping the unrealistic scenario 4.
• Line 348ff.I see the need for the distinction, but it seems odd to call them both time-series.Perhaps instead they can be stratophenetic series and time-series?The former phrase has been used occasionally in this literature.
• Line 516, about stratigraphic completeness not being the driver of outcomes.This is an interesting and important point.
• Line 640ff: this section should be reconsidered based on my comments above.The discussion about Levy flights is interesting, as I agree that kind of dynamic would probably be favored when there are unrecognized hiatuses in a section.That model isn't implemented in paleoTS, but (within-lineage) punctuations are.
• Line 826: this references Hannisdal (2006) This project tackles an important issue in the study of phenotypic evolution using fossil data: that the structure of the stratigraphic record confounds the direct application of evolutionary models to these data.This is underexplored (and somewhat ignored) but is exactly the type of work paleobiologists need to be doing.Please find below three reviews from researchers who are very familiar with the evolutionary modeling side of things and/or the sedimentary modeling side of things.I think these comments indicate both the merit of the work but also the need for some revisions if this is going to be as impactful as it could be.
To this end, I was also particularly surprised by the performance of the model selection without stratigraphic biases.One reviewer's comments about the need for the continuous-time expansion may be relevant here.
But I also wonder if the results that you are getting are due at least in part to using model implementations for testing (those from the paleoTS package) which are different from the implementations used to simulate the data (those original to this study).To quickly assess this, I ran some simulations using the functions available in the paleoTS package.I found some interesting patterns and they are not completely different from what you found.For example, short time series (N=5) simulated under the stasis model in paleoTS are hard to distinguish from unbiased random walks (and this is expected given the smaller number of parameters in the unbiased random walk model).Similarly there is increasing support for OU as time series length increases.
However, stasis never "loses" to the same degree as shown in your Figure 10.In fact, stasis is the best model in the majority of cases across all time series lengths*.It is also worth noting that the results are sensitive to the amount of variance (stasis is more strongly supported for time series simulated with smaller variance).I have included the R script and figure here in a PDF (which is the file type I can upload; I'm happy to send the actual files to you directly).*Although rarely with an AICc > 0.9.One reviewer commented on the stringency (and interpretation) of relying on AICc values that high, which I think is worth considering.Another reviewer wondered about using AICc instead of AIC.Based on my knowledge of the paleoTS package and your R scripts (utils.R), you must have used AICc -so this should be clarified in the text.

I look forward to seeing the revised manuscript! Download recommender's annotations
Moreover, except for the stasis simulations, the paleoTS analyses tend to show highest relative support for the correct model, particularly for the Brownian drift (GRW) models.For both the weak and strong Brownian drift, the stratigraphic domain results show greater relative support for the correct model than the time domain analysis.This is not surprising, because the very strong trends generated in these models will be condensed stratigraphically into even more extreme trends, favoring even more strongly biased random walks.
As expected, the results for the Brownian motion (URW) models are less clear cut, and the counterintuitive effect of time series length may be interpretable, as discussed below.
For the stasis simulations, the paleoTS analysis clearly favors the wrong model (OU) across all the scenarios shown in this manuscript, which is obviously problematic and worth pursuing.I don't have any experience with or insight into this, but the authors may want to consult the literature on phylogenetic comparative methods that discuss issues surrounding bias towards OU models in AIC model selection (https://doi.org/10.1111/2041-210X.12285).
The sensitivity of the time domain analysis to time series length (Fig. 10) is counterintuitive, and the authors understandably struggle to make sense of it.How could increasing the number of observations in the time series reduce the relative support for the correct model?I wonder if this might be an expression of the kind of stratigraphic distortion the authors are seeking to investigate.In some of the simulations reported in Hannisdal (2006), I found that the greatest stratigraphic distortions could occur in the thickest, most densely sampled distal sections.These sections were not characterized by unusually large gaps per se, but rather by unusually high completeness of the deposits between the gaps, yielding dense temporal sampling of short time windows.The more densely these short windows were sampled, the greater the distortion, in some cases.This might be relevant for understanding the counterintuitive effect of denser sampling reported in this manuscript, if the denser sampling is constrained within short windows of deposition.

Additional comments
The aim and approach of this manuscript is quite similar to my ancient study (H06), allowing me the indulgence to reflect on the pros and cons of H06, which was part of my dissertation.On the one hand, Hohmann et al's approach is more sophisticated than H06 in terms of running multiple model realizations and scenarios, and replacing the old hypothesis tests with likelihood-based model selection.On the other hand, H06 attempted to use a biologically motivated model of trait evolution, taking into account population size, ecological preference, and preservation.
In terms of research design, however, these kinds of stochastic forward model experiments have their limitations.The space of possible models is too vast be explored exhaustively, but if done thoughtfully they can help build an intuition and a greater understanding of the problem and how to address it.It was pointed out to me back then that although understanding how stratigraphy can distort trait evolution is important, the real question is what to do about it.Already then, the field was shifting from classical null hypothesis testing towards parameter estimation and model selection.Hence, the H06 paper was merely preparing the ground for proposing a new, inverse modeling approach, part of which was published in Hannisdal (2007, Paleobiology), which I believe holds more important messages than H06.I get a sense that the authors of this manuscript also aim to set the stage for a new approach, and I look forward to seeing what they come up with.
Finally, I think the authors should engage a bit more with the literature on this long-standing problem.The introduction would greatly benefit from a more representative view of the history of this research topic.For what it's worth, I found a link to my 2006 dissertation, which includes a short introduction chapter with an overview of the literature up to that point, in the hope that the authors might find it useful.In addition to the two published papers H06 and H07, the thesis also includes an unpublished chapter on model selection: https://www.proquest.com/dissertations-theses/inferring-phenotypic-evolution-fossil-record/docview/304952421/se-2?accountid=8579

Reviewed by anonymous reviewer 1, 13 February 2024
The manuscript investigates the effects of stratigraphic biases on our ability to detect the true evolutionary mode in the fossil record.The authors point out that most of the work focusing on interpreting fossil time series uses age-depth models based on simplified and highly unrealistic assumptions regarding the regularity of the stratigraphic record.This is problematic, as stratigraphic biases might affect how we interpret phenotypic change in the rock record.The manuscript is therefore addressing an important issue and has the potential to be a significant contribution to the field.However, for reasons detailed below, I am not entirely convinced by, nor sure how to interpret, the results of the manuscript.

Main comment:
It seems a bit worrying that the correct (simulated) evolutionary model is not recovered under excellent sample conditions in the absence of stratigraphic biases.A simulation study like this is, to some extent, equivalent to an experiment in the sense that you want to have a 'control' so that you can reliably measure the effect of the variable(s) you later start to tweak.It therefore appears imperative to recover the true evolutionary model in the simulated data in instances when the factors that you want to test the effects of later on (e.g.stratigraphic bias, etc.) are not allowed to compromise the signal in the data.The fact that it is harder to detect the true evolutionary mode with longer time series is also worrying.Analyzing more data that is generated under the true model should make detection easier.I am not convinced by the suggested reasons the authors provide to explain their lack of ability to detect the true model in absence of any stratigraphic biases in the data, and I sense (but might be wrong) the authors are not entirely convinced by their own arguments either.I thus advocate that the authors investigate why they fail to recover the true model in their data that lack stratigraphic biases, as the rest of the results in this work are difficult to interpret without a fully functioning 'control'.If this can be fixed, or at least explained, this manuscript is likely to be a significant contribution.Line 338: Why include a sample variance?Including this will introduce noise into the data, and this is not one of the aspects under investigation.

Minor comments:
Line 351 -354: The rationale for including the OU model as one of the candidate models when investigating relative model fit is unclear.Wouldn't it make more sense to assess how a model of punctuated evolution performs?Such a model is available in the paleoTS package.Hiatuses of a specific duration appear to produce a punctuation-like pattern in the stratigraphic domain when the underlying model is either an unbiased or biased random walk.This test would connect the work more closely to the ongoing debate regarding the validity of the punctuated equilibrium hypothesis.
Using a criterion of 0.9 for AIC weights is quite stringent, particularly since the number of data points in each time series is relatively small, and the various models can generate largely similar trait dynamics (e.g., stasis and OU can produce very similar trait evolution).With small sample sizes, it is common to use the AICc, the bias-corrected version of the Akaike Information Criterions.AICc approaches AIC asymptotically as the sample size increases.Is there a specific reason why you have favored AIC over AICc?FYI: I am not an expert on computer simulations of sedimentary strata, so I am not in a position to critically assess whether the simulations using the CarboCAT Lite model of carbonate platform formation are sensible.that, although the real-world SL record is highly variable, it may result in a stratigraphic record that has a fairly good ability to preserve evolutionary change in the fossil record.

Tables and figures
The figures are appropriate and support the text, although I suggest a few minor modifications to some of the figures.Additionally, some of the figure captions need more explanation of the information conveyed in the figures.
In Figure 2, it would be helpful to have a color key for the different facies depicted in the simulated shelves.
Differentiating among facies is not a focus of this manuscript, however, they are clearly shown in this figure and in Figure 3.In the caption, please clarify what are "the graphs," or refer to other figures by number.
In Figure 3, please include a color key to the facies, as suggested for Figure 2. In the caption, I suggest either stating the explicit distance from shore or adding a line to the Wheeler diagrams to show where the graphs were extracted, rather than referencing the "middle" of the grid.Please clarify: the extracted "graphs" are 2C and 2F?
Figure 7 illustrates the point that stratigraphic completeness in and of itself has little bearing on the preservation of trait evolution patterns.It is unclear why a column from 2 km in scenario A is compared to a column from 6 km scenario B. Scenario B is so little discussed in the Results and Discussion, and there is little explanation for the rationale of comparing two different platform locations across the two scenarios.I think the point is that completeness doesn't vary too much, but it is hard to take that away from comparing two completely different column locations and scenarios.Please explain this choice in the caption and text or consider modifying the figure to show outputs from more readily comparable example columns.
Figures 8 and 9, 10: Captions need explanation of the abbreviated tested modes in the legends.
I suggest adding a table to report the results of the AIC analyses for the different simulations and models.
The pertinent results are reported in the text, but a table would be helpful for presenting all the AIC values.

Authors
. The only other example of a similar study I know of appears in one of the chapters of the Patzkowsky and Holland book.I think that should be cited somewhere in here as well.(It is possible that chapter refers to another paper that I am not remembering at the moment, too.) • Line 829: you would probably see more artefactual support for stasis if the generating parameters didn't have such high rates.With lower rates, sampling noise would be a larger component, and since sampling noise looks like stasis, you should get more spurious cases of stasis.Analyzing more shorter sequences might have a similar effect.• I love the forward modeling approach to investigate what happens under known conditions.I am curious if you think these sedimentation models might ever be used for the inverse problem, so as to generate more realistic age models for empirical fossil time-series?Signed, Gene Hunt Hunt, G. 2012.Measuring rates of phenotypic evolution and the inseparability of tempo and mode.Paleobiology 38(3):351-373.Patzkowsky, M. E., and S. M. Holland.2012.Stratigraphic Paleobiology: Understanding the Distribution of Fossil Taxa in Space and Time.University of Chicago Press, Chicago.Download the review Evaluation round #1 DOI or URL of the preprint: https://doi.org/10.1101/2023.12.18.572098Version of the preprint: 1 Hopkins , posted 29 February 2024, validated 01 March 2024 Dear Niklas and co-authors,

Figure 1 :
Figure 1: This figure is important as it describes the study design.I would have appreciated a more detailed figure caption to make it easier to understand the different steps in the study.