Estimating human mobility in Holocene Western Eurasia with large-scale ancient genomic data

Significance Ancient human DNA (aDNA) extracted from archaeological contexts allows reconstructing past population movements. Previous methods work by calculating proportions of shared ancestry among individuals or groups in order to answer specific, regional research questions. Here, we propose a large-scale algorithm to quantify human mobility through time and space using bulk aDNA data. The algorithm has two core components: i) interpolation of the spatiotemporal distribution of genetic ancestry to obtain a continuous ancestry information field and ii) probabilistic estimation of a spatial genetic similarity surface for each input sample by projecting its ancestry profile into this field. We apply this to thousands of published genomic samples in the last 10,000 y to trace diachronic mobility patterns in Western Eurasia.

To stay true to a 3D perspective, the printing order of each sample dot is according to the third dimension (the one not on the two axis) -with lower values always printed first. For A that means for example that the dots are printed in the order of their coordinate value on C3: Samples with lower values on C3 are printed first, so they are below samples with higher C3 values. D and E are ordered by C1. 5 Figure S5: Diachronic Gaussian process regression interpolation map matrix as in Figure 3, but here for the first five output dimensions of the projected PCA. Compare Figure S4.     Figure S10: Regional mobility curves just as in Figure 5 for the mobility estimation run with the first two MDS dimensions (MDS2). Identical to Figure 5, but with the two additional regions Southeastern Europe and Western Pontic Steppe.   Figure S11: Regional mobility curves for an mobility estimation run with the first five Projection PCA dimensions (PCA5). Beyond that just as Figure S10. See Figure S34 for a direct comparison.   Figure S12: Regional mobility curves for an mobility estimation run with the first two MDS dimensions (MDS2) and a lower retrospection distance.   Figure S13: Regional mobility curves for an mobility estimation run with the first two MDS dimensions (MDS2) and a higher retrospection distance.   Figure S15: Another view on the data in Figure S10. The same sliding window used to calculate the moving mean and standard error for Figure S10 was employed here to determine proportions of distance vectors smaller, in between and bigger 500, 1000 and 2000 kilometres. These fractions are displayed as regionwise stacked area charts. No-data windows are left blank.  Figure S16: Another view on the data in Figure S10. Each subplot in the plot matrix covers a 500-year time window, where each mobility vector is shown as a white line, connecting the sampling location (so usually the place of death) in green with the reconstructed point of highest genetic similarity one default retrospection distance in the past. These points are coloured by a gradient indicating sample age within the respective 500-year time window. This is helpful to see some processes, for example in the 2500-2000BC time window. 17 Dataset S1: Sample context information 32 Lists context information for all individuals/samples that went into the analysis.  with Dataset S1. The remaining 80 columns emerge as the product of ten output dimensions (C1-C10), four    Figure S2 shows the scatter plots of the first (C1) and the second (C2) output dimension for these 114 methods. The subplots E and F already indicate that simple PCA with mean imputation is not capable 115 to distinguish spatiotemporal clusters as clearly as the other methods, which we think is due to the under-116 performing imputation of missing data in ordinary PCA using mean allele frequencies. Experiments with the 117 correlation and out-of-sample prediction analysis below confirmed this observation: Simple PCA performed 118 worse there by a factor of 2 to 3. We therefore decided to exclude this method right away from further 119 consideration. 120 We also saw a clear separation of samples that were prepared via untargeted shotgun sequencing and 121 samples that went through a target-enriching capture preparation step (usually for the 1240K SNP set) on 122 the third output dimension of both the MDS and the EMU analysis ( Figure S3).  Figure S17 summarizes the results for these metrics.

138
The nugget term in S17 A is introduced in more detail below in Supp. Text 2. It is calculated inde-

148
The measure in Figure S17 B is calculated as the correlation of pairwise "genetic" distance (Euclidean 149 distance in the multivariate output dimension space up to dimension n, so e.g. C1-C7 for C7), and pairwise 150 spatiotemporal distance (Euclidean spatiotemporal distance scaled with 1 year = 1 kilometre) (e.g. Figure   151 S18). We report the R 2 value to summarise the output, so a higher value indicates a stronger correlation. All 152 methods perform generally well, and higher-dimensional spaces seem to be linked to some degree to higher 153 correlations for MDS and EMU up to C7. This does not necessarily contradict the results for the nugget 154 term: Here long-distance correlation dominates the result, which is deliberately omitted in the nugget. We 155 also observe, that the filtered SNP set performs consistently worse in all instances of this correlation test, 156 probably because the filter is not perfect and also removes valuable information.   C7 on the x-axis means C1-C7). The differences shown in Figure S17 C are normalised by the mean pairwise 163 distance in said space to make them comparable across methods (a similar normalisation as done in the 164 calculation of the nugget). Projection PCA performs best by that metric, especially for higher-dimensional 165 spaces. EMU and MDS lose accuracy quickly, the former already for C1-C3, the latter after C5. Note that the 166 general increase of values towards more included dimensions is expected due to the "curse of dimensionality".  Figure S18: Correlation of pairwise genetic and spatiotemporal distance with genetic distance in two-or three-dimensional MDS space. The pairwise distances are counted in bins and plotted as a density raster.
To conclude, the observations for these three measures together do not necessarily lead to an obvious 168 decision which multivariate analysis method, n, and SNP set is optimal for the spatiotemporal mobility 169 estimation we want to attempt. In an iterative process we could rule out some options, though:

170
The reduced, filtered SNP set performs almost always worse than its unfiltered counterpart. It clearly 171 avoids some of the shotgun vs. capture bias, but there seems to be spatiotemporal information encoded 172 exactly in this distinction -maybe through a complex interaction of the archaeological record, preservation 173 and research history. We seem to be better off with the additional 140,000 SNPs and decided to abandon 174 the filtered dataset.

175
The question which multivariate method to use is harder to decide -at least for low-dimensional spaces.

176
Our understanding is, that Projection PCA, EMU and MDS generally perform similarly well on C1 and C2, 177 with various local optima where one method trumps the others. Here we resorted to external factors, like 178 the complexity of the underlying algorithm and the amount of additional parameters necessary for a given 179 method, assuming that simpler and less is generally preferable. EMU employs a relatively complex imputation 180 algorithm on top of normal PCA, and Projection PCA requires a set of modern reference populations, which 181 has critical influence on the genetic space it generates. We therefore decided to rely on MDS (MDS2) for the 182 analysis presented in the main text. This includes the implicit assumption that the method would produce 183 similar and robust results even with other pairwise distance metrics beyond the 1-IBS measure we employed.

23
A forced limitation to C1 and C2 has the additional advantage that a 2D "genetic map" is relatively easy to 185 visualize and understand -one can intuitively conclude that its structure is meaningful on the spatial and 186 temporal scale of our analysis.

187
For higher dimensions beyond C2, MDS and EMU can be ruled out entirely due to the extreme bias the 188 shotgun vs. capture distinction introduces ( Figure S3) and which we could not reliably cancel out via SNP 189 filtering. Projection-based PCA has a massive advantage here, as it relies on an optimized, external data 190 source to inform the structural properties of the genetic space. It is robust for all the ten dimensions we 191 tested, and adding more dimensions could barely deteriorate correlation or predictive accuracy. It is unclear 192 though, how many of these additional components n add value to the similarity search implemented for this 193 paper. Figure  A key component of this paper is the interpolation of a genetic ancestry field based on the output dimen-202 sions of different multidimensional analysis methods. Here, we use Gaussian process regression, which is a 203 parameterised method. The following section explains the process we went through, to find an optimal set of 204 parameters. For increased clarity the plots only show the results for the first two or three output dimensions 205 of our MDS run with the unfiltered SNP set (see Supp. Text 1). As discovered above, the third dimension 206 is highly biased by the library preparation (capture vs. shotgun) and not directly correlated to space and 207 time. When we include it below, then only as a didactic reference point. 208 We consider a number of individuals distributed in space and time, with a single-dimensional (scalar) 209 genetic MDS (or PCA) component as dependent variable. We use the notation (x i , y i , t i , g i ) to denote for 210 each data point i the set of spatial coordinates (x i and y i ), an age t i and the value of the genetic component We intend to model our data points as a random Gaussian process, for which we are using the laGP R 213 package for local approximate Gaussian process regression [8]. As a technical note, one of the assumptions 214 in this package is a mean of zero in the Gaussian process, which we exactly achieve by first fitting a linear 215 model to the data, and then considering the residuals instead of the original genetic values.

216
In mathematical terms, the model including the linear fit can be presented as where g ′ reflects a mean-zero random field, which we model with GPR. For simplicity, and because this is a 217 one-time operation, we just continue using the notation g i , now actually denoting the residuals of the linear 218 model instead of the raw genetic component.

219
A key ingredient for Gaussian process regression is the covariance kernel function, for which we here 220 follow the standard choice of a squared exponential, which in general terms for p-dimensional input data 221 and in the notation of laGP is defined as where x and x ′ are positions in p-dimensional space, and θ k are lengthscale parameters for each dimension 223 k. η is a dimension-less additional noise term to be added only if x = x ′ , using the delta-distribution, the so 224 called nugget term. τ 2 is a general scaling parameter.

225
Specifically for the purpose of spatio-temporal modelling with isotropic space, we write this covariance 226 function as where we have changed notation slightly, and introduced the spatial kernel radius ρ and the tempo-

233
It is instructive to first consider variograms in the context of continuous fields, where the field value 234 g(x, t) is defined at all spatial points x (which in our case are two-dimensional) and all time points t. The 235 semivariogram is then defined as the mean squared difference of field values at given spatial and temporal 236 distances: where the average runs over all space-time points (s, t).

238
Under the assumption of constant variance ⟨g(s, t) 2 ⟩ = ⟨g(s + r, t + u) 2 ⟩ for all s, r, t, u, we can establish 239 the relationship of the semivariogram and the covariance function of the Gaussian process: So the variogram is directly related to the covariance of the Gaussian process: instead of continuous spatial and temporal "radius" values r and u, as in the continuous version, we now 245 consider bins R k = (r k , r k+1 ) and U l = (u l , u l+1 ), with boundaries r 1 < r 2 < . . . and u 1 < u 2 < . . .. We then where I(condition) is an indicator function that is 1 if the condition is true and zero otherwise, and the 248 normalization N is Figure S19 is one way to visualize this empirical semivariogram V (k, l) as a raster plot. The bins R k and 250 U l are here chosen such that r i − r i−1 = 100km and u l − u l−1 = 100years.

251
To finally determine the kernel parameters from equation 3, one needs to fit the variogram using the 252 kernel covariance function, thereby learning the four kernel parameters τ 2 , η, ρ and α. We realised that in 253 the case of our data, this was unfortunately not possible, as we cannot co-estimate Cov(0) and the kernel 254 radiuses simultaneously. Consider Figure S20, which shows only a single cut through the semivariogram. In 255 this case, we aim to fit three parameters from this curve (since we consider only the temporal dimension  pairwise half mean squared normalized residual distance ancestry component distance type Figure S21: Violin-and boxplot of the detrended pairwise distance distribution for different ancestry components in a short and narrow temporal and spatial distance window (< 50km & < 50years). The diamond shaped dot is positioned at the mean point of the distribution.
For the nugget term we have now an estimator This can be readily derived, since the variance τ 2 can be estimated as the overall sample variance of the data, i.e.τ where N is the number of data points andḡ is the mean genetic value. for θ s , one for the spatial x axis (θ x ), and one for the spatial y axis (θ y ). Despite this, we decided to apply 300 the algorithms here to get a first estimate for θ and to test our previous conclusion concerning η. 301 Figure S22 shows the result of multiple runs for each combination of algorithm and ancestry component. ) with different values of θ. The "pairwise distance" could for example be in kilometres or years.
One consistent observation across all three ancestry components considered here, is that θ x should be 311 different from θ y for an optimal model. So the above mentioned anisotropy issue does indeed affect the 312 outcome of the parameter estimation and poses a form of overfitting. We do not believe though that a model 313 with a latitude-longitude mismatch is justified in this context.

314
The estimated values for η are about one order of magnitude smaller than the ones estimated from the 315 variogram. We assume this to be an effect of the implausible anisotropy. Experimental interpolation test 316 runs with η < 0.04 led to overfitting in settings with fixed θ x = θ y and we therefore decided to keep η as 317 decided above.

318
laGP also provides the function mleGP to estimate θ and η in isotropic systems and we decided to employ 319 this algorithm as well. To account for the anisotropic nature of the space-time relationship we introduced a 320 scaling factor that manipulates the temporal axis. Starting from the default 1 (1km = 1y) we increased and 321 decreased the scaling factor in a rescaling sequence from 1/10 to 2. Figure S24 documents the result: mleGP 322 yields only one value for θ, which reacts to the forced temporal "contraction" and "inflation". One way to 323 imagine this is a rigid sphere in a changing cuboid universe: We contract or inflate the cuboids z-axis and 324 estimate for each setting i) if a sphere is a good assumption for predicting observations (Figure S24 B) and 325 ii) which radius the sphere should ideally have (Figure S24 A). As stated above, we used fixed values for η 326 here.

327
In the MDS setup shown here and for the spatiotemporally informative ancestry components C1 and 328 C2, increasing and decreasing the scaling factor, so temporal inflation and deflation, quickly deteriorates the model likelihood. A scaling factor of 1, so θ t = θ s and 1km = 1y, yields good results for both. The estimates 330 for the absolute values of √ θ are smaller, but on the same magnitude as for the anisotropic estimation above.     Figure S24: Results of mleGP exploration runs with variable scaling of temporal and geospatial space. mleGP assumes an isotropic sytem.

335
As a third and more independent method to estimate θ, we turned towards a simple crossvalidation approach,  Figure S26: Crossvalidation results (mean squared differences between prediction and observation) of the first three ancestry components of MDS and projected PCA for different θ s and θ t combinations. The combinations of θ s and θ t with the best mean predictive power are highlighted in red.

33
The latter is confirmed when we look at the mean squared difference rasters in Figure S26 A   Beyond that we also experimented with smaller kernels and kernels with θ s ≪ θ t and θ s ≫ θ t . Above The main question for this paper was to estimate and quantify genetic spatiotemporal similarity and therefore 378 ancestry relocation through human mobility. We assume this can in principle be done because people carry 379 their genetic ancestry profiles with them when they move. Mobility estimation requires i) a suitable dimension 380 reduction for "ancestry", ii) a handle on the sparseness of genetic data and iii) an algorithm to derive a 381 probabilistic measure of genetic similarity for individual samples through space and time, which considers 382 aforementioned sparsity. We finally need iiii), a method to assess the similarity space to quantify meaningful 383 signals of possible mobility. 384 We deal with the first requirement with different multivariate methods, which assign every individual  probability that our individual with ancestry C s matches our ancestry field C at location (x, y, t) using the 403 normal distribution (for brevity we hide t): From a Bayesian perspective, this is the conditional probability of observing data C s under a model 405 (x, y), where "model" here simply refers to an unobserved similarity point. Such a conditional probability, 406 where the data appears left and the model parameters right is a likelihood. We can flip this using Bayes' 407 formula to obtain the posterior probability for (x, y) given data C s , using a prior probability p(x, y): where Z simply is the normalization constant that makes this expression a valid probability over x, y: Figure S27: Schemata to explain the similarity search algorithm. This leads to a posterior "similarity" probability for potentially any point in space, so a similarity 410 probability field. In our implementation the resolution of this field is limited by the regular, interpolated 411 search grid ( Figure S27 D & E).

412
For real-world data this operation requires some further generalisation. Above we had a single component 413 C(x, y), but actually we have multiple, typically at least C 1 (x, y) and C 2 (x, y). To combine the two, we 414 simply compute probabilities p 1 (x, y|C s1 ) based on C 1 (x, y) and the focal value C s1 , and p 2 (x, y|C s2 ) based 415 on C 2 (x, y) and the focal value C s2 . To get a combined similarity probability, we multiply the two and 416 normalise again.

417
Real world data is also not precisely positioned in space and -even more severely -time, e.g. through 418 uncertainties in absolute dating, where most input samples informing the interpolated field are either dated 419 with radiocarbon ages or through archaeological context information. For the former we can derive a complex 420 post-calibration radiocarbon probability distribution, for the latter at least a uniform probability distribution 421 from the potential start to the potential end point. One solution to consider this is through iterations of 422 36 random sampling, which leaves us with sampling iterations for C and thus p(x, y|C s ). In this case a combined 423 probability field could be calculated as the mean of the individual fields, but for our large-scale mobility 424 proxy (see below) we in fact computed separate mobility vectors for each temporal resampling iteration.    Balkans, but lag behind on the Early Neolithic population shift (here disregarding the extreme sparsity of 493 pre-Neolithic observations). 494 We generally assume, that our similarity search algorithm is capable of distinguishing spatial origins,  For the simulation we assume a world with two spatial (x, y) and one temporal dimension (z). Each of these 501 three dimensions scales between 0 and 1 and the space is fully homogeneous and featureless. Within this 502 world, observations ("samples") are randomly distributed following a uniform distribution through space 503 and time. The sample size is a variable parameter of this setup to later measure the effect of sample sparsity.

504
One quarter of the spatial square landscape passes through a different genetic development as the rest (see 505 Figure S29), which puts it more or less apart through time. As an analogy to the real-world C1 mds u, the 506 genetic component is represented by a numeric value roughly scaling between 0 and 1.

537
Unsurprisingly we observe that smaller kernels yield more irregular, larger ones more smooth curves, but 538 generally the field succeeds in reconstructing the broad strokes of the input scenarios. The example in Figure S31 is informative, but as a single iteration not conclusive on the behaviour of the   The purpose of this simulation exercise was to get a better understanding the robustness of the similarity 588 search algorithm in different scenarios. From our analysis we conclude, that said robustness is high and 589 should generally be suitable for real-world data on the orders of magnitude relevant for this paper. There 590 are, of course scenarios, where the algorithm fails to yield meaningful output. These are notably significant 591 bi-directional genetic turnover, so when two regions swap their genetic make-up, and extensive cross-regional 592 synchronization. The former can be considered unlikely or at least rare, but the latter is a defining property 593 of the Western Eurasian archaeogenetic record for various regions after the Bronze Age, which happens to 594 be a focal research context for this study. 595 We considered this setup in the artificial limited scenario of this simulation and learned that large sample 596 sizes and large kernel sizes improve the accuracy of the search considerably. Larger kernels allow the past 597 to inform the present, which is a reasonably safe default for the similarity search application. For the large 598 scale mobility estimation attempted in this study, we even emphasised this effect through the introduction 599 of the retrospection distance parameter. 600 We finally highlight two more features of our similarity search algorithm that mitigate undesired effects: 601 1.) By picking not just one, but two or more ancestry components, we render it less likely that two regions 602 44 become fully similar in their ancestry profile. Two or more dimensions are less likely to be spuriously similar. 603 2.) Our algorithm is probabilistic, and reveals for each sample a probability distribution through space, which 604 captures the full uncertainty of our search. So in cases of ambiguity, we expect the probability distribution 605 to reflect this ambiguity and make it transparent to a user of our method.  Figure S10).

612
Southeastern Europe stands out in our analysis, because we could include a high number of compara- from Bulgaria (all from ref. [13]), whose mobility vectors point to Central Europe. Unlike other European 625 regions, the arrival of Steppe ancestry in Southeastern Europe is more gradual, beginning earlier and less 626 abruptly [13]. Few individuals show a clear mobility signal pointing to the far Northeast -e.g. I4175 [13].

627
For later periods, finally, we observe some remarkable outliers with strong mobility signals: For example Neolithic-related ancestry" [13] and thus long-distance affinity to the West and Southwest. Most data for 638 the Neolithic and the Bronze Age is from the Caucasus region and documents a complex, though relatively 639 local mobility history [19]. Within this time frame, multiple Globular Amphora context individuals (e.g.

640
ILK003 [13]) from present-day Ukraine show a strong mobility signal from the West. During the Iron Age, 641 more individuals with a relatively long-distance mobility signal appear, for example cim359 [20] and MJ-13 The mobility estimation presented in this paper depends on a large number of parameters. For many of them 648 there is no naturally optimal choice, so we had to make multiple intuitive or empirically informed decisions.

649
Supp. Text 1 explains how we selected the simplified genetic space to interpolate for our ancestry field and 650 how we settled on a 2D MDS and a 5D projection-based PCA. Supp. Text 2 explains how we determined 651 the parameters for the Gaussian process regression interpolation. As explained in Supp. Text 3, another key 652 parameter arises from the retrospection distance that should be used for the similarity search algorithm in 653 our large scale mobility estimation.  Cov(x, x ′ ) = 0.5 (default), a low one at Cov(x, x ′ ) = 0.8 (low ) and a high one at Cov(x, x ′ ) = 0.2 (high).

657
For each of these settings we reran the mobility estimation and produced the curve plots Figure S10, S12 658 and S13. An additional version in Figure S11 shows the PCA5 result, again with the default retrospection 659 distance. Figure S34 is finally an attempt to visualise the major differences between these iterations.

660
The curves for all three experimental settings ( Figure S11, S12, S13) are generally similar to the default 661 (S10). The main peaks and depressions generally overlap and seem to be detected robustly. A number of 662 differences emerge, though: A lower retrospection distance generally causes shorter mean mobility vectors, 663 whereas a higher one causes the peak similarity to be further away. This is a strong signal, but not surprising: 664 It is plausible that the further one goes back in time, the further away ones ancestors might have lived 665 originally. Especially the main long-distance events during the Early and Late Neolithic get exaggerated by 666 this effect (see e.g. the timeseries for the Stuttgart sample in Figure S7) and only few contexts and individuals 667 seem to deviate from this general pattern entirely.  Divergence from MDS2 with the default retrospection distance in km Figure S34: Comparison of different mobility estimation runs in a plot matrix. Each row of the plot matrix covers one analysis region, each column one of the three additional run configurations. The dots show the length of the summarized mobility vectors of one sample, just as in Figure 5. Vertical lines connect the results for the given run and the MDS2 run with default retrospection distance as in Figure 5 and Figure  S10. The diverging colour scheme of these lines highlight when, where and to which degree the runs yield different mean mobility vector lengths.