Ancestry and demography and descendants of Iron Age nomads of the Eurasian Steppe

During the 1st millennium before the Common Era (BCE), nomadic tribes associated with the Iron Age Scythian culture spread over the Eurasian Steppe, covering a territory of more than 3,500 km in breadth. To understand the demographic processes behind the spread of the Scythian culture, we analysed genomic data from eight individuals and a mitochondrial dataset of 96 individuals originating in eastern and western parts of the Eurasian Steppe. Genomic inference reveals that Scythians in the east and the west of the steppe zone can best be described as a mixture of Yamnaya-related ancestry and an East Asian component. Demographic modelling suggests independent origins for eastern and western groups with ongoing gene-flow between them, plausibly explaining the striking uniformity of their material culture. We also find evidence that significant gene-flow from east to west Eurasia must have occurred early during the Iron Age.

D uring the first millennium BCE, nomadic people spread over the Eurasian Steppe from the Altai Mountains over the northern Black Sea area as far as the Carpathian Basin 1 . The classical Scythians, who had lived in the North Pontic region since the seventh century BCE, are the most famous among them due to the early reports in the Histories of Herodotus (490/480-424 BCE) 2 . Greek and Persian historians of the 1 st millennium BCE chronicle the existence of the Massagetae and Sauromatians, and later, the Sarmatians and Sacae: cultures possessing artefacts similar to those found in classical Scythian monuments, such as weapons, horse harnesses and a distinctive 'Animal Style' artistic tradition. Accordingly, these groups are often assigned to the Scythian culture and referred to as 'Scythians'. For simplification we will use 'Scythian' in the following text for all groups of Iron Age steppe nomads commonly associated with the Scythian culture.
The origin of the widespread Scythian culture has long been debated in Eurasian archaeology. The northern Black Sea steppe was originally considered the homeland and centre of the Scythians 3 until Terenozhkin formulated the hypothesis of a Central Asian origin 4 . On the other hand, evidence supporting an east Eurasian origin includes the kurgan Arzhan 1 in Tuva 5 , which is considered the earliest Scythian kurgan 5 . Dating of additional burial sites situated in east and west Eurasia confirmed eastern kurgans as older than their western counterparts 6,7 . Additionally, elements of the characteristic 'Animal Style' dated to the tenth century BCE 1,4 were found in the region of the Yenisei river and modern-day China, supporting the early presence of Scythian culture in the East. Artefacts of the Scythian culture spread over a large territory shortly after its emergence, but the underlying population dynamics that may have driven the cultural diffusion are poorly understood.
Genetic studies on Central Asian populations based on both ancient 8 and modern mitochondrial DNA (mtDNA) [9][10][11][12] agree that Central Asia has historically been a crossroad for population movements from east to west and vice versa. It has been claimed that gene flow occurred from east to west Eurasia as early as the Palaeolithic 13,14 and the Mesolithic 15 , and from west to east Eurasia during the Bronze Age 16 . A recent genomic study 17 has emphasized the role of eastward migrations of people associated with the Yamnaya and Andronovo culture during the Bronze Age, followed by substantial admixture with East Asians. Most genetic studies on the later Iron Age nomads, however, have been limited by small sample size, restricted to a single cultural group, or based on the analysis of mtDNA alone [18][19][20][21][22] .
In this study, mtDNA data from 96 individuals associated with the Scythian culture in different geographical regions and time periods have been sequenced and analysed; additionally, genomic data from eight of these individuals was obtained and analysed (Supplementary Table 1). From the western part of the Eurasian Steppe, samples discovered in the North Caucasus dating to the initial Scythian period (eighth to sixth century BCE), classical Scythians from the Don-Volga region (third century BCE), and Early Sarmatians from Pokrovka, southwest of the Ural (fifth to second century BCE), were included. From the eastern part of the Eurasian Steppe, we analysed samples from East Kazakhstan dating to the Zevakino-Chilikta phase (ninth to seventh century BCE); from the site Arzhan 2, assigned to the Aldy Bel culture in Tuva (seventh to sixth century BCE); and from the Tagar culture of the Minusinsk Basin (fifth century BCE). The majority of the samples generated by this study or retrieved from the literature date to the fourth to third century BCE and were discovered at archaeological sites situated in the Kazakh, Russian and Mongolian parts of the Altai Mountains. These findings were all assigned to the Pazyryk culture, which is named after the first discoveries by Gryaznov in 1927 and 1929 in the Pazyryk Valley and famous for its rich frost-conserved graves, where human bodies, tapestry and clothing remained well-preserved [23][24][25] (Fig. 1).
While the eastern and western populations investigated here are separated by a distance of 2,000-3,500 km, archaeological evidence indicates that they were strikingly similar regarding their lifestyle and culture. The aims of this study are to investigate the extent to which these groups are genetically related to each other and whether they have a common origin, and to elucidate their demographic history and genetic relationships to modern living populations. We therefore divided the Iron Age steppe nomad data generated by this study and from the literature (n ¼ 147, Supplementary Table 1) into seven sample groups (see Fig. 2) based on geographical and chronological criteria, and analysed these ancient individuals together with an extensive sample of modern individuals from 86 populations (n ¼ 3,410, Supplementary Fig. 11 and Supplementary Table 19) from all parts of Eurasia. Our analysis included an array of statistical methods as well as a series of population genetic inference approaches, including explicit demographic modelling.

Results
Samples and sequence data. We generated genome-wide capture data on a target set of 1,233,553 SNPs 26,27 for six individuals: two Early Sarmatians from the southern Ural region (PR9, PR3, group #3 in Fig. 2; fifth to second century BCE), two individuals from Berel' in East Kazakhstan (Be9, Be11, #6) dating to the Pazyryk period (fourth to third century BCE), and two individuals found in kurgan Arzhan 2 (A10, A17, #5) assigned to the Aldy Bel culture in Tuva (seventh to sixth century BCE). For Be9 and two additional individuals from east Kazakhstan (Is2 and Ze6, #4) dating to the Zevakino-Chilikta phase (ninth to seventh century BCE), we generated low coverage (o0.3x) whole genome datasets by shotgun NGS (Table 1, Supplementary Tables 20 and 21).
We additionally obtained unambiguous and reproducible mtDNA sequences of the hypervariable region 1 (HVR1; bp 16,013-16,410) for 96 samples (out of 110 samples for which the analysis was attempted) (Supplementary Table 5). For 90/96 samples additional coding region SNPs were typed (Supplementary Table 6).
Genetic relationship and origin of the Scythian groups. The eastern sample group (n ¼ 113) can be divided in four cultural subgroups chronologically dispersed over the 1 st millennium BCE (Fig. 2). Analysing mtDNA, we found no significant genetic distance between those groups (Supplementary Table 8). We then used approximate Bayesian computation (ABC) 28 to test for continuity between the earlier (#4: Zevakino-Chilikta and #5: Aldy Bel; n ¼ 26) and the later (#6: Pazyryk; n ¼ 71) Scythian period in the East. The Tagar/Tes group (#7) had to be excluded because of their imprecise dating. These analyses revealed that they were most likely derived from one single population that was expanding over the time period considered here, that is, the two samples are unlikely to represent two independent populations that diverged earlier than 108 generations before present (g BP) or B2.7 ky BP. This scenario was highly supported in our model selection procedure (Supplementary Table 9, logistic regression, P ¼ 0.995, neural networks P ¼ 0.568, cf. confidence in model choice in Supplementary Table 10, model parameter posteriors in  Supplementary Table 11), whereas two scenarios that assumed that the eastern Scythian sample groups were derived from two previously diverged populations received very little statistical support (cumulative posterior probabilities: logistic regression, Po0.001, neural networks, P ¼ 0.001).
Since the genetic distance between the combined Scythian groups of the east versus those of the west is relatively low (F ST ¼ 0.01733; P-value ¼ 0.02148±0.0045), we used ABC to further assess if the eastern and western Scythians might share a common origin (Fig. 3). For these analyses we included contemporary samples representative of genetic diversity on the extremes of Eurasia (Supplementary Note 1). According to our model selection algorithm, a multiregional model provided the best fit to the empirically observed diversity patterns (Supplementary Table 12, 0.5% closest simulations, posterior probability P ¼ 0.708 for logistic regression, P ¼ 0.715 for neural networks method), while a model of western origin also received some support (Supplementary Table 12, 0.5% closest simulations, posterior probability P ¼ 0.286 for logistic regression, P ¼ 0.267 for neural networks method). Therefore, in contrast to the eastern origin model, a western origin cannot be fully discounted by our analysis. In addition, a pairwise comparison through the computation of Bayes factors reveals a substantial to strong (logistic regression) or weak to substantial (neural network) support for the multiregional origin over the western origin model (see Supplementary Note 1, Supplementary Table 14 and Supplementary Fig. 5b). These results suggest that western and eastern Scythian groups arose independently-perhaps in their respective geographic regions-and thereafter experienced significant population expansions (during the 1 st millennium BCE). Importantly, our simulations support a continuous gene flow between the Iron Age Scythian groups, with indications of asymmetrical gene flow from western to eastern groups, rather than the reverse (see Supplementary Tables 13 and 15 for details).
Because population movements across Central Asia during the Bronze Age are often archaeologically associated with the spread of the Andronovo culture 29 , we used ABC to fit a sample of Middle Bronze Age nomadic groups from western Siberia, most of them associated with the Andronovo culture, onto the preferred demographic model for the origin of Scythians. For this purpose-and based on low F ST values between these groups-we combined 40 samples related to the Andronovo culture in the west Siberian forest steppe 30 and nine samples from the same culture in the Krasnoyarsk region 31 , all of which were dated to the first half of the 2 nd millennium BCE. The results provided very strong support for a linkage between these Middle Bronze Age groups and eastern Scythians (Supplementary  Tables 16 and 17). However, these simulations were not able to fully capture the patterns of genetic diversity observed in the Bronze Age populations, suggesting that the true demographic history of the ancestry of Iron Age populations may have been more complex than considered here (see Supplementary Note 1 and references 30 and 32 for details).
Genetic diversity and ancestry of the Scythian groups.  Table 7).
Using nuclear SNP data, we performed a principal component analysis 33 (Fig. 4). The two Early Sarmatian samples from the West (group #3 in Fig. 2) fall close to an Iron Age sample from the Samara district 34  Since the PCA of west Eurasia in Fig. 4 does not allow one to examine the ancient samples in relation to contemporary East Asian populations, we also carried out PCA of all 2,345 modern individuals in the Human Origins dataset 35 , onto which we also projected the ancient individuals (Fig. 5). It is evident from this PCA that ancestry of the Iron Age samples falls on a continuum between present-day west Eurasians and eastern non-Africans, which is in concordance with the mitochondrial haplogroup analyses. The eastern Scythians display nearly equal proportions of mtDNA lineages common in east and west Eurasia, whereas in the western Scythian groups, the frequency of lineages now common in east Eurasia is generally lower, even reaching zero in four samples of the initial Scythian phase of the eight to sixth century BCE (group #1 in Fig. 2), and reaches 18-26% during later periods (sixth to second century BCE; #2 and #3) (Supplementary Table 7).
f-statistics. We used f4-statistics of the form f4(Test, LBK; EHG, Mbuti) and f4(Test, LBK; Han, Mbuti), which are zero for those Test samples that form a clade with LBK and positive for populations that have EHG-or Han-related ancestry, respectively. We plotted the results against each other, which resulted in a V-shaped pattern with Yamnaya at the apex (Fig. 6). The Iron Age Scythians are arrayed along a cline from Yamnaya to Ami (a population of East Asian ancestry that experienced no admixture), consistent with having ancestry from populations genetically similar to these two groups.
We also computed statistics of the form f3(Test; Yamnaya_Samara, Han) to check whether a Test population has intermediate allele frequencies between Yamnaya_Samara and Han, which are used as proxies for possible source populations. Intermediate allele frequencies can only occur if the test population is a mixture of populations related to these two sources 37 . These statistics are significantly negative for all Scythians, demonstrating that admixture occurred (Supplementary Fig. 13).
ADMIXTURE analysis. We carried out ADMIXTURE analysis 38,39 of 2,345 present-day humans 35 genotyped on the Human Origins array 35,37 and 175 ancient individuals on a set of 296,340 SNPs intersecting with those in the Human Origins array. The results for the ancient individuals are displayed in Fig. 7 for K ¼ 15, which has the highest log likelihood value (the complete analysis can be found in Supplementary Fig. 14). All   Table 26). Fig. 1) suggests that the ancient Scythian populations from the eastern and western part of the Eurasian Steppe are genetically closer to each other than are the modern populations of the respective regions. AMOVA analyses carried out for modern and ancient groups of the eastern and western steppe provided further support for this finding. We found F CT values to be higher between modern populations of the East and the West (F CT ¼ 0.0835) than between ancient populations of similar regions (F CT ¼ 0.0262).  [43][44][45]. At the two LCT loci associated with lactase persistence, the derived allele is observed only in heterozygotes, only in the eastern Scythian samples, and at low frequency (2-3%). The ancestral alleles at ADH1B rs3811801 and rs1229984 are nearly fixed in the Scythian dataset, as they are in modern Europeans (the derived alleles, which confer some resistance to alcoholism, are under selection in East Asians 46,47 ). We observe the derived allele at rs3827760 in the EDAR gene in a single Pazyryk individual (#6 in Fig. 2). This EDAR derived allele, which is related to tooth and hair morphology, is selected and at high frequency in modern East Asians (87%) 48 , and very rare in modern Europeans (B1%) 48 , although it has been observed in prehistoric hunter-gatherers from Sweden (7.9-7.5 kya) 34 . Thus, the results of the examination of phenotypic SNPs that show frequency differences between Europe and East Asia are consistent with gene flow across the steppe territory.

Discussion
Our results show that the Iron Age groups-long believed to be connected through shared cultural artefacts associated with the classical Scythians of the North Pontic region-also share a genetic connection. This is supported by our ABC analyses revealing population continuity over the 1 st millennium BCE in the eastern Scythians and low F ST values between eastern and western Scythian groups. However, ABC analyses that evaluated different models for the origins of Scythian populations provided the strongest support for a multiregional origin, with eastern and western groups arising independently within their own regions. Despite separate origins and the enormous geographic separation, demographic modelling infers ongoing and substantial gene flow between eastern and western groups, which provides a plausible demographic mechanism to explain the low F ST values and the general uniformity of the material culture of Scythians right across the Eurasian Steppe zone. Our genomic analyses reveal that western and eastern steppe inhabitants possess east Eurasian ancestry to varying degrees. In our ADMIXTURE analyses we find an East Asian ancestry component at K ¼ 15 in all Iron Age samples that has not been detected in preceding Bronze Age populations in either western or eastern parts of the Eurasian Steppe. Another ancestral component that is maximized in the north Siberian Nganasan population becomes visible from the 2 nd millennium BCE onwards in the eastern steppe (Okunevo, Karasuk, Mezhovskaya). This component appears later in all Iron Age populations but with significantly higher levels in the eastern steppe zone than in the West. These findings are consistent with the appearance of east Eurasian mitochondrial lineages in the western Scythians during the Iron Age, and imply gene-flow or migration over the Eurasian Steppe belt carrying East Asian/North Siberian ancestry from the East to the West as far as the Don-Volga region in southern Russia. In general, gene-flow between eastern and western Eurasia seems to have been more intense during the Iron Age than in modern times, which is congruent with the view of the Iron Age populations of the Eurasian Steppe being highly mobile semi-nomadic horse-riding groups.
In the East, we find a balanced mixture of mitochondrial lineages found today predominantly in west Eurasians, including a significant proportion of prehistoric hunter-gatherer lineages, and lineages that are at high frequency in modern Central and East Asians already in the earliest Iron Age individuals dating to the ninth to seventh century BCE and an even earlier mtDNA sample from Bronze Age Mongolia 49 . Typical west Eurasian mtDNA lineages are also present in the Tarim Basin 16 and Kazakhstan 8 and were even predominant in the Krasnoyarsk area during the 2 nd millennium BCE 31 . This pattern points to an admixture process between west and east Eurasian populations that began in earlier periods, certainly before the 1 st millennium BCE 13,50 , a finding consistent with a recent study suggesting the carriers of the Yamnaya culture are genetically indistinguishable from the Afanasievo culture peoples of the Altai-Sayan region. This further implies that carriers of the Yamnaya culture migrated not only into Europe 26 but also eastward, carrying west Eurasian genes-and potentially also Indo-European languages-to this region 17 . All of these observations provide evidence that the prevalent genetic pattern does not simply follow an isolation-by-distance model but involves significant gene flow over large distances.
All Iron Age individuals investigated in this study show genomic evidence for Caucasus hunter-gatherer and Eastern European hunter-gatherer ancestry. This is consistent with the idea that the blend of EHG and Caucasian elements in carriers of the Yamnaya culture was formed on the European steppe and exported into Central Asia and Siberia 26 . All of our analyses support the hypothesis that the genetic composition of the Scythians can best be described as a mixture of Yamnaya-related ancestry and East Asian/north Siberian elements.
Concerning the legacy of the Iron Age nomads, we find that modern human populations with a close genetic relationship to the Scythian groups are predominantly located in close geographic proximity to the sampled burial sites, suggesting a degree of population continuity through historical times. Contemporary descendants of western Scythian groups are found among various groups in the Caucasus and Central Asia, while similarities to eastern Scythian are found to be more widespread, but almost exclusively among Turkic language speaking (formerly) nomadic groups, particularly from the Kipchak branch of Turkic languages (Supplementary Note 1). The genealogical link between eastern Scythians and Turkic language speakers requires further investigation, particularly as the expansion of Turkic languages was thought to be much more recent-that is, sixth century CE onwards-and to have occurred through an elite expansion process. There are potentially many more demographic factors involved in the origins of Turkic language speakers, such as migration waves associated with Xiongnu, ancient Turkic or early Mongolian populations. The extent to which the eastern Scythians were involved in the early formation of Turkic speaking populations can be elucidated by future genomic studies on the historic periods following the Scythian times.

Methods
Sample material. For this study, human skeletal sample material from different parts of Russia and Kazakhstan was selected based on an association of the archaeological complex with Scythian burial rites and artefacts. We analysed 110 skeletal samples. Five samples yielded no DNA, eight gave only poor results, and for one only coding region SNPs could be obtained. In the end we could use 96 samples for the analyses of mtDNA. Eight individuals were converted into Illumina libraries; six thereof were used for a genomic capture, two for shotgun sequencing, and one for both capture and shotgun sequencing. Overall DNA preservation was remarkably good, with slight variations depending on region or sample site (Supplementary Tables 1 and 2).
Sample preparation. All pre-PCR sample preparation steps were carried out in a cleanroom facility physically separated from the post-PCR laboratories. Sample preparation, DNA extraction, the amplification of single mtDNA fragments and the amplification of nuclear loci were performed as previously described 45 . For mtDNA analyses 30 coding region fragments covering 32 haplogroup-specific SNPs of the mitochondrial genome were selected from the literature [51][52][53] . Primer systems (Supplementary Table 3) were designed using PrimerSelect, a part of the Lasergene software package (DNASTAR). Amplification was performed in three different multiplex reactions to avoid overlapping fragments (Supplementary Table 3). The Multiplex PCRs were carried out in a total volume of 40 ml, using 20 ml Qiagen Multiplex Kit Master Mix (Qiagen, Hilden, Germany), 11 ml ddH 2 O, 0.2 ml of each primer and 4 ml DNA extract. Multiplex PCR conditions were as follows: initial denaturation for 15 min at 95°C followed by 36 cycles of 40 s denaturation at 94°C, 40-90 s annealing at 56°C, 40-90 s elongation at 72°C and for some reactions a final elongation for 10 min at 72°C was added. Standard PCR was used for fragments that failed amplification in the multiplex reactions and to amplify the HVR1 of some additional samples.
Sequence analyses and authentication. To allow for parallel sequencing of 37 mtDNA fragments in up to 60 samples on a 454 GS FLX machine, barcodes were ligated to the multiplexed DNA fragments following the tagging protocol of Meyer 2008 (ref. 54), with the exception that the whole PCR product was used for the initial step. The three multiplex PCRs of one sample were pooled during the first purification step. For all purification steps the Qiagen MinElute PCR purification Kit was used. 454 library preparation and sequencing was carried out by GATC Biotec AG in Konstanz. Single PCR products were analysed using the ABI 3130 Sequence Analyzer after standard Sanger-cycle-sequencing using BigDye Terminator v1.1 Cycle Sequencing Kit (Applied Biosystems, Life Technologies, Darmstadt, Germany). For the nuclear markers redundant sequences were collapsed using CD-HIT-454 (ref. 55), single sequence clusters were removed and unique sequences were aligned with mafft 56 to a reference sequence for each of the markers and alleles were counted. For mtDNA sequences consensus sequences were created with SeqMan of the DNASTAR Lasergene package, and haplogroups were assigned with HaploGrep, a web-based application using phylotree build 15 (refs 57,58) (Supplementary Tables 4-6). To authenticate the results, every position of the HVR1 had to be covered by at least three unambiguous sequences from independent PCRs. The coding region fragments were sequenced at least twice with the 454 FLX. Independent reproduction has been carried out for 22 of the samples by A. Pilipenko at the Institute of Cytology and Genetics, Siberian Branch, Russian Academy of Science. All results were concordant with results produced in the Palaeogenetics Lab, Mainz. During the sample preparation process blank controls were included in the pulverization, extraction and amplification steps. Overall contamination rate was 2.2%. The contaminations were monitored and compared with sample sequences of the same reaction steps. No correlation could be detected and since the sample sequences were reproduced in independent reactions all results are considered authentic.
Biostatistical analyses. For the population genetic analyses HVR1 sequences from position 16,040-16,400 of 96 samples analysed for this study and additional 51 samples taken from the literature (Supplementary Table 1) were used. The Arlequin 3.5.1.3 software 59 was used for AMOVA and to calculate gene diversity (haplotype diversity) 60 , nucleotide diversity 60,61 , F ST values and Reynolds' distance 62 . The pairwise distance method was used to calculate the F ST values with 1,000 permutations and a gamma value ¼ 0. Fu's F S test of selective neutrality was performed to test for population expansion. The significance level for the F S values was set at 0.02 (ref. 63). For results of summary statistics see Supplementary  Table 7.
ABC analyses. To explore the demographic history of Scythians we formulated multiple candidate scenarios, which provided the basis for simulating samples of genetic data for the HVR-1 region, using BayeSSC 64,65 . Calculations of summary statistics for the observed data were performed in DNaSP v5 (ref. 66). To confirm that candidate scenarios were able to reproduce the observed genetic data, we compared the prior distributions of simulated summary statistics with the empirically observed values. All analyses were performed using the abc package in R 2.15.1 (refs 67,68). We refer to the Supplementary Note 1 for full details on demographic scenarios and analyses employed here.
Genomic analyses. DNA-library preparation for subsequent shotgun sequencing was performed according to the protocol used in Kircher 2012 (ref. 69) with slight modifications for the shotgun samples (Supplementary Note 3). To analyse the genomic data f 3 -and f 4 -statistics were calculated (qp3Pop and qpDstat from ADMIXTOOLS) 37 , analysis of ancestry streams was applied (qpWave/qpAdm) 26 and an ADMIXTURE analysis was performed 38