Ancient DNA reveals genetic connections between early Di-Qiang and Han Chinese

Ancient Di-Qiang people once resided in the Ganqing region of China, adjacent to the Central Plain area from where Han Chinese originated. While gene flow between the Di-Qiang and Han Chinese has been proposed, there is no evidence to support this view. Here we analyzed the human remains from an early Di-Qiang site (Mogou site dated ~4000 years old) and compared them to other ancient DNA across China, including an early Han-related site (Hengbei site dated ~3000 years old) to establish the underlying genetic relationship between the Di-Qiang and ancestors of Han Chinese. We found Mogou mtDNA haplogroups were highly diverse, comprising 14 haplogroups: A, B, C, D (D*, D4, D5), F, G, M7, M8, M10, M13, M25, N*, N9a, and Z. In contrast, Mogou males were all Y-DNA haplogroup O3a2/P201; specifically one male was further assigned to O3a2c1a/M117 using targeted unique regions on the non-recombining region of the Y-chromosome. We compared Mogou to 7 other ancient and 38 modern Chinese groups, in a total of 1793 individuals, and found that Mogou shared close genetic distances with Taojiazhai (a more recent Di-Qiang population), Hengbei, and Northern Han. We modeled their interactions using Approximate Bayesian Computation, and support was given to a potential admixture of ~13-18% between the Mogou and Northern Han around 3300–3800 years ago. Mogou harbors the earliest genetically identifiable Di-Qiang, ancestral to the Taojiazhai, and up to ~33% paternal and ~70% of its maternal haplogroups could be found in present-day Northern Han Chinese.


Background
The Huaxia is the earliest Chinese dynasty to emergẽ 2000 BC along the Yellow River. This population grew from the Central Plain area and later became established as the Han Chinese during the Han Dynasty (206 BC to 220 AD). Throughout history, the Han Chinese continued to have complex interactions with surrounding ethnic minority groups in their vicinity [1,2], whose details are being studied and debated by historians, archaeologists, anthropologists and geneticists.
One important pastoral agriculturist group that interacted with the Han Chinese from the west near the upper reaches of the Yellow River in the Gansu-Qinghai (or Ganqing) region is a historical group called the Di-Qiang. Around the middle Neolithic, as people (including ancestors of the Han) expanded away from the Central Plain due to improved agricultural practices [3], they encountered the Di-Qiang people, and both groups have occupied the Ganqing [4,5]. A recent ancient DNA study goes further to suggest that a once Ganqing population, the Taojiazhai people, is related to the Di-Qiang, and even contributed genetically to the Han Chinese [6]. However, an issue with the Taojiazhai was that the archeological site dated to~1700-1900 yr. BP, which occurred well within the time period of the Han dynasty, raising the possibility that some Taojiazhai individuals might have been admixed in Han Chinese.
In this study, we overcome this problem by investigation of the Mogou cemetery (Fig. 1), a considerably older Di-Qiang site in the Ganqing region that is enclosed by the Qinghai-Tibetan Plateau to the west and the Tengger Desert to the north [7]. The accelerator mass spectrometry (AMS) radiocarbon dating of the Tomb M633 human bone samples (slightly more recent than specimens collected for this study) yielded 3145 ± 45 14 C yr. BP and 3526-3336 cal. yr. BP after correction with Damon's table [8]. Cultural artifacts, such as funerary pottery constructed of red clay with features found prominently in the Qijia culture, place this site in the late Neolithic to early Bronze Age (~3600 to 4200 yr. BP) [9] and associated with the Di-Qiang [10]. So the Mogou represents an early Di-Qiang predating the Han dynasty.
To clarify the genetic relationship between the ancient Di-Qiang and early Han Chinese, we analyzed the new Di-Qiang from Mogou, using hyper variable sequence I (HVS-I) and coding region of mtDNA and nonrecombining region of Y-chromosome (NRY), and compared them to other ancient DNA and contemporary groups across China.

Population samples and Mogou specimens
A total of 1793 individuals were collected, belonging to 8 ancient (235 individuals) and 38 modern (1558 individuals) groups (Table 1). The Mogou site (34°69′N 103°86′°E) is located in the Tibetan Autonomous Prefecture of Gannan, Gansu Province, being at the geographical center of China and upstream of the Yellow river [7]. The high altitude (2209 to 3926 m) and a continental climate with an annual average temperature of 3.28°C are favorable to ancient DNA preservation [11]. The Institute of Cultural and Historical Relics and Archaeology in Gansu Province excavated the cemetery, consisting of 26 graves, with permission from the State Administration of Cultural Heritage who has control over the archaeological excavations in China. Nearly every grave contained multiple individuals due to the complex structure of the tomb. In total, 60 ancient human remains were exhumed from the Mogou site for genetic analysis (see Additional file 1).

DNA extraction and laboratory environment
Prior to DNA extraction, the samples were cleaned using a series of treatments to remove the exogenous contaminants on the surface of the samples [12]. All of the operations were performed in separate rooms of an ancient DNA laboratory to strictly avoid any external contamination. Procedures were carried out independently of molecular biology experiments using present day DNA.
Before powdering, the bones and teeth surfaces were wiped down using cotton soaked in sodium hypochlorite solution. Bones and teeth were then soaked in a 5% sodium hypochlorite solution for 15 min, and carefully rinsed with 95% alcohol, and then UV-irradiated overnight. For dried bone material, a drilling machine was used to remove the top layer to avoid any remaining surface contaminants, and then powder was obtained for DNA extraction by drilling holes into the remainder of the bone. For the teeth, the dental calculus on the surfaces of the teeth was removed before drilling dental cervix to obtain cavitas pulpae powder. A turbid solution was then created, containing a mixture of about 200 mg of bone or tooth powder and 4.5 mL 0.5 mM EDTA (pH 8.0). This solution was stored at 4°C for 24 h, upon which 80 μL of 100 mg/ mL of proteinase K was added. The resulting solution was placed in the hybridization oven overnight at 56°C. The precipitate was removed by centrifugation (3 min at 8000 rpm), and the clear supernatant extract was concentrated to 100μL using an ultrafiltration tube (Centurion® YM-10) at 8000 rpm centrifugation. DNA was extracted using the concentrated solution in accordance with the QIAamp® Purification Kit manual. Furthermore, DNA extraction was performed at least twice for each sample, and every five ancient samples had one blank control.

Measures taken to ensure authenticity
To ensure the results are valid and reliable, we have kept in strict compliance with the rules indicated for extracting ancient DNA [13]. All laboratory personnel involved in the operation were female. Moreover, to obtain satisfactory results in ancient DNA research, the two guidelines were followed: 1) Pre-PCR and post-PCR protocols were carried out in two completely separate buildings. Experimenters were only allowed to move from the pre-PCR lab building to the post-PCR lab building each day, avoiding contamination from PCR products into the samples. The reverse was not allowed. The experimental areas including both the PCR room and the DNA extraction room have been equipped with Air Shower, which removes the dust, hair and other debris attached to clothes and reduces introduction of contaminants from laboratory personnel. 2) During the study period, we relocated our laboratory to a new campus, creating an opportunity to observe whether our results could be replicated in the new laboratory. Furthermore, different parts of the samples were randomly selected for replicate extraction and PCR amplification, in order to ensure the results are reproducible.

Mitochondrial DNA amplification and haplogroup assignment
Due to high degradation of DNA from the ancient samples, it is difficult to amplify long DNA fragments. We thus designed two sets of primers (see Additional file 2)  [45][46][47] to amplify and sequence the mtDNA HVS-I region between positions 16,051 and 16,384. We also used both the Sanger sequencing method and the amplified product-length polymorphisms (APLP) method [14,15], through the design of two or three sets of specific and corresponding primers (see Additional file 2). The PCR amplification was carried out in a 12.5 μL reaction mixture containing 2 μL of template DNA, 1.5× reaction buffer (Takara, Japan), 2.5 mM MgCl 2 (Promega, Germany), 0.25 mM dNTPs (Takara, Japan), 0.1 μM of each primer, 1 U of ExTaq®Hot Start Version DNA polymerase (Takara, Japan), 1 μL 20 mg/mL BSA, and RNase-Free Water (Takara, Japan). Cycling parameters were described as follows: initial denaturation at 94°C for 5 min, followed by 34 cycles at 94°C for 30s, 30s at 55°C, elongation for 30s at 72°C, with a final extension for 10 min at 72°C and storage at 4°C. Then, the PCR amplification products were examined by agarose gel electrophoresis. After the purification with the QIA quick Gel Extraction Kit (Qiagen, Germany), the amplification products were sequenced using the Big-Dye® Terminator V3.1 Cycle Sequencing kit (Applied Biosystems, USA). These sequences were analyzed, and an output file was generated from the ABI PRISM™310 automatic sequencer. In the end, the mtDNA haplogroups were called based on SNPs from the hypervariable and the coding regions, and the East Asian mtDNA classification tree [16,17].

Non-recombining region of the Y chromosome (NRY) capture
We performed NRY capture of two Mogou males (MG18 and MG48). The DNA library was prepared with NEBNext® Ultra™ DNA Library Prep Kit for Illumina® in accordance to manufacturer's instructions, which is similar to the Illumina TruSeq V2 protocol [23]. This library preparation will perform end repair with 5′ phosphorylation and dA-tailing, and make the damage-derived C-to-T in the 5′-endoverhang fragments to have the reverse complement nucleotides G-to-A substitutions at the 3′-end. After the ligation with NEBNext Adaptors (that includes hairpin loop with uracil), the uracils in the adaptor and DNA insert are then removed by USER (uracil-DNA-glycosylase (UDG) and endonuclease VIII), which would cause a small residual signal of C-to-T substitutions to be detected at the 5′ (~1.8%) and no influence to the G-to-A substitutions at 3′ terminal positions (for MG18~11% and for MG48~16%) [23]. However, within a CpG context, because the majority of cytosines are methylated invertebrate genomes, which when deaminated, leaves thymine instead of uracil, the deaminated cytosines in the majority of cases are not removed in this CpG part even with the USER treatment (CpG part: C-to-T substitutions for MG18~11% and for MG48~15%, see Additional file 3).
Next, the 7.18 Mb targeted unique regions on the NRY chromosome was used to design the array. We used a similar experimental method of the one described by Fu et al. [24], to do the in-solution hybridization enrichment for the libraries. We then focused on the reads passing Illumina quality control that had the expected index combinations for these libraries. We sequenced the libraries on the HiSeq X-Ten platform. We restricted analysis to pairs of reads that had at least 11 base pairs of overlap, merged the reads, and then mapped the merged sequences to the human genome reference hg19 using BWA. We removed duplicated molecules prior to analysis to reduce the influence of mapping errors. We restricted our analysis to unique regions in the genome, using Tandem Repeat Finder (for hg19) and mapability tracks (map 35-50%). Details of sequencing coverage on NRY are shown in Table 2. The fragments size distribution of two Mogou male specimens show short length fragments, which are typical for ancient DNA (see Additional file 4).  TempNet [27], which shows networks stacked in three dimensions (3-D), was used to explore the continuity of haplotypes across time. The phylogeny of Y-DNA haplogroup O was inferred using Figtree in the BEAST program [28] and tip dating of ancient DNA [29]. Demographic histories were simulated using Fastsimcoal [30] and parameter distributions inferred by Approximate Bayesian Computation (ABC) [31].

Results
A total of 55 of 60 samples from the Mogou site (Additional file 1) were successfully replicated, and verified to be different from the mtDNA of laboratory personnel (see Additional file 5). All sequences were submitted to GenBank under the accession numbers KX085423-KX085477.

Mitochondrial DNA analysis
The organic preservation was relatively high for Mogou, most likely related to its high elevation and low temperature. For example, from the captured MG48 library, we obtained 97-fold coverage for the mtDNA genome. The contamination of MG48 was 0.048% (95% C.I. 0.545%-0.008%) based on the match rate to the mtDNA consensus by running the contamination estimator Con-tamMix [24]. We genotyped 55 samples for mtDNA HVS-I and nt1040 0 T/C (for the M/N type), and further SNP loci detection was carried out on the coding region to ensure that haplogroup was correctly called based on the results of the HVS-I motif. We found a total of 46 haplotypes (Table 3) with certain haplotypes shared by two or more individuals buried in the same grave, suggesting a matrilineal kinship among some individuals. The haplotypes were analyzed using the correction criterion developed by Alzualde et al. [32] to control for the reduction in the genetic diversity due to kinship [33]. Table 3 shows these haplotypes could be assigned to 14 mtDNA haplogroups: Most of these haplogroups occur among East Asians [16] with M25 found in South Asians [34]. AMOVA was used to test how different classifications would affect the variance among groups (Additional file 7). We found geography explained the most variance among groups. Compared to the variance given when all groups were independent (variance among groups = 2.01%), the highest variance (variance among groups = 1.64%) was observed when two geographic groups were classified, i.e. Northern China (Mogou, Hengbei, Taojiazhai, Northern Han and Northern minorities, Tibeto-Burman) and Southern China (Southern Han and Southern minorities).

Y-chromosome analysis
Of the 55 samples, 15 males and 17 females were identified using molecular biology techniques. Remaining 23 samples were indeterminate for sex after testing the AMG-PCR product. Only six amplified Y-SNP products could be successfully replicated. Table 4 shows all six Mogou males belonged to Y-DNA haplogroup O3a2/ P201. The two male specimens (MG18 and MG48) selected for capture 7.18 Mb of the NRY chromosome, after retaining positions with coverage at least 3-fold, their Y-DNA haplogroups were identified to be O3a2c and O3a2c1a/M117, respectively. The MG48 was further analyzed since a higher 8.51-fold coverage was better to build the consensus.
We aligned MG48 with the published 71 HGDP East Asian individuals with O haplogroup [35] to verify that it could be properly placed within the haplogroup O lineage. The consensus length, between MG48 (retaining positions with coverage of at least 3-fold) and HGDP Ychromosome dataset, was 381,473 bp. Figure 2 shows that all 72 sequences could be confidently assigned to Y-DNA haplogroup O1, O2, O3 using 31 ISOGG defining SNPs, and that the posteriors leading up to the O3a2c clade that the MG48 falls under were 1.0, thus ensuring that its position was highly resolved. The inferred Y-DNA substitution rate of 7.76 × 10 −10 (95% CI 3.89 × 10 −10 to 1.13 × 10 −9 ) per site per year remained consistent with other ancient DNA studies [36].

Temporal network analysis
To identify whether Di-Qiang did have an influence on Han Chinese, we investigated the temporal network of Mogou, Taojiazhai, and Northern Han. Figure 6a shows haplotypes between Mogou and Taojiazhai were contiguous, and some sharing with Northern Han. Figure 6b shows that Mogou and Hengbei shared relatively more haplotypes with Northern Han compared to Taojiazhai.

Approximate Bayesian computation (ABC) simulations
To understand the genetic relationship between Mogou, Hengbei, and Northern Han, we proposed four models (Model 1-4; Fig. 7) that described the possible demographic history that occurred among them. After performing 1 million simulations for each model, the probability of model occurrence was assessed by two methods: acceptance-rejection (AR) [50] and weightedmultinomial logistic regression (LR) [31]. The quality of simulations was evaluated by R 2 , coverage, etc. compared to 1000 pseudo-observed, as described elsewhere [51]. The best supported model was Model 1 (49-79%) followed by Model 3 (16-31%; Fig. 7). We found the reason for the similarity between Model 1 and 3 was because Model 1 described Mogou contributed relatively few genes averaging at 15% (95% CI: 13-18%) into Northern Han around 3500 years ago (95% CI: 3301-3809; details in Table 5), which could approximate Model 3 that explained Northern Han is closer to Hengbei. The overall simulation quality was good, with the type I error (misclassified true models based on 1000 resamplings) of the four models being low~18%. Every parameter were estimated with high coverage and R 2 > 10% indicating that they were reliably estimated [51], and there were noticeable improvements (on averagẽ 12-fold; Additional file 10) to the posteriors of summary statistics used.

Discussion
In this study, we found that Mogou, being situated at the geographic center of China, also lay at intersection of Northern and Southern Chinese and Tibetans in terms of haplogroup frequencies, suggesting it plays an important role in the formation of early cultures along the Yellow River. We argue that it possibly has a northern origin, since more than 90% of its maternal haplogroups (A, B, C, D, F, G, M7, M8, M10, N9a, and Z) matches with those typically found among ancient groups across Northern China. In particular, the most frequent haplogroup D in Mogou (34.78%) is consistent with other ancient northern groups (Qilangshan 43.75%;        ht38, and ht45). Few haplotypes were carried across to the Northern Han on the temporal network. However, the Taojiazhai appeared to differ from the Mogou in its increased association with the modern Southern Chinese in terms of haplogroup frequencies and Fst genetic distance.
The closest modern relative to Mogou was the Northern Han (genetic distance Fst = 0.002) and the Northern minorities (Fst = 0.005), and then more distantly by the Southern Han (Fst = 0.03) and Southern minorities (Fst = 0.03-0.05). Generally, the Y-DNA haplogroup O3a2/P201 from Mogou males is a common subtype of the O3a/ M324 branch, which occurs at a high frequency in extant Han Chinese (43.37%) [43]. One Mogou male (MG48) was further identified as O3a2c1a/M117, which is a subclade of O3a2c1-M134 that is commonly found in Sino-Tibetan speakers and neighboring countries (e.g. Nepal, Bhutan, and Korea), but varies greatly in frequency among Miaoyao speakers. Furthermore, MG48 clustered with Han and Tibeto-Burman (e.g. Naxi, Yi, and Tujia) as opposed to southern groups (e.g. Dai, Miao) on the HGDP Y-DNA haplogroup O lineage (Fig. 2).
Finally, the present-day Tibeto-Burman speakers were also close to Mogou (genetic distance Fst = 0.01) than the Southern Han or Southern minorities. This was in agreement with historical records about the migration of ancient Di-Qiang people in the past. Some spread eastward, scattering in the middle reaches of the Yellow River, while others migrated southwest to form the Tubo, who are the ancestors of modern Tibetans, as well as contributing to the Southwestern minorities through integrating with the local population [5,10].

Conclusion
We identified Mogou to be the earliest~4000 yr. BP Di-Qiang population, and genetically related to Fig. 7 Probability of model 1 to 4 occurrence. Each model is followed by a brief description about their demographic history. Mogou and Hengbei are serial sampled at~4000 and~3000 yr. BP, respectively, and dashes indicate the uncertainty in whether they have direct modern descendants. In contrast Northern Han has solid line to the present-day. The probability (0-100%) of each model occurrence is assigned using AR (acceptance-rejection) and LR (logistic regression; details see main text)