Mitochondrial DNA transit between West Asia and North Africa inferred from U6 phylogeography

Background World-wide phylogeographic distribution of human complete mitochondrial DNA sequences suggested a West Asian origin for the autochthonous North African lineage U6. We report here a more detailed analysis of this lineage, unraveling successive expansions that affected not only Africa but neighboring regions such as the Near East, the Iberian Peninsula and the Canary Islands. Results Divergence times, geographic origin and expansions of the U6 mitochondrial DNA clade, have been deduced from the analysis of 14 complete U6 sequences, and 56 different haplotypes, characterized by hypervariable segment sequences and RFLPs. Conclusions The most probable origin of the proto-U6 lineage was the Near East. Around 30,000 years ago it spread to North Africa where it represents a signature of regional continuity. Subgroup U6a reflects the first African expansion from the Maghrib returning to the east in Paleolithic times. Derivative clade U6a1 signals a posterior movement from East Africa back to the Maghrib and the Near East. This migration coincides with the probable Afroasiatic linguistic expansion. U6b and U6c clades, restricted to West Africa, had more localized expansions. U6b probably reached the Iberian Peninsula during the Capsian diffusion in North Africa. Two autochthonous derivatives of these clades (U6b1 and U6c1) indicate the arrival of North African settlers to the Canarian Archipelago in prehistoric times, most probably due to the Saharan desiccation. The absence of these Canarian lineages nowadays in Africa suggests important demographic movements in the western area of this Continent.


Background
Attested presence of Caucasian people in Northern Africa goes up to Paleolithic times. From the archaeological record it has been proposed that, as early as 45,000 years ago (ya), anatomically modern humans, most probably expanded the Aterian stone industry from the Maghrib into most of the Sahara [1]. More evolved skeletal remains indicate that 20,000 years later the Iberomaurusian makers, replaced the Aterian culture in the coastal Maghrib. Several hypothesis have been forwarded concerning the Iberomaurusian origin. They can be resumed in those which propose an arrival, from the East, either from the Near East or Eastern Africa, and those which point to west Mediterranean Europe, either from the Iberian Peninsula, across the Gibraltar Strait, or from Italy, via Sicily, as their most probable homeland [2]. Between 10,000 and 6,000 ya the Neolithic Capsian industry flourished farther inland. The historic penetration in the area of classical Mediterranean cultures, ending with the Islamic domination, supposed a strong cultural influx. However, it seems that the demic impact was not strong enough to modify the prehistoric genetic pool.
Linguistic research suggests that the Afroasiatic phylum of languages could have originated and extended with these Caucasians, either from the Near East or Eastern Africa and that posterior developments of the Capsian Neolithic in the Maghrib might be related to the origin and dispersal of proto-Berber speaking people into the area [3]. Nowadays, the Berber speakers, scattered throughout Northwest Africa from the Atlantic to the Lybic desert and from the Mediterranean shores to the south of the Sahel, are considered the genuine descendants of those prehistoric colonizers. Some important issues are pending of resolution to clarify the past and present of the North African Caucasians: To which extent the Neolithic waves substituted the Paleolithic recipients? Which is the most probable origin of these prehistoric occupants? Did they come from Europe, East Africa, Southwest Asia or are they a result of an "in situ" evolution? Is there a correspondence between the Afroasiatic diversification and the spread of Caucasians?
Recently, molecular genetic research on North African populations has contributed new data to test the major issues proposed on archaeological, anthropological and linguistic grounds. The studies based on uniparental genetic markers have been particularly informative. Both, mitochondrial DNA (mtDNA) sequences [4,5], and Ychromosome binary markers [6,7] detected specific North African haplotypes that confirm an ancient human colonization for this area and a sharp discontinuity between Northwest Africa and the Iberian Peninsula. From a mtDNA point of view, the most informative of these genetic markers is the North African clade U6. On the basis of complete mtDNA sequences, it has been proposed that U6 lineages, mainly found in North Africa, are the signatures of a return to Africa around 39,000-52,000 ya [8]. This stresses the importance of its detailed study in order to trace one of the earliest Caucasian arrivals to Africa. Although in moderate frequencies, the geographic range of this clade extends from the Near East to the Canary Islands, along the Atlantic shores of Northwest Africa and from the Sahel belt, including Ethiopia, to the southern Mediterranean rim. Out of this area, U6 has only been spotted in the Iberian Peninsula [9][10][11][12], Sicily [13], in the north European Ashkenazic Jews [14], and in Ibero-America. The presence in the latter is, most probably, the result of the Spanish and Portuguese colonization [15,16].
In order to construct an unambiguous phylogeny for this clade and infer precise ages for the whole group and for its derivatives, we have fully sequenced eleven mitochondrial lineages representing the main branches of U6. Subsequently, we analyzed the geographic distribution range and relative diversity of these subclades, to deduce their most probable expansion origins based on sequence information of the first hypervariable segment (HVSI) of the mitochondrial control region and on new RFLPs, discovered to be diagnostic for them.

A new sublineage for U6
Haplogroup U splits from R by mutations 11467, 12308 and 12372. Three branches sprout from this root: U5 (3197, 9477, 13617 and 16270), U6 (3348 and 16172) and the rest of the U clade defined by mutation 1811 [8,17,18]. For this reason, a representative of U5 was chosen as an outgroup.
The phylogenetic tree based on complete mtDNA U6 sequences, confirms that this clade is defined by mutations 3348 and 16172 (Fig. 1). The former can be detected by RFLP analysis using MboI [15]. The existence of three subgroups is also evident. U6a was defined by the presence of HVSI mutations 16172, 16219 and 16278 [4] and now by 7805 and 14179 in the coding region, that can be tested by RFLPs -7802 MaeI and +14179 AccI, respectively. Subgroup U6b was characterized by HVSI mutations 16172, 16219 and 16311 [4], to which mutation 9438 (detectable by RFLP -9438 HaeIII) can now be added. The new clade U6c is defined by HVSI mutations 16169, 16172 and 16189 and at least by mutations 4965 and 5081, that can be tested by RFLPs +4963 Aci I and -5079 Tsp509 I, respectively. In addition, a subgroup, U6a1, has been detected within U6a characterized by the addition of HVSI mutation 16189 [4]. In the same way, HVSI mutation 16163 classifies subgroup U6b1, autochthonous of the Canary Islands [19]. Within the coding region, this subgroup can be further defined by RFLP + 2349 MboI.
From Fig. 1, an important question rises about the constant mutation rate in the coding region. The mean number of substitutions accumulated in U6b lineages (Table 1) is significantly smaller than those in U6a (P = 0.013) and is near significance in U6c (P = 0.058). These differences are mainly due to the number of mutations accumulated in the coding region. Following others [20], we used the likelihood-ratio test [21] to asses whether the mutations accumulated on the different branches were compatible or not with a uniform rate. The difference between the values obtained for the uniform clock model (L 0 = -23060.25) and for the variable rate model (L 1 = -23032.22), was statistically significant at the 5% level. So, the simpler clock-like tree was rejected. On the other hand, the substitution ratio between coding vs. HVSI region is double in U6a than in U6b or U6c (Table 1). Furthermore, taking into account the ratio of synonymous vs. non-synonymous substitutions in the coding region, again the U6a value doubles that of U6b or U6c, reaching a significant level (P = 0.0237, in a two-tailed Fisher exact test). Both selection and stochastic processes have to be invoked to satisfactorily explain these data. A bias in lineage sampling is the most probable cause of the different substitution ratios between D-loop and coding regions: the U6b and U6c lineages were chosen for their different geographic origin and, comparatively, large divergence in HVSI, whilst for U6a we chose central representatives of the different subclusters excepting that of the Canary Phylogenetic tree based on complete U6 mtDNA genome sequences Figure 1 Phylogenetic tree based on complete U6 mtDNA genome sequences. A U5b individual has been added in order to root the tree. Nomenclature of individuals as in Table 2   Islands. In relation to the differences in synonymous vs. non-synonymous ratios, they could be attributed to the action of purifying selection, having a stronger effect on the older U6a lineages. From this, we deduced that both U6b and U6c spread more recently. Finally, the apparent differences in substitution rates between U6b and U6a or U6c could better be the result of genetic drift, so that the founder lineage that originated the U6b subgroup was less evolved than those that originated U6a and U6c. However, we have to point out that in a similar case, in which significant differences were found in the number of mutations accumulated on two clades of haplogroup L2, selection was suggested as the most probable cause [20]. Fig. 2 shows the reduced median network obtained from the 56 U6 haplotypes found for the HVSI region between positions 16086-16370. The basal motif for haplogroup U6 has varied as new data have been added. Algerian sequences [9] suggested that the ancestral sequence harbored mutations 16172 16189. Additional data [4] considered 16172 16219 as the most probable ancestral motif. However, the complete sequence of the individual with this motif relocates it in U6a, presenting a back mutation in HVSI position 16278. Our data points to 16172 as the only substitution present in the basal motif. Unfortunately, the high recurrence of this mutation makes it insufficient to diagnose this haplogroup. The highest frequencies for haplogroup U6 as a whole are found in Northwest Africa (Table 2), with a maximum of 29% in the Algerian Berbers [9]. Subgroup U6a and its derivative U6a1 present the widest geographic distribution, from the Canary Islands in the West, to Syria and Ethiopia in the East, and from the Iberian Peninsula in the North, to Kenya in the South. In contrast, U6b shows a more limited and patched distribution, restricted to western populations. In the Iberian Peninsula, U6b is more frequent in the North whilst U6a is prevalent in the South. In Africa, it has been sporadically found in Morocco and Algeria in the North, and Senegal and Nigeria in the South, pointing to a wider distribution in the past, or to gene flow from a geographic focus which has still not been sampled. Curiously, two Arab Bedouins [22] [9], classified as U* by RFLP analysis [5], belong to this subgroup. Like for U6b, an autochthonous U6c subcluster (characterized by mutation 16129) was also detected in the Canarian Archipelago.

Relationships between areas
Linearized F ST values distinguished three significantly differentiated geographical areas: Continental Africa, the Iberian Peninsula and the Canary Islands (Table 3). Nucleotide diversities within areas ( , and the former presents a higher nucleotide diversity (1.55 ± 1.11) than the latter (0.98 ± 0.75). Geographic distributions and diversity values of U6 are congruent with a western origin and radiation for all subclades excepting U6a1 that, most probably, had an eastern origin.

Radiation ages
Radiation ages for U6 and its subclades have been estimated on the basis of complete coding and HVSI sequences (Table 4). In general, ages obtained from HVSI are larger than those deduced from the coding region. Both approaches present inconveniences for the time estimates. It has been demonstrated that the coding region has evolved at a roughly constant rate [24]. However, as relatively few clades are fully sequenced, stochastic and/or intentional sampling may bias the representation of the chosen lineages. On the other hand, HVSI estimations are based on a large number of individuals minimizing sampling errors. However, we deal with a short sequence that has not evolved at a constant rate across all human lineages [24]. Furthermore, from the phylogeny of complete U6 sequences (Fig. 1), it has been deduced, once more, that empirical time estimations are not independent of the demographic history of the population sampled.

African U6 origin and expansions
Discarding the Canary Islands, because the most ancient human settlement seems to be no earlier than 2,500 ya [25], and the Iberian Peninsula, because there are no consistent traces of U6 lineages in Europe, Northwest Africa is  (Table 4). Genetic diversities are congruent with a west to east expansion for U6a and a more probable east to west expansion for U6a1. Furthermore, the absence of U6b and U6c lineages in the East suggests that the population from which the U6a colonizers originated also lacked these lineages or presented them in very low frequencies. The fact that 5 of the 8 U6a haplotypes detected in the Near East are unique of this area (Fig. 2), points to prehistoric demic movements as the most probable cause of the U6a Africa to Asia migration, although historic events cannot be completely ruled out. In frame with the estimated age of U6a are archaeological data supporting early migrations from Africa into the Near East [26]. The expansion of Caucasians in Africa has been correlated with the spread and diversification of Afroasiatic languages. There are different hypothesis to explain the Afroasiatic origin. For some, it would be the result of a Neolithic demic diffusion from the Near East to Africa [27,28]. For others, the Afroasiatic originated in Africa and had a posterior demic spread to West Asia [29,30]. A third possibility is that Afroasiatic languages spread mostly through cultural contacts either from Africa or from Asia [31]. Only demic diffusions could be correlated with U6 expansions detected here.
Since an upper bound of 15,000 ya has been estimated for the proto-Afroasiatic origin, it seems that the coalescence age for U6a predates by far the origin of the Afroasiatic phylum. However, the recent spread of U6a1 is more in frame with the emergence of a proto-Afroasiatic language. This U6a1 expansion would favor an East African origin for the Afroasiatic and a posterior expansion to West Africa and West Asia. However, a Near Eastern origin,  most probably predating the Neolithic expansion, cannot be ruled out.

Iberian U6 origin and expansions
In Europe, U6 lineages have been consistently sampled only in the Iberian Peninsula. It has been mentioned that U6 nucleotide diversity is higher in Iberia than in Africa [12]. This has been confirmed here (Table 3). However, S is greater in West Africa. Considering the isolation of the different Berber groups we think that, in this case, the latter is a better diversity measure. The absence of U6 representatives in the rest of Europe, is also an argument against the hypothesis that these lineages could have migrated to North Africa from Europe. Naturally, this does not exclude that other mitochondrial lineages could have followed this route. Most probably, the presence of these African lineages in Iberia is the result of northward expansions from Africa. The time of this expansion has been predominantly attributed to either the Arab/Berber occupation that lasted seven centuries [10] or to prehistoric immigrations of North Africans to Iberia [12]. Both processes could have contributed to model the U6 landscape in Iberia. First, haplotype matches show that 10 of the 19 U6 lineages detected in Iberia are not present in Africa (Fig. 2), which points against only one recent immigration. Second, the geographic distribution of the U6 lineages in Iberia is puzzling. Whereas the U6b lineages, nowadays very scarce in Africa, are mainly detected in the Northwest, the U6 lineages found in highest frequencies in Africa are predominant in the south, where the Islamic rule lasted longer. At the light of these results we propose that U6b in Iberia is the signal of a prehistoric North African immigration that could have also brought some U6a lineages. Its actual northern range could be the result of a forced retreat due to the arrival of new southern incomers to Iberia. However, the U6a distribution is better explained as the result of more recent gene flow from North Africa. The age of U6b (approx. 10,000 ya) might be considered as an upper bound for the prehistoric wave. Curiously, around this time the Iberomaurusians began to be displaced by the incoming Capsian culture in the Maghrib. On archaeological grounds, it has been proposed that Iberomaurusians slowly retreated towards the Atlantic coast from where they sailed to the Canary Islands and southwards to the Malinese Sahara [2]. Coincidentally, these are the same places where the U6b lineages have been spotted (Fig. 2).

Canary Islands U6 origin and expansions
At a genetic level, the Berber origin of the Guanches, the aboriginal population of the Canary Islands, and their survival after the Spanish occupation, has been inferred from the high frequency of U6 lineages in its modern population (Table 2), similar to that of North Africa [19,32]. This fact has been recently confirmed in a mtDNA sequence study on aboriginal remains [33]. It was found that in the Guanche maternal gene pool, U6b1 and U6a were present at frequencies of 8.22% and 1.37%, respectively. U6c was probably also present in the aboriginal pool as a haplotype (16129 16169 16172 16189), now known to belong to subhaplogroup U6c, was proposed as a probable Canarian founder type [19]. As in Northwest Iberia, U6b was the dominant U6 subclade carried by the North African settlers of the islands. All three subclades are present in the modern Canarian population at frequencies of 1.3%, 13.0% and 3.3% for U6a, U6b and U6c, respectively, which is indicative of a broad aboriginal component in the present maternal pool. Perhaps, the comparatively higher frequency of U6a lineages might be attributed to an additional Berber input as result of the slave trade after the Spanish conquest [34,35]. What remains enigmatic of the indubitable North African prehistoric colonization of the Archipelago is that it was carried out by people whose U6 lineages mainly belonged to the U6b subclade that has only been spotted in very low frequencies in the modern African populations of Morocco, Algeria, Senegal and Nigeria (Table 2). Moreover, the U6b and U6c insular haplotypes belong to the autochthonous U6b1 and U6c1 branches differing by substitutions 16163 and 16129, respectively, from all their African counterparts. As the most probable arrival of the first prehistoric Canarian settlers was around 2,500 ya, it is highly improbable that these mutations occurred on the islands. Therefore, we expected to find these Canarian lineages in some place of Africa. However, after extensive sampling they have still not been detected. It is possible that they are present somewhere in low frequencies but, in any case, this phylogeographic distribution suggests that Northwest Africa suffered important demic displacements in the past.
Besides U6, other genetic markers such as 110(-) haplotype of the CD4/Alu system [36], and the M81 Y-chromosome binary marker [6,7], point to an ancient and autochthonous human presence in Northwest Africa. An eastward decline in M81 frequencies has been detected, regrettably the lack of extensive intra-M81 microsatellite diversity studies in Africa precludes phylogeographic comparisons as those done with mtDNA. There are other coincidences between mtDNA data and other systems. For instance, using classical genetic markers, it was found that the Iberian Peninsula showed smaller genetic distances with East Africa than with West Africa [37]. The same pattern was observed for Y-chromosome studies [7], both in line with our results (Table 3). More studies with other genetic markers are necessary to corroborate, complement or even contradict the proposed U6 landscape.

Complete mtDNA lineages
We have fully sequenced eleven mitochondrial lineages belonging to different subclades of the North African subhaplogroup U6. DNA extraction, amplification and manual sequencing methods have already been described [8].  [11,12]. In order to distinguish putative U6 members, all these subjects and the U individuals from a sample of 1059 previously published [4,11,12,19,38], were amplified with primers L3073/H3670 [8], and tested for the presence of the 3348 MboI site [15], that characterizes all U6 members.

Phylogenetic analyses
Phylogenetic relationships among complete mtDNA sequences were established using the reduced median network algorithm [39]. In addition to our eleven sequences, four lineages were added: U6 and U5b [8] (Accession numbers: AF382008 and AF381980, respectively) and for the coding region, H84 and H229 [18].

U6 phylogeographic analyses
In addition to our 611 samples, 41 populations where U6 haplotypes have been detected were included in our phylogeographic analysis (Table 2). Relationships among the different U6 haplotypes were inferred using the reduced median network algorithm [39]. To resolve reticulations, the highly recurrent mutations 16129, 16189, 16311 and 16362 were less weighted.

Differences in accumulated mutations among U6 branches
The non-parametric test, resampling probability estimates for the difference between the means of two independent samples (http://faculty.vassar.edu/lowry/Vassar Stats.html), was used to calculate the significance level of accumulated mutations between the different U6 subclades. The likelihood-ratio test, as implemented in TREE-PUZZLE [21], was used to check between a uniform clock or variable site model in the U6 tree.

U6 diversity and differentiation within and between areas
Arlequin package [40] was used to evaluate the U6 diversity within areas using nucleotide diversity (π) and segregating sites (S). Affinities between areas were obtained by means of linearized F ST [41].

Time estimates
For HVSI, the age of clusters or expansions was calculated as the mean divergence ρ from inferred ancestral sequence types [42] and converted into time by assuming that one transition within np 16090-16365 corresponds to 20,180 years [43]. The standard deviation of the ρ estimator was calculated as previously described [44].
For the complete sequences only substitutions in the coding region (15,447 nucleotides), excluding indels, were taken into account. The mean number of substitutions per site to the most recent common ancestor of each clade (ρ) was estimated, and converted into time using two substitution rates: 1.7 × 10 -8 [24] and 1.26 × 10 -8 [45].

Supplementary material
The eleven complete mitochondrial DNA sequences are registered under GenBank accession numbers: AY275527 to AY275537.
Publish with Bio Med Central and every scientist can read your work free of charge