Eurasian house mouse (Mus musculus L.) differentiation at microsatellite loci identifies the Iranian plateau as a phylogeographic hotspot

The phylogeography of the house mouse (Mus musculus L.), an emblematic species for genetic and biomedical studies, is only partly understood, essentially because of a sampling bias towards its most peripheral populations in Europe, Asia and the Americas. Moreover, the present-day phylogeographic hypotheses stem mostly from the study of mitochondrial lineages. In this article, we complement the mtDNA studies with a comprehensive survey of nuclear markers (19 microsatellite loci) typed in 963 individuals from 47 population samples, with an emphasis on the putative Middle-Eastern centre of dispersal of the species. Based on correspondence analysis, distance and allele-sharing trees, we find a good coherence between geographical origin and genetic make-up of the populations. We thus confirm the clear distinction of the three best described peripheral subspecies, M. m. musculus, M. m. domesticus and M. m. castaneus. A large diversity was found in the Iranian populations, which have had an unclear taxonomic status to date. In addition to samples with clear affiliation to M. m. musculus and M. m. domesticus, we find two genetic groups in Central and South East Iran, which are as distinct from each other as they are from the south-east Asian M. m. castaneus. These groups were previously also found to harbor distinct mitochondrial haplotypes. We propose that the Iranian plateau is home to two more taxonomic units displaying complex primary and secondary relationships with their long recognized neighbours. This central region emerges as the area with the highest known diversity of mouse lineages within a restricted geographical area, designating it as the focal place to study the mechanisms of speciation and diversification of this species.


Background
The house mouse (Mus musculus L.) has long been viewed has an excellent model for the study of evolution and its genome has been one of the first to be nearly completely sequenced [1,2]. Moreover, its dispersal capacity through commensalism has ranked it as one of the "100 world worst most invasive alien species" (ISSG), therefore offering various possibilities to study adaptation to various environments [3]. At the same time, this species is one of the most studied vertebrates due to its use as a prominent laboratory model, but its phylogeography and population genetics is so far only partly understood [4]. The current knowledge has slowly accumulated over the last 30 years in a non-optimal fashion, since its most peripheral populations in Europe, Asia and the Americas have been studied before insights were gained for those from the Middle-Eastern centre of its distribution [5].
It is now widely recognised that Mus musculus L. constitutes a complex assembly of more or less well separated populations and subspecies. The term "subspecies" in itself is taken here in its broad sense of "genetically recognisable entities" but this does not imply on our part any deeper statement about the actual level of isolation among these entities. The last 45 years of literature on systematics of the house mouse revealed that nomenclatorial issues have been quite controversial, with the use of many terms ranging from biochemical groups, subspecies, semi-species to full species to designate the same entities. Here, we follow the generally held view that the more widely distributed populations are grouped into three different subspecies: Mus musculus musculus in Eastern Europe, Central and North East Asia, Mus musculus domesticus in Northern Africa and Western Europe, and Mus musculus castaneus in South East Asia. These last two subspecies have further expanded in modern times to the Americas, Australia and Oceania [6][7][8][9]. In addition, Mus musculus molossinus, a hybrid between M. m. musculus and M. m. castaneus found in Japan [10] is often considered as a subspecies on its own. Closer to the centre of the distribution, Mus musculus gentilulus has been identified in the eastern part of the Arabic peninsula on the basis of its mitochondrial DNA lineage [11] while from the same type of data [12,13] it has been shown that certain populations considered as M. m. castaneus in Iran, Pakistan and Afghanistan should probably be considered as belonging to further sub-specific groups. Moreover, another completely independent lineage has recently been identified on this basis in Nepal [13]. Hence, the taxonomic situation close to the Middle-Eastern centre is far from being fully clarified. Since taxonomy reflects history, this clarification is a prerequisite if we want to further study the evolutionary mechanisms accounting for the species' differentiation.
The present study aims at filling this gap through the analysis of genetic variation at nuclear loci and is the first attempt to directly compare a set of population samples covering most of the Eurasian distribution of the species. We report on the variability at 19 microsatellite loci typed in 963 individuals originating from 47 populations in Europe, Asia, Africa and the Middle-East ( Figure 1).
Our main findings support the global picture of a species having initially differentiated in several genetic entities in the Middle-East forming the extant subspecies. Several of these subsequently expanded outwards because of their propensity to engage in commensalism and finally colonise the entire world. Of particular interest is the situation of the populations inhabiting the Iranian plateau where, despite an important level of secondary admixture in this region, four main genetic groups could be identified, in partial congruence with the mitochondrial analysis.

Genetic diversity
The genetic diversity was calculated for all samples ( Table 1). The most diverse sample was the one from Ahvaz in Iran with a H exp of 0.89 and an average number of alleles across loci of 19.3. The lowest H exp is seen for Moscow (0.49) with only 3.1 alleles/loci, but this can be ascribed to the small sample size (N = 5). Among the three islands that are present in this study, La Palma (Canaries) and Cyprus displayed a H exp comparable to continental populations (0.74 and 0.77 respectively), while the Madagascar value was slightly lower (0.67). Mean and Median of H exp are 0.74 and 0.75 respectively. The global differentiation as measured by inter-sample  Removing those yields an average F ST of 0.14.

Correspondence analysis
The Correspondence Analysis (CA) depicts the relative positions of individual genotypes projected onto the 3D space of maximal differentiation of each sample's centroid ( Figure 2A). We chose to represent the samples assigned to the three peripheral sub-species (M. m. domesticus, M. m. musculus and M. m castaneus) in previous studies by blue, red and green squares respectively, the Malagasy sample in orange and the Iranian samples with a palette of colours. The first three axes explained more than 44.8% of the total inertia. The coordinates of the centroids on the 10 first axes can be seen in Figure 2B.  Figure 2A and B), but that of Yazd is strongly associated with the negative coordinates of axis 4 ( Figure 2B). Other peculiarities can be found on the other axes and, even if no strictly private alleles were found for any particular sample except for very rare ones, these associations of different samples with different axes is indicative of the existence of groups of particular alleles among the 964 present in the global data set that pull the signal along theses axes (not shown).

Population tree
A complementary graphical representation of population differentiation is provided by the Neighbour-Joining tree of Figure 3. The bootstrap values are not very high, but there is a high coherence in the phylogeographical groupings on the tree. The three major subspecies are clearly grouped together, with the only exception for the sample of Abkhazia (western Georgia) which did not cluster with M. m. musculus, while it was previously described from electrophoretical data as being predominantly of M. m. musculus composition (55% M. m. musculus, 45% M. m. domesticus - [18]). The Iranian mice other than those previously referred to M. m. domesticus or to M. m. musculus form two clearly distinct groups that are not closer to each other than they are from M. m. castaneus. One group called Central Iran (in pink) contains the samples from Tehran, Bandar-Abbas, Yazd and Isfahan and another group called South-East Iran (in yellow) contains samples from Birdjand-Zabol, Iranshahr and Chabahar. This last group was located opposite of M. m. domesticus on axis 1 of the CA and clearly stands on its own on the NJ tree.

Individual allele-sharing distance tree
Representing nearly a thousand individuals on a single tree is difficult; nevertheless, the clustering obtained at the population level can largely be seen also at the individual level (Figure 4). Although the relative proximity of the different branches of the tree cannot be evaluated properly, it is remarkable that the order in which the various samples are organised fits almost perfectly with the population tree of Figure 3, starting with the M. m. domesticus samples at one point and ending with the ones referable to M. m. musculus. Remarkable also is the fact that most samples appear rather homogeneous. For instance, the individuals from South-East Iran: Iranshahr (light pink empty diamonds) and Chabahar (dark green

Structure analysis
The Bayesian clustering procedure implemented in STRUCTURE [27] was performed with various numbers of partitions. Interestingly, the software was not able to converge for K larger than 2, and the standard deviation of the LogLikelihood among runs was quite high for larger values of K (Table 2). Accordingly, the criterion of Evanno et al. [28] (a steep change in the likelihood) or other more recently derived methods (see [29]) could not really be applied, the probable reasons for this are discussed further below.  five different types of configuration. It is noteworthy, however, that the non-stable alternatives proposed by the software always turn around the same handful of clustering hypotheses. Further, for every of these values of K, some samples appeared to display detectable levels of introgression (see for instance Ahvaz, Thailand or Pakistan samples-Additional file 1: Figure  S1).

Overall differentiation
As was to be expected, the population samples from the best described subspecies differentiate clearly from each other, and can be unambiguously assigned to M. m. musculus, M. m. domesticus or M. m. castaneus in keeping with previous studies (see [5] for a review). These long recognised entities are separated by the three first canonical axes of the CA in Figure 2, and constitute the three longest branches of the distance tree in Figure 3 and occupy different sectors of the circle in the allelesharing tree of Figure 4. For the sake of comparison with the literature, an AMOVA performed among these three entities only indicated a F CT value of 0.70 for an average within group differentiation of F SC = 0.12. When calculated inside each of our sampling of the peripheral subspecies, this becomes 0.10, 0.13 and 0.09 for M. m. castaneus, M. m. domesticus and M. m. musculus respectively, thus indicating a consequent amount of internal variation inside each of them. However, when the whole collection was submitted to the Bayesian assignation of STRUCTURE, only the partition between M. m. domesticus and all the rest appeared stable ( Figure 5A).
One noticeable point is that there is a quite good geographical coherence in each group. If one considers M. m. domesticus for instance, the fan-shaped sub-tree of Figure 3 separates well the Near-East samples from the European ones. The Senegalese sample appears clearly as a recent offshoot of European origin, while this is not the case for Northern-African ones (Morocco and Tunisia). For M. m. musculus, the feather-like sub-tree indicate remote branching for the Chinese and Kazak samples, a fact to be put in relation with the probable recent eastwards expansion of this subspecies from the Caucasus region [26,30]. As to the M. m. castaneus cluster, it is the one where unexpected assignations occur. As reported above, the Malagasy sample clearly belongs to this group despite it possessing M. m. gentilulus mtDNA. This is another example of the possibility for maternal lineage capture during an expansion process. Madagascar is thought to have been populated by mice quite recently (ca 1,000 years, [15]) with the development of trade between India and Indonesia (were the nuclear genome most likely comes from), the Arabic peninsula (for the M. m. gentilulus maternal lineage [11]) and Africa. Of interest also is the position of the Kenyan mice, both from the coast (Mombasa) and the interior (Nairobi) which appear also as a recent M. m. castaneus offshoot, despite harbouring a mixture of M. m. castaneus and M. m. domesticus matrilines [17] as well as supplementary matrilines (see Additional file 2). Another noticeable point is, as expected, that samples known to be introgressed as the above-mentioned Abkhazian one, tend to "move" toward the centre of the tree. This is also the case for the Armenian sample from Megri, which is M.  m. musculus for 2/3 of its nuclear genome [18] and was already shown to be under multiple genetic influences since it also contains for instance at the ABP locus three different alleles, each one ascribed to a given subspecies [31].

Status of Iranian plateau populations
The most interesting results of this study concern the samples of the Iranian plateau, which have sometimes been coined "central populations" because of many uncertainties as to their taxonomic affiliation. Putting aside the samples unambiguously referable to M. m. musculus (Khakh-Qaene and Iran North-East) and M. m. domesticus (Ahvaz and Hamedan), we are left with two clearly separated entities as described in the Results section. Interestingly, this fits rather well with the mitochondrial description recently provided [12] where each of the entities recognised in Iran is associated predominantly with one haplogroup (HG) essentially not found elsewhere. Namely, when recompiling the aforementioned data [12] for the populations' samples of this study, the Central Iran group is associated with mitochondrial haplogroup Hg1B (78%) and South-East Iranian is associated with haplogroup Hg3 at 88%, while these two HGs are practically non-existent outside Iran where M. m. castaneus is almost completely associated with Hg2. The mitochondrial tree of all studied populations is provided in Additional file 1: Figure S3 to illustrate this point. As discussed by Rajabi-Maham et al. [12], the separate coalescence of these phylogenetically independent haplogroups have most likely arisen in allopatry during past periods of geographical isolation, and our nuclear genome results fit quite well with this view as they also point towards the existence of two independent groups in Iran, one in the center, and one in the South-East. These two groups are loosely related to M. m. castaneus, as are their matrilines (Additional file 1: Figure S3). Furthermore, the separation of the different entities in Iran seems to fit the topography of the country. M. m. domesticus samples are separated from the Central Iran individuals by the Zagros Mountains, Central Iran from M. m. musculus through the Kavir desert and from south-east Iran through the Lut desert ( Figure 1B). The central and south-east Iranian phylogroups are as distinct from each other as they are from the south-east Asian M. m. castaneus, even if many signs of admixture and secondary contacts are to be found, be it on the nuclear genome or the mitochondrial data. This is likely the principal reason why the STRUCTURE clustering did not converge to stable partitions for values of K larger than 2; probably because certain loci may have been more prone to be exchanged than other after secondary exchange. Hence, the present genomic make-up of Iranian populations is likely to be a mosaic of locally evolved haplotypes (like the mitochondrial matrilines) and segments imported from their neighbours in proportions that remain to be estimated. Nevertheless, they necessarily have evolved as geographical isolates during a certain amount of time and cannot be considered as resulting from simple admixture of surrounding populations. Hence, we may consider them as new independent entities, even if monophyly will rarely be achieved because of the retention of ancestral polymorphisms and secondary gene flow.
Attempts at setting the time frame for these exchanges with ABC methods have been recently performed [32], but these did not formally consider the existence of independent Iranian entities. The ancient origin of most Iranian phylogroups has however been recently reinforced by independent data showing they may have retained ancestral morphological features [33]. Only more sophisticated model-based analyses relying on genome-wide multilocus data will be able to tell what kind of genetic exchanges they are still able to maintain. Such genome-wide data exist currently solely for the comparison between the peripheral M. m. domesticus populations in Germany and France, versus M. m. musculus in Czech Republic and Kazakhstan [34] and between wild-derived lines representing each of the three peripheral subspecies [35]. These studies have revealed complex patterns of mutual introgression but cannot enlighten us as to the history and status of the Iranian phylogroups.
From the taxonomical standpoint, this situation deserves further scrutiny, since the distinctiveness of these three phylogroups should prevent the use of a single Latin trinomial like M. m. castaneus. However, solving the problem would require having access to hypothetical voucher specimens and a better delineation of present day limits of these phylogroups. This is not an easy task, because the species as a whole has undergone a phase of post-glacial expansion triggered by its association with Neolithic humans which may have induced a phase of complex hybridization, introgression and de-differentiation. This de-differentiation reaches some clear limits when distantly related subspecies like M. m. domesticus and M. m. musculus are concerned since their contact amounts to hybrid zones with restricted gene flow (see for instance [36] and the literature cited therein), but is not so clear what happens between the four non-domesticus entities considered here.

Conclusion
Our analysis of a Eurasian collection of wild mice encompassing both peripheral populations and a large number of central ones is the first comprehensive approach aiming at providing a global view of the relationships among the taxa constitutive of this complex subspecific assemblage. We propose here that the Iranian plateau is home to two more units displaying complex primary and secondary relationships with their long recognized neighbours M. m. domesticus, M. m. musculus and M. m. castaneus. If one adds to this the M. m. gentilulus from the Arabic peninsula and the new clade described from Nepal [13], the taxonomic diversity is clearly higher than previously thought. This is not a surprise if one considers the highly tormented landscape of the central regions where the complex species originated from, where there was clearly space for more than three geographical isolates before its worldwide expansion. The fact that this happens in and around the Iranian plateau is consistent with the key geographical position of this region which constitute an obligate passage between the lowlands of central Asia to the North, the Fertile Crescent and the Near-East to the West, and the Indian subcontinent to the East. Although the present-day distribution of distinct phylogroups does not necessarily indicate where the ancestral species was originally located, it becomes clear from the above results that the Iranian plateau should be included in the candidate regions together with the Indo-Pak sub-continent and neighboring Afghanistan. Indeed, in this large Middle-Eastern region, no less than six distinct phylogroups have differentiated and presently interact with each other. This should now be taken into account when building evolutionary scenarios.

Mice sampling
Most of the individuals included in this study come from the DNA collection established at ISE-M between 1985 and 2005 with the exception of the samples from Massif Central, Cologne-Bonn, Czech Republic and Kazakhstan that come from the collection held at the MPI-Plön [19]. All these samples have been described and used in previous publications, the references of which are given in Table 1, together with their geographical locations.

Data analysis
Basic diversity parameters were obtained using Genetix [39] and Arlequin [40]. Several ways of dissecting the genetic variability present at the 19 microsatellites loci were used concurrently. We first performed a Correspondence Analysis (CA) using the AFC-3D procedure of Genetix. This is an unsupervised method which allows representation of the differentiation of samples along independent factorial axes. We did this by taking the centroids of samples as active elements (and individuals as supplementary elements) considering alternatively the whole collection (47 population  The pairwise Reynold's distances were computed using the Gendist program of the Phylip package [41] and 1,000 bootstrapped datasets were generated with Seqboot, followed by the Neighbor and Consense procedures from the same package to obtain a Neighbour-Joining consensus population tree. Additionally, an Allele Sharing distance tree considering all individuals was calculated using the software Populations 1.2.32 [42] and visualized with MEGA 6 [43]. Finally, we used the STRUCTURE program [27] to assign the collection of individuals to several putatively reproductively independent groups. The run parameters were as follow: a burn-in period of 1,000,000 simulations followed by a run length of 1,000,000 MCMC simulations and ten iterations for K (number of clusters) equals 1 to 5. The runs were performed using the admixture model with the Loc Prior option. Results were summarized using Structure Harvester [44]. K was chosen using the criterion of Evanno et al. [28]. To draw the structure diagrams the softwares CLUMPP [45] and Distruct [46] were used.
The mitochondrial D-loop sequences tree from Additional file 1: Figure S3 was produced with MEGA6 [43]. The inference was performed according to the maximum likelihood method based on the Tamura-Nei model [47]. The boostrap values (150 replicates) are shown. Initial tree(s) for the heuristic search were obtained by applying the Neighbor-Joining method to a matrix of pairwise distances estimated using the Maximum Composite Likelihood (MCL) approach. A discrete Gamma distribution was used to model evolutionary rate differences among sites (5 categories (+G, parameter = 0.1440)). The rate variation model allowed for some sites to be evolutionarily invariable ([+I], 62.0668% sites). The tree is drawn to scale, with branch lengths measured in the number of substitutions per site. The analysis involved 867 nucleotide sequences. All positions with less than 95% site coverage were eliminated. That is, fewer than 5% alignment gaps, missing data, and ambiguous bases were allowed at any position. There were a total of 849 positions in the final dataset.

Availability of supporting data
The microsatellites dataset supporting the results of this article on Dryad: doi:10.5061/dryad.ck276.