Neogene paleogeography provides context for understanding the origin and spatial distribution of cryptic diversity in a widespread Balkan freshwater amphipod

Background The Balkans are a major worldwide biodiversity and endemism hotspot. Among the freshwater biota, amphipods are known for their high cryptic diversity. However, little is known about the temporal and paleogeographic aspects of their evolutionary history. We used paleogeography as a framework for understanding the onset of diversification in Gammarus roeselii: (1) we hypothesised that, given the high number of isolated waterbodies in the Balkans, the species is characterised by high level of cryptic diversity, even on a local scale; (2) the long geological history of the region might promote pre-Pleistocene divergence between lineages; (3) given that G. roeselii thrives both in lakes and rivers, its evolutionary history could be linked to the Balkan Neogene paleolake system; (4) we inspected whether the Pleistocene decline of hydrological networks could have any impact on the diversification of G. roeselii. Material and Methods DNA was extracted from 177 individuals collected from 26 sites all over Balkans. All individuals were amplified for ca. 650 bp long fragment of the mtDNA cytochrome oxidase subunit I (COI). After defining molecular operational taxonomic units (MOTU) based on COI, 50 individuals were amplified for ca. 900 bp long fragment of the nuclear 28S rDNA. Molecular diversity, divergence, differentiation and historical demography based on COI sequences were estimated for each MOTU. The relative frequency, geographic distribution and molecular divergence between COI haplotypes were presented as a median-joining network. COI was used also to reconstruct time-calibrated phylogeny with Bayesian inference. Probabilities of ancestors’ occurrence in riverine or lacustrine habitats, as well their possible geographic locations, were estimated with the Bayesian method. A Neighbour Joining tree was constructed to illustrate the phylogenetic relationships between 28S rDNA haplotypes. Results We revealed that G. roeselii includes at least 13 cryptic species or molecular operational taxonomic units (MOTUs), mostly of Miocene origin. A substantial Pleistocene diversification within-MOTUs was observed in several cases. We evidenced secondary contacts between very divergent MOTUs and introgression of nDNA. The Miocene ancestors could live in either lacustrine or riverine habitats yet their presumed geographic localisations overlapped with those of the Neogene lakes. Several extant riverine populations had Pleistocene lacustrine ancestors. Discussion Neogene divergence of lineages resulting in substantial cryptic diversity may be a common phenomenon in extant freshwater benthic crustaceans occupying areas that were not glaciated during the Pleistocene. Evolution of G. roeselii could be associated with gradual deterioration of the paleolakes. The within-MOTU diversification might be driven by fragmentation of river systems during the Pleistocene. Extant ancient lakes could serve as local microrefugia during that time.

281 tree-based, phylogenetic approach using the Bayesian implementation of the Poison Tree 282 Processor (bPTP) (Zhang et al., 2013) and iv and v) the tree-based general mixed Yule coalescent 283 (GMYC) model-based method (Pons et al., 2006), using either single and multiple threshold 284 models. 285 The BIN method is implemented as part of The Barcode of Life Data systems (BOLD; 286 Ratnasingham . Newly submitted sequences are compared together as well as with 287 sequences already available in BOLD. Sequences are clustered according to their molecular 288 divergence using algorithms aiming at finding discontinuities between clusters. Each cluster is 289 ascribed a globally unique and specific identifier (aka Barcode Index Number or BIN), already 290 available or newly created if newly submitted sequences do not cluster with already known BINs. 291 Each BIN is registered in BOLD.

292
The ABGD method is based upon pairwise distance measures. With this method the 293 sequences are partitioned into groups (MOTUs), such that the distance between two sequences 294 from two different groups will always be larger than a given threshold distance (i.e. barcode gap). 295 We used primary partitions as a principal for group definition, as they are typically stable on a 296 wider range of prior values, minimise the number of false positive (over-split species) and are 297 usually close to the number of taxa described by taxonomists (Puillandre et al., 2012). The default 298 value of 0.001 was used as the minimum intraspecific distance. As there is currently no consensus 299 about which maximum intraspecific distance is reflecting delimitation of species, neither based on 300 morphology (Costa et al., 2007, Weiss et al., 2014, or Katouzian et al., 2016 nor on reproductive 301 barrier (Lagrue et al., 2014) we explored a set of values up to 0.01. The standard Kimura two-302 parameter (K2p) model correction was applied (Hebert et al., 2003). 303 The tree based bPTP method for species delimitation uses non-ultrametric phylogenies. 304 The bPTP method incorporates the number of substitutions in the model of speciation and assumes 305 that the probability that a substitution gives rise to a speciation event follows a Poisson distribution. 306 The branch lengths of the input tree are supposed to be generated by two independent classes of 307 the Poisson events, one corresponding to speciation and the other to coalescence. Additionally the 308 bPTP adds Bayesian support values (BS) for the delimited species (Zhang et al., 2013). For the 309 input tree we generated phylogeny using Bayesian inference in MrBayes 3.2 (Ronquist et al., 310 2012). Prior to tree construction, the out-group was removed from the alignment. The GTR+G+I 311 model was selected using AICM method of moments estimator (Baele et al., 2013) in Tracer 1.6 312 (Rambaut et al., 2014). Four MCMC, each with 20 M iterations, were run and sampled every 2000 313 iterations. This was enough to obtain potential scale reduction factor values to equal 1 for all 314 parameters. Stable convergence was determined by inspecting the log likelihood values of the 315 chains and the split frequencies that reach, after 10 million generations, the value above 0.01. The 316 consensus tree was constructed after removal of 25% burn-in phase. Analysis was performed on 317 the bPTP web server (available at: http://www.species.h-its.org/ptp/) with 500 000 iterations of 318 MCMC and 10% burn-in.

319
With the GMYC method, species boundaries are assessed based on sequences in a 320 maximum likelihood framework by identifying the switch from intraspecific branching patterns 321 (coalescent) to typical species branching patterns (Yule process) on a phylogenetic tree. First, a 322 log-likelihood ratio test is performed to assess if the GMYC mixed models fit the observed data 323 significantly better than the null model of a single coalescent species. Providing there is evidence 324 for overlooked species inside the phylogenetic tree tested, two different GMYC approaches, one 325 using the single threshold and the other one on multiple threshold model are used to estimate the 326 boundary between intra-and interspecific branching patterns. Since the GMYC-method requires 327 an ultrametric tree, we have used the already reconstructed Bayesian time calibrated phylogeny. 328 The consensus tree was loaded into the R software package 'SPLITS' (Species Limits by 329 Threshold Statistics) (Ezard, Fujisawa & Barraclough, 2009) in R v3.1.0 (R Core Team, 2012) and 330 analysed using the single and multiple threshold models. Presence of significant differences 331 between the two models was tested using likelihood ratio test (LRT). 332 333 Bayesian reconstruction of the ancestral states 334 The program BayesTraits (Pagel, Meade & Barker, 2004) was used to implement the MCMC 335 Bayesian method, aiming at reconstructing ancestral states of a given trait along a phylogenetic 336 tree. Two traits, habitat (river vs lake) and geographic coordinates (longitude and latitude) were 337 explored. The BI maximum clade credibility chronogram was used as the input tree and the actual 338 habitat type and geographic coordinates of its tips were used as input data for the actual state of 339 the trait. For each ancestral node along the tree the results were reported as i) a chance (in 340 percentage) for the ancestor to occur in a given habitat or ii) possible coordinates of the ancestor's 341 locality. In both cases the analysis was performed in two steps. Firstly, using the mentioned input 342 data, a distribution of models was estimated and saved into a file. Secondly, the input data with 343 file containing models distribution was used in the final analysis in which four independent, 10 M 344 iterations long, MCMC runs were performed and sampled every 1000 iterations, with 5% burn-in. 345 Continuous and multistate reconstruction options with equal priors models, were used for 346 geographic coordinates and habitat, respectively. All runs in each analysis produced convergent 347 results that were combined and summarised (see Table S3 (Bandelt, Forster & Rohl, 1999) 362 implemented in the program NETWORK 4.6.1.2 (http://www.fluxusengineering.com). Given the 363 differences in substitution rates, we applied a weight of 1 for transition and 2 for transversions. 364 The topology was obtained at the homoplasy level parameter default value (e = 0).

365
Diversity was assessed as the number of haplotypes (k), haplotypic diversity (h) and 366 nucleotide diversity (π) (Nei, 1987) with the DnaSP software (Librado & Rozas, 2009). Molecular 367 divergence was estimated as average Kimura 2 -parameters (K2p) distance between haplotypes 368 using MEGA 6.0 . In case of MOTUs present in more than one site (MOTUs 369 A, C, E, G and K) and given a minimum sampling size per site of 3 individuals, the differentiation    (Table 1). Only seven haplotypes were shared between a pair of locations 395 (Table 1). Each of the remaining 67 haplotypes was specific to only one location (Table 1). 396 Minimum, average and maximum K2p distance between haplotypes were, respectively, 0.0019 397 (SE = 0.0017), 0.1463 (SE = 0.0108) and 0.2405 (SE = 0.0108). Such high average K2p distance 398 suggests presence of cryptic diversity within the morphospecies (see also Table S4).  (Table S2).

408
The spatio-temporal pattern of lineage diversification within G. roeselii in the Balkan 409 Peninsula appeared to be very complex (Fig. 2a, Table S1). The diversification started ca. 18 Ma   (Fig. 2). 426 All the 173 sequences submitted to BOLD clustered into 27 BINs (details in Table S1).

427
The ABGD distance-based approach (barcode-gap analysis), resulted in eight primary 428 partitions of the analysed COI sequences. Four of them were stable over a wide range of P values 429 (0.001-0.10) and defined 12 MOTUs, being the same in each partition (Fig. 2).    (Fig. 2b). 475 The MOTUs belonging to the southern group are spatially scattered and have very localised 476 distributions (Fig. 2b). Each of them is restricted usually to only one waterbody or drainage area.   Projection of the reconstructed longitudes and latitudes on paleomaps (Fig 4b-d) suggests 505 that the early MRCA of extant G. roeselii MOTUs occurred in the area covered by the Neogene  512 For each of the 13 MOTUs identified by the COI data, one to 18 individuals (50 in total) were 513 sequenced for the nuclear 28S marker, producing 23 haplotypes (N1-N23) ( Table 1). The topology 514 of the Neighbor-Joining tree (Fig. 5) showing relationships between the 28S haplotypes was 515 partially concordant with the results derived from the mtDNA COI marker. For example, MOTUs 516 with high COI divergence (ca >15% K2p distance) were also divergent in the case of 28S. Only 517 MOTU E showed a peculiar pattern as its individuals, associated with three 28S haplotypes (N2, 518 N4 and N5), and belonged to two highly divergent clades of the 28S N-J tree (Fig. 5). Interestingly, 519 MOTU E was always sympatric (sites 11, 13, 20) with another distantly related MOTU G, J and 520 A, respectively. In two cases the individuals with COI haplotypes of MOTU E had a nuclear 521 haplotype either close (site 11, N2 vs N3 and N9) or even identical (site 13, N4 and N5) to the 522 nuclear haplotypes associated with the other MOTU present at the respective site.  (Fig. 6), revealed a generally rather 526 high level of haplotypic and nucleotide diversity as well as K2p distance from 0.002 to 0.029 527 (Table 2). This was also reflected by the fact that in numerous cases various species delimitation 528 methods have subdivided the MOTUs we defined into smaller units. In all cases this intra-MOTU 529 divergence is younger than 2.5 Ma. Such sub-MOTUs have very localized distributions, e.g. they 530 are restricted to one lake (e.g. Prespa, Vegoritis/Petron) or one lake system (Kastoria, Ohrid). This 531 suggests that the Pleistocene glaciations might have promoted diversification at the local scale. 532 Only MOTU M showed a clear sign of demographic expansion ( Table 2, Table S6). On the other 533 side, results of AMOVA (Table 2) (Table S7).   2 values (p) of the parameters were evaluated with 10000 simulations: ns=not significant, *P < 0.05; ***P < 0.001

Figure 1(on next page)
Maps of the studied area and sampling sites.      Table 1).