Samples and sequence data preparation
All of the 2nNCRC samples, including F1 to F5 generations used in this study were cultured in ponds at the Protection Station of Polyploidy Fish, Hunan Normal University, and fed artificial feed. The F1 generation of 2nNCRC is denoted as 2nNCRC-F1, the F2 generation of 2nNCRC was the F1 self-cross offspring and denoted as 2nNCRC-F2, and so on the 2nNCRC-F3, 2nNCRC-F4 and 2nNCRC-F5. One specimens of each generation (F1 to F3) were collected and a total of three mitochondrial genomes of 2nNCRC were sequenced to construct a phylogeny tree based on two both rRNAs and twelve protein-coding genes (ND6 was excluded) (the methods for DNA extraction, amplification, and sequencing are given in additional file 1 in Supplementary Information). To perform a comprehensive phylogenetic analysis, another 85 specimens of representative Cyprinidae fish (including 31 specimens of Carassius, 19 of Cyprinus, 14 of Megalobrama and 24 of other cyprinid fish) and 10 Catostomidae fish complete mitochondrial genomes (both rRNAs and twelve protein-coding genes) were retrieved from the Genbank (see additional file 2 in Supplementary Information). In addition, 837 cytb genes of Carassius were also acquired from the Genbank (see additional file 3 in Supplementary Information) to obtain the phylogeographic structure of Carassius across the Eurasia.
Previous study showed that the morphometric of crucian carp-like fish (2nNCRC) were very similar with that of C. auratus on the lateral scales and the conventional measures [16]. In this study, the morphological characteristics on qualitative and quantitative traits were further compared among the species of Carassius and 2nNCRC, as well as its parents (see Table S1 in Supplementary Information). Furthermore, Pairwise genetic distance of inter- and intraspecific variation in Carassius genus (see Table S2 in Supplementary Information) were calculated under the K2P model for the cytb dataset using MEGA 7 [76].
Phylogenetic analysis
All sequences were aligned with MAFFT [77] as implemented in Geneious 4.8.3. For mitochondrial genomes, all alignments (twelve proteins and two rRNAs) were combined. We tested the 14 genes data for saturation using DAMBE v. 6.4.41 [78]. The test revealed an ISS-value that was significantly lower than the ISS.c in all cases (P < 0.0000), indicating the suitability of the fourteen genes for phylogenetic analysis. The homogeneity of the fourteen genes was tested in *PAUP [79], and the P value was 0.69 (> 0.05). To test our hypothesis that the 2nNCRC is precisely the C. auratus, the ML and Bayesian trees incorporating the mitochondrial genomes (14 genes data) were generated using RAxML v. 8.0 [80] with 1000 bootstrap replicates and MrBayes v. 3.2 [81], respectively. In the phylogenetic tree analysis, 98 sequecnes (see additional file 2 in Supplementary Information) were used, and the species of Catostomidae were used as outgroup. We determined the best-fitting substitution models for each gene fragment using Moldetest 3.7 [82], which were shown in Table S3 in Supplementary Information. Mixing of the MCMC chains of the two independent runs was monitored with TRACER v. 1.7.1 [83] and the analysis was terminated after the average standard deviation of the split frequencies fell under 0.01. The first 25% of the sampled approximately 20 million generations were discarded as burn-in. The final trees were visualized in FIGTREE v. 1.4.4 [84].
To further visualize haplotype diversity and distribution of Carassius, we generated two haplotype network using 837 cytb genes, and colored each haplotype by the geographic region from where it was collected. The haplotype networks were constructed using network v. 5.0 (www.fluxus-engineering.com/sharenet.htm) and applying the median-joining and Maximum Parsimony options.
Divergence time estimation
To test our hypothesis that the divergence times of Cyprinus and Megalobrama are much earlier than that of Carassius. The Divergence time was estimated using a molecular clock approach as implemented in BEAST. We used the combined data in the phylogenetic tree and employed a (uncorrelated lognormal) relaxed clock, as a likelihood-ratio test (LRT) [85] rejected the strict molecular clock hypothesis for the data (P < 0.01), with three calibration points. We used the following calibration points with normal distribution priors: the oldest fossil of Plesiomyxocyprinus arratiae similar to Myxocyprinus asiaticus was constrained to from the middle Eocene or earlier, approximately 40–38 Mya [86], the minimum age of Catostomidae is 60 Ma based on a catostomid fossil from the Paleocene [87], and the timing of the drastic uplift of the Tibet Plateau between 25–17 Mya [88], which was utilized as one calibration point for the Schizothoracinae as the species endemic to Qinghai-Tibet Plateau [89–91]. The clade corresponding to each calibration point was constrained to be monophyletic. The GTR + I + G model was the best fit for the combined dataset (Table S3 in Supplementary Information). The ‘speciation: Yule process’ tree prior was used to construct the tree. We ran four independent runs for 20 million generations logging trees every 2,000 generations. Convergence was checked with TRACER v. 1.7.1 [83]. Posterior trees from the four runs were combined after removing the first 10% as burn-in in LogCombiner v. 2.5.2 (http://beast.bio.ed.ac.uk/logcombiner). The maximum credibility tree was created in TreeAnnotator v. 2.5.2 available in the BEAST package.
Furthermore, another divergence time was estimated using the cytb data. We used a reduced dataset of cytb: specimens for which two or more sequences from the same region were included. This resulted in 250 terminals – 237 representatives of the five species of Carassius and 13 outgroup sequences (C. carpio) from GenBank (see additional file 4 in Supplementary Information). We used a conservative approach by employing calibration points from previous studies [27]. The most recent common ancestor (TMRCA) of C. carassius in the northern and central-eastern European drainages and the Danubian catchment was constrained to 2.18–2.15 Mya. Another calibration point was the fossil of Carassius in Pliocene epoch (5.3–2.6 Mya) in north of China (Yushe, Shanxi province), The GTR + I + G model was also the best fit for the cytb dataset (Table S3 in Supplementary Information). The ‘speciation: Yule process’ tree prior was used to construct the tree. We ran four independent runs for 50 million generations logging trees every 5,000 generations. The other settings are same with the above.
Reconstruction of ancestral areas
To test our hypothesis that whether the C. auratus was derived from the processing of lineage differentiation with other species in Carassius or not, we further performed a biogeographic reconstruction of ancestral areas for species of Carassius using BioGeoBEARS [92] in RASP 4.02 [93]. The analyses were conducted on a fully resolved topology from the BEAST analysis containing five species of Carassius, eight major geographical areas were defined based on the worldwide distribution of Carassius according to their current distribution and Jeffries et al. (2016) [27]: (A) the Europe, except for the south of Alpes mountains and Danube River basin, (B) the south of Alpes mountains and Danube River basin in Europe, (C) Western Asia, (D) Siberia in Russia and Mongolia, (E) the Amur River (including the Amur River in China, Mongolia and Russia), (F) the Yangtze River Basin and its south area in China and Southeast Asia, (G) the East Asia—Japan, (H) the North America. All six models of geographic range evolution were compared in a likelihood framework (DEC, DEC + j, DIVALIKE, DIVALIKE + j, BAYAREALIKE and BAYAREALIKE + j). The best-fit model was assessed on comparing the Akaike's information criterion and likelihood-ratio tests and the DIVALIKE + j was choosed (Table S4 in Supplementary Information). Ancestral distributions were reconstructed using 24 sequences in Carassius (see additional file 5 in Supplementary Information). To account for phylogenetic uncertainty, 4000 post-burn-in trees resulting from the BEAST analysis were integrated for inference. The maximum number of ancestral areas was set to four, as C. auratus and C. gibelio can be widespread.
Species distribution and paleodistribution modeling
We used species distribution modeling (SDM) to construct a model of current, Last Glacial Maximum (LGM, about 22,000 years ago), and Last Interglacial (LIG; ~120,000–140,000 years before present) Carassius distributions. The occurrences records for nearly all Eurasia species in the genus Carassius (including C. carassius, C. cuvieri, C. langsdorfii, C. gibelio and C. auratus) were collected from the Global Biodiversity Information Facility (GBIF, http://data.gbif.org/), Fishbase (https://www.fishbase.de/), the Fish database of Taiwan (http://fishdb.sinica.edu.tw/) and complementary literature (sources summarized in additional file 6 in Supplementary Information). Data records were inspected and occurrences outside of Eurasia, without geo-referencing, were excluded from the analyses. Furthermore, we performed a careful quality check and meticulously scrutinized all records for any geographic or taxonomic issues. Geographic duplicate and very adjacent records, as well as records from introduced species or those pre-dating 1970 were removed, as the current bioclimatic dataset ranges over a 30-year period (1970–2000), and a total of 1,499 unique, geo-referenced and quality-checked occurrence records were finally retained across Eurasia for species in the genus Carassius.
We extracted current bioclimatic data from the WORLDCLIM dataset (v 2.1, http://www.worldclim.org/) at 30 arc-seconds resolution, LGM bioclimatic data from the Model for Interdisciplinary Research on Climate (MIROC) dataset at 2.5-min resolution [94], and LIG bioclimatic data from Otto-Bliesner et al. (2006) [95] at 30 arc-seconds resolution. There exist nineteen bioclimatic variables are included in the WORLDCLIM current and LIG [95] and LGM [96] datasets. The ArcGIS 10.0 (ESRI: Redlands, CA) was used to extract climatic variable layers to include only Eurasia to improve predictive power of Maxent models [97]. Prior to constructing SDM, the ENMTools (v 1.3) [98] was used to determine which bioclimatic variables were correlated, using R > 0.90 as a cutoff, and twelve variables which were found to be correlated with at least one other variable were removed.
We used the Maxent (v 3.4.0) [99] to model current and past Carassius distribution, and used the following settings for the Maxent model: hinge features only, regularization multiplier of 1, 10,000 max number of background points, replicate run type of 10 cross-validations, 500 maximum iterations, and 0.00001 convergence threshold. We ran jackknife tests to measure the importance of each bioclimatic variable. Models used 700, 49, 56, 489 and 205 (total 1499) records for testing and six BIOCLIM environmental layers (bio2, 3, 8, 15, 18, 19) to produce models for present and paleodistributions of C. carassius, C. cuvieri, C. langsdorfii, C. gibelio and C. auratus, respectively.