Supermatrix analyses reveal the importance of outgroup, gene and taxon sampling in Onosma (Boraginaceae) phylogenetics

. Tribe Lithospermeae (Boraginaceae) consists of ca. 26 genera and 470 species, in which Onosma constitutes approximately one third of the species (~150). Although the tribe is strongly supported as monophyletic, both generic and species boundaries remain ambiguous. Among them, not only the phylogenetic position of Eastern Asian Onosma species, but also the taxonomic limits of the genus remain unclear. Whether Eastern Asian Onosma is monophyletic, or the genus should be widened to include Maharanga , and maybe Cystostemon , are still open questions. For these reasons, I performed 16 phylogenetic analyses with different taxon coverages, alignments, gene regions and outgroups, with up to 746 taxa of tribe Lithospermeae and with five DNA regions, using data from GenBank. The results, with the widest taxon coverage to date, show that while genus Onosma is not monophyletic in any of the analyses, the phylogenetic relationships among Onosma s.s., Eastern Asian Onosma , Maharanga and Cystos-temon differ among analyses. However, the approximately unbiased (AU) test showed that the topology (((Eastern Asian Onosma + Ma-haranga ) Cystostemon ) Onosma s.s.) is overwhelmingly supported. Therefore, the current study highlights the importance of taxon, gene and outgroup sampling in Onosma phylogenetics.

Onosma L. is the largest genus of Lithospermeae with ~150 species (Chacón & al. 2017). It is distributed in north-western Africa, Europe and Asia, but the centre of diversity is in the dry, rocky and sunny habitats of Turkey and Iran (Mehrabian & al. 2011;Nasrollahi & al. 2019). While Onosma has been considered as a taxonomically difficult group; only a few morphological and molecular studies have focussed on this genus (e.g., Cecchi & al. 2011;Nasrollahi & al. 2019). Similarly, to date, its sampling in the previous studies was limited [ Table 1]. For instance, while Nasrollahi & al. (2019) sampled 122 Onosma samples (87 species), they included only five outgroups in their analyses. Similarly, while Chacón & al. (2019) included 257 Lithospermeae samples (180 species and 13 subspecies), their sampling of Onosma remained insufficient with only 64 newly sequenced samples (50 species and 4 subspecies). Furthermore, some recently described taxa (e.g., Onosma atila-ocakii O. Koyuncu & Yaylaci;Koyuncu & al. 2013) have never been included in a phylogenetic study, and their phylogenetic position remain unknown. In addition, it was reported that genus Onosma is not monophyletic, with the Eastern Asian genus Maharanga recovered as closely related to Eastern Asian Onosma, namely, O. rostellata Lehm., O. paniculata Bureau & Franch., O. hookeri C.B. Clarke, O. waltonii Duthie, O. sinicum and O. pyramidalis Hook.f. (Cecchi & al. 2011;Chacón & al. 2019;Nasrollahi & al. 2019), so not only the phylogenetic position of the Eastern Asian species, but also the taxonomic limits of the genus Onosma (i.e., whether Eastern Asian Onosma+Maharanga constitutes a different genus) remain unclear.
For these reasons, molecular analyses of Onosma with a wide taxon coverage and more genetic regions are needed. However, such a worldwide study would be both time and money consuming. Therefore, before attempting such an expensive job, a supermatrix analysis which covers more taxa and all available published sequences would be the first step to overcome these problems. With this aim, in the current study, I performed 16 phylogenetic analyses with

Taxon sampling, alignment and phylogenetic analyses
The present study includes up to 746 taxa from 24 genera of tribe Lithospermeae (Table 1). Five loci, namely, nuclear ITS, trnL gene, trnL-F spacer, trnS-G spacer and rps16 downloaded from GenBank (Appendices 1 and 2). The number of taxa, alignment length, number of variable characters, and the number of parsimony informative sites for each region are provided in Table 2. First, for the ITS region, three data matrices with different taxon sampling were created: dataset 1) ITS matrix with missing data (i.e., either ITS1 or ITS2 regions, or both available for each taxon) with 746 taxa; dataset 2) ITS data matrix without missing data with 555 taxa (i.e., both the ITS1 and ITS2 regions are available for each taxon); dataset 3) ITS data matrix without missing data with 323 taxa (i.e., for this dataset, the difficult taxa to align, possibly due to misidentifications and/or ITS region problems, were removed). Second, for the plastid data only one data matrix was created: dataset 4) trnL gene, trnL-F spacer, rps16 and trnS-G spacer were included for 339 taxa. Third, for the total evidence analyses (i.e., all five concatenated loci), four matrices were created: dataset 5) total evidence matrix with 350 taxa; dataset 6) total evidence matrix with 345 taxa [five taxa with only one sequence for each region, namely Halacsya Dörfl., Macrotomia DC. ex Meisn., Stenosolenium Turcz., Lithospermum tschimganicum B.Fedtsch (= Ulugbekia/Arnebia tschimaganica) and Aegonychon Gray were excluded]. With the addition of two different outgroup taxa (i.e., out of tribe Lithospermeae), namely Vahlia Thunb. and Anchusa L., two more datasets were created: dataset 7) total evidence matrix with distant outgroups, namely two Vahlia taxa, and 345 ingroup taxa; and finally dataset 8) total evidence matrix with closely related outgroups, namely two Anchusa taxa, and 345 ingroup taxa ( Table 2). Other than these last two datasets (7 and 8), no outgroup(s) out of tribe Lithospermeae were used. In terms of taxon coverage for each dataset, see Table 3. These strategies were used to detect the effect of outgroups (e.g., datasets 5, 6, 7 and 8), different alignments (e.g., dataset 1, 6 and 3), genes (e.g., datasets 3, 4 and 6) and missing data (e.g., datasets 1 and 2) on the phylogenetic relationships within Onosma (Smith 1994;Qiu & al. 2001). Sequences were aligned by Geneious Pro 4.8.4 (Kearse & al. 2012) and manually edited for 16 different Maximum likelihood (ML) and Bayesian Inference (BI) analyses (Table 2). ML analyses were conducted with RAxML version 8.2.12 (Stamatakis & al. 2014) on the CIPRES Science Gateway (Miller & al. 2011). The GTRGAMMA model was selected as the best-fit model by using the Akaike information criterion (AIC) in the software jModelTest2.1.10 (Guindon & Gascuel 2013; Darriba & al. 2012), and the "Let RAxML halt bootstrapping automatically" options selected. A rapid bootstrap analysis/search for best-scoring ML tree was performed. BI analyses were implemented using MrBayes 3.1.2 (Ronquist & Huelsenbeck 2003) on the CI-PRES Science Gateway (Miller & al. 2011), for 10 million generations, sampling every 100 generations, with a random starting tree. The first 25% of trees were discarded as "Burn-in" and the remaining trees were used to build the consensus tree. Tracer v1.5 (Rambaut & Drummond 2007 was used to inspect the MCMC output and to determine convergence of the two chains. iTOL (Interactive Tree of Life) (Letunic & Bork 2016) was used to create the tree image.

Alternative topology testing
The approximately unbiased (AU) (Shimodaira & Hasegawa 1999) test was used to evaluate the p-values for the possible (((Eastern Asian Onosma+Maharanga) Onosma s.s.) Cystostemon) and (((Eastern Asian Onosma+Maha-  Trifinopoulos & al. 2016), and by using the '-au' option and 10,000 bootstrap replicates with the "total evidence dataset with 350 taxa". Note that due to the possible problems with the ITS region (explained below), the AU test was performed just for the two possible topologies, indicated in Figures 2b and 2c.

RESULTS
The genus Onosma and its subsections, namely, Onosma, Haplotricha and Heterotricha Riedl (1967) did not resulte monophyletic in any of the analyses (Table 3, Fig. 1) (Appendix 3-10). In contrast, Onosma s.s. was monophyletic (93-100% BS, 0.99-1.00 PP) in all analyses, except the plastid matrix analysis, where the clade was not well-resolved The topologies describing the phylogenetic relationships among Eastern Asian genera Onosma, Maharanga, Onos-ma s.s and Cystostemon differed depending on the analysis (Table 3, Fig. 2). First, Onosma s.s. was not sister to the Eastern Asian Onosma+Maharanga clade in neither the ITS with 555 taxa nor the ITS with 746 taxa analyses ( Fig.  2a) although note that the relationships among Onosma s.s., Eastern Asian Onosma, Maharanga and Cystostemon were not resolved in the ITS with 333 taxa and plastid data analyses. Second, the "total evidence tree with 350 taxa" analysis yielded a (((Eastern Asian Onosma+Maharanga) Onosma s.s.) Cystostemon) topology (Fig. 2b). However, a possible (((Eastern Asian Onosma+Maharanga) Cystostemon) Onosma s.s.) relationship was recovered for the first time here, and apart than the distant and closely related outgroup analyses, this topology was also recovered from the "total evidence tree with 345 taxa" analysis ( Fig. 2c).
While the Eastern Asian Onosma+Maharanga clade was strongly supported as monophyletic (94-100% BS, 1.00 PP) in all analyses, only in the ITS with 555 and 746 taxa analyses the Eastern Asian Onosma emerged mono-    were also monophyletic in all analyses. In terms of the differences between nuclear (ITS) and plastid (trnL+trnL--F+trnS-G+rps16) datasets, while Onosma s.s. (100% BS, 1.00 PP), Eastern Asian Onosma (87-91% BS, 0.92-0.98 PP) was monophyletic in all ITS analyses, the clade was not monophyletic in the plastid data analysis (Table 3). Moreover, while the Eastern Asian Onosma+Maharanga clade received a 100% BS and 1.00 PP in all ITS analyses, in the plastid data analysis this clade was supported with 80% BS and 1.00 PP (Table 3).

DISCUSSION
To date, the current study is the first to encompass not only the widest taxon coverage in the tribe Lithospermeae, but also the Eastern Asian Onosma+Maharanga+ Cystostemon clade. The results clearly show that the phylogenetic relationships within the clade are clearly dependent on changes in taxon, gene and outgroup sampling (Table 3, Fig. 2). For instance, the "total evidence with 350 taxa" analysis yielded a topology, (((Eastern Asian Onosma+Maharanga) Onosma s.s.) Cystostemon), similar to the molecular dating analysis of total evidence tree of Chacón & al. (2019). However, this topology is rejected by the AU test (p-value = -0.00133) ( Table 4, Fig. 2). On the other hand, the results of AU test overwhelmingly supported the (((Eastern Asian Onosma+Maharanga) Cystostemon) Onosma s.s.) topology (p-value = 0.999) which was recovered from the distant and closely related outgroup analyses, and the "total evidence tree with 345 taxa" analysis (Table 4, Fig. 2), a topology never reported before.
Apart from the molecular data here presented, except for the corolla morphology, Cystostemon shares great  (Cohen 2014). However, whether genus Onosma is not monophyletic, or the genus should be extended to include Maharanga and maybe Cystostemon, or the "O. rostellata and Sino-Indian Onosma +Maharanga" constitutes a different lineage are questions that remain unanswered (Cohen 2014;Chacón & al. 2019;Nasrollahi & al. 2019). The main obstacle to make a taxonomic decision is that the taxon sampling of these clades is still very limited in each of the above-mentioned studies and the present one. Any further phylogenetic analyses for taxonomic rearrangement should include more Eastern Asian Onosma, Maharanga and Cystostemon samples and perhaps more genetic regions. For instance, the genus Maharanga comprises around 10 species and Cystostemon includes 16; yet sequences from only two and four species, respectively, are currently available in GenBank. Similarly, for the Eastern Asian Onosma, except O. hookerii (seven individuals), only a few (one to three) sequences have been published. These limitations highlight the importance of taxon sampling in phylogenetic studies as have been indicated before (Zwickl & al. 2002;Heath & al. 2008).
Another unresolved question addressed here is whether the Eastern Asian Onosma species are monophyletic or not. On one hand, ITS analysis of Nasrollahi & al. (2019) and plastid/ITS analyses of Chacón & al. (2019) revealed a non-monophyletic Eastern Asian Onosma (Maharanga samples embedded in the clade); on the other hand, the ITS analysis of Cecchi & al. (2011) and molecular dating analysis of Chacón & al. (2019) with ITS+ trnL-F+rps16+trnS-trnG data yielded a monophyletic Eastern Asian Onosma clade. The results are equivocal in the present study (Table 3, Fig. 2). Eastern Asian Onosma may indeed include Maharanga, because the two genera share great morphological similarities other than corolla and anther morphology (Zhu & al. 1995); however, these results may be a consequence of sampling gaps, or the known problems associated to the ITS region, such as, concerted evolution, paralogy, gene duplication and incomplete lineage sorting (Álvarez & Wendel 2003). Although these issues have never been reported for Boraginaceae (Cecchi & al. 2011); it is still possible that ITS data problems may be responsible for these peculiar phylogenetic relationships. Similarly, as in the previous studies in tribe Lithospermeae (e.g., Cecchi & Selvi 2009;Weigend & al. 2009;Cecchi & al. 2014;Coppi & al. 2015), the results of this study have possibly indicated that the relatively low number of parsimony-informative characters of the plastid data matrix (24%) caused the phylogenetic uncertainties, and therefore, limited-plastid data may not be informative enough to solve the genus Onosma and tribe Lithospermeae phylogenetic relationships.This was already pointed out byNasrollahi & al. (2019) who suggested to use "fast evolving genes" for future phylogenetic studies of Onosma. Unfortunately the results of the present study hint that even these types of genes may not be enough to solve the phylogenetic inconsistency, due to the reason of the effect of taxon sampling on the group phylogeny and possible complex evolutionary histories of both clades (Nasrollahi & al. 2019). Besides, morphological characters may also not be helpful to answer the phylogenetic questions within the genus (i.e., similar morphologies of Onosma, Maharanga and Cystostemon), even when they are supported by molecular data. Thus, both the data and taxon sampling (i.e., both the ingroup and outgroup) hold great importance in Onosma phylogenetic analyses.
While Nasrollahi & al. (2019) reported a monophyletic position for Onosma hookeri and to date only one O. sinicum Diels sample was included in the previous studies (e.g., Chacón & al. 2019), the analyses of the current study showed that neither O. sinicum nor O. hookeri are monophyletic (Fig. 1). Indeed, misidentifications are common in Boraginales (Dr. Aslı Doğru Koca, pers. comm.) and the amount of wrong identifications in voucher specimens is reported to be very high in public molecular repositories (Nilsson & al. 2006;Wu & al. 2021). In the current study, I did not filter GenBank sequences by their reliable IDs or deposited voucher numbers. Indeed, this could be one of the reasons for the non-monophyly of these taxa. For example, a quick survey has shown that among the 746 ITS sequences included in the current study, only 489 of them (~66%) have their voucher numbers included in GenBank (results not shown). Although a voucher specimen will not guarantee a correct identification it provides scientific reliability and possibility of verification in the future (Wu & al. 2019). Therefore, future studies should be mindful about possible wrong identifications while adding Gen-Bank sequences in their datasets.