Eocene Loranthaceae pollen pushes back divergence ages for major splits in the family

Background We revisit the palaeopalynological record of Loranthaceae, using pollen ornamentation to discriminate lineages and to test molecular dating estimates for the diversification of major lineages. Methods Fossil Loranthaceae pollen from the Eocene and Oligocene are analysed and documented using scanning-electron microscopy. These fossils were associated with molecular-defined clades and used as minimum age constraints for Bayesian node dating using different topological scenarios. Results The fossil Loranthaceae pollen document the presence of at least one extant root-parasitic lineage (Nuytsieae) and two currently aerial parasitic lineages (Psittacanthinae and Loranthinae) by the end of the Eocene in the Northern Hemisphere. Phases of increased lineage diversification (late Eocene, middle Miocene) coincide with global warm phases. Discussion With the generation of molecular data becoming easier and less expensive every day, neontological research should re-focus on conserved morphologies that can be traced through the fossil record. The pollen, representing the male gametophytic generation of plants and often a taxonomic indicator, can be such a tracer. Analogously, palaeontological research should put more effort into diagnosing Cenozoic fossils with the aim of including them into modern systematic frameworks.

In the case of DQ333823, pseudogeny is less obvious in the ITS1 and ITS2: the sequence appears to be a mix of ITS variants with different levels of pseudogeny as indicated by polymorphic calls (Y and R at overall conserved C and G positions).
DQ333859 may be a chimera (or recombinant) of a non-pseudogenous and (weakly) pseudogenous ITS variants: the area of overlap between typical forward and reverse reads has been blanked out by the authors indicating problems with the direct PCR sequencing. Lacking comparative data of this genus/species, the sequence was kept as-is.
The 3' portion of the ITS2 in Notanthera heterophylla DQ333855 is entirely different from other accessions of the subtribe/tribe and markedly off-alignment. The strand portion finds no similar hits in gene banks. Based on the general distinctness of the ITS of this taxon, the 3' ITS2 may be genuine and has been kept as-is.
One of the two accessions of Taxillus (T. chinensis, JX177497) differs strongly from the other sequence and is essentially off-alignment within ITS1 and the larger part of ITS2. Subsequent MEGABLAST identifies the sequence as Viscum (same order, family Viscaceae). In case EU544368 many deviations are indicative for pseudogeny (G→A, C→T) in addition to deviations that may be linked to bad sequence reads/editing artefacts (e.g. AATTA at pos. [485][486][487][488][489] instead of the consensual ATTGA). The sequence was excluded from analysis. EU544375 (Helicanthes elastic; Nickrent 4906) represents a unique 25S rDNA fragment, with best BLAST hits of 92-93% identity covering a wide range of taxa and clades (Solanales: Schizanthus, Apiales: Bolax; Saxifragales: Saxifraga, Telesonix; Ranuculales: Borax; etc.); top hits did not include members of Loranthaceae. The sequence was tentatively excluded from analyses as putative contaminant. The 18S sequence from the same voucher is equally problematic.
The sequence was excluded from analysis as showing tendency towards pseudogeny.
The same, but to a lesser degree applies to accessions EU544368 (Diplatia furcata, Nickrent 2824) and EU544382 (Loxanthera speciosa, Nickrent 4026). The latter was tentatively kept for analysis due to the importance of the taxon for the phylogenetic framework.
Accession EU5443779 (Nickrent 4332, labelled as Ixocactus sp. [= Phthirusa]) is composed of two strands. The first strand (170 nt from the 5' end) finds best BLAST-hits among the Loranthaceae, but lower identity than usually found in when BLASTing a member of the family. Then comes a sequencing gap and the following remaining c. 1200 nt find highest identities (>90%; obtained with MEGABLAST) only with members of other families of the Santalales (Osyris, Santalaceae; Haloragis, Halogaraceae) and unrelated singletons such as Eucnide bartonioides (JF321124), Loasa vulcanica (JF321125; Cornales), Fendlera rupicola (AY260041), Dryas octopetala (JF317384; Rosales), and Eucryphia lucida (AF036494, Oxalidales). Based on this result it is uncertain whether the sequence is genuine or a mislabelling/artificial chimera; hence, it was tentatively excluded from all analyses. Accession EU544401 differs from the general sequences types within members of its clade and Loranthaceae in general by potential pseudogene-like mutations in addition to sequence deviations that may be linked to sequencing/editing artefacts. Because of its overall dissimilarity to other Loranthaceae sequences in family diagnostic regions, MEGABLAST (performed 9/2/2016) failed to retrieve any Loranthaceae 25S within the listed hits. Note that the pollen of one species formerly included in Ixocactus (Phthirusa hutchisonii) does not confirm with morphotypes typical for Loranthaceae (Grímsson, Grimm & Zetter 2017) Plastid matK gene (Fig. S1-3) General-Data includes up-to-near-complete matK sequences starting with the first codon in three accessions (stop codon seems to be not covered in any accession). More than 3-taxon coverage starts at the 27 th codon (alignment was accordingly truncated). Length-polymorphism is restricted to singleton duplications or eliminations.
Curation-In regions with duplications/eliminations, the general alignment was adapted for codon positions when necessary. Defect (due to missing single-nucleotides) and incomplete codons were blanked (replaced by missing data symbol, "?").
Problematic data-The 5' part of accession EU544409 (Aetanthus nodosus) is markedly different from other members of its tribe and the entire family. MEGABLAST revealed that the first c. 260 nt are highly identical to accession of several species of Acacia (Fabales). The remainder of the sequence falls within the variation of the Loranthaceae. Thus, the accession is a (artificial) chimera of contaminated material and unverified Aetanthus (no comparative data available of the genus), and has to be excluded from analysis.
Odd placements-The trnLLF region is the only gene region included in the final dataset, where genera (to some degree: species) are usually represented by more than a single accession. The single-gene tree reveals several issues with the currently available sequence data:  Curating-Alignment of rbcL is straightforward (generally no length polymorphism). The region is generally conserved at the genus/subtribe level, which facilitates the identification of putative mislabelled sequences. The data includes one unidentified Loranthacea rbcL, which was not considered for further analyses.
Problematic data-Three sequences of Taxillus chinensis (JF949992; JN687568; KF447376) differ markedly from the other eight sequences of the species and other members of the genus/tribe (visible from pos. 250 onwards). MEGABLAST identified JF949992 as Elytrantheae/Gaiadendreae (99% identity; verified on the alignment) and JN687568 as Viscaceae, another family of the Santalales (most similar to Viscum, 97% identity; no Loranthaceae among the top hits). The sequences either represent mislabelled or misidentified specimens. KF447376 received no significant hit with identity >93% (best hits with Scurulla, the sister genus of Taxillus, generic affinity verified at hand of the full alignment).
The sequence represents either a bad sequence read or a strongly aberrant rbcL gene (potential pseudogene). All three sequences were excluded from analyses.
A similar sequence type than in JN687568 is found in the only accession of Oryctanthus cordifolius (JQ592409), differing from the otherwise near-identical sequence of Psittacanthinae. MEGABLAST identifies the sequence accordingly as Viscaceae rbcL (Phoradendron; 98% identity). The sequence is either mislabelled or Oryctanthus is not a Loranthaceae and has been omitted.
Another problematic sequence is the one representing Macrosolen brandisianus (JN687566) which differs increasingly towards the 3' end from other accessions of the genus/tribe. Deviation is usually related to same-nucleotide repeats (e.g. GGTGGT instead GTGGGT in all other Lorantheae at pos. 349ff; AATTTG instead AAATTG as pos. 460ff). The problematic sequence part encompasses 50% of the total sequenced strand (c. 500 nt), the first half of the sequence is identical to other Lorantheae. The sequence was tentatively excluded from analyses.
Odd placements-The rbcL data includes genuine sequences from phylogenetic studies as well as short(er) sequences from barcoding studies. The latter are often problematic, mislabelled accessions are not uncommon (see Problematic data). Moreover, and contrasting statements in many barcoding papers, there is so far no conclusive evidence that rbcL can be used as barcode at low hierarchical levels in plants. Reported high species-identification levels with rbcL fragments in barcoding studies are simply due to incomprehensive species sampling of large genera (G. Grimm, pers. obs.) The currently available rbcL data on Loranthaceae supports this observation: the resolution of the rbcL tree is generally poor.
Identical sequences can be shared by species of the same genus, sometimes between genera. Based on the current, relatively taxon-limited data it is not possible to distinguish between potentially representative vs. mislabelled accession in critical cases (Dendrophthoe, Helixanthera, Struthanthus, Taxillus). With respect to this taxonomic uncertainty, the generally weak signal and the limited species coverage (23 species with genuine rbcL data compared to 44/45 covered by matK and trnLF) of the rbcL data, we excluded this gene region from further analyses.

Set-up for phylogenetic analyses and dating
Phylogenetic tree inferences and branch support analyses relied on maximum likelihood (ML) as the optimality criterion, branch support was estimated using non-parametric bootstrapping. All analyses, files can be found in subfolder ML in OSA, were done using RAxML v.8.2.4 (Stamatakis 2014).
RAxML was invoked using custom shell files (included in subsequent subfolders in the OSA). All inferences used the -f a option that performs a (in our case, fast; option -x) bootstrapping analyses with subsequent inference of the 'best-known' ML tree under a general-time-reversible substitution model allowing for site rate variation, modelled using a gamma distribution (GTR+ model) in the final optimisation step. During bootstrapping and the first (fast) tree inference step, the per-site approximation for the gamma distribution was used (option -GTRCAT; Stamatakis 2006). Number of necessary bootstrap replicates were determined by the extended majority rule consensus criterion (Pattengale et al. 2009), bootstrap analyses were automatically cut off when reaching a maximum of 1000 replicates.
We use a seven-step protocol Step 1: Preanalysis-Using the initial alignments (not consensed, not concatenated gene bank accessions), we inferred single-gene trees for five of the six harvested regions: 18S, 25S, matK, trnLLF, rbcL. This step allowed us to further check for problematic accessions not directly visible or overlooked during the visual inspection of the alignments and to assess the general level of discriminative signal in the single gene regions (see Processing).
Step 2: Consensing-Modal and strict species-consensus sequences were computed from the initial alignments using the programme G2CEF (Göker & Grimm 2008); gaps were treated as missing data or fifth state (batch and files are included in the subfolder Step2_Consensi in this Online Supplementary Archive, OSA). For further analyses the strict consensus sequences with gaps treated as missing were used.
Step 3: Inference of comprehensive species trees-Using the species-consensus alignments, we produced three concatenated matrices: (1) including all four gene regions (18S, 25S, matK, trnLLF), (2) a nuclear-only matrix including 18S and 25S rDNA, and (3) a plastid-only matrix including the matK gene and the trnLLF region. ML analyses were run partitioned and unpartitioned, including or excluding the relatively gappy, length-polymorphic trnL-trnF spacer (i.e. analyses used five matrices with different gene coverage). The partitioning scheme was as follows: • 18S rDNA and 25S rDNA treated as independent partitions • First, second and third codon position of matK gene treated as independent partitions • Intron and intergenic spacer treated as independents partitions, exons (trnL 3' exon, trnF gene) included in the same partition.
Step 4: Clock rooting-Data sets including selected outgroups (Wilson & Calvin 2006;Vidal-Russell & Nickrent 2008) or representatives of all/most Santalales families (e.g. Su et al. 2015; see files provided in the OSA to Grímsson, Grimm & Zetter 2017) failed to produce unambiguous support for the exact position of the Loranthaceae root, although trees based on combined data usually will recognise Nuytsia as the first diverging branch in the Loranthaceae. All potential outgroups are (extremely) longbranched and Nuytsia is one of the most distinct Loranthaceae. Thus, ingroup-outgroup long-branch attraction may be inevitable (as detailed by GWG in Grímsson, Grimm & Zetter 2017, file S6). Hence, a clock-rooting (cf. Renner et al. 2008), was tried using the five matrices from step three. For each of the matrices we performed a BEAST (v. 1.8.2; Drummond & Rambaut 2007;Drummond et al. 2012) run under partition specific substitution models, unconstrained tree topology, a Yule tree prior and uncorrelated log-normal clock prior [ucld.mean ~Gamma(0.001, 1000)]. The best fitting substitution models per partition, among the available in BEAST, were selected with JMODELTEST (Darriba et al. 2012). The selected model for most partitions (rDNAs, matK codon positions, trnL-trnF spacer) was GTR+ whereas HKY+was suggested for trnL intron and HKY+I for the tRNA. Each BEAST analysis was conducted for 2*10 7 generations with a sampling frequency of 0.001. To explore the effect of missing data, we performed the clock-rooting also with the reduced dataset of 42 species covering the four best-sampled and alignable gene regions (18S, 25S, matK and trnLLF) that was used for the final dating step (Step 6). The rbcL gene was excluded since it was represented only by a minority of the samples. The analysis was run under the same priors as before.
Step 5: Inference of preliminary dated phylogenies including all data-Using the two allcomprehensive matrix from Steps 3 and 4, i.e. the matrix including all four gene regions (18S, 25S, matK, trnLLF), dated phylogenies were inferred for three different rooting scenarios: traditional (following earlier studies) outgroup-based root as informed by Nuytsia, the same root is found via clockrooting of the reduced 42-taxon, 4-gene region dataset; the clock-informed root based on the all-taxon datasets recognising two main clades: Lorantheae vs. Psittacantheae+root parasites; and a pollen-typeinformed root recognising Tupeia as sister to all other Loranthaceae (cf. Grímsson, Grimm & Zetter 2017). Table S1-1 lists the used age constraints including short discussions (see also subsections Use as age constraint included in the pollen descriptions in the main-text). Dating was performed with BEAST v. 1.8.2 (Drummond & Rambaut 2007;Drummond et al. 2012). Bayesian optimisation used four data partitions with the following nucleotide substitution models: GTR+I for 18S and GTR+ for 25S, matK, and trnLLF, respectively. The Monte Carlo Marko chains (MCMC) were run twice for 5*10 7 million generations under an uncorrelated lognormal clock model. All age calibration priors were modelled as normal distributions around the midpoint of the known time intervals. For rooting scenario 2, the analysis of the comprehensive matrix did not converge after the set number of generations, hence, no result tree is included in the according subfolder. For the other two rooting scenarios, the high gappyness of the comprehensive matrix caused similar problems (relatively low ESS for several parameters, late convergence of runs). For this reason, the final dating analysis used a reduced species set, only including those members of each lineage that were best-covered by data for all four gene regions. Step 6: Final inference of dated phylogenies-To compensate for eventual missing data artefacts on the estimates, we performed the same set of analyses with a taxon-reduced data set, a data set that only included 42 species with data for all five gene regions. The general set-up (priors etc.) was the same as for the preceding step, further details can be extracted from the xml-files provided in the according subfolder in the OSA.
At this step, we performed an additional analysis (Scenario 4, Table S1 Notanthera in all optimised topologies, apparently a "wrong" placement). Following the topology of Su et al. (2015), we constrained the Elytrantheae as sister to the Lorantheae.

Step 7: Post-analyses test (comparison of Bayes factors)
For each of the four topological configurations we estimated the marginal likelihood with two approaches, stepping-stone and path-sampling, both of which are implemented in BEAST (Baele et al. 2012;Baele et al. 2013), with 100 path steps with 20*10 7 chain length for each. For selecting the rooting scenario that best fit our data, we compared the marginal likelihood estimate of each run using Bayes factors (Kass & Raftery 1995).

Documentation
All analysis files are included in the OSA in corresponding subfolders. See ReadMe.txt in top folder for file labels and brief description of content. The OSA is mirrored at www.palaeogrimm.org/data; if you have questions regarding the data, files, and inferences contact GWG (mail-to: loranths@palaeogrimm.org).

Figures (appended at the end of the file)
Single-gene trees based on the harvested, unconsensed/-concatenated sequence data and optimised under ML; visibly problematic -as directly deduced from the alignment -sequences not included.
Branches, subtrees, etc. referring to major groups (tribes/subtribes) of Loranthaceae are coloured accordingly; branch thickness indicates bootstrap support under ML (see in-figure legends). Highlighted by red font and in bold, mislabelled accessions detected via the tree inference and excluded from subsequent analyses; accessions highlighted by orange font (bold) may be problematic but tentatively kept due to the unavailability of comparative data.

Helixanthera
Other two Psittacanthinae nested in non-monophyletic'