Resolving the backbone of the Brassicaceae phylogeny for investigating trait diversity

signatures that distinguish simple from complex leaves in Brassicaceae. (cid:1) Our study provides an easily extendable dataset for further advances in Brassicaceae systematics and a timely higher-level phylogenetic framework for a wide range of comparative studies of multiple traits in an intensively investigated group of plants.


Introduction
Comparative biology relies on a firm phylogenetic framework to extend the mechanistic insights derived from a handful of model organisms to the broad diversity of life. In plants, research on the model Arabidopsis thaliana (L.) Heynh broadly informs our understanding of development, physiology, secondary metabolism and plant-microbe interactions, as well as natural variation of these processes (Kr€ amer, 2015;Provart et al., 2016). Arabidopsis thaliana belongs to the diverse and economically important family Brassicaceae, which includes c. 4000 species distributed across a wide range of habitats around the globe, important crop plants like cabbage, rapeseed and mustard domesticated for food and biofuel, ornamentals, and invasive weeds (Appel & Al-Shehbaz, 2003;Franzke et al., 2010;Kiefer et al., 2014). The family generally features small genomes, which enabled the sequencing of the first plant genome and the highest number of genome sequences for any plant lineage to date (Arabidopsis Genome Initiative, 2000; Koenig & Weigel, 2015). The wealth of genetic and genomic resources coupled with broad trait diversity and ecological adaptations make the Brassicaceae an attractive system for addressing important biological questions, such as genome and chromosome evolution, the evolution of form, adaptation to environmental change, crop domestication and adaptive physiology, plant-animal and plant-microbe interactions, and metabolic diversity (Appel & Al-Shehbaz, 2003;Franzke et al., 2010;Koenig & Weigel, 2015;Kr€ amer, 2015;Nikolov & Tsiantis, 2017).
Recent progress in the systematics of Brassicaceae has assigned most of the species to 52 monophyletic groupings (tribes) (Bailey et al., 2006;Warwick et al., 2010;Al-Shehbaz, 2012) in three major lineages (I, II and III) (Beilstein et al., 2006(Beilstein et al., , 2008, or six clades (A-F) (Huang et al., 2015). The most comprehensive phylogenomic study of nuclear markers of Brassicaceae derives from 32 transcriptome samples representing 29 of the 52 tribes (Huang et al., 2015). These analyses have resulted in the development of a draft tribal classification that provides a comprehensive catalog of the independent lineages in the family. Despite these efforts, relationships along the backbone of the phylogeny and among tribes remain largely unresolved. Moreover, 11 genera are not yet assigned to a tribe (Kiefer et al., 2014). It has been argued that this lack of resolution reflects early rapid radiation, ancient and recent polyploidy, and hybridization across species, genera or even distant lineages that resulted in the lack of phylogenetically informative sequence variation (Franzke et al., 2010;Huang et al., 2015). Expanded genomic and taxonomic coverage are necessary to test these scenarios, and to minimize phylogenetic error related to insufficient character and taxon sampling (Heath et al., 2008) to resolve the relationships among mustard lineages.
The Brassicaceae exhibit considerable morphological diversity, especially in leaf, fruit and trichome characters (Appel & Al-Shehbaz, 2003;Koenig & Weigel, 2015;Nikolov & Tsiantis, 2017). Leaf shape is a model trait that has received significant attention in studies of the genetic basis of morphological change (Bar & Ori, 2015). The sister genus to the rest of Brassicaceae, Aethionema, has simple leaves with entire margins (Mohammadin et al., 2017). The rest of the family features simple leaves with entire margins and minor serrations, or more complex lobed or dissected leaves. Lack of a robust Brassicaceae phylogeny has limited the understanding of the distribution of and the transitions between leaf character states, and it is not clear how leaf complexity has evolved across the family. Comparative genetics has revealed several key developmental regulators of leaf complexity in model Brassicaceae (Vlad et al., 2014;Rast-Somssich et al., 2015). Tracing the evolutionary history of the molecular players that underpin these developmental events is of key importance for understanding how leaf form develops and diversifies, but the necessary phylogenetic framework is currently lacking.
Targeted sequence capture has emerged as a powerful and cost-effective method to produce genome-scale data that facilitates orthologous gene comparisons for many species (Hedtke et al., 2013;Lemmon & Lemmon, 2013;Buddenhagen et al., 2016;Johnson et al., 2018) that has not previously been employed in Brassicaceae phylogenetics. By contrast to other methods for genome reduction, such as transcriptome and restriction site-associated DNA sequencing, targeted sequence capture performs well with degraded archival samples that may be the only available source of material for rare or geographically isolated taxa. Targeting single-copy nuclear genes provides a promising way to generate abundant phylogenetically informative data while minimizing potential issues arising from complex gene lineage evolution (Lemmon & Lemmon, 2013). This approach has been used extensively to resolve difficult phylogenetic relationships in plants, including sunflowers (Mandel et al., 2014), milkweeds (Weitemier et al., 2014), sages (Fragoso-Mart ınez et al., 2017), legumes (Vatanparast et al., 2018), Dutchman's pipes (Wanke et al., 2017) and breadfruit (Johnson et al., 2016), and is widely used in metazoan phylogenomics (e.g. Faircloth et al., 2015;Prum et al., 2016;Alfaro et al., 2018;Espeland et al., 2018).
Here, we reconstruct the evolutionary history of Brassicaceae with high support using a phylogenomic dataset of 1827 target captured exons from 63 species, one-third of which were derived from otherwise difficult to sample herbarium material. When combined with data acquired from previously published genomes of 16 species, our sampling represents 50 of the currently recognized 52 tribes and 16 taxa unassigned to a tribe. We provide a phylogenetic hypothesis for Brassicaceae, test the robustness of our estimates using novel dataset partitioning and taxon selection schemes, and uncover previously unidentified clades. We use this robust phylogenetic framework to study the evolution of leaf shape in the family through comparative transcriptome analysis, which identified a core set of genes under selection at the gene expression and at the protein level associated with the evolution and development of leaf complexity in the Brassicaceae.

Materials and Methods
Information on Plant Material (Supporting Information  Table S1), Library Preparation and Sequencing, and Obtaining corresponding exons from sequenced genomes (Tables S2, S3) is included in Methods S1.

Probe design
In order to generate targets for probe design with improved capture efficiency and enhanced phylogenetic utility throughout the Brassicaceae phylogeny, the genomes of Arabidopis thaliana, Sisymbrium irio and Aethionema arabicum were used to identify a set of putative single-copy orthologous genes by compiling the complete set of coding sequences from each genome and identifying single reciprocal hits using BLAT v.32x1 (Kent, 2002). Only coding sequences with length ≥ 960 bp, ≥ 85% sequence identity to the first best hit from A. thaliana and second best hit with sequence identity ≤ 40% were retained. The putative single-copy orthologous genes were split into their corresponding exons. The target set included exons > 180 bp to allow for sufficiently long sequences for probe design, 30% < CG content < 70% to improve in-solution hybridization, and no sequence similarity to annotated transposable elements or organellar sequences to avoid enrichment of nonrelevant targets. Approximately 40 000 unique enrichment probes (baits) of biotinylated RNA 120-mers were designed and synthesized at MYcroarray (Ann Arbor, MI, USA) with a 60-base overlap (29 tilling) between baits.
Sequencing data processing for phylogenomics Please see Fig. S1 for phylogenetic reconstruction workflow. Raw reads were adaptor and quality trimmed in TRIMMOMATIC v.0.33 (Bolger et al., 2014) with the following parameters: illuminaclip: TruSeq3-PE2.fa leading:20 trailing:20 slidingwindow:5:20 minlen:36, followed by deduplication with fastx_collapser in FASTX-Toolkit (Hannon Lab, CSHL, Cold Spring Harbor, New York, NY, USA). All reads were pooled together irrespective of direction. Reference-based mapping followed by de novo extension using the unmapped reads were performed with the hybrid assembler YASRA (Ratan, 2009) as incorporated in the Alignreads pipeline (Straub et al., 2011), which accommodates highsequence divergence between reads and reference, with percentage identity to reference set to 'medium' and reads aligned in a single step without iteration. To minimize sequencing errors and alleviate complications arising from the putative polyploid nature of some species, heterozygosity, and the inclusion of potentially mixed individuals from herbarium sheets, SNP calling was performed with SAMTOOLS (Li et al., 2009) andVARSCAN v.2.4.1 (Koboldt et al., 2012) with the following parameters: minimum coverage per position 5, minimum frequency of observed allele 0.6, P-value 0.1, minimum number of reads supporting position 5; if these criteria were not met, N was called. Contig identity was assigned with BLAT v.35 using translated DNA against the respective exon reference sets, selecting the highest scoring hit, and contigs with score > 20 and percentage identity > 75% were retained. The contigs corresponding to each target exon derived from each of the three references were aligned together in MAFFT v.6.851b (Katoh et al., 2002) using [-maxiterate 1000 -genafpair] to call a consensus sequence for each exon; if coverage was absent, N was inserted, otherwise, majority rule applied (Table S4). Transcriptome-derived reads from Turritis glabra were processed similarly.
Exons from sequenced genomes were added to the assembled capture-derived exons and aligned in MAFFT v.6.851b according to the references, which were placed in frame. Alignments were trimmed at the border of the A. thaliana reference to remove overflowing noncoding sequence. Realignment by coding frame was performed in MACSE v.1.02 (Ranwez et al., 2011), and trimmed to remove entire codon positions if internal stop codon indicative of misalignment was present in any of the species (a total of 1205 internal stop codons for the entire alignment of 16 million codons), or if a codon position was too diverse (most prevalent amino acid identical for < 30% of the taxa). Positions with > 20% ambiguous amino acids resulting from unidentified nucleotides (Ns) were removed. Final sequences shorter than 35% of unambiguous nucleotide positions based on the reference exon length were removed. For plastome assembly, we utilized the FASTPLAST pipeline (McKain, 2017), which combines reference-guided and de novo assembly to generate plastid genomes using off-target organellar reads (genome skimming) (Table S5). No plastid assemblies with contigs long enough to meet our quality cut-offs were produced for Cremolobus peruvianus and Lunaria rediviva, or Turritis glabra (this likely resulted from polyA enrichment in transcriptome sequencing). Plastid genomes were annotated with GESEQ (https://chlorobox.mpimp-golm. mpg.de/geseq.html). Plastid protein-coding genes were processed similarly to the nuclear dataset to obtain multiple sequence alignments.

Exon-pruning and phylogenetic-signal analyses
Because some ingroup relationships can be sensitive to the choice of an outgroup, we constructed two matrices with different outgroup composition and shared exon sets (1101 exons)one with Carica papaya and Tarrenaya hassleriana, and another including only T. hassleriana. We used PARTITIONFINDER v.2.0 (Lanfear et al., 2014) and RAXML v.8.2.9 (Stamatakis, 2014 to reconstruct maximum likelihood trees after concatenation. Rooting with C. papaya for the more inclusive dataset, and T. hassleriana for the other dataset demonstrated that different outgroup schemes result in identical ingroup relationships for Brassicaceae. Therefore, to maximize the number of shared exons, we only included T. hassleriana as the outgroup in our in-depth analyses. We estimated the maximum-likelihood (ML) gene trees for each exon using the fast algorithm in RAXML with GTRGAMMA model of nucleotide substitution, where each exon has three partitions (for the 1 st , 2 nd and 3 rd positions of a codon), and 100 rapid bootstrap replicates. To reduce sequence biases and assess potential sources of misleading signal, we calculated and excluded the most extreme outliers in each of seven metrics: upper quartile of long-branch score (L; 20 exons), standard deviation of the long branch score (36 exons), average patristic differences (33 exons), and R 2 of the saturation score (S; 103 exons) and saturation slope (24 exons) using TRESPEX v.1.1 (Struck, 2014). We further calculated Matching Splits (M; excluded 26 exons) and Robinson-Foulds (11 exons) tree distances for individual exons as implemented in TREECMP v.1.0 (Bogdanowicz et al., 2012) and excluded outliers resulting from aberrant modes of molecular evolution or incorrect paralog assignment that could influence the combined analyses. To explore the effects of dataset partitioning and to test the stability of our results based on exon inclusion/exclusion, we employed two cut-offs for the S, L and M metrics to partition the resulting set of exons into three increasingly exclusive subsets were for each metric. These 27 unique exon sets were named according to the combination of metric intervals, such that the most inclusive dataset is S1L1M1 and the most exclusive dataset is S3L3M3 (Table S6). These unique exon sets were used to infer phylogenies via concatenation-and coalescence-based methods. The distances between trees were visualized with density plots of tree distances and multidimensional scaling plots in R.

Phylogenetic reconstruction
Maximum likelihood analyses of the concatenated 27 matrices were performed in RAXML v.8.2.9 after PARTITIONFINDER v.2.0 to find the computationally most efficient model of evolution that minimizes overall model complexity and accurately accounts for substitution processes (Table S6).
In order to evaluate the phylogenetic signal of each exon in a given phylogenetic matrix, we separately optimized the model of evolution with constrained tree topology to each of the eight unique topologies inferred previously in the maximum-likelihood (ML) analysis. The log-likelihood score of each exon at a given topology, log e L (Exon |Topology i) was computed as the sum of the log-likelihood scores of all sites in the exon log e LðExon jTopology iÞ ¼ X a2Exon log e LðSite ajTopology iÞ: Using the log-likelihood scores of the exons for all eight unique topologies, the phylogenetic signal of each exon in the given phylogenetic matrix was estimated as the sum of the absolute pairwise differences of the exon log-likelihood scores of the unique tree topologies log e L Exon jTopology j ð Þ j : (DGLS, phylogenetic signal of the exon; N pairs , number of unique tree topology pairs; i and j run over the set of unique tree topologies). This formula extends the definition of phylogenetic  Shen et al. (2017) to any number of tree topologies. The phylogenetic signals of the exons were computed separately for each of the eight most-inclusive phylogenetic matrices that generated the eight unique tree topologies (S1L1M1, S1L2M2, S2L2M1, S2L2M2, S1L1M3, S2L2M3, S3L1M1, S3L3M1). Exons that disproportionately contribute to the phylogenetic signal of a given phylogenetic matrix (exons with phylogenetic signal in the matrix > 10) were excluded from the matrix and ML tree estimations were performed in RAXML with the filtered datasets. The resulting topologies were renamed S1L1M1R, S1L2M2R, S2L2M1R, S2L2M2R, S1L1M3R, S2L2M3R, S3L1M1R and S3L3M1R.
Plastid-coding gene phylogenies were estimated in RAXML from the partial or complete plastid genomes obtained from offtarget reads and plastid sequences obtained from whole plastomes from GenBank (Table S3).
Because the supermatrix approaches can fail to fully account for the influence of conflicting gene-tree signal due to processes such as incomplete lineage sorting, we estimated coalescent species trees with the gene tree summation method for each of the 27 exon sets in ASTRAL-II v.4.10.12 (currently the only method that can analyze a dataset of our scale under the multispecies coalescent model).
Quartet-based computations of internode certainty (LQ-IC) as a measure of phylogenetic incongruence were calculated according to Zhou et al., 2017 with the S1L1M1R topology as the reference tree and sets of evaluation trees, corresponding to all bootstrap replicates from the RAXML analyses with (27 9 100 = 2700 trees) and without (8 9 100 = 800) disproportionate contributors from the phylogenetic informativeness analysis. To compute LQ-IC scores for the coalescent analyses, the 100 bootstrap replicates of each gene tree in a given dataset were used to compile 100 sets of tree replicates, where each exon is represented. These 100 sets were used to compute 100 trees under the coalescent model. This procedure was done separately for each of the 27 datasets to produce a total of 2700 evaluation trees.
Phylogenetic hypothesis testing (AU test) was performed in CONSEL v.1.20 (Shimodaira & Hasegawa, 2001) to test the statistical significance of topological differences between trees in the ML analyses and the ML analyses after excluding disproportionate contributors to the phylogenetic signal.

Comparative transcriptomics and identifying shifts in gene expression
De novo transcriptome assembly was performed with TRINITY v.2.4.0 (Grabherr et al., 2011) using default parameters after combining data from all three biological replicates for each species. Coding sequences within contigs were identified with Transdecoder (http://transdecoder.github.io/), and all open reading frames with homology to the A. thaliana proteome (GenBank build UP000006548_3702) longer than 100 amino acids were used in subsequent analyses. Considering only the longest isoform of each gene, OrthoFinder (Emms & Kelly, 2015) was used to identify orthogroups consisting of a single gene for all eight species. Expression values were calculated for each triplicate with RSEM (Li & Dewey, 2011) in Trinity, using the species-specific transcriptome assemblies. Gene expression values for each species independently were expressed as transcripts per kilobase million (TPM; the most readily comparable measurement between species), and log 2 -transformed after adding 0.0001 to each value to avoid -inf errors. Between-sample normalization of expression values was performed with the package R/POISSONSEQ, which implements the method of Li et al. (2012). We built a NJ tree with the R/APE package (Paradis et al., 2004) using as input a correlation matrix of the normalized expression values of all samples for the 3188 orthologous genes identified, to demonstrate that all replicates of each species clustered together, as expected.
Genes that deviate from background gene expression in the core Brassicaceae or in the complex-leaved species were identified in the context of the Ornstein-Uhlenbeck process modeling framework (Butler & King, 2004;Rohlfs et al., 2014) after excluding genes exhibiting high-expression variability in at least one species (SD > 0.59mean; 846 genes). The framework takes into account phylogenetic information (tree topology including only the species with the newly generated transcriptomes and branch lengths calculated from the in-frame alignment of 1421 genes) to fit alternative models to the expression of each of the resulting 2342 genes using the average of the normalized, log 2transformed expression values for each species. Directional expression denotes a deviation from an optimal expression level, which accommodates phylogenetic relationships and drift, and does not specify the actual direction of the change. Two sets of tests were performed: one based on phylogeny alone comparing a null model of a single expression value for all species with models that specify distinct expression levels separately for the core Brassicaceae, and a second set of tests based on leaf morphology (simple/complex). Alternative models (H0a, uniform expression along all branches of the tree; H1a, shift in expression in the core Brassicaceae; H0b, uniform expression along all branches of the tree; and H1b, shift in expression along the branches leading to the complex-leaved species, where the character state of internal nodes is not specified) were compared using likelihood score and likelihood ratio chi-square test with one degree of freedom and FDR-corrected P-values to identify genes where a model with two optima fits the data better than a model with a single optimum for all lineages.

Detecting positive selection on the protein sequence
In order to estimate the frequency of positive selection acting on coding sequences, codon-based multiple species alignments for all orthogroups were used to fit alternative phylogenetic models of evolution of nonsynonymous to synonymous substitution rate (x = dN/dS). Codons with > 60% missing data and sequences with > 60% missing codons were excluded (nine genes). A further 32 genes were excluded because they showed very large synonymous divergence indicative of alignment errors or aberrant molecular evolution (dS > 2.5). Individual ML gene trees estimated in phyml (Guindon & Gascuel, 2003) were compared to the species tree using the Shimodaira-Hasegawa test (Shimodaira & Hasegawa, 1999) Hasegawa, 2001) to exclude genes exhibiting significant phylogenetic conflict, that is, genes that preferred the individual gene tree over the species tree (18.6% of the initial 3188 genes). The remaining 2554 genes were used to fit alternative models with 'Branch-site test of positive selection', which allows x to vary among sites of the gene and branches of the phylogeny in codeml/PAML v.4.8 (Yang, 2007). The test identifies sites that are evolving neutrally or under negative selection in part of the tree (background) but exhibit a shift towards positive selection along branches of interest (foreground). We performed two sets of tests, one based on phylogeny and another based on leaf morphology, using a tree including only the taxa with newly generated transcriptomes. Significance was tested with likelihood ratio test with two degrees of freedom.

Data availability
Raw sequence read data are available from NCBI Short Read Archive (PRJNA518905).

Data generation
We designed the first exome targeted enrichment probe set for Brassicaceae based on single-copy nuclear markers derived from three reference genomes, Arabidopsis thaliana, Sisymbrium irio, and Aethionema arabicum, and targeted 1827 exons of average size 516 bp (range 180-6059 bp) from 764 genes, representing all Brassicaceae linkage groups (Fig. 1a,b). The focus on singlecopy genes aimed to reduce issues of paralogy. The exons were selected for size and nucleotide composition to maximize sequence capture and phylogenetic utility, and their collective length measured on the longest alignment of the three references was 942 066 bp, cumulatively representing c. 1.5% of the exome of A. thaliana. We collected novel exome-targeted enrichment data for 63 species (Methods S1; Fig. 1c). To break up long branches, we included 16 taxa with controversial or ambiguous phylogenetic position that may represent independent evolutionary lineages. One third of the samples were derived from herbarium material, which produced high-quality data despite limited input DNA quality and quantity. Our complete dataset includes 79 Brassicaceae species representing all currently recognized lineages of Brassicaceae except two recently described tribes (Hillielleae (Chen et al., 2016) and Shehbazieae (German & Friesen, 2014)). Nearly all targeted regions were captured for all species with average coverage c.1009, resulting in a dataset with few missing data per terminal (Fig. 2a,b; Table S4). Mapping to each of the three references resulted in comparable exon recovery from the target species, suggesting that phylogenetic distance was not a significant factor for capture success in Brassicaceae. At the level of mapped reads, we observed polymorphisms that could reflect allelic variation (heterozygosity), copy number variation (reads from paralogous sequences) and the sampling of multiple individuals (from herbarium material) in all samples. Because the lack of synteny data precludes differentiating among these scenarios, we masked these positions using conservative criteria for base calling in the initial contig assembly to minimize the effect of this variation on tree estimation.
In order to assess whether our targeted loci can be recovered from transcriptome data without target enrichment to enable direct comparisons with previous phylotranscriptomic studies, we sequenced the transcriptome of Turritis glabra seedlings. We analyzed this transcriptome similarly to the targeted sequence data, which resulted in near complete (1781 of 1827) exon recovery, demonstrating the potential for future merging of our dataset with similarly generated targeted enrichment datasets and phylotranscriptomic datasets. Although we did not specifically target plastid sequences, we were able to assemble partial or complete (Arabis ottonis-schulzii, Alyssopsis mollis, Bunias erucago, Dipoma iberideum, Kernera saxatilis, Murbeckiella pinnatifida, Pseudofortuynia esfandiarii, Subularia aquatica and Stanleya elata) plastid genomes from off-target capture reads (genome skimming) for all but three species ( Fig. S2; Table S5), enabling phylogenetic comparison of both biparentally-inherited nuclear and maternally-inherited plastid genomes. The extensive sequencing information we obtained also allowed for dataset partitioning to evaluate the contributions of different loci to the phylogenetic signal.

Phylogenomic analyses
Because phylogenomic datasets capture the unique evolutionary history of many genomic loci, we investigated relationships using concatenation and coalescence-based (ASTRAL-II) species tree approaches, and characterized potential systematic bias via three metrics often used to detect misleading signal in phylogenetic reconstructions: evolutionary rate heterogeneity (L) that may result in long-branch attraction; sequence saturation (S) that obscures phylogenetic signal; and distance between gene trees (M) generated from individual exons that may indicate hidden paralogy resulting from complex gene-lineage evolution or polyploidization (Fig. S3). Excluding the most extreme outliers for the three metrics (20 exons of 18 genes for L metric, 103 exons of 96 genes for S metric, and 26 exons of 25 genes for M metric; some exons are excluded by more than one metric) and loci not recovered from the genome of the outgroup Tarenaya hassleriana (79 exons of 25 genes) resulted in 1540 exons from 673 genes. To assess the robustness of phylogenetic estimates based on locus selection under different partitioning schemes, we concatenated the resulting exons into 27 matrices based on combinations of metric cut-off values (Dataset S1; Fig. S3; Table S6) and conducted maximum-likelihood (ML) analyses under the best-fitting models of evolution. These estimates resulted in species trees well supported along the backbone that produced eight unique topologies, which differed by the relationship between Idahoa and Subularia, the branching order of Cochlearia, Conringia orientalis and relatives (Conringia clade), and Kernera + Petrocalis in respect to the rest of the clade, and the placements of Megacarpaea and Biscutelleae + (Lobularia + Iberis) (Dataset S2; Fig. S4). Different phylogenetic matrices could not reject all alternative topologies based on the AU test, indicating that some Brassicaceae phylogeny before the current study (phylogeny based on multigene analyses at the intertribal level, e.g. Huang et al., 2015). One third of the samples were obtained from herbarium material (brown). Species with sequenced genomes are marked with a light blue dot; taxa previously unassigned to a tribe are marked with a dark blue dot. See Supporting Information Tables S1, S2 for species abbreviations and tribal assignments.  (Table S7). Concatenation averages out phylogenetic signal from numerous loci to estimate the species phylogeny but it has been demonstrated that genes with strong phylogenetic signal may bias these estimates (Shen et al., 2017). To study the effect of such strong contributors, we evaluated the phylogenetic informativeness of the exons measured by their relative contribution to the total likelihood of each of the eight unique topologies based on the mostinclusive concatenation dataset. As expected, some loci contributed disproportionally to the likelihood of a given topology (Fig. S5). Because misleading strong signal from a minority of loci can potentially overpower the phylogenetic signal of other, more reliable loci, we excluded these strong contributors from the respective most-inclusive phylogenetic matrices (S1L1M1, S1L1M3, S1L2M2, S1L3M1, S1L3M3, S2L2M1, S3L1M1, S3L3M1). The ML estimations of these eight matrices produced eight unique topologies, some indistinguishable based on the data, which by contrast with previous analyses consistently resolve a strongly supported clade of Megacarpaea, Cochlearia, Lobularia and Iberis (Dataset S2; Fig. S6). The topologies differed by the relationship between Idahoa and Subularia, the relationship among Cochlearia, Lobularia and Iberis, and the positions of Biscutelleae and Lineage V (see 'Backbone of the phylogeny'; Fig. S6; Table S7).
In order to determine how taxa with variable positions in the concatenation analyses influence tree topology, we conducted a series of taxon-pruning experiments, systematically removing Cochlearia officinalis, Lobularia maritima, Iberis linifolia, Megacarpaea spp., Subularia aquatica and Idahoa scapigera from the largest matrix excluding strong contributors (S1L1M1R) (Fig. S7). These removals led to variable placements of the other unstable taxa, demonstrating the high sensitivity of the analysis to the exclusion of even a single 'keystone' taxon, thus underscoring the importance of our broad sampling of recognized Brassicaceae tribes for accurate phylogenetic inference.

Aethionema reference
Sisymbrium reference Arabidopsis reference

New Phytologist
In order to assess gene conflict arising from deep coalescence, we estimated species trees for each of the 27 sets of gene trees with ASTRAL-II, which calculates the support of a topology based on the prevalence of quartet trees derived from the individual gene trees (Dataset S2). These analyses resulted in seven unique species tree topologies containing similar groupings as in the concatenated ML analyses that differed in the position of Idahoa, the placements of Iberis and Lobularia, and the relationships among Cremolobus, Brayopsis and Schizopetalon, and among Arabidopsis, Physaria and the rest of Lineage I (Fig. S8). By contrast to their conflicting placements in the ML trees, Cochlearia, (Kernera + Petrocalis) and the Conringia clade were resolved as successive sisters to the rest of Lineage II in all ASTRAL-II trees with high support. In all concatenation analyses, (Camelina + Capsella) is sister to (Boechera + Halimolobos + Crucihimalaya) + (Dipoma + Hemilophia + Geococcus), whereas in all ASTRAL-II it is sister to (Boechera + Halimolobos + Crucihimalaya). The difference in topologies suggests complex speciation in this part of the tree.

Backbone of the phylogeny
Analyses of nuclear loci produced well-resolved and largely congruent topologies consisting of six clades that form successive sister groups to the rest of Brassicaceae: (1) Aethionema; (2) a clade approximately representing the previously circumscribed Lineage III (Beilstein et al., 2006); (3) a clade comprising the tribes Arabideae, Stevenieae and Alysseae, which we call Lineage IV; (4) a comprehensive clade consisting of representatives of Lineage I (Beilstein et al., 2006); (5) a clade including relatives of the previously circumscribed Lineage II (Beilstein et al., 2006); and (6) a novel clade of taxa distributed primarily in the Southern Hemisphere, designated hereafter as Lineage V (Fig. 3; Notes S1). At lower phylogenetic levels, our results support previously recognized relationships among tribes, reveal novel clades and assign previously unassigned taxa to a lineage, some of which may justify the erection of new tribes ( Fig. S9; Tables S8, S9). The topology of the reconstructed plastid tree (Fig. S10) revealed conflicting signal between the nuclear and plastid genomes at deeper nodes, notably in the branching order of lineages I and III, and among the more terminally branching taxa in Lineage I (Fig. S11). Such cytonuclear discordance is consistent with prior findings that have invoked both chloroplast capture (Beilstein et al., 2008) and substantial nuclear introgression (Forsythe et al., 2017) in Lineage I, suggesting that such phenomena may influence phylogenetic relationships more broadly in the family.

Points of incongruence among analyses
Our results reveal areas of topological stability and highlight recalcitrant nodes that exhibit discordance at the individual gene tree level and after the interrogation of combinations of exons and gene trees. Much of the topological conflicts concern only few taxa with unstable positions in different analyses and datasets. For example, the placements of Subularia and Idahoa, although firmly resolved within the novel Lineage V, vary with respect to each other and to the Cremolobeae-Eudemeae-Schizopetaleae (CES) and the Asta + Scoliaxon clades in all analyses. Both Subularia and Idahoa exhibit higher rates of molecular evolution indicated by long branches, which can obscure phylogenetic relationships making them intractable to resolve even with datasets of our scale (e.g. King & Rokas, 2017).
Incongruent phylogenetic histories also may result from duplication and subsequent gene loss, incomplete lineage sorting and introgression (e.g. Wendel & Doyle, 1998). Another area of conflict in the Brassicaceae phylogeny is the relationship among Cochlearia, Iberis, Lobularia and Megacarpaea. This clade received little support in the coalescence and the initial concatenation analyses. High support was obtained only after exclusion of loci that contribute disproportionally to the phylogenetic signal in the concatenation analyses, suggesting difficulties in assigning gene orthology among taxa (Walker et al., 2017) as the reason for the observed results. Such difficulties may arise from whole-genome duplication (polyploidization) events and subsequent asymmetric gene loss of paralogs, resulting in long branches (Fares et al., 2005), as observed here. The larger average genome sizes of Anastaticeae, Cochlearieae and Iberideae (Megacarpaeeae genome size is not known), as well as the variable base chromosome number in all four tribes (Hohmann et al., 2015) lend support to a possible shared mesopolyploid origin of this clade. A more targeted approach, such as distribution of synonymous substitutions of paralogs over time (e.g. Mand akov a et al., 2017) and syntenic information from whole-genome sequences will clarify this issue. However, note that whole-genome duplication events along terminal branches are unlikely to result in phylogenetic uncertainty. Relationships in Lineage III, where we sampled several polyploid species, are consistently resolved in all analyses, suggesting that the polyploidization events they feature may not be shared among tribes in this clade.
The basal-most nodes of Lineage I were universally resolved, but conflicting resolution was evident for the branching order associated with Arabidopsis spp. and Physaria, and the relationship among Capsella rubella + Camelina sativa, Geococcus pusillus + Hemilophia rockii + Dipoma iribideum and Boechera stricta + Halimolobos pubens + Crucihimalaya himalaica. This uncertainty is reflected in cytonuclear discordance, where the plastid phylogeny unites Arabidopsis spp. and Capsella rubella + Camelina sativa (a monophyletic Camelineae), and Geococcus pusillus + Dipoma iberideum + Crucihimalaya himalaica, strongly suggesting a complex speciation. It has been suggested recently that this observation reflects massive nuclear introgression (Forsythe et al., 2017). Our broad sampling suggests that these processes are more pervasive and explain the inconsistent placement of Microlepidieae in Huang et al. (2015) better than a putative hybrid origin for the tribe. Incomplete lineage sorting likely contributed to some of the observed patterns although its extent in this clade is unclear. These three examples demonstrate the complexity of evolutionary processes shaping the current Brassicaceae diversity and further emphasize the utility of this model clade in studying complex evolutionary histories.   Fig. 3 Maximum-likelihood topology of the Brassicaceae relationships inferred from 79 species and 1421 exons. Nodes marked with a black circle are universally supported by all concatenation and coalescence analyses. Numbers associated with certain internodes are a quartet-based measure, lowest quartet internode certainty (LQ-IC) (Zhou et al., 2017), for quantifying the similarity between this reference tree and three sets of evaluation trees: RAxML bootstrap trees (2700 trees)/the RAxML bootstrap trees after removal of strong contributors (800 trees)/trees used to calculate local support in the coalescence analyses (2700 trees). LQ-IC varies between -1 and 1: it approaches '1' when the reference internode is prevalent among the evaluation trees, it is close to '0' when alternative topologies are common, suggesting incongruence between the reference tree and the evaluation trees with respect to this internode, and approaches 'À1' when the evaluation trees strongly contradict the internode present in the reference topology. For example, the branch leading to Cremolobus peruvianus + Schizopetalon walkeri (1/1/0.28; Lineage V) is universally supported in both maximum-likelihood analyses (LQ-IC score = 1) but not present in all coalescent evaluation trees, reflected by a lower LQ-IC score (0.28). (2019)

Leaf-form evolution in Brassicaceae
The newly developed understanding of relationships along the backbone of the Brassicaceae phylogeny offers considerable opportunities for comparative studies focused on the evolution of critical traits. Leaf shape is a model trait for understanding the genetic basis of phenotypic diversity (Nikolov & Tsiantis, 2017) and our phylogeny provides a macroevolutionary framework to study the distribution of leaf complexity states in Brassicaceae. For example, simple leaves with entire, dentate or serrate margins are present in all six major clades in the family. All species of Aethionema have undivided, entire leaves, and have an articulation (joint) between the petiole-like base and the rest of the leaf blade, a trait which is exceptionally rare in the rest of the  Identifying gene expression differences that may underlie phenotypic differences is crucial for understanding the genetic basis of morphological evolution (Peter & Davidson, 2011;Rowan et al., 2011). Previously we demonstrated the significant overrepresentation of transcription factors among the differentially expressed genes between Cardamine hirsuta and Arabidopsis thaliana during early leaf development (Gan et al., 2016). To extend these analyses to the family level and identify gene expression and coding sequence differences associated with leaf diversity in Brassicaceae, we generated the transcriptional profiles of young developing leaves from eight diploid species with sequenced genomes and contrasting leaf shapessimple or complex (including compound leaves)throughout the Brassicaceae phylogeny (Fig. 4a, Table S10). Comparing gene expression across species requires prior knowledge of phylogenetic relationships (Dunn et al., 2017) to account for neutral fluctuations in gene expression levels in different lineages (Rohlfs et al., 2014). Such analyses could not previously be performed with sufficient confidence because (1) branch lengths, as required for downstream analyses, could not be estimated accurately, and (2) the relationship among taxa with newly generated transcriptomes, in particular the placement of the emerging model species Arabis alpina, has not been resolved consistently before our phylogenomic study (Willing et al., 2015).
Employing Ornstein-Uhlenbeck process modeling (Rohlfs et al., 2014) using our explicit phylogenetic framework, we identified 64 genes exhibiting significant directional change in expression from neutral expectations in the core Brassicaceae (the clade sister to Aethionema), where diversity in leaf margin geometry is pronounced (q-value < 0.05) (Dataset S3). These genes included the developmental regulators AUXIN RESPONSE TRANSCRIPTION FACTOR 3/ETTIN (AT2G33860), CYCLIN D6;1 (AT4G03270), CHROMATIN REMODELING 9/SWITCH 2 (AT1G03750), Enhancer of polycomb-like transcription factor (AT4G32620) and the ethylene receptor ETHYLENE RESPONSIVE 2 (AT3G23150) (Table S11). Next, we tested for directional change in gene expression not compatible with neutral evolution along the branches leading to the species with complex leaves relative to species with simple leaves, and identified 29 such genes (Table S12, Dataset S3). These genes included the putative signaling components MAP KINASE 17 (AT2G01450), MAP KINASE 20 (AT2G42880), the microtubule-associated GROWING PLUS-END TRACKING PROTEIN 2 (AT3G53320) and several post-translational modifiers (metallopeptidase M24 family, trypsin family, peptidase family M48), which emerge as candidates for a shared core set of regulators associated with the evolution and development of leaf complexity in Brassicaceae. Because microtubules play important roles in growth polarity, proteins that spatially organize this portion of the plant cytoskeleton may have been recruited to contribute to marginal growth polarization associated with complex leaf morphogenesis (Barkoulas et al., 2008). One of the genes, AT1G19485, a WD40-repeat protein hypothesized to be a part of a CUL4-RING E3 ubiquitin ligase complex (Jackson & Xiong, 2010) exhibits both shift in expression and evidence for positive selection (Table S13) on residues in its N-terminus in complex-leaved species. This finding highlights a potential role of protein ubiquitination in the diversification of leaf form. Taken together, the gene expression results identify candidates for further functional studies into the genetic basis of leaf diversity in Brassicaceae.

Conclusion
We have reconstructed the backbone of the Brassicaceae phylogeny, identified six major clades that harbor the diversity of this economically important family, and provided putative placement of 16 taxa that were not resolved previously within the family. We also identified genes exhibiting significant directional change in expression or evidence for adaptive evolution that may be associated with leaf complexity. Our phylogenomic study provides a phylogenetic framework for future genomic, developmental and evolutionary studies among mustards.
TS 229/1-1 and SFB (Sonderforschungbereit) 680, and core grant by the Max Planck Society. MT also acknowledges support from CEPLAS Cluster of Excellence. LAN and PS were supported by Alexander von Humboldt Research Fellowships. CDB was supported by NSF Plant Genome Research Grant 1238731.

Author contributions
LAN and MT conceived and planned the study; LAN conducted all experimental work, and collected and curated sequence data; LAN and PS performed the phylogenomic analyses with input from CDB; LAN, BN and DF performed the transcriptome analyses, XG contributed sequenced genome data and bioinformatics expertise; IAAl-S provided taxonomic expertise and advice on the biology of Brassicaceae; LAN and MT wrote the paper with input from CDB whose discussions with MT over many years were instrumental for conceiving the study; and all authors approved the manuscript. MT provided funding and directed the study.

Supporting Information
Additional Supporting Information may be found online in the Supporting Information section at the end of the article.
Dataset S1 Exons included in each dataset.
Dataset S2 Phylogenetic trees derived from each dataset.
Dataset S3 Expression levels of leaf-expressed genes in 8 species of Brassicaceae that deviate from neutral expectations.          Methods S1 Supplementary Materials and Methods.
Notes S1 Tree description and placing taxa unassigned to a tribe.
Table S1 Taxon and voucher information for species with newly generated sequence data.

Table S2
Accession information for species with sequenced nuclear genomes included in this study.

Table S3
Accession information for species with sequenced plastid genomes.

Table S4
Sequencing reads and contig statistics for nuclear loci.   Table S7 AU tests to evaluate the support of a given matrix for a given topology.

Table S8
Accession information of marker genes used for the infratribal placement of unassigned to a tribe taxa.

Table S9
Summary of the phylogenetic placement of unassigned to a tribe taxa and taxonomic notes.
Table S11 Genes that deviate from the background optimal expression value in the family in the core Brassicaceae.
Table S12 Genes that deviate from the background optimal expression value in the family along the branches leading to species with complex leaves.