Challenges of comprehensive taxon sampling in comparative biology: Wrestling with rosids

Using phylogenetic approaches to test hypotheses on a large scale, in terms of both species sampling and associated species traits and occurrence data—and doing this with rigor despite all the attendant challenges—is critical for addressing many broad questions in evolution and ecology. However, application of such approaches to empirical systems is hampered by a lingering series of theoretical and practical bottlenecks. The community is still wrestling with the challenges of how to develop species-level, comprehensively sampled phylogenies and associated geographic and phenotypic resources that enable global-scale analyses. We illustrate difficulties and opportunities using the rosids as a case study, arguing that assembly of biodiversity data that is scale-appropriate—and therefore comprehensive and global in scope—is required to test global-scale hypotheses. Synthesizing comprehensive biodiversity data sets in clades such as the rosids will be key to understanding the origin and present-day evolutionary and ecological dynamics of the angiosperms.

The use of comprehensive taxon sampling-up to and including complete coverage-is central to future progress in answering key questions in evolution and ecology framed at broad scales.Despite promise, progress in building comprehensive, broad-scale phylogenies and their associated data layers (i.e., biologically relevant taxonlevel data linked to tips in a phylogenetic tree, such as phenotypic traits and occurrence records) for testing hypotheses has been limited by diverse challenges, such as incomplete phylogenetic coverage, lack of associated and accessible data layers, and a lack of available infrastructure to disseminate phenotypic and geographic data in ways that facilitate integration with phylogenetic information.
Collating such large-scale data sets is not trivial; thus, a set of factors converges to render macroevolutionary studies on vast scales as increasingly tractable, yet tantalizingly out of reach for many researchers.The fact that so many global-scale analyses (e.g., Jetz et al., 2012a) have focused on the rich data available for vertebrates (e.g., VertNet, http://vertnet.org/;FishBase, http://www.fishbase.org/;AmphibiaWeb, http://www.amphibiaweb.org)demonstrates how building linked biodiversity community resources spurs transformative research (for example, enabling assessment of drivers of diversification that may include phenotypic traits, geographic range, and ecological niche occupancy, among other candidates).Extending the technical and social approaches for developing such resources to other clades would lower barriers to performing macroscale comparative analyses in other groups.While the overall state of knowledge in the angiosperms generally lags well behind similar efforts in other groups (e.g., vertebrates, a more tractable target at perhaps one tenth the diversity of flowering plants), there are angiosperm subclades well suited for realizing the vision of comprehensively assembling the large-scale picture of evolution of terrestrial ecosystems.What are the ingredients for lowering this barrier in flowering plants?
Here we provide an example of a subclade within the angiosperms that exemplifies the value of broad-scale approaches, the rosids (Rosidae sensu Cantino et al. 2007).Rosids are a major angiosperm clade, with ~90,000 species (Sun et al., 2016;M. Sun et al., unpublished) representing 22% of all angiosperms (assuming 400,000 species of angiosperms)-with properties that make this clade ideal for realizing the vision of global-scale hypothesis testing through a synthesis of biodiversity data.
In this paper, we ask: What are the grand challenge questions that could be addressed if a robust comparative framework-a well-resolved phylogeny linked with phenotypic and geographic data-were developed?This contribution is organized as a series of questions: 1. Why rosids?What is the case for building an exemplary comparative data set for this or any other large clade of life? 2. What challenges persist in building large-scale trees and trait layers despite progress to date, and how can these challenges be addressed?3. Why use comprehensive approaches to analyze large clades of life?What motivations underlie large-scale analyses in ecology and evolution?

ROSIDS: AN EXEMPLAR CLADE FOR THE ANGIOSPERMS
Rosids, which capture many of the evolutionary and ecological dynamics of angiosperms as a whole, are ideal as a case study for demonstrating data-driven arguments behind building comparative resources in the flowering plants.Rosids exhibit substantial diversity in morphology, habit, reproductive strategy, and life history, and hence occupy a substantial portion of the phenotypic and ecological space that characterizes angiosperms as a whole.Nearcomplete phylogenetic and trait coverage would permit elucidation of the tempo and mode of global diversification of this large, ecologically dominant clade, enabling comparative analyses with other major lineages of life, and eventually global assessment and synthesis of the evolution of terrestrial landscapes.Because the rosid clade and its associated biomes constitute a major driver of terrestrial biodiversity, predicting future biodiversity patterns for rosids based on historical diversification may likewise be key to understanding the future of other terrestrial clades of life.In short, the rosid clade provides the opportunity to link our understanding of biodiversity from the past to both present and future.We proceed by outlining key properties of the clade and how these exemplify the prerequisites for building any large-scale comparative system.
The rosid clade originated in the Early to Late Cretaceous (115-93 million years ago [Ma]), followed by rapid diversification of two major subclades, the Fabidae and Malvidae crown groups, about 112 to 91 Ma and 109 to 83 Ma, respectively (Wang et al., 2009;Bell et al., 2010).The rosid clade is further divided into clades recognized as 17 orders and 135 families (APG IV, 2016; Fig. 1).

Rosids and terrestrial biome dynamics
Understanding rosid evolution also means characterizing the origin and diversity of major biomes.The radiation of the rosids represents the presumably rapid rise of angiosperm-dominated forests and associated co-diversification events that profoundly shaped much of current terrestrial biodiversity (Wang et al., 2009;Boyce et al., 2010).Among major clades in the land plants, perhaps only the grasses and conifers (both smaller clades that are better understood phylogenetically than rosids) could also lay claim to building biomes covering large sections of the globe.The megadiverse rosid clade is home to most dominant forest trees (e.g., Betulaceae [alder, birch  and comprise aquatics (e.g., Podostemaceae), desert plants (e.g., Euphorbiaceae), and parasites (e.g., Rafflesia).

Applied dimensions
Rosids exhibit spectacular diversity in biological processes that may be responsible for the many practical uses of members of the clade.Foremost among these are symbioses with nitrogen-fixing bacteria in legumes and nine other families, the phylogenetic distribution of which is remarkably concentrated in one clade, the nitrogenfixing clade (Soltis et al., 1995;Werner et al., 2014;Li et al., 2015).This symbiosis has enabled many members to thrive in resourcepoor soils; thus, the functional genomics of this symbiosis is of great interest for crop improvement (Stokstad, 2016).Rosids also exhibit diverse phytochemistry, providing potent biochemical defense mechanisms, such as glucosinolate production in Brassicales (Rodman et al., 1998;Edger et al., 2015).This chemical diversity is also associated with the many economic uses of members of Brassicaceae.The plant model Arabidopsis thaliana (Brassicaceae) is in the rosid clade; many other rosids are also genetic models with sequenced genomes, e.g., Brassica rapa also of Brassicaceae (Brassica rapa Genome Sequencing Project Consortium, 2011) and several legumes (Sato et al., 2008;Schmutz et al., 2010Schmutz et al., , 2014;;Young et al., 2011;Varshney et al., 2012Varshney et al., , 2013)).

State of the art
Despite the ecological and economic importance of rosids, after decades of data accumulation, our knowledge of the clade remains remarkably limited along any metric.Rosids thus not only serve as a case study for the possibilities of large-scale biodiversity research, but also reveal the constraints on this research due to limitations in basic biodiversity knowledge.This knowledge gap is characteristic of nearly all large clades across the Tree of Life with the possible exception of vertebrates.Shedding a quantitative light on these disparities is critical to raising awareness about how little we truly know about global biodiversity and identifying priorities for future efforts in flowering plants.
Mapping DNA sequence availability onto a supertree estimate of the complete rosid clade (Fig. 2, combining both phylogenetic and taxonomic knowledge from the Open Tree of Life; Hinchliff et al., 2015) shows that current DNA sampling of rosids is highly biased toward subclades of economic interest and significant temperate diversity (e.g., legumes).Groups with the worst representation (e.g., Malpighiales) have few economically important members, yet are critical elements of tropical floras.Only a minority of rosid species-30,234 of 90,000, or 34%-have sequence data of any kind in GenBank (https://www.ncbi.nlm.nih.gov/genbank/).Many of these sequences are microsatellites, ESTs, or other sequences with low species coverage and are not usable for phylogenetics.Even well-known clades, such as Rosales (predominantly temperate), are poorly represented, with only 23% of species having usable DNA sequence data available (Table 1).Only one small group, Fagales, surpasses 50% coverage.Curating the available DNA sequence data for supermatrix phylogenetic analyses (Sun et al., 2016) results in further loss of data, leaving approximately 21% of species across the rosids represented as phylogenetic tips.The pattern of incomplete and biased taxon sampling in the rosids (cf.Fig. 2) is largely true of the angiosperms in general (see Fig. 2 of Eiserhardt et al. [2018] in this issue).Most known species still have no DNA data at all (Drew, 2013;Hinchliff et al., 2015); the vast majority of the flowering plant branch of the tree of life remains dark.

Phylogenetic bias
Large-scale phylogenetic efforts typically require integrating efforts and data sets from heterogeneous sources, including focused phylogenetic analyses, DNA barcoding data sets, genomic resources, and other data that were not purpose-built for comprehensive species-level inference at global scales.The piecemeal assembly of data sets often makes it difficult to control for uneven sampling of clades.A future need is the development of approaches to assess and correct phylogenetic bias in taxon sampling (either directly through improved sampling or indirectly through modeling taxon absence).In principle, phylogenetically even but incomplete sampling can be accounted for under many models if taxon sampling is unbiased (e.g., FitzJohn et al., 2009).Change in the overall shape of the tree due to biased sampling is not easily controlled for and will likely alter conclusions under models that make inferences from tree topology and branch lengths.
As more researchers assemble large-scale phylogenomic data sets, we see a need for identifying gaps in the coverage of the tree of life and of deploying this knowledge in sequencing efforts to fill these gaps and avoid duplication of effort (see also Eiserhardt et al., 2018, this issue).Although some general-purpose loci have been developed for the angiosperms (e.g., Léveillé-Bourret et al., 2018 and the PAFTOL project;Eiserhardt et al., 2018, this issue), custom-developed, often non-overlapping loci remain the norm (e.g., Weitemier et al., 2014;Mandel et al., 2014;Folk et al., 2015;Chamala et al., 2015;Schmickl et al., 2015), creating greater difficulties for post-hoc aggregation across these experiments.

Spatial bias
In addition to building comprehensive phylogenetic hypotheses, an ongoing trend in comparative research has been the assembly of equally comprehensive and globally scaled data layers.Recent plant contributions in this spirit include Werner et al. (2014), Zanne et al. (2014), andDíaz et al. (2016).For many clades of land plants, traits and geographic data are missing for most species in existing databases.This lack of coverage results partly from bias in the cumulative assembly of species trait and occurrence data over time, typically from aggregating a long series of small-scale or specialized projects and digitization efforts.Such data accumulation is highly correlated with sociological factors such as gross domestic product, local funding sources, and distance to institutions performing digitization (Amano and Sutherland, 2013;Meyer et al., 2015).One hallmark of spatial bias is an inverse latitudinal gradient clearly observable in the rosids (Fig. 3A), where records are least heavily accumulated and species least completely represented in the tropics, some of the most biodiverse parts of the world for the rosids (Fig. 3B).Because major rosid clades are not evenly distributed across the globe (e.g., Malpighiales and Rosales are associated, respectively, with tropical and temperate latitudes), spatial and phylogenetic bias are likely to interact.Outer band: Species that either have (yellow) or lack (blue) phylogenetically usable data ("usable" based on taxa remaining after a series of filtering steps described by Sun et al., 2016), based on matching nomenclature with tips present in Sun et al. (2016) against the Open Tree topology (excluding Open Tree tips with labels for fossil taxa, indicating subspecific or hybrid status, etc.).Note how few taxa have data (yellow) and how phylogenetically uneven this data coverage is.
Spatial bias may propagate to downstream analyses that do not explicitly include spatial data, such as those focusing on potentially correlated traits and taxon coverage.Hence, spatial bias can occur at multiple levels of sampling; accumulation of phylogenetic tips, occurrences, and species traits are all influenced by availability of material and digitization efforts.Most directly, spatial bias has an enormous impact on the spatial distribution of occurrence records, such that nearly any large-scale clade in the tree of life has an occurrence density pattern matching closely that seen in the rosids (Fig. 3A; compare with global mammal GBIF records: Boitani et al., 2011: fig. 2).This strong bias is partly due to historical differences in collection effort.However, differing levels of investment in biodiversity digitization among countries also contribute to this unevenness, which is compounded by the tendency of digitization efforts to be locally focused initially, even for internationally representative collections (Amano and Sutherland, 2013;Meyer et al., 2015).
As with phylogenetic bias, we see not only challenges but opportunities.It would be a major step toward enabling research if future efforts specifically assigned digitization priorities on the basis of evidence for data gaps in current infrastructure.For most herbaria, it is not feasible in the immediate future to completely digitize all specimens, including georeferences, images, and other data.Targeting data gaps would provide an evidence-based method to direct digitization efforts and maximize downstream research impact.

Linked data
Linking data sets such as those discussed above is critical for largescale inference (Parr et al., 2012).For instance, a common task is to subset a tree for the group of interest using a list of taxon names.Linked data already have a role-providing linkages between taxonomic concepts and a phylogeny.If unusual phylogenetic placements are observed, it might be necessary to retrieve either original voucher specimen photographs or original sequence data.Finally, using the name list, linkages could allow users to subset trait data from online repositories such as the TRY Plant Trait Database (https://www.try-db.org/);both unusual trait scores and the possibility of polymorphism would warrant consulting original specimen material using online herbaria.Central to these aims are stable identifiers built around taxon concepts to facilitate linking of disparate data products.Links between genetic data, online herbaria, and phylogenetic tips are typically not explicit and need to be laboriously sought manually, although some linkages, such as that between GenBank and iDigBio (https://www.idigbio.org/),are currently being developed.For example, herbarium specimen records in iDigBio that serve as vouchers for GenBank sequences and have globally unique identifiers on GenBank are linked to their associated DNA sequences; unfortunately, globally unique identifiers are not consistently used or formatted properly (Guralnick et al., 2014), thwarting efforts to link most data directly.
Community consensus is lacking about minimal reporting standards for integrative research programs that include multiple data types.Minimally, we recommend that these projects should contain unique sample identifiers (e.g., GUIDs) as part of data deposition in standard data-specific repositories (e.g., GenBank and SRA; https://www.ncbi.nlm.nih.gov/sra).Unambiguous identifier practices will enable future researchers to scrape metadata for recognizable identifiers and retrieve matching information generated downstream from those samples, such as sequences, modeled geographic distributions, and other data and knowledge products.

Name reconciliation
Reconciling conflicting taxon identifiers is unavoidable for any project that attempts to accrue multispecies data from diverse sources yet remains a core challenge of large-scale biology (Patterson et al., 2010).Many large-scale databases have their own internal taxonomy (e.g., GBIF https://www.gbif.org/;GenBank; Open Tree, https:// tree.opentreeoflife.org/),and standalone name products also exist (e.g., The Plant List, http://www.theplantlist.org;Tropicos, http:// www.tropicos.org/).These taxonomies sometimes represent conflicting taxonomic opinions and often are incomplete and partially out of date.Taxonomic mismatch results in major discrepancies in accepted genera, total species number, and other important metrics that inform sampling, analysis, and synthesis.The availability of community reconciliation services (Boyle et al., 2013) is an important step toward resolving these issues, at least for providing current assessments of valid taxon names.A much-needed area of growth is the improvement of existing databases by digitizing and incorporating major, yet largely inaccessible, natural history literature (below).
While necessary for building the framework of online taxonomies, a static, centralized approach to the name reconciliation problem (generally the approach used to date) will lack permanency given the continual flux of taxon delimitation (Lepage et al., 2014), meaning that a resource that is updatable, preferably by the community and in close to real time, will be critical to improving resources beyond those available to date.

Expert and algorithmic range products
A rich heritage of geographic range products is available for tetrapods, resulting from massive data digitization that has enabled comprehensive macroecological analyses and conservation-oriented decision-making (Jetz et al., 2012a, b;Meyer et al., 2015).In addition to purely expert-drawn range maps, automated approaches based on point occurrences have also been developed recently (e.g., Merow et al., 2016Merow et al., , 2017)), offering the potential for generating geographic range products in clades where few ranges have been expert-assessed.Range data are complementary to better-known occurrence record data, as range data have the potential to coarsely assess true species absence rather than pseudoabsence (Jetz et al., 2012a).Range products are not only useful for direct empirical analyses, but also for quality control of occurrence records for other research (Jetz et al., 2012b).Occurrence data sets too large to curate entirely by hand can be automatically checked against expertderived range maps using a spatial join to remove data points likely to be incorrect.These maps typically require expert involvement to produce credible estimates and are themselves hypotheses open to reinterpretation with new reports of species detection (or lack thereof).

Digitization of legacy natural history data
Enormous effort has been made in increasing access to data in biological collections (e.g., VertNet, iDigBio, and GBIF).The availability of these resources has facilitated growth in macro-perspectives in ecology and evolution; the vast number of papers using repositories of occurrence records (nearly 6000 according to GBIF.org, 2017) illustrates how natural history data drive progress in biodiversity science.Despite this effort, literature containing natural history data in plants remain untapped resources that are as rich as specimen data.Rather than direct point observations, literature sources represent expert-assessed consensus values for geographic range (see above) and phenotype, as well as a consensus taxonomic product for a given region in the form of accepted taxa.For largescale digitization strategies, large-scale floras are ideal data sources.These floras typically comprise comprehensive treatments of a specific area of the globe, covering information such as accepted species lists, partial synonymies, whole-plant trait data, coarsescale geographic range descriptors at the country, state, or other regional level, and variable additional features including chromosome number and invasive status.Regional taxonomic treatments are rich data sets; products of broad utility that can be developed from these treatments include (1) improved taxon name resolution, which could be combined with existing name databases for an improved consensus product; (2) coarse-scale range maps such as are available for vertebrates, typically of political regions, for inferences of range evolution, invasive species status, or quality assessment of occurrence data and spatial bias; and (3) very large morphological matrices.eFloras, such as the Flora of North America (Flora of North America Editorial Committee, 1993 onward) and Flora of China (Wu et al., 1994 onward;Brach and Song, 2008), represent lowhanging fruit for data mining.The text in these efforts does not identify descriptors (e.g., morphological terms do not have explicit metadata), so that indirect text scraping strategies are needed to match descriptors among taxa.While text scraping requires considerable effort, the pay-off is substantial for obtaining organismal information for hundreds or thousands of phylogenetic tips.Some recent efforts (e.g., Flora of Tropical West Africa; https://archive.org/details/FloraOfWestTropi00hutc) are partially semantically tagged, so that sub-blocks of text, such as a trait-related text block, can be obtained for further processing.Unfortunately, few other flora projects are so accessible.Although this is changing, e.g., for Flora Malesiana (Nooteboom et al. 2010 onwards) and Flora of New Zealand (Breitwieser et al. 2010 onwards), many recent and ongoing floras are not available online.Addressing these gaps in flora production would facilitate significant progress towards the vision of illuminating the dark parts of the tree of life, going beyond simply populating the tree with tip taxa by adding geographic and trait data layers with the assistance of partially automated approaches (Burleigh et al., 2013;Liu et al., 2015;Cui et al., 2016;Endara et al. 2018).

WHY USE COMPREHENSIVE APPROACHES?
An obvious first step in performing large-scale analyses is identifying the motivation for what may be a costly and labor-intensive enterprise spanning years from planning to fruition.Why fill in the dark parts of the tree, for rosids or any other clade, if we already understand higher-level relationships?Why indeed "go big" in phylogenetics?Why not "go small" many times in succession on small subclades and ultimately sum these well-worked case studies up to the ecological and evolutionary whole?Discussion on this point is important because basic questions have been raised about the inherent value of large phylogenies for testing hypotheses in evolution and ecology (Donoghue and Edwards, 2014).

Exemplar clade
With respect to the rosids or any other group, the choice of taxon for addressing large-scale hypotheses should be evidence-based and targeted toward finding groups appropriate in scale and properties for a given research question.Explicitly or implicitly, much recent work in phylogenetics sets its aims more broadly than inferences solely constrained to the group of interest, such that the use of comprehensive approaches has contributed insights for decades in evolution and ecology (see an early review by Pagel, 1999).As has long been the case for small clades, large-scale phylogenetic research should explicitly provide reasons for studying exemplar clades embodying the prerequisites for understanding particular evolutionary or ecological dynamics.We use "exemplar clade" to denote a monophyletic group that captures generalizable ecological and evolutionary processes for the purpose of analytical inference.An exemplar clade (= "model clade"; e.g., Chanderbali et al., 2016) thus serves as a biodiversity "model" in a phylogenetic framework, with the aim of inference placed more broadly than the group under concern.
Selection of a study group should not be based primarily on data availability, a criterion that would likely only exacerbate existing knowledge gaps and phylogenetic biases in future investigations-away from what are already dark parts of the tree of life.If the aim is to study generalizable principles and processes across the angiosperms, or in other parts of the tree of life, developing large exemplar clades as community resources puts global-scale research into reach, the conclusions of which will be reciprocally enhanced as other comprehensive comparative data sets are developed.

A tale of two approaches
The comparative method has as its goal the testing of hypotheses using multispecies samples in a phylogenetic framework (Felsenstein, 1985).Recently, a dichotomy has been proposed, identifying what may be complementary or conflicting alternative approaches to such macroevolutionary questions (Donoghue and Edwards, 2014).One could either (1) use an integrative, large-scale approach to test hypotheses in a single framework (e.g., Meredith et al., 2011;Jetz et al., 2012a;Zanne et al., 2014), or (2) accrue a large number of small-scale, well-characterized clades, which investigators would follow by a qualitative synthetic review (e.g., Soltis et al., 2006;Donoghue and Edwards, 2014) or quantitative meta-analyses (e.g., Mayrose et al., 2011) to test the same large-scale hypotheses.
Large-scale studies have been criticized by some based in part on three largely accurate observations: (1) robust and comprehensive clade and trait sampling is very challenging to achieve on large scales, (2) identifying appropriate evolutionary models is difficult, in that a sample representing a long timespan is likely to capture a large number of evolutionary dynamics, and (3) individual instances substantiating broad patterns are anonymized and massaged out of the message of many such studies.These issues are more easily overcome if taxonomic sampling is intentionally placed within modest limits.
Despite these concerns, the scale of systematics research is steadily increasing, through improved sampling of both taxa and loci, generating phylogenetic matrices that are growing both "taller" and "wider." The same growth is true for trait and occurrence data sets that accompany phylogenetic matrices.But a community trend does not constitute justification ipso facto; it is reasonable that the choice of a large-scale analytical approach should be accompanied by compelling reasons for being large, as we have outlined above.Likewise, are there also risks for intentionally small, wellcircumscribed scales in biodiversity science?

Emergent processes
Perhaps the most immediate problem of integrating over large numbers of small case studies is the potential for consistently failing to recover patterns that inherently cannot appear in small data sets.This problem concerns analytical scale: how do we build data sets appropriate for the phylogenetic and temporal scales at which we are testing hypotheses?We argue that biodiversity questions posed globally across large taxonomic groups require sampling that is appropriate to global scales of inference.Synthesizing knowledge in this way across large expanses of space and time will consistently compel the analysis of large data sets.The use of small clades to answer questions at large scales leads to data sets that are well characterized but restricted in their sampling of biological diversity.We identify conditions below where such sampling scales could obscure emergent signals and impact hypothesis testing.
One core issue is statistical power.For inference of diversification and other approaches that use highly parameterized models, branches and their lengths are the data points.Hence, fairly large phylogenies, on the order of hundreds to thousands of taxa under idealized simulated conditions (e.g., diversification: Davis et al., 2013;Rabosky and Huang, 2016;phylogenetic correlation: Ackerly, 2009) are required to have sufficient sensitivity to detect shifts in diversification with high power.It is expected, therefore, that an intentionally taxon-limited approach will consistently underestimate the number of diversification shifts and the occurrence of character-associated diversification patterns.Although no quantitative studies have been performed to assess the effects of taxonomic scope beyond statistical power, we expect that the number of significant evolutionary patterns extractable from phylogenetic data will be consistently and artificially truncated by focusing on small case studies.Such a truncation is likely for the simple reason that such patterns may be present in subclades but without the context of broader sampling that would make them detectable.
Estimation error increases with increasingly deep trees (Salisbury and Kim, 2001), and even within a given tree, estimation error is expected to increase as estimated nodes approach the root (Garland et al., 1999), leading to unequal error in ancestral state reconstruction across a tree.If a particular ancestral state is of interest, it is possible that removing taxa could result in smaller estimated uncertainty by incompletely sampling evolutionary transitions (Heath et al., 2008a), thus underestimating trait evolutionary rates and decreasing the magnitude of estimated error (e.g., the confidence interval, cf.Garland et al., 1999).Hence, a smaller reported uncertainty does not necessarily imply that the "true" error of such an estimate has actually decreased due to sampling scheme alone.Building data sets appropriate to the scale of questions posed-for global-scale analyses, this often means including data for as many extant species as possible, maximizing the information behind our inferences and the estimated uncertainty thereof-is therefore preferable.
The detection of some processes may fundamentally require large phylogenies, irrespective of statistical power.This problem is subtler, in that it cannot be easily measured or controlled for by performing statistical power studies or extending models to account for potential data set biases.Such a problem is likely to occur in instances where deep-level patterns in highly diverse clades (e.g., the root of major angiosperm clades) are the object of inference, but where inferences are sensitive to taxon sampling.This situation could appear in ancestral state reconstruction, where a deep-level node is of interest, but the polarity of ancestral states is impacted by a complex distribution of states in descendant extant taxa.Some of the risks of poor taxon sampling in this case include incomplete sampling of evolutionary transitions in the clade of interest (Heath et al., 2008a) and warping of overall tree shape by dropping taxa (Heath et al., 2008b).These concerns cannot both be addressed in small test cases (in this case, sets of trees with limited taxon sampling at deep levels) if the relevant information for accurately distinguishing among possible ancestral states is not present in the data, irrespective of our ability to detect it.Simulation studies have shown increased estimation error as proportional taxon coverage decreases (Salisbury and Kim, 2001;Litsios and Salamin, 2012; but see Li et al., 2008).
A final issue with a solely small-scale focus, raised by Beaulieu and O'Meara (2018, this issue), is ascertainment bias.The choice of idealized small-scale clades to understand broad-scale patternsoften resulting in a focus on groups showing especially frequent shifts in a biological trait-may result in overemphasis of unusual outlier taxa unrepresentative of overall variation patterns.Hence, large-scale biodiversity studies are needed to complement and contextualize focused clade-level studies.Likewise, as we have suggested for the rosids, the suitability of an exemplar clade is a testable assumption that can be directly assessed by asking how well a focal clade cross-sections broader diversity patterns.
Issues of both statistical power and levels of inference imply that questions exist that are uniquely suited to purposeful attempts at comprehensive taxon sampling, such that focusing solely on small, well-characterized case studies is neither always sufficient nor invariably necessary.Approaches in biodiversity science that use small study clades will continue to be relevant, particularly for understanding recent-scale evolutionary processes.By contrast, the application of such sampling schemes to global questions poses risks, possibly resulting in data sets with high confidence in individual data points but restricted and possibly biased coverage of the biodiversity that underlies many biological processes.Comprehensive phylogenetic approaches that span deep-time and global geographic scales are urgently needed for the kinds of grand challenges which the comparative approach to biology is poised to address, due not simply to an obsession with larger and more resolved data sets (Hahn and Nakhleh, 2016), but to their central necessity for answering questions on deep-time and global scales in highly diverse clades.

Ways forward
In our view, large-and small-scale approaches are complementary.Some questions are best addressed with small clades.Increasingly, however, phylogenetic effort is devoted to asking questions in evolution and ecology that require large trees and comprehensive taxon sampling (e.g., global patterns of diversification, deep-time ancestral state reconstruction and biogeography, correlated evolution of characters, community phylogenetics), often in a modelbased or otherwise explicitly quantitative framework (e.g., Smith and Donoghue, 2008;Smith and Beaulieu, 2009).We argue that the need remains for large-scale, comprehensive approaches that are appropriate to address questions of major importance.
We stress that focused case studies on small clades remain crucial for addressing certain specific questions and serve as an important element of building comparative data sets.Nonetheless, despite substantial progress in many domains, 30 years of effort on small focal clades in molecular systematics have resulted in uneven and incomplete coverage of rosids in particular, and angiosperm diversity as a whole, suggesting this approach alone may not suffice to eventually synthesize biodiversity knowledge across the flowering plants.Targeted and coordinated, large-scale sampling efforts at the community level are needed to complement these efforts and directly address data and knowledge gaps that have continually persisted despite intense efforts by individual researchers.Rather than continually aggregating upward in scope from focused data sets to create incomplete and biased larger sets, we can do more to collect comprehensive biodiversity data broadly for future users to disaggregate downward for focused work.

CONCLUSIONS
Much progress has been made in understanding deep-level relationships in the angiosperms (Chase et al., 1993;Qiu et al., 1999;Soltis et al., 1999Soltis et al., , 2011) ) with large-scale sequencing projects (e.g., 1KP, Matasci et al., 2014) resulting in robust backbone resolution (Wickett et al., 2014) and community consensus taxonomic products (APG IV, 2016).Current efforts in plant systematics beyond the backbone have largely remained centered on localized taxonomic sampling efforts, with less consideration of how to develop more comprehensive, community-based, synthetic investigations or of whether such goals are feasible without purposeful large-scale generation of phylogenetic data to fill in gaps.Yet, it is just these kinds of efforts that can provide the most critical insights and applications in biology, particularly those posed at global or deep-time scales.The effort to develop such synthetic analyses is still enormous, and bottlenecks are multidimensional.
We make the case for an evidence-based assessment as we build comprehensive community resources for phylogenetically informed hypothesis testing, with a focus on exemplary, hyper-diverse clades such as the rosids.Such resources, to maximize enabled research, should comprehensively sample phylogenetic tips and linked phenotypic and geographic data as a community priority.This approach is complementary to focal studies on smaller clades, which may address significant problems but on different phylogenetic and temporal scales; both can help with goals geared towards broadscale synthesis.However, we believe that purpose-built comprehensive phylogenies covering global scales and ancient radiations are valuable resources that, when linked to other biodiversity data and knowledge products, will be an impetus for transformative research.

FIGURE 1 .
FIGURE 1. Upper panel: Summary tree for ~19,000 rosid species (four loci; Sun et al., 2016); the legend matches branch colors to recognized orders.Lower panel: Photographs of representatives of 10 familiar orders; symbols follow colors in the upper panel legend.

FIGURE 2 .
FIGURE 2. Phylogeny of all rosids integrating taxonomic and phylogenetic knowledge (84,153 species, from the Open Tree of Life; https://tree.opentreeoflife.org/).Branch coloration represents ordinal taxonomy and matches the legend of Fig. 1.Outer band: Species that either have (yellow) or lack (blue) phylogenetically usable data ("usable" based on taxa remaining after a series of filtering steps described bySun et al., 2016), based on matching nomenclature with tips present inSun et al. (2016) against the Open Tree topology (excluding Open Tree tips with labels for fossil taxa, indicating subspecific or hybrid status, etc.).Note how few taxa have data (yellow) and how phylogenetically uneven this data coverage is.

FIGURE 3 .
FIGURE 3. (A) Global distribution of occurrence records for species in the rosid clade with at least 30 occurrence records in GBIF (https://www.gbif.org/; downloaded October 2015; 6,085,341 records), plotted on an elevation data set from R package raster.(B) Country-wise species richness, colorcoded by a Jenks natural breaks classification.Species counts used country DarwinCore fields from both georeferenced and ungeoreferenced records, aggregating GBIF data with an unpublished data set of Amazonian records.The distribution of records is largely characteristic of any globally distributed clade, revealing more about global digitization effort than geographic range dynamics, while species richness estimates from available data for the rosids are close to a priori expectations.Projection for both maps is EPSG:4326.