Prototype Biodiversity Digital Twin: Phylogenetic Diversity

Phylogenetic diversity (PD) represents a fundamental measure of biodiversity, encapsulating the extent of evolutionary history within species groups. This measure, pivotal for understanding biodiversity's full dimension, has gained recognition by major environmental and scientific organisations, including the Intergovernmental Science-Policy Platform on Biodiversity and Ecosystem Services. Unlike traditional taxonomic richness, PD offers a comprehensive, evolutionary perspective on biodiversity, essential for conservation planning and biodiversity management. This manuscript describes the development of a BioDT (Biodiversity Digital Twin) prototype, aimed at facilitating the calculation and visualisation of biodiversity metrics from global, dynamic data sources. By utilising the PhyloNext pipeline and integrating with global phylogenetic and species occurrence databases like the Open Tree of Life (OToL) and the Global Biodiversity Information Facility (GBIF), the prototype aims to significantly reduce computation time and enhance user interaction. This enables dynamic visualisation and potentially hypothesis testing, making it a valuable tool for researchers, monitoring initiatives, policy-makers and the public. The prototype's development focuses on improving the PhyloNext pipeline's scalability and creating a more intuitive user interface, expanding its utility for conservation efforts and biodiversity exploration


Introduction
Phylogenetic diversity (PD) quantifies the extent of evolutionary history encompassed by a group of species, highlighting a crucial dimension of biodiversity."Acknowledged by the Intergovernmental Science Policy Platform on Biodiversity and Ecosystem Services (IPBES, www.ipbes.net/), the Earth's evolutionary legacy is considered a vital component of biodiversity, safeguarding possibilities for future generations.The tree of life serves as a repository of possible advantages for humans and through the preservation of PD, we protect the diversity of traits (essentially, the wide array of evolutionary characteristics found within a group of species) and secure future opportunities for human benefit".(IUCN SSC Phylogenetic Diversity Task Force, https://www.pdtf.org/).Biodiversity is most commonly quantified through taxonomic richness.For example, it is common to describe how diverse a genus or a geographic area is by counting the number of species within them.On the other hand, PD, a metric that takes into account the branch lengths in a phylogenetic tree, provides an evolutionary perspective of biodiversity that cannot be estimated using species richness alone.PD (expected loss of phylogenetic diversity) was one of the proposed indicators for the Kunming-Montreal Global Biodiversity Framework (https://www.cbd.int/doc/decisions/cop-15/cop-15-dec-05-en.pdf).When combined with geographical constraints, a model utilising PD metrics can effectively compare and qualify competing areas as most relevant for the designation or expansion of protected natural areas.
The Global Biodiversity Information Facility (GBIF) is an intergovernmental network and biodiversity data infrastructure, that currently mediates almost 3 billion species occurrences.The Open Tree of Life (OToL) provides a comprehensive, accessible and continuously updated synthesised phylogenetic tree amongst all known species.The PhyloNext pipeline integrates these two pivotal research data infrastructures, making them more accessible to non-experts, by generating PD metrics.The PhyloNext pipeline (Mikryukov et al. 2024) generates PD metrics using the Biodiverse programme (Laffan et al. 2010).PhyloNext uses Docker containers allowing for relatively easy local installation, but is also possible to launch in a cloud environment.PhyloNext is operated by commands in a terminal.A demo graphical interface is hosted by GBIF.org.The tool can be used to explore PD across different geographic and/or taxonomic groups in relation to policymaking, prioritisation of conservation efforts etc.However, even relatively simple queriesfor example, Felidae (cats) in South Africa -usually require several hours of computation and the tool cannot presently be used easily as an interactive exploratory visualisation tool in near-real-time.With a significant improvement of computation time and an improvedand perhaps simplified -user interface, the tool would become invaluable for visualising/ testing conservation strategies or simply exploring PD.Furthermore, by incorporating additional information -for example, shapefiles with geographical strata or classes -the tool may be extended to address specific hypotheses/questions, such as whether areas designated as nature reserves, based on species richness, also support high PD.The work on this Phylogenetic Diversity Digital Twin will focus on up-scaling the existing PhyloNext pipeline and creating a more intuitive user interface that targets the most common and relevant use cases.This will create a responsive user experience allowing the tool to be used to, for example, identifying localities to survey and data from such surveys would feed into the next updating of the model.We envision a future version to have hypothesistesting modules, allowing for comparative evaluation of alternative poposals for, for example, designating protected nature areas.

Objectives
The objective of this prototype is to leverage the Biodiverse programme as implemented in the PhyloNext pipeline to develop a tool that facilitates calculation and visualisation of PD metrics and other biodiversity metrics from large, standardised and dynamic global data sources with a time expenditure that potentially allows for dynamic visualisation adequate for interactive exploration and refinement of input procedures (and potentially parameters) to achieve interactive fine-tuning of output.A central developmental focus will be on upscaling the calculations to achieve near real-time outputs of metrics.Another focus will be to use (either directly or as inspiration) the existing demo interface to develop a user interface that is intuitive to use for experts and non-experts alike and devise a dynamic and interactive visualisation module for refinement and optimisation of input parameters.The development will consider possible enhancements like hypothesis testing, based on shape files.We envision that this prototype will serve users across various domains from researchers, monitoring initiatives, policy-makers and interested citizens with interest in natural history and biodiversity.

Workflow
The existing GBIF demo interface (https://phylonext.gbif.org/) is a graphical user interface developed for the PhyloNext pipeline.It allows the user to provide settings in six sections of a web-based form.The first section, Name and description, allows the user to set a name and description for the particular pipeline/model to be run with selected setting.In the Phylogeny section, the user can choose to upload/provide a custom phylogenetic tree in Newick format or to use a few select pre-defined trees, based on the open tree of life.Information on the format taxon labels is also required.In the Taxonomic filters section, the user defines whether the model should be restricted to certain taxa (at any of the classic taxonomic levels: phylum, class, order, family genus).In Spatial and temporal filters, a range of years can be defined to which occurrences should be restricted, as well as spatial constraints in the form of a hand-drawn polygon, country name or an uploaded polygon in the GeoPackage format (https://www.geopackage.org/).In the section GBIF Occurrence filtering and aggregation, a number of filters can be engaged to filter the GBIF occurrence data to exclude likely flawed data (e.g.occurrences with known suspicious coordinates of museums, country capitals, country centroids etc.) and types of data (e.g.material samples), likely spatial outliers identified through density-based clustering.Finally, the section Biodiverse settings allows the user to define the parameters of the Biodiverse programme that calculates the metrics from the filtered data.After starting the pipeline with the defined settings, a significant amount of time is needed before the user has a visual output.
The envisioned workflow of the final BioDT prototype will be based on improvement of the demo interface with a focus on user friendliness and simplification and aiming for a more interactive, dynamic process, where the user can fine-tune parameters, based on the initial output.A hypothesis-testing module would likely be a combination of a graphical part and text, both for input and output.
The conceptual schema of the proposed workflow is shown in Fig. 1.

Data
The PhyloNext pipeline is already accessible and in practice able to use all global occurrences records from GBIF.Currently, these amount to almost 3 billion occurrences of taxa across the tree of life.Theoretically, all these taxa would also be in the Open Tree of Life phylogeny.However, in reality, there are several groups -especially of bacteria, fungi and micro-eukaryotes -where the taxonomy is in flux and there are many species that do not carry formal binomial names.Several of these groups are best defined by molecular species concepts and there have emerged several systems for making such dark taxa operational by providing them with persistent, unique identifiers.As these identifiers slowly find their way into both global molecular phylogenies and biodiversity databases, we will be able to obtain Phylogenetic Biodiversity metrics that vastly surpass what is currently available both in accuracy and taxonomic coverage and with much less bias.The data sources are described in Table 1.

Data source Data type Notes
Global phylogeny from Open tree of life.(https:// tree.opentreeoflife.org/) Phylogenetic tree (Newick format representing graph-theoretical trees with edge lengths using parentheses and commas) The synthetic phylogenetic tree from the OToL, constructed using all the contributing trees, is available for download in Newick format.PhyloNext accesses OToL through web APIs and retrieves a taxon-specific tree, optionally filtering out taxa without phylogenetic support.The latest released synthetic tree (v.14.9) includes 2,392,578 tips (≈ taxa/species). Global

Model
The final prototype biodiversity twin will, in most aspects, reuse the analytical flow and tools developed in the PhyloNext pipeline.
Key features of the model:.
• Input data specifications: As described in the workflow, the model takes filtered occurrence data from GBIF and phylogenetic data from the Open Tree of Life.
• Phylogenetic tree preparation: The workflow supports pre-constructed phylogenetic trees, as well as retrieving synthetic trees from the Open Tree of Life.This step includes matching species names (from the tips of phylogenetic) with GBIF species keys.
• Spatial Binning: The workflow uses a discrete global grid system -for example, H3 by Uber (https://h3geo.org/)for the spatial binning of species occurrences.

Data sources.
Prototype Biodiversity Digital Twin: Phylogenetic Diversity

•
Diversity and endemism estimation: Using the Biodiverse programme (Laffan et al. 2010) for each grid cell of the study area, the workflow calculates an array of diversity metrics.
A schematic illustration of the dataflow in PhyloNext can be see in Fig. 2. Image from Mikryukov et al. (2024).

FAIRness
The occurrence data used for the modelling is FAIR in the sense that it is all publicly available as standardised data in the GBIF index under open licences and findable and accessible via web interfaces (https://www.gbif.org)and APIs (https://techdocs.gbif.org/en/openapi/).The demo interface offers the functionality to generate a citable DOI for each PhyloNext pipeline execution and to create a sharable link to the results.Additionally, GBIF has implemented a mechanism for generating a unique DOI for a derived dataset (https:// www.gbif.org/derived-dataset/about),facilitating tracking and proper accreditation of all individual datasets that contributed to the underlying occurrence data.Similarly, the data from OToL is licensed openly and accessible with APIs ( https://github.com/OpenTreeOfLife/germinator/wiki/Open-Tree-of-Life-Web-APIs).The source code for the PhyloNext pipeline and the demo GUI are also fully open and accessible on GitHub (see above).The code to be developed for the BioDT prototype will also be fully open, ensuring its alignment with FAIR principles.A hosting at GBIF.org will make the DT fully open and available and will be provided with a CC-BY-NC licence.The outputs of the model will be able to download using widely accepted standards for biodiversity and spatial data.

Performance
Due to the design of PhyloNext, which utilises containerisation technologies (Docker and Singularity/Apptainer) to encapsulate all software dependencies, the pipeline was successfully installed on the petascale supercomputer LUMI.The initial tests have also been conducted with success.The next steps involve exploring various approaches for scaling up.The PhyloNext pipeline is highly adaptable for HPC and cloud environments.For example, it allows for the configuration of specific resource requirements (CPUs and RAM) for each process independently, allowing the pipeline to launch these tasks as independent jobs through SLURM workload manager.Alternatively, a fixed amount of resources (e.g. a single computational node) can be allocated exclusively for the pipeline's operation, facilitating the optimisation for specific tasks of datasets.For storage, the pipeline is also capable of utilising S3-compatible object storage which can significantly enhance performance by offering scalable, high-speed access to data.Additionally, a resource usage profiler is included, which allows us to monitor and optimise the resources required for the analysis.Furthermore, the Biodiverse programme, which calculates a wide array of diversity metrics, incorporates essential optimisations (e.g.caching and re-using computationally expensive calculations) that can significantly enhance the speed of analysis.

Interface and outputs
As describe above, the demo interface is a web-based user interface with a panel where users can select the geography and taxonomy of interest (e.g.mammals of South Africa), choose a phylogenetic tree and configure model settings (the size of spatial bins, the number of randomisation iterations etc.).Results are being shown on a map.Currently, some implicit knowledge about the procedure is expected from the user: for example, different types of taxon labels and what bin sizes, number of randomisations and what the various optional filtering terms in Darwin Core (the TDWG biodiversity data standards used by GBIF) mean and which values they can have.Additionally, it is easy for a user without prior knowledge to select parameters that conflict (e.g. using a tree that has a taxon focus other than the taxonomic filter for the occurrence data).The user interface of the Digital Twin prototype to be developed is intended to be a more user-friendly version of that GUI, for example, with more guidance, protection against conflicting values and fixed vocabularies where it makes sense.A number of pre-defined models (specific settings of the model) is also planned to allow an approach for users to start with examples, that intuitively make sense.

Integration and sustainability
If the final prototype runs smoothly and is user-friendly as planned, one possibility is hosting it at the Global Biodiversity Information Facility (GBIF.org),similar to the current demo interface, but with a formal release and comprehensive user guidance.

Application and impact
If a fast and user-friendly version of PhyloNext, equipped with an intuitive graphical user interface, is successfully developed, the Digital Twin prototype may become a valuable tool and analytical hub for many years.The primary data sources -occurrence data from GBIF.org -is constantly growing and supported by a stable infrastructure.Currently, the model by default uses the synthetic phylogenetic tree from the Open Tree of Life project, a resource that is also continuously expanding and improving.Thus, the estimates produced by the model will automatically improve over time.
A fast and interactive tool for visualising phylogenetic diversity (and other associated metrics) may serve numerous applications across various user groups as mentioned above.Researchers may use it to quickly visualise and explore taxonomic groups or geographic areas of interest as a tool to formulate new hypotheses.Monitoring initiatives may use the tool to visualise the impact of their work and identify areas of future attention or sampling.Policy-makers will be able to examine the potential impact of competing proposals for nature conservation.
By integrating modules for hypothesis testing and other advanced functionalities, the scope of potential applications could expand even further.Examples include comparative studies across ecosystems (e.g.comparisons across different ecosystems or biomes, which can provide information for conservation priorities and strategies at both the European level and globally), evaluation of the impact of different agricultural practices on biodiversity, invasive species management (e.g, identifying potential hotspots for invasive species spread and assessing effectiveness of management strategies), ranking of potential areas for the expansion of nature reserves and various analyses across time series or other stratifications (e.g.data types).