Prototype biodiversity digital twin: crop wild relatives genetic resources for food security

Amidst population growth and climate-driven crop stresses such as drought, extreme weather, fungal and insect pests, as well as various crop diseases, ensuring food security demands innovative strategies. Crop wild relatives (CWR), wild plants in the same genus as the crop as well as wild populations belonging to the same species as the crop, offer novel genetic resources crucial for enhancing crop resilience against these stress factors. Here, we introduce a prototype digital twin (pDT) to aid in searching and utilising CWR genetic resources. Using the MoDGP (Modelling the Germplasm of Interest) tool, the pDT enables mapping geographic areas where stress-tolerant CWR populations can be found. With its graphical user interface, it offers flexibility in selecting genetic resources from CWR tailored to enhance resilience of various crops against diverse stress factors.


Introduction
Population growth and climate change are two of the major factors that are challenging food security.The human population has increased from one to eight billion over the past 200 years and is expected to reach 11 billion by the end of this century (Roser et al. 2023, United Nations 2023).However, potential agricultural production is challenged by climate change driven biotic stresses such as drought, extreme weather, soil acidity and mineral deficiencies, as well as biotic stresses, including fungal and insect pests and various crop diseases (Kumar et al. 2022).To meet the Sustainable Development Goal 2: Zero Hunger (SDG2), we need to boost crop yield by about 70%* .For this, we need crops with adaptive capacities to changing environments.Domesticated crops have been under human selection pressure for ages and their gene pool is limited by the domestication bottleneck (Tanksley and McCouch 1997).To broaden their genetic diversity, valuable genetic resources can be found within crop wild relatives (CWR).
CWR are wild plant species closely related to cultivated crops.Broadly, they encompass all wild plants within the same genus as the crop (Maxted et al. 2006).The category also includes wild populations of the same species as the cultivated crops.CWR constitutes about 21% of the world's flora (Maxted and Kell 2009).CWR have survived in nature enduring various selection pressures, both biotic and abiotic.Consequently, they harbour novel genetic resources that can play pivotal roles in crop improvement efforts.
Currently, two prominent challenges hinder the utilisation of CWR in breeding programmes.Firstly, plant breeders often depend on their established breeding lines and the potential contributions of CWR is not investigated well.Secondly, there exists a notable absence of user-friendly tools for effective utilisation.
Plant breeders typically depend on the vast collections of plant genetic resources gathered (Loskutov 1999) and conserved ex-situ in several gene banks (FAO 2010).Numerous methodologies have been developed to systematically identify accessions possessing various traits from these collections.One of the earliest methods was the "core collection concept" (Frankel 1984), which aimed to characterise the entire accessions to create minimally redundant subsets to capture maximum genetic diversity with fewer samples.Initially, around 10% of accessions underwent field trials against various stresses (Frankel 1984, Brown 1989).However, for crops with extensive collections, this approach became impractical, leading to the development of the "mini-core collection" where only 10% of the core collection was evaluated (Upadhyaya andOrtiz 2001, Upadhyaya et al. 2013), leaving most collections untested.
To address this challenge, the FIGS ("Focused Identification of Germplasm Strategy") tool was introduced, building upon earlier work by Michael Mackay (Mackay 1985, Caradus et al. 2012).FIGS employs two main approaches: "FIGS filtering," which filters accessions, based on expert knowledge and environmental data (Bouhssini et al. 2009) and "FIGS modelling," which predicts the presence of genetic resource of interest in uncharacterised accessions using field trial data (Sunitha et al. 2023).All these methods primarily serve to filter collections stored in gene banks.
For CWR, both collections and field evaluation data are scarce.To address this challenge, we are introducing the MoDGP ("Modelling the Germplasm of Interest") tool in the CWR pDT.MoDGP leverages species distribution modelling, relying on occurrence data of CWR to produce habitat suitability maps, establish mathematical correlations between adaptive traits, such as tolerance to drought and pathogens and environmental factors and facilitates mapping geographic areas where populations possessing genetic resources for resilience against various biotic and abiotic stresses are potentially growing.

Objectives
The main objective of the CWR pDT is to streamline the identification and utilisation of novel genetic resources from CWR through automating data flow, automated modelling runs, uncertainty analysis and timely alerts on potential genetic resources of interest for plant breeders, policy-makers and conservation scientists.Our objective includes the creation of habitat suitability maps for all CWR with sufficient occurrence data, accessible via an intuitive graphical user interface implemented with the R Shiny framework.Our model is designed to be adaptable across different crop species and traits, empowering users to address key research questions in pre-breeding, such as identifying geographic areas where populations of CWR harbouring beneficial genetic resources for enhancing crop resilience to environmental stresses are potentially growing.Additionally, in the pDT, we are developing ecogeographic land characterisation (ELC) maps to identify ELC classes that are under-represented in ex-situ seed collections.This will help to assess gaps in current collection or ex-situ conservation efforts, aiding in the strategic planning of future genetic resource collections.

Workflow
The workflow of the CWR pDT includes automated access of occurrence and environmental data, automated model runs to generate habitat suitability maps for CWR via an ensemble modelling technique to predict and map stress-tolerant populations of CWR for use in breeding programmes (Fig. 1, see also the model section).Additionally, the pDT incorporates a graphical user interface to facilitate end-users' interaction with the outputs of the pDT.The pDT is automated to re-run once in a year depending on availablity updates in occurrence data.

Data
MoDGP relies on two types of data as input.Firstly, occurrence data from GBIF (CIAT 2024), with plans to expand sources to include ICARDA, Genesys PGR, EURISCO, RAINBIO and more (Table 1).Genesys, a global gene bank ex-situ conserved data hub, not only provides occurrence data, but also serves as a valuable source of crop trait information.RAINBIO contains georeferenced occurrences, particularly from sub-Saharan tropical Africa, which can be filtered for CWR data.Other data sources are listed in Table 1.
Secondly, environmental variables such as climate (bioclimate data), soil and topographic data are utilised as predictor variables in raster format.We use climate data from ERA5, soil data from SoilGrids and elevation data from SRTM DEM (Table 1).At each occurrence point for each CWR species, values of environmental variables are extracted and prepared as input for MoDGP.

Model
MoDGP uses different high performing species distribution modelling algorithms such as generalised additive modelling (GAM; Wood (2010)), generalised boosted regression modelling (gbm; Greg et al. (2024)) and maximum entropy modelling (MaxEnt;Phillips et al. (2006)) to produce habitat suitability maps of model targets (crops and crop wild relatives).The algorithms in MoDGP function by relating occurrence points to environmental variables to produce habitat suitability maps.
We aim to run models for all CWR with unique occurrence data exceeding 40.To represent the absence data, we identify 10,000 points where other species of the same genus are present, but the model target is absent or not recorded.These points are chosen within a buffer area of 15 km from known presence points.
To mitigate multicollinearity, we stack all predictor variables and extract their values at both the presence and absence points.Then, we compute Pearson's pairwise correlations and from variables exhibiting a correlation coefficient exceeding |0.8|,only one variable with the lowest variable inflation factor being selected for model runs.Each model is replicated twice using two methods: bootstrapping and substitution of 75% of the data.In each replication, 75% of the data are randomly allocated for training, with the remaining used for evaluation.Consequently, we generate 12 habitat suitability maps for each species as three algorithms replicated twice employing two replication methods.Results from all algorithms are evaluated against test data using area under the ROC curve (AUC) and True Skill Statistics (TSS).Maps from less performing models i.e. with AUC < 0.7 and/or TSS < 0.6 are dropped and only maps from high performing algorithms and models settings are kept.
The selected maps are combined through an ensemble approach and binary maps are produced using the maximum sum sensitivity and specificity threshold to distinguish between suitable and non-suitable pixels.Values of abiotic stresses are extracted from suitable pixels and the range of tolerance to these stress factors are generated as response curves.CWR of a given crop are ranked based on their range of tolerances to stress factors.For model targets with high tolerance to these factors, geographic areas where plants presenting the desired genotypes are potentially growing will be mapped and provided.

FAIRness
We will comprehensively document the entire workflow, spanning from the initial input data through each processing step and modelling, culminating in the generated output.We will ensure that the occurrence data utilised for modelling is referenced using persistent identifiers whenever feasible.Additionally, references to climate, soil and topographic data will be provided.All data employed in the models will be made publicly accessible and free for sharing and usage, with appropriate acknowledgement.The outputs from pDT and the modelling tools utilised to generate these outputs will also be openly available to the public as FAIR Digital Objects (FDOs; Wittenburg et al. ( 2023)).
FDOs  2022)) by software agents.In this way, RO-Crate opens up an implementation path for web-based or "webby" FDOs and enables mobilisation and reuse of the pDT CWR across the Destination Earth framework.This approach aids integration with European initiatives like the European Green Deal* , utilising two FDO types to describe computational workflows and capture FAIR data from simulations (Fig. 2) All developed model codes and scripts will be published as open source in the BioDT repository on GitHub (https://github.com/BioDT).

Performance
CWR pDT aims to run tens of thousands of CWR species using different algorithms and model replications.This is highly suitable for utilising parallel processing as the different model runs are independent.In preparation for executing the operation in parallel, the R environment has been containerised with Docker and the container image can be pulled and executed on the CPU partition of the LUMI supercomputer through Apptainer/ Singularity and on a cloud through Docker.Initial tests have been run on LUMI-C with this setup, but the parallelisation scheme is not fully implemented yet.The large parallel computing capacity of LUMI-C is expected to be advantageous for achieving the aimed large scale model processing.In case of smaller workloads, the containerised solution is directly executable also on cloud environments.

Interface and outputs
To provide the best experience of interaction with pDT for multiple end-user groups, such as pre-breeders, researchers, conservation scientists and academicians, we are developing a web interface, based on the R Shiny (https://rstudio.github.io/shiny/authors.html)application.The interface will feature dropdown menus for crops and their corresponding: 1. wild relatives, 2.
habitat suitability maps and 3.
abiotic stress ranges amongst others.
This will allow users to effectively map the optimal overlap between environmental stress factors and habitat suitability to identify geographic areas where populations resilient to stresses can potentially thrive.
End users can collect samples from mapped areas of interest and test the performances of the genotypes.The user interface also enables users to constrain or relax the tolerance thresholds and decide geographical areas from which the germplasm of interest can be obtained.It can also enable them to prioritise the populations to be tested, based on quality and/or access.Distribution models capture potentially suitable habitats and, thus, may help the discovery of new populations and identify gaps in collection efforts or ex-situ conservation.With improvements in online occurrence data, the validity of models can also Actual outline of data model employing the RO-Crate approach for workflow preservation and aggregation (Khan et al. 2019) represented as information nodes in a directed graph using machine interpretable semantic artefacts, such as schema.org(e.g.http://schema.org/Dataset), as well as PIDs, such as ORCID (https://orcid.org/).
improve over time improving the robustness of the models.The modelling tools will also be published in open access journals and made available to users.

Integration and sustainability
To ensure the long-term availability and accessibility of the pDT CWR, a pilot for the integration into the Big Data processing services of the Destination Earth Data Lake (DEDL; Duatis Juarez et al. (2023)) is under development together with the platform operator EUMETSAT* .
A major objective of the pilot study is the implementation of data pipelines between DEDL as a data aggregator, processing platform and provider of earth observation data and the pDT CWR which will serve as a blueprint to facilitate the integration of more Digital Twins into DestinE's core infrastructures.Comprehensive mappings between BioDT's core semantic artefacts, such as schema.org/Bioschemas(fundamental for RO-Crate) and specifications used in DEDL such as SpatioTemporal Asset Catalogues (STAC* ) will be provided as FAIR Semantic Mappings to foster the reusability of all resulting data products (Broeder et al. 2021) and subsequently mobilised through BioDT's mapping tool mapping.bio (Wolodkin et al. 2023).

Application and impact
While plant breeders often rely on their breeding lines and landraces, CWR offer not only vast diversity, but have also undergone several (and ongoing) selection pressures and, thus, encompass novel genetic resources.Representing approximately 21% of the plant kingdom (Maxted et al. 2006), assuming that a third of them have adequate occurrence data available, we here aim to provide outputs for roughly 7% of the plant kingdom, equivalent to around 26,600 plant species.Different populations of these species exhibit adaptations to various crop stresses.The CWR pDT makes this abundant resource accessible through a graphical user interface, allowing plant breeders to choose amongst several populations of the 26,600 species.The outputs and impacts will grow with enhanced data availability and quality, improving future prospects.
The suitability maps produced by pDT serve diverse purposes, including in-situ conservation, restoration, ex-situ conservation and seed collection gap analysis.As the pDT is envisioned to re-run automatically on an annual basis, its results are continuously updated, offering real-time outputs.These outputs are available at global scale and can be tailored to match different geographic scales, from country to continental levels.
In general, applications and impacts of the pDT can fall into two categories: 1.

Climate change adaptation:
Plant breeders can utilise the pDT to map populations of CWR possessing novel genetic resources, aiding in the development of crops with high resilience to stresses induced by climate change.

2.
Conservation: By identifying geographic regions hosting populations of CWR with adaptive traits, the tool facilitates targeted conservation efforts, thereby aiding in the conservation of genetic diversity.The CWR pDT also plans to integrate ecogeographic land characterisation (ELC) maps via the CAPFITOGEN tool (Parra Quijano et al. 2021).These maps illustrate adaptive scenario classes that can be overlaid on to protected areas to assess conservation of diverse adaptive trait populations.Moreover, the maps facilitate gap analysis in ex-situ gene-banks, thereby improving both ex-situ and in-situ conservation efforts.

Policy implications and recommendations
Crop wild relatives play a critical role in ensuring food security and agricultural resilience in the face of environmental challenges.However, just like other organisms, CWR are facing threats from climate change (Jarvis et al. 2008) and land-cover/land-use changes (Maxted et al. 2012).CWR are also data deficient and less represented in gene-bank collections showing less attention is given to both in-situ and ex-situ conservations.
To enhance the conservation and utilisation of CWR genetic resources, it is imperative to strengthen data management and collaboration amongst relevant stakeholders.Drawing from the recommendations by Arnaud et al. (2017) and the collaboration agreement between GBIF and FAO* , policy-makers should prioritise the integration of CWR data into existing platforms, such as GBIF, Genesys, EURISCO and FAO PlantTreaty.This entails enhancing data fitness for use in agrobiodiversity through quality standards outlined in the GBIF FAO collaboration agreement.Additionally, the establishment of a dedicated monitoring directive for CWR can streamline efforts in monitoring and managing CWR populations across Europe and beyond, ensuring their long-term conservation.
Moreover, in-situ conservation efforts for CWR should be supported through coordinated actions at the local, national and regional levels.Taking existing efforts, such as the Nordic CWR policy report and regional approach advocate (Fitzgerald et al. 2019), policy-makers should prioritise the development and implementation of comprehensive conservation plans tailored to regional contexts.This includes using genetic reserves adhering to quality standards to ensure effective conservation outcomes (Iriondo et al. 2012).Furthermore, collaboration with initiatives like Biodiversa+ and EIONET can facilitate funding and monitoring programmes for CWR conservation.By adhering to detailed in-situ conservation guidelines, policy-makers can strengthen the resilience of agricultural systems and safeguard the invaluable genetic diversity harboured by CWR populations.
Neither the European Union nor the European Commission can be held responsible for them.
We acknowledge the EuroHPC Joint Undertaking and CSC -IT Center for Science, Finland for awarding this project access to the EuroHPC supercomputer LUMI, hosted by CSC -IT Center for Science and the LUMI consortium, through Development Access calls.
We also thank Taimur Khan, Ingolf Kuhn, Jan Dick and one anonymous reviewer for reviewing and providing constructive comments, which have significantly improved the paper.

Figure 1 .
Figure 1.Simplified workflow of the crop wild relatives prototypes digital twin.CWR -crop wild relatives; GBIF -Global Biodiversity Information Facility; Genesys -Global Information System on Plant Genetic Resources; ICARDA -International Center for Agricultural Research in the Dry Areas; MODGP -modelling the distribution of germplasms of interest.

Table 1 .
Data and data sources for the crop wild relatives prototype digital twin.Prototype biodiversity digital twin: crop wild relatives genetic resources ...
integrate persistent identifiers and structured metadata to enable cross-domain interoperability, crucial for platforms like the European Open Science Cloud (EOSC* ), aligning with FAIR principles emphasising machine-actionability (European commission 2018Jacobsen et al. 2020).We are buiding on the RO-Crate approach (Soiland-Reyes et al. 2022) to implement lightweight packaging of the pDT's model description and output together with rich metadata.Structured metadata are provided by Schema.org and its Bioschemas extension (Gray et al. 2017) to facilitate both readability of data packages by humans and processability (i.e.machine-actionability, Weiland et al. (