Essential Biodiversity Variables: extracting plant phenological data from specimen labels using machine learning

Essential Biodiversity Variables (EBVs) make it possible to evaluate and monitor the state of biodiversity over time at different spatial scales. Its development is led by the Group on Earth Observations Biodiversity Observation Network (GEO BON) to harmonize, consolidate and standardize biodiversity data from varied biodiversity sources. This document presents a mechanism to obtain baseline data to feed the Species Traits Variable Phenology or other biodiversity indicators by extracting species characters and structure names from morphological descriptions of specimens and classifying such descriptions using machine learning (ML). A workflow that performs Named Entity Recognition (NER) and Classification of morphological descriptions using ML algorithms was evaluated with excellent results. It was implemented using Python, Pytorch, Scikit-Learn, Pomegranate, Python-crfsuite, and other libraries applied to 106,804 herbarium records from the National Biodiversity Institute of Costa Rica (INBio). The text classification results were almost excellent (F1 score between 96% and 99%) using three traditional ML methods: Multinomial Naive Bayes (NB), Linear Support Vector Classification (SVC), and Logistic Regression (LR). Furthermore, results extracting names of species morphological structures (e.g., leaves, trichomes, flowers, petals, sepals) and character names (e.g., length, width, pigmentation patterns, and smell) using NER algorithms were competitive (F1 score between 95% and 98%) using Hidden Markov Models (HMM), Conditional Random Fields (CRFs), and Bidirectional Long Short Term Memory Networks with CRF (BI-LSTM-CRF).


Introduction
Biological diversity is a fundamental pillar of life on Earth. Therefore, the governments of the world committed themselves through the United Nations Convention on Biological Diversity (CBD) to reduce the loss of biodiversity by intending to meet the Aichi Biodiversity Targets Convention on Biological Diversity (CBD) 2011. However, the ambitious Aichi Biodiversity Targets proposed in the 2011-2020 Strategic Plan for Biodiversity regarding this subject were not achieved. According to reports from different countries to the CBD, the causes of failure related to knowledge and technologies included the lack of biodiversity data for relevant taxa and locations and the lack of monitoring systems to support conservation actions Secretariat of the Convention on Biological Diversity 2020.
Essential Biodiversity Variables (EBVs) are recommended as a global biodiversity monitoring and reporting system to assess the state of biodiversity over time. They provide the basis for generating biodiversity indicators that allow repeated assessments of progress against national and global conservation goals (e.g., the Sustainable Development Goals and the Aichi Biodiversity Targets) , Kissling et al. (2018), Hardisty et al. (2019), Turak et al. (2017). The selected variables were proposed by a group of international ecologists led by the Group on Earth Observations Biodiversity Observation Network (GEO BON). The 22 EBV candidates were suggested in 2013 and organized into six classes (i.e., genetic composition, species populations, species traits, community composition, ecosystem structure, and ecosystem function) . Although EBVs were selected through a rigorous evaluation process of dozens of options considering criteria on scalability, temporal sensitivity, feasibility, and relevance, their practical implementation remains a challenge Kissling et al. (2018), , Pettorelli et al. (2016), Brummitt et al. (2017), Turak et al. (2017), Kissling et al. (2018).
Species traits include any measurable morphological, phenological, physiological, reproductive, or behavioral characteristics of individual organisms; nevertheless, they can also be generalized at the taxa and population levels. Recently, increasing efforts to integrate species traits have resulted in a significant amount of data available Kissling et al. (2018), Schneider et al. (2019); however, most of these data are associated with taxa rather than with specimens. Aggregating species traits at the taxa level causes critical data for monitoring changes in individual organisms or populations in a particular geographic area (e.g., time and location) to be lost, Schneider et al. 2019. Species traits have been suggested as indicator variables for monitoring the response of organisms to changes in the environment; for instance, phenological trait information related to changes in the timing of plant leafing, flowering, and fruiting can be used as an indicator of climate change impacts Kissling et al. 2018, Geijzendorffer et al. (2015, Kissling et al. (2018). Different authors suggest frameworks and ideas to feed the Phenology EBVs from specimen data Kissling et al. (2018), . Additionally, there are focused efforts to measure trends in particular species: for example, the UK Spring Index that tracks the impact of temperature change on the annual mean observation date of four biological events. These events include the first flowering of hawthorn (Crataegus monogyna), the first flowering of horse chestnut (Aesculus hippocastanum), the first recorded flight of an orange-tip butterfly (Anthocharis cardamines), and the first sighting of a swallow (Hirundo rustica) Parliamentary Office of Science and Technology, UK Parliament (2021).
On the other hand, the transformation of texts from taxonomic literature into structured data remains a key challenge in Biodiversity Informatics Hobern and Miller (2019), Miralles et al. (2020). NLP tools and algorithms have been successfully applied in information extraction tasks in biodiversity texts; for example, to extract taxonomic names using rules based on syntax, fuzzy logic, and dictionaries Gerner et al. (2010), Leary (2014), Wei et al. (2010, Sautter et al. (2006), and, in some cases, probabilistic models Akella et al. (2012); to structure complete texts using rules, regular expressions, dictionaries, and heuristics based on text style Sautter et al. (2012), Cunningham et al. (2011), Curry and Connor 2016; and to extract species morphological characteristics using rules, dictionaries, and ontologies Mora and Araya (2018), Duan et al. (2013), Cui (2012), Cui (2013), Balhoff et al. (2014).
Additionally, some ML algorithms, such as NER and Classification have been successfully applied to bioinformatics and biomedicine, and, more recently, to BI. Text Classification and Named Entity Recognition (NER) are classic research topics in the NLP field. Text Classification is a fundamental technique in NLP to categorize unstructured text data into predefined labels or tags (widely used in sentiment analysis). The Allerdictor tool is an example of an application in bioinformatics that models sequences as text documents and uses Multinomial Naïve Bayes (NB) or Support Vector Machine (SVM) for allergen classification Dang and Lawrence (2014 Demichelis et al. (2006).
NER is the first step in many NLP tasks. It seeks to locate and classify entities' names in free text into categories. The traditional NER task has expanded beyond identifying people, location and organization to identify dates, email addresses, book titles, protein names, numbers, amongst other applications. Additionally, there has been a strong interest in using NER for extracting product attributes from online data due to the rapid growth of E-Commerce Zheng et al. (2018), for assessing people skill sets in Skill Analysis Fareria et al. (2021) and in information retrieval, to extract the main elements of a user query to better identify what the user is looking for Putra et al. (2020). In E-Commerce, NER is used to autofill in attribute specifications, to improve search and to build product graphs. Some examples in Biodiversity Informatics include: the Specialized Information Service Biodiversity Research (BIOfid), which facilitates automatic extraction of regular categories (e.g., person, location, organization) and taxon names from printed literature about plants, birds, moths and butterflies hidden in German libraries for over the past 250 years Akella et al. (2012), Rössler (2004). The National Commission for Knowledge and Understanding of Biodiversity (CONABIO) in Mexico has trained models for NER to extract species names from text written in Spanish, Barrios et al. (2015). The "TaxonGrab'' method is a web-based project that allows users to upload information and then displays the list of the candidates' taxonomic names mentioned in the text Koning et al. (2005), NetiNeti (Name Extraction from Textual Information-Name Extraction for Taxonomic Indexing) and TaxoNERD to recognize scientific names in biodiversity publications using NER Akella et al. (2012), Le Guillarme and Thuiller (2021). However, at this point, no applied research results have been published to extract phenological data from morphological descriptions of specimens using ML algorithms.
The main objective of this project was to obtain baseline data to feed the Species Traits Variable Phenology and other biodiversity indicators by extracting species characters and structure names from morphological descriptions of specimens and classifying the descriptions using machine learning (ML). To achieve this goal, an ML workflow was tested to classify specimen descriptions to determine if the plant had flowers and/or fruits at the time of collection and to extract species characters and structure names mentioned in the descriptions. A database with 106,804 records from the Herbarium of the National Biodiversity Institute of Costa Rica (INBio) was used to illustrate the proposed approach, Vargas (2016).
The remainder of the paper is structured as follows: Section "Materials and methods" provides the detailed workflow of the proposed material and methods, section "Results" presents the evaluation metrics and results, and section "Discussion" analyze the results. Finally, conclusions and future work are discussed in "Conclusions".

Materials and methods
This research work presents an effort to extract species morphological characters and structure names using NER algorithms and classify specimen morphological descriptions to determine if a given plant had flowers or/and fruits at the time of collection. Successfully applying ML algorithms to NLP problems requires defining a workflow that includes phases like data selection and pre-processing, model training and test and model deployment. Fig. 1 shows the general workflow used in this research. The proposed general workflow includes two phases: A) Data Selection and Preprocessing using the Atta database (INBio). First, the data were cleaned by removing duplicate records, records written in English and null morphological descriptions, amongst other processes. Then, two datasets were selected for the next phase, one for Classification and one for NER. Those datasets were used for training and test activities. B) During the Models Training and Test phase, models were generated using algorithms such as: Multinomial Naive Bayes (NB), Linear Support Vector Classification (SVC) and Logistic Regression (LR) for Classification and Hidden Markov Model (HMM), Conditional Random Fields (CRF), and Bidirectional Long Short Term Memory Networks with CRF (BI-LSTM-CRF) for NER. Metrics like accuracy, precision, recall, and F1 score were used to test them. Essential Biodiversity Variables: extracting plant phenological data from ...

A. Data Selection and Processing Phase
A.1. Atta Dataset: Atta is an information system developed by INBio to manage data of specimens of different biological groups, such as plants, arthropods, fungi, and nematodes.
The database contains 350,007 records from the kingdom Plantae. Data related to taxonomy (i.e., scientific name and higher taxonomy); plant specimens (i.e., morphological description, date collected, locality, collectors, and sampling protocol, amongst other data); and geospatial data (i.e., locality and geographic coordinates) were obtained from Atta. All the selected specimens were collected in Costa Rica.    Histogram of records by year of collection. Years with few records, from 1892 to 1981, were excluded in the graph (i.e., 110 specimen records were not taken into consideration).

A.2. Cleaning and Random Selection of Data:
In this project, 106,804 records from Atta were used. Atta contains 350,007 records from the kingdom Plantae. Herbarium rules and regulations state to send duplicate specimens to the National Museum of Costa Rica and the Missouri Botanical Garden, so from this figure, 64% are duplicate records. After removing duplicate records, records without morphological description, discarded specimens, and descriptions written in English, about 93% of the remaining records (i.e., 106,804 records) were tagged (i.e., they were assigned to one of the classification target classes: has_flowers and has_fruits).
Morphological descriptions of plant specimens use a semi-structured language characterised by Mora and Araya (2018): • They use many abbreviations and omit functional words and verbs, making sentences become telegraph phrases to save space in scientific publications; • Texts are written in a very technical language because the formal terminology is based on Latin; • They contain primarily names, adjectives, numbers (measures) and adverbs to a lesser extent. Verbs are seldom used; • The vocabulary used is repetitive; • They are short because they are included on the specimen label and sometimes the text is shortened to fit on the label. Fig. 5 shows the distribution of the descriptions length of specimens from the INBio Herbarium; • They use highly standardised syntax even though they are written in natural language.
Supervised machine-learning algorithms were used to classify descriptions. Training supervised models involves adjusting their parameters using examples that allow models to map an input to the desired output, in this case, the target classes. Examples were built from the specimens' morphological descriptions by manually assigning each description to one of the classes (i.e., has_flowers and has_fruits). For example, the morphological description "Creciendo en tronco seco. Flores naranjas. Muestra conservada en alcohol" ("Growing on the dry trunk. Orange flowers. Sample preserved in alcohol", in English) was assigned to the has_flowers class, and the description "Arbusto de 35 m. en el sotobosque. Frutos de color verde y rojo a púrpura oscuro cuando están maduros. Escaso " ("35 m shrub in the understorey. Green and red fruits to dark purple when ripe. Scarce", in English) was assigned to the has_fruits class. Descriptions were standardised by changing their contents to lowercase, removing special characters, and tokenising each description (i.e., breaking descriptions into words, symbols, or other elements called tokens).
Two classes were used to classify specimen morphological descriptions and determine if a plant had flowers or/and fruits at the time of collection: has_flowers and has_fruits, accordingly. The 106,804 records from INBio's database (i.e., Atta) were tagged. Fig. 6 shows the number of records with zero, one, or two classes assigned in the selected samples. Records were tagged manually using SQL statements in a PostgreSQL database. Descriptions such as "sin flores" (no flowers), "sin frutos" (no fruits), "sin flores ni frutos" (no flowers or fruit), amongst others, were not included in the experiments because very few descriptions presented that pattern.

A.4. Tagging Data for NER:
A small part of the clean records used in the classification process was randomly selected for extracting species characters and structure names using supervised ML algorithms. Eight thousand specimen records were chosen for this purpose.
To prepare examples, different standard approaches to sequence tagging Goyal et al. (2018) were evaluated, such as IO (Inside, Outside), BIO (Begin, Inside, Outside) Ramshaw and Marcus (1999), and BIOE (Begin, Inside, Outside, End) Sang and Veenstra (1999). Due to the characteristics of the morphological descriptions mentioned above, the  represents the intermediate tokens in multi-word entities (e.g., "botones florales", flower buds), and O any other token including punctuation marks (not marked in the example). Very few multi-word entities were found in the specimen morphological descriptions. Fig. 7 shows the number of words in the records assigned with each label (i.e., B, I, O).
The following activities were carried out for the tagging process: • In addition to the has_flowers and has_fruits classes, the 106,804 specimens were associated with other classes such as has_leaves and has_stems (has_root was not used because very few descriptions mentioned roots). These classes were used to randomly select two thousand records of each to balance the presence of structures belonging to all classes. In total, eight thousand records were selected, including records for classes has_flowers, has_fruits, has_leaves, and has_stems. • FreeLing v.4.2 morphological analyzers and taggers Padró and Stanilovsky (2012) were used for tokenizing, lemmatizing and POS-tagging (part-of-speech tagging) the morphological descriptions. POS-tagging was used to semi-automatically assign a class (e.g., noun, adjective, verb, adverb, article) to each token. Most plant structures and characters correspond to nouns in sentences. The number of morphological descriptions assigned to zero, one, or two classes (i.e., has_flowers and has_fruits).
• Using the POS tags generated by FreeLing, each token was assigned a B, I, or O tag, depending on its role in the sentence. • Two thousand records randomly selected from the eight thousand were assigned to each team member to manually review the labels (four team members). The classification objective was to determine if each of the morphological descriptions of the specimens mentioned or not the presence of flowers or fruits, that is, to assign each description to the has_flowers and/or has_fruits classes. Each sample could be assigned to zero, one, or both classes; therefore, the classification problem corresponds to a multilabel classification task. The algorithms Multinomial Naive Bayes (NB) Klampanos (2009), Linear Support Vector Classification (SVC) Chang et al. (2008), and Logistic Regression (LR) Bishop (2006) were used for the experiments.

B) Models Training and Evaluation
The input to the models was a one-dimensional vector (x1, x2, ..., xn) with the morphological descriptions. Features were extracted from this 1D vector that was converted to a matrix of values using TF-IDF (Term Frequency-Inverse Document-Frequency) or the frequency of words occurring in the descriptions with a lower and upper boundary of the range of (1,3) for different n-grams to be extracted.
To estimate the skill of the models on new data, ten-fold cross-validation was used with the function cross_val_score (Scikit Learn) in combination with the NB, SVC, and LR algorithms Pedregosa (2011). The One-vs.-Rest (OvR) strategy was applied to solve the problem of multi-label Classification. The parameters used with each of the algorithms were as follows: NB (with learn class prior probabilities equal to true and priors adjusted according to the data), SVC (with hinge loss function, tolerance equal to 1e-4, strength regularization inversely proportional to 1.0, calculate the intercept equal to true, multi-class strategy one-vs.-res, and 1000 maximum number of iterations), and LR (with tolerance equal to 1e-4, L2 the norm of the penalty, strength regularization inversely proportional to 1.0, and L-BFGS solver for optimization Liu and Nocedal (1989).

B.2. NER: Train Models using HMM, CRFs, and BI-LSTM-CRF:
Out of the 106,804 specimen records, 8,000 were randomly selected, where 80% of the records were used for training, while the remaining 20% were for testing the models. The training and testing of the models were done using Python version 3 Python Software Foundation (2021) The aim of applying NER tagging to the data was to extract characters and structure names from morphological descriptions (e.g., flowers, trunk, color, height) where every token of a description was assigned a B, I or O tag. With this purpose in mind, the algorithms CRFs Lafferty et al. (2001), BI-LSTM-CRF Huang et al. (2015), and HMM Baum and Petrie (1966) were used for NER tagging. The information considered relevant to train the models of CRFs and BI-LSTM-CRF was the token, its POS tag, and the label assigned; for the case of HMM, only the token and its label were considered.
In order to train the HMM model, bigram, sequence starting, and sequence ending counts were used to estimate the probability distribution and generate every state and transition that the model would use for its predictions.
The way the data were handled to train the CRFs model was to convert each token in the training data into a feature that would later be fed to the model. The characteristics considered for every word were the word itself, its last three letters, if it was a punctuation mark or if it was a digit, its POS tag, and the first two letters of the POS tag. Each feature was processed using its own characteristics combined with the next and previous words in the sentence (if applicable). Afterwards, the model was trained with the hyperparameters established in Table 1. To train the BI-LSTM-CRF model, every word in the dataset was put into a dictionary that was later passed to the model; this had to be done with all records. The model worked with every sentence not as a string of words, but as a tensor of their respective indexes in the word dictionary. After obtaining the ready data, the model ran a forward pass with the negative log-likelihood cost function, then computed the loss and gradients, and updated the model parameters. This process was done for every sentence in the training set for every epoch. The model was trained with the hyperparameters established in Table 2.

B.3. Models Evaluation (Accuracy, Precision, Recall, and F1 score):
The metrics generally used in classification and NER problems to evaluate the results are precision and recall Dandapat (2011). They measure the percentage of correct classification and the completeness of the method, respectively. In addition, the accuracy and the F1-score (the harmonic mean between precision and recall) were computed.

Results
This section presents a report of the experimental results for both classification and NER tests. Fig. 1 presents the general workflow of the project. Table 3 gives examples of morphological descriptions obtained from the Atta database. After a cleaning process that involved removing duplicate records, specimens without morphological description, discarded specimens, descriptions written in English, and records with phrases such as "sin flores" (no flowers), "sin futos" (no fruits), "sin flores ni frutos" (no flowers or no Table 1.

Classification of morphological descriptions of specimens. Performance of the NB, SVC and LR algorithms:
Hyperparameters used to train the CRFs model. Table 2.
Hyperparameters used to train the BI-LSTM-CRF model. fruit), amongst other issues in the data, 106,804 records were tagged for the experiments. Table 4 presents the amount of specimen morphological descriptions distributed by class, the average length in characters of the descriptions, and the standard deviation. The objective of the experiment was to train models that could automatically associate the nonexclusive classes has_flowers and has_fruits to the morphological descriptions. As each description can be assigned to more than one class, the One-vs.-Rest (OvR) strategy was used with three traditional ML algorithms: NB, SVC, and LR. Models' skills were estimated using ten-fold cross-validation to prevent overfitting and reduce bias. After executing the ten training sequences and tests of different models, metrics such as accuracy, precision, recall, and F1 score by algorithm and class were computed, and the average of the results was calculated. Table 5 presents the results of the metrics used to estimate the model's skills.

Specimen Morphological Description English Translation Data
To measure the impact of different collector's writing on the result, in a second experiment, training and test data were partitioned using the number of specimens gathered per collector. The test was carried out to verify if the resulting models were just trained to parse the writing of the prolific collectors. Specimen descriptions written by collectors with different amounts of gatherings were selected for testing models, the rest of the samples were used to train the models. Fig. 8 shows the results of applying the algorithms to text written by collectors with one collected sample up to 500 samples. In all tests the model Table 3.
Examples of types of morphological descriptions used in these experiments.
results remained above 98% (macro-average F1-score) for the algorithms SVC and LR. Only in the case of NB, the result fell to 94% (macro-average F1-score) for collectors with less than 10 samples.  Table 6, were used to test the models and data cleaning was similar to the one used in the classification experiments. Records that were duplicated, discarded, lacked a morphological description, or contained descriptions in English were not used in the research.
As seen in the examples, the aim was to tag the entities that appeared in the specimen's description. With this purpose in mind, CRFs, HMM, and BI-LSTM-CRF were used.
The Sklearn Pedregosa (2011) library was used to obtain the metrics to evaluate every model's performance with the test data, including the accuracy, precision, recall, and F1score. Table 7 shows the results obtained by each model used in the NER experiment. Table 4.
Amount of specimen morphological descriptions distributed by class, average length in characters, and standard deviation. Table 5.

Specimen Morphological Description
English  Results of applying the algorithms to text written by collectors with one collected sample up to 500 samples. The test was carried out to measure the impact of different collector's writing on the result and to verify if the resulting models were just trained to parse the writing of the prolific collectors. Training and test data were partitioned using the number of specimens gathered per collector. Specimen descriptions written by collectors with different amounts of gathering were selected for testing models, the rest of the samples were used to train the models.

Discussion
A successful workflow was tested with the current project to extract phenological data from morphological descriptions of botanical specimens. Some elements of the project to highlight are: • The results achieved in the classification experiments showed that was feasible and generalisable to other biological groups to use the specimen morphological descriptions to automatically obtain phenological data, which most of the time, is only available in text format. The SVC models surpassed NB and LR models with Table 7.
Average precision (P), recall (R), accuracy, and F1-score (F1). an average F1 score higher than 0.995 ( The NER models had problems differentiating when an entity was composed by the name of an entity and an adjective (i.e., "frutos[B] maduros[I] rojos" -"red ripe fruit") and when that same adjective was used to describe the entity (i.e., "frutos[B] maduros" -"ripe fruits"). • The characteristics of descriptions could have influenced that FreeLing tools were not as effective in tagging nouns that are key elements to perform NER. This result made the manual review of the tagging text more time-consuming. • Although classes were highly unbalanced in all experiments and the description length ranges from 4 to 952 characters, the model's performance was not affected. This was mainly due to the large amount of data used during the training phase and the characteristics of the descriptions. • The data used were collected by INBio throughout the country, over a long time and by more than 400 botanists and technicians, which gives an idea of how variable the descriptions were. Figures 4 and 5 present these data in detail. • Most of the time, data of morphological descriptions of specimens are not shared in global networks that integrate biodiversity data, such as the Global Biodiversity Information Facility (GBIF), which could make it easier to carry out experiments integrating multiple sources and multiple languages.

Conclusions
Phenological traits data, such as the timing of plant leafing, flowering, and fruiting, have been suggested as indicators to measure how organisms respond to disturbances and changes in environmental conditions. This document has proposed a workflow that uses ML and NLP algorithms to integrate phenological data extracted from morphological descriptions in text format with other structured data available in specimen records (such as geographic coordinates, taxonomy and collection date). The integrated data, combined with abiotic records (e.g., temperature, precipitation, and humidity), could enable users (e.g., decision-makers, researchers, biodiversity institutes) to answer questions related to the possible effects of environmental changes that occur in time and space on particular species.
As far as we know, this work is the first to apply ML algorithms to specimen morphological descriptions to extract phenological data on flowering and fruiting. Results showed that it is possible to classify specimen morphological descriptions with more than 99% success (F1score) using a multi-label approach (with classes like has_flowers and has_fruits) and to extract the characters and structure names from descriptions with more than 98% success (F1-score) using NER.
Although models, like the one proposed in this project, achieve excellent results, it is crucial to consider that, even though there are records of the planet's biodiversity that have been systematically collected over hundreds of years, the available data are strongly unbalanced regarding taxa, locality, time, and the number of individuals.
The results of this project can be used to generate baseline data to feed the Phenology EBV from morphological descriptions of specimens written in any language, amongst other applications. Although data about the event duration as proposed by the USA-National Ecological Observatory Network ( The proposed workflow can be applied to the morphological descriptions of specimens of different biological groups,, and there are no restrictions on the language used. For biodiversity networks that integrate data from multiple sources using different languages, it is also vital to evaluate cross-lingual algorithms to alleviate the need to manually tag descriptions in a target language by leveraging tagged descriptions from other languages. For more complex texts, more robust algorithms, such as Recurrent Neural Networks -LSTM and Transformers, can be applied.

Data and Code
Data from the National Biodiversity Institute of Costa Rica is used in this paper. The full dataset and documentation can be downloaded from https://www.gbif.org/dataset/ 3717f916-d983-4a81-bb13-5f91200871a6. Code for data cleaning and analysis is provided as part of the replication package. It is available at https://github.com/colibri-itcr.