Harmonizing taxon names in biodiversity data: A review of tools, databases and best practices

The process of standardizing taxon names, taxonomic name harmonization, is necessary to properly merge data indexed by taxon names. The large variety of taxonomic databases and related tools are often not well described. It is often unclear which databases are actively maintained or what is the original source of taxonomic information. In addition, software to access these databases is developed following non‐compatible standards, which creates additional challenges for users. As a result, taxonomic harmonization has become a major obstacle in ecological studies that seek to combine multiple datasets. Here, we review and categorize a set of major taxonomic databases publicly available as well as a large collection of R packages to access them and to harmonize lists of taxon names. We categorized available taxonomic databases according to their taxonomic breadth (e.g. taxon specific vs. multi‐taxa) and spatial scope (e.g. regional vs. global), highlighting strengths and caveats of each type of database. We divided R packages according to their function, (e.g. syntax standardization tools, access to online databases, etc.) and highlighted overlaps among them. We present our findings (e.g. network of linkages, data and tool characteristics) in a ready‐to‐use Shiny web application (available at: https://mgrenie.shinyapps.io/taxharmonizexplorer/). We also provide general guidelines and best practice principles for taxonomic name harmonization. As an illustrative example, we harmonized taxon names of one of the largest databases of community time series currently available. We showed how different workflows can be used for different goals, highlighting their strengths and weaknesses and providing practical solutions to avoid common pitfalls. To our knowledge, our opinionated review represents the most exhaustive evaluation of links among and of taxonomic databases and related R tools. Finally, based on our new insights in the field, we make recommendations for users, database managers and package developers alike.


| INTRODUC TI ON
In the era of big data, combining, harmonizing and analysing massive amounts of ecological data have played a central role in improving our understanding of biodiversity in a changing world (Hampton et al., 2013;La Salle et al., 2016;Michener & Jones, 2012;Wüest et al., 2020). While promising, this new era is also challenging. As exabytes of primary biodiversity data become publicly available, issues of quality control in data integration, interoperability and redundancy have become pressing concerns to address (Jin & Yang, 2020;Kissling et al., 2018;Lenters et al., 2021;Nelson & Ellis, 2019;Soberón & Peterson, 2004;Thomas, 2009;Wüest et al., 2020).
One of the biggest challenges in biodiversity data handling is maintaining a consistent taxonomy of species names associated with different biological attributes (Jin & Yang, 2020;Meyer et al., 2016;Tessarolo et al., 2017;Thomas, 2009). The dynamic nature of taxonomy, reinforced by the growing availability of information and the increasing use of genetic methods to identify species results in ever-changing taxon names considered accepted. Taxonomists start by sampling individuals in the field and when considered as not yet described, name them, based on best knowledge and defined procedures (Dayrat, 2005). These names become de facto accepted. However, some names can become obsolete, when, for example, researchers realize later on this species was named already before. Those names then are used as synonyms of another now accepted name (Lepage et al., 2014). In addition to the names per se, taxonomists refer to species through taxonomic concepts-that is, biological entities- (Lepage et al., 2014). Which taxonomic concepts researchers use, that is, are defined as legitimate and valid, can vary across research cultures (Lepage et al., 2014). For some taxonomic groups general consensus on one taxonomic concept is far from being reached (Chawuthai et al., 2016), generating confusion.
This dynamic process results in difficulties for end users to point to single valid names referring unambiguously to single taxonomic concepts. The use of taxonomic databases helps resolve the different relationships that exist between names and taxonomic concepts (one-to-one, one-to-many, many-to-one or even many-to-many, see Lepage et al., 2014).
In an attempt to unify taxonomy across the tree of life, multiple initiatives have proposed curated lists of taxon names referenced against accepted taxon names. Taxonomic databases (Box 1) are usually based on extensive community and individual expert knowledge. Decisions which taxon names are accepted are usually based on robust scientific evidence. These decisions might also have to be based on less objective reasons, like reliability of original resources in comparison to conflicting studies or on individual preferences for grammar and spelling (e.g. Isoëtes vs. Isoetes; Isaac et al., 2004). However, despite significant efforts in creating a single authoritative list of the world's taxa (e.g. [37]), taxonomic unification has largely advanced through multiple independent efforts with different aims and scopes (e.g. per taxon group or region; Costello, 2020;Garnett et al., 2020). For example, some taxonomic databases, that is, databases that primarily offer reference taxonomic data, focus on specific taxonomic groups (e.g. Freiberg et al., 2020), others on environmental realms (e.g. [34]), providing a reference at either global or regional scale such as national databases ( Figure 1). The last decade brought a lot of progress in taxonomy in general to overcome the 'taxonomic impediment' (Rouhan & Gaudeul, 2021), the lack of comprehensive information per taxonomic group. These efforts have generated a large number of taxa lists with taxonomic-curated information dispersed across very different repositories  Taxonomic information, through taxon names (Figure 2), can serve as a common basis to index and merge different biodiversity data (e.g. Dyer et al., 2017; occurrences: GBIF: The Global Biodiversity Information Facility, 2020; conservation status: IUCN, 2021; traits: Jones et al., 2009;Kattge et al., 2020;phylogenetic relationships: Smith & Brown, 2018;Upham et al., 2019;invasion status: van Kleunen et al., 2019). Aside from the challenges with maintaining updated and comprehensive taxonomic databases by themselves, combining and harmonizing additional biological data can be problematic since such datasets may have been created and updated at different times (sometimes spanning several decades), may use different taxonomic databases to standardize taxon names, and may not even be linked to any consistent taxonomic concept (Edwards et al., 2000;Farley et al., 2018;König et al., 2019). Ultimately, if taxonomic name harmonization is not properly executed, researchers are likely to introduce and propagate errors that can lead to misquantified biodiversity components or mismatched data (Bortolus, 2008).
Larger amounts of data increase the issue, due to taxonomic inaccuracies introduced for increasing numbers of species and taxonomic breadth (Patterson et al., 2010).
Driven by the needs in data harmonization, multiple tools have emerged for this task. This has generated a diverse toolbox but no clear guidance on how these tools could be combined into a meaningful and efficient workflow. Improving our knowledge of the landscape of available taxonomic reference and tools is thus critical to developing robust and comprehensive workflows to achieve high levels of data quality and accurate downstream analyses.
Here, we fill this gap by reviewing publicly available taxonomic databases and R packages for taxonomic harmonization, describing common pitfalls to avoid when using them, and proposing hands-on approaches to achieve accurate and precise harmonized list of taxon names. To our knowledge, our study represents the most comprehensive review and assessment of tools and issues related to taxonomic name harmonization. We present and discuss main steps towards robust and meaningful harmonization workflows. Specifically, we review taxonomic databases, R packages, and show how they depend on and interact with each other. We focus on R as it is the programming language of choice for ecologists (Lai et al., 2019). We present a Shiny R application that guides users through the labyrinth of tools and resources. We assess the efficiency of different possible taxonomic harmonization workflows through a concrete use case. We then formulate recommendations for end users, tool developers and taxonomic data managers.

| A typology of taxonomic databases
We categorized taxonomic databases (see Box 1) along two axes: taxonomic breadth and covered spatial scale ( Figure 1). Taxonomic breadth describes the amount of taxonomic groups covered by the database. We use the term 'taxonomic group' as a broad term to describe a group of taxa or taxonomic ranks at which people work (e.g. birds-class Aves, butterflies-order Lepidoptera). Databases have varying taxonomic and spatial breadths, from narrow taxonomic breadth but global scale (e.g. eBird [17]) to broad taxonomic breadth but regional/national scope (e.g. the Chinese Animal Species Database [4]).
Some databases even aim to provide information without any taxonomic restriction at a global level, for example, Catalogue of Life [37].
Because navigating the landscape of taxonomic databases can be difficult for users, we provide a wide overview of available databases on as many taxonomic groups as possible at varying spatial scales and taxonomic breadths ( Table 1). As one covering many databases, this list provides an entry point for users to get a sense of potential sources of taxonomy. The immense variety of taxonomic databases, especially at regional scales, prevents our list from being exhaustive but it includes most existing global databases.

| The wide landscape of R packages for taxonomy
With the increasing amount of data used in ecological studies, taxonomic harmonization cannot rely on manual curation. Computational tools are needed to help extract, evaluate, manipulate and visualize taxonomic information. Additionally, the use of computational tools BOX 1 The taxonomic terminology diversity Across the literature, the terms taxonomic reference (list; e.g. Freiberg et al., 2020), taxonomic authority (list/file; Vanden Berghe et al., 2015), taxonomic databases (Rees, 2014), taxonomic backbone (e.g. Schulman et al., 2021) or taxonomic checklist (Costello, 2020) are used interchangeably, often without clear definitions. The terminological diversity makes it difficult to understand differences between terms and potentially to find the correct resources. For example, the expression 'taxonomic authority' can be confused with the authority when citing a species name, which is the citation of the author name associated with a taxon. Different expressions can sometimes reflect differences in sizes of provided databases, from a simple species list (e.g. to define the list of species names that occur in a given area), to a full nomenclatural reference (with a taxonomy), to systems that also provide synonymy resolution.
In this article, we use 'taxonomic databases' as a generic expression of digital collections of taxonomic information on many individual species, with processes to mitigate potential conflicts between taxonomic designation. F I G U R E 1 Typology of taxonomic databases according to their taxonomic breadth and their spatial scale. The x-axis represents increasing taxonomic breadth from a single taxonomic group to no clear taxonomic restriction (e.g. considering all biota or all Eukaryota). The y-axis represents spatial scale from regional to global. In this section, we present the most extensive review, to our knowledge, of R packages that can be used to process taxonomic information (Table 2).

| Description of the landscape of tools
We identified some packages that provide standardized technical infrastructure for taxonomic experts to develop and work with taxonomic information within R. Infrastructure packages provide basic 'building blocks' for other packages to build onto. taxa [51], used by metacoder [104], provides R-native objects and methods to represent taxonomic data. taxlist [52] contains objects and functions to store taxa lists, synonyms, taxonomic hierarchy and functional traits in a standardized format; it is used by vegdata [102]. taxview [53] provides basic visualization of taxonomic hierarchies; it is used by no other packages. The fact that virtually no other packages rely on them means that several tools reinvent the wheel instead of relying on standardized functions. More widespread reliance on infrastructure packages and associated methods within the small community of R taxonomy package developers could foster best development practices, easier interoperability, as well as increased reproducibility, as it has been for example done already for spatial data through the sp and sf packages (Bivand et al., 2013;Pebesma, 2018;Pebesma & Bivand, 2005).
We identified 47 packages providing direct access to online taxonomic databases. These packages let the users search a given taxon name in one (or several) online taxonomic database(s) and get back a list of potential matching names, considering both accepted names and synonyms. Details about the packages, for example, which taxonomic databases they access are available in S2 and our specifically for this review developed shiny app taxharmonizexplorer (https://mgren ie.shiny apps.io/taxha rmoni zexpl orer/). You can explore which package(s) access which database(s) as well as additional useful characteristics through taxharmonizexplorer described in the following section. To overcome these issues several packages provide or build local database copies. lcvplants [22] accesses the LCVP database fully offline through a local copy, it also offers functions to harmonize two lists of names. ncbit [60] provides a similar access but to the NCBI database [47]. taxadb [94,95] creates a unified local database from different data sources as specified by the user. taxalight [96], which is maintained by the same developers, is faster and with fewer dependencies, it will supersede taxadb (C. Boettiger, pers. comm.).
taxizedb [98] also downloads local copies of the database but, contrary to taxadb and taxalight, it provides the data without standardizing its format between sources. The user can then access F I G U R E 2 Taxonomy as a unifying key for ecological datasets. The two sides represent two exemplary datasets, with a containing conservation status of taxa (here species) and B their traits (colours show different traits). The datasets are indexed by taxon names 'Sp1' to 'Sp6'. The rounded rectangle in the middle depicts the taxonomic harmonization process: (a) the names are extracted from each dataset, respectively in the orange and purple rectangles; (b) both lists are then compared to a taxonomic database which harmonizes all names. Here the names 'Sp1' and 'Sp6' refer to the same taxon in the taxonomic database (as indicated by the dashed lines). Without taxonomic harmonization, the exact match of names would have resulted in the loss of Sp5 and Sp6 when merging both datasets. LC, NT, VU, and CR are abbreviations of Red List statuses, meaning least concern, not threatened, vulnerable, and critically endangered, respectively the original information through SQL queries tailored for each data- We identified several packages that deal with taxonomic assignment from genomic data but considered them out of scope of this review (see S1 for the inclusion criteria).

| Tools: Lessons learned and future direction
To avoid reinventing the wheel, whenever possible, package developers should build their tools on top of existing packages and functions; however, we found little evidence for package or function reuse across packages (see lack of network links in taxharmonizexplorer). As an exception, taxize [76, 77] relies on individual packages that provide functions to access specific online databases (e.g. it relies on rfishbase [67] to access FishBase). The lack of dependencies between packages is inefficient from a developer standpoint and unclear for end users, due to packages performing virtually identical tasks but in a slightly different

| A tool to guide users in the network of resources
To help the users navigate the complex network of tools and databases, we developed a shiny application that lets users explore the relationships between resources and their main characteristics (date of last update, taxonomic breadth, URL, etc.). We called it taxharmonizexplorer and it is available as a perennial archive on Zenodo (Grenié et al., 2021) but also accessible online at: https://mgren ie.shiny apps.io/taxha rmoni zexpl orer.
The application presents on the right side a network that links taxonomic databases and packages (Figure 3). Global databases with a wide taxonomic breath often aggregate taxonomies trying to provide a unified taxonomic backbone for all covered organisms, such

| S TEPPING OUT OF THE TA XONOMI C HARMONIZ ATION L ABYRINTH: RECOMMENDATIONS AND A COMPARISON OF E X AMPLE WORK FLOWS
In this section, we provide general guidelines and best practices to harmonize taxonomy in large biodiversity datasets to avoid common pitfalls. As an illustrative example, we harmonize taxon names from BioTIME (v. 02_04_2018, BioTIME Consortium, 2018;Dornelas et al., 2018), the largest global compilation of time-series assemblages, which includes 44,440 taxa spanning multiple taxonomic groups at broad spatial and temporal scales. BioTIME is often used (~145 citations) and is particularly interesting as it gathers information from different data sources (361 studies), which potentially leads to taxonomic inconsistencies between them. For the sake of simplicity we only focus here on birds, fishes and vascular plants in BioTIME.
We detailed the process and tools used for our taxonomic harmonization (packages, including versions, specific functions and parameter values used). To achieve full reproducibility we encourage others to detail their workflow in a similar fashion, as taxonomic harmonization workflows can be highly sensitive to the exact version of the tools or data used.
We applied four different workflows (WF, Figure 4) 'Fuzzy matching' is a method to match taxon names that differ by some characters.

How it works
Similarity measures are used to quantify the discrepancy between two names (Meyer et al., 2016 Rees, 2014). matching, sensitivity analyses should be performed using fuzzy matching scores, for example, by random sampling taxon names using matching scores as probability weights.

When to use it
( Step 1 as in WF1 and WF2), while in WF4 taxon names are passed directly from BioTIME to GBIF. We included these two workflows because they are intuitive and easy to implement and, as such, appeal particularly to non-taxonomists. We compared the performance of the different workflows by the number of identified names in the different taxonomic groups (birds, fishes and vascular plants).

| S TEP 1: PREPRO CE SS NAME S ( A . K . A . CLE AN/UNIF Y WRITING S T YLE )
Taxon names writing style can vary between sources, complicating harmonization (D. Patterson et al., 2016;Patterson et al., 2010) and becoming a source for errors. These differences arise because of the disparate use of upper and lower case, abbreviations, annotations, depictions of hybrids, authorships, etc. Removing these syntactic issues and standardizing taxon names are thus the starting point of taxonomic harmonization. To match all possible variations of a scientific name, these need to be divided into their stable (e.g. genus, species epithet and authorships) and prone-to-change elements (e.g. annotations) and then combined into only stable elements (Mozzherin et al., 2017). The result is a syntactically normalized list of names. We recommend keeping authorship, whenever possible, along the taxon names because it decreases errors. Using taxa authorship information also disambiguates between accepted and synonyms names (e.g. the IRMNG referencing binomial homonyms, Rees, 2021).
To standardize the writing style of taxon names across BioTIME, we used the function gn_parse_tidy() from package rgnparser v.0.2.0 [106]. After parsing taxon names, we only kept the two first words of each parsed name, which ideally represent the scientific binomial name of species (Genus species). We did not keep authorship as most names in BioTIME did not have it. We applied this step for all workflows except WF4. We found that of the 44,326 names reported in the original file, 4,734 taxa (11%) had spelling style differences, that is, species with the same binomial name after parsing. Of the remaining 39,592 unique taxon names, 6,692 were composed of only one word. We removed these taxa as our aim was to match only binomial names. Importantly, the remaining 32,900 names also contained common names and undetermined taxa with taxonomic abbreviation and keywords, for example, 'Family fam'. As our aim was to programmatically harmonize F I G U R E 3 Screenshot showing the network view of taxharmonizexplorer. The left section shows a table of each of the nodes in the network to let the user select manually nodes of interest, the top part presents a summary of the information on the selected node in the network. The right section displays the relationships between packages (which depends on which other), between databases (how one populates another one) and between packages and databases (which packages access which databases) taxonomy using available R packages, we kept such binomial entries as they were returned from rgnparser [106]; such inaccuracies will be solved in the next steps. GBIF offers an alternative name parser, which can be used through rgbif with the parsenames() function [68].

| Step 1.5: (if needed) Divide taxa in higher taxonomic groups
In WF2 BioTIME originally assigns taxonomic groups, but these are at the study level rather than for each species. For example, the species Abalistes stellatus was correctly assigned to the fish group except in one study, where it was assigned to the benthos group (to which most of the species in this study belong). To achieve maximal taxonomic accuracy, we reclassified species names into higher taxonomic groups using GBIF. We queried all names against GBIF and, based on higher clades (mostly taxonomic classes, e.g. Sarcopterygii, and unranked clades, e.g. Tracheophyta), we grouped names into three groups that could be referred to by taxon-specific databases: birds, fishes and vascular plants.

| S TEP 2: MATCH TA XONOMI C DATA BA S E S
The selection of databases and packages for harmonization depends on the taxonomic breadth and the spatial coverage of the species list under study (Figure 1). In general, we recommend using the most updated and taxa-specific databases. In summary, the workflows using taxon-specific databases performed relatively similar in the number of matched names, with WF1 matching slightly more species than WF2, but requiring three times the queries needed for WF2. WF3 and WF4 were faster, easier and matched the most species names, but this was at the expense of not resolving many synonyms. Which of these workflows is best depends ultimately on the goal of the taxonomic harmonization process and users must choose what suits most the task at hand.
Yet, using taxon-specific databases (WF2) to match species names already divided into high taxonomic groups seems an optimal tradeoff between computational speed, programmatic complexity, accuracy and robustness of the harmonization process.

| S TEP 3: (DO AT YOUR OWN RIS K ) RE SOLVE UNMATCHED NAME S WITH FUZ Z Y MATCHING
If not satisfied with the number of matches achieved through Steps  (Costello et al., 2013;Patterson et al., 2016;Patterson et al., 2010). Some misspellings may have been corrected during Step 2 if species names were matched using fuzzy matching.
To correct spelling errors, algorithms are available to calculate the probability of correspondence between an input taxon name and long lists of names. Although these fuzzy searches have some risks (Box 2), functions like gnr_resolve() from package taxize have arguments that reduce the probability of mismatching. Its argument with_context restricts the search to a narrower taxonomical context, reducing the probability of matching homonyms from different taxonomic groups (Costello et al., 2013;Shipunov, 2011). The IRMNG database, that references colliding genera names across the tree of life, can also be used to check potential typos (Rees, 2021). As fuzzy algorithms programmatically match names based on their orthographic similarity, often without considering additional taxonomic information, extra care should be taken if step 3 is implemented, including sensitivity analyses and manual checking of matched names.
We applied this step only to WF2. We looked for misspellings across the 777 names belonging to birds, fishes and plants (from Step 1.5) that were not matched in WF2. We used the function TA B L E 3 Number of species matched using each workflow. Numbers of species matched were calculated after performing Step 2 but before performing Step 3 Despite the improvement in the number of matches, these may be wrong due to fuzzy matching and orthographic corrections. Therefore, we recommend flagging matches obtained during this step and analysing their influence on downstream analyses to account for such potential issues (Box 2), for example, by randomizing the accepted fuzzy matched names based on their score.

| CON CLUS ION
The correct treatment of taxon names is a prerequisite for robust biodiversity research. We proposed a typology of widely used taxonomic databases and extensively reviewed R packages that work with taxonomic data. Throughout our review we identified several areas to be improved aiming for more integrated and userfriendly resources and processes to harmonize taxon names (Box 3).
Many issues we came across could have been prevented by a more open and inclusive communication across research communities (e.g. ecologists, data scientists and taxonomists). For instance, rigorous and widely spread communication on important new or updated taxonomic resources or relevant tools would help prevent using outdated data or developing redundant tools either as end user or developer. We suggest publishing short release notes of taxonomic databases and tools (and major updates of them) also in target journals of the respective user communities (often possible additionally to data papers).
On a technical side, we specifically see the design and documentation of taxonomic databases and tools as a major field to improve. We BOX 3 Recommendations and best practices for robust taxonomic harmonization

Target group Recommendations
urge any researcher and potential tool developer starting with taxonomic name harmonization to do a thorough search for the most suitable (i.e. most reliable, most up-to-date) databases and existing related tools. Users should also document fully their harmonization workflow (software versions, functions, parameters and database versions) for the sake of reproducibility. Vice versa, database managers and tool developers need to make their resources discoverable for all researchers globally and describe them with all necessary meta-data (Box 3). From our review, it is clear that joint efforts between taxonomists and ecologists are strongly needed to understand how these two related fields can inform each other better, improving taxonomic harmonization on one side and making use of and improving existing tools and functions on the other. Teaching and workshops focused on taxonomic name harmonization could foster knowledge and best practices while helping connect both disciplines.
What can the broad research community do to support these services for many of us? We can start by acknowledging more this type of community service, for example, in similar ways as for reviewing papers. Developing and especially maintaining databases and tools, used by many, should be more visible and valuable than just counting citations. Scientific evaluation should fully comprise these aspects. And developers and data managers should mention these services prominently in their CVs. Funding agencies should also fund these types of projects and specifically their long-term maintenance or should support, at least, relevant existing structures, which could serve as home for these resources.
Ultimately we are convinced that joint synthesis efforts across research communities towards a comprehensive resource overviewing taxonomic databases and useful tools, including meta-data and dependencies, will help any user to discover and work with the most suitable and robust information. This resource could be hosted, for example, on platforms already offering global cross-taxa information such as COL [37]. The research community will always need taxonomic experts and initiatives working on these individual resources, but we, as users, also need more guidance on where to find them and how to use them best. Our review and the shiny app can only be a start, even hopefully a very useful one.