DNA barcode reference libraries for the monitoring of aquatic biota in Europe: Gap-analysis and recommendations for future work

Effective identification of species using short DNA fragments (DNA barcoding and DNA metabarcoding) requires reliable sequence reference libraries of known taxa. Both taxonomically comprehensive coverage and content quality are important for sufficient accuracy. For aquatic ecosystems in Europe, reliable barcode reference libraries are particularly important if molecular identification tools are to be implemented in biomonitoring and reports in the context of the EU Water Framework Directive (WFD) and the Marine Strategy Framework Directive (MSFD). We analysed gaps in the two most important reference databases, Barcode of Life Data Systems (BOLD) and NCBI GenBank, with a focus on the taxa most frequently used in WFD and MSFD. Our analyses show that coverage varies strongly among taxonomic groups, and among geographic regions. In general, groups that were actively targeted in barcode projects (e.g. fish, true bugs, caddisflies and vascular plants) are well represented in the barcode libraries, while others have fewer records (e.g. marine molluscs, ascidians, and freshwater diatoms). We also found that species monitored in several countries often are represented by barcodes in reference libraries, while species monitored in a single country frequently lack sequence records. A large proportion of species (up to 50%) in several taxonomic groups are only represented by private data in BOLD. Our results have implications for the future strategy to fill existing gaps in barcode libraries, especially if DNA metabarcoding is to be used in the monitoring of European aquatic biota under the WFD and MSFD. For example, missing species relevant to monitoring in multiple countries should be prioritized. We also discuss why a strategy for quality control and quality assurance of barcode reference libraries is needed and recommend future steps to ensure full utilization of metabarcoding in aquatic biomonitoring.

1. Introduction 128 1.1 DNA barcoding for monitoring aquatic life (European Commission, 2000). Moreover, the highest proportion of species extinctions to 163 date has been recorded in freshwater (Young et al., 2016), highlighting the importance of 164 monitoring and protecting these ecosystems. To assess the ecological status, identification of aquatic organisms to family, genus or 178 species-level by morphology is necessary, but it is not a straightforward process. For 179 instance, individual differences in expertise, experience and opinion of the identifiers can 180 result in different taxonomic groups being documented from the same waterbody, potentially 181 leading to contrasting ecological assessments (Carstensen and Lindegarth, 2016;Clarke, 182 2013). An extensive audit of 414 macroinvertebrate samples taken as part of the monitoring 183 programs of German rivers and streams (Haase et al., 2010) documented that 29% of the 184 specimens had been overlooked by the primary analyst in the sorting stage, and that the 185 identification of >30% of the taxa differed between the primary analyst and the auditors. 186 Importantly, these results lead to divergent ecological assessments in 16% of the samples 187 general challenges in using short, standardized molecular markers for identification (Hebert 190 et al., 2016), DNA barcoding and metabarcoding offer a less subjective approach than 191 morphology for the identification of organisms in aquatic assessments (Leese et al., 2018). While national DNA barcode initiatives often start opportunistically and register any species 323 available for sampling, focus shifts to fill the gaps of the databases as soon as a critical 324 number of species is registered. Which taxonomic groups have priority is typically connected 325 to funded projects, available taxonomic expertise and scientific collections, and is not 326 necessarily the same in each campaign. Among aquatic taxa, species-rich groups such as 327 arthropods and polychaetes, or economically important groups such as fish, have seen some 328 priority. However, when building barcode reference libraries, there has usually not been a 329 general focus on species or organisms that are particularly relevant for water quality 330 assessments towards WFD or MSFD from the start. 331 In addition to large national barcoding campaigns, smaller activities intended to generate 332 reference barcodes of selected taxonomic groups (e.g. Trichoptera Barcode of Life), or 333 regional biota (e.g. "Barcoding Aquatic Biota of Slovakia -AquaBOL.sk" and "Israel marine 334 barcoding database") exist. These initiatives, even if lacking substantial funding, can provide 335 important data and in many cases be better targeted towards filling the gaps of barcode 336 libraries than more general campaigns. 337

338
Different organism groups are used as Biological Quality Elements (BQEs) to assess the 339 Ecological Quality Status (EQS) of aquatic ecosystems under the WFD. In the MSFD, 340 biodiversity data in general, along with other related descriptors, are used to define 341 Environmental Status (Borja et al., 2013;Zampoukas et al., 2014). 342 The MSFD is the first EU legislative instrument related to the protection of marine 343 biodiversity. The directive lists four European marine regions: 1) the Baltic Sea, 2) the North-344 east Atlantic Ocean, 3) the Mediterranean Sea, and 4) the Black Sea. Member States of one 345 marine region and with neighbouring countries sharing the same marine waters, collaborate 346 in four Regional Sea Conventions (OSPAR 1 , HELCOM 2 , UNEP-MAP 3 and the Bucharest 347 Convention 4 ). These different regions naturally share, or aim to share, taxa/species lists for 348 biodiversity assessments and reporting status. The status is defined by eleven descriptors in 349 the MSFD (e.g. biological diversity, non-indigenous species, fishing, eutrophication, seafloor 350 integrity, etc.). For some descriptors, species ID is critical. National marine environmental 351 monitoring often focuses on regular sampling sites and observations of specific habitats and 352 its inhabitants, i.e. groups of organisms such as benthic macroinvertebrates, phytoplankton, 353 or fish. As already mentioned, there exist large differences between countries in how 354 biodiversity data are used to evaluate the quality status of aquatic ecosystems. This is 355 indeed true for the marine environment, and only few countries were able to support this 356 study with national taxalists directly associated to the MSFD. MSFD overlaps with WFD, and 357 in coastal waters MSFD is intended to apply to the aspects of Good Environmental Status 358 that are not covered by WFD (e.g. noise, litter, other aspects of biodiversity) (European 359 Commission, 2017). In order to perform barcode gap-analyses for taxa of relevance to the 360 directives and with a European marine perspective, we identified the possibilities of two 361 existing taxalists: AZTI's Marine Biotic Index (AMBI; (Borja et al., 2000)) and the European 362

Register of Marine Species (ERMS). 363
The AMBI is used as a component of the benthic invertebrates' assessment by several 364 Member States in the four regional seas (Borja et al., 2009;European Commission, 2018), in 365 the context of describing the sensitivity of macrobenthic species to both anthropogenic and 366 natural pressures (see e.g. (Borja et al., 2000)). The index uses the abundance weighted 367 average disturbance sensitivity of macroinvertebrate species in a sample (Borja et al., 2000), 368 each species being assigned to one of five ecological groups (EG I-V; (Grall and Glémarec, 369 1997 However, all these studies underlined the necessity of well-curated reference libraries. In 407 Europe, efforts to develop such a resource are made by a group of diatom experts, which 408 curate the Diat.barcode library (Rimet et al., 2016). They also proposed innovative 409 methodologies based on HTS to fill the gaps of this database (Rimet et al., 2018a). 410 Aquatic macrophytes are recognized as a valid taxonomic group for assessing water quality 411 according to the WFD. They reflect the morphological conditions of the water bodies 412 (diversity and dynamics of the substratum, degree of rigid management of the banks) and 413 are particularly interesting to assess nutrient pressure. Moreover, they react to anthropogenic 414 interventions in the hydrological regime (potamalisation and water retention). Being plant 415 organisms, macrophytes also present properties, such as longevity and immobility, that make 416 them bad bioindicators in the short-term: they are able to integrate disturbed conditions over 417 a considerably long period of time; it is impossible to accurately locate the source of 418 pressures and the area of impact (Pall and Mayerhofer, 2015). According to the traditional 419 definition, macrophytes are aquatic plants whose vegetative structure develops either in the 420 water on a permanent basis or at least for a few months, or on the surface of water (Cook et 421 al., 1974). These include species of the Charophyta (charales), the Bryophyta (mosses), the 422 Pteridophyta (ferns) and the Spermatophyta (seed plants). In the present study we decided This national-level taxonomic variation in part reflects the natural difference in species 438 occurrences, but is necessary to consider when analysing gaps in the barcode libraries. 439 Freshwater fish are among the most commonly used organisms for assessing EQS 440 according to the WFD, and their community composition and structure is the base for a high 441 number of different metrics in Europe (Birk et al., 2012a). Sampling is conducted using a 442 variety of methods, including electro-fishing or netting and should deliver data on abundance, 443 species composition and age structure of fish present in a water body. However, large 444 differences between countries exist in the percentage of occurring species considered for an 445 assessment, and whether non-native species influence the overall score or not. In Protocols) this marker has a high potential to become the gold standard for regular eDNA-463 based fish monitoring in the future. We therefore also evaluate the completeness of the 464 reference library for European freshwater fish species for 12S sequence data. 465 Aim of this study 466 The purpose of this paper is to identify gaps in DNA barcode reference libraries that are 467 relevant for European countries when reporting water quality status to the EU in the context 468 of the WFD and MSFD. The gaps for freshwater taxa are reported by country and taxonomic 469 group, and compared across Europe, while gaps for marine organisms are evaluated by 470 taxonomic group. We also discuss the necessity of both quality assurance and quality control 471 For freshwater fish, we treated Europe as geographic entity, not by its political borders, but 503 follow its definition as a "continent" with Turkey, Russia and Kazakhstan being only partly 504 included and only with faunistic elements occurring in watersheds that lie within Europe (see 505 also (Kottelat and Freyhof, 2007)). All lists were made available to taxonomic coordinators of 506 selected taxonomic groups (specialists among the authors) to assure conformity of taxonomy 507 and correct spelling. In this process, the taxonomic validation tool available from the Global 508 Biodiversity Information Facility (GBIF), and WoRMS were used. For fish, the applied 509 taxonomy mostly follows the international Catalog of Fishes (Fricke et al., 2018), which is 510 also the backbone for the BOLD taxonomy. 511 Finalized species-level checklists were concatenated and uploaded to BOLD, and initial gap-512 analysis reports were retrieved. The reports were examined by taxonomic specialists to see if 513 any reported gaps were due to taxonomic incongruence between the checklist and the BOLD 514 taxonomic backbone. These were corrected in the uploaded checklists before final analysis 515 (Supplement 2). Separate spreadsheets retaining the country information for each taxonomic 516 group were kept for downstream analyses. 517 2.2 Gap-report analyses 518 Two sources of data were retained from BOLD for the majority of the taxonomic groups. 519 Firstly, the checklist progress report option implemented in BOLD was used. Secondly, the 520 checklists were compared to all publicly available sequence information in BOLD by using 521 datasets for each taxonomic group. Progress reports and datasets were generated on the 6 th 522 July 2018 for all groups except freshwater fish (1 st February 2018), freshwater Annelida (17 th 523 September 2018) and Odonata (29 th November 2018). The dataset for Diptera used for the 524 reverse taxonomy analysis was generated on the 18 th December 2018. The analyses were 525 based on one or two barcode markers, depending on the taxonomic group (see Table 2). 526 Based on the BOLD gap reports, gap-analyses and summarizing statistics were calculated 527 for all taxonomic groups using an analytical pipeline of custom made python scripts 528 [deposited in GitHub https://github.com/dnaquanet/gap-analysis.git]. This pipeline was largely 529 the same for all groups, except where specified under specific taxon treatment sections. 530 The data from taxonomic checklists with country information (i.e. nations in which the 531 respective species are monitored) were combined with the information from BOLD. Species-532 based summaries were generated containing the number of countries in which a species is 533 monitored by extracting the information from the taxonomic checklists. In addition, the total 534 number of reference sequences stored in BOLD (i.e. sequences ≥ 500 bp), hereafter referred 535 to as DNA barcodes, were taken from the progress report of each checklist. Additional BOLD 536 quality criteria for barcodes, such as the availability of a trace sequence, were not 537 considered. Using information from the publicly available data from the dataset output, it was 538 possible to calculate the number of barcodes publicly stored in BOLD (BOLD public) or 539 mined from GenBank (GenBank) as well as the number of privately stored barcodes in BOLD 540 (BOLD private). Sequences flagged due to potential contamination, misidentification, or 541 presence of stop-codons, were excluded from the analyses. For some species, DNA 542 barcodes were deposited under the valid species name as well as under synonyms. In these 543 cases, synonyms were part of the BOLD checklists and the barcode hits were merged to the 544 valid species names. 545 In a further step, the proportion of species represented by a minimum number of DNA 546 barcodes (threshold of 1 or 5) was calculated for each checklist. Additionally, country-based 547 summaries were generated, providing an overview of the number of monitored species 548 together with the percentage of barcode coverage for each taxonomic group in the reference 549 libraries (threshold of 1 or 5). For both summary overviews, the available barcode information 550 was sorted into three classes: BOLD public, BOLD total (including BOLD public and BOLD 551 private) and total (including BOLD public, BOLD private, and GenBank). The data were 552 visualized using the python-module matplotlib (Hunter, 2007) and cartopy 553 (scitools.org.uk/cartopy) together with geographical information from naturalearthdata.com. 554 In contrast to all other gap-analyses, no geographical data were included for the marine taxa. 555 Hence, the country-based analysis steps of the pipeline were omitted. Due to the large size 556 of the ERMS checklists, no datasets could be produced in BOLD. Thus, only the results of 557 the progress report were analysed for the availability of reference sequences. In the analysis 558 of species used to calculate the AMBI, datasets could be produced in BOLD, and our 559 analyses could distinguish between BOLD public, BOLD private, and GenBank sequence and 18S data in the database. Both, valid species names and synonyms were considered; 569 subspecies were also accepted as valid. An overall gap-analysis and country-based 570 summaries were generated. However, only a threshold of 1 was used. As all barcodes in 571 Diat.barcode are publicly available at https://www6.inra.fr/carrtel-collection_eng/Barcoding-572 database, the differentiation between public and private data did not apply. Due to the high 573 species diversity in diatoms, estimated at 100,000 (Mann and Vanormelingen, 2013), many 574 low-frequency species could potentially negatively impact the barcode coverage, while the 575 high-frequency (abundant) species could be sufficient for monitoring (Lavoie et al., 2009). 576 Hence, we re-analysed the barcode coverage for two checklists (France freshwater 577 phytobenthos and Croatia marine diatoms) using only high-frequency species. 578 Two standard barcode markers (rbcL or matK) are accepted for vascular plants in BOLD. 579 However, the checklist progress report does not include information on which of the two 580 barcode markers were covered for each taxon. Hence, the first part of the analyses 581 described above was conducted for vascular plants regardless of which of the two markers 582 was present (rbcL OR matK). In contrast, the BOLD dataset includes information on which 583 marker is sequenced for a certain record. Hence, for the public data (BOLD public and 584 GenBank) gap-analyses were performed for each marker as well as for the combination of 585 both markers (rbcL AND matK). 586 For gap-analysis of freshwater fish we also included the 12S marker. Since there are no 12S 587 sequence data available in BOLD (as of February 1 st 2018) for European freshwater fishes, 588 we manually compared our target species list with the available mitochondrial genomes from 589 MitoFish (http://mitofish.aori.u-tokyo.ac.jp), and NCBI's RefSeq and Nucleotide databases. As a case study, we analysed the proportion of public barcodes originating from reverse 607 taxonomy for freshwater macroinvertebrates, i.e. specimen identification via its DNA barcode 608 and not by morphology. In the datasets obtained from BOLD, the entry "Identification Taxonomy Match", "Tree based identification" or "DNA Barcoding". A full list is deposited in 611 Supplement 3. For each species, the number of public barcodes originating from reverse 612 taxonomy was compared to the total number of available public barcodes in BOLD. Four 613 cases were considered, in which reverse taxonomy can have a strong influence: i) all public 614 data originates from reverse taxonomy, ii) more than half of the public data originates from 615 reverse taxonomy, iii) only when including barcodes based on reverse taxonomy, at least five 616 public barcodes are present and iv) when less than five public barcodes are present, at least 617 one originates from reverse taxonomy. 618 3. Results

619
Our results revealed considerable variation in barcode coverage for selected major groups in 620 the queried databases (Table 1). Freshwater vascular plants and freshwater fish had the 621 largest coverage, though still less than 70% of the species had five or more barcodes 622 available. The lowest barcode coverage is found in the marine invertebrates of the ERMS list 623 9.9% (five or more barcodes) to 22.1% (one or more barcodes) and diatoms (14.6%), while 624 more than 60% of the 4502 freshwater invertebrate species used in ecological quality 625 assessments of freshwater ecosystems had one or more barcodes (Table 1) GenBank, but as much as 23% of those species only have private records (Fig. 1,  635 Supplement 2), and 22% of those with barcodes are single specimen records. If barcodes of a species were not recorded in the BOLD public library, the BOLD private 643 library was queried, and subsequently GenBank. Numbers on bars refer to total number of 644 species in checklist. Thick bars represent phyla, thin bars represent taxa of lower taxonomic 645 rank. Taxonomic groups with less than ten species are not indicated. 646 Among the 10 largest taxonomic groups included in this particular analysis, the Chordata 647 (excluding Vertebrata) displayed the lowest proportion of species with DNA barcodes (38%), 648 though only 26 species (within Ascidiacea) were listed for this taxon. In comparison, the best 649 represented taxon was the Nemertea, which has DNA barcodes for 81% of the 27 species 650 considered, while the second most complete group has 67% (Echinodermata). Most of the 651 remaining taxa have completion levels between 40 and 50%, including the three most 652 species-rich taxa (Annelida, Mollusca and Arthropoda), that comprise 85% of the species in 653 the European AMBI checklist (Fig. 1). 654 A narrower analysis of Mollusca shows that Bivalvia and Gastropoda have only moderate 655 levels of completion (50 and 47%, respectively), whereas within malacostracan crustaceans, 656 Decapoda (Arthropoda) is far more complete (84%) than Peracarida (45%). However, the 657 number of species considered is highly disparate for these two groups (25 Decapoda vs. 649 658 Peracarida) (Fig. 1). The proportion of singletons (i.e. only one barcode sequence available) 659 per taxonomic group ranges from 10% to 25%, although for some taxa the observed 660 proportion of singletons was considerably higher (e.g. 50% in Brachiopoda and 38% in 661 Sipuncula). 662 Most of the species from the AMBI checklist have public DNA barcodes available either from 663 BOLD or GenBank, with only 11% represented exclusively by private records. Two groups 664 have slightly higher values, Echinodermata (15%) and Arthropoda (12%). The levels of 665 completion by AMBI's Ecological groups (I to IV) are similar, ranging from 43% in group IV to 666 56% in group III (Supp. Fig. 1). However, 215 species were not assigned to ecological 667 groups, and among these the completion is low (ca. 38%). Species barcodes found 668 exclusively in BOLD private range from 10% (IV) to 13% (V) in each of AMBI's ecological 669 groups. The selection from the ERMS list on BOLD contains 16,962 species. Twenty-two percent of 679 these species have at least one DNA barcode in BOLD (Fig. 2). Of these species, 26% have 680 singletons and nearly 10% have five or more DNA barcodes. These figures include DNA 681 barcodes from GenBank that are present in BOLD. The highest coverage is found in 682 Decapoda (50%), followed by Sipuncula (42%), a phylum with 45 species only found in the 683 ERMS list (Fig. 2). At the other end, the lowest coverage (11%) is observed in Brachiopoda 684 (37 species). Nemertea also have a low coverage, 15% for the 380 listed species. The 685 coverage of most other taxonomic groups ranges from 20 to 30%. (143) and Holocephali (7). Overall, 82% of the species are barcoded (64% ≥ 5 barcodes), 709 ranging from 100% (71% ≥ 5 barcodes) for the Holocephali to 81% (63% ≥ 5 barcodes) for 710 the Actinopterygii, with the Elasmobranchii coverage is in between (92% ≥ 1 barcodes, 80% 711 ≥ 5 barcodes) (Fig. 3). 712

Diatoms 713
Taxonomic checklists for diatoms were obtained from 16 countries and contained a total of 714 3,716 species ranging from 6 (Albania) to 2,236 species (France). This list covers very 715 different habitats, freshwater phytobenthos, freshwater phytoplankton and marine 716 phytoplankton. Some national checklists did not mention which habitat was covered. 717 The general coverage of diatoms was very low, with 15% of all species having at least one 718 sequence of rbcL or 18S (Fig. 4). The coverage of rbcL (13%) is slightly better than the 719 coverage of 18S (11%). However, in most cases both markers are present if any sequence is 720 available (9%). Per country, the coverage ranged from 10% (France) to 37% (Italy), when 721 both markers are present and 15% (France) to 55% (Italy), when at least one of the markers 722 is present (Suppl. Fig. 1). indicators, the most frequently monitored species have a moderate to high representation of 741 both markers (Fig. 5B). Similar to all diatom datasets, most of the species monitored in 742 eleven countries are represented by both markers (70%), with additional species barcodes 743 for rbcL (20%). For species monitored by fewer countries, the coverage is considerably 744 smaller (below 20%, for species in ≤ 4 countries). 745 For the most common species of freshwater phytobenthos monitored in France, 553 of the 746 2,236 species were scored as abundant. In this subset, the barcode coverage was 33%, 747 considerably higher than the 15% of all species. The proportion of species with both rbcL and 748 18S sequenced was 20% compared to 10% for all species (Fig. 4). A similar picture was 749 evident for the marine diatoms from Croatia. Of the 100 most frequently observed marine 750 phytoplankton species (including Diatoms, Dinoflagellates, Silicoflagellates and 751 Coccolithophorids), 32 were diatoms. Of these 32 species, 50% had at least one barcode 752 available compared to 36% in the total dataset of 729 species. The proportion of species with 753 both barcodes was 34%, compared to 25% for all species (Fig. 4). In sum, rbcL is the best represented DNA barcode marker for vascular plants with 75% of 774 the species having publicly deposited sequences, and 66% of the species having BOLD 775 public data (Fig. 6). Sixty-six percent of the species have publicly deposited barcodes for 776 matK, with only 46% of the species having sequences deposited in BOLD public. Poland and Switzerland, Fig. 7B). A higher and more homogeneous coverage was found for 790 rbcL (67 -90%; Fig.7C) than matK (0 -74%; Fig. 7D), both for BOLD public and GenBank 791 data (rbcL: 71% -100%; matK: 50% -87%; Supp. Fig. 2). Two species were monitored in 792 twelve countries (Alisma lanceolatum and A. plantago-aquatica) and approximately one fifth 793 of the species in more than 4 countries (Fig. 7E, F). The barcode coverage of these species 794 was 100% when public and private data were taken into account. It decreased slightly for 795 species monitored in four or fewer countries. Nevertheless, more than 40% of the 330 796 species monitored in one country only had rbcL and matK data deposited publicly in BOLD 797 and 73 % had associated sequences when private BOLD and GenBank data were included. Among all taxonomic groups considered in the analysis, the three insect orders Odonata, 806 Trichoptera and Hemiptera along with crustaceans are best covered with ≥80% of species 807 barcoded from each taxonomic group. The groups with the least coverage are flatworms 808 (less than 5%), followed by annelids, molluscs and certain insect orders, such as Diptera and 809 Ephemeroptera, in which less than 60% of listed species are represented by at least one 810 barcode (Fig. 8). Only in the case of Hemiptera, more than 80% of the species are 811 represented by at least five barcodes while, except for Odonata, Trichoptera, Coleoptera and 812 Crustacea, less than 50% of the species are covered in the other macroinvertebrate groups. 813 For some groups, such as molluscs, annelids and crustaceans, a substantial share of the 814 available reference sequences are not deposited in BOLD, but present in GenBank (Fig. 8). 815 The most monitoring-relevant insect taxon with lowest coverage on BOLD is Diptera (ca. 816 60% of the 2,108 species in the list). Hemiptera, with 76 species listed and ca. 92% already 817 barcoded will probably be the first group to have full coverage in the near future. 818  Insects are used for monitoring ecological status in 29 out of the 30 surveyed countries. All 832 national monitoring checklists combined comprised 3,619 insect species (Supplement 2, Fig.  833 9D). However, taxonomic resolution used between countries differed substantially. Seven 834 countries exclusively assess taxonomic groups above species level, two countries only 835 above genus level, and five countries only above family level (Supplement 1). Assessed taxa 836 per country range from 10 (Albania) to 2,903 (Czech Republic, Fig. 10). In total, eleven insect 837 orders are monitored, ranging from orders with only one relevant species (Hymenoptera) to 838 orders with 2,108 species (Diptera, Fig. 8). The top ten species monitored in most countries 839 all belong to Ephemeroptera with Ephemera danica and Serratella ignita being the most 840 frequently listed species (20 countries each).

.2 Arachnids 863
A large proportion of the arachnid species records in BOLD is private (Fig. 8). The coverage 864 of the 211 species reported from all countries in total is moderate with 65% of the species 865 represented with at least one barcode. It is remarkable that 201 of the 211 arachnid species 866 are only monitored by one country, the Netherlands (Fig. 9B). Of these, 200 are solely 867 monitored in this country. The spider Argyroneta aquatica, which is monitored by the most 868 countries (7), has only private reference barcodes in BOLD, and five sequences in GenBank. 869

Crustaceans 870
A total of 193 crustacean species are included in the nationwide checklists; 22 of the 30 871 surveyed countries monitor one or more crustacean species (Fig. 11). They represent four In general, the barcode coverage (including GenBank data) per country is good and relatively 885 evenly distributed, from 70% to 100% of species barcoded in each country (Fig. 11D). These 886 values drop down immensely when only the public BOLD data are taken into account (Fig.  887   11B). In the countries such as Italy and Ireland not even 10% species is covered, while only 888 in Germany, UK, the Netherlands and Norway the coverage approaches 50% of the species 889 monitored in each of these countries. In total, 257 species of annelids are used in freshwater biomonitoring in the 21 countries that 899 supplied lists (Fig. 12). They represent two classes, Clitellata with the subclasses of 900

Oligochaeta, Hirudinea (leeches) and Branchiobdellida and Polychaeta with the subclass 901
Sedentaria. Among them, three species of leeches, Erpobdella octoculata, Glossiphonia 902 complanata and Helobdella stagnalis are monitored in 20 countries (Fig. 9A). Country wise, the barcode coverage (including GenBank data) extends from ca. 50% of 915 species barcoded in Czech Republic and Slovakia to 100% in Norway (Fig. 12D). When only 916 public BOLD records are considered, the barcode coverage per country drops down to 20%-917 40% (Fig. 12B). The national checklists of freshwater molluscs contain a total of 161 species, ranging from 927 one (Cyprus) to 77 (Czech Republic) species per country (Fig. 13). Ancylus fluviatilis, the 928 most commonly surveyed species, is included in 20 national checklists, while a total of 67 929 species are considered by a single checklist only (22 of them in Georgia) (Fig. 9D). The total 930 barcode coverage of freshwater molluscs (about 60%) was in the range of most freshwater 931 invertebrate groups (Fig. 8). While the proportion of species with public barcodes deposited 932 in BOLD was relatively low (only 15%), the proportion of species with sequences derived 933 only from GenBank was considerably high (24%). A similar pattern was evident when a 934 minimum coverage of five barcodes was used (Fig. 8B). Here, 41% of the species met the 935 criteria when all public and private data were considered, 10% of the species were covered 936 in the BOLD public database, while 21% of the species only had sufficient barcodes if 937 GenBank data were considered together with data from BOLD. have barcode data (23%). The barcode coverage per country was relatively evenly 949 distributed, with an average coverage of 23% (min: 0% -Cyprus, max: 38% -Italy) when 950 public barcodes in BOLD were considered, 56% (min: 0% -Cyprus, max: 76% -Finland) 951 when public and private data on BOLD were used and 76% (min: 0% -Cyprus, max: 94% -952 Finland) for the full BOLD and GenBank datasets (Fig. 13). completely lacking DNA barcode references in BOLD (coverage: 82.5%, Fig. 14A). When 976 setting the threshold for minimum number of DNA barcodes available to five, 212 species did 977 not have any or fewer than five barcodes deposited in the database (Fig. 14B). After manually 978 checking the resulting gap list and taking into account real synonyms and different taxonomic 979 concepts such as generic assignments (e.g., Iberocypris vs. Squalius, Orsinigobius vs. wild species (Fig. 15B), only a few species are missing, the highest number of them (6) 1001 reported from Switzerland. 1002 3.6 Reverse taxonomy 1003 Documented use of reverse taxonomy was observed in all groups of freshwater 1004 macroinvertebrates where public data was available, except for Neuroptera (Fig. 16,  1005 Supplement 3). The proportion of identified sequences originating from reverse taxonomy 1006 compared to all available barcodes ranged from 1% (Crustacea, Ephemeroptera, Hemiptera, 1007 Lepidoptera and Odonata) to 20% (Coleoptera) and 59% (Diptera). Since these values rely on 1008 the cumulative number of BOLD-public, BOLD-private and GenBank data, and since the use 1009 of reverse taxonomy is know only from public sequences in BOLD, the calculated proportions 1010 can be underestimations. For instance, when only public data in BOLD is considered, reverse 1011 taxonomy can be found in up to 61% (Annelida) and 82% (Diptera) of the deposited sequences. 1012 The fraction of species with barcodes originating from reverse taxonomy ranged from 3% 1013 (Arachnida, Coleoptera and Ephemeroptera) to 16% (Diptera) and 20% (Megaloptera). 1014 Although the proportion of species having reverse taxonomy of potentially strong influence was 1015 low for most taxonomic groups, it was comparatively high for Diptera (12%) and Megaloptera 1016 Commission, 2018). The taxonomic depth of data required for calculation of these indices is 1036 highly variable between countries. In cases where indices are dependent on species-specific 1037 traits, all species counts and complete species-level identification is required. Thus, the 1038 checklists of species from each country that we received and have used as basis for our gap-1039 analysis can be grouped into four major types. 1040 The first group contains 'full national lists of species'. Such lists are typically generated from 1041 the Pan-European species lists, or compiled individually from literature. Some countries (e.g. 1042 Czech Republic) use these complete lists as basis for their WFD monitoring, even if many 1043 taxa are not regularly encountered. The second group includes lists from countries that use 1044 the national taxa lists as a basis, but narrow down the selection based on experience or 1045 challenges with identification to species level. In Hungary, for example, only species that 1046 were previously recorded during WFD monitoring are used. Other countries would limit the 1047 identification of selected groups to family-or genus-level, or completely discard semi-aquatic 1048 taxa or taxa that are non-aquatic but closely connected to aquatic environments (e.g. 1049 Carabidae, Chrysomelidae or Curculionidae beetles). These restrictions have been taken 1050 into account in index development. In the third group, it is common to regularly monitor the 1051 frequency/occurrence of certain 'highly indicative' species/taxa, and use only these species 1052 in the calculation of MMIs. Thus, a highly restricted 'operational taxon lists' for WFD 1053 monitoring is compiled. Such a list can be extensive or quite short, dependent on country. Slovakia (Figs 10-13). For marine benthic macroinvertebrates in the AMBI list, the three most species-rich phyla; 1082 Annelida, Mollusca and Arthropoda (ca. 85% of the total species in the list), have moderate 1083 levels of completion (40% to 50%), while less represented groups such as Nemertea, 1084 Sipuncula and Echinodermata have completion levels of at least 65%. Within the ERMS list, 1085 the levels of completion were lower than those of the AMBI list, but followed similar trends of 1086 those reported for the AMBI list, with the exception of the nemerteans. The Annelida, 1087 Mollusca and Arthropoda, that accounted for ca. 77% of the species in the ERMS list, have 1088 fair levels of completion (20% to 30%) and lower than less diverse groups in the list, such as 1089 Echinodermata (35%) and Sipuncula (42%). 1090 Our results suggest that many of the barcode studies focused on Annelida, Mollusca and 1091 Arthropoda, may have targeted particular species or groups at the order or family level (e.g. in the former, and ca. 50% in the latter. For a larger group such as the superorder 1099 Peracarida, which comprises 649 species in the AMBI list and 2,643 species in the ERMS 1100 list, the total number of barcoded species is much far from completion (45% and 24%, 1101 respectively). 1102 In addition to the globally modest levels of completion for marine macroinvertebrates, the 1103 gap-analyses based on the AMBI checklist also reveals some insufficiencies of the available 1104 data, namely the presence of a sizeable proportion of private records, which are unavailable 1105 for full access in bioassessment studies employing DNA-based tools. For some groups, 1106 private records on BOLD were even higher than the public, such as for Sipuncula (25% 1107 versus 10%) and Annelida (20% versus 18%). An ISI Web of Science search, at the time of 1108 writing (30 th November 2018), with the search terms "barcoding" AND "marine" AND "the 1109 taxonomic group of interest" also supports the absence of published reference libraries for 1110 Sipuncula, or the low number of studies found for Annelida, compared to other above-1111 mentioned groups (e.g. fish and Crustacea). Another aspect worth of consideration is the 1112 number of singletons in the reference libraries. Although the percentage of singletons is 1113 generally low, some taxa have a considerable proportion of single representatives per 1114 species. Whereas relatively low levels of barcode coverage for some of these groups clearly 1115 reflect fewer efforts to barcode those taxa, a considerable proportion of the gap must also be 1116 ascribed to failed DNA sequencing, due to either primer mismatch, sample contamination or 1117 PCR inhibitors. This is particularly obvious for the marine Annelida, for which COI 1118 sequencing success rates may be down to 40-50 % on average (Kongsrud et al., 2017). 1119 Barcoding of annelids has also revealed unexpected high levels of genetic diversity, 1120 prompting traditional species taxa to be torn apart (Nygren,  Such representation is also key for efficient quality assurance, quality control and validation 1131 of reference libraries, as discussed below. 1132 Within the AMBI list, almost half of the species fall into the ecological group I, which are the 1133 "sensitive" species, and the remaining half is distributed among the other 5 ecological 1134 groups. However, the completion levels were higher for species from ecological groups III 1135 (56%) and V (52%) and lower for species that do not have any ecological group assigned 1136 (38%). Similar results were encountered when the first attempt of using a genetics based 1137 marine biotic index (gAMBI), with available GenBank sequences for AMBI species, has been 1138 performed (Aylagas et al., 2014). At the time, the authors concluded that the available 1139 genetic data was not sufficient or did not fulfil the requirements for a reliable AMBI 1140 calculation, that needs an even distribution of taxa across the disturbance gradient. On the 1141 other hand, when gAMBI values were calculated by using the most frequent species within 1142 each ecological group, the reliability of AMBI values increased significantly (Aylagas et al., 1143(Aylagas et al., 2014. Nevertheless, in the current study we have found a much higher completion level (e.g. 1144 48% versus 14%), since numerous new records have been generated in the meantime and 1145 our gap-analyses also included BOLD data. library must be as comprehensive as possible in order to assign a high proportion of 1151 environmental sequences to known taxa, and it requires regular expert curation in order to 1152 maintain quality. This is why experts from several countries joined efforts to curate a single 1153 reference library, Diat.barcode (formerly called R-Syst::diatom). Our results show that a large 1154 majority of the most common species (registered in the checklists of all countries) are 1155 present in this library, but that many rare species lack representation. 1156 A comprehensive barcode reference library for diatoms is difficult to achieve for two reasons. 1157 Firstly, because more than 100,000 species are estimated to exist globally (Mann and 1158 Vanormelingen, 2013), many of which are undescribed. Registration of barcodes and 1159 metadata of all these species in the reference library will require an overweening effort. Thus, 1160 an effort should be focused on the most common, not yet barcoded species. Secondly, 1161 diatoms need to be isolated and cultured in order to obtain high quality, vouchered, barcode 1162 records. This work is tedious and often unsuccessful because many species are difficult or 1163 impossible to cultivate. As a remedy to this, an alternative method using high throughput 1164 sequencing of environmental samples was proposed by (Rimet et al., 2018b). By using this method routinely, we will be able to quickly complete the barcode reference library of the most common diatoms in the near future.
applications have moved from classical single specimen identifications to highly parallelized 1377 characterisations of communities via DNA metabarcoding (Leese et al., 2018). Given the 1378 often overwhelming quantity of 'big biodiversity data' and automated pipelines in those HTS 1379 approaches, data quality aspects of DNA barcode references gain an even higher relevance. 1380 Thus, some research communities, such as European diatom experts have worked with the 1381 European Standardization Committee to publish a methodology as a first step towards 1382 standardization of reference barcode libraries for diatoms (CEN, 2018). 1383 In principle, two quality components can be distinguished: Quality assurance (QA) is 1384 process-orientated, providing and maintaining quality standards for DNA barcodes and 1385 reference libraries. Quality control (QC), on the other hand, is user-orientated, enabling the 1386 cross-validation of taxonomic assignments or flagging of doubtful barcodes. More generally 1387 speaking, QA and QC measures can be seen as internal (or preventive) and external (or 1388 reactive) curation of reference libraries, respectively (Fig. 17). The implementation of QA 1389 measures during reference library development is the first important step for a sustainable 1390 data quality management. Linked to a valid taxonomy, formally-correct barcode sequences 1391 are deposited in line with (digital) voucher specimens and extensive metadata information. 1392 The taxonomic backbone should be regularly updated with modifications being visible to the 1393 users. An open access and fully transparent reference library allowing for versioning of 1394 barcode collections and the possibility to track taxonomic changes can be seen as the gold 1395 standard here. Simultaneously, this will allow a more sophisticated QC by the barcoding 1396 community. Library entries can be flagged for contamination and the most recent taxonomic 1397 changes (i.e. newly described species, integrative revisions) incorporated into the reference 1398 library taxonomic backbone more easily. A library which communicates with other ecological 1399 or geographic datasets and which provides access to the full data lifecycle from deposition to 1400 publication of data will further smoothen the integrative utilisation of barcode datasets. The 1401 generation of custom reference libraries and their annotation with digital object identifiers 1402 (DOI) finally can account for transparency and the specific demands of the users. 1403 QC. Special cases of mito-nuclear discordance, the number of already known MOTUs for a 1419 given Linnaean species name and 'extraordinary' barcodes such as those originating from 1420 type specimens should be additionally highlighted in the output results. All this combined 1421 information could be used to establish an evaluation system for metabarcode identifications, 1422