COInr and mkCOInr: Building and customizing a nonredundant barcoding reference database from BOLD and NCBI using a semi‐automated pipeline

Reference databases with wide taxonomic coverage are greatly needed in many fields of biology, most particularly for the taxonomic assignment of metabarcoding sequences. Therefore, it is fundamental to be able to access and pool data from different primary databases. The COInr database is a freely available, easy‐to‐access database of COI reference sequences extracted from the BOLD and NCBI nucleotide databases. It is a comprehensive database: not limited to a taxon, a gene region or a taxonomic rank; therefore, it is a good starting point for creating custom databases. Sequences are dereplicated between databases and within taxa. Each taxon has a unique taxonomic identifier (taxID), fundamental to avoid ambiguous associations of homonyms and synonyms in the source database. TaxIDs form a coherent hierarchical system fully compatible with the NCBI taxIDs, allowing their full or ranked lineages to be created. The mkcoinr tool is a series of Perl scripts designed to download sequences from BOLD and NCBI, to build the COInr database and to customize it according to the users’ needs. It is possible to select or eliminate sequences for a list of taxa, select a specific gene region, select for minimum taxonomic resolution, add new custom sequences, and format the database for blast, vtam, qiime and rdp classifier. This is a semi‐automated pipeline using command lines in a Linux environment. The COInr database can be downloaded from https://doi.org/10.5281/zenodo.6555985 and mkcoinr and its full documentation is available at https://github.com/meglecz/mkCOInr.


| INTRODUC TI ON
Reference databases of particular genes or markers are used for DNA-based identification and thus have various applications in mitogenomics (Ho & Gilbert, 2010), metagenomics (Santamaria et al., 2012), phylogenetics (Khater et al., 2021;Slater-Baker et al., 2022;Vijapure et al., 2019), identification of unknown barcodes (Hebert, Cywinska, et al., 2003;Hebert, Ratnasingham, & de-Waard, 2003) and most particularly for metabarcoding. The use of metabarcoding has increased dramatically in the past decade due to technological advances, and the continuous reduction in sequencing costs has made it accessible for a wide range of studies (Slatko et al., 2018). Metabarcoding is applied mainly for biodiversity assessment, but it can be used in other fields such as studying interaction networks or understanding animal diets (Compson et al., 2020) One of the difficulties of metabarcoding lies in the taxonomic assignment of sequences and the completeness of the underlying reference databases. Methods of taxonomic assignment can be alignment-based, relying of sequence similarities detected by blast (Altschul et al., 1997) or vsearch (Rognes et al., 2016) implemented in different software (Bokulich et al., 2018;Huson et al., 2007) or based on machine learning (Murali et al., 2018;Pedregosa et al., 2011;Wang et al., 2007). However, for all methods, the quality of the reference database is crucial (Hleap et al., 2021). Many methods are sensitive to gaps in the taxonomic coverage of the reference database (Hleap et al., 2021), and thus the creation of a reference database with the best coverage available is greatly needed.
Several different markers can be used for metabarcoding, since each of them are subject to different taxonomic biases and provide different taxonomic resolution (Ruppert et al., 2019). The most widespread markers are the ribosomal RNA (rRNA) genes (18 S, 28 S, 16 S), the Cytochrome Oxidase C subunit I (COI) gene and internal transcribed spacer sequences (ITS) (Creer et al., 2016;Porter & Hajibabaei, 2020). rRNA genes allow amplification from a wide range of taxa, and are the most widely used markers for microorganisms (Creer et al., 2016). The choice of the ideal marker is more difficult when dealing with Eukaryotes. Plants and fungal studies most often use ITS markers or rbcL, since the COI gene often contains indels of variable size and location and is not sufficiently variable in these groups. In addition, the taxonomic resolution of plant and fungal rRNA genes is relatively low (Bruns et al., 1991;Yao et al., 2010).
For animals, the use of both rRNA genes and COI sequences is widespread (Creer et al., 2016). The COI marker has been proposed as the marker of choice for animals and it is one of the most widely sequenced genes (Porter & Hajibabaei, 2018b), since it is the main maker of the Barcode of Life database (Hebert, Cywinska, et al., 2003;Hebert, Ratnasingham, & deWaard, 2003). Although it has become clear that the COI gene or any of its fragment is not sufficient to differentiate species in some groups, most particularly Diptera (Meier et al., 2006;Roe & Sperling, 2007;Rubinoff et al., 2006), it is still a frequently used marker, mostly because more animal taxa have been barcoded with COI than with any other marker (Andújar et al., 2018).
Regularly updated, curated and marker-specific databases are available for ITS-UNITE (Rolf Henrik Nilsson et al., 2019), PLANTiTS (Banchi et al., 2020)-and for rRNA genes-Greengenes (DeSantis et al., 2006), SILVA (Pruesse et al., 2007). Conversely, COI sequences are deposited in two different major databases, which are not COIspecific: (i) the nucleotide database of NCBI (hereafter NCBI-nt database; Sayers et al., 2022) and their European (ENA) and Japanese equivalents (DDBJ) are generalist databases without focusing on a taxon or a gene; and (ii) the Barcoding of Life Data System (BOLD; Ratnasingham & Hebert, 2007) contains barcoding sequences of several markers, but most of the sequences are from the barcoding fragment of the COI gene. Although the overlap of data between these databases is considerable, each of them has sequences that are not found in the other database. Therefore, creating a merged database with sequences from both sources is highly desirable. Most existing COI databases are sourced from NCBI-nt (e.g., Bengtsson-Palme et al., 2018;Curd et al., 2019;Keller et al., 2020;Richardson et al., 2020) and only a few of them combine sequences from BOLD and NCBI (Arranz et al., 2020;Balech et al., 2022;Macher et al., 2017;Porter & Hajibabaei, 2018a).
A major challenge of pooling sequences from different sources into a single database is to reconcile their taxonomic lineages. This step is not trivial due to the presence of homonyms (e.g., Plecoptera is both an insect order and a moth genus), synonyms and misspellings. Therefore, the only clean solution to deal with taxon names is the use of unique taxonomic identifiers (taxID) that are connected to a nonambiguous, hierarchical system and allow the identification of the lineage for each taxon. Both the NCBI-nt and the BOLD databases use taxIDs, but the two systems are independent of each other, and thus they cannot be simply merged. Finding the equivalent taxon names and taxIDs between the two databases calls for a careful comparison of taxon names and their lineages in order to match them. However, a further complication arises from occasional inconsistencies of taxonomic lineages from different databases (e.g., the genus Vexillata is a nematode belonging to the family Ornithostrongylidae according to BOLD, but to the family Trichostrongylidae according to NCBI taxonomy), which further complicates pooling of taxonomic information into a single coherent system.
Merging of COI sequences from the NCBI-nt and BOLD has been attempted in different programs. bold_ncbi_merger (Macher et al., 2017) uses a simple method based on identical taxon names.
metacoxi (Balech et al., 2022) obtains NCBI taxIDs and taxonomic lineages based on ENA flat files, when available. However, when this information is not available (the sequence is present only in BOLD), NCBI taxIDs are determined by simply matching taxon names to the NCBI taxonomy, without checking for homonymy. Furthermore, taxon names not present in the NCBI taxonomy do not receive a taxID, and therefore a taxID system is incomplete.
A further difficulty of creating custom (local) databases is the download of sequences from the original sources. NCBI provides different means of accessing data: a whole database can be downloaded via ftp sites, and filtered subsequently, or Application Programming Interfaces (APIs) are provided for targeted downloads (Kans, 2021). On the other hand, BOLD systems do not provide an easy way to download the whole public data set, and the use of BOLD APIs needs considerable optimization to be able to access large data sets. Although the bold R package (https://docs.ropen sci. org/bold/) is available to download data from BOLD, it is subject to failure for large taxa and takes several hours or days, according to requested data size.
The mkcoinr tool was designed to create the COInr database, which includes all COI sequences from NCBI-nt and BOLD sequences, irrespective of the region of the gene covered and the taxonomic group.

MEGLÉCZ
All sequences have a taxID, and all taxIDs form a coherent system compatible with, but not limited to, the NCBI taxIDs, allowing the user to unambiguously obtain taxonomic lineages even for taxon names with homonyms. Sequence redundancy within taxa is eliminated to reduce database size, without losing information. This database is freely available and can be easily and quickly downloaded from Zenodo (https:// zenodo.org/recor d/65559 85; Meglécz, 2022aMeglécz, , 2022b, thus saving the most complicated and time-consuming steps of custom database creation. Users can customize the downloaded database using mkcoinr scripts and format them to be able to use it with their preferred taxonomic assignment tool. It is a semi-automated pipeline using command lines in a Linux environment. It is possible to add local sequences, select or eliminate sequences of a list of taxa, filter sequences for minimum taxonomic resolution, and choose a gene region. The COInr database is planned to be updated annually, but all scripts are available with detailed documentation to re-create it at any time or produce a different database by modifying some of the filtering options.

| MATERIAL AND ME THODS
mkcoinr is a series of Perl scripts that can be executed in command line, thus being easily integrated into other pipelines. The scripts were written for Linux OS and can run on MacOS or other Unix environments.

Mitochondrion[Filter])" -T -v --cds
This allowed the download of all coding DNA sequences (CDS) returned with the keyword search for COI, CO1, COXI or COX1, and CDS from complete mitochondrial genomes. The scope of this search was intentionally very wide, and the downloaded sequences were further filtered by the format_ncbi.pl script to (i) only retain CDS with gene and protein names corresponding to COI, and (ii) eliminate genes with introns and sequences from environmental or metagenomic samples. Sequences with more than five consecutive internal Ns, and outside of the length range of 100-2000 nucleotides were also eliminated. Open nomenclature was not accepted in taxon names. If the taxID did not correspond to a correct Latin name format, the smallest taxon with a correct Latin name in the lineage was chosen for the sequence (e.g., Acentrella sp. AMI 1, taxID: 888165, rank: species was replaced by Acentrella, taxID: 248176, rank: genus). Sequences were then subjected to taxonomically aware dereplication by the dereplicate.pl script. Within each taxID, all sequences that were a substring of another sequence were eliminated. This allows the size of the database to be reduced without losing information and keeping intraspecific variability. In the next step, a taxon under the smallest taxon with NCBI taxID was attributed to an arbitrary, negative taxID, and the new taxID was integrated into the taxID system, with the NCBI taxID as a parent. The newly created taxID was then added to the taxID system and it was characterized by a taxon name, a taxonomic rank and the taxID of its direct parent, forming a hierarchical system. This hierarchical taxID system allows the creation of the lineage of any taxID unambiguously, even in the case of homonymy and synonymy. As for NCBI sequences, the filtered BOLD data set was dereplicated by the dereplicate.pl script.

| BOLD
To compare the effect of using only correct Latin names (as in COInr) or accepting all taxon names present in the input databases, the above pipeline was run a second time using systematically the smallest taxon in each lineage, even if it did not correspond to a correct Latin name.

| The COInr database
The BOLD and NCBI data sets were pooled into one single data set by the pool_and_dereplicate.pl script, where sequences for the tax-IDs shared by the two source databases were dereplicated, while sequences from taxIDs unique to one of the sources were simply added to the combined database. This database is a starting point to create more specific custom databases according to the users' needs.
The core database consists of two simple-to-parse tsv files (tab-separated values). The sequence file has three columns (se-quenceIDs, taxIDs and sequences), and contains sequences of all taxonomic groups that can cover any COI region, with variable taxonomic resolution from species to phylum level. The taxonomy file contains taxIDs, scientific names, parent taxIDs, taxonomic rank and taxonomic level index. The taxonomic level index contains integers from 0 to 8, each corresponding to a major taxonomic level (rank) similar to those used in rdp classifier (Wang et al., 2007): root, superkingdom, kingdom, phylum, class, order, family, genus, species.
Intermediate taxonomic levels have 0.5 added to the next major taxon level index (e.g., 7.5 for subgenus). This file allows the reconstruction of the complete lineages of all taxa or the ranked lineages containing only the major taxonomic ranks.

| Customizing the COInr database
The COInr database can be modified according to users' needs.
Sequences can be selected for a list of taxa or, by contrast, removed from the database through the select_taxa.pl script. The script will also produce a lineage and a taxID for each taxon in the taxon list, allowing users to check for potential errors due to homonyms. In case of incoherence, the taxon list enriched by the correct taxIDs can be used to rerun the script with more precise selection. The same script F I G U R E 1 Flowchart of mkcoinr. Double lines represent the different options for customizing the COInr database. These steps can also be consecutive also allows sequences to be selected with a minimum taxonomic resolution.
The select_region.pl script trims the sequences to a specific region of the COI gene. Using the usearch_global command of vsearch (Rognes et al., 2016), sequences of the database are aligned to a small bait file, which contains a taxonomically diverse pool of sequences already trimmed to target region. The usearch_global program is similar to blast since it aligns each query sequence (sequences of the COInr) to the sequences of its database (the bait file in this case).
Contrary to blast, usearch_global produces global alignments. The best alignment of each query sequence is used to trim the query sequence according to the alignment positions. The bait file can be provided by the users or can be produced by the same script by mak- ing an E-PCR on the core database. The E-PCR (electronic PCR) uses cutadapt (Martin, 2011) to select for a particular subregion of COI using a pair of primer sequences.
The COInr database can also be completed by custom sequences. Users will need a taxon name and sequenceID for each custom sequence. The format_custom.pl script will produce a lineage file with all input taxa, which should be checked, and eventually corrected and completed by the users. The add_taxids.pl script will add taxIDs to each lineage and complete the input taxonomy file (part of the COInr database). Sequences should then be dereplicated by the dereplicate.pl script and added to the COInr database using the pool_and_derelplicate.pl. Figure 1 represents the customizing options on mkcoinr, each of them starting from the COInr database. However, the different steps can also be successive to produce a final database. For example, it is possible to start by selecting sequences for a list of taxa, then adding custom sequences to the newly created database, which in turn can be trimmed to the target region.

| Format database
The very simple format of the database (sequence file and taxonomy file both in tsv format) allows users to easily obtain a database in their desired format. The format_db.pl script can produce databases ready to use for blast (Altschul et al., 1997), vtam (González et al., 2020), rdp classifier (Wang et al., 2007) and qiime (Bolyen et al., 2019). The "full" option will produce a single tsv file with sequence IDs, ranked lineages, taxIDs and the sequences, allowing user to parse, and produce basic statistics on the database content (e.g., number of sequences of each taxon).

| Benchmarking the select_region script
Identifying and trimming sequences to a target region is one of the most difficult steps in customizing the COInr database. It can produce false positives and false negatives if the search parameters are not set properly. The detailed protocol of the benchmarking and the associated scripts are found in the github repository of mkcoinr (https://github.com/megle cz/mkCOInr). Briefly, a positive test data set was produced by downloading CDS of all complete mitochondrial genomes from NCBI-nt and the whole COI gene sequences were identified by the format_ncbi.pl script. Sequences shorter than 1100 bp were filtered out to avoid using erroneously annotated incomplete sequences. Two negative data sets were also produced.
The negative-mito data set was derived from the above downloaded mitogenomes. After filtering out COI genes and selecting genes with length between 700 and 2000 bp, sequences were randomly selected to match the size of the positive data set. The negativechloroplast data set was produced similarly, but from the complete chloroplast genomes downloaded from NCBI-nt.
First, the E-PCR option of the select_region.pl script was tested on the three test data sets. Using this option, the bait file is produced for the usearch_global step.
To test how variable a bait data set should be if it is provided by the user without using the E-PCR option, bait files were produced of varying diversity. The trimmed sequences of the most reliable E-PCR-based trimming of the positive data set was used to produce the baits. I randomly sampled one or five sequences per phylum, class or order and each random sampling was repeated 10 times.
Each of the resulting 60 bait files was used to trim the positive and the two negative test data sets by using two identity thresholds (0.6 and 0.7).  (Table 3).

| RE SULTS
The results of benchmarking the select_region.pl command using the E-PCR option is summarized in Table   with decreasing identity thresholds. When selecting one random sequence for each phylum, class or order, sensitivity increased steadily, reaching 98% and 99% for identity thresholds of 70% and 60%, respectively. Increasing the number of sequences per taxon also increased the sensitivity.

| DISCUSS ION
The need for high-quality databases can be measured by the number is designed to build an auto-curated database of Metazoan COI sequences. All the above-mentioned databases and tools are based exclusively on NCBI databases or on a data set already containing a coherent system of lineages. Several COI-specific databases containing sequences from NCBI databases and BOLD have also been TA B L E 1 The number of taxa and COI sequences of the input databases (NCBI-nt, BOLD), and in the COInr database (May 2022). COInr is the result of pooling and taxonomically aware dereplication of sequences in the input databases

TA B L E 2
The number of taxa and sequences by phylum and difficult steps. metacoxi is a COI database including metazoan sequences in an easy-to-parse format. However, no tools are provided at present for the creation of custom databases, and therefore basic programming skills are necessary to obtain a ready-to-use custom database from it. This is relatively easy for some tasks, such as selection of sequences of a taxon, but needs considerable effort to select a specific gene region, or to format the database for rdp classifier (Wang et al., 2007) or qiime2 (Bolyen et al., 2019). COInr and mkcoinr fill a gap by both providing a comprehensive, easy-toaccess database and a versatile tool to customize it.

TA B L E 3
Comparison of the number of sequences and taxIDs when accepting all taxon names or using only formal Latin names Note: In the positive data set, trimmed sequences are true positives (TP), and untrimmed sequences are false negatives (FN). In the negative data sets, trimmed sequences are false positives (FP), and untrimmed sequences are true negatives (TN).

TA B L E 4
Percentage of true and false positives after running the select_region script using the E-PCR option with different parameter settings

| Use of accepted Latin names
Both BOLD and NCBI contain a large number of taxon names at a species level, with unique taxIDs, which do not correspond to the binomial nomenclature. In most cases they correspond to taxon names of a higher level completed by an identifier or simply completing the taxon name by "sp." In principle, they could be proxies of species, but they in fact reflect a lack of information. This phenomenon is particularly pronounced in NCBI, where the total number of taxa including all names is more than three times higher than the number of distinct Latin names. For example, many genus names in NCBI are completed by the sampleID of BOLD and used as species names (e.g., Platynothrus sp. BIOUG14078-H10). The utility of nonstandard taxon names is questionable for most metabarcoding applications. When accepting all names as they appear in the input database, a high proportion of the COI sequences are shared between taxa, and most importantly a high proportion of taxa contain only sequences that are identical to sequences of other taxa. Therefore, keeping artificial identifiers as species names, when they do not necessarily correspond to species, they are uninformative for most users and in many cases they cannot be distinguished from sequences of other taxa, inflates uselessly the number of taxa and thus hinders efficient, taxonomically aware reduction of redundancy. The COInr database uses only taxa with correct Latin name format. To avoid the loss of sequences, sequences with incorrect taxon names are attributed to the lowest taxon in the lineage with a Latin name. Therefore, sequences are kept in the database, with a conservative level of taxonomic information, resulting in a more efficient dereplication, and thus a smaller database without the loss of crucial information. This particularity should be kept in mind when comparing the number of taxa to other databases that do not follow this strategy.
However, for users who wish to include nonstandard names, the pipeline can be re-run with deactivating the check_name option, thus keeping all taxon names as they appear in the source database.

| Selecting the target region
The COInr database includes sequences that can cover any region of the COI gene. For taxonomic assignment methods based on sequence similarity (Clemente et al., 2011;Huson et al., 2007;Kahlke & Ralph, 2019;Wood & Salzberg, 2014) the database can be used as it is, because sequences of the nontarget region will not be returned by blast or other similarity searches. The only disadvantage would be the database size, which could be eventually reduced by selecting only the region of the sequence that covers the target region.
On the other hand, for taxonomic assignment based on sequence composition or phylogeny (Murali et al., 2018;Nguyen et al., 2014;Rosen et al., 2011;Wang et al., 2007), or for the use of the database for phylogenetic, mitogenomic or genomic studies it is preferable to trim sequences to the target region. This can be done using the mkcoinr tool. It is possible to select only full-length sequences covering the whole target region. However, this comes at the price of losing partial sequences, and thus some taxa. Therefore, mkcoinr can also select sequences that cover user-defined portions of the target region to increase taxonomic coverage. Sensitivity for different bait sets and %identity thresholds high sensitivity. However, in some taxa with introns, or considerable length variation of the COI gene, it is important to include bait sequences of taxa with COI of atypical length. In this case, the E-PCR option can help to capture the variability of the data set, but the parameters should be carefully chosen to find the balance between sensitivity and specificity.

| Selecting the target groups
Using a large database with a wide taxonomic scope is convenient for users analysing different data sets with a varied taxonomic origin, since the same database can be used and can give a good first approximation of taxonomic assignment of sequences. It can also be helpful to detect contaminant sequences that are not expected in the study (e.g., human sequences or model species studied in the same laboratory) or sequences outside of the target group of the study (e.g., bacteria, algae and fungi when focusing on animals). By using a generalist database, these sequences can be identified and eliminated. On the other hand, the presence of reference sequences from taxa not relevant to the study can also have disadvantages: the database size is higher and therefore the speed of taxonomic assignment is lower with generalist databases. Moreover, sequences can be assigned to unexpected taxa if the taxonomic coverage of the target group is incomplete. This can be avoided with databases specific to the target group (Axtner et al., 2019;Mathon et al., 2021;Valentini et al., 2016).
For example, many sequences from marine samples can be erroneously assigned to insects when using a generalized database, which is the combined result of the facts that most marine groups are insufficiently covered in the reference databases (Mugnai et al., 2021), and an overwhelming majority of the sequences are from insects (73%). Therefore, the possibility to easily create custom databases specifically tailored to the users' needs is particularly important, and mkcoinr provides the necessary tools to make this selection.

| Selecting sequences with different taxonomic resolution
Another consideration when creating custom databases is whether to keep reference sequences with incomplete lineages. Most sequences of a reference database assigned to an insect order without further precision are likely to be useless, since most insect reference sequences are determined at least to the genus level, and the taxonomic coverage of this group is wide. By contrast, for less wellcovered groups, especially if species or higher-level groups are difficult to identify morphologically (e.g., Nematoda, Rotifera), reference sequences with partial lineages are still informative.

| Database curation
Erroneously annotated sequences in the reference database can have serious consequences on taxonomic assignments.
Unfortunately, both the NCBI and BOLD databases contain mislabelled sequences (Bidartondo et al., 2008;Meiklejohn et al., 2019). This problem should be addressed from the source by a public sequence database that can incorporate a community-curated annotation and allows third parties to improve the annotations of sequences. In the BOLD database, the detection of taxonomic incoherencies is principally based on BINs (Barcoding Index Numbers) (Ratnasingham & Hebert, 2013), and applications such as bags (Fontes et al., 2021) allow us to automatically flag some of them.
However, the BIN system is based on the existence of a barcoding gap, which does not exist for all taxa (Meier et al., 2006;Roe & Sperling, 2007;Rubinoff et al., 2006). Therefore, human expertise with curation jams is still very much needed for taxonomic revision (Radulovici et al., 2021). On the other hand, errors in primary sequence data in NCBI-nt can only be corrected by the authors, which is inefficient and unsustainable.
In the field of mycology, considerable progress has been done to identify undescribed taxa using a Taxon Hypothesis (Kõljalg et al., 2020), to make concerted effort to identify high-quality sequences and to re-annotate erroneous or insufficiently annotated public ITS sequences (R. Henrik Nilsson et al., 2014) and include the improved annotations to the UNITE database (Rolf Henrik Nilsson et al., 2019). This database can be used as a reference for automated curation of some other error types such as chimeras (R. Henrik Nilsson et al., 2015). A similar approach would also be desirable for Metazoa, especially for taxa that are difficult to distinguish morphologically.
Given the lack of sufficient curation effort of the source databases, ideally, a local database derived from them should be curated to identify incorrectly assigned sequences. Published semiautomatic methods aiming to curate databases are not applicable to large databases (millions of sequences), since either the run time would be prohibitive or include a manual step for curation (Collins et al., 2021;Kozlov et al., 2016;Rulik et al., 2017). The COInr database is too large to be able to run an automatic curation step, which should be kept in mind when using the full database. However, if a small custom database is created from COInr, this curation step becomes feasible and strongly recommended.

| CON CLUS IONS
The COInr is a comprehensive database of COI sequences, and its major aim is to serve as a reference database for barcoding and metabarcoding studies. It can be used for taxonomic assignments of COI sequences as it is, since it is not limited in its taxonomic scope, or to a particular region of the gene. It is also a good starting point to create local, custom databases, since it saves the most timeintensive and complicated steps of database creation: (i) downloading a large number of sequences, (ii) creation of a coherent taxID system to avoid ambiguity due to homonymy and synonymy, and (iii) sequence dereplication.
The mkcoinr package provides the necessary tools to both to re-create a whole COInr database, between the planned annual | 943 MEGLÉCZ updates, and produce a custom database starting from COInr. The possibility of refining the taxonomic composition of the database, selection of the gene region and formatting the output to widely used database formats (blast, rdp, qiime) are filling the need for an easy way of creating customized COI databases.

AUTH O R CO NTR I B UTI O N S
EM designed the research, wrote the scripts, analysed the data and wrote the manuscript.

ACK N O WLE D G E M ENTS
I thank Francesco Mugnai for testing mkCOnr and making valuable comments on its use, documentation and the present paper and Gabriel Nève for language editing.

CO N FLI C T O F I NTE R E S T
The author declare that she has no competing interests.

O PE N R E S E A RCH BA D G E S
This article has earned an Open Data badge for making publicly available the digitally-shareable data necessary to reproduce the reported results. The data is available at https://doi.org/10.5281/ zenodo.6555985

DATA AVA I L A B I L I T Y S TAT E M E N T
Data Accessibility: The complete COI database can be downloaded from https://doi.org/10.5281/zenodo.6555985 (Meglécz, 2022).
All scripts are available in https://github.com/megle cz/mkCOInr including full documentation and they are also archived in Zenodo at https://doi.org/10.5281/zenodo.6961340 (Meglécz, 2022).Benefits Generated: Benefits from this research accrue from the sharing of my data and results on public databases as described above.