Reliability and Utility of Standard Gene Sequence Barcodes for the Identification and Differentiation of Cyst Nematodes of the Genus Heterodera

Abstract Difficulties inherent in the morphological identification of cyst nematodes of the genus Heterodera Schmidt, 1871, an important lineage of plant parasites, has led to broad adoption of molecular methods for diagnosing and differentiating species. The pool of publicly available sequence data has grown significantly over the past few decades, and over half of all known species of Heterodera have been characterized using one or more molecular markers commonly employed in DNA barcoding (18S, internal transcribed spacer [ITS], 28S, coxI). But how reliable are these data and how useful are these four markers for differentiating species? We downloaded all 18S, ITS, 28S, and coxI gene sequences available on the National Center for Biotechnology Information (NCBI) database, GenBank, for all species of Heterodera for which data were available. Using a combination of sequence comparison and tree-based phylogenetic methods, we evaluated this dataset for erroneous or otherwise problematic sequences and examined the utility of each molecular marker for the delineation of species. Although we find the rate of obviously erroneous sequences to be low, all four molecular markers failed to differentiate between at least one species pair. Our results suggest that while a combination of multiple markers is best for species identification, the coxI marker shows the most utility for species differentiation and should be favored over 18S, ITS, and 28S, where resources are limited. Presently, less than half the valid species of Heterodera have a sequence of coxI available, and only a third have more than one sequence of this marker.

Diagnoses of important plant pests are increasingly reliant on molecular gene sequence data. This is especially the case for plant-parasitic nematodes, a group for which traditional taxonomic expertise is in decline (Coomans, 2002;Eyualem-Abebe et al., 2006;Oliveira et al., 2011). Collectively, plant-parasitic nematodes are estimated to cause up to 15% loss of the total global crop production, valued at over 100 billion USD annually (Koenning et al., 1999;Abad et al., 2008;Nicol et al., 2011;Singh et al., 2013Singh et al., , 2015Phani et al., 2021). The great diversity and species richness of plant-parasitic nematodes and their broad range of host associations and life cycle strategies (e.g., Procter, 1984;Boag and Yeates, 1998;Siddiqi, 2000;Oliveira et al., 2011;Palomares-Rius et al., 2014;Salas et al., 2022) mean that knowledge of species-level biology and life history is often needed for effective control strategies. Therefore, the ability to distinguish between closely related species is of critical importance.
The tylenchid family Heteroderidae Filipjev & Schuurmans Stekhoven, 1941, includes seven genera of "cyst nematodes," a monophyletic lineage of sedentary plant parasites united primarily by the form taken by adult females at the end of their life cycle, a hardened sac containing embryonated eggs (Luc et al., 1986;Baldwin, 1992;Subbotin et al., 2001Subbotin et al., , 2010aSubbotin et al., , 2010bBert et al., 2008). The largest genus, Heterodera Schmidt, 1871, comprises about 85 species, many of which are devastating pests of important crops, including cereals, legumes, vegetables, and a wide variety of other crops (Nicol et al., 2007;Subbotin et al., 2010aSubbotin et al., , 2010bToumi et al., 2013;Smiley et al., 2017). The taxonomy of Heterodera is complex and morphological identification can be difficult as adults are sexually dimorphic; differences between many species are subtle and multiple life cycle stages are often required for accurate species identification and delineation (Subbotin et al., 2010b). Furthermore, a few species cannot be distinguished from one another morphologically at any life cycle stage (e.g., Subbotin et al., 2002). Thus, molecular sequence data have become an important part of diagnoses in this group.
Molecular identification of species of Heterodera initially relied on polymerase chain reaction (PCR) restriction fragment length polymorphism profiles (PCR-RFLP) (Waeyenberge et al., 2009). This methodology has largely been superseded by DNA sequencing, with species being characterized primarily with four markers commonly employed in DNA barcoding: the small subunit ribosomal RNA (18S rDNA), internal transcribed spacer (ITS; comprising ITS1-5.8S-ITS2), large subunit ribosomal RNA (28S rDNA), and the mitochondrial cytochrome oxidase subunit one (coxI) gene regions (e.g., Szalanski et al., 1997;Subbotin et al., 2000Subbotin et al., , 2010aSubbotin et al., , 2010bFerris et al., 2004;Mundo-Ocampo et al., 2008;Escobar-Avila et al., 2018;Powers et al., 2019). The number of sequences for species of Heterodera uploaded to the public database GenBank has accumulated rapidly since the late 1990s to early 2000s when these data first began to become available (e.g., Ferris et al., 1993;Szalanski et al., 1997;Subbotin et al., 2000Subbotin et al., , 2001. There have been some reports, however, of species pairs which cannot be differentiated using sequences of one or more of these genes (Subbotin et al., 2000(Subbotin et al., , 2001Waeyenberge et al., 2009;Vovlas et al., 2015;Sekimoto et al., 2017;Escobar-Avila et al., 2018). As we move into the genomic era, with all its implications for more rapid and better identification methodologies, it seems pertinent to assess the accuracy and utility of the pool of barcoding gene sequences that have accumulated over the last few decades as these data will undoubtedly be incorporated into the next generation of molecular diagnostic tools. Here, we evaluate these barcoding data with aims to identify the availability of sequences for species of the genus as a whole and across regions, the reliability of these data in terms of erroneous or otherwise unreliable sequences, and the utility of the 18S, ITS, 28S, and coxI markers for accurate species identification and delineation.

Materials and Methods
Analyses here are based on sequence data obtained from the NCBI public database, GenBank, up to 10 August 2021. We downloaded all available partial or complete sequences of the 18S rDNA, ITS, 28S rDNA, and coxI gene regions for all named species of Heterodera. We note that ITS is situated between the 18S and 28S genes and comprises the ITS1, 5.8S, and ITS2 genes; we considered a sequence as "ITS" if it included a partial fragment or complete sequence of one of the latter three genes. We searched for all valid species individually and also searched using junior synonyms. We created a database of these sequences where each line of data records the GenBank accession number, species, gene region, geographic collection information, and sequence author(s) or publication reference. In those cases where some of this information was not included with the sequence record on GenBank, we sought it from the publication listed and/or through web searches of scholarly literature aiming to identify if the GenBank accession number in question had been published or referenced. Where collection locality could not be determined and where possible, collection locality was inferred based on the institutional addresses of the author(s) listed in the respective sequence record. Base statistics were calculated in Microsoft Excel, and a map depicting the geographic spread of available sequences was generated using ggplot2 (Wickham, 2009) in R (https://www.R-project.org) and edited in Adobe Illustrator CS6.

Assessment of sequence reliability
We evaluated whether publicly available sequence data of the four markers selected were reliably assigned to the correct species of Heterodera, by determining the number of sequences labeled as a particular species of Heterodera, which were clearly not, or could not reliably be determined to be that species. To do this, we constructed individual sequence files for each species/gene marker combination represented in our database and added a sequence of Globodera pallida (Stone, 1973) Behrens, 1975and Globodera rostochiensis (Wollenweber, 1923) Skarbilovich, 1959 to each file to serve as outgroup taxa. Each sequence file was aligned using MUSCLE (Edgar, 2004) as implemented in MEGA X (Kumar et al., 2018), except where there were less than five sequences of a particular marker for a species. Alignments were not trimmed or restricted to specific regions of the gene under analysis. Neighbor-joining trees based on each alignment file were constructed in MEGA X with the following parameters: 100 bootstrap replications, the number of differences model, inclusion of substitutions and transversions, uniform rates among sites, and pairwise deletion of gaps and missing data. Trees were examined by eye for clear outliers and potentially problematic sequences. Problematic sequences were defined as those that diverged significantly from putative congeners in the neighborjoining trees generated. Such sequences were added to a new database and evaluated through reexamination of alignments and comparison against GenBank using BLAST (Altschul et al., 1990) to determine the source of observed divergence. Where there were less than five sequences of a particular marker for a species, sequences were compared directly against the GenBank database using BLAST. Divergence from other sequences attributed to the same, and other species was recorded.

Assessment of sequence utility
The second aim of our study was to assess the power of each of the four molecular markers for accurate delineation of species. For this, we first generated four new sequence files, each including all sequences of Heterodera for each respective marker as above. Problematic sequences identified in our initial assessment of reliability were excluded, and sequences of G. pallida and G. rostochiensis were added as outgroup taxa as above. Alignments for each dataset were constructed using MAFFT via the online service (Katoh et al., 2019) and examined by eye for additional problematic sequences which would impede tree-based phylogenetic methods; such sequences were added to the database of problematic sequences and removed from their respective sequence files. Sequences were then re-aligned. Neighbor-joining trees were constructed for each dataset in MEGA X as above and examined by eye for clades which included sequences of two or more putatively different species that were poorly or not differentiated from one another. New data files were created for the sequences from each of these ambiguous species pairs or groups with sequences of Globodera spp. added as outgroup taxa. These ambiguous sequence datasets were aligned using MUSCLE and neighbor-joining trees as above and maximum likelihood trees using 100 bootstrap replications, and the general time reversible model with uniform rates among sites were also computed in MEGA X. Intraspecific and interspecific variation were further examined using pairwise comparison tables generated in MEGA X.

Data availability
Our sequence databases and associated data, along with sequence FASTAs, alignments, trees, and other files are all made publicly available on the Commonwealth Scientific and Industrial Research Organisation (CSIRO) Data Access Portal.

General results
Our sequence database includes 2,737 entries, comprising 77, 1,723, 345, and 592 sequences of the 18S, ITS, 28S, and coxI gene regions, respectively. Of the 2,737 sequences in our database, only 1,669 (61%) could be associated with a paper published in academic or other technical literature. Altogether, sequences were available for 66% of valid species of Heterodera (57 out of 87, based on our count at the time of writing; see Li et al., 2020;Hodda, 2022). Eight species were represented by just a single sequence, and for seven of these, this was of the ITS gene region. Only 13 species had at least one sequence of each marker, and only 11 species had more than a single sequence of each marker (Table 1;  Supplementary Table 2).
Sequence data were attributed to specimens collected from 59 countries, from all continents except Antarctica (Fig. 1). More than half came from three countries: China (937), Turkey (272), and the USA (254), while 19 countries had less than five sequences attributed to them, including Afghanistan, Chile, Côte d'Ivoire (Ivory Coast), and Qatar, each with just one record (Supplementary Table 1). Notably, very few sequences originated from sub-Saharan Africa and only a single sequence originated from South America. Iran had sequences attributed to the most species (19), followed by the USA (18), China (16), and Germany (15) (Supplementary Table 2).

Assessment of reliability
The overall rate of erroneous sequences was low, with only 55 of the total 2,737 sequences (2%) being detected as potentially problematic. Of these 55 sequences, 25 were accurate but had been uploaded running in the negative strand (3¢ to 5¢) and required reverse complementing before they could be aligned with congeners. The remaining 30 sequences were considered truly erroneous or otherwise unreliable. Issues appeared to include poor-quality sequencing results and/or sequence editing errors, sequences of Heterodera uploaded under the wrong species name (e.g., Fig. 2), misidentification of related nematodes as Heterodera, and clear contamination (including a sequence of a fungus and cucumber).

Assessment of sequence utility
Each of the four molecular markers evaluated could not differentiate at least one species pair (Table 2). In many cases, putatively distinct species shared identical sequences in one or more of the markers evaluated (Table 2; Fig. 3). Many other comparisons included differences of only one or two base positions (bp), and these slight differences were generally inadequate for species delineation in tree-based methods and BLAST. Intraspecific genetic variation      Intraspecific differences are presented in the format "base pair difference range; median of base pair difference range (percent difference range; median of percent difference range)." Medians indicated only in cases of three or more comparisons; n = the number of sequences compared. Table 3. Continued observed for some species of Heterodera (Table 3) spans the interspecific variation present within some species groups (e.g., Table 2). There were, however, some slight differences in sequences that were consistent enough to delineate species reliably. For example, despite having ITS sequences differing by only two bp from those of H. avenae Wollenweber, 1924, H. mani Mathews, 1971, consistently formed a monophyletic clade to the exclusion of other related species in tree-based methods. Fifteen species pairs could not be reliably delineated using both the ITS and 28S gene regions, followed by 12 species pair issues in the 18S gene region and just two in the coxI region (  Sturhan, 2001, andH. trifolii Goffart, 1932 in the 18S and ITS gene regions and differs from these species by only one and two bp in the 28S gene region, respectively. Sequences of coxI had the most utility for distinguishing between species, with just one case of a weak, and one case of an inadequate species pair delineation; these notable cases are discussed further below.

Discussion
Our analyses indicate that several of the most commonly employed molecular markers for the characterization of species of Heterodera lack the necessary resolution for distinguishing between some species, including a number of plant pests of significant global biosecurity concern. Among those for which molecular identification issues were detected, 13 species (H. avenae, H. carotae Jones, 1950a, H. ciceri Vovlas, Greco & Di Vito, 1985, H. cruciferae Franklin, 1945, H. daverti Wouts & Sturhan, 1978, H. elachista Ohshima, 1974, H. filipjevi (Madzhidov, 1981) Stelter, 1984, H. glycines Ichinohe, 1952, H. goettingiana Liebscher, 1892, H. hordecalis Andersson, 1975, H. oryzae Luc & Brizuela, 1961 are listed as regulated pests in at least one country (Singh et al., 2013). This has numerous implications for routine molecular diagnoses of species of Heterodera, such as the potential for confusing a major pest with a minor or non-pest species. For example, H. avenae is a major pest of cereals in temperate regions and causes significant annual yield losses throughout its range (e.g., Meagher, 1982;Smiley et al., 2005;Nicol and Rivoal, 2008), but our analyses showed that it could not be reliably distinguished from H. arenaria Cooper, 1955 or H. pratensis using the most commonly employed molecular marker, ITS. Both latter species parasitize non-crop grasses ; thus, there is potential for confusing H. avenae with one of these non-pest species. Such cases of mistaken identity could result in misdiagnosed infestations and novel incursions. In addition, mixed infestations of some closely related species which parasitize similar crops, such as members of the Schachtii group, may go undetected.
Although a combination of morphological data and sequences from multiple gene regions is best for identification of species of Heterodera, our results suggest that where resources or expertise are limited, the coxI region should be favored over 18S, ITS, and 28S for basic diagnostic purposes. It is significant that a number of economically important species of Heterodera cannot be delineated using ITS; as to date this marker has been used more extensively than any other for both identification and phylogenetic Purposes (e.g., Subbotin et al., 2001Subbotin et al., , 2017Tanha Maafi et al., 2003). Furthermore, more species of Heterodera have been characterized with ITS than any other marker, and for many species, this is the only gene region for which sequences are available. Thus, when developing molecular diagnostic tests for plantparasitic nematodes like Heterodera spp., the tradeoff between delineation power and species coverage needs careful consideration. There is currently a great amount of interest in employing high-throughput sequencing methods, such as metabarcoding and eDNA, for detection and monitoring of pest species (e.g., Abdelfattah et al., 2018;Valentin et al., 2018;Ruppert et al., 2019;Hardulak et al., 2020;Young et al., 2021). Many metabarcoding studies have utilized short fragments of nuclear genes such as 18S for species detection and identifications (e.g., Macheriotou et al., 2019;Ruppert et al., 2019;Giebner et al., 2020), but for Heterodera, fragments of the nuclear genes tested here lack the power to distinguish between all species. Those using such tools will need to be aware of the characteristics and shortcomings of the marker(s) employed and may need to follow-up species detections with other methods to confirm identifications.
The coxI gene seems robust for species delineation in the genus Heterodera; however, we did observe two instances in which this marker showed potential shortcomings. The first issue was detected in our coxI phylogenetic analyses of the Avenae group, in which sequences of H. arenaria formed a clade within the larger H. avenae species cluster. The sequences of H. arenaria differ by just two bp from those of H. avenae, well below the intraspecific variation observed for the latter species. However, these two bp appear to be unique changes, suggesting that H. arenaria can at least be distinguished from H. avenae from a barcoding, if not a phylogenetic, perspective. This is consistent with the findings of  where H. arenaria and H. avenae were shown to be distinct in a haplotype network but formed a polyphyly in a tree derived from Bayesian analysis.  speculated that H. arenaria represents a species recently diverged from a European population of H. avenae and remarked that, from a phytosanitary perspective, it is best to retain the species status of H. arenaria because it parasitizes coastal grasses of no economic importance, rather than cereals as in H. avenae.
The second instance in which coxI showed a possible shortcoming relates to the case of distinguishing H. carotae from H. cruciferae. These two species have previously been reported as indistinguishable using PCR-ITS-RFLP profiles and ITS gene sequences (Subbotin et al., 2000(Subbotin et al., , 2001. In a study of H. carotae in Mexico, Escobar-Avila et al. (2018) produced novel coxI sequences of that species but also several coxI sequences for H. cruciferae, which were the only coxI sequences available for the latter species at the time of writing. Escobar-Avila et al. (2018) found that H. carotae and H. cruciferae could not be distinguished using coxI sequences and concluded that to differentiate these species an integrated approach including morphology and a test of host range was necessary. Although it is entirely possible that H. carotae and H. cruciferae are indistinguishable using coxI, considering the former species appears restricted to carrots as hosts (Jones, 1950;Winslow, 1954;Mugniery and Bossis, 1988) and the latter to brassicas and a few species of the Lamiaceae (Winslow, 1954;Baldwin and Mundo-Ocampo, 1991), it is somewhat surprising that these species would exhibit so little divergence in such a rapidly evolving gene. Escobar-Avila et al. (2018) provided three sequences of H. cruciferae, two from specimens from the USA and one from Russia, but did not provide a morphological account of these specimens, presumably because they were all consumed in molecular analyses. The Russian sequence of H. cruciferae differs from those from the USA by 15 bp, whereas the USA sequences of H. cruciferae differ from sequences of H. carotae from multiple countries by as little as 1 bp. Thus, the current intraspecific variation for H. cruciferae is greater than the difference between some sequences of H. carotae and H. cruciferae. Because these two species are fairly similar morphologically and could easily co-occur in mixed or rotated vegetable fields, it is possible that some of the putative specimens of H. cruciferae used by Escobar-Avila et al. (2018) were misidentified. At present, there are simply too few coxI sequences of H. cruciferae to be certain of the utility of this marker, or lack thereof, for the H. carotae-H. cruciferae species pair. Heterodera carotae is an important pest of carrots throughout its range (Greco et al., 1994;Esquibet et al., 2020), and although H. cruciferae is largely considered a minor pest, there have been a few reports of significant damage to crops infested by this species (Lear et al., 1965;Sykes and Winfield, 1966). It is critical then that a suitable molecular marker be identified which can reliably distinguish between these two species. Using a collection of microsatellite markers, Esquibet et al. (2020) demonstrated a high level of genetic divergence between H. carotae and H. cruciferae and remarked that microsatellites could be used to develop a diagnostic test for these species. If further study confirms that the coxI gene cannot reliably distinguish between H. carotae and H. cruciferae, other molecular methods such as microsatellites (e.g., Gautier et al., 2019;Esquibet et al., 2020) can fill this diagnostic gap.
Although the number of obviously erroneous sequences detected was low, for several species of Heterodera, we observed high levels of intraspecific variation within one or more of the molecular markers evaluated, primarily ITS. Unsurprisingly, this was most apparent in those species/marker combinations with a large pool of sequences, such as H. avenae, H. glycines, and H. latipons Franklin, 1969, each of which had over 100 ITS sequences available. In these species, we observed intraspecific maxima of 80-122 bp (12%-13%) in the ITS gene. Additionally, some species had large intraspecific variation despite lower sample size, such as H. goettingiana with 15 ITS sequences and an intraspecific maximum of 81 bp (11%) and H. hordecalis with 37 ITS sequences and an intraspecific maximum of 68 bp (9%). None of these intraspecific maxima are unbelievable when considering that these datasets include sequences of individuals sourced from many geographically disparate populations (e.g., Subbotin et al., 2003. However, these large levels of intraspecific variation do clash with the patterns observed for most other members of the genus and notably in several other species with medium to large sequence pools, such as H. schachtii and H. filipjevi, both of which exhibit intraspecific maxima of less than 5% in ITS. We think it is likely that for several species the overall level of intraspecific variation observed for some molecular markers is inflated. In our assessment of sequence reliability, in many instances, we could not be certain if the intraspecific variation observed was due to real population-level genetic variability or artifacts of poor-quality sequencing results and/or sequence editing errors. We flagged sequences that diverged greatly from their congeners as problematic; however, in larger sequence pools, relatively small levels of variation between individual sequences become amplified, resulting in the very large levels of overall intraspecific variation detected in pairwise comparisons. This is an important limitation of our analyses and of the overall pool of publicly available sequences. A related issue is species for which only a few sequences are available, but one or more of those sequences diverge significantly from the others. Again, it can be difficult to determine if such divergence represents natural variation or an artifact such as misidentification or sequence editing errors. Furthermore, the large number of sequences submitted to GenBank that are not associated with a published manuscript leaves many issues related to sequence identity ambiguous as there is no account of the morphology of specimens utilized. A good example of both of the above relates to the 28S sequences available for H. cruciferae. Sasanelli et al. (2013) performed a thorough study of H. cruciferae in Italian cabbage, identified specimens using morphology, and characterized them with ITS and 28S rDNA gene sequences. However, BLAST results of the 28S sequence of H. cruciferae from that study suggests that it is probably representative of H. schachtii, another cyst nematode widespread on brassicas (Subbotin et al., 2010b). There is one other 28S sequence of H. cruciferae available, but it is not associated with a published manuscript (KP114546; Jabbari et al., unpublished); this sequence is very close to H. carotae but not identical. This suggests that this "unpublished" sequence is accurate, but without a morphological account or other sequences with which to compare, it is still uncertain. This results in a situation where, although two 28S sequences are available for H. cruciferae, we still cannot be confident that either are truly representative of that species. Thirty species of Heterodera lack molecular data and thus cannot be diagnosed using barcoding methodologies. Of the 57 species of Heterodera for which molecular data are available, eight are represented by just a single sequence and over half of the remainder have just one sequence for one or more of the markers evaluated here. Where only a single sequence of a particular marker is available, it is best treated with caution when used for diagnostic purposes as additional sequences, ideally from independent studies, are needed for confidence that such sequences are truly representative of the species they are purported to be of.
There is little doubt that erroneous sequences we observed on GenBank were uploaded in good faith. However, to avoid errors, it is important that authors carefully compare their novel sequences with those available in public databases via BLAST or other phylogenetic methods before upload. Original sequencing results should be scrutinized for basecall quality, and poor-quality sequences should be discarded. Where possible, multiple sequence replicates of each gene should be produced and compared prior to upload to ensure base calls are consistent. It is also advisable to sequence multiple gene regions from individual isolates to ensure that molecular based identification is consistent across markers. We also encourage authors who have uploaded sequences which later prove to be mislabeled or problematic to correct them.
High-throughput sequencing techniques utilizing new and more accurate markers are already superseding other molecular-based diagnostic methods for cyst nematodes in many laboratories (e.g., Gautier et al., 2019;Esquibet et al., 2020). Despite this, we foresee the four markers evaluated here remaining in use for identification of plant parasitic nematodes for many years to come. With that in mind, of the four markers evaluated, coxI shows the greatest utility for identification and delineation of species of Heterodera. However, the cox1 gene might not be best for phylogenetic interpretation ; so, going forward, a combination of coxI and ITS, plus 28S where possible, seems ideal, especially as coxI data are presently available for only around a third of species of Heterodera. There is a real possibility of hybridization between species of Heterodera (Potter and Fox, 1965); thus, a combination of a mitochondrial and nuclear marker is recommended for areas where the range of multiple species overlap. Molecular diagnoses should be based not only on multiple molecular markers used in concert but also on a combination of sequence matching and tree-based phylogenetic methods. Lastly, it is important to keep in mind that no one technique is likely to be a panacea for the identification of species of Heterodera. New species are being described at a fairly steady rate (e.g., Li et al., 2020;Phougeishangbam et al., 2020;Jiang et al., 2022), and sequence data continue to accumulate and provide further insight into the population genetics, host associations, and phylogeography of Heterodera Esquibet et al., 2020;Oro and Tabakovic, 2020). It is telling that so few sequences have been generated from countries in South America and Sub-Saharan Africa as this seems an obvious result of lack of resources, rather than lack of Heterodera species richness. Thus, there is still a great need for traditionally trained taxonomists and diagnosticians that can employ a range of techniques to identify these problematic nematodes.