High‐throughput identification and diagnostics of pathogens and pests: Overview and practical recommendations

Abstract High‐throughput identification technologies provide efficient tools for understanding the ecology and functioning of microorganisms. Yet, these methods have been only rarely used for monitoring and testing ecological hypotheses in plant pathogens and pests in spite of their immense importance in agriculture, forestry and plant community dynamics. The main objectives of this manuscript are the following: (a) to provide a comprehensive overview about the state‐of‐the‐art high‐throughput quantification and molecular identification methods used to address population dynamics, community ecology and host associations of microorganisms, with a specific focus on antagonists such as pathogens, viruses and pests; (b) to compile available information and provide recommendations about specific protocols and workable primers for bacteria, fungi, oomycetes and insect pests; and (c) to provide examples of novel methods used in other microbiological disciplines that are of great potential use for testing specific biological hypotheses related to pathology. Finally, we evaluate the overall perspectives of the state‐of‐the‐art and still evolving methods for diagnostics and population‐ and community‐level ecological research of pathogens and pests.


| INTRODUCTION
Globalization and international trade of plants have greatly accelerated the frequency and magnitude of pest and pathogen invasions to agroforestry systems leading to novel encounters with plant hosts (Lenzen et al., 2012;Liebhold, Brockerhoff, Nuñez, Wardle, & Wingfield, 2017). In some rare cases, these invasive antagonists have caused large-scale transformations of native ecosystems and changed the ecological dynamics through local and regional extinction of native host species (Prospero & Cleary, 2017) and collapses of ancient civilizations (Santini, Liebhold, Migliorini, & Woodward, 2018). In addition, climate change facilitates the probability of establishment of introduced pests and pathogens and promotes range expansion of existing populations (Seidl et al., 2017). Botanical gardens and early warning sentinel systems represent means to identify new and emerging risks to natural plant communities and to improve surveillance globally (Barham, 2016;Vettraino et al., 2015).
Besides economic damage and disease in plants and animals including humans (Seyedmousavi et al., 2018), pathogens and pests natural ecosystems (Bagchi et al., 2014;Maron, Marler, Klironomos, & Cleveland, 2011). This Janzen-Connell phenomenon occurs mainly through herbivory or root decay by hexapods or microbial pathogens that are specialized on the dominant plant species and selectively increase their mortality at the seedling stage (Liang et al., 2016).
Traditionally, microbial organisms including pathogens have been identified based on symptoms of disease or culture morphology, whereas detection of pests usually relies on morphological characters of representative individuals. Many obligate intracellular pathogens do not grow in pure culture and never form reproductive structures, which render their detection and identification difficult.
Furthermore, both microbial pathogens and animal pests may exhibit high intraspecific variability or comprise cryptic species that may strongly differ in niche and aggressiveness (Ashfaq & Hebert, 2016;Tuda, Kagoshima, Toquenaga, & Arnqvist, 2014). Within biological species, genotypes or races may also differ in their pathogenicity (Barnes et al., 2016;Brasier & Kirk, 2010), sometimes depending on the presence of accessory pathogenicity loci and chromosomes (Möller & Stukenbrock, 2017). These inter-and intraspecific differences emphasize the importance of precise detection of the organisms at the level of species and pathotypes or strains therein.
Rapid and accurate identification of pathogenic microorganisms and pests is essential for detection and employment of appropriate mitigation measures (Comtet, Sandionigi, Viard, & Casiraghi, 2015).

| THE EMERGING METHODS
In the last 15 years, researchers have taken advantage of the rapid development of high-throughput molecular identification methods to characterize the enormous diversity of microbial life aboveground and belowground. These methods enable identification of thousands of taxa per sample from hundreds of samples simultaneously and facilitate concurrent focus on any groups of organisms and viruses (Bork et al., 2015;Knief, 2014;Mendes et al., 2013;Uroz, Buee, Deveau, Mieszkin, & Martin, 2016). Based on their technical aspects, high-throughput identification methods can be divided into PCRbased quantification methods, hybridization-based methods (e.g., microarrays), second-generation fingerprinting methods (e.g., RADseq) and sequence-based methods, for example, metabarcoding, (meta)genomics and (meta)transcriptomics. The first and most influential examples of these methods and their applications in pathogens and pests are concluded in Table 1.
Droplet digital PCR (ddPCR) is based on microfluidics technology, which separates the amplification reaction into >20,000 individual droplets and allows absolute quantification of DNA from up to four target organisms or genes simultaneously, with a detection limit of 10 −5 relative abundance (Hindson et al., 2011). Dreo et al. (2014) showed much greater precision of ddPCR for quantification of two bacterial plant pathogens with optimal and suboptimal primers compared with ordinary qPCR. Currently, ddPCR can be run in 96-well and 384-well plates, but it is technically possible to increase the throughput of samples. It may be also possible to increase the number of fluorescent dyes to be able to multiplex >4 reactions. Overview of the methodology and use in pathology is reviewed in Gutierrez-Aguirre, Rački, Dreo, and Ravnikar (2015).
Quantification of marker or functional genes is possible by spiking approach combined with HTS identification methods. For spiking, known quantity of control DNA or individuals is added to the sample prior to DNA/RNA extraction and the quantity of target organisms or genes is detected based on the relative amount of obtained sequences (Pochon, Bott, Smith, & Wood, 2013;Tkacz, Hortala, & Poole, 2018). In theory, spiking allows absolute quantification of the DNA marker content of any pathogenic organism or gene, but this method has been little tested thus far. Differences in sequence length, G + C/A + T content, DNA secondary structure, etc. (see Technical biases below), may all affect accuracy of the spiking approach.

| Microarrays
Microarray technology is based on accommodation of multiple target-template hybridization reactions onto small chips using robotics technologies to generate arrays and perform multiple hybridization reactions simultaneously. Microarrays have been widely used for species diagnosis, detection of functional genes and gene expression (Sessitsch et al., 2006). Diagnostic microarrays were the earliest high-throughput identification methods that enabled targeting speci- While early microarrays used PCR-amplified templates, fine tuning of sensitivity enabled to detect taxa from genomic DNA (DeAngelis et al., 2011). Microarrays also enable to detect single nucleotide polymorphisms (SNPs), which allow genotyping of plant pathogens and detection of aggressive strains (Lievens, Claes, Vanachter, Cammue, & Thomma, 2006). Although reusable microarrays are cheap to run, provide highly sensitive results rapidly and suffice for monitoring the presence and abundance of specific pathogenic taxa and pathogenicity-related genes from complex samples, their major disadvantage is missing the large proportion of species and functions present in the targeted environment and non-optimal stringency in various probes (Sessitsch et al., 2006). Therefore, microarrays have been replaced by high-throughput sequencing (HTS) methods in the last decade. Poorly performing primer(s). (Knief, 2014;Reuter, Spacek, & Snyder, 2015). During the first five years in the market, HTS platforms usually evolve rapidly in terms of throughput, data quality and reduction in analytical costs, but technical constraints become limiting soon thereafter. Fundamentally new HTS methods are announced almost every year, but a fraction of these gain public attention and approximately half of those appear in the market (Heather & Chain, 2016). Table 2 provides an overview of widely used HTS platforms.

| HTS methods for identification of species
The first commercially available HTS method, 454 pyrosequencing (Roche Diagnostics, Basel, Switzerland), was developed in early 2000s. The 454 technology was >100-fold cheaper (10 −2 EUR/read) than Sanger sequencing, and the analysis chemistry was rapidly optimized to provide high-quality reads from 50 to 700-1,000 bases at 1.2 million read throughput (Reuter et al., 2015). The 454 technology was rapidly adopted by microbial ecologists who performed groundbreaking discoveries about the ultra-high diversity of prokaryotes (Leininger et al., 2006;Sogin et al., 2006). Anecdotally, much of the diversity turned out to be analytical artefacts, indicating the need for careful quality control and optimization of both sample preparation and analytical steps (Huse, Huber, Morrison, Sogin, & Mark Welch, 2007). Separation of artefacts from rare taxa is still the greatest issue of all HTS technologies. Soon after these pioneering prokaryote studies, 454 pyrosequencing was implemented to identify eukaryotes and to separate potentially pathogenic taxa from other guilds based on taxonomic information from indoor environment, animal samples, soil and foliage (Buee et al., 2009;Jumpponen & Jones, 2009;Luna et al., 2007;McKenna et al., 2008;Wegley, Edwards, Rodriguez-Brito, Liu, & Rohwer, 2007). Several years after implementation, the 454 method was used to identify macroorganisms such as plants from mammal and hexapod diet (Valentini et al., 2009) and animals including parasitic nematodes and other pests (Creer et al., 2010;Porazinska et al., 2009; Table 1).
The Illumina (www.illumina.com) and Ion Torrent (www.iontorre nt.com) technologies replaced 454 in the early 2010s because of greater throughput at lower costs. Nonetheless, the Ion Torrent is continuously haunted by short read length (up to 450 bp) and fluctuating sequence quality, which has limited its use in analysis of soil and plant samples (see Kemler et al., 2013). Compared with the 454 platform, the Illumina technology provides up to 3,000-fold greater throughput, several times greater accuracy and possibility to sequence reads of up to 550 bp (2 × 300 paired-end option) at relatively low cost, 10 −5 -10 −4 -EUR/read. Generation of self-chimeric sequences and long artefactual inserts or deletions represents the main shortfall of Illumina sequencing compared to other platforms . At present, Illumina sequencing is by far the best option for short DNA/RNA barcodes and metagenomics, considering sequence quality and analytical costs of library preparation and sequencing (Knief, 2014). It will undoubtedly remain the most widely used HTS method by the end of this decade in spite of only negligible technological improvements since 2015. The ultrahigh throughput of Illumina sequencing allows analysis of >1,000 samples in a single run at sufficient sequencing depth (Zinger et al., 2017). Illumina sequencing revealed that growing rotations of legume crops greatly increase the pathogen load in soils, with several-year legacy effects (Bainard et al., 2017). Cline et al. (2017) showed that relative abundance of soil pathogens increases with plant biomass in grasslands. In 2015, a paired-end ultra-HTS platform BGISEQ (www.seq500.com/en/), which is similar to the Illumina platform, was released. So far, it has been used for metagenomic detection of human pathogens (Cheng et al., 2018). Given its shorter read reducing the error rate to a minimum (0.1%) at 9-to 11-fold consensus (Tedersoo, Tooming-Klunderud, & Anslan, 2017). Given the average raw read length of 30 kb, PacBio allows sequencing of up to 5 kb DNA fragments at satisfactory quality (Heeger et al., 2018).
Sequencing of long fragments of a single molecule has become attractive in DNA barcoding; for example, Hebert et al. (2017) reported on sequencing the DNA barcode in around 10,000 arthropod specimens simultaneously. In a pioneer study, PacBio was successfully applied to identify potential mycoparasites of the coffee rust, Hemileia vastatrix. In general, long fragments greatly improve identification via greater taxonomic resolution of unconserved regions and phylogenetic analysis based on relatively more con- In spite of high error rate, the nanopore technology holds a great promise in disease diagnostics due to the low cost of equipment and analysis time of 1-2 days (Quick et al., 2016). The unique direct RNA sequencing option (without cDNA reverse transcription step) is of particular interest, but it requires testing for environmental samples and analytical biases.
No universal primers exist for viruses, rendering metagenomics and metatranscriptomics the only suitable methods for detecting previously unrecognized viruses (Mokili et al., 2012;Zhang, Breitbart, Lee, Run, & Wei, 2005). Metagenomic-and metatranscriptomicbased identification of viruses has been recently reviewed in Massart, Olmos, Jijakli, and Candresse (2014) and Roossinck, Martin, and Roumagnac (2015) and Adams and Fox (2016). Roossinck et al. (2015) in particular provide information about alternative analysis work flows for single-and double-stranded DNA and RNA viruses.
For dsRNA viruses, metatranscriptomics of the 21-24 base fragments of silencing RNA (siRNA) has become a popular identification tool of various viruses due to ease of analysis and high detection capacity of small RNA analysis (Kreuze et al., 2009;Roossinck et al., 2015).
Because of various biases introduced by primer choice and the PCR amplification process, PCR-free technologies offer great promise to molecular identification of organisms, particularly viruses and bacteria. In spite of generating huge amounts of sequence data, shotgun sequencing of full metagenomes or metatranscriptomes is an inefficient approach to taxonomic identification of eukaryotes (Alberdi, Aizpurua, Gilbert, & Bohmann, 2018; but see Geisen et al., 2015), because only a tiny fraction of the sequences is likely to originate from relevant marker genes. Furthermore, metagenome and metatranscriptome analyses suffer from several technical problems. Because organisms differ substantially in their AT:CG ratio, genomic fragments with extreme ratios may be disfavoured in the sequence analyses, depending on analysis platform (Shakya et al., 2013). The metagenomic fragments cover random stretches of the marker genes among other genomic regions, rendering it impossible to address species-level taxonomic richness in natural communities . The markerbased reference databases such as UNITE  and SILVA   Several examples of using RAD-seq in plant pathogens are given in Grünwald et al. (2016). This method revealed high genetic variability and recombination in a crop pathogen Fusarium graminearum, suggesting that these features facilitate rapid adaptation to resistant cultivars and biocides (Talas & McDonald, 2015). RAD-seq also revealed several coexisting groups of a mutualistic fungus Rhizophagus irregulare, most of which were globally distributed (Savary et al., 2018).  Table 1). Due to much greater genome size and organization of genetic material into multiple chromosomes, eukaryote genomes are more difficult and costly to sequence and assemble compared with these of prokaryotes.
Using a shotgun metagenomic approach, Duan et al. (2009) sequenced the genome of an uncultured plant pathogenic bacterium because of multiple carbon sources diluting the isotopic signal. This is especially relevant for eukaryotic pathogens that grow and accumulate 13 C or nucleotide analogues into their DNA slowly and may use much of the labelled carbon for respiration. Nonetheless, 13 C incorporated into DNA and fatty acids revealed flow of plant-derived carbon through the soil food web and decline in pathogen-to-mycorrhiza ratio during secondary succession (Hannula et al., 2017). This method could be useful when addressing the pests and pathogens that use recent photosynthesis products amongst other organisms or detecting potential biocontrol agents.

METHODS
All molecular identification methods suffer from specific analytical biases. Marker bias may select for organisms that exhibit high copy numbers. Primer bias discriminates against targets that exhibit primer-template mismatches, particularly in the 3´end of the primer, which reduces their relative amplification efficiency by 1-2 orders of magnitude (Ihrmark et al., 2012;. Primer bias in the ITS region is important in several animal and plant pathogenic fungal groups , nematodes and alveolates. PCR bias represents unequal amplification of target species due to differences in AT:CG ratio, DNA secondary structure and marker length (Ihrmark et al., 2012). Some arthropod and fungal groups exhibit introns in rRNA genes or long ITS1 or ITS2 regions, which may render corresponding taxa entirely unrepresented (Tedersoo, . The best example concerns the ash dieback disease agent Hymenoscyphus fraxineus that exhibits a long 3′ terminal 18S intron, which renders the species undetectable using the ITS1F/ITSOF forward primers (see Cross et al., 2017). Most Oomycota possess a long ITS2 region, which may be discriminated against in studies targeting all eukaryotes .

| Design of HTS studies
Study design depends on the objectives of research. Purely descriptive studies with haphazard sample collection and insufficient replication are difficult to publish and not worth the effort, except sequencing genomes or transcriptomes, or validating novel methods.
Testing ecological hypotheses requires a proper well-replicated sampling design. Many researchers seem to forget that technical replicates, multiple spatially autocorrelated subsamples and thousands of recovered OTUs do not serve as independent biological replicates (Prosser, 2010). This is particularly relevant in the geographically structured sampling with inherent hierarchical design and multilevel spatial autocorrelation.
One of the main questions in pathological and microbiological research is whether or not to pool subsamples. Pooling may strongly reduce analytical costs, but it also reduces small-scale resolution. The answer depends on the research objectives, spatiotemporal scale and nature of the samples of the particular study. If individual samples (e.g., leaf, soil core) are small and expected to represent the community poorly, pooling multiple samples is a viable option. In case of hierarchical design (i.e., structured by block, plot or site), it is useful to pool multiple subsamples when the internal variation is not of interest. However, in most other cases, analysis of multiple independent samples is preferable due to the ability to estimate sampling error and address the importance of spatiotemporal variability. HTS analyses can easily recover slight shifts in taxonomic and gene composition; therefore, multivariate techniques require just 3-4 replicates to detect biologically important shifts in community composition (Balint et al., 2016). An extra replicate should be considered, because it is common to obtain low-quality DNA or a limited number of sequences from some (typically 1%-10%) samples. Analysis of richness and diversity measures and pathogen load requires more samples, because univariate tests have lower statistical power.

| Sample preparation for HTS analysis
HTS techniques are sensitive to spoiling, external contamination and cross-contamination, hence requiring careful collection, handling and pre-treatment to prevent contamination and overgrowth by fastgrowing moulds or DNA/RNA degradation (Lindahl et al., 2013). , optimal protocols should be selected considering the mass, substrate and target organism (Brooks et al., 2015). Samples can be extracted at low overall mass (0.25 g), which recovers nearly comparable taxonomic richness with high-quantity extraction (10 g) for microorganisms (Song et al., 2015). For small extraction quantities, it is relatively more important to thoroughly homogenize the sample. For ultra-long amplicons and genomic analyses, bead beating should be kept at minimum duration. For metagenomic-and RNA-based analyses, it is of particular importance to minimize the concentration of co-extracted humic acids and saccharides that may interfere with downstream processes. The extracted DNA from soil and other organic-rich substrates may require an extra purification step using filter columns or magnetic beads for optimal performance in (meta)genomics analyses .

| Marker and primer selection
For HTS-based diversity analyses, it is very important to thoroughly consider the DNA/RNA marker based on desired taxonomic resolution. For routine community-level analysis, species-level resolution should always be targeted to avoid bulking together pathogenic organisms with closely related endophytes and saprotrophs (Critescu, 2014;. Nonetheless, strains of the same antagonist species may differ strongly in pathogenicity, which ren-  (Pawlowski, Audic, & Adl, 2012). The official animal barcode COI performs poorly in HTS analyses, because of the lack of conserved regions for inclusive primers and loss of primer-template specificity with multiple degenerations (Geller, Meyer, Parker, & Hawk, 2013 (Vestheim & Jarman, 2008), but it is probably impossible to effectively design for nuclear rRNA genes. On the other hand, cotargeting host DNA enables to determine the relative abundance of pathogen DNA marker relative to host DNA marker that is comparable (i.e., without systematic bias) across samples.
In analysis of complex pathological systems, food webs and diet of omnivores, organisms from multiple kingdoms can be targeted simultaneously. Strategies for this include a universal marker such as 18S rRNA or ITS for all eukaryotes, or different markers for each kingdom. Universal primers and primer mixes exist for the rRNA markers (Table 3). Different markers of similar length can be analysed in multiplex or mixed after separate amplification into a common library (de Barba, Boyer, Rioux, Coissac, & Taberlet, 2014). However, markers may yield >2 orders of magnitude difference in average sequencing depth , indicating that the relative discrimination factor should be considered beforehand. Metagenomic approach has proven a viable alternative for relative quantification of DNA of target organisms in soil  and gut contents (Pearman et al., 2018).
One or both primers used for HTS should be tagged with a molecular identifier to enable multiplexing samples. These tags of typically 6-12 bases should differ from each other by at least 4 bases/indels (e.g., the "error-correcting" Golay identifiers; Lundberg, Yourstone, Mieczkowski, Jones, & Dangl, 2013) to prevent random mutations in tags or impure synthesis to erroneously switch sequences among samples. The tagged primers may also include a platform-specific sequencing primer, but such long oligonucleotides may perform poorly (Lindahl et al., 2013). For Illumina sequencing, 96 combinations of Nextera indexes can be ligated by PCR. Primers tagged with identifiers only are cheaper and can be used for analysis employing any sequencing platform, rendering these usable for many years. To reduce the competition among tagged amplicons in the ligation step, it is advisable to select all identifiers to start with the same nucleotide and use a 2-base linker sequence with no match to any of the templates. Identifier tags that have an AT:CG ratio less or more than 0.25-4 tend to perform poorly . It is strongly recommended to add identifier tags to both the reverse and forward primers to minimize the tag-switching (Gohl et al., 2016).

| PCR
Prior to PCR, it is recommended to quantify DNA and use equal amounts of template for each sample to be able to use the same number of PCR cycles across the study (Gohl et al., 2016). The PCR mix should include a high-affinity and high-processivity polymerase (e.g., Pfu, Phusion, Q5) to minimize incorporation of erroneous nucleotides and generation of partial reads that can be converted to chimeric sequences in subsequent extension cycles. These more expensive polymerases strongly reduce the number of chimeric sequences and artificial taxa comprised of error-infested sequences (D'Amore et al., 2016;Gohl et al., 2016). For HTS analysis, the primer annealing temperature could be reduced by ca 5°C compared to regular PCR to promote amplification of templates with one or two mismatches to primers. The number of PCR cycles should be kept at minimum-so that a relatively weak band or smear of suitable size is seen on an agarose gel. Increasing extension time is also likely to reduce incomplete amplification and hence chimera formation (D'Amore et al., 2016;Lindahl et al., 2013). Low input DNA content results in lower amount of inhibitors and less chimeric sequences (D'Amore et al., 2016;Gohl et al., 2016). Due to stochastic variation, it is recommended to use at least two PCR replicates that can be pooled post-amplification (Alberdi et al., 2018;Lindahl et al., 2013;Tedersoo et al., 2010).
Amplicon purification depends on further analyses and choice of sequencing platform. It is recommended to normalize amplicon concentration across samples to reduce variation in sequencing depth among samples several-fold (Lindahl et al., 2013). The equimolarly mixed amplicons are subjected to platform-specific adapter ligation in the library preparation step. It is recommended to order library preparation from a sequencing service provider to secure their quality standards and leave the risk of failure to the service provider.
Researchers should check the quantity and quality requirements from each service provider, because these may differ greatly. The quantity appears to be negotiable, because service providers usually request 5-10 times more material than they use. Due to high demand, it takes typically 1-2 months to receive the sequences. It does not pay off to order bioinformatics service, because companies provide standard quality,.fasta-and.fastq-formatted files. These can be handled using custom options in any bioinformatics platform, whereas the service provider's analysis routine may be suboptimal (i.e., optimized for bacterial 16S rRNA gene, mouse or human samples) or untransparent.

| Controls and technical replication
To quantify contamination and technical artefacts such as sequenc- Technical replication is unnecessary in most cases, because these observations cannot be used as independent data points in the analysis. However, limited technical replication of a few samples may be feasible to estimate the performance and reproducibility of the method especially for newly developed protocols (Alberdi et al., 2018;Brooks et al., 2015).

| Quality filtering of HTS data
Analysis and quality filtering of HTS data are by far more challenging than viewing and editing Sanger sequencing reads because of large amounts of data and no clearly readable chromatograms. There is a myriad of available software for bioinformatics data analysis, most of which, such as mothur (www.mothur.org) and QIIME (www.qiime.org), and Windows, which all attract non-bioinformatician users (Anslan, Bahram, Hiiesalu, & Tedersoo, 2017). For analysis of non-alignable markers such as ITS, PipeCraft outperforms other bioinformat'ics pipelines in terms of input data formats, available analysis options and output quality . Comprehensive overview about available analysis platforms for amplicon and metabarcoding data is given in Oulas et al. (2015). Bioinformatics analysis of fungal data is reviewed in Nilsson et al. (2018).
Although the output of HTS platforms is converted to the same format, these data differ in the distribution of errors and require different options for analysis Knief, 2014;Laehnemann, Borkhardt, & McHardy, 2015;Reuter et al., 2015). Qualitytrimming is the first step of bioinformatics analysis and it is usually based on removing the 3′ end of sequences (or entire sequences) if it falls below specific quality threshold, the optima of which differ by sequencing platform. In a simultaneous sample demultiplexing process, sequences are re-assigned to biological samples based on the molecular identifiers. For demultiplexing sequence data with Golay barcodes, we recommend allowing 1-2 mismatches to tags and 1-2 mismatches to primers to account for random errors and natural primer-template mismatches. Demultiplexing from a single end typically enables to recover 40%-70% of all reads, but approximately a quarter of these are lost when accounting for the other tagged primer as well . However, dual-tag demultiplexing enables to remove tag-switching artefacts and incomplete sequences (Kozich, Westcott, Baxter, Highlander, & Schloss, 2013 Geller et al. (2013) Notes. The newly reported primers have been designed to cover >99% targeted taxa and tested in silico and complex soil samples. Full set of primers used for bacteria, fungi, oomycetes and eukaryotes in general can be found in Klindworth et al. (2013), Nilsson et al. (2018), Riit et al. (2016) and Adl, Habura, and Eglit (2014), respectively. 1 Superscript letters indicate matching forward and reverse primer pairs. 2 Correct primer sequences compared to the trimmed ones in the original publication.
enables simultaneous removal of non-target organisms and focus on a shorter but more variable barcode that has improved taxonomic resolution (Bengtsson-Palme et al., 2013;Hartmann, Howes, Abarenkov, Mohn, & Nilsson, 2010).

Taxonomic Units
Quality-filtered and trimmed sequences are subjected to clustering into OTUs, for which multiple algorithms exist (Kopylova et al., 2016). The best results are obtained when using open-source de novo clustering with single-linkage algorithms (Frøslev et al., 2017;Lindahl et al., 2013). Except for Illumina data, it is recommended to collapse homopolymers to trimers for clustering (Lindahl et al., 2013) or lowering the gap extension penalty, because other platforms are sensitive to indels in homopolymers. Although many protocols recommend removing sequences containing homopolymers of >8 or >10 bases, we do not encourage this practice for the non-coding regions, because many organisms do have naturally long homopolymers in these markers (Potter et al., 2017;Tedersoo et al., 2017).
In spite of different taxonomic resolution, the bacterial 16S and eukaryote 18S, 28S, ITS and COI sequences are typically clustered at 97% sequence identity, which is regarded as a compromise between natural intraspecific and interspecific sequence variation and random sequencing errors. The 97% sequence similarity threshold for all of these marker genes (except COI in some groups) is too conservative for species-level identification of most taxa. For example, some biological species of Fusarium display no variation at all in the relatively unconserved ITS region (Park et al., 2011). Therefore, HTS analysis of the ITS + 28S rRNA gene (Walder et al., 2017) and transcription elongation factor 1 subunit α (TEF; Karlsson et al., 2016) have been used to specifically distinguish Fusarium spp. With low-resolution markers, analysis of exact sequence variants can be performed using 100% similarity threshold or the DADA2 clustering program (Callahan et al., 2016).
All clustering methods generate more OTUs than expected at any barcoding threshold with increasing sequencing depth, indicating accumulation of PCR and sequencing errors into rare "satellite" taxa (Frøslev et al., 2017). This can be ameliorated by performing two or more consecutive clustering steps , post-clustering removal of taxa based on co-occurrence or phylogenetic algorithms (Frøslev et al., 2017;Potter et al., 2017) or focus on longer DNA fragments, where random errors are evened out . It is further recommended to remove global singletons and perhaps OTUs with <5 or <10 sequences, depending on sequencing depth, as potentially artefactual (Frøslev et al., 2017;Lindahl et al., 2013;Nguyen et al., 2015;Tedersoo et al., 2010).

| Sequence-based taxonomic identification and taxon communication
Selection of one or more reference databases is essential for sequence-based identification (reviewed in Kashyap, Rai, et al., 2017b). Since up to 20% of the material in INSDc is of poor quality or misidentified, initiatives such as UNITE (https://unite.ut.ee/), SILVA (www.arb-silva.de) and UniEuk (https://unieuk.org/) have generated databases and reference data sets populated with filtered and third-party annotated sequences. SILVA is focused on nuclear SSU and LSU sequences of prokaryotes and eukaryotes, but both oomycetes and fungi are poorly represented and have problems with taxonomic assignment (Tedersoo Tooming-Klunderud et al., 2017;Yarza, Yilmaz, Panzer, Glöckner, & Reich, 2017). The UniEuk initiative is focused both on taxonomy and on curation of high-quality 18S rDNA sequences of eukaryotes (Berney et al., 2017). The current version of UNITE includes SSU, ITS and LSU sequence data for all eukaryotes, although only fungal and oomycete ITS sequences have been intensively annotated for taxonomy, sequence quality and ecological metadata. Some pathogenic fungal groups, in particular, have been thoroughly checked, annotated and assigned for type status in UNITE (Nilsson et al., 2014). In the BOLD database (Ratnasingham & Hebert, 2007), curated COI sequence data for animals, Oomycota and other specific groups of protists are maintained. Thus, these databases provide best-suited species-level reference data for general molecular identification of pathogenic organisms. However, researchers focused on more narrow groups such as Fusarium or Phytophthora could use Fusarium-ID (Park et al., 2011) and the Phytophthora Database (www.phytophtho radb.org) in addition. Animal and human pathogens have annotated sequence data in the ISHAM-ITS database (Irinyi et al., 2015).
Metagenomic and metatranscriptomic analyses require inclusion of functional gene and genomics databases for combined taxonomic and functional analysis (Huson, Mitra, Ruscheweyh, Weber, & Schuster, 2011). Detection of viruses amongst genomic, metagenomic and metatranscriptomic reads requires some specific data mining effort.
For taxonomic assignments, it is most common to use BLASTbased similarity search methods for representative sequences of each OTU (Nilsson et al., 2014;. The Naive Bayesian Classifier (Porras-Alfaro, Liu, Kuske, & Xie, 2014;Wang, Garrity, Tiedje, & Cole, 2007) is widely used for conservative identification in prokaryotes, but this method has gained little popularity among mycologists due to a low proportion of taxa identified to species or genus level. This has been improved in ProTax-Fungi, which provides statistical assessment of assignment precision to different taxa from species to phylum ranks (Abarenkov et al., 2018). In fungal and oomycete ITS sequences, species, genus, family and order levels can be approximately approximated at >97%-99%, >90%, >85% and >80% ITS sequence similarity, respectively, to the closest identified sequence Tedersoo et al., 2017;data in Riit et al., 2016).
| 65 rates of rRNA gene evolution, there are multiple exceptions, with sordariomycete (Ascomycota) and oomycete species tending to exhibit greater similarity and early diverging fungal groups lower similarity.
An optional step is to assign functional traits such as pathogenicity information to OTUs, for which database-related tools exist. For Bacteria, an automated pipeline SINAPS (Edgar, 2017) enables to search and predict custom traits using the ProTraits reference database (Brbic et al., 2016). The basic fungal traits can be assigned to taxonomic profiles using a tool in FunGuild database . Its main limitation is genus-level operation, although it alerts that many genera contain both pathogens and saprotrophs or endophytes. As discussed above, the detected "pathogenic" OTUs may be non-pathogenic on non-hosts, rendering the assignments strongly context dependent. Therefore, more accurate metadata with hostand tissue-related traits assigned to species, species hypotheses (see next paragraph) or isolates/sequences are urgently needed.
HTS studies enable to recover tens of thousands of OTUs, most of which cannot be usually assigned to described species, which renders these difficult to communicate across studies. The UNITE and BOLD databases use taxon codes (species hypotheses and BINs, respectively) linked to Digital Object Identifiers (DOIs). These machine-readable DOIs enable communication of both named and unnamed taxa across studies and time (Kõljalg et al., 2013;Kõljalg, Tedersoo, Nilsson, & Abarenkov, 2016;Ratnasingham & Hebert, 2013).

| Post-bioinformatics data quality control
For soil and raw tissue samples with non-optimal storage conditions, it may be important to estimate sample quality due to potential overgrowth by moulds (Lindahl, Boer, & Finlay, 2010). This can be performed by measuring the average size of extracted DNA/RNA molecules on the gel or calculation of the relative abundance of moulds (fungal orders Hypocreales, Mucorales, Umbelopsidales and Mortierellales). Dominance of a single mould OTU, which is usually associated with reduced taxonomic richness, can be considered indicative of sample spoilage .
Similarly, it may be feasible to exclude samples with <5to 10-fold less sequences compared with the median. Such poor recovery may be ascribed to the failure to normalize a sample, poor performance of particular identifier tags and/or dominance of particular organisms in a sample, which are disfavoured in the library preparation, sequencing or quality-filtering steps. In spite of attempts to normalize quantity of amplicons, the number of retrieved sequences typically vary >3-fold. It is common to rarefy all samples to the same minimum sequencing depth, but this loses vast majority of taxonomic information. Therefore, it is recommended to calculate residuals of richness relative to square-root or logarithmic function of sequencing depth (whichever fits better), or use these functions as covariates in uni-and multivariate statistics (Balint et al., 2016).
Due to high sensitivity, HTS data commonly suffer from traces of environmental contamination or tag-switching (see above). Information about the OTUs in control and experimental samples enables evaluation of these technical biases and need for extra quality filtering Palmer et al., 2017). In case of extensive tag-switching, sequences can be removed according to statistical formulae (Larsson, Stanley, Sinha, Weissman, & Sandberg, 2018).
Although the tag-switching artefacts usually account for 0.1%-3% of all sequences (Palmer et al., 2017;Schnell et al., 2015;Tedersoo et al., 2017), these may blur qualitative diversity analyses and particularly network analyses that are sensitive to adding low-abundance OTUs. More importantly, tag-switching may generate false-positive implications of low-level presence of a pathogen or biocontrol organism, especially when these dominate some samples in the library.

| HTS data analysis
HTS platforms generate enormous OTU-by-sample data matrices, which cannot be sometimes fully loaded into common spreadsheet programs.
Therefore, experts use python or perl scripts to navigate and transform the data in text format. These large community matrices also test the limits of statistical software and processors. Many commonly used methods for community phylogenetics, bootstrap resampling and network analysis become computationally prohibitive. Thus, use of computation-efficient algorithms is warranted. To reduce the computation requirements, the data can be compressed by removal of rare species, which typically reduces unexplained variance and promotes statistical power (Põlme et al., 2018), but its effect on potential type I and type II error is not known in multivariate or network analyses.
For multivariate analyses, we recommend downweighing abundant OTUs by Hellinger (square-root) transformation to account for the semiquantitative nature of HTS. Use of qualitative binary data (presence/absence) is not recommended, because of lower fit due to loss of the (semi)quantitative information and artificial equalization of potentially artefactual low-abundance (including tag-switch artefacts) and real high-abundance OTUs (Balint et al., 2016). We recommend use of PERMANOVA for explicit statistical testing of shifts in community composition, because it allows including interactions, random factors and nested designs. ANCOM and Random Forest machine-learning algorithm provide statistical information about the performance of each OTU in the community matrix. General information about multivariate analysis methods suitable for HTS data is given in Buttigieg and Ramette (2014). Notably, the same multivariate techniques are commonly used to analyse standardized microarray and metagenomic and metatranscriptomic data (Thomas, Gilbert, & Meyer, 2012).
In univariate analyses, OTU richness, diversity, colonization, damage and relative abundance of certain taxonomic or functional groups are used as dependent variables. Apart from considering sequencing depth and treatment of rare OTUs, the analyses should follow best statistical practices including appropriate transformations, testing assumptions, etc. Balint et al. (2016) provide an overview about general recommendations to statistical analysis of HTS data, computation-efficient programs and potential pitfalls. 6.11 | HTS data storage and reporting HTS data sets are stored both as raw data files and elaborated data sets. The raw.fastq files, metadata files and files with identifier tag and primer information are kept in the Short Read Archive (SRA).
These files enable users to perform all steps of bioinformatics analyses including generation of OTU table and identification. This is important from several aspects such as confirming earlier findings with updated filtering procedures, addressing additional questions and performing metastudies using standardized filtering procedures.
It is, however, discouraged to submit representative sequences of HTS-derived OTUs to public databases because of their short length, potentially artefactual nature and unreliable taxonomic annotation.
These environmental sequences would increase the proportion of poorly annotated and erroneous data and complicate identification in subsequent studies.
Curated OTU-by-sample matrices including technical and environmental metadata, representative sequences as well as taxonomic and functional annotations should be deposited in machine-readable FAIR data format in specific data repositories such as Dryad Digital Repository (www.datadryad.org) and DataOne (www.dataone.org).
Metagenomic data should follow MIxS and MIMS (https://wiki.ge nsc.org/index.php?title=MIGS/MIMS) standards (ten Hoopen et al., 2017). The machine-readable FAIR data format allows researchers to understand and rapidly incorporate the data into meta-analyses. Such standardized data sets in digital repositories enable separate DOIbased citations.
In publications, it is important to refer to any additional data in the supplement or data repositories. It is also important to record and describe precisely all analytical steps including specific options in data filtering, because this information provides important details about the data quality and stringency of filtering to the readers. Nilsson et al. (2011) provide thorough recommendations about the required details for molecular and bioinformatics analyses.

| PERSPE CTIVES
Only a fraction of available high-throughput identification potential has been currently used in plant pathology. This is related to the practical surveillance-oriented work of plant pathologists and entomologists but focus on human and animal subjects by molecular pathologists. Governmental plant health surveillance organizations need to follow certified protocols for diagnosis, which develop slowly due to time-consuming tests. Limited budgets also hinder the possibility of purchasing high-throughput analysis equipment by governmental institutions. Considering analysis costs and time, practicing pathologists would certainly take advantage of qPCR/ddPCR for realtime quantification of specific pathogens and custom microarrays for simultaneous detection and quantification of multiple selected pathogens. In the nearest future, it may be possible to detect multiple organisms including pathogens using high-throughput sequencing on portable pocket-size sequencers as demonstrated for viruses using the Oxford Nanopore MinION platform (Loman et al., 2015).
For a simplified procedure, a single working day is essentially required for sample collection, analysis and interpretation of results.
Other high-throughput identification methods are more time-consuming but also more sensitive and thus better suited for research purposes. Metagenomic and metatranscriptomic methods offer great potential when targeting viruses (Zhang et al., 2005) or these together with prokaryote and eukaryote pathogens and pests simultaneously (Chandler, Liu, & Bennett, 2015). Alternatively, nematodes, insect pests, oomycetes and fungi can all be assessed by using a mixture of degenerate primers targeting the same marker or multiplex primers targeting different markers via metabarcoding (de Barba et al., 2014;Tedersoo, Liiv, et al., 2016). Targeted template capture by use of specific hybridization probes and immunochemical methods allows concentrate marker genes and pathogenesis-related genes of antagonists (Dowle, Pochon, Banks, Shearer, & Wood, 2016) that can be further identified using PCR-free methods.
Because of great intraspecific resolution, high-throughput fingerprinting and population genomics approaches offer enormous potential for diagnosis of aggressive strains or pathotypes and uncover their patterns of dispersal (O'Hanlon et al., 2018) and potential hybridization (Qiu, Cai, Luo, Bhattacharya, & Zhang, 2016). Given appropriate quality filtering, these methods are sensitive enough to distinguish rare alleles and SNPs from noise (Isola et al., 2005) in hundreds of samples in parallel. Whole-genome sequencing and transcriptome analyses complement HTS-based identification methods by shedding light into pathogenesis mechanisms and facilitating generation of vaccines and biocides and selection of biocontrol agents (Grünwald et al., 2016).
For correct identification, community-curated and taxonomically annotated reference databases are urgently needed. Such databases are maintained only for a few most important pathogen groups and cover the main barcoding marker genes (Park et al., 2008(Park et al., , 2011. Sequence databases should share third-party metadata and taxonomic annotations, whenever these are updated in one of these (Nilsson et al., 2014). In spite of a large proportion of erroneous data, INSDc will certainly continue to play a central role in bridging more specific databases encompassing genes from all domains of life. So, let's contribute well-annotated and high-quality sequence data to INSDc to benefit the pathologists research community! This also applies to HTS data sets and data matrices, the great practical and scientific value of TEDERSOO ET AL. | 67 which can be recognized perhaps after several decades. Alongside storing sequence data, it is important to maintain tissue and soil samples that can be resource efficiently kept dried at room temperature.
In spite of small size, accumulating DNA samples tend to rapidly fill refrigerators in entire rooms and are vulnerable to technical failure and power cuts. Besides the possibility of morphology-based re-identification and description of new species, both botanical and pathological herbaria provide excellent sources to trace back the evolution and dispersal of pathogens and pests Drenkhan, Riit, Adamson, & Hanso, 2016;Yoshida et al., 2014).
Taken together, high-throughput identification techniques offer great promise for detection and rapid identification of new pathogens and diseases in humans as well as tree and crop plantations and early warning systems such as the sentinel nurseries and botanical gardens.
HTS has already demonstrated its usefulness in studies of soil-and plant-associated microbial communities for detection of new potential pathogens and potentially invasive species before their introduction to the new environment and contact with new hosts. We predict that rapid monitoring methods such as nanopore sequencing, microarrays and nanotechnological biosensors will become particularly useful for early disease diagnostics and smart application of countermeasures such as biocides and biocontrol agents.