Current methods, future directions and considerations of DNA-based taxonomic identification in wildlife forensics

Wildlife forensic analyses are frequently concerned with taxonomic identification, and very often employ amplification and Sanger sequencing of informative regions of the genome to achieve this. The materials submitted to wildlife forensic laboratories for taxonomic identification span a wide scope, from plant and animal parts in trade to assemblages of incidental biota at crime scenes. As these analyses take place within the context of legal proceedings, the wildlife forensic community is subject to unique requirements and considerations. These requirements and considerations are quite different from those of human forensic DNA


Introduction
One of the major aims of wildlife forensic analysis is to identify the taxonomic source (e.g., species, genus, etc.) of non-human biological material, either as items in trade, or direct or indirect evidence in other crimes.The trade in animals, plants, and their derivatives is expansive, and is one of the drivers of the current sixth mass extinction event [1].Trade can be either licit (e.g., legally caught and marketed yellowfin tuna, carvings made of cow bone, boots made of ranched ostrich) or illicit (e.g., prohibited Goliath grouper, carvings made of elephant ivory, boots made of sea turtle).In some trades, such as timber and fisheries, the licit and illicit products are commonly co-mingled.In addition to the wide taxonomic scope of analyses required to detect violations in wildlife trade, wildlife forensic analyses may involve identification of direct evidence in wildlife crime (e.g., the alleged poisoning of a protected animal species) or of non-human biological trace evidence in human criminal investigations (e.g., pet hair, pollen, diatoms, insect parts, etc.).Identifications of assemblages of trace evidence can provide probative information for investigative leads, either to link the crime scene to the victim and/or perpetrator, or to determine where a crime occurred based on the suite of species present.
Taxonomic classification is a man-made construct to categorize organisms based on observed similarities (or differences) in underlying measured traits.Though our understanding of taxonomy changes with the acquisition of new knowledge and can result in revisions to established names, it should be recognized that species categorizations are by and large "real" [2].Morphological classification is the foundational method for taxonomic identification of biological materials.Morphological examination is a reliable, inexpensive method for identifying items encountered in forensic casework when sufficient diagnostic features are present, and applicable expertise and suitable comparative reference material are available (e.g., whole animals, intact bones, flight feathers, as in Trail [3]).When these conditions cannot be met and genetic capability is available, DNA is often used for taxonomic identification.Wildlife forensic DNA analysts almost universally rely on sequencing of mitochondrial DNA (mtDNA) for specimen identification of animals (e.g., [4]), and a range of markers from the nuclear and chloroplast genomes of plants (e.g., [5,6]).The position and order of nucleotide bases in informative regions are the class characters used for diagnosis.Sequences from in-house and/or public databases are used as reference comparisons, and specimen identification is based on the degree of similarity between the unknown sequence and the reference.Such identifications are straightforward for well-separated groups with all closely related taxa characterized for a given genetic region, but are increasingly difficult in groups with incomplete taxon sampling and shallow coalescent depths [7].
This review paper aims to provide a perspective on current methods, future directions and considerations of DNA-based specimen identification in wildlife forensics.We take a deeper dive than previous reviews [8,9] and discuss: 1) commonly used regions for animals and plants; 2) sources of known sequences; 3) distance-and tree-based approaches for taxonomic identification; 4) biological considerations when using mitochondrial and chloroplast regions; 5) applications of next generation sequencing; and 6) minimum standards and best practices for DNA-based taxonomic identification.General requirements for laboratory practice involving DNA evidence in forensic casework, including those governed by laboratory accreditation (ISO 17025) and analyst certification, are not addressed.We wish to emphasize here that the analysis we refer to as "taxonomic identification" or "specimen identification" involves assigning unknown evidentiary material to an already-described taxon, and is not meant to include species discovery and description, which is a task best left to specialized taxonomists [10].

Commonly used regions for taxonomic identification of animals and plants
Regions from the mitochondrial (mt) genome are the primary choice for forensic taxonomic identification of animal products because of the overwhelming availability of that molecule; the high copy number, cellular location and other molecular features of the mitochondrial genome means it is more likely to be recovered from forensic evidence which often contains DNA of low quality and quantity [11,12].Given that mtDNA is haploid and maternally inherited, there is a lower effective population size among individuals of the same species [13].Mitochondrial DNA also shows accelerated rates of evolution and differences in the rate of fixation of mutations through adaptive evolution [11,14,15].These mutations can be harnessed to permit discrimination of closely related species.The ability to design universal primers that permit amplification across a broad range of taxa also makes mtDNA highly advantageous in wildlife forensics [8,16,17] as often only general information is known about the evidence sample prior to analysis (e.g., sample from a shark but without information about what family of shark).Further, given that the majority of initial molecular evolutionary studies targeted mtDNA [18,19], an abundance of foundational information and sequence data exist in the published literature and public sequence databases, respectively (e.g., [11,20,21]).
Although the mtDNA molecule evolves rapidly, the genes encoded in mtDNA do not mutate at equal rates [11].This has an impact on the informativeness of individual mitochondrial regions; some regions have a lower mutation rate (e.g., cytochrome c oxidase subunit 1 (CO1) or 12S ribosomal RNA (12S)) and will be more useful for taxa that are well-separated or for higher level taxonomic classifications, while those that have a greater mutation rate (e.g., control region (CR), cytochrome b (cytB) or NADH dehydrogenase subunit 2 (ND2)) will be more useful for species level classifications of closely-related taxa [17].The genetic diversity at a given region varies due to differences in the rate of mutation and/or the rate of fixation of adaptive mutations among taxa, and these factors impact the informativeness of different regions for taxonomic identification.Further, the same mtDNA region can evolve at different rates in different taxa (e.g.[19,[22][23][24][25],), meaning while a given mitochondrial region may be informative at the species level in one taxon, it may only be informative to the genus level in another.For example, cytB is used to discriminate among species of rhinoceros (Perissodactyla: Rhinocerotidae), but it is only effective at the genus level among deer in the genus Odocoileus (Artiodactyla: Cervidae), even though both groups are composed of closely related species of large mammals [26,27].Thus, when selecting a mitochondrial region for use in taxonomic identification, it is important to have as complete a picture as possible of the level of sequence variation within and between species in the taxonomic group of interest.
The mitochondrial regions initially used for taxonomic identification in wildlife forensics are closely aligned with those used in early molecular evolutionary studies (e.g., [16,17]) based on restriction fragment length polymorphism (RFLP) and enzymatic techniques.Since DNA sequence analysis has become more common, there has been a desire to find a single region to serve as a 'silver bullet' for taxonomic identification.In 2003, a consortium of scientists promoted adoption of a single region for molecular taxonomic assignment of all animals, specifically a ~650-bp region of CO1 [28].While CO1 offers species-level discrimination in many vertebrate and invertebrate taxa, closely related species can share the same CO1 sequence (e.g., [29][30][31][32]).Other faster evolving mitochondrial regions, such as subunits of the NADH dehydrogenase and the CR are often used to delineate species [17,33,34] and characterize intraspecies relationships (i.e., subspecies, strains, etc.) (e.g., [35][36][37]).To highlight the breadth of mtDNA regions used in casework, wildlife forensic practitioners provided information about the commonly used regions for DNA-based taxonomic identification, noting that often multiple regions are used (Fig. 1).
In plants, the mitochondrial genome evolves slowly, meaning that mtDNA regions are typically not good candidates for DNA-based taxonomic identification [5,6].A range of regions from the chloroplast genome have long been used in evolutionary studies of land plants (e.g., [38,39]), given the higher substitution rate within the chloroplast genome.In 2009, the Consortium of the Barcode of Life completed a broad assessment of the utility of seven regions from the chloroplast genome across 397 species [5].They identified a 2-region combination that provided the best compromise with respect to species level discrimination (72 %), amplification success, and sequence quality: ribulose biphosphate carboxylase (rbcL) and maturase K (matK) [5].Supplementary regions from both the chloroplast genome (e.g., trnL, psbA-trnH, rpoB, rpoC1) and nuclear genome (primarily the internal transcribed spacer subunit 2 (ITS2)) are often required to ensure accurate specimen identification in some groups (Fig. 1).

Sources of known sequences for taxonomic identification
Accurate DNA-based classification of animal and plant material to species of origin is critically dependent on the availability of reference sequences for the taxonomic group of interest.These reference sequences must be from an informative gene region for the taxonomic level in question, be derived from accurately identified reference specimens, and achieve sufficient taxon sampling to rule out all other possibilities.

In-house reference sequence databases
Laboratories that routinely perform DNA-based classification of specific taxa often have developed an in-house reference database for this purpose, based on genetic sequences generated at that facility.The key benefit of an in-house database over a collection of genetic sequences derived from an external source is the ability to directly control and document the suitability of those sequences as reference material for forensic casework, including: 1) the suitability of originating reference specimens and region(s) sequenced; 2) completeness of taxon sampling; and 3) the integrity of the link between the electronic sequence data, the tissue sample and the originating specimen.Sequences to be deposited into an in-house database are obtained from reference material meeting a defined set of minimum quality, provenance and documentary criteria.Ideally, each reference sequence in the database would demonstrably link back to a specimen with sufficient species-diagnostic morphological characters.This specimen would be permanently retained in a collection to enable subsequent examination and taxonomic revision [40].DNA or tissue samples obtained from collaborators are often used to augment the in-house database; though the originating specimens are not retained in the forensic facility's collection, they should have been identified by an appropriate expert, with morphological characters defined by published taxonomic studies and for which suitable metadata exists (e.g., collection locality, collection date, name of taxonomic expert performing the identification, etc.).To the fullest extent possible, reference specimens should be documented in such a manner that their taxonomic identity can be verified should questions arise [41], for example with photographs of species-diagnostic traits.
In practice, because of the difficulty of sourcing reference specimens, especially for rare taxa, in-house databases are often comprised of a core dataset generated in-house, augmented by additional sequence data meeting defined criteria (e.g., datasets from peer-reviewed phylogenetic or phylogeographic studies that include accessioned museum specimens).The core dataset may, for example, establish the discriminatory power of a given region for a desired level of inquiry in a specific taxonomic group, or may consist of sequences derived from accessioned museum specimens for taxa frequently misidentified in the field.For taxonomically complex groups, as many species as possible should be represented by genetic data derived from accessioned type specimens.The metadata associated with genetic sequences, tissue samples, and originating specimens are to be stored in a suitable relational database to ensure the integrity of the association among them (e.g., [42][43][44]).Forensic laboratories adhering to published standards implement multiple layers of checks and balances to ensure data and specimen integrity, and for isolating possible explanations for unexpected data.

Public sequence databases
Publicly available DNA databases contain a wealth of information contributed by researchers around the world, and while their use requires caution, they are often suitable sources to augment in-house reference sequence databases, or to supply sequences of infrequently encountered species to laboratories unable to source material from a curated collection.There are numerous large public databases that contain DNA sequences, with the largest and most well-known being the International Nucleotide Sequence Databases (INSD), which is managed collaboratively by GenBank, the European Molecular Biology Laboratory (EMBL) and DNA Data Bank of Japan (DDBJ) [45].Since the public release of the GenBank database in 1982 by the National Center for Biotechnology Information (NCBI), hundreds of millions of DNA sequences have been made publicly accessible, providing reference sequence data for mitochondrial, chloroplast and nuclear regions from hundreds of thousands of organisms [46].More recently, whole genome sequences have been uploaded to GenBank, providing a rich array of material for sequence comparisons.Aside from GenBank, several other public sequence databases exist, some of which are taxon and region specific, for example, UNITE for eukaryote ITS sequences, and the Barcode of Life DataSystems (BOLD) for 'barcode' region sequences from plants, animals and fungi.
In addition to using public databases to source reference sequences, wildlife forensic geneticists use the native search interfaces of public sequence databases (e.g., BOLD, NCBI nucleotide search) at least occasionally for putative determination or exclusion of taxa at higher taxonomic levels, screening for contamination, or guiding marker choice with taxa not routinely processed by that laboratory.Identifications seldom rest solely on public database search results, as the output of such searches often do not contain enough information on the validity of the returned reference sequences.When using public databases, interpreting results relies on knowledge of relevant phylogenetic information, such as the composition of the genus or family in question, and any challenges at the species level.Additionally, species nomenclature undergoes constant review and revision, to not only resolve instances of synonymy (a single species that has two taxonomic names) but also to describe newly recognized species.Thus, the nomenclature for older submissions in some public databases may not be consistent with currently accepted nomenclature (e.g., the currently accepted Ursus americanus vs. the older, invalid name Euarctos americanus for the American black bear), or there may not be a consensus on the nomenclature for a species (e.g., Cervus elaphus vs. Cervus canadensis for red deer).
It is important to re-emphasize that analysts should be mindful that while sequences in a public database can be a valuable resource, the presence of incorrectly identified and erroneous sequence data has been well-documented (e.g., [28,[47][48][49][50][51][52][53]).To remedy this to some extent, GenBank has developed a Reference Sequence (RefSeq) collection, containing public data that have been subjected to varying levels of validation, annotation and manual curation by staff [54].However, the number of sequences in this collection is limited, such that it would not include sufficient representatives for many wildlife forensic applications.For example, the total number of CO1 sequences in GenBank is ~3,370,000 (as of March 2021), while the CO1 sequences in the RefSeq collection number just over 2,000.While an improvement in quality over other GenBank sequences, RefSeq samples are not required to meet the standard of validation for a forensic reference collection.A higher level of curation is performed on all sequences deposited in BOLD, including confirming that the sequence is not from a contaminant, is derived from a protein coding gene, and is a functional gene copy [55].Importantly however, BOLD does source data from GenBank for inclusion in their publicly available dataset, thus misidentified and erroneous data may appear in BOLD search results.As data from more species are added to public databases, it will be easier to identify and remove potentially misidentified sequences [56].Fundamentally, using DNA sequences obtained from a public repository as reference data in a forensic setting requires substantial domain knowledge and interpretation by the analyst to reach an accurate identification at an appropriate taxonomic level.

Approaches for taxonomic identification
After sequencing the DNA from an unknown and gathering relevant reference sequences from in-house or public databases, two main approaches are typically undertaken to achieve taxonomic identification in wildlife forensics: distance-based and/or tree-based.It should be noted that identifications are often made using a combination of both approaches, unless a single approach has been validated for the taxon in question.

Distance-based approaches
Comparing the genetic distance, a measure of the number of nucleotides that differ between two sequences, is a common approach used for taxonomic identification.The Kimura-2-Parameter (K2P) model is often used to calculate genetic distances as it: 1) allows for different, undefined substitution rates between transitions and transversions; and 2) works effectively when nucleotide variation is limited [57].Specimen identification based on calculated genetic distances often relies on either a 'gap' or 'threshold'.In a 'gap' based approach, the intraspecific genetic distance (i.e., between individuals of a single species) must not overlap the interspecific genetic distance (i.e., between individuals of differing species).The most common use of a 'gap' for taxonomic identification is for the CO1 barcode region, in which the general rule of thumb is that the intraspecific variation should be <3 % and interspecific variation should be >3 %, or 10X the mean of the intraspecific variation [58,59].In the absence of complete taxonomic sampling or when the efficacy of a new region is being established, an 'experience threshold' is used for taxonomic assignment based on empirical observations of forensic practitioners across many taxa.A conservative 'threshold' of 1 % is often implemented in the scientific literature to minimize misidentifications that can occur when conspecifics are not included in the reference database (e.g., [10]), but it must be re-emphasized here that there is no single threshold that will differentiate all taxa [7,60].
As several publications have examined the shortcomings of using either a 'gap' or 'threshold' approach for specimen identification (e.g., [10,31,60,61]), only the main concerns raised in those studies most pertinent to wildlife forensics will be highlighted here.First and foremost, broad implementation of a 'gap' or 'threshold' for specimen identification is often not appropriate, and can lead to both false negatives and false positives [7].In closely related species or sub-species which have a shallow coalescent depth, nucleotide variation at commonly analyzed regions may be limited.This scenario could result in a 'false negative', where multiple taxa would be lumped together given the interspecific distances are shallower than the proposed 'gap' or 'threshold' [10,29].This was highlighted for a pair of sister species of forensically important Australian flesh flies (Diptera: Sarcophagidae) [62] and tuna species within the genus Thunnus [63], whereby the interspecific distances using the CO1 barcode region were below 3 %.Alternatively, targeted regions in fast evolving species may exhibit higher than expected intraspecific variation, causing morphologically indistinguishable individuals to be assigned to more than one species (known as splitting) [29].For example, a high 'false positive' rate was reported for marine gastropods (cowries) when a 2 % 'threshold' was applied; 20 % of taxa were inaccurately split into more than one species [29].This was rectified when the 'threshold' for specimen identification was increased.A few smaller, but still noteworthy considerations for distance-based specimen identification for wildlife forensics include: 1) the length of the sequence, as calculations based only on highly variable sections of the targeted region can inaccurately inflate the genetic distance (and vice versa) [8]; 2) calculating the genetic distance between focal species does not provide any indication as to the relationship between other species (i.e., species X can be equidistant to both species Y and Z, but the distance to X gives no insight into the distance between Y and Z); and 3) species that cannot be assigned confidently and accurately based on the genetic distances may still be recovered as reciprocally monophyletic in a phylogram.
Before implementing genetic distances for specimen identification in wildlife casework, the accuracy of this approach for the chosen taxonomic group must be performed.Firstly, verifying whether individuals from the extent of a species' geographical range were included when setting the 'gap' or 'threshold' is needed; if individuals from only one population were sampled, the expected variation would reflect the 'local gap' rather than the 'global gap', and if used broadly could result in misidentifications [10,31].Secondly, the minimum and maximum genetic variation of the species in question should be established.Studies have shown that using the mean inaccurately inflates the interspecific variation, resulting in misleading specimen identification [64].Finally, it is prudent to compare genetic distances reported in the scientific literature to those calculated from in-house reference sequences.It is important to emphasize that wildlife forensic scientists are not alpha taxonomists.It is not within the scope of their role to describe new species or set thresholds for how species are defined (either genetically or morphologically).Rather, they draw from previously published research by taxonomic experts in zoology and botany and apply that information in a forensic context.

Tree-based approaches
Construction of phylograms, which visually display an estimate of the evolutionary relationships among taxa, is also commonly used for taxonomic identification.Using discrete data, such as scored morphological characters or protein or DNA sequences, software algorithms create phylograms.Regardless of the software used or data type, aligned data are required as input.Commonly used multiple sequence alignment software packages, such as ClustalW [65] and MAAFT [66], are freely available either as standalone packages, within larger sequence analysis software packages (e.g., Sequencher, Geneious, CLC Genomics Workbench, MEGA), or within open source web-based platforms (e.g., Galaxy, EMBL-EBI).As the sensitivity and accuracy of the resulting alignment can be impacted by the scoring scheme (i.e., matches, mismatches) and gap penalty settings (i.e., the cost to insert a gap into the alignment), care needs to be taken when choosing software settings for use.After manual review of the resulting alignment, evolutionary relationships are reconstructed using one of four main algorithms, each differing in complexity and a priori assumptions about the data: 1) Distance (e.g., neighbor joining, UPGMA), based on pairwise distances; 2) Maximum Parsimony, in which the simplest hypothesis is preferred; 3) Maximum Likelihood, based on the most likely tree topology when assuming a specific model of evolution; and 4) Bayesian, a statistical inference method based on Bayes' theorem.Confidence in the resulting topology is gleaned by nodal support (e.g., bootstrap percentages, posterior probability out of 1.0) and also branch length (long branches indicate more nucleotide substitutions).Assigning an unknown to a particular taxon is done by examining the placement of that unknown in the phylogram.For example, if the unknown sequence is recovered within a clade (or group) composed only of individuals of a single species (known as monophyly), then the unknown specimen would be identified as that species (Ross et al. [7] refers to this as a "strict" tree-based approach).An example phylogram, generated using CO1 barcode region sequences from forensically relevant canids, is shown in Fig. 2.
While reconstructing a phylogram for specimen identification might seem like a straight-forward endeavor, there are many considerations that are especially prudent to forensic casework.Firstly, the regions targeted for species-level assignments should not be used to infer evolutionary higher-level relationships (e.g., families or orders) without further validation, as their utility precisely lies in species-level resolution.For instance, the CO1 barcode region is not suitable for inferring inter-species relationships, and often higher-level nodes are recovered with poor support (i.e., when that topology is recovered in <50 % of replicate trees; typically reported as a bootstrap support).Thus, if using a DNA region best suited for species-level relationships to instead infer higher-level relationships, analysts should proceed with caution or use data from other regions that are calibrated for higher taxonomic levels.Though forensic analysts are not called upon to determine if deep nodes in a tree accurately reflect evolutionary history, they do commonly "back up" and identify evidence sequences to genus or family when species determination is difficult due to shallow coalescent depth or incomplete taxonomic sampling.While species-level markers are generally safe for genus-and sometimes family-level diagnosis of an unknown nested within a monophyletic species clade, diagnosing higher-level taxonomic groupings of an unknown sequence becomes increasingly difficult with increasing genetic distance from the nearest known.
Secondly, every attempt should be made to ensure that reference sequences in the alignment not only include conspecifics and congenerics from a range of biogeographical populations, but also where possible an appropriate outgroup.Collins and Cruickshank [60] emphasized that "comprehensive sampling and complete reference libraries [67,68], bring arguably the single biggest improvement to DNA barcode identification success [69]."Phylograms reconstructed with missing taxa and/or missing data can provide misleading results, and has been the topic of hundreds of published studies.For example, in a study targeting vertebrates, adding incomplete sequences (i.e., those spanning only 10 % of the targeted region) of missing taxa greatly improved the resolution and associated node support in the resulting tree [70].However, it is important to note that in some forensic scenarios complete taxon sampling is not feasible.Sampling of rare and endangered species is heavily regulated to protect them, and sampling remote populations can be challenging, especially if the wildlife in question is difficult to capture.In such scenarios, the resulting tree should be interpreted with caution.

Biological considerations when using mitochondrial and chloroplast regions for taxonomic identification
When using DNA-based approaches for taxonomic identification, it is important to take into account that biological processes such as heteroplasmy, polyploidy, introgression, hybridization, nuclear copies of mitochondrial genes (nUMTs) and incomplete lineage sorting can cause interpretation and reporting issues when not known and accounted for.

Heteroplasmy
Heteroplasmy is the presence of more than one mtDNA haplotype in an individual.Since mtDNA is a rapidly evolving molecule, new mutations arise frequently among the thousands of copies present in a cell as the result of imperfect replication and repair events [71,72].Aside from random mutations, paternal mtDNA leakage during early development has been reported as a mechanism for heteroplasmy in some taxa (e.g., Drosophila and cicadas [73]).Regardless of heteroplasmy origin, the means by which these new haplotypes become predominant in a maternal line are not well understood [71].There are some taxa in which heteroplasmy is more common, for example bivalves [74], crickets [40], bees [73], lizards [75], and treefrogs [76].For most animals, the actual prevalence of mtDNA recombination is not known, and it is thought to be maintained in populations through random genetic drift and mutation [77].As heteroplasmy can produce functional haplotypes, identifying heteroplasmic sequences via the presence of premature stop codons or frameshift mutations is typically fruitless [73].Rather, heteroplasmy is typically characterized by multiple base positions in a single Sanger sequence that have two alternate peaks.Cloning and/or bidirectional sequencing of a region may resolve the predominant sequence, or may just confirm that a length or point heteroplasmy is not a sequencing artifact.In the latter case, analysts should not attempt to assign one nucleotide, but rather use a degenerate base instead [73,78].The inclusion of unverified heteroplasmic sequences in downstream data analysis could lead to incorrect specimen identification, as the number of unique species would be overestimated [79].For example, when using CO1 for species identification in a genus of bees (Hylaeus), only 75 % of species known to have heteroplasmy were correctly recovered on a phylogenetic tree [73].The inclusion of highly variable heteroplasmic sequences contributed to individuals of morphologically identifiable species being split into multiple clades on the phylogenetic tree.Thus, care must be taken when dealing with a forensic unknown, as heteroplasmic sequences can exacerbate difficulties of taxonomic identification within poorly separated taxa [73].

Polyploidy
Polyploidy, a form of reticulate evolution, refers to a biological condition in which an organism acquires additional copies (or chromosomes) of the genome [80,81].While most common in plantsapproximately 70 % of all angiosperms are estimated to be polyploids [82] polyploidy has also been reported to a lesser extent in fungi [83], fish, insects, amphibians, reptiles, birds, molluscs, and mammals [80].
Polyploidy is classified into one of two types based on the mode of origin: 1) allopolyploids, where multiple copies originated from different species during hybridization; and 2) autopolyploids, where multiple copies originated from genome duplication within a species during meiosis [80,84].Regardless of the type, polyploidy generates substantial genetic and genomic variation, complicating taxonomic assignment; sequenced regions from polyploid individuals reflect species complexes which are difficult to accurately assign [85].Studies have suggested that all copies of the target region should be sequenced from polyploid individuals (possible using either cloning or next generation sequencing) and included in subsequent analyses to ensure accurate assignment [86].Considering this is not feasible for most wildlife forensic laboratories, a more appropriate solution would be to target regions derived through divergent evolution as they are less likely to be influenced by polyploidy (e.g., chloroplast regions in plants).Laboratories should be aware of the level of polyploidy documented in the scientific literature for the taxonomic group in question and choose regions appropriately, which in some cases might mean targeting multiple regions.

Introgression and hybridization
Hybridization can be defined as individuals from genetically distinct populations which interbreed to create offspring that possess genetic characteristics from each distinct group, while introgression is the gene flow between the hybridizing populations [87,88].Introgression is most simply defined as the incorporation of genetic material from one species into the gene pool of a second divergent, but closely related species [89,90].The issues of hybridization and introgression are often discussed simultaneously with various species concepts, their definitions, and what constitutes a hybrid [88,91,92].Wildlife forensic examiners, however, do not define what constitutes a species, and must use available information to identify the taxonomic source of an unknown item to the lowest taxonomic rank possible.It is estimated that approximately 10 % of animal species [93][94][95] and 25 % of plant species [94,95] hybridize to some extent.This number can be even higher in some groups.Ducks (Anatidae), for instance, have a rate of approximately 75 % [95], and hybridizations between wild and domestic species is well documented [96,97].The complexities of hybridization and introgression are unique to each species; understanding the phylogenies of the species group in question, as well as a knowledge of anthropomorphic activities involving the species (e.g., hybridization with invasive species or intentional hybridizations [87,98,99]) is critical when ensuring correct taxonomic identification for forensic purposes.
The topic of identifying hybrids is beyond the scope of this paper, however there are considerations with regard to mtDNA region choice, types of hybrids (F1, F2, etc), geographic gene flow, and possible human mediated events that should be considered when using mtDNA for taxonomic identification.One example is wild canids in North America.Gray wolves Canis lupus, Eastern Gray wolves C. lycaon and coyotes C. latrans as well as domestic dogs C. familiaris are all known to naturally hybridize with each other, and multiple types of hybrids and back crossing events are possible [91,100,101].Approximately 60 % of Eastern wolves in the Western Great Lakes region (Wisconsin, Michigan, Minnesota) of the United States have been documented to contain 'coyote-like' mitochondrial haplotypes as a result of a shared common ancestor, yet ongoing Eastern wolf x coyote hybridizations are known to occur in southern Ontario.Genetic studies on the relationships between these populations are numerous and somewhat contentious.The mitochondrial research has focused on the less conserved CR, and specific haplotypes have been defined as exclusive to specific populations.In this scenario, attempting to identify evidence items suspected to be a wolf using mtDNA regions without an understanding of the phylogeny of the organism, possible hybrids, or known introgression, can lead a forensic analyst to incorrectly assign the item in question to a coyote.Use of an additional nuclear region, combined with a clear understanding of the phylogeny and history of difficult hybridizing animals such as wolves, will avoid incorrect taxonomic identification.
Anthropomorphic mediated hybridization events can also lead to wide-spread introgression in natural populations.Throughout the 1990s and early 2000's, overharvesting of sturgeon from the Caspian Sea region (e.g., Huso huso, Acipenser gueldenstaedtii/persicus and A. stellatus) led to wide-spread replacement of caviar from other Acipenseriform species such as the North American white sturgeon A. transmontanus and American paddlefish Polyodon spathula [102].Forensic scientists at the USFWS National Fish and Wildlife Forensics Laboratory tested over 6, 000 caviars between the years 1998-2008, of which 85 % were declared as originating from Caspian Sea species.However, almost 30 % of caviars declared as Russian sturgeon A. gueldenstaedtii exhibited cytB sequences identical to Siberian sturgeon A. baerii [102].This surprising result was not believed to occur as a result of intentional replacement, but rather that released Siberian sturgeon from aquaculture along the Danube River in Germany had successfully hybridized with wild Russian sturgeon, resulting in Siberian haplotype introgression into the wild Russian populations [102,103].
Knowledge of the geographic distribution of a species in question, as well as areas of overlapping distribution (sympatry) with hybridizing species and the amount of introgression is also an important consideration when identifying unknown samples.Members of the deer genus Odocoileus (O.hemionus, mule deer and black-tailed deer; O. virginianus, white-tailed deer) are wide spread across North America with a large range of sympatry between the species, and hybrids are known to exist [104][105][106].Shared cytB sequences have been reported between white-tailed and mule deer, and it is currently unclear if that is due to introgression [107] or the paraphyletic sorting of shared mtDNA haplotypes [108].Despite this, it should be noted that the degree of mtDNA divergence between white-tailed, black-tailed and mule deer is inconsistent with their current taxonomy; there is high sequence divergence between mule and black-tailed deer (both currently O. hemionus) in some geographic areas, and low sequence divergence between white-tailed and mule deer in others [105,109,110].

Nuclear copies of mitochondrial genes (nUMTs)
The utilization of mtDNA for specimen identification can be problematic since portions of the mitochondrial genome can become integrated into the nuclear genome, known as nuclear mitochondrial pseudogenes (nUMTs) [11,[111][112][113][114].Given there are fewer functional constraints when mtDNA segments are incorporated as non-coding regions of the nuclear genome, nUMTS typically accumulate mutations more quickly and can lead to the existence of distinct copies of some mitochondrial genes within an individual [111,115].Using universal primers, true functional copies of mitochondrial genes and nUMTs can be amplified equally for downstream sequencing [79].Subsequent taxonomic assignment based on nUMTs may provide misleading results, as the nUMT and the true mtDNA target sequence will have diverged more than anticipated for the taxon in question [79,111].Most nuclear copies can, however be recognized by several identifiable characteristics, predominantly premature stop codons and indels, and eliminated prior to data analysis.Alternatively, nUMTs can be characterized in reference materials along with the mtDNA region from which they originated, and the data maintained in comparison databases.

Incomplete lineage sorting
It is commonplace in evolutionary biology to use a single gene region (or a combination of regions) to infer the relationships among a set of species using a phylogram that is typically referred to as a 'gene' tree.Incomplete lineage sorting (ILS) is a process which can cause incongruence between 'gene' and 'species' trees, complicating specimen identification [116].ILS occurs when closely related species interbreed or have not completely diverged, resulting in the differing patterns of inheritance reflected by mitochondrial, chloroplast and nuclear DNA (or between different regions of a single DNA type) .For instance, ancestral mitochondrial states may have been retained, while derived states are evident in the nuclear genome, or vice versa [117].Phylogenetic relationships among species exhibiting ILS may be polyphyletic, such that taxa that do not share a recent common ancestor are clustered together K.A. Meiklejohn et al. in a tree.This often occurs with species that have long generation times such as sturgeon [118].One example from Crotaphytus lizards shows how evolutionary events rather than phylogenetic descent can be reflected in the mitochondrial genome, which could lead to erroneous taxonomic identification [119].Caution should be taken when attempting to identify closely related species using a single highly conserved gene region.

Application of next generation sequencing to taxonomic identification
Next-generation sequencing (NGS) (also known as massively parallel sequencing and high throughput sequencing), has revolutionized molecular biology, making it easier, quicker and cheaper to generate large volumes of sequence data.Regardless of the specific NGS platform used (Illumina, PacBio, IonGeneStudio, MinION, etc.), the underlying premise is the same: multiple DNA fragments are sequenced in parallel, and sophisticated bioinformatic workflows are used to process and interpret the resulting data [120].Whilst some human-focused forensic laboratories have validated NGS for certain casework applications (e.g., Federal Bureau Investigation for mitochondrial control region sequencing [121]; genotyping by sequencing [122]), at the time of this review we are not aware of any ISO 17025 accredited forensic wildlife laboratories using NGS for taxonomic identification of wildlife in casework.
Several published studies have highlighted the utility and benefits that NGS could bring to wildlife casework.In addition to the higher throughput capabilities afforded by NGS, the increased sensitivity facilitates the recovery of full target sequences even from low quantity and quality DNA samples.Standard PCR and Sanger sequencing of hard matrices such as ivory, teeth, bone and timber is often difficult, as few intact DNA fragments remain.Using NGS, successful taxonomic assignment has been achieved for highly degraded ivory [123], mammoth tusks [124] and timber regulated by the Convention on International Trade in Endangered Species of Wild Fauna and Flora (CITES) [125,126].Food products encountered in forensic cases often pose a challenge for traditional analysis given that high temperatures along with mechanical and chemical treatment during processing can shear and degrade DNA.NGS has been applied to processed seafood products, in which taxonomic assignment is necessary to either confirm authenticity [127] or detect possible undisclosed allergens (e.g., crustaceans [128]).Commercial NGS assays have also been developed (ThermoFisher Scientific; A38452, A38454) to streamline taxonomic identification of meat and fish in food mixtures for authentication purposes [129].The analysis of mixed samples can also be streamlined using NGS; individual reads are generated which can be assigned to separate taxa, whereas a mixed Sanger chromatogram cannot be reliably interpreted.Further, NGS can be used to detect and characterize taxa present in mixtures at very low levels (1 %), which might confirm a CITES or mislabeling violation (e.g., [129][130][131][132]).For taxa in which multiple regions are needed to ensure reliable and accurate taxonomic identification, new techniques coupled with NGS can streamline the generation of data whilst conserving valuable DNA extract.For example, Shaili et al. [133] used "genome skimming" (a method in which all DNA templates in a sample are sequenced without PCR) on the portable MinION sequencer to quickly generate full mitochondrial genome sequences for specimen identification of a CITES listed shark.To discriminate between wolves and coyotes, hybridization capturea method in which biotinylated RNA "baits" complementary to the DNA of interest are used to isolate target DNA prior to NGSwas used to generate full mitochondrial genome sequences [140].It is worth noting though, that cases from the research literature demonstrating what is technically possible in terms of detection of violations in the field may not currently be adequately validated for or compatible with the workflow of forensic casework, or may not be useful in an enforcement context [4].
Despite the highlighted advantages of using NGS for taxonomic assignment, there are numerous considerations that are especially critical within a forensic context.First, while increased sensitivity is advantageous for processing challenging samples, it can also increase the likelihood that low level environmental and reagent contaminants are sequenced.This was highlighted by Dormontt et al. [134], who observed contamination in negative control samples, likely a result of sample carry over during DNA isolation.Likewise, contamination between evidence samples, which are often packaged together (Fig. 3), will be visible with NGS, meaning that validations must address acceptable levels of contamination.Each laboratory will need to conduct extensive validation studies to determine interpretation coverage thresholds and address error rates, which are inherent at some level with any NGS platform.Secondly, when universal primers are used with a mixed sample, bias in primer annealing may cause taxa with primer binding site mismatches to amplify poorly or not at all, leading to a false negative result [135,136].Additionally, inhibitors that are commonly co-extracted with forensic samples can negatively impact the ligation of indices and adaptors during library preparation [135].Without this ligation, downstream sequencing is not possible.Finally, there is a substantial cost investment to bring NGS into casework, which might be out of reach for smaller wildlife forensic laboratories.These costs extend beyond the instrumentation, to the reagents and consumables needed for validation and analyst training.Further, while NGS is more affordable than Sanger sequencing on a per nucleotide basis, this cost benefit is only capitalized when processing at high throughout.Given most wildlife forensic laboratories have low throughput, the cost of NGS still might be out of reach for at least the near term.
For laboratories considering implementing NGS in forensic casework, there are additional logistical challenges.Aside from completing the necessary validations, analysts would need to be trained on the laboratory workflows, which differ substantially from those currently used for Sanger sequencing.For simplicity, some of the time-consuming and analyst-sensitive library preparation steps (e.g., bead purification) can be completed by liquid handling robots.Further, the reagent and consumable set up for most NGS instruments is very straight-forward; the analyst is typically only required to pipette their sample into a reagent cartridge and subsequently load the cartridge into the instrument.A challenge unique to NGS concerns the storage and analysis of large amounts of sequence data.Given a single Illumina MiSeq v3 run generates up to 15 GB of data, laboratories would likely need a dedicated storage platform.Commercially developed and open source software programs are available to complete basic NGS analysis steps, such as Fig. 3.It is common in wildlife forensic laboratories to receive evidence with many individuals packaged together, such as this assemblage of seahorses from multiple species submitted to the NOAA Forensic Laboratory.Crosscontamination between evidence items is seldom detected with PCR and Sanger sequencing, but could be apparent with NGS, which is more sensitive.Image kindly provided by M. Katherine Moore.
sample demultiplexing, quality filtering and primer trimming.To complement these, several bioinformatic pipelines have been developed to streamline taxonomic assignment, such as for metazoans [137], fish [138], CITES listed species (e.g., [125,132]), and taxa associated with environmental samples [139].These pipelines provide a good starting point for the analysis of NGS data, and could be adapted to meet the specific needs (i.e., DNA region, taxa) of the laboratory.

Developed standards and guidelines for molecular taxonomic identification
To ensure forensic science is admissible in court, the methods used to analyze forensic evidence must be performed to recognized standards and guidelines.The Organization of Scientific Area Committees (OSAC) for Forensic Science was created in 2014 to address the lack of discipline-specific forensic science standards.This U.S.-based organization is composed of 550-plus members who are leading forensic practitioners from private and public forensic laboratories, industry, and academia.The two Biology subcommittees, Human and Wildlife Forensic Biology, have developed several general standards and guidelines for DNA analysis, interpretation, and reporting (e.g., ANSI/ASB 019, 048).Additional standards are directly applicable to DNA-based taxonomic identification in wildlife forensics, including training in DNA isolation and purification methods (ANSI/ASB 023), validating new primers for Sanger sequencing (ANSI/ASB 047), training in mtDNA analysis for taxonomic identification (ANSI/ASB 111) and report writing (ANSI/ASB 029).Further, forthcoming standards focusing on a) DNA sequencing using capillary electrophoresis, b) prevention, monitoring and mitigating DNA contamination, c) in-house sequence databases for taxonomic assignment of wildlife, and d) use of reference sequences from public databases for taxonomic identification, have been drafted.Published standards and documents pertinent to the interpretation and reporting of DNA-based taxonomic identification can be accessed online via https://www.nist.gov/topics/organization-scientific-area-committees-forensic-science/wildlife-forensics-subcommittee and http://www.asbstandardsboard.org/published-documents/wildlife-forensics-published-documents/.

Conclusions
One of the main requests of law enforcement to a wildlife forensic laboratory is identifying the species origin of an evidence item.As outlined in this review, amplification and Sanger sequencing of informative regions of the genome is the most commonly used approach in both animals and plants.Wildlife forensic laboratories often rely on the regions identified and analysis techniques developed by the research community, given that personnel and monetary resources needed to develop a tailored approach for a specific species in question are extremely limited.When completing DNA-based taxonomic identifications, wildlife forensic biologists have to employ both rigor and pragmatism: forensic science has little tolerance for error, the scope of species submitted for analysis are often much broader than those in most academic laboratories (e.g., a single laboratory could process samples from birds, fish, mammals, and timber), and the court system demands either a 'yes' or 'no' answer within a reasonable timeframe.Accurate taxonomic identification requires not only technical expertise in the laboratory, but also knowledge of evolutionary and coalescent theory, characteristics of mtDNA, and the phylogeny and biogeography of the species in question.

Disclaimer
The findings and conclusions in this article are those of the author(s) and do not necessarily represent the views of the U.S. Fish and Wildlife Service or the National Oceanic and Atmospheric Administration.Mention of tradenames does not imply U.S. Government endorsement of commercial products.

Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Fig. 2 .
Fig. 2. Example phylogram for forensically relevant canid species.Sequences pertaining to the barcode region of CO1 were downloaded from GenBank and aligned in CLC Genomics Workbench (Qiagen).The UPGMA construction method and Kimura 80 nucleotide distance measure was used, and 1,000 replicates generated to determine bootstrap support.Red fox (Vulpes vulpes) was used as the outgroup.Multiple individuals of each species are recovered as a single clade with strong bootstrap support (100).Images sourced from the open source collection from the USFWS Digital Library and Wikipedia (Golden jackal).