A DNA barcode survey of insect biodiversity in Pakistan

Although Pakistan has rich biodiversity, many groups are poorly known, particularly insects. To address this gap, we employed DNA barcoding to survey its insect diversity. Specimens obtained through diverse collecting methods at 1,858 sites across Pakistan from 2010–2019 were examined for sequence variation in the 658 bp barcode region of the cytochrome c oxidase 1 (COI) gene. Sequences from nearly 49,000 specimens were assigned to 6,590 Barcode Index Numbers (BINs), a proxy for species, and most (88%) also possessed a representative image on the Barcode of Life Data System (BOLD). By coupling morphological inspections with barcode matches on BOLD, every BIN was assigned to an order (19) and most (99.8%) were placed to a family (362). However, just 40% of the BINs were assigned to a genus (1,375) and 21% to a species (1,364). Five orders (Coleoptera, Diptera, Hemiptera, Hymenoptera, Lepidoptera) accounted for 92% of the specimens and BINs. More than half of the BINs (59%) are so far only known from Pakistan, but others have also been reported from Bangladesh (13%), India (12%), and China (8%). Representing the first DNA barcode survey of the insect fauna in any South Asian country, this study provides the foundation for a complete inventory of the insect fauna in Pakistan while also contributing to the global DNA barcode reference library.

The effectiveness of DNA barcoding coupled with advances in sequencing technology allow it to support large-scale biodiversity analysis (Wilson et al., 2017). However, the intensity of study has varied among regions (Weigand et al., 2019). For example, the BIN count (84,000) for Canada is 8Â that for Russia (11,000), although the latter nation is 1.7Â larger (www.boldsystems.org, accessed 7 September 2021). In a similar fashion, the BIN count for Germany (23,000) is 4Â that for India (5,800), although the latter nation is 9Â larger. The current study extends DNA barcode coverage for Pakistan to both advance knowledge of the taxonomic composition of its insect fauna and to develop a barcode reference library that supports routine eDNA and metabarcoding studies in the future.

Sample collection and preparation
Insects were sampled at 1,858 sites across Pakistan ( Fig. 1) from 2010-2019 using both active and passive collecting methods including sweep nets, hand collections, hostplant beating, light traps, Malaise traps, pitfall traps, and UV illuminated sheets. Plans for the specimen collections/sites were approved by the Director, National Institute for Biotechnology and Genetic Engineering, Faisalabad under the project HEC No. 20-1403/ R& D/09. The specimens were identified to an order and, where possible, to lower taxonomic ranks. Large specimens were either pinned and preserved dry or placed in Matrix tubes with 95% ethanol. Small specimens were individually placed in a well containing 30 ml of 95% ethanol in 96-well microplates. Specimen metadata and an image (except for Malaise samples where only representative specimens of each BIN were imaged) were submitted to BOLD where the information can be accessed on both the specimen page and corresponding BIN page. Voucher specimens are archived at the National Institute for Biotechnology and Genetic Engineering (NIBGE), Faisalabad, Pakistan (with sample ID prefix NIBGE) or at the Centre for Biodiversity Genomics (CBG), Guelph, Canada (with ID prefix BIOUG).

DNA barcoding
A total of 60,273 insects were barcoded following standard protocols (deWaard et al., 2019a(deWaard et al., , 2019b. In brief, a leg was removed with sterile forceps from each large specimen and transferred to a well preloaded with 30 ml of 95% EtOH. As smaller specimens were already in plates, they were ready for analysis, but vouchers were recovered after DNA extraction (Porco et al., 2010). DNA extraction, PCR amplification, and sequencing were performed at the Canadian Centre for DNA Barcoding (CCDB) following established protocols (Ivanova, deWaard & Hebert, 2006;Hebert et al., 2018;deWaard et al., 2019b). PCR reactions were either 6 ml or 12 ml (Hebert et al., 2013). Three quarters (73%) of the specimens were Sanger sequenced while the rest were analyzed using SMRT sequencing on a Sequel platform (Pacific Biosciences, Menlo Park, CA, USA). Sanger sequencing employed BigDye Terminator Cycle Sequencing Kit (v3.1) on an Applied Biosystems 3730XL DNA Analyzer. Sequences were assembled, aligned and edited using CodonCode Aligner before submission to BOLD. SMRT sequencing employed protocols described by Hebert et al. (2018). The resultant sequences were uploaded to mBRAVE (Multiplex Barcoding Research and Visualization Environment; www.mbrave.net) for editing (sequence trimming, quality filtering, de-replication), identification, and generation of operational taxonomic units (OTUs). The edited sequences were subsequently exported to BOLD for BIN assignment and reference library development. The specimen records, sequence data, electropherograms, and primer details are available in the dataset "DS-INSCTPAK" (dx.doi.org/10.5883/DS-INSCTPAK). All DNA extracts are stored within the DNA archive facility at the CBG.

Data analysis
The final dataset (N = 50,592) included 50,094 new barcode records and 498 public records on BOLD from specimens collected in Pakistan (Table S1). All records were assigned taxonomy and BINs following the workflow outlined by deWaard et al. (2019b). In brief, once the barcode data was on BOLD, each record went through a taxonomic assignment and verification workflow. Earlier studies (Ashfaq et al., 2013;Nazir et al., 2014;Iftikhar et al., 2016b;Akhtar et al., 2018;Naseem et al., 2019) on five taxa (antlions, aphids, butterflies, grasshoppers, thrips) coupled analysis of barcode results with detailed morphological study by taxonomic specialists. All sequences meeting the quality criteria were either assigned to an existing BIN or founded a new one (Ratnasingham & Hebert, 2013). Sequences founding a new BIN had to possess >500 bp of the barcode region with <1% ambiguous bases and no stop codons. Shorter sequences (300-495) that met the latter two quality criteria and that were a close sequence match to an established BIN were assigned to it (deWaard et al., 2019a). The remaining short sequences (1,230) that failed to gain a BIN assignment were run through the stand-alone version of the RESL algorithm (Ratnasingham & Hebert, 2013) (using the function Cluster Sequences on BOLD) to estimate the number of additional OTUs among them. One representative from each OTU was then queried against the BOLD ID Engine to link them with known BINs (deWaard et al., 2019b). The BIN details with specimen records and representative images (where available) are accessible on BOLD (dx.doi.org/10.5883/DS-INSCTPAK).
Various statistical approaches were used to estimate the number of insect species in Pakistan (Chao & Chiu, 2016) including the parametric estimator Preston's log-normal as well as non-parametric estimators Chao1, and the first-order and second-order jackknife. A bias-corrected version of each non-parametric estimator, designed to improve performance under conditions of low sampling effort, was also included (Lopez et al., 2012). All estimates were calculated using the R packages vegan and BAT. In addition, a species accumulation curve was drawn based on a sample-size-based rarefaction and extrapolation to at most double the minimum observed sample size, guided by an estimated asymptote using the R package iNEXT (Hsieh, Ma & Chao, 2016).

RESULTS
DNA barcodes were recovered from 50,094 (83%) of the 60,273 specimens analyzed. The other 17% either failed to amplify or generated problematic sequences (e.g., contamination, NUMTs, stop codons, endosymbionts) that were excluded from subsequent analysis. Considering orders with 100 or more specimens, sequence recovery ranged from a low of 63% for Blattodea to 95% for Lepidoptera. Sequence recovery for the other four major orders of insects showed considerable variation (Diptera: 91%, Coleoptera: 80%, Hymenoptera: 78%, Hemiptera: 69%).
All 50,592 insects with a barcode were assigned to one of 19 orders while 99.8% received an assignment to one of 362 families (Table 1, Table S1). Five orders represented 92% of the specimens: Diptera (40%), Hymenoptera (21%), Lepidoptera (12%), Hemiptera (11%), and Coleoptera (8%) (Fig. 2). Six orders (Mantodea, Neuroptera, Odonata, Orthoptera, Psocodea, Thysanoptera) were each represented by >100 specimens while the remaining eight possessed fewer representatives (Fig. 2, Table S1). Most of these sequences (98%) received a BIN assignment, leading to a total of 6,590 BINs. The other 1,230 barcode sequences did not meet the criteria for BIN assignment but included 629 OTUs when analyzed using "Cluster Sequences" function on BOLD. The BOLD ID Engine assigned 82 of these OTUs to known BINs, but the other 547 OTUs likely represent taxa new to BOLD. Many (57%) of the 6,590 BINs were represented by two or more sequences, but 43% were represented by just a single specimen. The ratio of these singletons was above 40% in all five major orders but was highest in Hymenoptera (48%). Most BINs (88%; N = 5,754) possessed an image of at least one voucher.
The percentage of records in each of the five major orders with a BIN assignment ranged from 93% (Coleoptera) to 99% (Diptera, Lepidoptera) with Hemiptera and Hymenoptera intermediate (96%) ( Table 1). These five orders also contributed most of the BINs (92%) and families (81%) ( Table 1, Figs. 3A, 3B). Only 40% of BINs were placed to a genus and 21% to a species, but this still led to records for 1,375 genera and 1,364 species (Table 1, Table S1). Among the five major orders, more BINs were identified to a genus (72%) and species (41%) in Lepidoptera than in the other four orders (Table 1). For example, just 13.8% of Diptera BINs and 10.2% of Hymenoptera BINs were assigned to a species (Table 1).
Specimen counts for the 362 families were highly variable as 15 families were each represented by >1,000 specimens while 38 had just one (Table S1). This pattern was also reflected in the number of BINs as 15 families had >100 BINs while 86 had just one. The Chironomidae (N = 3,258) and Braconidae (N = 2,174) were represented by the most specimens while Cecidomyiidae (238 BINs) and Platygastridae (230 BINs) were most diverse. Figure 4 shows the BIN diversity and BIN:specimen ratio for the 15 families with >100 BINs. The ratio was highest (0.33) for Geometridae (Lepidoptera) and lowest (0.05) for Chironomidae. The species accumulation curve did not reach an asymptote indicating more species await detection (Fig. 5). Species estimates for the country ranged from 9,253 to 12,246 species suggesting that, on average, 40% of species remain to be sampled (Table 2). BOLD was searched to ascertain if the 6,590 insect BINs from Pakistan were known from other countries. This analysis showed that 2,684 BINs (41%) were shared with at least one of 199 other countries while the others (3,906) are so far only known from Pakistan. The percentage of shared species ranged from 0.02% to 13%. Figure 6 shows the overlap values between Pakistan and countries with >1,000 BINs. BIN overlap was higher with nearby countries (Bangladesh: 13%, India: 12%, China: 8%) than for other regions. For example, Pakistan shared just 5% of its BINs with Australia, South Africa, and Germany (Fig. 6). The overlap between Canada and Costa Rica, both countries with >50,000 insect BINs, was only 4% and 1% respectively (Fig. 6).

DISCUSSION
Current estimates of the number of insect species which occur in Pakistan range from 5,000 (Ministry of Climate Change, Pakistan, 2019) to 20,000 species (Hasnain, 1998), but they are certainly too low (Baig & Al-Subaiee, 2009). The current study aimed to refine estimates of species richness by coupling DNA barcoding with the BIN system. With over 50,000 specimens sequenced, this study represents, by far, the largest effort to assemble a  DNA barcode registry for the insect fauna of any South Asian country. While success (83%) in DNA barcode recovery was good, it varied considerably among orders from 63% for Blattodea to 95% for Lepidoptera. Similar variation in barcode recovery among different insect taxa has been reported in other studies (Geiger et al., 2016;Pentinsaari et al., 2020). For example, a study on the insect fauna of French Polynesia reported 91% recovery for Diptera vs 63% for Coleoptera (Ramage et al., 2017). Similarly, a large-scale Canadian study revealed 95% recovery for Diptera vs 77% for Hemiptera and 74% for Hymenoptera (deWaard et al., 2019a). Although DNA quantity and quality play an important role (Ballare et al., 2019;Velasco-Cuervo et al., 2019), failures in primer binding often underlie low sequence recovery (Hajibabaei et al., 2005. Such failures can lead to the underestimation of species richness in insect groups where recovery is low . Other factors, such as co-amplification of pseudogenes (Song et al., 2008), Wolbachia (Smith et al., 2012), recent speciation (van Velzen et al., 2012), or incomplete lineage sorting (Mallo & Posada, 2016) may also limit the efficacy of barcodes to delimitate species, consequently affecting the diversity estimates. Moreover, there are instances where the BIN system overestimated species diversity in certain insect groups, such as Chironomidae (Lin, Stur & Ekrem, 2015;Ekrem et al., 2018). The coupling of morphological inspection with barcode matches on BOLD (deWaard et al., 2019a(deWaard et al., , 2019b was very effective at placing BINs to an order (100%) and family   (>99%). However, just 40% of the BINs could be assigned to a genus and 21% to a species indicating the need for better parameterization of the barcode reference library. This was particularly true for the three most diverse orders where species assignments were less than 15% (Diptera: 13.8%, Coleoptera: 13.3%, Hymenoptera: 10.2%). Considerably higher assignment success has been reported for Malaise samples from Germany (34%) and Canada (38%) (Geiger et al., 2016;deWaard et al., 2019a) reflecting the more comprehensive DNA barcode reference libraries available for these nations. Despite the limited reference database (Virgilio et al., 2010), the present analysis identified representatives from 1,375 genera and 1,364 species showing the value of the global reference library (BOLD) which far exceeds the results obtained by morphology alone (Marshall, Paiero & Buck, 2009). The present analysis revealed 6,590 BINs with species richness estimates indicating that the fauna of Pakistan certainly includes more than 10,000 species. As these estimates are based on specimens collected with uneven sampling and limited geographic coverage, they are likely to increase with more comprehensive efforts.
Fifteen of the 362 families dominated with 1,000 or more specimens and this pattern was also reflected in the BIN diversity. The uneven detection of families in the survey is supported by the fact that 38 families were represented by just one specimen and 88 by one BIN. Interestingly, nine of the 15 families with the most BINs were dipterans and hymenopterans with Cecidomyiidae and Ichneumonidae comprising the highest BIN: specimen ratio.
Because BOLD now hosts around nine million DNA barcode records for more than 760,000 animal species, it provides a good basis for assessing faunal overlap using BINs. Only 41% of the 6,590 insect BINs from Pakistan are currently known from other countries. As expected, BIN overlap was highest with neighboring countries. This result reflects the endemism of biodiversity (Werneck et al., 2012) and underscores the need to develop local biodiversity inventories. The current survey represents a first step towards building an inventory for the insect fauna of Pakistan.

ADDITIONAL INFORMATION AND DECLARATIONS Funding
This study was enabled by grant 106106-001 "Engaging Developing Nations in iBOL" from the International Development Research Centre (IDRC) in Canada and by grant HEC No. 20-1403/R& D/09 "Sequencing DNA Barcodes of Economically Important Insect Species from Pakistan" from the Higher Education Commission of Pakistan. Sequence analysis was made possible by a grant from the Government of Canada through Genome Canada and Ontario Genomics in support of the International Barcode of Life (iBOL) project. This is a contribution to the 'Food from Thought' project supported by the Canada First Research Excellence Fund. Paul Hebert and Nazeer Ahmed were supported by the New Frontiers in Research Fund through its award to BIOSCAN. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Grant Disclosures
The following grant information was disclosed by the authors:

Competing Interests
The authors declare that they have no competing interests.

Author Contributions
Muhammad Ashfaq conceived and designed the experiments, performed the experiments, analyzed the data, prepared figures and/or tables, authored or reviewed drafts of the paper, sample collections, and approved the final draft. Arif M. Khan conceived and designed the experiments, performed the experiments, prepared figures and/or tables, authored or reviewed drafts of the paper, sample collections, and approved the final draft. Akhtar Rasool conceived and designed the experiments, performed the experiments, prepared figures and/or tables, authored or reviewed drafts of the paper, sample collections, and approved the final draft. Saleem Akhtar conceived and designed the experiments, performed the experiments, prepared figures and/or tables, authored or reviewed drafts of the paper, sample collections, and approved the final draft. Naila Nazir conceived and designed the experiments, performed the experiments, analyzed the data, prepared figures and/or tables, authored or reviewed drafts of the paper, sample collections, and approved the final draft. Nazeer Ahmed conceived and designed the experiments, performed the experiments, analyzed the data, prepared figures and/or tables, sample collections, and approved the final draft. Farkhanda Manzoor conceived and designed the experiments, performed the experiments, analyzed the data, prepared figures and/or tables, sample collections, and approved the final draft. Jaclyn McKeown performed the experiments, analyzed the data, authored or reviewed drafts of the paper, specimen imaging, and approved the final draft. Ahmed Ali Samejo conceived and designed the experiments, performed the experiments, analyzed the data, authored or reviewed drafts of the paper, sample collections, and approved the final draft. Imran Khaliq conceived and designed the experiments, performed the experiments, authored or reviewed drafts of the paper, sample collections, and approved the final draft. Michelle L. D'Souza analyzed the data, prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft. Shahid Mansoor conceived and designed the experiments, performed the experiments, authored or reviewed drafts of the paper, sample collections, and approved the final draft. Paul D. N. Hebert conceived and designed the experiments, performed the experiments, analyzed the data, prepared figures and/or tables, authored or reviewed drafts of the paper, specimen identifications, and approved the final draft.

Field Study Permissions
The following information was supplied relating to field study approvals (i.e., approving body and any reference numbers): Field experiments were approved by the Director National Institute for Biotechnology and Genetic Engineering Faisalabad (project number; HEC No. 20-1403/R& D/09).

Data Availability
The following information was supplied regarding data availability: The sequences are available at Barcode of Life Data Systems: dx.doi.org/10.5883/DS-INSCTPAK.

Supplemental Information
Supplemental information for this article can be found online at http://dx.doi.org/10.7717/ peerj.13267#supplemental-information.