Chromosome-level genome assembly of bean flower thrips Megalurothrips usitatus (Thysanoptera: Thripidae)

Bean flower thrips Megalurothrips usitatus is a staple pest of cowpea and other legumes and causes dramatic economic losses. Its small size allows for easy concealment, and large reproductive capacity easily leads to infestations. Despite the importance of a genome in developing novel management strategies, genetic studies on M. usitatus remain limited. Thus, we generated a chromosome-level M. usitatus genome using a combination of PacBio long read and Hi-C technologies. The assembled genome was 238.14 Mb with a scaffold N50 of 13.85 Mb. The final genome was anchored into 16 pseudo-chromosomes containing 14,000 genes, of which 91.74% were functionally annotated. Comparative genomic analyses revealed that expanded gene families were enriched in fatty acid metabolism and detoxification metabolism (ABC transporters), and contracted gene families were strongly associated with chitin-based cuticle development and sensory perception of taste. In conclusion, this high-quality genome provides an invaluable resource for us to understand the thrips’ ecology and genetics, contributing to pest management.

Estimation of genomic characteristics. Genomic characteristics were determined based on 55.86 Gb of short-read data using a K-mer-based statistical analysis in JELLYFISH version 2.1.3 15 with the following parameters: 'count -m 17 -C -c 7 -s 1 G -F 2' . Genome heterozygosity and genome size were estimated in GenomeScope version 2.0 16 with default parameters. Based on 17-mer depth analysis, genome size and heterozygosity were estimated to be 255.81 Mb and 0.85%, respectively (Fig. 1). Genome assembly. We assembled a draft genome using wtdbg2 version 2.5 with default parameters 17 . We then had it polished using RACON version 1.4.13 18 with parameters '-m 8 -x −6 -g −8 -w 500 -u' and Pilon version 1.14 19 with default parameters based on 98.30 Gb long reads and 55.86 Gb short reads.
A scaffolding pipeline based on Durand (2016) 20 was used to generate a high-quality chromosome-scale genome. Initially, Hi-C data were mapped to the contig assembly using BWA-MEM version 0.7.17 21 with the following parameters: 'mem -SP5M' . Next, the DpnII sites were generated using the 'generate_site_positions.py' script in Juicer version 1.5 20 . The 3D-DNA pipeline (-r 2) was subsequently employed to order, orient, and cluster the contig 22 . After viewing Hi-C contact maps, the chromosome-scale genome was assembled in Juicebox version 1.11.08 (https://github.com/aidenlab/Juicebox). The genome assembly was screened for contaminant sequences by using the "Contamination in Sequence Databases" in NCBI. A total of 33 sequences were labeled as contaminant and removed (available in Figshare). To identify the mitochondrial genome, we amplified the cytochrome oxidase subunit 1 (COI) gene fragment with primer pairs LCO1490 and HCO2198, and obtained a DNA barcode sequence of approximately 610 bp 23 . We then used BLAST version 2.2.28 24 (-evalue 1e-5) to find assembly sequences of a high similarity to the COI fragment (>98%), and identified one unplaced sequence (scaffold46) as mitochondrial sequence. The resulting chromosome-level genome was 238.14 Mb with a scaffold N50 of 13.85 Mb, maximum length of 20.88 Mb, and GC rate of 55.90% (Table 2). 91.89% of the genome was anchored to 16 pseudo-chromosomes (Table 2), which were well-distinguished from each other based on the chromatin interaction heatmap (Fig. 2). www.nature.com/scientificdata www.nature.com/scientificdata/ Predicting repeats. Repeat sequences were annotated in Extensive de novo TE Annotator (EDTA) version 1.9.4 25 . In brief, LTR retrotransposons were identified in LTR FINDER version 1.07 26 , LTRharvest 27 , and LTR retriever version 2.9.0 28 with default parameters. Next, TIR Learner 29 and HelitronScanner 30 were used to classify DNA transposons with default parameters. RepeatMasker version 4.0.7 (-gff -xsmall -no_is) 31 and RepeatProteinMasker version 4.0.7 (-engine wublast) were utilized to identify repeat sequences based on RepBase edition 20170127 32 . Repeats were masked with de novo predictions using RepeatModeler version 2.0.1 with parameters '-engine ncbi -pa 4' . Additionally, Tandem Repeats Finder 33 was used to annotate tandem repeats with parameters '2 7 7 80 10 50 500 -f -d -m' . Overall, 20.20% of the assembled genome was classified as repetitive sequences in the M. usitatus genome (Table 3). Tandem repeat elements were found to be the most abundant (8.42%), followed by the terminal inverted repeat category (5.39%) ( Table 3).
Gene and functional predictions. Genes in the assembled genome were predicted using a combination of homology-based, transcriptome-based, and ab initio methods. Homology-based predictions involved downloaded sequences of peptides and transcripts from Aptinothrips rufus (http://v2.insect-genome.com/ Organism/87), Frankliniella occidentalis (https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/697/945/ GCF_000697945.3_Focc_3.1), and Thrips palmi (https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/012/932/325/ GCF_012932325.1_TpBJ-2018v1). The IsoSeq version 3.4.0 workflow was utilized to generate 28,608 high-quality transcripts from the full-length transcriptome data, with quality parameters of 0.99 (https://github.com/ PacificBiosciences/IsoSeq). Next, RNA-seq short data were mapped to the reference genome using HISAT2 version 2.2.1 34 with the parameter '-k 2' . The mapped reads were then assembled into transcripts using StringTie version 2.4.0 35 with default parameters. Homologous proteins and transcripts were aligned using Exonerate version 2.4.0 with default parameters to train the gene sets. Meanwhile, a sorted and mapped bam file of RNA-seq data was transferred to a hints file using the bam2hints program in AUGUSTUS version 3.2.3 36 with the parameter '-intronsonly' . The trained gene sets and hint files were combined as inputs for AUGUSTUS version 3.2.3 36 , which predicted coding genes from the assembled genome with default parameters. Finally, homology-based, de novo-derived, and transcript genes were merged in MAKER version 2.31.10 to generate a high-confidence gene set 37 . It resulted in the annotation of 14,000 M. usitatus genes. The average transcript length was 2,243.30 bp with an average length of coding sequence (CDS) of 1,588.94 bp. The average exon number per gene was 7.38, and the average exon length was 303.85 bp (Table 4).   www.nature.com/scientificdata www.nature.com/scientificdata/ Gene structure and annotations were determined through several methods, including eggnog-mapper 38 (-m diamond-tax_scope auto-go_evidence experimental-target_orthologs all-seed_ortholog_evalue 0.001-seed_ortholog_score 60-query-cover 20-subject-cover 0 -override), InterProscan version 5.0 39 (-iprlookupgoterms -appl Pfam -f TSV), BLAST version 2.2.28 24 (-evalue 1e-5), and HMMER version 3.3.2 40 (-noali-cut_ga Pfam-A.hmm). These methods were used to search against multiple public databases, including NCBI non-redundant protein (Nr), Gene Ontology (GO), Clusters of Orthologous Groups of Proteins (COG), Kyoto Encyclopedia of Genes and Genomes (KEGG), Swiss-Prot, and Pfam. Most genes (91.74%) were successfully annotated with at least one public database (Table 5).

Comparative genomic analysis.
To identify single-copy orthologous genes, we utilized the longest protein sequence of each gene from M. usitatus and multiple other species (    www.nature.com/scientificdata www.nature.com/scientificdata/ Tribolium castaneum 48 , Apis mellifera 49 and Daphnia galeata 50 . We performed all-to-all single-copy ortholog BLAST comparisons in OrthoFinder version 2.5.4 51 with the parameters '-a blast -M msa' . We aligned the resulting single-copy orthologous genes using MAFFT version 7.487 (-auto) 52 and further trimmed the poorly aligned regions using Gblocks version 0.91b 53 (-t = p -b4 = 5). We maintained the genes that met the stationary, reversible and homogeneous (SRH) assumptions 54 using IQ-TREE version 2.2.0 55 with a p-value cut-off of 0.05. We finally obtained 1,573 single-copy genes under these criteria. Next, We used FASconCAT-G version 1.05.1 56 to concatenate the genes to form a supermatrix, which was used for subsequent phylogenetic analysis.
We performed a maximum likelihood analysis of concatenated sequences in IQ-TREE version 2.2.0 55 with 1,000 UFBoot replicates (-bb 1,000 -model JTT + I + G4). The minimum correlation coefficient for the convergence criterion was set at 0.99 (-bcor 0.99). The age of each node was estimated using a correlated rates clock in MCMCTREE of PAML version 4.4 57 . To estimate the divergence times, we selected fossil records listed in Table 7.
Gene-family expansion and contraction were estimated using CAFÉ version 4.2 with parameters 'lambda -s -t' , based on maximum likelihood and reduction methods 58 . Phylogenetic tree topology and branch lengths were considered when inferring the significance of changes to gene-family size in each branch. The results revealed 684 expanded gene families and 1,639 contracted gene families in M. usitatus (Fig. 3). Next, functional enrichment   www.nature.com/scientificdata www.nature.com/scientificdata/ analysis (GO enrichment and KEGG pathway) was performed in KOBAS version 3.0 59 . Significantly enriched GO terms were those with an adjusted p < 0.05 under Fisher's exact test. Expanded gene families were enriched in cAMP signaling pathway, fatty acid metabolism, detoxification metabolism (ABC transporters) and the ionotropic glutamate receptor pathway (Fig. 4a, available in Figshare). Contracted gene families were enriched in chitin-based cuticle development, sensory perception of taste and NADP + activity (Fig. 4b, available in Figshare).

Genome completeness
Complete  Hi-C sequencing data were deposited in the Sequence Read Archive at NCBI under accession number SRR22137481 64 .
The final chromosome assembly was deposited in GenBank at NCBI under accession number JAPTSV000000000 65 .
The contaminant file, single-copy orthologous genes, gene-family expansion and contraction, gene function annotation, and repeat annotation are available in Figshare 66 . technical Validation DNA integrity. The integrity of extracted genomic DNA was determined using 0.75% agarose gel electrophoresis and analyzed with an Agilent 2100 Bioanalyzer (Agilent Technologies, USA). DNA concentration was measured using a Nanodrop 2000 spectrophotometer (Thermo Fisher Scientific, USA) and Qubit 2.0 (Thermo Fisher Scientific, USA). Absorbance at 260/280 nm was approximately 1.8. assessment of genome assemblies. We assessed the accuracy of the final genome assembly by mapping Illumina short reads to the M. usitatus genome with BWA-MEM version 0.7.17 21 . The analysis showed that 96.52% of short reads were successfully mapped to the M. usitatus genome (Table 8). We further assessed the base quality of genome assembly by estimating the quality value score (QVS) using Merqury version 1.1 67 , which showed a high QVS of 32.65 (Table 8). These findings indicate that the quality of our assembled genome is high.
Furthermore, we evaluated the completeness of the final genome assembly using Benchmarking Universal Single-Copy Orthologs (BUSCO version 3.0.2) insecta_odb10 68 , which includes 1,367 orthologous genes. The analysis revealed a high completeness of 97.40% for the M. usitatus genome with only 0.60% of BUSCO genes being fragmented, 2.00% being missing, and 0.40% being duplicated (Table 8). These BUSCO results were comparable to the completeness for other thrips genomes, such as T. palmi (97.20%), F. occidentalis (98.50%), and A. rufus (95.00%) ( Table 9).

Code availability
No specific codes or scripts were used in this study. All software used is in the public domain, with parameters clearly described in the Methods section.