The Pacific Biosciences de novo assembled genome dataset from a parthenogenetic New Zealand wild population of the longhorned tick, Haemaphysalis longicornis Neumann, 1901

The longhorned tick, Haemaphysalis longicornis, feeds upon a wide range of bird and mammalian hosts. Mammalian hosts include cattle, deer, sheep, goats, humans, and horses. This tick is known to transmit a number of pathogens causing tick-borne diseases, and was the vector of a recent serious outbreak of oriental theileriosis in New Zealand. A New Zealand-USA consortium was established to sequence, assemble, and annotate the genome of this tick, using ticks obtained from New Zealand's North Island. In New Zealand, the tick is considered exclusively parthenogenetic and this trait was deemed useful for genome assembly. Very high molecular weight genomic DNA was sequenced on the Illumina HiSeq4000 and the long-read Pac Bio Sequel platforms. Twenty-eight SMRT cells produced a total of 21.3 million reads which were assembled with Canu on a reserved supercomputer node with access to 12 TB of RAM, running continuously for over 24 days. The final assembly dataset consisted of 34,211 contigs with an average contig length of 215,205 bp. The quality of the annotated genome was assessed by BUSCO analysis, an approach that provides quantitative measures for the quality of an assembled genome. Over 95% of the BUSCO gene set was found in the assembled genome. Only 48 of the 1066 BUSCO genes were missing and only 9 were present in a fragmented condition. The raw sequencing reads and the assembled contigs/scaffolds are archived at the National Center for Biotechnology Information.


a b s t r a c t
The longhorned tick, Haemaphysalis longicornis, feeds upon a wide range of bird and mammalian hosts. Mammalian hosts include cattle, deer, sheep, goats, humans, and horses. This tick is known to transmit a number of pathogens causing tick-borne diseases, and was the vector of a recent serious outbreak of oriental theileriosis in New Zealand. A New Zealand-USA consortium was established to sequence, assemble, and annotate the genome of this tick, using ticks obtained from New Zealand's North Island. In New Zealand, Pac Bio de novo assembly Genome annotation Cattle tick the tick is considered exclusively parthenogenetic and this trait was deemed useful for genome assembly. Very high molecular weight genomic DNA was sequenced on the Illumina HiSeq4000 and the long-read Pac Bio Sequel platforms. Twenty-eight SMRT cells produced a total of 21.3 million reads which were assembled with Canu on a reserved supercomputer node with access to 12 TB of RAM, running continuously for over 24 days. The final assembly dataset consisted of 34,211 contigs with an average contig length of 215,205 bp. The quality of the annotated genome was assessed by BUSCO analysis, an approach that provides quantitative measures for the quality of an assembled genome. Over 95% of the BUSCO gene set was found in the assembled genome. Only 48 of the 1066 BUSCO genes were missing and only 9 were present in a fragmented condition. The raw sequencing reads and the assembled contigs/scaffolds are archived at the National Center for Biotechnology Information.  Raw read data is available at the National Center for Biotechnology Information (NCBI) Short Read Archive (SRA) through the SRA accession number SRR9226158 (Pacific Biosciences Sequel) and SRR9226159 (HiSeq4000). The whole genome shotgun assembly project has been deposited under the accession VFIB00000000. The version described in this paper is the first version, VFIB01000000. The overall BioProject ID is PRJNA540490 and the BioSample accession is SAMN11539514.

Data
Haemaphysalis longicornis is a three-host tick, with a wide distribution in temperate regions of Asia, Australia, and New Zealand [1]. This tick is capable of parthenogenetic reproduction, which allows rapid invasion of new areas and explosive population growth in established ranges. Haemaphysalis longicornis has recently established stable populations in several regions of the United States, and the tick's capacity for harboring and spreading several pathogens has heightened researchers interest in this tick [2]. Very high molecular weight genomic DNA was purified from eggs collected from parthenogenetic female H. longicornis ticks sourced from New Zealand. The genomic DNA was sequenced using 28 SMRT cells on Pacific Biosciences Sequel and 3 full lanes on the Illumina HiSeq4000 platforms. An all-Pac Bio reads genome was assembled using Canu. Raw read data is available at the National Center for Biotechnology Information (NCBI) Short Read Archive (SRA) through the SRA accession number SRR9226158 (Pacific Biosciences Sequel) and SRR9226159 (HiSeq4000). The whole genome shotgun assembly project has been deposited under the accession VFIB00000000. The version described in this paper is the first version, VFIB01000000. The overall BioProject ID is PRJNA540490 and the BioSample accession is SAMN11539514. The dataset contains the raw sequencing data and assembled genome of the tick. The data files were deposited at NCBI under project accession No. PRJNA540490. Information about the sequence reads, assembled genome, and genome repeats analysis is presented in Tables 1, 2, and 3, respectively.

Tick tissue
Female ticks were collected by research staff from the Massey University School of Veterinary Sciences, removing them from cattle on a ranch at Whangara, near Gisborne, New Zealand during January 2018. Live females were maintained under ambient laboratory conditions and allowed to oviposit. Approximately 1 g of eggs was obtained, incubated under ambient laboratory conditions for 4 weeks, and then frozen at À80 C.

Value of the Data
This assembled genome is the highest quality tick genome publicly available. Researchers studying arachnid and tick genomics, arachnid evolution, and comparative genomics will find the assembled genome valuable. The dataset can be used to study parthenogenesis-related genes, as this tick exclusively utilizes parthenogenetic reproduction in New Zealand. The developers of novel tick control technologies for this and other species of ticks will find this genome very useful.

Genomic DNA isolation and sequencing
A protocol from Sambrook et al. [3] was used to purify very high molecular weight genomic DNA from the eggs [4]. The protocol consisted of pulverizing frozen material in a liquid nitrogen-cooled mortar and pestle, addition to an aqueous buffer, followed by RNAse treatment, digestion by proteinase K, phenol extraction, and dialysis in 50 mM Tris, 10 mM EDTA, pH 8.0. The resultant DNA was determined by agarose gel electrophoresis to be > 200 kb. The DNA was concentrated for sequencing using Centricon Plus 70 Centrifugal Filter Units (Molecular Weight Cut Off ¼ 3000; Millipore Sigma, Burlington, MA, USA) and 3 washes of approximately 50 ml wash buffer (50 mM Tris, 10 mM EDTA, pH 8.0), centrifuging at 2500Âg and 8 C. Ten ml of this buffer was used to recover a final total of 0.4 mg of purified genomic DNA at a concentration of 37 mg/ml.

Assembly and analysis
Sequencing was performed at the Texas A&M AgriLife Genomics and Bioinformatics Service, College Station, TX using 28 SMRT cells on the Pacific Biosciences Sequel and 3 lanes of the Illumina HiSeq4000  platforms. Read quality checks and filtering of raw reads were conducted via the manufacturer's standard protocol and protocols developed at the Texas A&M AgriLife Genomics and Bioinformatics Service prior to submission to NCBI and assembly. The original intent was to use the Illumina reads to error-correct the Sequel long reads. However, due to the high amount of required computational resources necessary to error-correct and assemble large genomes, we chose to create a Sequel-only assembly using the Canu [5] pipeline. We hypothesized the parthenogenetic nature of New Zealand's H. longicornis populations [1] would minimize genomic DNA heterogeneity and allow for a high-quality Pacific Biosciences-only genome assembly. We utilized allocations on the Pittsburgh Supercomputing Center Bridges system [6], granted through the Extreme Science and Engineering Discovery Environment (XCEDE) program sponsored by the National Science Foundation [7]. The Canu assembly took over 24 consecutive days, running on a reserved node with access to 352 cores, 12 TB of RAM, and node-local disk storage to avoid unnecessary data transfers. Program parameters were corMhapSensitivity ¼ high, corOutCoverage ¼ 100, batOptions ¼ -dg3 -db 3 -dr 1 -ca 500 -cp 50, and an input genome size assumption of 3 Gb, estimated from our experience with Rhipicephalus tick genomes. The Canu assembly output estimated genome size to be 7.36 Gb and we are working to verify this result using independent genome size determination protocols. When the 34,211 assembled genome contigs were submitted to NCBI for archiving, only four contigs were detected to have contaminating sequence and those contaminations were corrected by NCBI staff. BUSCO (v. 3.0.2) analysis was run on the assembled genome against the arthropoda BUSCO set, using AUGUSTUS fly pre-configured prediction model with arguments -m genome -sp fly -c 8 [8]. Statistics from the sequencing are in Table 1 while features of the assembled genome are in Table 2. De novo repeats were identified with RepeatModeler v. 1.0.11 [9] using the NCBI engine (BLAST þ software v. 2.8.1) and then masked using Repeat Masker v. 4.0.9 [10] using a combined repeat database of classified repeats from RepeatModeler, the ticks library included as part of RepeatMasker, Dfam 3.0, and RepBase-20170127, using the following parameters: -e ncbi -gccalc -frag 2000000 -qq -xsmall. The results from the repeats analysis are shown in Table 3.