A chromosome-length genome assembly and annotation of blackberry (Rubus argutus, cv. “Hillquist”)

Abstract Blackberries (Rubus spp.) are the fourth most economically important berry crop worldwide. Genome assemblies and annotations have been developed for Rubus species in subgenus Idaeobatus, including black raspberry (R. occidentalis), red raspberry (R. idaeus), and R. chingii, but very few genomic resources exist for blackberries and their relatives in subgenus Rubus. Here we present a chromosome-length assembly and annotation of the diploid blackberry germplasm accession “Hillquist” (R. argutus). “Hillquist” is the only known source of primocane-fruiting (annual-fruiting) in tetraploid fresh-market blackberry breeding programs and is represented in the pedigree of many important cultivars worldwide. The “Hillquist” assembly, generated using Pacific Biosciences long reads scaffolded with high-throughput chromosome conformation capture sequencing, consisted of 298 Mb, of which 270 Mb (90%) was placed on 7 chromosome-length scaffolds with an average length of 38.6 Mb. Approximately 52.8% of the genome was composed of repetitive elements. The genome sequence was highly collinear with a novel maternal haplotype-resolved linkage map of the tetraploid blackberry selection A-2551TN and genome assemblies of R. chingii and red raspberry. A total of 38,503 protein-coding genes were predicted, of which 72% were functionally annotated. Eighteen flowering gene homologs within a previously mapped locus aligning to an 11.2 Mb region on chromosome Ra02 were identified as potential candidate genes for primocane-fruiting. The utility of the “Hillquist” genome has been demonstrated here by the development of the first genotyping-by-sequencing-based linkage map of tetraploid blackberry and the identification of possible candidate genes for primocane-fruiting. This chromosome-length assembly will facilitate future studies in Rubus biology, genetics, and genomics and strengthen applied breeding programs.


Introduction
Blackberries (Rubus spp.) are specialty fruits in the Rosoideae subfamily of Rosaceae, which are prized for their sweet, juicy berries that have a delicate aroma and a deep black color. The global blackberry industry has experienced rapid growth and change during the past 2 decades (Strik et al. 2007). Americans spent just over $656 million on blackberries during 2020, a 17% increase over the previous year (Produce Market Guide 2022). This growth has been driven by increased consumer demand, advanced production methods, year-round product availability, and new cultivars.
The Rubus genus likely has a North American origin and is divided into 12 subgenera (Focke 1910;Carter et al. 2019). Other economically important crops in the genus Rubus include red raspberries (Rubus idaeus) and black raspberries (Rubus occidentalis), both of which are diploid species belonging to subgenus Idaeobatus. In contrast, blackberries belong to subgenus Rubus and range from diploid to 12x (2n ¼ 2x ¼ 14 to 2n ¼ 12x ¼ 84). Species belonging to subgenus Rubus are believed to have diverged from other subgenera, including Idaeobatus, Chamaebatus, Cylactis, Dalibardastrum, and Malachobatus, approximately 15-20 MYA (Carter et al. 2019). Cultivated blackberries are not assigned a specific epithet because most cultivars have several species in their ancestry ). In North America, erect and semierect blackberries grown for fresh-market production are bred at the tetraploid (2n ¼ 4x ¼ 28) level and are composed mostly of species native to the Central and Eastern United States, including R. allegheniensis, R. argutus, and R. trivialis. Processing cultivars with trailing growth habit are typically bred at higher ploidy levels (primarily 2n ¼ 6x/7x ¼ 42/49), and are most closely related to the Western North American blackberry species R. ursinus (Finn and Clark 2012).
Rubus plants are unusual among fruit crops because they typically have perennial crowns and root systems and biennial canes.
First-year canes, which are usually vegetative, are called primocanes, while second-year canes that have overwintered are called floricanes. Floral initiation typically begins on primocanes in short days during the autumn, with flowers and fruits developing on floricanes the following spring (Williams 1959;Takeda et al. 2003;Sønsteby and Heide 2008). Raspberry and blackberry cultivars with this customary flowering trait are described as floricane-or biennial-fruiting. Primocane-or annual-fruiting red raspberry cultivars that initiate flowers in the early summer and produce fruit on the tip portion of primocanes or primocane branches during the late summer and autumn ( Fig. 1) were first developed in the 1950s and 1960s (Keep 1988), with primocanefruiting blackberries first commercially released in the early 2000s (Clark et al. 2005). Primocane-fruiting cultivars differ from traditional floricane-fruiting types in that they have no short-day requirement for flower induction and low-temperature requirement for flower emergence. Primocane-fruiting raspberries and blackberries have grown in economic importance over the past 2 decades because they confer several advantages for growers. The primocane crop is typically distinctly later than the floricane crop, which allows for season extension and the possibility for "double-cropping" by producing a floricane crop followed by a primocane crop from the same plant in each year. Furthermore, primocane-fruiting allows for production in an expanded geographical area, including tropical areas where there would be insufficient chilling hours for floricane cultivars, and regions where winter injury to canes is problematic (Clark 2008).
The only known source of primocane-fruiting in tetraploid blackberry cultivars is a recessive allele from the wild diploid accession "Hillquist" (R. argutus; PI 553951; Fig. 1). "Hillquist" was initially discovered in Ashland, VA by L.G. Hillquist, who noticed that some of the wild blackberries growing in his backyard had an unusual fruiting habit. The accession was later donated to the New York State Agricultural Experiment Station by Mrs Hillquist in 1949 (GRIN 2022). "Hillquist" was first used as a male parent in crosses with the tetraploid, floricane-fruiting blackberry cultivar "Brazos" in 1967, but the first primocane-fruiting cultivars, "Prime-Jim" and "Prime-Jan" were not released until nearly 40 years later (Clark et al. 2005;Clark 2008). Since then, many public and private blackberry breeding programs have accessed this germplasm, and "Hillquist" is in the pedigree of many important floricane-fruiting and primocane-fruiting cultivars grown around the world.
Despite their economic importance, very few genomic resources are available for blackberries compared with other fruit crops. Pseudo-chromosome level genome assemblies are available for over 20 Rosaceae crops, including apple (Malus Â domestica) (Velasco et al. 2010;Daccord et al. 2017;Zhang et al. 2019), peach (Prunus persica) (Verde et al. 2013), and Asian pear (Pyrus pyrifolia) (Gao et al. 2021). Within the Rosoideae subfamily, which is characterized by a base chromosome number of x ¼ 7, there are highquality genome assemblies available for rose (Rosa chinensis) (Raymond et al. 2018) and diploid (Fragaria vesca) (Shulaev et al. 2011;Edger et al. 2018) and octoploid (F. Â ananassa) (Edger et al. 2019) strawberry. The first Rubus genome sequenced was a highly homozygous diploid black raspberry selection, ORUS 4115-3 (VanBuren et al. 20164115-3 (VanBuren et al. , 2018Jibran et al. 2018). More recently, chromosome-length assemblies have been published for the red raspberry cultivar "Anitra" (Davik et al. 2022) and R. chingii (Wang et al. 2021). To date, however, the only published Rubus genome assemblies are for species in subgenus Ideaobatus, and there is no genome sequence data for any close relatives of cultivated blackberries in subgenus Rubus in public databases.
Here, we present a chromosome-length genome assembly and annotation of the diploid R. argutus accession "Hillquist." "Hillquist" was chosen for the assembly because it is the original source of the primocane-fruiting used in cultivated tetraploid blackberries and is now represented in the pedigree of public and private blackberry breeding germplasm around the world. The R. argutus assembly was produced using Pacific Biosciences (PacBio) long-read single-molecule real-time (SMRT) sequencing and scaffolded using high-throughput chromosome conformation capture (Hi-C) sequence data. The full assembly is 298.2 Mb in length, with 270.0 Mb (90.1%) assigned to seven scaffolds with an average length of 38.6 Mb. Repetitive elements were predicted to make up 52.8% of the genome, with Gypsy superfamily lineages accounting for the largest fractions of long-terminal repeat (LTR)retrotransposable elements (REs). The computational annotation was performed with support of RNA-sequencing (RNA-seq) and Iso-seq data generated from root tips and actively growing leaves and stems of primocane and floricanes. A total of 38,503 proteincoding genes were predicted from the genome, 72.2% of which were functionally annotated. The practical value of the R. argutus genome assembly and annotation was demonstrated by comparing the genome sequences of related Rosoideae species, anchoring the scaffolds to a novel modified genotyping-bysequencing (GBS)-based (GBSpoly) linkage map of tetraploid blackberry, and identifying possible candidate genes for primocane-fruiting within a previously mapped locus.

Plant material and genome size estimation
The R. argutus germplasm accession "Hillquist" (PI 553951), sourced from the USDA National Clonal Germplasm Repository (NCGR) was used for genome sequencing and assembly. Leaf material was harvested from a single plant of this cultivar grown in the greenhouse at the USDA-NCGR, in Corvallis, Oregon for flow cytometry, DNA extraction, and PacBio, 10Â Chromium, and Hi-C sequencing. "Hillquist" plants propagated by NCGR staff were sent to North Carolina State University (NCSU) and grown in a greenhouse. Tissue from root tips and actively growing leaves and stems from primocanes and floricanes for RNA sequencing and IsoSeq was obtained from plants grown at NCSU. Nuclear flow cytometry with DAPI staining was used to measure DNA content and estimate the genome size of R. argutus "Hillquist." Flow cytometry was performed using young, unexpanded "Hillquist" leaves in biological triplicate with Vinca major as an internal standard.

Pacific Biosciences
High molecular weight DNA was extracted from young, unexpanded leaves of R. argutus "Hillquist" using a modified cetyl trimethylammonium bromide method (Porebski et al. 1997). DNA quality was evaluated with Pulsed Field Gel Electrophoresis (BioRad, Hercules, CA, USA), and quantification was performed with a Qubit fluorometer (ThermoFisher Sci., Waltham, MA, USA). Genomic DNA was sheared to achieve fragments in the 15-40 kb size range using a 26-gauge blunt end needle (ThermoFisher UK Ltd HCA-413-030Y guanine-cytosine Syringe Replacement Parts 26 g, 51 mm) and 1 ml luer-loc syringe. The sheared DNA was then cleaned using 1Â AMPure PB Beads before library preparation. Fragments were enzymatically repaired and used to construct a long read (20 kb) PacBio Sequel genomic library with a SMRTbell Template Prep Kit 1.0-SPv3 according to the manufacturer's recommendations (Pacific Biosciences Inc., Menlo Park, CA, USA). The resulting SMRTbell templates were size selected using BluePippin electrophoresis (Sage Science Inc., Beverly, MA, USA) and template DNA ranging in size between 15 and 50 kb was sequenced in eight PacBio Sequel SMRT cells on a PacBio Sequel instrument at the NCSU Genomic Sciences Laboratory.

Hi-C and 103 Genomics
Five grams of young leaf tissue for Hi-C and 10Â Genomics library preparation was collected from a "Hillquist" plant subjected to 48 h of darkness. An in situ Hi-C library was prepared following (Rao et al. 2014) and sequenced as 150 base pairs (bp) paired-end reads using the Illumina HiSeq4000 platform. 10Â Genomics linked read libraries were made at the Wellcome Sanger Institute High-Throughput DNA Sequencing Centre by the Sanger Institute R&D and pipeline teams using the Chromium Genome Reagent Kit (v2 Chemistry) following the manufacturer's recommended protocol. These libraries were then sequenced on Illumina NovaSeq 6000 platforms at the Wellcome Sanger Institute High-Throughput DNA Sequencing Centre.

Genome sequence assembly
A contig-scale assembly was generated with PacBio sequence data using the FALCON and FALCON-Unzip software applications (Chin et al. 2016). Error correction on the phased assembly was performed using the Arrow consensus model in the PacBio GenomicConsensus package following default parameters. The kmer distribution of unassembled, corrected PacBio reads for "Hillquist" showed a bimodal distribution, indicating high heterozygosity. Therefore, the Purge Haplotigs pipeline was used to curate the heterozygous diploid genome assembly and resolve under-collapsed heterozygosity by identifying syntenic pairs of contigs and moving one to a haplotig pool (Roach et al. 2018). Hi-C data were aligned to the Purge Haplotigs draft assembly using Juicer v1.6.2 (Durand et al. 2016). A candidate assembly and contact maps visualizing the alignments with respect to the draft and the new reference were built using the 3D de novo assembly (3D-DNA) pipeline (Dudchenko et al. 2017), and the genome was reviewed and polished using Juicebox Assembly Tools (https:// github.com/aidenlab/Juicebox). Chromosome nomenclature and orientation were assigned following Fragaria conventions (Shulaev et al. 2011).
Heterozygosity and genome size were estimated by analysis of the k-mer count histogram generated with 10Â Chromium Illumina reads using the online version of GenomeScope (GenomeScope, RRID: SCR_017014; Vurture et al. 2017). The k-mer profile measures how often substrings of length k occur in raw short read sequencing reads. GenomeScope fits a mixture model of 4 evenly spaced negative binomial distributions to the k-mer profile to measure the relative abundances of heterozygous (unique) and homozygous (2-copy) sequences to estimate heterozygosity and estimates genome size by normalizing the observed k-mer frequencies to the average coverage value for homozygous sequences, excluding likely sequencing errors.

Linkage map of autotetraploid blackberry
A mapping population consisting of 119 F 1 progeny from the cross A-2551TN Â APF-259TN (Supplementary Fig. 1) were used to generate a maternal haplotype map. Full methods and results for map construction are provided in Supplementary File 1. Multiplexed NGS-based reduced representation sequencing libraries for parents and progeny were prepared following the GBSpoly protocol optimized for heterozygous and polyploid genomes (Wadl et al. 2018;Mollinari et al. 2020) and sequenced on the HiSeq 2500 (Illumina, San Diego, CA, USA) and the SP flow cell of the NovaSeq 6000 (Illumina, San Diego, CA, USA) system at the Genomic Sciences Laboratory at NCSU to generate 615.4 million sequencing reads after demulitplexing and quality filtering. Raw Fastq files were processed and filtered with the ngsComposer (Kuster et al. 2021) pipeline (https://github.com/bod eolukolu/ngsComposer) and were aligned to the black raspberry (VanBuren et al. 2018) and "Hillquist" genomes using Burrows-Wheeler Aligner (BWA)-MEM (https://github.com/lh3/bwa). The GBSapp pipeline (https://github.com/bodeolukolu/GBSapp), which integrates original and third-party tools (bwa, samtools, picard, bcftools, GATK, java, R-ggplot2, and R-AGHmatrix), was used for variant calling and filtering. Only single dose markers segregating in A-2551TN were used to construct the haplotyperesolved maternal linkage map. Markers that had less than 5% missing data, were heterozygous in A-2551TN (0/0/0/1 Â 0/0/0/0), and segregated in a 1:1 ratio in the progeny were used to create a maternal linkage map in JoinMap 4.1 (Van Ooijen 2006).

Clustering-based characterization of repeats
A clustering characterization of the repetitive component of the "Hillquist" genome was performed using RepeatExplorer2 (Nová k et al. 2020) with default parameters with a random set of 1,000,000 paired-end sequences. To reduce the number of unknown retrotransposon clusters, BLASTN and tBLASTX (Altschul et al. 1990) analyses were performed using Basic Local Alignment Search Tool (BLAST) v2.6.0 with default parameters against the libraries of the characterized Rosaceae full-length LTR-REs.

Full-length LTR-retrotransposon discovery and characterization analysis
The genome assemblies of "Hillquist," F. vesca, P. micrantha, P. persica, and M. domestica were scanned for a structural identification of Class I full-length LTR-REs using LTRharvest v1.5.10 (Ellinghaus et al. 2008) with the following parameters: -minlenltr 100; -maxlenltr 6000; -mindistltr 1500; -maxdistltr 25000; -mintsd 5; -maxtsd 5; -similar 85; -vic 10; -motif tgca. The libraries of fulllength LTR-REs were submitted to domain-based annotation by using DANTE v1.0.0, available on the RepeatExplorer Galaxybased website (https://galaxy-elixir.cerit-sc.cz/). The annotation process was performed with default parameters using the REXdb of transposable element protein domains (Neumann et al. 2019) and a BLOSUM80 scoring matrix. The protein matches were filtered by significance using the parameters provided by the platform, and nested elements were manually removed. To reduce the number of uncharacterized full-length LTR-REs, we performed BLASTN and tBLASTX between uncharacterized elements and characterized elements in conjunction with the annotated contigs produced by the comparative clustering analysis.

RNA extraction, library preparation, and sequencing
Total RNA was extracted from 5 tissue types (root tips, as well as actively growing leaves and stems from both primocane and floricane canes of the same plant) for sequencing with RNA-Seq and Iso-Seq technologies using the Spectrum Plant Total RNA Kit (Millipore Sigma, Burlington, MA, USA) following the manufacturer's protocol. Cane types were distinguished by the presence of trifoliate leaves on floricanes and pentifoliate leaves on primocanes. The purity and concentration of the extracted RNA was determined using a 2100 Bioanalyzer (Agilent Technologies, Santa Clara, CA, USA), and the integrity of the samples was determined using a Qubit 4.0 fluorimeter (Thermo Fisher Scientific, Waltham, MA, USA). Samples with an RNA integrity number value above 7.0 were submitted for subsequent sequencing. Two duplicate RNA-Seq libraries were produced for each tissue type and sequenced with an Illumina HiSeq X instrument at Scientific Operations core at the Wellcome Sanger Institute. Total RNA from the same 5 tissue samples were pooled and used for Iso-Seq library preparation. Standard PacBio Iso-Seq SMRTbell libraries were prepared by Genewiz (South Plainfield, NJ, USA) and one SMRT cell was sequenced with Sequel II. Full-length transcripts were identified using the Iso-Seq 3 application in SMRTLink 5.0. First, multiple reads of the same SMRTbell sequence or the subreads from the same polymerase read were combined to produce one high-quality circular consensus sequence (CCS). Next, the CCS reads were classified as full-length based on the presence of both complementary DNA primers and polyA tails in the reads. Full-length reads were further classified as chimeric or nonchimeric reads based on whether or not primers were found in the middle of the sequences. Finally, unpolished consensus isoforms were extracted using the iterative clustering and error correction algorithm and polished to obtain high-quality and low-quality isoforms.

Structural gene annotation
A repeat library of transposable element families was generated using RepeatModeler2 (Flynn et al. 2020). Repeat sequences, interspersed repeats, and low complexity DNA sequences were identified and soft-masked using RepeatMasker (Smit et al. 2013). Repeat masking was further refined using Iso-Seq transcript sequences. The representative Iso-Seq open reading frames (ORFs) supported by protein or RNA-Seq evidence were used to reduce the amount of repeat-masked coding sequence by unmasking the masked regions overlapping the Iso-Seq defined ORFs. RNA-Seq mapping originated intron hints were obtained by aligning paired RNA-Seq reads to the "Hillquist" genome using STAR (Dobin et al. 2013) with filters for the intron coverage value !3. Additionally, consensus high-quality Iso-Seq isoforms were aligned to the genome by GMAP (Wu and Watanabe 2005) with filters for !95% identity and !90% coverage. The longest ORF (lORF) was identified in each aligned transcript. In loci with overlapping isoforms, a single representative transcript with the longest lORF was selected thus making a set of nonoverlapping Iso-Seq isoforms. Transcripts with lORFs shorter than 300 nucleotides or with introns longer than 10,000 nucleotides were filtered out from this set. Protein hints to splice sites and translation initiation and termination sites were generated by ProtHint (Brůna et al. 2020) using proteins from the Plantae section of the OrthoDB v10 protein database (Kriventseva et al. 2019).
Genes were annotated using a protocol similar to BRAKER2 (Brůna et al. 2021), with additional integration of RNA-Seq and Iso-Seq data ( Supplementary Fig. 2). GeneMark-ET (Lomsadze et al. 2014) with RNA-Seq intron hints was used to create a set of predicted genes. In this analysis, introns mapped with coverage !100 were used for initial parameter estimation. Genes predicted by GeneMark-ET were subsequently used as seed regions in ProtHint to generate protein hints. Next, protein and RNA-Seq hints were used together to predict genes with GeneMark-EPþ (Brůna et al. 2020). By default, GeneMark-EPþ directly uses protein hints generated by ProtHint. This hint set was extended by adding RNA-Seq intron hints. Introns found in the intersection of RNA-Seq and protein hints were added to GeneMark-EPþ's highconfidence hint set. Genes predicted by GeneMark-EPþ and ORFs from the set of nonoverlapping Iso-Seq isoforms were combined to create the new seed regions. In case of an overlap between the Iso-Seq and GeneMark-EPþ defined seeds, the Iso-Seq seed was selected if its ORF was >50 nt longer than the GeneMark-EPþ seed. GeneMark-EPþ was then run on the genome with updated repeat-masking and protein hints delivered by the second iteration of ProtHint. Again, RNA-Seq hints were added to the hints set in the same way as described for the first GeneMark-EPþ run. GeneMark-EPþ predictions fully supported by mapped Iso-Seq transcripts or protein hints were selected for the training of AUGUSTUS (Stanke et al. 2006). AUGUSTUS was run on the "Hillquist" genome sequence with refined repeat-masking and ProtHint proteins hints in agreement with the BRAKER2 protocol (Brůna et al. 2021) to generate the final gene predictions.
The predicted genes were categorized according to their support by external evidence. Multiexon transcripts were fully supported by Iso-Seq if all introns had support by at least a single Iso-Seq transcript. The supporting Iso-Seq transcript could not contain any additional introns, except in its 5' and 3' UTRs. Multiexon transcripts were fully supported by proteins or RNA-Seq if all their introns were supported by protein or RNA-Seq hints. Single exon transcripts were fully supported by Iso-Seq if a matching lORF was found in one of the Iso-Seq transcripts. Single-exon transcripts were fully supported by proteins if the start and stop codons were supported by protein hints. Transcripts supported by any evidence were required to have a part of their gene structure supported by an Iso-Seq, RNA-Seq, or a protein hint. The Benchmarking Universal Single-Copy Orthologs (BUSCO) (Seppey et al. 2019) toolkit was used to assess how many predicted R. argutus genes were coding for Universal Single-Copy Orthologs. Furthermore, we used Liftoff (Shumate and Salzberg 2021) to map annotated genes from F. vesca (annotation v4.0.a2; assembly v4.0.a1; Li et al. 2018) onto the R. argutus assembly.

Functional gene annotation
Putative gene function was determined through interrogation of the Swiss-Prot, Araport11, NCBI nr, Refseq, and TrEmbl protein databases with BLASTþ blastp-fast algorithm (Camacho et al. 2009) using the predicted protein-coding sequences of the 38,503 genes identified in the structural annotation as queries with an expectation value cutoff of 1eÀ6. BLASTþ analyses were executed using the Galaxy platform (Afgan et al. 2018) with locally installed databases except for Araport11, which was downloaded from The Arabidopsis Information Resource (TAIR, https://www. arabidopsis.org/). InterProScan v5 (Zdobnov and Apweiler 2001) was used to assign InterPro domains, and Gene Ontology (GO) terms to the predicted proteins. KEGG ortholog and KEGG pathway mapping were performed with BlastKOALA v2.2 (Kanehisa et al. 2016) and eggNOG-mapper v2 (Huerta-Cepas et al. 2017), respectively.

Potential candidate genes for primocane-fruiting in blackberry
To explore candidate genes for the primocane-fruiting trait in blackberry, blackberry homologs of the Arabidopsis flowering time genes listed in FLOR-ID database were mined from the "Hillquist" genome sequence (Bouché et al. 2016). A previously mapped 11.2 Mb region (Castro et al. 2013) corresponding to R. argutus chromosome Ra02 at 25,901,374-37,085,204 bp was specifically targeted for potential primocane-fruiting candidate genes.

Results and discussion
Chromosome-length genome assembly A combined total of 3.8 million PacBio post-filtered reads with an average length of 6,824 bp were generated from the eight SMRT cells, resulting in a total of 25.9 Gb of sequence ($77Â Genome Coverage) (Supplementary Table 2). These reads were used to generate an initial FALCON-Unzip assembly comprised 374 Mb of sequence in 1,756 contigs with an N50 of 486 kb and a maximum contig length of 5.9 Mb. After Purge Haplotigs was used to resolve under-collapsed heterozygosity, the optimized assembly consisted of 297 Mb assigned to 811 primary contigs with a contig N50 of 650 Kb and a maximum contig length of 5.9 Mb. The Hi-C library was sequenced to produce 559,559,351 paired-end reads. Hi-C data were aligned to the Purge Haplotigs draft assembly to create a new 298 Mb assembly composed of 350 scaffolds with an N50 of 38.6 Mb and a maximum scaffold length of 45.5 Mb (Table 1 and Fig. 2). Among these Hi-C scaffolds, seven chromosome-length scaffolds with a total length of 270 Mb (90% of the 298 bp genome) corresponded directly to the seven R. occidentalis and F. vesca chromosomes (Supplementary Table 3).

Genome size estimation
The nuclear flow cytometry generated estimate of the R. argutus genome size was 337.4 Mb (1C ¼ 0.345 pg). This estimate falls within the reported range of other diploid species in subgenus Rubus (R. hispidus, R. canadensis, R. trivialis, R. canescens, and R. sanctus), which was between 1C ¼ 0.295-0.375 pg (Thompson 1995;Meng and Finn 2002). Heterozygosity and genome size were also estimated by analysis of the k-mer count histogram generated with 10Â Chromium Illumina reads using the online version of GenomeScope [GenomeScope, RRID: SCR_017014 (Vurture et al. 2017)]. The size and heterozygosity of the genome were estimated as 298.06 Mb and 1.04% (Supplementary Fig. 3). The k-mer based genome size estimate was within 172.4 kb of the Hi-C assembly length, which suggests that the genome was nearly complete. However, the flow cytometry estimate of R. argutus genome size was 337 Mb, indicating that 88.4% of the genome was incorporated in the assembly.

Synteny with Rosoideae genomes
The "Hillquist" assembly showed a high degree of collinearity to the other Rosoideae genomes ( Fig. 3; Supplementary Fig. 4). Collinearity to the genomes of R. idaeus "Anitra" and R. chingii was particularly high, with no large-scale rearrangements, translocations, or inversions observed across any of the 7 chromosomes when compared with these 2 species (Fig. 3c; Supplementary Fig.  4). As with the previously published comparison of the R. idaeus "Anitra" and R. occidentalis genome assemblies (Davik et al. 2022), several areas of noncollinearity were observed between the "Hillquist" and R. occidentalis genomes. The most notable areas of non-collinearity between the genomes were the large degree of rearrangement on one half of chromosomes 1 and 4 and the 2 large inversions originating from the same chromosomal breakpoint identified on chromosome 6 (Fig. 3d). Other authors (Davik et al. 2022) have suggested that these differences could be the result of errors in the assembly of the R. occidentalis genome, and the data presented here support that hypothesis. The pattern of rearrangements observed between the "Hillquist" genome and F. vesca "Hawaii 4" and R. chinensis "Old Blush" genomes were similar to those previously reported for the "Anitra" genome (Davik et al. 2022). Two large inversions on chromosomes 5 and chromosome 7 and with several smaller inversions on chromosomes 3 and 4 were observed between the "Hillquist" genome and that of F. vesca "Hawaii 4" (Fig. 3a). Two significant translocations were observed between chromosome 1 and chromosome 6 of the "Hillquist" and R. chinensis "Old Blush" genomes, along with small inversions on chromosomes 2 and 7 (Fig. 3b). These rearrangements reflect the evolutionary timescales since the Rubus, Fragaria, and Rosa ancestral genomes diverged from a common ancestor (Longhi et al. 2014).

Linkage map of autotetraploid blackberry
The best available blackberry linkage map was constructed using 119 simple sequence repeat (SSR) markers developed from red raspberry and a blackberry expressed sequence tag library (Castro et al. 2013). Due to the paucity of markers, this SSR-based map contained large genetic regions with no marker coverage. The utility of the "Hillquist" genome sequence for use in freshmarket blackberry breeding was therefore assessed by anchoring the pseudo-chromosomes to a novel linkage map of the autotetraploid breeding selection, A-2551TN, from the University of Arkansas System Division of Agriculture Fruit Breeding Program. The linkage map consisted of 2,935 sequence-characterized markers that were identified using a modified GBS protocol (GBSpoly) that is robust for highly heterozygous and polyploid genomes. In total, 85.9% of quality filtered reads were mapped to unique positions and 2,022,664 polymorphic markers were identified when the "Hillquist" genome was used as a reference, while only 67.3% of reads mapped to unique positions and 1,811,617 polymorphic markers were discovered when the black raspberry genome was used as the reference (Supplementary Table 4).
Only single dose markers segregating in A-2551TN were used to construct the haplotype-resolved maternal linkage map, which was composed of 2,935 markers assigned to 30 linkage groups, with between 5 and 249 markers per linkage group. The total map length was 2,411.81 cM, with linkage groups ranging from 18.61 to 146.65 cM in length and an average of one marker every 0.82 cM (Supplementary Tables 5 and 6 and Fig. 5). The physical positions of the mapped markers on the "Hillquist" pseudochromosomes were used to identify 4 homologous linkage groups corresponding to 5 chromosomes (1, 2, 3, 4, and 6), and 5 homologous linkage groups corresponding to the remaining 2 chromosomes (5 and 7). The A-2551TN maternal haplotype map was strongly collinear with the "Hillquist" genome, with no major translocations or inversions (Fig. 4). While many of the linkage groups in the A-2551TN maternal haplotype map contained markers that aligned to physical positions across the length of Fig. 3. Whole-genome alignment plots between the "Hillquist" blackberry (R. argutus) genome assembly and the chromosome-length assemblies of (a) woodland strawberry (Fragaria vesca V. 4), (b) rose (Rosa chinensis), (c) red raspberry (R. idaeus), and (d) black raspberry (R. occidentalis v.3). each of the chromosomes, 10 linkage groups had markers aligned to physical positions spanning less than 10 megabase pairs (Mbp) in the "Hillquist" genome. Based on the physical positions of these markers on short linkage groups, it is likely that linkage groups 7b/7d and 5c/5e belong to the same haplotype of A-2551TN. Gaps in the linkage map can likely be attributed to the high inbreeding coefficients of A-2551TN (F ¼ 0.100) and its progeny from the A-2551TN Â APF-259TN cross (F ¼ 0.099). The high percentage of reads mapped to unique positions on the "Hillquist" genome and the collinearity between the physical map of "Hillquist" and the A-2551TN maternal haplotype map validate the order and orientation of the Hi-C-based chromosome-length assembly of "Hillquist" and demonstrate its utility for genomic breeding research in polyploid fresh-market blackberries.

Clustering-based characterization of repeats
Of the 1 million "Hillquist" paired-end Illumina reads randomly selected for de novo clustering, 555,442 reads were processed by RepeatExplorer2 (Nová k et al. 2020). Of these processed reads, 262,064 (47.2% of the genome) were considered singlets and did not fall into the category of repeated sequences according to the thresholds imposed by the program. The remaining 293,379 reads (52.8% of the genome) were characterized as repeats and grouped in 51,851 clusters, each of which represented a single repeat sublineage. One hundred and seventy-three clusters were classified   4. Comparison of the tetraploid A-2551TN maternal haplotype map with the "Hillquist" blackberry (R. argutus) physical map. As expected, 4 homologous linkage groups (haplotypes a-d) were identified for chromosomes Ra01, Ra02, Ra03, Ra04, and Ra06. Five homologous linkage groups (haplotypes a-e) corresponded to chromosomes Ra05 and Ra07. Based on the physical positions of the markers on chromosomes Ra05 and Ra07, it is likely that linkage groups 5c and 5e and 7b and 7d actually belong to the same haplotype of A-2551TN.
as top clusters with a genome proportion greater than 0.01%, representing the most abundant repeat families. Copia and Gypsy superfamilies accounted for the largest fractions of the genome (10.51% and 23.44%, respectively; Table 2). In particular, Athila-related clusters were the most abundant. No DNA transposons, non-LTR elements, or satellite DNA were among the top clusters. The absence of DNA transposon and satellite DNA in top clusters indicates that these repeat types are scarce in the Rubus genome. Illumina reads related to these repeats were assembled in clusters accounting for less than 0.01% of the genome. Finally, 18.14% of the repetitive component remained unclassified.

Full-length LTR-retrotransposon discovery and characterization analysis
A total of 636 full-length LTR-REs were identified in the "Hillquist" genome assembly, with 217 and 409 LT-REs belonging to the Gypsy and Copia superfamilies, respectively (Supplementary Table 7). The number of full-length LTR-REs isolated from the other 4 genome assemblies of Rosaceae species varied from a minimum of 204 in F. vesca to a maximum of 2,662 in M. domestica (Supplementary Table 7). Copia elements were more abundant than Gypsy LTR-REs in "Hillquist" (1.9:1), F. vesca (2.7:1), and P. persica (4.5:1), while LTR-REs in Copia and Gypsy superfamilies were equally represented in P. micrantha (1:1), and Gypsy elements were slightly more abundant in M. domestica (0.7:1). The lineage level annotation of most elements revealed considerable quantitative and qualitative variability among the 5 species, with several lineages that were not detected in some species. However, it is possible that very ancient and rearranged elements may not have been identifiable through structural features due to the stringency of the parameters used in the identification process.

Gene prediction and annotation
RNA extraction, library preparation, and sequencing A total of 135,518,570 paired reads were generated from the 10 RNA-Seq libraries (2 duplicate libraries of 5 tissue types: root tips, and actively growing leaves and stems from both primocane and floricane canes of the same plant), with 9,457,856-20,119,374 paired reads per library. One SMRT cell with a library prepared from pooled RNA from the same 5 tissue samples was sequenced with Sequel II to generate a total of 5,959,439 polymerase reads with a mean length of 39,878 bp per read, an average insert length of 7,387 bp, and a mean subread length of 1,614. A total of 2,830,415 CCS reads with a mean length of 1,526 bp were generated from these reads. Finally, 185,699 and 290 polished highquality and low-quality isoforms were generated from the IsoSeq data.

Structural gene annotation
One hundred and thirty-five megabase pairs (45.4%) of the "Hillquist" genome was repeat masked prior to structural annotation. The repeat length distribution is shown in Supplementary  Fig. 6. Masking refinement based on aligned Iso-Seq transcripts unmasked 1.9 Mbp of the sequence at 7,257 distinct Iso-Seq loci. The final set of predicted genes contained 38,503 coding genes and, with counting alternative isoforms, 40,397 coding transcripts. A total of 13,364 of these transcripts were fully supported by Iso-Seq transcripts, while RNA-Seq data fully supported 13,469 transcripts, and 17,848 transcripts had full protein support (Fig. 5,  a and b); 31,326 transcripts were partially supported by some evidence, and the remaining 9,407 transcripts were pure ab initio predictions. Transcripts in the unsupported group were rather short (average protein length 166 AA), with a large fraction (5,129; 55%) lacking any introns. Overall, 19,937 genes had full support from at least one of the external evidence types. The average length of proteins encoded by transcripts with at least one type of external evidence support was 400 AA; this set included 6,125 intronless transcripts. The 38,503 coding genes had an average length of 2,183 bp, containing an average of 3.4 introns per gene and median intron and exon lengths of 152 and 132 bp, respectively. Of these coding genes, 36,836 had no alternative isoforms, 1,466 had 2 isoforms, and 201 had 3 or more isoforms ( Supplementary Fig. 7). The number of predicted genes in R. argutus was comparable to other Rubus genomes including R. idaeus (39,448;Davik et al. 2022), R. chingii (33,130;Wang et al. 2021), and R. occidentalis (34,545;VanBuren et al. 2018). Liftoff, using default parameters, mapped 21,480 genes (63% of genes in the F. vesca annotation v4.0.a2) and the mapped gene structures closely agreed with our predicted gene structures: 68% and 95% of mapped exons matched exons in our R. argutus annotation exactly and partially, respectively. In the predicted set of genes, 2,134 (91.7%) complete R. argutus genes orthologous to the BUSCO families were identified, along with 74 (3.2%) genes with partial match. A small fraction of the BUSCO families (5.1%) were not identified among the predicted R. argutus genes ( Supplementary  Fig. 8). These results suggest that the "Hillquist" assembly and the gene complement are 94.9% complete.

Potential candidate genes for primocane-fruiting in blackberry
In blackberry, the primocane-fruiting trait is caused by a single recessive locus that has been mapped between markers FF683693.1 RH_MEa0007aG06 and FF683518.1 RH_MEa0006aC04 Table 2. Classification of clusters produced by RepeatExplorer2 and proportion of repeat types in the genome of "Hillquist" (R. argutus).

Classification
Genome proportion (%) Number of clusters in an SSR-based linkage map of the tetraploid population "Prime-Jim" Â "Arapaho" (Castro et al. 2013). While these markers were originally placed on linkage group 7 of blackberry, it was later shown that the flanking markers and most others from linkage group 7 of the "Prime-Jim" Â "Arapaho" aligned to chromosome 2 of R.  ). Interestingly, different loci have been found to control primocane-fruiting in raspberry (Jibran et al. 2019) and everbearing flowering in diploid and octoploid strawberries (Koskela et al. 2012;Gaston et al. 2013), suggesting that flowering in first-year shoots has evolved multiple times in the Rosaceae.
To explore candidate genes for the primocane-fruiting trait in blackberry, blackberry homologs of the Arabidopsis flowering time genes listed in FLOR-ID database were mined from the "Hillquist" genome sequence (Supplementary Table 10; Bouché et al. 2016). Based on blackberry gene annotations and BLAST analyses, 18 flowering gene homologs were identified within the $11.2 Mb primocane-fruiting locus on "Hillquist" chromosome Ra02 (Table 3). Almost half of the genes were involved in epigenetic processes that control gene expression through histone methylation, histone ubiquitinylation, small RNA processing, or as a component of nucleosome assembly. Moreover, 6 putative transcription factors and 3 photoperiodic flowering pathway genes (LATE, PRR7, CIB4) were identified in the primocane-fruiting locus.
Ten and 8 of the 18 flowering genes in the locus encoded activators and repressors of flowering in Arabidopsis, respectively. Floral repressors are primary candidates for primocane-fruiting because a loss-of-function mutation in a repressor could cause this trait to be recessively inherited. Among transcription factors in the locus that repress flowering, LATE is a C2H2 zinc-finger protein that represses the expression of photoperiodic pathway genes CO and FT (Weingartner et al. 2011) and ATH1 is involved in the activation of FLC in Arabidopsis (Proveniers et al. 2007). Furthermore, many of the identified epigenetic regulators, including STG8, ATX2, UBP26, and EMF1, functioned as floral repressors in Arabidopsis by activating the expression of FLC (Zhao et al. 2005;Saleh et al. 2008;Schmitz et al. 2009;Kim et al. 2012). No clear FLC ortholog was found in the "Hillquist" genome assembly, but these epigenetic regulators likely regulate other targets in blackberry as observed in Arabidopsis (Saleh et al. 2008;Kim et al. 2012).
Other promising candidate genes identified were PRR7 and FD. PRR7 encodes a floral activator in Arabidopsis (Nakamichi et al. 2007), and a homologous gene called BTC1 is involved in the annual to biennial transition in sugar beet (Pin et al. 2012). However, if PRR7 controls primocane-fruiting in blackberry, the mechanism is different from beet. In beet, recessive btc1 alleles confer an obligatory vernalization response and postpone floral initiation into the spring of the second year (Pin et al. 2012), while recessive alleles of the primocane-fruiting locus cause flowering during the first year in blackberry (Lopez-Medina et al. 2000).
Previous studies have shown that TFL1 encodes a strong repressor of flowering in several Rosaceous species. For example, in diploid woodland strawberry, nonfunctional TFL1 alleles cause rapid and perpetual flowering in long day conditions (Koskela et al. 2012). Similarly, RNA-silencing of TFL1 orthologs in cultivated strawberry, apple, and pear caused comparative phenotypes in these species (Flachowsky et al. 2012;Freiman et al. 2012;Koskela et al. 2016). Therefore, TFL1 is also expected to play an important role in the control of flowering in blackberry, and it is a Table 3. Rubus argutus flowering gene homologs identified in the primocane-fruiting locus from 25.9 to 37.1 Mb on chromosome Ra02.
Rubus argutus gene potential target of identified candidate genes. A gene encoding the bZIP transcription factor FD was identified just outside the primocane-fruiting locus in the "Hillquist" genome. Recent results show that TFL1 competes for binding to FD with floral activator FT to control common target genes in Arabidopsis (Zhu et al. 2020). Therefore, a mutation in FD could potentially prevent TFL1 from repressing floral activators that are needed for floral initiation during the first season in primocane-fruiting genotypes, leading to the observed phenotype.

Conclusions
The first high-quality chromosome-length genome assembly and annotation of the diploid blackberry R. argutus "Hillquist" is reported in this manuscript. Comparisons of the "Hillquist" genome with the related species R. idaeus (Davik et al. 2022) and R. chingii (Wang et al. 2021) demonstrated that the Hi-C assembly represented the majority of the genome and was of high quality. BUSCO analysis and comparisons of predicted genes with other Rubus genomes showed that the structural and functional annotations of the assembly were also comprehensive. Analysis of repeat content revealed that approximately 52.8% of the genome was composed of repetitive elements and that the Gypsy superfamily of LTR-REs accounted for the largest fractions of the genome. Developing new GBS-based maternal haplotype map of the tetraploid blackberry breeding selection A-2551TN that was highly collinear with the physical sequence of "Hillquist" demonstrated the utility of this new genome for molecular breeding applications in tetraploid fresh-market blackberries. The new "Hillquist" genome assembly and its annotation were also used to identify potential candidate genes for the economically important trait of primocane-fruiting. The "Hillquist" genome sequence and annotation presented here will assist blackberry breeders and scientists in marker development and genomic-assisted breeding and facilitate future studies of Rubus biology, genetics, and genomics.