The genome sequence of lesser trefoil or Irish shamrock, Trifolium dubium Sibth. (Fabaceae)

We present a genome assembly from an individual Trifolium dubium (lesser trefoil; Tracheophyta; Magnoliopsida; Fabales; Fabaceae) as part of a collaboration between the Darwin Tree of Life and the European Reference Genome Atlas. The genome sequence is 679.1 megabases in span. Most of the assembly is scaffolded into 15 chromosomal pseudomolecules. The two mitochondrial genomes have lengths of 133.86 kb and 182.32 kb, and the plastid genome assembly has a length of 126.22 kilobases.


Background
Lesser trefoil (Trifolium dubium Sibth.), also known as lesser hop clover or suckling clover, is a common clover species that is considered by most to represent the traditional Irish shamrock.It is native and common across Europe, north to Scandinavia and south to Morocco and Turkey, but it is also found in many temperate regions of the world as an introduced species (POWO, 2023).
Trifolium dubium is a mat-forming annual, which has up to 20 tiny yellow flowers packed in dense globular flower heads (Figure 1).Most commonly, it occurs in unimproved grassland, but is also found in other habitats such as lawns, pastures, coastal meadows, roadsides, waste places and disturbed areas.Its adaptability to different environmental conditions has contributed to its prevalence in both natural and anthropogenic landscapes across its range.
There has been much discussion on the identity of the "true" shamrock, but for over a century the majority of people surveyed consider T. dubium to be the real one (Colgan, 1892;Colgan, 1893;Nelson, 1991).Shamrock flowers from May to October in Ireland, so it is not generally in flower on St Patrick's Day (17 March); however, leaves of T. dubium are worn on St. Patrick's Day, and have since become a floral symbol of Ireland.Trifolim dubium appears in numerous emblems of state and non-state organisations and companies across the Republic of Ireland, Northern Ireland, and beyond.Together with the harp, the shamrock is registered as an international trademark by the Government of Ireland.
The legend of the shamrock holds that St Patrick used its three-parted clover leaflets to explain to the Irish people the Christian concept of the Holy Trinity (Van Treeck & Croft, 1936), although the word "shamrock" derives from the Irish words seamair (clover) and óg (young) (Nelson, 1991).
While T. dubium is not typically cultivated as a primary crop, like most legumes it is capable of fixing atmospheric nitrogen through its symbiotic relationship with nitrogen-fixing bacteria in root nodules (Brock, 1973).This enriches the soil as well as the plants themselves, which therefore provide a good source of macro-and micronutrients and protein for livestock (Brock, 1973;Gounden et al., 2018).This species and several related species of Trifolium also produce condensed tannins (unlike the major crop clover species T. repens L. and T. pratense L.), making them of interest to breeding programmes of forage legumes, because they are less likely to cause legume bloat in ruminants (Fay & Dale, 1993).
While many cytological studies of Trifolium species have indicated that most (about 80%) are diploid based on x = 8 (with descending dysploidy giving rise to x = 7, 6 or 5 in some species; Ellison et al., 2006), counts of T. dubium have suggested it is a tetraploid, although there has been some discrepancy as to whether it is 2n = 28 or 30 (Ansari et al., 2008;Taylor et al., 1983;Vižintin et al., 2006;Zohary & Heller, 1984), or 2n = 32 (based on a chromosome count of a plant from Kent, England; Gornall and Bailey, 1993).Recent molecular cytogenetic studies of T. dubium with 2n = 30, are in agreement with the genome assembly reported here, and have provided important insights into its genetic composition and evolution (e.g.Ansari et al., 2008;Vozárová et al., 2021).Such studies have proposed that the species is an allotetraploid that likely arose from natural hybridisation between T. campestre Schreb.(2n = 14) and T. micranthum Viv.(2n = 16) (Ansari et al., 2008).
Whole genome sequence data are now available for at least six Trifolium species (e.g.Bickhart et al., 2022;Garg et al., 2022;Griffiths et al., 2019;Santangelo et al., 2023), and here we present the first high-quality genome for T. dubium, stemming from a collaboration involving the Darwin Tree of Life Project and the European Reference Genome Atlas pilot project.We anticipate this genome will be a valuable genomic resource for a range of future studies.These include comparative analyses focused on the evolution of allopolyploid genomes, as well as studies exploring its potential as an additional nutritional source for livestock, especially given its high condensed tannin content.

Genome sequence report
The genome was sequenced from a specimen of Trifolium dubium collected from Gorebridge, UK (55.84, -3.04).Using flow cytometry, the genome size (1C-value) was estimated to be 0.84 pg, equivalent to 820 Mb.A total of 72-fold coverage in Pacific Biosciences single-molecule HiFi long reads was  generated.Primary assembly contigs were scaffolded with chromosome conformation Hi-C data.Manual assembly curation corrected 283 missing joins or mis-joins, reducing the scaffold number by 61.95%, and increasing the scaffold N50 by 14.41%.
The final assembly has a total length of 679.1 Mb in 153 sequence scaffolds with a scaffold N50 of 46.0 Mb (Table 1).
The snail plot in Figure 2 provides a summary of the assembly statistics, while the distribution of assembly scaffolds on GC proportion and coverage is shown in Figure 3.The cumulative assembly plot in Figure 4 shows curves for subsets of scaffolds assigned to different phyla.Most (99.51%) of the assembly sequence was assigned to 15 chromosomal-level scaffolds.
Chromosome-scale scaffolds confirmed by the Hi-C data are named in order of size (Figure 5; Table 2).The order and orientation of contigs on chromosome 1 between 37.5 Mb and 42.4 Mb is uncertain.While not fully phased, the assembly deposited is of one haplotype.Contigs corresponding to the second haplotype have also been deposited.The mitochondrial and plastid genomes were also assembled and can be found as contigs within the multifasta file of the genome submission.

Sample acquisition, genome size estimation and nucleic acid extraction
Leaf and flower samples of Trifolium dubium were collected from Gorebridge, Scotland, UK (latitude 55.84, longitude -3.04) on 2021-08-11.One specimen was used for DNA sequencing (specimen ID EDTOL02342, ToLID drTriDubi3); another was used for Hi-C sequencing (specimen ID EDTOL02341, ToLID  RNA was extracted from flower tissue of drTriDubi4 in the Tree of Life Laboratory at the WSI using the RNA Extraction: Automated MagMax™ mirVana protocol (do Amaral et al., 2023).The RNA concentration was assessed using a Nanodrop spectrophotometer and a Qubit Fluorometer using the  Qubit RNA Broad-Range Assay kit.Analysis of the integrity of the RNA was done using the Agilent RNA 6000 Pico Kit and Eukaryotic Total RNA assay.
Protocols developed by the WSI Tree of Life core laboratory are publicly available on protocols.io(Denton et al., 2023).

Sequencing
Pacific Biosciences HiFi circular consensus DNA sequencing libraries were constructed according to the manufacturers' Table 3. Software tools: versions and sources.

Software tool Version
instructions.Poly(A) RNA-Seq libraries were constructed using the NEB Ultra II RNA Library Prep kit.DNA and RNA sequencing was performed by the Scientific Operations core at the WSI on Pacific Biosciences SEQUEL II (HiFi) and Illumina NovaSeq 6000 (RNA-Seq) instruments.Hi-C data were also generated from flower and leaf tissue of drTriDubi2 using the Arima2 kit and sequenced on the Illumina NovaSeq 6000 instrument.
Table 3 contains a list of relevant software tool versions and sources.

Wellcome Sanger Institute -Legal and Governance
The materials that have contributed to this genome note have been supplied by a Darwin Tree of Life Partner.The submission of materials by a Darwin Tree of Life Partner is subject to the 'Darwin Tree of Life Project Sampling Code of Practice', which can be found in full on the Darwin Tree of Life website here.By agreeing with and signing up to the Sampling Code of Practice, the Darwin Tree of Life Partner agrees they will meet the legal and ethical requirements and standards set out within this document in respect of all samples acquired for, and supplied to, the Darwin Tree of Life Project.
Further, the Wellcome Sanger Institute employs a process whereby due diligence is carried out proportionate to the nature of the materials themselves, and the circumstances under which they have been/are to be collected and provided for use.
The purpose of this is to address and mitigate any potential legal and/or ethical implications of receipt and use of the materials as part of the research project, and to ensure that in doing so we align with best practice wherever possible.The overarching areas of consideration are: • Ethical review of provenance and sourcing of the material

Yoshinori Fukasawa
Center for Center for Bioscience Research and Education, Utsunomiya University, Tochigi, Japan This manuscript reports the genome sequence of Trifolium dubium, the Irish Shamrock, assembled from PacBio HiFi and Hi-C data.The final assembly (drTriDubi3.1)has a total length of 679.1 Mb in 153 sequence scaffolds, with a scaffold N50 of 46.0 Mb and 99.51% of the assembled sequence assigned to 15 chromosomal-level scaffolds.The authors also report the assembly of the mitochondrial (133.86 kb and 182.32 kb) and plastid (126.22 kb) genomes.

Major Points:
-Haplotype phasing: The manuscript states that the assembly represents one haplotype and that contigs from the second haplotype were also deposited.However, the statistics for the phased assembly are not available.It would be beneficial to clarify this aspect and explain the rationale behind not aiming for a fully phased assembly.
-Chromosome 1 structure: The manuscript indicates that the order and orientation of contigs on chromosome 1 between 37.5 Mb and 42.4 Mb are uncertain.It would be beneficial to provide further detail on the difficulties encountered in this region and to present potential solutions or avenues for further investigation that could be pursued in order to resolve this ambiguity.
-Genome annotation: The manuscript makes a cursory mention of the planned use of RNA-Seq data for genome annotation and presentation through Ensemble.However, providing a more detailed rationale for the decision to extract RNA only from flower tissue would be beneficial for readers.Additionally, discussing the expected timeline for the annotation release would be helpful for researchers interested in utilizing this resource.

Minor Points:
-It is imperative that the source of the k-mers used in Merqury be explicitly described.This issue could be addressed by specifying whether the k-mers are derived from HiFi reads.If this is the case, readers could consider that this could potentially lead to an overestimation of QV within the Merqury framework.
-Subgenome identification: The authors may employ existing genomic resources of the proposed parental species (T.campestre and T. micranthum) to identify homeologous regions within the T. dubium assembly.

Rizky Dwi Satrio
1 Department of Biology, The Republic of Indonesia Defense University, Bogor, Indonesia 2 Universitas Pertahanan Indonesia, Tajur, Indonesia The manuscript titled "The genome sequence of lesser trefoil or Irish shamrock, Trifolium dubium Sibth.(Fabaceae)" presents a high-quality genome assembly of Trifolium dubium, a species often considered to represent the traditional Irish shamrock.The genome was sequenced as part of a collaboration between the Darwin Tree of Life and the European Reference Genome Atlas.The study reports a genome assembly size of 679.1 megabases, scaffolded into 15 chromosomal pseudomolecules, along with two mitochondrial genomes and a plastid genome.

Review
Rationale for Creating the Dataset Clarity: The manuscript clearly explains the motivation behind sequencing the Trifolium dubium genome.The species holds cultural significance and has potential agricultural importance due to its ability to fix nitrogen and produce condensed tannins, making it a valuable resource for future comparative studies and breeding programs.

○
Critique: The rationale is well articulated.However, the manuscript could benefit from a more detailed discussion of specific research questions that the dataset could address, particularly in comparative genomics and plant breeding, although this manuscript is a data note article.

Appropriateness of Protocols and Technical Soundness
Protocols: The manuscript describes the sequencing and assembly methods in great detail.High molecular weight (HMW) DNA extraction was performed, followed by sequencing with PacBio HiFi, Hi-C, and RNA-Seq technologies.The assembly was curated using multiple tools, ensuring high quality.
○ Technical Soundness: The work is technically sound, with a high degree of thoroughness in the methods described.The genome assembly metrics, such as scaffold N50 and BUSCO scores, indicate a high-quality assembly.
○ Critique: While the protocols are appropriate and robust, it would be beneficial to include more details on the challenges faced during the assembly process and how they were addressed, especially during manual curation.

Detailing of Methods and Materials for Replication
Detailing: The manuscript provides detailed descriptions of the methods, including specific protocols for DNA and RNA extraction, sequencing, and data processing.It references publicly available protocols on platforms like protocols.io,which enhances reproducibility.
○ Critique: The level of detail is generally sufficient for replication.However, the manuscript could improve by providing more insights into the specific conditions used during sequencing and assembly, such as environmental factors during sample collection, particularly for RNA-seq analysis.

Presentation and Accessibility of Datasets
Presentation: The datasets are presented clearly, with comprehensive tables summarizing the assembly metrics, accession numbers, and scaffold information.The manuscript also provides links to interactive tools for further exploration of the data.Reviewer Expertise: Plant Genomics I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.

Figure 1 .
Figure 1.Photographs of Trifolium dubium (a and b are representative images for the species, but not the specimen pr population used for genome sequencing, c is a representative plant from the population that was used for genome sequencing).a) https://commons.wikimedia.org/wiki/User:Rasbakb) https:// commons.wikimedia.org/wiki/User:Kenraizc) Markus Ruhsam.

Figure 2 .
Figure 2. Genome assembly of Trifolium dubium, drTriDubi3.1:metrics.The BlobToolKit Snailplot shows N50 metrics and BUSCO gene completeness.The main plot is divided into 1,000 size-ordered bins around the circumference with each bin representing 0.1% of the 679,499,717 bp assembly.The distribution of scaffold lengths is shown in dark grey with the plot radius scaled to the longest scaffold present in the assembly (64,644,275 bp, shown in red).Orange and pale-orange arcs show the N50 and N90 scaffold lengths (46,006,535 and 34,190,264 bp), respectively.The pale grey spiral shows the cumulative scaffold count on a log scale with white scale lines showing successive orders of magnitude.The blue and pale-blue area around the outside of the plot shows the distribution of GC, AT and N percentages in the same bins as the inner plot.A summary of complete, fragmented, duplicated and missing BUSCO genes in the fabales_odb10 set is shown in the top right.An interactive version of this figure is available at https://blobtoolkit.genomehubs.org/view/drTriDubi3_1/dataset/drTriDubi3_1/snail.

Figure 3 .
Figure 3. Genome assembly of Trifolium dubium, drTriDubi3.1:BlobToolKit GC-coverage plot.Scaffolds are coloured by phylum.Circles are sized in proportion to scaffold length.Histograms show the distribution of scaffold length sum along each axis.An interactive version of this figure is available at https://blobtoolkit.genomehubs.org/view/drTriDubi3_1/dataset/drTriDubi3_1/blob.

Figure 4 .
Figure 4. Genome assembly of Trifolium dubium, drTriDubi3.1:BlobToolKit cumulative sequence plot.The grey line shows cumulative length for all scaffolds.Coloured lines show cumulative lengths of scaffolds assigned to each phylum using the buscogenes taxrule.An interactive version of this figure is available at https://blobtoolkit.genomehubs.org/view/drTriDubi3_1/dataset/drTriDubi3_1/cumulative.

Figure 5 .
Figure 5. Genome assembly of Trifolium dubium, drTriDubi3.1:Hi-C contact map of the drTriDubi3.1 assembly, visualised using HiGlass.Chromosomes are shown in order of size from left to right and top to bottom.An interactive version of this figure may be viewed at https://genome-note-higlass.tol.sanger.ac.uk/l/?d=F0MkUMXRRMqAc9HJXoQ8LA.

○
Accessibility: The data is openly accessible via public repositories, with clear instructions on how to access it.○Critique:The presentation is strong, but the manuscript could enhance usability by providing more user-friendly summaries or visualizations, such as a Circos diagram representing an overview figure of the genome assembly for each chromosome.

○
Is the rationale for creating the dataset(s) clearly described?YesAre the protocols appropriate and is the work technically sound?YesAre sufficient details of methods and materials provided to allow replication by others?Yes Are the datasets clearly presented in a useable and accessible format?Yes Competing Interests: No competing interests were disclosed.

Table 2 . Chromosomal pseudomolecules in the genome assembly of Trifolium dubium, drTriDubi3. INSDC accession Chromosome Length (Mb) GC%
Agreement entered into by the Darwin Tree of Life Partner, Genome Research Limited (operating as the Wellcome Sanger Institute), and in some circumstances other Darwin Tree of Life collaborators.

the rationale for creating the dataset(s) clearly described? Yes Are the protocols appropriate and is the work technically sound? Yes Are sufficient details of methods and materials provided to allow replication by others? Yes Are the datasets clearly presented in a useable and accessible format? Yes Competing Interests:
No competing interests were disclosed.

have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.
https://doi.org/10.21956/wellcomeopenres.23438.r87955© 2024 Satrio R.This is an open access peer review report distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.