The genome of the endangered Macadamia jansenii displays little diversity but represents an important genetic resource for plant breeding

Abstract Macadamia, a recently domesticated expanding nut crop in the tropical and subtropical regions of the world, is one of the most economically important genera in the diverse and widely adapted Proteaceae family. All four species of Macadamia are rare in the wild with the most recently discovered, M. jansenii, being endangered. The M. jansenii genome has been used as a model for testing sequencing methods using a wide range of long read sequencing techniques. Here, we report a chromosome level genome assembly, generated using a combination of Pacific Biosciences sequencing and Hi‐C, comprising 14 pseudo‐molecules, with a N50 of 52 Mb and a total genome assembly size of 758 Mb of which 56% is repetitive. Completeness assessment revealed that the assembly covered −97.1% of the conserved single copy genes. Annotation predicted 31,591 protein coding genes and allowed the characterization of genes encoding biosynthesis of cyanogenic glycosides, fatty acid metabolism, and anti‐microbial proteins. Re‐sequencing of seven other genotypes confirmed low diversity and low heterozygosity within this endangered species. Important morphological characteristics of this species such as small tree size and high kernel recovery suggest that M. jansenii is an important source of these commercial traits for breeding. As a member of a small group of families that are sister to the core eudicots, this high‐quality genome also provides a key resource for evolutionary and comparative genomics studies.


| INTRODUCTION
Macadamia is a recent domesticate with a complex domestication history (Peace, 2005). The four currently recognized Macadamia species are endemic to the central coast of eastern Australia (Mast et al., 2008).
However, macadamia was domesticated in Hawaii around 100 years ago, with most of the global production based upon the Hawaiian domesticated germplasm (Hardner, 2016). Macadamia is a member of the Proteaceae family, one of a group of families that are sister to the core eudicots (Christenhusz & Byng, 2016;Gross & Weston, 1992).
Macadamia was the first Australian native plant to be widely grown as a food plant (Peace et al., 2013). All the Hawaiian macadamia cultivars have been reported to be based upon only a few or possibly even a single tree from Australia . The resulting narrow gene pool makes it susceptible to disease and climate change, whereas the unexploited wild macadamia germplasm of Australia provides an opportunity for great improvement of this newly domesticated crop. Despite a rapid international increase in macadamia production, breeding is restricted because of lack of genomic information .
Macadamia remains the most widely grown Australian native food crop (Peace et al., 2013). Macadamia production was valued at USD 1.17 billion in 2019, and production is expected to grow at a rate of 9.2% from 2020 to 2027 (https://www.grandviewresearch.com/ industry-analysis/macadamia-nut-market). Among the Macadamia species, M. integrifolia, the species from which most of the domesticated gene pool is derived (Hardner, 2016), was the first genome to be sequenced . This genome, of cultivar HAES 741, has supported initial efforts at genome based breeding (O'Connor et al., 2018) and has recently been upgraded to chromosome level with a contig N50 of 413 Kb (Nock et al., 2020). The other species that has been a contributor to domesticated germplasm, M. tetraphylla, has also been sequenced with contig N50 of 1.18 Mb (Niu et al., 2020).
All species are rare in the wild, but M. jansenii is endangered and is only found in a limited area to the north-west of Bundaberg, Queensland (Hayward et al., 2021;Shapcott & Powell, 2011).
Macadamia jansenii is endangered under the Australian (EPBC) Act and critically endangered under the Queensland (Qld Nature Conservation Act) legislation (Gross & Weston, 1992). Due to the expected low heterozygosity associated with the extremely small population size, this species has been used as a model to compare available genome sequencing technologies (Murigneux et al., 2020;Sharma et al., 2021). Macadamia jansenii has been sequenced (Murigneux et al., 2020), using three long read sequencing technologies, Oxford Nanopore (PromethION), PacBio (Sequel I), and BGI (Single-tube Long Fragment Read). The genome was recently updated by sequencing using the PacBio HiFi sequencing (Sharma et al., 2021). Here, we report chromosome level assembly of the same genotype using Hi-C and annotation of the genome. This provides a platform that allows analysis of key genes of importance in macadamia breeding, a reference genome in this group of angiosperms, and insights into the impact of rarity on plant genomes.
This high-quality reference genome also provides a platform for analysis of three unique attributes of macadamia, the high levels of unusual fatty acids (Hu et al., 2019), high cyanogenic glucoside content , and the presence of a novel anti-microbial peptide (Marcus et al., 1999). The fatty acid, palmitoleic acid (16:1), is found in large amounts in macadamia and has been considered to have potential human health benefits (Solà Marsiñach & Cuenca, 2019;Song et al., 2018). Cyanogenic glycosides in plants are part of their defense against herbivores. However, the highly bitter nuts of M. jansenii are not edible, and the use of this species in macadamia breeding will require selection to ensure high levels of cyanogenic glycosides are avoided. Identification of the associated genes could assist by providing molecular tools for use in breeding selection. A novel antimicrobial protein was reported in the kernels of M. integrifolia (Marcus et al., 1999). These small antimicrobial proteins were found to be produced by processing of a larger pre-cursor protein. As fungal infection and insect herbivores are major hurdles in macadamia production (Dahler et al., 1995;Marcus et al., 1999;Nock et al., 2016), retention of the antimicrobial protein and cyanogenesis in some parts of the plant may be important. Analysis of candidate genes for these traits may assist in understanding and manipulating them in macadamia breeding. The assembly and annotation of the genome presented here will allow characterization of the role of these novel genes and facilitate their use in breeding.

| Genome sequencing and assembly
A pseudo-molecule level genome assembly of PacBio contigs (Murigneux et al., 2020) was produced using Hi-C. The estimated genome size of M. jansenii has been reported to be 780 Mb (Murigneux et al., 2020), and the size of the final Hi-C assembly was 758 Mb, comprised of 219 scaffolds with an N50 of 52 Mb (Table 1).
Of this, 97% was anchored to the 14 largest scaffolds representing the 14 chromosomes ( Figure S1; Table S1). The statistical summary of self-interacting genomic region analysis or topologically associating domain (TAD) analysis at three different resolutions is given in Tables S2 and S3. Comparison of the PacBio assembly with the Hi-C chromosome assembly shows the number of scaffolds decreased from 762 to 219 and the length of the longest scaffold increased sixfold (Table 1). The L50 reduced from 135 to 7 scaffolds, and the N50 was improved from 1.58 to 52 Mb.

| Assembly completeness and repeat element analysis
The completeness of the M. jansenii assembly was assessed by Benchmarking Universal Single-Copy orthologs (BUSCO) analysis (Simão et al., 2015). This analysis revealed 97.1% complete genes (single and duplicated) in the Hi-C assembly (

| Structural and functional annotation
A total of 31,591 genes were identified in the repeat-masked Hi-C M. jansenii genome using a homology-based and RNA assisted approach. The average length of the genes was 1,368 bp (Table 3).
Of a total of 31,591 transcripts, only 22,500 sequences (71%) were annotated by BLAST2GO ( Figure S2). Then, the transcripts were functionally annotated using Gene Ontology (GO) terms to assess the potential role of the genes in the M. jansenii genome. The most abundant M. jansenii specific gene families were organic cyclic and heterocyclic compounds among the molecular function; organic and cellular metabolic among the biological process; and proteincontaining binding membrane and intracellular organelle among the cellular component ( Figure S3). The annotation edit distance (AED) score plot, which calculates how well the predicted genes agrees with the external evidence, is given in Figure S4, where zero represents the perfect evidence match and one represents no support for the predicted genes. The comparison of the three Macadamia genomes, assembled so far, showed M. jansenii has the highly continuous assembly with highest number of BUSCO genes (Table 4).

| Cyanogenic glycoside genes
Analysis of genes of cyanogenic glycoside metabolism detected a total of 11 putative genes with high homology in the M. jansenii genome.
These genes were distributed throughout the genome. The largest number of these genes (five) is encoded by CYP79 which is involved in the first step of the cyanogenesis, responsible for conversion of amino acids to oxime. In contrast, only one gene of CYP71 was found which is responsible for conversion of oxime to hydroxynitrile (Table S4).

| Fatty acid metabolism genes
This study identified the key enzymes involved in fatty acid biosynthesis: elongases (e.g., KAS, FATA, and FATB) and desaturases (e.g., SAD). A total of 12 of these genes were found in the M. jansenii genome. Stearoyl-ACP desaturases (SAD) which convert 18:0 to 18:1 were found to be abundant with five genes present. Ketoacyl-ACP-synthase (KAS), which is responsible for elongation of butyryl-ACP from a 4 to 14 carbon chain, was found to be two in number (Table S5).

| Heterozygosity and genetic diversity
To study the genetic diversity within the species, resequencing was undertaken for eight accessions including the one for which the genome assembly was generated.   (Table S8).

| Genome duplication
The MCScanX tool identified 6,303 genes ($20%) of the total genes as WGD/segmental genes. Among all the types of duplicated genes, dispersed genes (15,875 genes) were found to be highest in number, followed by WGD, singleton (3,183 genes) and tandem (2,966 genes).
The proximal gene pairs which are separated by only a few genes on the chromosomes were found to be lowest in number, as 2,573 (Table S9). The distribution of the duplicated genes is shown in Figure 2, where pseudo-molecules 3 and 10 represents a relationship based upon duplicated genes. The whole genome duplication (WGD) tool computed Ks distribution graph with a peak at 0.3 ( Figure S7).
This shows that M. jansenii genome has undergone a single whole genome duplication event about 20 Million years ago.

| DISCUSSION
A major constraint to the use of M. jansenii for commercial breeding is the risk of an inedible kernel due to high levels of toxic cyanogenic glycosides. Cyanogenic glycosides have been observed in all the four species of Macadamia. However, the concentration varies at different developmental stages (Castada et al., 2020). Even the edible cultivars derived from M. integrifolia have genes involved in the cyanogenic glycoside pathway . However, cyanogenic glycoside levels are extremely low in the kernel of the commercially important species M. integrifolia and M. tetraphylla (Dahler et al., 1995). The high level of bitterness in the seeds of M. jansenii may be associated with high concentrations of cyanogenic glycosides. Knowledge of these genes will support efforts to avoid their transfer to domesticated Macadamia when using M. jansenii as a source of other desirable genes.
Plants may produce antimicrobial proteins as part of their defense against microbial attack. Macadamia seed might have antimicrobial proteins that protect them against attack when germinating in the warm and moist rainforest environment. A new family of antimicrobial peptides, MiAMP-2, was discovered in the seeds of M. integrifolia (Marcus et al., 1999). In addition to antimicrobial properties, these seed storage proteins are homologous to vicilin 7S globulins and have F I G U R E 2 Chromosomal distribution of duplicated genes of M. jansenii, generated using SynVisio. The parallel horizontal lines represent the 14 pseudo-molecules of M. jansenii genome, with connected ribbons representing the duplicated genes been identified as putative allergens (Rost et al., 2016(Rost et al., , 2020. A cDNA sequence, from M. integrifolia, encoding these proteins, MiAMP-2, has been reported to contain four repeat segments, with each segment comprised of cysteine rich motifs (C-X-X-X-C-[10 to12] X-C-X-X-X-C), where X is any other amino acid residue. Although only a single gene was found in the M. jansenii genome, it encoded a protein with four domains that correspond to the previously reported antimicrobial peptides, suggesting that four copies of the peptide could be derived from each translation of this gene. This is the first report of a gene structure for the macadamia anti-microbial peptide with a single intron. This gene has potential for wide use as an antimicrobial protein in plant defense.
Macadamia oil has a unique composition being 75% fat, 80% of which is monounsaturated, for exampe, oleic oil (C18:1) 55-67%, This is currently the most complete genome sequence available for a macadamia and any member of the more than 1,660 Proteaceae species (Christenhusz & Byng, 2016) making a useful contribution to the goal of sequencing plant biodiversity (Lewin et al., 2018). The Proteaceae belongs to the basal eudicot order Proteales, a sister group to core eudicots (Chanderbali et al., 2016;Drinnan et al., 1994).
Among the basal eudicots there are few well characterized genomes.

| DNA and RNA isolation
Leaf tissue was coarsely ground under liquid nitrogen using a mortar and pestle and further ground under cryogenic conditions into a fine powder using a Tissue Lyser (MM400, Retsch, Germany). All accessions were used for DNA isolation. DNA was extracted as per an established method (Furtado, 2014) with minor modification where phenol was excluded from the extraction method. DNA was extracted from 2-3 gm of leaf tissue and dissolved in up to 400 μl of TE buffer.

| Chromosome level assembly
4.3.1 | Chicago library sequencing and sequencing DNA was isolated as per an established method (Furtado, 2014).
Then, the library was prepared as described in Putnam et al. (2016).
Briefly, $500 ng of HMW gDNA was reconstituted into chromatin in vitro and fixed with formaldehyde. Fixed chromatin was digested with DpnII, the 5 0 overhangs filled in with biotinylated nucleotides, and then, free blunt ends were ligated. After ligation, crosslinks were reversed, and the DNA was purified from protein. Purified DNA was treated to remove biotin that was not internal to ligated fragments.
The DNA was then sheared to $350 bp mean fragment size, and sequencing libraries were generated using NEBNext Ultra enzymes and Illumina-compatible adapters. Biotin-containing fragments were isolated using streptavidin beads before PCR enrichment of each library. The libraries were sequenced on an Illumina HiSeqX platform to produce 213 million 2 Â 150 bp paired end reads, which provided 88.11 Â physical coverage of the genome (1-100 kb pairs).

| Dovetail Hi-C library preparation and sequencing
A Dovetail Hi-C library was prepared in a similar manner as described previously (Lieberman-Aiden et al., 2009). Briefly, for each library, chromatin was fixed in place with formaldehyde in the nucleus and then extracted. Fixed chromatin was digested with DpnII, the 5 0 overhangs filled in with biotinylated nucleotides, and then free blunt ends were ligated. After ligation, crosslinks were reversed, and the DNA purified from protein. Purified DNA was treated to remove biotin that was not internal to ligated fragments. The DNA was then sheared to $350 bp mean fragment size, and sequencing libraries were generated using NEBNext Ultra enzymes and Illumina-compatible adapters.
Biotin-containing fragments were isolated using streptavidin beads before PCR enrichment of each library. The libraries were sequenced on an Illumina HiSeqX platform to produce 156 million 2 Â 150 bp paired end reads, which provided 3,601.74 Â physical coverage of the genome (10-10,000 kb pairs).

| Scaffolding the assembly with HiRise
The input de novo assembly, shotgun reads, Chicago library reads, and Dovetail Hi-C library reads were used as input data for HiRise, a software pipeline designed specifically for using proximity ligation data to scaffold genome assemblies (Putnam et al., 2016). An iterative analysis was conducted. First, Shotgun and Chicago library sequences were aligned to the draft input assembly using a modified SNAP read mapper (http://snap.cs.berkeley.edu). The separations of Chicago read pairs mapped within draft scaffolds were analyzed by HiRise to produce a likelihood model for genomic distance between read pairs, and the model was used to identify and break putative misjoins, to score prospective joins, and to make joins above a threshold. After aligning and scaffolding Chicago data, Dovetail Hi-C library sequences were aligned and scaffolded following the same method. After scaffolding, shotgun sequences were used to close gaps between contigs.

| Re-sequencing
To study the genetic diversity within the species, re-sequencing of the eight different genotypes was performed on the DNBseq platform (Drmanac et al., 2010

| Genome assembly quality evaluation and repetitive element evaluation
The completeness of the genome assembly was evaluated by checking the integrity of the protein coding genes in the Hi-C assembly using Benchmarking Universal Single-Copy Orthologs (BUSCO) (version v5.0.0) analysis with eudicot odb10 dataset with 2,326 genes.
Repetitive elements in the Hi-C assembly were identified de novo and classified using RepeatModeler (version 2.0.1). The repeat library obtained from RepeatModeler was used to identify and mask the repeats in the Hi-C assembly file using RepeatMasker (Version 4.1.0).

| Structural annotation and functional annotation
The prediction of the protein coding genes in the repeat masked genome was carried out using ab initio and evidence-based approach.

| Gene families
To identify the anti-microbial genes in the genome, BLAST homology search was performed to identity transcripts similar to the M. integrifolia antimicrobial cDNA (MiAMP2, GenBank: AF161884.1) (Marcus et al., 1999). Then, sequence alignment was undertaken using Clone Manager ver 9.0 (SciEd, USA). Multiple Alignment was undertaken using a reference sequence as indicated in the results and alignment parameter scoring matrix of Mismatch (2)  Accession-specific polymorphic SNP sites present in up to seven of the eight accessions were determined as outlined below.
Accession-specific SNP sites (homozygous and heterozygous), identified by filtering "sample Count" for "≤7," were represented as a percentage of the M. jansenii genome size (780 Mb). Accession-specific unique polymorphic SNP sites, defined as those variant sites present only once in any of the eight accessions, were identified by filtering the "sample Count" for "≤1."

| Genome duplication
MCScanX (Wang et al., 2012) was used to identify whole-genome duplication (WGD)/segmental along with tandem, dispersed, singleton, and proximal duplication on the M. jansenii genome. An allversus-all BLASTP was performed (E value: 1eÀ10, max target sequences: 5 and m6 format output), for the M. jansenii wholegenome protein sequences. The duplicate gene classifier and MCScanX program was executed using the M. jansenii genome annotation (gff file) along with the BLASTP output file using the default parameters. The collinearity file (MCScanX output file) and the genome annotation file were used to generate the synteny plot using SynVisio toolkit (Bandi & Gutwin, 2020). The Ks distribution graph was generated using WGD tools (Zwaenepoel and van de Peer, 2018).
Timing of WGD was estimated as described by Magall on et al. (2015).

This project was funded by the Hort Frontiers Advanced Production
Systems Fund as part of the Hort Frontiers strategic partnership initiative developed by Hort Innovation, with co-investment from the