A genomic toolkit for ‘the soybean of the tropics’ – winged bean (Psophocarpus tetragonolobus)

A sustainable supply of plant protein is critical for future generations. We also need to reduce carbon emissions from agriculture, while increasing resilience in the face of elevated temperatures and climate volatility. Agricultural diversi�cation with more crop options is hampered by limited available genomic resources, lack of understanding of the genetic structure of breeding germplasm and important trait inheritance-needed to breed desired ideotypes. Winged bean (Psophocarpus tetragonolobus), a high protein seed tropical legume, has been termed ‘the soybean for the tropics’ with potential to help reduce soybean imports in many tropical and semi-tropical countries. Here we present the �rst chromosome level winged bean genome assembly (98.9% of 536.1MB scaffolds are assigned), an investigation of genetic diversity of 136 worldwide accessions for breeding, together with the �rst two genetic maps and a trait QTL analysis for genomic regions with desirable ideotype traits for breeding; namely architecture, protein content and phytonutrients.


Introduction
Winged bean (Psophocarpus tetragonolobus (L.) DC.), also known as goa bean, is a dicotyledonous (2n = 2x = 18) legume belonging to the Fabaceae family and Papilionoideae subfamily.It is the only domesticated species of genus Psophocarpus and has three pairs of short and six pairs of long chromosomes 1 .It is cultivated mainly in hot, humid, equatorial countries of Southern Asia, Melanesia and the Paci c region.It is grown as a vegetable for its immature pods and for leaves and for tuberous roots, with consumption preferences differing geographically 2 .Its ability to x atmospheric nitrogen with a wide range of Rhizobium 3,4 makes it an ideal horticultural crop for more sustainable agriculture particularly in the humid and sub-humid tropics.Nevertheless, its uncontrolled branching habit, indeterminate growth and inconsistent productivity all limit current large-scale utilisation.Our previous work has suggested that an optimal trade-off could be achieved between vegetative growth and yield (pod number and seed number), for example, by controlling branch number 5 .Individuals with fewer but longer branches could maintain high pod productivity, while potentially reducing the vegetative biomass.
Its description as 'a soybean of the tropics' was coined in the 1980s after work by the US National Academy suggested it had potential to partially replace soybean in tropical countries 6 .In particular, its mature seeds are a good source of protein and other nutrients.Mature dried winged bean seeds are reported to contain between 16.4 to 21.3% lipid and 34.3 to 40.7% crude protein with a favourable amino acid balance and micronutrient composition 7,8 .However, its dietary diversi cation and nutritional potentials have been hampered by the crop architecture and the 'hard-to-cook' phenomenon, with the presence of anti-nutritional factors (ANFs), such as hemagglutinating lectins, trypsin and chymotrypsin inhibitors, and tannins 9 , in the seed.The crude protein content in the seeds of 25 accessions from the International Institute of Tropical Agriculture (IITA) ranged from 28.4 to 31.3% with tannin and phytate varying signi cantly 10 .Some of the issues caused by antinutritional factors can be overcome through different cooking and processing methods 11,12 .Coupled with architectural changes and breeding selection for high protein and/or high lipid, genetic improvement could widen winged bean production from a purely horticultural crop to an arable crop for its seed protein, also helping to reduce dependency on soybean imports for many countries and the environmental impact associated with soybean production expansion.
Genome-assisted breeding improvement strategies can lead to genetically improved winged bean to contribute to food and nutritional security.However, the lack of molecular resources has hampered a better understanding of the diversity and phylogenetic relationships of germplasm conserved ex situ.Using Genotype-by-Sequencing (GbS) we have evaluated the genetic structure of a wide range of global accessions, developed two controlled cross genetic maps and performed the rst QTL analysis of architectural, protein and a nutritional trait.We also present the rst chromosome-scale genome assembly through a combinatorial approach integrating Illumina and Nanopore reads with BioNano optical mapping and within species genetic maps.

Genome assembly and annotation
Ma3, a cultivar (renamed from the previously reported M3 5 ) released by Malaysian Agricultural Research and Development Institute (MARDI) was selected for genome sequencing.Our approach of using Oxford Nanopore Technology (ONT) reads corrected by Illumina reads generated 1,164 scaffolds with the largest being 19.99 Mb in length.An improvement was further made with the addition of BioNano optical mapping, yielding a total of 48 hybrid scaffolds with the largest scaffold being 38.64 Mb and a N50 at 23.88 Mb (Table 1 and Table S1).A total of 94.6% of Illumina reads were successfully remapped back to the assembly.Subsequently, SNP markers in two genetic maps derived from Genotype-by-Sequencing (DArTSeq™) were used to assign scaffolds for the nal assembly at a pseudo-chromosome level, spanning 530.28 Mb in total.This assembly accounts for 98.9% of the assembled scaffolds, with only 5.85 Mb of scaffolds left unassigned.The genetic maps derived from Cross XT (F 2 = 184) and Cross XB (F 2 = 223) have Ma3 as the common paternal parent and are reported for the rst time here (Fig. 1a).The genetic map of Cross XT comprises of 692 high quality spaced SNP markers (with no missing data across the whole population) covering 1,102.1 cM (an average interval of 1.6 cM), the XB map comprises of 393 spaced SNP markers covering 1,336.9cM (3.4 cM average distance) (Table S2).The nine anchored pseudo-chromosomes ranged from 24.61 to 84.79 Mb in length (Table S2).Of the SNP markers present in the linkage maps, 97.4% and 95.4% SNPs, respectively, could be mapped back to the genome assembly.From the full DArTseq™ dataset derived from the mapping and diversity analysis, 95.7% of SNP (n = 21,412) and 77.2% (n = 40,557) of presence-absence variants (PAV/in silico DArT), respectively, had their sequence tags mapped to the nal genome (Fig. 1b).Repetitive elements represented 37.7% (n = 242,355) of the winged bean genome, with the majority being Terminal Inverted Repeat elements (TIR; 49.02%) followed by Long Terminal Repeat elements (LTR; 28.66%) (Fig. S1).The Class I/Class II transposable elements (TE) ratio was observed to be 0.81 (Table S3 and S4).Of the dominant Long Terminal Repeat (LTR) in Class I retrotransposons, the Gypsy superfamily was six times more abundant than Copia-type (Fig. S1).
Other than de novo gene prediction and homology comparison, initial transcriptomic resources from leaf, root, pod, and reproductive tissue (comprising of bud and ower) 13 were also included, assisting in 26,370 protein coding genes being annotated.Of 1,614 Embryophyta larger group single-copy core genes from both Benchmarking Universal Single-Copy Orthologs (BUSCO) analysis, 84.1% complete conserved genes were recovered with 9.5% as missing and 6.4% are fragmented (Table 1).Additionally, nearly 80% of the annotated genes have an Annotation Edit Distance (AED) score ≤ 0.3 14 (Fig. S2a), indicating a high degree of quality and continuity for the current annotation.These gene models have an average length of 4,203 bp and an average coding-sequence length of 1,396 bp (Fig. S2b), with an average of ve exons per gene.As many as 1,551 of them are transcription factors, accounting 5.9% of the total annotated genes.Together, transcription factor classes MYB and bHLH are dominant and accounted for 17.6% of the identi ed transcription factors (Table S5).Additionally, 10,303 small nucleolar RNAs and 446 tRNA genes were identi ed.
The previously reported genome size of winged bean varies widely; Vatanparast and colleagues 15 have suggested it to be 1.22 Gbp/1C in size whereas previous ow cytometry has estimated it to be 782 Mb (or 0.8 pg)/1C 16 .Our K-mer distribution analysis has predicated a smaller genome size at 569 Mb although GC-rich regions and repetitive sequences might have been underestimated 17 (Fig. S3).This would also explain a lower proportion of repetitive elements in the winged bean genome, as compared to 49.5% in cowpea genome 18 and 41% of TE in P. vulgaris genome v2.1 19 .The genome sequenced material has around 5.4% heterozygosity detected from the 21,412 gene-enriched SNPs, with 49.9% of DArTseq SNPs located within coding regions.Overall, the current reference genome covers 93.2% of the K-mer estimated genome size of 569 Mb and is expected to provide comprehensive coverage of the coding regions of winged bean.By comparing the extent of protein-coding gene homology (at e-value < 1e − 10 ), it was observed that winged bean shares 87.2% with Glycine max (Wm82.a4.v1) 20 , 86.5% with Phaseolus vulgaris (Pvulgaris v2.1) 21 and 87.0% with Vigna unguiculata (Vunguiculata v1.2) 18 .The majority of the genes are conserved between the species, as illustrated in Fig. 2. for example.However, among highly conserved orthologues identi ed on Pt01 nd Pt07(at e = 0) 549 and 496, respectively, we did nd genes distributed across all 11 chromosomes of common bean, suggesting extensive chromosomal rearrangement or gene translocation in winged bean compared to common bean (Fig. S4).Even with the smallest Pt09 pseudochromosome, blocks of gene collinearity were found on eight Pv chromosomes.Taken together, our approach has demonstrated a cost-effective means for generating species-speci c genomic landscapes, particularly for crops lacking well annotated closely related species, such as winged bean.The strategic sequencing of individual species within clades where no current crop genome exists could allow progress to be made in multiple species.

Winged bean germplasm diversity
For any genomics-based breeding effort to attain maximum genetic gain, an understanding of the genetic relationships among the available breeding germplasm is needed.The genetic diversity of winged bean germplasm was investigated using 171 individuals originating from 16 countries (Table S6) based on GbS (DArTSeq) markers.Using 7,791 SNP [with minor allele frequency (MAF) > 0.01], the level of heterozygosity ranged from 0.4 to 56.6% with a median value at 6.1%, largely re ective of a self-pollinating species (Table S6).Nevertheless, the presence of individuals with high heterozygosity implies the possibility of outcrossing and is consistent with our observations of bees and other insects visiting owers in open eld conditions.As high as a 16.4% average out-crossing rate has been reported from 457 Thai accessions evaluated using 14 Simple Sequence Repeat (SSR) markers 22 .
Bayesian clustering and phylogenetic analysis [Neighbour-Joining (NJ) tree] and Principal Coordinate Analysis (PCoA) results all closely corresponded to each other and suggest the existence of three major gene pools (CP, OT, LT) (Fig. 3 and Fig. S5).Based on the membership coe cient ≥ 0.7, approximately a quarter of these lineages show allelic admixture.As previously reported 2,23 , the clustering does not correlate well with the country of origin, which might re ect seed exchange among genebanks with limited documentation or historical exchange with subsequent collection capturing the secondary collection sites, although broader regional clustering can be seen (Fig. S5b).Comparing among the countries having more than ve accessions included in this study, the materials originating from Colombia and Thailand (44.4% out of nine accessions and 45.2% out of 31 lines, respectively, excluding biological replicates) show high levels of admixture.This suggests possible in-eld hybridisation events between accessions from genetic exchanges of germplasm.
This has further supported our previous hypothesis that the germplasm were most likely introduced from Asia to South America given that winged bean is not been reported as a native or traditional crop consumed in South America 2 .In contrast, potentially more limited diversity and possible founder effects might be re ected in the germplasm collected from India and the Philippines (Fig. 3d).
The CP subpopulation, being the smallest cluster, is composed of mainly Colombian and Papua New Guinean accessions whereas the LT subpopulation predominantly consists of accessions collected from Sri Lanka and Thailand.It is also interesting to note that for 57 accessions, replicate samples from the same breeding accession based on pedigree information were available.However, 44% (25 out of 57) of these were found not to be highly similar when using molecular markers.Although mislabelling at some stage cannot be ruled out, outcrossing might have also contributed to the differences between genotypes in or from the same accession.As such, our ndings could help breeders by emphasising the need for quality assurance for breeding programmes and to test pedigree information, as well as providing insights into some of the publicly available germplasm.
Analysis of Molecular Variance (AMOVA) reveals that the genetic variance observed is not only attributed to the differences between individuals within population (45.7%) but also between populations (30.9%).
Comparatively, genetic differentiation was lower within individual accessions, accounting for 23.4% of the total variance.Nevertheless, the population differentiation was signi cant at all levels (Table S7).The moderate, yet greatest, difference (0.4479) in pairwise Nei's G ST between the CP and LT subpopulations is in line with the population structure analysis as the CP and OT population clustered together when K = 2 (Table S8).Similarly, the rst principal component (Coordinate 1 = 39.0%variation explained) in the PCoA distinguishes CP and LT subpopulations whereas the second component (Coordinate 2 = 23.6%)reveals the divergence between OT and CP subpopulations (Fig. 3b).However, genetic differentiation among OT and LT subpopulations was relatively low at 0.1792 (Table S9).
Both Shannon-Wiener index (H) and Simpson's index (lambda) suggested that the OT subpopulation (n = 70) had the highest genotypic richness or genetic diversity among all populations (Table S9).Consistent with this, the expected heterozygosity (H e ) from Nei's Index suggests a stronger selection signature or founder effect in the CP subpopulation (H e =0.102) in comparison with OT (H e =0.247) and LT (H e =0.134) subpopulations (Table S9).The genetic richness preserved within the OT gene pool might re ect different domestication selection criteria for the diverse traits in this crop, due to its various edible parts and their varying consumption preference in different countries (Tajima's D = 1.0204).For example, tubers are mainly consumed in Myanmar, whereas young pods are preferred in Malaysia.On the other hand, the observed Tajima's D value of -0.3394 indicates that the CP cluster could be derived from directional selection pressure thus driving the divergence, while balancing selection or population contraction could have happened in OT and LT gene pools (Tajima's D = 1.0204 and 0.6142, respectively) with OT being more severely affected (Table S9).

Linkage disequilibrium (LD) decay
The genome-wide linkage disequilibrium decay rate of winged bean was estimated to be in the range of 76 to 113 kb based on GbS SNPs when the MAF was set at > 0.01 and > 0.05, respectively.As reported in a peanut (Arachis hypogaea) mini-core collection that consisted of 107 accessions and also in 632 lines of a temperate and tropical maize collection, LD is affected signi cantly by MAF selection 18,24 .At MAF > 0.05, 6,149 SNP were distributed across the genome with approximately 683 SNP per chromosome on average, with Pt08 showing the fastest LD decay (Fig. S6 and Table S10).Only 0.8% of SNP pairs showed high LD (r 2 ≥ 0.8).This would argue that while winged bean is predominantly autogamous, outcrossing is still a major contributor to current genetic diversity.This LD decay rate is similar to cultivated pigeonpea (Cajanus cajan).
When compared to other closely related autogamous legumes such as soybean (250-375kb) 25,26 , common bean (240-250kb) 27,28 and cowpea (809-4,705 kb) 29 , LD decay in winged bean is more rapid.High LD decay could indicate that phenotypic evaluation of this set of germplasm can identify close gene-trait associations through association analysis, although larger numbers of markers will also be needed.

Quantitative Trait Locus (QTL) analysis of XB population
As a rst step in breeding and trait analysis we developed a number of controlled crosses with the Ma3 genotype as the common parent, focusing on architecture and pigmentation.Two of these populations were used for genetic map construction to assist in scaffold assignment and orientation to generate the physical map of winged bean (as mentioned above).The segregation for morphological traits in the XB F 2 between FP15 and Ma3 has previously been reported 5 .For this cross, a QTL analysis was also conducted (illustrated in Fig. 4) and is summarised in Table 2 and Table S11.To test the utility of the genome resources generated, we then investigated the inheritance of several traits of potential breeding interest in the XB population in additional to those previously reported 5 .Variation in seed storage proteins (SSP) compared to soybean The review by Kadam et al in the 1980s reported that winged bean seeds has a comparable essential amino acid composition to soybean 7 .From this base, molecular breeding could enhance protein composition and content.It was also reported that protein components in the winged bean seed are predominantly vicilin-type 7S globulin occupying approximately 31.9% of the protein extract, with a lack of 11S legumin-type 30,31 .In addition to the 18 genes putatively encoding 7S globulin identi ed from the genome sequence (Table S12), we also found three genes (Psote01G0072700, Psote03G0109900 and Psote03G0110000) homologous to the 11S globulin in soybean, Medicago truncatula and chickpea 32 .Of these, two are approximately 6Mb from the qSSP-1 QTL.Given that legumin contains a higher proportion of sulphur-containing amino acids than 7S vicilin/convicilin 33 , it is worth investigating whether these genes are functional or are being expressed at low levels, so were undetectable by previous methodologies.Additionally, 2S albumin was reported to vary among winged bean cultivars 34 , another component rich in sulphur in common bean 35 .It was reported to be the second most abundant protein fraction occupying 26.5% of total protein in winged bean.Quantifying sulphur content in each protein fraction could provide further targets for winged bean improvement for good nutritional quality and amino acid balance (Table S12).
In contrast to soybean, developmental stage characterisation has suggested that protein bodies in winged bean seeds accumulate at a late stage of maturation (45 days after owering) when starch granules are decreasing rapidly 36,37 .The protein pro les characterised by Makeri and colleagues further indicated that the N content in the globulin fraction is highest in winged bean seed, but lowest in soybean globulins, but also that N content is signi cantly different from other protein fractions in both crops 31 .These observations, in turn, might suggest that the targets for genetic improvement in seed protein content could be different between winged bean and soybean.
The paternal Ma3 genotype had a measured average of 34.0 ± 1.8% (n = 14) total seed protein content, whilst the maternal FP15 had 39.3 ± 3.2% (n = 9) although a wider difference has been observed between other lines from other studies 6 .Among the F 3 seed from the XB cross and assuming this represents the maternal genotype (n = 161 lines evaluated), transgressive segregation was observed in the population, ranging from 27.2 to 44.6% total protein per seed (Fig. 5).In other legumes such as pigeonpea (Cajanus cajan), soybean and pea (Pisum sativum) similar seed protein segregation was reported, although 24.6% in pigeonpea and 33% in pea were considered as high protein genotypes 38,39,40 and are lower than seen here in winged bean.A signi cant QTL explaining 10.8% protein variation was detected on Pt03.1, and three globulin genes were identi ed within two of the three putative QTL, with the QTL cumulatively explaining 18.7% of the total phenotypic variation (Table 2).Given the high genetic complexity underlying the accumulation of seed protein -for example, 33 consensus amino acid content QTL were identi ed in soybean 41 -it is not surprising that candidate gene identi cation is not straightforward (Table S13).Nevertheless, Psote03G0171400 (qSSP-1) and its paralogue Psote02G0416500 (approximately 2.3Mb from qSSP-4) are homologous to xylem bark cysteine peptidase 3 (P34, Glyma.08G116300), which is the second highest abundance transcript detected in all four stages of soybean seeds during storage protein accumulation 42 .Although this is not a well characterised gene, it is highly expressed at 40 days after owering (DAF) in soybean and associated with enriched GO terms 'transmembrane transport' and 'nutrient reservoir activity' 43 .Furthermore, using the same accessions, the authors compared their DEGs with the Harada-Goldberg soybean RNA-seq dataset (http://seedgenenetwork.net) comprising of 15 tissues with this gene found to be tissue speci c and expressed in an early maturation of seed stage 42 .Although the soybean orthologues of Psote03G0156500, Psote03G0160400 and Psote03G0160600 (qSSP-1) are not tissue speci city, they were expressed highly in all four soybean seed developmental stages 42 .In addition, their orthologues in the common bean have been identi ed as the target genes of microRNAs expressed in the developing seed at 20 days after anthesis (DAA), considered to be the mid-lling/maturation stage.The genomic sequences surrounding the SNP of anking markers detected which detected QTL for protein content (and other traits) could facilitate marker assisted selection (MAS) (presented in Table S14).Overall, from this initial work, it is clear that there is potential within winged bean germplasm to select for a high protein legume seed crop, either through conventional approaches or MAS.
At the moment, winged bean is primarily a horticultural crop, grown on trellis structures and manually harvested.For large-scale seed production, the crop could be grown as a creeping vine on the at (assuming that owering and pod development is not interrupted) or changes in the architecture of the plant could be bred for, to facilitate mechanisation of sowing and harvesting or even to optimise intercropping with complementary amino acid sources, such as cereals.

Altering architectural traits to improve mechanisation
Our previous work 5 suggested that fewer but longer branches could achieve the best trade-off between yield and vegetative biomass.Both branch length and total number of branches traits had a signi cant positive correlation with pod number (r s = 0.44, p < 0.001 and r s = 0.28; p < 0.01, respectively).Selection for these traits could therefore have major impact on plant architecture and yield of winged bean.There could have been selection for this trait during domestication, given the observed reduction in the higher orders of branches in the domesticated species, compared to closely related species like P. scandens (Tanzi AS, pers.comm.).Our QTL analysis has further suggested a putative qNoB (branch number QTL, measured from branches > 10cm in length within the rst ten nodes) that co-localises with a signi cant total branch number QTL (qTNoB-1; LOD 4.4, 21.1% PVE, counted from the entire main stem) with the former having a narrower 2-LOD range (Chr03.5:13,353,743 .. 24,206,688), implying that yield variation in this material may be contributed to by branches departing from upper stem nodes.Although lower branch number yet higher branch length could be a better selection criterion for yield, the branch length QTL (qLoB) is currently putative at LOD 3.6.The Arabidopsis CUC2 orthologues (Psote01G0176400 and Psote03G0433200) both located within the QTL (qTNoB-1/qNoB and qTNoB-2) are promising candidate genes that could be examined further (Table S15).Li and colleagues have recently discovered the CUC2/3-DA1 UBP15 regulatory module and its role in controlling axillary meristem initiation 44 .The AtUBP15 orthologue (Psote03G0414300) is positioned within the qTNoB-1/qNoB region as well.CUC2 is a member of the NAC domain of transcription factors that de nes organ boundaries and organ numbers 45 .

Inheritance of phytonutrient content -purple pigmentation
The maternal parent of the XB population, FP15, has pods which are near-full or full purple when they approach maturity, while the seeds are purple or near black (Fig. S7).Being a member in the CP subpopulation with extremely low heterozygosity, this line might have undergone active selection by farmers for its distinctive pigmentation.The paternal parent Ma3 (OT subpopulation) has green stems and fully green pods, with the F 3 population (F 2 maternal tissue) derived having pigment segregation in the seed testa and for pod specks.As seed coat is a maternal tissue, it is not surprising that F 1 seeds had dark purple without segregation.Of the F 3 seed screened, 80 had a brown seed coat and 122 were purple.This ratio (χ 2 = 1.411; p = 0.24) could be attributed to two dominant loci with a duplicate recessive epistasis interaction.Two major signi cant QTL (P1 and P2 alleles, Table 2) were detected explaining 61.7% and 58.5% of the phenotypic variation, respectively.Maternal tissue interaction and/or environmental stimulus were speculated to be contributing to the variation in seed coat colour intensity in an Trifolium interspeci c cross 46 .Similar mechanisms could be present in winged bean.
These pigmentation loci (Table 2) demonstrate the complexity in the regulatory network of anthocyanin biosynthesis pathway in a tissue-speci c manner.The pigmentation in pod and calyx showed the same segregation patterns as the seed testa (χ 2 = 0.067, p = 0.80; χ 2 = 0.229, p = 0.63) and exhibited co-dominance and/or an incomplete dominance.Of these traits, the F 1 had purple-green calyx and purple specks on pods.
While the pod pigmentation varied in speck quantity and even expanded to purple 'wings' (pod fringes), no F 2 progeny were identi ed which had as full a purple pod as the maternal parent.Notably, lines with green calyx had brown seed testa, suggesting the lack of anthocyanin expression in both tissues (Fig. S8).In support of this notion, QTL for calyx colour and pod pattern were identi ed also on Pt02 and Pt03, respectively.In soybean, there have been six loci identi ed; I, T, Wp, W1, R and O, producing colours ranging from yellow, green buff and brown to black 47 .Both brown (irT) and black (iRT) genotypes contain proanthocyanidins (PAs) but only black ones contain anthocyanin 47 .
Both putative P1 and P2 loci are on Pt05, approximately 50cM apart (Recombination Fraction = 0.32).PtMYB113 (Psote05G0045200) is the most likely the candidate gene for the P1 locus, despite being 1.2 Mb away from the current QTL range (Table S16), which might be partly attributable to our spaced genetic map.It is homologous to MYB113 in Arabidopsis, the R locus gene (Glyma.09G235100) in soybean 48,49 and VuMYB90-1 in cowpea 50 .PtMYB114 (Psote05G0128800) gene is likely to be the gene for the adjacent QTL peak of the P1 locus (Fig. S9).In contrast to the R2R3 MYB activators of the MBW complex in Arabidopsis, grape (Vitis vinifera), cowpea and cauli ower (Brassica oleracea) which have arisen from tandem duplication and the paralogues remain in very close linkage 51, 52, 53, , PtMYB113 (Chr05.1:4,657,704..4,659,076) and PtMYB114 (Chr05.1:13,552,089..13,553,762) are not co-localised next to each but are still linked.It is deduced from a phylogenetic study that both belong to the Subgroup 6 which is the anthocyanin-related clade (Fig. 6a and Table S17) 54,55,56 .Their orthologues (Fig. 6b and Table S18) are known to be the activators for late biosynthetic genes (LBGs) through interactions at their R3 repeats with a conserved motif ) in the N-terminal MYB-interacting region (MIR) of the bHLH TFs (Fig. 7) 56,57 .The SG6 KPRPR[S/T]F motif is more conserved in PtMYB113 than in PtMYB114 but both are conserved in the anthocyanin-promoting (S/A)NDV motif (Fig. 7) 58,59,54, .Although there is a G/A SNP at the 95th amino acid leading to a change from a hydrophobic glycine to hydrophilic glutamic acid in PtMYB113 in the purple FP15 line, this same mutation was not observed in an additional four accessions, comprising of two purple-seeded and two cream-seeded lines.Similarly, the polymorphism of four SNPs and a C InDel in two introns were not consistently observed in these additional lines.No SNP variation was found at the 5' UTR region of PtMYB113 as well as in the coding regions of PtMYB114 (two SNPs were identi ed in introns) for both parental lines ampli ed with gene-speci c primers (Table S19).A 13kb transposon TgmR* was observed to be inserted into the Intron 2 of the soybean R locus 48 .Despite the gene being expressed at low level, TgmR* in the black seeded soybean is more methylated, preventing transpose binding and thus correct intron splicing could take place to upregulate anthocyanidin synthase genes 58 .A similar scenario may be true in the winged bean P1 locus.(Fig. 7).Instead of ANS, PtANL2a (Psote05G0083500) located at the P1 locus is likely to be the target gene of this TF.In Arabidopsis, the loss-of-function anl2 mutant had reduced anthocyanin levels in the subepidermal cells, in addition to generating an aberrant root structure 59 .Studies from Wei et al. have suggested that the TTG1-dependent MBW complex could activate the expression of ANL2 indirectly although the exact mechanism is yet to be unravelled 60 .This is also supported by the previous nding that all four ANL2 transcripts were up-regulated in winged bean leaf containing higher tannin with the fold-change ranging from 1.2 to 4.3, despite not reaching statistical signi cance 61 .In line with Wei et al.'s nding, the absence of PtMYB113 expression in this transcriptomic study leads us to postulate that its expression could be at an earlier stage for ANL2 activation before avonoid begins to be accumulated.It is also worth noting that the clustering of anthocyanin biosynthesis pathway genes correlates well with high LOD scores observed throughout Chr05.1 (Fig S10).

Discussion
The winged bean genome is of value for molecular breeding improvement as most plant species with sequenced genomes are taxonomically distant, limiting the cross-species use of positional information.
Stefanović and colleagues 63 63 .This rst winged bean genome lays the foundation for crop genome guided breeding improvement programmes, to enhance the wider adoption of the species as a protein source for the future, given the limited conserved synteny with the well annotated P. vulgaris.The high levels of genetic diversity seen within the breeding material diversity study could be explained by: i) ex-situ conservation efforts by the gene banks might have been conducted early, ii) geographical preferences for different edible parts and cropping methods have limited the loss of genetic diversity overall, iii) seed phenotypes have not been as strongly visually distinctive compared to other legumes, particularly with limited variation in seed mottling and hilum pattern, with the major selection focus on the vegetative parts of the plant.
Thus far, our approach in integrating Illumina, ONT, BioNano and genetic mapping have resulted in a pseudochromosome level genome spanning 530.28 Mb.The genome annotation has facilitated the identi cation of a number of candidate genes associated with QTL relevant to ideotype development for breeding selection, in particularly, variability for seed protein content, plant architecture and pigmentation.In addition, it has been possible to identify the regions of the genome containing QTL anking marker variation, potentially allowing the development of simple KASP markers for further selection of desirable architectural, protein and nutritional traits.Selection for alleles of genes associated with these candidates (if validated) could be subsequently utilised for varietal development or anking markers around the QTL of interest could be used directly for breeding.

Materials And Methods
Library preparation of material for whole genome sequencing The common paternal line Ma3 for both Cross XT and Cross XB was selected for whole genome sequencing, based on single-seed descent derived material from the individual plant used during controlled hybridisation.
The plant was grown in Future Crop glasshouse under 28°C day/23°C night temperature cycle (Sutton Bonington, UK).High molecular weight (HMW) genomic DNA was isolated from 100 mg fresh leaves using DNeasy Mini Kit (Qiagen).DNA quantity and quality were assessed by Nanodrop, Qubit dsDNA BR assay (ThermoFisher) and the Genomic DNA ScreenTape assessed (Agilent) prior to library preparation.
ONT library preparation: A rapid library prep was prepared from 400 ng of DNA (ONT; SQK-RAD004) and run over a MinION ow cell (ONT; FLO-MIN106 R9.4) on the GridION X5 mk.A small-scale ligation sequencing library (ONT; SQK-LSK109) was also prepared and run over a MInION ow cell on the GridION.A large-scale ligation sequencing library was subsequently prepared and the entire library prep was loaded onto a PromethION ow cell (ONT; FLO-PRO001) and sequenced on the PromethION platform.
Illumina library preparation: An Illumina-compatible sequencing library was prepared from 500 ng of DNA using the KAPA HyperPlus Kit (Roche) and the KAPA Single-Indexed Adapter Kit, Illumina Platforms, Set B (Roche).Genomic DNA was fragmented for 10 min and 2 cycles of PCR were used during the library ampli cation stage.Post-ampli cation SPRI-based size selection was performed using a 0.7/0.9ratios of AMPure XP beads (Beckman Coulter).The library was quanti ed using the Qubit Fluorometer and the Qubit dsDNA HS Assay Kit (ThermoFisher) and the fragment length distribution was assessed using the Agilent TapeStation 4200 and the Agilent High Sensitivity D1000 ScreenTape Assay (Agilent).Final library QC was performed using the KAPA Library Quanti cation Kit for Illumina (Roche).The library was sequenced on the Illumina NextSeq500 (NextSeq Control Software v2.2.0) using a NextSeq500 Mid Output 300 cycle kit v2.5 (Illumina) to generate 150-bp paired-end reads.

Bionano mapping
High molecular weight genomic DNA was extracted from fresh young leaves using the Bionano Prep Plant Tissue DNA Isolation Kit (Bionano Genomics: 80003) according to the Isolation Base Protocol (Bionano Genomics: Document #30068-RevD).Due to the partially insoluble nature of the DNA quanti cation was not possible so 21 µl (including the bulk of DNA) was taken into labelling.
DNA was labelled with DLE-1 enzyme using the Bionano Prep DLS Kit (Bionano Genomics: 80005) and following the standard protocol (Bionano Genomics: Document #30206-RevF).The labelled sample was quanti ed by Qubit Fluorometer 4 (Thermo Fisher: Q33238) and the Qubit dsDNA HS Assay Kit (Thermo Fisher: Q32854).The average concentration of the labelled sample was 3.36 ng/µl (CV = 0.09).The labelled DNA reaction was run over one ow cell on a Bionano Saphyr Chip (Bionano Genomics: 20319) on the Bionano Saphyr to generate 540 Gbp of data.For analysis, the molecules le was used to generate a de novo assembly using the default Bionano Access settings (Bionano Access: 1.3.0;Bionano Tools: 1.3.8041.8044; Bionano Solve: Solve3.3_10252018;RefAligner: 7915.7989rel;HybridScaffold/SVMerge/VariantAnnotation: 10252018).This assembly was used to generate a hybrid scaffold from the winged bean NGS assembly.The hybrid scaffold was constructed using default settings; con ict resolution was set to 'Resolve Con icts' for both the Bionano assembly and sequence assembly.

Genome Assembly and TE characterisation
Raw ONT reads rstly ltered by quality score (< 7) and read length (< 1k), and then trimmed and corrected by Canu v1.7 64 Processed reads with high qualities were assembled by wtdbg2 with default settings 65 .Assembled contigs were polished twice by Illumina PE reads using Racon 66 and further polished by Pilon v1.22 67 .EDTA pipeline 68 with default settings was used for transposable elements characterisation with the classi cation in reference to Wicker et al. 69 .The Illumina assembly was subjected to K-mer analyses using both Jelly sh 70 and GenomeScope 71 .

Gene Annotation
De novo transcriptome assembly was conducted by Trinity v2.11.0 72 with previously published RNA-seq data 14 (PRJNA374598).Raw assembled transcripts were clustered by CD-HIT 73 and corrected by CAP3 73 .
Collinear blocks are de ned using default parameters in MCScanX 76 and visualised using SynVisio 77 .
Genotype-by-Sequencing (GbS) using DArTseq Genomic DNA was extracted from young leaves using Qiagen DNeasy ® 96 Plant Kit and subsequently genotyped by DArTseq™ platform (Diversity Arrays Technology Pty Ltd) in which the DNA was fragmented through PstI-MseI double digestion, followed by adaptor ligation, ampli cation and then sequenced on Illumina Hiseq 2500.SNP and PAV calling were then carried out using proprietary DArT analytical pipelines 78 .
Genetic mapping markers from DArTseq were ltered so as to be polymorphic between parental lines and to have no missing values across the populations for a total of 221 and 183 F 2 individuals, respectively in XB and XT populations.Maps were constructed using JoinMap® v5 79 , rst linkage groups (LGs) were obtained through a logarithm of odds (LOD) value of 30 using "grouping (tree)" function.Initial mapping was then performed using a maximum likelihood (ML) algorithm, in order to deal with computational load of more than 100 markers in each LG.After su cient markers had been removed using iterative reduction of markers with ML, nal LGs were generated using regression mapping with default parameters: recombination frequency ≤ 0.40, LOD value ≥ 1.0, goodness-of-t jump threshold for removal of loci equal to 5.0, and rippling performed after every marker was added into the map.Recombination frequency was converted into map distance (cM) using the Haldane's function and markers were checked for nearest neighbour (NN) t values and mean χ 2 contribution for the goodness-of-t within each LG map.
Markers showing signi cant zygotic segregation distortion (SD) were rst removed from initial mapping calculations, and later re-introduced using regression mapping.This was monitored by comparing initial LGs (without SD markers) with newly obtained ones (with SD markers), in order to ensure the marker order between non-distorted markers was consistent.To improve the QTL analysis for qCC, qSC, and qCLX, markers on XB.lg05 were ltered out with a 0.95 threshold and the linkage group was mapped using a maximum likehood algorithm.

Genetic diversity and population structure
Marker quality control was by ltering biallelic SNP loci with minor allele frequency (MAF) > 0.01 and missing value of not more than 10%.The ancestry of each individual was estimated using fastSTRUCTURE 80 with admixture as the ancestry model.fastSTRUCTURE was run using the structure.pyscript provided with the software.The software was run using multiple choices of K (K = 1 to K = 8) to investigate the structure of the data.ChooseK.py was used to choose the most likely number of clusters.
F 2 lines were assessed in three blocks between June to November 2017 in a complete randomized block design (CRBD), each with ve replicates of each parent.The soil had a sandy loam pro le with pH 5.0, while the day/night temperatures recorded were 32 ± 0.9°C and 23 ± 0.3°C (on site weather station; DeltaT).Traits measurements from 87 individuals (which were not border plants) were recorded as reported 5 .In addition, the average chlorophyll concentration was obtained from the middle lea ets of the top three fully opened leaves, each measured 3 times at 28 DAE using a SPAD-502 (Konica Minolta).For qTNoB, the number of branches along the entire stem were calculated, as opposed to qNoB in which represents the number of branches within the rst ten nodes.
QTL analyses were conducted using MapQTL v6 88 using both non-parametric and parametric tests.Each trait was rst analysed through a Kruskal-Wallis (KW) test, in order to establish single marker-trait associations (at p < 0.01).This was followed by Interval Mapping (IM) analysis with LOD threshold calculated through 10,000 permutation tests, taking the genome-wide (GW) signi cant threshold at α = 0.05.Only QTLs that were consistent between these two tests were reported.A QTL was considered signi cant when equal to or above the GW-LOD threshold and explained ≥ 10% of phenotypic variance (PVE).Multiple-QTL model (MQM) mapping was utilised for seed protein content QTL analysis.All QTL have been mapped onto their corresponding LG using MapChart v2.32 89 , including markers as supporting intervals using a 2-LOD drop.

Phylogenetic analysis of R2R3 MYB genes
For the MYB subgroup characterisation, the acid sequences (sequence accessions listed in Table S17 and S18) were aligned using MUSCLE 90 in MEGA X 91 using default settings.The phylogeny was constructed as a NJ tree with 5,000 bootstrap replicates.
Figures     Transgressive segregation in seed protein content (%) observed in the XB population (n=161).
Figure 6 (a) Characterisation of MYB genes (PtMYB113, PtMYB114, PtMYB21, PtMYB102a, PtMYB102b and PtMYBL2, marked with circles) near the pigmentation loci (Table S16) with designated R2R3-MYB subgroups as previously reported using NJ method (bootstrap values with 5000 replicates are shown as percentage next to the branches) in MEGA X.(b) Characterisation of PtMYB113 and PtMYB114 with designated R2R3-MYB Subgroup 6 anthocyanin activators.The evolutionary distances were computed using the Poisson correction method.The subgroups are designated as previously reported [54][55][56][57] .The species and accession numbers of labelled MYB proteins are listed in Table S17 and S18.

Figure 1 Overview
Figure 1

Figure 2 The
Figure 2

Table 2 .
QTL analysis results * only major peaks listed, refer to TableS12for candidate genes in Chr05.1 ** signi cant QTL are indicated in bold have suggested P. tetragonolobus diverged approximately 11 Mya from Erythrina soursae.The common ancestor between this clade and the clade consisting of Phaseoleae subtribe Glycininae and subtribe Phaseolinae which arose approximately 21 Mya, with an estimated divergence slightly earlier (approximately 23 Mya) from the Cajaninae clade