Illuminating the lineage-specific diversification of resin glycoside acylsugars in the morning glory (Convolvulaceae) family using computational metabolomics

Abstract Acylsugars are a class of plant defense compounds produced across many distantly related families. Members of the horticulturally important morning glory (Convolvulaceae) family produce a diverse subclass of acylsugars called resin glycosides (RGs), which comprise oligosaccharide cores, hydroxyacyl chain(s), and decorating aliphatic and aromatic acyl chains. While many RG structures are characterized, the extent of structural diversity of this class in different genera and species is not known. In this study, we asked whether there has been lineage-specific diversification of RG structures in different Convolvulaceae species that may suggest diversification of the underlying biosynthetic pathways. Liquid chromatography coupled with tandem mass spectrometry (LC–MS/MS) was performed from root and leaf extracts of 26 species sampled in a phylogeny-guided manner. LC–MS/MS revealed thousands of peaks with signature RG fragmentation patterns with one species producing over 300 signals, mirroring the diversity in Solanaceae-type acylsugars. A novel RG from Dichondra argentea was characterized using nuclear magnetic resonance spectroscopy, supporting previous observations of RGs with open hydroxyacyl chains instead of closed macrolactone ring structures. Substantial lineage-specific differentiation in utilization of sugars, hydroxyacyl chains, and decorating acyl chains was discovered, especially among Ipomoea and Convolvulus—the two largest genera in Convolvulaceae. Adopting a computational, knowledge-based strategy, we further developed a high-recall workflow that successfully explained ~72% of the MS/MS fragments, predicted the structural components of 11/13 previously characterized RGs, and partially annotated ~45% of the RGs. Overall, this study improves our understanding of phytochemical diversity and lays a foundation for characterizing the evolutionary mechanisms underlying RG diversification.


Introduction
The plant kingdom has historically been a source of numerous metabolites of use in medicines, foods, cosmetics, agriculture, and cultural practices. Many of these metabolite classes e.g. glucosinolates, alkaloids, terpenes, acylsugars, or cardenolides show lineagespecific structural diversification. Such specialized metabolites play important functional roles in plant pollination, defense, and resilience to abiotic stress conditions in the respective plants' ecological niche. One such class of specialized metabolites is acylsugars, which, in the Solanaceae family, are produced in the trichomes of many species [1][2][3][4][5]. In Solanaceae, this class of anti-herbivory compounds comprises a sugar core (usually a sucrose but sometimes glucose or inositol) to which acyl chains derived from fatty acid or branched amino acid metabolism are esterified (Fig. 1A) [3][4][5]. A previous study catalogued acylsugar diversity across 40 species in the family, and identified some species, such as Salpiglossis sinuata and Nicotiana alata, that produce over 600 acylsugars on individual leaves [5]. Acylsugars also show variation among individuals of the same species [2]. This extreme structural diversity of acylsugars is hypothesized to occur due to the promiscuity of the BAHD acyltransferases involved in their biosynthesis, and the diverse precursor coenzyme A-activated acyl groups in the trichome metabolite pool.
While "acylsugars" commonly refers to the class of compounds that is known to be restricted only to members of the Solanaceae family, other structurally and possibly biosynthetically analogous acylated sugar compounds are also documented across the plant kingdom. For example, Ibicella lutea (Martyniaceae) [6], Silene gallica (Caryophyllaceae) [7], Cerastium glomeratum (Caryophyllaceae) [8], and Geranium carolinianum (Geraniaceae) [9] are also known to produce acylated sugar compounds (Fig. 1C-E). In the known examples, the central sugar in these compounds is a glucose, with acyl or hydroxyacyl chains attached via ester and/or ether linkages. Given all these compound classes are technically acylsugars, we refer to Solanaceae acylsugars as "Solanaceae-type acylsugars" while referring to the combined class as "acylsugars" (Fig. 1). While the extent of the diversity of acylsugars across plants is not known, structures of another similar compound class-resin glycosides (RGs)-have been extensively characterized in morning glories (Convolvulaceae) since their discovery in the 1990s [10,11] (Fig. 1B). Over 300 RG structures have been characterized from several Convolvulaceae species using nuclear magnetic resonance (NMR) (reviewed in [12,13]), and they reveal three primary structural elements: (i) an oligosaccharide core of 2-7 sugars comprising pentoses, deoxyhexoses, and hexoses; (ii) a 14-18 carbon long hydroxy-or dihydroxy fatty acid that may form an intra-molecular macrolactone ring; and (iii) decorations of aromatic or aliphatic acyl side chains that can be 2-16 carbons long. Some RGs undergo intermolecular condensation to form larger, more complex units [12,13]. RGs have also been extensively characterized functionally, with studies finding their roles in bacterial and fungal inhibition [14], in herbivory defense against nematodes [14,15] and leaf herbivores [15] as well as in allelopathic interactions [10]. RGs are also responsible for the purgative properties of powders of Ipomoea purga root-which accumulate resin glycosides up to 20% of their root dry weight-and which are used in traditional Indian and Mexican medicine [12,13].
Despite their structural and functional characterization in multiple species, the diversity of RGs has not yet been assessed in a phylogenetic context. In addition, due to incomplete sampling, it is not clear if the characterized RG structures from different species are representative only of specific species or are broadly conserved across the entire family. Most structures have also been characterized using NMR spectroscopy, which is not a highthroughput method of assessing overall phytochemical diversity. Characterizing common patterns of fragmentation in mass spectrometry can help in the rapid identification of RGs.
To address these goals, we sampled 26 Convolvulaceae species using ultra high-performance liquid chromatography tandem mass spectrometry (UHPLC-MS/MS), characterized the structural diversity RGs in a phylogenetic context, and structurally elucidated a novel RG in the species D. argentea. This study was enabled by our preliminary observations that RGs tend to fragment in a characteristic manner producing some signature fragments. We built upon this observation by computationally sampling a wide repertoire of possible moieties in the RG structural space. Our investigation revealed thousands of LC-MS peaks with MS/MS fragmentation patterns similar to known RGs but with unknown structures. This is emblematic of most LC-MS/MS studies of natural extracts where a majority of the peaks cannot be annotated. Thus, in this study, we investigated the RG fragmentation patterns in greater detail and developed a computational pipeline for identifying the putative structural components of over 1600 RG peaks, most of which are novel. This study lays the foundation for further molecular analysis of Convolvulaceae RGs.

Sampling and identification of resin glycosides from Convolvulaceae species using MS/MS and NMR
To characterize RG structural diversity, we first generated a collection of 31 plants from 26 Convolvulaceae species (Supplementary File 1). Six of the twelve tribes and all five major clades in the family [16,17] were represented. The identity of each of these species was verified by sequencing and comparative analysis of a part of the maturase K (matK) gene region from the plastid DNA (Supplementary File 2). We found that while most species were correctly defined, some such as Distimake quinquefolius, Ipomoea pubescens, and Calystegia sepium needed re-assignment as Ipomoea sp., Merremia sp., and Convolvulus sp., respectively (Supplementary Fig. 1).
Root and leaf extracts of these 31 plants, grown in growth chambers and greenhouses, were obtained at variable stages of growth and assessed using UHPLC-MS/MS (Supplementary File 3). The goal of this analysis was to obtain a snapshot of the RG profiles of individual species, similar to the profiling performed for Solanaceae-type acylsugars from the New York Botanical Gardens [5]. Thus, consistency of growth stage and conditions-which is impossible when sampling 26 globally distributed species with variable growth habits and evolutionary adaptations to different environments-was not maintained. Exploratory analysis revealed substantial profile diversity for jalapinolic acid (11-hydroxyhexadecanoic acid; C16-OH; m/z = 271.2279)containing compounds ( Fig. 2A  After pre-processing and filtering the peaks obtained using UHPLC-MS/MS, we computationally identified high-confidence RG-like peaks from the untargeted metabolomics data using multiple criteria (Supplementary Fig. 2E). The primary criterion for RG peak selection was the presence of fragment pairs-one, a hydroxy-or dihydroxy-acyl chain of varying lengths, and two, the acyl chain attached to a sugar (pentose, deoxyhexose or hexose) (Supplementary File 4, see Methods)-defined based on previous knowledge [12,13] and our empirical observations (Supplementary Figs 3  and 4). Overall, 2687 peaks congruent with the above two criteria and other filtering criteria were identified across all sampled species (Supplementary File 1). We randomly and manually assessed multiple observable peaks containing the fragment pairs in different samples and found that most or all of the peaks were being captured by our computational pipeline. However, we cannot completely rule out the possibility that using these signature fragments underestimated the RG diversity in our samples.
To confirm whether the peaks identified were indeed RGs, we catalogued 30 previously characterized RGs from six species in our study (Supplementary File 5) and asked whether they were detected in our data. Thirteen out of the 30 RGs were captured in our processed dataset, and the rest were not detected. As seen also in later results, the low number of detected RGs likely speaks to the within-species RG structural diversity. None of the eight previously identified Convolvulus arvensis RGs were detected in our study, despite 150 and 215 RG signals detected in the two C. arvensis accessions sampled (Fig. 3).
To further confirm the detection of RGs, we purified and structurally elucidated one highly abundant predicted RG peak with [M-H]-m/z 1279.67, from D. argentea Silver Falls using NMR. Its molecular formula was determined to be C 62 H 104 O 27 , with the structure comprising a fucose, two glucoses, and a rhamnose (Fig. 2C Supplementary Fig. 5). The jalapinolic acid moiety was found to be in an open structure instead of in a macrolactone ring. Previous RGs identified from Dichondra repens were also found to contain two rhamnoses and glucoses (Dichondrin A, B) or two rhamnoses and one glucose (Dichondrin C) [18], both with open hydroxyacyl chain instead of a macrolactone ring. We thus call this structure Dichondrin D. The MS/MS fragmentation pattern of Dichondrin D could be explained by the characterized structure ( Supplementary Fig. 6). This result further confirmed that the fragmentation patterns being assessed were from true RGs.

Resin glycoside accumulation patterns are generally consistent within a single plant but vary between varieties and species
Substantial previously uncharacterized RG diversity was found at various phylogenetic scales (Fig. 3). First, we found RGs in all sampled species including smaller taxa such as Dichondra and Jacquemontia [17]. RGs have also been documented in Cuscuta [19], whose phylogenetic placement in Convolvulaceae is not completely clear [20]. This suggests that the evolution of RGs occurred in the ancestor of all Convolvulaceae species. Second, at the genus level, over a hundred and up to 334 RG signals were detected from different Convolvulus, Calystegia, Ipomoea, Dichondra, and Merremia species. In contrast, other genera showed only 1-40 RGs. This could be due to unequal sampling in those genera, or due to low RG diversity in those species under the sampling conditions. Differences between the two most sampled genera (Ipomoea and Convolvulus) indicate that the number of RGs can vary substantially between species; thus, additional sampling in underrepresented genera (e.g. Aniseia, Jacquemontia, Argyreia) would be important to reveal the true number of RGs in those taxa.
Nonetheless, substantial variability was found in number of RGs at the species and variety levels. Over 300 RG signals were found in Convolvulus equitans-the most among all plants sampled, while an unidentified Convolvulus/Calystegia species produced only 20 RGs. Two C. arvensis accessions collected from Yakima, in Washington State and Huntley, in Montana, USA that are ∼800 miles apart produced 150 and 215 RGs, indicating genotype, environment or genotype × environment interactions in defining RG profiles. The fact that only 13/30 previously identified RGs could be detected in our study from the same species (Supplementary File 5) further supports this inference. Such population-level diversity has previously also been documented among Solanaceae-type acylsugars in populations of Solanum habrochaites [2,21] and Solanum pennellii [22,23].
Metabolite sampling from the 26 species was performed from both leaves and roots, providing an opportunity to understand organ-specific localization of RGs. We divided the signals into three groups-only leaves, only roots, and both (Fig. 3A). In almost all cases, RG signals were found in both organs. Analysis of RG peak areas in individual species from leaves and roots revealed low correlation between the peak areas in almost all species ( Fig. 2F; Supplementary File 1). The most abundant RGs in roots vs. leaves were also very different between roots and leaves of every species (Supplementary File 1). In Merremia sp., three Dichondra species and two Ipomoea species (Ipomoea trifida, Ipomoea purpurea), most or all RGs were found in the roots. This observation strongly suggests that RGs are produced in the roots. It is not clear, however, whether RGs are also produced in the leaves or are simply transported there.

Substantial structural variation exists among resin glycosides
The phylogeny-guided sampling of RGs revealed substantial structural diversity in each species at the level of oligosaccharide lengths, hydroxyacyl chains, and the decorating acyl chains. Previous studies have reported existence of hydroxyacyl chains of varying lengths; thus, we computationally scanned the identified RG peaks to identify the type of long acyl chains seen in our data. Overwhelmingly, we saw m/z = 271.2279, which corresponds to 11-hydroxyhexadecanoic acid (jalapinolic acid) (Fig. 3B). The most diversity in hydroxyacyl chains was found, unsurprisingly, in the two species with the most overall RG diversity-C. arvensis and C. equitans-with hydroxypentadecanoic acid (C15-OH; m/z = 257.2122), dihydroxyhexadecanoic acid (C16-OHOH; m/z = 287.2228) and hydroxyoctadecanoic acid (C18-OH; m/z = 299.2586) found in multiple RGs.
The MS/MS peak m/z = 271.2279 co-occurs with m/z = 417.2854 ( Supplementary Fig. 4), which corresponds to C16-OH bonded to a deoxyhexose. We thus searched the MS/MS data across all species to identify hydroxyacyl chains of all lengths with the three types of sugars previously reported in RGs, namely pentoses (xylose), deoxyhexoses (rhamnose, fucose, quinovose), and hexoses (glucose) in RGs (Fig. 3B). Most RG peaks had the above-noted m/z peak association; however, a substantial number were also associated with a hexose, which most likely is glucose. The relative frequencies of pentose vs. deoxyhexose vs. hexose were speciesdependent. Many Convolvulus and Calystegia species produced >20 RGs with deoxyhexose and pentose as the most and second-most frequent sugars, respectively, while the Merremia and Ipomoea species contained deoxyhexose and hexose (Fig. 3B). Previously identified RGs from Calystegia show the hydroxyacyl macrolactone ring connected to a glucose (a hexose) and quinovose (a deoxyhexose) [24,25]. Since the sugar-hydroxyacyl paired fragment in our study detects only one end of the macrolactone ring connection, the glycosidic linkage between deoxyhexose and the hydroxyacyl chain may be more stable than the ester linkage with hexose in electrospray ionization and collision-induced dissociation in LC-MS. Both pentose and hexose were equally frequent in D. argentea as an alternative sugar (Fig. 3B). In contrast, the two Operculina species and Xenostegia tridentata had hexose instead of deoxyhexose or pentose as their most preferred sugar for adding the hydroxyacyl chain. Eight previously published structures from Operculina macrocarpa and Operculina turpethum [26,27] with the hydroxyacyl chain bonded to glucose agree with this finding. Based on these observations, we infer that hydroxyacyl chain addition (or at least the glycosidic bond formation) with deoxyhexoses was likely the ancestral state for RG biosynthesis, with lineagespecific shifts to hexose or pentose. The mechanism of this addition is not clear.
We also tested the occurrence of fragments corresponding to small aliphatic or aromatic acyl chains esterified to the sugar hydroxyl groups (Fig. 3D). Using a database of all possible acyl CoAs from the Chemical Entities of Biological Interest (ChEBI) database [28] between 50-200 Da (the mass range of Solanaceae-type acylsugar acyl groups) as well as previously identified RG acyl groups [12,13], we asked if the fragment masses of the small acyl chains could be identified. Twenty-two acyl-related fragment ion exact masses were searched for (Supplementary File 4). The most frequent m/z seen across multiple RG peaks was m/z = 101.061, corresponding to five carbon aliphatic saturated acids (e.g. methylbutyric, isovaleric, pentanoic) (Fig. 3D), which we cumulatively refer to as C5 or "pentanoic" below. Mass-tocharge ratios of 71.0133, 85.0289, and 101.0239 normally correspond to acrylic, crotonic, and acetoacetic acids, respectively; however, it is not clear if the occurrence of these fragments indicates presence of these chains rarely found in previous RGs or whether they are degradation products of larger groups. For example, sequential demethylation and unsaturation of a C5 chain from insource fragmentation can also produce fragments of m/z 85.0289 and 71.0133.
Differences in acyl incorporation between species were observed. RGs from Convolvulus species tended to contain m/z = 117.059 (hydroxymethylbutyric acid, HMBA) while those from Ipomoea had m/z = 101.061 (pentanoic acid) ( Fig. 3D; Supplementary Fig. 7). Cinnamic acid was predicted mostly in Ipomoea species; presence of trans-cinnamic acid has been previously reported in Ipomoeas [29][30][31]. Calystegia species were all predicted to contain C5 acyls, except Calystegia soldanella also with substantial tiglic RGs. This pattern is corroborated by studies in Calystegia hederacea and C. soldanella containing 2S-methylbutyric acid as the C5 acyl chain [25,32]. As per prediction here, the two studies in Operculina also demonstrate a dominance of C5 (2S-methylbutyric) and tiglic acids in two Operculina species [26,27]. Generally, short-chain fatty acids were more represented in RG structures than long-chain saturated and unsaturated fatty acids.

Molecular networking reveals significant structural differentiation between Convolvulus and Ipomoea resin glycosides
To visualize the structural differences in RGs across the Convolvulaceae, we clustered MS/MS fragmentation patterns using molecular networking, which groups similar spectra based on their cosine scores. A total of 2100 RGs (78.1% of detected RGs) were clustered into a single large network, and these were displayed using genuslevel occurrence patterns (Fig. 4). We found large, centrally located regions 1 and 2, comprising fragmentation spectra shared by multiple genera. The greatest difference was seen between Ipomoea and Convolvulusalthough these genera shared peaks in regions 1 and 2, dozens of unique RGs were found in regions 4, 6, 7 (Convolvulus) and region 3 (Ipomoea). These observations suggest fixed lineage-specific differences between the major genera. Analysis of the pseudo-molecular ion masses (Fig. 4B) revealed that Convolvulus produced hundreds of high molecular RGs with m/z > 1500 while Ipomoea species produced many low molecular weight RGs not observed in other species. This finding agrees with previously reported RGs from C. arvensis, containing many penta-, hexa-, and hepta-saccharides [33,34] vs. predominantly tri-and tetra-saccharides from Ipomoea species [11,35]. Above results (Fig. 3D) showing differential acyl chain patterns in Ipomoea vs. Convolvulus may also contribute to overall spectral divergence.

A knowledge-based algorithm can explain many MS/MS peaks and help predict resin glycoside structural components
Based on the above substructures identified across our entire LC-MS/MS dataset, we sought to rapidly annotate the MS/MS fragments of individual RG peaks. The Metabolomics Standards Initiatives guidelines [36] specify four levels of annotation, with Level 3 corresponding to putatively identified compound classes, and Level 2 corresponding to putatively identified compounds. The compounds called RGs herein are annotated at Level 3 based on MS/MS information and validating NMR data from one peak. Acquiring Level 2 annotation for individual peaks requires information about the types and stereochemistry of sugars and acyl chains that is not available from LC-MS/MS experiments. However, the spectral information can be used to predict the structural components of the identified RG peaks, and thus narrow their structural neighborhood.
To facilitate putative component identification, we first attempted to better understand the different types of peaks produced upon fragmentation and any patterns of their co-occurrence. Three types of peaks showed frequent co-occurrences: (i) the hydroxyacyl fragment + one and two sugar (typically, 271-417-561/579); (ii) formate adduct (−46) or CO 2 (−44) losses paired with a specific MS/MS fragment; and (iii) water (−18) loss (Fig. 5). This water loss could arise from an open or closed macrolactone ring structure-both forms are found in diverse RG structures across Convolvulaceae and in our NMR results (Fig. 2)-or via a dehydration-led rearrangement occurring during electrospray ionization [37] (Fig. 5).
With the above spectral knowledge, we established a pipeline for RG component prediction. First, we matched the top 20 observed MS/MS peaks of each MS1 peak to an in-house database of known moieties found in RGs [12,13] (Supplementary File 4). Second, we performed pairwise comparisons of all fragments, and inferred neutral losses and the potential moieties those losses would correspond to. On an average, 72.5% of the 43 261 total MS/MS fragments across all samples could be annotated automatically using this manner. Third, the difference between the maximum MS/MS fragment with annotation and the pseudo-molecular ion mass of the parent compound was used to infer the remaining possible moieties. To reduce false positive predictions, we only considered acyl chains with occurrence in >5% of the RG peaks ("Frequent", Supplementary File 4) and only predicted up to hepta-saccharide cores with up to three frequent acyl chains. We also considered water loss and formate/no-formate adducts. Using these rules, 1245 (45.9%) of the RGs were assigned putative component identifications (Supplementary File 6). Multiple predictions were also obtained for each structure (low precision), all of which we provide for further manual consideration. Using knowledge of observed MS/MS fragments, prediction of formate-related and other neutral losses can help filter through the multiple predictions manually and thereby obtain a smaller set of component predictions.
Of the 13 RGs identified in our analysis, correct prediction was obtained for 11 of them (high recall; Supplementary File 5), with Dichondrin D (Fig. 2C) containing five acyl chains and Merremin D containing decanoic acid being the exception. When the algorithm was extended to five acyl chains and included decanoic acid, correct prediction was obtained for both RGs. As an example of the prediction, Tricolorin A is a tetrasaccharide containing two rhamnoses, one fucose, one glucose, 11-hydroxyhexadecanoic acid, and two methylbutyric acid groups. There were two additional peaks corresponding to the m/z of its formate adduct 1067.563, which could correspond to Tricolorins E and Dstructural isomers of Tricolorin A (Fig. 6A). These peaks were accurately predicted to contain three deoxyhexoses and one hexose, two C5 acyls and a C16-OH. MS/MS peaks for multiple moieties were correctly isolated (Fig. 6B).
Although the recall of this knowledge-based approach is high, the precision is low. One reason is presence of some ions with insufficient good-quality fragmentation to produce informative peaks. The most common issue, however, is multiple predictions due to multiple residue combinations producing the same exact mass (Supplementary File 6). For example, an RG containing a deoxyhexose and pentanoic acid will have its mass equivalent to one with hexose and hydroxymethylbutyric acid-both being common side chains. This issue can be mitigated by manually checking the extracted MS/MS fragments for presence/absence of formate adduct and specific acyl chains (Fig. 6C). Nonetheless, the computational pipeline is able to explain a large proportion of the top MS/MS fragments of RGs and is able to provide tentative identifications of the RG structure based on the universe of likely fragments. Future iterations of the workflow can potentially allow for more precise predictions based on species-level component preferences.

Discussion
In this study, we performed a phylogeny-guided analysis of RG structural diversity. We sampled 26 species from 12 genera of the Convolvulaceae family and constructed a phylogeny using the matK sequence from each sampled plant. However, there are 1500-1600 estimated species in the Convolvulaceae across 55-60 genera [16,17]. Thus, our study represents only a very limited sampling of the broader diversity of RGs that may be present across the Convolvulaceae. We also note while the matK-based tree generally recapitulated known species relationships [16,17], inconsistencies due to using only a single sequence, e.g. as seen by the positions of Merremia species and Ipomoea tricolor varieties (Fig. 3A), still exist and can affect inferences of biochemical evolution based on the reconstructed species relationships.
Substantial diversity was found in RG structures in each plant, within species as well as between species. All species were found to accumulate RGs in the roots.
Evidence of five different hydroxyacyl groups complexed with sugars with occurrence in >10 RGs was obtained, with C16-OH being the most common (Fig. 3C). This was followed by hydroxyacyl chains of C15, C17 length, and then by C16-OHOH. Odd-chain fatty acids (C15, C17) are rarer compared to even-chain ones [38]; however, both have previously been described in RGs [26,33]. In addition, hydroxylated C17 fatty acid is the second most prevalent macrolactone ring fatty acid except in C. arvensis Yakima, which preferred C15 as the alternate chain. Biosynthesis of odd-chain fatty acids has been proposed to occur (i) from propionyl coA as the primer in fatty acid elongation pathway instead of acetyl coA, (ii) via one carbon elongation of fatty acids e.g. in acylsugars [39], and (iii) α oxidation of the more abundant evenchain fatty acids [40], but which of these mechanisms is prevalent in morning glories is unknown. These observations-combined with the unusual 11th position hydroxylation in the hydroxyacyl chain-reveal the unique biochemistry of RGs. 12-Hydroxyhexadecanoic acid and 11(S)-hydroxyheptadecanoic acid have been previously documented in O. turpethum [26] and C. arvensis [33], respectively, but our study reveals the previously unreported extent of hydroxyheptadecanoic acid's prevalence across the Convolvulaceae RGs. Furthermore, based on the chain distribution patterns (Fig. 3C), we predict that a single enzyme-preferring jalapinolic acid and with promiscuous activities with other hydroxy/dihydroxy acyl chains-is responsible for acylation of the oligosaccharide core.
Among the acyl chains, five-carbon long chains (e.g. valeric acid, 2-methylbutyric acid, 3-methylbutyric acid) were the most frequent while four-carbon moieties (e.g. isobutyric acid, n-butyric acid), tiglic and acetic were also found in >10% of the RGs. Longer saturated acyl chains were not as frequently detected, indicating the predominant role of branched chain amino acid biosynthesis-from which most of the small acyl chains are likely derived-in RG decorations. Such diverse acylations are indicative of the involvement of BAHDs, similar to the promiscuous BAHDs of the Solanaceae acylsugar biosynthetic pathway [1,5,41]. Solanaceae-type acylsugar BAHDs as well as anthocyanin sugar acylating BAHDs [42] show conservation of acylation positions in orthologs despite their acceptor promiscuity; however, in RGs, analysis of previously characterized structures does not suggest positional uniformity [12,13]. Nonetheless, the diversity of acyl chains added to RGs and the variable positions suggests the involvement of one promiscuous or multiple BAHDs in the process. Lineage-specific allelic divergence or gains/losses of the BAHDs may produce the observed lineage-specific patterns of acylation.
Our findings reveal structural divergence indicative of underlying biosynthetic divergence in different lineages. At the genus level, the divergence between Ipomoea and Convolvulus is the most stark. Detailed analysis of spectral patterns performed in this study revealed some of the basis for this divergence-larger RGs in Convolvulus and use of different decorating acyl chains between Convolvulus and Ipomoea, i.e. HMBA and pentanoic, respectively (Fig. 3B). Identification of the underlying glycosyl and acyltransferases will help determine the molecular basis of this divergence.
The plant kingdom is estimated to contain over a million metabolites [43]; however, <5% can be confidently identified structurally using public databases (unpublished results). This inadequacy results in a vast proportion of the plant metabolome remaining unannotated, which can further create roadblocks in annotating the genes involved in metabolism. Here, we used the knowledge of fragmentation patterns in RGs to annotate their structural components. Such a knowledge-based approach is popularly used in lipid annotation by tools such as LipidBlast [44], whose internal database contains in silico fragmentation for >120 000 lipids from 30 structural classes. Recently, a deep learning method CANOPUS [45] was developed to predict metabolite classes using untargeted LC-MS/MS data. This approach was used to predict hundreds of flavonoids and anthocyanins in sweet potatoes [46]; however, more algorithmic development is needed for other phytochemical classes. Furthermore, much of the specialized metabolic diversity lies outside of current reference species that are the most well studied [5,47]. Thus, approaches to rapidly annotate this "non-model diversity" are needed. The knowledge-based approach developed here for RGs provides a template for other acylsugars across the plant kingdom (Fig. 1). Given the bonds remain generally the same-glycosidic, ester, and ether linkages-we predict that the fragmentation pattern would be consistently reproducible if using comparable MS methodology. Overall, the findings presented here not only provide an extensive insight into the unique chemistry and diversification of RGs in Convolvulaceae-a horticulturally important familybut also provide a foundation for using the lineagespecific variation to probe the biochemistry of these compounds in the future.

Plant germination and growth conditions
Depending on the thickness of the seed coat, seeds were either sterilized using concentrated sulfuric acid or 10% trisodium phosphate (TSP) solution. Seeds with thick seed coats (e.g. I. tricolor) were acid scarified and sterilized with concentrated sulfuric acid for 15 minutes. Seeds with thinner seed coats (e.g. D. argentea) were washed in 10% TSP for 10 minutes for surface sterilization. For both conditions, seeds were washed five times with sterile water afterwards. Student grade filter paper (VWR, Radnor, PA) was placed within a sterile petri dish and 1 mL of sterile water was added. The sterilized seeds were added to the petri dish and sealed with parafilm. Seeds were kept in the dark at room temperature for 4-10 days until germinated. When the radicle was clearly visible the seedlings were transferred to 4-inch pots and grown in a growth chamber under a constant light/dark (16 h/8 h) regime at 24 • C for 4 weeks. After 4 weeks, plants were moved to a greenhouse, re-potted if necessary and grown for approximately 3 months until samples were harvested.

Genomic DNA extraction and amplification of maturase K partial gene sequences
Leaf samples were taken using clean tweezers and scalpel, and f lash frozen in liquid nitrogen and stored at −80 • C until use. Approximately 50 mg of frozen tissue was ground in a 1.5 ml reaction tube using a disposable tube pestle. Genomic DNA from the ground leaves of all species grown was isolated using the E.Z.N.A. Plant DNA Kit (OMEGA Bio-Tek, Norcross, GA) following the manufacturer's recommendations. The gDNA was eluted using 100 μl nuclease-free water and the concentration was estimated using a NanoDrop™ One instrument (ThermoFisher Scientific, Waltham, MA). The 1:10 diluted DNA was used to amplify a region of the maturase K gene using Q5 ® High-Fidelity DNA Polymerase (NEB, Ipswich, MA) following the manufacturer's recommendations with following primer combinations: matK_F1 (5 -ATA CTT TAT TCG ATACAA ACT CCT TTT TTT-3 ) and matK_R1 (5 -AGT ATT GCA ATT TGA ATA GTT TCA TTA C-3 ) using 53 • C as annealing temperature or matK_F2 (5 -ATA CTT TAT TCG ATA CAA ACT CCT TTT TTT GGA AG-3 ) and matK_R2 (5 -AGT ATT GCA ATT TGA ATA GTT TCA TTA CTC GAA A-3 ) using 56 • C as annealing temperature. Both primer pairs amplify the same target while matK_F2 and matK_R2 bind approximately 40 bp upstream of their respective F1/R1 counterparts. PCR products were directly sequenced using Sanger sequencing at the Cornell Institute of Biotechnology using the above primers.

Confirmation of species identity via matK barcoding and generating a species phylogeny
The obtained matK sequences for each sampled species were searched against NCBI's nr database using blastn [48] default parameters to confirm that the correct species were sampled. All nucleotide sequences including those of the best hits were aligned using MAFFT v.7.453-with-extensions [49] in Geneious Prime v.2020.1.1 using default parameters and provided as input to IQ-TREE v.1.6.10 [50], which was run with model selection (ModelFinder, [51]), single branch test using SH-like approximate likelihood ratio test (1000 replicates) [52], and 1000 standard non-parametric bootstrap replicates using following parameters: -alrt 1000 -b 1000 -nt AUTO -ntmax 32. The optimal tree was obtained using the TVM + F + R2 model.

Resin glycoside extraction
Leaves and roots of plants were harvested into 50 ml falcon tubes and snap frozen in liquid nitrogen. Frozen samples were rough ground and homogenized with a spatula. Approximately 100 mg of sample was weighed while keeping the material frozen with liquid nitrogen and 300 uL of methanol containing 10 uM telmisartan per 100 mg of tissue was added to samples (except in species 28-32; Supplementary File 1), and vortexed for 30 seconds. Two metal beads (2.3 mm) were added to the sample and samples were homogenized using a mixer mill (MM400, Retsch, Haan, Germany) at 30 bpm in 2-minute intervals for a total of 6 minutes within a 15-minute extraction period. Afterwards, samples were centrifuged for 30 seconds at 14000 g to remove particulate. If supernatant still contained particulate, it was centrifuged again with the same parameters. Once all particulates were removed, the supernatant was transferred to HPLC vials for further analysis or stored at −80 • C.

UHPLC-MS/MS run conditions
LC-MS analysis was performed on a ThermoScientific Dionex Ultimate 3000 HPLC equipped with an autosampler coupled to a ThermoScientific Q-Exactive Orbitrap mass spectrometer using solvent A [water + formic acid (0.1% v/v)] and solvent B (acetonitrile) at a flow rate of 0.6 ml/min, on a Phenomenex Kinetex 150 mm C18 column (model: 00F-4462-AN). Metabolites were detected using full MS1 coupled to a data-dependent MS2 method, extracting the top 10 most abundant ions (Full-MS/ddMS2) in negative ionization mode. The total scan range used was 500 to 2000 m/z. Additional details of chromatographic, mass spectrometric, and MS-DIAL-based data analysis methods are described in Supplementary File 3. RAW files from the LC-MS runs were converted to abf files using AbfConverter (Reifycs, Tokyo, Japan) and used for further data analysis with MS-DIAL [53]. Converted raw files for leaf and root samples for each species were aligned to each other, and an alignment-based mgf file was exported. These output files were then filtered using custom Python scripts as described in Supplementary Fig. 2.

Liquid-chromatography mass-spectrometry data analysis
RGs were identified using a multi-step analysis. First, we performed manual analysis of multiple LC-MS runs to study RG fragmentation patterns. In addition, previously reported MS/MS fragmentation patterns [12,13] as well as MS/MS fragmentation simulated in ChemDraw v19.0 (PerkinElmer, Waltham, MA, USA) provided us with a general understanding of the fragmentation pattern and allowed us to compile a database of RG-specific fragment ions (Supplementary File 4). Automated identification and quantification of RGs in the analyzed samples was carried out using in-house Python scripts as outlined in Supplementary Fig. 2E.

Identification of resin glycoside peaks from LC-MS data
The alignment files produced using MS-DIAL per species were analyzed individually. Multiple filters were applied using custom Python scripts to separate likely true RGs from potential noise, and the selected peaks were then extracted from the mgf format files. The filters applied were as follows: (i) The peak area of the selected MS1 peak in either leaf or root needed to be >10× the signal of that peak vs. blank as well as >1000. (ii) Every selected peak had to contain at least one out of 30 special pairs of MS/MS fragments: each pair comprised one hydroxyacyl fragment (out of 10) and a combined fragment of the hydroxyacyl bound to a sugar (out of 3) (e.g. 271.2279-417.2858 for deoxyhexose+C16-OH). (iii) The MS/MS peak areas of the fragments needed to be >10% of the maximum MS/MS ion intensity, so as to reduce noise. The last filter also resulted in some true RG peaks being filtered out e.g. m/z 1079.4929 in O. turpethum, which contained MS/MS fragments 257.2108 and 419.2613 corresponding to a C15-OH complexed to a hexose sugar-a motif frequently found in this species. However, the 257.2108 fragment was only 4% of the maximum ion intensity. (iv) MS-DIAL identifies peaks as linked to adducts or to ions of higher masses. While all adducts were processed as RGs, peaks linked to other peaks of higher masses were processed only to consider one of the two peaks. The Dichondrin D peak (1279.6695) was associated with a higher mass 1364.642, and thus was lost to downstream steps because of this filter. Distinction between only root, only leaves, and ambiguous was made as follows: if the unnormalized peak area was >1000 in root or leaves and 0 in the other organ, it was assigned to root or leaves, respectively. Conversely, if the peak area was >1000 in both, it was assigned to both organs. Peaks not satisfying these scenarios were deemed ambiguous. The relatively low area threshold of 1000 was used-based on prior experience with our Orbitrap instrument-to avoid excluding low-abundance, yet likely real RG peaks.

Compound purification for NMR
For structural elucidation of Dichondrin D, larger amounts were extracted from roots of D. argentea. Root tissues were collected and frozen in liquid nitrogen and stored at −80 • C until bulk extraction. First, the tissues were homogenized using a pre-chilled Victoria™ cast iron grain mill under constant infeed of liquid nitrogen to keep tissues frozen. Subsequently, 3 g of tissue were extracted using 9 mL of methanol for D. argentea. The extraction was performed under constant shaking at 25 • C for 15 minutes in an Eppendorf New Brunswick Excella E24 Orbital Shaker (180 rpm). Afterwards, the samples were vacuum filtered using a Büchner funnel and Whatman grade 1 qualitative filter paper to remove particulate matter. The extract was dried down until solidified using a rotary evaporator (Rotavapor R-II, Büchi, Flawil, Switzerland) equipped with a water bath set to 25 • C and a vacuum of 50 mbar. The solid samples were then reconstituted in 300 μL MeOH (HPLC grade). Dichondrin D was then purified on an Agilent HPLC 1100 System using an Agilent Eclipse XDB-C18 Semi-Prep (5 μm, 9.4 × 250 mm) column over 32 minutes at a flow rate of 3.5 mL per minute. The gradient started with 40% solvent A (acetonitrile, HPLC grade) and 60% solvent B (H2O with 0.1% formic acid, HPLC grade) from minute 0 to 3, and increasing the concentration of solvent A to 100% at minute 20, 100% solvent A for 5 minutes, followed by equilibrating the column to the starting conditions for 7 minutes. The samples were purified over 10 injections of 30 μL each in order to purify high enough amounts for subsequent NMR spectroscopy. Eluting RGs were examined using a Waters Quattro Ultima Triple Quad Micromass mass spectrometer and the corresponding fractions for Dichondrin D (m/z 1279.6, RT 19.5-20.0 min) were collected using a ISCO Foxy 200 Fraction Collector. The samples were further assessed using a ThermoScientific Dionex Ultimate 3000 HPLC coupled to a ThermoScientific Q-Exactive Orbitrap and determined to be of approximately 90% purity. The fractions of interest were then dried via rotary evaporation and reconstituted in 99.5% deuterated MeOH. The drying procedure was repeated, and the sample was reconstituted in 1 mL 99.9% deuterated MeOH before being transferred to the NMR tube.

NMR spectroscopy data acquisition and analysis
NMR spectra were recorded on the Bruker AVANCE III HD 800 MHz (800 MHz) spectrometer at State University of New York Environmental Science and Forestry NMR facility. [1] H NMR chemical shifts are reported in ppm (δ) relative to residual solvent peaks (3.31 ppm for methanol-d 4 ) [13]. C NMR chemical shifts are reported in ppm (δ) relative to residual solvent peaks (49.0 ppm for methanol). All NMR data processing was done using MNOVA 14.2.1 (https://mestrelab.com/).

Molecular networking analysis
The filtered mgf files and peak area files were submitted to GNPS [54] for molecular networking. The networking parameters were kept at default values except "Precursor Ion Mass Tolerance" was set to 0.05, "Min Pairs Cosine" was set to 0.5, "Network TopK" was set to 15, "Maximum Connected Component Size" was set to 0, "Minimum Matched Fragment Ions" was set to 3, and "Minimum Cluster Size" was set to 1. The resulting graphml file was imported into Cytoscape v 3.8.0 [55] and visualized using the Prefuse Force Directed Layout.

Data availability
Raw LC-MS files, processed mgf with RG MS/MS information, peak alignment files and metadata is deposited in MetaboLights under the accession ID MTBLS3113. All Python scripts for parsing MS-DIAL exported files and for generating component annotations have been deposited on the moghelab/ResinGlycoside GitHub repository.