Exploring novel secondary metabolites from natural products using pre-processed mass spectral data

Many natural product chemists are working to identify a wide variety of novel secondary metabolites from natural materials and are eager to avoid repeatedly discovering known compounds. Here, we developed liquid chromatography/mass spectrometry (LC/MS) data-processing protocols for assessing high-throughput spectral data from natural sources and scoring the novelty of unknown metabolites from natural products. This approach automatically produces representative MS spectra (RMSs) corresponding to single secondary metabolites in natural sources. In this study, we used the RMSs of Agrimonia pilosa roots and aerial parts as models to reveal the structural similarities of their secondary metabolites and identify novel compounds, as well as isolation of three types of nine new compounds including three pilosanidin- and four pilosanol-type molecules and two 3-hydroxy-3-methylglutaryl (HMG)-conjugated chromones. Furthermore, we devised a new scoring system, the Fresh Compound Index (FCI), which grades the novelty of single secondary metabolites from a natural material using an in-house database constructed from 466 representative medicinal plants from East Asian countries. We expect that the FCIs of RMSs in a sample will help natural product chemists to discover other compounds of interest with similar chemical scaffolds or novel compounds and will provide insights relevant to the structural diversity and novelty of secondary metabolites in natural products.


Results
overview of our Lc/MS data-processing protocols for the representative MS spectra. We attempted to develop an LC/MS data-processing pipeline to extract the MS spectral information for interpretation of the structures of small secondary metabolites from large quantities of raw MS spectral data. Briefly, the automated protocols developed in this study comprise noise filtering, deisotoping, and clustering after similarity scoring between consecutive MS spectra (Fig. 1). After these automated processes, several thousand raw MS spectral scans from a sample are combined into tens to hundreds of RMSs. The detailed data-processing protocols are presented in Supplementary Note 1. The RMSs are tentatively considered to be derived from single metabolites that are well separated on the UPLC system, and the RMSs are then used to investigate the structural characteristics of the secondary metabolites in the extracts of natural materials. optimization of Lc/MS data-processing protocols using model datasets. We compared and optimized the data-processing parameters, including noise filtering, the similarity score thresholds and the deconvolution filters using a natural product extract as a model dataset to improve the quality of the RMSs. The methanolic extracts of Agrimonia pilosa (Rosaceae) 31 , which is a perennial plant distributed throughout Korea, Japan and China, were used in the study, along with the spectral information of eight new compounds and six known compounds that have previously been reported from the roots of A. pilosa 32 .
Our protocols were only focused on the clear separation of the chromatographic peaks based on the acquisition of RMSs. The two raw MS spectral datasets generated from the root and aerial parts of A. pilosa were processed and optimized with noise filtering thresholds only for handling the ion peaks with m/z and intensity values of over 100 as well as a deisotoping process. The processed MS spectra were clustered between consecutive scans with similarity scores above the threshold using a modified dot-product method to generate the RMSs with noise filtering and deisotoping steps. As the similarity score thresholds were increased to 0.95 (roots) or 0.90 (aerial parts), the number of RMSs gradually increased ( Supplementary Fig. S1). Since higher similarity scores, e.g., 0.99, reduced the number of chromatographic peaks apparently derived from single compounds in the samples, the two datasets of raw spectra from the A. pilosa roots and aerial parts were processed into 145 and 212 RMSs with similarity scores of 0.95 and 0.90, respectively, which gave separation qualities that were much better than those at higher or lower thresholds ( Supplementary Fig. S2). In addition, we applied two deconvolution filters to separate a single RMS into two different spectra when the consecutive MS spectra used to generate a single RMS showed different base peak ions or a convex downward pattern. After clustering based on the similarity scores, two deconvolution filters were applied to separate the unresolved peaks derived from co-eluted compounds ( Supplementary Fig. S3). In further studies, we optimized the similarity score threshold to 0.95, which appeared to allow the correct detection of the chromatographic peaks of interest and remove the noise peaks, and two deconvolution filters were used to improve the separation of a single RMS generated from co-eluted metabolites. As a result, two sets of raw MS spectra consisting of 2699 scans were converted to 205 RMSs for roots and 232 for aerial parts ( Fig. 2 and Supplementary Table S1).
Dereplication study of Agrimonia pilosa. The RMSs corresponding to fourteen compounds (1, 2, 6-11 and 16-21) that were previously reported successfully were identified in the LC/MS chromatogram of A. pilosa roots ( Fig. 3 and Supplementary Fig. S4) 31 and were introduced in the dereplication study to discover other secondary metabolites. The symmetric Pearson's correlation distance matrix consisting of the similarity score profiles between the RMSs in a sample was applied to the hierarchical clustering analysis (HCA) (Supplementary Fig. S5). We only handled 189 of the 205 total RMSs of A. pilosa roots to facilitate the interpretation of the HCA results. The 16 RMSs not applied for the HCA were regarded as the unimportant scans derived from the mixture of nonpolar metabolites, such as lipids, that were highly retained in the column due to their high affinity. The fourteen RMSs for the six agrimonolides (16-21) and seven acylphloroglucinolated catechins (1-2; pilosanidins, and novelty scoring using the reference RMSs. We attempted to devise a method that can identify the MS spectral patterns of secondary metabolites with novel structures, but the exception to this was the common structures, such as triterpenes and flavonoids, that have already been intensively studied or that are produced by many plants. We constructed an in-house database consisting of 65,322 reference RMSs in negative mode using the above user-defined parameters in our LC/MS data-processing protocols. The reference RMSs were derived from metabolites that have not been identified but are unambiguously present in the 466 representative Korean medicinal plants (Supplementary Table S2); some of the representative plants were deposited as standard medicinal herbs in the Korea Plant Extract Bank (Korea Research Institute of Bioscience and Biotechnology, Ministry of Science, ICT and Future Planning, Cheongju, South Korea), and others were directly collected from a Korean medicinal herb garden (Seoul National University, Goyang, South Korea). The garden contained plants that are native to East Asian countries, such as Korea, China and Japan, making these plants the most common sources of medicinal materials for traditional Korean medicines, and their chemical compositions have been intensively studied for many years. We assumed that the 466 plants randomly sampled in Korea are representative of plants native to East Asia, and we used them to construct an in-house database of secondary metabolites to investigate the novelty of RMSs in a given sample. The structural novelty of compounds given by the FCI in a sample was calculated as the normalized value of the dissimilarity and similarity indices against the reference RMSs (Fig. 5a). The RMSs corresponding to the secondary metabolites with more novel structures in A. pilosa samples showed higher FCIs than did the RMSs of the metabolites common to many plants (Fig. 5b). The FCIs of pilosanidins (1-5), pilosanols (6-15) and agrimonolides (16)(17)(18)(19)(20)(21), which are only found in the genus Agrimonia or in A. pilosa, are 89.4 ± 0.1, 80.4 ± 4.6 and 88.1 ± 0.9, respectively, but the FCIs of triterpenes (27)(28)(29)(30)(31)(32)(33)(34)(35) or flavonoid derivatives (36)(37)(38)(39)(40)(41)(42)(43), which are common to many plants, are 60.7 ± 7.9 and 72.4 ± 3.6, respectively (Table 1). In addition, the trend lines of the cumulative relative frequency of the similarity scores of the RMSs corresponding to 43 secondary metabolites isolated from A. pilosa samples against the reference RMSs in our in-house database indicated patterns similar to the results of the FCI profiles from the different cumulative patterns, which is consistent with the chemical scaffolds (Fig. 5c). The RMSs of triterpenes and flavonoids have relatively higher similarity scores against the reference RMSs than do the scores of the pilosanidins (1-5), pilosanols (6)(7)(8)(9)(10)(11)(12)(13)(14)(15), agrimonolides (16)(17)(18)(19)(20)(21) and chromones (22)(23)(24)(25).

Discussion
Discovering compounds with structural novelty from natural products has contributed to expanding the known chemical diversity. Accordingly, the development of methodologies for supporting this process have helped to accelerate natural product research. Hence, various MS-based dereplication approaches have been developed to avoid the rediscovery of known compounds from natural materials 27,33-37 . Recently, a popular approach has www.nature.com/scientificreports www.nature.com/scientificreports/ been molecular networking (MN), which visualizes the connectivity of molecules with similar MS/MS spectral patterns generated by the data-dependent acquisition (DDA) mode; many natural product chemists have applied MN to the discovery of novel compounds by tracking the connections between the nodes from known metabolites and unknown compounds [38][39][40][41] .
In the present study, we developed a new data-processing protocol based on MS spectra that were acquired by the DIA method, which potentially permits the simultaneous fragmentation and detection of peaks regardless of the ion abundances 42 . The raw MS 2 spectra consisting of fragment ions rapidly and continuously detected from precursor ions in the MS 1 spectra without an ion selection step were acquired in an unbiased and parallel manner by DIA analyses and converted into the RMSs using our data-processing protocol. The RMSs contain the essential MS spectral information corresponding to every secondary metabolite in a sample and are directly mapped on an LC chromatogram. Our protocol can directly verify the separation performance of a chromatographic method by checking the quality of the well-resolved peaks while adjusting the data-processing parameters.
Furthermore, RMSs can be used for applied studies, such as dereplication studies and the rapid discovery of novel compounds based on the structural relationships between the massive volume of secondary metabolites in natural products using computational methods. The HCA of the symmetric matrix consisting of the similarity scores between the RMSs provided more reliable results than did MN visualized only by the similarity of two nodes. When using A. pilosa samples as the model datasets, two RMSs sharing the same ion peaks were connected in MN, but clusters of compounds containing the same chemical scaffolds but lacking common fragments were not connected ( Supplementary Fig. S8). On the other hand, compounds generating more similar MS spectral patterns were located in adjacent clusters on the dendrogram, and the HCA of the symmetric Pearson correlation matrix provided more useful information for the discovery of novel compounds than that provided by MN (Fig. 4). Our method was successfully applied to identify structurally similar but novel compounds (3-5 and 12-15) in the sub-clusters adjacent to RMSs that were already known. In addition, new compounds with different backbones, namely, six chromones (16)(17)(18)(19)(20)(21), were identified in sub-clusters that were far-removed from sub-clusters containing known compounds. High-resolution ultra-performance liquid chromatography (UPLC) was used to obtain highly separated peaks corresponding to as many components as possible in a sample by using a long analysis time prior to MS analysis; however, among many secondary metabolites in the sample, a few pilosanols and triterpenes with similar physiochemical properties were co-eluted from the column and simultaneously detected. Among the 43 compounds isolated from A. pilosa, the RMSs of three pilosanols (6, 12 and 15) were far located from the sub-clusters containing the other pilosanols. Their RMSs suggested the presence of other derivatives, which were co-eluted from the column; the signals indicative of triterpene or pilosanidin derivatives were more intense (Supplementary Fig. S9).
In LC/MS metabolomics or dereplication studies, peak identification has focused on finding the exact structures of unknown metabolites in a sample by comparing their spectral data to those of known compounds deposited in mass spectral databases. However, natural product chemists are more interested in the discovery of unknown metabolites that only exist in certain species. In the present study, to discover novel secondary metabolites, we chose to use an MS spectral database that contains unknown secondary metabolites that have not been identified but are unambiguously present in natural products. We introduced a new scoring system, the FCI, which grades the structural novelty of RMSs in a sample against the "real but unknown" reference RMSs in our in-house database. The FCIs of the RMSs in the sample were calculated against the 65,322 reference RMSs  (1)(2)(3)(4)(5) and pilosanols (6)(7)(8)(9)(10)(11)(12)(13)(14)(15), blue for agrimolides (16)(17)(18)(19)(20)(21)(22), dark red for chromones (22)(23)(24)(25)(26), yellow for triterpenes (23)(24)(25)(26)(27)(28)(29)(30)(31)(32)(33)(34)(35) and sky blue for flavonoids (36)(37)(38)(39)(40)(41)(42)(43) www.nature.com/scientificreports www.nature.com/scientificreports/ from 466 representative Korean medicinal plants, which were automatically extracted by our developed LC/MS data-processing protocols. The 466 samples used to construct our in-house database were regarded as a representative set of the medicinal plants distributed in East Asian countries, including Korea, China and Japan, which have similar climates and geographical conditions. The FCIs can be used for the discovery of secondary metabolites with high structural novelty or with similar chemical scaffolds. Since a 95% confidence interval was selected, the FCI profile with the maximum error in the estimate based on the standard deviations of the FCIs, which were calculated from the reference RMSs in 10 sub-groups with 46-47 species randomly sampled from 466 plants, shows that this new scoring method can be reliably applied to predict the structural novelty of unknown secondary metabolites and to discover new compounds in natural materials (Fig. 6).
The structural complexity of secondary metabolites in natural products is one of the greatest challenges in natural product research. In the present study, we introduced DIA-based LC/MS data-processing protocols that allow natural product chemists to inspect their raw MS data and identify meaningful MS spectral information. In the future, we will continue to add reference RMSs from plants to our in-house database, and we expect that the intensive study of RMSs with higher FCIs will guide the rapid discovery of novel secondary metabolites. This approach will facilitate the laborious and tedious isolation process and accelerate the discovery of novel secondary metabolites. Agrimonia   The symmetric matrix consisting of the similarity score profiles between m RMSs in a sample and n reference RMSs in our in-house database for the HCA (a). x i,j denotes the dot-product similarity score between the i th RMS (S i ) in a sample and the j th reference RMS (S j ) in our in-house database. FCI i , the normalized sum of the similarity scores vector of S i , represents the structural novelty of a secondary metabolite in a sample relative to the reference RMSs in our in-house database. The LC chromatograms mapped with RMS are shown in red (upper), and the FCIs profile corresponding to the RMSs (lower) from A. pilosa roots (b) and the aerial parts (c). Compounds 1-43 are shown in violet for pilosanidins (1)(2)(3)(4)(5) and pilosanols (6)(7)(8)(9)(10)(11)(12)(13)(14)(15), blue for agrimolides (16)(17)(18)(19)(20)(21)(22), dark red for chromones (22)(23)(24)(25)(26), yellow for triterpenes (23)(24)(25)(26)(27)(28)(29)(30)(31)(32)(33)(34)(35) and sky blue for flavonoids (36)(37)(38)(39)(40)(41)(42)(43).  Table S2). The extracts were dissolved at a concentration of 5 mg/ml in 50 or 100% LC-grade MeOH depending on their solubility. After passage through a 0.2-μm membrane filter (Minisart, Sartorius Stedim Biotech GmbH, Gorttingen, Germany), the samples were stored in a deep freezer at −80 °C. www.nature.com/scientificreports www.nature.com/scientificreports/ isolation and structural determination of secondary metabolites from A. pilosa. The isolations of fourteen secondary metabolites (1, 2, 6-11 and 16-21) were conducted as previously described in the literature 31 . Twenty-nine compounds (3-5, 12-15 and 22-43) were isolated from A. pilosa roots and aerial parts using a wide range of chromatographic techniques in accordance with the RMSs' profiles and their FCIs. The structural elucidation of each of these compounds by spectroscopic methods, such as 1D and 2D NMR, MS and UV analyses, is described in detail in Supplementary Note 3.

plants.
UpLc-qtof analytical conditions. The LC/MS systems consisted of a Waters Acquity UPLC system (Waters Co., Milford, MA, USA) with a binary solvent delivery system and an auto-sampler. The UPLC column was a Waters Acquity UPLC BEH C 18 (150 mm × 2.1 mm, 1.7 μm). The temperatures of the auto-sampler and the column oven were 15 °C and 40 °C, respectively. The flow rate was 300 μl/min. For the detection of polar and nonpolar metabolites in a sample, the mobile phases were 0.1% formic acid in H 2 O (A) and acetonitrile (B), and the following gradient was used: 5-95% B (0-14 min), 95% B (14-17 min), 50-70% B (10-17 min) and 5% B (17.1-20 min). The injection volume was 2 μl. The MS experiments were performed on a Waters Xevo G2 QTOF mass spectrometer (Waters MS Technologies, Manchester, UK) equipped with an electrospray ionization (ESI) interface. The MS/MS ion patterns were obtained using a collision energy ramp from 15 to 45 eV in MS E mode. The ESI parameters were set as follows: in negative ion mode, a capillary voltage of 2.5 kV, cone voltage of 45 V, source temperature of 120 °C, desolvation temperature of 350 °C, cone gas flow of 50 l/h, and desolvation gas flow of 800 l/h. The ion acquisition rate was 0.25 s with resolution in excess of 20,000 FWHM, and the inter-scan delay time was 0.014 s. The energy for collision-induced dissociation (CID) was set to 4 V for the precursor ion. The mass range was from m/z 100 to 1800. The instrument was calibrated using a sodium formate solution as the calibration standard as suggested by the manufacturer, and this calibration allowed for mass accuracies of <5 ppm. To ensure the mass accuracy and reproducibility of the optimized MS conditions, leucine encephalin (m/z 554.2615 in negative mode) was used as the reference lock mass at a concentration of 200 pg/μl and a flow rate of 5 μl/min and was sprayed into the MS instrument every 10 s. Data processing for the acquisition of RMS. MS spectral data acquired from the UPLC-qTOF instrument were processed by the source codes, which were written in R statistical language (ver. 3.2.2) and are available from the authors upon request. The detailed processing procedures are described in Supplementary Note 1. Briefly, after converting the raw data files into mzXML files, every MS scan in a sample was processed according to the data-processing protocols, such as the removal of higher signal-to-noise signals and the deisotoping step www.nature.com/scientificreports www.nature.com/scientificreports/ for the monoisotopic patterns. Then, the sum of all the peaks in a processed MS scan was scaled to 1000 to minimize the influence of peaks with high intensities in the similarity scoring step between the consecutive scans. The consecutive processed MS scans with above a user-defined threshold based on a modified dot-product similarity scoring method were combined into an RMS 43 . Hierarchical clustering analysis and network visualization of RMSs. For n RMSs, the similarity score of every RMS was calculated by a modified dot-product method against other RMSs in the same sample, and the spectra were compiled into an n × n matrix. The similarity score vectors of each row were hierarchically compared based on several distance methods, such as Euclidean and Pearson, and linkage methods, such as average, centroid, and ward.D, using the 'Dist' function of the 'amap' package in R. The differences in the sub-clustering of RMSs due to the distance and linkage methods were evaluated based on the dendrograms visualized by the 'dendlist' function of the 'dendextend' package.
calculation and statistical analysis of the fci values. The general idea of the novelty of the RMSs in a sample, or the FCI, is as follows: the FCI of the i th RMS is determined by the difference of two values, the dissimilarity index (DI) and the similarity index (SI). The DI of the i th RMS is the ratio of reference RMSs with similarity scores of 0 against the total reference RMSs, and the SI is the weighted sum of the similarity scores against references RMSs with non-zero similarity scores of the total reference RMSs. The FCI is calculated from the following equation:  where m and X i = {x i1 , x i2 , , x ij , , x im } denote the number of reference RMSs with non-zero similarity scores against the i th RMS among N (=65,322) total reference RMSs and the similarity score vector of the i th RMS, respectively.
To calculate the 95% confidence intervals for the population mean of the FCIs of the RMSs, the means and standard deviations were repeatedly obtained from 10 groups divided by random sampling without replacement among 466 plants, which are approximately derived from all the medicinal plants in East Asia and processed using the t-distribution. The mean and the 95% confidence intervals were visualized by the solid line and the shaded sky-blue colour by the 'plot' function in R.

Data availability
The spectral data used in this study are available from the corresponding author upon request.