Discovery of the Potential Biomarkers for Discrimination between Hedyotis diffusa and Hedyotis corymbosa by UPLC-QTOF/MS Metabolome Analysis

Hedyotis diffuse Willd. (HD) and Hedyotis corymbosa (L.) Lam. (HC), two closely related species of the same genus, are both used for health benefits and disease prevention in China. HC is also indiscriminately sold as HD in the wholesale chain and food markets. This confusion has led to a growing concern about their identification and quality evaluation. In order to further understand the molecular diversification between them, we focus on the screening of chemical components and the analysis of non-targeted metabolites. In this study, UPLC-QTOF-MSE, UNIFI platform and multivariate statistical analyses were used to profile them. Firstly, a total of 113 compounds, including 80 shared chemical constituents of the two plants, were identified from HC and HD by using the UNIFI platform. Secondly, the differences between two herbs were highlighted with the comparative analysis. As a result, a total of 33 robust biomarkers enabling the differentiation were discovered by using multivariate statistical analyses. For HC, there were 18 potential biomarkers (either the contents were much greater than in HD or being detected only in HC) including three iridoids, eight flavonoids, two tannins, two ketones, one alcohol and two monoterpenes. For HD, there were15 potential biomarkers (either the contents were much greater than in HC or being detected only in HD) including two iridoids, eight flavonoids, one tannin, one ketone, and three anthraquinones. With a comprehensive consideration of the contents or the MS responses of the chemical composition, Hedycoryside A and B, detected only in HC, could be used for rapid identification of HC. The compounds 1,3-dihydroxy-2-methylanthraquinone and 2-hydroxy-3-methylanthraquinone, detected only in HD, could be used for rapid identification of that plant. The systematic comparison of similarities and differences between two confusing Chinese herbs will provide reliable characterization profiles to clarify the pharmacological fundamental substances. HC should not be used as the substitute of HD.


Introduction
Hedyotis diffuse Willd. (HD) is a well-known Chinese folk-medicine with a spectrum of pharmacological activities, including anti-cancer, antioxidant, anti-inflammatory, anti-fibroblast, immunomodulatory and neuroprotective effects, especially the anti-cancer effect in practice [1]. Acetonitrile and methanol were UPLC-MS pure grade (Fisher Chemical Company, Geel, Belgium). Formic acid for UPLC was purchased from Sigma-Aldrich Company (St. Louis, MO, USA). Deionized water was purified using a Millipore water purification system (Millipore, Billerica, MA, USA). All other chemicals were of analytical grade.

Sample Preparation and Extraction
All the whole plants, including HC (HC1~HC10) and HD (HD1~HD10), were air-dried, grinded and sieved (40 mesh) to get the homogeneous powder respectively. Then, the powder of 20 samples (200 mg per sample) were extracted respectively with 80% methanol (2L × 3) at 80 • C for three times (3 h each time) with the reflux method. The extraction procedure is repeated until the extracted solution is colorless. After filteration, the extracts of each sample were combined, concentrated and evaporate to dryness. As a result, 20 desiccated extract powders were obtained. Each powder was dissolved in 1.0 mL of 80% methanol. Subsequently, each methanolic solution was filtered and injected directly into the UPLC system. The volume injected of each sample was 2 µL for each run. Furthermore, the methanol blank were run with the same gradient program between two samples during the whole sample list. The wash volume between injections was enough for avoiding carry over. Meanwhile, 20-µL aliquots of each HD and HC sample were mixed to obtain a quality control (QC) sample, which contained all of the components in the analysis. The QC sample was run every five samples to monitor the stability of the system.

Ultra-High Performance Liquid Chromatography with Quadrupole Time-of-Flight Tandem Mass Spectrometry (UPLC-QTOF-MS)
The separation and MS detection of components were performed on a Waters Xevo G2-XS QTOF mass spectrometer (Waters Co., Milford, MA, USA) connected to the UPLC system through an electrospray ionization (ESI) interface. UV wavelength did not trigger the MS detection of components. The column used was an ACQUITY UPLC BEH C 18 (100 mm × 2.1 mm, 1.7 µm) from Waters Corporation (Milford, MA, USA). The mobile phases consisted of eluent A (0.1% formic acid in water, v/v) and eluent B (0.1% formic acid in acetonitrile, v/v) with a flow rate of 0.4 mL/min following a liner gradient program: 10% B from 0 to 2 min, 10-90% B from 2 to 25 min, 90% B from 25 to 26 min and 90-10% B from 26 to 26.1 min. The temperature of the UPLC column and sample was set at 30 • C and 15 • C. Mixtures of 10/90 and 90/10 water/acetonitrile were used as the strong wash and the weak wash solvent respectively. The optimized instrumental parameters were as follows: capillary voltage floating at 2.6 kV (ESI + ) or 2.2 kV (ESI − ), cone voltage at 40 V, source temperature at 150 • C, desolvation temperature at 400 • C, cone gas flow at 50 L/h and desolvation gas flow at 800 L/h. In MS E mode, collision energy of low energy function was set to 6 V, while ramp collision energy of high energy function was set to 20-40 V. Each sample was analyzed by UPLC-QTOF-MS E mode; data acquisition was performed via the mass spectrometer by rapidly switching from a low-collision energy (CE) scan to a high-CE scan during a single LC run. The low-CE experiment provides information about the intact molecular ion, e.g., [M+H] + , while the high-CE scan generates fragment ion information. Alignment of the low-CE and high-CE data is automatically performed by the software. To ensure mass accuracy and reproducibility, the mass spectrometer was calibrated over a range of 100-1200 Da with sodium formate. Leucine enkephalin was used as external reference of Lock Spray™ infused at a constant flow of 10 µL/min. In addition, MassLynx data were recorded in continuous mode during acquisition.

Chemical Information Database for the Components of HC and HD
In addition to the Waters Traditional Medicine Library in the UNIFI software, a systematic investigation of chemical constituents was conducted. A self-built database of compounds isolated from HC and HD was established by searching online databases such as China Journals of Full-Text Database (CNKI), PubMed, Medline, Web of Science and ChemSpider. The name, molecular formula and structure of components from HC and HD were obtained in the database.

Data Analysis by UNIFI Platform
Data analysis was performed on UNIFI 1.7.0 software (Waters, Manchester, UK). Emphasis was put on analyzing structural characteristics and MS fragmentation behaviors, especially for characteristic fragments. Minimum peak area of 200 was set for 2D peak detection. The peak intensity of high energy over 200 counts and the peak intensity of low energy over 1000 counts were the selected parameters in 3D peak detection. A margin of error up to 5 ppm for identified compounds was allowed. We selected positive adducts containing +H and +Na and negative adducts including +COOH and −H. For exact mass accuracy, with leucine enkaplin as the reference compound, [M+H] + 556.2766 was used for positive ion and [M−H] − 554.2620 was used for negative ion in the UNIFI platform.
The MS raw data were processed using the streamlined workflow of UNIFI software to quickly identify the chemical components that met the match criteria with the Traditional Medicine Library. Firstly, an in-house scientific library was created including the information of chemical components from the target herbs based on the literature, saved as Mol file format, and then, the newly built library was imported into the analysis method, in virtue of some compounds being missing in the Traditional Medicine Library. Secondly, the raw data was compressed by Waters Compression and Archival Tool v1.10 and imported into the software. Thirdly, automated screening and identification were performed by the UNIFI platform instead of manually extracting each individual chromatographic peak, calculating the elementary composition and then analyzing MS fragmentation behaviors. Fourthly, we set up a filter to refine results, being mass error between −5 and 5 ppm, and additionally, response value greater than 6000. Finally, further verification of compounds by comparison with retention time of reference substances and characteristic MS fragmentation patterns reported in literature was carried out. After processing and filtering of the data by UNIFI, all selected components were listed for further verification, including information such as compound name, chemical structure, mass error, adducts, response, extracting ion chromatograms and spectra of low energy and high energy. The components were listed by descending response order and confirmed by reference substances or comparison with literatures.

Metabonomics Analysis
MarkerLynx XS V4.1 software (Waters, Manchester, UK) was used to process the raw data for alignment, deconvolution, data reduction, etc. As a result, the list of mass and retention time pairs with corresponding intensities for all the detected peaks from each data file. The main parameters were as follows: retention time range 0-26 min, mass range 100-1200 Da, mass tolerance 0.10, minimum intensity 5%, marker intensity threshold 2000 counts, mass window 0.10, retention time window 0.20, and noise elimination level 6. Furthermore, also with the MarkerLynx XS V4.1 software, principle component analysis (PCA) and orthogonal projections to latent structures discriminant analysis (OPLS-DA) were applied to analyze the above resulting data. Whether these two species are different would depend on the separation between HD and HC groups. The obvious separation in PCA score plots means they are differentiated. The supervised pattern recognition approach OPLS-DA can visualize and depict general metabolic variation between two groups. To identify the metabolites contributing to the discrimination, S-plots and VIP-plots were obtained via OPLS-DA analysis to find potential biomarkers that significantly contributed to the difference among HC and HD. Each spot in S-plots represents a variance. The importance of each variance to classification is determined by the value of variable importance in the projection (VIP) and metabolites with VIP value above 2.0 were considered as potential markers.

Identification of Components from HC and HD
A total of 113 compounds were identified or tentatively characterized in both positive and negative mode from HC and HD (Table 2), the base peak intensity (BPI) chromatograms are shown in Figure 1, and their chemical structures are shown in Figure 2. In HC and HD 109 and 104 compounds were characterized, respectively. Both herbs are rich in natural components with various structural patterns, including iridoids, flavonoids, organic acids and organic acid esters, tannins, alcohols, , and the rest are organic acids and organic acid esters, triterpenoids, coumarins, alkaloid, phenol, amide and glycoside. The contents of above components were similar in these two herbs.

Biomarker Discovery for HD and HC
PCA, a classic unsupervised lowering-dimension pattern recognition model, can be used to select distinct variables and to find potential biomarkers. It was firstly established based on the

Biomarker Discovery for HD and HC
PCA, a classic unsupervised lowering-dimension pattern recognition model, can be used to select distinct variables and to find potential biomarkers. It was firstly established based on the spectra of HD and HC samples to discern the presence of inherent similarities in mass spectral profiles as displayed in Figure 3. Two parameters, R 2 (cum) and Q 2 (cum), are commonly used to assess the quality of the PCA model, with values close to 1.0 indicative of good fitness and predictive ability. In the present study, R 2 X (cum) and Q 2 (cum) were 0.6909 and 0.6257, respectively, indicating good fitness and prediction of the constructed PCA model. spectra of HD and HC samples to discern the presence of inherent similarities in mass spectral profiles as displayed in Figure 3. Two parameters, R 2 (cum) and Q 2 (cum), are commonly used to assess the quality of the PCA model, with values close to 1.0 indicative of good fitness and predictive ability. In the present study, R 2 X (cum) and Q 2 (cum) were 0.6909 and 0.6257, respectively, indicating good fitness and prediction of the constructed PCA model.
Based on the obtained PCA score plots (Figure 3), the 20 samples were obviously divided into two main groups according to different species (HD and HC). The HD samples were noticeably overlapping, which indicates good similarity among them, and this result was also observed for HC samples. Meanwhile, the HD group and the HC group were completely separated, indicating that these two species herbs could be differentiated. The QC samples were between the two species, which came from the fact that they were mixed volumetrically in 50%. In order to distinguish HD from HC, OPLS-DA models were built in both positive and negative modes. OPLS-DA score plot, S-plot, variable trend and VIP (variable importance in the projection) values were obtained to understand which variables are responsible for separation [109].
As shown in Figure 4, OPLS-DA models were constructed to discriminate the difference under the already established separation between different groups based on the PCA results. Each model has 2 score components (HD and HC). These scores are weighted averages of the original ones, hence providing a good summary. In addition, these scores display the separation of the groups in both ESI + and ESI − modes. The scores t[1] (x-axis) and to [1] (y-axis) are the two most important new variables in summarizing and separating the data. Each point in the plot corresponds to an observation. The groups are shown in different shapes and the separation of the groups is easily visible in t [1]. The to [1] score values show the variation within each class. This variation can either be caused by biological variation or by systematic changes in the experimental setup. Figure 5 displays the variable importance (VIP) versus the PLS-regression coefficients. Important X-variables have large positive VIP values and large positive or negative coefficient values. The covariance p [1] and correlation p(corr) [1] loadings from a two class OPLS-DA model were shown here in S-Plot format ( Figure 6). The points are Exact Mass/Retention Time pairs (EMRTs). The upper right quadrant of the S-plot shows those components which are elevated in HC, the control group, while the lower left quadrant shows EMRTs elevated in HD, the treated group. The farther along the x-axis the greater the contribution to the variance between the groups, while the farther the y-axis the higher the reliability of the analytical result. Based on VIP values (VIP > 4) ( Figure 5) and p values (p < 0.05) [110] from univariate analysis, and the identification of components from HC and HD (Table 2), 33 robust known biomarkers enabling the differentiation between HD and HC were discovered and marked in S-plots ( Figure 6). In order to systematically evaluate the biomarkers, a heatmap was generated from these biomarkers (shown in Figure 7), which shows distinct segregation between two species. Based on the obtained PCA score plots (Figure 3), the 20 samples were obviously divided into two main groups according to different species (HD and HC). The HD samples were noticeably overlapping, which indicates good similarity among them, and this result was also observed for HC samples. Meanwhile, the HD group and the HC group were completely separated, indicating that these two species herbs could be differentiated. The QC samples were between the two species, which came from the fact that they were mixed volumetrically in 50%.
In order to distinguish HD from HC, OPLS-DA models were built in both positive and negative modes. OPLS-DA score plot, S-plot, variable trend and VIP (variable importance in the projection) values were obtained to understand which variables are responsible for separation [109].
As shown in Figure 4, OPLS-DA models were constructed to discriminate the difference under the already established separation between different groups based on the PCA results. Each model has 2 score components (HD and HC). These scores are weighted averages of the original ones, hence providing a good summary. In addition, these scores display the separation of the groups in both ESI + and ESI − modes. The scores t[1] (x-axis) and to [1] (y-axis) are the two most important new variables in summarizing and separating the data. Each point in the plot corresponds to an observation. The groups are shown in different shapes and the separation of the groups is easily visible in t [1]. The to [1] score values show the variation within each class. This variation can either be caused by biological variation or by systematic changes in the experimental setup.   to [1] t [1] ESI + ESI − ESI +  and p values (p < 0.05) [110] from univariate analysis, and the identification of components from HC and HD (Table 2), 33 robust known biomarkers enabling the differentiation between HD and HC were discovered and marked in S-plots ( Figure 6). In order to systematically evaluate the biomarkers, a heatmap was generated from these biomarkers (shown in Figure 7), which shows distinct segregation between two species.

Discussion
There are 109 and 104 compounds characterized from HC and HD respectively. Sixty compounds were identified in ESI − mode and 53 compounds were identified in ESI + mode. According to the BPI chromatograms of HC and HD, it seems that ESI − ionization mode is better than ESI + based on the quantity and the responses of the identified compounds, but it is still necessary to run the ESI + mode because some compounds showed better respond than in ESI − mode.
It was revealed that HD and HC differed in their chemical composition according to the HPLC analysis [19]. It was also indicated that 6-O-(E)-p-coumaroyl scandoside methyl ester and 6-O-(E)-pcoumaroyl scandoside methyl ester-10-O-methyl ether were the main components of HD. In 2007, Liang et al. reported that HD and its substitutes could be identified based on HPLC chemical fingerprints and mass spectrometric analysis [25]. MS combined with UV spectra and literature values was used to obtain the chemical information. As a result, four compounds, asperuloside, 6-O-(E)-p-coumaroyl scandoside methyl ester, 6-O-(E)-p-coumaroyl scandoside methyl ester-10-methyl ester and 6-O-p-feruloyl scandoside methyl ester were recommended to be used as chemical markers for quality evaluation and chemical authentication of HD and its substitutes. In addition, scandoside methyl ester detected in the chromatograms of HC can be used as the characteristic peaks [25]. Furthermore, a previous report found that hedyotiscone A could be used to differentiate HC from HD using TLC method [22]. In our study, asperuloside, 6-O-(E)-p-coumaroyl scandoside methyl ester-10-methyl ester, scandoside methyl ester, 6-O-p-feruloyl scandoside methyl ester and hedyotiscone A were shared in HC and HD, but the reported result concerning 6-O-(E)-p-coumaroyl scandoside methyl ester was consistent with our findings.
In the other record, another marker compound, 10(S)-hydroxypheophytin a, isolated with a yield of 22 mg from 600 g of HC, was identified exclusively in HD [23]. It is a pity that it was not be detected under our experimental conditions. Similarly, (9R,10S,7E)-6,9,10-trihydroxyoctadec-7-enoic

Discussion
There are 109 and 104 compounds characterized from HC and HD respectively. Sixty compounds were identified in ESI − mode and 53 compounds were identified in ESI + mode. According to the BPI chromatograms of HC and HD, it seems that ESI − ionization mode is better than ESI + based on the quantity and the responses of the identified compounds, but it is still necessary to run the ESI + mode because some compounds showed better respond than in ESI − mode.
It was revealed that HD and HC differed in their chemical composition according to the HPLC analysis [19]. It was also indicated that 6-O-(E)-p-coumaroyl scandoside methyl ester and 6-O-(E)-p-coumaroyl scandoside methyl ester-10-O-methyl ether were the main components of HD. In 2007, Liang et al. reported that HD and its substitutes could be identified based on HPLC chemical fingerprints and mass spectrometric analysis [25]. MS combined with UV spectra and literature values was used to obtain the chemical information. As a result, four compounds, asperuloside, 6-O-(E)-p-coumaroyl scandoside methyl ester, 6-O-(E)-p-coumaroyl scandoside methyl ester-10-methyl ester and 6-O-p-feruloyl scandoside methyl ester were recommended to be used as chemical markers for quality evaluation and chemical authentication of HD and its substitutes. In addition, scandoside methyl ester detected in the chromatograms of HC can be used as the characteristic peaks [25]. Furthermore, a previous report found that hedyotiscone A could be used to differentiate HC from HD using TLC method [22]. In our study, asperuloside, 6-O-(E)-p-coumaroyl scandoside methyl ester-10-methyl ester, scandoside methyl ester, 6-O-p-feruloyl scandoside methyl ester and hedyotiscone A were shared in HC and HD, but the reported result concerning 6-O-(E)-p-coumaroyl scandoside methyl ester was consistent with our findings.
In the other record, another marker compound, 10(S)-hydroxypheophytin a, isolated with a yield of 22 mg from 600 g of HC, was identified exclusively in HD [23]. It is a pity that it was not be detected under our experimental conditions. Similarly, (9R,10S,7E)-6,9,10-trihydroxyoctadec-7-enoic acid, isolated with a yield of 47.9 mg from 20 kg of HC, was reported to be used to differentiate HC from HD [26]. It was not be detected under our experimental conditions either.
In this study, 33 known compounds enabling the robust differentiation between HC and HD were detected. For HC, there were 18 potential biomarkers, including three iridoids (23, 55, 66), eight flavonoids (30, 35, 40, 42, 47, 71, 75, 81), two tannins (19,45), two ketones (22,91), one alcohol (92), two monoterpenes (89,90). Among these potential biomarkers, the contents of nine components (19, 22, 23, 30, 35, 40, 45, 66 However, there are still some unresolved issues. Firstly, the pharmaceutical effects associated with these identified compounds should be screened in the future. Secondly, as shown in BPI chromatograms, though 113 compounds were identified, there are still some unidentified components. Further research should be carried out based on the formula of these unknown compounds. Thirdly, source material is not seasonable as it was collected during summer time. Fourthly, collecting HC and HD in the same area may be the better way for comparison. But in this study, Haikou City for HC and Fuzhou City for HD were visited. To some extent, the collection of these samples might be used as negative controls for another species because it could eliminate the influence of the region on the analysis of the sample. But unfortunately, the regional factor should not be considered as there should be more samples per region.

Conclusions
Under the optimized conditions, a total of 109 chemical compounds with different structural types were identified from HC and 104 from HD. The similarities and differences between these two herbs were also highlighted in the paper. Various structural patterns including iridoids, flavonoids, organic acids and organic acid esters, tannins, alcohols, ketones, coumarins, anthraquinones, monoterpenes, triterpenoids were presenting in these two herbs, of which there were 80 shared compounds in HC and HD. There is quite a difference in the parent structures types between HC and HD. A total of 33 robust biomarkers enabling the differentiation between HC and HD were discovered. For HC and HD, 18 and 15 potential biomarkers, respectively, were identified in this paper. Two iridoids, hedycoryside B (compound 55) and hedycoryside A (66) might be used for rapid identification of HC, and two anthraquinones, 1,3-Dihydroxy-2-methylanthraquinone (compound 69) and 2-Hydroxy-3-methylanthraquinone (78) might be used for rapid identification of HD based on their presence and content. Actually, these solid biomarkers are recommended for further use in the recognition and distinction between HC and HD. The results provided reliable characterization profiles to identify these two herbs and to clarify the fundamental pharmacological substances. Different chemical compositions will inevitably lead to different biological effects of HC and HD in clinical application. HC should not be used as substitute of HD. The results provided data on the chemical constituents of HC and provide a reference for the quality control of HD in the aspect of quantitative determination.