Secondary Metabolites Extracted from Annonaceae and Chemotaxonomy Study of Terpenoids

The Annonaceae family of plants is one of the most anatomically and structurally uniform families. Chemotaxonomy is a common practice to determine the chemical patterns within these families at different phylogenetic levels. The aim of this study was to build a dataset of all the secondary metabolites isolated within the Annonaceae family and to perform the respective chemotaxonomic analysis using self-organizing maps (SOMs). This dataset is composed of 5321 botanical occurrences and 1860 unique molecules present in all subfamilies of the Annonaceae. Diterpenes account for 366 unique compounds and 533 botanical occurrences seen in both Annonoideae and Malmeoideae subfamilies. The Annoneae, Xylopieae and Miliuseae tribes had the highest number of botanical occurrences and were therefore selected for the analysis. Molecular descriptors of the diterpenes and their respective botanical occurrences were used to generate the SOMs. These SOMs demonstrated clear and indicative tribe separations, with a match rate higher than 70%. Our results corroborate with the morphological and molecular data. These models can be used to predict the phylogenetic location of certain diterpenes and to accelerate the research of Annonaceae secondary metabolites and their biological potentials.


Introduction
The Annonaceae family was first described by Antoine Laurent de Jussieu in 1789 and is known for its striking anatomical and structural uniformity. The family is very consistent morphologically, with a unique primitive group of angiosperms providing easy identification. [1][2][3][4] Two recent studies relevantly discuss the phylogenetic classification of the Annonaceae family. The first study carried out by Chatrou et al. 5 used eight plastid markers and representatives of 94 genera to formally and scientifically classify the Annonaceae into four subfamilies: Anaxagoreoideae, Ambavioideae, Annonoideae and Malmeoideae. The two largest subfamilies, Annonoideae and Malmeoideae, were divided into 14 tribes. The second study was conducted by Guo et al., 6 and considered the phylogenetics of the Annonaceae based on a super matrix of eight chloroplast loci and 749 accessions representing 705 species (29% of ca. 2,400 species of 105 genres; 98% of 107 genres currently accepted). This matrix included almost four times more species as well as representatives of 15 additional genera compared to the first large study of phylogenetic importance by Chatrou et al. 5 In addition to rebuilding the most comprehensive Annonaceae evolutionary tree, Guo et al. 6 also determined the phylogenetic position of five genera, Bocageae, Boutiquea, Cardiopetalum, Duckeanthus and Phoenicanthus, that were not included in any previous phylogenetic reconstruction. Their work assessed the monophyletic status and phylogenetic relationships within each major clade highlighting possible non-monophylides of genera and evaluating alternative resolutions for nomenclatural problems. Additionally, they identified and discussed unresolved problems such as the phylogenetic location and taxonomy of two genera, Froesiodendron and Melodorum, which have not yet been sampled. Finally, they provided an updated view of the genera currently recognized in the family using their wealth of species.
Overall, Guo et al. 6 reorganized the phylogenetics and taxonomy of Annonaceae and concluded their study stating that the family contains four subfamilies, 15 tribes, 107 genera and 2400 species.
Annonaceae are very important economically given the multitude of ways the derivatives are used; the fruits are used in cooking and the production of ropes, the great diversity of chemical compounds shown to have pharmacological activities inspire new medicines, and the wood that is both light and durable. [7][8][9] These chemical compounds, also known as secondary metabolites, have great structural diversity in this family and represent many chemical classes including but not limited to alkaloids, terpenes, acetogenins, and steroids. [10][11][12] One of the most common classes of Annonaceae is the terpenes. Terpenes are a very diverse class of substances and in addition to their important natural defense mechanisms in plants, terpenes display several therapeutic uses for humans. 9,13 In the natural biosynthetic route, terpenes are formed from isoprene units, which are considered the basic units for the formation of both terpenes and steroids. Subclasses of terpenes include monoterpenes (two isoprene units, 10 carbons in their structure), sesquiterpenes (15 carbons), diterpenes (20 carbons) and triterpenes (30 carbons). 9,14 The information gathered from chemical structures of both different species and genera has been and continues to be used in chemotaxonomy, that is, to determine the chemical phylogenetic patterns of a given family. [15][16][17] For chemotaxonomy studies, it is common practice to use machine learning with either supervised or unsupervised algorithms. A few examples of these machine learning techniques include neural networks (NN), support vector machine (SVM) and k-nearest neighbors (k-NN). [15][16][17] Self-organizing maps (SOMs), which were developed by Kohonen,18 are the main algorithm used in this study. A SOM is an unsupervised neural network that recognizes patterns and performs groupings based on exploratory analysis of the input data to generate non-linear relationships. [18][19][20] The SOM learning phase is competitive as there is no convergence or minimization criteria, and it works with a defined number of iterations and weight adjustments. In addition, each variable is mapped in a finite space of neurons organized in a typically two-dimensional arrangement (Kohonen map). [19][20][21] In order to generate the SOM model, the model must first be trained on a portion of the established data previously separated for training. Then, the second set called the test set evaluates the training of the model.
Using the results from the test set evaluation, we then isolate models capable of correctly mapping the test set, since the test data instances are not present in the training data. [20][21][22][23][24] Vesanto et al. 25 created a unified distance matrix (U-matrix) that uses Euclidean distances to further analyze the SOM. In this matrix, it is possible to better visualize the possible groupings of the analyzed data. [25][26][27] The goal of this study is to compile and integrate secondary metabolites isolated from Annonaceae into one curated dataset and to perform a chemotaxonomic analysis of diterpenes.

Results and Discussion
We collected and processed all Web of Science-indexed research papers published between 1970 and 2019 to create a database of secondary metabolites isolated from Annonaceae, except for the acetogenin class that is exclusive to this family. As seen in Figure 1, the interest in studying the Annonaceae plants has grown over time. One explanation for this growth is the abundant and diverse biological activity of the Annonaceae that comes from the structural diversity of the secondary metabolites. Alkaloids, for example, exhibited a wide variety of pharmacological activities and have been clinically studied for the treatment of cancer, Parkinson's disease, cardiovascular diseases, and various viral infections. [1][2][3][4]8,28,29 Our database consisted of 5321 botanical occurrences and 1860 unique molecules present in all subfamilies, 12 tribes, 64 genera and 380 species of the Annonaceae. Terpenes and alkaloids are the largest classes present in these plants ( Figure 2).
It is important to note that although Annonaceae has 107 genera and 2400 species, only a small percentage of them have been studied chemically and therefore our database was considered comprehensive. The alkaloids present in the Annonaceae are isoquinolines but the biosynthetic origins of the main nuclei occurring in the Annonaceae are the simple isoquinoline, proaporphine, aporphine, benzylisoquinoline, protoberberine, and phenanthrene. 30,31 Terpenes, the second most common class in Annonaceae, occur in all subclasses (mono-, di-, sesqui-, and triterpenes), with the diterpenes being the most abundant. The most frequent diterpenes are kaurene, trachylobane, labdane, and atisane, wherein kaurane is the most common. Figure 3 shows the skeletons of some of the most present alkaloids and diterpenes in this family.
Once the database was compiled and the classes and skeletons of the secondary metabolites most present in Annonaceae were identified, the chemotaxonomic analysis was performed.
Chemotaxonomy is defined as a taxonomic classification method based on the chemical similarity of compounds identified in the organisms/plants being classified. 32 Thus, we sought to investigate chemical molecules that serve as taxonomic markers of the Annonaceae.
Given the assortment of the secondary metabolites collected, the terpenes were selected for the chemotaxonomic studies because they were the predominant class (46% of metabolites). As mentioned earlier, terpenes can be classified into mono-, di-, sesqui-, and triterpenes.
Among these four subclasses, about 50% of the terpenes were diterpenes. Annonaceae diterpenes have promising anti-inflammatory activity, making compounds of this class excellent candidates for clinical trials in anti-inflammatory therapy. 33 Diterpenes represented a total of 366 unique chemical structures and 533 botanical occurrences; a botanical occurrence indicates that the compounds are present in several species.
These 533 botanical occurrences are distributed in two subfamilies, Annonoideae and Malmeoideae, which are the largest subfamilies of the Annonaceae and are distributed in 8 tribes, 13 genera and 50 species. The phylogenetic classification of the Annonaceae family proposed by Guo et al. 6 was utilized.
The three tribes with the highest number of botanical occurrences and molecules were then selected for the self-organizing neural maps, as the high number of diterpenes allows for the recognition of chemical pattern among the tribes. These tribes were Annoneae, Xylopieae and Miliuseae, and Table 1 contains the botanical characteristics and quantities of the selected molecules.
The genera represented in each selected tribe are: Annona (Annoneae), Xylopia (Xylopieae), Polyalthia, Pseudouvaria, Piptostigma and Greenwayodendron (Malmeoideae). Malmeoideae is the most studied genera of these tribes.   For the 521 molecules of the three selected tribes, molecular descriptors were calculated using the DRAGON 7.0 software, 34 which has 5270 descriptors organized in 30 logic blocks. From these three blocks of descriptors, 60 molecular descriptors were selected to consider ring descriptors, functional groups, and fragments of central atoms.
The botanical occurrences were classified in the three selected tribes and the values of the 60 molecular descriptors were used as input data in the SOM Toolbox software. 25 The self-organized matrix of diterpenes was then generated, classified into the three aforementioned tribes according to the chemical similarity between them. Then, the classification generated was compared with the phylogenetic classification proposed by Guo et al. 6 The phylogenetic classification of Guo et al. 6 can be seen in Figure 4.
In the generated maps, the hit rate using the two types of DRAGON 7.0 descriptors was > 77%. Thus, the 5-fold validation was performed for the generated SOM model, in which the diterpenes were divided into five training groups and five test groups, always maintaining the proportion of molecules from the three tribes (Annoneae, Xylopieae and Miliuseae). The results of the validation are described in Table 2. Table 2, like Table 3, also describes the accuracy values for each training and test. Accuracy provides us with information about the overall performance of the model, indicating the overall hit rate. The values of this metric vary between 0 and 1, and the closer to 1 it indicates that the model is getting more correct in its classification of molecules in terms of their tribes, that is, correctly classifying a molecule of the Annoneae tribe in the Annoneae tribe. Models with an accuracy greater than 0.70 are already considered models of excellent performance. 24 After analyzing Table 2, it is observed that the hit rate was overall > 70%, with the best hit rate of 95% for the Miliuseae tribe. The average hit rate of the test sets was 80% and is very close to the average hit rate for the training, which was 83%, revealing not only the good predictive power of the model, but that the model is robust. The applicability domain was also analyzed and was > 99% of the predictions of the test sets.
To verify the tribes dependence on chemical similarity and the ability to separate them accordingly, chemotaxonomy analysis was performed using other machine learning algorithms such as the support vector machine (SVM) and the k-nearest neighbors' algorithm k-NN, in addition to neural maps generated using the fingerprint descriptors calculated by the DRAGON 7.0 software. The results are shown in Table 3 for this SOM analysis of the Annoneae, Xylopieae and Miliuseae tribes and like those in Table 2, the hit rates are excellent.
To visualize the generated SOM, we utilize a U-matrix and display it alongside a principal component analysis (PCA) which was developed from the correlation matrix of the database used in the generation of SOM. PCA is measured using eigenvectors with higher eigenvalues. In  the projection of the PCA, the neighboring map units are connected by lines to make the visualization of the data on the map more clear and defined. The PCA performed has an explained variance of 37.04%, that is, using only two variables it is possible to visualize one third of the entire variance. Figure 5 shows the U-matrix of the generated SOM where we can see a chemical pattern separating the three tribes Annoneae (blue), Xylopieae (red) and Miliuseae (green), which are best observed in the principal component analysis chart (PCA).
We can see that the Miliuseae tribe, despite having the fewest number of diterpenes and, consequently, the fewest botanical occurrences, was the tribe with the best hit rates (greater than 85% in all algorithms and different descriptors in SOM) and is more structurally distant from the Annoneae and Xylopieae tribes, corroborating Guo's 6 phylogenetic classification, seen in Figure 4.
Annoneae and Xyopieae are part of the same subfamily, Annonoideae, explaining the proximity of the two tribes in the SOM, while Miliuseae is part of the Malmeoideae subfamily, and is therefore further away. When observing the diterpenes present in the tribes present in the SOM (Figure 6), we can see that each tribe has a higher frequency of a certain subtype of diterpene. The subtypes present in the Annoneae and Xylopieae tribes, although different, maintain a certain chemical similarity in their skeletons, explaining once again the approximation of these two tribes in the SOM. Figure 6 shows some of the isolated diterpenes in each of the analyzed tribes, focusing on the most frequent skeletons identified from each tribe. The Miliuseae tribe has a clerodane subclass of diterpenes. The clerodane diterpene is able to undergo structural changes and generate some subtypes, 35 and the kolava subtype is present in the Miliuseae tribe. The Annoneae and Xylopieae tribes have kaurane and trachylobane diterpenes, respectively. Although different, these subclasses have similarities in their chemical skeletons, even further supporting the closeness of the two tribes in the SOM.
The most significant descriptors in the separation of each cluster (each tribe in SOM) are represented in Figure 7. For the Annoneae tribe, the descriptors that presented a high value were (i) NROH, which describes hydroxyl groups (OH) linked to aliphatic groups, (ii) nOHp descriptor that points to primary alcohols, (iii) C-006, which indicates CH 2 carbons attached to a radical and that radical attached to an OH, and (iv) the descriptor O-056 that describes the alcohol function. Thus, these descriptors report that the diterpenes of this tribe are distinguished by the large number of hydroxyls in their chemical structure ( Figure 6).
For the Xylopieae tribe, the most representative descriptors were nCIR, which indicates the number of circuits (rings/cycles connected to each other) present in the molecule, the RFD descriptor of ring melting density, In the upper corner we have the U-matrix. The left U-matrix does not identify the tribes while the right U-matrix identifies the tribes by color; Annoneae is blue, Xylopieae is red, and Miliuseae is green. The values shown on the scale between the two U-matrices represent the values of the molecular descriptors of the diterpenes, varying between 0.603 and 5.96. These values were used to group the diterpenes by tribes. At the bottom, we have the PCA projection of the SOM measured by its two eigenvectors with higher eigenvalues. The tribes were plotted using the same identification colors as the U-matrix.   and the RCI descriptor that provides information about the ring complexity of the molecule. These descriptors point to the presence of molecules with a large number of interconnected rings/cycles; as seen in Figure 6, the diterpenes of this tribe have many interconnected rings/ cycles and a certain degree of complexity. For the Miliuseae tribe, the descriptors with the highest values were nConj, a descriptor that expresses the presence of non-aromatic C conjugates (sp 2 ), NNRS, the normalized number of ring system, which accounts for both the ratio between the number of ring systems (NRS) and the cyclomatic number (nCIC, discriminates cyclic compounds from acyclics) to provide information related to the presence of aromatic rings in the chemical structure, and lastly the ARR descriptor. The ARR, aromatic ratio, is the ratio of the number of aromatic bonds to the total number of bonds in the molecule. These descriptors reveal that the diterpenes of this tribe have an aromatic ring and conjugated non-aromatic bonds, which can also be seen in Figure 6.
An article by Scotti et al., 15 constructed a SOM with nuclear magnetic resonance (NMR) data of 118 diterpenes from three genera of the Annonaceae, the genera Xylopia, Polyalthia and Annona. The SOM was able to separate the diterpenes of the three genera with the NMR data and specific chemical displacement values of 13 C were observed for the skeletal carbons of each type of diterpenes of each genus. Kauranes skeletons were found for Annona, while trachylobans were found for Xylopia and clerodanes were found for Polyalthia.
Review papers concerning the Annona genus and some of its species have suggested that ent-kauranes are the most abundant diterpenes. [36][37][38] A review by Barbosa and Vega,9 highlights that diterpenes are the second most common class of secondary metabolites in species of the Xylopia genus, with kaurane, labdane, atisane and trachylobane diterpenes being the most frequent. Of these, trachylobanes are considered as chemotaxonomic markers of Xylopia as they are the most abundant in Xylopia and are difficult to find elsewhere in Annonaceae. 9,39 The four genera selected from the Miliuseae tribe are those with the most phytochemical studies, with the Polyalthia and Pseuduvaria genera being the most chemically and biologically studied of the tribe. As in the other genera, there are studies in the literature that show that the most isolated diterpenes of Polyalthia and Pseuduvaria species are clerodanes. [40][41][42][43]

Conclusions
The literature corroborates the information obtained in this study. In this way, this study of Annonaceae diterpenes establishes a way to separate the Annoneae, Xylopieae and Miliuseae tribes in accordance with the family's morphological and taxonomic separation. This phenomenon makes it possible to predict the location of a certain diterpene in the Annoneae, Xylopieae and Miliuseae tribes of the Annonaceae and to search for these secondary metabolites and their biological potentials more effectively.

Construction of the Annonaceae database
The articles used for the construction of the database were selected by means of an electronic search in the Web of Science research base, and were composed of studies and literature reviews on secondary metabolites isolated in plants of the Annonaceae. The following terms were used in the search for scientific articles: "Annonaceae", "secondary metabolites", "terpenes", "alkaloids", "flavonoids". All secondary metabolites, the species from which they were isolated, and the geographic locations will be registered on the SISTEMATX 44 34 Of the 30 blocks of molecular descriptors available in the Dragon 7.0 software, 34 only the ring descriptors, functional groups, and fragments of central atoms blocks were selected.

Pre-processing of data
In this step, the variables/descriptors were selected. This selection tactic is used to identify those descriptors that are most important for the grouping of the diterpenes and in this case were mostly related to the tribes. The selection of descriptors is an important step that must be carried out before the generation of the model, since it is useful for reducing the dimensionality of the data, helping to obtain a generic and not over-adjusted model, reducing computational cost, simplifying extraction processes and transformation of data, and further simplifying the presentation and demonstration of data. 48 In short, this step helps to reduce overfitting, increases the accuracy of the model, and reduces training time.
The pre-treatment criteria removed descriptors that had equal values in the series, ones that only a different value, and ones that had a correlation greater than 0.99. The majority of descriptors end up being removed, as many were inter-correlated, such that the independent variable remained the most correlated with the dependent variable.

Self-organizing maps (SOMs)
For the realization of the neural maps, the selection of molecular descriptors was performed for the bank of isolated molecules of the Annonaceae. The functional group, central atom, and ring descriptors were selected. Then, the constant variables for each block of descriptors and those with a different value in the series were excluded.
The molecular descriptors selected were analyzed with SOMs in Matlab 6.5 and SOM Toolbox 2.0. 25, 26,49 The SOM Toolbox tool is a set of Matlab functions that can be used for the elaboration and implementation of neural networks, since it contains functions for the creation, visualization, and analysis of self-organizing maps. The data set was presented to the network before any adjustments were made. Subsequently, the data group was partitioned according to the regions of the weight vectors of the map, in each training stage. Then, the correct prediction of these sets and the total correct predictions of the compounds were evaluated. In the most relevant models, the set was divided into training and test sets to assess the forecasting capacity. Training and test performance were assessed by calculating the proportion of the number of samples correctly classified by SOM. For each map, 5 crossvalidations were performed, being partitioned into 80% training and 20% testing. In the SOM, sites containing molecules for each descriptor were identified to highlight existing chemical patterns.

SVM and k-NN models
Knime 3.6.2 software 50 was used to perform all the following analyzes. The class descriptors and variables were imported from the Dragon 7.0 software 34 and, for each, the data was divided into the "partitioning" node with the "stratified sample" option to create a training set and a set of tests, covering 80 and 20% of the compounds, respectively. Although the compounds were selected at random, the same proportion of active and inactive samples was maintained in both sets. Two models were generated using the support vector machine (SVM) algorithm 51 and the K-nearest neighbors' algorithm (k-NN). 52 An external cross-validation was modeled 5 times.
SVM is a supervised machine learning algorithm that analyzes data and recognizes patterns. 51,53 The parameters selected for the SMV for all the models generated were polynomials, with power 1.0, bias 1.0, and range 1.0.
k-NN consists of instance-based machine learning as the function and is approximated only locally (neighbors) so the entire calculation is postponed until classification. 53,54 It is a technique that gives weight to the contributions of neighbors, so that the closest neighbors contribute more to the average than the more distant ones. [52][53][54] The parameters selected for the SVM for all the generated models were k = 3.