MicotoXilico: An Interactive Database to Predict Mutagenicity, Genotoxicity, and Carcinogenicity of Mycotoxins

Mycotoxins are secondary metabolites produced by certain filamentous fungi. They are common contaminants found in a wide variety of food matrices, thus representing a threat to public health, as they can be carcinogenic, mutagenic, or teratogenic, among other toxic effects. Several hundreds of mycotoxins have been reported, but only a few of them are regulated, due to the lack of data regarding their toxicity and mechanisms of action. Thus, a more comprehensive evaluation of the toxicity of mycotoxins found in foodstuffs is required. In silico toxicology approaches, such as Quantitative Structure–Activity Relationship (QSAR) models, can be used to rapidly assess chemical hazards by predicting different toxicological endpoints. In this work, for the first time, a comprehensive database containing 4360 mycotoxins classified in 170 categories was constructed. Then, specific robust QSAR models for the prediction of mutagenicity, genotoxicity, and carcinogenicity were generated, showing good accuracy, precision, sensitivity, and specificity. It must be highlighted that the developed QSAR models are compliant with the OECD regulatory criteria, and they can be used for regulatory purposes. Finally, all data were integrated into a web server that allows the exploration of the mycotoxin database and toxicity prediction. In conclusion, the developed tool is a valuable resource for scientists, industry, and regulatory agencies to screen the mutagenicity, genotoxicity, and carcinogenicity of non-regulated mycotoxins.


Introduction
Mycotoxins are common contaminants present in several human food products and animal feed. They are produced during the secondary metabolism of different filamentous fungi (molds), being the most common mycotoxin-producing fungal species Aspergillus, Fusarium, Penicillium, Claviceps, and Alternaria [1]. These genera are responsible for the production of an important variety of mycotoxins, which also include the main mycotoxins reported, such as aflatoxins, trichothecenes, fumonisins, ochratoxins, patulin, and zearalenone [2]. Some differences can be observed between strains: while Fusarium species usually infect growing crops in the field, Aspergillus and Penicillium species frequently grow on foods and feeds during the storage stage [3]. As a result, there is a wide range of Some of the larger, more traditional categories have a higher structural variability because they include several subgroups of compounds. For instance, cytochalasan alcaloids can be subclassified into chaetoglobosins, daldinins, and chalasins. In particular, trichothecenes, the most abundant category, with more than 350 compounds, has a high number of subcategories, as can be explored in the web server (https://chemopredictionsuite. com/MicotoXilico, accessed on 20 May 2023).
Some compounds could not be associated with a specific structural category, and they have therefore been classified into more general groups comprising alkaloids, terpenoids, amino acid derivatives, peptides, clyclic peptides, nucleotides, anthraquinones, benzoquinones, naphthalenes, phenols, phtalates, furans, lactones, thiazolidines, and fatty acid-like compounds. Furthermore, 79 compounds could not been associated with any category and were labelled as "not classified".

QSAR Models Building
Once we had generated a comprehensive database of mycotoxins, we wanted to optimize several machine learning models to predict the genotoxicity, mutagenicity, and carcinogenicity of these compounds. The tests chosen for the characterization of each type of toxicity follow the recommended endpoint protocols of the OECD [26]. For data searching, we overlapped our ensemble of mycotoxins with different databases of mutagenicity, genotoxicity, and carcinogenicity to obtain experimental data based on the Ames test for mutagenicity, in vitro and in vivo MN assay for genotoxicity, and data from in vivo models for carcinogenicity. As expected, experimental data could only be found for a relatively low number of compounds (350-100 compounds, depending on the endpoint).
These data were then used for the building and validation of robust QSAR models to predict the four endpoints. For model building, we followed the protocols described in the materials and methods section that meet the requirement of the five principles of the OECD for QSAR model building in a regulatory context [27].
A summary of the characteristics of the final classification models can be found in Figure 2, including the number of compounds in the training and test set, the descriptor selection method, the number of descriptors, the compound/descriptor ratio, and the model algorithm. In the Appendix A, a list with the molecular descriptors selected for each model can be found (Tables A2-A6). It is worth mentioning that the descriptors were selected from a panel of more than 4000, including 2D and 3D descriptors, in order to obtain the best chemical description of the complex mycotoxin structure.
For mutagenicity, models based only on mycotoxin data could be build, as data for up to 365 compounds could be retrieved, with a balanced proportion between mutagenic and non-mutagenic compounds. Two mutagenicity QSAR models (model A and model B) were generated applying two different data selection criteria (for details, see Section 5.3). In both cases, all the parameters included in the metrics were higher than 0.8, thus showing a good performance. Both models were able to correctly predict the mutagenicity of almost 90% of compounds in the internal cross validation.
For genotoxicity, data retrieval was much more complicated, as very few data from non-genotoxic mycotoxins could be detected. Therefore, we decided to build mixed models including mycotoxins and other organic compounds (for details, see Section 5.4). Regarding the in vitro genotoxicity QSAR model, information on the in vitro MN assay from 455 compounds was considered (Figure 2). For a more comprehensive analysis of genotoxicity, we decided to also include an in vivo genotoxicity model. In this case, the ProtoPRED model (https://protopred.protoqsar.com/, accessed on 20 May 2023) based on the in vivo MN assay was applied, built from a training set that included mycotoxins. Regarding the performance of these models, most parameters were higher than 0.8 ( Figure 3d). Only the specificity was moderate, as some non-genotoxic compounds were predicted as genotoxic in the internal validation. This result could relate to the fact that only a few data from non-genotoxic mycotoxins could be found.
For carcinogenicity, again we applied the in vivo carcinogenicity QSAR model from ProtoPRED (https://protopred.protoqsar.com/, accessed on 20 May 2023), containing mycotoxins in the training set, as not enough carcinogenicity data were retrieved to build a specific model. The metrics obtained for the carcinogenicity QSAR model showed a good performance on the training set, with all parameters being higher than 0.9. Parameters on the test set showed lower values, especially for precision and sensitivity, but are still close to 0.7, the value recommended for QSAR models by ECHA [28].
When we compare the metrics of the training and test sets in Figure 2, we can see that there are almost no differences for the mutagenicity models, small differences in the in vitro genotoxicity model, and a higher difference in the carcinogenicity and the in vivo genotoxicity. This result is coherent with the fact that the mutagenicity models were built only with mycotoxins, while in the other models, only part of the training set were mycotoxins, meaning that we have more structural differences between the training and test set.      When we compare the metrics of the training and test sets in Figure 2, we can see that there are almost no differences for the mutagenicity models, small differences in the in vitro genotoxicity model, and a higher difference in the carcinogenicity and the in vivo genotoxicity. This result is coherent with the fact that the mutagenicity models were built

QSAR Model Application to an External Validation Set
After model building and internal validation, we decided to further confirm the performance of the models by performing an external validation with an independent set Since only the mutagenicity model A is considered valid from a regulatory point of view, we decided to subject only this model to external validation. The list containing the different mycotoxins used for the external validation for each model can be found in Tables A7-A10 of the Appendix A. Figure 3 shows the resulting confusion matrix for all model validations, proving that the experimental data were in general well predicted.
For mutagenicity (Figure 3a), the confusion matrix showed that the model was capable of correctly classifying more than 0.8 of the compounds (0.83 accuracy). The in vitro genotoxicity QSAR model showed the highest accuracy (0.93) within all developed models. All genotoxic compounds were predicted as positive (1.00 of sensitivity), while only one compound out of 7 non-genotoxic mycotoxins was predicted as genotoxic (0.86 of specificity) (Figure 3b), thus improving the specificity of the internal validation of the in vitro model. For the in vivo genotoxicity model, however, although obtaining good values for accuracy (0.81) and sensitivity (0.88), the specificity was again low (0. 46), proving that this model was not accurate for the prediction of non-genotoxic compounds.
The carcinogenicity model was applied to a validation set of 75 mycotoxins, showing an accuracy of 0.81. The 89% of non-carcinogenic mycotoxins were predicted as inactive compounds, while 77% of carcinogenic mycotoxins were predicted as active compounds ( Figure 3d).
Thus, even if the models were built with a small data set of experimental data from mycotoxins, they seem to have a good predictive power on the external validation set of mycotoxins.
In order to see how well our models were adapted to mycotoxins, in comparison with other, more general toxicological QSAR models, we also performed a prediction with the validation data set applying three other reference QSAR tools: VEGA [29], Leadscope, and Case Ultra (the last two being integrated into QSARToolbox) [30]. We predicted the same four endpoints, obtained from the same or a very similar protocol. The metrics of these predictions can be found on Table A11.
We can observe that, in general, our models provide a better prediction for the selected test mycotoxins. Only for mutagenicity do we obtain a better prediction with Case Ultra; however, six compounds could not be predicted by this model because they were outside the applicability domain. Also, for the other models, the prediction of several mycotoxins could not be performed because they were not in the applicability domain. We have also found that, in some cases, VEGA could not correctly read and normalize structures with aromatic rings, probably due to the aromaticity model that the software uses. These results show that our models are better adapted to complex structures with several aromatic rings, which are typical mycotoxin structures.

Mutagenicity, Genotoxicity, and Carcinogenicity Prediction of the General Mycotoxin Database
After model building, we wanted to perform a prediction of the genotoxicity, mutagenicity, and carcinogenicity of the whole mycotoxin database described in Section 2.1. A general overview of the results is presented in Figure 4, and detailed predictions can be explored in the MicotoXilico web application (https://chemopredictionsuite.com/ MicotoXilico, accessed on 20 May 2023).  From the figure, we can appreciate that a very high percentage of mycotoxins predicted to be genotoxic, mutagenic, and/or carcinogenic. This result is not surprisin the definition of mycotoxin already assumes that we are dealing with toxic compou In particular, genotoxicity seems to be a property of most mycotoxins, as 80-90% of t compounds are predicted as genotoxic. However, further studies are required to con these results, as the models were built with only a few data from non-genotoxic myco ins, and the genotoxicity models only have a moderate specificity.
We also used the Benigni and Bossa rules for mutagenicity [31], implemented in toICH software (https://protopred.protoqsar.com/ accessed on 23 May 2023) to iden the presence of structural mutagenicity alerts in our database. The number of molec detected as positive based on structural alerts were significantly less in comparison w the molecules detected as positive based on QSAR models (https://chemopredict suite.com/MicotoXilico accessed on 23 May 2023). On the contrary, when we perfor the same comparison with a set of over 6000 general organic compounds, most of compounds predicted as mutagenic had a mutagenicity alert (data not shown).
In order to compare the toxicity between mycotoxin categories, we represented percentage of genotoxic, mutagenic, and carcinogenic compounds for the 30 major m toxin categories in a heatmap ( Figure 5). From the figure, we can appreciate that a very high percentage of mycotoxins are predicted to be genotoxic, mutagenic, and/or carcinogenic. This result is not surprising, as the definition of mycotoxin already assumes that we are dealing with toxic compounds. In particular, genotoxicity seems to be a property of most mycotoxins, as 80-90% of these compounds are predicted as genotoxic. However, further studies are required to confirm these results, as the models were built with only a few data from non-genotoxic mycotoxins, and the genotoxicity models only have a moderate specificity.
We also used the Benigni and Bossa rules for mutagenicity [31], implemented in Pro-toICH software (https://protopred.protoqsar.com/, accessed on 20 May 2023) to identify the presence of structural mutagenicity alerts in our database. The number of molecules detected as positive based on structural alerts were significantly less in comparison with the molecules detected as positive based on QSAR models (https://chemopredictionsuite.com/ MicotoXilico, accessed on 20 May 2023). On the contrary, when we performed the same comparison with a set of over 6000 general organic compounds, most of the compounds predicted as mutagenic had a mutagenicity alert (data not shown).
In order to compare the toxicity between mycotoxin categories, we represented the percentage of genotoxic, mutagenic, and carcinogenic compounds for the 30 major mycotoxin categories in a heatmap ( Figure 5). As we can see, there are several categories that are specially concerning, as they have a very high toxicity prediction in mutagenicity, genotoxicity, and carcinogenicity ( Figure  5), such as aflatoxins, chromanes, fusidic acids, griseofulvins, nitropropionic acids, sterigmatocystins, and verticillins. Some categories have non-positive predictions in some of the toxicity endpoints (taxols and lovastatins), but there is no type of compound that is negative in all three toxicities.
In many categories, the mutagenicity alerts correspond to a very high mutagenicity prediction (<80%), such as in aflatoxins, emodins, napththazarins, and sterigmatocystins. However, it is worth highlighting that several categories without alerts showed a high predicted mutagenicity index, such as alternarenes, chromanes, enniatins and beauvericins, fusidic acids, nitropropionic acids, and verticillins. As we can see, there are several categories that are specially concerning, as they have a very high toxicity prediction in mutagenicity, genotoxicity, and carcinogenicity ( Figure 5), such as aflatoxins, chromanes, fusidic acids, griseofulvins, nitropropionic acids, sterigmatocystins, and verticillins. Some categories have non-positive predictions in some of the toxicity endpoints (taxols and lovastatins), but there is no type of compound that is negative in all three toxicities.
In many categories, the mutagenicity alerts correspond to a very high mutagenicity prediction (<80%), such as in aflatoxins, emodins, napththazarins, and sterigmatocystins. However, it is worth highlighting that several categories without alerts showed a high predicted mutagenicity index, such as alternarenes, chromanes, enniatins and beauvericins, fusidic acids, nitropropionic acids, and verticillins.

Discussion
In this work, we explored the application of in silico QSAR models to obtain toxicity data of mycotoxin, which are urgently needed for public health regulations. In order to obtain a general overview of the type and number of mycotoxins that exist, we performed a comprehensive search, generating for the first time a database containing almost 4400 compounds. Our research revealed that the number and diversity of the identified mycotoxins is much higher than generally assumed: most publications only indicate the existence of several hundreds of mycotoxins, while we found more than 4000 known mycotoxins. It is worth mentioning the high structural diversity of these compounds, which generally have relatively high molecular weights and many structures including sugar, aromatic, or peptide rings.
The constructed database allowed us to perform a search of mutagenicity (Ames test), genotoxicity (in vivo and in vitro MN assay), and carcinogenicity (in vivo models) data by overlapping with different experimental databases. The search confirmed a low availability of experimental data, covering only a small percentage of the total existing mycotoxins. Nevertheless, we could obtain and validate robust QSAR models that predicted the four endpoints on an external validation set of mycotoxins with a relatively high accuracy, sensitivity, and specificity. These good results could be achieved by taking into account the specific characteristics of mycotoxins during the model building process, creating an appropriate applicability domain by the inclusion of mycotoxin and mycotoxin-like structure in the training set, and using a broad panel of 2D and 3D chemical descriptors. This was further confirmed by a comparison of the prediction of the mycotoxin's validation set with three other, more general, QSAR reference tools (Table A11), which provided a worse prediction and only included part of the molecules in their applicability domain. In the case of the genotoxicity, the specificity of the predictions was moderate to low, probably due the fact that only a few data from non-genotoxic mycotoxins were retrieved and incorporated in the models. Thus, further experiments are required to confirm the existence of non-genotoxic categories of mycotoxins. However, following the caution principle, a model with lower specificity (predicting false positives) is preferred over a model with lower sensitivity (predicting false negatives).
When applying the prediction models to the database constructed containing almost 4400 mycotoxins, we obtained a very high proportion of mutagenic, carcinogenic, and especially genotoxic compounds. This result is not unexpected, as mycotoxins are defined per se as toxic compounds, and compounds of many categories proved to induce acute toxicity. However, differences between categories can be observed, mainly due to differences in the chemical structure. Concerning genotoxicity, all major categories included compounds with a positive genotoxicity prediction. Among them, some categories are well known to be genotoxic, such as aflatoxins, ochratoxins, or sterigmatocystins. Indeed, 97% of mycotoxins from the ochratoxin family have been predicted as genotoxic compounds. This result agrees with the scientific opinion published by the EFSA in 2020 [32] indicating the genotoxicity of ochratoxin A and thus eliminating the previously established TDI and establishing instead an MOE (Margin Of Exposure), as no threshold can be allowed for genotoxic compounds. In the case of sterigmatocystin, a mycotoxin structurally related to aflatoxin B1, it has been demonstrated to induce tumors in diverse animal species, and thus, it is a known carcinogen mycotoxin [33], which agrees with the prediction performed with our QSAR models.
However, some categories showing a high percentage of genotoxic potential, are not well studied. For instance, griseofulvins were predicted as genotoxic by both in vitro and in vivo QSAR models. In the literature, animal studies have shown evidence that they are able to cause a variety of acute and chronic toxic effects, including liver and thyroid cancer in rodents, abnormal germ cell maturation, teratogenicity, and embryotoxicity in various species [34].
Regarding enniatins and beauvericins, commonly named as emerging Fusarium mycotoxins, the EFSA concluded in 2014 that a risk assessment was not possible given the lack of relevant toxicity data [35]. On one hand, in vitro genotoxicity data available suggested a potential genotoxic effect for beauvericin, while in vitro genotoxicity data for enniatins were negative. These results agree with those predicted by the in vitro QSAR model developed in our study ( Figure 6). On the other hand, there are no in vivo genotoxicity data for either beauvericin or enniatins and no studies on carcinogenicity of beauvericin and enniatins have been identified, and thus, the use of in silico predictions for these endpoints can provide valuable information. Thus, according to predictions on enniatins and beauvericin, 79% and 71% of compounds from this category were predicted as genotoxic (in vivo model) and carcinogenic, respectively. In addition, 71% of enniatins and beauvericin were also predicted as mutagenic, thus suggesting a careful assessment of the emerging Fusarium mycotoxins toxicity.
lack of relevant toxicity data [35]. On one hand, in vitro genotoxicity data available suggested a potential genotoxic effect for beauvericin, while in vitro genotoxicity data for enniatins were negative. These results agree with those predicted by the in vitro QSAR model developed in our study ( Figure 6). On the other hand, there are no in vivo genotoxicity data for either beauvericin or enniatins and no studies on carcinogenicity of beauvericin and enniatins have been identified, and thus, the use of in silico predictions for these endpoints can provide valuable information. Thus, according to predictions on enniatins and beauvericin, 79% and 71% of compounds from this category were predicted as genotoxic (in vivo model) and carcinogenic, respectively. In addition, 71% of enniatins and beauvericin were also predicted as mutagenic, thus suggesting a careful assessment of the emerging Fusarium mycotoxins toxicity. Compounds from several categories have been classified as carcinogenic by the IARC [36], such as aflatoxins, trichothecenes, or fumonisins. Other classes, such as actinomycins, cyclosporings, and lovastatins, have still not been classified, but they are labelled as potentially carcinogenic by the ECHA. Furthermore, in several categories, carcinogenicity predictions proved to have an impact on the DNA of cells. For instance, aphidicolin is an inhibitor of eucaryotic nuclear DNA. Brevianamide produced a slightly teratogenic effect in chick embryos [37]. Emodin is suspected to create DNA strand breaks and/or non-covalently binding to DNA and inhibiting the catalytic activity of topoisomerase II (Toxin and Toxin Target Database (T3DB)); nevertheless, a genotoxic effect could be confirmed [38]. Mycophenolic acid inhibits the de novo pathway of guanosine nucleotide synthesis without incorporation into DNA (Toxin and Toxin Target Database (T3DB)).
Alternariols and alternarenes are Alternaria mycotoxins that can be found in cereals around the world, but little relevance is still given to this fact. Currently, the toxicity of several altenariols is being investigated, including alternariol, alternariol monomethyl ether, altertoxins, altenuene, tenuazonic acid, and tentoxin. Among them, tenuazonic acid, alternariol, alternariol monomethyl ether, altenuene, and altertoxin I are the most important mycotoxins that can be found as contaminants in fruits and vegetables [39]. In 2011, the EFSA carried out a risk assessment on Alternaria toxins, as they were reported to induce genotoxicity, cytotoxicity, and reproductive and developmental toxicity, among other adverse effects [40]. Regarding their genotoxic effects, it was reported that alternariol, alternariol monomethyl ether, and altertoxins could induce gene locus mutation, DNA damage or synthesis disorder, chromosome aberration, and other effects in in vitro Compounds from several categories have been classified as carcinogenic by the IARC [36], such as aflatoxins, trichothecenes, or fumonisins. Other classes, such as actinomycins, cyclosporings, and lovastatins, have still not been classified, but they are labelled as potentially carcinogenic by the ECHA. Furthermore, in several categories, carcinogenicity predictions proved to have an impact on the DNA of cells. For instance, aphidicolin is an inhibitor of eucaryotic nuclear DNA. Brevianamide produced a slightly teratogenic effect in chick embryos [37]. Emodin is suspected to create DNA strand breaks and/or non-covalently binding to DNA and inhibiting the catalytic activity of topoisomerase II (Toxin and Toxin Target Database (T3DB)); nevertheless, a genotoxic effect could be confirmed [38]. Mycophenolic acid inhibits the de novo pathway of guanosine nucleotide synthesis without incorporation into DNA (Toxin and Toxin Target Database (T3DB)).
Alternariols and alternarenes are Alternaria mycotoxins that can be found in cereals around the world, but little relevance is still given to this fact. Currently, the toxicity of several altenariols is being investigated, including alternariol, alternariol monomethyl ether, altertoxins, altenuene, tenuazonic acid, and tentoxin. Among them, tenuazonic acid, alternariol, alternariol monomethyl ether, altenuene, and altertoxin I are the most important mycotoxins that can be found as contaminants in fruits and vegetables [39]. In 2011, the EFSA carried out a risk assessment on Alternaria toxins, as they were reported to induce genotoxicity, cytotoxicity, and reproductive and developmental toxicity, among other adverse effects [40]. Regarding their genotoxic effects, it was reported that alternariol, alternariol monomethyl ether, and altertoxins could induce gene locus mutation, DNA damage or synthesis disorder, chromosome aberration, and other effects in in vitro studies. In fact, according to the in vitro genotoxicity model developed, 100% of mycotoxins from the alternariols category was predicted as genotoxic compounds. In addition, alternariols have been related to the high incidence of esophageal cancer in Linxian, China [39], which can be related to our findings, as 62% of compounds belonging to this family were predicted as carcinogenic mycotoxins. Thus, special attention should be paid to this mycotoxin category. In this sense, maximum levels have been recently recommended by the EU in the Commission Recommendation 2022/553 [41] for alternariol, alternariol monomethyl ether, and tenuazonic acid.
A high percentage of trichothecenes have been predicted as genotoxic by the in vitro model (97%) and the in vivo model (76%). The main trichothecene reported to occur in food commodities is deoxynivalenol. Although deoxynivalenol is not genotoxic by itself, it has recently been shown that this toxin exacerbates the genotoxicity induced by model or bacterial genotoxins. In addition, other trichothecenes, namely, T-2 toxin, diacetoxyscirpenol, nivalenol, fusarenon-X, and the newly discovered NX toxin, were also reported as compounds able to exacerbate the DNA damage inflicted by various genotoxins [42]. In addition, in the study reported by Yang et al. [43], deoxynivalenol was able to cause damage to the membrane, the chromosomes, and the DNA at all times of culture in human peripheral blood lymphocytes, thus concluding that deoxynivalenol potentially triggers genotoxicity in human lymphocytes. In other study performed on Sprague Dawley rats, deoxynivalenol increased the percentage of chromosomal aberration, DNA fragmentation, and comet score [44].
Citrinin, a mycotoxin classified in the chromanes category, has been reported to be genotoxic at high concentrations in cultured human lymphocytes, as it caused a significant concentration-dependent increase in MN frequency in human lymphocytes [45], a result according to our genotoxicity predictions for the chromanes category, where almost 90% of compounds were predicted as genotoxic by both the in vitro and the in vivo models.
The same occurs with zearalenones, which have been predicted as genotoxic by both genotoxicity models, according to some data reported showing that zearalenone and some of its metabolites increased the percentage of chromosome aberrations in mouse bonemarrow cells and in HeLa cells [46] and can increase the frequencies of polychromatic erythrocytes micronucleated and chromosomal aberrations in bone marrow cells from Balb/c female mice [47].
Regarding fumonisins, 87% of mycotoxins belonging to this family were predicted as genotoxic by the in vitro model. The most predominant, fumonisins, fumonisin B1, fumonisin B2, and fumonisin B3, are carcinogenic and genotoxic secondary metabolites found in corn-based foods worldwide and are produced by Fusarium verticillioides and F. proliferatum [48]. Fumonisin B1 is defined by IARC as a possible human carcinogen in Group 2B, and it shows genotoxic activity via oxidative stress, DNA damage, cell cycle arrest, apoptosis, inhibition of mitochondrial respiration, and deregulation of calcium homeostasis [49]. Some studies revealed that exposure to fumonisin B1 caused a significant increase in micronucleus frequency in a concentration-and time-dependent manner in rabbit kidney cells [50], and in HepG2 cells, fumonisin B1 has shown clastogenic effects [51].
Studies on the genotoxic activity of ergot alkaloids, also predicted as genotoxic by our models, are very limited. In the scientific report delivered by EFSA in 2012 [52], it was stated that genotoxicity studies on ergot alkaloids were insufficient, and more concretely, some studies evaluating the genotoxic and mutagenic effects of ergotamine revealed different results. In the literature, it has been reported that ergotamine is able to induce chromosomal abnormalities in human lymphocytes and leukocytes [53] but does not show mutagenic effects in mouse lymphoma cells [54]. Other authors have demonstrated that ergotamine and ergometry can induce sister chromatid exchange in ovarian cells [55]. Due to the scarce and different data obtained, further studies are necessary to evaluate the genotoxic and mutagenic potential of ergot alkaloids.
Furthermore, our results reveal that several mycotoxin categories are predicted as mutagenic but have no mutagenicity alert following ICH-M7 criteria. This did not happen when we performed the same comparison with a general database of organic compounds, where almost all molecules predicted as mutagenic had an alert. For some of these categories, no mutagenicity has been detected previously (alternarenes, enniatins and beauvericins, fusidic acids, and verticillins), while for others, some experimental evidence exists already (chromanes, nitropropionic acids, among others) [56,57]. This suggests that the structure of the mycotoxins could have been underestimated in the expert analysis of the mutagenicity, and that ICH-M7 criteria do not take into account specific mutagenic structural motives present in mycotoxins. Regulatory agencies should take this into account and request a revised version of these criteria to obtain a better coverage of mycotoxins, which are a danger for public health, and thus prioritize mycotoxins based on their mutagenic, genotoxic, and carcinogenic potential, as already suggested in other studies [58][59][60].

Conclusions
The web server developed in this work represents a valuable resource for scientists, industry, and regulatory agencies, including a comprehensive database of over 4000 mycotoxins divided into categories that can be easily explored by an interactive visualization. To our knowledge, this is the first database containing such a high number of mycotoxins. Furthermore, the user can directly access a prediction of the mutagenicity, genotoxicity, and carcinogenicity of the whole ensemble of mycotoxins, without the need to perform a prediction workflow. The data are based on mycotoxin specific and robust QSAR models, that were built up according to OECD principles and are adapted to REACH criteria, which means that they can be used for regulatory purposes. Thus, the developed models are a valuable tool for screening toxicity of non-regulated mycotoxins. Future perspectives of our work include the experimental validation of our models by analyzing selected mycotoxins from categories with no data in the training set, which have to be synthesized or purified, as most of them are not commercially available. Furthermore, the new QSAR models will be included in our ProtoPRED platform.

Materials and Methods
To achieve the development of adequate and robust QSAR models, several elements are required. First of all, a data set providing experimental values of a biological activity or property for a group of already tested chemicals is necessary; in this case, the biological properties evaluated were mutagenicity, genotoxicity, and carcinogenicity; thus, experimental results derived from the mutagenic assay Ames test and the in vitro and in vivo MN test, and in vivo long-term carcinogenicity assay on rodents have been collected, as described in Section 5.1. Secondly, statistical methods (often called chemometric methods) are employed to find and validate the relationship between the calculated descriptors and the toxicity properties of the mycotoxins. The exact workflow is summarized in Figure 6 and described in the following sections.
This list was further completed by searching in PubChem for compounds of the same family, and their metabolites, resulting in a list of 4360 compounds. Only compounds that were directly produced by fungi and their metabolites were included. For each compound, the isomeric SMILES together with the CAS number and the PubChem CID code was retrieved. The resulting database was further curated by normalizing the smiles and removing counterions, salts, and mixtures. Finally, each compound was assigned to a specific family or category to their chemical structure, obtaining 170 different categories (Table A1).

Mycotoxin Clustering and Chemical Space Distribution
To study the chemical space of the generated mycotoxins data set, a fingerprinting system was generated using the LSH Forest algorithm from TMAP (https://tmap.gdb.tools/, accessed on 20 May 2023) for dimensionality reduction. Each point in the TMAP represents a fingerprint of a unique chemical transformation generated using the fully trained model. The points were colored by categories or their outcome in the toxicity predictions. The Faerun package (https://pypi.org/project/faerun/, accessed on 20 May 2023) was used to create an interactive visualization tool for the clustering scheme.

Mycotoxin Mutagenicity QSAR Model
For QSAR mutagenicity model development, the endpoint Bacterial Reverse Mutation Test (Ames test) (OECD test guideline 471) was selected (Figure 7), considered the first outcome to assess the possible mutagenicity of a substance [67,68]. We scanned several highquality databases (Carcinogenic Potency Database CEBS, CCRIS Mutagenicity assay, Vega, QSAR Toolbox, EFSA OpenFoodTox, ECVAM, etc.) as well as the scientific literature [68] for data of the compounds of our previously generated mycotoxin database. For each compound, CAS number, isomeric SMILES, and experimental data expressed as active (1) or inactive (0) were compiled in a table.

Mycotoxin Clustering and Chemical Space Distribution
To study the chemical space of the generated mycotoxins data set, a fingerprinting system was generated using the LSH Forest algorithm from TMAP (https://tmap.gdb.tools/ accessed on 23 May 2023) for dimensionality reduction. Each point in the TMAP represents a fingerprint of a unique chemical transformation generated using the fully trained model. The points were colored by categories or their outcome in the toxicity predictions. The Faerun package (https://pypi.org/project/faerun/ accessed on 23 May 2023) was used to create an interactive visualization tool for the clustering scheme.

Mycotoxin Mutagenicity QSAR Model
For QSAR mutagenicity model development, the endpoint Bacterial Reverse Mutation Test (Ames test) (OECD test guideline 471) was selected (Figure 7), considered the first outcome to assess the possible mutagenicity of a substance [67,68]. We scanned several high-quality databases (Carcinogenic Potency Database CEBS, CCRIS Mutagenicity assay, Vega, QSAR Toolbox, EFSA OpenFoodTox, ECVAM, etc.) as well as the scientific literature [68] for data of the compounds of our previously generated mycotoxin database. For each compound, CAS number, isomeric SMILES, and experimental data expressed as active (1) or inactive (0) were compiled in a table. As a result, raw data of 356 mycotoxins were recovered, which were further refined as follows: For a first model A, data of 120 compounds (75 mutagens and 45 no mutagens) were selected that strictly fulfil the OECD guideline TG471: assayed with at least five Salmonella strains (TA1535, TA1537 or TA97 or TA 97a, TA98, TA100, and TA102) with and without microsomal activation by S9 fraction, since often the interaction with genetic material occurs after metabolic activation. For a second model B, data were selected less strictly, to cover a greater part of the chemical space, including data from at least two strains with or without metabolic activation. The data set of model B was composed of 287 compounds, with a ratio of mutagens to non-mutagens of 108/179. A third data set of 24 compounds, fulfilling the OECD guideline, was saved for external validation (Section 5.6).
For the development of the QSAR models, around 4000 different chemical descriptors from 15 different categories were calculated for each compound with an in-house python script. Then, descriptors were selected by recursive feature elimination (RFE) to 7 for model A and 13 for model B (Tables A2 and A3). In the next step, the data set was divided into a training set for model building, and a test set, in a proportion of 75-25%, As a result, raw data of 356 mycotoxins were recovered, which were further refined as follows: For a first model A, data of 120 compounds (75 mutagens and 45 no mutagens) were selected that strictly fulfil the OECD guideline TG471: assayed with at least five Salmonella strains (TA1535, TA1537 or TA97 or TA 97a, TA98, TA100, and TA102) with and without microsomal activation by S9 fraction, since often the interaction with genetic material occurs after metabolic activation. For a second model B, data were selected less strictly, to cover a greater part of the chemical space, including data from at least two strains with or without metabolic activation. The data set of model B was composed of 287 compounds, with a ratio of mutagens to non-mutagens of 108/179. A third data set of 24 compounds, fulfilling the OECD guideline, was saved for external validation (Section 5.6).
For the development of the QSAR models, around 4000 different chemical descriptors from 15 different categories were calculated for each compound with an in-house python script. Then, descriptors were selected by recursive feature elimination (RFE) to 7 for model A and 13 for model B (Tables A2 and A3). In the next step, the data set was divided into a training set for model building, and a test set, in a proportion of 75-25%, respectively. Mutagenic and non-mutagenic compounds were distributed homogeneously in both sets. For model building, several algorithms were tested on the training set, obtaining the best metrics for Logistic Regression.

Mycotoxin Genotoxicity QSAR Models
For in vitro genotoxicity QSAR modelling, the in vitro MN assay was chosen, a robust and quantitative assay of chromosome damage with the capacity to detect not only clastogenic and aneugenic events but also some epigenetic effects and recommended by the OECD for genotoxicity evaluation in test guideline 487 (Figure 7).
Databases (Genetic Toxicology Data Bank in PubChem and eChemPortal) and the scientific literature [69][70][71] were scanned for data for the in vitro MN assay, fulfilling the OECD guideline, retrieving data for 91 compounds, of which 15 were saved for external validation (Section 5.6). To enable model building, the data set was further completed with 379 compounds that were not mycotoxins but close in the chemical space, generating a final data set of 455 compounds. The ratio of genotoxic to non-genotoxic compounds was 264/191, respectively. For model building, the same procedure as in Section 5.3 was employed, also obtaining model based on 25 descriptors (Table A4)

based on Logistic Regression.
For in vivo genotoxicity QSAR modelling, the in vivo MN assay on mammalian erythrocyte was chosen (OECD test guideline 474), which identifies substances that cause micronuclei in erythroblasts sampled from bone marrow and/or peripheral blood cells of animals, usually rodents. As not enough data for mycotoxins could be retrieved (72 mycotoxins from the ISSMIC public database), the QSAR model from ProtoPRED (https://protopred.protoqsar.com/, accessed on 20 May 2023) including mycotoxins in the training set, was applied. The model was built with 13 descriptors (Table A5) and based on the Support Vector Machine Classifier (SVC) algorithm. The model was then externally validated with the external set of 72 mycotoxins obtaining good results. All mycotoxins were included in the applicability domain of the model.

Mycotoxin Carcinogenicity QSAR Model
For carcinogenicity assessment, the long-term carcinogenicity study on rat was employed, fulfilling the OECD test guideline 451 ( Figure 7). As there was not enough data to build a specific model for mycotoxins (75 mycotoxins from different databases, such as PubChem AID_1259411, Carcinogenic Potency Database CEBS and OpenFoodTox TX22525), the QSAR model from ProtoPRED (https://protopred.protoqsar.com/ accessed on 20 May 2023) was applied, showing excellent results when applied to the 75 mycotoxins in the external validation. All mycotoxins were included in the applicability domain of the model.
The data for developing the model was extracted from Carcinogenicity I (ISSCAN) public database retrieved from QSAR Toolbox, which contains curated information on chemical compounds tested with the long-term carcinogenicity bioassay on rodents (rat and mouse). The main primary sources of data are the NTP, CPDB, CCRIS, and IARC repositories. After curation and preprocessing, the database was formed by 652 experimental results, with 251 of positive values (38.5%) and 401 negative values (61.5%). For model building, the same procedure as in Section 5.3 was employed, also obtaining the best metrics for Light Gradient Boosting Machine (LGBM) Classifier.

External Model Validation
The predictive potential of the QSAR models was evaluated with a completely independent data set of mycotoxins, in terms of accuracy, precision, sensitivity, and specificity. The metrics considered to evaluate the performance of the built classification models were calculated according to the following formulas:

Conflicts of Interest:
The authors declare no conflict of interest.

Descriptor Definition
C-030 X-CH-X atom-centered fragments    Table A5. Molecular descriptors selected by Recursive Feature Elimination (RFE) based on Support Vector Machine (SVM) using linear kernel (C = 1), for QSAR in vivo genotoxicity model.

GATS4i
Geary autocorrelation of lag 4 (log function) weighted by ionization potential.
B02(C-C) presence/absence of C-C at topological distance 2.

EState_VSA6
EState VSA descriptor 6.                          QTB = QSARToolbox; FN = false negatives; TN = true positives; FP = false positives; TP = true positives. The models described in this article are labelled in bold. * Several compounds could not be predicted because they were not included in the applicability domain of the model. ** Some compounds could not be predicted because the program did not accept the input structure. *** Some compounds could not be predicted because the program did not accept the input structure or the prediction was unclear.