Predicting Chemical Hazard across Taxa through Machine Learning

We applied machine learning methods to predict chemical hazards focusing on fish acute toxicity across taxa. We analyzed the relevance of taxonomy and experimental setup, showing that taking them into account can lead to considerable improvements in the classification performance. We quantified the gain obtained throught the introduction of taxonomic and experimental information, compared to classification based on chemical information alone. We used our approach with standard machine learning models (K-nearest neighbors, random forests and deep neural networks), as well as the recently proposed Read-Across Structure Activity Relationship (RASAR) models, which were very successful in predicting chemical hazards to mammals based on chemical similarity. We were able to obtain accuracies of over 93% on datasets where, due to noise in the data, the maximum achievable accuracy was expected to be below 96%. The best performances were obtained by random forests and RASAR models. We analyzed metrics to compare our results with animal test reproducibility, and despite most of our models"outperform animal test reproducibility"as measured through recently proposed metrics, we showed that the comparison between machine learning performance and animal test reproducibility should be addressed with particular care. While we focused on fish mortality, our approach, provided that the right data is available, is valid for any combination of chemicals, effects and taxa.


Introduction
One of the pillars of modern civilization is the ability to synthesize and/or use an enormous range of chemicals, which allow for new and improved products and serve as pharmaceuticals, pesticides, food additives and the like. However, their benefits need to be weighed against their risk. For that purpose, governments put risk assessment procedures in place decades ago, with the aim of evaluating the impact of chemicals on human and environmental health. In these assessment procedures, exposure, i.e. the level of chemical to which organisms are subjected, and hazard, i.e. the ability of chemicals to cause negative effects on organism/population health, are considered together as determinants of risk [1]. Hazard assessment traditionally depends, to a large part, on animal tests using vertebrates, such as tests with mice, rats and fish, depending if assessments are for human or environmental health. These animal tests are ethically controversial. They are also highly resource consuming (time, personnel, test material needed) and can, in fact, not comply with the demand of rapid testing of an ever increasing chemical universe in need of evaluation [2]. On these grounds, there are global efforts to refine hazard assessments using alternative or supplementary strategies [3,4]. One of them is the employment of machine learning [5].
Machine learning (ML) is a term broadly applied to the use of computational algorithms to infer emerging patterns from data. In the field of chemical hazard assessment, one can mine chemical structural and physico-chemical information, along with effect data reported from animal toxicity tests. This comes often with the advantage that the domain of applicability is unrestricted, provided that appropriate training data is available.
One of the first applications of ML to hazard assessment was the prediction of acute oral toxicity in rats (LD50 values), using local lazy learning, achieving correlation coefficients of R 2 = 0.72 on a test set of 2896 compounds [6]. Further ML approaches to predict rat acute oral toxicity by other investigators [7] improved the R 2 to 0.86, and using discrete outputs to measure accuracies, values reached close to 95% [8]. However, with their focus on single species data, these studies do not cover the dimensionality of taxa and species sensitivity differences to be considered if one wants to know chemical impacts on an-test reproducibility". RASAR models were developed for binary classification, and adopt a succession of unsupervised and supervised learning, which allows the exploitation of information on endpoints that are different from the target one. For example, one could predict the LC50 of a chemical by integrating a Mortality with a Behavior dataset. Given the great success of RASAR models, a natural question is whether they perform equally good in the datasets used in ecotoxicology, and how they compare to standard ML models.
Here, we perform an extensive analysis of ML models applied to an ecotoxicological dataset focused on fish acute toxicity data, with an eye on the extra information that can be obtained from the knowledge about the toxicity of the same compound on more than one species. We compare several standard machine learning models with the two variants of the RASAR model developed by Luechtefeld et al. [16]. We extend RASAR models to an arbitrary number of classes, which allows us to study both binary (more toxic versus less toxic) and 5-class (non-toxic, slightly toxic, moderately toxic, highly toxic, very highly toxic) classification. We carefully analyze the impact of introducing taxonomic and experimental information as extra input features in addition to the chemical, since this introduction allows for models with a wide taxonomic range of applicability, which can learn the patterns relating to toxicity and taxonomic/experimental variability. Finally, we compare our results with two indicators of the reproducibility of the experiments. These suggest that, given the level of noise in in-vivo data (e.g. the same experiment repeated twice may have very different outcomes), we are close to the maximum achievable performances. However, this should not be taken as an indication that these models are outperforming animal test reproducibility.

Problem Setup
We are interested in predicting whether an experiment will result in a certain chemical being more or less acutely toxic on a given species. The outcome of the experiment is the LC50, which is the concentration of a chemical required to kill half a population, in our case 50% of fish in a 24h, 48h, 72h or 96h acute exposure scenario. We treat the problem as binary classification according to the European regulation (EC) No 1272/2008 on the classification, labelling and packaging of substances and mixtures (CLP Regulation) for acute toxicity data [17]. To do so, we define the binary target label y 2 : Our main goal is to deduce y 2 from the data, which we denote as x. In Appendix F we show analogous results for five-class labels.

Data
The data, x, contains information on chemicals, x ch , on the taxonomy of the tested organism, x tax , and on the experimental conditions, x ex : The vectors x ch , x tax and x ex have respectively 893, 5 and 6 components. To build x and the related set of labels [Eq. (1)], we take in-vivo experiments on fishes from the Ecotox database (https://cfpub.epa.gov/ecotox/), provided by the Environmental Protection Agency of the USA. This database contains x tax and x ex , but it only contains the CAS identifier for the used chemical. In Appendix A we describe in detail the sources of the data, and which features are contained in x ch , x tax , and x ex . The used information on the chemical has two different natures. We use some standard properties of each chemical (e.g. logP, water solubility, etc.); and the Pubchem2D vectors, that identify chemicals through a one-hot encoded 881-dimensional vector (see Appendix A). In other words, we can write x ch as the union of the properties x pr and the Pubchem2D data x pubc , We study in-vivo mortality experiments performed on fishes, and take into account all the chemicals (mainly organic compounds) and species available in the dataset. As for the taxonomic ranks, we use class, order, family, genus and species as features. When the same experiment is repeated more than once, we merge the multiple entries into a single one, and take the median LC50 as an outcome. The data cleaning is described in Appendix A.
At the end of the process, we have n chem = 2199 chemicals, tested for mortality on n spec = 345 different species of fish. In total, our dataset counts 20128 entries. In Fig. 1, we show how the data is distributed into each class, in the case that we split the outputs into 2 (left) or 5 classes (right).
In addition to mortality, we also use data related to different effects, which is necessary as an input for the Data-Fusion RASAR model (Appendix B). This adds 3713 further entries to our dataset, spanning over 543 chemicals. Of those, 117 do not appear in the dataset related to mortality. See Appendix A for details.

Euclidean Distances
The KNN and RASAR algorithms require measuring a distance between feature vectors. Initially, we will use Euclidean distances. In other words, the distance between two feature vectors u and v is regardless of whether these vectors have ordinal or one-hot components.

Per-category Distances
The fact that our data is clearly separated into the three sources of information, x ch , x ex and x tax , allows us to tailor our models to achieve better performance. One way to do this, is to define the distances in the KNN and RASAR models as a linear combination of the distances in different subspaces of the data.
We decide to treat separately the pubchem vectors (binary features), the chemical fingerprints (ordinal features), and the union of experimental and taxonomic details (these are categorical features, so we denote them with x cat = ( x tax , x ex )).
We measure the distance between pubchem vectors, through the Hamming distance function, d Ham , which measures the number of equal entries in the two vectors. The distance between chemical fingerprints is a Euclidean distance, d Euc , and for the categorical features we also use the Hamming distance. In order to have distances normalized to one (so that each set of features has the same weight), we divide the Hamming distances between two vectors by their number of components, and the Euclidean distances by their maximum over the training set. We calld Ham and d Euc the resulting normalized distances.
The distance between two experiments u and v can then be written as where α pr , α pubc and α cat are constants whose optimal value is found through cross validation (CV).
Since we are interested in the relative (and not absolute) values of these constants, we can safely set α pr = 1. Then, to find the optimal values of α pubc and α cat , we perform a search on a grid in the plane (α pubc , α cat ), tuning one dimension at a time, and choosing the alpha with the best accuracy. When two or more values of α give the same validation accuracy (with a tolerance of 0.001), we choose the smallest value. For both α pubc and α cat , we first found the optimum a coarse logarithmic grid, and then fine-tuned it with a finer one. Since we have several models that employ these distances, and because this hyperparameter tuning is computationally expensive, we only find the optimal αs for the K-NN algorithm, and use those values for the S and DF RASARs.
The advantage of using the distances in Eq. (5) is twofold: 1. It allows to assign a different weight to different subsets of features. This is useful because we do not expect all the groups of features to be equally important. Splitting the features into smaller groups can further improve the performances, but this would be at the expense of a longer hyperparameter tuning. 2. Since the distances d Euc , d pubc and d cat are normalized, the constants α automatically tell us the relative importance of each source of data. In particular, the value of α cat gives us a direct indication of the role played by taxonomy and experiment details.

Training and Assessing Performance
We divide our dataset in a training and a hold-out test set with an 80:20 ratio. Unless stated otherwise, the training set is further split into train and validation sets, through a 5-fold cross-validation procedure (the splitting is again 80:20) [20]. After we found the best model hyperparameters through an exhaustive grid search during CV, we retrain the model on the whole train+validation set. The final performances are calculated on the test set. Except when explicitly stated, we do not use methods to mitigate class imbalance.
We measure the performances in terms of accuracy (a), recall (r, often called sensitivity), specificity (s), precision (p) and F1 score (F 1 ). The definition of these indicators is provided in Appendix C.

Error estimates
When possible, in order to get an intuition on whether two results are significantly different, we compute error bars on the performances as the standard error between the cross-validation folds. These error bars represent the performance fluctuations of models trained only on the validation folds. Since the validation folds are smaller than the test set, the fluctuations on the test set should be smaller. In addition, while the performances on the validation folds are computed on models trained on the training folds, the performance on the test set comes from a model trained on training+validation data, so we can expect the test performance to fluctuate less or equally. While not rigorous, this procedure allows for an easier interpretation of the results. The errors bars are the number given in parenthesis and are applied to the last significant digit (e.g. 0.03(1) stands for 0.03±0.01).

Reproducibility of the experiments
Even though from the perspective of a regulator each experiment has one true outcome (in the binary case, more vs less toxic), if an experiment is performed more than once it sometimes happens that different repetitions of the experiment result in different values of the label y 2 (or y 5 , if we are in the multiclass setting). We can interpret this variability as some source of fluctuations -which derive from elements which are not completely under controlaround a ground truth value. 3 Since these fluctuations are present in our whole data set, also the test set will have some degree of disagreement with the ground truth. As a consequence, testing a model that perfectly predicts the ground truth will result in an accuracyã lower than 100%. This means that the maximum accuracy that we should expect for our models is not 100%, butã, so the performances that we find should be measured in units ofã. However, we do not have access toã, so the best we can to is to try to estimate it by analyzing the mutual (dis)agreement of experiments that were performed more than once. We use two metrics: (a) an estimated upper bound toã, and (b) a conditional estimator proposed in Ref. [16].

Upper Bound
Intuitive definition. We estimate an upper bound, A up , to the maximum achievable accuracyã. When all the repetitions of an experiment agree, we can assume that they correspond to the ground truth. When they do not coincide, the accuracy is maximized by assuming that the value of y 2 (or y 5 ) corresponding to the ground truth is the one occurring most often. We then define A up as the accuracy obtained by assuming that repetitions of an experiment measure the ground truth more often than not. If we restrict the measurement of A up to the experiments with a majority of positive outcomes, we obtain an upper bound to the recall, R up , whereas if we restrict it to the negative outcomes, we obtain an upper bound to the specificity, S up . The balanced accuracy is the average between R up and S up . In Appendix D we provide the formal definitions, for the binary and multiclass cases.
An example. To provide some intuition, let us focus on a single experiment (n rep = 1), with binary labels, and assume that it is performed five times, with results y 2 ≡ (y 21 , y 22 , y 23 , y 24 , y 25 ) = (0, 1, 0, 1, 1). Since the value y 2 = 1 appears most often, we assume that it corresponds to the ground truth, and obtain A up = 0.6 (similarly, we could obtain a lower bound A down = 0.4). Notice that, by construction, A up cannot be smaller than 0.5.

Conditional Estimator
Definition. The conditional estimators of recall and specificity used in Ref. [16] aim at calculating the actual value of those quantities, instead of an upper bound. Their procedure postulates that one of the repetitions of the experiment is the true result, and validates the other repetitions assuming it as truth. For a formal definition we refer the reader to Ref. [16].
An example. We can use the previously defined example of an experiment repeated five times, with results y 2 = (0, 1, 0, 1, 1). We take the first outcome, y 21 = 0, as true, and see that the accuracy with the remaining experiments is a 1 = 0.25. We then iterate through all the outcomes, and obtain a = Caveat. Note that these estimatorsã are measured through different repetitions of the same experiment, whereas our models are trained and tested on the medians of these repetitions, which arguably make the labeling more stable. Thus, it is possible thatã is higher than what we would expect from our estimators.

Dealing with many species
Compared with training and testing models on a single species, dealing with a large number of species requires a different perspective. On one side, [i] only a small number of species-chemical couples are present in this kind of dataset. In other words, an ecotoxicological dataset is sparse. If our dataset was dense, it would have n spec n chem = 831222 lines, instead of only 20128. On the other side, [ii] the different chemicals have different degrees of toxicity depending on which species is exposed to them.
We can test statement [ii] from our dataset. For each chemical that is tested on more than one species, we measure the fraction f of species for which it results more toxic (i.e. it has a label y 2 = 1). If a chemical has the same effect on all taxa, then f = 0, 1. We plot the histogram of f , h(f ), in Fig. 2, where it can be seen that a sizable number of chemicals have 0 < f < 1, meaning that these chemicals act differently on different species. The two peaks at 0 and 1 are expected, since we still expect that a number of chemicals have a similar effect across all the orders (especially because our analysis is restricted to a narrow taxonomic range, i.e. fishes). Furthermore (i) many of the entries at f = 0, 1 come from chemicals which are tested on only a couple of species, and (ii) we did not filter f according to the taxonomic distance between species.
The quantity f fixes the chemical, and checks how different the outcomes are on different species. We can try instead to fix the taxon, in order to see whether the outcomes on a specific taxon are very different from the others. To do this, for every taxon t, we take all the chemicals that were tested on t. Then, for each of these chemicals, we calculate the mean value of f , as defined in the previous paragraph. Finally, we calculate what we call the taxonomic disagreement parameter, to indicate how different the outcomes on a taxon are from other taxa. If, for taxon t, g = 0, then all the chemicals that were tested on t had the same outcome on all the other taxa on which they were tested. If g > 0, it means that part of the chemicals that were tested on t had a different outcome when tested on different values of the taxonomic rank. In other words, if it was useless to use taxa in our dataset, we would always have g = 0. The larger g (the maximum value is 1), the more crucial it is to use the taxonomic information. In Fig. 3 we show the mean value of g for each taxo- Figure 3: The bars depict, for each order, the taxonomic disagreement, g, as defined in the main text. If g = 0 for an order, all chemicals tested on that order give the same result also on the other orders. If g > 0, that order reacts to chemicals differently from other orders. The black line (top x axis) says how many chemicals were tested on that order.
nomic order (excluding those for which only a single experiment is available, see Appendix A). We chose the order for representation clarity, but as we explained this can be done in a similar way with any taxonomic rank. The value of g is positive and steady across the orders, indicating that experiments on one order are not fully representative of other orders, and that there is no single order that represents better or worse the other orders. The only exception would seem Amiiformes, but this datum is poorly significant because we only have two experiments on this order (the black line in Fig. 3 indicates the number of available experiments).

The C setup: predictions based on chemical properties alone
Predicting toxicity outcomes based on chemical properties is the domain of quantitative structure-activity relationships (QSARs) [21]. Most often, existing QSARs are trained on the impact that a set of chemicals has on a single species. In our notation, this amounts to deducing y 2 from x ch . Approaches of this kind are sometimes called global, but since we refer to generic ways of using only x ch , we call this approach the C setup. In this section, we show the outcome of this kind of approach on our dataset.
There are two ways in which we can train models using only x ch (and ignoring x tax and x ex ): 1. Train and test the models on the whole dataset, merging the x ch related to different taxa. The performances of the different models are shown in Tab   From Tabs. 2 and 3, it seems less advantageous to blindly mix all the taxa, than to train on a single species, and then naïvely extrapolate. The reason for this can be that the former method involves merging data coming from different taxa before training. This implies that i) the dataset becomes smaller and unbalanced (see Appendix A -anyhow, this can be partly offset by using the F1-score as a metric), and ii) that the labels y 2 are perturbed in an unknown way, since a label that correctly applies to one taxon does not necessarily apply to another. Further tests of the C approach are shown in Appendix G.   Tabs. 2 and 3). Moreover, the gap between the LR and our best non-linear models is below 10%, demonstrating that much of the information can be captured through simple linear relationships. The best model is the RF, followed by the DF RASAR. 5 The better performance, lower data requirement, and higher simplicity of implementation are strong factors in favor of the RF. However, we must keep in mind that the DF RASAR is explicitly designed for problems with a high availability of data from different biological effects, while in our dataset the effect "mortality" is dominant with respect to the other effects, which the DF RASAR uses as input to predict mortality (see Appendix A, and Appendix B). Furthermore, there is space for improving of RASAR models. One way is to replace the Euclidean distance (Eq. (4)) used by the RASAR with a distance that gives a different weight to different sources of information (Eq. (5)), as described in Sec. 2.3.2. In Tab. 5 we show the performances obtained by using this procedure. We note an improvement 5 In Appendix G we provide further comparisons between the C and CT E approaches, showing that with different kinds of train/test splitting the CT E approach still outperforms the C , and that RF and DF-RASAR perform best.   (4) and (5)], since the others are unaffected.

The
throughout most of the metrics of all the models (K-NN, S-RASAR and DF-RASAR), especially the DF-RASAR, which reaches a = 92.3% and F 1 = 90.8%. 6 From Sec. 2.6 we know that the maximum achievable accuracy is not 100%, but rather an unknown valueã. We cannot measureã directly, but we can estimate it by mea- Table 1, we see that the accuracy of the experiments is at most A up = 0.958, so we can use this value to rescale our performances. For example, the rescaled performance of the RF becomes 0.932/0.958 0.973 ≈ 97%. On another side, we note that most of our models perform better than what would be indicated through the conditional estimator ofã (Table 1), suggesting that this metric is a too pessimistic estimate ofã.

Feature Importance 3.4.1. Importance from Random Forests
We now discuss the relative importance of the features, when using our best model (the RF of Tab. 4). These importances are estimates of how relevant each feature is for the predictions. Since there is no single best method to calculate feature importances, we evaluate them through two most commonly used methods: permutation-based and impurity-based (using Scikit-Learn libraries [23]). While the permutation-based importance falls short with collinear features, the impurity-based tends to overestimate the relevance of high-cardinality categorical features [24]. Even though we do observe some (expected) differences between the two methods, the conclusions we draw are consistent across both kinds of analysis. We here show the results with the impurity-based method, and provide the permutation importance analysis in Appendix E.
In the top chart of Fig. 5, we coarsen the feature importances in three groups, corresponding to x ch , x tax and x exp . It can be clearly seen that x tax and x exp contribute to almost 30% of the information. The contribution of the features related to taxonomy and experimental details is even more striking when we look at them one by one: in the bottom set of Fig. 5, we show the importance of the single 20 most important features. The single most important feature appears to be the species. The most important features from x ch are water solubility and LogP. The control type and observation duration (details in Appendix A) appear to be the most influential experimental details, and the most important pubchems have the labels 335, which indicates the presence of the substructure C(∼C)(∼C)(∼C)(∼H), and 39, which indicates the presence of more than 4 Cl atoms [25]. Also with the permutation importance we find that the taxonomyand experiment-related features, taken one by one, carry a large importance compared with the chemical descriptors taken one by one. These observations are consistent with Ref. [14], where the species is identified as the most important taxonomic descriptor, and the LogP as the most important chemical.

Importance from α
Since the distances used in Eq. (5) are normalized to one, the optimal values of each α tell us how relevant each set of features is. In particular, α cat is an indicator of how relevant the data on experiment and taxon, x ex and x tax are with respect to the information on the chemical. In Tab. 6 we show the optimal values of α for several models. The first thing that we point out is that α cat is not zero, meaning that taxonomy and experimental details provide useful information. This is consistent with our previous feature importance analysis, and the remark that α cat is ∼ 20 times smaller than the other two descriptors suggests that these models rely more heavily on x ch than the RFs. Second, we notice that now the pubchems ( x pubc ) seem to often play a smaller role than the explicit chemical properties ( x pr ), than what we had seen with the RFs. We see two possible reasons for this: (i) the same information is contained both in the pubchems and in the properties,  and which of the two is used is decided by the model while training. This is expected at least to a degree, since physicochemical properties can be predicted from the molecular structure [26]. (ii) The distances in the pubchem space do not discern which chemical properties are different, and assign the same weight to all kinds of variation. This is also true for the chemical properties, x pr , but the properties are fewer, not binary, features. One way to come around this would be to treat different groups of pubchems components separately, by introducing more α constants.

Multiclass models
The whole analysis that we showed for binary classification was repeated in the case of 5-class labeling. Other than requiring to adjust the RASAR in order to also include an arbitrary number of classes (Appendix B), the analysis was analogous to the binary case. Also the results are similar, so we relegate them to Appendix F. Here, we only make some remarks: • The performances of our models are lower for multiclass, but also the self consistency of the data is lower, since with more classes it is more likely that the same experiment repeated twice falls into two different categories. If we normalize the recall by the upper bounds shown in Tab. 1, we obtain a rescaled accuracy of 89%, and a rescaled recall of 95%.
• The performance of the multiclass DF-RASAR model matches that of the RF.

Conclusions
In this paper, we addressed the problem of predicting the acute toxicity of chemicals on different taxa through machine learning. We focused on fish datasets where one is interested in knowing about the effect of many chemicals on a large number of taxa, and the available data only includes few combinations of taxon and chemical. We addressed both binary and 5-class classification, with similar results.
Gain from including taxonomic and experiment information. We showed that taking into account the information on the taxon and on the experimental conditions is highly beneficial, providing over a 10% gain in F1 score (and other metrics, such as accuracy) with respect to the standard procedures of dealing with the same dataset. Remarkably, a linear model trained on a dataset which includes taxonomic information performs better than more sophisticated models which only make use of the chemical information. Additionally, we showed how one can take into account the peculiarities of each source of data to obtain a further gain.
Best models and comparison to RASARs. The best performances are obtained by the RF, followed by the DF-RASAR and other models that obtained performances that were not too far. This similarity between different models was already found previously [13,15], and the superiority of RFs had already been pointed out [15]. Even though the DF-RASAR obtains good performances, they do not seem to outperform other traditional machine learning models. In particular, the S-RASAR does not outperform K-NN, making its deployment less justifiable. The DF-RASAR, instead, has several advantages in the integration of the data from different effects, and a potential for improvement through some small variations.
Best performance in terms of maximum achievable performance. For binary classification, we obtain a maximum prediction accuracy of 93.2%. However, since this number comes from a test set which contains some inconsistent examples, this number is effectively higher than it appears, since it should be normalized by the accuracy of the test set. We cannot estimate it, but we see that it is roughly upper-bounded by A up = 0.958. By rescaling appropriately through A up , we can argue that the real accuracy is around 97%. Similar reamarks are valid for the multi-class setting.
Metrics for comparing ML models with experiment reproducibility. If we use the metrics presented in Ref. [16], all our models (as long as we use a CT E approach), outperform animal experiments. As also mentioned in Ref. [27], this statement should thus be polished, given the nonrigorous nature of this kind of estimators. To deal with this, instead of trying to estimate the accuracy (as well as other metrics) of the experiments, we estimated a heuristic upper bound. With these upper bounds, we still find evidence that our best models (RASAR, RF, KNN and MLP) are outperforming animal tests (Sec. 2.6). However, these comparisons should be taken with a grain of salt. In fact, the main problem is that, as stated earlier, this kind of comparisons is not rigorously defined. When we estimate the accuracy of the experiments, we have no truth value with which we can compare; we only have the results of repetitions of the same experiment. Therefore, we cannot estimate it without making strong assumptions. When we then calculate the accuracy of our models, we test them with a random selection of outcomes from experiments. In other words, we are now making the implicit (wrong) assumption that the data in the test set is not mislabeled (i.e. we are assuming that all the experimental outcomes faithfully reflect the ground truth). To overcome this issue, we made heuristic arguments on how the test accuracy of the models would be affected by the noise in the labels. However, these arguments are also not fully representative, since the accuracy of the labels is calculated on single repetitions of the same experiments (and assuming that all experiments fluctuate in the same way), but the data used for training and testing coarsens repeated entries in a single one, since we take the median LC50. Taking the median is desirable, since it arguably produces higher-quality data, but makes it even more far fetched to compare the performance of the models with that of the experiments. Anyhow, albeit not mathematically rigorous, these metrics arguably give a reasonable indication of the reliability of experimental results, and can be used to get a gross understanding of how accurate machine learning models are. In our case, they indicate that some of our models are close to the maximum obtainable performance, but only as long as we include taxonomic variability as an input variable. This means that, for the data we analyzed, the model selection is only marginally relevant with respect to the right choice and curation of the input features.
Most important descriptors. We discriminated which input features played the most important role. Most of the important features belong to the chemical, demonstrating why it is still possible to obtain decent predictions based on the chemical only. Nonetheless, the role of taxonomy and experimental details is relevant (e.g. it explains 30% of the inter-trees variability in the RFs), and the single most important feature is the species (the species descriptor is more crucial than the LogP towards assessing toxicity), which is in agreement with previous observations [14]. Moreover, in this work we restricted our analysis to fishes, but expanding the focus to wider taxonomic ranges would certainly further increase the importance of including the taxonomy.
Relation to human toxicology. Training models that are efficiently able to generalize across taxa is central in ecotoxicology, but it is also crucial for human-centered toxicology. In fact, the investigation of toxic effects on humans passes through the experimentation on mammals, such as rats, mice, guinea pigs and rabbits, and there is no guarantee that these species react in the same way as humans to novel toxicants. In recent years in-silico methods, including machine learning, have been used to predict human toxicity -most often based on oral toxicity tested on rodents- [28,29,30,31,32], under the implicit assumption that humans react in the same way as the tested mammals. Therefore, mastering generalization across taxa can help data-driven approaches to infer better from animals to humans.
Replacing in-vivo animal testing. In the long term, it would be desirable to replace in-vivo testing through machine learning. How far are we from that? In order to do this, we would like in-silico predictions to be as accurate, or almost, as directly performing an in-vivo test. Our results represent a fraction of all the possible chemicals and taxa that one could be interested in testing. In order to expand the domain of applicability of our models, we would need to use a wider range of chemicals and taxa both for training and, more crucially, for testing. A known problem in machine learning is that the performances obtained in the test set often degrade when the models are deployed in real application settings, because the datasets used to test these models did not comprehensively represent the whole feature space. This problem is also known as dataset shift [33,34]. One main source of dataset shift is the fact that the sampling of data points is not completely random, but follows patterns. This is certainly the case in (eco)toxicology, where taxa and chemicals that appear in the databases are chosen according to specific needs. Therefore, before the deployment of machine learning for risk assessment can become an option, it is necessary to have extensive-enough test sets, and a thorough study of dataset shift. Even then, it is foreseeable that for exotic chemical/taxonomic categories the machine learning predictions might need to be confirmed through tests. Therefore, it is unlikely that machine learning alone will soon be able to completely replace animal testing, but it can certainly allow to reduce it drastically. In addition, given that it is anyhow hopeless to test all chemicals on all taxa, machine learning can be effectively used out of the box, even at the current stage of development, for prioritization: the results of in-silico methods can guide us on which chemicals/taxa should be tested first.

Funding Sources
This project was funded by the SDSC grant "Enhancing Toxicological Testing through Machine Learning" (project N o C20-04).

Data and code availability statement
Our analyses were performed on an openly available data set [35]. Our code for data cleaning and for model training and testing is freely available at github.com/ mbaityje/ML-Tox.

Appendix A. Data
We use the Ecotox database, provided by the Environmental Protection Agency of the USA [35], and download the whole ASCII database. The dataset contains information on the taxonomy of the tested organism, on the conditions under which the experiment was brought through, a CAS identifier of the used chemical, and several endpoints related to the studied effects.
Chemicals. From the CAS codes provided in each Eco-Tox entry, we extract the SMILES string representing the chemical. We use the rdkit python library [36] to extract the following features from the SMILES: number of atoms, number of alone atoms, number of OH groups, number of bonds, number of double bonds, number of triple bonds, and number of rings, LogP (octanol-water partition coefficient) and Morgan Density. Additionally, we extract molecular weight, water solubility and melting point from the Comptox database [37]. From the SMILES we also obtain the Pubchem2D vector through the pubchempy library [38]. In a small number of cases we are not able to retrieve crucial information such as the SMILES or the Pubchem vectors. We decide to drop those data entries.
Taxonomy. We filter the dataset keeping only the experiments on fishes. This implies that we can discard the high-rank taxonomic features, ending up with 2 classes (Actinopterygii and Cephalaspidomorphi), 26 orders, 83 families, 219 genuses and 345 species. We also exclude organisms at the embryo life stage. Note that the data is not equally distributed among taxonomic categories. For example, in Fig. A.6 we show how many times every order appears in our dataset (after cleaning), which is highly imbalanced. We also show, in Figure A.7, the fraction of chemicals that result more toxic to it.
Target. We then filter on the desired endpoint and effect. For mortality experiments, the endpoints EC50 and LC50 are equivalent, so we keep both. The distribution of the LC50 in our dataset is shown in Fig. A.8. For the other effects, that are needed by the DF RASAR as input, we take the EC50.
Imputation and aggregation. We drop the lines with one or more missing features, and aggregate the experiments with the same CAS, taxonomy, and experiment conditions onto a single record, whose LC50 (or equivalent target) is the median LC50 from those repeated experiments. We then use the median LC50 to calculate the labels y 2 and y 5 . While for the mortality data, which constitutes our target, we use the thresholds described in Eq. (1) (or Eq. (F.1) for multiclass), for the other effects, which are only used as input features for the DF-RASAR models, we use the median (or 20 th , 40 th , 60 th and 80 th percentiles for multiclass) in order to obtain maximally well-separated input features.
Standardization. All the features are rescaled between 0 and 1, and with a prior logarithmic transformation when the quantity spans several orders of magnitude. The scripts we used for the data cleaning are freely available [39].
We end up with 20128 entries in our mortality dataset, including n chem = 2199 chemicals and n spec = 345 species.
Class balance. As shown in Fig. 1, the data set is balanced both in the binary and in the 5-class classification. If we neglect all the features different from x ch , we need to merge into a single entry all the lines related to different taxa/conditions and the same chemical, which makes the dataset less balanced. We end up having 461 positives and 1738 negatives and 543 non toxic, 640 hazardous, 555 toxic, 302 very toxic and 159 extremely toxic for multiclass (Fig. A.9). If we restrict to fathead minnows, we have 2830 entries. If we restrict to rainbow trouts, we have 2508 entries.

Appendix B. Models
In this section we explain, where necessary, the implementations of our machine learning models. All our models are trained by splitting the dataset (80:20) into a training and a hold-out test set. The training set is used for 5-fold CV, implying a further 80:20 splitting into training and validation sets. The test set is not used until the very end, to produce the performance metrics.
Except for the RASARs, the models we use are quite standard, so we will not get into the details of most models, but rather in hyperparameter tuning and choices.

Appendix B.1. Logistic Regression and Random Forest
For LR and RF we use the standard implementation from the Scikit Learn python library [23,39]. We rescale all numerical features through min-max normalization and perform one-hot encoding (using the OneHotEncoder function from Scikit Learn [23]) on all categorical features in the preprocessing step. The hyperparameters adjusted in logistic regression include (the range of values in parentheses): C (10 −3 -10 3 ), l1 ratio (0-1), and max iter (100-800).

Appendix B.2. Multi-Layer Perceptron
As a first preprocessing step, we one-hot encode all the categorical features. Then, we apply a min-max scaler to the data so that the scaled data has a minimum of 0 and a maximum of 1. The lower bound of 0 is chosen to keep the sparsity of the categorical features after one-hot encoding.
We use a tensorflow [40] implementation of MLP and the tuning of hyperparameters is performed on the training set using the Hyperband tuning algorithm [41]. The tuned hyperparameters include (value range in brackets): number of layers (2 -5), number of nodes in each layer (64 -1024), dropout rate (0 -0.5), and learning rate (10 −5 -10 −2 ). One dropout layer is placed after the dense layers and before the output layer. The tanh activation function is used in all dense layers. For the output layer, we use the sigmoid and the softmax activation for the binary and the multiclass classification, respectively. The selected hyperparameters for each case are shown in Table B.7.

Appendix B.3. K Nearest Neighbors
The KNN algorithm assigns to an unknown entry the target that appears most often among the K nearest-neighbor data points. If two values are equally present, the one  which is most present in the training set is chosen. For this reason, when the dataset is unbalanced (e.g. Tab. 2) even values of K give a higher performance. We use the KNN functions from SciKit Learn [23]. We select and report for the value of K that gives the highest accuracy value. Differently from what is found in Ref. [15], we have a better test performance with small values of K. In Fig. B.10 we show the validation performance for the different values of K that we tested, for both the C and the CT E setup.

Appendix B.4. RASAR
The RASAR model, presented in Ref. [16], comes in two variants. The S-RASAR requires data on a single effect (mortality in our case), whereas the DF-RASAR integrates several effects.

Appendix B.4.1. Simple RASAR
The S-RASAR is composed of two steps. In the first step, a metadataset is created through a nearest-neighbor algorithm, while in the second step, the metadataset is trained through a supervised learning method. The first step does not require any hyperparameter tuning, while the second follows the protocol described in Appendix B.1.
For each entry i in the training set, we save the normalized distanced + from the nearest positive, 7 and the one from the nearest negative outcome,d − (we use the distances defined in Sec. 2.3). This results in a meta entry We now have a supervised learning problem, where we need to deduce the labels y 2i from m i . We did this through LR or RF. 8 Multiclass. In Ref. [16], this algorithm is presented only as a binary classification algorithm, but we can easily extend it to multi-class classification. It is enough to measure the distance to the nearest element of each class. The multiclass metadata point is then where C is the number of classes.

Appendix B.4.2. Data Fusion RASAR
The DF-RASAR extends the idea of the S-RASAR, by including different sources of information in the metadataset. Since the definition in the binary case is already provided in Ref. [16], we define it here in the case of C classes.
The first, unsupervised, step is performed for E extra effects, beyond the target one, which we denote with a 0 (in our study, mortality). For each extra effect e, we construct an extra metadataset. This metadataset contains C + 1 elements. The first C elements are the distances to the nearest experiment of each label (distance to the nearest 0, to the nearest 1, . . .), and the (C +1) th is the label of effect e. In other words, if chemical A was tested on taxon B for mortality, now we are including the impact of A on B based on another effect e. If this information is unavailable, we use an extra "unknown" label. In the multiclass problem, the introduction of an "unknown" label implies that the six labels are no longer ordered and need to be treated as categorical variables.
In formulas, for each extra effect e, the metadatapoint corresponding to the i th experiment (i.e. a taxon-chemical couple) iŝ where y C (e) is the C-class label (see Appendix F) of the e th effect. When the features used for the e th effect are different from those of the target effect. In those cases, we restrict the distance to the common features.
As a result, the i th meta datapoint for a C-class DF-RASAR takes the form We then use RFs on this metadataset to relate it to the target, y C i (0).

Appendix C. Performance metrics
Here, we explicitly define the perofmance metrics used in this manuscript: accuracy (a), recall (r, often called sensitivity), specificity (s), precision (p) and F1 score (F 1 ).
The accuracy is the most intuitive of these quantities, and measures the fraction of correct answers, a = #correct guesses total . (C.1) In terms of the number of true positives (T P ), true negatives (T N ), false positives (F P ) and false negatives (F N ), the remaining metrics are defined as The recall measures, out of all the positives in the test set, how many of those were guessed correctly. The specificity does the same, but for the negatives. In balanced datasets (i.e. with a similar number of positives and negatives), these are usually similar. With unbalanced datasets, high accuracies can be an artifact of the unbalance, and this usually results in very different values of r and s. In those cases, a better indicator of the performance is either the balanced accuracy (b = r+s 2 ), or the F 1 score [42]. For multiclass classification, we calculate the metrics on each class, and then average between the classes without taking into account how many examples there are in each class (this is called a macro average [43]). In the multiclass case, the sensitivity does not make too much sense, since the negatives of one class include several classes. Therefore, we report the precision, which indicates, out of all the examples that were labeled as positive (for a given class) by the classifier, how many of those were actually correct.

Appendix D. Reproducibility of the experiments: formal definitions
Here, we define formally the quantities A up , R up , S up and BA up , that we introduced in Sec. 2.6. Let y(i) = 0, . . . , C − 1 be a generic label (in this work we have C set to 2 or 5 classes), related to an experiment i, which is performed n i times. Let us call n rep the number of experiments that were performed at least twice, and y j (i) the label resulting from the j th experiment. Let us also call y(i) the label of experiment i that appears the most times, over the n i times that the experiment is performed. Then, we have where δ(a, b) is the Kronecker function, which is equal to 1 if a = b, and equal to 0 if a = b. Therefore, the term in square brackets is the fraction of evaluations of the experiment that resulted in the most common outcome. By restricting the measurement of A up to the experiments with a majority of positive outcomes, we get an upper bound to the recall, R up . By restricting to the negative outcomes, we obtain an upper bound to the specificity, S up : .

(D.3)
Here, are separating the experiments in two subsets. In Eq. (D.2) we are only considering the experiments with a majority of y j (i) = 1 outcomes. This is enforced by the term δ (1, y(i)), which is zero unless the label is a 1. Since we discarded a fraction of the experiments, the normalization factor is no longer n rep , but the total number of experiments with a majority of y j (i) = 1 outcomes. The definition in Eq. (D.3) is equivalent, but we consider the experiments with a majority of y j (i) = 0.
We obtain an estimated upper bound for the balanced accuracy by taking For multiclass classification, A up is calculated in the same manner. Since in this case the specificity is not informative, we restrict to the recall. We calculate separately the average recall R up,c related to each class c, and then average over all the classes (i.e. we measure the macroaveraged recall):

Appendix E. Permutation-based feature importance
Since the impurity-based feature importance analysis is at risk of giving too much weight to high-cardinality features [24], we also used permutation importance, to ensure that the species (which indeed has a high cardinality), indeed is one of the most important features. In the bottom set of Fig. E.11 we see that, also when using permutation importance, the species remains one of the most important features. We notice that the importance of chemical-related features is now diminished with respect to the impurity-based method (this is also visible from the top chart). This can be expected, since the features in x ch are not linearly independent (i.e. similar information can be carried by more than one feature).

Appendix F. Multi-class performances
In this section, we show the results of the main paper, but with a multi-class target,  For the binary labeling, we can say that the labels distinguish more (0) from less (1) toxic outcomes. For the multi-class, we can refer to the standard thresholds [44,13]: non-toxic (0), slightly toxic (1), moderately toxic (2), highly toxic (3), very highly toxic (4).
In Tabs. F.9, F.10, F.11 and F.12 we report, for multiclass classification, the same results that in the main text we reported for binary classification (Tabs. 2, 3, 4 and 5, where we report the precision instead of the specificity, since it is better defined for multi-class classification). We observe the same trend shown for binary classification. The F1-score increase obtained by including taxonomy and experiment-related information is of around 15%. The best accuracy that we are able to reach with multi-class classification is 0.781 (Tab. F.12), which is close to the estimators of the reproducibility of experiments (A up = 0.878, Tab. 1), indicating that likely our models are close to achieving the best possible performance in this dataset. If we rescale the maximum obtained accuracy by A up , we obtain 0.781/0.878 0.890 ≈ 89%, which is close to we obtained for binary classification. The recall is even closer to its estimate maximum R up , as it is 0.787/0.833 0.945 ≈ 95%.
Even in the multi-class case, the best models are RF, MLP and DF-RASAR, as it was for binary classification. From the values of the α parameters in Tab. 5 we find the same indications, with α cat being smaller, though not negligibly, than α Euc and α pubc , and the pubchems seemingly less important when a smaller number of neighbors is taken into account.  Metrics of different multi-class models, trained and tested on x ch , i.e. without distinguishing among taxa and experimental details (same as Tab. 2, but for multi-class classification.). The numbers in parentheses are an uncertainty estimate (see Sec. 2.5).

Appendix G. Stratified splitting
Here, we show how our models perform in the case of a different splitting of the dataset. We choose two ways  of doing stratified splitting: by chemical and by taxon.
In the first, each CV fold, as well as the hold-out test set, contain exclusively different chemicals. In the second, they contain different taxa.
To do this, we used the GroupShuffleSplit function from SciKit Learn [23] to split the train and test dataset, ensuring the (chemical, taxon) groups in the test dataset are independent of those in the training dataset. In the 5-fold CV, we used the GroupKFold function. For the LR and RF we adopt mitigate class imbalance through class reweighting.

Appendix G.1. Splitting by chemical
In Table G.14 we show the performance of the C approach when stratifying the performances by chemical, and in Table G. 15 we show the performance of the CT E approach in the same conditions. The performances of the CT E approach are better than the C approach also in this case.