STopTox: An in Silico Alternative to Animal Testing for Acute Systemic and Topical Toxicity

Background: Modern chemical toxicology is facing a growing need to Reduce, Refine, and Replace animal tests (Russell 1959) for hazard identification. The most common type of animal assays for acute toxicity assessment of chemicals used as pesticides, pharmaceuticals, or in cosmetic products is known as a “6-pack” battery of tests, including three topical (skin sensitization, skin irritation and corrosion, and eye irritation and corrosion) and three systemic (acute oral toxicity, acute inhalation toxicity, and acute dermal toxicity) end points. Methods: We compiled, curated, and integrated, to the best of our knowledge, the largest publicly available data sets and developed an ensemble of quantitative structure–activity relationship (QSAR) models for all six end points. All models were validated according to the Organisation for Economic Co-operation and Development (OECD) QSAR principles, using data on compounds not included in the training sets. Results: In addition to high internal accuracy assessed by cross-validation, all models demonstrated an external correct classification rate ranging from 70% to 77%. We established a publicly accessible Systemic and Topical chemical Toxicity (STopTox) web portal (https://stoptox.mml.unc.edu/) integrating all developed models for 6-pack assays. Conclusions: We developed STopTox, a comprehensive collection of computational models that can be used as an alternative to in vivo 6-pack tests for predicting the toxicity hazard of small organic molecules. Models were established following the best practices for the development and validation of QSAR models. Scientists and regulators can use the STopTox portal to identify putative toxicants or nontoxicants in chemical libraries of interest. https://doi.org/10.1289/EHP9341


Introduction
Historically, animal testing has been required by regulatory agencies for hazard categorization, labeling, and packaging. 1However, there have been multiple calls, especially, in the last two decades for reducing, refining, and replacing animal tests for hazard identification. 2ill, the potential for chemicals, especially those used in cosmetic products to cause several types of acute toxicity in humans is still evaluated by the animal in vivo assays historically termed the "6-pack". 3This battery of tests includes three topical (skin sensitization, skin irritation and corrosion, and eye irritation and corrosion) and three systemic (acute oral toxicity, acute inhalation toxicity, and acute dermal toxicity) endpoints. 3e European Union (EU) has maintained an animal testing ban on finished cosmetic products since 2004, and since 2009, the EU has maintained an animal testing ban on cosmetic ingredients.Since 2013, this ban has been applied irrespective of the availability of alternative non-animal tests. 4In the United States, although animal testing is not prohibited, Interagency Coordinating Committee on the Validation of Alternative Methods (ICCVAM) and its supporting center, the National Toxicology Program Interagency Center for the Evaluation of Alternative Toxicological Methods (NICEATM), have been coordinating the development, validation, acceptance, and harmonization of alternative toxicological test methods throughout the U.S. Federal Government.Still, traditionally, all compounds registered under the Federal Insecticide, Fungicide, and Rodenticide Act (FIFRA), are required to be tested in the 6-pack assays.In 2015, 6-pack guidelines were revised to include in vitro and ex vivo test methods for classification and labeling for eye irritation -these new guidelines applied both to household antimicrobial cleaning products and to conventional pesticide products. 4In 2018, the EPA's Draft Interim Science Policy allowed in vitro tests for skin sensitization as well; however, only pure substances may be used as test materials for submission purposes, and an Integrated Testing Strategy (either the "2 out of 3 Defined Approach" or the "Key Event 3/1 Sequential Testing Strategy") must be followed. 5However, the European Commission reported 9.58 million uses of animals in research and testing at UE, and 23% of those animal uses were for regulatory purposes in 2017. 6ong with ethical concerns, the validity of animal testing has also come under question in recent years.Several studies have shown that animal-based assay outcomes do not always equate with human response 7,8 and that animal models are less reproducible than some alternative methods. 9However, despite the international adoption of a variety of in vitro alternatives, their widespread implementation has been hampered by concerns over concordance with the traditional animal tests, and practical considerations as compared to the traditional animal tests. 10mputational approaches, such as structural alerts, read-across, and Quantitative Structure-Activity Relationship (QSAR) modeling, have earned broad acceptance as weight of evidence for assessing chemical toxicity. 11,12Structural alerts are molecular substructures that are associated with a particular adverse outcome. 13Read-across is a technique that proposes to identify potential hazards of untested compounds by associating them with structurally similar compounds that have been tested. 14QSAR modeling is a computational approach that employs statistical or machine learning techniques to establish correlations between intrinsic chemical properties (chemical descriptors) and measured properties or toxicological effects. 15QSAR modeling has been used extensively to model and predict chemical toxicity, and best practices for model development and validation have been developed to ensure their reliability. 16Both structural alerts and read across approaches have been preferred by regulators due to ease of use, transparency, and mechanistic interpretability.However, there have been concerns that these tools often do not help with reliable assessment of whether the underlying compounds present true hazard to humans and the environments.For instance, we have previously demonstrated that alerts have a tendency to flag compounds as toxic even when the experimental evidence shows otherwise. 17 the last several years, both our [18][19][20] and other 21,22 groups have developed reliable computational models for predicting skin sensitization potential of chemicals.These and other models developed for one or more of the 6-pack endpoints are summarized in Table 1, which indicate that the development of reliable computational models for predicting the outcomes of all 6-pack tests is still a significant challenge.Herein, we have compiled, integrated, and curated the most comprehensive collection of experimental in vivo data on 6-pack endpoints.We especially emphasize, with vivid examples, the importance and impact of data curation on the rigor of the study design and reliability of its outcomes.We have developed and rigorously validated QSAR models for all 6-pack assays and demonstrated their utility in identifying potentially safe or unsafe chemicals in industrial products (Figure 1).We integrated these models into a software package called STopTox (Systemic and Topical chemical Toxicity (STopTox) and made it publicly available to the research community via a dedicated web portal (https://stoptox.mml.unc.edu/).

Data Curation and Cross-endpoint concordance analysis
Although predictive models have been developed and reported previously for subsets of the 6-pack endpoints (Table 1), many of these models did not fully comply with the model validation guidelines specified by the Organization for Economic Cooperation and Development (OECD) 23 and, most notably, lacked proper data curation.As can be seen in Figure 2, applying data curation protocols decreased the size of the available data by more than 50%.Our final database represented a sparse matrix containing 11,941 compounds with activity measurements for at least one of the 6-pack endpoints.Only 12 compounds were tested in all six endpoints.This low number can be due to waivers granted by regulators in an effort to decrease animal use in toxicity assessment. 24gure 3 shows a heat map of the pairwise concordance and overlap (in parenthesis) between compounds tested in different endpoints.We have identified 328 compounds overlapping between acute oral toxicity and acute inhalation toxicity.These two endpoints showed the highest (80%) pairwise concordance among all six endpoints.A previous study 25 analyzed the binary concordance between the acute systemic toxicity tests in "pesticide" and "chemical" sets.The oral-inhalation overlap for "pesticides" was 328 compounds with concordance as low as 24%, while for "chemicals" (71 compounds) the concordance was 72%.
The oral-dermal overlap for pesticides (307 compounds) demonstrated a concordance of 71%.
The oral-dermal overlap for "chemicals" (1569 compounds) resulted in high concordance of 93%.Our concordance analysis also showed an oral-dermal concordance of 71% estimated on the overlap of 1308 compounds.Among compounds that were "Not Classified" in acute oral tests, 98% were also "Not Classified" in acute dermal tests.While the oral route is mandatory for acute toxicity testing under most regulatory frameworks for food-use pesticides, selection of a second route of exposure (dermal or inhalation) is based on the amount of a chemical being commercialized, taking into account the inherent acute toxicity of the chemical and the primary route of exposure when handling the material. 26Computational models built on curated datasets and following the best practices for development and validation of QSAR models can support the decision of waiving dermal tests in case the substance has acute oral toxicity profile, since there is still a small possibility for the chemical to be toxic upon skin contact. 27kin irritation and eye irritation showed the same binary outcomes in 60% of 380 overlapping compounds.A concordance of 67% has been previously observed for 205 formulations tested for both skin irritation and eye irritation. 28This concordance was higher for "Not Classified" compounds, whereas 89% of formulations classified as non-irritant for eye irritation were also non-irritants for skin irritation.Chemicals that were considered corrosive or severely irritant to the skin or highly toxic (category 1 or 2) in the dermal test, do not need to be tested for eye irritation and corrosion, according to the OECD guidance on considerations for waiving mammalian acute toxicity tests. 24Also, skin sensitization and skin irritation datasets showed the same binary outcomes in 72% of 237 overlapping compounds.

QSAR modeling
Previously, we have built models to predict skin sensitization endpoints using a combination of animal, 29 OECD validated in vitro assays, 18 and human data. 18,20,30In this study, we have developed consensus models for predicting outcomes of skin sensitization testing in animals as STopTox is intended as a reliable alternative to animal testing in the 6-pack assays.
Therefore, all the models reported here were built using only data collected from animal tests that followed the OECD protocols.Statistical characteristics of QSAR models developed in this study are summarized in Table 2.All cross-validated models showed high predictive accuracy on independent external evaluation sets based on several metrics including correct classification rate (CCR), sensitivity (SE), specificity (SP), positive predictive value (PPV), and negative predictive value (NPV).The acute toxicity models showed CCRs ranging from 70% to 78%; SEs ranging from 67% to 79%; SPs ranging from 68% to 81%; PPVs ranging from 71% to 79%; and NPVs ranging from 70% to 78%.

Comparative assessment of new "6-pack" models versus alternative tools
Over the decades, many computational tools have been built to evaluate adverse health effects of chemical compounds from their chemical structure. 31,32Predictive models, such as quantitative structure-activity relationship (QSAR), facilitate replacement and reduction of animal testing in toxicology and risk assessment; however, many subject-matter experts acknowledge that "these methods are not always reliable and must be assessed on their individual merit for the compound and context in question". 33For this and other reasons catalogued by Dearden et al. 34 , results derived from QSAR and other in silico models are usually met with caution.Model predictions are almost always used in combination with other evidence, or only for the purpose of the initial screening/ranking of compounds for further testing. 35e curation of the ECHA database proved to be an extremely laborious task and the most time-consuming part of this work.It is important to emphasize that much of 6 pack endpoint data included in the ECHA database should not be used for model development.As seen from the summary of data curation (Figure 2), the major reduction in the size of individual datasets used eventually for QSAR model development was due to a large fraction of inconsistent data in the original ECHA database which, upon careful inspection, were found to be reported as predictions made by QSAR models, or read across, or expert systems, or labeled as "not reliable".
Comparison to models developed for the same endpoints without rigorous data curation 36 suggests that our extensive data curation procedures resulted in the net decrease in both the dataset sizes and model performance.Indeed, we compared the models produced in this study to those reported by Luechtefeld et al. 36 who described the development of a suite of in silico models, termed read-across structure activity relationships (RASAR) for the 6-pack endpoints.
Since the model predictions based by RASAR can only be accessed through a fee-based commercial platform (https://www.ulreachacross.com),we have performed an indirect comparison of statistics (see Table S1 in the supplementary materials).Our models showed on average, a 10% lower CCR.Furthermore, the amount of data reported in the study mentioned above 36 was, on average, five times larger than the size of the carefully curated dataset used in this study.Previously, we already expressed concerns that high accuracy of models as reported 36 could be the consequence of inadequate data curation leaving many duplicate compounds in the modeling datasets. 37We posit that our results more accurately reflect the actual model performance for these endpoints since we have eliminated such confounders as duplicate entries or the use of predicted or "not reliable" values and conducted more rigorous validation procedures according to the established guidelines.We strongly suggest that this exercise reemphasizes the importance of proper data curation and cautions against overinterpreting results from models built on non-curated datasets.

STopTox usability and interpretation
The limitations of the experimental assay results used in model development define the usability and interpretation of STopTox.It is important to note that the binary compound annotations in all 6-pack assays as toxic or non-toxic are derived from dose-dependent testing conditions.For instance, the skin sensitization potencies for substances are based on a function of lymph node cell proliferation induced by the test chemical and expressed as a stimulation index (SI) relative to values obtained with concurrent controls.If SI ≥ 3, the substance is considered a sensitizer in the tested concentration.The skin irritation criteria are based on a single dose of a chemical applied to the skin of an experimental animal.The degree of irritation/corrosion is based on an erythema/edema scale system that is subjectively scored at specified observation intervals.Similarly, eye irritation is based on subjectively scoring lesions of conjunctiva, cornea, and iris, at specific intervals after application of a single dose of the test substance.If the effects are reversible after 21 days, the chemical is considered an irritant, but if it does not reverse (or if a severe score is noted at any timepoint), the chemical is labeled as corrosive.The tests for systemic endpoints in STopTox (Acute Dermal Toxicity, Acute Inhalation Toxicity, and Acute Oral Toxicity) rely on LD 50 data, meaning the dose of a substance required to kill half of the experimental animals.The Globally Harmonized System of classification and labelling of chemicals (GHS) 38 threshold for Acute Oral and Dermal toxicity for chemicals is LD 50 < 2000 mg/kg bodyweight.As for Acute Inhalation toxicity, the threshold varies with the form of tested substance (gases -LC 50 ≤ 2500 ppm, vapors -LC 50 ≤ 10 mg/L, and dusts/mists -LC 50 ≤ 20 mg/L).
It is essential to note that, if the model predicts a compound as toxic or non-toxic, such prediction should be considered only in the context of specific dose-dependent observation for each assay; obviously, increasing the dose of any compound in any assay could often lead to toxic effects.These considerations are often overlooked when making predictions or assertions concerning the expected chemical toxicity.The ultimate goal of any method for evaluating acute toxicity is to provide an accurate assessment of the potential risk of a chemical concerning human safety. 39Therefore, we reinforce that the limitation of assays should influence the interpretation of the predictions made by the models and how these models can be used by toxicologists to help in decision making.Prediction with QSAR models implemented in STopTox (and, actually, with any models) do not take the dose into account; they merely state whether a chemical is predicted toxic or non-toxic in each assay.Thus, users interpreting these predictions should always be familiar with and keep in mind the underlying experimental conditions briefly discussed above under which compounds in the training sets have been annotated as toxic or non-toxic.Further, these models are limited to binary hazard-based predictions, rather than providing information on potency and GHS or EPA subcategorization.Thus, they are not directly applicable for many regulatory classifications and labeling requirements requiring higher level of granularity.However, these models are well suited to assist in hazard assessment and chemical screening/prioritization, and, because of high accuracy in terms of both sensitivity and specificity, they are especially useful in identifying non-toxic compounds (tested in the same conditions as those identified as toxic where additional subcategorization is indeed important).

Predictions of known toxicants
For additional external validation of our models, we have conducted a literature search for toxicants described in clinical studies or known toxicants for each endpoint that was absent in our database.A list of 45 potential skin sensitizers in cosmetic ingredients was compiled by the Norwegian Scientific Committee for Food Safety. 40Eleven out of 45 compounds were absent in our skin sensitization dataset and eight of the eleven chemicals were correctly predicted as sensitizers by our skin sensitization model (SE = 72%).
MS-222, a fish anesthetic commonly used in aquaculture, 41 and sodium lauryl sulfate, 42 a product commonly used in personal care products, are both known skin irritants that were not present in our skin irritation training data.Our models predicted sodium lauryl sulfate as a skin irritant and MS-222 as not classified (according to OECD protocols, chemicals not classified as skin irritants are considered "not classified").There were three compounds that were not present in our eye irritation model: glutaraldehyde, glyphosate, and Paraquat (1,1'-Dimethyl-4,4'bipyridinium dichloride).Our model predicted glutaraldehyde and glyphosate as eye irritants.
Exposure of glutaraldehyde during cataract surgery led to the development of toxic eye anterior segment syndrome in six patients. 43Ocular glyphosate exposure was reported to lead to the development of chemosis, heart palpitations, raised blood pressure, headache, and nausea. 44In two cases of accidental eye exposure to Paraquat, eye damage was reported. 45chloromethane 46 and methanol 47 have been reported as systemic toxicants after dermal exposure and were not in the modeling set.Our acute dermal toxicity model correctly predicted both compounds as toxic after dermal exposure.Twenty chemicals commonly present in occupational inhalation accidents were compiled in a study. 48There were five organic chemicals in this list that were absent in our acute inhalation dataset.All five compounds were correctly predicted by acute inhalation models.One clinical case of accidental oral exposure to the pyrethroid deltamethrin led to the poisoning of a 4-year old girl that consumed insecticidal chalk and was found unconscious 20 minutes after going outside to play. 49Mephedrone, a psychoactive drug, has been proven toxic in a study reporting cases of acute toxicity related to self-reported use of mephedrone. 50Our acute oral toxicity model predicted both compounds as toxic if swallowed.Together, our models correctly predicted 80% of the known toxicants compiled (Figure 4).
Figure 5 shows the predictions generated for N-phenyl-p-phenylenediamine, a known skin sensitizer that is usually added to temporary black henna tattoos leading to many cases of contact allergy. 51We also generated maps showing the relative significance of fragment contributions, providing a graphical interpretation of developed models (Figure 5), where atoms and structural fragments enhancing toxicity are highlighted in pink and those decreasing toxicity are shown in green.These maps are generated for each six-pack endpoints independently.

Virtual screening of CosIng, REACH and all STopTox compounds
As a case study illustrating STopTox usability, we have applied our QSAR models to the European Commission Cosmetic ingredient database (CosIng), REACH, and to all STopTox compounds.The majority of compounds in each of these datasets were predicted as Not Classified by each individual model.In the CosIng dataset (n = 3,930 compounds), 1655 compounds were predicted as skin sensitizers, 1400 compounds were predicted as skin irritants, 2177 compounds were predicted as eye irritants, 551 compounds were predicted as toxic if swallowed, 407 compounds were predicted as toxic if inhaled, and 372 compounds were predicted as toxic after dermal exposure.Out of N=XX total compounds, there were 3398 compounds predicted as toxic in at least one endpoint and 1381 compounds predicted as Not Classified in all six endpoints.All compounds and corresponding predictions are listed in Supplementary Table S2.
In the REACH dataset (n = 10,465 compounds), 4,018 compounds were predicted as skin sensitizers, 2,445 compounds were predicted as skin irritants, 4,605 compounds were predicted as eye irritants, 2,679 compounds were predicted as toxic if swallowed, 2,139 compounds were predicted as toxic if inhaled, and 1,899 compounds were predicted as toxic after dermal exposure.There were 7,641 compounds predicted as toxic in at least one endpoint and 2,824 compounds predicted as Not Classified in all six endpoints.All compounds and corresponding predictions are listed in Supplementary Table S3.
In the STopTox dataset (n = 11,941 compounds), 4,792 compounds were predicted as skin sensitizers, 2,491 compounds were predicted as skin irritants, 4,766 compounds were predicted as eye irritants, 5,232 compounds were predicted as toxic if swallowed, 2,394 compounds were predicted as toxic if inhaled, and 2,902 compounds were predicted as toxic after dermal exposure.There were 7,641 compounds predicted as toxic in at least one endpoint and 2,824 compounds predicted as Not Classified in all six endpoints.All compounds and corresponding predictions are listed in Supplementary Table S4.

Model implementation
The STopTox web-based application (Figure 6) runs machine learning routines written in Python by using Flask v. 0.12.2, a small framework for creating web microframeworks at the backend.Models were developed using Scikit-Learn.Angular 4 and Typescript were used for the development of the frontend, and Docker and Docker-Compose for the orchestration of containers.The developed models and all datasets are publicly available at https://stoptox.mml.unc.edu/.

Conclusions
We have developed STopTox, a comprehensive collection of computational models that can be used as an alternative to in vivo 6-pack tests for predicting chemical toxicity hazard.
Models were established following the best practices for the development and validation of QSAR models 16,23 using the largest publicly available and carefully curated datasets that we compiled for all 6-pack assays.To the best of our knowledge, STopTox is the first publicly available portal that enables accurate predictions of chemical hazards in all the 6-pack endpoints at once.Despite the limitations of these models with respect to potency classes, they are reliable for predicting chemicals that do not require classification, i.e., those expected to be nontoxic if tested following the same protocols used for compounds in the modeling set.We suggest that these models are valuable for both regulatory agencies and respective industries in helping them identify safer alternatives to chemicals of interest.We reinforce that, in order to build predictive models, it is not enough just to utilize adequate chemical descriptors and powerful statistical techniques; 52 we shall stress that, STopTox is the only 6-pack endpoint predictor in the public domain developed with extensively curated data.The STopTox web app provides users with the access to statistically significant and externally predictive QSAR models of acute toxicity tests.
The web app can be used for rapid evaluation of acute toxicity hazards in chemical inventories.
STopTox is freely available at https://stoptox.mml.unc.edu/.To the best of our knowledge, STopTox does not have analogs in terms of the level of data curation, validated statistical accuracy of constituting models, transparency of the data, modeling methods and software tools, and public accessibility.
Unfortunately, there were numerous problems with the collected raw data.For instance, many numerical data were represented as string variables, the units of measurements were not standardized through the datasets, and there were many "free text" data.Therefore, we extensively cleaned and standardized all the data and converted measurements to the same units in each dataset.We also used regex expressions to find important features for the database that were described in text format; this was key to endpoint classification into GHS hazard classes.
Following this laborious data preparation and standardization, we performed both chemical and biological data curation.This attention to detailed data curation at different levels of the data preparation protocol is, unfortunately, uncommon in computational chemical toxicology, as we noted previously. 37

Data curation
Datasets were thoroughly curated following the workflows developed by us earlier. 52rst, we excluded inconsistent data, which significantly decreased the size (number of data points) of our datasets (Figure 2).Data were categorized as inconsistent if they were generated not following the OECD protocols, if compounds were tested in few or only one concentration and could not be classified into GHS classes, labeled as non-experimental (e.g., labeled as obtained using QSAR and/or read across predictions and/or weight of evidence decisions), data with measurements different from the standard protocols for the 6-pack endpoints (e.g., NOAEL, NOEL, LC 10 , LC 0 , etc.), data without identification number (CAS number, EC number), formulations, and complex mixtures.Then, we performed chemical structure curation and removed mixtures, inorganics, and organometallic compounds, cleaned and neutralized salts, normalized the specific chemotypes, and applied special treatment to chemicals with multiple replicated records as follows: (i) when replicated records presented the same binary outcome, only one record was kept; (ii) when majority of replicated records presented the same binary outcome and one had different binary outcome, only one record with the agreeing binary outcome was kept, (iii) when replicated records had different binary outcomes, all of them were removed.All the curated data will be available in Supplementary Table S5.

Skin sensitization (dataset A)
Skin sensitization data were compiled from two sources: (i) National Toxicology Program Interagency Center for the Evaluation of Alternative Toxicological Methods on behalf of ICCVAM (http://ntp.niehs.nih.gov/go/40500) and (ii) the REACH study results database publicly available at https://iuclid6.echa.europa.eu/reach-study-results.The original ICCVAM database included 1,060 chemical records with local lymph node assay (LLNA) data.After curation, 516 unique compounds (332 sensitizers and 184 non-sensitizers) were retained.The REACH database initially comprised 10,588 records for 9,801 chemicals.The REACH dataset is composed of many types of assays and study categories.In vitro and "weight of evidence" categories were discarded.Data from different OECD (Organization for Economic Co-operation and Development) skin sensitization assays (OECD guidelines 406, 411, 429 and 442B) 62-65 were available; only the data corresponding to LLNA assays (429 and 442B) were selected, resulting in 1,275 data points with LLNA records.After curation, 566 compounds (197 sensitizers and 369 non-sensitizers) were retained.Eventually, we merged the curated data from ICCVAM and REACH and examined the content of this combined data.There were 58 pairs of duplicate chemicals between these two datasets, and the sensitization potential of five of these pairs was different.These discordant records were removed, and only one record for each concordant pair of duplicates was kept.The merged dataset had 1,000 unique compounds (481 sensitizers and 519 non-sensitizers).

Skin irritation and corrosion (dataset B)
Experimental animal data on skin irritation and corrosion were retrieved from the REACH study results database (https://iuclid6.echa.europa.eu/reach-study-results).After removing inconsistent data, 1,631 out of original 5,274 data points were left.After removal of mixtures, inorganics, and neutralization and removal of Not Classified counter-ions, 1,326 records remained.Among 124 duplicate chemicals in the dataset, 95 were concordant and 29 were discordant.All the discordant replicates and one of each concordant replicate were removed.The final dataset has 1,012 unique chemical compounds including 40 corrosives, 277 irritants, and 695 non-irritants.As there were only few corrosive compounds in our dataset, we decided to merge the corrosive and irritant classes and model only irritant versus non-irritant compounds.We note that these models have limited regulatory value at the moment given the data available with respect to compounds predicted as toxic as regulators typically would like to see more granular measurement or prediction at the level of specific subcategories of toxicity.However, we shall highlight and emphasize that our models make accurate predictions of nontoxic compounds thereby helping both regulators and respective regulated industries to develop safer chemicals.Our resulting dataset contains 317 irritants versus 695 non-irritants.Because the dataset is imbalanced, we applied an undersampling technique where the majority class was sampled in a way to match the number of records of the minority class.This was done by searching for the compounds in the majority class that had higher similarity (Tanimoto coefficient) with compounds in the minority class.The balanced dataset consisted of 554 compounds (277 irritants and 277 non-irritants).

Eye irritation and corrosion (dataset C)
The eye irritation and corrosion dataset was retrieved from the REACH study results database (https://iuclid6.echa.europa.eu/reach-study-results)and the literature [53][54][55][56][57][58][59][60][61] .We first curated data from each source separately, then we merged the curated datasets and checked for overlapping compounds.After removing inconsistent data, 7,196 out of original 7,332 experimental animal data points for eye irritation and corrosion were left.After removal of mixtures, inorganics, and neutralization and removal of Not Classified counter-ions, 5,985 records remained.All the discordant chemical replicates and one of each concordant replicate were removed.The final dataset has 3,547 unique chemical compounds including 1,146 irritants, and 2,401 non-irritants.Because the dataset is imbalanced, we applied an undersampling technique where the majority class was sampled in a way to match the number of records of the minority class.This was done by searching for the compounds in the majority class that had higher similarity (Tanimoto coefficient) with compounds in the minority class.The balanced dataset consisted of 2,292 compounds (1,146 skin irritants and 1,146 skin non-irritants).

Acute dermal toxicity (dataset D)
The acute dermal toxicity dataset was retrieved from the REACH study results database (https://iuclid6.echa.europa.eu/reach-study-results), the publicly available database ToxValDB, and from a literature dataset. 27After removing inconsistent data, 5,259 out of original 29,824 data points were left; the major reasons for compound removal were the presence of many compounds without a defined LD50.After removal of mixtures, inorganics, organometallic compounds, 4,601 records remained.Among 1,979 chemical replicates in the dataset, 1,836 were concordant and 143 were discordant.All the discordant replicates and one of each concordant replicate were removed.The final dataset has 2,622 unique chemical compounds including 382 dermally toxic compounds and 2,234 Not Classified compounds.Because the dataset is imbalanced, we applied an undersampling technique where the majority class was sampled in a way to match the number of records of the minority class.This was done by searching for the compounds in the majority class that had higher similarity (Tanimoto coefficient) with compounds in the minority class.The balanced dataset consisted of 764 compounds including 382 toxic compounds and 382 Not Classified compounds.

Acute inhalation toxicity (dataset E)
The acute inhalation toxicity dataset was retrieved from the REACH study results database (https://iuclid6.echa.europa.eu/reach-study-results)and from the publicly available database ToxValDB.After removing inconsistent data, 2,061 out of original 8,176 data points were left only.This dramatic reduction of the dataset was mainly due to the presence of many compounds without a defined LD50 and because of the absence of information regarding the exposure method used (gas, dust or mist), which is essential for GHS classification.After removal of mixtures, inorganics, and neutralization and removal of Not Classified counter-ions, 1,637 records remained.Among 527 chemical replicates in the dataset, 501 were concordant and 26 were discordant.All the discordant replicates and one of each concordant replicate were removed.The final dataset has 681 unique chemical compounds including 345 toxic compounds and 336 Not Classified compounds.

Acute oral toxicity (dataset F)
The acute oral toxicity dataset was retrieved from NICEATM (The National Toxicology Program Interagency Center for the Evaluation of Alternative Toxicological Methods) workshop for the Collaborative Acute Toxicity Modeling Suite (CATMoS) 66 project that our team was part of.After removing inconsistent data, 8,981 out of original 8,994 data points were left.After removal of mixtures, inorganics, and neutralization and removal of counter-ions, 8,979 records remained.A total of 406 chemical replicates were found in the dataset.All replicates with different toxicity calls were removed and one of the duplicative compounds with concordant toxicity calls was kept.The final dataset has 8,442 unique chemical compounds including 4,803 toxic compounds and 3,639 Not Classified compounds.

CosIng (Dataset g)
CosIng is the European Commission database for information on cosmetics substances and ingredients (https://ec.europa.eu/growth/sectors/cosmetics/cosing_en).This dataset contained 5166 chemical records with a defined chemical structure.After curation, 3,930 unique chemical substances were kept for prediction purposes.

REACH (Dataset e)
The REACH data come from registration dossiers submitted to ECHA by May 2019 (https://echa.europa.eu/information-on-chemicals/registered-substances).The database originally contains 20,000 substances, of which 15,438 are chemical records with a defined chemical structure.After curation, 10,465 unique chemical substances were kept for prediction purposes.

Figure 3 .
Figure 3. Cross-endpoint pairwise concordance of binary outcomes and number of overlapping compounds (in parentheses).

Figure 4 .
Figure 4. Toxicants identified in the literature that were absent in our modeling data.StopTox predictions for a given endpoint are listed below each structure."Not classified" compounds were predicted as non-toxic.

Figure 5 .
Figure 5. Predicted probability maps and predictions of each model for N-phenyl-pphenylenediamine.The predicted probability of a toxic effect is accompanied by the colored map