Two-Stage Machine Learning-Based Approach to Predict Points of Departure for Human Noncancer and Developmental/Reproductive Effects

Chemical points of departure (PODs) for critical health effects are crucial for evaluating and managing human health risks and impacts from exposure. However, PODs are unavailable for most chemicals in commerce due to a lack of in vivo toxicity data. We therefore developed a two-stage machine learning (ML) framework to predict human-equivalent PODs for oral exposure to organic chemicals based on chemical structure. Utilizing ML-based predictions for structural/physical/chemical/toxicological properties from OPERA 2.9 as features (Stage 1), ML models using random forest regression were trained with human-equivalent PODs derived from in vivo data sets for general noncancer effects (n = 1,791) and reproductive/developmental effects (n = 2,228), with robust cross-validation for feature selection and estimating generalization errors (Stage 2). These two-stage models accurately predicted PODs for both effect categories with cross-validation-based root-mean-squared errors less than an order of magnitude. We then applied one or both models to 34,046 chemicals expected to be in the environment, revealing several thousand chemicals of moderate concern and several hundred chemicals of high concern for health effects at estimated median population exposure levels. Further application can expand by orders of magnitude the coverage of organic chemicals that can be evaluated for their human health risks and impacts.


Feature Preprocessing Steps
The QSAR models for predicting points of departure (PODs) consisted of a pipeline of feature preprocessing steps and a machine learning estimator (e.g., random forest) (Figure 1B).
The following preprocessing steps were involved: 1. Remove null variance features.
2. Exclude any features with over 30% missing values.
3. Apply a power transform to continuous features to make them more Gaussianlike.The Yeo-Johnson transform was applied, supporting both positive and negative values. 1 4. Impute missing values with the median for a given feature.
5. Center and scale continuous features using Equation S1.The median and median absolute deviation () were used for robustness to potential outliers.Given a continuous feature  !%%%%⃗ with I samples  " !,  # !, …,  $ !: Where: •  $ ! is the original value of the ith sample for the nth feature.
•  $ !% is the centered and scaled value.
Note that centering and scaling do not affect decision-tree-based methods, such as Random Forest, but such regularization may improve performance for other regression-based methods.

Model Training Steps
The general training steps for model training are listed below, followed by the specific parameters used.Figure S1 illustrates the training steps and includes pseudocode representing the algorithm: -For each repetition r in {1, 2 …, R} and each fold k in {1, 2 …, K} folds: • Split the full dataset into 1 test set, k, with the remaining data as the training set.
• Train a baseline model m on the training set.
• Evaluate m on the test set using the root-mean-squared-error (RMSE) as the reference score s. •

Model Performance Metrics
Figure S2 illustrates the model evaluation scheme and includes pseudocode representing the algorithm.To quantify performance, we used the root-mean-squared error (RMSE), median absolute error (MedAE), and coefficient of determination (R 2 ).RMSE is conceptually like a standard deviation with respect to the prediction errors: Where: •  * : is the predicted value of the ith sample.
•  $ is the corresponding observed (or measured) value of the ith sample.
MedAE is a metric that is robust to outliers: R 2 represents the proportion of variance of Y that has been explained by the features for a given model: Where: •  @ is the arithmetic mean of the observed (or measured) values.
The best possible R 2 score is 1, and a score of 0 would correspond to a constant model that always predicts the expected (average) value of Y.Note that in this formulation (unlike for linear regression), R 2 can also be negative if a model were worse than the constant model.  Specifically, we used the median value for each chemical after conversion to human-equivalent LD50s.This sensitivity analysis evaluated the robustness of using predicted LD50s, which are available for many more chemicals than experimental LD50s.

CompTox Features
Combined features from OPERA and TEST (Toxicity Estimation Software Tool) from the CompTox Chemistry Dashboard (2.1.1)by U.S. EPA. 3 These features are described in a supplemental Excel file (Table S4).This sensitivity analysis evaluated the impact of using readily available preselected/pre-calculated features instead of generating features directly from the OPERA 2.9 software.For consistency, we preprocessed these features by excluding any chemicals that had been filtered out in the QSARstandardization workflow by Mansouri et al. 4,5

RDKit Features
Used all two-dimensional descriptors from the RDKit Python library (2022.09.05).First, for each chemical, a molecule object was instantiated from the "QSAR-ready" SMILES string.Next, the function, "CalcDescriptors," was applied to each molecule, resulting in 208 features.Some features had null median absolute deviations and were therefore not scaled in the preprocessing pipeline described in section, "Feature Preprocessing Steps."No Imputation Used only samples for which all OPERA 2.9 features had no missing values, so that no imputation was performed.
The sensitivity analysis was used to assess generalization error sensitivity to different datasets, feature preprocessing, and machine learning estimators.Our baseline Final Model was described in the main text, involving feature selection among all 39 OPERA 2.9 features, imputation of missing values, and the Random Forest Regressor.All models were applied to the same chemicals, except the model involving no imputation was restricted to those chemicals with no missing feature values (n =184 -227).

Figure S3
. Distributions of raw OPERA 2.9 features. 9,10Continuous features are represented with histograms, whereas discrete features are represented with bar plots indicating the count of samples for each unique value.The data are shown for all chemicals in this study.Feature descriptions are included in a supplemental Excel file (Table S3).

S13
Figure S4.Proportions of data completeness for OPERA 2.9 features. 9,10The figure is subdivided by the target effect category for training data chemicals (left and middle panels) and for the application chemicals (right panel) that were on the Merged NORMAN Suspect List (SusDat) 11,12 and within the applicability domain of SEEM3, 8 excluding any training chemicals.A vertical dashed line denotes the threshold above which features were excluded (see section, Model Training and Evaluation): Biodegradation half-life for compounds containing only carbon and hydrogen (BioDeg_HalfLife_pred): 74-75% missing; Caco-2 permeability (CACO2_pred): 49% missing; Rate constant for the atmospheric, gas-phase reaction with photochemically produced hydroxyl radicals (OH_pred): 44-45% missing.Features with no missing values are not shown.Feature descriptions are included in a supplemental Excel file (Table S3).Note: n, sample size.Features present in the final models are highlighted with a distinct color.The remaining important features were excluded from the final models to avoid overfitting.Feature descriptions are included in a supplemental Excel file (Table S3).
Figure S7.Feature importance scores for the final model for general noncancer effects.Features were extracted from OPERA 2.9. 9,10These scores were used to select important features (see section, Model Training Steps).The feature selection scheme is illustrated in Figure S1.The boxes show the median and interquartile range with outliers omitted.Feature descriptions are included in a supplemental Excel file (Table S3).Note: RMSE, root-mean-squared error, MedAE, median absolute error; R 2 , coefficient of determination.

Figure S8.
Feature importance scores for the final model for reproductive/developmental effects.Features were extracted from OPERA 2.9. 9,10These scores were used to select important features (see section, Model Training Steps).The feature selection scheme is illustrated in Figure S1.The boxes show the median and interquartile range with outliers omitted.Feature descriptions are included in a supplemental Excel file (Table S3).Note: RMSE, rootmean-squared error, MedAE, median absolute error; R 2 , coefficient of determination.

Figure S9.
Feature importance scores for the replicate models for general noncancer effects.Features were extracted from OPERA 2.9. 9,10These scores were used to select important features (see section, Model Training Steps).The feature selection scheme is illustrated in Figure S1.The boxes show the median and interquartile range with outliers omitted.Feature descriptions are included in a supplemental Excel file (Table S3).Note: RMSE, root-mean-squared error, MedAE, median absolute error; R 2 , coefficient of determination.

Figure S10
. Feature importance scores for the replicate models for reproductive/developmental effects.Features were extracted from OPERA 2.9. 9,10These scores were used to select important features (see section, Model Training Steps).The feature selection scheme is illustrated in Figure S1.The boxes show the median and interquartile range with outliers omitted.Feature descriptions are included in a supplemental Excel file (Table S3).Note: RMSE, root-mean-squared error, MedAE, median absolute error; R 2 , coefficient of determination.The application chemicals were on the Merged NORMAN Suspect List (SusDat) 11,12 and within the applicability domain of SEEM3. 8The figure is further subdivided by the target effect category from left to right.Note: POD, point of departure.S3).Note: POD, point of departure.Inputs and other options Current model outputs are available for > 30,000 chemicals expected to occur in the environment 11,12 that are within the applicability domain of SEEM3, 8 based on input of chemical identifiers (DTXSID) (Figure 4).A graphical user interface will be made available for downloading predictions.

2.3
Model accessibility The model will be publicly available upon publication.

Defined domain of applicability 3.1 Clear definition of the applicability domain and limitations of the model
The paper defines the applicability domain (AD) in two stages.First, chemicals must pass the "QSAR-ready" standardization workflow in order to be considered within the "general AD" of the model.Second, feature-specific ADs are generated from OPERA 2.9. 9,10Predictions can still be generated for chemicals outside this feature-specific AD with median imputation, but those features are flagged.Appropriate measures of goodness-of-fit, robustness and predictivity 4.1 Goodness-of-fit, robustness The paper transparently describes the statistical metrics and crossvalidation method used to evaluate performance (Equations S3-S6 and Figure S2).

Predictivity
Extensive, two-stage cross-validation was used to estimate the predictive power (R 2 ) and prediction errors (RMSE, MedAE) for an external dataset (Figure S2).Performance was also evaluated relative to reference values in the form of authoritative PODs (Figure S5).

Mechanistic interpretation 5.1 Plausibility of the mechanistic interpretation
The most important feature is consistently the QSAR-predicted LD50, derived from in vivo rat acute oral toxicity studies, 15 which is an indicator of the acute mammalian potency of a chemical.Other important features are physical and chemical properties with clear interpretations (Figures S6-S10).Feature descriptions are included in a supplemental Excel file (Tables S3-S4).
Results of applying the (Q)SAR Assessment Framework to our modeling framework (Figure 1), demonstrating how our framework conforms to general principles and criteria for use of QSAR models. 17

Figure S7 .
Figure S7.Feature importance scores for the final model for general noncancer effects....

Figure S8 .
Figure S8.Feature importance scores for the final model for reproductive/developmental

Figure S9 .
Figure S9.Feature importance scores for the replicate models for general noncancer effects.

Figure S10 .
Figure S10.Feature importance scores for the replicate models for

Figure S11 .
Figure S11.Cumulative distributions of point of departure across different data sources. .

Figure S12 .
Figure S12.Predicted points of departure with feature selection versus without feature

Figure S13 .
Figure S13.Pairwise scatterplots and kernel density estimate plots for selected features. .

ooo.o
For each feature n and permutation repetition p in {1, 2, …, P}: Permute the feature values.Evaluate the model with the permuted feature.Compute the RMSE score for the permuted model  &,(,) !Compute the raw importance score  &,(,) ! for feature n using Equation S2: R, folds K, and permutations P to form a vector  !%%%⃗ -Select the top 10 features with the largest median of their respective  !%%%⃗ -Train the final model using all samples and the top 10 features.Parameters used: K = 5 (empirically shown to yield balanced bias-variance test error rate estimates), 2 R = 50 and P = 5.

Figure S1 .
Figure S1.Overview of model training with feature selection.The top panel shows pseudocode representing the algorithm in Python.The corresponding components are illustrated in Panels 1-4.The general training steps are listed in the section above (Model Training Steps).

Figure S2 .
Figure S2.Overview of model evaluation.The top panel shows pseudocode representing the algorithm in Python.The corresponding components are illustrated in Panels 1-3.The performance metrics are defined in the section above (Model Performance Metrics).Figure S1 shows an overview of model training with feature selection.

Figure S5 .
Figure S5.Model performance benchmarking.Point of departure estimates are compared against authoritative values."ToxValDB Surrogate" refers to the surrogate values from Table S5 of Aurisano et al. (2023). 13"QSAR" refers to the final model developed in this study, described in the main text."ToxCast/httk" refers to the combination of high-throughput in vitro bioactivity data with toxicokinetic data using reverse dosimetry.Specifically, these values are the PODNAM,50 values from Table S2 of Paul Friedman et al. (2020) 14 .The figure is further subdivided by the target effect category from left to right.Note: POD, point of departure; QSAR, Quantitative Structure-Activity Relationship; RMSE, root-mean-squared error, MedAE, median absolute error; R 2 , coefficient of determination; n, sample size.

Figure S6 .
Figure S6.Frequency of features deemed important across replicate models.Features were extracted from OPERA 2.9.9,10The figure is subdivided by the target effect category from left to right.The x-axis represents the number of times each feature was deemed important across the cross-validated replicate models illustrated in FigureS2.The feature selection scheme is illustrated in FigureS1.Features present in the final models are highlighted with a distinct color.The remaining important features were excluded from the final models to avoid overfitting.Feature descriptions are included in a supplemental Excel file (TableS3).

Figure S11 .
Figure S11.Cumulative distributions of point of departure across different data sources."Authoritative" refers to the values from authoritative and regulatory assessments from Figure S5 of Aurisano et al. (2023).13"ToxValDB Surrogate" refers to the surrogate values from TableS5of Aurisano et al. "QSAR" refers to the final model developed in this study, described in the main text.The intersection of chemicals is shown in the top half of the figure.The bottom half shows all chemicals with original authoritative PODs (PODauthoritative), all chemicals from ToxValDB with surrogate PODs (PODsurrogate), excluding those chemicals with PODauthoritative values, and all "application chemicals" with QSAR-derived PODs (PODQSAR), excluding those chemicals in the other two datasets.The application chemicals were on the Merged NORMAN Suspect List (SusDat)11,12  and within the applicability domain of SEEM3.8The figure is further subdivided by the target effect category from left to right.Note: POD, point of departure.

Figure S12 .
Figure S12.Predicted points of departure with feature selection versus without feature selection for all chemicals in this study."Application chemicals" refer to those on the Merged NORMAN Suspect List (SusDat) 11,12 and within the applicability domain of SEEM3, 8 excluding any training chemicals.The figure is subdivided by the target effect category from left to right.Note: POD, point of departure; RMSE, root-mean-squared error, MedAE, median absolute error; R 2 , coefficient of determination; n, sample size.

Table S1 .
Descriptions of models included in the sensitivity analysis.

Table S2 .
OECD Checklist for the assessment of (Q)SAR models.