Discovery of new STAT3 inhibitors as anticancer agents using ligand-receptor contact fingerprints and docking-augmented machine learning

STAT3 belongs to a family of seven vital transcription factors. High levels of STAT3 are detected in several types of cancer. Hence, STAT3 inhibition is considered a promising therapeutic anti-cancer strategy. In this work, we used multiple docked poses of STAT3 inhibitors to augment training data for machine learning QSAR modeling. Ligand–Receptor Contact Fingerprints and scoring values were implemented as descriptor variables. Escalating docking-scoring consensus levels were scanned against orthogonal machine learners, and the best learners (Random Forests and XGBoost) were coupled with genetic algorithm and Shapley additive explanations (SHAP) to identify critical descriptors that determine anti-STAT3 bioactivity to be translated into pharmacophore model(s). Two successful pharmacophores were deduced and subsequently used for in silico screening against the National Cancer Institute (NCI) database. A total of 26 hits were evaluated in vitro for their anti-STAT3 bioactivities. Out of which, three hits of novel chemotypes, showed cytotoxic IC50 values in the nanomolar range (35 nM to 6.7 μM). However, two are potent dihydrofolate reductase (DHFR) inhibitors and therefore should have significant indirect STAT3 inhibitory effects. The third hit (cytotoxic IC50 = 0.44 μM) is purely direct STAT3 inhibitor (devoid of DHFR activity) and caused, at its cytotoxic IC50, more than two-fold reduction in the expression of STAT3 downstream genes (c-Myc and Bcl-xL). The presented work indicates that the concept of data augmentation using multiple docked poses is a promising strategy for generating valid machine learning models capable of discriminating active from inactive compounds.


Naïve Bayes
Naïve Bayes (NB) is a simple classifier whereby class labels are predicted and assigned to external observations based on vectors of descriptors for some limited set of training observations. NB classifier presumes each descriptor to contribute independently to the probability that certain observation (i.e., docked pose) belongs to a particular class (i.e., active, intermediate, or inactive) 23 . The probability of certain observation to belong to certain class is the multiplication of the individual probabilities of that class within each individual descriptor 24 .

Probabilistic Neural Network PNN
It is a feed forward neural network used for classification and pattern recognition. PNN is formed by replacing the sigmoid activation function with an exponential function that can compute nonlinear decision boundaries approaching the Bayes optimal. A PNN classifier, having four layers of neural network, can be used to map any input pattern to any number of classification 25,26 . PNN models were built using the PNN learner node default settings within KNIME Analytics Platform (Version 4.3.3).

k-Nearest Neighbors (kNN)
The kNN classifier depends on a distance learning methodology that calculates the activity value of an unknown member based on the bioactivities of a certain number (k) of nearest neighbors (kNNs) in the training set. In this classifier., the similarity is measured by a distance metric 17 . We implemented kNN Learner node within KNIME Analytics Platform (Version 4.3.3). The value of k was scanned from 3 to 5. The runs were repeated to include all possible combinations (i.e., of k) resulting from allowing the influence of neighbors to be either distance-dependent or distanceindependent.

Multilayer perceptron (MLP)
MLP which is a modification of the standard linear perceptron, has a neural network architecture consisting of a layer with several nodes., where each node connects to a subsequent node in another layer. The concept of NN is that each input into the neuron has its own weight., that is adjusted to train the network. In between the inputs and the output layer., there may be several hidden layers 27 .
We implemented the MLP learner node in its default settings within the KNIME Analytics Platform (Version 4.3.3).

SM3. Biological evaluation of the captured hits Cell viability assay (MTT) and IC50 determination
Rapid colorimetric test based on mitochondrial dehydrogenase enzymes' capacity to transform 3,-4,5 dimethylthiazol -2,5 diphenyl tetrazolium bromide (MTT) into a purple formazan precipitate. A multi-well plate reader is used to measure the optical density after the formazan crystals have been dissolved. Due to its simplicity and suitability for automation, MTT has thus emerged as the preferred technique for first drug screening in cell lines to assess cell viability 54 .

7
To determine the IC50 of NCI hits for cells under study., first; an MTT assay was performed.
Approximately 8 × 10 3 cells per well were seeded into a 96 well plate (Corning., USA) and treated with different concentrations of NCI ligands., Both treated and control cells were incubated at 37 °C in a 5% CO2 incubator for 72 h after which the old media was aspirated and the MTT assay salt (Bioworld., USA) in 100 μl of fresh media was added to each well. Following that., plates were incubated at 37 °C for 3 h., and then 50 μl of solubilization solution (DMSO) was added to each well. The absorbance of the solution was measured at 570 nm using Glomax plate reader (Promega., USA) to determine cells' viability.

Quantitative Polymerase Chain Reaction (qPCR) analysis
qPCR analysis was performed to determine the expression intensities of target genes (c-Myc and         Hypo(6nuq-10).
A .

G E .
I