Main

Target-based high-throughput screening dominates the conventional drug discovery process. It has been the focus of computer-aided drug discovery for decades, including recent applications of deep learning; however, the readout from the modulation of a single protein by a chemical is poorly correlated with organism-level therapeutic effects or side effects. As a result, the failure rate from a lead compound generated from the target-based screening to approved drug is high. Phenotype-based screening has created renewed interests for identifying cell-active compounds but suffered from low throughput and difficulty in target deconvolution. A high-throughput, mechanism-driven phenotype compound screening method will therefore facilitate drug discovery and development.

Gene expression profiling has been widely used to characterize cellular and organismal phenotypes. Systematic analysis of genome-wide gene expression of chemical perturbations on human cell lines has led to considerable improvements in drug discovery and systems pharmacology. In particular, gene expression profiling can be applied to drug repurposing1,2,3,4, discovering drug mechanisms5, lead compound identification6 and predicting side effects for preclinical compounds7. The use of genome-wide chemical-induced gene expression was initially made possible by the appearance of Connectivity Map (CMap)8, which consists of gene expression profiles of five human cancer cell lines perturbed by ~1,300 compounds after 6 h; however, the limited data availability across cell types restricts the performances of the above-mentioned analyses, which heavily depend on the coverage of chemicals and human cell lines. To overcome this limitation, a novel gene expression profiling method, L1000 (which is an extension of the CMap project), was developed by the National Institutes of Health (NIH) library of integrated network-based cellular signatures (LINCS) programme9. After Phase I of LINCS, the L1000 dataset consists of ~1,400,000 gene-expression profiles on the responses of ~50 human cell lines to one of ~20,000 compounds across a range of concentrations. The L1000 dataset and its normalization versions10 were recently widely used in drug repurposing and discovery11,12. Despite these successes, there are several major problems when utilizing L1000. First, although the number of gene expression profiles is much larger than that in CMap, many missing expression values remain in the vast combinatorial space of chemicals and cell lines. Second, there are hundreds of millions of drug-like, purchasable chemicals that are potential drug candidates13. It is infeasible to experimentally test all of these chemicals for their chemical-induced gene expression profiles across multiple cell lines. Finally, due to various experimental problems (for example, the batch effect), many experiment measurements are not reliable (as shown in Supplementary Fig. 1). These serious obstacles will limit the effectiveness and scope of utilizing L1000 dataset in drug discovery. Predicting gene expression values for unmeasured and unreliable experiments is therefore necessary.

Missing entries in the combinatorial space is not a problem exclusive to the L1000 dataset. Before the appearance of L1000, several methods of imputing missing values had been proposed for gene expression datasets. We categorize these methods into two main approaches that pivot on the dependence of other information beyond gene expression data. The first approach does not use any extra information. Works following this approach include k-nearest neighbour (kNN)14, singular value decomposition14, least mean square15,16,17, Bayesian principal component analysis18, Gaussian mixture clustering19 and support vector regression20. The second approach uses additional information to predict expression profiles. For example, chemical structures are used to predict chemical-induced gene expressions but that work does not consider cell-specific information21.

The approaches described above are designed for matrix-structured data (that is, gene × experiment), whereas the L1000 dataset is formulated as tensor-structured data (that is, gene × chemical × cell × dosage × time) and therefore cannot be applied to capture high-dimensional associations that help to impute missing values for L1000. Several methods were proposed to predict gene expression profiles in the L1000 dataset. In particular, to deal with high-dimensional structured data, an extension of a linear regression model named polyadic regression is developed to capture interactions emerging across features22. Matrix completion methods are also adapted to handle tensor-structured gene expression data23,24.

The above methods for the L1000 dataset just focus on imputing the missing values of some gene expression profiles or the whole gene expression profiles of some missing experiments. They are not very useful in the real setting of drug discovery where the chemical-induced gene expression profile of new chemicals needs to be identified. This motivates us to solve a more practical but more challenging problem: predicting gene expression profiles for de novo chemicals (chemicals that do not appear in training data). Solving this problem is necessary as it helps to infer gene expression profiles of new chemicals without conducting experiments that require time and human resources. More importantly, this problem can be expanded to predict gene expression profiles for new cell lines, which can be difficult for measuring in in vitro environments. However, current computational approaches for predicting gene expression values for L1000 cannot work well in a de novo setting. In particular, the tensor completion approach cannot predict gene expression profiles for new chemicals due to inaccessibility to chemical features. Polyadic regression, theoretically, can predict gene expression profiles for high-dimensional data in a de novo chemical setting due to using chemical features; however, in practice, it is not feasible due to the huge computational resources required for handling high-dimensional data (that is, this method fails when applied to data in dimensions greater than three). There is therefore a strong incentive to develop a new and effective method that exploits high-dimensional data for predicting gene expression profiles for a de novo chemical setting.

To address the aforementioned problems, we design a mechanism-driven neural network-based model, DeepCE25, which captures high-dimensional associations among biological features, as well as non-linear relationships between biological features and outputs, to predict gene expression profiles when given a new chemical compound. Our proposed DeepCE considerably outperforms state-of-the-art models for predicting gene expression profiles in L1000, not only in a de novo chemical setting but also a traditional imputation setting. Several novelties in the architecture of the model contribute to the success of DeepCE. First, we leverage a graph convolutional network (GCN) to automatically extract chemical substructure features from data. Second, an attention mechanism is used to capture associations among chemical substructures and genes, and among genes in cell lines. Finally, gene expression values of all L1000 genes are predicted simultaneously from hidden features by a multi-output, multilayer feed-forward neural network. Aside from developing this neural network-based model, we propose a data augmentation method by which we can extract useful information from unreliable experiments in L1000 to improve the prediction performance of our model. We also verify the effectiveness of DeepCE by comparing the performances of several classification models trained on gene expression profiles generated from DeepCE and those trained on original gene expression profiles in L1000 for two downstream tasks: predicting the targets and indications of drugs. Finally, we assess the value of our proposed method for a challenging and urgent problem: finding treatment for COVID-19 by in silico screening all chemical compounds in DrugBank against COVID-19 patient clinical phenotypes. The prioritized lead compounds are consistent with existing clinical evidences. The source code of DeepCE and the generated gene expression profiles of all chemical compounds in DrugBank are publicly available for research purposes, which could make an important contribution to drug discovery and development in particular, and computational chemistry and biology research in general.

Chemical-induced gene expression prediction models and datasets

In this section we present datasets used in our study and our proposed model, DeepCE, as well as baseline models for predicting gene expression profiles, such as linear models, a vanilla neural network, kNN and tensor-train weight optimization (TT-WOPT) models. The general framework for training and testing these computational models for L1000 gene expression profile prediction is shown in Fig. 1. Basically, computational models take L1000 experimental information (that is, the chemical compound, cell line, time stamp and chemical dosage) from L1000 as inputs, transform them into numerical representations and then predict L1000 gene expression profiles on the basis of these representations. The details of the numerical feature transformation process for chemical and biological objects used in our study and model implementation of DeepCE and other baselines are shown in the Supplementary Notes 2 and 4. We also present the data augmentation method that extracts useful information from unreliable experiments in L1000 to improve the prediction performance of our models, and the evaluation method for our models.

Fig. 1: A general framework for training computational models for L1000 gene expression profile prediction and using them for downstream application (that is, drug repurposing for COVID-19 treatment).
figure 1

θ is the set of model parameters, f is the function of θ that maps experiment information to gene expression profiles, and l is the function of θ that computes the differences between predicted and ground-truth gene expression profiles. The objective for the learning process is to minimize the loss between predicted profiles and ground–truth profiles in the L1000 dataset. After training, the models are used to generate profiles for new chemicals in an external molecular database (DrugBank). These profiles are then used for in silico screening (comparing with patient gene expression) to find potential drugs for COVID-19 treatment.

Datasets

In the following paragraphs we present the details and usages of several biological datasets in our study, including L1000, STRING, DrugBank and transcriptome data of COVID-19 patients. We also provide a summary of these datasets in Supplementary Table 1.

Bayesian-based peak deconvolution of L1000 dataset

After the original version of L1000 was released9, many efforts have been made to improve the quality of this dataset. For example, instead of using a k-means clustering algorithm as per the original version, some works propose to use a Gaussian mixture model to enhance the accuracy of the peak deconvolution step26,27. One work, alternatively, develops a multivariate method called characteristic direction to compute gene signatures instead of using the moderated z-score from the original version10. In our study we conduct experiments on a Bayesian-based peak deconvolution L1000 dataset, which has been shown to generate more robust z-score profiles from L1000 assay data and therefore gives better representation for perturbagens28. In particular, we train and evaluate our proposed methods on level 5 data of this dataset. The gene expression profiles that result from experiments featuring the seven most frequent cell lines and six most frequent chemical dosages in the L1000 dataset are used to construct our gene expression dataset. We then select high-quality experiments from our dataset and split them into a high-quality training set as well as a development and testing set. We also construct the original training set by keeping unreliable experiments in our gene expression dataset and the augmented training set generated by our data augmented algorithm. The details of constructing these sets are described in the Supplementary Note 1. The statistics of these training, development and testing sets are shown in Supplementary Table 2.

STRING database for human protein–protein interactions

STRING29 is a multisource database of protein–protein interactions. These interactions—which can be known or predicted, direct (physical) or indirect (functional)—are collected from five main sources, including genomics context prediction, high-throughput laboratory experiments, conserved co-expression, automated text-mining and past knowledge databases. In our setting we extract the human protein–protein interaction network (that is, ~19,000 nodes (proteins) and ~12,000,000 edges (interactions)) from this database to compute vector representations for L1000 genes. The drug-target vector representations for chemical compounds used in our study are also computed from this human protein–protein interaction network. The details of generating these representations from the STRING database are shown in the Supplementary Note 2.

DrugBank database for drug–target interactions and disease predictions

DrugBank is a well-known, comprehensive database used in many bioinformatics and cheminformatics tasks30. This database consists of information about drugs and their targets. In our experiments we extract Anatomical Therapeutic Chemical (ATC) labels derived from the first level of the ATC tree and targets of drugs that appear in the L1000 dataset from DrugBank. There are 698 drug targets and 14 ATC labels in the extracted dataset. We select the most frequent ATC labels and drug targets—on the basis of their frequency as drug labels in this dataset—to form drug-target and ATC prediction datasets, respectively. These datasets are used to evaluate the performance of gene expression profiles generated from our models. We also predict gene expression profiles for all drugs in DrugBank and use them to screen potential candidates for COVID-19 treatment.

Patient expression in response to SARS-CoV-2 infection

Patient expression datasets for this study are downloaded from National Genomics Data Center (NGDC, PRJCA002273)31 and the National Center for Biotechnology Information (NCBI, GSE147507)32. While the former includes eight SARS-CoV-2 patients and twelve healthy samples, the latter has only one SARS-CoV-2 patient and two healthy samples. For each dataset, we use expression profiles from both SARS-CoV-2 patients and healthy negative controls for differential expression analysis. The first dataset can be thus considered as population-based gene expression analysis whereas the second dataset a patient-specific gene expression analysis. The DESeq233 package is used to generate the differential gene expression profiles of the patients. Not all L1000 genes appear in the result of DESeq2 package and therefore we only consider genes that appear in both the L1000 dataset and DESeq2 package when comparing with chemical-induced gene expression profiles.

Overall architecture of DeepCE

Our neural network-based model for L1000 gene expression profile prediction, DeepCE, consists of several components. First, we use a GCN to learn the numerical representation of a chemical compound from its graph structure, and a feed-forward neural network to learn numerical representations of the cell line and chemical dosage. We also use numerical representations for L1000 genes, which are derived from the human protein–protein interaction network (described in the Supplementary Note 2). These vector representations are then put into the interaction component to capture high-level feature associations including chemical substructure–gene and gene–gene feature associations. Finally, the prediction component takes the outputs of the interaction component as inputs to predict the gene expression values for all L1000 genes simultaneously. The overall architecture of DeepCE and its hyperparameters used in our experiments are shown in Fig. 2 and Supplementary Table 4, respectively. The following paragraphs describe each component of DeepCE in detail.

Fig. 2: The overall architecture of DeepCE.
figure 2

This model consists of three main components are as follows: the feature transformation component that uses GCN to generate features for chemical compound, pre-trained information to represent L1000 genes, and feed-forward neural network to generate features for cell and dosage; the interaction network that learns high-level feature associations (the details of the second layer, which has similar architecture to the first layer in the interaction network, are omitted to save space); the prediction network that predicts gene expression profiles from high-level features.

GCN for neural fingerprints

Data-driven chemical fingerprints were recently shown to be more effective than predefined chemical fingerprints (for example, PubChem, Extended Connectivity Fingerprint (ECFP)) for many biological prediction problems. We therefore propose to use a GCN to capture the chemical substructure information. The original GCN model for chemical fingerprints34 takes a graph structure of a chemical compound as input and updates vector representations for each node (atom) in the graph (chemical compound) from its neighbourhoods by convolutional operation. The vector for each node after convolutional operation can thus be seen as the representation of chemical substructures. The final vector (which is the sum of vectors of every node) is used as the chemical fingerprint. The GCN model used in our experiments is primarily based on that model but with a minor modification. In particular, we output vector representations for every node instead of one vector representation for the chemical compound as we want to model the associations of chemical substructure features with gene features. In our settings we use the GCN model with two convolutional layers (radius, R = 2). It means that the output vector from GCN for each atom represents the chemical substructure, which is a span of two-hop from that atom. The initial representation of the atom, which captures the symbol, degree, number of Hydro neighbourhoods and aromaticity of atoms, and the initial representation of bond, which captures type of bond are multi-hot vectors that have lengths of 62 and 6, respectively. The details of GCN model used in our experiments are shown in Supplementary Algorithm 1.

Multihead attention for gene–gene and chemical substructure–gene feature associations

Attention mechanisms where an element of one set selectively focuses on a subset of another set (attention) or its set (self-attention) on the basis of attention weights are used widely in neural network-based models and are effectively applied to many artificial intelligence tasks, including computer vision and natural language processing. In our experiments we propose to apply the attention method named multihead attention for modelling associations among gene features, and gene and chemical substructure features. Multihead attention was first proposed in the transformer model, which achieves state-of-the-art results for many natural language processing tasks35. Basically, each element in sets can be represented by a set of three vectors: query, key and value. An individual attention module is a function of mapping queries and sets of key–value pairs to output matrix computed by:

$${\mathrm{Attention}}({Q},{K},{V})={\mathrm{softmax}}\left(\frac{{QK}^{T}}{\sqrt{{d}_{k}}}\right){V}$$

where Q, K, V are matrices (sets) of queries, keys, values, respectively, T is a transpose operation, and dk is a scaling factor. Multihead attention focuses on different representation subspaces by concatenating several individual attention modules:

$${\mathrm{MultiHead}}({Q},{K},{V})={\mathrm{concat}}({\mathrm{hea{d}}}_{1},...,{\mathrm{hea{d}}}_{h}){{W}}^{O}$$

where \({\mathrm{hea{d}}}_{i}={\mathrm{Attention}}({Q}{{W}}_{i}^{Q},{K}{{W}}_{i}^{K},{V}{{W}}_{i}^{V})\). WO, WQ, WK, WV are learned parameters and h is the number of heads.

This multihead attention mechanism is the main ingredient used to construct the interaction component of DeepCE. In particular, the interaction component consists of two identical layers where outputs of the first layer are used as inputs for the second layer. For each layer, we use two separate multihead attention modules with four heads for each module to model associations among genes in gene set and among elements in gene set and chemical substructure set. The lengths of the query, key and value vectors are set to 512. Outputs from these two multihead attention modules are concatenated and put into normalization layer followed by feed-forward layer and another normalization layer. The abstract architecture of interaction component is shown in Fig. 2.

Multi-output prediction

The multi-output prediction component—a two-layer feed-forward neural network with a rectified linear unit (ReLU) activation function—takes input as the concatenation of the chemical neural fingerprint, the gene feature generated by the interaction component, the cell line and chemical dosage features, to predict gene expression values for all L1000 genes together as follows:

$${Y}={{W}}_{2}(\mathrm{{ReLU}}({{W}}_{1}{X}+{{\bf{b}}}_{1}))+{{\bf{b}}}_{2}$$

where W1, W2, b1, b2 are the weight matrices and bias vectors of this network. The output size of this feed-forward neural network is set at 978, which is the number of L1000 genes.

Objective function

The objective function used in DeepCE model is the mean squared error (MSE) between predicted and ground-truth gene expression values and is computed as follows:

$${{\bf{loss}}}_{\mathrm{DeepCE}}({{\Theta }})=\frac{1}{NM}\mathop{\sum }\nolimits_{i = 1}^{N}\mathop{\sum }\nolimits_{j = 1}^{M}{({z}_{i,j}-{y}_{i,j})}^{2}$$

where Θ is the set of parameters in the DeepCE model; N and M are the number of gene expression profiles in the dataset and number of L1000 genes, respectively; and zi,j and yi,j are ground-truth and predicted gene expression values, respectively, of the jth gene in the ith gene expression profile.

Baseline models

In this section we describe several baseline models used in our experiments including linear models, a vanilla neural network, kNN and TT-WOPT24.

Linear models

We experiment with a multi-output linear regression model and its regularization versions, including Lasso regression (L1 regularization) and ridge regression (L2 regularization) models. Similar to DeepCE, input for these models is the concatenation of numerical representations for chemical, gene, cell line and chemical dosage features, but we use predefined chemical fingerprints and drug-target features instead of data-driven representations derived from GCN for chemicals. The details of these representations are described in the Supplementary Information. Multi-output linear models can be seen as one-layer feed-forward neural network without activation function.

Vanilla neural network

The vanilla neural network used in our experiments can be seen as a simpler version of the DeepCE model that does not include the interaction network component for modelling gene–gene and gene–chemical substructure feature associations and GCN for generating neural fingerprints. Input for this vanilla neural network is similar to its for linear models. The following layers in this network are similar to the prediction network component in DeepCE model, which is a two-layer feed-forward neural network with an ReLU activation function.

kNN

We also propose a kNN-based approach for gene expression prediction within a de novo chemical setting. In particular, a gene expression profile for a new chemical compound in one particular setting (that is, cell line, chemical dosage) is generated by averaging gene expression profiles of its nearest neighbourhoods in the training set in the same setting. In our research we experiment with different numbers of neighbourhoods from one to fifteen and different similarity measures including cosine, correlation, Jaccard and Tanimoto, as well as euclidean distance.

Tensor-train weight optimization

Tensor-train weight optimization (TT-WOPT) is a tensor completion approach proposed to retrieve missing values in tensor data from existing values. It has been shown to be effective for predicting values missing from the L1000 dataset, which can be formulated as a tensor-structure object without using additional information24. In our research, we conduct experiments to compare TT-WOPT with our proposed model, especially in a de novo chemical setting. As this model does not require additional information, inputs are therefore L1000 gene expression values formulated as a tensor.

Data augmentation

We can see from Supplementary Fig. 1 that only a small number of experiments in L1000 are reliable (average pearson correlation (APC) score ≥ 0.7) and thus it would be wasteful if we cannot exploit useful information from a large number of unreliable experiments. We show in Table 1 that simply adding unreliable experiments to the high-quality training set (original training set) makes the performances of our models worse. We thus propose the data augmentation method by which we can effectively exploit unreliable experiments to improve the performances of our models. We argue that although an experiment (level 5 data) is unreliable, not all of its bioreplicate experiments (level 4 data) are also unreliable and we will extract these reliable bioreplicate experiments by our proposed data augmentation method. The basic idea is that we first train our model on the high-quality training set and then generate predicted gene expression profiles for unreliable experiments. These predicted gene expression profiles are compared with their bioreplicate gene expression profiles and we incorporate bioreplicate gene expression profiles that have the similarity scores with their predicted gene expression profiles larger than the threshold. Supplementary Algorithm 2 presents this data augmentation method in detail. In our settings, the similarity score is Pearson correlation.

Table 1 Performances on a testing set of a vanilla neural network, kNN, linear models with different chemical features, TTWOPT and DeepCE with its simpler variants trained with different training sets

Performance evaluation

The Pearson correlation coefficient is used as the main metric to evaluate performances of models in our experiments. Correlation scores that measure the relationship between ground-truth and predicted gene expression profiles have been shown to be more effective than error measures for microarray data analysis36,37. Moreover, using Pearson correlation allows us to conduct unbiased evaluation for our models that are optimized for MSE. We calculate the average Pearson correlation for a dataset as follows:

$$r=\frac{1}{N}\mathop{\sum }\nolimits_{i = 1}^{N}\frac{\mathop{\sum }\nolimits_{j = 1}^{M}({z}_{i,j}-{\bar{z}}_{i})({y}_{i,j}-{\bar{y}}_{i})}{\sqrt{\mathop{\sum }\nolimits_{j = 1}^{M}{({z}_{i,j}-{\bar{z}}_{i})}^{2}}\sqrt{\mathop{\sum }\nolimits_{j = 1}^{M}{({y}_{i,j}-{\bar{y}}_{i})}^{2}}}$$

where \({z}_{i,j},{y}_{i,j},{\bar{z}}_{i},{\bar{y}}_{i}\) are ground-truth and predicted gene expression values of the jth gene in the ith gene expression profile, and ground-truth and predicted mean values of the ith gene expression profile, respectively.

Aside from the Pearson correlation, we also report the performances of models by other metrics including root mean squared error (r.m.s.e.), gene set enrichment analysis (GSEA)38,39 and Precision@k. Although Pearson correlation and r.m.s.e. capture the variations among all L1000 genes, GSEA and P@k (including both positive and negative P@k) only focus on the most important up- and down-regulated genes. Using multiple metrics thus allows us to measure the performances of models in different aspects. The details of these additional metrics are shown in the Supplementary Note 3.

Furthermore, we use the area under the receiver operating characteristic curve (AUC) to verify the effectiveness of these predicted profiles for downstream binary classification tasks including drug-target and ATC code predictions.

Results and discussions

The results and discussions below are mainly based on Pearson correlation; we also observe the same patterns via other metrics.

DeepCE considerably outperforms baseline models in the novel chemical setting

In this experiment we compare DeepCE and its simpler variants constructed by removing either the whole interaction component or just one part of it (that is, chemical substructure–gene or gene–gene feature association modules) with several baseline models including a vanilla neural network, kNN, linear models and TT-WOPT. Although TT-WOPT predicts output on the basis of gene expression values only, other models learn the relationship between experimental information and gene expression profiles to make predictions. For DeepCE, we use neural fingerprints whereas for other models, we use predefined fingerprints including PubChem and circular (ECFP6) fingerprints, and drug-target information including latent target interaction profile (LTIP)40 and our proposed drug-target feature to represent chemicals. All models are trained on the high-quality training set and are evaluated on the test set.

As listed in Table 1, the DeepCE model and its variants achieve order-of-magnitude improvements over baseline models. In particular, the DeepCE model considerably outperforms other models including a vanilla neural network, kNN, linear models and TT-WOPT by achieving a Pearson correlation of 0.4907 on the testing set (paired t-test, P-value < 4.63 × 10−15). In comparison with its simpler variants whose interaction components are removed, DeepCE also achieves better performance, indicating that the effectiveness of modelling chemical substructure–gene and gene–gene feature associations. Specifically, the performance of DeepCE decreases to 0.4620, 0.4477 and 0.4418 when removing the chemical substructure–gene feature association part (Deep CE−drug−gene attn), gene–gene feature association part (Deep CE−gene−gene attn) and the whole interaction component (Deep CE−attn) (paired t-test, P-value < 2.25 × 10−5), respectively. We also delve deeper into the performance of DeepCE by looking it over cell lines, chemical dosages and L1000 genes. The results of this analysis is shown in Supplementary Figs. 2 and 3. For baseline models, vanilla neural network and kNN achieve pretty good performances. Linear models including linear regression, Lasso and ridge regression do not work well for our problem. It indicates that the linear relationship is not sufficient to model the dependencies among variables in this dataset. TT-WOPT, which, as expected, does not leverage additional features beyond gene expression values to make predictions, does not work well for de novo chemical setting. In particular, it achieves a Pearson correlation of 0.0144, which is similar to randomness. We also provide error estimate for these performances by conducting cross-validation on the high-quality dataset. The results are shown in Supplementary Table 5.

DeepCE outperforms state-of-the-art methods in the imputation setting

We further investigate the performance of DeepCE for the traditional imputation setting that does not require the chemicals in the testing set to be different from those in the training set, and compare it with TT-WOPT, which has been shown to be effective for this setting. To do that, we randomly split the high-quality dataset to the new training, development and testing sets, and conduct the experiment on these sets. Note that, at this time, we split the dataset by gene expression profile instead of chemical compound. The details of the training, development and testing set for imputation setting are shown in Supplementary Table 3.

For the traditional imputation setting, we observe that DeepCE outperforms TT-WOPT by a large margin. In particular, DeepCE achieves a Pearson correlation of 0.7010 versus 0.5113 for TT-WOPT. This result indicates that DeepCE consistently achieves the best performances for both de novo chemical and traditional imputation settings by effectively leveraging features of chemical and biological objects including chemical compounds and genes.

Chemical similarity has an impact on prediction performance

To thoroughly investigate the prediction performance of our models, we explore the impact of chemical similarity between the testing and training sets. In particular, we compute the distance between one experiment in the testing set and its nearest-neighbour experiments in the training set, which are induced by the most similar chemicals (determined by comparing their fingerprints with the fingerprint of the chemical compound induced the experiment in the testing set) on the same cell line. The distance between the two experiments is the Tanimoto coefficient of PubChem fingerprints of their two chemicals, and the distance between the experiment in the testing set with its nearest-neighbour experiments in the training set is the average of distances between that experiment and each of its nearest neighbours. After computing the distances to the training set for all experiments in the testing set, we sort them by ascending order and compare the Pearson correlation scores of these experiments. We calculate the average Pearson correlation scores of all experiments in the testing set that have their distances to the training set smaller than the first quartile (Q1), from Q1 to the second quartile (Q2), from Q2 to the third quartile (Q3) and larger than Q3 of the sorted list. Figure 3 shows the average Pearson correlation scores with these distances of three models including DeepCE, a vanilla neural network and kNN; we can see the same pattern for all models that the prediction performances are higher when the experiments in the testing set are more similar to their nearest neighbour experiments on the training set. We also recognize that DeepCE achieves better performances than the vanilla neural network and kNN for all distance categories, especially for experiments that have their distances to the training set smaller than Q1.

Fig. 3
figure 3

Performances of DeepCE, vanilla neural network, and kNN with different distances among chemicals in the training and testing sets. d is the distance measured by the Tanimoto coefficient between chemical compounds in training and testing sets, and Q1, Q2, and Q3 are the first, second, and third quartiles of the sorted list of distances.

Data quality has a significant impact on prediction performance

Aside from the sparseness problem, the L1000 dataset also includes many unreliable gene expression profiles. To investigate the impact of noisy profiles on the prediction performances of our models, we train two baseline models (including a neural network and kNN) on different training sets generated by filtering unreliable gene expression profiles with different APC thresholds varying from –1 (original training set) to 0.7 (high-quality training set). The PubChem fingerprint is the chemical feature used in this experiment.

As shown in Fig. 4, all models have the same pattern. Starting at the threshold of 0.1, they achieve better performances on the testing set when the threshold is higher and the best setting is training our models on the high-quality training set (that is, a Pearson correlation of 0.3923 for the vanilla neural network and 0.3903 for kNN). For training on the original training set and other training sets generated by filtering unreliable experiments with thresholds <0.1, the ground-truth and predicted gene expression profiles are uncorrelated showing the randomness of the model predictions. These results indicate that unreliable data have a severely negative impact on prediction performances and removing this part from the dataset is necessary for achieving good performances.

Fig. 4
figure 4

Pearson correlation scores of vanilla neural network and kNN at different APC threshold settings. These models are trained on training sets generated by filtering unreliable experiments with different APC thresholds and then are evaluated on the high-quality testing set.

A novel data augmentation method improves the model performance

We propose the data augmentation method (described in detail in Supplementary Algorithm 2) to effectively exploit useful information from unreliable gene expression profiles. In this experiment we evaluate the impact of this method on our models. In particular, DeepCE trained on high-quality training set are used to generate gene expression profiles and the threshold for selecting bioreplicate profiles is 0.5, which is similar to the performance of DeepCE. The statistics of this augmented training set are shown in Supplementary Table 1.

The experimental results for training vanilla neural network, kNN and DeepCE on the augmented training set are shown in Table 1. We can see that the performances of all models trained on this augmented training set are improved in most cases. For example, the Pearson correlation of DeepCE increased from 0.4907 to 0.5014 (paired t-test, P-value <0.05). These results indicate that information extracted from unreliable gene expression profiles by our data augmentation method is effective for gene expression prediction.

The selection of chemical feature affects model performance

In this experiment we investigate the effectiveness of several chemical feature representations for our models. Models used in this experiment are a vanilla neural network for PubChem, ECFP fingerprints, our proposed drug-target features, and LTIP, and DeepCE model without interaction component for neural fingerprint. These models are trained on the high-quality training set. We also create random chemical features by generating random binary vectors whose size is similar to PubChem fingerprint from discrete uniform distribution.

Table 1 shows the performances measured by Pearson correlation of these models with different chemical feature representations. First, chemical features achieve much better performances than the random feature, indicating that chemical features capture important information about chemicals which is useful for predicting gene expression profiles. Second, DeepCE which uses neural fingerprint achieves the Pearson correlation of 0.4418 which is the best performance compared to other settings (paired t-test, P-value < 4.89 × 10−5). For other chemical features, biological-based features including drug-target feature and LTIP achieves slightly better performances than chemical-based features including PubChem and ECFP fingerprints. All of these observations are verified by the paired t-tests with P-values < 0.01. In fact, most of the P-values are much less than 0.01.

We also conduct an ablation study to investigate the impact of other features (that is cell line, dosage) to the predictive performance by removing them from the feature vectors. The results in Supplementary Table 6 show that removing these features decreases the performance of DeepCE and the worst scenario is when removing both cell line and dosage information.

DeepCE is effective in predictive downstream tasks

In this section we design an experiment to answer a question about whether these predicted gene expression profiles can provide added values for downstream prediction tasks, especially in the case that original gene expression profiles in L1000 dataset are unreliable. We first extract gene expression profiles of chemicals that do not have reliable experiments in L1000 (original feature set) as well as use a DeepCE model trained on a high-quality training set to generate gene expression profiles for these drugs (predicted feature set). We then use these sets as the features for drugs to train classification models for two tasks: ATC code and drug-target predictions. The details of constructing these datasets are presented in the Supplementary Note 1 and Supplementary Table 7. Finally, we train four popular classification models, including logistic regression, support vector machine, kNN and a decision tree, using fourteen different versions of chemical features (seven cell-specific features for each original and predicted feature sets) for fourteen binary classification tasks (that is, ten ATC codes and four drug-targets). For each experiment setting, we use cross-validation and report the average results.

The differences in AUC between training classification models with predicted and original feature sets for drug-target and ATC prediction tasks are shown in Extended Data Fig. 1. The improvements in AUC when using predicted features instead of original features are recognized in all cell-specific profiles (Extended Data Fig. 1a), all classification models (Extended Data Fig. 1b), eight-tenths of ATC codes (Extended Data Fig. 1c) and three-quarters of drug-targets (Extended Data Fig. 1d), and these improvements are significant (paired t-test, P-value < 4.87 × 10−5). The details of AUC scores for predicted and original features for each setting (that is, per model, cell line, ATC code and drug-target) are shown in Supplementary Table 8. These results indicate that we can substitute unreliable gene expression profiles in L1000 dataset with gene expression profiles generated from DeepCE to achieve better performances on downstream prediction tasks.

Drug repurposing for COVID-19

To further demonstrate the value of DeepCE, we use a chemical-induced gene expression profile to discover potential drugs for COVID-19 treatment. As the disease state and symptom of COVID-19 patients vary dramatically depending on many factors such as age, gender, underlying conditions and so on, we therefore evaluate the drug repurposing for COVID-19 task under two settings including population- (group of patients) and individual-based (individual patient) analysis. In particular, we first use the trained DeepCE on the high-quality part of L1000 dataset to generate predicted gene expression profiles for all of 11,179 drugs in the Drugbank database at the largest chemical dosage. For patient gene expression profiles, we use SARS-COV-2 gene expression datasets from NGDC and NCBI to calculate the differential gene expression profiles of the patients under population- and individual-based settings, respectively. Specifically, the DESeq2 package is used to generate the patient profiles from eight SARS-CoV-2 patients and twelve healthy samples (population-based), and from one a SARS-CoV-2 patient and two healthy samples (individual-based). We then screen drugs in Drugbank by computing Spearman’s rank-order correlation scores between their gene expression profiles with the patient gene expression profiles and select drugs that give the most negative scores as the potential drugs. Here we incorporate the gene expression profiles of A549—the cancerous lung tissue—next to the main seven cell lines in the high-quality dataset. Aside from the predicted profiles, we also include the gene expression profiles extracted from the high-quality part of L1000 dataset. For each cell line, we extract the top 100 drugs that have the most negative correlation scores with the patient profile as the potential drugs. Finally, we output drugs that have potential for COVID-19 treatment at all cell lines as the result of our screening process.

The results for the population- and individual-based drug repurposing are shown in Table 2 and Extended Data Fig. 2, respectively. COVID-19-induced acute respiratory failure is thought to be related to both direct viral pathogenicity and dysregulated inflammatory host response. As shown in Table 2, among the ten drugs we identified for population-based analysis, three drugs are antiviral drugs used in hepatitis C treatment and two drugs are immunosuppressive agents. In particular, voclosporin and cyclosporine are immunosuppressant and calcineurin inhibitors that share similar structures. Cyclosporine has been used to prevent organ rejection and to treat T-cell-associated autoimmune diseases, and recently shows potential in preventing the uncontrolled inflammatory response, SARS-CoV-2 replication and acute lung injury caused by COVID-1941,42,43,44. Calcineurin inhibitors have also been shown to be promising treatment for severe COVID-19 cases45,46. Alisporivir, which is a non-immunosuppressive analogue of cyclosporine with potent cyclophilin inhibition properties, is shown effective in reducing SARS-CoV-2 RNA production in Vero E6 cells47. Moreover, valspodar inhibits P-glycoprotein, which affects the transportation of immunosuppressive agents, and ceftobiprole medocaril is used in hospital- and community-acquired pneumonia48.

Table 2 The chemical structures, status and known uses of potential drugs for COVID-19 treatment (that is, drugs that appear in the list of top-100 drugs for all eight cell lines when comparing their cell-specific predicted gene expression profiles with the polulation-based patient profiles by Spearman’s correlation). Experimental and investigational drugs are those at the preclinical or animal testing stage and in human clinical trials, respectively

For individual-based analysis, among the fifteen drugs we identified (Extended Data Fig. 2), nine drugs are antiviral drugs and seven of them are used for treating hepatitis C as a NS5A inhibitor. They are similar to the top-ranked drugs that are identified from the population-based analysis. Two from hepatitis C treatment (elbasvir and velpatasvir) in particular have been shown as potential candidates for COVID-19 treatment by using other approaches49,50,51. Moreover, two drugs show anti-inflammatory or immune-regulating function and have the potential to regulate the immune response under COVID-19 infection. Laniquidar can suppress the function of P-glycoprotein 1 and affect transportation of immunosuppressive agents. The individual-based analysis also recognizes the drugs with the similar mode of actions. AMG-487 targets chemokine receptor CXCR3, which can regulate leukocyte trafficking. It is noted that all potential drugs here are not available in L1000 dataset, showing the effectiveness of DeepCE for phenotype compound screening under both population-based and individual-based settings.

Conclusion

Deep learning has attracted a great attention in drug discovery. Past and existing efforts mainly focus on accelerating compound screening against a single target52. However, such a one-drug one-gene paradigm proved to be less successful in tracking complex diseases. A systematic compound screening approach—which both takes information on a biological system into consideration and uses a chemical-induced systematic response as readouts—will provide new opportunities on discovering safe and effective therapeutics that module the biological system. In this study we have proposed DeepCE, a novel and robust neural network-based model for predicting chemical-induced gene expression profiles from chemical and biological objects, especially in a de novo chemical setting. Our model achieves state-of-the-art results in predicting gene expression profiles compared with other models not only in de novo chemical setting but also in the traditional setting. Furthermore, we have addressed the unreliable measurement problem of L1000 by introducing the data augmentation method to effectively exploit useful information from unreliable gene expression profiles to improve the prediction performances of our models. Furthermore, the downstream prediction task evaluation shows that training classification models with gene expression profiles generated from DeepCE achieve better performances than training them with unreliable gene expression profiles in L1000, indicating the added values of DeepCE for downstream prediction. Finally, DeepCE is shown to be effective in the challenge and urgent problem, finding treatment for COVID-19, by in silico screening all chemical compounds in DrugBank against COVID-19 patient clinical phenotypes (that is, comparing chemical-induced gene expression profiles generated from DeepCE with the patient profiles). In summary, DeepCE could be a powerful tool for phenotype-based compound screening.