Curiosity in exploring chemical space: Intrinsic rewards for deep molecular reinforcement learning

Computer-aided design of molecules has the potential to disrupt the field of drug and material discovery. Machine learning, and deep learning, in particular, have been topics where the field has been developing at a rapid pace. Reinforcement learning is a particularly promising approach since it allows for molecular design without prior knowledge. However, the search space is vast and efficient exploration is desirable when using reinforcement learning agents. In this study, we propose an algorithm to aid efficient exploration. The algorithm is inspired by a concept known in the literature as curiosity. We show on three benchmarks that a curious agent finds better performing molecules. This indicates an exciting new research direction for reinforcement learning agents that can explore the chemical space out of their own motivation. This has the potential to eventually lead to unexpected new molecules that no human has thought about so far.


Introduction
The development of new drugs and functional materials is an important but expensive undertaking. To bring the cost down, the scientific community introduced computational screening methods that frame the discovery process as an optimization problem of desired properties over a large molecule database or chemical space. This is also known as the inverse molecular design problem [1,2]. However, the search space is enormous [3,4] rendering exhaustive search of all possible molecules unfeasible. Thus, various artificial intelligence (A.I). approaches have been developed to tackle this problem, including variational autoencoders (VAEs) [5,6], recurrent neural networks [7,8,9] and generative adversarial networks (GANs) [10,11]. These approaches are very promising, but they require a sometimes large, training dataset. However, not for every class of molecules, such a training dataset exists, or it is very small. Furthermore, the use of a dataset biases the model and thus makes it unlikely to find interesting molecules outside of the given data distribution.
Other approaches like genetic algorithms [12,13,14] or reinforcement learning (RL) allows designs potentially far away from any known data distribution [15,16,17]. Instead of a dataset, only a reward function is needed, that measures how good a generated molecule is. However, due to the vast chemical space, efficient exploration is necessary.
Here we take inspiration from the field of RL for video games, where the use of a method called curiosity [18] has led to impressive results without access to actual rewards from an environment [19]. Curiosity falls under the wider category of intrinsic motivation techniques [18,20,21], which are loosely modeled after human curiosity. In this work, we propose for the first time the use of intrinsic motivation for molecular design and show that the more curious agents in our study perform better than their less curious counterparts on three commonly used benchmarks.

Reinforcement Learning basics
Reinforcement Learning is a technique used to find a policy π θ parameterized by θ that maximizes state-action trajectories in an environment. Formally, the environment is described as a Markov decision process M = (S, A, T , µ 0 , γ, R, T ). Here, S is the state space, A is the action space, T : S × A → S is the transition function, µ 0 is the initial state distribution, γ ∈ (0, 1] is the discount factor. R : S × A → R is the reward function. We define r t := R(s t , a t ) and T a the maximal length of an episode. For every policy π, the expected reward is defined as the reward that an agent will collect when it is in a certain state V π (s t * ) = E π ( T t=t * γ t R(s t , a t |s t * )). This quantity is also called the value of the state s t * . Analogously we define the Q value of an action in a state as Q π (s t * , a t * ) = E π ( T t=t * γ t R(s t , a t |s t * , a t * )). The goal is to find a policy so that J(θ) = E s0∼µ0 (V π (s 0 )) is maximized.
There are many different ways to train a policy. Throughout this paper, we will use a policy gradient method, in particular proximal policy optimization (PPO). We use the hyperparameter provided by the original paper [22] which were tuned on atari games. The authors in [23] conducted a thorough hyperparameter search and confirmed that these hyperparameters are also nearly optimal for molecular design and also other authors [16] reported that they could not find better settings.

Reinforcement Learning for molecular design
For molecular design, we define the state s t as the SELFIES [24] string that is so far constructed. SELFIES is a 100% robust string-based representation of molecules. This is in contrast to SMILES or other string-based representations, which frequently produce semantically or syntactically invalid strings. The robustness of SELFIES is advantageous as our RL agent produces in every step a valid molecule. Thereby, the training doesn't require any post-processing or filters, thus is simpler. The action a t is the next character to be appended to the string. The molecule is finished either when the max number of steps is reached, which we set to 35 throughout our experiments, or the agents use the [STOP] symbol. For some property p that we wish to optimize, and by denoting the molecule at time step t as mol, the reward at every time step can be formulated in two ways. Either as or as For both formulations the cumulative reward is the property of the final molecule ( T t=0 γ t r t = p(mol(t))) for γ = 1. The first formulation is sparse, but only needs one evaluation of the reward  Figure 1: An illustration of curiosity: The agent generates molecules and gives them to the property prediction network. Initially, the predictions are wrong everywhere in the chemical space, but over time the network learns to predict the properties of molecules it has already encountered well. By using the prediction error as an intrinsic reward, the agent is incentivised to move to regions, it has not yet explored, because the prediction network will make more errors there. The combination of the intrinsic and extrinsic rewards is then used as feedback to train the agent.
per trajectory while the second formulation is more dense and therefore more informative, but also requires a lot more calculations of the reward function. The chemical space is huge, and therefore efficient exploration is necessary to find good solutions. In the next section we discuss a technique in reinforcement learning literature that is known to help with better exploration and adapt it to our use case.

Related Work
The literature on reinforcement learning often distinguishes between intrinsic and extrinsic rewards. An extrinsic reward is anything that comes directly from the environment. Intrinsic rewards are any rewards that are generated by the agent itself. Pathak et. al [18] introduce an intrinsic reward called curiosity. Curiosity guides the exploration of an agent into regions of the state space, where it has not understood the effect of its actions on the environment. They introduced a separate neural network that tries to predict the next state the agent will be in given the action it took. Then, the total reward the agent gets is where r extrinsic (t) is the normal reward provided by the environment, and r intrinsic (t) is the error of the prediction from the newly introduced neural network. The prediction network can also be seen as implicitly storing information in its weights about which areas of the state-action space have already been visited, since the more often the agent is in a certain region of that space, the smaller the prediction error is going to be. By exploiting this information, the agent will continue exploring new regions and not get stuck in local optima. This situation is depicted in Figure 1.

Curiosity for molecular design
Our goal is to adapt the work described in the previous section to molecular design. For this, we use a prediction network that predicts the property of the next molecule and use the prediction error as the intrinsic reward (see Figure 1). In order to test different variations of this idea, we formulate it generally as: Herep(·, η) is the prediction network parameterized by η that tries to predict the real value of the considered property p of the molecule. mol(t, θ) is the molecule the agent parameterized by θ generates at time step t and dist(·, ·) is a distance metric, for example L1 or L2. We also try an alternative formulation that we call "greedy curiosity". It uses a mask function to be curious only in promising direction (see appendix A). For training the predictor network we consider two options: Either we update the predictor network each time the agent generates new molecules, or we collect them in a buffer and train on the whole buffer (see appendix B).
Unlike Pathak et. al. [18], we do not predict the next state. The reason is, that given the current state (the string so far), and the next action (the character to append), predicting the next state (the string so far with the new character appended) does not require to learn anything about the chemical space. Instead of storing information in the state-action space, our prediction network only remembers regions of the state space the agent visited. We also consider a very simple alternative where we explicitly store the last N molecules into a buffer and calculate the average Tanimoto Similarity (TS) of the Morgan Fingerprints (MF) [25]: This approach explicitly makes a choice of a particular distance metric. In contrast, curiosity defines one implicitly, because molecules with a low prediction error are close to previously encountered molecules in the learned problem-specific feature space of the predictor network. Since this feature space is by construction useful for predicting the property that the agent ought to optimize, we hope that the induced similarity metric is more useful for the optimization task at hand.

Experiments
We test our method on 3 different tasks. The three tasks are optimizing for Quantitative Estimate of Druglikeness [26] (QED), penalized logP [5] (plogP), and similarity (in terms of Tanimoto similarity of Morgan fingerprints) to the target molecule Celecoxib, a task from the Guacamol benchmark [27]. QED is a statistical measure to quantify if a specific set of 8 properties of a molecule fit those of 771 approved orally administered drugs. Penalized logP rewards the logP value and synthetic accessibility score of a molecule, and penalizes its ring count. This property is often used in drug design literature. For penalized logP the best-known optimum is the sulfur chain [12]. It turned out that the carbon chain was a good local minimum all agents were getting stuck in. Therefore we were providing the [S] symbol as the initial state so that some of the agents were able to find the sulfur chain. For each task we train 3 agents for all possible combinations of α = {0, 0.01, 0.1, 1}, r intrinsic, alternative , the distance metric L1/L2 and whether or not Greedy curiosity and the buffer are used. We divide the extrinsic and intrinsic rewards by their respective running averages so that their influence is on the same order of magnitude and the scaling factor α is comparable across experiments. The agent consists of an LSTM [28] with 64 units to encode the state and a linear network to predict the action. The prediction network has the same architecture, but does not share any neural network weights.

With Curiosity
Without Curiosity QED (a) Penalized logP (b) Similarity (c) Figure 2: Best molecule generated on the different tasks. Tasks from top to bottom: QED, penalized logP, Similarity. Left: Best molecule generated by the curious agent. Right: Best molecule generated by a non curious agent. Note that no chemical stability or synthesizabitly filters were employed on this work, so the structures may look a bit strange to organic chemists.

Results
The averaged results of the 3 runs for the best performing hyperparameter sets are shown in Table 1a -1c for the 3 different tasks. Additionally the best value for an agent without curiosity (α = 0) and the best value for an agent using r intrinsic, alternative are shown.
The agents with curiosity perform the best, moreover, the best-performing agents always have the highest curiosity weight (α) from all tested hyperparameter sets. For the pLogP task, only two agents, both of which use curiosity, have found the sulfur chain. This indicates that curiosity indeed can help to escape local optima. The alternative formulation of the intrinsic reward seems to help for the similarity task but not on the QED task and does not help to find the sulfur chain. The version where we optimize the predictor network after every step of the agent consistently performed better than the version where we used a buffer, which is probably due to the fact, that we trained the predictor only two times during the agent's lifetime. In Figure 2 the best generated molecule for each task are shown.
Please note that for this work we did not employ any synthesizability or stability screening of the final molecules. This results in structures that are not so pleasant to the eye of the trained chemist. Further work can be done to add these as suitable additional terms in the reinforcement learning framework or alteratively as post-selection filters.

Conclusion
In this work, we develop the first curious agents in the domain of molecular design and show that they outperform their lesser curious competitors in three distinct molecular design tasks. Our results point towards a new, efficient RL-based exploration strategy for identifying high-performance molecules and compounds. Since the agent explores the chemical space without any prior knowledge, in the future this can lead to the discovery of new molecules completely different from the ones scientists know of today. Interesting future experiments could further investigate the curiosity rewards, and  The best values of the generated molecules, averaged over the 3 runs for the 3 best performing hyperparameter settings over all tasks. Additionally the average best value of an agent without curiosity (α = 0), and one that uses r intrinsic, alternative are shown. The best agents all used the intrinsic reward (α = 1). explore how they are related or can be combined with other exploration strategies, such as count based exploration [29,30], parameter space noise injection [31], Go-Explore [32] and others.
These positive results indicate that curiosity, and intrinsic rewards in general, have the potential to significantly improve de-novo molecular design. We imagine three very interesting domains of application. First, the exploration of the large chemical space could be improved leading to better solutions. Second, if scientists are interested in exceptionally unique solutions that are very different from known structures, curiosity-based optimization techniques might be a way to obtain these. Lastly, our main motivation is to explore the potential advantage of intrinsic curiosity-based rewards in RL environment. We show in three examples how curious agents outperform their non-curious friends. As future research, it will be interesting to see how these insights can be combined to build algorithms [33] that compete in a wide variety of different molecular design tasks.
Importantly, to understand how to exploit the advantage of curiosity in realistic applications, we will need to train on other more complex molecular properties, as well as to make the model aware of synthesizability and stability constraints by adding filters [34] and/or numerical approximations [35] of the feasible regions within the chemical space. These two aspects, in general, should be a major area of focus for most of the work in this field.

B Curiosity training details
For training the predictor network we consider two options: The first is to update the predictor network after every episode with the new batch of generated samples. A potential downside of this is, that the predictor might forget about older samples. The second option is to use a buffer and collect and train the predictor on all samples. One can either reinitialize the predictor every time before training, which makes it very resource-intensive, or once can do warm starts. However, old samples will be seen more often than new samples, leading to overfitting. Thus we opted for reinitializing and training the predictor only two times, after 200, and again after 500 episodes. In our setup, when using equation (1) we have access to the property of the molecule at every time step. Thus, we can use all molecules at every time step to train the prediction network. If we use equation (2) instead, we have access to molecules only at the end of each trajectory, and therefore use only the last molecule of the trajectory to train the predictor.
The alternative reward formulation has the downside, that the runtime scales with the size of the memory and becomes the most time consuming step even for a small number of stored molecules N . Therefore we were only storing the last 2 batches of molecules in the memory which was giving it about the same computational budget as the prediction network.