1 Introduction

With the increasing availability of reaction data, developments and accessibility of computational methods, and a drive to further automate design, make, test, analyze (DMTA) cycles within drug discovery [1], computer assisted synthesis planning (CASP) has seen renewed interest as of late [2]. This has been spurred by recent achievements in the application of neural networks combined with search algorithms [3, 4], learning from breakthroughs in their application to games such as chess and Go [5].

CASP or retrosynthetic analysis refers to the strategy used by chemists to deconstruct a compound into its simpler precursors. In likeness with games, both are fundamentally decision-making tasks, in intractable search spaces, and with complex optimal solutions, where evaluation of the position and available moves at each step is difficult (Table 1). A knowledge base of reactions a chemist has learned throughout their career, coupled with extensive literature searching, form the basis of an initial pattern recognition step, from which applicable reactions can be identified and prioritized.

Table 1. Comparison of search spaces in games vs retrosynthetic analysis.

Recent studies have shown that neural network policies framed as multi-class classification problems can identify likely reactions through the noisy knowledge base [3, 4]. However, we have found they are heavily weighted towards frequently occurring reactions, owing to imbalanced datasets. Thus, miss out on less frequent yet feasible alternatives. In the present study, we explore and tune neural network architectures with the aim of maximizing the number of synthetically feasible options at each step. This is supplemented by curation and analysis of the underlying knowledge base, extracted from available reaction datasets. The number of which is limited when publicly available data is considered.

2 Methods

The US patent office extracts are a set of text mined reactions from the patent literature [6]. Given the reaction SMILES, an extension of the SMILES notation used to represent molecular structures [7], we used a modified version of Coley and coworkers algorithm to extract reaction templates [8]. That is the transformation required to convert the reactants into the products. These form the core of our knowledge base from which we can train a policy to enumerate retrosynthetic pathways in the form of a tree. To evaluate performance on a state-of-the-art model, we have opted to reimplement a variant of the policy used by Segler and Waller [4].

3 Results

We found that the applicability of the predicted templates from our trained policies varied with the template specificity (Fig. 1), the size of the template library, and the network architecture. Additionally, we developed a method for determining the selectivity of our templates during extraction and validation, the effect of which we are in the process of investigating on the trained model.

Fig. 1.
figure 1

Assessment and comparison of the model’s ability to predict templates that can be successfully applied for a subset of 10,000 randomly sampled compounds from each ChEMBL and FDB17. The radius refers to the number of bonds from the reaction center that are considered.

Preliminary evaluation of the models was performed on a random selection of 10,000 compounds from each ChEMBL [9] and FDB17 [10]. This enabled assessment of both the model’s predictive ability, and the validity of templates across a range of druglike and novel scaffolds. Using this assessment criteria, we aim to maximize the number of options available to our policy at each step in the subsequent tree search. Thereby, enabling the later prediction of full synthetic pathways, which is a necessity in accelerating automated DMTA cycles.

Whilst our viewpoint on the performance of the models has shed new light on the way in which a model may be evaluated, there is still much detail to investigate. This paper introduces preliminary results for a template-based synthesis planning methodology, and highlights that a more rigorous study is currently underway. This will encompass larger datasets, template design, data curation, the network architecture, and the implementation of appropriate metrics. These results will follow in a more rigorous study of the problems faced in computer assisted synthesis planning.