SMILES-X: autonomous molecular compounds characterization for small datasets without descriptors

There is more and more evidence that machine learning can be successfully applied in materials science and related fields. However, datasets in these fields are often quite small ($\ll1000$ samples). It makes the most advanced machine learning techniques remain neglected, as they are considered to be applicable to big data only. Moreover, materials informatics methods often rely on human-engineered descriptors, that should be carefully chosen, or even created, to fit the physicochemical property that one intends to predict. In this article, we propose a new method that tackles both the issue of small datasets and the difficulty of task-specific descriptors development. The SMILES-X is an autonomous pipeline for molecular compounds characterisation based on a \{Embed-Encode-Attend-Predict\} neural architecture with a data-specific Bayesian hyper-parameters optimisation. The only input to the architecture -- the SMILES strings -- are de-canonicalised in order to efficiently augment the data. One of the key features of the architecture is the attention mechanism, which enables the interpretation of output predictions without extra computational cost. The SMILES-X shows new state-of-the-art results in the inference of aqueous solubility ($\overline{RMSE}_{test} \simeq 0.57 \pm 0.07$ mols/L), hydration free energy ($\overline{RMSE}_{test} \simeq 0.81 \pm 0.22$ kcal/mol, which is $\sim 24.5\%$ better than molecular dynamics simulations), and octanol/water distribution coefficient ($\overline{RMSE}_{test} \simeq 0.59 \pm 0.02$ for LogD at pH 7.4) of molecular compounds. The SMILES-X is intended to become an important asset in the toolkit of materials scientists and chemists. The source code for the SMILES-X is available at \href{https://github.com/GLambard/SMILES-X}{github.com/GLambard/SMILES-X}.


Introduction
In the fields of bio-and cheminformatics, machine learning (ML) algorithms combined with human-engineered molecular descriptors 1,2 have shown great potential in tasks of predicting physicochemical properties of molecular compounds. In practice, however, it is often necessary to run a blind scan through a large number of such combinations in order to find the most accurate inference model, which still may not lead to success. Most of the descriptors are task-or domain-specific, which makes their use impossible for more general problems, such as virtual screening, similarity searching, clustering and structure-activity modelling [3][4][5][6] .
For these purposes molecular fingerprints have been developed. Fingerprint is a binary representation of a molecule: its structural or functional features are translated into a string of bits in the way to keep the fingerprint invariant arXiv:1906.09938v2 [physics.comp-ph] 4 Jul 2019 to rotations, translations and property-preserving atomic permutations (see, e.g., extended circular fingerprints 7 ). Even though molecular fingerprints are known to be helpful to drugs discovery or compounds search among various databases, they may as well be detrimental to materials characterisation and design. Therefore, while both descriptors and fingerprints may be beneficial, they come along with restrictions.
In fields like materials science it is common to have datasets with 1000 samples, which is considered to be too small for a direct deep learning application. Some research groups use neural architectures (NAs) for secondary tasks such as to build novel high-level features as non-linear combinations of molecular descriptors [8][9][10] . Others use NA to automatically learn features based on 2D/3D images 11,12 , molecular graphs 13 , SMILES (simplified molecular input line entry system) [14][15][16] , N-gram graphs 17 or a combination of mentioned inputs 18 , similar to computer vision (CV). Still, none of them intends to design an NA for property prediction on small datasets. There are some works on transfer learning 11,19,20 , but the results vary greatly depending on the correlation between the tasks -which is often unknown a priori. Moreover, most of the NAs used in the fields of CV or natural language processing (NLP) are trained on big data and impose architectures that do not fit small datasets.
Aside from the lack of data, another bottleneck on the way of using NAs in physics and chemistry is the lack of interpretability. A method for explaining neural networks has been recently proposed 15 . It consists in training an additional neural network to generate a mask identifying the most important SMILES characters. Despite the respectable coherence in the interpretation of the chemical solubility, the explanation network is entirely correlated to its prediction network, which forces the training phase to be doubled for each dataset. Moreover, even though the explanation network allows to identify the groups having the highest weight in the property prediction, there is no evidence that the original prediction network has also learned the known chemistry concepts in order to make proper characterisation.
In this article we propose a method allowing to overpass the issues of data scarcity, descriptors engineering and the prediction interpretation ambiguity at the same time. The algorithm benefits from the natural ability of NAs to learn a suitable and task-specific representation of the data. It designs a simple yet effective NA dedicated to small datasets based on attention mechanism [21][22][23] . To achieve this, we borrowed the latest techniques from the CV and NLP fields to build an entirely autonomous system -the SMILES-X. To the best of our knowledge, this is the first time in materials science related fields when an NA is specifically designed to manage small datasets, and the first attempt to integrate a NLP-based attention mechanism for predicting physicochemical properties of molecular compounds. This mechanism allows to reduce the number of trainable parameters, and provides the interpretation of the results at no extra cost. The SMILES-X achieves the state-of-the-art results, predicting any physicochemical property given the molecule's SMILES 24,25 as the sole input.
The structure of the article is as follows. First, we describe the entire pipeline of the SMILES-X in Section 2. The SMILES augmentation and formatting are detailed in subsections 2.1, 2.2, respectively, while the procedures of building the NA frame and its data-specific optimisation are presented in the subsection 2.3. The subsection 3.1 is dedicated to the performance of the SMILES-X based on three benchmark datasets for regression tasks from the MoleculeNet 26 : ESOL 27 , FreeSolv 28 and Lipophilicity 29 . There are three modes of interpretation of the results of the SMILES-X, which are discussed in the subsection 3.2. Finally, we conclude and discuss further possible improvements of the SMILES-X, as well as propose more potential target properties to be inferred using the algorithm in Section 4.

The SMILES-X pipeline
The SMILES-X has been conceived to meet the following requirements: (i) to use the SMILES format as the only representation of a molecular compound; computable characteristics, such as the fingerprints or physical descriptors, are left out. (ii) Remove the SMILES canonicalization 24 in order to exploit the full capacity of the molecular compound representation. (iii) The core architecture is simple enough to handle small datasets without sacrificing the prediction accuracy. (iv) Outcomes of the SMILES-X are interpretable. Figure 1 is a sketch of the main steps within the SMILES-X pipeline. The primary input is a list of SMILES strings with corresponding property values. Then, a splitting into training, validation and test sets is performed via equiprobable sampling. The subsequent steps are detailed below.

Augmentation
It has been shown in CV that data augmentation approaches such as flipping, rotation, scaling, cropping and other image transformations are effective to reduce the error rate on classification tasks and improve generalisation 30 . Here, we introduce a technique called SMILES augmentation, similar to Bjerrum 14 . The first step consists in removing canonicalization 24 of the SMILES. Canonicalization is the default procedure to standardise the SMILES across the databases, therefore removing it leads to an expanded number of SMILES individual representations. Then, augmentation is done by iterating over the following two steps: (i) Renumber the atoms of a given SMILES by rotation of their index. (ii) For each renumbering, reconstruct grammatically correct SMILES under the condition of conserving the initial molecule's isomerism and prohibiting kekulisation 24,25 . In the end, one obtains an expanded list of SMILES together with their corresponding property and cardinality n augm (s i ) (number of augmentations for a SMILES s i ), if any. Duplicated SMILES are removed. The SMILES augmentation is individually performed after splitting into training, validation and test sets to avoid any information leakage. The procedure is performed using the RDKit library 31 .

Tokenisation
Tokenisation consists in dividing the SMILES into unique tokens, each token being a set of characters. The procedure of SMILES tokenisation is as follows 24 The characters between squared brackets, that may include inorganic and aromatic organic atoms, isotopes, chirality, hydrogen count, charges or class number, form a single token (brackets included, e.g., [NH4+]). (iii) Unlike the NLP analysis, the beginning token is not different from the termination one: both of them are represented by a whitespace, which is added at both ends of a tokenized SMILES. This is important to keep its reading direction invariant. Finally, a set of unique tokens is extracted to form the representative chemical vocabulary for a given dataset. To become an interpretable NA input, this vocabulary is then mapped into integers, and is conserved into memory for future usage.

Architecture search
The neural architecture search has recently reached a new milestone in finding the optimal NA for a given task, by using, e.g., reinforcement learning techniques 32,33 or evolutionary algorithms 34 . However, not only these techniques are computationally expensive but also they do not necessarily deal with the recurrent blocks. It has therefore been decided to fix the overall NA geometry ( Figure 2) and search for the best set of the hyperparameters through the Bayesian optimisation 35 . As it was mentioned ear- lier in Section 2, this geometry is NLP-oriented and treats SMILES strings as sentences in the chemical language; it has low complexity so as to be applicable to small datasets, and its outcomes are interpretable. Inspired by the hierarchical neural architecture 36 , which allows to get cutting edge results on document classification, we have built the SMILES-X frame based on a four-step formula: {Embed, Encode, Attend, Predict} 37 .

Embed
The embedding layer 38 transforms the tokens, derived from the dataset's vocabulary in form of integers, into dense n embed -dimensional float vectors. Unlike arbitrary ordinal numbers, these vectors encapsulate the semantic meaning of tokens and their relations. This operation transforms SMILES into series of n embed × 1 vectors, or n tokens × n embed tensor, where n tokens corresponds to the number of tokens in a tokenised SMILES string. 2. Encode The encoding phase is responsible for modifying the embedding, so that it captures the relationships between tokens in the context of the dataset. It consists of two neural layers: a bidirectional CuDNN long short-term memory (LSTM) layer 39,40 is followed by a time-distributed fully connected one. The former consists of n LSTM LSTM blocks and maps the input SMILES, represented now by a n tokens × n embed tensor, into a context-aware n tokens × n LSTM tensor. After training, each row of the tensor represents the meaning of a given token within the context of the rest of the SMILES string containing it. The bidirectionality forces the embedded SMILES to be sequentially passed forwards and backwards, conserving the invariance of their reading direc-tion. The forward and backward encodings of a SMILES are then concatenated, resulting in a n tokens × 2n LSTM output tensor. The timedistributed dense layer is then applied to each of n tokens tokens. This allows to capture the relationships between tokens in greater detail, or in other words to deepen the LSTM layer (similar to the effect of adding an extra dense layer to a vanilla neural network). Given that the number of hidden units in this layer is n dense , the output after encoding is a n tokens × n dense tensor. It should be noted that we specifically use CuDNN LSTM 41 blocks for efficient optimization and training phases on GPU from NVIDIA Corporation. Without the CuDNN version of LSTM, the speed of training would drop by a factor of ∼ 10, making the optimisation phase intractable.

Attend
The attention layer detects the salient tokens, compressing tensor H ∈ R n tokens ×n dense into an n dense vector c with minimum information loss 23 : where W a ∈ R n dense ×1 and b a ∈ R n tokens ×1 are trainable parameters, α ∈ R n tokens ×1 is the attention vector and c ∈ R n dense ×1 is the output. Thus, the attention layer performs two important tasks at once: (1) it collapses the representation H of a variable length chain of tokens into a fixed length vector c by applying a weighted sum over the tokens to fit the final property best, with (2) the weights in α which represent the importance of each token towards the final property prediction, bringing to a straightforward interpretation. Therefore, the attention layer has two modes, one returning the output vector c, and the other -the attention vector α (see Section 3). The two modes are switchable at will without extra computational cost.

Predict The final NA layer transforms the attention layer output c into a single property value
Prop(s i ) by a simple linear operation: The interpretation from α in Equation 1 and the prediction are thus linearly connected and are accessible without any additional treatments on the input data or NA, unlike the pipelines in other works 15,42,43 .
It should be noted that all the above tensors or vectors have one additional dimension, n SMILES , omitted for the sake of simplicity. This dimension corresponds to the batch size of a single iteration passed to the network, i.e. the maximum number of SMILES that it processes at once. All of the steps above are implemented in Keras API 44 and Tensorflow 45 with GPU support.

Results & discussion
To evaluate the regression performance of the SMILES-X, it was chosen to test it on three benchmark physical chemistry datasets issued from the MoleculeNet 26 . These datasets are considered as small, with less than 5000 compound-property pairs, and therefore present a challenge to machine learning models. The ESOL 27 dataset contains the logarithmic aqueous solubility (mols/L) for 1128 organic small molecules; the FreeSolv 28 consists of the calculated and experimental hydration free energies (kcal/mol) for 642 small neutral molecules in water; and the Lipophilicity 29 stores the experimental data on octanol/water distribution coefficient (logD at pH 7.4) for 4200 compounds.
In present report the splitting ratio for training/validation/test is set to 0.8/0.1/0.1. Following the procedure from MoleculeNet 26 , we performed 8 splits, each time using new seed for the Monte-Carlo sampling. The seeds have been fixed for the sake of reproducibility. We use the averaged RMSE over the 8 test sets as the comparison metric of performance.
The optimal model architecture is determined via Bayesian optimisation individually for each split.
We used the python library GPyOpt 46 for this purpose.
The search bounds are as follows: (n embed , n LSTM , n dense and n SMILES ) ∈ {8, 16, 32, 64, 128, 512, 1024}, γ ∈ [2; 4] with a step of 0.1, where γ is related to the optimiser learning rate as lr ≡ 10 −γ , making a total of 50421 configurations. For the Lipophilicity dataset, n SMILES and learning rate are fixed to 1024 and 10 −3 , respectively, leaving 343 potential architectures to search among. First, 25 architectures are randomly sampled and trained. Then, a maximum of 25 architectures are proposed via the expected improvement acquisition function 47 . Each of the architectures are sequentially trained for 30 epochs for ESOL and FreeSolv, and 10 for the Lipophilicity set (these values have been chosen based on the speed/efficiency ratio). The best proposed architecture is finally trained using a standard Adam optimiser 48 with checkpoint and early stopping. The early stopping is configured to stop the training if the validation loss is not improving for 50 consecutive epochs, and a checkpoint saves the parameters of the model with the minimal validation loss. The maximum number of epochs is set to 300, but because of the early stopping condition this value has never been reached. Depending on whether the SMILES augmentation is requested or not, the code needs from 1 to 4 GPUs running in parallel.

Predictions
We compare the performance of SMILES-X against the best-to-date results from MoleculeNet 26 , and for the Free-Solv additionally to the calculations based on the molecular dynamics simulations 28 ( Table 1). The results in Molecu-leNet 26 are reported for the molecular graph-based models that achieved the best results on a given dataset: concretely, a message passing neural network 49 for the ESOL and FreeSolv datasets, and a graph convolutional model 50 for the Lipophilicity dataset. Bayesian optimisation is also used there for the layers size, batch size and learning rate. We include both the results on canonicalised SMILES (Can) and on SMILES that have been augmented (Augm) (see Section 2.3). When a SMILES string s i is augmented to n augm strings, its predicted property value is averaged over n augm predictions. Table 1 shows that the SMILES-X reaches the best results for the FreeSolv and Lipophilicity datasets, improving the prediction accuracy by 30% and 9%, respectively, while having a comparable performance on the ESOL data. It is unclear why our algorithm fails to improve on the ESOL data. We thought that the number of tokens per SMILES may be the culprit. However, Figure  3 shows that this is not the case. Note that even using the standard canonicalised SMILES strings, the property can be predicted quite well without employing any chemical knowledge (i.e., using no descriptors). Interestingly, machine learning allows to achieve a better accuracy than the molecular dynamics simulations. There are the three main reasons that we think permitted SMILES-X to achieve these results: i. The success is mainly attributed to the attention layer, that shows similar improvements in document classification tasks 36 . Comparing our performance to a similar NA without an attention layer 15 , we see some 32.5% improvement on accuracy. ii. Bayesian optimisation is a valuable tool that allows to efficiently find the best hyper-parameters in a short time. iii. It is obvious that SMILES augmentation shows great improvement (Can versus Augm in Table 1), and was necessary to achieve the best current results. Also, one can note that a graph-based NA would not allow such data augmentation.

Interpretability
As it was mentioned before, one of the great advantages of our method is its interpretability. The Figure 4 shows an example of the trained token embeddings. We used a principal component analysis (PCA 51,52 ) to reduce dimensionality from n embed = 1024 down to two, for the  Instead, we are interested in the interpretation of the network property prediction. With the SMILES-X, we are able to visualise the importance of each single token towards the final prediction of the property of interest ( Figure  5).
There are three ways of visualisation available: (a) a 1D map built from the attention vector α (see Equation 1) juxtaposed with the SMILES string, (b) a similar 2D version for the molecular graph and (c) temporal relative distance T dist to the predicted property. For the first two, the redder and darker the colour is the stronger is the attention on a given token.
T dist (n) shows the evolution of the prediction for the SMILES while reading it token by token from left to right. It is inspired by Lanchantin 53 and defined as: where Prop(n) is the property predicted value based on the first n tokens of the SMILES for n ∈ [1, ..., n tokens ]. Note that it converges to the final prediction Prop(n tokens ) ≡ Prop(s i ) (prediction based on the entire SMILES). This also allows to judge as to how much a token influences the property of a compound. In this example, the prediction based on fragment 'Cc1ccc(O' is almost identical to the final prediction on the whole structure. For the compound that we used as an example, the oxygen atom ('O') is considered to be the most influential element of the molecule for the hydration free energy prediction, which reflects chemical reality.

Conclusions
A new neural architecture for the chemical compounds characterisation, the SMILES-X, has been developed. In this article, we have presented the pipeline and performance of the SMILES-X. We demonstrate its aptitude to provide state-of-the-art results on the inference of several physicochemical properties, concretely the logarithmic aqueous solubility (RM SE test 0.57±0.07 mols/L), hydration free energy (RM SE test 0.81 ± 0.22 kcal/mol) and octanol/water distribution coefficient (RM SE test 0.60 ± 0.04 for LogD at pH 7.4). These results prove that it is now possible to successfully predict a physicochemical property employing no chemical intuition, even with a small dataset at hand. The success of the SMILES-X rides on three key factors: (i) The Embed-Encode-Attend-Predict architecture, that allows to simplify the whole architecture thanks to the attention mechanism (i.e., to have less trainable parameters), and therefore reduces the risk of over-fitting. (ii) The Bayesian optimisation of the neural network's hyper-parameters allows to achieve close-tooptimal representation of the molecular compounds, per task and dataset. (iii) The use of SMILES strings as a sole input representation of chemical compounds allows efficient data augmentation.
Thanks to the attention mechanism, the SMILES-X comes with three modes of interpretation of the inference outcomes. This provides the end-user with the insights on which fragments of the chemical structure have the highest (or the lowest) influence on the property of interest. This kind of artificial intuition is a valuable asset not only for the tasks of characterisation and design of novel compounds, but also to re-purpose already-known materials.
As for the future improvement on the SMILES-X, we plan to use BERT-like 54 NA's skeleton for the sake of reducing the accuracy gap existing between the ESOL, FreeSolv and Lipophilicity datasets studied here. The LSTM blocks are known to have memory problems with very distant dependencies within long sentences, and an architecture that is entirely based on the attention mechanism, i.e. free from LSTM blocks, like BERT, may overcome this weakness. Another way to improve the inference accuracy may be via informative sampling 55 .
In our forthcoming article we will address the tasks of classification, still using the MoleculeNet's datasets 26 . That means that the SMILES-X will be modified in order to handle single-to-many, many-to-many and many-to-single classification tasks.

Conflicts of interest
There are no conflicts to declare.