Molecular Identification from AFM Images Using the IUPAC Nomenclature and Attribute Multimodal Recurrent Neural Networks

Spectroscopic methods—like nuclear magnetic resonance, mass spectrometry, X-ray diffraction, and UV/visible spectroscopies—applied to molecular ensembles have so far been the workhorse for molecular identification. Here, we propose a radically different chemical characterization approach, based on the ability of noncontact atomic force microscopy with metal tips functionalized with a CO molecule at the tip apex (referred as HR-AFM) to resolve the internal structure of individual molecules. Our work demonstrates that a stack of constant-height HR-AFM images carries enough chemical information for a complete identification (structure and composition) of quasiplanar organic molecules, and that this information can be retrieved using machine learning techniques that are able to disentangle the contribution of chemical composition, bond topology, and internal torsion of the molecule to the HR-AFM contrast. In particular, we exploit multimodal recurrent neural networks (M-RNN) that combine convolutional neural networks for image analysis and recurrent neural networks to deal with language processing, to formulate the molecular identification as an imaging captioning problem. The algorithm is trained using a data set—which contains almost 700,000 molecules and 165 million theoretical AFM images—to produce as final output the IUPAC name of the imaged molecule. Our extensive test with theoretical images and a few experimental ones shows the potential of deep learning algorithms in the automatic identification of molecular compounds by AFM. This achievement supports the development of on-surface synthesis and overcomes some limitations of spectroscopic methods in traditional solution-based synthesis.


Contents
References and Notes 21 S1 IUPAC tokenization In order to tokenize the IUPAC nomenclature we selected a set of terms consisting of prefixes, suffixes, numbers, etc., whose combinations generates the 686,000 IUPAC names corresponding with the molecules belonging to Quasar Science Resources -Universidad Autónoma de Madrid -Atomic Force Microscopy Image Dataset (QUAM-AFM). 1 Deep learning models require large datasets in which each target is repeated many times to be learned during the training.
We have analysed the number of times that each term appears in QUAM-AFM and removed those terms that are repeated less than 100 times. This reduces the total number of terms considered to a total of 199, shown in table S1. Consequently, we have discarded molecules whose decomposed IUPAC name contains any of these terms. Similarly, extremely long IUPAC names have very little representation in QUAM-AFM, so we also discard molecules whose IUPAC name decomposes into more than 57 terms. In this way, we have slightly reduced the set of available inputs to a total of 678.000 molecules.
acen acet  acid  acr  alde  alen amate amide amido amim  amine amino  amo  ane  ani  anil  ano  anone anthr  ate  ato  aza  aze  azi  azido  azin  azo  azol  benz  brom  but  carb  chlor  chr  cyclo  ene  eno  eth  fluor  form  furan  furo  hyde  hydr  ic  ida  ide  idin  idine  ido  imid  imidin imine  imino  ind  ine  ino  iod  iso  itrile  ium  meth  mine  naphth  nitr  nium  nyl  oate  oic  ol  ole  oli  olin  oline  olo  om  one  oso  oxo  oxol  oxy  oxyl  oyl  phen  phth  prop  pter  purin  pyr  pyrrol  quin  sulf  thi  tri  urea  yl  ylid yridin  zin  zine  dodeca undeca lambda phosph cinnam xanth porph coron deca  guan  hept  amic  pent  enal  octa  anal  nona mido  nida  azet  tetr  ulen  anol  hypo  pine  ysen  anth  tere  acyl  yrin  inin  tris  pino  mid  hex  nia  bis  per  nio  pin  ite  rin  alo  en  di  11  yn  an  cy  de  15  10  13  14  12  in  18  21  az  al  bi  et  ep  id  ox  il  or  16  17  on  (  6  )  -[  a  ]  N  '  1  7  4  5  3  9  2  8  H  ,  o  b  e  c  C  d  O  g  f   Table S1: Table of terms for International Union of Pure and Applied Chemistry (IUPAC) decomposition. The elements above the bold line are the subset of 100 attributes considered. The grey cell does not correspond to any term, it has been coloured in order to distinguish it from the term that spells an empty space between two words. problems, and that has led to the development of the transformers. 6,7,8,9 In this case we assume that the relationship between each IUPAC name and each molecule is biunivocal, (although there are some exceptions we suppose that the names of each compound have been entered un- S2 M-RNN A and AM-RNN models: A layer description As previously mentioned, the architecture developed for molecular identification with the IU-PAC nomenclature is composed of two models, M-RNN A and AM-RNN, (see Figure 2 of the main text). Each one is composed by a CNN, a RNN and a multimodal componet ϕ. Figure S1 displays the type of layers that constitute each component of both M-RNN A and AM-RNN.
The architecture of the CNN component is identical in both models, while RNN and ϕ share the same structure and, except for the RNN layer, the same type of layers with different number of units (e.g. kernels in convolutional layers, vector length in fully connected layers, etc) according to the specific purpose of the model. Figure S2 provides the details, highlighting the  Figure S2 for a detailed description).
differences between the layers of the RNN and ϕ components of the two models.
The CNN component consists of a modification of the Inception ResNet V2 model, 10 identical for both M-RNN A and AM-RNN. This well-known model has been developed to be applied to 2D images whereas in our case we process 3D maps (stacks of 10 AFM images with various tip-sample distances). Therefore we have replaced the first 2D-convolutional layers in Inception ResNet V2 by two 3D convolutional layers, each one with 32 filters, (3,3,3) kernel size and (2,1,1) strides, followed by a dropout layer. We have verified that this dropout layer is essential for the model to generalize to different images, such as the experimental ones. In addition, we have removed the last fully connected layer of the model, which is specific for the original classification task, obtaining an output vector (v) with length 1539.
The goal of the RNN component is to use sequential data to add new terms to the formulation. To this end, the architecture of this component is developed according to two key objectives: Firstly, to embed a representation of each term based on its semantic meaning and, secondly, to store the semantic temporal context in the recurrent layers. To perform the representation of each term in a vector space, the RNN component has an embedding layer that is able to capture relationships between terms (see section S6 for a more detailed analysis). The embedding layer is followed by a dropout layer that acts as a regularizer and finally the data goes through a recurrent layer which, with the temporal context generated by all the predictions made in previous time steps, makes a proposal of predictions that is processed by the multimodal component (see fig. S1). The recurrent layer is a GRU in M-RNN A (attribute prediction) and an LSTM in AM-RNN (term prediction). The multimodal component ϕ first processes the CNN output v in two fully connected layers with a dropout between them. Subsequently, this output is concatenated with the output of the RNN (see fig. S1). Finally, the result of the concatenation feeds two fully connected layers, the first one activated with Rectified Linear Unit Activation Function (ReLU) activation function whereas the second one is activated with Soft Approximation of Max (Softmax) activation function, that converts the outputs from the layer into a vector with components that represent probabilities that sum to one.

S3 Model learning
In principle, it could be possible to train the CNN in conjunction with the RNN in an end-toend process. However, this would require a massive amount of computational time (more than a year even with the good GPU resources available to us

S3.1 Molecular Classes for Transfer Learning
The  We train the CNN minimizing the error of the Negative Log Likelihood loss function with the Adaptive Moment estimator (Adam) optimizer. 13 To speed up the convergence, we apply a batch-normalization, 10, 14 setting the mean to zero and the variance to one in the input layers. 15,16 We found that this normalization not only makes the training faster but also improves the classification results, reaching an accuracy of 0.97 in the test prediction.

S3.3 M-RNN and AM-RNN optimization
where θ represents the model parameters, and I is the 3D image stack. Note that the prediction of the IUPAC nomenclature, according to the decomposition performed on a set of terms, must depend on the predictions already performed, i.e. the prediction must take into account previously predicted terms in order to have semantic meaning, so it is a time series. During the training, we feed the model at each time step t with the target of previous time steps (S 0 , S 1 , ..., S t−1 ) and not with the predictions performed (Y 1 , ..., Y t−1 ), which in some cases are wrong. We refer to each input of the model at the time step t as the pair (S 0 , S 1 ..., S t−1 ; I).
Note that, in this way, the final prediction length of each molecule depends on t. Thus, the prediction of S depends on the prediction of each specific term, which in turn depends not only on I but also on all the predictions performed in previous time steps. Since the model predicts a single term of the sequence at each time step, it is natural to apply the chain rule to model the joint probability over the sequential terms. Hence, the probability of obtaining a correct prediction for the complete sequence is described by the sum of the logarithmic probabilities over the terms. Therefore, the maximum log-likelihood function is as follows: As the deep learning optimization techniques consist of searching for a minimum rather than a maximum, the loss function is described by the sum of the negative log-likelihood: log p(S t |I, S 0 , S 1 , ..., S t−1 ; θ). In the same way as we do for the CNN pre-training described in section S3.2, a random combination of the simulation parameters described in 1 is selected for each input at each epoch. Thus, Figure S5: Details of the training scheme for each stage of each of the models. Lr is a short for learning rate. S4 Influence of the molecular torsion in the model performance: gas-phase versus flat configurations HR-AFM shows an outstanding lateral resolution for quasi-planar molecules, but it is more limited to discern correctly molecules with a significant torsion. This is due to the nature of the contrast and the probe flexibility. The exponential behavior of the Pauli repulsion with respect to the tip-sample distance makes this contribution the most relevant contrast source. Thus, each atom in the sample will behave as an umbrella of around 2-3Å veiling any other electronic charge density below this region (except some cases where deformations in the charge density may occur). Furthermore, the CO molecule will tilt under this repulsion and will block the access of the probe to the region underneath.
Previous works 17 suggest that it is difficult to retrieve information from parts of the molecule that are located more than 1.5Å below the topmost atoms. According to this, we expect our algorithm to provide a poorer performance when finding out the IUPAC names of molecules showing significant out-of-plane distortions. In order to quantify this, we have carried out some tests to directly show how the model accuracy improves for planar structures. We have  Figure S6 shows two molecules where the prediction for the non planar failed but where we obtain a highly accurate result (an almost perfect match in the cases shown) for the planar configurations, with only an error in the misplacement of one of the functional groups. Figure S7 shows the result for two other molecules that contain functional groups that are difficult to discern such as fluorine atoms, easily mistakable with other terminations like diketones which may display faint AFM features.
Although the BLEU 4-gram for the planar configuration is not high there is a clear improvement compared to the non-planar cases. Figure S6: Comparison of the model predictions for the gas-phase and the forced planar configuration of a representative molecule, whose ball-and-stick depiction and height map are shown on the left. The upper AFM images correspond to the simulation with the structure in a completely planar configuration while the lower ones correspond to the images for the gas-phase structure. To the right of the AFM images is the 4-gram score corresponding to the prediction shown (flat and gas-phase structure, respectively). Further to the right is the average of the 4gram scores obtained for images generated with the 24 combinations of operational parameters considered in QUAM-AFM. The prediction example shown is the one that scored closest to the mean. Figure S7: As fig. S6 for two molecules that include chemical chemical groups, like F atoms (top) and diketones (bottom), that display faint AFM features and thus are more difficult to recognise by the model.

S5 Experimental test
We have performed a test with 3D stacks of experimental AFM images for dibenzothiophene and 2-iodotriphenylene (see fig. S8). In the first case, despite the strong noise in the images and the white lines crossing the images diagonally, we obtain a perfect prediction. In the second case, with images covering a tip-sample distance range of 72 pm (smaller that the 100 pm range used for the training), the prediction performed by the model is "2iodtriphenylene", which is very close to the ground truth, just missing a hyphen but providing all the relevant chemical information. Notice that, in both cases, experimental images have been scaled close to the size of the unit cell used in QUAM-AFM before feeding the model. Without this scaling, the score in the 4-gram evaluation falls dramatically for both cases, suggesting that the typical sizes of chemical moieties learned during the training are an important component in the success of the identification process. This sensitivity to variations in image sizes can possibly be reduced by increasing the zoom range of ±15% applied in our data augmentation during the training.

S6 Embedding
The weights of the embeddings of the RNN components of models applied to image captioning 20, 21 are usually pretrained with Word Embeding to represent the semantic high-level features in a vectorial space. The usefulness of this technique lies in the representation of words in a vector space, in which words with similar semantic meaning are represented close together, as shown in fig. S9. In this way, the models manage to use synonyms providing versatility and improving its expressive capacity. The versatility of languages to express the same idea in different ways is further exploited in image captioning providing the same input image with different annotation outputs. Then those models succeed in learning that different words have Figure S9: Distances in the L2 sense between some of the terms in the AM-RNN embedding layer. The higher intensity of blue corresponds to smaller distances while the lighter ones correspond to longer distances. The distance from a term to itself is zero, so it matches the darkest blue. similar meaning.
However, there are no trivial synonyms in our case, due to the close relationship between the IUPAC names assigned to the functional groups in a molecule. Thus, any change in the output sequence will result likely in an error, as a different molecule would be predicted. Additionally, there is no freely available pre-trained Word Embeding for the IUPAC nomenclature, and therefore all weights of the RNN component are initialised randomly. Consequently, we train the embedding layer with the rest of the RNN component (stages 1 and 3 described in section S3.4). Here we find that, although the IUPAC nomenclature does not use synonyms, the representations made in this vector space have certain semantic relationships. To check this, we define the distance in the L2 sense and calculate the distances among the terms (of the AM-RNN). We find, for example, that the terms represented closest to brom are chlor, fluor and iod. In addition to the terms associated with halogens, those designating numbers are also represented in close proximity. It is also noteworthy that the terms closest to nona are octa, deca, undeca and dodeca, as shown in fig. S9. It should be recalled that each of the terms is represented by a number to feed the network, so the model is not learning lexical relations but syntactic relations.