FragNet, a Contrastive Learning-Based Transformer Model for Clustering, Interpreting, Visualizing, and Navigating Chemical Space

The question of molecular similarity is core in cheminformatics and is usually assessed via a pairwise comparison based on vectors of properties or molecular fingerprints. We recently exploited variational autoencoders to embed 6M molecules in a chemical space, such that their (Euclidean) distance within the latent space so formed could be assessed within the framework of the entire molecular set. However, the standard objective function used did not seek to manipulate the latent space so as to cluster the molecules based on any perceived similarity. Using a set of some 160,000 molecules of biological relevance, we here bring together three modern elements of deep learning to create a novel and disentangled latent space, viz transformers, contrastive learning, and an embedded autoencoder. The effective dimensionality of the latent space was varied such that clear separation of individual types of molecules could be observed within individual dimensions of the latent space. The capacity of the network was such that many dimensions were not populated at all. As before, we assessed the utility of the representation by comparing clozapine with its near neighbors, and we also did the same for various antibiotics related to flucloxacillin. Transformers, especially when as here coupled with contrastive learning, effectively provide one-shot learning and lead to a successful and disentangled representation of molecular latent spaces that at once uses the entire training set in their construction while allowing “similar” molecules to cluster together in an effective and interpretable way.


Introduction
The relatively recent development and success of "deep learning" methods involving "large", artificial neural networks (e.g., [1][2][3][4]) has brought into focus a number of important features that can serve to improve them further, in particular with regard to the "latent spaces" that they encode internally. One particular recognition is that the much greater availability of unlabeled than labelled (supervised learning) data can be exploited in the creation of such deep nets (whatever their architecture), for instance in variational autoencoders [5][6][7][8][9], or in transformers [10][11][12].
A second trend involves the recognition that the internal workings of deep nets can be rather opaque, and especially in medicine there is a desire for systems that explain precisely the features they are using in order to solve classification or regression problems. This is often referred to as "explainable AI" [13][14][15][16][17][18][19][20][21][22]. The most obviously explainable networks are those in which individual dimensions of the latent space more or less directly reflect or represent identifiable features of the inputs; in the case of images of faces, for example,

Molecular Similarity
Molecular (as with any other kind of) similarity [70][71][72] is a somewhat elusive but, importantly, unsupervised concept in which we seek a metric to describe, in some sense, how closely related two entities are to each other from their structure or appearance alone. The set of all small molecules of possible interest for some purpose, subject to constraints such as commercial availability [73], synthetic accessibility [74,75], or "druglikeness" [76,77], is commonly referred to "chemical space", and it is very large [78][79][80][81][82][83][84][85][86][87][88][89][90][91][92][93][94][95][96][97]. In cheminformatics the concept of similarity is widely used to prioritize the choice of molecules "similar" to an initial molecule (usually a "hit" with a given property or activity in an assay of interest) from this chemical space or by comparison with those in a database, on the grounds that "similar" molecular structures tend to have "similar" bioactivities [98].
The problem with this is that the usual range of typical metrics of similarity, whether using molecular fingerprints or vectors of the values of property descriptors, tend to give quite different values for the similarity of a given pair of molecules (e.g., [99]). In addition, and importantly, such pairwise evaluations are done individually, and their construction takes no account of the overall structure and population of the chemical space.

Deep Learning for Molecular Similarity
In a recent paper [5], we constructed a subset of chemical space using six million molecules taken from the ZINC database [100] (www.zincdocking.org/, accessed on 28 February 2021), employing a variational autoencoder to construct the latent space used to represent 2D chemical structures. The latent space is a space between the encoder and the decoder with a certain dimensionality D such that the position of an individual molecule in the latent space, and hence the chemical space is simply represented by a D-dimensional vector. A brief survey [5] implied that molecules near each other in this chemical space did indeed tend to exhibit evident and useful structural similarities, though no attempt was made there either to exploit contrastive learning or to assess degrees of similarity systematically. Thus, it is correspondingly unlikely that we had optimized the latent space from the points of view of either optimal feature extraction or explainability.
The most obvious disentanglement for small molecules, which is equivalent to feature extraction in images, is surely the extraction of molecular fragments or substructures, that can then simply be "bolted together" in different ways to create any other larger molecule(s). Thus, it is reasonable that a successful disentangled representation would involve the principled extraction of useful substructures (or small molecules) taken from the molecules used in the training. In this case we have an additional advantage over those interested in image processing, because we have other effective means for assessing molecular similarity, and these do tend to work for molecules with a Tanimoto similarity (TS) greater than about 0.8 [99]; such molecules can then be said to be similar, providing positive examples for contrastive learning (although in this case we use a different encoding strategy). Pairwise comparisons returning TS values lower than say 0.3 may similarly be considered to represent negative examples.
Nowadays, transformer architectures (e.g., [3,11,12,[101][102][103][104][105][106][107][108][109][110]) are seen as the state of the art for deep learning of the type of present interest. As per the definition of the contrastive learning framework mentioned in [61,66], we add an extra autoencoder in which the encoder behaves as a projection head. The outputs of the transformer encoder, which we regard as representations, are to be of a higher dimension. Consequently, it can still take a relatively large computational effort to compute the similarity between the representations. To this end, we add a simple encoder network that maps the representations to a lower dimensional latent space on which the contrastive loss is computationally easier to define. Then, to convert the latent vector again to the appropriate representations to feed into the transformer decoder network, we add a simple decoder network.
In sum, therefore, it seemed sensible to bring together both contrastive learning and transformer architectures so as to seek a latent space optimized for substructure or molecular fragment extraction. Consequently we refer to this method as FragNet. The purpose of the present paper is to describe our implementation of this, recognizing that SMILES strings represent sequences of characters just as do the words used in natural language processing. During the preparation of this paper, a related approach also appeared [111] but used graphs rather than a SMILES encoding of the structures. Figure 1 shows the basic architecture chosen, essentially as set down by [112]. It is based on [112] and is described in detail in Section 4. Pseudocode for the algorithm used is given in Scheme 1.   Transformers are computationally demanding (our largest network had some 4.68 M parameters), and so (as described in Section 4) instead of using 6M ZINC molecules (that the memory available in our computational resources could not accommodate), we studied datasets consisting overall of~160,000 natural products, fluorophores, endogenous metabolites, and marketed drugs (the dataset is provided in [113]). We compared contrastive learning with the conventional objective function in which we used the evidence lower bound of the KL divergence. The first dataset (Materials and Methods) consisted of~5000 (actually 4643) drugs, metabolites, and fluorophores, and 2000 UNPD natural products molecules, while the second consisted of the full set of~150 k natural products. "Few-shot" learning (e.g., [114][115][116]) means that only a very small number of data points are required to train a learning model, while "one-shot" learning (e.g., [117][118][119][120][121]) involves the learning of generalizable information about object categories from a single related training example. In appropriate circumstances, transformers can act as few-shot [3,122,123], or (as here) even one-shot learners [124,125]. We thus first compared the learning curves of transformers trained using cross entropy versus those trained using contrastive loss ( Figure 2). In each case, the transformer-based learning essentially amounts to one-shot learning, especially for the contrastive case, and so the learning curve is given in terms of the effective fraction of the training set. We note that recent studies happily imply that large networks of the present type are indeed surprisingly resistant to overtraining [126]. In Figure 2A the optimal temperature used seemed to be 0.05 and this was used for the larger dataset ( Figure 2B). The clock time for training an epoch on a single NVIDIA-V100-GPU system was ca. 30 s and 23 min for the two datasets illustrated in Figures 2A and 2B, respectively. is given in terms of the effective fraction of the training set. We note that recent studies happily imply that large networks of the present type are indeed surprisingly resistant to overtraining [126]. In Figure 2A the optimal temperature used seemed to be 0.05 and this was used for the larger dataset ( Figure 2B). The clock time for training an epoch on a single NVIDIA-V100-GPU system was ca. 30 s and 23 min for the two datasets illustrated in Figure 2A and 2B, respectively.   Figure 3 gives an overall picture using t-SNE [127,128] of the dataset used. Figure 3A recapitulates that published previously, using standard VAE-type ELBO/K-L divergence learning alone, while panels Figure 3B-E show the considerable effect of varying the temperature scalar (as in [112]).  Figure 3 gives an overall picture using t-SNE [127,128] of the dataset used. Figure 3A recapitulates that published previously, using standard VAE-type ELBO/K-L divergence learning alone, while panels Figure 3B-E show the considerable effect of varying the temperature scalar (as in [112]).

Results
It can clearly be seen from Figure 3B-E that as the temperature was increased in the series 0.02, 0.05, 0.1, and 0.5, the tightness and therefore the separability of the clusters progressively decreased. For instance, by mainly looking at the fluorophores (red colors) in the plotted latent space for each of the four temperatures, the separability as well as tightness of the cluster was best for the 0.02 and 0.05 temperatures. Later, as the temperature increased to 0.1, the data points became more dispersed, and finally at a temperature of 0.5, the data points were the most dispersed. Therefore, we suggest that (while the effect is not excessive) the reduced temperature may lead to the data points being more tightly clustered. However, the apparent dependency is not linear.
We also varied the number of dimensions used in the latent space, which served to provide some interesting insights into the effectiveness of the disentanglement and the capacity of the transformer (Figure 4).
In Figure 4, trace 0 means that the elements of this number of dimensions was always nonzero. In other words, for every molecule, the value of at least that number of dimensions (the value on the y-axis) will be always non zero. Thus, for the 256-dimensional latent space three dimensions were always non-zero). Trace1 means the average of the number of dimensions that were non zero for the dataset. Finally trace2 gives the highest number of dimensions recorded as populated for that specific dimensional latent space. This shows (and see below) that while GPU memory requirements meant that we were limited to a comparatively small number of molecules in our ability to train a batch of molecules, the capacity of the network was very far from being exceeded, and in many cases some of the dimensions were not populated with non-zero values at all. At one level this might be seen as obvious: if we have 256 dimensions and each could take only two values, there are 2 256 positions in this space (~10 77 ). This large dimensionality at once explains the power and the storage capacity of large neural networks of this type.   Figure 3 gives an overall picture using t-SNE [127,128] of the dataset used. Figure 3A recapitulates that published previously, using standard VAE-type ELBO/K-L divergence learning alone, while panels Figure 3B-E show the considerable effect of varying the temperature scalar (as in [112]).   The temperature scalar (as in [112]) was varied between 0.02 and 0.5 as indicated. (Reducing t below led to numerical instabilities.) All drugs, fluorophores, and Recon2 metabolites are plotted, along with a randomly chosen 2000 natural products (as in [113]).
It can clearly be seen from Figure 3B-E that as the temperature was increased in the series 0.02, 0.05, 0.1, and 0.5, the tightness and therefore the separability of the clusters progressively decreased. For instance, by mainly looking at the fluorophores (red colors) in the plotted latent space for each of the four temperatures, the separability as well as tightness of the cluster was best for the 0.02 and 0.05 temperatures. Later, as the temperature increased to 0.1, the data points became more dispersed, and finally at a temperature of 0.5, the data points were the most dispersed. Therefore, we suggest that The temperature scalar (as in [112]) was varied between 0.02 and 0.5 as indicated. (Reducing t below led to numerical instabilities.) All drugs, fluorophores, and Recon2 metabolites are plotted, along with a randomly chosen 2000 natural products (as in [113]).
(while the effect is not excessive) the reduced temperature may lead to the data points being more tightly clustered. However, the apparent dependency is not linear.
We also varied the number of dimensions used in the latent space, which served to provide some interesting insights into the effectiveness of the disentanglement and the capacity of the transformer ( Figure 4). In Figure 4, trace 0 means that the elements of this number of dimensions was always nonzero. In other words, for every molecule, the value of at least that number of dimensions (the value on the y-axis) will be always non zero. Thus, for the 256dimensional latent space three dimensions were always non-zero). Trace1 means the average of the number of dimensions that were non zero for the dataset. Finally trace2 gives the highest number of dimensions recorded as populated for that specific dimensional latent space. This shows (and see below) that while GPU memory requirements meant that we were limited to a comparatively small number of molecules in our ability to train a batch of molecules, the capacity of the network was very far from being exceeded, and in many cases some of the dimensions were not populated with nonzero values at all. At one level this might be seen as obvious: if we have 256 dimensions and each could take only two values, there are 2 256 positions in this space (~10 77 ). This large dimensionality at once explains the power and the storage capacity of large neural networks of this type.    To illustrate in more detail the effectiveness of the disentanglement, we illustrated a small fraction of the values of the 25th dimension alone, as plotted against a UMAP [129,130] X-coordinate. Despite the tiny part of the space involved (shown on the y-axis), it is clear that this dimension alone has extracted features that involve tri-hydroxylated cyclohexane-( Figure 8A) or halide-containing moieties ( Figure 8B). To illustrate in more detail the effectiveness of the disentanglement, we illustrated a small fraction of the values of the 25th dimension alone, as plotted against a UMAP [129,130] X-coordinate. Despite the tiny part of the space involved (shown on the y-axis), it is clear that this dimension alone has extracted features that involve tri-hydroxylated cyclohexane-( Figure 8A) or halide-containing moieties ( Figure 8B).
Another feature of this kind of chemical similarity analysis involves picking a molecule of interest and assessing what is "near" to it in the high-dimensional latent space, as judged by conventional measures of vector distance. We variously used the cosine or the Euclidean distance. As before [5], we chose clozapine as our first "target" molecule and used it to illustrate different feature of our method. To illustrate in more detail the effectiveness of the disentanglement, we illustrated small fraction of the values of the 25th dimension alone, as plotted against a UMA [129,130] X-coordinate. Despite the tiny part of the space involved (shown on the y-axis it is clear that this dimension alone has extracted features that involve tri-hydroxylate cyclohexane-( Figure 8A) or halide-containing moieties ( Figure 8B). Another feature of this kind of chemical similarity analysis involves picking molecule of interest and assessing what is "near" to it in the high-dimensional late space, as judged by conventional measures of vector distance. We variously used t cosine or the Euclidean distance. As before [5], we chose clozapine as our first "targe molecule and used it to illustrate different feature of our method. Figure 9 illustrates the relationship (using a temperature factor of 0.05) between t cosine similarity and the Tanimoto similarity for clozapine (using RDKit's RDKfingerpri encoding (https://www.rdkit.org/docs/source/rdkit.Chem.rdmolops.html, accessed on February 2021).    Another feature of this kind of chemical similarity analysis involves picking a molecule of interest and assessing what is "near" to it in the high-dimensional latent space, as judged by conventional measures of vector distance. We variously used the cosine or the Euclidean distance. As before [5], we chose clozapine as our first "target" molecule and used it to illustrate different feature of our method. Figure 9 illustrates the relationship (using a temperature factor of 0.05) between the cosine similarity and the Tanimoto similarity for clozapine (using RDKit's RDKfingerprint encoding (https://www.rdkit.org/docs/source/rdkit.Chem.rdmolops.html, accessed on 28 February 2021).  It is clear that (i) very few molecules showed up as being similar to clozapine in Tanimoto space, while prazosin (which competes with it for transport [131]) had a high cosine similarity despite having a very low Tanimoto similarity. In particular, none of the molecules with a high Tanimoto similarity had a low cosine similarity, indicating that our method does recognize molecular similarities effectively.
To show other features, Figure 10A shows the plots of the cosine similarity against the Euclidean distance; they were tolerably well correlated, with an interesting bifurcation, implying that the cosine similarity is probably to be preferred. This is because a zoomed-in version ( Figure 10B) shows that the two sets of molecules with a similar Euclidean distance around 1.5 really are significantly different from each other between the two sets, where the cosine similarities also differ. By contrast, the molecules with a similar cosine similarity within a given arm of the bifurcation really are similar. The zooming in also makes it clear that the upper fork tends to have a significantly greater fraction of "Recon2" metabolites than does the lower fork, showing further how useful the disentangling that we have effected can be. It is clear that (i) very few molecules showed up as being similar to clozapine in Tanimoto space, while prazosin (which competes with it for transport [131]) had a high cosine similarity despite having a very low Tanimoto similarity. In particular, none of the molecules with a high Tanimoto similarity had a low cosine similarity, indicating that our method does recognize molecular similarities effectively.
To show other features, Figure 10A shows the plots of the cosine similarity against the Euclidean distance; they were tolerably well correlated, with an interesting bifurcation, implying that the cosine similarity is probably to be preferred. This is because a zoomed-in version ( Figure 10B) shows that the two sets of molecules with a similar Euclidean distance around 1.5 really are significantly different from each other between the two sets, where the cosine similarities also differ. By contrast, the molecules with a similar cosine similarity within a given arm of the bifurcation really are similar. The zooming in also makes it clear that the upper fork tends to have a significantly greater fraction of "Recon2" metabolites than does the lower fork, showing further how useful the disentangling that we have effected can be.  In a similar vein, varying the temperature scalar caused significant differences in the values of the cosine similarities for clozapine vs. the rest of the dataset (Figure 11). In a similar vein, varying the temperature scalar caused significant differences in the values of the cosine similarities for clozapine vs. the rest of the dataset (Figure 11). A similar plot is shown, at a higher resolution, for the cosine similarities with temperature scalars of 0.05 and 0.1 ( Figure 12) and 0.05 vs. 0.5 ( Figure 13). The closeness of clozapine to the other "apines", as judged by cosine similarity, did vary somewhat with the value of the temperature. However, the latter value brings prazosin to be very close to clozapine, indicating the substantial effects that the choice of the temperature scalar can exert.  A similar plot is shown, at a higher resolution, for the cosine similarities with temperature scalars of 0.05 and 0.1 ( Figure 12) and 0.05 vs. 0.5 ( Figure 13). The closeness of clozapine to the other "apines", as judged by cosine similarity, did vary somewhat with the value of the temperature. However, the latter value brings prazosin to be very close to clozapine, indicating the substantial effects that the choice of the temperature scalar can exert.
In a similar vein, varying the temperature scalar caused significant differences in the values of the cosine similarities for clozapine vs. the rest of the dataset (Figure 11). A similar plot is shown, at a higher resolution, for the cosine similarities with temperature scalars of 0.05 and 0.1 ( Figure 12) and 0.05 vs. 0.5 ( Figure 13). The closeness of clozapine to the other "apines", as judged by cosine similarity, did vary somewhat with the value of the temperature. However, the latter value brings prazosin to be very close to clozapine, indicating the substantial effects that the choice of the temperature scalar can exert.  A similar exercise was undertaken for "acillin"-type antibiotics based on flucloxacillin, with the results illustrated in Figures 14-18.  A similar exercise was undertaken for "acillin"-type antibiotics based on flucloxacillin, with the results illustrated in Figures 14-18.        In the case of flucloxacillin, the closeness of the other "acillins" varied more or less monotonically with the value of the temperature parameter. Thus for particular drugs of interest, it is likely best to fine tune the temperature parameter accordingly. In addition, the bifurcation seen in the case of clozapine was far less substantial in the case of flucloxacillin.
That the kinds of molecule that were most similar to clozapine do indeed share structural features is illustrated ( Figure 19) for a temperature of 0.1 in both cosine and Euclidean similarities, where the 10 most similar molecules include six known antipsychotics, plus four related natural products that might be of interest to those involved in drug discovery.  In the case of flucloxacillin, the closeness of the other "acillins" varied more or less monotonically with the value of the temperature parameter. Thus for particular drugs of interest, it is likely best to fine tune the temperature parameter accordingly. In addition, the bifurcation seen in the case of clozapine was far less substantial in the case of flucloxacillin.
That the kinds of molecule that were most similar to clozapine do indeed share structural features is illustrated ( Figure 19) for a temperature of 0.1 in both cosine and Euclidean similarities, where the 10 most similar molecules include six known antipsychotics, plus four related natural products that might be of interest to those involved in drug discovery. In the case of flucloxacillin, the closeness of the other "acillins" varied more or less monotonically with the value of the temperature parameter. Thus for particular drugs of interest, it is likely best to fine tune the temperature parameter accordingly. In addition, the bifurcation seen in the case of clozapine was far less substantial in the case of flucloxacillin.
That the kinds of molecule that were most similar to clozapine do indeed share structural features is illustrated ( Figure 19) for a temperature of 0.1 in both cosine and Euclidean similarities, where the 10 most similar molecules include six known antipsychotics, plus four related natural products that might be of interest to those involved in drug discovery.  Figure 19. Molecules closest to clozapine when a temperature of 0.1 is used, as judged by both cosine similarity and Euclidean distance.
Finally, here we show (using for clarity drugs and fluorophores only (Figure 20)) the closeness of chlorpromazine and prazosin in UMAP space when the NT-Xent temperature factor is 0.1.

Discussion
The concept of molecular similarity is at the core of much of cheminformatics, on the simple grounds that structures that are more similar to each other tend to have more similar bioeffects, an elementary idea typically referred to as the "molecular similarity principle" (e.g., [98,[132][133][134]). Its particular importance commonly comes in circumstances where one has a "hit" in a bioassay and wishes to select from a library of Figure 19. Molecules closest to clozapine when a temperature of 0.1 is used, as judged by both cosine similarity and Euclidean distance.
Finally, here we show (using for clarity drugs and fluorophores only (Figure 20)) the closeness of chlorpromazine and prazosin in UMAP space when the NT-Xent temperature factor is 0.1. Finally, here we show (using for clarity drugs and fluorophores only (Figure 20)) the closeness of chlorpromazine and prazosin in UMAP space when the NT-Xent temperature factor is 0.1.

Discussion
The concept of molecular similarity is at the core of much of cheminformatics, on the simple grounds that structures that are more similar to each other tend to have more similar bioeffects, an elementary idea typically referred to as the "molecular similarity principle" (e.g., [98,[132][133][134]). Its particular importance commonly comes in circumstances where one has a "hit" in a bioassay and wishes to select from a library of available molecules of known structure which ones to prioritize for further assays that

Discussion
The concept of molecular similarity is at the core of much of cheminformatics, on the simple grounds that structures that are more similar to each other tend to have more similar bioeffects, an elementary idea typically referred to as the "molecular similarity principle" (e.g., [98,[132][133][134]). Its particular importance commonly comes in circumstances where one has a "hit" in a bioassay and wishes to select from a library of available molecules of known structure which ones to prioritize for further assays that might detect a more potent hit. The usual means of assessing molecular similarity are based on encoding the molecules as vectors of numbers based either on a list of measured or calculated biophysical or structural properties, or via the use of so-called molecular fingerprinting methods (e.g., [135][136][137][138][139][140][141][142]). We ourselves have used a variety of these methods in comparing the "similarity" between marketed drugs, endogenous metabolites and vitamins, natural products, and certain fluorophores [91,99,113,[143][144][145][146][147][148].
At one level, the biggest problem with these kinds of methods is that all comparisons are done pairwise, and no attempt is thereby made to understand chemical space "as a whole". In a previous paper [5], based in part on other "deep learning" strategies (e.g., [80,96,[149][150][151][152][153][154][155][156][157][158][159]) we used a variational autoencoder (VAE) [6], to project some 6M molecules into a latent chemical space of some 192 dimensions. It was then possible to assess molecular similarity as a simple Euclidean distance.
A popular and more powerful alternative to the VAE is the transformer. Originally proposed by Vaswani and colleagues [11], transformers have come to dominate the list of preferred methods, especially those used with strings such as those involved in natural language processing [106,[160][161][162][163]. Since chemical structures can be encoded as strings such as SMILES [164], it is clear that transformers might be used with success to attach problems involving small molecules, and they have indeed been so exploited (e.g., [10,12,104,[165][166][167][168]). In the present work, we have adopted and refined the transformer architecture.
A second point is that in the previous work [5], we made no real attempt to manipulate the latent space so as to "disentangle" the input representations, and if one is to begin to understand the working of such "deep" neural networks it is necessary to do so. Of the various strategies available, those using contrastive learning [11,62,66,[169][170][171] seem to be the most apposite. In contrastive learning, one informs the learning algorithm whether two (or more) individual examples come from the same of different classes. Since in the present case we do know the structures, it is relatively straightforward to assign "similarities", and we used a SMILES augmentation method for this.
The standard transformer does not have an obvious latent space of the type generated by autoencoders (variational or otherwise). However, the SimCLR architecture admits its production using one of the transformer heads. To this end, we added a simple autoencoder to our transformer such that we could create a latent space with which to assess molecular similarity more easily. In the present case, we used cosine similarity, Tanimoto similarity, and Euclidean distance.
There is no "correct" answer for similarity methods, and as Everitt [172] points out, results are best assessed in relation to their utility. In this sense, it is clear that our method returns very sensible groupings of molecules that may be seen as similar by the trained chemical eye, and which in the cases illustrated (clozapine and flucloxacillin) clearly group molecules containing the base scaffold that contributes to both their activity and to their family membership ("apines" and "acillins", respectively).
There has long been a general recognition (possibly as part of the search for "artificial general intelligence" (e.g., [173][174][175][176][177][178][179]) that one reason that human brains are more powerful than are artificial neural networks may be-at least in part-simply because the former contain vastly more neurons. What is now definitely increasingly clear is that very large transformer networks can both act as few-shot learners (e.g., [3,108]) and are indeed able to demonstrate extremely powerful generative properties, albeit within somewhat restricted domains. Even though the limitations on the GPU memory that we could access meant that we studied only some 160,000 molecules, our analysis of the contents of the largest transformer trained with contrastive learning indicated that it was nonetheless very sparsely populated. This both illustrates the capacity of these large networks and leads necessarily to an extremely efficient means of training.
Looking to the future, as more computational resources become available (with transformers using larger networks for their function), we can anticipate the ability to address and segment much larger chemical spaces, and to use our disentangled transformer-based representation for the encoding of molecular structures for a variety of both supervised and unsupervised problem domains.

Materials and Methods
We developed a novel hybrid framework by combining three things, namely transformers, an auto-encoder, and a contrastive learning framework. The complete framework is shown in Figure 1. The architecture chosen was based on the SimCLR framework of Hinton and colleagues [61,112], to which we added an autoencoder so as to provide a convenient latent space for analysis and extraction. Programs were written in Py-Torch within an Anaconda environment. They were mostly run one GPU of a 4-GPU (NVIDIA V100) system. The dataset used included~150,000 natural products [91,99,148], plus fluorophores [113], Recon2 endogenous human metabolites [143,144,146,147], and FDA-approved drugs [99,[143][144][145], as previously described. Visualization tools such as t-SNE [127,128] and UMAP [129,130] were implemented as previously described [113]. The dataset was split into training and validation and test sets as described below.
We here develop a novel hybrid framework upon the contrastive learning framework using transformers. We explain the complete framework with each of the components as below:

Molecular SMILES Augmentation
Contrastive learning is all about uniting positive pairs and discriminating between negative pairs. The first objective is thus to develop an efficient way of determining positive and negative data pairs for the model. We adopted the SMILES enumeration data augmentation technique from Bjerrum [180], that any given canonical SMILES data example can generate multiple SMILES strings that basically represent the same molecule. We used this technique to sample two different SMILES strings x i and x j from every canonical SMILES string from the dataset, which we regarded as positive pairs.

Base Encoder
Once we received the augmented, randomized SMILES, they were added with their respective positional encoding. The positional encoding is a sine or cosine function defined according to the position of a token in the input sequence length. It is done in order also to take into consideration the order of the sequence. The next component of the framework is the encoder network that takes in the summation of the input sequence and its positional encoding and extracts the representation vectors for those samples. As stated by Chen and colleagues [112], there is complete freedom when it comes to the choice of architecture for the encoder network. Therefore, we used a transformer encoder network, which has in recent years become the state-of-the-art for language modelling tasks and has been subsequently significantly extended to chemical domains as well.
As set down in the original transformers paper, the transformer encoder basically comprises two sub-blocks. The first sub-block has a multi-head attention layer followed by a layer normalization layer. The first multi-head attention layer makes the model pay attention to the values at neighbors' positions when encoding a representation for one particular position. Then, the layer normalization layer normalizes the sum of inputs obtained from the residual connection and the outputs of the multi-head attention layer. The second block consists of a feed forward network, one for every position. Then, similar to the previous case, layer norm is defined on the position-wise sum of the outputs from the feed forward layer and the residually connected output from the previous block.
The output of the transformer encoder network is an array of feature-embedding vectors which we call the representation (h i ). The representation obtained from the network is of the dimension sequential length × d model . This means that the transformer encoder network generates feature embedding vectors for every position in the input sequence length. Normally, these transformer encoder network blocks are repeated N times and the output representation of one encoder is an input of another. Here, we employed 4 transformer encoder blocks.

Projection Head
The projection head is a simple encoder neural network to project the feature embedding representation vector of shape (input sequence length × d model ) down to a lower dimension representation of shape (1 × d model ). Here, we used an artificial neural network of 4 layers with the ReLu activation function. This gave an output projection vector z i , which was then used for defining the contrastive loss.

Contrastive Loss
As the choice of contrastive loss for our experiments, we used the normalized temperaturescaled cross entropy (NT-Xent) loss [64,112,181,182].
where z i and z j are positive pair projection vectors when two transformer models are run in parallel. {k = i} is a Boolean evaluating to 1 if k is not the same as i, and τ is the temperature parameter. Lastly, sim() is the similarity metric for estimating the similarity between z i and z j . The idea behind using this loss function is that when sampling a sample batch of data of size N for training, each sample is augmented as per subsection "Section 4.1" and the total would then be 2N samples. Therefore, for every sample there is one other sample from the same canonical SMILES and 2N-2 other samples. Therefore, we considered for every sample one other sample generated from the same canonical SMILES as a positive pair and each of the other 2N-2 samples as a negative pair.

Unprojection Head
Unlike SimCLR or any other previous contrastive learning framework, we also opted to include a simple decoder network and then a transformer decoder network through which we also taught the model to generate a molecular SMILES representation whenever queried with latent space vectors. With this architecture, we thus developed a novel framework which can not only build nicely clustered latent spaces based on the structural similarities of molecules but also has the capability of doing some intelligent navigation of those latent spaces to generate some other highly similar molecules.

Base Decoder
This final component of our architecture, the base decoder, consists of a transformer decoder network, a final linear layer, and a softmax layer. The transformer decoder network adds one more block of multi-head attention, which takes in the attention vectors K and V from the output of the unprojection. Moreover, the masking mechanism is infused in the first attention block to mask the 1 position shifted right output embedding. With this, the model is only allowed to take into consideration the feature embeddings from the previous positions. Then the final linear layer is a simple neural network to convert position vector outputs from the transformer decoder network into a logit vector which is then followed by softmax layer to convert this array of logit values into a probability score, and the atom or bond corresponding to the index with highest probability is produced as an output. Once the complete sequence of molecules is generated, it is compared with the original input sequence with cross-entropy as a loss function.

Default Settings
We referred to the first dataset of~5 k molecules containing natural products, drugs, fluorophores, and metabolites as SI1 and that of~150 k natural product molecules as SI2.
For both the datasets, the choice of optimizer was Adam [183], the learning rate was 10 −5 , and dropout [184] was 20%. Our model has 4 encoder and decoder blocks and each transformer block has 4 attention heads in its multi-head attention layer. For the SI1 dataset, the maximum sequence length of the molecule (in its SMILES encoding) was found to be of length 412. Therefore, we chose the optimal input sequence length post data preprocessing to be 450. The vocabulary size was 79, and the dmodel was set to 64. With these settings the total number of parameters in our model was 342,674, and we chose the maximum possible batch-size to fit on our GPU set-up, which was 40. We randomly split the dataset in the ratio 3:2 for training and validation. However, in this particular scenario we augment the canonical SMILES and train only on the augmented SMILES. Our model was shown none of the original canonical SMILES during training and validation. Canonical SMILES were used only for obtaining the projection vectors during testing and the analyses of the latent space.
For the SI2 dataset, the maximum molecule length was 619, and therefore we chose to train the model with input sequence length of 650. The total vocabulary size of the dataset was 69. The dimensionality dmodel of the model was varied for this dataset from around 48 to 256. For most of our analysis, however, we choose 256 dimensional latent space or d model = 256. Therefore, we focused on the settings for this case only. The batch size was set to 20, and the model had a total of 4,678,864 training parameters. In this case, the dataset was split such that 125,000 molecules were used for training and 25,000 reserved for validation.

Conclusions
The combination of transformers, contrastive learning, and an autoencoder head allows the production of a powerful and disentangled learning system that we have applied to the problem of small molecule similarity. It also admitted a clear understanding of the sparseness with which the space was populated even by over 150,000 molecules, giving optimism that these methods, when scaled to greater numbers of molecules, can learn many molecular properties of interest to the computational chemical biologist.