AMCG: a graph dual atomic-molecular conditional molecular generator

Drug design is both a time consuming and expensive endeavour. Computational strategies offer viable options to address this task; deep learning approaches in particular are indeed gaining traction for their capability of dealing with chemical structures. A straightforward way to represent such structures is via their molecular graph, which in turn can be naturally processed by graph neural networks. This paper introduces AMCG, a dual atomic-molecular, conditional, latent-space, generative model built around graph processing layers able to support both unconditional and conditional molecular graph generation. Among other features, AMCG is a one-shot model allowing for fast sampling, explicit atomic type histogram assignation and property optimization via gradient ascent. The model was trained on the Quantum Machines 9 (QM9) and ZINC datasets, achieving state-of-the-art performances. Together with classic benchmarks, AMCG was also tested by generating large-scale sampled sets, showing robustness in terms of sustainable throughput of valid, novel and unique molecules.


Introduction
Drug discovery is a complex and expensive task: the capitalized cost for a single drug can reach 4.5 billion dollars [1]. Computer-aided strategies to tackle such problem can be roughly divided in two categories: methods which leverage libraries of existing compounds (e.g.virtual screening and docking) and de novo drug design.The first approach is the de facto current standard in early stages of drug design [2], whereas the second has been pioneered in the past [3], yet only recently has become more and more a viable route.The main limitation for the first class of methods is the inherent restriction to the set of predefined compounds present in available databases.Instead, de novo drug design algorithms, in particular those that can be ascribed to the deep learning realm, are acquiring huge popularity because of their ability of interpolating among chemical structures.Among these, an exponentially increasing number of works involve molecular graph representations which in turn are most often supported by graph neural networks (GNNs) [4].In this family of methods graph-based generative models have shown to be feasible for this task [5].
As in any deep learning design endeavour also in this case several possibilities emerge.A first aspect is the learning framework, ranging from generative adversarial networks to variational autoencoders (VAEs) to reinforcement learning.A second element is the generation process that can be one-shot or sequential; techniques to enforce the chemical validity of the generated samples are also relevant [6].Furthermore, an additional challenge consists in conditioning the generation towards the optimization of desired (numerical) properties such as quantitative estimation of drug-likeliness (QED), synthetic accessibility score (SAS) or bioactivity [7].
In this work we present AMCG, a dual atomic-molecular conditional generator, that is a VAE-like GNN-based framework that allows for unconditional and conditional generation.We propose a novel one-shot generative model for molecular graphs and we apply it on two well-known benchmarks for the generation of small organic molecules, the Quantum Machines 9 (QM9) [8] and ZINC [9] datasets.Our approach consists in a dual network in which the molecular and atomic representations co-exist.In detail we jointly train these two sub-networks where the atomic based one is easier to converge and the molecular one simultaneously learns from data and distills from the atomic network branch.The proposed network among other features works with no prescribed bound on the number of atoms, allows to use an explicit histogram on atomic types during sampling and is able to optimize properties through gradient ascent.Being an architecture based on a latent representation there is full control on the geometry of the space, hence sampling can be directed on the regions of interest.Controlling the sampling space is fundamental; to this aim, we use a Gaussian mixture model (GMM) to focus the sampling in proper subspaces of the latent representation.Fitting a GMM over the latent space is an approach that we found to be rather fast and effective, providing new samples that reasonably belong to the underlying latent distribution while carrying the advantages of the single-iteration VAE gaussian prior sampling.Other techniques such as score-based models in latent space [10,11], conversely, require a multiple-iteration sampling strategy and are far more complex to be trained as they are deep learning models themselves.We provided several GMM versions to understand the continuum between the VAE default single gaussian prior and the more complex distributional environment supported by GMMs.Experimental results show the feasibility of the approach.
The paper is subdivided in the following sections: in section 2 the AMCG architecture is described and discussed, in section 3 the experiments on the benchmark datasets are presented.Finally, in section 4 final insights and future perspectives are provided.

Method
The VAE model is well-known in the machine learning community.It has been introduced in [12] and since then it has been a central component of many state-of-the-art generative frameworks, with several applications in molecular generation [13][14][15].
VAEs leverage probabilistic latent variables to model complex data distributions and represent a viable solution for both data generation and latent space exploration.The VAE architecture comprises two subnets.The first subnet is an encoder network that maps the input examples x i ∈ X into the parameters µ i , σ i of a Gaussian.The set of µ i is then imposed to follow a zero-centered, unit variance Gaussian distribution p(z).Subsequently, a latent sample vector is drawn from this distribution and given to a second network, the decoder.During training, the decoder tries to reconstruct the original input (i.e. the whole network is trained as an autoencoder).Once the VAE has been trained, generation is achieved by sampling from this known prior distribution and feeding the latent sample vector to the decoder.Formally, the VAE loss function is defined as: where q(z|x) represents the approximate latent posterior, p(x|z) is the generative distribution, and KL(q(z|x)||p(z)) quantifies the Kullback-Leibler divergence between the approximate posterior and the prior latent distribution which acts as regularizer during the training process.It can be shown that this loss function is a variational lower bound (ELBO) of the log-likelihood of the data and is the sum of the reconstruction loss (the first term) and the similarity loss (the KL-divergence).Minimizing this loss not only enables effective data reconstruction but also encourages the latent space to follow the desired prior distribution p(z).

VAE-like architecture
The main limitation of a VAE-based approach is the need of a trade off between reconstruction ability and regularization of the latent space.Indeed, in order to reconstruct each data point the encoding and decoding mappings have to be injective functions [16].As such the model may tend to create a highly sparse space in which the empty regions often do not contain meaningful objects; it is in fact well known that an overweighting of the regularization term helps in more properly compacting the space [17].This nevertheless, due to the prior distribution shape, possibly induces an overly constrained mapping.It has been shown that summoning score-based methods in such space [10,11] can be a way to overcome this partial failure in shaping the latent manifold.The need for more control in the latent space is so significant that even graphical tools have been devised to manually navigate it [18].From one side the graphical manual solution is intuitive but not algorithmic, whereas score-based methods are rigorous yet computationally demanding.A compromise between these two approaches is to sample from a slightly different prior, that is N(µ, σ), where µ is the mean of the latent embeddings and the best σ is found empirically.Both quantities are estimated upon training; this is similar in spirit to [15] where however the Gaussian distribution parameters are learnt at training time.The immediate refinement of this approach is to take advantage of GMMs [19] to capture the learnt prior.Note that here one uses the GMMs to solely navigate the data manifold.Hence, with respect to the canonical use of GMMs for clustering, one in this case is more resilient to initialization biases as there is no real interest in the samples-to-clusters assignment information.The consequence is that GMMs are rather robust for this task.The here-introduced AMCG network is a dual atomic-molecular network.It possesses two branches: an atomic branch (left side in figure 1) and a molecular branch (right side in figure 1).The two branches process the molecule and, during the training phase, both aim at reconstructing the input molecular graph, as commonly done in autoencoder architectures.During generation the model solely relies on the molecular branch: like in classical VAEs, generation is carried out by sampling the molecular latent space and then feeding the latent representation to the molecular branch.
The two branches rely on different processings of the input graph.The atomic branch maintains the entire amount of atomic incoming information without reductions; this leads, in our architecture, to a deterministic feature vector per atom.The molecular branch only relies on a global representation of the molecule (i.e. a sampled, latent vector).This dual solution was rather beneficial at the training stage, as we found that using only a global representation of the molecule (completely forgetting about the atomic information during training) would lead to a non-converging gradient descent.The converse is also true, as using only the atomic branch would lead to very difficult conditioning and would have required defining an a priori upper bound on the number of atoms for the generated molecules.For this reason, the easy-to-converge atomic branch of the network is used during training to help the molecular part via a self-distilling approach.To further describe the model, let us divide it in four main components: an encoder, a combiner, a molecular decoder and a shared decoder.Formally, the encoder network takes as input a molecular graph G = (X G , A G , E G ), where X G is the node feature matrix, A G is the graph adjacency matrix and E G is the edge feature matrix.Differently from traditional VAEs, it generates an atom-wise representation X A (i.e. an embedding vector per node) and a (latent) molecular embedding x M .From now on we will use symbols having A and M superscripts to indicate atomic and molecular representations respectively.The two representations X A and x M are fed to the combiner, that outputs a final atom-wise representation X A F .The molecular decoder instead only takes the molecular embedding x M as input and generates a surrogate atom-wise representation X A F , trying to mimic X A F .Finally, the shared decoder is trained to reconstruct the original molecule (i.e.predicting the adjacency matrix and the bond types).At each training step, the shared decoder takes as input X A F (coming from the left branch) and X A F (coming from the right branch), simultaneously training the two branches.The data flow can be visualized in figure 1.
AMCG is trained using the weighted sum of several loss terms, coming from each part of the model.The final loss is described in equation (7); in the following we detail each component.

Encoder
The encoder is made of two parts: an atomic encoder and a molecular encoder.The atomic encoder is a GNN relying on graph attention layers (GAT) [20], equipped with HeteroLinear (HL) [21] activation functions, that is each atom type has its own linear activation.It predicts two feature vectors per atom A, µ A and σ A , that are the parameters of a multivariate Gaussian distribution N (µ A , diag(σ A )).The atomic encoder consists of a first, shared layer followed by two distinct layers for µ A and σ A respectively.This number of steps avoids oversmoothing, preventing atoms from the same species from being indistinguishable.This two-hop step is incidentally similar to [22] where instead of GCN layers we chose GAT layers.Furthermore, HL activation allows us to keep independent data flows per atom type.This is tightly related to the way the generation process works (section 2.1.3).The output is finally sampled as: where ε ∼ N (0, I) and ⊙ is Hadamard product.This is done to increase the robustness of the downstream network (i.e. the decoder part) [23] and to break the symmetry in reconstructing the original graph [24].At this point the node features are fed to the molecular encoder that first transforms them via a multilayer perceptron (MLP) and then aggregates the node features via a global aggregator to obtain the molecular embedding.The chosen aggregator was a global sum.This operator, in agreement with theoretical [25] and empirical studies [26,27], led to lower loss values with respect to other aggregators (see appendix C for details).Finally, the molecular embedding is sent to a MLP that calculates the parameters of a (molecular) multivariate Gaussian distribution N (µ M , diag(σ M )) of dimension m.The final molecular representation is sampled as: This distribution is required to adhere to a unitary Gaussian via a KL divergence term (2)

Combiner
The molecular representation x M is concatenated after each atomic embedding vector x A , which is then fed to a four-layer MLP N with HL activation.The output of the combiner is thus computed as: where ∥ is the concatenation operation.This combination operator proved to be more flexible than a sum or averaging operator, allowing also for different atomic and molecular embedding size.Let us then denote with X A F the resulting atomic feature matrix.

Molecular decoder
The molecular decoder (figure 2) takes as input the molecular representation generated by the encoder and generates new atomic representations trying to match the ones from the combiner.This task is split in four sub-tasks: • First of all, the molecular representation is given to an MLP regressor that predicts the histogram H of the atom types of the molecule.Its loss function is the mean squared error (MSE).Let us denote it by L H .The presence of the histogram confers a significant flexibility to the generative process.Indeed, at inference time one can define the desired atom types population, resulting in a built-in conditioning mechanism.• Second, let k be the number of atom types present in the dataset.The molecular representation is given to k atom type-specific MLPs.For each atom type AT, each network outputs the parameters of a normal distribution N (µ AT , diag(σ AT )).The atom type-specific distributions are then sampled according to the inferred histogram.During sampling, a positional bias constant term (positional with respect to the sampling order) is added to each atom to grant that elements generated from the same distribution are distant; this assures variety in the output of the network.Producing equivalent atoms would lead, for instance, to over-connected networks (e.g.carbon atoms all pair-wise connected).This step ultimately produces a matrix X where the number of lines is equal to the number of atoms.• Finally, the atomic embeddings are fed to a MLP post-processor with HL activation, returning a matrix X A F .
While the atomic branch never loses information about the chosen ordering for the feature matrix X A F that stays unaltered through the entire data flow, the same does not hold for X A F , that is obtained by simply sampling the atom-type specific Gaussian distributions described above.This is not an issue during generation since, as we will detail in the next subsection, adjacency matrix and bond type inference are permutation-independent operations.However, during training, to calculate the reconstruction loss for the molecular branch an atom ordering has to be chosen.To do this, the left (atomic) branch of the network is used to guide the alignment by matching the matrix X A F with its counterpart X A F .To this aim we use the Hungarian algorithm [28] that, in general, has a complexity of O(n 3 ).In this case, due to the fact that atoms bring type information, there is no need to perform an all-vs-all alignment.Instead, only atoms belonging to the same class are aligned, reducing the complexity from pure cubic scaling to O((max k n k ) 3 ), where n k is the number of atoms of type k.Since the requirement is for X A F and X A F to be as close as possible, an alignment loss is added, that is the MSE between the original atom-wise representation and the surrrogate one.Let us denote it by L AL 5 .

Shared decoder
The last part of the framework, that is the shared decoder, works as follows: • first, X (either the atomic or surrogate embedding) is fed to a MLP preprocessor P with HL activation, returning Z = P(X); • at this point Z is fed to an MLP regressor that, for each atom (i.e. each row of matrix Z), predicts the number of hydrogens connected to it.A hydrogen prediction loss L HY is added to the global loss; • third, Z is fed to an adjacency matrix decoder, originally introduced in [22], that reconstructs the adjacency matrix of the molecular graph, returning: where α and β are learnable scalar parameters.Hence, a reconstruction loss term is added to the global loss: where N is the number of atoms and l is defined as: • Z and A are then used to generate new atomic features via a composition of two linear transformations (ZA embedder in figure 2): where • finally, a MLP classifier C is used to predict the bond type.Let {e ij } be the set of edges predicted by the previously introduced adjacency matrix decoder and z ′ ′ i ,z ′ ′ j the ith and jth row of the feature matrix Z ′′ , respectively.The bond type of the edge e ij is defined as: A bond reconstruction loss term L B is hence defined by the crossentropy between the predicted and target bond types.At this point, the network output consists of an adjacency matrix, bond types, atom types, and the count of hydrogen atoms for each atom.Subsequently, our process involves attempting to generate a molecule by first adding all the bonds and hydrogen atoms.We then validate its consistency.If inconsistencies arise, we retain only the hydrogen atoms associated with atoms involved in aromatic bonds and make another attempt.If issues persist, we proceed by disregarding the hydrogen atoms and utilizing only the bond information.This process will ultimately generate a reconstructed molecular graph G = (X G , A G , E G ).

Unconditional generation
To train the model for the unconditional generation task, the chosen loss is the weighted sum (w weights) of all the components described in previous sections and has the form: where again A and M superscripts are used to denote adjacency and bond reconstruction losses coming from the atomic and molecular branches respectively.Due to the complexity of the loss we employ a curriculum approach: we first concentrate on learning the atomic branch (left side of figure 1), next we switch to the molecular branch (right side of figure 1).Lastly we promote manifold compaction by pushing on the VAE-like KL divergence.Our schedule is detailed in table 1.

Conditional generation
The methodologies applied to condition the generation are tightly related to the chosen framework.
A straightforward way to achieve this goal in the case of a VAE is to navigate the latent space.This can be done when (1) the space is compact enough to give a meaning to each intermediate point (thus decoding is robust) and ( 2) the obtained space is structured enough to allow for gradient ascent of the property of interest.To achieve (1), training the model by simply minimizing the usual KL divergence term is enough.To achieve (2), a property predictor is jointly trained together with the unconditional model.We also observe that conditioning a single, molecular latent space results in a more controllable behaviour with respect to its atomic counterpart.In AMCG the property predictor is a four-layer MLP regressor that is trained with a weighted MSE loss L PP which is added to equation (7).Our molecular optimization strategy is described in algorithm 1.

Datasets
The model was tested on two of the most important de novo molecular design benchmark datasets: QM9 [8], that contains ∼130 k samples made of up to 9 heavy atoms whose atom type is one of those in {C, O, N, F} plus hydrogens, and ZINC250 k, a set of ∼250 k samples extracted by ZINC dataset [9], having atom types in {C, O, N, F, P, S, Cl, I, Br}.

Data preprocessing
Atoms and bonds have been equipped with three kinds of features: chemical, topological and physical.All the chemical descriptors have been generated using RDKit v.2023.03.3 [29]; hydrogens-related descriptors where not used for Zinc as in this dataset these atom types are not explicitly defined due to the SMILES representation.In order to generate the physical descriptors, we took advantage of the AmberTools suite, version 22.3, and BiKi Life Sciences [30].In particular from AmberTools we used the antechamber tool to convert .pdbfiles into .mol2to get generalized Amber force field (GAFF) atom types.When coordinates are not available (Zinc) we translate the SMILES string into a 3D structure first by using the method of Riniker and Landrum [31] and next by minimizing it using the UFF force field parameters [32].Next we match the input SDF (for QM9) or SMILES (for Zinc) files (from which we get the topology) atoms to GAFF atom types to recover the equilibrium distance and spring constant of each bond; this piece of information is used as edge descriptor.In case the exact GAFF parameter was not found, we used the parmchk2 tool to generate the missing bond parameters by chemical analogy as commonly done when devising parameters for molecular dynamics simulations [30].Finally, we computed a graph positional embedding, trying to capture the positional information of each node in the graph as done in [33].This feature has been extracted using a torch-geometric built-in function.The complete list of descriptors can be found in table 2. Differently from competing methods in table 3 we do not kekulize the training dataset as we explicitly deliver the putative aromaticity of the bonds of the generated molecule.Overall our bond types are: single, double, triple and aromatic.

Unconditional generation: sampling the correct subspace
The first experiment we perform is aimed at estimating the capability of the network to generate novel, unique and valid molecules without any form of conditioning.Validity is defined as the ratio between the number of valid samples (i.e.those that pass through RDKit Chem.SanitizeMol() function) and the total number of generated samples; uniqueness is the ratio between the set of valid molecules and the number of valid molecules; novelty is the ratio between the set of valid molecules that are not in the training dataset and the set of valid molecules.We also define the product of validity, novelty and uniqueness as VUN, serving the role of a global aggregated metrics for the model.The experiments have been done using the model in several configurations (i.e. using different priors to sample the latent space): • VAE: the latent space has been sampled in a VAE fashion, that is from the prior distribution N (0, I).
• VAE-like: we sample from the prior N (µ, diag(α • σ)), where µ and σ are respectively the mean of the latent representations of the dataset and their standard deviation.α is a scalar hyper-parameter.• GMM-F: the third configuration has been obtained by training a GMM in the latent space.
• GMM-D1/D2: these configurations have been obtained by training a GMM in the latent space with the restriction of diagonal covariance matrices scaled by a scalar hyper-parameter β.The 1 and 2 models differ (see later) in the optimization strategy of β.
To train the GMM models we used a random subset of 10 000 molecules of the training set upon embedding in the VAE space.Hyper-parameters such as α, β and the number of components of the GMMs offer us several layers of flexibility in choosing which part of the latent space we want to visit.Increasing the number of components, for instance, allows for higher validity values, but of course novelty and uniqueness are affected by this.On the other hand, scaling up the covariance matrices allows for exploration of the entire latent space and for the generation of samples that adhere less to the original dataset.In our case, the coefficient α of the VAE-like has been chosen in such a way to maximize the VUN without resampling, as well as for GMM-F.More explorative approaches have been adopted for GMM-D1 and GMM-D2.The former aims at getting a compromise between the VUN and the validity rejection efficiency (number of resampling before acceptance).The latter model, instead, has been specifically chosen to overfit the VUN, largely relying on resampling.Due to the one-shot nature of our model, indeed, the sampling operation is fast (∼1500 molecules per second on a single V100 GPU).For this reason, to generate K valid samples one can iterate the generation until K samples that met validity requirements are found, and discarding the other ones [39].This allows the model to achieve 100% validity by design, whilst it may perform poorly in other metrics such as uniqueness and novelty.It will be shown later that AMCG is not affected by this problem.Results are presented in table 3 and compared to some state-of-the-art latent variable methods found in literature.As a  common practice, column Validity w/o check reports the validity value for our model withour resampling.In our results the validity value is often modest when compared to competing methods: this effect is a deliberate design choice we made in order to favour more uniqueness and novelty which indeed represent the most compelling metric values in a real-world drug design campaign.While one can often algorithmically obtain valid molecules via resampling or post-hoc corrections and fixes, this does not hold for novelty and uniqueness.For this reason while validity is important, it is a more technical concept than novelty and uniqueness which bear more semantics.Correspondingly uniqueness and novelty, whose values are related to a set of 10 4 valid samples as in [14,[34][35][36][37][38], are state-of-the-art.We report in figure 3 some examples of the generated molecules via the GMM-F model.
Estimating the VUN is a practice for benchmarking de novo molecular design models.However this metric may not correctly capture the genuine task objective, namely reproducing the data distribution.It has been shown how trivial models such as the AddCarbon model have an almost perfect VUN score [40].Novelty, for instance, is subject to the copy problem, meaning that the generated molecules differ from the original ones for one bond type, or one atom, or one symbol in the SMILES string, producing an artificially high value for this metrics.Also, due to RDKit's behavior, when sanitizing a molecule, it may erroneously assign single bonds everywhere, disregarding Kekulé structures.In competing methods of table 3 the training dataset is kekulized, hence there is no consistency check between an expected aromaticity (which is not predicted) and the actual generated molecule.This contributes in possibly artificially increasing novelty in those reports.To avoid this, if kekulization errors occur, the generated molecule in our tests is marked invalid and discarded.Furthermore, as observed in [39], uniqueness varies a lot with respect to the number of sampled molecules and tends to decay as the sampled size increases.Table 3 shows that by judiciously choosing a configuration (GMM-D2) and using resampling, our model can achieve a near perfect VUN as well.However, to get a deeper understanding of the real capabilities of our framework and challenging the static VUN view point we check how novelty and uniqueness (or their product, UN), decay when sampling more and more valid molecules.This is the real-world scenario in which a medicinal chemist would like to evaluate the highest possible number of diverse hypothesis.We show in figure 4 what we called a UN persistence plot: it is realized by sampling 2000 valid molecules at each step and calculating their UN.
By a simple graphical inspection of the UN persistence plot we can see that GMM-D2 obtains a very small decay rate hence this model bears a strong UN dynamical behaviour.A solid performance is obtained also by GMM-D1 and GMM-F, whereas VAE-like tends to decay more quickly.Undoubtedly, what these priors do is to better capture the latent manifold with respect to the usual Gaussian VAE prior, for which the UN persistence plot was omitted due the fact that already in a small-sized samples set regime (table 3) it achieved a poor score.Correctly navigating the manifold by using GMMs improved generalization in a fast and simple surrogate of Langevin models in latent space [10].However, we can see that the VAE-like, which is a further simplification of the GMMs (or score-based methods) achieved a (partial) mode collapse.
At this stage it would be tempting to conclude again that GMM-D2 is the best model, yet one is not comparing with the reference distribution, which is our ultimate goal.In figure 5 we compare the reference (dataset) distribution of some molecular properties against the one coming from the generated samples.As we can see, the most conservative approach(es) provide better results in terms of mimicking the properties of the dataset, whereas more explorative ones produce different results.Means and standard deviations of the properties of interest are summarized in table 4, together with the evaluation of internal diversity of the generated samples, that is defined as the average pairwise Tanimoto distance between Morgan fingerprints [41].Accordingly to the nature of the chosen priors, the diversity value also increases when moving to the less conservative ones.

Conditional generation
In this section we move from the unconditional to the conditional generation.In the proposed architecture we can structurally condition the generation process by defining an arbitrary histogram of atom types.We term this conditioning structural in that we do not need any explicit additional property predictor, instead this conditioning is already available in the proposed architecture.Next we discuss the more familiar property induced conditioning.

Conditioning with respect to atom type histogram
In section 2.1.3we observed that the model is robust to modifications of the predicted histogram.This can have several applications in that there are cases of chemical interest where the replacement or addition of a single atom is beneficial.A compelling case is the substitution of (for instance) an oxygen with an electronic similar atom like a sulfur [42][43][44]; this helps in escaping existing patents and allowing more freedom to operate.Also perturbing the histogram can bring to the stage the role of so-called activity cliffs [45]; these are relatively small changes in the atoms composition of a molecule which lead to significant changes in potency profiles.
Trading off validity for fantasy, we perturbed the predicted histograms when sampling from the most conservative priors VAE-like and GMM-F.We applied four different modifications: • ±1: the predicted histogram was modified by randomly adding or subtracting 1 to each atom type; • ±2: the predicted histogram was modified by randomly adding or subtracting 2 to each atom type; • random-p: the histogram was generated randomly according to the atom types probability of the dataset; • random-u: the histogram was generated randomly according to the uniform probability.
We then proceeded with the VUN evaluation as in section 3.3.Results are presented in table 5.As one can see, the obtained molecules are still unique and novel with the ultimate, not obvious, result of improving VUN.Evidently such molecules lie outside the support of the training distribution, however it is remarkable that they still bear chemically sound features.

Conditioning with respect to molecular properties
As already anticipated, gradient ascent was utilized to navigate the latent space with the aim of optimizing a given property.We tried to optimize penalized logP (that is defined as the octanol-water partition coefficient minus the SAS and the number of cycles of length bigger than 6) and QED, as commonly done in literature [7,37,46].The experiment was carried out trying to maximize the value for the desired property for 10 000 random training molecules.In order to chose the gradient ascent hyper-parameters (i.e. the learning rate, the  number of steps and the decoding step size-see algorithm 1) we first optimize 1000 molecules from the training set and record the max number of steps we can do without incurring in a failed decoding.Once this maximal number of gradient ascent steps is found we use this in inference and obtain the whole set of molecules along the path; we declare as final molecular output the best molecule along the path in terms of property value.In figure 6 we show an example of optimization path in the penalized logP property; for this case the path first opens the ring system and then increases the dipole moment of the molecule also rendering it more linear.
We collect overall results in figure 7, where we report the shift of the distribution of the property towards higher values.Furthermore, we define the validity V as the percentage of decoded samples along the path; we then report the optimized percentage of molecules O, meaning the ratio of molecules for which we found a molecule having an augmented property value; the success rate S, that is the ratio of optimized molecules that are also novel; we finally report the Tanimoto distance (diversity) D between the initial molecule and the optimized one.

Model assessment over ZINC dataset
The same models that were developed for QM9 were also developed for ZINC.In this case we found out that compacting the latent space was too difficult; upon training this issue rendered impossible to fit the Gaussian mixture of the GMM-F model.For this reason, the chosen priors were: VAE, VAE-like and GMM-D, defined in section 3.3, and what we called GMM-PW, where PW stands for pointwise.This one is a Gaussian Mixture Model defined as the normalized sum of N (µ i , α * σ i ), where µ i and σ i are the training data embedding vectors inferred by the model.In table 6 details about validity, novelty and uniqueness of generated samples are reported.
As we can see, we found that, despite the space is not overall well structured, in the vicinity of the learnt embedded training molecules µ i the sampling process can occur safely.Analogously to QM9, also here sampling from VAE-like leads to evident benefits with respect to VAE.Finally, GMM-D prior exhibits poor validity without resampling yet upon validity rejection results are excellent.We mention that to improve   molecular quality we applied a post-hoc algorithm to remove chords from our generated graphs.In figure 8 we show some randomly generated samples from the GMM-PW model.To corroborate our findings, we provide in figure 9 the UN persistence plot for ZINC dataset.Again, results for VAE prior have been omitted due to its poor performances in a small-sized sampled set regime.In this case we evaluate UN for K = 400 000 valid samples.As we can see, the GMM-D prior is not affected at all by the large usage of the validity rejection strategy.
Finally, in table 7 and figure 10 we can see the distribution of the molecular properties for 10 000 novel unique samples.Coherently with the nature of the chosen priors, sampling from VAE-like prior means sampling from a reduced portion of a hard-to-compact latent space.This results in overly concentrated property distributions, while GMM-PW, as expected, provides the best performances in mimicking the behaviour of the original dataset.One aspect that needs, however, to be underlined, is that for GMM-D our molecules tend to be slightly smaller than the ones belonging to the training distribution.This is due to the inherent difficulty in building bigger molecules when departing from the embedding µ i .On this regard we can conclude that if a limited fantasy and tight adherence to the dataset is desired the best performing model  is GMM-PW, whereas better fantasy and sustained UN persistence can be achieved by GMM-D.Lastly we mention that in order to get converging models we used logP as property to help guiding the training process.

Conclusions
In this paper we introduced AMCG, a VAE-like generative model for de novo molecular design.The learnt models show state-of-the-art results in the QM9 dataset.Particularly AMCG presents an almost unique feature, that is conditioning generation via an histogram of the number of each atomic species; this feature well fits drug discovery campaigns requirements.This structural conditioning capability is coupled with the possibility of conditioning via desired physico-chemical properties analogously to other existing methods.
To develop AMCG we employed a dual atomic-molecular representation whose training is done via a distillation strategy which could be also beneficial to other neural models.At variance of all the previously analyzed models we assessed the capability of the network to generate unique and novel molecules with a sustained throughput.We dubbed this curve the UN persistence plot as it provides a concise and informative representation of the model's performance in terms of structural diversity and similarity to the training data.This again well fits a real-world drug discovery campaign scenario.Lastly we challenged the network with the, notoriously difficult, ZINC dataset providing a state-of-the-art model.In conclusion the presented architecture performed competitively or better than existing methods while bearing unique features and architectural choices.We plan to further modify this architecture to better support bigger molecules as in the ZINC dataset where it results evident (as pointed out in [47]) that a history dependent approach in building the graph is beneficial; also the model should be improved in terms of conditioning and space compaction.In this sense, other sampling strategies might play a significant role as well as improved aggregation strategies to extract a molecular representation (section 2.1.1).Finally, thanks to the modularity of the architecture we plan to modify it to support specific property signals such as docking scoring functions and chemical similarity with respect to a reference chemical structure.These two properties combined will ultimately support tasks as scaffold hopping [48] which is of paramount importance in drug design.

Figure 1 .
Figure 1.A high-level representation of the AMCG framework.Black blocks represent data objects, while red blocks represent network components.The left and right branches are respectively the atomic and molecular ones.

Figure 2 .
Figure 2. Architectural details of the right (molecular) branch.By HMLP we mean a MLP equipped with HeteroLinear activation.During training its goal is reconstructing the original graph.This is the only part of the model that is used during generation.

Figure 3 .
Figure 3. Randomly generated molecules from the GMM-F model.

Figure 4 .
Figure 4. UN persistence plot on QM9 dataset.On x axis the number of valid generated molecules and on y axis the product of uniqueness and novelty (UN).The VAE-like model bears a quickly reducing generative quality whereas the GMM-based models have a much better sustained UN performance.

Figure 5 .
Figure 5. Molecular property distributions for 10 000 valid, novel and unique generated molecules by training on the QM9 dataset.The properties are: logP, QED, Synthetic Accessibility (SA) Score and heavy molecular weight.Dashed line is the original dataset and continuous lines are the developed models.GMM models and in particular GMM-F and GMM-D1 are more adherent to the original dataset.

Figure 6 .
Figure 6.Property optimization path example.The optimization progresses from left to right; this leads to the opening of the ring system and an increase in the dipole moment.

Figure 7 .
Figure 7. Property optimization results on QM9 dataset.On top of the pictures we report validity (V), optimization rate (O), success rate (S) and diversity (D).In blue the original distribution and in orange the obtained one by generating molecules.The optimized distribution shows a slight shift over bigger values.

Figure 8 .
Figure 8. Randomly generated molecules from the GMM-PW model.

Figure 9 .
Figure 9. UN persistence plot on the ZINC dataset.On x axis the number of valid generated molecules, on y axis uniqueness multiplied by novelty.The GMM-PW and VAE-like are conservative but lead to a quick degradation whereas GMM-D is able to sustain generation more effectively.

Figure 10 .
Figure 10.Molecular property distributions for 10 000 valid, novel and unique generated molecules by training on the ZINC dataset.The dashed line is the original dataset, continuous ones are our models.GMM-PW is the most conservative model whereas GMM-D tends to produce small molecules.

Figure C. 11 .
Figure C.11.A plot of AMCG training loss.Multi-aggregator denotes a linear combination of global sum, global mean and global max aggregator, implemented via MultiAggregation().The peculiar shape of the loss curve is due to the curriculum used to assign different weights to different components of the loss (see table 1).

Table 1 .
The weights schedule used to train AMCG.(a) was utilized to train the model on QM9 dataset, while (b) was utilized to train the model on ZINC dataset (see section 3).

Table 2 .
The descriptors used to train AMCG.

Table 3 .
Comparison of AMCG models with competing latent variable methodologies on QM9 dataset.The size of the generated sample set was 10 4 .The VAE model employs a regular Gaussian prior N (0, 1) and collapses in terms of uniqueness.The VAE-like uses a Gaussian prior N (µ, σ) and improves uniqueness significantly.GMM models employ various flavors of mixture models and obtain the best results.

Table 4 .
Mean values and standard deviations (in brackets) for molecular properties of 10 000 valid, novel and unique generated samples by training on the QM9 dataset.The first line, named QM9, shows reference values.In terms of mimicking the original manifold the GMM-F model is the best one.

Table 5 .
VUN evaluation for histogram perturbation.Here we perturb the histogram step of the method with several random techniques (one per line).The more we depart from the original manifold through random perturbation of the histogram the more we lose in validity and gain in uniqueness and novelty.

Table 6 .
Comparison of AMCG models with competing latent variable methodologies on the ZINC dataset.The size of the generated sample set was 10 4 .The more we depart from the manifold the more we reduce validity.The GMM-PW model, as it samples in the close vicinity of the training points, is the most conservative model.