Exploring and mapping chemical space with molecular assembly trees

Description

. Summary of the computational complexity of the problems to find the shortest assembly pathway and to compute the assembly tree.

Problems
To find the shortest assembly pathway for a single molecule = To find the shortest addition chain of an integer + Subgraph isomorphism problem Computational complexity At least as hard as NP-complete Hard, but not been proven to be NP-complete or NP-hard NP-complete

Problems
To compute the assembly tree for a group of molecules = To find an addition chain that simultaneously forms each of a sequence of integers + Subgraph isomorphism problem Computational complexity At least as hard as NP-complete NP-complete NP-complete integers, but close to. The third column shows the measurement error (standard deviation). The second column equal to the third divides the first column, in percentage. All fragments whose measurement error is larger than 5% are dropped.

How to obtain the fragments distribution?
This part of the algorithm is to calculate a distribution that shows how many times a molecular structure or fragment is duplicated in the original molecule. For convenience, here we use a string (Fig. S2a) as an example to illustrate instead of a molecule, but the same principle applies to molecules. Now, the question is to find how many instances of each fragment (namely, substring) are contained in the original string. We see that the fragment CUCGAC appears twice, GAC appears four times, CG appears three times, etc. Ultimately, the algorithm should be able to tell us those information.
More precisely, the algorithm should tell us the value w(S) that describes how many instances of each fragment S are contained in the original string. This algorithm is conceptually simple, and has two main steps: 1. Randomly fragment the original string for many times, and count how many times each fragment appears, thus a histogram obtained.
2. Weight the counts more for longer fragments. Generally speaking, the weighting scheme is: "the weighted counts of a fragment S" is proportional to "(the actual counts of S) × (2 to the power of the length of S)".
3. Normalized the weighted counts, and then we obtain w(S).
The output of this algorithm applied to the exemplified string is shown in Fig. S1b. The details of why this can work are shown in the following.

Notation
First of all, we denote a specific fragment as s (lowercase). Note that if we want to distinguish two fragments that are identical but in different positions, we will use the lowercase s. For example, in Fig. S2a, the fragment GAC that consists of the 4th, 5th and 6th letter is denoted as s(GAC,1); while we denote another GAC fragment that consists of the 7th, 8th and 9th letter as s (GAC,2). On the other hand, if we do not want to distinguish fragments that are identical but in different positions, we will use S (uppercase). As shown in Fig. S2a, all fragments GAC are denoted as S(GAC). Therefore, the ultimate goal of this algorithm is to find w(S), rather than w(s). With that being clear, we can proceed.
Theoretically, we have n: the number of links of the original string (i.e., the number of letters minus 1); m: the number of links of the fragment in question; N = 2 n − 1: the total number of unique fragmenting events; Ma and Mb: in these N unique events, for any fragment s on the edge, it will appear for M a = 2 n−m−1 times, while for any fragment s in the middle, it will appear for Mb = 2 n−m−2 times; Π: the total number of fragments appeared in these N unique events, that is Empirically, we have # : the number of fragmenting events in the experiments (referred to as fragmenting trails); # : the total number of fragments appeared in these # empirical trails.
So, in these # empirical trails, the number of times that one fragment s appears is denoted as ℎ ' ! ( ). If s is on the edge, it is denoted as ℎ ' " ( ); if s is in the middle, it is denoted as ℎ ' # ( ).
Therefore, based on Wilson score interval in statistics, the probability of the fragment s appears is (mean ± standard deviation): where x is replaced by a if the fragment s is on the edge, and x is replaced by b if s is in the middle; z is the probit function (when the confidence level is 95%, then z = 1.96). One step further, the probability of the fragment S appears on the edge is: where # " ( ) is the total number of times that S appears on the edge. The probability of the fragment S appears in the middle is: where # # ( ) is the total number of times that S appears in the middle.
Now we can show how to calculate w(S) based on the empirical data Referring to the string in Fig. S2a, the fragment GAC has 4 instances. In the N unique fragmenting events, the fragment s(GAC,1) will appear Mb = 2 23−2−2 times, denoted as hb(s(GAC,1)). The same for s(GAC,2) and s (GAC,3). On the other hand, in the N unique fragmenting events, the fragment s(GAC,4) will appear M a = 2 23−2−1 times, denoted as ha(s (GAC,4)). Eventually, we can use these counts hx to achieve the answer 4, i.e., In general, we have where Ha(S) is the total number of times that S appears on the edge, and Hb(S) is for S in the middle. If we substitute the empirical values, we obtain the measured w(S), i.e., .( ). In order to evaluate the empirical error, we use Eq. (S3.1) and (S3.2) to calculate Ha(S) and Hb(S), rather than directly using # " ( ) and # # ( ). Therefore, we have Equivalently, the mean is

Monte Carlo
Step 1: 1.1. Read mol-file to construct a "molecule" object (MOL_BOND class), which stores all of the atoms and bonds, also giving each atom an index and each bond (connecting two atoms) an index.
1.2. Find all atoms that have more than one bond linked to it, and make a list recording them.
1.3. For each atom in the list, randomly choose one possible scheme to "split the bonds". For example, this atom has bond 1, 3 and 6 linked to it; so in total it has five ways to split the bond: no split (136), three ways to split into two (1,36), (3,16) and (6,13), split into three (1,3,6). In general, there is a mathematical formula to generate all of the possible splitting.
1.4. Commit the split, meaning: add atoms to the split end of the bond. For example, if you choose to cut into (1,36), then add an atom to the split end of bond 1; for bond 3 and 6, we don't need to add atoms since we can just use the original end-atom.
1.5. After go through all atoms that have more than 1 bond linked to it, we obtain a molecule with bonds cut (still a MOL_BOND class object). Now we can use depth-first-search algorithm to work out all the disconnected parts. Each disconnected part is a fragment of this molecule. Then we obtain a list of fragments.
1.6. For each fragment, we know which bonds it consists of, and we can also calculate the number M (referring to SI section 3.3 for details about M). All of these gives hist_item (the information of one item of the overall fragment histogram). As 1.3 -1.6 (the blue part in the flowchart Fig. S3) are the main processes in Step 1, we summarize them further here: the outcome of 1.3 -1.6 is to obtain hist_item that will be used in 1.7 to construct the distribution of fragments of this molecule (namely, a histogram).
hist_item is a list of tuples each of which stores the complete information of a fragment (so this list of tuples represents all of the fragments that constitute this molecule). Each tuple consists of three things: InChI string of this fragment, the bond indices that constitute this fragment (called "fragment identifier"), a number (denoted as M) associated with this fragment that will be used to weight the counts to get the final histogram. Refer to SI section 3.3 to see how M is calculated and used to weight.
1.7. Add hist_item to the overall histogram, where the x-axis is the InChI of a particular structure and y-axis is the count of that structure. A map is also constructed which associates a list of fragment identifiers and corresponding M values with each InChI string.
That is, the key of this map stores InChI strings; and under any InChI, there is a list of fragment identifiers (that all have the same InChI) and the corresponding M.
After repeating 1.3 -1.7 for Nstep1 (a predefined parameter) number of times, we get the overall histogram. The y-axis is the count of how many times a fragment appears in Nstep1 random cuttings. For example, we are looking at this molecule C-N-C-O-C-N-C; and the fragment C-N-C appears 1 million times in all these cuttings, then y-axis for C-N-C is 1 million.
But 1 million is not what we want. Instead, we want to know how many times a fragment has been duplicated in this molecule. In this case, it is 2. So we need to convert the counts (1 million) to number of duplications (two). This conversion can be done by using the counts and the number M (see details in SI section 3.3), which gives the statistical measurement of how many times a fragment is duplicated, mean ± error.
1.8. Clean up the histogram. We first delete the items that have the error > Hist_Err_Threshold (a predefined parameter, 5% by default). Then we round the means to integers. Finally, we delete the items whose corresponding y-axis values are 1 (i.e., it only appears once in this molecule).

Monte Carlo
Step 1 is now complete. In the end, we obtain a histogram where x-axis is the InChI string of fragment and y-axis is how many times it duplicates in this molecule (with 100%-5% = 95% confidence by default).
Randomly choose a fragment x based on the histogram f(x), i.e., the relative y-value represents the probability that a certain x is chosen.
2.2. Randomly choose a "position" for x. Here "position" means the "fragment identifier".
2.4. Commit, so we delete 3, 5 and 6 from the list S, means that we consider this fragment has been taken.
2.5. If the position is not available, then it means x must be overlapped with one or more already-taken fragments. In the flowchart, get the overlapped bulk means figuring out these already-taken fragments, say, (y1, y2, y3, …), and make a new fragment that consists of x and y1, y2, y3…. We denote this big new fragment as bulk x'.
2.6. Check if x' is contained in the histogram. If yes, it means x' is a duplicated fragment; otherwise not.
2.7. Commit here means delete each index in x' from the list S, meaning they have been taken.
2.8. Here n means for how many times we cannot find an available position for new fragments continuously. If it is larger than a predefined number N, we stop this searching process, meaning that we assume no position available. By default, N is set to twice the number of bonds of the molecule.
2.9. We now have a list of fragments that are used to make the molecule. Now we check which fragments appear more than once in this list, the "duplications", and then remove them, storing them separately (2.10 below). The leftover fragments constitute the part we call the "residue".
2.11. The residue is a list of non-duplicated fragments. Now we consider them as one whole molecule (namely a MOL_BOND object) although it consists of disconnected parts. We need to obtain a fragment histogram for it, similar to f(x) described above to the original molecule. Naively we can repeat Monte Carlo Step 1 on the residue to get such a histogram, but we have a shortcut here: We check every InChI string in the original f(x), to see whether it is a fragment of the residue. If it is, check how many duplications it has, and then we obtain one histogram item. Ultimately, we obtain the histogram for the residue, still denoted as f(x) which will be used again.
2.12. At this point, we have repeated the above process until no further duplication can be found. In each loop we obtained a bag (or multiset) of duplications, and now we add them together to get the bag that contains all the duplications, which is exactly the multiset representation of one assembly pathway.
2.13. Lastly, based on the multiset representation, we can calculate the index of this pathway.
2.14. Run 2.1 -2.13 for Nstep2 (a predefined parameter) number of times, to find the pathways that have the lowest pathway index.

Executable program manual (AssemblyMC.exe)
Download zip file AssemblyMC.zip. Unzip it and you will see six files: 5. example_Adenine_pathway.txt, one file of the results, which will be explained later.
6. example_Aenine_histogram.txt, one file of the results, which will be explained later.
Keep them in the same folder. Then: 1. Download the mol file of the molecule that you wish to calculate, and put it into the same folder of AssemblyMC.exe (here we will just take Adenine.mol as an example).
2. Open Windows command prompt (cmd.exe), and navigate to this folder.
3. Enter and it will then analyze the file Adenine.mol, namely, molecule adenine.
There are two optional parameters. The first parameter (Nstep2, referring to SI section 3.4) specifies how many possible assembly pathways to try in order to find the shortest one (-1 is the default value, meaning there is no limit); while the second parameter (Nstep1, referring to SI section 3.4) means how many fragmenting schemes to try to obtain the fragments histogram (-1 is the default value, meaning either 1% of all possible fragmenting schemes, or 100000, whichever is greater).
The example above uses no parameter. The following example uses both parameters which we will take as an example to show what to expect when it runs and after it finishes. It first calculates the fragments histogram, and shows the process as When the number (e.g., >67144) shown equal to the second parameter, this process to obtain the fragments distribution (Monte Carlo Step 1, referring to Fig. S3) finishes. And the message below will be displayed in the prompt: After that, it will start Monte Carlo Step 2 (Fig. S3) immediately. It will keep displaying the symbol ">" until 10000 pathways have been tried. Then, in the same folder, a file Adenine_pathway.txt will be created that is the ultimate result and contains all the information we need, which we will explain how to interpret soon (it will also display some texts in the prompt, which basically repeats what is written in the file Adenine_pathway.txt). Note that the program will continue to run until it has tried Nstep2 (the first parameter you input) number of possible assembly pathways; but each time it has tried 10000 pathways, it will update and overwrite the file Adenine_pathway.txt. If the first parameter you input is the default value -1, you may close the program manually when you think it has tried enough possible pathways. Now we will explain how to interpret the results contained in Adenine_pathway.txt. We will take the file example_Adenine_pathway.txt as an example to explain due to the fact that this method is Monte Carlo (thus contains randomness) so there might be slight differences at each time it runs. Indeed, example_Adenine_pathway.txt is the file Adenine_pathway.txt generated for a particular run, and we just renamed it as example_Adenine_pathway.txt.
The block started with "| |" records how the molecule is represented in the program. The lines starting with capital letters represent atoms. For example, the first line "C #8 --> 8, 10" means the carbon (C) atom's ID is 8 (denoted by symbol #), and it is attached by bond 8 and 10; the seventh line "N #4 --> 3, 8" means the nitrogen (N) atom's ID is 4, and it is attached by bond 3 and 8. The lines starting with integers represent bonds. For example, the first line "0 *1 : #0 --#1" means this bond's ID is 0, which is a single bond (denoted by symbol *), and it connects atom 0 and atom 1; the fourth line "10 *2 : #5 --#8" means this bond's ID is 10, which is a double bond, and it connects atom 5 and atom 8. All other information is self-evident. Now let's look at the lines "Number of pathways: 2" and "Assembly index is: 7". It tells you that the assembly number (MA) of this molecule adenine is 7 (namely, the assembly index of the shortest assembly pathways of adenine), and there are two assembly pathways with this MA, each of which is displayed in the following lines, starting with "=== Pathway i ===".

{
, } Note that we should ignore all of the hydrogens as we have ignored them from the beginning (these hydrogens appear in the graphs because we have to use them as some "placeholders" to generate proper InChI's).

Examples: several molecules' shortest assembly pathways calculated
Here we apply AssemblyMC.exe to five molecules as examples. We always set the parameters Nstep1=100000 and Nstep2=10000. The first example is aspirin. The command we run is From the output of this program (see SI section 3.5 for details), we know that aspirin's MA is 8 and one of the shortest assembly pathways in multiset representation is as follows: Note that we should ignore all of the hydrogens as we have ignored them from the beginning (these hydrogens appear in the graphs because we have to use them as some "placeholders" to generate proper InChI's).
Likewise, the 2nd example is hexachlorobenzene. Its MA is 5, and one of its shortest assembly pathways is: The 3rd example is tryptophan. Its MA is 11, and one of its shortest assembly pathways is: The 4th example is 5-[(1-Carboxyvinyl)oxy]-4-hydroxy-3-(phosphonooxy)-1-cyclohexene-1carboxylic acid. Its MA is 14, and one of its shortest assembly pathways is: The 5th example is sildenafil, or commonly known as the brand name Viagra. Its MA is 25, and one of its shortest assembly pathways is:

Comparison with reassembling from arbitrary fragments
In the main text, we have compared our method (denoted as "assembled") with reassembling from only single bonds (denoted as "random"), see Fig. 8 in the main text. Here, we further compare the method of reassembling from arbitrary substructural fragments (denoted as "arbitrary"), i.e., the fragment to be reassembled can be any substructure of the original set of molecules. Any other settings are exactly the same as that in Fig. 8. The results are shown in Fig. S17. We can see that (1) for Tanimoto similarity, the "arbitrary" method performs worse than our proposed "assembled" method (referring to Fig. 8) but better than the "random" method, while (2) for QED measurement, the "arbitrary" method performs worse than our proposed "assembled" method and pretty much the same as the "random" method. To summarize, our proposed "assembled" method performs better than the "arbitrary" method, as expected. Besides the worse performance, the "arbitrary" method has some other problems. First of all, the number of all possible fragments is much larger, over half a million for this set of 6 opiates for example. When we need to include lots of original molecules in practical drug designs, this would cause a big problem of storing and handling this large amount of data, since, roughly speaking, the number of possible fragments increases exponentially with the number of bonds of molecules. Second, when we were doing the "arbitrary" method in silico, it happens very frequently that a new generated molecule is simply a large fragment or probably with an extra single bond attached somewhere (which is very unlikely to happen in the "assembled" method).
This is not mysterious because for all possible unique substructures, a large fraction is taken up by large substructures (this is straightforward to see, e.g., for a non-chain molecule, there are many ways to detach two bonds without splitting the molecule into two, and each alternated structure is a valid "substructure"). But these "new" molecules would be no use and we should filter them for good. All in all, the ultimate problem of the "arbitrary" method lies in the fact that the useful information is submerged (practically lost) in the tremendously large set of all possible substructures, while our method is exactly used to extract this useful information.