t-SMILES: a fragment-based molecular representation framework for de novo ligand design

Effective representation of molecules is a crucial factor affecting the performance of artificial intelligence models. This study introduces a flexible, fragment-based, multiscale molecular representation framework called t-SMILES (tree-based SMILES) with three code algorithms: TSSA (t-SMILES with shared atom), TSDY (t-SMILES with dummy atom but without ID) and TSID (t-SMILES with ID and dummy atom). It describes molecules using SMILES-type strings obtained by performing a breadth-first search on a full binary tree formed from a fragmented molecular graph. Systematic evaluations using JTVAE, BRICS, MMPA, and Scaffold show the feasibility of constructing a multi-code molecular description system, where various descriptions complement each other, enhancing the overall performance. In addition, it can avoid overfitting and achieve higher novelty scores while maintaining reasonable similarity on labeled low-resource datasets, regardless of whether the model is original, data-augmented, or pre-trained then fine-tuned. Furthermore, it significantly outperforms classical SMILES, DeepSMILES, SELFIES and baseline models in goal-directed tasks. And it surpasses state-of-the-art fragment, graph and SMILES based approaches on ChEMBL, Zinc, and QM9.

Maybe, the authors should just concentrate on t-SMILES and the distribution reproduction experiments from the GuacaMol benchmark.Dump all the rest, or try to publish the rest in a future paper, once t-SMILES have been established.
If indeed the authors have created a new string-based representation for molecules, then this is a key and much needed contribution nowadays.Please distill this paper and concentrate on t-SMILES Vs.SMILES, DeepSMILES and SELFIES.If you indeed improve on those three, or even just over SMILES and DeepSMILES, that would be interesting already.
Minor comments: =============== -p5: why the validity of generated molecules is not 100%?Can't you shoot for this?You are at 99% already!If you have a proper algorithm, I don't see why it would not work all the time.
-at this step, potential users will only be interested by an open-source encoder and decoder for t-SMILES; possibly in Python -p7 and Fig 1: I don't even understand how you fragment molecules.Can you show several famous molecules fragmented / t-SMILES encoded by your approach.caffeine, aspirin, paracetamol, etc. From a single example, and if your explanations are not good enough, how are people supposed to follow what you are doing?Humans, like deep-learning models, might need a few examples in order to generalize.The fragmentation in Fig 1 seems like heavy atoms were saturated with hydrogens after bonds were cut; some bonds were duplicated after breaking in two a fused ring system (what fragmentation algorithm on earth does this?and how to reconnect fragments after such a molecular "disintegration"?) -are there constraints on the molecular fragmentation scheme t-SMILES can use?E.g. is opening rings OK? randomization; cf.Randomized SMILES strings improve the quality of molecular generative models.J Cheminform 11, 71 (2019).https://doi.org/10.1186/s13321-019-0393-0Is this what you call SMILES enumeration?
Reviewer #2 (Remarks to the Author): In this paper, the authors introduce t-SMILES, a new molecular representation that includes two additional special characters to the SMILES vocabulary to account for rings and branches in a way that does require paired brackets or SELFIES tokenization.t-SMILES uses breadth-first rather than the depth-first search used for constructing SMILES, to reduce the nesting depth of characters and reduce the need for generative models to capture long-range dependencies in the grammar.They show benefits to generating valid and novel molecules when training generative models on this input representation.

Comments and questions
In line 259, it's stated that candidates are randomly selected when assembling pieces during reconstruction of a molecule from t-SMILES.What is the "recovery rate" for a common dataset (e.g.Zinc 250k) when converting back and forth from SMILES to t-SMILES?The inverse of this question may be answered in Table 2.In line 288, are the authors saying that "&^" characters occur frequently enough in the tokenization for generative models to pick up on?Because the authors refer to t-SMILES in the tables using the names of the approaches they use for substructure identification (e.g., JTVAE), I'd suggest highlighting those columns and somehow indicating that they are all flavors of t-SMILES.In line 325, what is meant by "models that do not require training…introduce other problems such as long training time" ?How does t-SMILES compare to CREM and Group SELFIES?These approaches also introduce chemical diversity through either "chemically reasonable mutations" (CREM) or assembling fragments (Group SELFIES).In strings is considered the most suitable method for describing molecules when borrowing advanced NLP methodologies to solve chemical problems.However, there are still some issues that need to be addressed and proven.
Actually, we really hoped to show the generality of our proposed approach as efficiently as possible by comparing it with SOTA tools.We apologize for not achieving our objective.Due to our limited experience, we submitted an original manuscript that did not meet the desired standards.
However, we have done a major revision to enhance its academic quality.As listed in Response 1.3, we have condensed, merged, relocated, or removed some sections and content.
As a result, I don't understand what are the key contributions and key experiments that the authors would like to show us.

Response 1.0 Key contributions
Reviewer 2 has provided a clear and objective summary of contributions.It is quoted directly here: In this paper, the authors introduce t-SMILES, a new molecular representation that includes two additional special characters to the SMILES vocabulary to account for rings and branches in a way that does require paired brackets or SELFIES tokenization.
t-SMILES uses breadth-first rather than the depth-first search used for constructing SMILES, to reduce the nesting depth of characters and reduce the need for generative models to capture long-range dependencies in the grammar.
They show benefits to generating valid and novel molecules when training generative models on this input representation.
Jean-Marie Lehn's famous quotation "Atoms are letters, molecules are the words, supramolecular entities are the sentences and the chapters" (Lehn, 1988)(Jean-Marie Lehn -Interview, n.d.) was cited by the researchers (Cadeddu et al., 2014) studying the rank distribution of fragments in organic molecules being similar to that of words in the English language.This idea inspired us to use advanced NLP methodologies for molecular modeling.
Therefore, we must first address two key questions: 1) What are 'chemical words' and 2) How can they be encoded as 'chemical sentence'?Defining 'chemical words' or 'chemical fragments' is a significant challenging task, more difficult than word segmentation algorithms in NLP.Fortunately, there are some published algorithms available that can generate chemical fragments.
The main contribution of our work is the t-SMILES framework, along with its encoding and decoding algorithms, which address the second question.Since it is not possible to cover all ideas, research and experiments in one paper, as far as details are concerned, at least the following points should be included: 1) t-SMILES serves as a scalable molecular description to encode fragmented molecules as a string.Instead of using a dictionary ID, classical SMILES is utilized to describe the molecule fragments.As is widely recognized, SMILES somehow is more advantageous than Graph in describing molecules, such as chirality information.By utilizing t-SMILES, we can not only maintain SMILES' benefits but also enhance it.2) Compared to atom-based classical SMILES, t-SMILES introduces only two extra, unpaired characters and using BFS than DFS algorithm.The main reason that classical SMILES model generates invalid strings is due to the deep nesting and the long-term dependencies.t-SMILES algorithm effectively reduces the nesting depth of characters, and the need for generative models to capture long-term dependencies in the grammar.On the other hand, the BFS algorithm obtains the shortest path between two nodes, which better reflects the properties of real molecules.
3) The t-SMILES framework includes a decoding procedure that reconstructs t-SMILES strings into valid chemical molecules.This integration unifies distributional and non-distributional approaches into a single system.Our research evaluated the random and goal-oriented algorithms separately and obtained promising results.Advanced algorithms like MCTS, CReM have the potential to further enhance performance in the future.4) The t-SMILES is an open and flexible framework, which is able to integrate classical SMILES as a special case and construct a multi-code system for describing molecules that enables efficient exploration of larger chemical spaces.Our study has evaluated the JTVAE, BRICS, MMPA, Scaffold, and Open-Ring algorithms and demonstrated their diversity especially in low-resource and goal-oriented tasks.It is believed that t-SMILES will be able to support additional fragmentation algorithms, such as pharmacophore, functional group, maximum common substructures (MCS), etc., in the future to help address the challenges of molecular design tasks in real-world environments, and the research of the 'chemical word' problem.On the other hand, although SMILES is used in this study to demonstrate the t-SMILES framework, it is important to note that other formats such as DeepSMILES or SELFIES can also be used to describe fragments, resulting in t-DeepSMILES or t-SELFIES, to take advantage of their benefits.This could be an interesting topic for further research.5) t-SMILES uses tree to encode multi-scale and hierarchical molecular topologies, so whether the tree structure could be learned and how LMs go beyond superficial statistical correlations to learn the chemical knowledge of molecules remains to be explored in depth, since some research has shown that LLMs can understand well-formed English syntax.
Systematic experiments indicate that t-SMILS exhibits impressive performance on low-resource datasets, whether the model is original, data augmented, or pre-training fine-tuned.It significantly outperforms classical SMILES, DeepSMILES, and SELFIES in goal-directed tasks.Meanwhile, t-SMILES models surpass SOTA fragment, graph and string-based approaches on ChEMBL, Zinc, and QM9.

Response 1.1 Key experiments
Since well-optimized language models have been shown to be effective in various studies, including the open-ring problem and the goal-directed tasks, we evaluate the t-SMILES from some perspectives as follows: Firstly, t-SMILES was systematically evaluated by exploring its distinguished properties, considering that the boundaries between different codes depend heavily on their fundamental distinctions.Following this, comprehensive experiments are performed on two labeled, lowresource datasets: JNK3 and AID1706.Our study aims to compare and evaluate the advanced benefits of t-SMILES and its alternatives, which were achieved through the use of standard, data augmentation, and pre-training fine-tuning models.In line with our goal, we evaluated twenty goal-directed tasks on ChEMBL in parallel, as presented in the revised manuscript.
Additionally, we conducted experiments on three widely-used datasets, ChEMBL, Zinc, and QM9, employing all code algorithms to evaluate the overall performance of t-SMILES.We evaluate t-SMILES by comparing it to its counterparts, the fragment-based and graph-based baseline models.
To evaluate the adaptability and flexibility of t-SMILES to fragmentation algorithms, we employed four previously published fragmentation algorithms (JVATE, BRICS, MMPA, and Scaffold) to break down molecules.
Furthermore, we conducted rapid ablation study in order to enhance our comprehension and confidence in t-SMILES.This experiment indicated that TSSA models get higher novelty score on all these models with reasonable FCD scores.
Below table is a brief summary of the experiments conducted in our study.Maybe, the authors should just concentrate on t-SMILES and the distribution reproduction experiments from the GuacaMol benchmark.

Response 1.2 GuacaMol benchmark
We are most grateful for your suggestion.
Overall, we use distribution-learning benchmarks which are described in GuacaMol to evaluate the fundamental performance of t-SMILES models.From the perspective of optimization, it could be considered that the task can be solved once the molecules with a high score of a certain index are generated, but these generated molecules may not be useful.Therefore, we also use Wasserstein distance metrics for physicochemical properties in various experiments to assess the ability of generative models to effectively learn the physical and chemical characteristics of molecules in the training set.
The Distribution-Learning metrics of models: ORGAN, LSTM, CharacterVAE, AAE, and Graph MCTS were listed as key baseline models in the original manuscript, with the exception of the random sampler.We have updated the reference information in the revised version to prevent any misleading reference numbers.
GuacaMol includes Distribution-Learning Benchmarks and Goal-Directed Benchmarks.In response to Reviewer 2, we performed goal-directed leaning with 20 subtasks on ChEMBL using TSDY to evaluate t-SMILES, SMILES, DeepSMILES, and SELFIES in revised manuscript.
Dump all the rest, or try to publish the rest in a future paper, once t-SMILES have been established.
If indeed the authors have created a new string-based representation for molecules, then this is a key and much needed contribution nowadays.
Please distill this paper and concentrate on t-SMILES Vs.SMILES, DeepSMILES and SELFIES.

Response 1.3 Distil & future paper
We are very grateful for your review and suggestion.We have refined this paper to improve its academic quality.However, to ensure systematic evaluation, we have to retain some contents of the original version.Here is a list of condensed, merged, reloacted and removed contents:  We understand that many contents of the original manuscript seem more suitable to publish in next paper, once t-SMILES have been established.But as we pointed out before, we feel that it would be very difficult to convince the audience to accept our proposal to make any changes to the famous classical SMILES system, a de-facto standard for string-based representing molecular information in-silico.We hoped to try our best to provide evidence for our proposal.

Response 1.4 t-SMILES Vs. SMILES, DeepSMILES and SELFIES
We have conducted additional experiments on ChEMBL, ZINC, and QM9 and performed analyses focusing on the comparison of t-SMILES vs. SMILES, DeepSMILES, and SELFIES.
If you indeed improve on those three, or even just over SMILES and DeepSMILES, that would be interesting already.

Response 1.5 Encouragement
We appreciate the reviewer's kind feedback.Recently, pre-trained Transformer based language models (LMs) have demonstrated their ability to generate English text that closely resembles human writing.
For the English language, as an arbitrary combination of letters from the Latin alphabet will not necessarily lead to a valid word, in this sense, English is not robust, while for example, SELFIES is robust with respect to chemistry, as mentioned by the authors discussing the evolution from SMILES to SELFIES toward perfect robustness (Krenn et al., 2022).
We hope to borrow NLP models for solving chemical problems and believe that the original SMILES model was really the best starting point in selecting models for describing molecules.
Besides making SMILES more robust with respect to chemistry as SELFIES does, we focus our efforts on encoding-decoding algorithms and enriching the chemical information embedded in the new molecule representation framework t-SMILES based on the fragmented molecule described with classical SMILES-type string.
Our experiments demonstrate that the t-SMILES family can outperform SMILES, DeepSMILES, and SELFIES in certain tasks that utilize appropriate singleton or hydride models and are welloptimized, particularly in low-resource datasets and goal-directed tasks.In contrast, M. Thank you for your comments.
We are sorry that the fragmentation procedure in Fig. 1 seems difficult to understand, leading to the confusion that: "what fragmentation algorithm on earth does this?and how to reconnect fragments after such a molecular 'disintegration'?".
Actually, the fragmentation algorithms demonstrated in this work, including JTVAE, BRICS, MMPA, and Scaffold, were all published years ago.We simply use it and pay our attention to encoding-decoding algorithm.
We agree that the decomposition algorithm of JTVAE is somewhat complex and differs from BRICS, MMPA, and Scaffold in terms of chemical perspective.To make it easier to understand, The paper focuses on a different topic, so we suggest interested readers refer to the original paper for a more detailed explanation of how to cut molecules into pieces.
-are there constraints on the molecular fragmentation scheme t-SMILES can use?E.g. is opening rings OK?
Response 1.9 Opening-Rings Although TSSA could theoretically support open-ring, the logic is somewhat complex.We also realize that open-ring is an important chemical problem when studying retrosynthesis and reaction prediction problems, so we optimized t-SMILES with TSDY and TSID.TSDY and TSID use a dummy atom '*' with or without ID instead of the share atom to reduce complexity and achieve higher performance.
To evaluate the scalability and adaptability of t-SMILES for the open-ring problem, we use RBrics (Zhang et al., 2023) to fragment molecules and use TSID as the encoding-decoding algorithm.The experiments on ChEMBL are given in SI.E.8 for reference.Fragmenting molecules is a complex problem, but not the main focus of this study.Therefore, the experiments are included in the Supplementary Information (SI) for reference and future research.
As mentioned above, the length of this paper is excessive.In response to this problem, TSDY and TSID need to be published.This change has resulted in some experiments being moved to the SI.
-p15: proper data augmentation for SMILES is via SMILES randomization; cf.

Response 1.10 SMILES enumeration
We thank the reviewer for pointing out this difference.
The SMILES randomization has the same meaning as the SMILES enumeration in our manuscript.

Reviewer #2 (Remarks to the Author):
In this paper, the authors introduce t-SMILES, a new molecular representation that includes two additional special characters to the SMILES vocabulary to account for rings and branches in a way that does require paired brackets or SELFIES tokenization.
t-SMILES uses breadth-first rather than the depth-first search used for constructing SMILES, to reduce the nesting depth of characters and reduce the need for generative models to capture long-range dependencies in the grammar.
They show benefits to generating valid and novel molecules when training generative models on this input representation.

Response 2.0 Key contributions
We deeply appreciate the reviewer's objective and concise summary of this study's key contribution.

Comments and questions
In line 259, it's stated that candidates are randomly selected when assembling pieces during reconstruction of a molecule from t-SMILES.What is the "recovery rate" for a common dataset (e.g.Zinc 250k) when converting back and forth from SMILES to t-SMILES?The inverse of this question may be answered in Table 2.

Response 2.1 Recovery rate
We thank the reviewer for this new metric.
1) If the definition of "recovery rate" is valid and unique molecules that are in the training dataset after reconstruction (using canonicalized SMILES), which serves as an inverse measure of "novelty" as specified in this study.In this scenario, to facilitate comparison, we use novelty for all experiments.Please refer Table 1  In line 288, are the authors saying that "&^" characters occur frequently enough in the tokenization for generative models to pick up on?

Response 2.2 Frequent occurrence of "&^" characters
We deeply appreciate the reviewer for bringing this matter to our attention.
We also recognize that this is really a tricky problem.If the frequent occurrence of "&^" characters is the key reason why generative models can pick them up correctly, then why can't generative models generate 100% correct '(' and ')' when they also receive the second highest scores in SMILES.Here we just want to point out that in t-SMILES the "&^" characters are playing a similar role when used as tokens as that of '(' and ')' in SMILES according to the frequency of occurrence.But the crux of the matter is not the frequency of occurrence rather than the design idea of t-SMILES framework itself, the latter was related to the challenge of the syntactic invalidity associated with unbalanced parentheses.The newly introduced symbols '&^' in t-SMILES do not need to be paired, while '(' and ')' require pairing, causing SMILES syntax to have deep recursion.t-SMILES using smaller SMILES fragment reduces the long-term dependencies in the grammar, and simplifies the overall complexity.So, we also calculated Nesting Depth.This is a very basic understanding.However, proving it in theory goes beyond the scope of this study and could be a target for future research.Although t-SMILES and classical SMILES appear similar, there are some fundamental differences between them.
On the other hand, the addition of "&^" symbols does not present a new challenging issue, like the scarcity reasoning problem, as evidenced by their high frequency.
Analyzing deep learning models is a complex and arduous topic.Regrettably, we needed to exclude some basic discussion, such as section 3.1.2on Entropy, due to length limitations.
Nevertheless, we are continuing our research to improve our understanding of the fundamental properties of t-SMILES from both an algorithmic and chemical perspective.
In response to this issue, we have revised it as: "It means that, a crucial task of the SMILES based model is to learn and predict paired '(' and ')' symbols.Conversely, the t-SMILES based model must learn how to reason non-paired symbols '&' and '^'." Because the authors refer to t-SMILES in the tables using the names of the approaches they use for substructure identification (e.g., JTVAE), I'd suggest highlighting those columns and somehow indicating that they are all flavors of t-SMILES.

Response 2.3 Abbreviations
We are most grateful for your kind suggestion.In the revised manuscript, some new names and abbreviations have been used to describe t-SMILES algorithms, such as TSSA_J means TSSAstyle code algorithm that uses JTVAE as a fragmentation algorithm.
In line 325, what is meant by "models that do not require training…introduce other problems such as long training time" ?

Response 2.4 Training molecules
We thank the reviewer for bringing our attention to the omission of the keyword "training molecules".This has been corrected as following: "The authors point out: models that do not require training molecules are free from this problem,…" How does t-SMILES compare to CREM and Group SELFIES?These approaches also introduce chemical diversity through either "chemically reasonable mutations" (CREM) or assembling fragments (Group SELFIES).
We thank the reviewer for suggesting these analyses.These two articles have been cited.We discuss them separately in two items: Response 2.5 and Response 2.6.

Response 2.5 CReM
CReM presents an intriguing approach to generating molecules from fragment data.However, the authors note that "which less suitable for a stochastic sampling of compounds similar to ones used for generation of a fragment database, and other major limitation of the current implementation is the inability to create new ring systems so the performance depends on their representativeness in the input compound database".
1. Rigorous and systematic experiments on goal-directed learning have been added in revised manuscript, where CReM serves as a key baseline for evaluating t-SMILES algorithms.
2. Experiments on t-SMILES using goal-directed reconstruction exhibit the potential for achieving better scores than CReM through appropriate optimization processes.More comprehensive experiments and nuanced discussions, tables and figures have been added, like Table4 & Fig. 4.
3. In fact, the CReM approach differs from t-SMILES in that it seems to struggle with learning the distribution of the training data.On the contrary, t-SMILES could effectively learn the distribution of the training data.4. t-SMILES has the ability to generate fragments, including new rings, that can serve as a foundation for CReM.It would be an interesting solution in future research to use a simple process of CReM to reconstruct fragments in the t-SMILES process.CReM and t-SMILES framework would complement each other to get better performance.

Response 2.6 Group SELFIES
Group SELFIES is a dictionary ID based solution that relies on SELFIES and fragments.While some research shows that SELFIES outperform SMILES, other studies (Krenn et al., 2022) (Chen et al., 2023) suggest that besides validity, other metrics are difficult to be optimized to outperform classical SMILES, and its advanced grammar makes some strings are difficult to parse.
Moreover, it is evident that models based on dictionary IDs suffer from some fundamental problems, such as in-vocabulary (IV), out-of-vocabulary (OOV), and high-dimensional sparse representation (curse of dimensionality).
In response we have revised the Introduction to include a reference and highlight the common problem with solutions relying on dictionary IDs.
Molecular fragmentation schemes are based on specific chemical principles.From this perspective, it is not easy to make a definitive judgment about which option is superior to another.Furthermore, in real-world molecular design experiments, the goal is often to address a specific problem, such as designing a molecule with a particular scaffold.Similarly, chemists may wish to perform a thorough investigation of open ring problems.In these scenarios, different fragmentation schemes should be selected to accomplish the given task.For example, if the target task is related to open rings, none of BRICS, MMPA, Scaffold, or JTVAE are suitable, only algorithms such as RBrics, which can cut rings, are the best choice.
Although it was only a minor comment (concerning the open-ring problem) in the last revision, we took it seriously and published TSDY and TSID.Our action, however, actually contradicts your major suggestion to 'dump all the rest, or try to publish the rest in a future paper, once t-SMILES have been established'.We took this risk really after careful consideration.Our belief is that this action benefits readers by providing a comprehensive overview of the t-SMILES framework, although much more experiments are required.We appreciate the reviewer providing additional ideas from a chemical perspective, which has made the t-SMILES code system more flexible and versatile.
From systemically experiments and nuanced discussions, it is evident that TSSA, TSDY, and TSID have different distinct advantages at different points.They complement each other well but cannot be completely replaced by one another.A figure with highlighted features is included for a visual aid in the discussion.We are most grateful for reviewer's thoughtful and constructive suggestion.
In the revised manuscript, title has been updated.

Response 1.2 words correction
We are most grateful for reviewer's kind and professional suggestion.All of them have been updated in the revised manuscript.

Response 1.3 Fig.4 table and curves
We would like to apologize for any confusion caused by the abundance of models.Fig. 4 visualizes a kernel density estimation of these distributions, similar to the figures used in the MOSES benchmark.
To enable easy comparison, these three figures utilize the same baseline model as the first one.This baseline model consists of 10,000 randomly selected molecules from the training dataset.The Wasserstein distance for this baseline model is set as zero.Tiny system errors are a result of random sampling.Same goes for Zinc.
To quantitatively compare the distributions in the generated and test sets, we compute a 1D Wassersteindistance between property distributions of generated and test sets.The tables that correspond to Fig. 4 were previously located in the SI file under SI.We acknowledge that the Fig.These tables are quite large, with over 40 rows.
In addition, the Wasserstein distance is a single value with limited information, whereas Figure 4 presents a range that could provide more comprehensive information.
We hope this clarification will help alleviate any concerns.In response to this issue, Fig. 4 has been revised for clarity by removing some curves.
1) The curves of the hybrid models represent an average of various algorithms, which are demonstrated in C3 and Z3.
2) Only one algorithm of different t-SMILES algorithms is demonstrated in C2 and Z2.
The updated figure is presented below.
In our source code, the reconstruction of t-SMILES to generate molecules is separated from the rest of the process as a standalone task.This task can be optimized later using faster programming languages and advanced techniques.We thank the reviewer for carefully examining our manuscript and pointing out this issue.We could consider investigating it in-depth and publishing it in a future paper once t-SMILES have been established.
Because the principles for generating molecules of t-SMILES and the recommended papers are different, we calculate the cost time as two separate parts to make it easier to compare.
The corresponding revision is as follows: Due to the similarity between t-SMILES and SMILES strings, training time is not a key indicator to evaluate One key difference between t-SMILES and SMILES is that the t-SMILES model must reconstruct the string into molecules, which distinguishes it from the SMILES-based model.The cost of reconstructing 1000 molecules is shown in SI.The newly added statement is shown below, marked in green: The study of fragmentation methodologies and their applications continues to reveal new opportunities for efficient molecular design and development.A review paper summarized a total of 15 published algorithms, such as eMolFrag etc.
2. In addition, another fragment assembling solution eSynth [2], which comes from this recommendation, mentioned that: This protocol mimics a real application, where one expects to discover novel compounds based on a small set of already developed bioactives.
And some statements in Background: These large generic collections, like Zinc, have a very low probability to exhibit the desired bioactivity for a specific target protein.Consequently, the chances to identify novel, high-quality leads from large compound repositories are low.
These statements align perfectly with our goal of conducting experiments on low-resource datasets.
Our experiments on JNK3 also support these statements.It is evident that the SMILES-based pretrained then fine-tuned models are not the best ones to achieve higher active-novelty scores.In the revised manuscript, eSynth has also been cited as a key reference.We would like to express our sincere appreciation to reviewer's constructive suggestion and bring this valuable reference to us.
The following statement has been added to the first paragraph of Experiments on Low-Resource Datasets and is marked in green: The scarcity of labeled data presents a challenge for implementing deep learning in target-oriented drug discovery.This section simulates a real-world scenario, where novel compounds are discovered based on a small set of pre-existing bioactive compounds.This is because in the generic large compound libraries, such as Zinc, the vast majority of compounds have a very low probability to exhibit the desired bioactivity for a specific target protein.As a result, the chances to identify novel, high-quality leads from large compound repositories are low.
distribution parameters.Although FragDgm uses a segmented mode and being based on distributional learning, its FCD value of 0.303 is the lowest among all listed models.
2. CReM, FASMIFRA (the second reference) and eSynth (a related algorithm in the first reference) use different ways to assemble fragments, they all belong to a big category of fragment assembling algorithms that could serve as a part of the t-SMILES framework as a reconstruction algorithm.
This kind solutions seem to struggle with 'the limitation of inability to create new ring systems so the performance depends on their representativeness in the input compound database.'On the contrary, t-SMILES model can effectively generate new rings and new fragments.From this point of view, they are rather a complementary one with t-SMILES.
Due to the total volume of this study, this broad category of algorithms was not originally extensively covered in the scope of this study.Currently, although only random algorithms are still used to select candidates, even in goal-directed reconstruction methods, t-SMILES models outperform baseline models by a wide margin in goal-directed tasks.In future research, it may be worthwhile to consider using any of them to reconstruct fragments in the t-SMILES process for improved performance.As you suggested, further exploration could be considered in a future paper once t-SMILES have been established.
We thank the reviewer for their helpful feedback, constructive criticism and their suggestion to include more references, which has helped us to improve the quality of our manuscript and provide a more comprehensive comparison of our solution with other state-of-the-art models.

Response 2.0
We would like to express our deep appreciation to the reviewers for their review of our manuscript.
Fig 6 you might plot Novelty against FCD score and use colors and shapes to indicate methods and training epoch, to better show the tradeoff.The t-SNE plots in Fig 7 do not have any clearly identifiable clustering structure and may be better suited to the SI if the conclusion is just that training and generated samples overlap.

4 )
Condense section 3.3 Chemical space and data augmentation for t-SMILES.5) Condense and merge section 3.4 and 3.5: Experiments on JNK3.

Fig. 1
has been revised using TSID algorithm.The original Fig.1 has been relocated to the Supporting Information file as SI.Fig.A.1.1 to demonstrate the procedure of TSSA with a very detailed explanation.MMPA instead of JTVAE is used as an example to decompose molecule in both figures.In addition, the molecule demonstrated in Fig.1 and SI.Fig.A.1.1 has also been revised to Celecoxib.Furthermore, more examples, such as Aspirin, Caffeine, Paracetamol and a chiral molecule have been added to SI.A.1 to demonstrate how to decompose molecules using different fragmentation algorithms and t-SMILES codes.In addition, more comments on key points of different code algorithms have been added in revised manuscript.Three coding algorithms are presented in the revised manuscript: The difference between TSSA and TSDY/TSID is that in TSSA, two fragments share one real atom as a connecting point, whereas TSDY and TSID use an attachment point (indicated by a dummy atom '*' with or without ID in the TS string) to illustrate how the group bonds.In fact, the dummy atom '*' with or without ID is shared by two pieces in TSDY and TSID.TSDY and TSID use the same logic to generate t-SMILES strings and decode them back.However, since the representation of the fragment is a bit simpler, the encoding and decoding performance is a bit better than TSSA, as can be observed in Fig.1and SI.A.4.In particular, TSID has been optimized to achieve a significantly low score for reconstruction novelty, making it easier to support open-ring, retrosynthesis and reaction prediction, etc. SI.A.3 to SI.A.5 have been revised to provide a brief introduction to fragmentation algorithms.

Fig. 4
Fig.4 Performance of the Goal-Directed Benchmarks for T16.SMPO with different training epochs.The TSMG model yields significantly high results.For further comprehensive experiments and nuanced discussions, please refer to section SI.B.3.4.
addition, we have included it as a baseline mode for comparison with the t-SMILES model in experiments on Zinc.It reads: "As to Group-VAE, which uses fragments and SELFIES, it achieves a lower novelty score compared to SELFIES-VAE in published experiments.Because Group-VAE uses a different way to calculate FCD, it is impossible to make a direct comparison.But in this experiment, SELFIES based model obtains lower scores for both novelty and FCD when compared to five t-SMILES based models."In Fig 6 you might plot Novelty against FCD score and use colors and shapes to indicate methods and training epoch, to better show the tradeoff.Response 2.7 Figure of Novelty-FCD We are most grateful for your suggestion.Figure has been revised as below.The t-SNE plots in Fig 7 do not have any clearly identifiable clustering structure and may be better suited to the SI if the conclusion is just that training and generated samples overlap.Response 2.8 Figure of JNK3 Fig 7 has been relocated to SI.D.1.

Fig.R. 1
Fig.R.1 figures and tables, we have selected only SASocre and merged the curves of ChEMBL and Zinc into one their differences.For instance on GPU: NVIDIA Quadro RTX4000, CPU: Intel(R) Xeon(R) W-2265, a classical SMILES-based model takes almost 24 hours to train, while the TSID_S model takes almost 19 hours with the same training parameters: epochs = 10, batch_size = 128, and tokens such as B, Br, C, Cl, F, I, N, O, P, S, [Cl+2], [Cl+3], etc.The cost of training depends on various factors such as hyperparameters and tokenization method.It is important to note that the reference run time is not an indication that the t-SMILES model requires less time than the SMILES model, but rather serves as a reference point.When generating 1000 molecules on an GPU: NVIDIA Quadro RTX4000 with a batch size of 128, the TSID_S model takes approximately 40 seconds, resulting in roughly 25 molecules per second.While, the SMILES model takes around 20 seconds and produces approximately 50 molecules per second.The difference is mainly because the TSID_S and SMILES models have different token dictionaries.

Code Joint Point Frag Alg. Experiments Relative Pros and Cons of TSSA, TSDY and TSID
1. Classical SMILS in t-SMILES format This table has also been added to SI file as SI.Table.B.1.
Thank you for your comment.Although the t-SMILES code has improved performance by reducing long-term dependency grammar, there may still be cases of incorrect grammar in fragments because the generative model used in this study is based on probability.For this reason, we initially use >0.99.However, the t-SMILES framework is technically capable of generating chemically valid molecules.Therefore, the validity score has been updated to 1.0.Humans, like deep-learning models, might need a few examples in order to generalize.The fragmentation in Fig 1 seems like heavy atoms were saturated with hydrogens after bonds were cut; some bonds were duplicated after breaking in two a fused ring system (what fragmentation algorithm on earth does this?and how to reconnect fragments after such a molecular "disintegration"?) with an appropriate fragmentation algorithm, is expected to significantly aid chemists in their molecular modeling efforts.-atthisstep, potential users will only be interested by an open-source encoder and decoder for t-SMILES; possibly in Python Response 1.7 Open-source code Thank you.Data used in this study and open-source code in python are available at: https://github.com/juanniwu/t-SMILES/Thenoriginal function names have been renamed as encode_single() and decode_single().-p7and Fig 1: I don't even understand how you fragment molecules.Can you show several famous molecules fragmented / t-SMILES encoded by your approach.caffeine, aspirin, paracetamol, etc. From a single example, and if your explanations are not good enough, how are people supposed to follow what you are doing?
Thank you for bringing up this open-ring problem.In theory, the t-SMILES framework has no limitations on molecular fragmentation schemes from an algorithmic point of view.In practice, we recommend performing a sanity test for any new fragmentation algorithm.It is unfortunate that none of the four publicly available fragmentation schemes (BRICE, MMPA, Scaffold, and JTVAE) used in this study support open-ring, which is the main reason why we did not claim to support open-ring.
for novelty scores.Recovery rate of random reconstruction on 10K samples This table has been added as SI.Table.E.1.Additionally, the definition of Novelty in SI.B.3.2 has been revised for accuracy.
The small error arises from two factors: 1) the samples were randomly selected, and 2) t-SMILES A can be reconstructed into SMILES B due to the presence of closely related molecules in training dataset.SI.Table.E.1

Table 4 .
Results of Goal-Directed Benchmarks on ChEMBL.
SI.Table.B.1.2.3*The term 'candidate' refers to the number of sub-fragments selected for the next step during reconstruction.