Open data and algorithms for open science in AI-driven molecular informatics

Recent years have seen a sharp increase in the development of deep learning and artificial intelligence-based molecular informatics. There has been a growing interest in applying deep learning to several subfields, including the digital transformation of synthetic chemistry, extraction of chemical information from the scientific literature, and AI in natural product-based drug discovery. The application of AI to molecular informatics is still constrained by the fact that most of the data used for training and testing deep learning models are not available as FAIR and open data. As open science practices continue to grow in popularity, initiatives which support FAIR and open data as well as open-source software have emerged. It is becoming increasingly important for researchers in the field of molecular informatics to embrace open science and to submit data and software in open repositories. With the advent of open-source deep learning frameworks and cloud computing platforms, academic researchers are now able to deploy and test their own deep learning models with ease. With the development of new and faster hardware for deep learning and the increasing number of initiatives towards digital research data management infrastructures, as well as a culture promoting open data, open source, and open science, AI-driven molecular informatics will continue to grow. This review examines the current state of open data and open algorithms in molecular informatics, as well as ways in which they could be improved in future.


Graphical Abstract Introduction
Considerable improvements in artificial intelligence (AI) research through the introduction of deep neural networks promises to transform society [1][2][3][4] and the way research is conducted [5,6].However, in most areas of molecular informatics, the amount of training data available is insufficient for the use of today's most powerful deep neural network architectures, which demonstrate superior performance only by training with large amounts of data [7].In addition, a thorough assessment of a model's true predictive performance in practice is a rare exception (e.g. the Critical Assessment of Protein Structure Prediction (CASP) [8]).
Because of this lack of accessible experimental data [9,10], machine learning predictions in chemistry are generally too error-prone to realize the potential of the new methods at this time.This necessitates a change in the way chemists publish their data and the type of data published [11,12].The call for open data, open source, and open science (ODOSOS) in chemistry is not new [13,14], but with the advent of more powerful data-driven algorithms, it has never been more important.
Journals and funders demanding the deposition of research data and the necessary establishment of suitable research data infrastructures will inevitably alleviate the data shortage problem in the future [15,16].The German government, for example, has recently decided to implement and long-term-fund a national research data infrastructure (Nationale Forschungsdateninfrastruktur, NFDI) [17] with 30 consortia in all areas of science, collaboratively developing open research data management (RDM) e-infrastructures, coordinated by an umbrella process and a joint directorate.One of those consortia is NFDI4Chem which is building an RDM e-infrastructure for chemistry that follows FAIR data principles [18] to make chemical data findable, accessible, interoperable, and reusable [19,20].One flagship project of NFDI4Chem is nmrXiv, an open and FAIR repository and analysis platform for NMR spectroscopy data [21].
In recent years, advances in artificial intelligence and data-driven applications in molecular informatics have provided a glimpse into the magnitude of future accomplishments, which have made open data a necessity for machine learning algorithms.Here, we attempt to present some of the major milestones of the past years and discuss obstacles that are yet to be overcome to enable similar AI-driven progress in (nearly) every area of chemistry.

The importance of openly available resources and data
One cause of the dissatisfying data shortage situation has been the lack of a culture of data deposition and sharing in chemistry in the past, where at least from the early 1990s onwards, with the advent of the internet, widespread data deposition and sharing would have been possible.There have been notable exceptions, such as the crystallography community, that have developed data deposition cultures even earlier.Both small molecules and biomacromolecule structures have been and are being deposited in the Protein Data Bank (PDB) [22,23] and the Cambridge Crystallographic Database (CCD) [24].Of particular note, the open PDB in combination with the openly available protein sequence information (for multiple sequence alignments) formed the basis for the outstanding success of the AlphaFold protein 3D structure prediction system [5].Similarly, open databases such as PubChem [25], ChEMBL [26], ChEBI [27], Drugbank [28], the Human Metabolome Database (HMDB) [29], the Collection of Open Natural Products (COCONUT) [30], the Natural Products Atlas [31], the Natural Products Magnetic Resonance Database [32], and ZINC [33] fundamentally broaden the research opportunities [34].The PubChem database is used by millions of users every month [35].An example for the usage of the referenced databases is the creation of a classifier that determines whether a Natural Product (NP) originates from funghi, plants, or bacteria based on its chemical structure with data obtained from the COCONUT database [36].The ZINC database has recently been used for the in silico determination of drug candidates that inhibit the main protease of SARS-CoV-2 [37].
Another crucial aspect is the availability of open software libraries to handle and process chemical information, like the Chemistry Development Kit (CDK) [38], Indigo [39], RDKit [40], or OpenBabel [41], as well as the recently published Python-based Informatics Kit for Analysing Chemical Units (PIKAChU) [42].Without these open-source projects, the research community would lack basic tools for programmatically reading, modifying, and processing chemical information.Accordingly, they are fundamental for every researcher in the field of molecular informatics.
Molecular string representations, such as DeepSMILES [43] and SELFIES [44], enable processing chemical structures using models like transformers that are designed to process sequential data.Recently, a study investigated the performance of transformers on different tasks using SMILES, DeepSMILES, and SELFIES.The amount of returned invalid chemical structures could be decreased when using DeepSMILES and especially SELFIES compared to SMILES, although the overall best performance was achieved using SMILES [45].
Without open libraries such as Tensorflow [46] and Pytorch [47] for the implementation and training of neural networks as well as the ubiquitous availability of Graphical Processing Units (GPU) and Tensor Processing Units (TPU) in cloud environments [48], the big leaps in molecular AI research would not have been possible.

An approach to the protein folding problem -AlphaFold
The problem of protein folding is considered one of the fundamental challenges of molecular biology because a large number of degrees of freedom of bonds and atoms in a protein leads to a combinatorial explosion in the number of possible low-energy arrangements [49].In 2020, the DeepMind team announced a widely recognised breakthrough in the prediction of spatial protein 3D structures from their amino acid sequence with their deep learning-based system AlphaFold [5].The system participated in the 13th and 14th Critical Assessment of Protein Structure Prediction (CASP) competition [8], outperforming all competitors.Since then, it has been made openly available and used to fill the open AlphaFold Protein Structure Database [50] which contains more than 200 million predicted protein 3D structures, covering nearly every known protein on earth [51].Within a short period of time, the structures of 98.5% of the human proteome have been predicted using AlphaFold, while the previous decades of experimental research yielded 17% [52].The system was trained on structural data openly deposited in the Protein Data Bank [22,23], which was founded and announced in 1971 [53].The success story of AlphaFold illustrates what is possible today when researchers are able to access the data that scientists have produced over the course of 50 years.
It is important to mention that challenges like the prediction of the relative positions of protein domains and their changes when an external stimulus is applied remain partially unsolved.Additionally, the transition from disordered to ordered domain states cannot be elucidated using AlphaFold, and it is limited to structures with less than 2700 amino acids [54].Nevertheless, the high impact of its accurate protein structure predictions is indisputable [55].For example, the predicted structural information about nucleoporins has been combined with cryo-electron tomography (cryo-ET) to generate a model that precisely explains 90% of the scaffold of the human nuclear pore complex (NPC) [56].Another example is the identification of tens of thousands of unknown potential binding sites for iron-sulfur clusters and zinc ions in more than 360,000 proteins [57].

Digital transformation of synthetic chemistry
Similar to other fields, the foundation for successful machine learning applications in synthetic organic chemistry is the availability of extensive experimental data [58].Recently, Strieth-Kalthoff et al. demonstrated the benefit that emerges from the usage of real experimental data for machine learning-based chemical yield predictions [12] while the prediction of reaction outcomes and yields remains a challenge in general [59].Nonetheless, there have been impressive developments using attention-based deep learning methods to explore the chemical reaction space [60].Schwaller et al. trained a transformer to predict chemical reaction outcomes with state-of-the-art results [61].The resulting model which is referred to as molecular transformer was then used in combination with hypergraph exploration to automatically plan retrosynthesis routes [62].Since then, the molecular transformer has been extended to predict the products of enzymatic reactions [63].Based on the aforementioned retrosynthesis planning system, Probst et al. have published a biocatalysed synthesis planning system [64].
Schwaller et al. have also shown that the attention matrix weights of transformers that have been trained on unlabelled chemical reaction data can be used to determine accurate atom mappings [65].Additionally, they demonstrated that attention-based models are highly suitable for the classification of chemical reactions [66].Similar model architectures were successfully used to generate specific synthesis instructions [67] and to determine the yield of a given chemical reaction formula [68].Andronov et al. successfully demonstrated the prediction of reagents based on given reaction SMILES strings using transformers.They were then able to use the reagent prediction model to fill in missing reagents in incomplete reaction data from US patents leading to an improved state-of-the-art model [61] for the prediction of reaction products [69].Recently, Rohrbach et al. demonstrated the translation of synthesis protocols in the literature into a standardized chemical language, which could then be executed by their automated synthesis system [70].
Again, the described advances are exemplary cases of the synergy of deep learning-based models and the availability of training data.There are datasets extracted from US patents [66,[71][72][73][74], the scientific literature [75], and high-throughput experiments (HTE) [76] available [60].Recently, the Open Reaction Database (ORD) has been launched as a platform to replace unstructured reaction data in the supporting information of publications [77].If it is accepted by the research community, the ORD may become a part of the solution to problems caused by the aforementioned lack of data and report bias [11,12].Providing structured data in standardized formats may become a key step towards the digital transformation of synthetic chemistry.

Extraction of chemical information from the scientific literature
Besides enforcing FAIR data publication standards today and in the near future, it is important to tackle the damage that has already been done by publishing chemical data almost exclusively in a human-readable form with unstructured text and images in the past decades.The advances in the fields of natural language processing (NLP) [78][79][80] and computer vision (CV) [81][82][83] have made a new generation of chemical literature mining tools possible.These can be considered AI-driven solutions that enable further AI-driven advances by making concealed data accessible in structured, machine-readable formats.
The field of optical chemical structure recognition (OCSR) deals with the translation of images of chemical structures as they are published in the scientific literature into machine-readable representations of the underlying molecular graph [84,85].In the past two years, a variety of deep learning-based OCSR methods [86][87][88][89] has been published, where DECIMER Image-Transformer [90], Img2Mol [91] and SwinOCSR [92] provide openly available source code and trained models.For the segmentation of chemical structure images from whole pages, the open-source tool DECIMER Segmentation can be used [93].With the publication of the open-source depiction generation tool RanDepict, efforts have been made to standardize and diversify the training data for deep learning-based OCSR tools [94].The newest version of DECIMER was trained on more than 400 Million images using the latest Tensor Processing Units [95] available on the Google cloud platform.Currently, DECIMER performs with an accuracy rate of above 90% and is regarded as an important point of reference for future work [85].Without open databases like PubChem, where one can download over 100 million chemical structures for free, this would not have been possible.
Since its original release in 2016, the chemical literature mining toolkit ChemDataExtractor [96] has been continuously developed [97,98].The highly adaptable toolkit uses user-defined models of the information to be extracted in a pipeline with readers for different publisher formats and a system for interdependency resolution with a set of parsers and a sophisticated chemical named entity recognition system [99] to extract chemical information in a structured data format [97].In the past years, ChemDataExtractor has been extensively used to automatically generate databases about refraction indices and dielectric constants [100], battery material properties [101], properties of semiconductors for building solar cells [102], magnetic properties [103], as well as UV/Vis spectra [104].
In addition to the technical obstacles, scientific publishers hinder literature mining essentially by hiding publications behind paywalls and limiting the number of publications that can be downloaded and used even if a subscription is available.Some publishers like Elsevier offer markup versions of their publications for text mining purposes to academic researchers [105], but there is a long way to go to truly make all published chemical information available.In 2018, an international group of research funders announced the initiative Plan S which requires scientists who benefit from their funding to publish in open-access journals [106].Recently, the US government announced that they will require all publicly funded research to be openly accessible from 2026 on [107].With RDM e-infrastructures being established as the mandatory scientific data publication standard, the kind of literature mining methods described herein will become obsolete in the future.For now, they are indispensable for artificially intelligent data-driven applications.

AI in natural product-based drug discovery
The field of drug discovery has shifted towards implementing approaches based on the analysis of large amounts of data and deep learning [108].As a result of the growing demand for efficient new drugs, the field has experienced rapid growth in the last few years.NP are attractive to drug developers due to their availability and their potential affinity to protein drug targets [109,110].
There have been significant advances in various areas of the field, such as the prediction of biochemical effects of NP based on their molecular structure [111], in the field of genome mining for the discovery of bioactive compounds [112], the mining of mass spectrometry-based metabolomics data [113], and integrative approaches that combine metabolomics and genomics data [114].
The initial hope that large-scale data analysis in the different omics-related research fields would boost the drug discovery rate has not yet materialised [115], but the methods are progressing continuously.The open access to databases and repositories such as MetaboLights [116], the HMDB [29], the Metabolomics Workbench [117], and METASPACE [118] is crucial for the identification of metabolites and NP [112].In 2021, the Paired Omics Data Platform (PODP) was launched as a community-driven platform that provides linked metabolome and genome data according to the FAIR principles [119].
NP-based drug discovery has greatly benefited from models developed for NLP [120].For example, in 2021, Huang et al. published MolTrans, a state-of-the-art deep learning-based framework for the in silico prediction of Drug-Protein Interactions (DPI) [121].In the following year, Wang et al. presented their structure-aware multimodal deep DPI prediction model STAMP-DPI, which outperforms MolTrans.The tool has been published along a large high-quality training and benchmarking dataset [122].The adaptation of sequence models like the transformer [78] for AI-based drug design requires large amounts of well-curated, high-quality data.
The recent development in the field of deep generative models helps researchers generate molecules with desired properties [123], but a model that can generalise well and can generate molecules with desirable properties requires a large amount of training data.When dealing with artificially generated structures, it is also necessary to consider their synthetic accessibility.To successfully use deep learning on published NP structures, well-curated data is essential.Published data resources are often incomplete, inaccessible, or no longer available [124] which makes available resources like the Natural Products Atlas [31], LOTUS [125], and COCONUT [30] even more important.
The development of deep learning-based models has assisted the advancement of drug discovery overall, with more advancements being made in the development of models and increasing access to open data and open databases helping this field grow.We hope that the research community will continue to actively contribute to openly available data sources to enable further progress in the field.

Conclusions
The developments of the past years demonstrate the potential of data-driven machine learning applications in the field of molecular informatics in an impressive manner [5, 65,70].An obvious requirement to benefit from this development is the availability of open structured experimental data [11,12].The integration of open data infrastructures will enable AI to be used in nearly every field of chemistry.The application of deep learning methodologies and the sharing of code and data in the field of chemistry are still in their early stages and require more community standards to be developed.Many of the models are still being trained from scratch using in-house servers and GPUs, which is a time-consuming and restrictive process.The rapid growth of the field will be enabled by the sharing of already-trained models and curated data with the public.When sharing code or data, high quality and data standards must be maintained [126].Using the public cloud infrastructures will readily allow researchers to take advantage of the latest developments in hardware and software, which will lead to faster growth and a reduction in energy consumption.There are several initiatives working continuously to implement open data, open-source, and open science in their individual research area [13,14,17,18,20,21,77,106,107,127,128].Fueled by the availability of more and more open research data, AI-powered molecular informatics will be a key driver of the digital transformation of chemistry in the coming years.

•
AI: Artificial Intelligence • CASP: Critical Assessment of protein Structure Prediction • CCD: Cambridge Crystallographic Database • CDK: Chemistry Development Kit • cryo-ET: cryo-Electron Tomography • COCONUT: COlleCtion of Open Natural Products • CV: Computer Vision • DECIMER: Deep lEarning for Chemical ImagE Recognition • DPI: Drug-Protein Interaction • FAIR: Findable, Accessible, Interoperable, and Reusable • GPU: Graphics Processing Unit • HTE: High-Throughput Experiments • HMDB: Human metabolome database • NFDI: National Research Data Infrastructure • NFDI4Chem: National Research Data Infrastructure for Chemistry • NLP: Natural Language Processing • NP: Natural Products • NP-MRD: Natural Products Magnetic Resonance Database • NPC: Nuclear Pore Complex • OCSR: Optical Chemical Structure Recognition • ODOSOS: Open Data, Open Source and Open Science • ORD: Open Reaction Database • PDB: Protein Databank • PODP: Paired Omics Data Platform • PIKAChU: Python-based Informatics Kit for Analysing CHemical Units • RDM: Research Data Management • TPU: Tensor Processing Unit Declarations