Curcumin Chalcone Derivatives Database (CCDD): a Python framework for natural compound derivatives database

We built the Curcumin Chalcone Derivatives Database (CCDD) to enable the effective virtual screening of highly potent curcumin and its analogs. The two-dimensional (2D) structures were drawn using the ChemBioOffice package and converted to 3D structures using Discovery Studio Visualizer V 2021 (DS). The database was built using different Python modules. For the 3D structures, different Python packages were used to obtain the data frame of compounds. This framework is also used to visualize the compounds. The webserver enables the users to screen the compounds according to Lipinski’s rule of five. The structures can be downloaded in .sdf and .mol format. The data frame (df) can be downloaded in .csv format. Our webserver can help computational drug discovery researchers find new therapeutics and build new webservers. The CCDD is freely available at: https://srampogu-ccdd-ccdd-8uldk8.streamlit.app/.


INTRODUCTION
Modern drug development involves identifying possible hits and optimizing them to improve selectivity, specificity, activity, and extent of absorption. Once a molecule with all of the necessary qualities has been found, drug development and clinical trials can begin. Computer-aided drug design is used in one or more phases of the drug discovery procedure. The pharmaceutical business invests a significant amount of money in the modern drug discovery process.
Natural compounds and their derivatives have been at the forefront in contributing to the therapeutic field. The compounds that have gained enormous credit are curcumin and chalcone.
Curcumin is a diarylheptanoid polyphenol with two methoxy rings and an ortho phenolic OH group. Curcumin is attached to a conjugated seven-carbon chain with an enone and 1,3-diketone group. Curcumin has two double bonds, o-methoxy and phenolic groups, and a 1,3-keto-enol ( Fig. 1A) (Yang et al., 2017). Chalcones are a class of open-chain flavonoid compounds with a carbonyl group composed of two aromatic rings and a range of a,β-unsaturated substituents ( Fig. 1B) (Salehi et al., 2021).
The rhizome of the turmeric plant, Curcuma longa, contains the polyphenol curcumin, also known as diferuloylmethane, which is also present in other Curcuma species and is widely used as a medicinal herb in Asian countries (Hewlings & Kalman, 2017). Traditional treatments have employed curcumin as a home remedy to treat a number of illnesses, including rheumatism, sinusitis, anorexia, cough, hepatic disorders, and biliary disorders (Asakura & Kitahora, 2018). Curcumin also acts against diabetes, multiple sclerosis, cardiovascular diseases, inflammation, and Alzheimer's disease (Suresh, Yadav & Suresh, 2006;Fadus et al., 2017). Due to curcumin's high therapeutic potential, several research groups have developed analogs that have proven to be active compounds against several diseases (Fig. 2).
The other names of chalcones are β-phenylacrylophenone, a-phenyl-βbenzoylethylene, benzylideneacetophenone, and phenyl styryl ketone (Tekale, 2020). Generally, chalcones are benzyl acetophenone compounds consisting of two fragrant rings A and B joined by three aliphatic carbons, and they have a,β-unsaturated ketones demonstrating varied organization of substituents (Salehi et al., 2021). The biological actions of chalcones, which include antibacterial, anticancer, cytotoxic, antioxidative, antiinflammatory, and antiviral properties, are well characterized. Chalcone derivatives are being utilized extensively in the treatment of viral infections, cardiovascular conditions, stomach cancer, food additives, and chemicals in cosmetic formulations ( Fig. 2) (Salehi et al., 2021).
The Natural Compound Derivatives Database (NCDD) objectives include establishing a database containing an extensive number of derivative compounds in a single location, as To expedite the drug design process, computational methodologies are advantageous for interpreting and guiding experiments (Walsh, 2003). The two main categories of computer-aided drug design (CADD) methods are structure-based drug design (SBDD) and ligand-based drug design (LBDD) (Schneider & Fechner, 2005). Utilization of open access resources for cheminformatics and structural bioinformatics as well as public platforms for code deposition, such as GitHub, is increasing in the realm of research. This combination facilitates and encourages the development of modular, reproducible, and easy-to-share computer-aided drug design (CADD) pipelines (Pirhadi, Sunseri & Koes, 2016).
Virtual screening is an essential step in computational drug discovery (CDD). CDD approaches reduce the time of discovering a new drug candidate and involve lower costs and human burden. Typically, CDD consists of the screening of databases of small molecules either by pharmacophore modeling methods or molecular docking approaches. The small molecules from the databases will be subjected to Lipinski's rule of 5 to enhance the opportunities of the candidate drug during clinical trials.

Data curation
We manually extracted information on the structures of chalcone and curcumin from the literature. With the use of keywords such as "curcumin", "curcumin derivatives", "modified curcumin", "chalcone", "modified chalcone", and "chalcone derivatives", we searched the PubMed database and Google Scholar. More than 30 complete literary works were downloaded overall. To learn more about the specifics of curcumin and chalcone structures, we manually searched the literature.

Database generation
Chemdraw professional software was employed to draw the two-dimensional (2D) structure of the derivatives, which was then uploaded to the Discovery Studio (DS) visualizer 2021 to obtain the 3D structures. These structures were saved in the .sdf format to the local disk and then uploaded to Jupyter Notebook. The dataframe (df) was then generated using PandasTools. Correspondingly, the unused columns were deleted and Lipinski's rule of five was calculated for df and stored in the .csv file. This .csv file was loaded onto Streamlit to create the webserver (Fig. 3). The code was written in Python IDLE Shell 3.9.5 (Napoles-Duarte et al.

RESULTS
The CCDD was built using the .csv file obtained from Python and can be divided into three steps. In the first step, the 3D structures from the DS were upgraded onto the Jupyter Notebook, and the data frame was created using rdkit and PandasTools. Then, Lipinski's descriptors were then calculated, and all the data with null values and duplicates were deleted. The resultant 474 compounds were saved in the .csv format and were taken onto Streamlit to build the webserver.
In the second step, the code was written in Python IDEL Shell for Streamlit. The overview of the framework shows the df and the filter in terms of parameters corresponding to Lipinski's rule of 5. Upon moving each parameter the changes are reflected in df. Users can filter the df according to their use and the df can be downloaded (in .csv format) by clicking the 'Get Compounds' button.
The third step is to view the structure on the DS, form the .csv file, delete all the contents and retain only the SMILES and save the file. Then, the file format should be changed from .csv to .smi using the rename option. Then drag and drop the .smi file onto the DS. The filtered data can be viewed in 3D structure form.
The 2D structures can be viewed by entering the SMILES code of the selected compound and then clicking enter. Furthermore, the database is equipped to retrieve the data in the row-wise selection after entering the SMILES.
How to access CCDD: access the database of curcumin chalcone derivatives. Typically, a database consists of four layers (1) The df has the SMILES code of compounds along with their corresponding descriptors. The side bar panel is equipped with the app to screen the compounds according to the user's interest (Fig. 4). The selected compounds or the df in whole can be downloaded in the .csv format by clicking 'Get compounds' (Fig. 4). Figure 3 Illustration of the methods adopted for building the webserver. The curation of curcumin and chalcone derivatives is carried out in Step 1. The left section describes how to draw 2D structures and then convert them into 3D structures. This demonstrates how to prepare a dataframe and save it as the .cvs file format.
Step 2 details the construction of a database using Streamlit, while Step 3 demonstrates the visualization of 2D structures in a webserver builtwith Streamlit.
Full-size  DOI: 10.7717/peerj.15885/ fig-3 (2) After opening the .csv file, delete all columns, retaining the SMILES column. Save the file to the .smi format. Drag and drop it on the DS visualizer to the view of 3D structures (Fig. 5).
(3) Within the web app, the 2D structure can be viewed by giving the SMILES as an input (Fig. 6A).
(4) Furthermore, a particular row can also be retrieved by giving the input as SMILES (Fig. 6B).

UTILITY AND DISCUSSION
The CCDD is a free resource for scientists and research groups studying natural products. The database comprises curcumin and chalcone analogs, the largest collection of their analogs in 2022. This webserver is useful for scientists and researchers studying natural products (curcumin and chalcones) for a variety of purposes, including medication discovery and research into biodiversity. Additionally, the CCDD is the first significant chemical database with analog compounds of curcumin and chalcones that use Python frame work and streamlit. This server is built using different python-based tools from the 2D structures. The Jupyter Nootbook and PandasTool were adapted to create the filterable df. This allows users to filter the data based on descriptors according to Lipinski's rule. Since this parameter is already computed, users can directly use the data according to their needs or they can simply download the whole content in .csv form. The database currently has 474 analogs. To the best of our knowledge this is the first web platform containing analog compounds. The database will be regularly updated to furnish new compounds. LIMITATIONS Our website does not currently have the capability to automatically update newly introduced chemicals. In the next version, we will attempt to incorporate this functionality.

CONCLUSIONS
A natural compound database was built taking the structures from the literature using Python modules. This server will help the virtual screening process that is usually conducted for the identification of candidate compounds. Our webserver could assist computational drug discovery researchers in finding new therapeutics for different diseases and could further serve in building new webservers.

ADDITIONAL INFORMATION AND DECLARATIONS Funding
This work was supported by the National Research Foundation of Korea (2020R1A2C1006909 and 2022R1A4A1021817) and the Samsung Science and Technology Foundation (SSTF-BA1701-10). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Grant Disclosures
The following grant information was disclosed by the authors: National Research Foundation of Korea: 2020R1A2C1006909 and 2022R1A4A1021817. Samsung Science and Technology Foundation: SSTF-BA1701-10.

Competing Interests
Shailima Rampogu is employed by Cachet Big Data Laboratories.

Author Contributions
Shailima Rampogu conceived and designed the experiments, performed the experiments, analyzed the data, prepared figures and/or tables, authored or reviewed drafts of the article, and approved the final draft. Thananjeyan Balasubramaniyam conceived and designed the experiments, performed the experiments, analyzed the data, prepared figures and/or tables, authored or reviewed drafts of the article, and approved the final draft. Joon-Hwa Lee conceived and designed the experiments, authored or reviewed drafts of the article, and approved the final draft.