A data-driven approach to predict the in vitro dissolution time of sustained-release tablets using raw material databases and machine learning algorithms

Tablets are the most typical dosage forms of pharmaceutical inventions. Sustained-release (SR) tablet formulations are designed to release the drug gradually in the bloodstream and often require less frequent dosing. Current strategies to optimize sustained-release tablet dissolution time still rely on the traditional approach, which is time-consuming and expensive. In the present context, we have demonstrated alternate machine learning and deep learning models through the TPOT AutoML platform. Six machine learning (ML) models were compared to improve the methodology for dissolution time prediction, particularly the decision tree regressor (DTR), gradient boost regressor (GBR), random forest regressor (RFR), extra tree regressor (ETR), XGBoost regressor (XGBR), and deep learning (DL). The obtained results indicated that machine learning methods are convincing in speculating the dissolution time, especially the random forest regressor, but upon hypertuning of the deep neural network, the deep learning model with a 10-fold cross-validation scheme demonstrated superior predictive performance with an NRMSE of 8% and an R 2 of 0.92. The major essentials affecting the dissolution time of SR tablets were explained using the SHAP method.


Introduction
Pharmaceutical design needs to be tailored for all novel drug entities due to the variability in these relationships (Hayashi et al. 2023).The primary goal of many pharmaceutical drugs is to attain a consistent blood level that is both helpful for therapeutic purposes and safe over a long period of time (Mishra et al. 2019).Therefore, sustained-release (SR) tablet dosage forms are intended to achieve an extended therapeutic impact.SR dosage forms aim to enhance drug efficacy by prolonging therapeutic effects, allowing for less frequent dosing, and utilizing lower doses (Bhagat et al. 2023).They are also considered a wider choice for drugs with quick life spans, which typically need regular administration to sustain their therapeutic levels.Overall, a lower dosage and less frequent dosing can contribute to fewer complications from the drugs.This is particularly beneficial in the treatment of chronic diseases, where long-term adherence to medication is crucial (Mishra et al. 2019).There has been a notable increase in SR drug delivery systems over the last two to three decades due to various factors like the high cost of raising the latest drug candidates, the running out of current patents, and the development of newly discovered polymers that enable controlled drug release (Bhagat et al. 2023).
The choice of polymer material, including hydrophilic and hydrophobic substances or biodegradable materials, shows the versatility of formulations.Studies of in vitro dissolution can be used to determine the medication release rate (Santoshrao et al. 2014).A formulation can possess many elements, like polymers, lubricants, solubilizers, binders, and fillers.The choice of formulation components depends on the specific dosage form and the requirements of the formulation procedure.Each element and its specific quantity directly impact the overall critical quality attributes (CQA) of a dosage form (Szlek et al. 2022).
Current strategies to optimize the dissolution time of SR tablets still rely on conventional trial-and-error methods, which are considered time-consuming and expensive.There may be a need for more efficient and streamlined approaches to enhance the optimization process (Yanga et al. 2019).One way to disrupt the conventional approach is by executing machine learning (ML) and deep learning (DL) models.ML has indeed become a transformative force in various research domains, including pharmaceuticals.ML models can make data-driven predictions with datasets of preliminary data.
ML methods, such as optimization, can be employed to find optimal combinations of ingredients and process parameters that produce the required properties, such as drug release profiles.This can lead to significant cost savings in terms of materials, resources, and time.ML models can contribute to maintaining product consistency by identifying key factors that influence the quality of formulations (Sustersic et al. 2023).DL is indeed a subset of ML that has gained significant attention and success in various domains of artificial intelligence.DL includes training artificial neural networks with several layers (DNNs) to learn and make predictions or choices.
This method has proven to be quite successful in a variety of applications, including natural language processing, image and speech recognition, and many others.Therefore, DL has indeed demonstrated remarkable success in various domains, often outperforming traditional ML methods in predictive performance.In vitro, prediction of a pharmaceutical formulation (Yanga et al. 2019), prediction of adverse drug reactions in drug discovery (Mohsen et al. 2021), prediction of pharmacological drug properties (Aliper et al. 2016) , and DL-based dose prediction (Gronberg et al. 2023)-as many such works relaying on DL models were published in the last 5 years.The prediction of dissolution release by artificial neural network (ANN) and regression methods, respectively (Wang et al. 2022), prediction of HPMC polymer matrix particle size distribution (PSD) by using ANN, support vector machine (SVM), and ETR (Galatta et al. 2021), surface area and volume of the tablet affect the release profile of the tablet can be predicted by ANN (Mazur et al. 2023), and physiochemical and powder characteristics of API are predicted by four-layered ANN (Takayama et al. 2017) .Multivariant tools and ANN methods have been seen to execute well in the context of understating the underlying relationship between CQAs and process parameters.
Nevertheless, ML and DL modeling methods can be used to assess the dissolution time of SR tablets with the above-mentioned inferences.However, with more comparisons of DL with ML, more approaches are looking forward to anticipating the effective formulation.In the present context, we developed an optimal method for the evaluation of the dissolution time of SR tablets based on ML models, namely DTR, GBR, XGBoost regressor, RFR, ETR, and DL.All these models have demonstrated successful applications in the fields of pharmaceutical formulation development, development processes, and destructive analytical tests.Therefore, by leveraging data-driven approaches, these methods contribute to more efficient and effective pharmaceutical development and manufacturing processes.Nevertheless, it is essential to remember that the successful implementation of these techniques requires careful consideration of data quality, model interpretability, and regulatory compliance in the pharmaceutical industry (Loua et al. 2021).This paper showcases a wide range of ML and DL methods for accurately forecasting the dissolution time of SR tablets.

Methodology Pharmaceutical data description
An existing literature-based data model (Hana et al. 2018) was chosen for development and organization.Data was categorized to incorporate only data records with emphasis on SR tablets and their components like tablet hardness, thickness, friability, drug content, dissolution time, and dissolution release profile.To extend our database, we have done a literature review from the Scopus, Web of Sciences, and Google Scholar databases.The search engine operates with SR tablet formulations as its foundation: "HPMC" or "hydroxypropyl methylcellulose" or "hydroxypropylmethylcellulose" or "hydroxypropylmethyl cellulose" or "Hypromellose tablets (Yanga et al. 2019).
A total of 210 articles were retrieved through database search; from these, only 152 research articles have been picked for feature data extraction.Later, upon hand search, 80 articles did not match all the inclusion norms and were eventually removed from the study.After sorting 72 articles, a total of 1215 formulations were retrieved.The formulation data includes the name API and other excipients; process details were documented in the data-set.The final dataset includes the following parameters for each formulation: API name, dose, excipient name, dose (each excipient as a separate column), hardness, friability, thickness, drug content, and cumulative drug release (Momeni et al. 2023).

Workflow
Three stages make up the process of developing an ML model: preprocessing the data, modeling, and model interpretation.These stages come after ML's best execution.

Preprocessing of data
After the completion of data collection, the data needs to be processed before building the predictive models to ensure the robustness and effectiveness of the ML models.A few normally employed methods, including data cleansing, dimension reduction, imbalanced data solutions, and data splitting strategies, are required to analyze the data.Data cleansing is used for observations of missing datasets, and it is employed by replacing the data points with median/mean values.Besides, there are few constraints on replacing the missing values; therefore, a decrease in data size might have an impact on the exactness of the model.The dimensionality reduction method is used to eliminate the dataset's least significant features, which will lower the overfitting issues and reduce the complexity of the model.
Different approaches to dimensionality reduction, including principal component analysis (PCA), high correlation filtering, and random forest feature selection, came out mostly in data processing.Imbalanced data solutions mostly discuss the uneven scattering of different database classes.Usage of an unbalanced dataset by a prediction model leads to non-significant performance.Data splitting is one more key step in data processing.In this method, the entire dataset is randomized and split into three subcategories: training, validation, and testing.The training set will train the models.Validation is for tuning the hyperparameters and stopping overfitting.A testing set is used to determine the prediction ability of unknown data.The desired ratio for these three categories is 70:20:10; moreover, the ratios rely on the data size.Consequently, data preprocessing and splitting strategies are essential steps before performing the task (Jiang et al. 2022).

Modelling
ML modeling tasks are framed to cover various techniques, including classification, regression trees, neural networks, and potentially many other algorithms.These models are trained using prepared databases.The performance of different models is evaluated using an error metric.Common error metrics include accuracy, precision, recall, the F1-score for classification tasks, and the normalized root mean squared error, R 2 score for regression tasks.Keeping track of different modeling methods and exploring various features can be challenging and computationally expensive.Therefore, automated ma-chine learning (AutoML) is utilized.AutoML approaches frequently use ensemble learning strategies, which combine several model types to produce predictions that may be more reliable.In this case, the K-fold cross-validation technique was employed by TPOT AUTOML to generate a definitive production model by selecting features based on a predefined threshold.A distinct training-testing pair, consisting of 568 records randomly selected for training, 244 records for validation, and 348 records for testing, makes up each fold.

Model training
After the completion of the ML modeling process, it is essential to assess the predictive accomplishments of ML models to understand how well they generalize to new, unseen data.ML models are known to be overfit.Overfitting happens when the model learns not only the underlying patterns in training data but also the noise and random fluctuations that accompany the data.To avoid overfitting and maintain a stable model.The choice of features offers insights into which features had the biggest impact on the model's assumptions, revealing the blackbox nature of ML models.Also, a subset of the data was excluded for evaluating the model using the K-fold approach.In this study, models were trained and verified via a five-fold cross-validation scheme, then split for feature selection by a Python script.The training and validation procedures were repeated 50 times to ensure complete coverage of the input database and get the best model.The final model was trained using a 10-fold cross-validation approach after the final input feature vector was chosen.Root mean square error (RMSE), normalized root mean squared error (NRMSE), and coefficient of determination (R 2 ) can be used to measure the correctness of the models.Six algorithms drawn from the TPOT AUTOML platform were utilized for feature selection and final model development: DTR, GBR, XGBoost, RFR, ETR, and DL.where "obs i , pred i " are the practical and expected values, "i" is the data record number, "n" is the total number of records, "obs max " is the highest experimental value, "obs min " is the least observed value, "R 2 " is the coefficient of determination, "SS res " is the sum of squares of the residual errors, "SS tot " is the total sum of the errors, and "obs" is the arithmetical mean of observed values (Szlek et al. 2022).

Machine learning models
The accuracy of ML results cannot be improved simply by fitting data into models.As the data gets large and complex, better data handling techniques are essential to handling it.

Model interpretation
As ML models are inherently black boxes, efforts are undertaken to teach their prediction technique.In our work plan (Fig. 1), we used Lundberg et al. 's SHAPLEY additive explanation (SHAP) approach to describe the universal relationship between input and output variables.The SHAP method is indeed a mostly used path derived from cooperative game theory, and it has found applications in various domains, including the pharmaceutical domain.
The notion of Shapley values is used to evaluate the beneficence of each participant (or individual) towards the overall team effort or outcome.In machine learning, the concept of Shapley values has been adapted to explain the contribution of each feature in a predictive model.SHAP values provide a way to allocate the model's prediction to each feature fairly and consistently.The mathematical formula is displayed in the following equation: while "S" is a subset of the features employed in the model, "x" is the vector of feature values to be elucidated, and "p" is the number of features."Val x (S)" is the prediction for feature values in set "S" that are marginalized over features excluded in set "S" (Fadel et al. 2022).

Database
The  1 demonstrates that the variables were not distributed normally; instead, the proportions of formulations were notably positively skewed (right-skewed distribution).However, the database was split using a 10-fold cross-validation approach that was balanced to ensure that the input variables were classified fairly throughout the splits.

Choosing features and creating the final model
Choosing features and creating the final model were accomplished by the TPOT AUTOML method.The AU-TOML dimensions are given in Table 2.This table exhibits the accuracy of the developed models.The values of NRMSE, RMSE, R 2 , MAE, and MSE are further analyzed to check the accuracy and precision of the model output.
As expected, ML techniques, especially RFR, have proved to be the best pipeline in the TPOT AUTOML analysis for the curated dataset, but it is pretty clear that the DL model, when hyper-tuned, performed on par with its ML counterparts with great R 2 values and NRMSE%.Following the preliminary evaluation of the DL models when trained using a five-fold CV scheme, when the DL model was trained with three hidden layers of 100 neurons each and 35 input neurons and one output neuron with epoch values of 2200 combined with a 10-fold cross-validation scheme, the model accuracy jumped and showed significant improvement in NRMSE and R 2 value.

Feature selection of input variables based on scaled importance
The input variables were categorized into two main groups: composition and manufacturing parameters.Except for those in the composition group, features from all categories that fell below the variable importance threshold were eliminated.Consequently, the final input vector contained 58 inputs.
According to the scaled importance of input variables, it shows a higher number of polymers.It is evident that a higher concentration of polymer has a high influence on the predicted values.As stated in Table 3, lubricants and solubilizers (PEG) are the least essential components of the tablets, which could be related to a positive skew in the variable distribution.

Model performance
The SHAP summary plots were performed to figure out the effect of two categories of inputs, namely formulation setup and manufacturing specifications, on output.Higher dissolution times are predicted where higher concentrations of polymers like Polyox WSR, HPMC, and carbopol occur.Regarding diluents (lactose, MCC) with lactose, a reverse correspondence is seen where a higher amount of lactose produces a low dissolution time.Re- garding the lubricants (magnesium stearate, Aerosil, and Avicel), two different effects can be seen at high concentrations of Aerosil, leading to greater dissolution time, which may be due to its hydrophilic nature.The lipophilic nature of the Mg stearate decreases the dissolution times when its concentration is higher.For binders like povidone and maltodextrin, it is observed that higher dissolution times are associated with a higher concentration of povidone.Glidants (Talc, Aerosil): a decreased amount of Aerosil leads to increased dissolution time.

Discussion
Solid oral dosage forms occupy a major part in the field of pharmaceuticals, and their efficacy is highly dependent on the absorption of API in the human body.The major aspect that impacts the API is its dissolution behavior, and that is essential for ensuring the bioavailability and therapeutic effectiveness of the API.However, the dissolution profile is mostly influenced by the physical characteristics of the materials, like solubility and filling material employed in the development procedure, and the process variables, like compression force.Therefore, it often involves a careful balance of formulation design, choice of materials, and control of manufacturing processes.Regulatory agencies often set specifications for dissolution profiles to ensure the consistency and efficacy of the product.The traditional method for dissolution involves testing a small number of tablets, which may be laborious, prolonged, and overpriced.Additionally, it would not provide a complete representation of the overall dissolution behavior.As a result, there is interest in developing surrogate or alternate methods that can provide insights into dissolution attributes without the need for extensive testing (Fink et al. 2023).On account of this, an alternative method was employed: ML models are perhaps powerful function approximators and can memorize and forecast composite systems from huge amounts of input.
Lately, ML has been productively implemented in pharmaceutical research in the formulation development area.The developed models are successful with greater accuracy in predicting the dissolution curves of SR tablets (Hana et al. 2019).DL is one of the most extensively utilized ML methods.DL and ML have revolutionized the world's perspective.Optimizing the formulation of an SR dosage form is a crucial step in pharmaceutical development.To achieve this efficiently, researchers often employ statistical experimental designs, which are valuable techniques for systemat-ically exploring the relationships between formulation variables and the dissolution rate.These help in achieving the desired dissolution times with minimal experimentation.
In the present study, as mentioned in the methodology segment, the developed models have been analyzed by ML techniques to forecast which produces the greatest results.Regarding the outcomes displayed in Table 1, the outputs are as per the 5-fold cross-validation scheme.The ML technique, especially RFR, has shown the best in TPOT AUTOML analysis, but upon hypertuning, the DL model results in greater R 2 and NRMSE%.After the TPOT AU-TOML output analysis, we went ahead with the final model development of a 10-fold cross-validation scheme where the DL model was trained over 2200 epochs, fetching us better results than previous ones.Moreover, in Fig. 2, critical parameters that affect the dissolution time were plotted.A clear analysis was conducted to figure out the fundamentals underlying the final model's forecasting by employing Shapley values.The findings not only provide insights into dissolution time prediction but also reinforce the suitability of autoML-based approaches for addressing the challenges posed by complex pharmaceutical tasks.However, the results showed that DL has achieved significant outcomes compared to the other ML models.Therefore, our predictions were consistent (Hana et al. 2018;Yanga et al. 2019).

Conclusion
SR tablets represent a notable advancement in formulation development, especially when compared to the traditional "trial-and-error" methods that have been in use for   hundreds of years.The traditional approach often involves repeated experimentation and refinement, consuming significant time, financial resources, and human efforts.The present contemporary research has progressed the ML and DL models to exhibit a beneficial estimation for the dissolution time of SR tablet formulations.In summary, the employed models could constructively anticipate the key attributes, but the results showed that predictions with DL were better than those of other ML models.The proposed DL model not only streamlines SR tablet development but also showcases potential applications across different dosage forms and pharmaceutical research areas.Its contribution to reducing development timelines, optimizing resources, and facilitating robust drug product development underscores its significance in advancing pharmaceutical formulation research.

Figure 2 .
Figure 2. Scatter plot between actual and predictive values for dissolution time of the DL model.

Figure 3 .
Figure 3. SHAP dependence diagram for the ML models for the top 20 attributes.

Table 1 .
Descriptive statistics of the dataset.

Table 2 .
Robustness of the TPOT AUTOML-developed models and the DNN model.

Table 3 .
Selected input variables for the best predictive ML models.Polymer, composition 0.071688362 Colloidal Silicon dioxide [%] Lubricant, composition 0.068848904