Correlating physico-chemical properties of analytes with Hansen solubility parameters of solvents using machine learning algorithm for predicting suitable extraction solvent

Artificial neural networks (ANNs) are biologically inspired algorithms designed to simulate the way in which the human brain processes information. In sample preparation for bioanalysis, liquid–liquid extraction (LLE) represents an important step with the extraction solvent selection is the key laborious step. In the current work, a robust and reliable ANNs model for LLE solvent prediction was generated which could predict the suitable solvent for analyte extraction. The developed ANNs model takes a set of chosen descriptors for the cited analyte as an input and predicts the corresponding Hansen solubility parameters of the suitable extraction solvent as a model output. Then, from the solvent combination’s appendix, the analyst can identify the proposed extraction solvents' combination for the cited analyte easily and efficiently. For the experimental validation of the model prediction capabilities, twenty structurally diverse drugs belonging to different pharmacological classes were extracted from human plasma. The extraction process was performed using the predicted extraction solvent combination for each drug and quantitively estimated by HPLC/UV methods to assess their extraction recovery. The developed LLE solvent prediction model is in- line with the global trend towards green chemistry since it limits the consumption of organic solvents.


Materials and reagents
The used drugs were supplied by different pharmaceutical companies.HPLC-grade acetonitrile and methanol were purchased from Sigma-Aldrich (Germany).Ortho-phosphoric acid, acetic acid and potassium hydroxide were supplied by EL-Nasr Pharmaceutical Chemicals Co., Egypt.Potassium dihydrogen phosphate and ammonium acetate were supplied by Sigma-Aldrich (Germany).Bi-distilled water was produced in-house (Aquatron Water Still, A4000D, UK).Membrane filters of size 0.22 μm were purchased from ChromTech (UK).Human blank plasma was obtained from the Holding Company for Biological Products and Vaccines (VACSERA, Egypt) and stored at -70 °C.

Instrumentation
The HPLC instrument (Agilent1100 series) was composed of an Agilent isocratic pump G1310A, Agilent UV-visible detector G1314A, an Agilent manual injector G1328B with (20 mL) injector loop and Inertsil ODS-3 column (5 µm, 150 mm × 4.6 mm).An Agilent syringe, (50 mL, USA) and a Powersonic 405 ultrasonic processor (Human Lab INC-Hwaseong city, Korea) were employed.The pH was adjusted by the addition of ortho-phosphoric acid or potassium hydroxide by means of a pH meter equipped with a glass electrode (Jenway, 3505, Essex, UK).

Dataset construction
The extraction data of sixty-three structurally diverse drug molecules belonging to different pharmacological classes covering a wide range of physicochemical properties were self-collected from literature.The selected extraction solvents were ethyl acetate, diethyl ether, tert-butyl methyl ether, and dichloromethane, whereas drugs extracted with toxic solvents (e.g., chloroform) were excluded.The values of HSPs of the solvents were obtained from Dr Manuel Díaz de los Ríos, Director of Derivatives Division, ICIDCA 23 .

Drawing structures and molecular descriptors calculation
Molecular Operating Environment (MOE, 2020.0901)software was used for all the molecular modeling studies.Canonical SMILES of the sixty-three drugs were imported from PubChem 24 into the MOE which were then converted into 3D structures.Energy minimization was performed for the built compounds until a RMS gradient of 0.05 kcal mol −1 Å −2 with MMFF94x force field and the partial charges were automatically calculated.MOE molecular mechanics descriptors were calculated for each compound and RapidMiner 7.1.000Basic Edition 25 was used to remove low variance descriptors using Remove Useless Attributes operator as they add no additional information to the model ''redundant descriptors'' , this left a pool of 301 descriptors.Based on the relation of the different descriptors to the target parameters ''Hansen solubility parameters'' , we found that dipole moment www.nature.com/scientificreports/(Dipole), Van der Waals volume Å 3 (Vdw_vol), Van der Waals energy (E_vdw), and log octanol/water partition coefficient (logP(o/w)) are the most important descriptors.

Training set and test set generation
The selected 63 drugs were split manually in a random manner into a training set of 48 molecules and an external test set of 15 molecules such that the test set maintains the same distribution of Hansen solubility parameters (HSPs) in the original dataset by keeping the ratio of the different solvents in the training and test sets equals to the original dataset (Supplementary Table S1   and Supplementary Table S2 [74][75][76][77][78][79][80][81][82][83][84][85][86][87][88] in Supplementary File).

LLE model generation
MATLAB (version: 7.12.0.635) (R2011a) was used for generating the ANN models.Mean Absolute Error (MAE) is the model evaluation metric used to describe the average model performance.Linear Layer (design) was used in the ANN model generation.

Model validation
To assess the prediction ability and the robustness of the generated models, the developed model was validated using: (a) Internal validation: this was carried out using leave-20%-out cross-validation (CV L20%O ) in which the training set was split into five subsets and training and test subsets were chosen such that each point appears in the test subset once.Five ANN models were generated using linear layer design network.(b) External validation: This was carried out by using the generated model to predict Hansen solubility parameters for the independent test set.This should be a direct simulation of the real case scenario which requires the prediction of new compounds (unseen by the model).

Experimental validation
To test the generated model in a real case scenario, experimental validation of the model prediction was carried out.The developed ANN model was applied on twenty structurally diverse drugs from different pharmacological classes (Fig. 1) to predict their suitable extraction solvent combinations.

Prediction of the HSPs of the extraction solvent combination
First, the model's four descriptors of the 20 drugs were calculated using MOE (Supplementary Table S3 in Supplementary File) and then ANN linear layer design model was applied on them to predict the HSPs of the solvent combinations to be used to extract each drug and using the solvent combination appendix (Supplementary www.nature.com/scientificreports/Table S4 in Supplementary File) the corresponding solvent mixture for each drug was determined based on its predicted HSPs.

Determination of the solute recovery from spiked plasma using the predicted solvent combinations
The twenty drugs were extracted from spiked human plasma using the predicted solvent combinations.Various mobile phases and chromatographic conditions were used for the separation and quantitation of those drugs using HPLC/UV methods (Supplementary Table S5 and Supplementary Fig. S1 in Supplementary File).Selectivity of the developed chromatographic methods was confirmed by the absence of any interfering peaks from plasma samples at the retention times of the investigated drugs (Supplementary Fig. S2 in Supplementary File).

Preparation of standard solutions
Stock solutions (1 mg/ml) were prepared by dissolving each drug in the appropriate HPLC-grade solvent (water, methanol or acetonitrile) and stored at 4 °C nominal.These stock solutions were diluted with a mixture of methanol and water (50: 50, v/v) to attain the required working solutions (100, 200 and 300 μg/ml).

Preparation of human plasma samples and analyte extraction
Plasma samples (0.5 ml) containing the analyte were vortexed for 30 s.The extraction solvent mixture was added to the spiked plasma and blank samples.Samples were vortexed for 1.5 min, centrifuged at 4500 rpm for 10 min.
The clear supernatant was transferred into a clean Wassermann tube then evaporated to dryness at 45 °C under the stream of Nitrogen then dried extract was reconstituted with 100 μl of the mobile phase.

Procedure for extraction recovery calculations
The recovery following the sample preparation using the LLE model was evaluated by comparing the mean peak area of three extracted samples of low, medium, and high concentrations to the mean peak area of three plain standards of equivalent concentrations.Six replicates for each concentration were performed with the established extraction procedure.

Ethical approval
This article does not contain any studies with human participants or animals performed by any of the authors.

Results and discussion
A correlation between some of the molecular mechanical descriptors of the drugs, dipole moment, Van der Waals volume, and log octanol/water partition coefficient and the target property (HSPs) of the extraction solvents using ANN was performed.The selection of the descriptors was based on showing high mutual solubility intercorrelation.The target property (HSPs) are physicochemical parameters that are commonly used to estimate the form of interactive forces that cause material compatibility.The HSP assumes that cohesive energy (E) can be divided into three parts: atomic dispersion (Ed), molecular dipolar interactions (Ep), and hydrogen-bonding interactions (Eh).ANNs are a type of computer programs that can be taught to mimic relationships in data sets.After the ANN has been 'trained, ' it can be used to predict the outcome of a new set of input data, such as a different composite system.Linear Layer (design) was used in the ANN model generation (Fig. 2).The generation was done using a custom script written on MATLAB (version: 7.12.0.635) (R2011a).The mean absolute error of a model represents the mean of the absolute values of the individual prediction errors on the overall instances in the dataset.Each prediction error is the difference between the predicted value and the true value for the instance.
where ŷ i is the predicted value, y i is the true value, and n is the sample size.

MAE of external validation
MAE of Hansen solubility parameters was found to be: 0.79 ± 0.56 for Hansen D, 1.14 ± 0.97 for Hansen P, and 1.23 ± 0.35 for Hansen H. the MAE of external validation was calculated by absolute subtracting the predicted values from reported ones then divided by the number of the test set (Supplementary Table S7 in Supplementary File).

Predicted HSPs of the investigated drugs and the predicted solvents' combinations
The predicted Hansen solubility parameters of the twenty drugs were obtained from the application of the developed ANN model on those drugs.For better extraction recovery results, the use of extraction solvents' combination is recommended than the use of a single extraction solvent.Supplementary Table S5 in Supplementary File shows the Hansen solubility parameters of different combinations of the four extraction solvents used in the developed model with different ratios.The fraction of ratio of each solvent has been multiplied to its Hansen solubility parameters then HSP values for both solvents have been summed giving the HSPs for the solvents' combination.By visual inspection of the solvents' combinations' table and the predicted HSP values obtained from the model, one or two solvent combinations could be selected that have HSP values close to the predicted values obtained from the prediction model (Table 1).

Recovery of the investigated drugs
Recovery of each drug was performed by comparing the results obtained from the analysis of plasma spiked with three different concentrations to non-extracted samples of equivalent concentrations (Table 1).

Conclusion
A robust and validated LLE solvent prediction model which helps in predicting the organic extraction solvents' combinations for different drugs from aqueous-based matrices was built and validated.This was performed by making a correlation between some of the molecular mechanical descriptors of the drugs and the target property (HSPs) of the extraction solvents using ANN.Assessment of the prediction ability and the robustness of the generated model has been performed by internal and external validation.The generated ANN model has been applied on twenty drugs from different pharmacological classes.The extraction process of the investigated drugs was performed using the predicted extraction solvents' combination for each drug and quantitively estimated by HPLC/UV methods to study their extraction recovery.Good extraction recoveries were achieved.Therefore,

Figure 1 .
Figure 1.Chemical structures of the investigated drugs.

Figure 2 .
Figure 2. The developed ANN model structure.
MAE of Hansen solubility parameters was found to be: 0.77 ± 0.48 for Hansen D, 1.19 ± 0.87 for Hansen P, and 1.12 ± 0.46 for Hansen H.The MAE of CV was calculated by absolute subtracting the predicted values from reported ones then divided by the number of the training set (Supplementary TableS6in Supplementary File).

Table 1 .
Predicted Hansen solubility parameters and extraction solvents' mixtures of the investigated drugs using linear layer design model with their extraction recovery.*Extractionrecovery is the average recovery of three concentrations for each drug where each concentration was repeated six times.*TBMEtertiary-Butyl methyl ether.www.nature.com/scientificreports/bioanalysis could be much easier and more eco-friendly with the aid of the developed LLE solvent prediction model.The generated ANN model can be continuously improved by adding more input data to get more prediction capabilities.