A machine learning framework for the prediction of antibacterial capacity of silver nanoparticles

The biocompatibility property has made silver nanoparticles powerful candidates for various nanomedical applications. Research interest in silver nanoparticles as a viable alternative to antibiotics is gaining more attention due to their enhanced antimicrobial activity, better antibacterial activity and low cytotoxicity. Machine Learning (ML) has become a state-of-the-art analytic and modelling tool in recent times, due to its prediction capabilities and increased accuracy of the results. In this work, we present machine-learning techniques to predict the antibacterial capacity of silver nanoparticles and extended the work on antifungal studies. In the first phase, we reviewed 50 articles and collected data points for training the model, which consists of features such as core size, shape of the nanoparticle, dosage, bacteria/fungi species and zone of inhibition (ZOI). Then, we trained the data using eight different machine-learning regression algorithms and validated the models’ performance using four metrics such as RMSE, MSE, MAE and R2. Furthermore, the importance of features used in the prediction models has been evaluated. The feature importance revealed that the core size of silver nanoparticles is the prominent feature in the prediction of the antibacterial capacity. The optimum model for the prediction of antibacterial and antifungal activity has been identified. Finally, the model’s validation has also been demonstrated. This work enables researchers to utilize Machine Learning which in turn can address the challenges of time consumption, and cost in laboratory experiments while minimising the reliance on trial and error.


Introduction
The significant challenge faced in the field of medical science in recent times is antibiotic resistance [1].Antibiotic resistance is increasing to dangerous levels, which indeed pose a risk to global health and make treatment of common infections more difficult [2].The continuous usage of chemical compounds has led to the development of genetic resistance in microorganisms [3].Consequently, in recent years, a vital aspect of research has been focused on the development and modification of antibacterial agents to enhance their antibacterial effectiveness [4].
The use of nanotechnology has been applied to a wide variety of fields, including physical, chemical, and biological sciences.Innovative techniques are being developed in various fields of engineering, which are indeed designed to probe and manipulate single atoms as well as molecules [5].Advancements in nanotechnology have unfolded the horizons of nanomedicine, biomedicines, pharmaceuticals, sensors, electrochemistry, catalysis, food technology, and cosmetics [6][7][8].
Nanoparticles are nanometer-scaled particles which have one or two dimensions within the range of 100 nm or less and have unique properties compared to the bulk molecules depending upon the morphology and size [9][10][11].Among all nanoparticles, metal nanoparticles have great research interest as a substantial material because of the excellent properties possessed by the silver nanoparticles like high surface-to-volume ratio, size and shape-dependent properties, highly tunable physical and optical properties and it indeed results in widespread integration into nearly every discipline of scientific research ranging from catalysts and sensing to optics, antibacterial applications, nanomedicine and data storage [12][13][14][15][16][17].Among metal nanoparticles, silver nanoparticles have acquired great interest due to their strong plasmonic properties, low cytotoxicity, antiinflammation properties and potent antibacterial activity [18].They are attributed to their smaller size and larger surface area, which are significantly different from the bulk counterparts of silver [19].The morphology of the silver nanoparticles decides the physical and chemical properties of the nanoparticle [20].
Silver nanoparticles are applied to the wounds and burns to inhibit the bacterial infection [21].It has been used as an antibacterial agent for many Gram-positive and Gram-negative bacteria and as an antiseptic for a very long time [22,23].There are numerous reports of silver nanoparticles exhibiting substantial antibacterial efficiency against a wide range of bacteria present in humans such as Escherichia coli, Staphylococcus aureus, Pseudomonas aeruginosa and Klebsiella pneumoniae [22][23][24][25].Though the mechanism of bacterial activity has been elucidated.The interest in unfolding the real and exact mechanism is of great interest to the researchers.The available experimental data provides support for various mechanisms that take into account the physical and chemical characteristics of silver nanoparticles (AgNPs), including their size and surface properties.These properties enable AgNPs to interact with, and potentially traverse, cell walls or membranes, exerting a direct impact on intracellular components [26].They are considered to be able to prevent or limit the generation of antibiotic-resistant bacteria and inhibit the resistance strain evolution [27,28].
There are numerous challenges faced by the researchers during the antimicrobial studies.The process from synthesis to antibacterial and antifungal characterization is time-consuming.It demands a significant amount of trial and error, making it a resource-intensive endeavour.Furthermore, the entire process can be quite expensive, adding to the challenges faced by researchers and scientists.To address the issues of time consumption and cost in laboratory experiments and to minimize the reliance on trial and error, researchers can rely on Machine Learning (ML).Scientists can make more efficient decisions with the help of ML, which in turn can be used to save valuable time and resources in the experimental process.Novelty of our work is to enable the researchers to utilize Machine Learning which in turn can address the challenges of time consumption, and cost in laboratory experiments and also minimising the reliance on trial and error.
The availability of supercomputing platforms, along with artificial intelligence techniques, boosts the traditional approach in nanomedicine [29].Over the past few years, Machine Learning (ML) and Artificial Intelligence (AI) are gaining significant importance due to their predicting capabilities and increased efficiency of the results.Machine learning, which lies at the crossroads of computer science and statistics, is a rapidly progressing field of technology.It forms a crucial part of artificial intelligence and data science.The remarkable advancements in machine learning can be attributed to the emergence of new learning algorithms and theories, as well as the abundant availability of online data and low-cost computation [30].The fundamental qualities of the datasets determine the predictive ability of the machine learning models [31].The literature strongly supports the utilization of Machine Learning techniques in predicting toxicity profiles of nanoparticles [32][33][34] and in the development of drug designing and novel antibiotics [31,35,36].
Machine learning tools help diverse researchers by making predictions about the antibacterial and antifungal effects of silver nanoparticles, based on the physical, chemical properties and exposure conditions of the nanoparticles.Mahsa Mirzaei et al proposed an ML tool to predict the antibacterial activity of nanoparticles against a vast range of Gram-positive and Gram-negative bacteria.They focused on three different nanoparticles namely Zinc Oxide, Iron Oxide and silver nanoparticles [37].Tarun Mateti et al developed a polynomial equation using machine learning that predicts the antibacterial activity of silver nanoparticles against S.aureus and E.coli focusing only on the spherical-shaped silver nanoparticles [38].Afshin Saadat et al studied machinelearning techniques to predict the antibacterial activity of AgNPs against Escherichia coli, Pseudomonas aeruginosa, Staphylococcus aureus, and Klebsiella pneumonia [39].Eyup Bilgi et al employed two different machine learning approaches namely decision tree (DT) and artificial neural network (ANN) to predict the cytotoxic potential of nanosilver based on material and assay-related parameters [40].Anjana S Desai et al utilized the Decision tree and Random Forest models to comprehend the relationship between the physical parameters of silver nanoparticles and their cytotoxicity [41].Lei liu et al reported a meta-analysis for the cytotoxicity classification of photosynthesized AgNPs using two machine learning algorithms, namely decision tree and random forest [42].Devina et al studied the Microbial Activity in Silver Nanoparticles using a Modified Convolution Network [43].In this work, we used numerous machine-learning models to predict the antibacterial capacity of silver nanoparticles against different Gram-positive as well as Gram-negative bacteria.We also utilised the models to predict the antifungal capacity also.The antibacterial and antifungal capacity of silver nanoparticles is predicted with the help of physical-chemical properties, exposure condition of antibacterial/antifungal experiment and in-vivo properties of bacteria/fungi.The label selected for the antibacterial prediction is the Zone of Inhibition (ZOI) which is given in mm and it represents the diameter of the region or halo formed around a bacterium when its growth is inhibited by the presence of antibacterial agent to be evaluated.
To have a better picture of the models' performance metrics, we have simulated over 50 realizations of each model and plotted box plots for each performance metric of 8 machine learning models.To our knowledge, there is currently no research on the antibacterial and antifungal properties of silver nanoparticles that employs eight different machine learning models along with an analysis of performance metrics through simulations (50 realizations) for each model.During the machine learning process, we perform a random split of data into training and test sets, allowing for multiple realizations of training and test sets with the same split ratio.In our study, we have conducted simulations over 50 realizations for each model to gain a more comprehensive understanding of the performance metrics associated with each model.Therefore the work enables the researchers to utilize Machine Learning to do a better understanding of the antibacterial and antifungal study of silver nanoparticles thereby addressing challenges like time consumption, cost in laboratory experiments and the demands of a significant amount of trial and error.

Antimicrobial effectiveness of silver nanoparticles
Silver nanoparticles are known to be highly toxic against various microorganisms [44].There are three different postulates in literature by which silver nanoparticles exert antibacterial action.The first hypothesis in the literature suggests that the action of silver nanoparticles occurs at the cell membrane, as they can enter the outer membrane and accumulate within the inner membrane.The accumulation leads to the adhesion of nanoparticles to the cell, which indeed results in their destabilization and causes damage [45,46].The affinity of metallic ions with the electrostatic interactions and sulphur proteins enhances the bonding of silver nanoparticles to the cell wall of bacteria [47,48].The second postulate in the literature suggests that the nanoparticles possess the ability not only to disrupt and traverse the cell membrane, modifying its structure and permeability but also to enter the inner part of the cell.Within the cell, it has been proposed that the silver nanoparticles exhibit an affinity for interacting with sulphur or phosphorus groups found in intracellular components like DNA and proteins due to their characteristics.This interaction can eventually lead to changes in the structure and functionality of these biomolecules.Similarly, silver nanoparticles can alter the respiratory chain in the inner membrane by interacting with thiol groups present in enzymes.The interaction initiates the production of reactive oxygen species and free radicals, causing harm to intracellular machinery.The last postulate operates along with the previous postulates and it involves the release of silver ions from the silver nanoparticles.These silver ions can interact with cellular components due to their size and charge which in turn leads to alterations in metabolic pathways and membranes [49][50][51][52][53].
The antimicrobial effectiveness of nanoparticles is influenced by two primary factors (i) Physical and chemical characteristics of the nanoparticles, surface changes and composition.(ii) The species of bacteria/fungi being targeted [37].The physical and chemical properties of the nanoparticles and the species of bacteria/fungi targeted play an important role in predicting the antibacterial and antifungal capacity of silver nanoparticles.So, in our study, we made use of these properties in determining the antibacterial and antifungal capability of silver nanoparticles.

Methods
Machine learning (ML) techniques are algorithms which are used for classification, clustering or regression.The regression machine-learning algorithms can be used to predict various properties of nanoparticles.In this research, Python programming language is used for the implementation of Machine Learning (ML) algorithms.Scikit Learn, one of the machine learning libraries which provides a vast array of algorithms is utilised for the easy implementation of machine learning.The documentation of Scikit-learn provides narrative examples along with the sample codes, which in turn allows the community for the easy and effortless implementation of standard machine learning algorithms (e.g.regression, clustering and classification) [54,55].We also utilised Matplotlib, pandas, numpy and seaborn libraries in our machine learning work.The models used in this work are XG Boost, LightGBM, Randomforest, AdaBoost, Multilevel model aka MLM (also known as mixed model with random effects), Ridge, Lasso and Elastic Net regression.A detailed discussion of the model will be done in session 2.6.

Data collection
The research on studies that have been investigated regarding Silver nanoparticles' antibacterial activity has been compiled from Google Scholar, Web of Science.The keywords used in the search are 'Silver nanoparticles', 'antibacterial', 'bactericidal'.The studies included only the papers published in English language.We focused mainly on the antibacterial properties of silver nanoparticles which exhibits substantial antibacterial efficiency against a wide range of bacteria present in humans [22-28, 47, 48, 50-52].Since silver nanoparticles has enhanced antimicrobial activity, some papers reported both antibacterial as well as antifungal activity [56][57][58].We have included both antibacterial and antifungal data for our work for the better understanding.We focused on the coating, core size, shape, specific surface area, aggregation, dose, bacteria/fungi class, family, species, and zone of inhibition while reviewing articles for this study.We reviewed 50 articles and collected data for the model training.Figure 1 shows the Model workflow from Dataset collection to the model validation.Datafeature preparation includes features and label selection for the machine learning models.In general, the machine learning model's performance depends upon the underlying datasets.The dataset mainly contains three categories which includes Physical-Chemical properties, Exposure conditions of the bacteria/fungi and In-vivo characteristics of the bacteria/fungi.The variables such as Specific surface area, Core size, Aggregation, Shape and Coating come under physical-chemical properties.Dose and Duration come under the exposure conditions of the antibacterial and antifungal study.The In-vivo properties of the bacteria/fungi consist of the class, family, and species of the bacteria/fungi.The variables like Specific surface area and aggregation were eliminated due to the high percentage of missing values.The missing values for those variables were more than 90%.The outcome was selected as ZOI (Zone of Inhibition) based on the antibacterial study.Most of the antibacterial work in silver nanoparticles was conducted on Escherichia coli (E.coli) which belongs to Gram-negative bacteria.Unlike other work which focuses mostly on Escherichia coli (E.coli), we collected data points which consist of Gram-positive as well as Gram-negative bacteria to give importance to all kinds of bacteria as well as fungi.

Datafeature preparation
For the prediction of antibacterial and antifungal property of silver nanoparticles, the machine learning model requires features as well as labels.We have collected features and label from the literature review.Table 1 shows the Data Feature Preparation table (features).The Physical-Chemical properties, Exposure conditions of the bacteria/fungi and In-vivo characteristics of the bacteria/fungi were selected as the features.The ZOI (Zone of Inhibition) was selected as the label for determining the antibacterial capacity of silver nanoparticles.There were numerous measurements to determine the antibacterial efficacy of silver nanoparticles which include ZOI (Zone of Inhibition), minimum bactericidal concentration (MBC), and minimum inhibitory concentration (MIC).However, we chose the ZOI (Zone of Inhibition) as the label because it was reported in most papers.The ZOI (Zone of Inhibition) is investigated by the fast and inexpensive assay called the Disk Diffusion Method.

Data pre-processing
One of the key phases in ML is the pre-processing of the data.Data cleaning, data normalisation and transformation are all included in this phase.The data pre-processing phase significantly affects the extent to which the supervised ML algorithm generalises [59].After the data preprocessing, we had about 267 rows and 9 columns for training the models.Furthermore, missing values processing, One-hot-encoding, and train and test data split come as a part of data pre-processing.

Missing values
The missing data or the missing value refers to the absence of a stored data value for a specific variable in an observation.Such an absence of values can significantly impact the outcomes.In our case, certain features in the primary dataset have missing values which in turn affects the outcomes of the regression model.As the performance of the models will be affected by the missing values, we handled them by deleting the rows with missing values [37,60]

One hot encoding
The ML algorithm cannot directly operate on the categorical values.The process of converting the categorical variables to integers is known as one hot encoding.It is well known for its high accuracy [61,62].We utilised the one-hot encoding in our work for the better performance of the machine learning model.The absence and presence of the original attributes were represented by the values 0 and 1 respectively.

Normalization
Usually in machine learning models, it is common to have different range features and that could bias the weights of features.The objective of normalization is to standardize the numeric columns in a dataset to allow effective comparisons and computations.It brings the features to a common scale.There are different techniques such as Z-score, min-max, mean and median absolute deviation scaling to achieve normalization [37,59,63].We performed the normalization in our work, by applying a z-score.

Data splits
Data Splitting is an important step in data science, particularly when working with data-driven models.It involves dividing the dataset into two sets, one set is utilised for training and the latter set is utilised for testing the models.The training dataset is used to train and fit the ML model and the testing dataset is used to assess the performance of the trained models.In this work, we randomly split our dataset into two, training and testing datasets.The training dataset includes 90% of our dataset and the remaining 10% is assigned for testing.

Feature selection
Feature selection is a machine-learning technique that involves the selection of a subset of relevant features (variables) for the model construction [64].The objective of the feature selection technique is to eliminate redundant or irrelevant features and to identify the features that exhibit strong correlations [64].The feature selection process is carried out with minimal loss of information [64].The feature selection method addresses the challenges associated with the high dimensional data by selecting the most relevant features.The learning

XGBoost eXtreme Gradient
Boost is an ensemble ML system that focuses on tree boosting.XGBoost, being an open source package, is portable, reusable and supports a wide range of weighted classification and rank objective functions.It utilizes a gradient-boosting framework which is specifically engineered for speed and performance [74].XGBoost is well known for its model performance and execution speed [74].The most important factor of XGBoost is its scalability in all situations, due to several important systems and algorithmic optimizations.The excellent organization, portability, and flexibility make XGBoost suitable for various applications [75].

AdaBoost
AdaBoost algorithm, developed by Freund and Schapire, was the pioneering practical boosting algorithm [76].It continues to be one of the most extensively used and researched boosting algorithms applied in numerous fields.
It has the capability to elevate the performance of a weak learning algorithm.Numerous weak subsets of features are combined to form a new strong subset to improve the performance [77].

Ridge regression
The Ridge regression proposed by Hoerl and Kennard is a method for data analysis [78].The ridge regression allows the execution of a nonlinear regression by constructing a linear regression function in a high-dimensional feature space [79].The selection of the ridge parameter is of utmost importance in determining the effectiveness of the algorithm [80].

LASSO
The Least Absolute Shrinkage and Selection Operator (LASSO) regression is a popular approach for variable selection and shrinkage estimation [81].LASSO Regression improves the interpretability of the model and the prediction accuracy by combining the qualities of ridge regression and subset selection [64].

Elastic net regression
Elastic Net Regression or ENR combines the penalties from both Ridge and LASSO regression methods to regularize a model.ENR achieves better performance compared to LASSO particularly when the number of predictors significantly exceeds the number of observations [37,82].

LightGBM
LightGBM is a Gradient-boosting decision tree (GBDT) algorithm and it is well known for its high efficiency and execution speed.It is faster compared to the conventional Gradient boosting decision tree (GBDT) algorithms [83].The LightGBM mainly involves two techniques: Gradient-based One-Side Sampling and Exclusive Feature Bundling.Gradient-based One-Side Sampling is to handle a large data instance and Exclusive Feature Bundling is to handle a large number of features [39,83].
2.6.1.8.MLM Multilevel model aka (also known as mixed model with random effects) is a commonly employed data analysis technique for exploring hierarchical data structures across various disciplines.The data analysis utilizes complex data structures that cannot be adequately explored using single-level analytical techniques like multiple regression, path analysis, and structural modelling [84].

Model validation
The algorithms mentioned under section 2.6.1 were used in this study for the prediction of antibacterial and antifungal capacity of silver nanoparticles.Cross-validation can assist in the comparison of different machinelearning models.It also addresses the over-fitting of the data, when a model learns to fit a model too closely and results in poor performance [85,86].
The model performance was validated by comparing the values of the performance metrics such as R 2 value, root mean square error (RMSE), mean square error (MSE) and mean absolute error (MAE).
The coefficient of determination commonly known as R-squared (R 2 ) is an important evaluation metric for regression-based algorithms.The values of R-squared range between 0% and 100%, where a higher R-squared value indicates that a larger proportion of the output variation can be attributed to the input variables.Higher the value of R 2 means variations of the output can be explained well and that is desirable.MSE or mean square error is a widely used and conventional error metric for the regression models.It is calculated by averaging the squared differences between the predicted and actual values in a dataset.
where y i is the actual value of data, ŷ i is the predicted value of data, and n is the number of observations/ rows [39].
RMSE or root mean square error is an extension of the concept of error metric mean square error (MSE) which is calculated by taking the square root of MSE that is MSE .
MAE or mean absolute error is an important metric in regression model which is calculated as For R 2 larger the value, better is the prediction.As MSE, RMSE and MAE are error representations, the smaller the value, the better is the prediction.

Results and discussion
The machine learning regression models considered are XGBoost, LightGBM, Randomforest, AdaBoost, Multilevel model aka MLM (also known as mixed model with random effects), Ridge, Lasso and Elastic-Net regression.The performance metrics considered are R 2 value, root mean square error (RMSE), mean square error (MSE) and mean absolute error (MAE).
To achieve an optimal modelling, it is crucial to have a balance, avoiding both underfitting and overfitting through the adjustment of hyperparameters.In the case of RIDGE, LASSO and Elastic-net models, various statistical techniques were utilized to evaluate the model using different values of alpha (a).The alpha (a) value has a significant impact on the machine learning model.Higher alpha value leads to greater influence of regularization parameters thereby reducing errors caused by variance (overfitting).We have selected the alpha (a) parameter in the Ridge, Lasso and Elastic Net regression as the same and its value as 0.3.On top of the ɑ parameter, the l 1 ratio is taken as 0.3 in the Elastic Net regression.In the MLM regression, the feature variable family (here families of bacteria/fungi) has been taken as the group variable.
The models are trained on a training data set which is the 90% total data, and the performance metrics are evaluated on the test set (remaining 10% of the total data).Since we do the random split of data into training and test sets, we can have multiple realizations of test and training data sets with having the same split ratio of 0.1.As such, we have simulated over 50 realizations of each model.To have a better picture of how the models' performance metrics observed to each realization, we have plotted box plots for each performance metrics.To begin with below is the box plot for R 2 value is shown in the figure 2, For the R 2 value, The XGBoost model outperforms all other models and is followed by the RandomForest model.Although the RandomForest model's median R 2 value is better than Lightgbm model, the Lightgbm model has slightly better standard deviation of R 2 values.Among all the models, MLM regression performance is not that good for the underlying dataset and problem.
Using the different realization of models, the box plot for Root Mean Square Error (RMSE), Mean Square Error (MSE) and Mean Absolute Error (MAE) value of models' prediction for the test data set is shown in figures 3, 4 and 5 respectively.For RMSE, MSE and MAE, The XGBoost model outperforms all other models with least RMSE, MSE and MAE.It is followed by the Lightgbm and RandomForest models.Also, the LightBGM model has a slightly better standard deviation of RMSE, MSE and MAE values.MLM, Ridge, AdaBoost and ElasticNet regression models have also given reasonably good RMSE, MSE and MAE values as well in the median sense.However, the MLM model has the highest standard deviation of RMSE, MSE and MAE values.This observed effect in the MLM model is believed due to the incapacity of modelling the random effects quite well in some realization (where you may see error goes off) of training/test data having poor representation of group variables (family).Arguably for such an instance, the XGBoost model's fitting seemed to be robust.It is to be noted that having more data points for each family of bacteria might have resulted in a better performance boost in all the models, especially for the MLM model.
To summarise the performance metrics, the median and mean value of the performance metrics over all the simulations is given below, Tables 2 and 3 shows the median and mean value of performance metrics over 50 simulations of models.XGBoost shows the highest R 2 median value and then followed by LightGBM and RandomForest.
Figure 6 shows the feature importance of XGBoost, Random Forest and AdaBoost respectively.It is evident that the core size of the nanoparticles is the prominent factor that controls the antibacterial and antifungal properties of silver nanoparticles.In XGBoost model, the core size is followed by the shape of the nanoparticle,  duration of exposure, and dosage.While in Random Forest model, the core size of the nanoparticles is followed by the dosage, shape of nanoparticle and duration and in the AdaBoost regression model, the core size of the nanoparticles is followed by dosage, shape of nanoparticle.Duration does not have a significant impact on the antibacterial and antifungal properties in AdaBoost regression models.Among the categorical random variables, species (e.g., Ganoderma, Shigella, C. Scedosporium, tropicalis, etc) of the bacteria/fungi is the dominant feature.It is also to be noted that the feature importance can vary with each realization because the underlying data is different for each realization.Then the order of importance of features can change slightly as well.The size, shape and coating of the silver nanoparticle play an important role in the antimicrobial activity.Recent studies show that surface coating and size of silver nanoparticles significantly influence their antibacterial activity.Specifically, these studies suggest that smaller nanoparticles tend to exhibit greater toxicity [87][88][89][90].Sotiriou et al reported that the antibacterial activity of silver nanoparticles depends on the size of the nanoparticle.The antibacterial activity increases as the size of the nanoparticle decreases [91][92][93].The smaller nanoparticle exhibits high antibacterial activity because it can easily pass through the bacteria.Literature suggests that the shape of silver nanoparticles also plays importance in their antibacterial activity [94][95][96].The antibacterial efficacy depends on the contact between the bacterial cell membrane and nanoparticle [94].Surface coating of the nanoparticles also plays a vital role in the antibacterial activity of silver nanoparticles.The biofunctionalized silver nanoparticles exhibits superior antibacterial activity compared to the nonfunctionalized nanoparticles [97,98].The antibacterial activity also depends on the species of bacteria being targeted [99,100].The dose and duration of the antibacterial studies also play a pivotal role in determining the antibacterial activity of silver nanoparticles [101].The feature importance analysis in our studies validates that the core size, shape, dose, duration, coating and bacteria/fungi species also contribute to the antibacterial and antifungal efficacy of silver nanoparticles.
In figure 7, actual versus predicted ZOI is plotted.Since the scale of the axes is the same, having actual equals predicted is zero error.However, in real systems, there is always some error.Better algorithm must have a scatter plot with data points more around the 45-degree line from the origin.The XGBoost and random forest have scatter plots of this behaviour.We have already shown they have the least error, but these scatter plots help to visualize it.
Overall, the XG Boost outperforms all other models considered in the analysis.It is to be noted that the data set (including the training and test) was 267 rows due to the limited availability of sufficient data from conducted experiments on the antibacterial activity of silver.With this limited data, XGBoost model showed its robustness and Lightgbm and RandomForest models performed next to the XGBoost model.It is to be noted that having more data points for each family of bacteria/fungi might have resulted in a better performance boost in all the models, especially for the MLM model.As such, the MLM model struggled in modelling the random effects quite well in some realizations where training/test data had poor representation of group variables (family).

Conclusion
In this work, we investigated the effectiveness of various machine-learning techniques to predict the antibacterial capacity of silver nanoparticles against a wide range of Gram-positive and Gram-negative bacteria.We also utilised the models to predict the antifungal capacity also.After the analysis, it was found that the XGBoost outperforms all other models considered in the analysis.Even with the limited data, XGBoost model showed its robustness and Lightgbm and RandomForest models performed next to the XGBoost model.The XGBoost model exhibited the highest predictive performance.The XGBoost gave the highest R 2 and least error.This work enables researchers to utilize Machine Learning which in turn can address the challenges of time consumption, and cost in laboratory experiments while minimising the reliance on trial and error.Feature Importance revealed about the features that have a great influence on the antibacterial and antifungal capacity of silver nanoparticles.The study revealed that the core size of silver nanoparticles is the prominent feature in the prediction of the antibacterial and antifungal capacity.

Figure 4 .
Figure 4. Box plot for MSE value.

Figure 5 .
Figure 5. Box plot for MAE value.

Figure 7 .
Figure 7. Scatter plot of predicted and actual values of ZOI in different models (XGBoost, Random Forest and Elastic net).

Table 1 .
[72,73]][67] preparation of features (NA means Not Applicable).increasedandthecomputationcanbedecreasedbythismethod.There are different methods for feature selection such as filter methods, embedded methods and wrapper methods[65][66][67].2.6.Machine learning modelsSince this work is related to prediction, we are going to discuss various regression models.2.6.1.Regression modelsMachine Learning, part of AI (Artificial Intelligence), empowers machines to learn from data, with the help of past experiences to enhance performance and make predictions.The machine learning algorithms identify patterns by analysing and processing vast amounts of data and it eventually leads to better prediction of results[68,69].The regression techniques involve creating a model that has the ability to predict new numerical values based on the input variables.In this study, we employed several regression algorithms as potential options for developing our model in order to determine which one offers the highest prediction accuracy.The machine learning regression models considered in this study are XGBoost, LightGBM, Randomforest, AdaBoost, Multilevel model aka MLM (also known as mixed model with random effects), Ridge, Lasso and Elastic Net regression.2.6.1.1.Random forestRandom Forest (RF) constructs decision trees and then combines the results of each decision tree to predict the final result[55,70].Tree-based algorithms are highly attractive due to their execution speed.Random forest can work with both classification and regression[71].Random Sampling of the training data is done to enhance the diversity of decision trees.This strategy, also known as bagging or bootstrap aggregating, creates multiple training data subsets.Bagging is done to enhance prediction accuracy and improve stability during the training process[72].Boosting is another feature of the RF regression technique, where the best-split point is determined by utilizing the randomly selected subsets during the growth of decision trees.Boosting allows the enhancement of prediction accuracy with the help of prior trees[72,73].

Table 2 .
The median value of performance metrics over 50 simulations of models.

Table 3 .
The mean value of performance metrics over 50 simulations of models.