Computational enrichment of physicochemical data for the development of a -potential read-across predictive model with Isalos Analytics Platform

The physicochemical characterisation data from a library of 69 engineered nanomaterials (ENMs) has been exploited in silico following enrichment with a set of molecular descriptors that can be easily acquired or calculated using atomic periodicity and other fundamental atomic parameters. Based on the extended set of twenty descriptors, a robust and validated nanoinformatics model has been proposed to predict the ENM ζ -po-tential. The five critical parameters selected as the most significant for the model development included the ENM size and coating as well as three molecular descriptors, metal ionic radius ( r ion ), the sum of metal electronegativity divided by the number of oxygen atoms present in a particular metal oxide ( Σ χ /n O ) and the absolute electronegativity ( χ abs ), each of which is thoroughly discussed to interpret their influence on ζ -potential values. The model was developed using the Isalos Analytics Platform and is available to the community as a web service through the Horizon 2020 (H2020) NanoCommons Transnational Access services and the H2020 NanoSoveIT Integrated Approach to Testing and Assessment (IATA).


Introduction
Engineered nanomaterials (ENMs) have become part of everyday life, as they are being used in a wide number of consumer and commercial products. The unique properties of ENM, their versatility, complexity and tuneable properties have put them at the centre of innovation in various fields (e.g. nanomedicine, photonics, energy production) (Kraegeloh et al., 2018). However, the innovation potential of ENMs is inhibited by concerns regarding their potential adverse effects. These can include toxic effects e.g. cytotoxicity, cell apoptosis, oxidative stress, genotoxicity, ecotoxicity etc. following accumulation in different organs, Trojan horse effects and more (Valsami-Jones and Lynch, 2015;Yan et al., 2019;Lin et al., 2018;Lead et al., 2018;Saarimäki et al., 2020). These concerns have led regulatory agencies, including the European Chemicals Agency (ECHA, 2020) and the U.S.A. Environmental Protection Agency (EPA, 2017) to publish guidelines on the testing and safety assessment of ENM. These regulatory frameworks (REACH, TSCA) are based on datasets generated in accordance with the OECD test guidelines for the physicochemical characterisation of ENMs, which currently have 15 listed endpoints (OECD, 2016), and for toxicity/ ecotoxicity assessment using model species and modified test guidelines (Rasmussen et al., 2019) including for dispersion of ENMs (Au-Kaur et al., 2017). Furthermore, the OECD and regulatory bodies are promoting the development of Alternative Testing Strategies (ATS) for the evaluation of ENMs and the development and application of safer by design (SbD) strategies (Hjorth et al., 2017;Fritsche et al., 2017;Basketter et al., 2019). These aim to reduce the need for animal testing and the cost of detailed ENM testing via long-term animal studies, through the design of safer ENMs prior to production. A number of reviews deal with the application of SbD strategies in various ENM application areas (Kraegeloh et al., 2018;Yan et al., 2019;Lin et al., 2018).
The application of in silico approaches (Afantitis, 2020) for the prediction of specific ENM properties is emerging as an important step towards complete in silico hazard and risk assessment. Computational approaches can take advantage of existing data to develop predictive nanoinformatics models, which can be used to either design ENMs with specific properties (Melagraki and Afantitis, 2014;Varsou et al., 2019;Ling et al., 2018;Martin et al., 2019) or to predict their behaviour and effects (Afantitis et al., 2018;Toropova et al., 2016;Basei et al., 2019). One of the limitations ll to the widespread application of in silico approaches is the lack of large quantities of high-quality data, or of data with adequate metadata that will allow dataset interoperability and their combination to create larger datasets. The effects of a lack of sufficient data or metadata on result reliability have been demonstrated in several meta-analysis studies (Bilal et al., 2019;Gernand and Casman, 2014;Oh et al., 2016;Ha et al., 2018;Shin et al., 2018;Labouta et al., 2019). In fact, a number of studies (Bilal et al., 2019;Oh et al., 2016;Ha et al., 2018;Labouta et al., 2019) found a significant correlation between the assay type and cytotoxic effects, with another (Shin et al., 2018) demonstrating the correlation between the ENM dispersion protocols and cytotoxicity. Furthermore, Labouta et al. (2019) demonstrated the advantages of data combination and computational approaches (decision trees) using the meta-analysis of 93 peer-reviewed papers, corresponding to over 3000 data points, on the cytotoxicity of organic and inorganic ENMs, to uncover hidden relationships. Thus, as more data is generated, and data management practices improve (Papadiamantis et al., 2020a), the power of in silico approaches will increase exponentially.
The role of metadata in facilitating dataset combination and interoperability becomes even more prominent when dealing with parameters which are dependent on the ENMs surrounding environment and can affect its behaviour, the so-called extrinsic parameters (Lynch et al., 2014;Casals et al., 2017). One example of an extrinsic, or contextdependent ENM parameter is ζ-potential, which corresponds to the electrostatic potential of the ENM at the slipping plane; i.e., at the surface where the ENM is considered to interact with its surrounding environment (Fig. 1). The size of this surface depends on both the properties of the ENM, e.g. size, coating, and the ionic strength of the surrounding solution (Lowry et al., 2016). In practice, it consists of an ionic liquid layer (Stern layer), which is strongly bound and moves with the particle as it flows inside the solution (Brownian motion) and an outer loosely bound layer (the diffusion layer) (Lowry et al., 2016). While in uncoated ENMs (Fig. 1, left) the Stern layer, and subsequently ζ-potential, is outside the core of the ENM, this is not the case for polymeric coated ENMs (Fig. 1, middle and right). The ζ-potential of polymeric coated ENMs will be affected by both the bare ENM''s ζ-potential and the charge of the respective coating, as the polymeric coating will act as a form of cation-exchange membrane separating the Stern from the diffusion layer (Batley and Apte, 2005). The charge of the polymeric coating is called the Donnan potential (Batley and Apte, 2005;Ohshima, 2006). Assuming a negatively charged ENM, covered with a negatively-charged polymeric coating (Fig. 1, middle and right), cationic species will be transported through the coating, driven by the negative charge, until the Stern layer is formed (Batley and Apte, 2005). If the coating is sufficiently thick (Fig. 1, middle), it will mask the bare particle ζ-potential and the apparent potential will be almost equal to the Donnan potential (Lowry et al., 2016;Ohshima, 2006). For thinner layers (Fig. 1, right), the apparent ζ-potential will be affected by both the bare particle ζ-potential and the coating charge and for a positively charged bare particle ζ-potential and negatively charged coating, it will be less than the Donnan potential (Lowry et al., 2016).
ζ-potential is considered one of the key ENM physicochemical parameters which need to be reported in regulatory dossiers, and several standards (e.g. ISO 13099) exist regarding its measurement. ζ-potential has been linked with ENM stability and its behaviour in solution, as well as with its toxicity. Furthermore, ζ-potential has substantial applications in nanoscience research broadly, as it can regulate the behaviour and functionality of ENMs when used in fields like nanomedicine (Honary and Zahir, 2013a;Honary and Zahir, 2013b), energy production (Sen et al., 2017;Caimi et al., 2018), fuel and combustion (Saxena et al., 2017;Kumar et al., 2020), construction (Ogunsona et al., 2020), consumer products (Ogunsona et al., 2020), water treatment (Ogunsona et al., 2020) and remediation of environmental heavy metal pollution (Yang et al., 2013).
A high value (empirically ≥ ±30 mV) is considered to provide sufficient electrostatic repulsion between ENMs to prevent agglomeration Fig. 1. Schematic representation of a negatively charged uncoated (left), a negatively charged thickly polymer-coated (middle) and negatively charged thinly polymer-coated (right) ENM. In the uncoated ENM the Stern layer consists of firmly bound positively charged ions, which move with the ENM. Positive and negative charges outside the slipping plane and within the ENM's diffusion layer are loosely bound to the ENM and remain with the bulk fluid. In this case, the ζ-potential corresponds to the electric charge at the slipping plane. For a negatively charged and thickly coated ENM the Donnan potential falls between the metal core and the coating surface and is approximately equal to the apparent ζ-potential. For a similar thinly coated ENM, the apparent ζ-potential is less than the Donnan potential and is affected by the bare ENM ζ-potential, the coating charge and the surrounding solution. In the image, white and black dashes correspond to the charge of the metal core and the coating respectively. Grey and red spheres correspond to positive and negative free radicals (counterions) inside the solution. Image adapted from Lowry et al. (2016). (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.) (in the absence of steric stabilisation resulting from coatings) (Fatehah et al., 2014), while the diffusion layer (regulated by the ζ-potential) plays a vital role in ENM dissolution (Borm et al., 2006). Increasing medium ionic strength, changes in pH, and subsequent alterations in ζ-potential can affect the way in which ENM react with the cell membrane, altering the degree of ENM uptake by organisms or cells. This will influence the extent of reactive oxygen species (ROS) production, which is the most common adverse effect following exposure to ENM (Abdal Dayem et al., 2017). As a result, being able to predict an ENM's ζ-potential based on specific characteristics such as its elemental composition, will allow the development of more stable, functional and safer ENMs. Previous ζ-potential related predictive studies (Varsou et al., 2020a;Mikolajczyk et al., 2015;Toropov et al., 2018;Wyrzykowska et al., 2016;Toropov et al., 2016) have mainly focussed on metal oxides, whereas recent work by Varsou et al. (Varsou et al., 2020a) has developed a fully validated model for the prediction of ζ-potential of both metal oxide and metal ENMs. These previous studies developed ζ-potential predictive models through the analysis of TEM images of existing ENMs (Varsou et al., 2020a;Mikolajczyk et al., 2015), predicted the ζ-potential of ENM in different media based on known aqueous ζ-potential values (Wyrzykowska et al., 2016), or did not take into account different surface characteristics (coating, functionalisation) (Toropov et al., 2018;Toropov et al., 2016). Most studies used literature curated data (Mikolajczyk et al., 2015;Toropov et al., 2018;Wyrzykowska et al., 2016;Toropov et al., 2016) that could result in discrepancies regarding the experimental protocols used for data generation. Protocol variation has been shown to have a significant effect on the acquired results (Bilal et al., 2019;Gernand and Casman, 2014;Oh et al., 2016;Ha et al., 2018;Shin et al., 2018;Labouta et al., 2019;Langevin et al., 2018) and should be taken into account (potentially as an extra variable) while preparing datasets for in silico exploitation. In our case, the ENM physicochemical characterisation data were all generated within one project (the Framework Program 7 (FP7) project NanoMILE, 2013), which minimises the risk of protocol variation.
The aim of this study was to enrich this library containing the full physicochemical characterisation of 69 ENMs, with molecular descriptors to increase the value of the available information. The enriched dataset was used to develop an in silico workflow to predict ENM ζ-potential based on a number of descriptors that can be used as part of SbD approaches for design and production of safer and more functional ENMs. The read across predictive model has been made publicly and freely available as a webservice through the Horizon 2020 (H2020) NanoCommons project (http://enaloscloud.novamechanics.com/nan ocommons/mszeta/) and the H2020 NanoSolveIT Cloud Platform (https://mszeta.cloud.nanosolveit.eu/) to ensure accessibility to the community and interested stakeholders.

Experimental data
The experimental dataset used for the development of the nanoinformatics predictive model was generated during the FP7 NanoMILE project (2013). The dataset, published by Joossens et al. (2019), contains the physicochemical characterisation in water of 69 ENM and is available through the NanoPharos database (NovaMechanics Ltd., 2020) developed under the H2020 NanoSolveIT (Afantitis et al., 2020a) and (NanoCommons (2018)) www.nanocommons.eu projects. The dataset consists of pristine Ag (n = 3), AgO (n = 1), Ag 2 S (n = 1), AlOOH (n = 1), Au (n = 5), BaTiO 3 (n = 1), CeO 2 (n = 13), Zr-doped CeO 2 (n = 5) and their (n = 5) aged equivalents, CePO 4 (n = 1), CuO (n = 1), Fe 2 O 3 (n = 1), hydroxylapatite (n = 1), SiO 2 (n = 14), TiO 2 (n = 8), ZnO (n = 7) and ZrO 2 (n = 1) ENMs. The dataset contains information on the core chemistry, where applicable coating type and coating charge (positive, negative, neutral), morphology (identified using TEM) and whether the particles were aged or not (no ageing protocol was provided). Furthermore, the physicochemical characterisation of ENM in MilliQ water includes: core size (measured using TEM or STEM), hydrodynamic diameter, ζ-potential (although the data producers have not provided information on the concentrations used during the ζ-potential measurements but it was performed in parallel to the dynamic light scattering measurements and as such was typically ~10 mg/L), specific and geometric surface area (calculated using primary particle size and morphology) and energy band gap (calculated from UV-Vis measurements using Tauc plots (Tauc et al., 1966)). The characterisation methodologies and protocols used for the ENM characterisation can be found in (Joossens et al., 2019;Römer et al., 2019;Briffa et al., 2019), while Fig. 2 presents an overview of the metals present in the ENM library along with basic atomic parameters and atomic periodicity trends. All information was handled as data points (descriptors), either nominal (core chemistry, coating type, coating charge, morphology, ageing) or continuous (core size, diameter, ζ-potential, specific and geometric surface area, energy band gap).

Data pre-processing and enrichment
The NanoMILE dataset was initially evaluated in terms of completeness. To increase the model robustness and reliability all descriptors with data gaps that could not be filled either experimentally or using suppliers' information were excluded. Hence, 7 descriptors were included in the model development. Core chemistry, coating, coating charge, ageing, primary particle size, morphology and geometric surface area were used as the independent variables and ζ-potential was used as the dependent variable.
Taking into account that ζ-potential is regulated by the intrinsic molecular properties of the ENM, the dataset was enriched using a number of molecular descriptors, which have been used in past studies for model development and computational workflows (Kar et al., 2014;Zhang et al., 2012). The descriptors used are metal atomic radius (r at ), metal ionic radius (r ion ), metal electronegativity (χ), sum of metal electronegativity (Σχ), sum of metal electronegativity divided by the number of oxygen atoms present in a particular metal oxide (Σχ/n O ), number of metal atoms (N Metal ), number of oxygen atoms (N Oxygen ), the charge of the metal cation corresponding to a given oxide (χ ox ), the molecular weight (MW) and the period and group of the metal in the periodic table. Furthermore, the absolute electronegativity (Mulliken electronegativity, χ abs ), which is a measure of the tendency of an atom to attract electrons, and energy band gap (EBG) were calculated using the equations reported by Portier et al. (2001aPortier et al. ( , 2001b. In short, χ abs was calculated using: (1) and EBG was calculated using: where EBG is the energy band gap in eV, E ΔН is the standard enthalpy of formation of the ENM and A is a pre-exponential constant, which varies from 0.5-1.7 depending on the metal. E ΔН values were taken from Portier et al. (2001aPortier et al. ( , 2001b or other literature or database sources. The A value was taken from Portier et al. (2001b) where possible. For the rest, the average A value reported by Portier et al. (2001aPortier et al. ( , 2001b, depending on the position of the metal in the periodic table was used: • For s-and f-block elements the average A value is 0.80 • For d-block elements the average A value is 1.00 • For p-block elements the average A value is 1.35 All data are available through the NanoPharos database (https://db. nanopharos.eu/Queries/Datasets.zul) (NovaMechanics Ltd., 2020) and as a supplementary file.

Model development and read across
Model development was performed using the Isalos Analytics Platform, powered by the Enalos+ Tools (Varsou et al., 2018;Afantitis et al., 2020b). The dataset was first passed through a low-variance filter to remove descriptors that did not present significant variance and thus did not contribute to the discrimination power of the model (Ojha and Roy, 2011). The upper variance bound was set to 0.2, thus all descriptors having 20% or more equal values to another descriptor were excluded. Due to the varying numerical ranges of the used descriptors, Z-score normalisation was then applied to the dataset to ensure that the values of all descriptors followed Gaussian distribution with a mean value of 0.0 and a standard deviation of 1.0 (Leach and Gillet, 2007). Random partitioning of the data into training and test sets took place using a ratio of 75%: 25% respectively. The training set was used to train the model and to blindly evaluate its performance, following parameter tuning through cross-validation (Bishop, 2007;Xu and Goodacre, 2018). The most significant descriptors, those contributing the most to dataset variance, were then identified using the Correlation based Feature Selection (CfsSubset) algorithm combined with the BestFirst evaluator (Hall et al., 2009;Witten et al., 2016).
The ζ-potential predictive model was developed using the Enalos implementation of the k-nearest neighbours (EnaloskNN, Enalos Chem/ Nanoinformatics tools) regression methodology (Papadiamantis et al., 2020b). This is an instance-based (lazy) method used for data classification and regression. EnaloskNN predicts the unknown endpoint based on the k (k = 1, 2, 3, …) nearest neighbours in the features space R n where n is the total number of descriptors used for the prediction. ζ-potential prediction was achieved based on the Euclidian distance (similarity measure) of the target variable from its closest neighbours (Witten et al., 2016). The prediction was then performed using the weighted average of the ζ-potential values of these neighbours, with the weighing factor for each neighbour being the inverse of its Euclidean distance (Varsou et al., 2019;Witten et al., 2016). For nominal descriptors, the individual values are compared and if the values are the same the Euclidean distance is set equal to 0, otherwise it is set to 1 (Larose and Larose, 2014). The optimal number of neighbours (k) was identified based on the best model performance.
The kNN algorithm can be used within a read-across strategy that can also be employed for ENMs. According to ECHA's read across framework (ECHA, 2017) a computational workflow needs need to include the following steps: 1. Gathering of the required descriptors (physicochemical, molecular) for each ENM. 2. Construction of a data matrix including properties and endpoints. 3. Development of an initial grouping hypothesis that correlates an endpoint to different behaviour and reactivity properties. Assignment of the samples to groups.

Fig. 2.
Schematic overview of the metals included in the NanoMILE ENM library along with their atomic characteristics (atomic number, atomic weight, valence, atomic and ionic radii, electronegativity). The schematic presents the recurring atomic trends (i.e. atomic periodicity) to assist with the visual understanding of the chemical similarities and differences between the different metals.
4. Assessment of the applicability of the approach using computational techniques, and data gaps filling. If no regular pattern can be identified, an alternative grouping hypothesis must be proposed. 5. In the case that the grouping hypothesis is robust, but adequate data are not available, additional testing should be considered. 6. Justification of the method.
Using the Euclidian distance as a metric, it is possible to predict ζ-potential and identify the respective groups of neighbouring ENMs. By identifying and studying the resulting groupings it is possible to map the prediction space into specific groups that can be translated into read across strategies, as per the requirements of ECHA's read across framework. The EnaloskNN node has the benefit of providing not only the predictive results, but also the specific neighbours along with their Euclidian distances, as well allowing the visualization of the entire predictive space.

Model validation
Internal and external model validation followed the OECD principles for validation, for regulatory purposes, of (Q)SAR models (OECD, 2004). Goodness-of-fit, robustness and predictivity were used to validate the model internally and externally (Zhang et al., 2006;Puzyn et al., 2018). The statistical criteria used to evaluate the model's performance were the coefficient of determination between experimental values and model predictions (R 2 ), validation through an external test set, the leave-manyout cross validation procedure and Quality of Fit and Predictive Ability of a continuous predictive model according to Tropsha's tests (Tropsha, 2010). This was achieved using the Enalos Model Acceptability Criteria, where the following equations (to calculate Tropsha's tests) have been implemented: where ntest is the number of ENM in the test set, y tr is the average ζ-potential for the training set; y i , ỹ i , i = 1, 2, …, ntest are the experimentally measured and the predicted ζ-potential values for the validation set, respectively; and ỹ is the average predicted ζ-potential over all of the model predictions ỹ i , i = 1, 2, …, ntest. Furthermore, according to Tropsha (2010), a predictive model is considered predictive if all of the below conditions are satisfied: The statistical significance and robustness of the produced model was also evaluated using Y-randomisation. During Y-randomisation the modelling calculations were repeated several times using all original descriptors and shuffling the ζ-potential predictions to produce 10 different datasets. The model acceptability criteria, described above, were calculated in each case and were expected to be reduced compared to those of the original model. This would demonstrate that the developed model was not due to chance correlation, as in this case we would also get similar acceptability criteria. If this was the case (i.e., if the model was based on chance correlation), it would not be possible to obtain a valid model using the specific partitioning (Tropsha et al., 2003).

Domain of applicability
To ensure the applicability of the produced model to external datasets it is important to define the limits within which future predictions will be considered reliable; i.e., the applicability domain (APD). If the model is applied to ENMs that are too different from those used to train the model, the predictions will be outside the defined limits, these will be flagged as unreliable and filtered out (Melagraki and Afantitis, 2014). The APD can be defined through the calculation of the Euclidian distances of all ENM in the training set. The APD was calculated using: where <d> and σ are the average and standard deviation of all Euclidian distances in the training set. Z is an empirical cut-off value and was set to 0.5 (Zhang et al., 2006). All calculations were performed using the Isalos Analytics Platform (Papadiamantis et al., 2020b;Afantitis et al., 2020b) and the full demonstration that the produced model meets the OECD criteria as listed above is demonstrated via the completed QSAR Model Reporting Format (QMRF) template which is included in the supplementary information (SI).

Results
The main purpose of this study was the development of a robust and predictive model for the prediction of ζ-potential that can be used in SbD approaches for the design and development of novel ENM. Using the Isalos Analytics Platform (Papadiamantis et al., 2020b) (Afantitis et al., 2020b) and the respective Enalos+ nodes (Varsou et al., 2018) we worked with a dataset (Joossens et al., 2019) of 69 ENM, which was further enriched with a number of molecular descriptors. The benefits of using this dataset is that the physicochemical characterisation was performed using harmonised protocols, which reduced the risk for data variability due to experimental practices which has been identified as a major obstacle for data interoperability and reusability. This eliminates the need to identify protocols used and include them as potential descriptors, the significance of which has been demonstrated in previous meta-analysis studies where the type of assay was identified as a key parameter in ENM toxicity studies (Bilal et al., 2019;Gernand and Casman, 2014;Oh et al., 2016;Ha et al., 2018;Shin et al., 2018;Labouta et al., 2019). Following exclusion of incomplete descriptors and enrichment with calculated descriptors, the used dataset includes a total of 20 descriptors -7 physicochemical and 13 molecular descriptors (Table 1).
Model development took place by randomly dividing the dataset into training and test sets using a ratio of 75%: 25%. The descriptors of the training set were normalised using Z-score normalisation and the applied normalisation parameters were then used to normalise the test set. The most significant parameters, those contributing the most to dataset variability, were identified using the CfsSubset algorithm combined with the BestFirst evaluator (Hall et al., 2009;Witten et al., 2016). Out of the 20 descriptors imported into the computations, as none of the descriptors presented variance lower than 20%, 5 were identified as the most significant. Two of those were physicochemical; i.e. CT and SZ, and The developed workflow provided the opportunity to test a number of different predictive algorithms. This enabled us to pick kNN as the best performing algorithm for the specific dataset. kNN was applied to the training dataset with k = 7, which was the value that presented the best model performance (Table 2), based on the cross validated R cvext 2 performnce. The R 2 (R Pred 2 ) value; i.e., the coefficient of determination between experimental values and model prediction on the test set, also presented its maximum value (R 2 = 0.900, Table 2) for k = 7. This denotes a substantial correlation between the values measured experimentally and the predicted values (presented graphically in Fig. S1 of the ESI).
The model was validated as described in the Experimental Section as per the OECD's guidelines. The produced model successfully passed Tropsha's tests (Tropsha, 2010) (Table 3) demonstrating the robustness and predictivity of the model. The calculation of the presented parameters was performed using the regression between the experimentally measured ζ-potential vs. the predicted values and vice versa. Similarly, Y-randomisation demonstrated the model's robustness and validity. Finally, the APD to test potential model limitations was calculated to provide future users with an indication of the reliability of their predictions. The APD limit was calculated, in this case, to be 1.947 with all predictions being deemed reliable (test set range: 0-1.605).
The results acquired from the EnaloskNN node were used to study the neighbouring space relative to the Euclidian distance between the ENM descriptors. This allowed us to perform an initial grouping of the ENM based on their distances (Fig. 3) The ζ-potential predictive model has been made available through the Enalos Cloud Platform (Varsou et al., 2020b) to facilitate further use within the wider nano-community and interested stakeholders. The corresponding webservice can be found at: http://enaloscloud.novamechanics.com/nanocommons/mszeta/ The service is complemented with a user-friendly interface (Fig. 4). The user needs to input the required values (ENM size, coating, metal r ion , Σχ/n O and χ ab ) that were selected as the most significant descriptors during model development. In the case of the coating type, the user needs to pick from the available coatings in the dropdown menu, as, currently, only these coatings are supported in the model as they constituted the training set. A note has been added to the tool (Fig. 4) warning the users that the prediction assumes specific dispersion conditions, i.e., in water at pH 7 and with a theoretical ionic strength of 10 -6.998 mol/L. The linked tutorial provides a short library of the molecular properties (r ion , Σχ/n O and χ ab ) used during model development that are available for re-use, since these are ENM size/coating independent. Upon submission of the values, the results page appears (Fig. 5) indicating the value of the predicted ζ-potential along with the 7 closest neighbours that were used to perform the prediction. The Euclidean distances of each neighbour from the tested ENM are provided as well. Furthermore, the prediction's reliability is stated to indicate whether the prediction falls within the calculated APD or not, and the reliability value is next to the domain value calculated for the specific ENM, based on its distances from the identified closest neighbours. All results appear on screen, and can be downloaded as a .CSV file.

Discussion
ζ-potential is a key parameter from a regulatory context, as it can directly affect ENMs behaviour in different media and their interaction with biological organisms. However, ζ-potential's correlation with medium properties like ionic concentration and chemistry makes it difficult to measure and interpret (Kirby and Hasselbrink Jr., 2004) and highly context dependent. Additionally, the calculations for ζ-potential assume that measurements are performed on spherical and reasonably monodisperse particles, which is often not the case for ENMs. This results in substantial variability in the ζ-potential measurements reported for the same ENM by different laboratories, on top of variability due to different experimental protocols used.
An in silico workflow for the development of a robust and validated ζ-potential kNN predictive model is presented. The dataset published by    Joossens et al. (2019), which contained the physicochemical characterisation in water of 69 ENM, was exploited. Eight descriptors were included in the model development: core chemistry, coating, coating charge, ageing, primary particle size, morphology, geometric surface area and ζ-potential. The dataset was evaluated regarding its completeness and enriched using 12 molecular descriptors: metal atomic radius (r at ), metal ionic radius (r ion ), metal electronegativity (χ), sum of metal electronegativity (Σχ), sum of metal electronegativity divided by the number of oxygen atoms present in a particular metal oxide (Σχ/n O ), number of metal atoms (N Metal ), number of oxygen atoms (N Oxygen ), the charge of the metal cation corresponding to a given oxide (χ ox ), the molecular weight (MW), the period and group of the metal in the periodic table and the absolute electronegativity (Mulliken electronegativity, χ abs ).
The resulting dataset was passed through a low-variance filter to remove any descriptors that did not present significant variance and thus did not contribute to the discrimination power of the model (Ojha and Roy, 2011) and the dataset was then divided into training and test  The user inputs the required parameters. For the coating type, the user needs to pick from those available in the dropdown menu, while for the rest of the parameters the user needs to manually input them. The calculation is then performed automatically. sets, using a ratio of 75%: 25% respectively. The training set was used to identify the descriptors contributing the most to the dataset variance. Five descriptors were deemed as the most statistically significant, two of which were physico-chemical, i.e., CT and SZ, and 3 of which were molecular, i.e., r ion , Σχ/n O and χ abs . These descriptors were used to train and validate the model using the training data set. To do this, the Enalos implementation of the kNN algorithm was used, the accuracy of which has been cross-checked using the established and widely used Weka kNN implementation in the Konstanz Information Miner (KNIME) opensource data analytics platform. The Enalos kNN algorithm offers the benefit of providing the calculated Euclidean distances to each of the nearest neighbours allowing their relative influence to be understood and visualised.
Based on the identified variables, the kNN model was trained, its parameters fine-tuned and validated internally and externally using the OECD principles for validation of (Q)SAR models for regulatory purposes (OECD, 2004). Goodness-of-fit, robustness and predictivity were used to validate the model internally and externally (Zhang et al., 2006;Puzyn et al., 2018), while the coefficient of determination between experimental values and model predictions (R 2 ), validation through an external test set, the leave-many-out cross validation procedure and Quality of Fit and Predictive Ability of a continuous predictive model according to Tropsha's tests (Tropsha, 2010) were used to evaluate the model's performance (Table 3). Furthermore, the model's robustness and statistical significance were tested using Y-randomisation. Finally, the APD of the model was calculated using the Euclidean distances of all ENM in the training set, to be able to evaluate whether a future prediction is reliable or not. The APD limit was found to be 1.947 such that all predictions resulting in an APD value lower than this are considered reliable.
The descriptors identified as significant, and used for model development can be easily calculated using periodic table or bibliographic information (χ abs , r ion , Σχ/n o ,), or can be imported from the manufacturer's description of the ENM (ENM core size, coating) without the need for any experimental measurements. Thus, the developed model can be used as a SbD tool for the development of stable and safer ENMs, as ζ-potential can affect ENM behaviour by influencing agglomeration (Fatehah et al., 2014) or dissolution (Borm et al., 2006). ζ-potential has also been linked with ENM toxicity, in terms of reactivity with cell membranes and ENM uptake from organisms (Schwegmann et al., 2010;Deryabin et al., 2015;Cho et al., 2012), especially for those ENM that undergo dissolution. Given that suspension medium ionic strength and pH changes affect ζ-potential and ENM dissolution, subsequent models can incorporate such information enabling prediction of dissolution and toxicity as part of a complete in silico workflow and IATA.
ζ-potential as a parameter is distinct from the surface charge of an ENM (Fig. 1). When suspended in an ionic solution, a negatively charged uncoated ENM (Fig. 1a) attracts ions from the surrounding solution. As a result, the electrical double layer (EDL) is formed between the surface of the ENM and the bulk of the solution (Schmickler, 2014). The innermost layer, called the Stern layer, is firmly bound to the surface of the particle and consists of positively charged ions, which are attracted to the particle via the Coulomb force and move around in the liquid with the ENM. The outermost layer, consisting of ions that are loosely bound to the rest of the particle and the Stern layer, due to electrostatic attraction, is called the diffusion layer (Schmickler, 2014). In fact, the EDL thickness can be calculated through the Debye length (λ D , eq. 11), which expresses the distance over which the electrostatic effect of the ENM charge persists (Russel et al., 1989). For a colloidal or ionic solution, the Debye length is given by the equation: where ε r is the solution's dielectric constant, ε 0 is the free space permittivity, k B is Boltzmann's constant, T the temperature in K, e is the electron charge, n i is the number of ions in the solution and z i is the valence of the i th ion (Russel et al., 1989). ζ-potential is calculated at the slipping plane, where the EDL (and thus the Debye length) ends, using Henry's equation (Eq. (12)) for electrophoretic mobility (u e ) (Delgado et al., 2005): where ε rs is the relative permittivity of the electrolyte solution, ε 0 is the electric permittivity of vacuum, η is the dynamic viscosity of the solution, ζ is the ζ-potential, κ is the inverse Debye length, α is the ENM size and f 1 is Henry's function, the value of which varies from 1 to 1.5 based on the ratio of the ENM size to the Debye length (κα). Hence, electrophoretic mobility is directly correlated to the ENM size, confirming its presence as one of the key parameters for the prediction of ζ-potential. Interestingly, it was the ENM core size that was the key parameter rather than the experimentally measured hydrodynamic diameter that was found to be most predictive. Furthermore, the presence of the Debye length in the ζ-potential formula (Eq. (12)); i.e., the thickness of the EDL, means that the measured ζ-potential, depends on the ability of an ENM to attract and redistribute ions. Based on studies on solid-solid electrification, which are dominated by electron transfers, Wang and Wang (2019) suggested, extending to solid-liquid interface, that the EDL is formed in two stages. Firstly, the solution molecules interact with the atoms on the surface of the solid and form strong overlaps of electron clouds, with electron transfers taking place to ionise the atoms on the solid surface to form the first electrostatic charges. These are followed by ion exchange, which is the dominant process thereafter, between the charged solid atoms and free H + and OH − radicals in the solution, forming the EDL (Wang and Wang, 2019). These results were further supported by Lin et al. (2020), who demonstrated that both electron and ion transfers co-exist at the solid-liquid interface. In practice, this explains the identification of χ abs as a significant parameter for the prediction of ζ-potential, as it is the average (Eq. (13)) of the ionisation energy (E i ) and the electron affinity (E ea ) and, according to Mulliken (Mulliken, 1934;Mulliken, 1935), expresses the tendency of an atom to attract electrons and thus create the EDL.
This also explains why the ionic radius (r ion ) of metals has been identified as a significant parameter, also reported by Wyrzykowska et al. (2016), for the prediction of ζ-potential, which regulates the bond strength of the atom's valence electrons, and thus the energy needed to detach them and commence the electron exchange needed to form the EDL. The higher r ion , the further the valence electrons are from the nucleus and, thus, the easier it is to remove an electron from its orbital. Similarly, metal electronegativity plays an important role in determining ζ-potential since its increase corresponds to a higher number of protons in the atom's nucleus, increasing its electron attracting potential. Thus, the summation of total metal electronegativity (Σχ) would provide an indication of the ENMs' total metal reactivity and ability to attract electrons. On the other hand, weighing Σχ with the total number of oxygens present (n O ) would provide an indication of the total ENM electronegativity, and thus reactivity, and is based on Sanderson's electronegativity equalisation principle (Sanderson, 1951). Sanderson's principle states that when two atoms with different electronegativities combine, their electronegativities equalise in the formed molecule, with molecular electronegativity being the geometric mean of the electronegativities of the isolated atoms (Parr and Bartolotti, 1982). Hence, Σχ/n O can be considered as a measure of the reactivity of the entire ENM, with ions and electrons, explaining why it has been identified as a significant parameter in our model.
In the case of coated ENMs (Fig. 1b and c), ζ-potential measurement is more complicated and depends on both the ENM's EDL and the coating's charge and thickness (Lowry et al., 2016). For an ENM with a thick polymeric coating layer (Fig. 1b), which is thicker than the EDL, the coating completely masks the core ENM ζ-potential pushing the slipping plane to the surface of the coating (Ohshima, 2006). As a result, the ζ-potential measured in this case will correspond to the Donnan potential (Batley and Apte, 2005;Ohshima, 2006). This is the result of the Gibbs-Donnan effect of two ionic solutions separated by a semipermeable membrane, the role of which is played by the coating, which are not distributed evenly (due to certain charged elements being unable to cross the membrane) resulting in a potential difference measured as the ζ-potential. For thinly polymer coated ENM (Fig. 1c), the apparent ζ-potential will be influenced by both the coating charge (neutral, positive, negative) and the ζ-potential of the core ENM.
Having studied the kNN ENM grouping, certain trends could be observed that could provide a meaningful basis for ENM grouping and read-across. Elements, in general, group with other similar elements in terms of core chemistry, in most cases where enough neighbouring (n = 7) ENM species exist. This is the case for SiO 2 and CeO 2 ENM, which present the same molecular properties with their neighbours, but differ in terms of size and coating. Results demonstrate that, for same core chemistry, the closest neighbours are those with similar sizes, followed by those with different surface properties. These are followed by changes in size initially and then coating demonstrating the ranking of the descriptors significance. While the molecular properties will initially define the ENM properties and reactivity, the larger the ENM core size the lower the effect of the coating on EDL and ζ-potential. Similarly, Zrdoped CeO 2 ENM are grouped with other similar ENM with different doping ratios and ZrO 2 ENM, spanning the entire range from low to complete Ce 4+ substitution by Zr 4+ that can be considered a form of "grading" the behaviour of ENM based on varying doping ratios.
In the case where not enough similar neighbours exist, certain patterns of grouping can also be observed. Metallic Ag ENMs are grouped initially with other Ag ENMs (metallic Ag, AgO, Ag 2 S) and are then grouped with Au ENM as they are in the same group of the periodic table, and thus present chemical similarity. AgO ENM are grouped with metallic ENM and ZnO, CuO and Fe 2 O 3 ENM. Ag and Cu 2+ are in the same periodic table group, and thus present chemical similarity, while Fe 3+ and Zn 2+ have electronegativities (Zn: 1.65, Fe: 1.83) close to those of Ag (1.93). The fact that AgO is grouped with other metal oxides and not with mostly metallic Ag or Au points out the significance of both metal electronegativity and oxygen presence and how these interact based on Sanderson's principle of electronegativity discussed above. The grouping of BaTiO 2 with CeO 2 and Zr-doped CeO 2 can be explained, initially, by the chemical similarity between Ti 2+ and Zr 4+ , which belong to the same periodic table group. Barium (Ba 2+ ) and Ce 4+ , on the other hand, are in the same periodic table period (row) having the same number of electron shells and a difference of two protons (Ba: 56, Ce: 58) in their nuclei and close electronegativities (Ba: 0.9, Ce: 1.12). Period (row) grouping is also taking place in the case of ZnO ENM. These are grouped not only with other ZnO ENMs, but with TiO 2 , CuO, Fe 2 O 3 as well. This can be explained by the trends existing for electronegativity, atomic structure, electron affinity and ionisation energy for elements in the same period row (as shown in Fig. 2). Similar grouping is also observed for TiO 2 ENM with other TiO 2 species, ENM from the same row (Fe 2 O 3 , ZnO), CeO 2 , which is in the subsequent periodic table group, and AlOOH ENM with which they present similar χ and the closest Σχ/n O value in the dataset.
Although the scope of the produced model is to be used for safe by design strategies, i.e. to design new safer and more functional ENM, it can also be used to test existing ENMs as long as there is experimental information on the ENM core size. As a result, it would be of benefit to test how the current model compares with other ζ-potential models, like the "Enalos Nanoinformatics Model for Zeta Potential Prediction" (), which uses the pH of the suspension used for ζ-potential measurement and the ENM's main elongation as extracted through TEM image analysis () (Varsou et al., 2020a). Using ENMs characterised in MiliQ water (n = 8) that were not used in the current model's training or test sets, the results suggest that the current model has better predictive ability (R 2 = 0.921) when compared to the "Enalos Nanoinformatics Model for Zeta Potential Prediction" (R 2 = 0.906). Furthermore, our model's predictions were all deemed reliable, compared to two unreliable measurements produced with the Enalos model.
The significance of the presented model is further enhanced due to the importance of ζ-potential and thus the applicability of the model to different fields in which ENM are applied, including nanomedicine (Honary and Zahir, 2013a;Honary and Zahir, 2013b), energy production (Sen et al., 2017;Caimi et al., 2018), fuel and combustion (Saxena et al., 2017;Kumar et al., 2020), construction (Ogunsona et al., 2020), consumer products (Ogunsona et al., 2020), water treatment (Ogunsona et al., 2020) and remediation of heavy metal pollution (Yang et al., 2013). In most cases, ζ-potential plays a key role in keeping the ENMs dispersed inhibiting agglomeration to maximise their functional potential. In the case of construction and consumer products (Ogunsona et al., 2020), high ENM dispersibility assisted by high ζ-potential values can enhance their potential as antimicrobial agents in substrates (e.g. coatings), alternative food packaging, antifouling coatings and more. Furthermore, ζ-potential can assist with maximising the efficiency of redox flow batteries, as well as dispersed non-agglomerating suspensions of electrochemically active ENMs to reduce system viscosity such that the suspensions can undergo electrochemical charge and discharge (Sen et al., 2017) and thus increase the battery's conductivity. Similarly, the inclusion of stable ENM suspensions in diesel and biodiesel fuels for combustion engines has led to substantial enhancement in thermophysical and chemical properties and decrease in environmental pollutants (Saxena et al., 2017;Kumar et al., 2020). In nanomedicine, an extensive review by Zahir (2013a, 2013b) demonstrates the significance of ζ-potential with respect to blood circulation by inhibiting agglomeration, targeting specific sites, absorption through bodily membranes and controlled drug release. Finally, ζ-potential can substantially help with decreasing environmental pollution through water treatment (Ogunsona et al., 2020) and binding of heavy metal pollutants from soil and groundwater (Yang et al., 2013). In these cases, ENMs with a sufficiently negative charge can attract and bind positively charged metals, like Pb or As, inhibiting their release into the environment.
The ζ-potential predictive model has been made available as a webservice and as such it is now easily accessible and fully exploitable by interested users. It requires the submission of the values of the identified significant parameters (ENM core size, coating, r ion , Σχ/n O and χ abs ). The predicted ζ-potential values, as well as the domain of applicability warnings, are returned to the user within seconds. It is important to note that the presented ζ-potential predictive model has been developed based on experimental measurements performed with water as the dispersion medium. As such, users must assume that the acquired predictions are at pH 7 and that the theoretical ionic strength is 10 -6.998 mol/L. Furthermore, model usage is restricted to the coatings offered in the webservice's dropdown menu currently. Direct usage for different conditions or coatings would result in unreliable measurements. Another point to take into account when using the tool, is that a concentration-related effect has been reported when measuring the ζ-potential of ENM (Kaszuba et al., 2010;Tantra et al., 2010;Wang et al., 2013). These changes were observed when the ENM concentration dropped below a certain threshold (Tantra et al., 2010;Wang et al., 2013), which Tantra et al. (2010) have reported to be between 10 − 2 and 10 − 4 wt% depending on the tested material. This change was attributed to either ambient CO 2 interacting with the ENM surface functional groups to form ⋰ ⋱ MOCO 2 − (Wang et al., 2013) or to extraneous particulate matter (Tantra et al., 2010) the contribution of which became significant as the ENM concentration decreased. While in most cases, we would not expect this to affect model performance, users need to take care when predicting ζ-potential values and applying the results to lowconcentration systems. The webservice is offered as a service through the NanoCommons research infrastructure and is fully compatible with, and will be integrated into, the NanoSolveIT IATA that is currently under development within the NanoSolveIT project.

Conclusions
In silico approaches can substantially reduce the effort and time needed for the design, production and evaluation (e.g. preparation of regulatory dossiers) of stable and safe ENM. This study presents a robust, validated and applicable ζ-potential predictive model, developed using the Isalos Analytics Platform and the Enalos+ nodes (Papadiamantis et al., 2020b;Afantitis et al., 2020b). The model was developed via a library of 69 ENMs, using 7 physico-chemical descriptors, enriched with 13 molecular descriptors, which can be easily identified or calculated using periodic table information. The produced model, which predicts ζ-potential (R 2 = 0.9) can be applied in SbD workflows for the development of novel ENM, and is fully independent of experimental input parameters, requiring just descriptive information and molecular descriptors calculated from the periodic table. The parameters identified as significant, i.e., ENM core size, coating, absolute electronegativity (χ abs ), metal ionic radius (r ion ) and sum of metal electronegativity divided by the number of oxygen atoms present (Σχ/n o ) play a clear role in regulating ζ-potential in water. The benefits of this group of parameters are that they can either be calculated or acquired through the periodic table or existing libraries of calculated descriptors (χ abs , r ion , Σχ/n o ), or can be imported based on ENM physicochemical descriptions typically provided by manufacturers (core size, coating type and charge), and thus is straightforward to use to design specific ENM as only limited and easily acquired input data is required. Furthermore, the used descriptors allow ENM read-across, based on chemical similarity and specific trends observed within the groups (columns) and periods (rows) of the periodic table, followed by ENM core size and the coating used. In any case, no ENM synthesis or physicochemical characterisation is needed to successfully use the model.

Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.