Investigation of water desalination/purification with molecular dynamics and machine learning techniques

: This paper incorporates a number of parameters, such as nanopore size, wall wettability, and electric field strength, to assess their effect on ion removal from nanochannels filled with water. Molecular dynamics simulations are incorporated to monitor the process and a numerical database is created with the results. We show that the movement of ions in water nanochannels under the effect of an electric field is multifactorial. Potential energy regions of various strength are formed inside the nanochannel, and ions are either drifted to the walls and rejected from the solution or form clusters that are trapped inside low potential energy regions. Further computational investigation is made with the incorporation of machine learning techniques that suggest an alternative path to predict the water/ion solution properties. Our test procedure here involves the calculation of diffusion coefficient values and the incorporation of four ML algorithms, for comparison reasons, which exploit MD calculated results and are trained to predict the diffusion coefficient values in cases where no simulation data exist. This two-fold computational approach constitutes a fast and accurate solution that could be adjusted to similar ion separation models for property extraction.


Introduction
Ion and contaminant removal from aqueous solutions is an ever-growing research trend in the last decades, mainly due to water scarcity all around the globe, which has become prominent nowadays. The idea of removing undesired substances from water through membrane-based nanochannels in which ion transport is guided by an external electric field perpendicular to the walls [28,31]. The potential energy and electric field maps are created, and diffusion coefficients of the solution are extracted. It is found that the critical parameters that affect ion drift are the channel width, h, the external electric field, E, and the wall/fluid interaction strength ⁄ . These parameters are, then, fed into an ML computational procedure and predictions outside the simulation range are made, with the coefficient of determination approaching values close to unity. MD simulation is a technique that can provide ample numerical data to train an ML algorithm. Towards proposing a framework of combined MD/ML (Figure 1), a new computational approach arises that could be incorporated in property calculation platforms, at least in similar applications, posing as a hybrid technique that could boost calculation times, while making ML an indispensable tool that advances our ability to investigate science and technology problems.

Molecular Dynamics model
Water separation from Na + and Clions is performed with an MD model consisted of two impermeable carbon walls (a nanochannel) and an electrical field acting as the driving force towards the walls. The system is considered periodic in the x-, y-, and z-directions ( Figure 2). Table 1 lists all the parameters and variables of the dimensions of the materials used. Wall and fluid particles are given Lennard-Jones (LJ) parameters to make the simulation less computationally demanding. Equation 1 describes the total potential as sum of the LJ potential between two particles i and j and the coulombic potential due to electrostatic forces (where present), as where ε represents the interconnection's strength, σ the length range, r c is the cut-off radius, C is the energy conversion constant, q i and q j are the charges of interacting atoms, and ε 0 is the dielectric constant. Arithmetic Lorentz-Berthelot combining rules are applied to all interatomic interconnections based on the values in Table 2. Wall wettability is determined by the ⁄ ratio of wall-to-fluid interaction. The wall is considered strongly hydrophilic when ⁄ = 1.0, hydrophilic when ⁄ = 0.5, hydrophobic when ⁄ = 0.2, strong hydrophobic when ⁄ [33][34][35]. Water molecules conform to the SPC/E (extended simple point charge) pair potential, which has been found to adequately reproduce water's structural and dynamic properties [36]. ). Green and red circles are Na + and Clions, respectively. The external electric field E z is applied perpendicular to the flow (driven by the external force F x ).  The SPC/E water model is incorporated here, which is composed of LJ terms and Coulombic potential for the ion/water solution. Water bonds are constrained using the SHAKE algorithm [37] and no dissociation phenomena are observed. A particle-particle particle-mesh (PPPM) method has been used to calculate the long-range electrostatic force. An external electric field (similar to a solution between two charged surfaces) should result in anion drift towards the "positive" surface and cation drift towards the "negative" surface. The electric field is caused by the homogeneous distribution of opposite sign charges on the two walls, which produces an electric force F e = qE z acting on the z-direction. In this work, the applied E z is investigated over the range 0.0-1.0 V/Å.
The MD procedure involves the following steps, as can be also seen from Figure 1, (a) an equilibrium MD run of t eq = 10 ns is incorporated, (b) in the first production runs (parallel independent runs), the desired desalination procedure takes place when most ions approach the walls due to the effect of E z , (c) then, on each fluid particle, an external driving force is applied along the x-direction, and flow is induced (in the same manner as pressure difference acts on Poiseuille flow), and (d) as ions are forced to diverge from the solution, clean water flows through the channel interior and must be collected appropriately at the outlet (not shown here).
The results are saved, averaged (we have considered n = 5), and secured in the simulation database for later use. Wall atoms absorb the increase in fluid kinetic energy caused by the application of external fields, and Nosé-Hoover thermostats are used at the walls to maintain a constant system temperature (T = 300 K).

Computational details
The impact of an external electric field on ion/water flows in conduits is investigated. A multivariate interaction environment is considered, taking into account the hydrophobic/hydrophilic nature of the wall as well as the repulsive/attractive forces between positive and negative ions. As a result, potential energy maps are extracted to reveal probable ion positions. The computational space is divided into mxnxk bins along (x,y,z)-dimensions and local calculations are conducted.
Potential energy as a local quantity, , is calculated in each one of the mxnxk bins as (2) refers to the local Lennard-Jones interaction potential within each computational bin, which includes wall/fluid interactions in terms of hydro/ion-phobicity and hydro/ion-philicity. The potential energy associated with long-range electrical interactions between all charged atoms, H + and Ofrom water molecules, and Na + , Clas free-running ions in the solution, is denoted by the second term, . The result of applying the external field perpendicular to the walls, E z , is associated with . The potential energy map inside the channels has been extracted using Eq 2. Because the goal of the proposed model is to present a method of ion drift towards the walls, this would be useful information, because low (negative) potential regions can reveal possible ion caging positions inside the channel, which would obstruct their transport.
The induced electric field caused by the presence of several N c charged atoms within a bin is calculated.
or equivalently where denotes the dielectric constant of water. For each mxnxk bin, the electric field is calculated locally. We get the local value by adding the external field , which is distributed evenly throughout the channel.
Using Einstein's relation, the average diffusion coefficient throughout the channel is calculated by time-averaging the mean square displacement of N fluid particles.
where r j is the j th atom's position vector. Equation 7 is usually used for extracting the diffusion coefficients for systems at equilibrium. Here, we have a non-equilibrium system, where forces are applied on the x-dimension to drive the flow. However, it can be used in such cases, provided one excludes the drift contribution from the flow [38].

Decision trees
The decision tree (DT) ML algorithm has recently gained popularity as a supervised machine learning (ML) method for predicting processes and events. This algorithm is useful in problems involving data sets that must be classified. A decision tree is made up of several transitional (decision-makers) and leaf nodes. Each decision node separates the data into two parts based on one or more input variables, and this process is repeated hierarchically from middle to leaf nodes. Each leaf node specifies the final clause value [39]. The tree can be described by the following function: The three critical variables for the DT classifier are the maximum depth of the tree, the minimum number of samples required for splitting, and the number of maximum features. Decision Tree (DT) algorithm is often used as a comparative candidate among other ML methods, without always having the best performance [40].

Random Forest
The Random Forest (RF) algorithm is a grouping of DTs that predicts the final value using the averaging method where Y b is each DT and X' is the number of unknown scenarios. The total number of DTs is presented by the variable b. A segment of training dataset known as the "bootstrapped dataset" creates each tree in the forest. To build the trees, a random set of characteristics is also chosen. There is no need to perform input normalization before training the RF model. The RF method can automatically determine the most influential parameters. In comparison to the DT model, the forest developed with randomness has a better prediction performance and has one additional critical value which is the number of trees. The RF algorithm was utilized for predicting salt absorption capacity with great success [41], and it has also been the best performing ML method to predict salt passage [42].

Gradient Boosting
The Gradient Boosting Regressor (GBR) is a supervised ML method where an initial basic learner is formed by a function f 0 (x), using a sample of known x, y values as trainer. The goal is to estimate the function F(x), which maps variables x to their output values y, by decreasing the expected value of a given loss function, F w (x) = L(y, F(x)), forming a node tree-like schematic. Gradient boosting produces a weighted sum of functions as an additive estimate of F(x), as (10) where p w is the weight of the w th function, h w (x). This method, due to its weakness at overfitting, is best suited with data that are regularized or combined with other ML methods for optimum performance. Gradient Boosting tree algorithm was successfully used in Thin Film Nanocomposite RO membranes to predict and form models of water permeability and salt rejection rate [41]. The GB method was the best performer among other ML algorithms in predicting properties of a 2D membrane during MD simulations, while a hybrid GBR implementation has performed well in predicting the internal concentration polarization in Forward Osmosis membranes [43].

Multi-Layer Perceptron
One of the most fundamental and significant ANN models is the Multi-Layer Perceptron (MLP), a supervised ML method [44], which mimics the transcriptional patterns of human brain. These networks usually have an input layer, one or more middle-hidden layers, and an output layer. The number of hidden layers can be altered and calibrated based on the task's sophistication to achieve the best results for process simulation. The depth (or shallowness) of a machine learning model is determined by the number of hidden layers. The input signal is routed through the network layer by layer, so that each sensory unit analyzes the signal after receiving it from a neural or non-neural neuron and conveys the result to another sensory unit. This behavior will continue until a valid conclusion is reached, a selection will be made, and further processing will begin. The majority of the network behavior of the human brain and signal transmission are scrutinized in this model. (11) where Y, w, x i , and b represent output data, weighted vector data, input data, and the input bias term, respectively. A supervised learning method is employed to calculate weights. MLP can understand nonlinear function approximation for classification and regression given a set of parameters. The MLP model has shown high capability and accuracy in predicting the amount of water produced by the CDI method [45], and CO 2 solubility in salt solutions [46].

Potential energy maps
To study the effects of electric field strength on ion localization we employ a hydrophobic ( 0.1 ⁄ ) nanochannel of size h = 3 nm. At first, pure water solution for E z = 0 V/Å is studied, and Figure 3a shows the potential energy map. We remind here that the lowest negative potential energy regions are preferred by fluid particles [37]. The negative potential energy regions are randomly distributed throughout the channel, while regions near the walls may attract water molecules due to the hydrophobic walls [34]. When an external field E z = 0.01 V/Å is applied to the z-direction, this tendency becomes slightly stronger (Figure 3b). The solution is then enriched with Na + and Clions for E z = 0.1 V/Å (Figure 3c). A new behavior occurs, as low/high potential energy regions have been formed near the upper and lower walls, along with scarce negative potential energy regions near the channel center. It becomes clear that the presence of ions has altered the potential energy distribution, and, at least for the given electric field strength value, all ions may not drift towards the wall, as negative potential regions near the channel center may prevent their movement. For E z = 0.5 V/Å (Figure 3d), "shady" potential regions are formed primarily near the walls, hinting ion separation. Another important factor affecting fluid particle localization is wall/fluid interaction, which is stronger near the walls, inside the interaction radius of the wall particles [47][48][49]. Two cases are presented next, a highly hydrophobic case ( 0.1 ⁄ ) in Figure 4a, and a highly hydrophilic case ( 1.0 ⁄ ) in Figure 4b. The channel size is h = 21 nm and the electrical field is E z = 0.5 V/Å. The 0.1 ⁄ case indicates that regions of lower, negative potential appear near the walls, along with few positive potential regions. When we consider highly hydrophilic walls, 1.0 ⁄ , in Figure 4b, both low and high potential regions appear near the upper and lower walls, forming an interaction region where ions could be trapped. This could also hinder the presence of Na + and Clions in this region. In any case, there appears to be no ionic interactions near the channel center, as potential energy values are close to zero. As a result, ion removal from the channel center region has been accomplished.  We now turn our attention on the effect of channel size on the potential energy map. Ion/ water flow simulations in this work include the investigation of channels spanning from 3 nm 21 . We consider 0.1 ⁄ and E z = 0.1 V/Å. Low and close to zero potential regions are revealed near the upper wall of the h=6nm channel, along with a strong negative potential region at the channel center ( Figure 5a). As the channel size increases to h = 9 nm (Figure 5b), negative potential regions appear near the upper wall, and some negative regions still remain near the channel center. For the h = 15 nm channel, a multiple interaction environment is also observed near the walls (Figure 5c).
Finally, in Figure 5d, where the channel dimension has been increased to h = 21 nm, lower potential regions are found throughout the channel, making it more difficult for ions to reach the walls. To summarize the effect of channel size, the presence of negative potential energy regions near the walls, which indicate probable ion position during the simulation, is not solely affected by the strength of the external electric field. Although wider channels receive more electric energy when the same E z is applied, ions must travel a greater distance and may become trapped inside a low potential region away from the walls.

Ion localization
When an electric field is applied perpendicular to the walls in similar models, water molecules are aligned, with hydrogen atoms pointing in the negative direction [38]. Without ions in the solution, the electric field lines would be straight lines parallel along the z-direction. Transport along the z-axis (towards the upper or lower wall) is expected in the presence of ions, if E z is strong enough to overcome interatomic interactions. Ion separation is not always clear; Na + /Clion pairs form at low E z strength and may move together towards the walls or, in the worst-case scenario, wander around the channel centerline. Furthermore, Na + and Clions have distinct transport mechanisms and diffuse at different rates [50]. Larger ions (Cl -) are less bonded to water molecules and thus more susceptible to being driven by an electric field than smaller Na + ions [15]. As a result, ion movement is somewhat complicated. As ions move towards the upper or lower wall, a new electric field, E ion , opposite to E z , emerges and competes with E z . Further research is required to elucidate this mechanism.
We evaluate the electric field effect inside the ion/water solution. At first, we consider E z = 0.1 V/Å and h = 3 nm, in Figure 6a. It is observed that E z = 0.1 V/Å is not strong enough to anticipate high ion drift movement towards the walls as ions are mostly clustered just above the channel's center. Another observation is that the presence of ions has redirected the external field (z-direction) flow lines in nearby regions, which appears to be impeding their movement towards the walls. The potential energy map shown previously in Figure 3c has captured this effect by revealing local minima in the same region where ions are now set. As a result, ions must absorb more energy to overcome this barrier. In Figure 6b, we increase the electric field strength to E z = 1.0 V/Å. In this case the ion movement is clear, electric field lines are straight, and this makes ions be localized near the upper and lower walls.
Consequently, the effect of wall/fluid-particle interaction is investigated. We also note the distance d which refers to the minimum ion distance from the lower wall, inside the h = 21 nm channel for two cases, a hydrophobic wall ( 0.2 ⁄ ), in Figure 6c, and a hydrophilic wall ( 0.5 ⁄ ), Figure 6d. We have incorporated a strong electric field, E z = 1.5 V/Å. Parameter d shows no significant difference between the two cases. Therefore, we come into conclusion that effect of wall wettability on ionic separation from water solutions is not critical for the proposed desalination model at this scale. Nevertheless, it may be significant in other situations, such as those resembling biological processes that take place at smaller scales, around 1-2 nm, as it has been found elsewhere [28,51] .
From another point of view, it has been shown that in confined nanochannels, where electric fields are applied perpendicular to the walls, one wall acts as hydrophobic and the other as hydrophilic [52]. In such cases, small external electric fields do not affect hydrogen bonding, while a large E can break these bonds. Water molecules are oriented, with hydrogen atoms pointing on the negative direction and oxygen on the positive direction of the field. We have to note here that ion concentration considered in this work is rather small to affect the electrical behavior of internally induced electric fields, so water behavior near the walls is mainly due to molecules orientation. Breaking of electric lines continuity observed in Figure 6a,c,e,f is a local phenomenon and is attributed to increased local ion concentration, which takes place inside a small region of the channel.
The impact of channel size (h = 6 nm and h = 9 nm) on ionic separation in water solutions is shown in Figure 6e-f, respectively, for E z = 1.0 V/Å. We note that the electric field E z forms parallel lines across the z-direction, indicating that all ions are drifted near the walls for h = 6 nm. This finding validates the proposed desalination procedure, at least for this system size. However, field orientation is not parallel to the z-direction for the h = 9 nm channel, and ions appear to be blocked near the channel center.
To summarize our findings, the electric field distribution studied in the previous cases shows that ion removal in this scale is independent of wall wettability and channel size for cases h < 6 nm, but it is strongly affected for h > 6 nm. It seems that the most important parameter for controlling ion separation for this solution is the strength of the electric field, which is the driving force of the ion drift movement. Channel size is important, as well, since smaller channels achieve better ion separation, and this is consistent to observations presented in similar works [10,25].

Diffusion coefficients
Diffusion coefficient values for various values of h, E z , and ⁄ have been calculated according to Eq 7 with MD simulations and all values are given in Table 3. The MSD values are extracted from Eq 6 and the respective diagrams can be found in the Supplementary Information. To eliminate statistical errors, we have run five independent simulations (see Figure 1) and have taken D as the average value, with σ(D) being the error margin.  10 . This means that calculated D values range around water bulk values as channel dimensions increase.

Machine learning predictions for diffusion coefficients
The four ML algorithms implemented in the context of this paper. i.e., DT, RF, GBR, and MLP, have been trained on data taken from MD simulations, as shown in Table 3, and suggest predictions with accuracy metrics presented in Figure 7a-d. The coefficient of determination. R 2 , is shown for every algorithm. We observe that R 2 > 0.700 for all cases, and this shows more than adequate algorithm performance. The three algorithms that are based on decision trees (DT, RF, GBR) in Figure 7a-c have achieved higher R 2 compared to MLP, with DT achieving R 2 = 0.853.
The main outcome from the application of ML techniques to this investigation is that ML can pose as an alternative approach to obtaining properties in similar flow systems, complementary to simulations and/or experiments. The diffusion coefficient has been chosen as the candidate property. It is a fact that only scarce experimental data are available in the literature for diffusion coefficients, and, as far as we know, it is particularly difficult to perform experiments. Moreover, simulation data provided in the literature are most of the times calculated over specific conditions and lack in details that could be fed as input parameters to train an ML model, at least in the problem we are dealing with in this paper. In this direction, MD simulations seem as the proper to obtain diffusion data, and they can be boosted by employing ML prediction techniques.
The amount of data incorporated for training an ML algorithm is one of the main considerations for obtaining an accurate result. However, as our dataset is uniformly distributed between a short range of channel widths, and most indicative cases of wall wettability, and electric field values (at least, in cases where no chemical phenomena occur), we can achieve nice-fitting results. We also believe that the problem of overfitting in ML methods is anticipated here by employing multiple ML algorithms, that are based on different internal mechanisms. Furthermore, we did not acquire "perfectly" fit for any algorithm, which might have been an indication of overfitting.
Finally, to anticipate the fact that diffusion data presents high statistical errors in some cases, even after statistical averaging of their simulation values after 5 independent runs (see Figure 1 the need of exploiting such a two-fold approach arises

Conclusions
This paper has tried to reveal the ion behavior mechanism within water solutions, when a number of factors are taken into account, such as the electric field strength, the wall wettability strength, and the channel width. Molecular Dynamics simulations have been incorporated to provide detailed potential maps inside the nanochannels, spot possible ion position in the solution, and calculate the diffusion coefficients of the solution.
There have been cases where Na + and Clions along with water molecules have been found to form clusters and/or hydrated ions, preventing ion rejection towards the walls. The increase of the electric field value is the main force that ensures ion drift to the walls, as it overcomes internal energy barriers that keep ions away from the walls. It has also been observed that in smaller nanochannels, the ratio of wall/fluid interaction may affect ion/water flow characteristics. In terms of diffusion coefficients, calculated values with MD simulations for a wide range of system conditions are shown here. These values are also exploited to train various ML models in order to construct a computational prediction scheme that can act in parallel to simulations and use their output to maximize the span of the obtained results. Here, we have achieved good accuracy in the prediction of diffusion coefficient values with tree-based ML algorithms.
We envision that the proposed two-fold simulation scheme could be improved and incorporated in various materials science computational methods, especially for properties that are hard to obtain, such as shear viscosity, thermal conductivity, and electrical conductivity in nanofluidic devices.