Automation of some macromolecular properties using a machine learning approach

In this study, we employed a newly developed method to predict macromolecular properties using a swarm artificial neural network (ANN) method as a machine learning approach. In this method, the molecular structures are represented by the feature description vectors used as training input data for a neural network. This study aims to develop an efficient approach for training an ANN using either experimental or quantum mechanics data. We aim to introduce an error model controlling the reliability of the prediction confidence interval using a bootstrapping swarm approach. We created different datasets of selected experimental or quantum mechanics results. Using this optimized ANN, we hope to predict properties and their statistical errors for new molecules. There are four datasets used in this study. That includes the dataset of 642 small organic molecules with known experimental hydration free energies, the dataset of 1475 experimental pKa values of ionizable groups in 192 proteins, the dataset of 2693 mutants in 14 proteins with given experimental values of changes in the Gibbs free energy, and a dataset of 7101 quantum mechanics heat of formation calculations. All the data are prepared and optimized using the AMBER force field in the CHARMM macromolecular computer simulation program. The bootstrapping swarm ANN code for performing the optimization and prediction is written in Python computer programming language. The descriptor vectors of the small molecules are based on the Coulomb matrix and sum over bond properties. For the macromolecular systems, they consider the chemical-physical fingerprints of the region in the vicinity of each amino acid.


Introduction
Recently, the neural network method has seen a broad range of applications in molecular modeling [1,2]. In [3], a hierarchical interacting particle neural network approach is introduced using quantum models to predict molecular properties. In this approach, different hierarchical regularization terms have been introduced to improve the optimized parameters' convergence. While in [4], the machine learning (ML) like-potentials are used to predict molecular properties, such as enthalpies or potential energies. The degree to which the general features are included in characterizing the chemical space of molecules to improve these models' predictions is also discussed in [5,6]. Tuckerman and co-workers [7] used a stochastic neural network technique to fit high-dimensional free energy surfaces characterized by the reduced subspace of collective coordinates. In [8], an ab initio based neural network potential energy function is introduced to model the interactions of the zinc ion in water for use in molecular dynamics simulations. While very recently [9], a comparison study has been performed between the neural network approach and Gaussian process regression to fit the potential energy surfaces. One of the recognized problems in using ML approaches for predicting free energy surfaces is the inaccurate representation of the surface topology's available features by the training data. To improve on this, a combination of metadynamics molecular dynamics with neural network chemical models are also proposed [10]. It is worth noting that an accurate representation of the reduced subspace can be important in the prediction of free energy surfaces. For that,

Materials and methods
ML approach provides a potential method to predict the properties of a system using decision-making algorithms, based on some predefined features characterizing these properties of the system [48]. The neural networks method considers a large training dataset. It tries to construct a system that is made up of rules for recognizing the patterns within the training data set by a learning process, which can be either supervised or unsupervised training. In particular, interest is shown in accelerating chemical discovery with ML [49], and other applications in molecular modeling [1,2]. Here, we used an improved version of the standard supervised ANN, namely the BSANN [50]. We have also introduced in the following the BSANN approach.

Bootstrapping swarm ANN
In general, for a supervised ANN with K hidden layers (see also figure 1), the output Y i is defined as (1) Here, W and b are considered as free parameters, which need to be optimized for a given training data used as inputs and given outputs, which are known. To optimize these parameters the so-called loss function is minimized using gradient descent method [51]: where Y 0 represent the true output vector. For that, the gradients of S (W, b) with respect to W and b are calculated [51]: To avoid over-fitting, which is one of the problems of the ML approaches [52], the following regularization terms have been introduced in literature: where γ w is called learning rate for the gradient and γ 1 is called the regulation strength. Usually, the gradient descent method often converges to a local minimum, and hence it provides a local optimization to the problem. Therefore, a new BSANN method is proposed in the literature [50]. The standard ANN method deals with random numbers, which are used to initialize the parameters W and b; therefore, the optimal solution of the problem will be different for each run. In particular, we can say that there exists an uncertainty in the calculation of the optimal solution (i.e. in determining W and b.) To calculate these uncertainties in the estimation of the optimal parameters, W and b, a new approach, bootstrapping ANN, was proposed in [53], or similar methods [54]. In this approach, M copies of the same neural network run independently using different input vectors. Here, we implement that at regular intervals to swap optimal parameters (i.e. W and b) between the two neighboring neural networks, which is equivalent to increasing the dimensionality of the problem by one; that is, if the dimensionality in each of the replicas is d = K × L, then the dimensionality of the bootstrapping ANN method is d + 1. Figure 2 shows the layout of this configuration.
Furthermore, to achieve a good sampling of the phase space extended by the vectors W and b, we introduce two other regularization terms similar to the swarm-particle sampling approach. First, we define two vectors for each neural network, namely W Lbest n and b Lbest n , which represent the best local optimal parameters for each neural network n. Also, we define W Gbest and b Gbest , which represent the global best optimal parameters among all neural networks.
Then, the expressions in equation (4) are modified by introducing these two regularization terms as the following: for each neural network configuration n, n = 1, 2, …, M. Here, U(0, 1) is a random number between zero and one, and γ 2 and γ 3 represent the strength of biases toward the local best optimal parameters and global best optimal parameters, respectively. The first term indicates the individual knowledge of each neural network and the second bias term the social knowledge among the neural networks. Then, the weights, W n , and biases, b n , for each neural network n are updated at each iteration step according to A single hidden layer neural network with identically independent distributed initial parameters is equivalent to a Gaussian process described above in the case of the infinite network width, that is L 1 → ∞; [55]. This, in turn, allows establishing a Bayesian inference framework for the infinite width neural network. Furthermore, it can generate kernel functions to describe the multi-layer ANNs, which can be used as covariance functions for Gaussian process regression, allowing full Bayesian prediction for an ANN [25].
In standard ANN, assuming that initial weights and bias parameters are taken as identically independent distributed random variables, that is Figure 2. Layout of the BSANN as adopted by [50]. It is characterized by M different input vectors each of dimension n, K hidden layers of l LK neurons each, and M different output vectors each of dimension m. Every two neighboring neural networks communicate regularly with each other by swapping the optimized parameters. then x 1 j and x 1 j ′ are independent for j ̸ = j ′ . In addition, Z 1 i (x) (see also equation (13)) is sum of the identically independent distributed terms; therefore, based on the Center Limit Theorem in the limit of the infinite network width (L 1 → ∞), it follows that Z 1 i (x) is Gaussian distributed. Moreover, a finite process has a joint Gaussian distribution, that is, it forms a Gaussian process. Therefore, Z 1 i is a Gaussian process with mean µ 1 and covariance K 1 [24] where and with C being the covariance: For a K-hidden layer neural network, the output of the l layer is where Z l i (x) are identically independent distributed random variables, and Z l i (X) forms a Gaussian random process for L l → ∞: Z l i ∼ G(0, K l ), with covariance K l given as [55] which can be re-written in the following recursive form [25]: where G f is a deterministic function depending on the choice of the function f. This indicates that the ANN can be performed in a series of computations obtaining K l as in the Gaussian process regression described above. Therefore, there exists an equivalence between the Gaussian process and the ANN in the limit of L k → ∞ and that initially the weights and bias parameters are drawn from identically independent distributed random variables by equation (7). Therefore, we can use the Gaussian process to do Bayesian training of ANN [24]. Following [25], for that, assume a dataset D with elements (X i , Y i ) for i = 1, 2, . . . , N train representing the input-reference data-point pairs. The aim is to do a Bayesian prediction of some test point X ⋆ using the distribution of the outputs Z(X) as obtained from a trained ANN with probability: where Y ⋆ is the predicted output value of the input value X ⋆ . Note that the well-known relation p(x, y) = p(x|y)p(y) is used twice to obtain the final expression. In equation (17), P(Y|Z) gives probability of obtaining the reference distribution Y from the ANN with an output of the distribution Z from the training dataset; therefore, it represents the error in the output of the ANN, and it can be modeled as noise centered at the output distribution Z and variance σ 2 ε with an unbiased estimate as: That is, under the condition of the initial choice of the parameters and assuming that network width is infinite, this implies that process is a Gaussian process and hence P(Y ⋆ , Z|X ⋆ , X) ∼ G(0, K) is a multivariate Gaussian with covariance [25] where X ⋆ is the test point. In equation (20), K D,D is a N train × N train block matrix with elements K ij = K(X i , X j ) where both X i and X j are drawn from D. On the other hand, K X ⋆ ,D is a block matrix whose elements are K ij = K(X * , X i ) with only X i ∈ D. The integration in equation (17) can be performed exactly to get [25] However, the prediction of P(Y ⋆ |D, X ⋆ ) and the calculation of the mean value and variance of the predicted value Y ⋆ are under the assumption of the infinite width of the networks to apply the Center Limit Theorem. That is not easy to implement. Therefore, the bootstrapping approach introduced here represents an alternative way of evaluating P(Y ⋆ |D, X ⋆ ), the mean value and variance of the test data points.

Feature description vectors
To construct data-driven models, such as in the ML approach, we will need to specify a list of the input (macro) molecule's physical and chemical properties that contain necessary information about the system. Here, the input data will be presented by a vector of length N, called X. That process is called feature description, and the input data are called feature descriptors.
Often, the simplified molecular input line entry system is used to represent a small molecule as a string of letters [56]. In such a case, the atoms could be encoded by a single integer number, such as H= 1, C= 2, N= 3, and so on, or by the nuclear charge Z, such as H= 1, C= 6, N= 7, and so on [57]. That creates an (unnecessary) relationship between the input data, namely H<C<N, which could influence the performance of the network. Other encoding models are also suggested, for instance, representing each atom of the input molecule by the following fingerprint: H= [1 0 0 · · · ], C= [0 1 0 · · · ], N= [0 0 1 · · · ], and so on [57]. However, these fingerprints do also have drawbacks because the dimensions of the encoding vector depend on the number of atoms in the structure and may vary from molecule to molecule; moreover, based on this model, the atoms belonging to the same group in the periodic table of elements do not behave the same.
In this study, we used the Coulomb matrix, C, to encode the molecular features, which contains both the geometrical information of the three-dimensional structure and the type of atom [14]. For any two atoms i and j in a given input molecule, the matrix element C ij is as follows: where Z i is the atomic number of the ith atom and r ij is the distance between the atoms i and j. The fingerprint represented by the Coulomb matrix, C, has some advantages. It considers the three-dimensional molecular structure, and it is invariant under rotation translation of the structure. To calculate C for a given molecular structure, we need the nuclear charges for each atom and the Cartesian coordinates of the atomic positions taken from the equilibrium geometry. However, note that C is not invariant under the permutations of the atom order in a molecule. Therefore, the spectrum of eigenvalues of matrix C can be used as a fingerprint of the molecule since they are invariant under both rotation/translation and permutations of the rows and columns. A second feature descriptor that we used in this study is the sum over bonds, which is a numerical descriptor representing the vector of bond types present in a molecule, similar to [21]. If N b is the number of unique bonds in the dataset of the compounds studied, then a vector with dimensions N b is constructed for each molecule with entry either zeros or the integers giving the frequency of appearance for each bond type in molecular structure. This fingerprint descriptor vector has a unique length within the dataset. Then, the vector descriptor of the sum over bonds concatenates at the end of the Coulomb matrix descriptor.
To construct the input descriptor vector for a macromolecule, we introduced the following model. For example, suppose we would like to calculate the change on the Gibbs free energy upon the mutations (either single or multiple mutations) or perform pKa calculations for a selected residue in a protein. We label each residue or nucleotide of the input sequence with an ID from 1 to 24. That is, we form a descriptor vector with length N 1 = 24, X 1 , which is a vector of zeros and ones defined as the following: Here, 'A. A. dictionary' represents a dictionary of names of all amino acids. In addition, to characterize the environment around any mutation point, we determine another descriptor vector, namely X 2 with length N 2 = 24, which is defined as the following. For each mutation point amino acid i, we determine the nearest neighbor amino acids k ∈ {i 1 , i 2 , . . . , i n.n. }, based, for example, on the center of mass distance. Then, the jth element X (2) j of the vector X 2 is defined as a modified 'Coulombic matrix': where the first sum runs over all point mutation amino acids, and the second sum runs over all nearest neighbors of amino acid i. In equation (24), r ik denotes the center-to-center distance between the two amino acids. To take into account the polarity of the amino acids, we introduce a binary vector of dimension N p = 3, such that where X (1) represents a non-polar amino acid, X (2) represents an uncharged polar amino acid, and X (3) represents a charged polar amino acid. In addition, we also added another component to the net vector, which is a real value representing the percentage of the buried part of the amino acids (%SASA buried ), which is defined as the ratio of the buried surface with the solvent accessible surface area of the amino acid in the protein structure, and it is represented by the vector X 4 . Note that vector X 4 can also include other properties, such as the temperature, concentration of the salt and pH value of the experiment; therefore, we can write where T, c, and pH are the temperature (in kelvin), concentration (in molar) and pH, respectively.
To determine the descriptor vector of the macromolecule, such as protein, we concatenate the vectors X 1 , X 2 , X (i) p and X 4 into a net descriptor vector X with length N = 55. Note that in the expression given by equation (24), other properties can be encoded. For example, we can encode the dielectric properties in the vicinity of each amino acid in the structure by modifying equation (24) as follows: where ε ik is the dielectric constant of the environment in the vicinity of the mutated amino acid i, which can be taken a simple distant dependent dielectric constant between the amino acid i and its nearest neighbor k: ε ik = Dr ik , where D is a constant, or even other complicated distance dependence functions [58,59]. However, in this work, other more complicated distance dependent dielectric constant is considered, such as the sigmoidal function [58,59]: Here, r ik is the distance between two amino acids, respectively, i and k, ε w = 80 is the dielectric constant of water, A plot of ε ik versus the distance r ik is presented in figure 3 for both simple function of the distance of dielectric constant and sigmoidal distance dependence function of the dielectric constant. Here, sigmoidal function gives a smooth variation of the dielectric constant screening the electrostatic interactions from 2 (which is the dielectric constant of the internal protein) to 80 (which is the dielectric constant of bulk water), as shown in figure 3. Note that these fingerprints of the structures are rotation and translation invariant. Furthermore, as a sequence of the amino acids in a macromolecular structure, the protein data bank, RCSB PDB [60], can be used that is unique. Therefore, the descriptor vector X is an exclusive representation of a macromolecule in a dataset. Also, the descriptor vector X has the same length for any set of the macromolecules used as input.
It is important to note that if the chemical sample space of the input descriptor vector becomes quite large, then the principal components analysis [61] can be performed to reduce the degrees of freedom.

Datasets
In this study, we used four different databases. The first database contains 642 small organic molecules, for which we know the experimental hydration free energies [26]. Note that this database has also been subject to the previous studies [23,47]. The second database contains 1475 experimental pKa values of ionizable groups in 192 proteins, both wild type (153 proteins) and mutated (39 proteins) [27][28][29][30]. The third database has 2693 experimental values of the Gibbs free energy changes in 14 mutant proteins [31,32]. The last database has 7101 quantum mechanics heat of formation calculations [6] (and the references therein), the  QM7, which is a subset of the GDB13 molecules, optimized at the quantum mechanics level with the Perdew-Burke-Ernzerhof hybrid functional (PBE0) [14]. Figure 4 shows the distribution of the average experimental pKa values for each residue in the wild-type and mutated proteins within the database.
In figure 5, we show the distribution of the percentage of each type of mutation in the database for which we know either ∆∆G or ∆∆G H2O , namely 1: single mutation; 2: double mutations, and so on, up to 6: six mutations.
All the data are prepared and optimized in advance using the AMBER force field for proteins and nucleic acids ff99SBnmr [44,45] and generalized AMBER force field for small organic molecules [46] (as in [26,47]) using CHARMM macromolecular computer simulation program [42]. The BSANN code performing the optimization and prediction is written in Python computer programming language. The small molecules' descriptor vectors are based on the Coulomb matrix and the sum over bond properties. For the macromolecular systems, they consider the chemical-physical fingerprints of the region in the vicinity of each amino acid.
Software implementing the methods discussed in this study is free for download from the website http://hkamberajibu.wikidot.com/machine-learning. Also, one can access the databases of all molecular structure and topology files used in our calculations as prepared with a general AMBER force field from the same site.

Data analysis
For that, consider a method to learn a function from a finite dataset D of input-output pairs, namely (X, Y), where X is the feature descriptor input vector for each atom and Y is the reference output vector for each atom, such as the hydration free energy, change on the Gibbs free energy, the heat of formation, amino acid pKa, and so on. The dataset is then split into a training dataset D train used for learning (or gaining experience) and a validation dataset D valid used for testing the knowledge, such that In this study, we will discuss the training dataset's ability to optimize the parameters of a supervised ANN as a function of the size of the training dataset. We aim to estimate the average bootstrapping confidence interval of error that can be used to predict any new test data.
It is important to note that an optimal average range is believed to provide the highest confidence level, within which most of the predicated values lie in. In the following discussion, we will use the term 'match' for such cases, that is, when the prediction interval of error coincides with that provided by the experimental value using statistical confidence of 95%, which is verified using the Student t-Distribution.
To determine the average confidence interval, we used the following statistical justification [62]. Suppose that there are N train training data points, and n is the number of the neural networks defining the bootstrapping model given above. Then, the 100(1 − α)% bootstrapping confidence interval of the average value for each data point prediction is given by where σ i is the unbiased standard deviation obtained from the bootstrapping data distribution. c i,u and c i,l are the upper and lower critical values, respectively, determined from the empirical distribution function F of the bootstrapping dataset as where t i is the studentized bootstrapping random variable obtained from the data points of the ith prediction [62]. Then, the average statistical error from all prediction N train data points is calculated using the chain rule as follows: where it is assumed that the statistical errors obtained for each of the training data points are independent, which is indeed the case. Then, the average 100(1 − α)% bootstrapping confidence interval of the average value for each data point prediction is defined as Equation (34) is used to determine the upper and lower bound of the average values in all our predictions shown in the graphs of the following discussions. Note that in practice, the Student t-distribution can be used to approximate the distribution of the bootstrapping dataset, and hence c l = −c u = t n−1,α/2 , which is the critical value for the t-distribution. In this study, the confidence level was α = 0.05.
To compare the state of the art, we introduce in the following another error model [23]. According to this model, the deviation between the predicted values and experimental reference, for each measurement, is a sum of four independent terms: The first term is the uncertainty with known variance from the calculation; the second is the experimental error; the third is a general error due to simulation settings; the last term is an error representing the presence of different features in the system (such as the error due to the force field parameters related to a particular feature.) Each of these terms is represented by a normal stochastic distribution, characterized by the mean (identifying the systematic error) and variance (identifying random error) for each new measurement that may or may not be in the training dataset. Thus, the total mean and variance are given as [23] Each term in equation (35) represents an average value calculated over all bootstrap iterations [23].
Comparing the state of arts of both methods, in our error model introduced here, we do not try to separate the systematic and random errors, as it is done on the error model in [23]. Secondly, we did not add the experimental error in our total uncertainty because we did not have the experimental error values for all datasets. It is interesting to note that the error model given in [23] is based on the uncertainty estimation by maximizing the likelihood function (given as the product of Gaussian probability densities) and the uncertainly on the most-likely average calculated using the bootstrapping technique. However, a rigorous mathematical formalism can also be derived for the calculation of the mean and variance of the predicted value using the Gaussian process to do Bayesian training of the ANN [24] under the assumption of the infinite width of the neural networks, as shown above. In all our calculations, predicted values using the bootstrapping technique depend on the average of the some quantity (µ) over a finite number (M) of the ML setups. Following [50], we take this average as an integral of µ weighted with probability distribution of µ, P(µ): Assuming that µ are functions of identically distributed random variables, then µ have a Gaussian distribution: If we require that σ to be comparable with k B T, we write where Var(µ) denotes the variance of µ. Note that P(µ), in practice, may also be slightly different from a Gaussian distribution, but close to a Gaussian-like shape. Combining equations (36) and (37), we get where the first term is the average of µ measured using bootstrapping technique and the second term depends on the fluctuations of µ. While the first term can be positive or negative, the second one is always negative. Therefore, the accuracy in measuring A pred depends on the balance between these two terms. We can also write equation (39) as where γ = Var(µ)/(2k B T). Thus, for Var(µ) = nk B T, where n integer number, we get For n = 1 or Var(µ) = k B T), 95% of the values of µ fall in the region ⟨µ⟩ ± 2σ, and from equation (41) we can see that A pred = ⟨µ⟩ − σ/2, which falls inside the region where most of the µ are sampled. That is, the bootstrapping approach in ML will result in accurate measure of A pred . On the other hand, for n > 4 or Var(µ) > 4k B T), more than 97% of the µ values fall in the region ⟨µ⟩ ± 2σ, and from equation (41) we can see that A pred < ⟨µ⟩ − 2σ, which falls outside the region where most of the µ are sampled; hence, in this case inaccurate measure of A pred will be produced due to the sampling inefficiency. However, there is no physical reason to assume that the fluctuations of µ given by equation (38) should be comparable to k B T. That requirement is often satisfied by the computer simulations (such as molecular dynamics and Monte Carlo), resulting in small statistical errors, as will be illustrated here when comparing the predicted values using the computer simulations and those predicted by ML. In fact, this is because of the assumption that P(µ) in equation (37) obeys to Maxwell-Boltzmann probability distribution.

Feature selection algorithm
When the training data size is substantially larger than the number of features (dimension of the description feature vector), then the distribution of the properties over the range of the features is reasonably accurate. However, this is not always the case; for instance, in the hydration free energy database, the number of used data points is 415, and the number of features is over 1000 features. That is, in a training dataset, there might often be a tiny fraction of the data with non-zero values of some features. Therefore, the computation of the joint probability distribution over all features will be inaccurate.
Estimating the marginal distribution of each term in a classified dataset (training and validation dataset) instead of the joint distribution of all parts is suggested to improve the accuracy significantly [63]. We identify for a limited size dataset whether a given feature appears more frequently in one class of datasets than another. For that, we split the data into two categories, namely the training and validation datasets.
In the following, we will describe the χ 2 -test algorithm [63] used to select the features from a collection of observed data in a database. Suppose we have N training data samples from each class, and f is a fixed feature. Here, we use standard statistics to test if two quantities are significantly correlated. For each feature f, we label 0 and 1 the two classes, namely the training and validation dataset, respectively. We denote n i,0 the number of data in the class i not containing the feature f, and n i,1 the number of data in the class i containing the feature f. In this way, we calculate a 2 × 2 matrix K with elements n CI f such that K = n 00 n 01 n 10 n 11 (42) where C and I f are two indicator random variables, defined as C = 1 randomly chosen molecule belongs to validation dataset 0 randomly chosen molecule belongs to training dataset (43) and I f = 1 randomly chosen molecule contains the feature f 0 randomly chosen molecule does not contain the feature f.
Therefore, n ij is a random variable denoting the number of observations with C = i and I f = j. The algorithm verifies if the random variables are independent or not. Note that The marginal probability distributions are given as [63] P(C = 0) = n 00 + n 01 N (46) P(I f = 0) = n 00 + n 10 N (48) The joint probability distribution of the random variables C and I f is which implies that n ij = NP(C = i, I f = j). Furthermore, the condition of the independence requires that In other words, the condition of the independence indicates that a feature is observed regardless of the classified dataset, either training or validation dataset. The χ 2 -test can be used to measure the deviations between the two distributions, namely P(C = i, I f = j) and the product P(C = i)P(I f = j) [63]: The values of χ 2 is a measure of the confidence of the independence condition; that is, smaller the values of χ 2 higher the confidence. Next step, we sorted the values of χ 2 f , obtained for each feature f, in decreasing order of their values, and considered for training of the dataset the features with the highest values of χ 2 .
Also, the mathematical model presented in [64,65] is employed to keep high diversity of the observed properties (such as hydration free energies, pKa, heats of formation, and changes on Gibbs free energies) in the classified datasets.

Results
In this section, we show some results of the predictions using different datasets. Table 1 summarizes the Pearson coefficient, mean average error (MAE) (in kcal mol −1 ), root mean square error (RMSE) (in kcal mol −1 ) and Matches (in %) for different lengths of training dataset. Predictions are based on the neural network parameters optimized using only the training dataset. Our results show that an MAE value as small as 1.192 kcal mol −1 is obtained in the validation dataset, corresponding to a Pearson coefficient of 0.886 and an RMSE of 1.770 kcal mol −1 , for a size of the training dataset of 350 molecules (or equivalently, 84% of the overall dataset). For that case, the percentage of matches between the predicted and experimental values is 81.5% with a 95% statistical confidence. Figure 6 presents the scatter plots of the experimental hydration free energies and predicted ones for two different training data sizes, N = 330 and N = 380 molecules. Errors are calculated using the bootstrapping method, and the straight line represents the function f (x) = x. Besides, we have indicated the 95% bootstrapping confidence interval of error with parallel lines. Results show that when a training dataset of size 380 molecules is used to train the network, more matches are found between the predicted and the experimental values compared to the network trained using a dataset of size 330 molecules. That is because the distribution of the data points of the training dataset influences the input data's topology. Thus, a large input dataset can reveal more insights into the structure of the data distribution.

The hydration free energy database
We also used a larger dataset of 630 molecules to train and test the neural network, as shown in figure 7. For this case, we used 560 (equivalent to 89% of the total size of the dataset) data points to train the neural network, and the rest about 70 data points for validation (or equivalently, about 11% of the entire dataset). Our results show an improvement of the predictions when compared with the smaller dataset, as used above, that can be shown by our results presented in figure 7, indicating that all the validation data points lie inside the 95% bootstrapping confidence interval.    We also compared hydration free energies predicted using computer simulation techniques (such as molecular dynamics and Monte Carlo) and those predicted using the ML approach. Predicted computer simulation hydration free energies are obtained from [23,26,66]. The results are shown graphically in figure 8 for a dataset of 415 molecules, where the training dataset size was 380 molecules, and the validation dataset size was 35 molecules. For this dataset: MAE = 1.140 kcal mol −1 , RMSE = 1.499 kcal mol −1 and R = 0.949. Our results indicate an excellent correlation between the predicted values by both methods. One can see some discrepancies for large values of the hydration free energies; however, we do not have many experimental measurements. We also compared the computer simulation values of hydration free energies against the experimental values and found the following: MAE = 1.188 kcal mol −1 , RMSE = 1.588 kcal mol −1 , and R = 0.941. On the other hand, the comparison between the predicted hydration free energies using the method presented in this study and experimental values for the same dataset of molecules gives MAE = 0.272 kcal mol −1 , RMSE = 0.628 kcal mol −1 , and Pearson correlation coefficient of R = 0.988 (see also table 1).   Table 2 presents the results of the predictions on both the training and validation datasets. The size of the entire dataset is N = 953 pKa calculations. Our results indicate that the Pearson correlation coefficient is above 0.95 in both training and validation datasets. The smallest MAE and RMSE were 0.104 kcal mol −1 on the training dataset and 0.164 kcal mol −1 , respectively. For the validation dataset, the smallest MAE and RMSE values were 0.269 and 0.416 kcal mol −1 , respectively, obtained for the size of the training dataset about 89% of the entire dataset. Besides, the matches between the experimental and predicted values of the pKa on the validation dataset are 82% with a statistical confidence of 95%. Figure 9 presents the predicted and experimental pKa values graphically as a scatter plot. Also, we show the average 95% bootstrapping confidence interval of error of the predicted values within the training dataset. The scenarios are created for two training data sizes, respectively, 84% and 89% of the entire dataset. Interestingly, our results indicate that almost all the validation data prediction of pKa values lies inside the average bootstrapping confidence interval of error when 89% of the dataset is used in training the neural network.

The pKa of amino acids in proteins
To check the influence of the dataset size on the learning efficiency from the data, we optimized the neural network for a larger dataset of 1240 pKa calculations. We implemented three different training datasets for optimizing the neural network to check the influence of the size of the training data set and the length of the entire dataset, which determine the topology of the input data. We notice that the dataset with N train = 1100 (which is about 89% of the entire dataset) data points for training provided the best optimization. The results are summarized in table 2, and plotted in figure 10. Our results show that the MAE decreases about twice for the same percentage of data in the training set taken from a smaller dataset, namely MAE= 0.038 kcal mol −1 ; a smaller RMSE is also obtained in this dataset of about 0.073 kcal mol −1 . Furthermore, it can be seen that the percentage of matches on the training dataset increases to 95% with a perfect Pearson correlation between the experimental and the predicted of exactly R = 1.000. Moreover, our results show (figure 9) that the average 95% bootstrapping confidence interval of error is larger, and all the predicted values of pKa of the validation set lie within the bootstrapping confidence interval.

Quantum mechanics database
The results of the predicted values of the heat of formation from the quantum mechanics calculations using the PBE0 method are shown in figure 11. The size of the dataset is 7000 molecules. We used two different sets of the training data with lengths 3000 (or equivalently 43% of the size of the dataset) and 5000 (or equivalently, approximately 71% of the entire dataset). Our results indicate an excellent performance of the predictions using the optimized neural network on the validation data; using just 43% of the entire dataset for the training of the neural network, and the test on the validation data show only a few data are outside the predicted average 95% bootstrapping confidence interval of error, and when 71% of the total data are used for training, then all the tested calculations from the validation dataset lie inside the 95% bootstrapping confidence interval.
As intuitively expected, our results indicate that the data's length influences the optimization of the neural network and the knowledge in the dataset. That explains that the topology of the input dataset plays an essential role in the experience gained from the training of the neural network. In the following discussion, we argue that this should be related to the topology of the input data.

Thermodynamics of proteins
Here, we show the results of the predicted changes in the Gibbs free energy due to mutations for 1063 mutations in different mutant proteins. The results of predictions using ML are shown in figure 12. Our results support the findings that increasing the training dataset's size improves the bootstrapping confidence interval of error, hence the prediction probability. For example, for a training dataset of length 1000 points (or, equivalently 94% of the total dataset), all the predicted values of the Gibbs free energy values fall inside the 95% bootstrapping confidence interval of error (see also figure 12).
In table 3, we have summarized the calculation results of the Pearson coefficient, MAE (in kcal mol −1 ), RMSE (in kcal mol −1 ), and Matches (in %) for different numbers of training data. Predictions are based on the neural network parameters optimized using only the training dataset. As expected, for the training dataset of size 94% of the entire dataset, significant improvement in predicting the changes in the Gibbs free energy is obtained. For example, the Pearson correlation coefficient between the experimental and predicted values for the testing dataset is 0.925. The MAE is 0.488 kcal mol −1 , RMSE is 0.665 kcal mol −1 , and the number of matches is 71%.
We have also compared our algorithm's performance (BSANN) with other ML algorithms, such as the so-called DeepDDG [67], used for predicting the change in Gibbs free energy due to the point mutations in  proteins. The computations using the DeepDDG algorithm are performed using online web-server [67]. Figure 13 shows the results graphically. It is interesting to note that none of the test dataset points is used to train the BSANN algorithm, which is a blind prediction. In this work, we considered mutations in Barnase wildtype structure protein [68] (PDB Id: 1BNI) and Bacteriophage T4 Lysozyme [69] (PDB Id: 2LZM). In total, we used about 80 amino acid single mutations of the wild-type structures for the prediction test comparisons. Here, we used the BSANN algorithm trained with 900 mutations dataset. We have also calculated the MAE, RMSE, and Pearson correlation coefficient, as shown in figure 13. Our results indicate that BSANN performs better than DeepDDG in all metrics. It is interesting to note that the fitting straight line of the predicted ∆∆G using BSANN algorithm is close to the baseline (y = x), namely y = 0.8876x, compared to the DeepDDG algorithm, y = 0.5839x. At this point, it is difficult to say which algorithm is generally more predictive because we have not used the same training dataset to optimize the algorithms and the input description vectors are not the same for both algorithms. However, our results of the comparison indicate that BSANN is a promising ML algorithm for predicting changes in proteins' stability due to the point mutations.

Discussion
In this work, we intend to establish a methodology for an automated machine-like supervised learning approach for predicting different (macro)molecular properties. In particular, for a training dataset of molecules, D created of N train pairs (X i , Y i ) for i = 1, 2, . . . , N train , where the vector X denote the feature descriptor vector of dimension N features × N train and Y of dimensions N properties × N train the reference values. That aims to obtain an estimate of the probability P(Y ⋆ |D, X ⋆ ) to predict the output properties of an optimized neural network for any input test data-point . This calculation is now an automated process since the black box is trained to predict the output value described by the probability P(Y ⋆ |D, X ⋆ ), which makes the predictions of the physical properties an efficient automation process.
However, the accuracy in estimation of P(Y ⋆ |D, X ⋆ ) is a data-driven process, and the prediction of Y ⋆ depends on the used training dataset. In particular, it depends on the diversity of the feature descriptors for the dataset of molecules, that is, the amount of N features . Besides, it depends on the size of the dataset, N train . Both the diversity of the feature descriptors of the compounds and the size of the dataset are interconnected; however, a large size dataset is practically difficult to be established due to the lack of experimental data, and quantum mechanics data may be expensive to obtain. Besides, increasing the dimensions of the input feature descriptor vector, N features × N train , is equivalent to increasing the amount of information processed by the computer. Based on a mass-energy-information equivalence principle [70][71][72], it can be expected to increase the amount of the irreversible heat generated during the data processing. That is related to another computer term, namely 'big-data' processing. In [72] has been introduced a formalism trying to quantify the weight of the big-data information by using a physical interpretation of information and the principle of the equivalence mass-energy-information. That allows establishing an equivalence between the (necessary) amount of the input training data for accurate prediction of P(Y ⋆ |D, X ⋆ ) and the limit of the amount of the information that can be processed by a computer considering the heat generated during the computer processing, and so the amount of the external work necessary to process that big-data of information by a computer.
Furthermore, to characterize the stability of the input training data, a rigorous mathematical model can be employed, introduced in [50]. It is important to note that the feature descriptor vectors for each compound are considered time-invariant in our discussion above. Thus they represent only two-and three-dimensional feature descriptors of the (macro)molecules. However, higher feature descriptor vectors can also be constructed, such as four-dimensional feature descriptor vectors, where the time is the fourth component. In that case, to build the three-dimension part of the feature descriptor vectors, different conformations of the compounds can be considered, for example, as generated from the molecular dynamics simulations. In that case, one can map the three-dimensional configurations of the compounds produced from the simulations into a three-dimensional grid, where the centers of the grid points will represent the average positions of each atom obtained from its fluctuations after the configurations are aligned to remove the overall translation and rotation motion of the compounds. Therefore, the feature descriptor vectors derived from these average structures mapped into a three-dimension grid are translation and rotation invariant. A review of such higher-dimensional descriptor vectors is discussed in [73].

Conclusions
This study presented a methodology for the automation of (macro)molecular properties predictions using a new algorithm integrated into a ML approach. We gave the results of predictions for four different databases of both molecular and macromolecular systems properties. Each dataset contains the results of the experimental values, including the error when provided in the literature.
Furthermore, we showed how to create an input descriptor vector for a supervised ANN for small organic molecules and macromolecular systems. The descriptor features included both the two-dimensional (macro)molecular fingerprints and the three-dimensional structure of the systems. Moreover, we presented a statistical approach of how to estimate the bootstrapping confidence interval of the error.
The application of that new algorithm in our data indicated that the topological chemical spaces extended by the molecular description vectors on the relevance or irrelevance of perturbations in the data analysis are crucial. Furthermore, we envision that the persistence homology can be considered necessary as the renormalization group theory in statistical physics when applied to equilibrium phenomena to understand the relevant or irrelevant interactions. In this analogy, the resolution scaling factor on the topological data analysis can be considered similar to the characteristic correlation length scale that determines the judgment of the strong interactions and correlations renormalization group theory. Besides, we introduced a mathematical framework for representing the input dataset to characterize the stability of the data and a new algorithm for feature selection.
Besides, the results of the comparison between the DeepDDG and BSANN algorithms indicated that BSANN could be a good ML algorithm for predicting changes in proteins' stability due to the point mutations.

Associated content Software
The software (written in Python programming language) is published on the following website: http://hkamberajibu.wikidot.com/machine-learning.

Datasets
The three-dimensional structures are included as optimized with the AMBER force field using the CHARMM program: http://hkamberajibu.wikidot.com/machine-learning.

Databases
The databases containing the information about the systems studied here and the experimental information are published on the following website: http://hkamberajibu.wikidot.com/machine-learning.

Data availability statement
The data that support the findings of this study are available from the corresponding author upon reasonable request.