Simulation data for an estimation of the maximum theoretical value and confidence interval for the correlation coefficient

The data presented in this article are related to the article titled "Molecular Dynamics as a tool for in silico screening of skin permeability" (Rocco et al., 2017) [1]. Knowledge of the confidence interval and maximum theoretical value of the correlation coefficient r can prove useful to estimate the reliability of developed predictive models, in particular when there is great variability in compiled experimental datasets. In this Data in Brief article, data from purposely designed numerical simulations are presented to show how much the maximum r value is worsened by increasing the data uncertainty. The corresponding confidence interval of r is determined by using the Fisher r→Z transform.


a b s t r a c t
The data presented in this article are related to the article titled "Molecular Dynamics as a tool for in silico screening of skin permeability" (Rocco et al., 2017) [1]. Knowledge of the confidence interval and maximum theoretical value of the correlation coefficient r can prove useful to estimate the reliability of developed predictive models, in particular when there is great variability in compiled experimental datasets. In this Data in Brief article, data from purposely designed numerical simulations are presented to show how much the maximum r value is worsened by increasing the data uncertainty. The corresponding confidence interval of r is determined by using the Fisher  Reduced set (Reduced_ser.pdf) modified by randomly generated errors.

Data source location
Not applicable Data accessibility Data is contained in this article and files: Reduced_set.pdf, simulation_data.xlsx

Value of the data
When there is great variability in a compiled experimental dataset, considerations on the confidence interval for the correlation coefficient r and on the maximum theoretical value achievable for r can offer hints as to what to expect from a predictive model based on that set.
Numerical simulations used to generate a dataset of arbitrary average uncertainty and to estimate a confidence interval around the correlation coefficient r and its maximum theoretical value are easily applicable to all experimental datasets The here proposed data can be easily utilized to derive the range of r that can be pursued when the variability of a given dataset is known Along with well-known statistical parameters (such as r, r 2 , q 2 , F, SE, etc), the here proposed confidence interval of r can become a meaningful parameter to better evaluate the reliability of a given model and to understand whether there is still room for statistical improvements.

Data
Data presented here represent maximum theoretical average values and confidence interval for the correlation coefficient r and the determination coefficient r 2 as obtained through numerical simulation ( Table 1). The values of r and r 2 correspond to different simulated levels of random error (ε) in the experimental data set.
Original data, on which data in Table 1 are based, are contained in the files Reduced_set.pdf and simulation_data.xlsx. Reduced_set.pdf contains a set of 80 permeability coefficients k p [1] assembled as the intersection of Flynn's set [2] and the Fully Validated data set [3]. The file simulation_data.xlsx contains data from the numerical simulation described below.

Experimental design, materials and methods
Given a set of experimental data, y i , we can assume that a perfect estimator ϕ for the set is known (in [1], y i correspond to pk p values). ϕ is a mathematical function, which correlates a set of variables {x ij } with the experimental value y i , where x ij represents the j-th molecular property of the i-th molecule (Eq. (1)). The correlation, based on a perfect estimator, yields a correlation coefficient r = 1.
For every y i , we introduce an error ε·c ik ·y i , where {c ik } is a set of normally distributed pseudorandom numbers with zero average and unitary standard deviation (obtained by applying the Box-Muller transform [4] to a set of a linearly distributed random numbers); ε corresponds to the standard deviation of the errors, normalized by y i . For the k-th simulation, Eq. (1) becomes Eq. (2): Since ϕ k , by definition, is a perfect estimator, the values of r obtained for Eq. (2) in the simulation are the maximum theoretical correlation coefficients achievable given the uncertainty introduced (ε).
For different values of ε, the numerical simulation is repeated 99 times (l ¼ 99) obtaining 99 correlation coefficients r k (simulation_data.xlsx). Table 1 shows how much r and r 2 worsen when ε increases and confirms that the formula: maximum r 2 ≅ (1−ε) is an approximate but yet reasonable way to estimate the worsening effect of ε.
As for the confidence interval around r, it can be estimated, for each value of r, by using Fisher r→Z transform [5]: We apply Fisher r→Z transform to the r k values, obtaining 99 Z k values. Unlike r, Z tends to a normal distribution as the number of data becomes large. Therefore, the standard deviation S z can be calculated by Eq. (4): The 95% confidence interval around Z is then calculated as ðZ -− 1:96 ⋅ S z ; Z þ 1:96 ⋅ S z Þ, and the 95% confidence interval around r is obtained from it, through the reverse transform (Eq. (5)): The confidence intervals around r and r 2 for different values of ε are shown in Table 1.

Transparency document. Supplementary material
Transparency data associated with this article can be found in the online version at http://dx.doi. org/10.1016/j.dib.2017.07.045.