Keywords
Cloned data, Linear regression, Non-linear regression, Residuals
This article is included in the Research on Research, Policy & Culture gateway.
Cloned data, Linear regression, Non-linear regression, Residuals
In situations where the data is confidential and cannot be shown, there is a need for an alternative or matching set of data that can play the role of the actual data. The alternative or matching set of data is called cloned data. Therefore, cloned datasets can give a model-free way of representing confidential data. One possible use of these naturally cloned datasets is for the confidentiality of sensitive data for publication purposes, where having data sets with the same fit as the original data is the main advantage. Anscombe (1973) provided four cloned datasets to show the significance of graphs in a statistical study. All these cloned datasets have identical summary statistics (e.g., mean, variance, and correlation) but different data graphics (scatter plots). Chatterjee and Firat (2007) presented a technique of producing different (bivariate) datasets with the same summary statistics but dissimilar graphs by applying a genetic algorithm-based method. The idea of generating cloned data that has the same fit for simple and multiple linear regressions has been explained by Haslett and Govindaraju (2009, 2012) and Govindaraju and Haslett (2008). Govindaraju and Haslett (2008) gave the idea of cloning datasets by using the simple linear regression in the bivariate case. In all the cases, regression estimates are the same, and the variability decreases in the first to the next iteration. Haslett and Govindaraju (2009) explained the procedure for generating matching or cloning datasets for multivariate case.
The procedure by Haslett and Govindaraju gives a substitute way of presenting confidential data so that statistical analysis of multiple regression has the same fit in the original as well as in the cloned data. However, the data have been changed to be not any more confidential. The advantage is that parameter estimates from the cloned data and the original data do not consist of any model error. Haslett and Govindaraju (2012) regarded the issue of how to enhance the algorithm of producing several cloned datasets that will generate the same fitted regression equations. The primary method described that the fitted slope and intercept are merely estimates and that somewhat dissimilar datasets can still generate the same estimates. Anscombe (1973) showed using four different fictious data sets (see Table 1) and showed that the regression estimates and their standard errors obtained are similar with their graphical significance in the literature (presented in Figure 1) but could not elaborate how such data have been obtained. Chatterjee and Firat (2007) used the data given in Anscombe (1973). They showed that all four datasets have identical summary statistics but different graphs by using the algorithm of the same datasets. Govindaraju and Haslett (2008) explained the procedure to generate cloned data for a simple linear regression yi = a + bxi + eimodel as follows, assuming.
1) Consider n pairs of observations for X and Y, i.e.,(xi, yi) , obtain by simple regression of Y on X, also obtain by simple inverse regression.
2) The simple regression of on i.e., has the same parameters.
3) Further, obtain by inverse regression and observe simple regression on Has the same parameters.
Iterating steps 1 & 2 can generate several fictitious or cloned datasets. They used the datasets given in Anscombe (1973) to generate several cloned datasets and showed that all cloned datasets giving the same regression estimates for first, second, third, fifth, and tenth iteration but different scatter plots. They also identify that and . The cloned datasets generated by four fictitious datasets given in Anscombe (1973) provided the same mean of X and Y, the correlation between X and Y, coefficient of determination R2, adjusted R2, regression fit, and standard error of the slope. But the variance of X and Y, standard error of residual and standard error of intercept decreases as the iterations increase. It shows regression towards the mean, i.e. every next cloned dataset is closer to the mean. Haslett and Govindaraju (2009) explained the procedure for generating cloned or matched datasets for a multivariate case that has the same fit. They consider identically independently distributed data for multiple regression models
Where Y is the vector of responses, X = (X1, X2, …, Xp) Is the n × p design matrix, β is the unknown p × 1 vector of parameters, and ε is the n × 1 vector of errors. The OLS estimate β is
They used mean corrected form for response variable and independent variables x1, x2, …, xp . Because of mean correction, the above multiple regression models can be written as
They explained the procedure in six steps and generated ynew, x1, new, …, xp, new which have the same fit as the original model. The cloned dataset generated by Haslett and Govindaraju (2009) gave the same regression fit, the sample mean of Y, X1, X2but the variance of Y, X1, X2, and residual standard error less than that of raw data. Haslett and Govindaraju (2012) developed cloning algorithms for simple and multiple linear regression models. They fit the linear regression model of ynew on x (where x and y are mean-centred) on the original data and find its estimates and residuals. The residuals are added to data y one by one to create n2 data points then fit the linear regression model of ynew and x to find its estimates, resulting in identical estimates for the original datasets and the cloned datasets. The above cloning algorithm can also be used in the multivariate case. In both cases, the parameter estimates of original datasets and cloned datasets are similar. They explained the following methods to generate cloned datasets
Cloning via supplementing data by zero-mean additions-bivariate case
Cloning via supplementing data by zero-mean additions-multivariate case
Bivariate data cloning by regression y on x and x on y
Cloning for multiple regression via pivots
Here, we use the model presented in Haslett and Govindaraju (2012) to provide cloned datasets for bivariate and multivariate non-linear regression models with the same non-linear regression fit.
We consider the non-linear regression of y on X, where both X and y are non-mean-centered with data points. R software was used for all analysis.
In general, the non-linear regression model is
with y being the response variable, X the covariate data design matrix, which is often controlled by the researcher, β the model parameters characterizing the relationship between X and y through the regression function h, and ε the model errors that are assumed to be normally distributed with zero mean and unknown variances σ2.
When the regression function h is linear in the parameters β, it leads to linear regression analysis. However, linear models are not always appropriate, so one often needs to apply a non-linear regression model where h is non-linear in β.
Like in linear regression, non-linear regression provides parameter estimates based on the least square criterion. However, unlike linear regression, no explicit mathematical solution is available and specific algorithms are needed to solve the minimization problem, involving iterative numerical approximations. Here, since this is a bivariate non-linear regression on non-mean corrected data when X= x a column vector. In general, provided X is of full row rank, the ordinary least square estimate of β is, of course, .
Now add the residuals from the model fit the data , so that the original data are replicated as block n times to create an n2 × 1 vector and to each block, one of the residuals is added. The first block is y + 1r1 where is a vector of 1’s and r1 the first residual. The data are now 1 ⊗ y + r ⊗ 1, and the design matrix becomes 1 ⊗ x . On noting that the model is still the same, i.e., a bivariate non-linear regression, if 1 ⊗ y + r ⊗ 1 are now regressed on 1 ⊗ x , the OLS estimate becomes which is equal to . Thus, the non-linear regression estimates for cloned data are unchanged because the sum of the residuals 1Tr is zero. Software R has been used to obtain the numerical results.
Anything can be added, i.e., if {al : l = 1, 2, …, m} is added to each data point in the set {yi : i = 1, 2, …, n} then the condition is that ∑al = 0. Some additions are more useful than others.
Example 1: The following cloned dataset (Table 2) is generated from the dataset X= (1,2,3,4,5,6) T and Y= (2.98, 4.26, 5.21, 6.10, 6.80, 7.50) T for the nonlinear regression model Y = aXb , a geometric or power curve. The parameter estimates for this cloned data set are summarized in Table 2b. which can be suitable for the data used in different fields of life if our plotted data shows the form of model Y = aXb. It can be observed that the estimates obtained by cloning procedure in Table 2b are some of the actual estimates.
Example 2: The cloned dataset (Table 3) is generated from the dataset X= (0, 1, 2, 3, 4, 5, 6, 7, 8) T and Y= (0.75, 1.20, 1.75, 2.50, 3.45, 4.70, 6.20, 8.25, 11.50) T for the nonlinear regression model =abX , an exponential curve. If the sensitive observed data shows the exponential curve (Y = abX) such procedure can be useful for cloning of data. It can be observed that the estimates obtained by cloning procedure in Table 3b are similar to the actual estimates.
Estimates | Std. Error | Variables | Mean | Variance | RSE | Corr. | |
---|---|---|---|---|---|---|---|
a | 0.97 | 0.037728 | X | 4 | 7.50 | Y|X | - |
b | 1.36 | 0.007540 | Y | 4.5 | 12.95 | 0.139762 | 0.954 |
aclone | 0.96 | 0.015889 | Xclone | 4 | 6.75 | Yclone|Xclone | - |
bclone | 1.36 | 0.003210 | Yclone | 4.5 | 11.67 | 0.177424 | 0.954 |
Example 3: The cloned dataset (Table 4) is generated from the dataset X= (1, 2, 3, 4, 5, 6) T and Y= (1.6, 4.5, 13.8, 40.2, 125.0, 363.0) T for the nonlinear regression model Y = aebX, an exponential curve. If the sensitive data shows the non-linear regression shape of Y = aebX, then such cloning procedure would be helpful. It can be observed that the estimates obtained by cloning procedure in Table 4b are equal to the actual estimates.
Example 4: The cloned dataset (Table 5) is generated from the dataset X= (0, 1, 2, 3, 4, 5)T and Y= (58, 66, 72.5, 78, 82, 85)T for the nonlinear regression model Y = kabX, the Gompertz curve. Parameter estimates of the raw and cloned dataset is shown in Table 5b.
Example 5: The cloned dataset (Table 6) is generated from the dataset X= (0.5, 0.5, 1, 1, 2, 2, 4, 4, 8, 8, 16, 16) T and Y= (0.96, 0.91, 0.86, 0.79, 0.63, 0.62, 0.48, 0.42, 0.17, 0.21, 0.03, 0.05) T for the nonlinear regression model =ksXbcX, the Makeham curve. The observed sensitive data shows the non-linear regression shape of makeham curve, then such cloning procedure would be beneficial as the estimates are closed. It can be observed that the estimates obtained by cloning procedure in Table 6b are some of the actual estimates.
Example 6: The cloned dataset (Table 7) is generated from the dataset X= (0, 1, 2, 3, 4, 5, 6, 7, 8) T and Y= (0.75, 1.20, 1.75, 2.50, 3.45, 4.70, 6.20, 8.25, 11.50) T for the nonlinear regression model =k + abX , a modified exponential curve. Sensitive data showing the pattern of modified exponential curve, procedure explained above with the help of table and their estimates would be beneficial. It can be observed that the estimates obtained by cloning procedure in Table 7b are equal to the actual estimates.
Example 7: The following cloned dataset (Table 8) is generated from the dataset X= (0, 1, 2, 3, 4, 5, 6,7, 8) T and Y= (1225, 2879, 4994, 11525, 16190, 22573, 30677, 38517, 39003) T for the nonlinear regression model , the Logistic curve. If the curve of observed data is in the form of logistic, then Table 8 procedure for cloning the data would be suitable. It can be observed that the estimates obtained by cloning procedure in Table 8b are identical as the actual estimates.
The algebra for the bivariate non-linear regression is unaltered for multivariate non-linear regression, except that the matrix X becomes , and the parameter vector and its estimates become (p + 1) × 1 vector, , and β.
Example 8: The following cloned dataset (Table 9) is generated from the dataset X1= (23.81, 75.83, 9.46, 5.71, 85.78, 0.37,8.82, 8.99, 37.65)T, X2= (11.33, 25.92, 7.03, 29.68, 21.81, 0.57, 11.25, 19.01, 75.25)T and Y= (22.76, 76.73, 8.62, 10.98, 86.77, 0.97, 11.82, 16.63, 67.40)T for the nonlinear regression model , the constant elasticity of substitution production function. Parameter estimates of the raw and cloned dataset is shown in Table 9b.
In this article, we presented a cloned dataset for bivariate and multivariate non-linear regression models with the same non-linear regression fit. The application of such cloned datasets is for maintaining the confidentiality of sensitive real data for publication purposes. In this context, new methods can be developed so that cloning is possible for non-linear regression models. The question this study addresses is how cloning techniques are better than simulation and re-sampling. The simulation approach assumes that the model is known and then generates random data from the distribution of the response variable to illustrate the sampling variability in the estimates, re-sampling estimates the precision of sample statistics by using a subset of available data or drawing randomly with replacement from a set of data points. Unfortunately, these approaches do not help to explain the concept of regression or the idea of ‘moving towards’ the mean. The methods presented in this study are intended to fill this gap by yielding a sequence of matching data sets with the same fitted regression equation, for which the variability in the response variable Y and the explanatory variable X will progressively reduce. The tendency of moving towards the means rather than the conditional mean are also demonstrated.
All data underlying the results are available as part of the article and no additional source data are required.
This research is fully sponsored by Landmark University Centre for Research and Development, Landmark University, Omu-Aran, Nigeria.
Views | Downloads | |
---|---|---|
F1000Research | - | - |
PubMed Central
Data from PMC are received and updated monthly.
|
- | - |
Is the work clearly and accurately presented and does it cite the current literature?
Yes
Is the study design appropriate and is the work technically sound?
Partly
Are sufficient details of methods and analysis provided to allow replication by others?
Partly
If applicable, is the statistical analysis and its interpretation appropriate?
Partly
Are all the source data underlying the results available to ensure full reproducibility?
No
Are the conclusions drawn adequately supported by the results?
Partly
Competing Interests: No competing interests were disclosed.
Reviewer Expertise: Machine learning, mathematical modeling, data sciences, pattern recognition, big queries, protein engineering, cloud architectures.
Alongside their report, reviewers assign a status to the article:
Invited Reviewers | |
---|---|
1 | |
Version 2 (revision) 15 Mar 22 |
|
Version 1 11 Feb 21 |
read |
Provide sufficient details of any financial or non-financial competing interests to enable users to assess whether your comments might lead a reasonable person to question your impartiality. Consider the following examples, but note that this is not an exhaustive list:
Sign up for content alerts and receive a weekly or monthly email with all newly published articles
Already registered? Sign in
The email address should be the one you originally registered with F1000.
You registered with F1000 via Google, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Google account password, please click here.
You registered with F1000 via Facebook, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Facebook account password, please click here.
If your email address is registered with us, we will email you instructions to reset your password.
If you think you should have received this email but it has not arrived, please check your spam filters and/or contact for further assistance.
Comments on this article Comments (1)