MaskDensity14: an R package for the density approximant of a univariate based on noise multiplied data

Lin (2014) developed a framework of the method of the sample-moment-based density approximant, for estimating the probability density function of microdata based on noise multiplied data. Theoretically, it provides a promising method for data users in generating the synthetic data of the original data without accessing the original data; however, technical issues can cause problems implementing the method. In this paper, we describe a software package called MaskDensity14, written in the R language


Motivation and significance
Confidential data are not allowed to be issued to the public without certain levels of protection. A number of methods for protecting data have been recommended and used in practice (see Duncan and Lambert [1], Willenborg and De Waal [2], Oganian [3], Shlomo [4], and references therein). * Corresponding author.
The multiplicative noise method is one method for providing data protection (Kim and Jeong [5]). The method is briefly described as follows. Let y 1 , y 2 , . . . , y N (the original data) be a sample drawn from a sensitive random variable Y . Let C be a positive random variable, independent of Y . When we say the original data y 1 , y 2 , . . . , y N were masked by C, it means their masked data have the form y A key issue is how to recover the statistical information of the original data based on their noise multiplied data. Recently, many data analysis methods to recover the original information from noise multiplied data have been developed (Kim and Jeong [5]; Sinha, et al. [6]; Hwang [7] and Lin and Wise [8]). These methods are not standard, and some of them are complex to use. As a consequence, Lin [9] introduced a framework on the sample-moment-based density approximant. The framework provides a method for estimating the probability density function of the original data based on noise multiplied data. It gives a solution for generating the synthetic data of the original data without accessing the data themselves. Thus, using standard statistical inference methods to analyze the original data based on noise multiplied data becomes possible. Developing software to implement the method is desirable.
We briefly describe the computational statistical approach developed by Lin [9] as follows. Let {y i } N 1 be a set of original sample data drawn from a random variable Y . The data were masked by a noise C which yielded masked data {y * i } N 1 . Let {c i } N 1 , not the same sample used to mask {y i } N 1 , be another independent sample drawn from C. Assume that Y has moment generating function. Thus, the density function f Y of Y can be determined by its moments. The sample-moment-based density approximant of the density function of Y is defined as is a polynomial function of y, i.e. a continuous function of y (the details see Lin [9]), where a and b, used in Lin [9], are min 1≤i≤N {y i } and max 1≤i≤N {y i }, respectively. Lin [9] showed that f Y,K |{y * i ,c i } N 1 can well represent the density function of Y given that the sample size N and the upper order of moment K are appropriate (for brevity, we sometimes use "the upper order K " instead of "the upper order of moment K "). Thus, the sample drawn from f Y,K |{y * i ,c i } N 1 can be considered as synthetic data of the original data. The upper order K was determined with reference to the density function f Y of the original data in Lin [9].
To implement the technique developed by Lin [9] in practice, there are a number of technical issues to solve. Firstly, the upper order K has to be determined without the reference to f Y , as f Y is not available. Secondly, it is desirable that the boundaries a and b are determined without using the information max 1≤i≤N {y i } and min 1≤i≤N {y i } directly, as the information might be confidential.

Software description
The MaskDensity14 software presented in this paper uses the R language, which is widely used system by statisticians (see The R Package for Statistical Computing http://www.rproject.org). MaskDensity14 follows the standard manner to obtain the smoothed function of the sample-moment-based den- (y) at 512 equal distance points on [a, b]. Then, MaskDensity14 uses standard smoothing and normalizing techniques to obtain the smoothed samplemoment-based density approximant f Y,K |{y * i ,c i } N 1 . The kernel R package adopted by MaskDensity14 is ks (see http://cran.rproject.org/web/packages/ks/ks.pdf).
Determining an appropriate value for the upper order K and boundaries a and b for f Y,K |{y * i ,c i } N 1 is a critical issue. The upper order K in Lin [9] is determined by comparing the plots of f Y,K |{y * i ,c i } N 1 and f Y . Using this way to determine the value for K is impracticable as the original data are not available. The boundaries a and b used in Lin [9] are min{y i } and max{y i }, respectively. Using those values in f Y,K |{y * i ,c i } N 1 might cause problems because the values might be confidential, and the data provider might feel uncomfortable to release them to the public.
A method for determining K and boundaries a and b, without directly employing the information of the original data is essential for the software built in this paper.

Determination of the upper order
Provost [10] pointed out that, if an inappropriate upper order K is used in the density approximant f Y,K , it may cause f Y,K taking negative values. Simulation studies by Lin [9] showed that the density approximant will not be more accurate for large values of the upper order K . It is a challenge to determine an appropriate K for f Y,K |{y * i ,c i } N 1 without reference to f Y . Lin [9] used a term "the correlation between two density functions" to evaluate the similarity of two density functions. The implication of the term is related to the concept of "the probability plot correlation coefficient". The Q-Q plot method can be used to compare two probability distributions if they are close to each other. Adopting the idea of the Q-Q plot, we evaluate the value of "the correlation between density functions f 1 and f 2 " in the following manner. (1) Independently simulate two samples {x 1,i } and {x 2,i } of the same size from f 1 and f 2 , respectively. (2) Sort the two samples, and then calculate the sample correlation coefficient of the sorted samples.
Lin [9] demonstrated that the larger the value of "the correlation between f Y,K |{y * i ,c i } N 1 and f Y " is, the closer the two functions, f Y,K |{y * i ,c i } N 1 and f Y , will be. Motivated by this fact, the following steps are built in MaskDensity14 for determining the appropriate K in f Y,K |{y * i ,c i } N i without directly using the information of the original data {y i } N 1 . Consider the masked dataset {y * i }: Step 1. Set an initial upper order of moment, K = 1 and a maximum upper order of moment to be tested. The maximum upper order of moment set in MaskDensity14 is 100.
Step 2. Independently simulate a sample {c i } N 1 from C and obtain the smoothed function f Y,K |{y * i ,c i } N 1 using Eq. (1). In MaskDensity14, we assume that the data agency provides the data user with a sample of C. The size of the sample is sufficiently large (>N ) such that the sample can well represent the probability structure of the noise C. Thus, a sample drawn from the sample population can be considered as a sample drawn from C. 1 Step Step 4. Independently simulate a second sample {c ′ j } N 1 from C. Mask {y ′ j } N 1 by using this new sample of noise and yield a new masked dataset {y ′ * j } N 1 .
Step 5. Sort {y ′ * i } N 1 and {y * i } N 1 , respectively. Evaluate the correlation Cor (K ) between the two sorted datasets. Keep track of the optimum upper order of moment such that Cor (K opt ) = max k≤K Cor (k).
Step 6. Update K to K + 1 and return to Step 2 if K + 1 ≤ 100. Stop when Cor (K ) drops below a threshold taken as Step 7. Report K opt as the optimum upper order of moment used.

Remarks. (1)
Step 5 is the key step in identifying an appropriate upper order of moment for the approximant of f Y . We explain the logic used to support Step 5 as follows. If the density approximant determined by {y * i , c i } is close to the true density function f Y , {y ′ i } can be considered as an independent sample from Y . Consequently, {y ′ * i } can be considered as an independent sample from Y C. Thus, the smoothed density functions determined by {y * i } and {y ′ * i }, respectively, will be more likely close to each other.
MaskDensity14 reports the value of "the correlation between the smoothed density functions" given by {y * i } and {y ′ * i }. The higher the value is, the relatively better the approximation between f Y,K |{y * i ,c i } (y) and f Y (y) will be. (2) We set the tested maximum upper order of moment to be 100. To save time, we would not like to have the testing procedure going from K = 1 to K = 100. Our experience (also see examples in Lin [9]) shows that Cor (K ) will decrease quite rapidly if K becomes too large. It is due to the result of poor estimates of high-order moments. If the value of Cor (K ) decreases too low and according to our empirical testing, it is not necessary to carry out any further testing. Therefore, we set the threshold 1 − 10[1 − Cor (K opt )] in Step 6.

The boundaries a and b used in MaskDensity14
Example 1 shows the impact of the values of a and b on the plot of f Y,K |{y * i ,c i } . without any interference from the noise C, we let C = 1. 1 We can modify MaskDensity14 and allow the data agency to provide the data user with the probability distribution information of C, instead of a sample from C. To well represent the impact of the boundaries on the plot of The plots of f Y,K |{y * i ,c i } based on the seven pair-boundaries are presented in Fig. 1. Fig. 1 shows that the plots of the density approximants given by PB(1) and PB(2) are very different from the plot of the true density function; the plots given by PB(4) and PB(6) are reasonable. The values of Cor (K opt ) based on PB(4) and PB(6) are all around 0.9997, which are higher than those of Cor (K opt ) based on PB(1) and PB(2) (0.9873 and 0.9973, respectively). It confirms that, based on the same sample {y i }, a density approximant with a relatively higher value of Cor (K opt ) should give a better approximation of f Y . As the size of the interval [a, b] increases, the plot of the corresponding density approximant tends to be flat and gradually runs away from the plot of the true density function (see the plots given by PB (7) and PB (11) The density approximants of f Y and f Y sub can be obtained from the masked data {y * i } N 1 and {y * sub, j }, respectively. Since the probability structures of Y and Y sub might not be the same, the appropriate pair-boundaries used in the density approximants of the two populations might not be the same.
With the full knowledge on the original data {y i } N 1 , the data agency has no problems in providing the data user with an appropriate pair-boundary (a, b) for f Y,K |{y * i ,c i } N 1 . It is impossible for the data agency to provide an appropriate pairboundary (a, b) for f Y,K |{y * i ,c i } sub without knowing in which subset {y sub, j } the data user might be interested.
To ensure that the data user has more freedom in exploring the statistical information of the full/subset of the original data, it is of interest how to determine an appropriate pairboundary for f Y sub based on the information of {y * i , c i } N 1 and the appropriate pair-boundary for f Y,K |{y * i ,c i } N 1 provided by the data agency.
Taking into account the discussions above, a standard procedure for determining a and b is adopted in MaskDensity14: (1) If Y is a categorical variable taking values 1, 2, . . . , M, let a = 0, and b = M + 1 (see Section 2.3).
(2) If Y is not a categorical variable, the values of a and b are determined as follows: Step 2.1 Let a basic and b basic be the boundaries determined by the data agency. With the full knowledge of the original data Step 2.2 For each α = 0.01-0.05 with increment 0.01, 2 let where y * sub and y * 2 sub are the sample mean and the sample second moment of {y * sub, j }, respectively;c and c 2 are the sample mean and the sample second moment of the noise C, respectively; Step 2.3 For each pair-boundary (a basic , b basic ), (a α , b α ), α = 0.01, . . . , 0.05, determine the optimal upper order K for f Y,K |{y * sub, j ,c j } and record Cor (K opt ), denoted by Cor (K opt,basic ) and Cor (K opt,α ), α = 0.01, . . . , 0.05, respectively; Step 2.4 Let a = a α 0 and b = b α 0 , α 0 ∈ {basic, 0.01, 0.02, . . . , 0.05} such that Cor (K opt,α 0 ) = max{Cor (K opt,basic ), Cor (K opt,α ), α = 0.01, . . . , 0.05}.
Remarks. The logic used to support the standard procedure above is explained as follows. (ii) From Tchebichev inequality, we have Ignoring the probability α, {y sub, j } will be bounded by By taking into account the information Cor (K opt,basic ) and Cor (K opt,α ), and replacing the means by sample means in L α and U α , we should expect that [a α 0 , b α 0 ] is a reasonable domain replacing [min{y sub, j }, max{y sub, j }] based on the information of {y * sub, j , c j }. There might be other ways for determining the appropriate pair-boundary (a, b) for f Y,K |{y * sub, j ,c j } . We leave it as an open question.

MaskDensity14 for categorical data
Categorical data is the primary type of data considered in confidential microdata. With noise multiplied data, the mass function of the original data can be obtained by the method of moments. Assume that Y is a categorical variable taking values If the method of moments fails, MaskDensity14 will use the sample-moment-based approximant density method to estimate the mass function of Y .
In theory, the technique of the approximant of a density function is for continuous univariate distributions. However, MaskDensity14 will use the following way to cope with categorical data. (i) Mask the underlying categorical variable by a continuous noise. Thus, the masked data are no longer categorical data. (ii) Apply the method of sample-momentbased density approximant to the masked data and estimate the smoothed density function of the categorical variable based on the masked data. Obviously, the density approximant will have multiple centers at the levels of the categorical variable. (iii) Finally, use the existing K-means clustering R package to convert the smoothed density approximant to the mass function.

Software architecture
MaskDensity14 is built for two purposes. One purpose is to provide masked data. The other is for the data user generating synthetic data.
The flow chart for producing masked data is presented below: By default the values of a basic and b basic are min{y i } N 1 and max{y i } N 1 , respectively. By default, the sample of noise is from a mixture of two normal distributions with means randomly generated. The data agency has the option of providing the values of a basic and b basic , and the noise C used to mask the underlying original data. The binary file "noise.bin" contains a large sample from C, the values of a basic and b basic , and the information about whether the underlying data are numerical or categorical. The sample noise in "noise.bin" are generated by sampling data from the noise sample used to mask the original data. This process is run in background during the process of producing masked data. The size of the noise sample in "noise.bin" is ten times larger than that of the original data. In the process of determining the upper order K , the independent noise sample are drawn from this big noise sample. Since there is no link between the entries in the original dataset and the sample noise stored in "noise.bin", the individual entries of the original dataset cannot be identified simply knowing the masked data and the noise sample drawn from "noise.bin". The binary file is recognizable by MaskDensity14 only.
The masked dataset and the binary file "noise.bin" can be sent out to the data user, and the original data are concealed from the data user. The data user can apply MaskDensity14 to the files and obtain the synthetic data of the original data. A brief flow chart of the process is presented below:

Software functionalities
There are two main functions, mask and unmask, in MaskDensity14. Function mask is used to produce masked data and the binary file noise.bin. The outcome of unmask is a simulated sample drawn from the sample-moment-based density approximant of the original data.

Illustrative examples
Due to the focus of this paper, we only demonstrate the main functions of MaskDensity14. Simulation studies and applications will be investigated in another paper. # y is a sample drawn from Y. noise<-rmulti(n=10000, mean=c(80, 100), sd=c (5,3), p=c(0.6, 0.4)) # noise is a sample drawn from C.
After running the above R code, two files "ystar.dat" and "noise.bin" are generated and ready for the data user. Saving the two files "ystar.dat" and "noise.bin" in the same R working directory, the data user can use the following code to obtain synthetic data of {y i } 10000 1 . library(MaskDensity14) ystar <-scan("ystar.dat") y1 <-unmask(ystar, noisefile="noise.bin") sample<-y1$unmaskedVariable R output y1$unmasked V ariable gives the synthetic data of the original data {y i }. The size of the synthetic data is the same as that of the original data. The data user can apply standard R code to y1$unmasked V ariable and obtain the output data analysis of the original data. For example, the output "plot(density(y1$unmaskedVariable))" gives the plot of the density approximate of the original data; the output "summary(y1$unmaskedVariable)" gives the estimate of the summary statistics of the original data. The summary statistics for original data and synthetic data are listed in Table 1.
Example 3 gives another example where the original data are categorical. Since {y i } only takes two values 1 and 2, the boundaries a basic and b basic are 0 and 3, respectively. The R code used to produce {y * i } is as follows: library(MaskDensity14) ymask<-mask(factor(y), noisefile="noise.bin", noise, a1=0,b1=3) # using factor(y) because y is a categorical variable write(ymask$ystar, "ystar.dat") After running the above code, the files "ystar.dat" and "noise.bin" are ready for the data user.
The following code is used to obtain a set of synthetic data of {y i } with the same size and the estimated mass function of Y .

Impact
MaskDensity14 is a promising software package for estimating the density function of univariate random variables based on noise multiplied data. The advantages of MaskDensity14 can be summarized as follows: No restriction on the type of distribution of the multiplicative noise The data agency has a broad range of choices on the multiplicative noise for protecting the original data. Indeed, the noise sample used to mask the original data has a certain level impact on the accuracy of density approximant. This issue is beyond the focus of this paper.
More possibilities for data agencies to share data with the public With MaskDensity14, the process of sharing data information for the public can be simplified. The data agency might have less responsibility for data analytics. It is particularly important for data agencies that have no necessary resource for doing advanced data analysis.
Standard statistical methods for data analysis The complexity of the statistical inference based on noise multiplied data is reduced.
Lin and Wise [8] showed that the level of protection of the underlying original data can be maintained through an appropriate multiplicative noise even if the probability distribution of the noise is available to the public. To provide extra protection for the underlying original data, MaskDensity14 encrypts the information of the multiplicative noise into a binary file. There may be other better methods to replace this manner.
A critical issue in MaskDensity14 is about the decision on the upper order of moment K . It is possible to improve the accuracy of the density approximant further if there is a better way to determine the upper order K .
MaskDensity14 provides the data user with opportunities to explore the statistical information of the subset of the original dataset. The accuracy of the density approximant of a subset of data can be further improved if there is a better manner for determining the boundaries for the subset data based on the information provided by the data provider.
The method of the sample-moment-based density approximant is for univariate distributions. In real life, developing a computational statistical method for multivariate distributions is desirable. With MaskDensity14 built, it will provide help in developing software for estimating joint density functions based on noise multiplied data.
The software developed in this paper provides data agencies and data users with a totally different process in data protection and confidential data analysis. Data agencies can mainly focus on the issue of data protection, and data users can generate synthetic data by themselves rather than receiving synthetic data from data providers.

Conclusions
MaskDensity14 is applied to univariate distributions. With MaskDensity14 built, the method of sample-moment-based density approximant becomes feasible. The data user can produce (asymptotically) synthetic data of the original data by himself and carry out data analysis on the original data by using standard statistical methods without accessing the original data. Compared to existing methods, MaskDensity14 provides a different manner in data protection and data statistical information recovery.