1 Introduction

Disseminating synthetic data in place of confidential micro-level data can be an efficient way to protect the privacy of survey and clinical trial participants. Whether a set of synthetic data is ideally suitable in representing the statistical information from a confidential data source depends upon the similarity of the density function of the set of synthetic data to the density function of the confidential data source. Therefore, having a reliable and robust method for estimating the underlying density function is essential in the synthetic data generation process.

Many existing non-parametric density function estimation methods, including kernel density estimation, rely heavily on local area estimation. The low-density local areas used to construct the density function estimate of a sample space may contain very few sample observations. The values of the sample observations in those areas are more easily identified or reconstructed later from the constructed density function estimate, posing greater risk to privacy and confidentiality. Data agencies can offset the risk of disclosure by defining local areas of greater size (larger bandwidth) or through perturbation, but this may adversely impact statistical utility. Therefore, those methods relying on local area estimation may not be suitable when confidential data is involved. Existing parametric maximum likelihood and mixture distribution approaches often lack sufficient robustness to estimate atypical distributions effectively. Researchers tend to opt more for kernel estimators, maximum likelihood estimators, or increasingly, non-parametric machine learning techniques (Izenman 1991; Wang and Scott 2019). While highly effective in varying contexts, these methods have limitations, especially when dealing with atypical, incomplete data. Moreover, circumstances in which the data agency may restrict access to micro-level data will result in data-driven non-parametric approaches becoming untenable.

This paper proposes a computational approach for estimating a joint density function of continuous data. The estimated density function is expressed as a finite linear summation of orthogonal polynomial functions and only relies on sample moments. Without knowledge of the joint density function underlying the continuous data, this paper develops a feasible computational strategy that searches for the appropriate maximum order of polynomials used in the estimated density function. We show that the moment-based density approach achieves a compromise between non-parametric and parametric methods, offering a non-parametric global estimate that is sufficiently robust. We also provide an example application of moment-based density estimation in producing synthetic continuous micro-data on confidential micro-data for disclosure protection.

Although our primary motivation, the application of moment-based density estimation is not limited to use cases for confidential data. Moment-based density estimation has interesting applications in imputation for missing data more broadly and for the estimation of continuous density functions from tabular data. Furthermore, there has been recent developments in applying moment-based density approximates in stochastic models for finance (Stoyanov 2016) and image reconstruction (Mustapha and Dimitrakopoulos 2010).

Constructing unique polynomial-based series expansions for continuous functions is a common practice applied in a range of areas of mathematics, physics engineering, and statistics (Munkhammar et al. 2017). Polynomial-based expansions have been used in studies to approximate probability density functions on both closed and infinite domains (Elderton and Johnson 1969; Solomon and Stephens 1978; Kendall et al. 1987; Reid 1988; Provost 2005). Known more commonly as the moments’ problem, this area of study has been considered from as far back as the 19th century (Gram 1883). The advent of greater computational power has increased the usage of polynomial expansions. Currently, however, polynomial-based expansions are rare in multivariate density estimation applications, when compared to their use for univariate estimation. As noted by Kleiber and Stoyanov (2013), the results for the multivariate case tend to be scattered throughout specialised journals and focus on an analytical perspective.

This paper focuses on distributions with bounded marginal domains to ensure determinacy. Determinacy, and more specifically moment determinacy, refers to the existence of a unique moment-based expansion (Kleiber and Stoyanov 2013). This paper requires no specific choice of the associated reference joint density (weight) used. The restriction to a bounded domain is certainly not necessary if moment determinacy can be assured. All results presented in this paper can be generalised on unbounded domains in certain situations. Unbounded reference joint densities shown in the literature to have moment determinacy include the normal and gamma distributions (Provost 2005). Furthermore, as demonstrated in Lin (2014), which applies a univariate form of moment-based density estimation to recover information from perturbed data, assuming a compact domain can result in a reasonable estimate for density functions that exist on an unbounded domain provided the sample size is sufficiently large enough.

This paper is structured in six sections. Section 2 provides the analytical results required to construct the moment-based density estimate. Section 3 considers a measure of goodness of fit for the moment-based density estimate and outlines a sample-based strategy for parameter selection. In Sect. 4, we perform a simulation study to compare the performance of the moment-based density estimation method with existing joint density estimation methods. In Sect. 5, we outline a process of simulating synthetic micro-data from the moment-based density estimate and demonstrate this application to hypothetical confidential data. We conclude our paper in Sect. 6.

2 The multivariate moment-based density estimate

Throughout the literature, researchers have mainly focused on the usefulness and effectiveness of a particular orthogonal polynomial basis in regards to estimating certain distributions, selecting classical polynomial families as the basis for the expansion (Gram 1883; Edgeworth 1905; Charlier 1914; Szegő 1939; Cramer 1946; Alexits 1961; Kendall et al. 1987; Provost 2005). These classical polynomials are usually chosen based on the resemblance of their weight functions (derivative of the positive Borel measure by which the polynomials are orthogonal) to the canonical distributions. Orthogonal polynomials for which a density approximation has been derived in the literature include Hermite (Edgeworth and Gram-Charlier Expansions), Legendre, Jacobi and, Laguerre Polynomials (Gram 1883; Edgeworth 1905; Charlier 1914; Szegő 1939; Cramer 1946; Alexits 1961; Kendall et al. 1987; Provost 2005). They have weight functions resembling normal, uniform, beta, and gamma distributions, respectively. These polynomials can all be used to provide moment-based expressions of certain density functions and are eigenfunctions of a hypergeometric type second-order differential equation (Sánchez-Ruiz and Dehesa 1998).

Multivariate orthogonal polynomial expansions, particularly for Hermite, Jacobi, and Laguerre polynomials, have been considered in the literature (Xu 1997; Griffiths and Spano 2011). Joint density expressions for quasi-multivariate normal distributions have also been developed using a Hermite polynomial-based expansion (Mehler 1866; Withers and Nadarajah 2010). This section derives a generalised form of the multivariate density estimate and the fundamental mathematical results required to apply this density estimate in practice. This form covers many existing approaches and allows for greater flexibility in the choice of reference density and resulting orthogonal polynomials used.

2.1 The orthogonal polynomial basis

Initially, we only consider the orthonormal basis for the univariate case in a single-dimensional space \(\Omega = [a,b]\). Later this basis is expanded to the N-dimensional compact space \({\Omega }^{(N)} = \prod _{k=1}^N [a_k, b_k]\). Let \( f_{\nu }\) denote the probability density function of a random variable \(\nu \) with the domain the closed interval \(\Omega = [a,b]\). We refer to \( f_{\nu }\) as the reference distribution function, and this function will serve as the weight function on an orthonormal basis.

2.1.1 Univariate polynomial basis

Regarding the function \(f_{\nu }\), the results of Sánchez-Ruiz and Dehesa (1998) can be summarised in the following theorem.

Theorem 1

If there exist polynomial functions \(\sigma (x)\) and \(\tau (x)\) of degree no greater than 2 and 1 with coefficients \(\phi _0,\phi _1,\phi _2\) and \(\psi _0,\psi _1\) respectively for \(f_{\nu }\), where,

$$\begin{aligned} \sigma (x) = \phi _{2}x^2 + \phi _{1}x + \phi _{0}, \quad \text {and} \quad \tau (x) = \psi _{1}x + \psi _{0}, \end{aligned}$$
(1)

such that

$$\begin{aligned} \frac{d}{dx} \left[ \sigma (x) f_{\nu }(x) \right] = \tau (x) f_{\nu }(x), \end{aligned}$$
(2)

with boundary conditions, for \(i = 0,1,\ldots \)

$$\begin{aligned} \lim _{x \rightarrow a} x^i \sigma (x) f_{\nu }(x)= \lim _{x \rightarrow b} x^i \sigma (x) f_{\nu }(x) = 0, \end{aligned}$$
(3)

then there exists a set of eigenfunction solutions \(\{P_n(x)\}_{n \in \mathbb {N}_0}\), where \(\mathbb {N}_0\) denotes the set of all natural number including 0, of the hypergeometric-type differential equation,

$$\begin{aligned} \sigma (x) y''(x) + \tau (x) y'(x) - \lambda y(x)= 0. \end{aligned}$$

\(P_n(x)\) denotes the nth eigenfunction solution corresponding to \(\lambda _n = n\psi _{1} + n(n-1)\phi _{2}\) and has the following properties:

  1. (i)
    $$\begin{aligned} \lambda _n P_n(x)f_{\nu }(x) = \frac{d}{dx} \left[ \sigma (x)P_n'(x)f_{\nu }(x) \right] . \end{aligned}$$
  2. (ii)

    \(P_n(x)\) is a polynomial of degree n with Rodrigues’ formula of the form

    $$\begin{aligned} P_{n}(x) = \frac{B_{n}}{f_{\nu }(x)} \frac{d^{n}}{dx^{n}} \left[ (\sigma (x))^{n} f_{\nu }(x) \right] , \end{aligned}$$

    where \(B_{n} \in \mathbb {R}\) is the normalising constant in the Rodrigues’ formula and can be chosen arbitrarily in order to show consistency with classical orthogonal polynomial families.

  3. (iii)

    Given \(B_n\) of the Rodrigues’ formula is defined as follows,

    $$\begin{aligned}{} & {} B_0 = 1\quad \text {and} \nonumber \\{} & {} B_{n} = \sqrt{ \frac{(-1)^n \prod _{i=0}^{n-1} \left[ \frac{1}{ \psi _1 + (n + i -1)\phi _{2} }\right] }{ n! \int _{a}^{b} (\sigma (x))^{n}f_{\nu }(x) dx}, } \quad \text {for}\; n = 1,2,\dots , \nonumber \\ \end{aligned}$$
    (4)

    then,

    $$\begin{aligned} \int _{a}^{b} P_{n}(x) P_{m}(x) f_{\nu }(x) dx = \delta _{n,m}, \end{aligned}$$

    where \(\delta _{n,m}\) denotes the Kronecker delta, and \(n, m \in \mathbb {N}_0\). Hence \(\{P_n(x)\}_{n \in \mathbb {N}_0}\) are orthogonal polynomials.

Theorem 1 shows that provided an appropriate quadratic polynomial \(\sigma (x)\) and appropriate linear polynomial \(\tau (x)\) for a given reference density \(f_\nu \) can be found; then, we can obtain the aforementioned set of polynomials. Theorem 1 makes no statement about the domain of the underlying distribution and can therefore be applied to reference distributions defined on both bounded and unbounded domains. Many of the standard probability densities used in statistics meet this criterion as a reference density. Table 1 provides examples of \(\sigma \) and \(\tau \) functions for some of these densities and the connection of the resulting polynomials to the classical orthogonal polynomial families.

In the following, we list some properties of the orthogonal polynomials without specifying the distribution of the reference function \(f_{\nu }\). These properties will benefit the computational approach for moment-based joint density function estimation later developed in this paper.

Table 1 Various possible reference densities with corresponding \(\sigma (x)\) and \(\tau (x)\) functions and their domain and relation to classical orthogonal polynomials

Property 1

Regarding the polynomial function \(P_n(x)=\sum _{i=0}^{n}a_{n,i}x^i\), the coefficient \(a_{n,i}\) for fixed \(n \in \mathbb {N}_0\) is given by,

$$\begin{aligned}{} & {} a_{n,i} = \nonumber \\ {}{} & {} {\left\{ \begin{array}{ll} \frac{a_{n,i+1}(i + 1)(i \phi _1 + \psi _0) + a_{n,i+2}(i +2)(i + 1)\phi _0}{\lambda _{n} - i(i - 1)\phi _{2} - i\psi _1}, &{} i = 0,\dots ,n-2 \\ \kappa _{n} \frac{n (n -1)\phi _1 + n \psi _0}{ \psi _1 + 2(n - 1)\phi _{2}}, &{} i = n-1\\ \kappa _{n}, &{} i = n \end{array}\right. }, \end{aligned}$$
(5)

where \(\kappa _{n} = B_{n} \prod _{i=0}^{n-1} \left[ \psi _1 + (n + i -1)\phi _{2} \right] \) and \(\kappa _{0} =1\).

We note that \(\sigma (x)\) and \(\tau (x)\) are polynomials of degree no greater than 2 and 1, with coefficients \(\phi _0,\phi _1,\phi _2\) and \(\psi _0,\psi _1\) respectively (1).

Property 1 provides the Frobenius solutions to the polynomial coefficients for each of our polynomial terms. We note that this solution also applies to reference densities on the unbounded domain and can be used to derive the coefficients of all the classical polynomial expressions given in Table 1.

Denote \(\mu _{\nu }\) a measure on (\(\Omega ,\mathscr {F}\)) with \( \mu _{\nu }(A) = \int _{A} f_{\nu }(x) dx\) for \(A \in \mathscr {F} \) , where \(\mathscr {F}\) is a \(\sigma \)-field on \(\Omega \).

Property 2

  1. (i)

    The integral of polynomial \(P_n(x)\) with respect to the measure function \(\mu _\nu \) is

    $$\begin{aligned} \int P_n(x) d\mu _\nu&= \int P_n(x) f_{\nu }(x) dx \nonumber \\&={\left\{ \begin{array}{ll} \frac{\sigma (x)}{\lambda _n}P_n'(x)f_{\nu }(x) + c, &{} n = 1 ,2, \dots \\ F_{\nu }(x) + c, &{} n =0 \end{array}\right. }, \end{aligned}$$
    (6)

    where \(c \in \mathbb {R}\) is some constant and \(F_{\nu }\) is the cumulative distribution function of \(f_\nu \).

  2. (ii)
    $$\begin{aligned} \lim _{t\rightarrow a} \sigma (t)P_n'(t)f_{\nu }(t) = 0. \end{aligned}$$

Property 2 expresses the indefinite integral of the orthogonal polynomials with respect to the reference density. This expression is essential in the practical application of the density estimate as it allows for the analytical derivations of the distribution functions associated with the multivariate density estimate. However, due to length constraints, we will not give the expressions of all the associated distribution functions in this paper. Property 2 will be used in Sect. 5, to derive the expression of the moment-based conditional distribution function estimate. One of the key benefits of the polynomial-based approach to density estimation is the analytical flexibility and applicability of the resulting density estimate. Provided the reference density is continuous and differentiable, then so too is the resulting moment-based density expression and thus is more easily applied in inversion and optimisation problems.

Property 3

Szökefalvi-Nagy 1965\(\{P_n\}_{n \in \mathbb {N}_0}\) forms a complete orthogonal base in the \(L_2\) space \(L_2(\Omega ,\mu _\nu )\).

Therefore, polynomials \(\{P_n\}_{n \in \mathbb {N}_0}\) can serve as basis functions when expanding any continuous functions in \(L_2\) space \(L_2(\Omega ,\mu _\nu )\). Given this property, we extend the complete orthogonal basis to a multivariate space in the following section.

2.1.2 A multivariate polynomial basis

Consider a N-dimensional compact space given by \({\Omega ^{(N)}} = \prod _{k=1}^N [a_k,b_k]\). From this point on, we will consider the orthogonal polynomial bases on \({\Omega ^{(N)}} = \prod _{k=1}^N [a_k,b_k]\).

Let \( f_{\nu } = \prod _{k=1}^N f_{\nu _k} \). Here \(f_{\nu _k}\) is a probability density function of a random variable \(\nu _k\) with the domain the closed interval \(\Omega _k = [a_k,b_k]\), \(k=1,\dots ,N\). Random variables \(\{\nu _k\}_{k=1,\dots ,N}\) are mutually independent. We refer to \( f_{\nu }\) as the reference joint density function and this function will serve as the multivariate weight function in our orthonormal basis. We note that \(f_{\nu }: {\Omega ^{(N)}} \rightarrow \mathbb {R}\) is the joint density function of the random vector \(\nu =(\nu _1, \ldots , \nu _N) \).

Of interest is the \(L_2\) space \(L_2({\Omega ^{(N)}},\mu _{\nu })\) with measure function \(\mu _{\nu }\) defined by

$$\begin{aligned} \mu _{\nu }(A) = \int _{A} f_{\nu }(\textbf{x}) d\textbf{x} = \int _{A} \left[ \prod _{k=1}^N f_{\nu _k}(x_k) \right] d\textbf{x} \quad \text {for}\;\, A \in \mathscr {F}, \end{aligned}$$
(7)

where \(\mathscr {F}\) is the \(\sigma \)-field of \(\Omega ^{(N)}\). By taking the product of the univariate orthogonal polynomials described in Sect. 2.1.1 we obtain a set of multivariate orthogonal polynomials defined on \({\Omega ^{(N)}} = \prod _{k=1}^N [a_k,b_k]\). These multivariate polynomials are given as follows.

Definition 1

Let \(\{P_{k, n_k}\}_{n_k \in \mathbb {N}_0}\) the set of hypergeometric type polynomials given by the density function \(f_{\nu _k}\), \( k=1, \dots , N\). The multivariate hypergeometric type polynomial \(P_{n_1,\dots ,n_N}(\textbf{x})\) associated to \(f_{\nu }\) is given as the product of N hypergeometric type polynomials \(P_{k,n_k}\) of degree \(n_k\) for each \( k = 1,\dots ,N\), that is

$$\begin{aligned} P_{n_1,\dots ,n_N}(\textbf{x})= & {} \prod _{k=1}^N P_{k,n_k}(x_k) \nonumber \\= & {} \prod _{k=1}^N \frac{B_{k,n_k}}{f_{\nu _k}(x_k)} \frac{d^{n_k}}{dx^{n_k}}\left[ \left( \sigma _k(x_k) \right) ^{n_k} f_{\nu _k}(x_k) \right] , \end{aligned}$$
(8)

where \(\sigma _k\) and \(\tau _k\) refer to the polynomials described in (2) of the kth hypergeometric type polynomial and \(B_{k,n_k}\) refers to the kth normality constant described in (1) and (4).

We can subsequently leverage the construction of the multivariate polynomial as a product of marginal univariate polynomials to decompose the multivariate polynomial into its marginal components. In doing so, the completeness and orthogonality of the univariate marginal polynomials ensure the multivariate polynomials defined in Definition 1 are themselves orthogonal and complete in the N-dimensional space. As such, it can be said these multivariate polynomials form a complete orthonormal polynomial basis for the N-dimensional \(L_2\) product space \(L_2(\Omega ^{(N)},\mu _{\nu })\). This statement is given more formally in Theorem 2.

Theorem 2

The multivariate hypergeometric-type polynomials are orthonormal on the compact space \({\Omega ^{(N)}} = \prod _{k=1}^N \Omega _k\) and complete in the \(L_2\) space \(L_2({\Omega ^{(N)}},\mu _{\nu })\).

A detailed proof of Theorem 2 is given in “Appendix A”.

2.2 Multivariate moment-based density function

Let \(\textbf{X} = (X_1,\dots ,X_N)\) be a random vector with joint density function \(f_\textbf{X}\) on \({\Omega ^{(N)}}\). The joint density function \(f_\textbf{X}\) can be uniquely expanded into a polynomial expression, multiplied by the reference density \(f_\nu (\textbf{x})\), by use of orthonormal basis functions \(\{ P_{n_1,\dots , n_N}(\textbf{x}) \}_{n_1,\dots ,n_N \in \mathbb {N}_0} \) on the compact space \(\Omega ^{(N)}\). The resulting density expression has polynomial coefficients, which are functions of the underlying moments of the distribution. This form of the joint density is referred to in this paper as the multivariate moment-based density (MBD) expression.

We begin with the statement of the theorem outlining the multivariate moment-based density expression.

Theorem 3

The Multivariate Moment-Based Density.

Let \(\textbf{X} = (X_1,\dots ,X_N)\) be a random vector with joint density function \(f_\textbf{X}\) on space \({\Omega ^{(N)}} = \prod _{k=1}^N [a_k, b_k]\). Assume \(\textbf{X}\) has finite moments \(\int _{\Omega ^{(N)}} x_1^{i_1} \dots x_N^{i_N}f_{X}(\textbf{x}) dx_1\dots dx_N \) for all \(i_1,\dots ,i_N \in \mathbb {N}_0\), and

\(f_{X} / f_{\nu } \in L_2({\Omega ^{(N)}},\mu _{\nu })\). Then \(f_{\textbf{X}}\) can be expressed as

$$\begin{aligned} f_{\textbf{X}}(\textbf{x}) = f_{\nu }(\textbf{x})\sum _{n_1 = 0}^{\infty } \dots \sum _{n_N = 0}^{\infty } C_{n_1,\dots ,n_N} P_{n_1,\dots ,n_N}(\textbf{x}), \end{aligned}$$
(9)

where the coefficient terms \(C_{n_1,\dots ,n_N}\) are given by

$$\begin{aligned} C_{n_1,\dots ,n_N}&= \mathbb {E}_{\textbf{X}}\left[ P_{n_1,\dots ,n_N}(X_1,\dots ,X_N) \right] \nonumber \\&= \int _{\Omega ^{(N)}} P_{1,n_1}(x_1) \dots P_{N,n_N}(x_N) f_{\textbf{X}}(\textbf{x}) d\textbf{x}\nonumber \\&= \sum _{i_1 = 0}^{n_1} \dots \sum _{i_N = 0}^{n_N} a_{n_1,\dots ,n_N ; i_1, \dots , i_N} \mathbb {E}_{\textbf{X}} \left[ X_1^{i_1} \dots X_N^{i_N} \right] , \end{aligned}$$
(10)

where \(a_{n_1,\dots ,n_N ; i_1, \dots , i_N} = \prod _{k=1}^N a_{n_k,i_k}\) and \(a_{n_k,i_k}\) is the \(i_k\)th coefficient of the \(n_k\)th polynomial \(P_{k,n_k}(x_k)\).

Proof of Theorem 3 can be found in “Appendix B”.

Remark

We note \(\sum _{n_1=0}^\infty \dots \sum _{n_N=0}^\infty C_{n_1,\dots ,n_N}^2 = \mathbb {E}_{\textbf{X}} \left[ \frac{f_{\textbf{X}}(X_1,\dots ,X_N)}{f_{\nu }(X_1,\dots ,X_N)} \right] < \infty \) as \(f_{\textbf{X}} / f_{\nu } \in L_2({\Omega ^{(N)}},\mu _{\nu }).\) It also follows that \(C_{n_1,\dots ,n_N} \rightarrow 0\) as \(n_1,\dots ,n_N \rightarrow \infty \).

The expansion given in Theorem 3 relates to a joint-density defined on a compact space. We imposed this restriction to ensure moment determinacy, as, on unbounded spaces, we cannot guarantee moment determinacy. Polynomials given by (8) and applied in Theorem 3 may still be extended to unbounded spaces such as in the case of \(f_{\nu _k}\), for any \(k=1,2,\dots ,N\), being defined by a normal (Hamburger case) or gamma (Stieltjes case) density. For these cases, moment determinacy can be assured by the completeness of the Laguerre and Hermite polynomial families (Provost 2005).

In the examples of these types of polynomials in the literature, it is often assumed that all marginal densities of the joint density being approximated have a consistent kind of distribution. Here we make no such assumption, and only ensure that the domain of our multivariate polynomials is consistent with that of our joint density, provided that \(f_{X} / f_{\nu } \in L_2({\Omega ^{(N)}},\mu _{\nu })\) and has finite moments i.e. \(\mathbb {E}_{\textbf{X}} \left[ X_1^{i_1} \cdots X_N^{i_N}\right] < \infty ,\) for all \(i_1, \ldots , i_N\), which enables the correct estimation of joint densities with variables of different domains, a common occurrence in real-world data.

The hypergeometric polynomial-based expansion of a joint density function generally is an infinite sum of polynomials. It is impracticable to evaluate this infinite sum to obtain the complete information of the joint density function. Using finite summation terms in the expansion to approach the density function is one way to address this issue. The function of the finite summation terms is named the finite order multivariate moment-based density, which is defined as follows.

Definition 2

The Finite Order Multivariate Moment-Based Density

The finite order multivariate moment-based density approximation of max order \(\textbf{K} = (K_1,\dots ,K_N)\) is defined as

$$\begin{aligned} f_{\textbf{X};\textbf{K}}(\textbf{x}) = f_{\nu }(\textbf{x})\sum _{n_1 = 0}^{K_1} \dots \sum _{n_N = 0}^{K_N} C_{n_1,\dots ,n_N} P_{n_1,\dots ,n_N}(\textbf{x}) . \end{aligned}$$

The Weierstrass–Stone approximation theorem provides that \(f_{\textbf{X}:\textbf{K}}\) converges uniformly to \(f_{\textbf{X}}\) as the order \(\textbf{K}\) increases. Using \(f_{\textbf{X}:\textbf{K}}\) as an estimate of \(f_{\textbf{X}}\) is not practicable in real-life data analysis. One issue is the evaluation of the coefficients \(\{C_{n_1, \ldots ,n_N}\}\) in \(f_{\textbf{X}:\textbf{K}}\). The coefficient \(C_{n_1, \ldots ,n_N}\) is a linear function of joint moments of \(\textbf{X}\). The evaluation of coefficients requires the knowledge of the density function \(f_{\textbf{X}}\) while \(f_{\textbf{X}}\) is unknown. The following section introduces the finite order multivariate sample moment-based density function to address this problem.

2.3 Multivariate sample moment-based density estimation

We now consider an approximation of multivariate density functions based on sample moments.

Definition 3

Multivariate Sample Moment-Based Density Approximation

Given a random sample \(\{\textbf{X}_i\}_{i=1}^M =\{(X_{1,i},X_{2,i}, \dots , X_{N,i})\}_{i=1}^M\) of \(\textbf{X} = (X_1,X_2,\dots ,X_N)\), define

$$\begin{aligned} \hat{f}_{\textbf{X};\textbf{K}}(\textbf{x}; \{\textbf{X}_i\}_{i=1}^M )&= f_{\nu }(\textbf{x})\sum _{n_1 = 0}^{K_1} \nonumber \\&\quad \dots \sum _{n_N = 0}^{K_N} \hat{C}_{n_1,\dots ,n_N} P_{n_1,\dots ,n_N}(\textbf{x}), \end{aligned}$$
(11)

where \(\textbf{K} = (K_1,\dots ,K_N) \) and \(\hat{C}_{n_1,\dots ,n_N} = (1 / M) \sum _{j=0}^M P_{n_1,\dots ,n_N}(X_{1,j},\dots ,X_{N,j}) \).

We propose using \(\hat{f}_{\textbf{X};\textbf{K}}(\textbf{x}; \{\textbf{X}_i\}_{i=1}^M )\) as an estimator of \(f_{\textbf{X}}\) and name it the sample moment-based estimator of \(f_{\textbf{X}}\) or the moment-based estimator of \(f_{\textbf{X}}\) if there is no ambiguity.

A significant issue when estimating the multivariate joint density function is computation complexity. The use of \(\hat{f}_{\textbf{X};\textbf{K}}(\textbf{x}; \{\textbf{X}_i\}_{i=1}^M )\) as an estimate of \(f_{\textbf{X}}\) has some computational advantages.

The marginal moment-based density estimates determined by integrating the joint moment-based density estimate are consistent with what would be achieved if the estimation were applied to the original sample’s corresponding marginal subset. Computationally we note that all coefficients of the various marginal distributions are elements of the set of joint density coefficients, meaning marginal density estimates are easily computed given the joint density coefficients. An essential factor for many density estimation schemes is the ability for these schemes to be adjusted or updated based on new data. Refitting density estimates in large sample and high-dimensional scenarios can be highly time-expensive. The moment-based density estimate is somewhat resilient to this issue, as the data agency can introduce new information to the existing model by averaging a density estimate based solely on the latest data with the original density estimate.

In privacy-sensitive situations, micro-level information may not necessarily be available due to the risk of information disclosure. Like that employed in Kernel density estimation, local density estimation methods require knowledge of the frequency of observations within a particular bandwidth. Due to privacy concerns, this knowledge may be challenging to obtain in sparsely populated areas. Perturbation techniques are also increasingly being used on privacy-sensitive statistics. Perturbation greatly reduces the statistical utility of estimates, even when such information is obtainable. By only using sample moments within the density estimate expression, we can tailor our estimate based on what information is available, with sample-moment information usually being more easily accessible, especially for lower-order moments.

As a result of the large number theory, \(\hat{f}_{\textbf{X};\textbf{K}}(\textbf{x}; \{\textbf{X}_i\}_{i=1}^M )\) approaches to \(f_{\textbf{X};\textbf{K}} \) as M increases. At the same time, the function \(f_{\textbf{X};\textbf{K}}\) is a reasonable approximate of \(f_{\textbf{X}}\) as \(K \rightarrow \infty \). Proposing \(\hat{f}_{\textbf{X};\textbf{K}}(\textbf{x}; \{\textbf{X}_i\}_{i=1}^M )\) as an estimator of \(f_{\textbf{X}}\) makes sense. However, \(\textbf{K}\) and \(\{\textbf{X}_i\}_{i=1}^M\) might either independently or mutually impact the performance of the estimator in estimating \(f_{\textbf{X}}\). To construct a good estimator \(\hat{f}_{\textbf{X};\textbf{K}}\) in practice, the up order \(\textbf{K}\) can work as an “adjuster” while the working sample \(\{\textbf{X}_i\}_{i=1}^M\) is fixed.

In the following sections, we propose and demonstrate a computational statistics approach for determining \(\textbf{K}\) in \(\hat{f}_{\textbf{X};\textbf{K}}(\textbf{x}; \{\textbf{X}_i\}_{i=1}^M )\), given M is fixed. Under the mean square distance \(L_2(\Omega ^{(N)}, \mu _{\nu })\) criterion, we expect that with the determined \(\textbf{K}\), \(\hat{f}_{\textbf{X};\textbf{K}}\) can perform well based on the available sample information from the underlying population \(\textbf{X}\).

3 The \(\mathcal {N}_{\textbf{K}}\)-norm and max polynomial order

To quantify the quality of the resulting density estimate, we propose an averaged measure of divergence from the true joint density function, referred to as the \(\mathcal {N}_{\textbf{K}}\)-Norm, which is similar to the integrated square error statistic but allows for the decomposition of the error in terms of the estimated and coefficients given in (10).

The definition of the \(\mathcal {N}_{\textbf{K}}\)-Norm of the Approximation is given below.

Definition 4

\(\mathcal {N}_{\textbf{K}}\)-Norm of the Approximation.

The \(\mathcal {N}_{\textbf{K}}\)-Norm of the sample \(\{\textbf{X}_i\}_{i=1}^M\) denoted \(\mathcal {N}_{\textbf{K}}\left( \{\textbf{X}_i\}_{i=1}^M \right) \) is defined to be the \(L_2({\Omega ^{(N)}},\mu _{\nu })\) norm of the difference between the sample approximation \(\hat{f}_{\textbf{X};\textbf{K}}\left( \textbf{x};\{\textbf{X}_i\}_{i=1}^M \right) \) and the true joint density \(f_{\textbf{X}}(\textbf{x})\) as a ratio of the reference joint density \(f_{\nu }(\textbf{x})\). \(\mathcal {N}_{\textbf{K}}\left( \{\textbf{X}_i\}_{i=1}^M \right) \) is expressed as

$$\begin{aligned} \mathcal {N}_{\textbf{K}}\left( \{\textbf{X}_i\}_{i=1}^M \right) = \int _{\Omega ^{(N)}} \left|\frac{\hat{f}_{\textbf{X};\textbf{K}} \left( \textbf{x} ; \{\textbf{X}_i\}_{i=1}^M\right) }{ f_{\nu }(\textbf{x})} - \frac{ f_{\textbf{X}}(\textbf{x}) }{ f_{\nu }(\textbf{x})} \right|^2 d\mu _{\nu }. \end{aligned}$$

\(\mathcal {N}_{\textbf{K}}\left( \{\textbf{X}_i\}_{i=1}^M \right) \) is a statistic. Using \(\mathcal {N}_{\textbf{K}}\) to measure the difference between \(\hat{f}_{\textbf{X};\textbf{K}}\) and \(f_\textbf{X}\) quantitatively is different from measuring the maximum point-wise difference between the two functions. A small value of \(\mathcal {N}_{\textbf{K}}\) cannot guarantee the differences between \(\hat{f}_{\textbf{X};\textbf{K}}(\textbf{x})\) and \(f_\textbf{X}(\textbf{x})\) are small for all \(\textbf{x}\). However, it can ensure that, on average, the differences between \(\hat{f}_{\textbf{X};\textbf{K}}(\textbf{x})\) and \(f_\textbf{X}(\textbf{x})\) can be depicted by the value of \(\mathcal {N}_{\textbf{K}}\). The difference is a difference of ratios. As long as \(f_{\nu }(\textbf{x})\) has a non-zero bound, such a ratio difference can be reasonable to depict the difference between \(\hat{f}_{\textbf{X};\textbf{K}}(\textbf{x})\) and \(f_\textbf{X}(\textbf{x})\), particularly, if \(f_{\nu }(\textbf{x})\) has a uniform distribution.

It is also important to note the similarity of the \(\mathcal {N}_{\textbf{K}}\)-Norm to the integrated square error statistic (ISE) of a density estimate \(\hat{f}(\textbf{x})\) of \(f(\textbf{x})\) expressed below,

$$\begin{aligned} {\textit{ISE}}(\hat{f}) = \int _{\Omega ^{(N)}} \left|\hat{f}(\textbf{x}) - f(\textbf{x}) \right|^2 d\textbf{x}. \end{aligned}$$
(12)

The point of difference between the \(\mathcal {N}_{\textbf{K}}\)-norm and \({\textit{ISE}}(\hat{f})\) is that the former is a weighted squared \(L_2\)-norm in the \(L_2({\Omega ^{(N)}},\mu _{\nu })\) space. In contrast, the integrated square error is measured within Euclidean space. Using this weighted measure, the expected value of the statistic, which reflects the variance and bias components of the estimate, can be expressed in terms of the corresponding coefficients in (10). This expression of \(\mathbb {E}_{\textbf{X}} \left[ \mathcal {N}_{\textbf{K}}\left( \{\textbf{X}_i\}_{i=1}^M \right) \right] \) is given in Theorem 4.

Theorem 4

The expected value of \(\mathcal {N}_{\textbf{K}}\left( \{\textbf{X}_i\}_{i=1}^M \right) \) is given by

$$\begin{aligned} \mathbb {E}_{\textbf{X}} \left[ \mathcal {N}_{\textbf{K}}\left( \{\textbf{X}_i\}_{i=1}^M \right) \right]&=\sum _{n_1 = 0}^{K_1} \dots \sum _{n_N = 0}^{K_N} {\textit{Var}}_{\textbf{X}}\left( \hat{C}_{n_1,\dots ,n_N} \right) \\&\quad + \mathcal {T}_{\textbf{K}}, \end{aligned}$$

where \( \mathcal {T}_{\textbf{K}} = \sum _{n_1=0}^\infty \dots \sum _{n_N=0}^\infty C_{n_1,\dots ,n_N}^2 - \sum _{n_1=0}^{K_1}\dots \sum _{n_N=0}^{K_N} C_{n_1,\dots ,n_N}^2 \) denotes the remainder term.

The proof of Theorem 4 is simple if we note that,

$$\begin{aligned}&\int _{\Omega ^{(N)}} \left|\frac{\hat{f}_{\textbf{X};\textbf{K}}(\textbf{x}; \{\textbf{X}_i\}_{i=1}^M )}{ f_{\nu }(\textbf{x})} - \frac{ f_{\textbf{X}}(\textbf{x}) }{ f_{\nu }(\textbf{x})} \right|^2 d\mu _{\nu }\\&\quad =\sum _{n_1 = 0}^{K_1}\dots \sum _{n_N = 0}^{K_N} \hat{C}_{n_1,\dots ,n_N}^2 \\&\qquad -2 \sum _{n_1 = 0}^{K_1} \dots \sum _{n_N = 0}^{K_N} \hat{C}_{n_1,\dots ,n_N}C_{n_1,\dots ,n_N} \\&\qquad + \sum _{n_1 = 0}^{\infty } \dots \sum _{n_N = 0}^{\infty } C_{n_1,\dots ,n_N}^2. \end{aligned}$$

A crucial dichotomy arises when applying the moment-based density estimation method, which is demonstrated in the expression of the expected value of \(\mathcal {N}_{\textbf{K}}\)-Norm. Increasing the order of the approximation will lead to a reduction in the remainder term \(\mathcal {T}_{\textbf{K}}\). However, an increase in the order of approximation also inflates the aggregated variance of the coefficient estimators.

This tradeoff between the remainder term and the estimator variance is at the heart of determining the optimal strategy for approximating the joint density function in practice, based on sample data. The tradeoff also suggests that the usual approach to polynomial approximations used in mathematics, taking the largest order feasible based on computation and accuracy requirements, is inappropriate in the sample context. Instead, it suggests that there may be an optimal max polynomial order to the approximation.

We provide the following definition of an optimal max polynomial order.

Definition 5

An Optimal Max Polynomial Order.

An optimal max polynomial order denoted \(\textbf{K}^{*} = (K_1^{*},\dots ,K_N^{*})\) is defined as a vector of max polynomial orders in the approximation such that

$$\begin{aligned} \mathbb {E} \left[ \mathcal {N}_{\mathbf {K^*}}\left( \{\textbf{X}_i\}_{i=1}^M \right) \right] \le \mathbb {E} \left[ \mathcal {N}_{\textbf{K}}\left( \{\textbf{X}_i\}_{i=1}^M \right) \right] , \end{aligned}$$

for any possible max polynomial order \(\textbf{K} = (K_1,\dots ,K_N)\).

This \(\textbf{K}^{*}\) will ensure that the mean of \(\mathcal {N}_{\mathbf {K^*}}\left( \{\textbf{X}_i\}_{i=1}^M \right) \) takes the smallest value. It does not mean that, for all the realisations of the random sample \(\{\textbf{X}_i\}_{i=1}^M\), \(\mathcal {N}_{\textbf{K}}\left( \{\textbf{X}_i\}_{i=1}^M \right) \) will take the smallest value at \(\textbf{K}=\textbf{K}^*\). The variance of \(\mathcal {N}_{\mathbf {K^*}}\left( \{\textbf{X}_i\}_{i=1}^M \right) \) will provide the information of a confidence level of a realisation of \(\mathcal {N}_{\mathbf {K^*}}\left( \{\textbf{X}_i\}_{i=1}^M \right) \) falling in a close neighbourhood of the smallest value.

Given just the definition, it is unclear whether we can obtain such a quantity. It appears entirely feasible that taking larger and larger max polynomial orders may improve the density estimate (based on the \(\mathcal {N}_{\textbf{K}}\)-Norm). However, in the following theorem, we show that provided the estimated coefficients applied in the approximation have a non-zero variance (after some point), then the optimal max polynomial order is finite, i.e., for \(\textbf{K}^{*} = (K_1^{*},\dots ,K_N^{*})\) then \(K_k^{*} < \infty \) for \(k = 1,\dots ,N\). In practice, the estimated coefficients applied in the approximation will have a non-zero variance as the estimated coefficients are linear combinations of higher and higher order sample moments.

Theorem 5

Suppose there exists \(\epsilon _0 > 0 \) such that \( {\textit{Var}}_{\textbf{X}}(\hat{C}_{n_1,\dots ,n_N}) \ge \epsilon _0 \) for all \(n_1,\dots ,n_N \in \mathbb {N}_0\) where \((n_1,\dots ,n_N) \ne (0,\dots ,0)\). Then \( \mathbb {E}_{\textbf{X}} \left[ \mathcal {N}_{\textbf{K}}\left( \{\textbf{X}_i\}_{i=1}^M \right) \right] \), has a finite optimal max polynomial order.

The proof of Theorem 5 can be found in “Appendix C”.

In a practical situation, we may not necessarily know the true joint density of a sample, and hence it would be untenable to calculate the \(\mathcal {N}_{\textbf{K}}\)-Norm statistic. But this value does not explicitly need to be known to determine the optimal max polynomial order. Consider the following definition of the shifted \(\mathcal {N}_{\textbf{K}}\)-Norm.

Definition 6

The Shifted \(\mathcal {N}_{\textbf{K}}\)-Norm.

The shifted \(\mathcal {N}_{\textbf{K}}\)-Norm denoted \(\mathcal {N}_{\textbf{K}}^*\left( \{\textbf{X}_i\}_{i=1}^M \right) \) is defined as

$$\begin{aligned} \mathcal {N}_{\textbf{K}}^*\left( \{\textbf{X}_i\}_{i=1}^M \right)&= \sum _{n_1 = 0}^{K_1} \dots \sum _{n_N = 0}^{K_N} \hat{C}_{n_1,\dots ,n_N}^2 \nonumber \\&\quad - 2\mathbb {E}_{\textbf{X}}\left[ \frac{\hat{f}_{\textbf{X};\textbf{K}}(\textbf{X};\{\textbf{X}_i\}_{i=1}^M)}{f_{\mathbf {\nu }}(\textbf{X})} \right] . \end{aligned}$$
(13)

The relationship between \(\mathcal {N}_{\textbf{K}}^*\left( \{\textbf{X}_i\}_{i=1}^M \right) \) and \(\mathcal {N}_{\textbf{K}}\left( \{\textbf{X}_i\}_{i=1}^M \right) \) is summarised as follows:

  1. (1)

    \(\begin{aligned} \mathcal {N}_{\textbf{K}}^*\left( \{\textbf{X}_i\}_{i=1}^M \right) =\mathcal {N}_{\textbf{K}}(\{\textbf{X}_i\}_{i=1}^M) - \int _{\Omega ^{(N)}}\frac{f^2_{\textbf{X}}(\textbf{x})}{f_{\nu }(\textbf{x})} d\textbf{x} \end{aligned}\).

  2. (2)

    If \(\mathbb {E}_{\textbf{X}}(\mathcal {N}_{\textbf{K}}^*(\{\textbf{X}_i\}_{i=1}^M))\) take the smallest value at \(\textbf{K}^*\), \(\mathbb {E}_{\textbf{X}}(\mathcal {N}_{\textbf{K}}(\{\textbf{X}_i\}_{i=1}^M))\) will also take the smallest value at \(\textbf{K}^*\), vice versa.

  3. (3)

    Both of \(\mathcal {N}_{\textbf{K}}^*(\{\textbf{X}_i\}_{i=1}^M\) and \(\mathcal {N}_{\textbf{K}}(\{\textbf{X}_i\}_{i=1}^M \) have the same variance.

The second term of \(\mathcal {N}_\textbf{K}^*(\{\textbf{X}_i\}_{i=1}^M\) in (13) can be more easily estimated in a computational approach, i.e. for any random function \(g(\textbf{X})\), \(\mathbb {E}_{\textbf{X}}[g(\textbf{X})]\approx (1/M) \sum _{i=1}^M g(\textbf{X}_i)\). Therefore, given the relationship between \(\mathbb {E}_{\textbf{X}}(\mathcal {N}_{{\textbf{K}}}^*(\{\textbf{X}_i\}_{i=1}^M))\) and \(\mathbb {E}_{\textbf{X}}(\mathcal {N}_{\textbf{K}} (\{\textbf{X}_i\}_{i=1}^M))\), the optimal max polynomial order \(\textbf{K}^*\) can be identified through \(\min _{\textbf{K}} \{\mathbb {E}_{\textbf{X}}(\mathcal {N}_{\textbf{K}}^*(\{\textbf{X}_i\}_{i=1}^M))\}\), rather than \(\min _{\textbf{K}}\{\mathbb {E}_{\textbf{X}}(\mathcal {N}_{\textbf{K}}(\{\textbf{X}_i\}_{i=1}^M))\}\).

We propose the following strategy for estimating the mean of the shifted \(\mathcal {N}_{\textbf{K}}\)-Norm given a random sample \(\{\textbf{X}_i\}_{i=1}^M\). This strategy is valid subject to a condition that the probability distribution of any subsample, which was randomly drawn from the random sample \(\{\textbf{X}_i\}_{i=1}^M\), is the same as that of \(\{\textbf{X}_i\}_{i=1}^M\). This condition is not too strict in general as long as the size of the subsample is not too small.

  1. 1.

    Randomly partition observations of the random sample into two approximately equal-size partitions \(\{\textbf{X}_i\}_{i\in \mathcal {I}}\) and \(\{\textbf{X}_i\}_{i\in \overline{\mathcal {I}}}\).

  2. 2.

    Calculate the statistic:

    $$\begin{aligned}&\hat{\mathcal {N}}_{\textbf{K}}^{*}\left( \{\textbf{X}_i\}_{i\in \mathcal {I}}; \{\textbf{X}_i\}_{i\in \overline{\mathcal {I}}}\right) \\&\quad = \sum _{n_1 = 0}^{K_1} \dots \sum _{n_N = 0}^{K_N} \hat{C}_{n_1,\dots ,n_N}(\{\textbf{X}_i\}_{i\in \mathcal {I}})^2 \\&\qquad -\frac{2}{ ||\overline{\mathcal {I}} ||} \sum _{j \in \overline{\mathcal {I}} } \frac{\hat{f}_{\textbf{X};\textbf{K}}({X}_{1,j},\dots ,{X}_{N,j};\{\textbf{X}_i\}_{i\in \mathcal {I}})}{f_{\nu } ({X}_{1,j},\dots ,{X}_{N,j})} \end{aligned}$$

    where \(||\overline{\mathcal {I}} ||\) denotes the sample size of \(\{ \textbf{X}_i \}_{i \in \overline{\mathcal {I}}}\).

  3. 3.

    Repeat 1 and 2 B times to obtain the set of repeated shifted \(\mathcal {N}_{\textbf{K}}\)-Norm statistics

    $$\begin{aligned} \Big \{ \hat{\mathcal {N}}_{\textbf{K}}^{*}\left( \{\textbf{X}_i\}_{i\in \mathcal {I}_b}; \{\textbf{X}_i\}_{i\in \overline{\mathcal {I}_b}}\right) \Big \}_{b=1}^B. \end{aligned}$$
  4. 4.

    Estimate \(\mathbb {E}_{\textbf{X}}[\mathcal {N}^*_{\textbf{K}}(\{\textbf{X}_i\}_{i=1}^M)]\) by

    $$\begin{aligned} \hat{\mathcal {N}}_{\textbf{K};B}^{*}&\left( \{\textbf{X}_i\}_{i=1}^M\right) \nonumber \\&= (1 / B) \sum _{b=1}^B \hat{\mathcal {N}}_{\textbf{K}}^{*}\left( \{\textbf{X}_i\}_{i\in \mathcal {I}_b}; \{\textbf{X}_i\}_{i\in \overline{\mathcal {I}_b}}\right) . \end{aligned}$$
    (14)

    The variance of the shifted \(\mathcal {N}_{\textbf{K}}\)-Norm statistics can be estimated through the sample variance.

With the information of the mean of the shifted \(\mathcal {N}_{\textbf{K}}\)-Norm, we can determine an appropriate upper polynomial order \(\textbf{K}=(K_1, \ldots , K_N) \) for moment-based density function estimation. Sometimes, searching for the proper upper polynomial order \(\textbf{K}\) is a complex process, especially when \(N >3\). To reduce the complexity we may restrict the search process only to those \(\textbf{K}\) with \(K_1=\cdots = K_N\) in practice. However, restricting the search process to those \(\textbf{K}\) with \(K_1=\cdots = K_N\) may not be appropriate when there are considerable differences of scale and distribution between marginal variables of the data. We use such a restricted search process in the comparison study undertaken in the next section as both marginal variables are marginally beta distributed on [0, 1]. In Sect. 5, we do not apply a restricted search process as there are large differences in distribution and scale across the marginal variables of the data.

4 Comparing density estimation methods

In the following section, we compare the performance of the moment-based density estimate and a range of other multivariate density estimation methods through a bivariate example below.

4.1 Bivariate beta distribution example

Consider the Bivariate Beta Distribution discussed in Macomber and Myers (1983), which has density function

$$\begin{aligned} f_{X_1,X_2}(x_1,x_2)&= \frac{\Gamma (\alpha _1+\alpha _2 + \alpha _3)}{\Gamma (\alpha _1)\Gamma (\alpha _2)\Gamma (\alpha _3)}x_1^{\alpha _1 - 1}x_2^{\alpha _2 - 1}\nonumber \\&\quad \times (1-x_1-x_2)^{\alpha _3 - 1}, \end{aligned}$$
(15)

for \(x_1,x_2 > 0\) and \(x_1 + x_2 \le 1\). The moments of this bivariate density function are given by Macomber and Myers (1983) as

$$\begin{aligned} E(X_1^m X_2^n) = \frac{\Gamma (\alpha _1 + \alpha _2 + \alpha _3)\Gamma (\alpha _1+m)\Gamma (\alpha _2+n)}{\Gamma (\alpha _1 + \alpha _2 + \alpha _3 + m + n)\Gamma (\alpha _1)\Gamma (\alpha _2)}. \end{aligned}$$

We perform a simulation study of 100 simulations for four scenarios \(\alpha _1 = 0.6\) and \(\alpha _2 = 0.8\), \(\alpha _1 = 1.6\) and \(\alpha _2 = 0.8\), \(\alpha _1 = 0.6\) and \(\alpha _2 = 1.8\), and \(\alpha _1 = 1.6\) and \(\alpha _2 = 1.8\) wherein every scenario \(\alpha _3 = 1\). In all the scenarios, we consider the performance of the bivariate Gaussian kernel density estimate (Duong 2007), the bivariate Gaussian copula with a marginal beta distribution (Hofert et al. 2020; Yan 2007; Kojadinovic and Yan 2010; Hofert and Mächler 2011), and the density estimation trees estimate (Meyer 2017) against two moment-based density estimates. The first moment-based density estimate (uMBD) applies a uniform reference density for marginal distributions \(\nu _1,\nu _2 \sim U[0,1]\). The other moment-based density estimate (bMBD) applies a beta reference density for both marginal distributions where the parameters of the beta distribution are taken to be the method of moments estimates of the sample data for each marginal.

We note that the bivariate Gaussian kernel density estimate (Duong 2007) is a non-parametric local estimation method and is calculated using the ks package in R and associated kde function with optimal bandwidth selector (see Duong 2018). The bivariate Gaussian copula with a marginal beta distribution (Hofert et al. 2020; Yan 2007; Kojadinovic and Yan 2010; Hofert and Mächler 2011) is a parametric global density estimate and is calculated using the copula package in R (see Hofert et al. 2020) with the marginal beta distribution parameters taken to be the corresponding method of moments. The density estimation trees estimate (Meyer 2017) is a machine learning estimate and is calculated using the detpack package in R with element refinement goodness of fit significance taken to be \(1\times 10^{-5}\) (see Meyer 2019).

For all density estimates, the integrated square error (12) with respect to the true bivariate beta density function (15) is determined by Monte Carlo integration using a uniform sample of size 100,000 on \([0,1] \times [0,1]\). The ISE statistic is averaged across all 100 simulations to give the mean integrated square error (MISE). In the case of the uMBD moment-based density estimate, the ISE statistic corresponds precisely to the \(\mathcal {N}_{\textbf{K}}\)-Norm statistic (see Definition 4), and hence the MISE is an estimate of the expected \(\mathcal {N}_{\textbf{K}}\)-Norm expressed in Theorem 4. In the case of the bMBD moment-based density estimate, the mean \(\mathcal {N}_{\textbf{K}}\)-Norm across the 100 simulations is also determined using Monte Carlo integration with a uniform sample of size 100,000 on \([0,1] \times [0,1]\).

The max polynomial order applied in the expression of the moment-based density estimates, \(\textbf{K} = (K_1, K_2)\), is set to be consistent across both marginal distributions, i.e. \(K = K_1 = K_2\). We determine the optimal max polynomial order (see Definition 5) by selecting the max polynomial order which minimises the mean \(\mathcal {N}_{\textbf{K}}\)-Norm. For uMBD this is consistent with minimising the MISE. For each comparison density estimate, we also determine the percentage of simulations in which the uMBD and bMBD achieve a smaller MISE value when the optimal max polynomial order is used.

The remainder term component of the mean \(\mathcal {N}_{\textbf{K}}\)-Norm, i.e. the \(\tau _{\textbf{K}}\) term given in Theorem 4, is also estimated by evaluating the \(\mathcal {N}_{\textbf{K}}\)-Norm when theoretical moments are used to express the moment-based density estimate. This term reflects the bias component of error in the moment-based density estimate.

We provide the simulation study results for all four scenarios in Table 2.

Table 2 Simulation study results of 100 simulations from the bivariate beta distribution (15) with four different parameterisations

The simulation study shows that both the uMBD and bMBD estimates outperform the Gaussian kernel density estimate nearly universally for all scenarios in which at least one \(\alpha _i < 1\) (\(i=1,2\)). However, the Gaussian kernel density estimate universally outperforms both the uMBD and bMBD estimate when \(\alpha _1 = 1.6\) and \(\alpha _2 = 1.8\). When \(\alpha _1 = 1.6\) and \(\alpha _2 = 1.8\), the true density \(f_{X_1,X_2}(x_1,x_2)\) achieves its maximum on the boundary condition \(x_1 + x_2 \le 1\). This boundary condition is not specifically enforced by construction for any of the density estimates. Although due to the local estimation applied by the Gaussian kernel density estimate, zero areas of the sample space can be better predicted. The polynomial curves applied in the uMBD and bMBD approaches require a high max polynomial order to model the immediate change in density values to zero adequately. This issue is not as significant in previous parameterisations as density function values near the \(x_1 + x_2 \le 1\) boundary are small. As a higher max polynomial order would also constitute greater sampling variance, the optimal max polynomial order remains relatively low, reducing the quality of the moment-based density estimates relative to the Gaussian kernel density estimate for the \(\alpha _1 = 1.6\) and \(\alpha _2 = 1.8\) parameterisation. The Gaussian copula with marginal beta distribution performs best comparatively in scenarios where \(\alpha _1 = 0.6\), outperforming the uMBD estimate in approximately 86–88% of samples. The copula estimate does not appear to perform as well as the bMBD estimate in general for any scenario, only achieving a smaller MISE in at most 1% of simulations in each scenario. The density estimation trees estimate similarly does not perform as well as the bMBD estimate across all scenarios. The uMBD estimate also outperforms the density estimation trees estimate in most simulations in all scenarios.

5 Generating synthetic micro-data example

In the following section, we provide an example of synthetic micro-data generated from a “confidential” data set. Before this example, we outline the process of simulating synthetic data from the moment-based density estimate.

5.1 Simulating synthetic data from the density estimate

Given a random sample \(\{(Y_i,\textbf{X}_i)\}_{i=1}^M =\{(Y_{i},X_{1,i}, \dots , X_{N,i})\}_{i=1}^M\) of \((Y,\textbf{X}) = (Y,X_1,X_2,\dots ,X_N)\), and corresponding moment-based density estimate \(\hat{f}_{Y,\textbf{X};\textbf{K}}((y,\textbf{x}); \{(Y_i,\textbf{X}_i)\}_{i=1}^M )\), we define the moment-based conditional cumulative probability function estimate of Y given \(\textbf{X}\) as follows,

$$\begin{aligned}&\hat{F}_{Y {\mid }\textbf{X};\textbf{K}}(y {\mid }\textbf{x}; \{(Y_i,\textbf{X}_i)\}_{i=1}^M)\nonumber \\&\quad = \frac{\int _{a_Y}^{y} \hat{f}_{Y,\textbf{X};\textbf{K}}((t,\textbf{x}); \{(Y_i,\textbf{X}_i)\}_{i=1}^M )dt}{ \hat{f}_{\textbf{X};\mathbf {K_X}}(\textbf{x};\{\textbf{X}_i\}_{i=1}^M )}, \end{aligned}$$
(16)

where \(\hat{f}_{\textbf{X};\mathbf {K_X}}(\textbf{x};\{\textbf{X}_i\}_{i=1}^M )\) denotes the moment-based density estimate of \(\{\textbf{X}_i\}_{i=1}^M\) with max polynomial order \(\mathbf {K_X}\), and \(a_Y\) denotes the lower bound of Y, i.e. \(Y \in [a_Y,b_Y]\).

Given Property (2) in Sect. 2.1.1, we derive the following expression for the moment-based conditional cumulative probability function estimate of Y given \(\textbf{X}\), \(\hat{F}_{Y {\mid }\textbf{X};\textbf{K}}(y {\mid }\textbf{x}; \{(Y_i,\textbf{X}_i)\}_{i=1}^M)\).

$$\begin{aligned}&\hat{F}_{Y {\mid }\textbf{X};\textbf{K}}(y {\mid }\textbf{x}; \{(Y_i,\textbf{X}_i)\}_{i=1}^M) =\frac{f_{\nu _{\textbf{X}}}(\textbf{x})}{\hat{f}_{\textbf{X};\mathbf {K_X}}(\textbf{x}; \{\textbf{X}_i\}_{i=1}^M)} \\&\quad \times \left\{ \sum _{n_Y=0}^{K_Y} \sum _{n_{X,1} = 0}^{K_{X,1}} \dots \sum _{n_{X,N} = 0}^{K_{X,N}} \hat{C}_{n_Y,n_{X,1},\dots ,n_{X,N}} \right. \\&\quad \times \left. \left( \frac{\sigma _Y(y)f_{\nu _Y}(y)}{\lambda ^{*}_{Y,n_Y}} P^{'}_{Y,n_Y}(y)+ \delta _{n_Y,0}F_{\nu _Y}(y) \right) \right. \\&\quad \times \left. P_{n_{X,1},\dots ,n_{X,N}}(\textbf{x}) \right\} , \end{aligned}$$

where \(f_{\nu _Y}\) and \(f_{\nu _{\textbf{X}}}\) are the reference densities of Y and \(\textbf{X}\) respectively, \( \hat{C}_{n_Y,n_{X,1},\dots ,n_{X,N}} \), are the estimated coefficients of the moment-based density estimate \(\hat{f}_{Y,\textbf{X};\textbf{K}}((y,\textbf{x}); \{(Y_i,\textbf{X}_i)\}_{i=1}^M )\) with \(\textbf{K} = (K_Y,K_{X,1},\dots ,K_{X,N})\), \(\lambda ^{*}_{Y,n_Y} = {\left\{ \begin{array}{ll} \lambda _{Y,n_Y} &{} n_Y = 1,2,\dots \\ 1 &{} n_Y = 0 \end{array}\right. }\) where \(\lambda _{Y,n_Y}\) is the \(n_Y\)th eigenvalue related to the univariate polynomial \(P_{Y,n_Y}(y)\), \(F_{\nu _Y}(y)\) denotes the reference cumulative distribution function of Y, and \(\delta _{n_Y,0}\) denotes the Kronecker delta.

We make note of the following in regards to \(\hat{F}_{Y {\mid }\textbf{X};\textbf{K}}(y {\mid }\textbf{x}; \{(Y_i,\textbf{X}_i)\}_{i=1}^M)\):

  1. 1.

    We have an analytical expression of the moment-based conditional cumulative probability function estimate of Y given \(\textbf{X}\), which does not require numerical integration.

  2. 2.

    The moment-based conditional cumulative probability function estimate of Y given \(\textbf{X}\) is a univariate cumulative distribution function (in terms of Y) from which we can simulate an observation \(Y^{*}\) using inverse transform sampling. That is, given \(X_1^{*},\dots ,X_N^{*}\) from \(\hat{f}_{\textbf{X};\mathbf {K_X}}(\textbf{x};\{\textbf{X}_i\}_{i=1}^M )\) and a uniform random variable \(U\sim \text {Unif}[0,1]\), then \(Y^{*} = \hat{F}_{Y {\mid }\textbf{X};\textbf{K}}^{-1}(U {\mid }X_1^{*},\dots ,X_N^{*}; \{(Y_i,\textbf{X}_i)\}_{i=1}^M)\) and \(X_1^{*},\dots ,X_N^{*}\) have joint density \(\hat{f}_{Y,\textbf{X};\textbf{K}}((y,\textbf{x}); \{(Y_i,\textbf{X}_i)\}_{i=1}^M )\).

  3. 3.

    Without loss of generality, we can express any column of our data set as \(\{Y_i\}_{i=1}^M\) and the others as \(\{\textbf{X}_i\}_{i=1}^M\).

  4. 4.

    By applying a Gibbs sampling approach, we can simulate all columns of the data to form a complete synthetic data set.

In the following section, we obtain a 5-dimensional moment-based density estimate and simulate data from a “sensitive” variable of the data (a variable that poses a risk to confidentiality if disclosed) by applying inverse transform sampling to the moment-based conditional cumulative probability function estimate.

5.2 Synthetic data example

The data we consider \(\{(Y_i,\textbf{X}_i)\}_{i=1}^{9568}\) are 9568 observations of 5 variables collected from a combined cycle power plant over six years (2006–2011) when the power plant was set to work with full load (Tüfekci 2014; Kaya and Tufekci 2012). The dataset consists of hourly average ambient variables (\(\textbf{X} = (X_1,X_2,X_3,X_4)\)): temperature (\(X_1\)), ambient pressure (\(X_2\)), relative humidity (\(X_3\)), and exhaust vacuum (\(X_4\)), that were collected to predict the net hourly electrical energy output of the plant (Y).

Suppose that the values of the plant’s net hourly electrical energy output are considered “sensitive” due to a risk of proprietary infringement or operational sabotage were the actual data released. We produce synthetic data that mimics the true values of the net hourly electrical energy output by obtaining a moment-based density estimate \(\hat{f}_{Y,\textbf{X};\textbf{K}}\) of the joint distribution of the data. For each variable, we select uniform reference densities with bounds consistent with its range plus or minus (for the maximum and minimum, respectively) a buffer of 5% of its standard deviation.

We estimate the optimal max polynomial order \(\hat{\textbf{K}^{*}} =(\hat{K}_1^{*},\dots ,\hat{K}_5^{*})\) by minimising \(\hat{\mathcal {N}}_{\textbf{K};B}^{*}\left( \{(Y_i,\textbf{X}_i)\}_{i=1}^{9568}\right) \) (14) for \(B=100\),

$$\begin{aligned} \hat{\textbf{K}^{*}} = \underset{\textbf{K} \in \mathbb {N}_0^5}{\arg \!\min } \Big \{ \hat{\mathcal {N}}_{\textbf{K};100}^{*}\left( \{(Y_i,\textbf{X}_i)\}_{i=1}^{9568}\right) \Big \}. \end{aligned}$$

Note that we only considered values of \(\hat{K}^{*}_k \le 35\), for \(k=1,\dots ,5\), as the statistic was diverging in all cases.

Averaged across \(B=100\) randomly sampled partitions, we achieve the estimates of the shifted \({\mathcal {N}}_\textbf{K}\)-Norm (\( \hat{\mathcal {N}}_{\textbf{K};100}^{*}\left( \{(Y_i,\textbf{X}_i)\}_{i=1}^{9568}\right) \)) for various max polynomial orders from 1 to 35 for each variable. The estimated optimal max polynomial order is determined to be \(\hat{\textbf{K}^{*}} = (21,22,12,5,22)\).

We simulate a synthetic sample of net hourly electrical energy outputs (\(\{Y_i^{*}\}_{i=1}^{9658}\)) using inverse-transform sampling on the moment-based conditional cumulative probability function estimate \(\hat{F}_{Y {\mid }\textbf{X};\hat{\textbf{K}^{*}}}^{-1}(y {\mid }\textbf{x}; \{(Y_i,\textbf{X}_i)\}_{i=1}^{9568})\) where \(\hat{\textbf{K}^{*}} = (21,22,12,5,22)\). We note that given uniform reference densities are used, \(\hat{F}_{Y {{\mid }} \textbf{X};\hat{\textbf{K}^{*}}}(y {\mid }\textbf{x}; \{(Y_i,\textbf{X}_i)\}_{i=1}^{9568})\) is a polynomial with respect to y. We apply the base R polynomial rooting function polyroot() (Jenkins and Traub 1972) to compute values from the inverse moment-based conditional cumulative probability function estimate.

Histograms comparing the distribution of the actual net hourly electrical energy output (\(\{Y_i\}_{i=1}^{9658}\)) and synthetic net hourly electrical energy output (\(\{Y_i^{*}\}_{i=1}^{9658}\)) are given in Fig. 1. Scatterplots showing the associations of both the actual and synthetic net hourly electrical energy outputs with each net hourly ambient variable are given in Fig. 2. Heatmaps showing the bivariate marginal density of both the actual and synthetic net hourly electrical energy outputs with each net hourly ambient variable are given in Fig. 3.

The histograms in Fig. 1 indicate that the synthetic net hourly electrical energy outputs approximate the shape of the actual net hourly electrical energy outputs well. However, synthetic net hourly electrical energy outputs appear slightly over-populated in the distribution’s tails compared to the actual net hourly electrical energy outputs, leading to a slightly “flatter” distribution overall. This over-population of synthetic net hourly electrical energy outputs in sparsely populated areas of the actual data is also apparent when comparing the bivariate associations of the net hourly electrical energy outputs with each net hourly ambient variable shown in Figs. 2 and 3 . The general pattern of association between the net hourly electrical energy outputs and each net hourly ambient variable is approximated by the synthetic data well. However, sparse areas of the actual data appear over-estimated by the moment-based density estimate.

To assess the statistical utility of the synthetic data for prediction modelling, we compare statistics obtained by a main-effects ordinary least squares regression of the net hourly ambient variables on the actual and synthetic net hourly electrical energy outputs. The results of the linear regression analysis are given in Table 3. The estimated coefficient of the ambient temperature data (\(-1.727\), 95% CI [\(-1.785, -1.669\)]) and the relative humidity data (\(-0.128\), 95% CI [\(-0.143,-0.112\)]) obtained from the synthetic data underestimated the magnitude of the corresponding estimated coefficient obtained from the actual data (\(-1.978\) and \(-0.158\) respectively). Whereas the estimated coefficients of exhaust volume (\(-0.259\), 95% CI [\(-0.287,-0.231\)]) and ambient pressure (0.093, 95% CI [0.057, 0.128]) obtained from the synthetic data better represent the estimated coefficients obtained from the actual data, with both actual coefficient estimates (\(-0.234\) and 0.062 respectively) belonging to the 95% confidence intervals of the corresponding coefficient estimates obtained from the synthetic data. All coefficient estimates obtained from the synthetic data have increased standard error estimates. The synthetic-based linear regression model has an increased residual mean square error estimate (8.804) compared to the actual-based linear regression model (4.558). A reduction in the proportion of variation explained by the model (\(R^2\)) is also seen in the synthetic-based model, with \(R^2 = 0.748\) from the synthetic-based model being less than \(R^2 = 0.929\) from the actual-based model.

Although the performance of the synthetic data appears mixed, unlike other model-based synthetic generation methods, synthetic data obtained from the moment-based density estimate is not model specific and can be used to fit any model. For instance, suppose we were to fit a full factorial linear regression model of the net hourly ambient variables on the actual and synthetic net hourly electrical energy outputs. Without re-defining our generation model or re-sampling, we obtain the results given in Table 4 for the same synthetic data used to obtain the results in Table 3. Again, all coefficient estimates obtained from the synthetic data had increased standard error estimates. Although in all cases, the estimated coefficients obtained from the actual data for the full factorial model are contained within the 95% confidence interval of the corresponding estimated coefficients obtained from the synthetic data.

Considering the disclosure risk posed by the synthetic net hourly electrical energy outputs, we note that only approximately 3% of observations in the actual data have synthetic observations within a 1% standard-deviation-based interval (Mateo-Sanz et al. 2004) of their values. Figure 2 shows that regions containing multivariate outliers in the actual data are well populated in the synthetic data. A data intruder, therefore, would find it difficult to determine which synthetic observations (if any) could be successfully attributed to an actual observation and which should be treated as erroneous based on the synthetic data alone. Given a public release of the synthetic data, we suggest that the disclosure-prone multivariate outliers present in the actual data are more greatly shielded from attribution due to the high presence of synthetic observations in and around corresponding regions of the synthetic data.

Fig. 1
figure 1

Histograms of the plant’s actual (left) and synthetic (middle) net hourly electrical energy outputs with a side-by-side comparison histogram (right)

Fig. 2
figure 2

Scatter-plots depicting the point-wise dispersion of net hourly electrical energy outputs with respect to each hourly ambient variable from the actual (top) and synthetic (bottom) data sets

Fig. 3
figure 3

2-dimensional binned histogram showing the empirical density of the actual (top) and synthetic (bottom) net hourly electrical energy outputs with respect to each hourly ambient variable

Table 3 Table of Linear Regression statistics obtained from modelling net hourly electrical energy output with respect to the net hourly ambient variables in a main effects model based on the actual and synthetic data sets
Table 4 Table of Linear Regression statistics obtained from modelling net hourly electrical energy output with respect to the net hourly ambient variables in a full factorial model based on the actual and synthetic data sets

6 Conclusion

Our specific interest in this paper is the application of moment-based density estimation in the context of sharing confidential micro-data while preserving the privacy of the respondents. The multivariate moments’ problem and its application to estimating joint density functions are often considered highly impractical in modern-day analysis. Methods that do exist in this context, often fail to account for the sampling problems faced in real-world applications due to the computational complexity and intractability of higher-order moments. One of the critical difficulties in obtaining moment-based density expressions is the breadth of variation seen in densities requiring estimation. The support of these distributions must align with the orthogonal basis used in the density approximation (Kleiber and Stoyanov 2013). By requiring consistency in sample space, we inherently arrive at three different cases: Hausdorff (density functions defined on a compact space), Stieltjes (density functions defined on a half bounded (lower or upper) space) (Stieltjes 1894; Stoyanov and Tolmatz 2005) and Hamburger (density functions defined on a completely unbounded space) (Shohat and Tamarkin 1943; Akhiezer and Kemmer 1965). We have focused mainly on the Hausdorff problem, joint-densities on a compact space, and have only referred to our ability to apply such methods on the unbounded space provided moment determinacy.

Determinacy refers to the existence of a unique moment-based expansion of the density function. Kleiber and Stoyanov (2013) extensively review this concept in the context of the multivariate moments problem. Interestingly, the paper investigates the relationship between the marginal determinacy and the determinacy of the multivariate distribution as a whole. It raises Peterson’s result (Petersen 1982), that marginal determinacy implies multivariate determinacy and conjectures that for absolutely continuous functions, multivariate determinacy implies marginal.

Therefore we can comfortably apply the estimation results presented in this paper to unbounded distributions if, at the very least on the marginal level, our choice of unbounded reference densities converges to a unique polynomial expression. Furthermore, due to the unique properties of the Gaussian and gamma distributions (and others related, e.g., Chi-Square, Weibull), it has been shown that reference densities of this variety have moment-determinacy (Sánchez-Ruiz and Dehesa 1998; Provost 2005). Therefore this approach has greater application than may appear at first glance. Furthermore, when unsure of the appropriateness of a particular polynomial basis, we can instead focus on reviewing the convergence marginally rather than on the joint density as a whole to gain an idea of moment-determinacy.

In this paper, we proposed a generalised form of a multivariate polynomial-based density expansion, making the estimate applicable to one type of polynomial basis and any hypergeometric type polynomial basis (provided determinacy is assured). This expansion is generated from a generalised multivariate polynomial basis that, by construction, allows for the estimation of joint density functions based on aggregated sufficient statistics, namely the sample moments. We derived critical statistical properties of the moment-based density estimate and suggested a computational strategy by which we can estimate the optimal max polynomial order of the expansion. Through the contributions of this paper, we have demonstrated that applying multivariate polynomial expansion to joint density estimation is beneficial for analytical formulations and can be used as a practical density estimation alternative to existing comparable methods.

Furthermore, we can extend this result to estimate both cumulative and conditional cumulative distribution functions, allowing for the simulation of data from the moment-based density estimates without requiring grid-based sampling techniques. This property has considerable ramifications when producing synthetic representative samples of the original data and imputing missing information, especially when we can only ascertain empirical moments. The production of non-parametric synthetic continuous data based on aggregated statistics is an area of great interest in data privacy research.