Moment-based density estimation of confidential micro-data: a computational statistics approach

Wakefield, Bradley; Lin, Yan-Xia; Sarathy, Rathin; Muralidhar, Krishnamurty

doi:10.1007/s11222-022-10203-1

Moment-based density estimation of confidential micro-data: a computational statistics approach

Original Paper
Open access
Published: 20 January 2023

Volume 33, article number 35, (2023)
Cite this article

Download PDF

You have full access to this open access article

Statistics and Computing Aims and scope Submit manuscript

Moment-based density estimation of confidential micro-data: a computational statistics approach

Download PDF

Bradley Wakefield ORCID: orcid.org/0000-0002-6691-162X¹,
Yan-Xia Lin¹,
Rathin Sarathy² &
…
Krishnamurty Muralidhar³

1540 Accesses
1 Altmetric
Explore all metrics

This article has been updated

Abstract

Providing access to synthetic micro-data in place of confidential data to protect the privacy of participants is common practice. For the synthetic data to be useful for analysis, it is necessary that the density function of the synthetic data closely approximate the confidential data. Hence, accurately estimating the density function based on sample micro-data is important. Existing kernel-based, copula-based, and machine learning methods of joint density estimation may not be viable. Applying the multivariate moments’ problem to sample-based density estimation has long been considered impractical due to the computational complexity and intractability of optimal parameter selection of the density estimate when the true joint density function is unknown. This paper introduces a generalised form of the sample moment-based density estimate, which can be used to estimate joint density functions when only the information of empirical moments is available. We demonstrate optimal parametrisation of the moment-based density estimate based solely on sample data by employing a computational strategy for parameter selection. We compare the performance of the moment-based estimate to that of existing non-parametric and parametric density estimation methods. The results show that using empirical moments can provide a reasonable, robust non-parametric approximation of a joint density function that is comparable to existing non-parametric methods. We provide an example of synthetic data generation from the moment-based density estimate and show that the resulting synthetic data provides a reasonable disclosure-protected alternative for public release.

Violating the normality assumption may be the lesser of two evils

Article Open access 07 May 2021

Ulrich Knief & Wolfgang Forstmeier

A simple introduction to Markov Chain Monte–Carlo sampling

Article Open access 11 March 2016

Don van Ravenzwaaij, Pete Cassey & Scott D. Brown

A Systematic Review of Hidden Markov Models and Their Applications

Article 12 May 2020

Bhavya Mor, Sunita Garhwal & Ajay Kumar

1 Introduction

Disseminating synthetic data in place of confidential micro-level data can be an efficient way to protect the privacy of survey and clinical trial participants. Whether a set of synthetic data is ideally suitable in representing the statistical information from a confidential data source depends upon the similarity of the density function of the set of synthetic data to the density function of the confidential data source. Therefore, having a reliable and robust method for estimating the underlying density function is essential in the synthetic data generation process.

Many existing non-parametric density function estimation methods, including kernel density estimation, rely heavily on local area estimation. The low-density local areas used to construct the density function estimate of a sample space may contain very few sample observations. The values of the sample observations in those areas are more easily identified or reconstructed later from the constructed density function estimate, posing greater risk to privacy and confidentiality. Data agencies can offset the risk of disclosure by defining local areas of greater size (larger bandwidth) or through perturbation, but this may adversely impact statistical utility. Therefore, those methods relying on local area estimation may not be suitable when confidential data is involved. Existing parametric maximum likelihood and mixture distribution approaches often lack sufficient robustness to estimate atypical distributions effectively. Researchers tend to opt more for kernel estimators, maximum likelihood estimators, or increasingly, non-parametric machine learning techniques (Izenman 1991; Wang and Scott 2019). While highly effective in varying contexts, these methods have limitations, especially when dealing with atypical, incomplete data. Moreover, circumstances in which the data agency may restrict access to micro-level data will result in data-driven non-parametric approaches becoming untenable.

This paper proposes a computational approach for estimating a joint density function of continuous data. The estimated density function is expressed as a finite linear summation of orthogonal polynomial functions and only relies on sample moments. Without knowledge of the joint density function underlying the continuous data, this paper develops a feasible computational strategy that searches for the appropriate maximum order of polynomials used in the estimated density function. We show that the moment-based density approach achieves a compromise between non-parametric and parametric methods, offering a non-parametric global estimate that is sufficiently robust. We also provide an example application of moment-based density estimation in producing synthetic continuous micro-data on confidential micro-data for disclosure protection.

Although our primary motivation, the application of moment-based density estimation is not limited to use cases for confidential data. Moment-based density estimation has interesting applications in imputation for missing data more broadly and for the estimation of continuous density functions from tabular data. Furthermore, there has been recent developments in applying moment-based density approximates in stochastic models for finance (Stoyanov 2016) and image reconstruction (Mustapha and Dimitrakopoulos 2010).

Constructing unique polynomial-based series expansions for continuous functions is a common practice applied in a range of areas of mathematics, physics engineering, and statistics (Munkhammar et al. 2017). Polynomial-based expansions have been used in studies to approximate probability density functions on both closed and infinite domains (Elderton and Johnson 1969; Solomon and Stephens 1978; Kendall et al. 1987; Reid 1988; Provost 2005). Known more commonly as the moments’ problem, this area of study has been considered from as far back as the 19th century (Gram 1883). The advent of greater computational power has increased the usage of polynomial expansions. Currently, however, polynomial-based expansions are rare in multivariate density estimation applications, when compared to their use for univariate estimation. As noted by Kleiber and Stoyanov (2013), the results for the multivariate case tend to be scattered throughout specialised journals and focus on an analytical perspective.

This paper focuses on distributions with bounded marginal domains to ensure determinacy. Determinacy, and more specifically moment determinacy, refers to the existence of a unique moment-based expansion (Kleiber and Stoyanov 2013). This paper requires no specific choice of the associated reference joint density (weight) used. The restriction to a bounded domain is certainly not necessary if moment determinacy can be assured. All results presented in this paper can be generalised on unbounded domains in certain situations. Unbounded reference joint densities shown in the literature to have moment determinacy include the normal and gamma distributions (Provost 2005). Furthermore, as demonstrated in Lin (2014), which applies a univariate form of moment-based density estimation to recover information from perturbed data, assuming a compact domain can result in a reasonable estimate for density functions that exist on an unbounded domain provided the sample size is sufficiently large enough.

This paper is structured in six sections. Section 2 provides the analytical results required to construct the moment-based density estimate. Section 3 considers a measure of goodness of fit for the moment-based density estimate and outlines a sample-based strategy for parameter selection. In Sect. 4, we perform a simulation study to compare the performance of the moment-based density estimation method with existing joint density estimation methods. In Sect. 5, we outline a process of simulating synthetic micro-data from the moment-based density estimate and demonstrate this application to hypothetical confidential data. We conclude our paper in Sect. 6.

2 The multivariate moment-based density estimate

Throughout the literature, researchers have mainly focused on the usefulness and effectiveness of a particular orthogonal polynomial basis in regards to estimating certain distributions, selecting classical polynomial families as the basis for the expansion (Gram 1883; Edgeworth 1905; Charlier 1914; Szegő 1939; Cramer 1946; Alexits 1961; Kendall et al. 1987; Provost 2005). These classical polynomials are usually chosen based on the resemblance of their weight functions (derivative of the positive Borel measure by which the polynomials are orthogonal) to the canonical distributions. Orthogonal polynomials for which a density approximation has been derived in the literature include Hermite (Edgeworth and Gram-Charlier Expansions), Legendre, Jacobi and, Laguerre Polynomials (Gram 1883; Edgeworth 1905; Charlier 1914; Szegő 1939; Cramer 1946; Alexits 1961; Kendall et al. 1987; Provost 2005). They have weight functions resembling normal, uniform, beta, and gamma distributions, respectively. These polynomials can all be used to provide moment-based expressions of certain density functions and are eigenfunctions of a hypergeometric type second-order differential equation (Sánchez-Ruiz and Dehesa 1998).

Multivariate orthogonal polynomial expansions, particularly for Hermite, Jacobi, and Laguerre polynomials, have been considered in the literature (Xu 1997; Griffiths and Spano 2011). Joint density expressions for quasi-multivariate normal distributions have also been developed using a Hermite polynomial-based expansion (Mehler 1866; Withers and Nadarajah 2010). This section derives a generalised form of the multivariate density estimate and the fundamental mathematical results required to apply this density estimate in practice. This form covers many existing approaches and allows for greater flexibility in the choice of reference density and resulting orthogonal polynomials used.

2.1 The orthogonal polynomial basis

Initially, we only consider the orthonormal basis for the univariate case in a single-dimensional space $\Omega = [a,b]$. Later this basis is expanded to the N-dimensional compact space ${\Omega }^{(N)} = \prod _{k=1}^N [a_k, b_k]$. Let $ f_{\nu }$ denote the probability density function of a random variable $\nu $ with the domain the closed interval $\Omega = [a,b]$. We refer to $ f_{\nu }$ as the reference distribution function, and this function will serve as the weight function on an orthonormal basis.

2.1.1 Univariate polynomial basis

Regarding the function $f_{\nu }$, the results of Sánchez-Ruiz and Dehesa (1998) can be summarised in the following theorem.

Theorem 1

If there exist polynomial functions $\sigma (x)$ and $\tau (x)$ of degree no greater than 2 and 1 with coefficients $\phi _0,\phi _1,\phi _2$ and $\psi _0,\psi _1$ respectively for $f_{\nu }$, where,

$$\begin{aligned} \sigma (x) = \phi _{2}x^2 + \phi _{1}x + \phi _{0}, \quad \text {and} \quad \tau (x) = \psi _{1}x + \psi _{0}, \end{aligned}$$

(1)

such that

$$\begin{aligned} \frac{d}{dx} \left[ \sigma (x) f_{\nu }(x) \right] = \tau (x) f_{\nu }(x), \end{aligned}$$

(2)

with boundary conditions, for $i = 0,1,\ldots $

$$\begin{aligned} \lim _{x \rightarrow a} x^i \sigma (x) f_{\nu }(x)= \lim _{x \rightarrow b} x^i \sigma (x) f_{\nu }(x) = 0, \end{aligned}$$

(3)

then there exists a set of eigenfunction solutions $\{P_n(x)\}_{n \in \mathbb {N}_0}$, where $\mathbb {N}_0$ denotes the set of all natural number including 0, of the hypergeometric-type differential equation,

$$\begin{aligned} \sigma (x) y''(x) + \tau (x) y'(x) - \lambda y(x)= 0. \end{aligned}$$

$P_n(x)$ denotes the nth eigenfunction solution corresponding to $\lambda _n = n\psi _{1} + n(n-1)\phi _{2}$ and has the following properties:

(i)
$$\begin{aligned} \lambda _n P_n(x)f_{\nu }(x) = \frac{d}{dx} \left[ \sigma (x)P_n'(x)f_{\nu }(x) \right] . \end{aligned}$$
(ii)
$P_n(x)$ is a polynomial of degree n with Rodrigues’ formula of the form
$$\begin{aligned} P_{n}(x) = \frac{B_{n}}{f_{\nu }(x)} \frac{d^{n}}{dx^{n}} \left[ (\sigma (x))^{n} f_{\nu }(x) \right] , \end{aligned}$$
where $B_{n} \in \mathbb {R}$ is the normalising constant in the Rodrigues’ formula and can be chosen arbitrarily in order to show consistency with classical orthogonal polynomial families.
(iii)
Given $B_n$ of the Rodrigues’ formula is defined as follows,
$$\begin{aligned}{} & {} B_0 = 1\quad \text {and} \nonumber \\{} & {} B_{n} = \sqrt{ \frac{(-1)^n \prod _{i=0}^{n-1} \left[ \frac{1}{ \psi _1 + (n + i -1)\phi _{2} }\right] }{ n! \int _{a}^{b} (\sigma (x))^{n}f_{\nu }(x) dx}, } \quad \text {for}\; n = 1,2,\dots , \nonumber \\ \end{aligned}$$
(4)
then,
$$\begin{aligned} \int _{a}^{b} P_{n}(x) P_{m}(x) f_{\nu }(x) dx = \delta _{n,m}, \end{aligned}$$
where $\delta _{n,m}$ denotes the Kronecker delta, and $n, m \in \mathbb {N}_0$. Hence $\{P_n(x)\}_{n \in \mathbb {N}_0}$ are orthogonal polynomials.

Theorem 1 shows that provided an appropriate quadratic polynomial $\sigma (x)$ and appropriate linear polynomial $\tau (x)$ for a given reference density $f_\nu $ can be found; then, we can obtain the aforementioned set of polynomials. Theorem 1 makes no statement about the domain of the underlying distribution and can therefore be applied to reference distributions defined on both bounded and unbounded domains. Many of the standard probability densities used in statistics meet this criterion as a reference density. Table 1 provides examples of $\sigma $ and $\tau $ functions for some of these densities and the connection of the resulting polynomials to the classical orthogonal polynomial families.

In the following, we list some properties of the orthogonal polynomials without specifying the distribution of the reference function $f_{\nu }$. These properties will benefit the computational approach for moment-based joint density function estimation later developed in this paper.

Table 1 Various possible reference densities with corresponding $\sigma (x)$ and $\tau (x)$ functions and their domain and relation to classical orthogonal polynomials

Full size table

Property 1

Regarding the polynomial function $P_n(x)=\sum _{i=0}^{n}a_{n,i}x^i$, the coefficient $a_{n,i}$ for fixed $n \in \mathbb {N}_0$ is given by,

$$\begin{aligned}{} & {} a_{n,i} = \nonumber \\ {}{} & {} {\left\{ \begin{array}{ll} \frac{a_{n,i+1}(i + 1)(i \phi _1 + \psi _0) + a_{n,i+2}(i +2)(i + 1)\phi _0}{\lambda _{n} - i(i - 1)\phi _{2} - i\psi _1}, &{} i = 0,\dots ,n-2 \\ \kappa _{n} \frac{n (n -1)\phi _1 + n \psi _0}{ \psi _1 + 2(n - 1)\phi _{2}}, &{} i = n-1\\ \kappa _{n}, &{} i = n \end{array}\right. }, \end{aligned}$$

(5)

where $\kappa _{n} = B_{n} \prod _{i=0}^{n-1} \left[ \psi _1 + (n + i -1)\phi _{2} \right] $ and $\kappa _{0} =1$.

We note that $\sigma (x)$ and $\tau (x)$ are polynomials of degree no greater than 2 and 1, with coefficients $\phi _0,\phi _1,\phi _2$ and $\psi _0,\psi _1$ respectively (1).

Property 1 provides the Frobenius solutions to the polynomial coefficients for each of our polynomial terms. We note that this solution also applies to reference densities on the unbounded domain and can be used to derive the coefficients of all the classical polynomial expressions given in Table 1.

Denote $\mu _{\nu }$ a measure on ($\Omega ,\mathscr {F}$) with $ \mu _{\nu }(A) = \int _{A} f_{\nu }(x) dx$ for $A \in \mathscr {F} $ , where $\mathscr {F}$ is a $\sigma $-field on $\Omega $.

Property 2

(i)
The integral of polynomial $P_n(x)$ with respect to the measure function $\mu _\nu $ is
$$\begin{aligned} \int P_n(x) d\mu _\nu&= \int P_n(x) f_{\nu }(x) dx \nonumber \\&={\left\{ \begin{array}{ll} \frac{\sigma (x)}{\lambda _n}P_n'(x)f_{\nu }(x) + c, &{} n = 1 ,2, \dots \\ F_{\nu }(x) + c, &{} n =0 \end{array}\right. }, \end{aligned}$$
(6)
where $c \in \mathbb {R}$ is some constant and $F_{\nu }$ is the cumulative distribution function of $f_\nu $.
(ii)
$$\begin{aligned} \lim _{t\rightarrow a} \sigma (t)P_n'(t)f_{\nu }(t) = 0. \end{aligned}$$

Property 2 expresses the indefinite integral of the orthogonal polynomials with respect to the reference density. This expression is essential in the practical application of the density estimate as it allows for the analytical derivations of the distribution functions associated with the multivariate density estimate. However, due to length constraints, we will not give the expressions of all the associated distribution functions in this paper. Property 2 will be used in Sect. 5, to derive the expression of the moment-based conditional distribution function estimate. One of the key benefits of the polynomial-based approach to density estimation is the analytical flexibility and applicability of the resulting density estimate. Provided the reference density is continuous and differentiable, then so too is the resulting moment-based density expression and thus is more easily applied in inversion and optimisation problems.

Property 3

Szökefalvi-Nagy 1965$\{P_n\}_{n \in \mathbb {N}_0}$ forms a complete orthogonal base in the $L_2$ space $L_2(\Omega ,\mu _\nu )$.

Therefore, polynomials $\{P_n\}_{n \in \mathbb {N}_0}$ can serve as basis functions when expanding any continuous functions in $L_2$ space $L_2(\Omega ,\mu _\nu )$. Given this property, we extend the complete orthogonal basis to a multivariate space in the following section.

2.1.2 A multivariate polynomial basis

Consider a N-dimensional compact space given by ${\Omega ^{(N)}} = \prod _{k=1}^N [a_k,b_k]$. From this point on, we will consider the orthogonal polynomial bases on ${\Omega ^{(N)}} = \prod _{k=1}^N [a_k,b_k]$.

Let $ f_{\nu } = \prod _{k=1}^N f_{\nu _k} $. Here $f_{\nu _k}$ is a probability density function of a random variable $\nu _k$ with the domain the closed interval $\Omega _k = [a_k,b_k]$, $k=1,\dots ,N$. Random variables $\{\nu _k\}_{k=1,\dots ,N}$ are mutually independent. We refer to $ f_{\nu }$ as the reference joint density function and this function will serve as the multivariate weight function in our orthonormal basis. We note that $f_{\nu }: {\Omega ^{(N)}} \rightarrow \mathbb {R}$ is the joint density function of the random vector $\nu =(\nu _1, \ldots , \nu _N) $.

Of interest is the $L_2$ space $L_2({\Omega ^{(N)}},\mu _{\nu })$ with measure function $\mu _{\nu }$ defined by

$$\begin{aligned} \mu _{\nu }(A) = \int _{A} f_{\nu }(\textbf{x}) d\textbf{x} = \int _{A} \left[ \prod _{k=1}^N f_{\nu _k}(x_k) \right] d\textbf{x} \quad \text {for}\;\, A \in \mathscr {F}, \end{aligned}$$

(7)

where $\mathscr {F}$ is the $\sigma $-field of $\Omega ^{(N)}$. By taking the product of the univariate orthogonal polynomials described in Sect. 2.1.1 we obtain a set of multivariate orthogonal polynomials defined on ${\Omega ^{(N)}} = \prod _{k=1}^N [a_k,b_k]$. These multivariate polynomials are given as follows.

Definition 1

Let $\{P_{k, n_k}\}_{n_k \in \mathbb {N}_0}$ the set of hypergeometric type polynomials given by the density function $f_{\nu _k}$, $ k=1, \dots , N$. The multivariate hypergeometric type polynomial $P_{n_1,\dots ,n_N}(\textbf{x})$ associated to $f_{\nu }$ is given as the product of N hypergeometric type polynomials $P_{k,n_k}$ of degree $n_k$ for each $ k = 1,\dots ,N$, that is

$$\begin{aligned} P_{n_1,\dots ,n_N}(\textbf{x})= & {} \prod _{k=1}^N P_{k,n_k}(x_k) \nonumber \\= & {} \prod _{k=1}^N \frac{B_{k,n_k}}{f_{\nu _k}(x_k)} \frac{d^{n_k}}{dx^{n_k}}\left[ \left( \sigma _k(x_k) \right) ^{n_k} f_{\nu _k}(x_k) \right] , \end{aligned}$$

(8)

where $\sigma _k$ and $\tau _k$ refer to the polynomials described in (2) of the kth hypergeometric type polynomial and $B_{k,n_k}$ refers to the kth normality constant described in (1) and (4).

We can subsequently leverage the construction of the multivariate polynomial as a product of marginal univariate polynomials to decompose the multivariate polynomial into its marginal components. In doing so, the completeness and orthogonality of the univariate marginal polynomials ensure the multivariate polynomials defined in Definition 1 are themselves orthogonal and complete in the N-dimensional space. As such, it can be said these multivariate polynomials form a complete orthonormal polynomial basis for the N-dimensional $L_2$ product space $L_2(\Omega ^{(N)},\mu _{\nu })$. This statement is given more formally in Theorem 2.

Theorem 2

The multivariate hypergeometric-type polynomials are orthonormal on the compact space ${\Omega ^{(N)}} = \prod _{k=1}^N \Omega _k$ and complete in the $L_2$ space $L_2({\Omega ^{(N)}},\mu _{\nu })$.

A detailed proof of Theorem 2 is given in “Appendix A”.

2.2 Multivariate moment-based density function

Let $\textbf{X} = (X_1,\dots ,X_N)$ be a random vector with joint density function $f_\textbf{X}$ on ${\Omega ^{(N)}}$. The joint density function $f_\textbf{X}$ can be uniquely expanded into a polynomial expression, multiplied by the reference density $f_\nu (\textbf{x})$, by use of orthonormal basis functions $\{ P_{n_1,\dots , n_N}(\textbf{x}) \}_{n_1,\dots ,n_N \in \mathbb {N}_0} $ on the compact space $\Omega ^{(N)}$. The resulting density expression has polynomial coefficients, which are functions of the underlying moments of the distribution. This form of the joint density is referred to in this paper as the multivariate moment-based density (MBD) expression.

We begin with the statement of the theorem outlining the multivariate moment-based density expression.

Theorem 3

The Multivariate Moment-Based Density.

Let $\textbf{X} = (X_1,\dots ,X_N)$ be a random vector with joint density function $f_\textbf{X}$ on space ${\Omega ^{(N)}} = \prod _{k=1}^N [a_k, b_k]$. Assume $\textbf{X}$ has finite moments $\int _{\Omega ^{(N)}} x_1^{i_1} \dots x_N^{i_N}f_{X}(\textbf{x}) dx_1\dots dx_N $ for all $i_1,\dots ,i_N \in \mathbb {N}_0$, and

$f_{X} / f_{\nu } \in L_2({\Omega ^{(N)}},\mu _{\nu })$. Then $f_{\textbf{X}}$ can be expressed as

$$\begin{aligned} f_{\textbf{X}}(\textbf{x}) = f_{\nu }(\textbf{x})\sum _{n_1 = 0}^{\infty } \dots \sum _{n_N = 0}^{\infty } C_{n_1,\dots ,n_N} P_{n_1,\dots ,n_N}(\textbf{x}), \end{aligned}$$

(9)

where the coefficient terms $C_{n_1,\dots ,n_N}$ are given by

$$\begin{aligned} C_{n_1,\dots ,n_N}&= \mathbb {E}_{\textbf{X}}\left[ P_{n_1,\dots ,n_N}(X_1,\dots ,X_N) \right] \nonumber \\&= \int _{\Omega ^{(N)}} P_{1,n_1}(x_1) \dots P_{N,n_N}(x_N) f_{\textbf{X}}(\textbf{x}) d\textbf{x}\nonumber \\&= \sum _{i_1 = 0}^{n_1} \dots \sum _{i_N = 0}^{n_N} a_{n_1,\dots ,n_N ; i_1, \dots , i_N} \mathbb {E}_{\textbf{X}} \left[ X_1^{i_1} \dots X_N^{i_N} \right] , \end{aligned}$$

(10)

where $a_{n_1,\dots ,n_N ; i_1, \dots , i_N} = \prod _{k=1}^N a_{n_k,i_k}$ and $a_{n_k,i_k}$ is the $i_k$th coefficient of the $n_k$th polynomial $P_{k,n_k}(x_k)$.

Proof of Theorem 3 can be found in “Appendix B”.

Remark

We note $\sum _{n_1=0}^\infty \dots \sum _{n_N=0}^\infty C_{n_1,\dots ,n_N}^2 = \mathbb {E}_{\textbf{X}} \left[ \frac{f_{\textbf{X}}(X_1,\dots ,X_N)}{f_{\nu }(X_1,\dots ,X_N)} \right] < \infty $ as $f_{\textbf{X}} / f_{\nu } \in L_2({\Omega ^{(N)}},\mu _{\nu }).$ It also follows that $C_{n_1,\dots ,n_N} \rightarrow 0$ as $n_1,\dots ,n_N \rightarrow \infty $.

The expansion given in Theorem 3 relates to a joint-density defined on a compact space. We imposed this restriction to ensure moment determinacy, as, on unbounded spaces, we cannot guarantee moment determinacy. Polynomials given by (8) and applied in Theorem 3 may still be extended to unbounded spaces such as in the case of $f_{\nu _k}$, for any $k=1,2,\dots ,N$, being defined by a normal (Hamburger case) or gamma (Stieltjes case) density. For these cases, moment determinacy can be assured by the completeness of the Laguerre and Hermite polynomial families (Provost 2005).

In the examples of these types of polynomials in the literature, it is often assumed that all marginal densities of the joint density being approximated have a consistent kind of distribution. Here we make no such assumption, and only ensure that the domain of our multivariate polynomials is consistent with that of our joint density, provided that $f_{X} / f_{\nu } \in L_2({\Omega ^{(N)}},\mu _{\nu })$ and has finite moments i.e. $\mathbb {E}_{\textbf{X}} \left[ X_1^{i_1} \cdots X_N^{i_N}\right] < \infty ,$ for all $i_1, \ldots , i_N$, which enables the correct estimation of joint densities with variables of different domains, a common occurrence in real-world data.

The hypergeometric polynomial-based expansion of a joint density function generally is an infinite sum of polynomials. It is impracticable to evaluate this infinite sum to obtain the complete information of the joint density function. Using finite summation terms in the expansion to approach the density function is one way to address this issue. The function of the finite summation terms is named the finite order multivariate moment-based density, which is defined as follows.

Definition 2

The Finite Order Multivariate Moment-Based Density

The finite order multivariate moment-based density approximation of max order $\textbf{K} = (K_1,\dots ,K_N)$ is defined as

$$\begin{aligned} f_{\textbf{X};\textbf{K}}(\textbf{x}) = f_{\nu }(\textbf{x})\sum _{n_1 = 0}^{K_1} \dots \sum _{n_N = 0}^{K_N} C_{n_1,\dots ,n_N} P_{n_1,\dots ,n_N}(\textbf{x}) . \end{aligned}$$

The Weierstrass–Stone approximation theorem provides that $f_{\textbf{X}:\textbf{K}}$ converges uniformly to $f_{\textbf{X}}$ as the order $\textbf{K}$ increases. Using $f_{\textbf{X}:\textbf{K}}$ as an estimate of $f_{\textbf{X}}$ is not practicable in real-life data analysis. One issue is the evaluation of the coefficients $\{C_{n_1, \ldots ,n_N}\}$ in $f_{\textbf{X}:\textbf{K}}$. The coefficient $C_{n_1, \ldots ,n_N}$ is a linear function of joint moments of $\textbf{X}$. The evaluation of coefficients requires the knowledge of the density function $f_{\textbf{X}}$ while $f_{\textbf{X}}$ is unknown. The following section introduces the finite order multivariate sample moment-based density function to address this problem.

2.3 Multivariate sample moment-based density estimation

We now consider an approximation of multivariate density functions based on sample moments.

Definition 3

Multivariate Sample Moment-Based Density Approximation

Given a random sample $\{\textbf{X}_i\}_{i=1}^M =\{(X_{1,i},X_{2,i}, \dots , X_{N,i})\}_{i=1}^M$ of $\textbf{X} = (X_1,X_2,\dots ,X_N)$, define

$$\begin{aligned} \hat{f}_{\textbf{X};\textbf{K}}(\textbf{x}; \{\textbf{X}_i\}_{i=1}^M )&= f_{\nu }(\textbf{x})\sum _{n_1 = 0}^{K_1} \nonumber \\&\quad \dots \sum _{n_N = 0}^{K_N} \hat{C}_{n_1,\dots ,n_N} P_{n_1,\dots ,n_N}(\textbf{x}), \end{aligned}$$

(11)

where $\textbf{K} = (K_1,\dots ,K_N) $ and $\hat{C}_{n_1,\dots ,n_N} = (1 / M) \sum _{j=0}^M P_{n_1,\dots ,n_N}(X_{1,j},\dots ,X_{N,j}) $.

We propose using $\hat{f}_{\textbf{X};\textbf{K}}(\textbf{x}; \{\textbf{X}_i\}_{i=1}^M )$ as an estimator of $f_{\textbf{X}}$ and name it the sample moment-based estimator of $f_{\textbf{X}}$ or the moment-based estimator of $f_{\textbf{X}}$ if there is no ambiguity.

A significant issue when estimating the multivariate joint density function is computation complexity. The use of $\hat{f}_{\textbf{X};\textbf{K}}(\textbf{x}; \{\textbf{X}_i\}_{i=1}^M )$ as an estimate of $f_{\textbf{X}}$ has some computational advantages.

The marginal moment-based density estimates determined by integrating the joint moment-based density estimate are consistent with what would be achieved if the estimation were applied to the original sample’s corresponding marginal subset. Computationally we note that all coefficients of the various marginal distributions are elements of the set of joint density coefficients, meaning marginal density estimates are easily computed given the joint density coefficients. An essential factor for many density estimation schemes is the ability for these schemes to be adjusted or updated based on new data. Refitting density estimates in large sample and high-dimensional scenarios can be highly time-expensive. The moment-based density estimate is somewhat resilient to this issue, as the data agency can introduce new information to the existing model by averaging a density estimate based solely on the latest data with the original density estimate.

In privacy-sensitive situations, micro-level information may not necessarily be available due to the risk of information disclosure. Like that employed in Kernel density estimation, local density estimation methods require knowledge of the frequency of observations within a particular bandwidth. Due to privacy concerns, this knowledge may be challenging to obtain in sparsely populated areas. Perturbation techniques are also increasingly being used on privacy-sensitive statistics. Perturbation greatly reduces the statistical utility of estimates, even when such information is obtainable. By only using sample moments within the density estimate expression, we can tailor our estimate based on what information is available, with sample-moment information usually being more easily accessible, especially for lower-order moments.

As a result of the large number theory, $\hat{f}_{\textbf{X};\textbf{K}}(\textbf{x}; \{\textbf{X}_i\}_{i=1}^M )$ approaches to $f_{\textbf{X};\textbf{K}} $ as M increases. At the same time, the function $f_{\textbf{X};\textbf{K}}$ is a reasonable approximate of $f_{\textbf{X}}$ as $K \rightarrow \infty $. Proposing $\hat{f}_{\textbf{X};\textbf{K}}(\textbf{x}; \{\textbf{X}_i\}_{i=1}^M )$ as an estimator of $f_{\textbf{X}}$ makes sense. However, $\textbf{K}$ and $\{\textbf{X}_i\}_{i=1}^M$ might either independently or mutually impact the performance of the estimator in estimating $f_{\textbf{X}}$. To construct a good estimator $\hat{f}_{\textbf{X};\textbf{K}}$ in practice, the up order $\textbf{K}$ can work as an “adjuster” while the working sample $\{\textbf{X}_i\}_{i=1}^M$ is fixed.

In the following sections, we propose and demonstrate a computational statistics approach for determining $\textbf{K}$ in $\hat{f}_{\textbf{X};\textbf{K}}(\textbf{x}; \{\textbf{X}_i\}_{i=1}^M )$, given M is fixed. Under the mean square distance $L_2(\Omega ^{(N)}, \mu _{\nu })$ criterion, we expect that with the determined $\textbf{K}$, $\hat{f}_{\textbf{X};\textbf{K}}$ can perform well based on the available sample information from the underlying population $\textbf{X}$.

3 The $\mathcal {N}_{\textbf{K}}$-norm and max polynomial order

To quantify the quality of the resulting density estimate, we propose an averaged measure of divergence from the true joint density function, referred to as the $\mathcal {N}_{\textbf{K}}$-Norm, which is similar to the integrated square error statistic but allows for the decomposition of the error in terms of the estimated and coefficients given in (10).

The definition of the $\mathcal {N}_{\textbf{K}}$-Norm of the Approximation is given below.

Definition 4

$\mathcal {N}_{\textbf{K}}$-Norm of the Approximation.

The $\mathcal {N}_{\textbf{K}}$-Norm of the sample $\{\textbf{X}_i\}_{i=1}^M$ denoted $\mathcal {N}_{\textbf{K}}\left( \{\textbf{X}_i\}_{i=1}^M \right) $ is defined to be the $L_2({\Omega ^{(N)}},\mu _{\nu })$ norm of the difference between the sample approximation $\hat{f}_{\textbf{X};\textbf{K}}\left( \textbf{x};\{\textbf{X}_i\}_{i=1}^M \right) $ and the true joint density $f_{\textbf{X}}(\textbf{x})$ as a ratio of the reference joint density $f_{\nu }(\textbf{x})$. $\mathcal {N}_{\textbf{K}}\left( \{\textbf{X}_i\}_{i=1}^M \right) $ is expressed as

$$\begin{aligned} \mathcal {N}_{\textbf{K}}\left( \{\textbf{X}_i\}_{i=1}^M \right) = \int _{\Omega ^{(N)}} \left|\frac{\hat{f}_{\textbf{X};\textbf{K}} \left( \textbf{x} ; \{\textbf{X}_i\}_{i=1}^M\right) }{ f_{\nu }(\textbf{x})} - \frac{ f_{\textbf{X}}(\textbf{x}) }{ f_{\nu }(\textbf{x})} \right|^2 d\mu _{\nu }. \end{aligned}$$

$\mathcal {N}_{\textbf{K}}\left( \{\textbf{X}_i\}_{i=1}^M \right) $ is a statistic. Using $\mathcal {N}_{\textbf{K}}$ to measure the difference between $\hat{f}_{\textbf{X};\textbf{K}}$ and $f_\textbf{X}$ quantitatively is different from measuring the maximum point-wise difference between the two functions. A small value of $\mathcal {N}_{\textbf{K}}$ cannot guarantee the differences between $\hat{f}_{\textbf{X};\textbf{K}}(\textbf{x})$ and $f_\textbf{X}(\textbf{x})$ are small for all $\textbf{x}$. However, it can ensure that, on average, the differences between $\hat{f}_{\textbf{X};\textbf{K}}(\textbf{x})$ and $f_\textbf{X}(\textbf{x})$ can be depicted by the value of $\mathcal {N}_{\textbf{K}}$. The difference is a difference of ratios. As long as $f_{\nu }(\textbf{x})$ has a non-zero bound, such a ratio difference can be reasonable to depict the difference between $\hat{f}_{\textbf{X};\textbf{K}}(\textbf{x})$ and $f_\textbf{X}(\textbf{x})$, particularly, if $f_{\nu }(\textbf{x})$ has a uniform distribution.

It is also important to note the similarity of the $\mathcal {N}_{\textbf{K}}$-Norm to the integrated square error statistic (ISE) of a density estimate $\hat{f}(\textbf{x})$ of $f(\textbf{x})$ expressed below,

$$\begin{aligned} {\textit{ISE}}(\hat{f}) = \int _{\Omega ^{(N)}} \left|\hat{f}(\textbf{x}) - f(\textbf{x}) \right|^2 d\textbf{x}. \end{aligned}$$

(12)

The point of difference between the $\mathcal {N}_{\textbf{K}}$-norm and ${\textit{ISE}}(\hat{f})$ is that the former is a weighted squared $L_2$-norm in the $L_2({\Omega ^{(N)}},\mu _{\nu })$ space. In contrast, the integrated square error is measured within Euclidean space. Using this weighted measure, the expected value of the statistic, which reflects the variance and bias components of the estimate, can be expressed in terms of the corresponding coefficients in (10). This expression of $\mathbb {E}_{\textbf{X}} \left[ \mathcal {N}_{\textbf{K}}\left( \{\textbf{X}_i\}_{i=1}^M \right) \right] $ is given in Theorem 4.

Theorem 4

The expected value of $\mathcal {N}_{\textbf{K}}\left( \{\textbf{X}_i\}_{i=1}^M \right) $ is given by

$$\begin{aligned} \mathbb {E}_{\textbf{X}} \left[ \mathcal {N}_{\textbf{K}}\left( \{\textbf{X}_i\}_{i=1}^M \right) \right]&=\sum _{n_1 = 0}^{K_1} \dots \sum _{n_N = 0}^{K_N} {\textit{Var}}_{\textbf{X}}\left( \hat{C}_{n_1,\dots ,n_N} \right) \\&\quad + \mathcal {T}_{\textbf{K}}, \end{aligned}$$

where $ \mathcal {T}_{\textbf{K}} = \sum _{n_1=0}^\infty \dots \sum _{n_N=0}^\infty C_{n_1,\dots ,n_N}^2 - \sum _{n_1=0}^{K_1}\dots \sum _{n_N=0}^{K_N} C_{n_1,\dots ,n_N}^2 $ denotes the remainder term.

The proof of Theorem 4 is simple if we note that,

$$\begin{aligned}&\int _{\Omega ^{(N)}} \left|\frac{\hat{f}_{\textbf{X};\textbf{K}}(\textbf{x}; \{\textbf{X}_i\}_{i=1}^M )}{ f_{\nu }(\textbf{x})} - \frac{ f_{\textbf{X}}(\textbf{x}) }{ f_{\nu }(\textbf{x})} \right|^2 d\mu _{\nu }\\&\quad =\sum _{n_1 = 0}^{K_1}\dots \sum _{n_N = 0}^{K_N} \hat{C}_{n_1,\dots ,n_N}^2 \\&\qquad -2 \sum _{n_1 = 0}^{K_1} \dots \sum _{n_N = 0}^{K_N} \hat{C}_{n_1,\dots ,n_N}C_{n_1,\dots ,n_N} \\&\qquad + \sum _{n_1 = 0}^{\infty } \dots \sum _{n_N = 0}^{\infty } C_{n_1,\dots ,n_N}^2. \end{aligned}$$

A crucial dichotomy arises when applying the moment-based density estimation method, which is demonstrated in the expression of the expected value of $\mathcal {N}_{\textbf{K}}$-Norm. Increasing the order of the approximation will lead to a reduction in the remainder term $\mathcal {T}_{\textbf{K}}$. However, an increase in the order of approximation also inflates the aggregated variance of the coefficient estimators.

This tradeoff between the remainder term and the estimator variance is at the heart of determining the optimal strategy for approximating the joint density function in practice, based on sample data. The tradeoff also suggests that the usual approach to polynomial approximations used in mathematics, taking the largest order feasible based on computation and accuracy requirements, is inappropriate in the sample context. Instead, it suggests that there may be an optimal max polynomial order to the approximation.

We provide the following definition of an optimal max polynomial order.

Definition 5

An Optimal Max Polynomial Order.

An optimal max polynomial order denoted $\textbf{K}^{*} = (K_1^{*},\dots ,K_N^{*})$ is defined as a vector of max polynomial orders in the approximation such that

$$\begin{aligned} \mathbb {E} \left[ \mathcal {N}_{\mathbf {K^*}}\left( \{\textbf{X}_i\}_{i=1}^M \right) \right] \le \mathbb {E} \left[ \mathcal {N}_{\textbf{K}}\left( \{\textbf{X}_i\}_{i=1}^M \right) \right] , \end{aligned}$$

for any possible max polynomial order $\textbf{K} = (K_1,\dots ,K_N)$.

This $\textbf{K}^{*}$ will ensure that the mean of $\mathcal {N}_{\mathbf {K^*}}\left( \{\textbf{X}_i\}_{i=1}^M \right) $ takes the smallest value. It does not mean that, for all the realisations of the random sample $\{\textbf{X}_i\}_{i=1}^M$, $\mathcal {N}_{\textbf{K}}\left( \{\textbf{X}_i\}_{i=1}^M \right) $ will take the smallest value at $\textbf{K}=\textbf{K}^*$. The variance of $\mathcal {N}_{\mathbf {K^*}}\left( \{\textbf{X}_i\}_{i=1}^M \right) $ will provide the information of a confidence level of a realisation of $\mathcal {N}_{\mathbf {K^*}}\left( \{\textbf{X}_i\}_{i=1}^M \right) $ falling in a close neighbourhood of the smallest value.

Given just the definition, it is unclear whether we can obtain such a quantity. It appears entirely feasible that taking larger and larger max polynomial orders may improve the density estimate (based on the $\mathcal {N}_{\textbf{K}}$-Norm). However, in the following theorem, we show that provided the estimated coefficients applied in the approximation have a non-zero variance (after some point), then the optimal max polynomial order is finite, i.e., for $\textbf{K}^{*} = (K_1^{*},\dots ,K_N^{*})$ then $K_k^{*} < \infty $ for $k = 1,\dots ,N$. In practice, the estimated coefficients applied in the approximation will have a non-zero variance as the estimated coefficients are linear combinations of higher and higher order sample moments.

Theorem 5

Suppose there exists $\epsilon _0 > 0 $ such that $ {\textit{Var}}_{\textbf{X}}(\hat{C}_{n_1,\dots ,n_N}) \ge \epsilon _0 $ for all $n_1,\dots ,n_N \in \mathbb {N}_0$ where $(n_1,\dots ,n_N) \ne (0,\dots ,0)$. Then $ \mathbb {E}_{\textbf{X}} \left[ \mathcal {N}_{\textbf{K}}\left( \{\textbf{X}_i\}_{i=1}^M \right) \right] $, has a finite optimal max polynomial order.

The proof of Theorem 5 can be found in “Appendix C”.

In a practical situation, we may not necessarily know the true joint density of a sample, and hence it would be untenable to calculate the $\mathcal {N}_{\textbf{K}}$-Norm statistic. But this value does not explicitly need to be known to determine the optimal max polynomial order. Consider the following definition of the shifted $\mathcal {N}_{\textbf{K}}$-Norm.

Definition 6

The Shifted $\mathcal {N}_{\textbf{K}}$-Norm.

The shifted $\mathcal {N}_{\textbf{K}}$-Norm denoted $\mathcal {N}_{\textbf{K}}^*\left( \{\textbf{X}_i\}_{i=1}^M \right) $ is defined as

$$\begin{aligned} \mathcal {N}_{\textbf{K}}^*\left( \{\textbf{X}_i\}_{i=1}^M \right)&= \sum _{n_1 = 0}^{K_1} \dots \sum _{n_N = 0}^{K_N} \hat{C}_{n_1,\dots ,n_N}^2 \nonumber \\&\quad - 2\mathbb {E}_{\textbf{X}}\left[ \frac{\hat{f}_{\textbf{X};\textbf{K}}(\textbf{X};\{\textbf{X}_i\}_{i=1}^M)}{f_{\mathbf {\nu }}(\textbf{X})} \right] . \end{aligned}$$

(13)

The relationship between $\mathcal {N}_{\textbf{K}}^*\left( \{\textbf{X}_i\}_{i=1}^M \right) $ and $\mathcal {N}_{\textbf{K}}\left( \{\textbf{X}_i\}_{i=1}^M \right) $ is summarised as follows:

(1)
$\begin{aligned} \mathcal {N}_{\textbf{K}}^*\left( \{\textbf{X}_i\}_{i=1}^M \right) =\mathcal {N}_{\textbf{K}}(\{\textbf{X}_i\}_{i=1}^M) - \int _{\Omega ^{(N)}}\frac{f^2_{\textbf{X}}(\textbf{x})}{f_{\nu }(\textbf{x})} d\textbf{x} \end{aligned}$.
(2)
If $\mathbb {E}_{\textbf{X}}(\mathcal {N}_{\textbf{K}}^*(\{\textbf{X}_i\}_{i=1}^M))$ take the smallest value at $\textbf{K}^*$, $\mathbb {E}_{\textbf{X}}(\mathcal {N}_{\textbf{K}}(\{\textbf{X}_i\}_{i=1}^M))$ will also take the smallest value at $\textbf{K}^*$, vice versa.
(3)
Both of $\mathcal {N}_{\textbf{K}}^*(\{\textbf{X}_i\}_{i=1}^M$ and $\mathcal {N}_{\textbf{K}}(\{\textbf{X}_i\}_{i=1}^M $ have the same variance.

The second term of $\mathcal {N}_\textbf{K}^*(\{\textbf{X}_i\}_{i=1}^M$ in (13) can be more easily estimated in a computational approach, i.e. for any random function $g(\textbf{X})$, $\mathbb {E}_{\textbf{X}}[g(\textbf{X})]\approx (1/M) \sum _{i=1}^M g(\textbf{X}_i)$. Therefore, given the relationship between $\mathbb {E}_{\textbf{X}}(\mathcal {N}_{{\textbf{K}}}^*(\{\textbf{X}_i\}_{i=1}^M))$ and $\mathbb {E}_{\textbf{X}}(\mathcal {N}_{\textbf{K}} (\{\textbf{X}_i\}_{i=1}^M))$, the optimal max polynomial order $\textbf{K}^*$ can be identified through $\min _{\textbf{K}} \{\mathbb {E}_{\textbf{X}}(\mathcal {N}_{\textbf{K}}^*(\{\textbf{X}_i\}_{i=1}^M))\}$, rather than $\min _{\textbf{K}}\{\mathbb {E}_{\textbf{X}}(\mathcal {N}_{\textbf{K}}(\{\textbf{X}_i\}_{i=1}^M))\}$.

We propose the following strategy for estimating the mean of the shifted $\mathcal {N}_{\textbf{K}}$-Norm given a random sample $\{\textbf{X}_i\}_{i=1}^M$. This strategy is valid subject to a condition that the probability distribution of any subsample, which was randomly drawn from the random sample $\{\textbf{X}_i\}_{i=1}^M$, is the same as that of $\{\textbf{X}_i\}_{i=1}^M$. This condition is not too strict in general as long as the size of the subsample is not too small.

1.
Randomly partition observations of the random sample into two approximately equal-size partitions $\{\textbf{X}_i\}_{i\in \mathcal {I}}$ and $\{\textbf{X}_i\}_{i\in \overline{\mathcal {I}}}$.
2.
Calculate the statistic:
$$\begin{aligned}&\hat{\mathcal {N}}_{\textbf{K}}^{*}\left( \{\textbf{X}_i\}_{i\in \mathcal {I}}; \{\textbf{X}_i\}_{i\in \overline{\mathcal {I}}}\right) \\&\quad = \sum _{n_1 = 0}^{K_1} \dots \sum _{n_N = 0}^{K_N} \hat{C}_{n_1,\dots ,n_N}(\{\textbf{X}_i\}_{i\in \mathcal {I}})^2 \\&\qquad -\frac{2}{ ||\overline{\mathcal {I}} ||} \sum _{j \in \overline{\mathcal {I}} } \frac{\hat{f}_{\textbf{X};\textbf{K}}({X}_{1,j},\dots ,{X}_{N,j};\{\textbf{X}_i\}_{i\in \mathcal {I}})}{f_{\nu } ({X}_{1,j},\dots ,{X}_{N,j})} \end{aligned}$$
where $||\overline{\mathcal {I}} ||$ denotes the sample size of $\{ \textbf{X}_i \}_{i \in \overline{\mathcal {I}}}$.
3.
Repeat 1 and 2 B times to obtain the set of repeated shifted $\mathcal {N}_{\textbf{K}}$-Norm statistics
$$\begin{aligned} \Big \{ \hat{\mathcal {N}}_{\textbf{K}}^{*}\left( \{\textbf{X}_i\}_{i\in \mathcal {I}_b}; \{\textbf{X}_i\}_{i\in \overline{\mathcal {I}_b}}\right) \Big \}_{b=1}^B. \end{aligned}$$
4.
Estimate $\mathbb {E}_{\textbf{X}}[\mathcal {N}^*_{\textbf{K}}(\{\textbf{X}_i\}_{i=1}^M)]$ by
$$\begin{aligned} \hat{\mathcal {N}}_{\textbf{K};B}^{*}&\left( \{\textbf{X}_i\}_{i=1}^M\right) \nonumber \\&= (1 / B) \sum _{b=1}^B \hat{\mathcal {N}}_{\textbf{K}}^{*}\left( \{\textbf{X}_i\}_{i\in \mathcal {I}_b}; \{\textbf{X}_i\}_{i\in \overline{\mathcal {I}_b}}\right) . \end{aligned}$$
(14)
The variance of the shifted $\mathcal {N}_{\textbf{K}}$-Norm statistics can be estimated through the sample variance.

With the information of the mean of the shifted $\mathcal {N}_{\textbf{K}}$-Norm, we can determine an appropriate upper polynomial order $\textbf{K}=(K_1, \ldots , K_N) $ for moment-based density function estimation. Sometimes, searching for the proper upper polynomial order $\textbf{K}$ is a complex process, especially when $N >3$. To reduce the complexity we may restrict the search process only to those $\textbf{K}$ with $K_1=\cdots = K_N$ in practice. However, restricting the search process to those $\textbf{K}$ with $K_1=\cdots = K_N$ may not be appropriate when there are considerable differences of scale and distribution between marginal variables of the data. We use such a restricted search process in the comparison study undertaken in the next section as both marginal variables are marginally beta distributed on [0, 1]. In Sect. 5, we do not apply a restricted search process as there are large differences in distribution and scale across the marginal variables of the data.

4 Comparing density estimation methods

In the following section, we compare the performance of the moment-based density estimate and a range of other multivariate density estimation methods through a bivariate example below.

4.1 Bivariate beta distribution example

Consider the Bivariate Beta Distribution discussed in Macomber and Myers (1983), which has density function

$$\begin{aligned} f_{X_1,X_2}(x_1,x_2)&= \frac{\Gamma (\alpha _1+\alpha _2 + \alpha _3)}{\Gamma (\alpha _1)\Gamma (\alpha _2)\Gamma (\alpha _3)}x_1^{\alpha _1 - 1}x_2^{\alpha _2 - 1}\nonumber \\&\quad \times (1-x_1-x_2)^{\alpha _3 - 1}, \end{aligned}$$

(15)

for $x_1,x_2 > 0$ and $x_1 + x_2 \le 1$. The moments of this bivariate density function are given by Macomber and Myers (1983) as

$$\begin{aligned} E(X_1^m X_2^n) = \frac{\Gamma (\alpha _1 + \alpha _2 + \alpha _3)\Gamma (\alpha _1+m)\Gamma (\alpha _2+n)}{\Gamma (\alpha _1 + \alpha _2 + \alpha _3 + m + n)\Gamma (\alpha _1)\Gamma (\alpha _2)}. \end{aligned}$$

We perform a simulation study of 100 simulations for four scenarios $\alpha _1 = 0.6$ and $\alpha _2 = 0.8$, $\alpha _1 = 1.6$ and $\alpha _2 = 0.8$, $\alpha _1 = 0.6$ and $\alpha _2 = 1.8$, and $\alpha _1 = 1.6$ and $\alpha _2 = 1.8$ wherein every scenario $\alpha _3 = 1$. In all the scenarios, we consider the performance of the bivariate Gaussian kernel density estimate (Duong 2007), the bivariate Gaussian copula with a marginal beta distribution (Hofert et al. 2020; Yan 2007; Kojadinovic and Yan 2010; Hofert and Mächler 2011), and the density estimation trees estimate (Meyer 2017) against two moment-based density estimates. The first moment-based density estimate (uMBD) applies a uniform reference density for marginal distributions $\nu _1,\nu _2 \sim U[0,1]$. The other moment-based density estimate (bMBD) applies a beta reference density for both marginal distributions where the parameters of the beta distribution are taken to be the method of moments estimates of the sample data for each marginal.

We note that the bivariate Gaussian kernel density estimate (Duong 2007) is a non-parametric local estimation method and is calculated using the ks package in R and associated kde function with optimal bandwidth selector (see Duong 2018). The bivariate Gaussian copula with a marginal beta distribution (Hofert et al. 2020; Yan 2007; Kojadinovic and Yan 2010; Hofert and Mächler 2011) is a parametric global density estimate and is calculated using the copula package in R (see Hofert et al. 2020) with the marginal beta distribution parameters taken to be the corresponding method of moments. The density estimation trees estimate (Meyer 2017) is a machine learning estimate and is calculated using the detpack package in R with element refinement goodness of fit significance taken to be $1\times 10^{-5}$ (see Meyer 2019).

For all density estimates, the integrated square error (12) with respect to the true bivariate beta density function (15) is determined by Monte Carlo integration using a uniform sample of size 100,000 on $[0,1] \times [0,1]$. The ISE statistic is averaged across all 100 simulations to give the mean integrated square error (MISE). In the case of the uMBD moment-based density estimate, the ISE statistic corresponds precisely to the $\mathcal {N}_{\textbf{K}}$-Norm statistic (see Definition 4), and hence the MISE is an estimate of the expected $\mathcal {N}_{\textbf{K}}$-Norm expressed in Theorem 4. In the case of the bMBD moment-based density estimate, the mean $\mathcal {N}_{\textbf{K}}$-Norm across the 100 simulations is also determined using Monte Carlo integration with a uniform sample of size 100,000 on $[0,1] \times [0,1]$.

The max polynomial order applied in the expression of the moment-based density estimates, $\textbf{K} = (K_1, K_2)$, is set to be consistent across both marginal distributions, i.e. $K = K_1 = K_2$. We determine the optimal max polynomial order (see Definition 5) by selecting the max polynomial order which minimises the mean $\mathcal {N}_{\textbf{K}}$-Norm. For uMBD this is consistent with minimising the MISE. For each comparison density estimate, we also determine the percentage of simulations in which the uMBD and bMBD achieve a smaller MISE value when the optimal max polynomial order is used.

The remainder term component of the mean $\mathcal {N}_{\textbf{K}}$-Norm, i.e. the $\tau _{\textbf{K}}$ term given in Theorem 4, is also estimated by evaluating the $\mathcal {N}_{\textbf{K}}$-Norm when theoretical moments are used to express the moment-based density estimate. This term reflects the bias component of error in the moment-based density estimate.

We provide the simulation study results for all four scenarios in Table 2.

Table 2 Simulation study results of 100 simulations from the bivariate beta distribution (15) with four different parameterisations

Full size table

The simulation study shows that both the uMBD and bMBD estimates outperform the Gaussian kernel density estimate nearly universally for all scenarios in which at least one $\alpha _i < 1$ ($i=1,2$). However, the Gaussian kernel density estimate universally outperforms both the uMBD and bMBD estimate when $\alpha _1 = 1.6$ and $\alpha _2 = 1.8$. When $\alpha _1 = 1.6$ and $\alpha _2 = 1.8$, the true density $f_{X_1,X_2}(x_1,x_2)$ achieves its maximum on the boundary condition $x_1 + x_2 \le 1$. This boundary condition is not specifically enforced by construction for any of the density estimates. Although due to the local estimation applied by the Gaussian kernel density estimate, zero areas of the sample space can be better predicted. The polynomial curves applied in the uMBD and bMBD approaches require a high max polynomial order to model the immediate change in density values to zero adequately. This issue is not as significant in previous parameterisations as density function values near the $x_1 + x_2 \le 1$ boundary are small. As a higher max polynomial order would also constitute greater sampling variance, the optimal max polynomial order remains relatively low, reducing the quality of the moment-based density estimates relative to the Gaussian kernel density estimate for the $\alpha _1 = 1.6$ and $\alpha _2 = 1.8$ parameterisation. The Gaussian copula with marginal beta distribution performs best comparatively in scenarios where $\alpha _1 = 0.6$, outperforming the uMBD estimate in approximately 86–88% of samples. The copula estimate does not appear to perform as well as the bMBD estimate in general for any scenario, only achieving a smaller MISE in at most 1% of simulations in each scenario. The density estimation trees estimate similarly does not perform as well as the bMBD estimate across all scenarios. The uMBD estimate also outperforms the density estimation trees estimate in most simulations in all scenarios.

5 Generating synthetic micro-data example

In the following section, we provide an example of synthetic micro-data generated from a “confidential” data set. Before this example, we outline the process of simulating synthetic data from the moment-based density estimate.

5.1 Simulating synthetic data from the density estimate

Given a random sample $\{(Y_i,\textbf{X}_i)\}_{i=1}^M =\{(Y_{i},X_{1,i}, \dots , X_{N,i})\}_{i=1}^M$ of $(Y,\textbf{X}) = (Y,X_1,X_2,\dots ,X_N)$, and corresponding moment-based density estimate $\hat{f}_{Y,\textbf{X};\textbf{K}}((y,\textbf{x}); \{(Y_i,\textbf{X}_i)\}_{i=1}^M )$, we define the moment-based conditional cumulative probability function estimate of Y given $\textbf{X}$ as follows,

$$\begin{aligned}&\hat{F}_{Y {\mid }\textbf{X};\textbf{K}}(y {\mid }\textbf{x}; \{(Y_i,\textbf{X}_i)\}_{i=1}^M)\nonumber \\&\quad = \frac{\int _{a_Y}^{y} \hat{f}_{Y,\textbf{X};\textbf{K}}((t,\textbf{x}); \{(Y_i,\textbf{X}_i)\}_{i=1}^M )dt}{ \hat{f}_{\textbf{X};\mathbf {K_X}}(\textbf{x};\{\textbf{X}_i\}_{i=1}^M )}, \end{aligned}$$

(16)

where $\hat{f}_{\textbf{X};\mathbf {K_X}}(\textbf{x};\{\textbf{X}_i\}_{i=1}^M )$ denotes the moment-based density estimate of $\{\textbf{X}_i\}_{i=1}^M$ with max polynomial order $\mathbf {K_X}$, and $a_Y$ denotes the lower bound of Y, i.e. $Y \in [a_Y,b_Y]$.

Given Property (2) in Sect. 2.1.1, we derive the following expression for the moment-based conditional cumulative probability function estimate of Y given $\textbf{X}$, $\hat{F}_{Y {\mid }\textbf{X};\textbf{K}}(y {\mid }\textbf{x}; \{(Y_i,\textbf{X}_i)\}_{i=1}^M)$.

$$\begin{aligned}&\hat{F}_{Y {\mid }\textbf{X};\textbf{K}}(y {\mid }\textbf{x}; \{(Y_i,\textbf{X}_i)\}_{i=1}^M) =\frac{f_{\nu _{\textbf{X}}}(\textbf{x})}{\hat{f}_{\textbf{X};\mathbf {K_X}}(\textbf{x}; \{\textbf{X}_i\}_{i=1}^M)} \\&\quad \times \left\{ \sum _{n_Y=0}^{K_Y} \sum _{n_{X,1} = 0}^{K_{X,1}} \dots \sum _{n_{X,N} = 0}^{K_{X,N}} \hat{C}_{n_Y,n_{X,1},\dots ,n_{X,N}} \right. \\&\quad \times \left. \left( \frac{\sigma _Y(y)f_{\nu _Y}(y)}{\lambda ^{*}_{Y,n_Y}} P^{'}_{Y,n_Y}(y)+ \delta _{n_Y,0}F_{\nu _Y}(y) \right) \right. \\&\quad \times \left. P_{n_{X,1},\dots ,n_{X,N}}(\textbf{x}) \right\} , \end{aligned}$$

where $f_{\nu _Y}$ and $f_{\nu _{\textbf{X}}}$ are the reference densities of Y and $\textbf{X}$ respectively, $ \hat{C}_{n_Y,n_{X,1},\dots ,n_{X,N}} $, are the estimated coefficients of the moment-based density estimate $\hat{f}_{Y,\textbf{X};\textbf{K}}((y,\textbf{x}); \{(Y_i,\textbf{X}_i)\}_{i=1}^M )$ with $\textbf{K} = (K_Y,K_{X,1},\dots ,K_{X,N})$, $\lambda ^{*}_{Y,n_Y} = {\left\{ \begin{array}{ll} \lambda _{Y,n_Y} &{} n_Y = 1,2,\dots \\ 1 &{} n_Y = 0 \end{array}\right. }$ where $\lambda _{Y,n_Y}$ is the $n_Y$th eigenvalue related to the univariate polynomial $P_{Y,n_Y}(y)$, $F_{\nu _Y}(y)$ denotes the reference cumulative distribution function of Y, and $\delta _{n_Y,0}$ denotes the Kronecker delta.

We make note of the following in regards to $\hat{F}_{Y {\mid }\textbf{X};\textbf{K}}(y {\mid }\textbf{x}; \{(Y_i,\textbf{X}_i)\}_{i=1}^M)$:

1.
We have an analytical expression of the moment-based conditional cumulative probability function estimate of Y given $\textbf{X}$, which does not require numerical integration.
2.
The moment-based conditional cumulative probability function estimate of Y given $\textbf{X}$ is a univariate cumulative distribution function (in terms of Y) from which we can simulate an observation $Y^{*}$ using inverse transform sampling. That is, given $X_1^{*},\dots ,X_N^{*}$ from $\hat{f}_{\textbf{X};\mathbf {K_X}}(\textbf{x};\{\textbf{X}_i\}_{i=1}^M )$ and a uniform random variable $U\sim \text {Unif}[0,1]$, then $Y^{*} = \hat{F}_{Y {\mid }\textbf{X};\textbf{K}}^{-1}(U {\mid }X_1^{*},\dots ,X_N^{*}; \{(Y_i,\textbf{X}_i)\}_{i=1}^M)$ and $X_1^{*},\dots ,X_N^{*}$ have joint density $\hat{f}_{Y,\textbf{X};\textbf{K}}((y,\textbf{x}); \{(Y_i,\textbf{X}_i)\}_{i=1}^M )$.
3.
Without loss of generality, we can express any column of our data set as $\{Y_i\}_{i=1}^M$ and the others as $\{\textbf{X}_i\}_{i=1}^M$.
4.
By applying a Gibbs sampling approach, we can simulate all columns of the data to form a complete synthetic data set.

In the following section, we obtain a 5-dimensional moment-based density estimate and simulate data from a “sensitive” variable of the data (a variable that poses a risk to confidentiality if disclosed) by applying inverse transform sampling to the moment-based conditional cumulative probability function estimate.

5.2 Synthetic data example

The data we consider $\{(Y_i,\textbf{X}_i)\}_{i=1}^{9568}$ are 9568 observations of 5 variables collected from a combined cycle power plant over six years (2006–2011) when the power plant was set to work with full load (Tüfekci 2014; Kaya and Tufekci 2012). The dataset consists of hourly average ambient variables ($\textbf{X} = (X_1,X_2,X_3,X_4)$): temperature ($X_1$), ambient pressure ($X_2$), relative humidity ($X_3$), and exhaust vacuum ($X_4$), that were collected to predict the net hourly electrical energy output of the plant (Y).

Suppose that the values of the plant’s net hourly electrical energy output are considered “sensitive” due to a risk of proprietary infringement or operational sabotage were the actual data released. We produce synthetic data that mimics the true values of the net hourly electrical energy output by obtaining a moment-based density estimate $\hat{f}_{Y,\textbf{X};\textbf{K}}$ of the joint distribution of the data. For each variable, we select uniform reference densities with bounds consistent with its range plus or minus (for the maximum and minimum, respectively) a buffer of 5% of its standard deviation.

We estimate the optimal max polynomial order $\hat{\textbf{K}^{*}} =(\hat{K}_1^{*},\dots ,\hat{K}_5^{*})$ by minimising $\hat{\mathcal {N}}_{\textbf{K};B}^{*}\left( \{(Y_i,\textbf{X}_i)\}_{i=1}^{9568}\right) $ (14) for $B=100$,

$$\begin{aligned} \hat{\textbf{K}^{*}} = \underset{\textbf{K} \in \mathbb {N}_0^5}{\arg \!\min } \Big \{ \hat{\mathcal {N}}_{\textbf{K};100}^{*}\left( \{(Y_i,\textbf{X}_i)\}_{i=1}^{9568}\right) \Big \}. \end{aligned}$$

Note that we only considered values of $\hat{K}^{*}_k \le 35$, for $k=1,\dots ,5$, as the statistic was diverging in all cases.

Averaged across $B=100$ randomly sampled partitions, we achieve the estimates of the shifted ${\mathcal {N}}_\textbf{K}$-Norm ($ \hat{\mathcal {N}}_{\textbf{K};100}^{*}\left( \{(Y_i,\textbf{X}_i)\}_{i=1}^{9568}\right) $) for various max polynomial orders from 1 to 35 for each variable. The estimated optimal max polynomial order is determined to be $\hat{\textbf{K}^{*}} = (21,22,12,5,22)$.

We simulate a synthetic sample of net hourly electrical energy outputs ($\{Y_i^{*}\}_{i=1}^{9658}$) using inverse-transform sampling on the moment-based conditional cumulative probability function estimate $\hat{F}_{Y {\mid }\textbf{X};\hat{\textbf{K}^{*}}}^{-1}(y {\mid }\textbf{x}; \{(Y_i,\textbf{X}_i)\}_{i=1}^{9568})$ where $\hat{\textbf{K}^{*}} = (21,22,12,5,22)$. We note that given uniform reference densities are used, $\hat{F}_{Y {{\mid }} \textbf{X};\hat{\textbf{K}^{*}}}(y {\mid }\textbf{x}; \{(Y_i,\textbf{X}_i)\}_{i=1}^{9568})$ is a polynomial with respect to y. We apply the base R polynomial rooting function polyroot() (Jenkins and Traub 1972) to compute values from the inverse moment-based conditional cumulative probability function estimate.

Histograms comparing the distribution of the actual net hourly electrical energy output ($\{Y_i\}_{i=1}^{9658}$) and synthetic net hourly electrical energy output ($\{Y_i^{*}\}_{i=1}^{9658}$) are given in Fig. 1. Scatterplots showing the associations of both the actual and synthetic net hourly electrical energy outputs with each net hourly ambient variable are given in Fig. 2. Heatmaps showing the bivariate marginal density of both the actual and synthetic net hourly electrical energy outputs with each net hourly ambient variable are given in Fig. 3.

The histograms in Fig. 1 indicate that the synthetic net hourly electrical energy outputs approximate the shape of the actual net hourly electrical energy outputs well. However, synthetic net hourly electrical energy outputs appear slightly over-populated in the distribution’s tails compared to the actual net hourly electrical energy outputs, leading to a slightly “flatter” distribution overall. This over-population of synthetic net hourly electrical energy outputs in sparsely populated areas of the actual data is also apparent when comparing the bivariate associations of the net hourly electrical energy outputs with each net hourly ambient variable shown in Figs. 2 and 3 . The general pattern of association between the net hourly electrical energy outputs and each net hourly ambient variable is approximated by the synthetic data well. However, sparse areas of the actual data appear over-estimated by the moment-based density estimate.

To assess the statistical utility of the synthetic data for prediction modelling, we compare statistics obtained by a main-effects ordinary least squares regression of the net hourly ambient variables on the actual and synthetic net hourly electrical energy outputs. The results of the linear regression analysis are given in Table 3. The estimated coefficient of the ambient temperature data ($-1.727$, 95% CI [$-1.785, -1.669$]) and the relative humidity data ($-0.128$, 95% CI [$-0.143,-0.112$]) obtained from the synthetic data underestimated the magnitude of the corresponding estimated coefficient obtained from the actual data ($-1.978$ and $-0.158$ respectively). Whereas the estimated coefficients of exhaust volume ($-0.259$, 95% CI [$-0.287,-0.231$]) and ambient pressure (0.093, 95% CI [0.057, 0.128]) obtained from the synthetic data better represent the estimated coefficients obtained from the actual data, with both actual coefficient estimates ($-0.234$ and 0.062 respectively) belonging to the 95% confidence intervals of the corresponding coefficient estimates obtained from the synthetic data. All coefficient estimates obtained from the synthetic data have increased standard error estimates. The synthetic-based linear regression model has an increased residual mean square error estimate (8.804) compared to the actual-based linear regression model (4.558). A reduction in the proportion of variation explained by the model ($R^2$) is also seen in the synthetic-based model, with $R^2 = 0.748$ from the synthetic-based model being less than $R^2 = 0.929$ from the actual-based model.

Although the performance of the synthetic data appears mixed, unlike other model-based synthetic generation methods, synthetic data obtained from the moment-based density estimate is not model specific and can be used to fit any model. For instance, suppose we were to fit a full factorial linear regression model of the net hourly ambient variables on the actual and synthetic net hourly electrical energy outputs. Without re-defining our generation model or re-sampling, we obtain the results given in Table 4 for the same synthetic data used to obtain the results in Table 3. Again, all coefficient estimates obtained from the synthetic data had increased standard error estimates. Although in all cases, the estimated coefficients obtained from the actual data for the full factorial model are contained within the 95% confidence interval of the corresponding estimated coefficients obtained from the synthetic data.

Considering the disclosure risk posed by the synthetic net hourly electrical energy outputs, we note that only approximately 3% of observations in the actual data have synthetic observations within a 1% standard-deviation-based interval (Mateo-Sanz et al. 2004) of their values. Figure 2 shows that regions containing multivariate outliers in the actual data are well populated in the synthetic data. A data intruder, therefore, would find it difficult to determine which synthetic observations (if any) could be successfully attributed to an actual observation and which should be treated as erroneous based on the synthetic data alone. Given a public release of the synthetic data, we suggest that the disclosure-prone multivariate outliers present in the actual data are more greatly shielded from attribution due to the high presence of synthetic observations in and around corresponding regions of the synthetic data.

Table 3 Table of Linear Regression statistics obtained from modelling net hourly electrical energy output with respect to the net hourly ambient variables in a main effects model based on the actual and synthetic data sets

Full size table

Table 4 Table of Linear Regression statistics obtained from modelling net hourly electrical energy output with respect to the net hourly ambient variables in a full factorial model based on the actual and synthetic data sets

Full size table

6 Conclusion

Our specific interest in this paper is the application of moment-based density estimation in the context of sharing confidential micro-data while preserving the privacy of the respondents. The multivariate moments’ problem and its application to estimating joint density functions are often considered highly impractical in modern-day analysis. Methods that do exist in this context, often fail to account for the sampling problems faced in real-world applications due to the computational complexity and intractability of higher-order moments. One of the critical difficulties in obtaining moment-based density expressions is the breadth of variation seen in densities requiring estimation. The support of these distributions must align with the orthogonal basis used in the density approximation (Kleiber and Stoyanov 2013). By requiring consistency in sample space, we inherently arrive at three different cases: Hausdorff (density functions defined on a compact space), Stieltjes (density functions defined on a half bounded (lower or upper) space) (Stieltjes 1894; Stoyanov and Tolmatz 2005) and Hamburger (density functions defined on a completely unbounded space) (Shohat and Tamarkin 1943; Akhiezer and Kemmer 1965). We have focused mainly on the Hausdorff problem, joint-densities on a compact space, and have only referred to our ability to apply such methods on the unbounded space provided moment determinacy.

Determinacy refers to the existence of a unique moment-based expansion of the density function. Kleiber and Stoyanov (2013) extensively review this concept in the context of the multivariate moments problem. Interestingly, the paper investigates the relationship between the marginal determinacy and the determinacy of the multivariate distribution as a whole. It raises Peterson’s result (Petersen 1982), that marginal determinacy implies multivariate determinacy and conjectures that for absolutely continuous functions, multivariate determinacy implies marginal.

Therefore we can comfortably apply the estimation results presented in this paper to unbounded distributions if, at the very least on the marginal level, our choice of unbounded reference densities converges to a unique polynomial expression. Furthermore, due to the unique properties of the Gaussian and gamma distributions (and others related, e.g., Chi-Square, Weibull), it has been shown that reference densities of this variety have moment-determinacy (Sánchez-Ruiz and Dehesa 1998; Provost 2005). Therefore this approach has greater application than may appear at first glance. Furthermore, when unsure of the appropriateness of a particular polynomial basis, we can instead focus on reviewing the convergence marginally rather than on the joint density as a whole to gain an idea of moment-determinacy.

In this paper, we proposed a generalised form of a multivariate polynomial-based density expansion, making the estimate applicable to one type of polynomial basis and any hypergeometric type polynomial basis (provided determinacy is assured). This expansion is generated from a generalised multivariate polynomial basis that, by construction, allows for the estimation of joint density functions based on aggregated sufficient statistics, namely the sample moments. We derived critical statistical properties of the moment-based density estimate and suggested a computational strategy by which we can estimate the optimal max polynomial order of the expansion. Through the contributions of this paper, we have demonstrated that applying multivariate polynomial expansion to joint density estimation is beneficial for analytical formulations and can be used as a practical density estimation alternative to existing comparable methods.

Furthermore, we can extend this result to estimate both cumulative and conditional cumulative distribution functions, allowing for the simulation of data from the moment-based density estimates without requiring grid-based sampling techniques. This property has considerable ramifications when producing synthetic representative samples of the original data and imputing missing information, especially when we can only ascertain empirical moments. The production of non-parametric synthetic continuous data based on aggregated statistics is an area of great interest in data privacy research.

Change history

27 February 2023
Missing Open Access funding information has been added in the Funding Note.

References

Akhiezer, N.I., Kemmer, N.: The Classical Moment Problem: And Some Related Questions in Analysis, vol. 5. Oliver & Boyd, Edinburgh (1965)
Google Scholar
Alexits, G.: Convergence Problems of Orthogonal Series. International Series of Monographs in Pure and Applied Mathematics, vol. 20, pp. 63–170. Pergamon Press, Oxford (1961). https://books.google.com.au/books?id=VAJRAAAAMAAJ
Charlier, C.V.L.: Frequency curves of type a in heterograde statistics. Ark. Mat. Ast. Fysik 9, 1–17 (1914)
Google Scholar
Cramer, H.: Mathematical Methods of Statistics, vol. 9, pp. 85–89. Princeton University Press, Princeton (1946)
Duong, T.: ks: Kernel density estimation and kernel discriminant analysis for multivariate data in R. J. Stat. Softw. 21(7), 1–16 (2007). https://doi.org/10.18637/jss.v021.i07
Duong, T.: Ks: Kernel Smoothing. R package version 1.11.3. https://CRAN.R-project.org/package=ks (2018)
Edgeworth, F.Y.: The law of error. Proc. Cambridge Philos. Soc. 20, 36–65 (1905)
MATH Google Scholar
Elderton, W.P., Johnson, N.L.: Systems of Frequency Curves, 1st edn., pp. 110–121. Cambridge University Press, Cambridge (1969). https://doi.org/10.1017/CBO9780511569654
Gram, J.P.: On the development of real functions in rows by the method of least squares. J. Reine Angew. Math. 94, 41–73 (1883)
Article MathSciNet MATH Google Scholar
Griffiths, R.C., Spano, D.: Multivariate Jacobi and Laguerre polynomials, infinite-dimensional extensions, and their probabilistic connections with multivariate Hahn and Meixner polynomials. Bernoulli Soc. 17, 1095–1125 (2011). https://doi.org/10.3150/10-BEJ305
Article MathSciNet MATH Google Scholar
Hofert, M., Kojadinovic, I., Maechler, M., Yan, J.: Copula: Multivariate Dependence with Copulas. R package version 1.0-1. https://CRAN.R-project.org/package=copula (2020)
Hofert, Marius: Mächler, Martin: Nested Archimedean copulas meet R: the nacopula package. J. Stat. Softw. 39(9), 1–20 (2011)
Article Google Scholar
Izenman, A.J.: Review papers: recent developments in nonparametric density estimation. J. Am. Stat. Assoc. 86(413), 205–224 (1991). https://doi.org/10.1080/01621459.1991.10475021
Article MathSciNet MATH Google Scholar
Jenkins, M.A., Traub, J.F.: Algorithm 419: zeros of a complex polynomial [c2]. Commun. ACM 15(2), 97–99 (1972)
Article Google Scholar
Kaya, H., Tufekci, P.: Local and global learning methods for predicting power of a combined gas & steam turbine, pp. 13–18 (2012)
Kendall, M.G., Stuart, A., Ord, J.K.: Kendall’s Advanced Theory of Statistics. Oxford University Press Inc, New York (1987)
MATH Google Scholar
Kleiber, C., Stoyanov, J.: Multivariate distributions and the moment problem. J. Multivar. Anal. 113, 7–18 (2013)
Article MathSciNet MATH Google Scholar
Kojadinovic, Ivan: Yan, Jun: Modeling multivariate distributions with continuous margins using the copula R package. J. Stat. Softw. 34(9), 1–20 (2010)
Article Google Scholar
Lin, Y.-X.: Density approximant based on noise multiplied data. In: Domingo-Ferrer, J. (ed.) Privacy in Statistical Databases, pp. 89–104. Springer, Cham (2014)
Chapter Google Scholar
Macomber, J.H., Myers, B.L.: The bivariate beta distribution: comparison of Monte Carlo generators and evaluation of parameter estimates. ACM SIGSIM Simul. Dig. 14(1–4), 26–35 (1983)
Article Google Scholar
Mateo-Sanz, J.M., Sebé, F., Domingo-Ferrer, J.: Outlier protection in continuous microdata masking. In: Domingo-Ferrer, J., Torra, V. (eds.) Privacy in Statistical Databases. PSD 2004. Lecture Notes in Computer Science, pp. 201–215. Springer, Berlin (2004)
Mehler, F.G.: G. reihenentwicklungen nach laplaceschen functionen hoherer ordnung. J. Reine Angew. Math. 66, 161–176 (1866)
Meyer, D.W.: Density estimation with distribution element trees. Stat. Comput. 28(3), 609–632 (2017). https://doi.org/10.1007/s11222-017-9751-9
Article MathSciNet MATH Google Scholar
Meyer, D.: Detpack: Density Estimation and Random Number Generation with Distribution Element Trees. R package version 1.1.3. https://CRAN.R-project.org/package=detpack (2019)
Munkhammar, J., Mattsson, L., Rydén, J.: Polynomial probability distribution estimation using the method of moments. PLoS ONE 12(4), 1–14 (2017). https://doi.org/10.1371/journal.pone.0174573
Article Google Scholar
Mustapha, H., Dimitrakopoulos, R.: Generalized Laguerre expansions of multivariate probability densities with moments. Comput. Math. Appl. 60(7), 2178–2189 (2010). https://doi.org/10.1016/j.camwa.2010.08.008
Petersen, L.: On the relation between the multidimensional moment problem and the one-dimensional moment problem. Math. Scand. 51(2), 361–366 (1982)
Article MathSciNet MATH Google Scholar
Provost, S.B.: Moment based density approximants. Math. J. 9, 728–756 (2005)
Google Scholar
Reid, N.: Saddlepoint methods and statistical inference. Statist. Sci. 3(2), 213–227 (1988). https://doi.org/10.1214/ss/1177012906
Article MathSciNet MATH Google Scholar
Sánchez-Ruiz, J., Dehesa, J.S.: Expansions in series of orthogonal hypergeometric polynomials. J. Comput. Appl. Math. 89(1), 155–170 (1998)
Article MathSciNet MATH Google Scholar
Shohat, J.A., Tamarkin, J.D.: The Problem of Moments, vol. 1, pp. 860–863. American Mathematical Society, New York (1943)
Solomon, H., Stephens, M.A.: Approximations to density functions using Pearson curves. J. Am. Stat. Assoc. 73(361), 153–160 (1978)
Article Google Scholar
Stieltjes, T.-J.: Recherches sur les fractions continues. Annales de la Faculté des sciences de Toulouse: Mathématiques 8(4), 1–122 (1894)
MathSciNet MATH Google Scholar
Stoyanov, J.: Moment properties of probability distributions used in stochastic financial models. In: Recent Advances in Financial Engineering 2014: Proceedings of the TMU Finance Workshop 2014, pp. 1–27, World Scientific (2016)
Stoyanov, J., Tolmatz, L.: Method for constructing Stieltjes classes for M-indeterminate probability distributions. Appl. Math. Comput. 165(3), 669–685 (2005)
MathSciNet MATH Google Scholar
Szegő, G.: Orthogonal Polynomials. American Mathematical Society, vol. 23, pp. 38–100. American Mathematical Society, Providence (1939). https://books.google.com.au/books?id=WRoPAAAAIAAJ
Szökefalvi-Nagy, B.: Introduction to Real Functions and Orthogonal Expansions, pp. 310–337. Oxford University Press, Oxford (1965)
Tüfekci, P.: Prediction of full load electrical power output of a base load operated combined cycle power plant using machine learning methods. Int. J. Electr. Power Energy Syst. 60, 126–140 (2014). https://doi.org/10.1016/j.ijepes.2014.02.027
Article Google Scholar
Wang, Z., Scott, D.W.: Nonparametric density estimation for high-dimensional data-algorithms and applications. Wiley Interdiscip. Rev. Comput. Stat. 11(4), 1461 (2019)
Article MathSciNet Google Scholar
Withers, C.S., Nadarajah, S.: Expansions for the multivariate normal. J. Multivar. Anal. 101(5), 1311–1316 (2010). https://doi.org/10.1016/j.jmva.2010.01.001
Article MathSciNet MATH Google Scholar
Xu, Y.: On orthogonal polynomials in several variables. Special functions, q-Series and related topics. Fields Inst. Res. Math. Sci. Commun. Ser. 14, 247–270 (1997)
Yan, Jun: Enjoy the joy of copulas: with a package copula. J. Stat. Softw. 21(4), 1–21 (2007)
Article Google Scholar

Download references

Acknowledgements

This research was partially completed with support from the Australian Government Research and Training Program.

Funding

Open Access funding enabled and organized by CAUL and its Member Institutions.

Author information

Authors and Affiliations

National Institute for Applied Statistics Research Australia, University of Wollongong, Northfields Ave, Wollongong, NSW, 2522, Australia
Bradley Wakefield & Yan-Xia Lin
Department of Management Science & Information Systems, Oklahoma State University, Spears School of Business, Stillwater, OK, 74078, USA
Rathin Sarathy
Marketing & Supply Chain Management, University of Oklahoma, West Brooks, Norman, OK, 73019, USA
Krishnamurty Muralidhar

Authors

Bradley Wakefield
View author publications
You can also search for this author in PubMed Google Scholar
Yan-Xia Lin
View author publications
You can also search for this author in PubMed Google Scholar
Rathin Sarathy
View author publications
You can also search for this author in PubMed Google Scholar
Krishnamurty Muralidhar
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Bradley Wakefield or Yan-Xia Lin.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

The original online version of this article was revised: In the original publication of the article, Missing Open Access funding information has been added in the Funding Note.

Appendices

Appendix A: Proof of Theorem 2

Proof

Firstly as the integrals on $\Omega _k$ are all finite, then

$$\begin{aligned}&\int _{{\Omega ^{(N)}}} P_{n_1,\dots ,n_N}(\textbf{x})P_{m_1,\dots ,m_N}(\textbf{x}) d\mu _{\nu } (\textbf{x}) \\&\quad = \prod _{k=1}^N \left[ \int _{\Omega _k} P_{k,n_k}(x_k) P_{k,m_k}(x_k)f_{\nu _k}dx \right] \\&\quad = \prod _{k=1}^N\delta _{n_k,m_k}. \end{aligned}$$

Furthermore, given that for each $k=1,\dots ,N$, the set of orthonormal polynomials $\{ P_{k,n_k}\}_{n_k \in \mathbb {N}_0} $ are complete in $L_2(\Omega _k,\mu _{\nu _k})$. That is, for any function $g(x_k)$ on $ \Omega _k $, if

$$\begin{aligned} \int _{a_k}^{b_k} g(x_k) P_{k,n_k}(x_k) f_{\nu _k}(x_k) dx_k = 0,\quad \text {for all}\quad n_k \in \mathbb {N}_0, \end{aligned}$$

then, $ g(x_k) = 0$ for all $x_k \in \Omega _k$ a.s.. Therefore for $P_{n_1, \dots , n_N}( \textbf{x} ) = \prod _{k=1}^N P_{k,n_k}(x_k)$ and any function h on ${\Omega ^{(N)}} = \prod _{k=1}^N \Omega _k $, then for all $\textbf{x} \in {\Omega ^{(N)}}$, if

$$\begin{aligned} \int _{{\Omega ^{(N)}}} h(\textbf{x}) P_{n_1, \dots , n_N}( \textbf{x} ) d\mu _{\nu }(\textbf{x}) = 0\quad \text {for all}\quad n_1, \dots , n_N \in \mathbb {N}_0, \end{aligned}$$

then it follows that

$$\begin{aligned}&\int _{a_N}^{b_N} \dots \int _{a_1}^{b_1} h(x_1,\dots ,x_N)\\&\quad \times \left( \prod _{k=1}^N P_{k,n_k}(x_k) f_{\nu _k}(x_k) \right) dx_1 \dots dx_N = 0 \end{aligned}$$

for all $ n_1, \dots , n_N \in \mathbb {N}_0$. Hence,

$$\begin{aligned}&\int _{a_N}^{b_N} \left[ \int _{a_{N-1}}^{b_{N-1}} \dots \int _{a_1}^{b_1} h(x_1,\dots ,x_N)\right. \\&\quad \times \left. \left( \prod _{k=1}^{N-1} P_{k,n_k}(x_k) f_{\nu _k}(x_k) \right) dx_1 \dots dx_{N-1} \right] \\&\quad \times P_{N,n_N}(x_N) f_{\nu _N}(x_N)dx_N = 0\quad \text {for all}\quad n_1, \dots , n_N \in \mathbb {N}_0. \end{aligned}$$

Due to the fact $P_{N,n_N}(x_N)$ is complete, this infers that, for all $x_N \in \Omega _N$ a.s.,

$$\begin{aligned}&\int _{a_{N-1}}^{b_{N-1}} \dots \int _{a_1}^{b_1} h(x_1,\dots ,x_N)\\&\quad \times \left( \prod _{k=1}^{N-1} P_{k,n_k}(x_k) f_{\nu _k}(x_k) \right) dx_1 \dots dx_{N-1} = 0. \end{aligned}$$

Applying the same reasoning we can see that for all $x_i \in \Omega _i (i=3,\dots ,N)$ a.s., we have

$$\begin{aligned}&\int _{a_2}^{b_2} \left[ \int _{a_1}^{b_1} h(x_1,\dots ,x_N) P_{1,n_1}(x_1) f_{\nu _1}(x_1) dx_1 \right] \\&\quad \times P_{2,n_2}(x_2) f_{\nu _2}(x_2) dx_2 = 0, \end{aligned}$$

for all $ n_1, n_2 \in \mathbb {N}_0$, which implies for all $x_i \in \Omega _i, i=2,\ldots ,N,$ a.s.,

$$\begin{aligned} \int _{a_1}^{b_1} h(x_1,\dots ,x_N) P_{1,n_1}(x_1) f_{\nu _1}(x_1) dx_1 = 0 \quad \text {for all}\quad n_1 \in \mathbb {N}_0. \end{aligned}$$

This leads to

$$\begin{aligned} h(x_1,\dots ,x_N) = 0 \quad \text {for all}\quad \textbf{x} \in {\Omega ^{(N)}}, \; a.s.. \end{aligned}$$

Therefore we can say that for any function h on ${\Omega ^{(N)}} = \prod _{k=1}^N \Omega _k $ then

$$\begin{aligned} \text {if } \int _{{\Omega ^{(N)}}} h(\textbf{x}) P_{n_1, \dots , n_N}( \textbf{x} ) d\mu _{\nu }(\textbf{x}) = 0\quad \text {for all}\quad n_1, \dots , n_N \in \mathbb {N}_0, \end{aligned}$$

then $h(\textbf{x}) = 0\;\text {for all}\; \textbf{x} \in {\Omega ^{(N)}}$ almost surely. Therefore $\{ P_{n_1,\dots , n_N}(\textbf{x}) \}_{n_1,\dots ,n_N \in \mathbb {N}_0} $ are a complete set of multivariate orthonormal polynomials. $\square $

Appendix B: Proof of Theorem 3

Proof

By the Weierstrass–Stone Approximation (Szökefalvi-Nagy 1965), we know there exists a unique polynomial approximation that converges uniformly to the continuous function $f_{\textbf{X}}(\textbf{x}) / f_{\nu }(\textbf{x}) $. This result is held because $\{P_{n_1, \cdots , n_N}\}_{n_1,\dots ,n_N \in \mathbb {N}_0}$ is a complete orthogonal base. Let

$$\begin{aligned} \frac{f_{\textbf{X}}(\textbf{x})}{f_{\nu }(\textbf{x})} = \sum _{n_1 = 0}^{\infty } \dots \sum _{n_N = 0}^{\infty } C_{n_1,\dots ,n_N} P_{n_1,\dots ,n_N}(\textbf{x}). \end{aligned}$$

Given $f_{\textbf{X}} / f_{\nu } \in L_2({\Omega ^{(N)}},\mu _{\nu })$, then, for fixed $m_1, \dots , m_N \in \mathbb {N}_0$, by the Cauchy-Schwarz inequality,

$$\begin{aligned}&\int _{{\Omega ^{(N)}}} \left|\sum _{n_1 = 0}^{\infty } \dots \sum _{n_N = 0}^{\infty } C_{n_1,\dots ,n_N} P_{n_1,\dots ,n_N}(\textbf{x})P_{m_1,\dots ,m_N}(\textbf{x})\right|d\mu _{\nu }(\textbf{x}) \\&\quad \le \sqrt{\int _{{\Omega ^{(N)}}}\left|\frac{f_{\textbf{X}}(\textbf{x})}{ f_{\nu }(\textbf{x})}\right|^2 d\mu _{\nu }(\textbf{x}) \int _{{\Omega ^{(N)}}} P_{m_1,\dots ,m_N}^2(\textbf{x}) d\mu _{\nu }(\textbf{x})} \\&\quad = \sqrt{\int _{{\Omega ^{(N)}}}\left|\frac{f_{\textbf{X}}(\textbf{x})}{f_{\nu }( \textbf{x})}\right|^2 d\mu _{\nu }(\textbf{x})} < \infty . \end{aligned}$$

Therefore by the Fubini–Tonelli theorem, it follows

$$\begin{aligned}&\int _{{\Omega ^{(N)}}}\sum _{n_1 = 0}^{\infty } \dots \sum _{n_N = 0}^{\infty } C_{n_1,\dots ,n_N} P_{n_1,\dots ,n_N}(\textbf{x})P_{m_1,\dots ,m_N}(\textbf{x}) d\mu _{\nu }(\textbf{x}) \\&\quad = \sum _{n_1 = 0}^{\infty } \dots \sum _{n_N = 0}^{\infty } C_{n_1,\dots ,n_N} \left[ \prod _{k=1}^N\delta _{n_k,m_k} \right] = C_{m_1,\dots ,m_N}. \end{aligned}$$

Hence

$$\begin{aligned} C_{m_1,\dots ,m_N}&= \int _{{\Omega ^{(N)}}}\frac{f_{\textbf{X}}(\textbf{x})}{f_{\nu }(\textbf{x})} P_{m_1,\dots ,m_N}(\textbf{x}) d\mu _{\nu }(\textbf{x})\\&= \mathbb {E}_{\textbf{X}} \left[ P_{m_1,\dots ,m_N}(X_1,\dots ,X_N) \right] . \end{aligned}$$

Therefore by the uniqueness of the polynomial approximation, then,

$$\begin{aligned} f_{\textbf{X}}(\textbf{x}) = f_{\nu }(\textbf{x}) \sum _{n_1 = 0}^{\infty } \dots \sum _{n_N = 0}^{\infty } C_{n_1,\dots ,n_N}P_{n_1,\dots ,n_N}(\textbf{x}), \end{aligned}$$

where the coefficient terms $C_{n_1,\dots ,n_N}$ are given by

$$\begin{aligned} C_{n_1,\dots ,n_N}= & {} \mathbb {E}_{\textbf{X}} \left[ P_{n_1,\dots ,n_N}(X_1,\dots ,X_N) \right] \\= & {} \int _{\Omega ^{(N)}} P_{1,n_1}(x_1) \dots P_{N,n_N}(x_N) f_{\textbf{X}}(\textbf{x}) d\textbf{x}. \end{aligned}$$

$\square $

Appendix C: Proof of Theorem 5

To prove Theorem 5, we need Lemma 1 below. This Lemma provides a vital property as the convergence of the “remainder term” as described in Theorem 4.

Lemma 1

For a dimension $j \in \{1,\dots ,N\}$ as well as for any $\epsilon > 0$ and $m_1,\dots ,m_{j-1},$ $m_{j+1},\dots ,m_N \in \mathbb {N}_0$ there exists $K_j$ such that for all $n_{j} > K_{j}$,

$$\begin{aligned} \sum _{n_1 = 0}^{m_1} \dots \sum _{n_{j-1} = 0}^{m_{j-1}} \sum _{n_{j+1} = 0}^{m_{j+1}} \dots \sum _{n_N = 0}^{m_N} C_{n_1,\dots ,n_N}^2 < \epsilon . \end{aligned}$$

Proof

We will only provide the detailed proof for the case $N = 3$ for Lemma 1 and for the general N, we can achieve the proof by following the same process outlined for the proof of $N=3$.

Firstly we note that the finite order multivariate moment-based density converges to the true density as all the elements of the max polynomial order vector $\textbf{K}$ approaches infinity. This property follows as,

$$\begin{aligned}&\int _{\Omega ^{(N)}} \left|\frac{f_{\textbf{X}}(\textbf{x})}{f_{\nu }(\textbf{x})} - \frac{f_{\textbf{X};\textbf{K}}(\textbf{x})}{f_{\nu }(\textbf{x})} \right|^2 d\mu _{\nu } \\&\quad = \int _{\Omega ^{(N)}} \left|\frac{f_{\textbf{X}}(\textbf{x})}{f_{\nu }(\textbf{x})} \right|^2 d\mu _{\nu } \\&\qquad - 2 \int _{\Omega ^{(N)}} \frac{f_{\textbf{X}}(\textbf{x}) f_{\textbf{X};\textbf{K}}(\textbf{x})}{f_{\nu }^2(\textbf{x})} d\mu _{\nu } + \int _{\Omega ^{(N)}} \left|\frac{f_{\textbf{X};\textbf{K}}(\textbf{x})}{ f_{\nu }(\textbf{x})} \right|^2d\mu _{\nu } \\&\quad = \sum _{n_1=0}^\infty \dots \sum _{n_N=0}^\infty C_{n_1,\dots ,n_N}^2 - 2\sum _{n_1=0}^{K_1} \dots \sum _{n_N=0}^{K_N} C_{n_1,\dots ,n_N}^2 \\&\qquad + \sum _{n_1=0}^{K_1} \dots \sum _{n_N=0}^{K_N} C_{n_1,\dots ,n_N}^2 \\&\quad = \sum _{n_1=0}^\infty \dots \sum _{n_N=0}^\infty C_{n_1,\dots ,n_N}^2 - \sum _{n_1=0}^{K_1} \dots \sum _{n_N=0}^{K_N} C_{n_1,\dots ,n_N}^2. \end{aligned}$$

Therefore,

$$\begin{aligned} \lim _{K_1 \rightarrow \infty \; \dots \; K_N \rightarrow \infty } \; \int _{{\Omega ^{(N)}}} \left|\frac{f_{\textbf{X}}(\textbf{x})}{f_{\nu }(\textbf{x})} - \frac{f_{\textbf{X};\textbf{K}}(\textbf{x})}{f_{\nu }(\textbf{x})} \right|^2 d\mu _{\nu } = 0. \end{aligned}$$

We note therefore that for any $\epsilon > 0$ there exists $\textbf{K}_0 = (K_{0,1},\dots ,K_{0,N})$ such that

$$\begin{aligned}&\int _{\Omega ^{(N)}} \left|\frac{f_{\textbf{X}}(\textbf{x})}{f_{\nu }(\textbf{x})} - \frac{f_{\textbf{X};\textbf{K}_0 }(\textbf{x})}{f_{\nu }(\textbf{x})} \right|^2 d\mu _{\nu } \\&\quad = \sum _{n_1=0}^\infty \dots \sum _{n_N=0}^\infty C_{n_1,\dots ,n_N}^2 - \sum _{n_1=0}^{K_{0,1}} \dots \sum _{n_N=0}^{K_{0,N}} C_{n_1,\dots ,n_N}^2 \\&\quad = \mathcal {T}_{\textbf{K}_0} < \epsilon . \end{aligned}$$

Consider for $N = 3$ and denote

$$\begin{aligned} T_0&= \sum _{n_1=0}^\infty \sum _{n_2=0}^\infty \sum _{n_3=K_{0,3} +1}^\infty C_{n_1,n_2,n_3}^2 \\ T_1&= \sum _{n_1=K_{0,1} +1}^\infty \sum _{n_2=0}^{K_{0,2}} \sum _{n_3=0}^{K_{0,3}} C_{n_1,n_2,n_3}^2 \\ T_2&= \sum _{n_1=0}^{K_{0,1}} \sum _{n_2=K_{0,2} +1}^\infty \sum _{n_3=0}^{K_{0,3}} C_{n_1,n_2,n_3}^2 \\ T_3&= \sum _{n_1=K_{0,1} +1}^\infty \sum _{n_2=K_{0,2} +1}^\infty \sum _{n_3=0}^{K_{0,3}} C_{n_1,n_2,n_3}^2 \\ T_4&= \sum _{n_1=0}^{K_{0,1}} \sum _{n_2=0}^{K_{0,2}} \sum _{n_3=K_{0,3} +1}^\infty C_{n_1,n_2,n_3}^2 \\ T_5&= \sum _{n_1=K_{0,1} +1}^\infty \sum _{n_2=0}^{K_{0,2}} \sum _{n_3=K_{0,3} +1}^\infty C_{n_1,n_2,n_3}^2 \\ T_6&= \sum _{n_1=0}^{K_{0,1}} \sum _{n_2=K_{0,2} +1}^\infty \sum _{n_3=K_{0,3} +1}^\infty C_{n_1,n_2,n_3}^2 \\ T_7&= \sum _{n_1=K_{0,1} +1}^\infty \sum _{n_2=K_{0,2} +1}^\infty \sum _{n_3=K_{0,3} +1}^\infty C_{n_1,n_2,n_3}^2 . \\ \end{aligned}$$

Therefore, for any $\epsilon _0 > 0$ we can fix $\epsilon = \epsilon _0/2$ and note that there therefore exists a $\textbf{K}_0 = (K_{0,1},K_{0,2},K_{0,3})$ such that $ \mathcal {T}_{\textbf{K}_0} < \epsilon _0 / 2$. Thus, for such $\textbf{K}_0$,

$$\begin{aligned}&\epsilon _0 / 2 > \mathcal {T}_{\textbf{K}_0} = \sum _{n_1=0}^\infty \sum _{n_2=0}^\infty \sum _{n_3=0}^\infty C_{n_1,n_2,n_3}^2 \\&\qquad - \sum _{n_1=0}^{K_{0,1}} \sum _{n_2=0}^{K_{0,2}} \sum _{n_3=0}^{K_{0,3}} C_{n_1,n_2,n_3}^2 \\&\quad = \sum _{n_1=0}^\infty \sum _{n_2=0}^\infty \sum _{n_3=0}^{K_{0,3}} C_{n_1,n_2,n_3}^2 + \sum _{n_1=0}^\infty \sum _{n_2=0}^\infty \sum _{n_3=K_{0,3} +1}^\infty C_{n_1,n_2,n_3}^2 \\&\qquad - \sum _{n_1=0}^{K_{0,1}} \sum _{n_2=0}^{K_{0,2}} \sum _{n_3=0}^{K_{0,3}} C_{n_1,n_2,n_3}^2 \\&\quad = \sum _{n_2=0}^{K_{0,2}} \sum _{n_3=0}^{K_{0,3}} \left( \sum _{n_1=0}^{K_{0,1}} C_{n_1,n_2,n_3}^2 + \sum _{n_1=K_{0,1} +1}^\infty C_{n_1,n_2,n_3}^2 \right) \\&\qquad + \sum _{n_1=0}^\infty \sum _{n_2=K_{0,2} +1}^\infty \sum _{n_3=0}^{K_{0,3}} C_{n_1,n_2,n_3}^2 \\&\qquad + \sum _{n_1=0}^\infty \sum _{n_2=0}^\infty \sum _{n_3=K_{0,3} +1}^\infty C_{n_1,n_2,n_3}^2 \\&\qquad -\sum _{n_1=0}^{K_{0,1}} \sum _{n_2=0}^{K_{0,2}} \sum _{n_3=0}^{K_{0,3}} C_{n_1,n_2,n_3}^2 \\&\quad = \sum _{n_1=0}^{K_{0,1}} \sum _{n_2=0}^{K_{0,2}} \sum _{n_3=0}^{K_{0,3}} C_{n_1,n_2,n_3}^2 + \sum _{n_1=K_{0,1} +1}^\infty \sum _{n_2=0}^{K_{0,2}} \sum _{n_3=0}^{K_{0,3}} C_{n_1,n_2,n_3}^2 \\&\qquad + \sum _{n_1=0}^\infty \sum _{n_2=K_{0,2} +1}^\infty \sum _{n_3=0}^{K_{0,3}} C_{n_1,n_2,n_3}^2\\&\qquad + \sum _{n_1=0}^\infty \sum _{n_2=0}^\infty \sum _{n_3=K_{0,3} +1}^\infty C_{n_1,n_2,n_3}^2\\&\qquad -\sum _{n_1=0}^{K_{0,1}} \sum _{n_2=0}^{K_{0,2}} \sum _{n_3=0}^{K_{0,3}} C_{n_1,n_2,n_3}^2 \\&\quad = \sum _{n_1=K_{0,1} +1}^\infty \sum _{n_2=0}^{K_{0,2}} \sum _{n_3=0}^{K_{0,3}} C_{n_1,n_2,n_3}^2\\&\qquad + \sum _{n_1=0}^\infty \sum _{n_2=K_{0,2} +1}^\infty \sum _{n_3=0}^{K_{0,3}} C_{n_1,n_2,n_3}^2 \\&\qquad + \sum _{n_1=0}^\infty \sum _{n_2=0}^\infty \sum _{n_3=K_{0,3} +1}^\infty C_{n_1,n_2,n_3}^2 . \end{aligned}$$

As every term is positive, we note that this implies that

$$\begin{aligned} T_0 = \sum _{n_1=0}^\infty \sum _{n_2=0}^\infty \sum _{n_3=K_{0,3} +1}^\infty C_{n_1,n_2,n_3}^2 < \frac{\epsilon _0}{2}. \end{aligned}$$

Furthermore, following the same process, if we were to continue to partition all the infinite sums we would have

$$\begin{aligned} \mathcal {T}_{\textbf{K}_0} = \sum _{i=1}^7 T_i < \frac{\epsilon _0}{2} \end{aligned}$$

Now consider for any $m_1, m_2$ the sum

$$\begin{aligned} \sum _{n_1= 0 }^{m_1} \sum _{n_2=0}^{m_2} \sum _{n_3=K_{0,3}+1}^\infty C_{n_1,n_2,n_3}^2. \end{aligned}$$

We have four distinct cases of this sum:

Case 1 $m_1 \le K_{0,1}$ and $m_2 \le K_{0,2}$.

When $m_1 \le K_{0,1}$ and $m_2 \le K_{0,2}$ then it follows that

$$\begin{aligned} \sum _{n_1= 0 }^{m_1} \sum _{n_2=0}^{m_2} \sum _{n_3=K_{0,3}+1}^\infty C_{n_1,n_2,n_3}^2 \le T_4< \frac{\epsilon _0}{2} < \epsilon _0. \end{aligned}$$

Case 2 $m_1 \le K_{0,1}$ and $m_2 > K_{0,2}$.

When $m_1 \le K_{0,1}$ and $m_2 > K_{0,2}$ then it follows that

$$\begin{aligned}&\sum _{n_1= 0 }^{m_1} \sum _{n_2=0}^{m_2} \sum _{n_3=K_{0,3}+1}^\infty C_{n_1,n_2,n_3}^2 \\&\quad = \sum _{n_1= 0 }^{m_1} \sum _{n_3=K_{0,3}+1}^\infty \left( \sum _{n_2=0}^{K_{0,2}} C_{n_1,n_2,n_3}^2 + \sum _{n_2=K_{0,2}+1}^{m_2} C_{n_1,n_2,n_3}^2 \right) \\&\quad \le T_4 + T_6< \frac{\epsilon _0}{2} < \epsilon _0. \end{aligned}$$

Case 3 $m_1 > K_{0,1}$ and $m_2 \le K_{0,2}$.

When $m_1 > K_{0,1}$ and $m_2 \le K_{0,2}$ then it follows that

$$\begin{aligned}&\sum _{n_1= 0 }^{m_1} \sum _{n_2=0}^{m_2} \sum _{n_3=K_{0,3}+1}^\infty C_{n_1,n_2,n_3}^2 \\&\quad = \sum _{n_2= 0 }^{m_2} \sum _{n_3=K_{0,3}+1}^\infty \left( \sum _{n_1=0}^{K_{0,1}} C_{n_1,n_2,n_3}^2 + \sum _{n_1=K_{0,1}+1}^{m_1} C_{n_1,n_2,n_3}^2 \right) \\&\quad \le T_4 + T_5< \frac{\epsilon _0}{2} < \epsilon _0. \end{aligned}$$

Case 4 $m_1 > K_{0,1}$ and $m_2 > K_{0,2}$.

When $m_1 > K_{0,1}$ and $m_2 > K_{0,2}$ then it follows that

$$\begin{aligned}&\sum _{n_1= 0 }^{m_1} \sum _{n_2=0}^{m_2} \sum _{n_3=K_{0,3}+1}^\infty C_{n_1,n_2,n_3}^2 \\&\quad = \sum _{n_2= 0 }^{m_2} \sum _{n_3=K_{0,3}+1}^\infty \left( \sum _{n_1=0}^{K_{0,1}} C_{n_1,n_2,n_3}^2 + \sum _{n_1=K_{0,1}+1}^{m_1} C_{n_1,n_2,n_3}^2 \right) \\&\quad \le T_4 + T_6 + T_5 + T_0 < \frac{\epsilon _0}{2} + \frac{\epsilon _0}{2} = \epsilon _0. \end{aligned}$$

Therefore it follows for any $m_1, m_2$ we have

$$\begin{aligned} \sum _{n_1= 0 }^{m_1} \sum _{n_2=0}^{m_2} \sum _{n_3=K_{0,3}+1}^\infty C_{n_1,n_2,n_3}^2 < \epsilon _0. \end{aligned}$$

Similarly, repeating this process we have

$$\begin{aligned}{} & {} \text {for any}\;m_1, m_3 \quad \sum _{n_1= 0 }^{m_1} \sum _{n_2= K_{0,2}+1}^{\infty } \sum _{n_3= 0 }^{m_3} C_{n_1,n_2,n_3}^2< \epsilon _0, \\{} & {} \text {for any}\;m_2, m_3 \quad \sum _{n_1= K_{0,1}+1}^{\infty } \sum _{n_2=0}^{m_2} \sum _{n_3= 0 }^{m_3} C_{n_1,n_2,n_3}^2 < \epsilon _0. \end{aligned}$$

$\square $

Proof of Theorem 5

For a fixed $j \in \{ 1,\dots ,N \}$ we define the notation for some vector $\textbf{a} = (a_1,\dots ,a_N)$ that $\textbf{a}_{(j)} = (a_1,\dots ,a_{j-1}, a_j +1, a_{j+1},\dots a_N)$.

We assume there exists $\epsilon _0 > 0 $ such that $ {\textit{Var}}_{\textbf{X}}(\hat{C}_{n_1,\dots ,n_N}) \ge \epsilon _0 $ for all $n_1,\dots ,n_N \in \mathbb {N}_0$ where $(n_1,\dots ,n_N) \ne (0,\dots ,0)$.

Consider the $\epsilon _0$ and j, from Lemma 1, for any $m_1,\dots ,m_{j-1}$, $m_{j+1},\dots ,m_N \in \mathbb {N}_0$ there exists $K_j(\epsilon _0)>0$ such that for all $m_{j} > K_{j}(\epsilon _0)$,

$$\begin{aligned} \sum _{n_1 = 0}^{m_1} \dots \sum _{n_{j-1} = 0}^{m_{j-1}} \sum _{n_{j+1} = 0}^{m_{j+1}} \dots \sum _{n_N = 0}^{m_N} C_{n_1, \cdots ,n_{j-1},m_j, n_{j+1},\ldots , n_N}^2 < \epsilon _0. \end{aligned}$$

Therefore

$$\begin{aligned} \mathcal {T}_{\textbf{m}} - \mathcal {T}_{\textbf{m}_{(j)}}&= \sum _{n_1 = 0}^\infty \dots \sum _{n_N = 0}^\infty C_{n_1,\dots ,n_N}^2 \end{aligned}$$

(C1)

$$\begin{aligned}&\quad - \sum _{n_1 = 0}^{m_1} \dots \sum _{n_N = 0}^{m_N} C_{n_1,\dots ,n_N}^2 \nonumber \\&\quad - \sum _{n_1 = 0}^\infty \dots \sum _{n_N = 0}^\infty C_{n_1,\dots ,n_N}^2 \nonumber \\&\quad + \sum _{n_1 = 0}^{m_1} \dots \sum _{n_{j-1} = 0}^{m_{j-1}} \sum _{n_{j} = 0}^{m_{j}+1} \sum _{n_{j+1} = 0}^{m_{j+1}} \dots \sum _{n_N = 0}^{m_N} C_{n_1,\dots ,n_N}^2 \nonumber \\&= \sum _{n_1 = 0}^{m_1} \dots \sum _{n_{j-1} = 0}^{m_{j-1}} \sum _{n_{j} = 0}^{m_{j}+1} \sum _{n_{j+1} = 0}^{m_{j+1}} \dots \sum _{n_N = 0}^{m_N} C_{n_1,\dots ,n_N}^2 \end{aligned}$$

(C2)

$$\begin{aligned}&\quad - \sum _{n_1 = 0}^{m_1} \dots \sum _{n_N = 0}^{m_N} C_{n_1,\dots ,n_N}^2 \nonumber \\&= \sum _{n_1 = 0}^{m_1} \dots \sum _{n_{j-1} = 0}^{m_{j-1}} \sum _{n_{j+1} = 0}^{m_{j+1}} \dots \sum _{n_N = 0}^{m_N} C_{n_1,\dots ,m_j+1,\dots ,n_N}^2 \nonumber \\&< \epsilon _0. \end{aligned}$$

(C3)

Suppose, if there is a $\textbf{m} =(m_1, \ldots , m_N)$ with a $m_{j} >K_{j}(\epsilon _0)$ such that

$$\begin{aligned} \mathbb {E}_{\textbf{X}} \left[ \mathcal {N}_{\textbf{m}_{(j)}}\left( \{\textbf{X}_i\}_{i=1}^M \right) \right] \le \mathbb {E}_{\textbf{X}} \left[ \mathcal {N}_{\textbf{m}}\left( \{\textbf{X}_i\}_{i=1}^M \right) \right] . \end{aligned}$$

We note that

$$\begin{aligned}&\mathbb {E}_{\textbf{X}} \left[ \mathcal {N}_{\textbf{m}_{(j)}}\left( \{\textbf{X}_i\}_{i=1}^M \right) \right] \le \mathbb {E}_{\textbf{X}} \left[ \mathcal {N}_{\textbf{m}}\left( \{\textbf{X}_i\}_{i=1}^M \right) \right] \\&\quad \Rightarrow \sum _{k_1 = 0}^{m_1} \dots \sum _{k_N = 0}^{m_N} {\textit{Var}}_{\textbf{X}}\left( \hat{C}_{k_1,\dots ,k_N} \right) \\&\qquad + \sum _{k_1 = 0}^{m_1} \dots \sum _{k_{j-1} = 0}^{m_{j-1}} \sum _{k_{j+1} = 0}^{m_{j+1}} \dots \sum _{k_N = 0}^{m_N} {\textit{Var}}_{\textbf{X}}\\&\qquad \times \left( \hat{C}_{k_1,\dots ,k_{j-1},m_j+1,k_{j+1},\dots ,k_N} \right) \\&\qquad + \mathcal {T}_{\textbf{m}_{(j)}} - \sum _{k_1 = 0}^{m_1} \dots \sum _{k_N = 0}^{m_N} {\textit{Var}}_{\textbf{X}}\left( \hat{C}_{k_1,\dots ,k_N} \right) - \mathcal {T}_{\textbf{m}} \le 0 \\&\quad \Rightarrow \sum _{k_1 = 0}^{m_1} \dots \sum _{k_{j-1} = 0}^{m_{j-1}} \sum _{k_{j+1} = 0}^{m_{j+1}} \dots \sum _{k_N = 0}^{m_N} {\textit{Var}}_{\textbf{X}}\\&\qquad \times \left( \hat{C}_{k_1,\dots ,k_{j-1},m_j+1,k_{j+1},\dots ,k_N} \right) \le \mathcal {T}_{\textbf{m}} - \mathcal {T}_{\textbf{m}_{(j)}} \\&\quad \Rightarrow \sum _{k_1 = 0}^{m_1} \dots \sum _{k_{j-1} = 0}^{m_{j-1}} \sum _{k_{j+1} = 0}^{m_{j+1}} \dots \sum _{k_N = 0}^{m_N-1} {\textit{Var}}_{\textbf{X}}\\&\qquad \times \left( \hat{C}_{k_1,\dots ,k_{j-1},m_j+1,k_{j+1},\dots ,k_N} \right) \\&\qquad + {\textit{Var}}_{\textbf{X}}\left( \hat{C}_{m_1,\dots ,m_j+1,\dots ,m_N} \right) \le \mathcal {T}_{\textbf{m}} - \mathcal {T}_{\textbf{m}_{(j)}} \\&\quad \Rightarrow {\textit{Var}}_{\textbf{X}}\left( \hat{C}_{m_1,\dots ,m_j+1,\dots ,m_N} \right) \le \mathcal {T}_{\textbf{m}} - \mathcal {T}_{\textbf{m}_{(j)}} \\&\quad \text {as}\; {\textit{Var}}_{\textbf{X}}\left( \hat{C}_{k_1,\dots ,k_N} \right) \ge 0\quad \text {for all}\quad k_1,\dots ,k_N \in \mathbb {N}_0. \end{aligned}$$

And hence as $m_j> K_j(\epsilon _0) >0 $, from (C1),

$$\begin{aligned} {\textit{Var}}_{\textbf{X}}\left( \hat{C}_{m_1,\dots ,m_j+1,\dots ,m_N} \right) \le \mathcal {T}_{\textbf{m}} - \mathcal {T}_{\textbf{m}_{(j)}} < \epsilon _0. \end{aligned}$$

However as ${\textit{Var}}_{\textbf{X}}\left( \hat{C}_{n_1,\dots , n_N} \right) \ge \epsilon _0$ for all $\textbf{n} = (n_1, \ldots , n_N) \not = (0, \ldots , 0)$,

$$\begin{aligned} {\textit{Var}}_{\textbf{X}}\left( \hat{C}_{m_1,\dots ,m_j+1,\dots ,m_N} \right) \ge \epsilon _0, \end{aligned}$$

which is a contradiction.

Therefore we can conclude that if there exists $\epsilon _0 > 0$ such that $ {\textit{Var}}_{\textbf{X}}(\hat{C}_{n_1,\dots ,n_N}) \ge \epsilon _0 $ for all $\textbf{n} = (n_1,\dots ,n_N) \ne (0,\dots ,0)$, then for any $m_1,\dots ,m_{j-1},m_{j+1},\dots ,m_N$ there exists a $K_j(\epsilon _0) >0$ such that for all $m_j$ where $m_j > K_j(\epsilon _0)$ and $\textbf{m} = (m_1,\ldots , m_N)$ then

$$\begin{aligned} \mathbb {E} \left[ \mathcal {N}_{\textbf{m}_{(j)}}\left( \{\textbf{X}_i\}_{i=1}^M \right) \right] > \mathbb {E} \left[ \mathcal {N}_{\textbf{m}}\left( \{\textbf{X}_i\}_{i=1}^M \right) \right] . \end{aligned}$$

As this applies to any dimension $j \in \{ 1,\dots ,N \} $ then we can say that there exists a vector $\textbf{K}_0$ whose elements are given by $K_{0,j} = K_{j}$ for each $j = 1,\dots ,N$ such that for any max polynomial order $\textbf{K} =(K_1, \ldots , K_N)$ not less than $\textbf{K}_0$. then,

$$\begin{aligned} \mathbb {E} \left[ \mathcal {N}_{\textbf{K}}\left( \{\textbf{X}_i\}_{i=1}^M \right) \right] > \mathbb {E} \left[ \mathcal {N}_{\textbf{K}_0}\left( \{\textbf{X}_i\}_{i=1}^M \right) \right] . \end{aligned}$$

Hence given that $\mathbb {E} \left[ \mathcal {N}_{\textbf{K}}\left( \{\textbf{X}_i\}_{i=1}^M \right) \right] > 0$ for any $\textbf{K} \in \mathbb {N}_0^N $, then an optimal max polynomial order (as defined in Definition 5) $\textbf{K}^*$ given by $\textbf{K}^* = \arg \!\min _{\textbf{K} \in \mathbb {N}_0^N }\{ \mathbb {E} \left[ \mathcal {N}_{\textbf{K}}\left( \{\textbf{X}_i\}_{i=1}^M \right) \right] \} $ can also be determined by considering the finite set $\mathcal {K} = \{ \textbf{K}=(K_1,\dots ,K_N) : K_i \in \mathbb {N}_0 \text { and } K_i \le K_{0,i} \text { for each } i = 1,\dots ,N \}$ which is finite as $||\mathcal {K} ||_c = \prod _{i=1}^N (K_{0,i} + 1) < \infty $.

We note that for all $\textbf{K} \in \mathbb {N}_0^N \backslash \mathcal {K}$,

$$\begin{aligned} \mathbb {E} \left[ \mathcal {N}_{\textbf{K}}\left( \{\textbf{X}_i\}_{i=1}^M \right) \right] > \mathbb {E} \left[ \mathcal {N}_{\textbf{K}_0}\left( \{\textbf{X}_i\}_{i=1}^M \right) \right] . \end{aligned}$$

Hence $\mathbb {E} \left[ \mathcal {N}_{\textbf{K}_0}\left( \{\textbf{X}_i\}_{i=1}^M \right) \right] $ is an infimum to the set $\{\mathbb {E} \left[ \mathcal {N}_{\textbf{K}}\left( \{\textbf{X}_i\}_{i=1}^M \right) \right] : \textbf{K} \in \mathbb {N}_0^N \backslash \mathcal {K} \}$ and therefore we can say as $\textbf{K}_0 \in \mathcal {K}$, that

$$\begin{aligned} \textbf{K}^*= & {} \arg \!\min _{\textbf{K} \in \mathbb {N}_0^N }\left\{ \mathbb {E} \left[ \mathcal {N}_{\textbf{K}}\left( \{\textbf{X}_i\}_{i=1}^M \right) \right] \right\} \\= & {} \arg \!\min _{\textbf{K} \in \mathcal {K} }\left\{ \mathbb {E} \left[ \mathcal {N}_{\textbf{K}}\left( \{\textbf{X}_i\}_{i=1}^M \right) \right] \right\} . \end{aligned}$$

Noting that all finite sets must have a minimum, we can then be sure that if there exists $\epsilon _0 > 0 $ such that $ {\textit{Var}}_{\textbf{X}}(\hat{C}_{n_1,\dots ,n_N}) \ge \epsilon _0 $ for all $n_1,\dots ,n_N \in \mathbb {N}_0$ where $(n_1,\dots ,n_N) \ne (0,\dots ,0)$, then the expected $\mathcal {N}_{\textbf{K}}$-Norm $ \mathbb {E}_{\textbf{X}} \left[ \mathcal {N}_{\textbf{K}}\left( \{\textbf{X}_i\}_{i=1}^M \right) \right] $ (Definition 4) has a finite optimal max polynomial order $\textbf{K}^*$. $\square $

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Wakefield, B., Lin, YX., Sarathy, R. et al. Moment-based density estimation of confidential micro-data: a computational statistics approach. Stat Comput 33, 35 (2023). https://doi.org/10.1007/s11222-022-10203-1

Download citation

Received: 08 February 2022
Accepted: 30 December 2022
Published: 20 January 2023
DOI: https://doi.org/10.1007/s11222-022-10203-1

Keywords

Mathematics Subject Classification

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Moment-based density estimation of confidential micro-data: a computational statistics approach

Abstract

Similar content being viewed by others

Violating the normality assumption may be the lesser of two evils

A simple introduction to Markov Chain Monte–Carlo sampling

A Systematic Review of Hidden Markov Models and Their Applications

1 Introduction

2 The multivariate moment-based density estimate

2.1 The orthogonal polynomial basis

2.1.1 Univariate polynomial basis

Theorem 1

Property 1

Property 2

Property 3

2.1.2 A multivariate polynomial basis

Definition 1

Theorem 2

2.2 Multivariate moment-based density function

Theorem 3

Remark

Definition 2

2.3 Multivariate sample moment-based density estimation

Definition 3

3 The \(\mathcal {N}_{\textbf{K}}\)-norm and max polynomial order

Definition 4

Theorem 4

Definition 5

Theorem 5

Definition 6

4 Comparing density estimation methods

4.1 Bivariate beta distribution example

5 Generating synthetic micro-data example

5.1 Simulating synthetic data from the density estimate

5.2 Synthetic data example

6 Conclusion

Change history

27 February 2023

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding authors

Additional information

Publisher's Note

Appendices

Appendix A: Proof of Theorem 2

Proof

Appendix B: Proof of Theorem 3

Proof

Appendix C: Proof of Theorem 5

Lemma 1

Proof

Proof of Theorem 5

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Search

Navigation