Nonparametric Multivariate Density Estimation: Case Study of Cauchy Mixture Model

: Estimation of probability density functions (pdf) is considered an essential part of statistical modelling. Heteroskedasticity and outliers are the problems that make data analysis harder. The Cauchy mixture model helps us to cover both of them. This paper studies ﬁve different signiﬁcant types of non-parametric multivariate density estimation techniques algorithmically and empirically. At the same time, we do not make assumptions about the origin of data from any known parametric families of distribution. The method of the inversion formula is made when the cluster of noise is involved in the general mixture model. The effectiveness of the method is demonstrated through a simulation study. The relationship between the accuracy of evaluation and complicated multidimensional Cauchy mixture models (CMM) is analyzed using the Monte Carlo method. For larger dimensions ( d ~ 5) and small samples ( n ~ 50), the adaptive kernel method is recommended. If the sample is n ~ 100, it is recommended to use a modiﬁed inversion formula (MIDE). It is better for larger samples with overlapping distributions to use a semi-parametric kernel estimation and more isolated distribution-modiﬁed inversion methods. For the mean absolute percentage error, it is recommended to use a semi-parametric kernel estimation when the sample has overlapping distributions. In the smaller dimensions ( d = 2) and a sample is with overlapping distributions, it is recommended to use the semi-parametric kernel method (SKDE) and for isolated distributions, it is recommended to use modiﬁed inversion formula (MIDE). The inversion formula algorithm shows that with noise cluster, the results of the inversion formula improved signiﬁcantly. 62G05; 62G07; 62G30


Introduction
Estimation of probability density functions (pdf) is considered an essential part of statistical modelling. It expresses random variables as functions of other variables, making it possible to detect hidden relationships between data [1]. In a significant number of machine learning algorithms, it is essential to determine a previously unknown function of the distribution density of the data. The function of the distribution density is applied in the Bayesian classifier [2,3], in density-based clustering algorithms [4][5][6], or information-based feature selection algorithms [7,8]. Effective density estimates must be carefully created in advance to obtain unknown functions of probability density. Nowadays, there is still much focus on developing innovative density estimation procedures [9,10]. Density estimation is an open research topic in the fast-growing area of deep learning. Scientists have begun proposing robust density estimators based on neural networks such as Parzen neural networks [11], soft-constrained neural networks [12], and others [13].
Let us say that the random vector X∈R d satisfies the distribution mixture model if its distribution density f (x) satisfies the equation f (x) = ∑ parameter q is the number of mixture clusters, and p k is the a priori probability. These conditions must also be met: p k > 0 and ∑ q k=1 p k = 1. The function f k (x) is a function of the distribution density, and θ is a multidimensional parameter of the model. Suppose X is a d-dimensional random vector with a distribution density f (x), and there is a sample of independent copies of X, where X = (X(1), . . . , X(n)). It can be argued that the sample satisfies the mixture model if X(t) satisfies f (x) = ∑ q k=1 p k f k (x) = f (x, θ). One of the statistical tasks is to estimate the density of the observed random variable. Suppose the available sample's distribution type is known (Normal, Poisson, and others). In that case, the distribution density of the data can be estimated simply using mean and covariance matrix estimates, fitting them to a defined distribution [14][15][16]. Thus, the standard parametric method is applied when the assumptions about the density form are met. When estimating density in a parametric way, the value of the multidimensional distribution parameter θ needs to be found, which is not straightforward because the number of parameters increases rapidly as the dimension d increases. For example, in the case of a mixture of Gaussian distributions, dimθ = 1 2qd(d+1) + qd + q − 1, and even with a small dimension d = q = 5, the model will consist of dim(θ) = 104 parameters. When searching for parameter estimates, it may be necessary to solve the optimization problem in the 104-dimensional space. In practice, the number of clusters q may also be unknown, and it needs to be estimated. The parametric method is not proper when the random size distribution is unknown. In this case, non-parametric methods are used to determine certain forms of density estimates [17][18][19].
The histogram is one of the simplest and oldest estimates of density. To the best of our knowledge, data in the form of histograms (without graphical representation) were first presented in 1661 to determine mortality probabilities in different age groups [20]. To approximate the density f (x) in the area Ω, the number of observations X(t) falling into Ω is calculated and divided by n and the volume of the area Ω. The area of space to which all observations fall is first found. That means the fluctuation intervals of all X projections on the axes X (1) ,X (2) , . . . , X (d) are found. The fluctuation intervals of the observations are divided into l partial intervals and in the hypercubes Ω j (j = 1, . . . , r) bounded by them, the density estimate is calculated aŝ Here n Ω j is the number of observations entering the hypercube Ω j and h j , j = 1, . . . , d are the edges of the hypercube [21,22]. It is recommended to select the number of hypercubes [17,23,24], and to choose r ∼ = 1 + 3.32 log(n), and l = d √ r has to be an integer number, so r is chosen that d 1 + 3.32logn d .
A histogram is one of the simplest means of presenting data that is easy to understand and convenient. This estimate is a function that acquires non-negative values, and its integral is equal to one. However, it is not continuous. That poses problems when knowing the density estimate derivatives is essential, mainly when density estimation is used in intermediate steps of other methods, such as clustering using a gradient algorithm or plotting high-measurement data-level lines. Remarkably, the histogram stood as the only non-parametric density estimator until the 1950's when substantial and simultaneous progress was made in density estimation and spectral density estimation. In 1951, in a little-known paper, Fix and Hodges [25] introduced the basic algorithm of non-parametric density estimation; an unpublished technical report was formally published as a review by Silverman and Jones in 1989 [26]. They addressed the problem of statistical discrimination when the parametric form of the sampling density was not known. During the following decade, several general algorithms and alternative theoretical modes of analysis were introduced by Rosenblatt in 1956 [27], Parzen in 1962 [28], and Cencov in 1962 [29]. Then followed the second wave of essential and primarily theoretical papers by Watson and Leadbetter in 1963 [30], Loftsgaarden and Quesenberry in 1965 [31], Schwartz in 1967 [32], Epanechnikov in 1969 [33], Tarter and Kronmal in 1970 [34], and Kimeldorf and Wahba in 1971 [35]. Next, Cacoullos introduced the natural multivariate generalization in 1966 [36]. Finally, in the 1970s, the first papers focusing on the practical application of these methods were published by Scott et al. in 1979 [24] and Silverman in 1978 [37]. These and later multivariate applications awaited the computing revolution.
Modern data analysis uses several non-parametric methods for statistically estimating the distribution density of multivariate random variables. Kernel estimates are particularly prevalent [38,39]. Quite popular and spline [40,41] and semi-parametric [42][43][44][45][46] algorithms. However, detailed comparisons of the effectiveness of existing popular estimates for multimodal density are lacking. With the most popular non-parametric estimation procedures, optimal selection of their parameters is encountered in practice. The most crucial element in the design of kernel estimates is the width of the smoothing. It is not easy to select the nodes of the spline estimates. Although several adaptive procedures for the selection of these parameters have been developed [39,[47][48][49][50][51][52], however, they are not efficient enough when the sample volume is not large, especially then the observational dimension is large. In the latter case, it is appropriate to apply data design [53][54][55][56] because of the more extensive the dimension of the observed random vectors, the more complex the task of parameter selection.
The main idea of this paper is to estimate the performance of different density estimators by using density mixtures to show another type of problem, which may result from data heteroscedasticity and outliers. The relationship between the accuracy of evaluation and complicated multidimensional Cauchy mixture models (CMM) is analyzed using the Monte Carlo method. For example, Kalantan and Einbeck [57] used engineering data and, for computer vision, used CMM, comparing it with the Gaussian mixture model. Azzari and Foi [58] used harmony between Gauss and heavy-tailed Cauchy to find noise-model parameters that make outlier estimation robust when imaged dominated by texture. Finally, Teimouri [59] analyzed patients with Cushing's syndrome and their diagnostic tests. The focus was on the tetra hydrocortisone urine release rate (mg/24 h) and evaluating parameters in the EM algorithm and Cauchy mixture model. Scientific novelty. Evaluation accuracy comparative analysis is made by using different probability density estimation procedures. Density function estimates are chosen as popular different technique estimates, which other researchers have already analyzed. This research is essential because it focuses on Cauchy distributions.

The Density Estimation Algorithms
This section aims to present the density estimation algorithms used in the study theoretically. All algorithms are presented with algorithms theoretical substantiation. When making the histogram, each X(t) can be imagined as a separate column with a height of 1/n. Then it makes sense to change the centre of the column to X(t) itself and get the following function:f Here C h is a hypercube with centre X(t), and the lengths of the edges are h 1 , . . . , h d . In summary, instead of the indicator function, a smooth "prominence"-the kernel functioncan be used at each observed point. The multidimensional fixed-width bandwidth estimate with the kernel function K and the fixed (global) kernel width parameter h, which can be used to estimate the densityf (x) of the multidimensional data X∈R d , is then defined as follows:f (x) = 1 These are some of the most common non-parametric estimates of distribution density [38,39,60,61]. The kernel function is selected to meet the following condition: The standard normal distribution density function ϕ is often used as the kernel [62,63]: Often the observations are not evenly distributed in all directions. Therefore, it is desirable to scale the data by eliminating the most significant dispersion differences in the different coordinate directions. One suitable method for this [64] is data standardization. That means the sample's effect on a linear transformation. The mean of the transformed data is zero, the covariance matrix is unitary, and (3) apply the Equation to already standardized data. For example, suppose Z is a standardized random vector, here X is the empirical mean of the sample, and S∈R d×d is the empirical covariance matrix. Based on the fixed kernel width density estimate (3), a more complex standardized data density estimate has been constructed: The optimal kernel width h * for a fixed core width is determined by minimizing the average integral root mean square error (MISE) [65]. For example, when the distribution of observations is normal with a unit covariance matrix in Gaussian kernel, the expression h * proposed by [65] d+4 . More sophisticated kernel width selection methods (such as the least-squares cross-checking method) are obtained by more complex and lengthy calculations [66][67][68][69][70].
In practical research, the kernel width is often selected experimentally. If the value of h is small, the density function estimate has more modes that correspond to the layout of the observed data. A higher value of h means more significant smoothing of the estimate.
Although fixed-core width density estimates are widely used to estimate non-parametric densities, they often have some practical drawbacks [65]. For example, fixed-core width density estimates do not ensure the distribution ends' integrity without over-smoothing the underlying bulk density.

Adapted Kernel Density Estimate (AKDE)
A good improvement on the fixed kernel width density estimate is the adapted kernel density estimate [65]. The adapted kernel density estimate is constructed similarly to the fixed kernel width density estimate. The kernel describes the density at each observed point. In this case, the kernel width is already considered when moving from one observation to another. In areas of different smoothness, it is appropriate to take different kernel widths. This method consists of two steps: estimation of the adapted kernel width and density estimation by the kernel method, using the information obtained in the first step. The algorithm can be summarized as follows: Step 1: The elements of sample X = (X(1), . . . , X(n)) are standardized to Z = (Z(1), . . . , Z(n)) such thatÊ[Z] = 0 andÊ[ZZ ] = I.
Step 3: The local width parameter is determined where g is f Z (z) the geometric mean, log g = 1 n n ∑ i=1 logf Z (Z(t)) and γ is the sensitivity parameter: Step 4: An adapted kernel estimate is made with variable-width kernels: Where h is the same global smoothness parameter as in Equation (3), the higher γ, the more sensitive the density selection. Quite often, the parameter value is selected as follows γ = 1 2 [65,71].

Semi-Parametric Kernel Density Estimate (SKDE)
When data are scarce, parametric estimates are often applied even when the unknown density is not parameterized. Therefore, it is essential to mention the combination of parametric and non-parametric estimates. For example, one of the semi-parametric estimates of kernel density was examined by F. Hoti and L. Holmström [46]. This estimate divides the random vector into two subvectors and estimates the distribution density of one of them by the kernel method. Afterward, another relative density is approximated by the Normal distribution density [46]. For example, suppose d and s are positive integers d ≥ 2, 1 ≤ s ≤ d − 1. Using this method, the d-dimensional vector X∈R d is decomposed into two s and (d-s) dimensioned subvectors X = Y Z , and the sample is decomposed accordingly: The evaluated density function is expressed as the product of the distribution density of the random vector Y and the conditional distribution density of the random vector Z: Here f X and f Y are the densities of X and Y. f Z|Y=y is the density of Z when Y = y. Suppose that the relative density Y = y is multidimensional normal Gaussian, but the density f Y does not belong to any family of parametric functions. The density f X is then obtained by estimating f Y in a non-parametric manner and applying a multidimensional Normal density to each f Z|Y=y . The density function f Y (y), as with (8), is evaluated by the kernel method [65]. Since the sample elements are not standardized, the smoothness parameter is not the same in all directions. Therefore, using the kernel method, it is replaced by the s-dimensional matrix H: Usually, the shape of H is chosen diagonally-H = diag(h 1 , . . . , h s ), and the smoothness parameters are calculated as follow It should be noted that this form, when s = 1, was proposed by B. W. Silverman [65].
Replacing the standard deviation σ j of the component Yj with its estimateσ j = ∑(X j −X j ) 2 n j and by the rule of D. W. Scott [39] first multiplier is always between 0.924 and 1.059ĥ j can be calculated as followsĥ This Scott's rule is easy to summarize for the smoothness matrix H: HereΣ = diag σ 2 1 ,σ 2 2 , . . . ,σ 2 s is the diagonal matrix of Y empirical variances. The conditional density f Z|Y (· y) is approximated by the Gaussian distribution N(m(y), C(y)), where m(y), C(y) denote the conditional mean of the vector Y and the conditional covariance matrix: m(y) = E(Z|Y = y) , y∈R s , C(y) = E (Z − m(y)) (Z − m(y)) Y = y , y∈R s . For the estimation of m(y) and C(y), it is proposed to apply the kernel smoothing: Here are the weights The sum of which is equal to one. The formula (13) can be understood as a regression estimate of the conditional mean function m of Nadaraya and Watson [72,73]. The conditional covariance matrix can be evaluated similarlyĈ(y) , y∈R s . The parametric estimate of the relative density f (Z|Y) = y looks akin The procedure described above is called the semi-parametric kernel density estimate. In practice, even if the conditional assumption of the normality of several random vector components is satisfied. The decomposition dimensions also influence the accuracy of the density estimation results, and the choice of the coordinates influences the accuracy of the density estimation results. One way to select them is to use the least-squares method or the maximum likelihood cross-entropy method recommended by original method authors [46]. The authors propose the parameters H 2 and H 3 [46] to select 2H.

Logspline Estimation (LSDE)
This subsection describes the logspline estimation (LSDE) calculation. One-dimensional polynomial splines are called partial polynomials of a certain degree. Breakpoints that contain a transition from one polynomial to another are called nodes. Suppose that the vector t = (t 1 , . . . , t K ) ∈ R K defines a set of such K points. Splines describe smooth connections, showing how different areas are separated by nodes [74]. These constraints are precisely defined by expressing partial polynomials in the number of continuous derivatives s. These include partially linear curves. If there are no restrictions, breakpoints are allowed in the nodes of these functions. Assuming that the functions are globally continuous, it is required that the individual linear parts meet at each node. If greater smoothness is needed (for continuous first-order derivatives), then the flexibility of the nodes is lost. Moreover, the curves are considered simple linear functions. The term "linear spline" is applied to a continuous partial linear function in the literature on approximation theory. Accordingly, the term "cubic spline" is assigned to continuous cubic functions with second-order continuous derivatives and nodes that allow jumps of third-order derivatives. If the polynomial degree is b and the vector of the nodes is t, then the set of polynomial splines with s continuous derivatives forms a linear space. For example, a set of linear splines with nodes in the sequence t is defined by function Here ( ) + = max( ,0). We will rely on this set as the base of space. The base of the spline space of degree b and s smoothness consists of monomial whose form (x − t k ) s+j + , here 1 ≤ j ≤ b − s. Using this formula, in the case of classical cubic splines, where b = 3 and s = 2, the base consists of elements From the model point of view, this base is convenient because the individual functions at the nodes are merged. In expressions (14) and (15), each function is precisely associated with one of the nodes, and removing this function essentially corresponds to removing the node itself. It is known that the numerical properties of functions (14) and (15) are poor. For example, the solution matrix deteriorates as rapidly as the number of nodes decreases in linear regression problems. A practical alternative is the so-called B-spline base [75,76]. These functions are designed to be supported in several contiguous intervals defined by nodes (b + 1 contiguous intervals are used for the smoothest splines). Suppose we can find the basis for splines of space B 1 (x; t), . . . , B J (x; t) b with smoothness s and a sequence of nodes t so that any function in space can be written as Where the corresponding coefficient vector is β = (β 1 , . . . , β J ) . As seen from (14) and (15), then spline spaces of maximum smoothness are used J = K + b + 1.
According to the title of the subsection, the object of this analysis is the logarithmic density. Suppose X is a random vector that takes values from the interval (L, U). In the individual case, L and U can be ±∞. The parameters L and U are set to 2t 1 − t 2 and 2t The method of Kooperberg and Stone [52,[77][78][79], known as logspline, is implemented with cubic spline. The cubic spline is described in (15). These functions are also continuously differentiated, and the partial polynomials are defined accordingly in the sequence of nodes t = (t 1 , . . . , t K ). In each interval [t 1 , t 2 ], . . . , [t K−1 , t K ] cubic splines are also cubic polynomials, but at the edges (L, t 1 ] and [t K , U) are linear functions. The minimum number of nodes is K ≥ 3 (otherwise, a linear function or constant can be obtained). The basis form is 1, Now, having a random sample n of magnitude X(1), . . . , X(n) from the interval (L, U) with an unknown density function f, the logical probability function corresponding to the model of logsplines (16) is where estimation of maximum likelihoodβ = argmax β∈B l(β, t) and an estimate of densitŷ Let us say that during the stepwise determination procedure, the sequence of models is denoted by v, the vth model has J v base functions. The Generalized Akaike Information Criterion (AIC) selects the best model [80]. Suppose thatl v defines the estimate of the logiclikelihood function (17) for the vth model. The Equation defines the Akaike information criterion AIC a,v (t) = −2l v (t) + aJ v for which the model has a loss parameter a. From many models, the one whose value of v minimizes AIC a,v . Stone [52] recommends the use of a = log n.

PPDE Algorithm. Estimation of the Projection Density of the Target
The projection pursuit density estimator (PPDE) proposed by Friedman is based on the target projection and consistent projection Gaussianization. The essence of J. H. Friedman and coauthors [54,55,81] in estimating the target projection density is to search for "interesting", small-measurement data projections. The distribution structures, where the projections have distributions that are very different (in the sense of some projection index) from Gaussian. Huber [82] made a heuristic proposal to consider the Gaussian distribution as the least interesting. This proposal is based on the facts that:

•
The multidimensional Gaussian distribution is entirely defined by its linear structure (mean and covariance matrices). Therefore, it is desired to feel a data structure independent of the correlation and linear transformations such as the scale parameter.
• All projections of a multidimensional Gaussian distribution are also Gaussian distributions. Thus, if the projection differs insignificantly from the Gaussian distribution, it indicates that distribution is also close to the Gaussian. • For multidimensional data with a structure in multiple projection directions, many projections will have a distribution close to normal. This statement follows from the central limit theorem.

•
In the case of constant variance, the Gaussian distribution is considered to be the least informative.
Friedman developed Huber's idea and proposed an algorithm called exploratory target projection to estimate multidimensional non-parametric density. This procedure consists of five steps: (1) Data standardization simplifies layout, scalability, and correlation structures; (2) Projection index: the degrees of 'interest' in various directions are determined.
(3) Optimization strategy: search for the direction in which the projection index is the largest. (4) Data transformation: the one-dimensional density is calculated in the chosen direction, and the data are multiplied. (5) Density formation: multidimensional density is formed from the calculated onedimensional densities. Multidimensional density is a function of one-dimensional densities.
The following projection index construction has been proposed. It is known that all projections of a multidimensional Gaussian distribution are one-dimensional Gaussian distributions. If the distribution in one direction is not Gaussian, then the multidimensional distribution is also not Gaussian. Therefore, the projection index I(τ) shows how far the one-dimensional density f τ (y) is in the direction τ(Y = τ Z) from the Gaussian distribution when Z is a standardized quantity [83]: The projection direction τ, which maximizes the projection of a distribution I(τ), is chosen to highlight the multimodal or other nonlinear structure of that distribution. We transform the data y by equality The distribution density of the transformed quantity R, function f R (r) can be rewritten as Equation (18) can be rewritten by changing the variable y to r : [55] proposed a slightly different form of the projection index I(τ), taking the integrated square error as a measure of R inequality: Note that if the distribution of Y is Gaussian, then f R (r) = 1 2 , ∀r, and the projection index I(τ) is zero. The more the Y distribution differs from the normal, the higher the value of the index I(τ). Since R ∈ [−1, 1], f R (r) can be decomposed by orthogonal Lagrangian polynomials, ψ j ∞ j=0 , i.e., f R (r) = ∑ ∞ j=0 b j ψ j (r): An iterative expression defines orthogonal Lagrangian polynomials. ψ 0 (r) = 1 and , then j ≥ 2. It follows from the orthogonality property that the coefficients b j can be calculated as follows b j = 2j+1 is the mean of the sample approximates the expression. Thus, equality can be written as It should be noted that the infinite amount has been changed to finite. Such a change has advantages: the sum is calculated faster, giving robustness to the projection index. By summing only a finite number of members, the slowly fading "tails" of the projection distributions have a more negligible effect on the value of the projection index. Therefore, it is suggested to choose 4 ≤ s ≤ 7.
There are many methods for finding "interesting" projections. The method used in this research for finding the 'most interesting' projection direction is a mixed optimization strategy [55,64,84]. After defining the analytical expression of the projection index, its gradient in the projection direction τ is obtained as follows ∂I ∂τ Here, the Lagrangian polynomial derivative is calculated by an iterative formula: ψ 1 (r) = 1, then ψ j (r) = rψ j−1 (r) + jψ j−1 (r), then j ≥ 1. Initially, an approximate step optimizer is found by searching in the directions of the main components and their combinations so that the initial convergence to the maximum can be achieved quickly. Then, the approximate step optimizer (steepest ascent) quickly selects the projections required to ascend to the (local) maximum of the projection index. The projection index is used to search for 'interesting' data projections. However, it is usually not enough to find a single projection to reasonably estimate the multidimensional density. In general, "interesting" directions do not have to be orthogonal and may require more projection directions than the data dimension. Therefore, when estimating density by targeted projection, the so-called deletion of the data structure is applied. A nonlinear scale transformation is performed, found in the projection direction, so the distribution of the transformed data becomes normal. This operation ensures that the same direction as before was not found when searching for another projection direction.
The deletion of the data structure is based on the fact that if the projection of onedimensional data projection τ Z has a distribution density f τ (y) and a corresponding distribution function F τ , then the random variable is equal to where Φ −1 is the inverse of the standard normal distribution. Friedman [55] proposed to calculate the empirical estimate of the distribution function as followsF τ (y) = rank(Y) /n − 1 2n , where rank(y) is the rank of Y in the whole sample of size n. Unfortunately, this estimate is not accurate and often results in a very uneven density function. By denoting Z (0) = Z, we will discuss how Z (k−1) is obtained from Z (k) . Based on Equation (14), Z (k) can be defined as The same procedure is performed to find the 'most interesting' projection with Z (k) searching for a new direction. This sequence is repeated until the multidimensional distribution becomes close to the Gaussian distribution in all directions. It has been observed [55] that gaussianization in one direction disrupts normalcy in the directions previously studied so that their projection index I(τ) is no longer zero. However, studies show [54] that the changes in results are minimal. Multidimensional density is calculated from one-dimensional density estimates.
The relationship between the multidimensional densities Z (k) and Z (k−1) (where Z (k) is the structure of the distant data Z (k−1) along the k-th projection τ(k)) is Starting from the initial multidimensional data Z (0) gaussianization procedure is performed for each "interesting" projection found by I(τ). After a certain number, the projections' multidimensional data Z (M) differ slightly from the normal distribution. Density Z (0) can be calculated as follows ) .
(26) The one-dimensional density of the projected data f τ(k) τ (k)z (k−1) is calculated according to Equation (18) or more precisely Then, replacing the unknown one-dimensional distribution densities on the right-hand side (26) with their statistical estimates, we obtain The target projection density estimate is calculated relatively quickly because of the shape of the multivariate projection index and the iterative relationship between polynomials.

Inversion Formula
When examining approximations of parametric methods, it should be emphasized that as the data dimension increases, the number of model parameters increases rapidly, making it more difficult to find accurate parameter estimates. One-dimensional data projections X τ = τ X density f τ is much easier to find than multidimensional data density f because there exists a mutually unambiguous correspondence, f ↔ f τ , τ ∈ R d . It is quite natural to try to find the multidimensional density f using the density estimatesf τ of one-dimensional observational projections. It should be noted that in the case of the mixture, when the distributions are Gaussian, the projections of observations are also distributed according to the (one-dimensional) Gaussian mixture model Here ϕ k,τ (x) = ϕ x; m k,τ , σ 2 k,τ -one-dimensional Gaussian density. The parameter θ of the multidimensional mixture. The distribution parameters of the data projections θ τ = p k,τ , m k,τ , σ 2 k,τ , k = 1, . . . , q are related by equations: p j,τ = p j , m j,τ = τ M j and σ 2 j,τ = τ R j τ. Using the inversion formula where ψ(t) = Ee it x denotes the characteristic function of the random variable X. Marking u = |t|, τ = t/|t| and changing the variables to a spherical coordinate system, density is written Here, the first integral is understood as the surface integral of the unit sphere. After noting the characteristic function of the projection of the observed random variable as ψ τ (u) = Ee iuτ X . Then equality ψ(uτ) = ψ τ (u) holds. By selecting the set T of projection directions evenly spaced on the sphere and replacing the characteristic function with its estimate (f (x)) a formulâ is obtained to calculate the estimate [85,86]. Here and # continue to denote the number of elements in the set T. Using the d-meter ball volume the constant A(d) depending on the data dimension can be calculated using Computer simulation studies have shown that the density estimates obtained using the inversion formula are not smooth. Therefore, in formula (32), an additional multiplier e −hu 2 is used below the integral sign. This multiplier further smoothes the estimatef (x) (32) with the Gaussian kernel function. This form of the multiplier allows the value of the integral to be calculated analytically. The number of clusters and Gaussian mixture parameters was selected using the constructive procedure and software developed at the Lithuanian Institute of Mathematics and Informatics, applying the w 2 type criterion [87]. Formula (32) can be used for various estimates of the characteristic function of the projected data. We will discuss the two methods used in this work.
One of them is based on the density approximation of the Gaussian distribution mixture model. In the present case, after replacing the parameters of the Gaussian mixture with their statistical estimates (p k,τ = p k ,m k,τ = τ M k , σ 2 k,τ = τ R k τ) (Page 10), the following parametric estimatê k,τ e ium k,τ −u 2σ2 k,τ /2 (35) of the characteristic function is used, and adding (32) to (35) giveŝ and where I j (y) can be written as It should be noted that only the real part of the expression can be considered here. The sum of the imaginary parts must be equal to zero. Because the density estimatef (x) can acquire only real values. The chosen form of the smoothing multiplier e −hu 2 allows relating the smoothing parameter h to the variances of the projection clusters-in the calculations, the variances are increased by 2h. How to calculate expression (37) is given in Appendix B.

Modified Density Estimate of the Inversion Formula
One of the disadvantages of the inversion formula method defined in (32) is that the Gaussian distribution mixture model described by this estimate (where f k = ϕ k ) evaluates well only the density of observations close to it. However, when approximating the density under study with a mixture of Gaussian distributions, the estimation of the density of the inversion formula often becomes complicated due to a large number of components with low a priori probabilities. Their number can be reduced by introducing a noise cluster-the modified algorithm based on a multidimensional Gaussian distribution mixture model. Let us use the inversion formula (30). The parametric estimate of the characteristic function of uniform distribution density can be calculated as followŝ In uniform distribution density function (38), b is the maximum value, and a is the minimum value. In the density estimate calculation formula (32), construct the estimation of the characteristic function as a union of the characteristic functions of a mixture of Gaussian distributions and a uniform distribution with corresponding a priori probabilities as followŝ Here the second term describes a uniform distributed noise cluster andp 0 is the weight of the noise cluster. Based on the parameters of the uniform distribution and the projected data, we can write Using notations such as (36), we can writê where the expression I j (y) is the same as (37) and its value is I j (y) = C j (y) and By integrating, we get ∞ 0 e iyu−u 2 /2 · sin zu·u j du = ∞ 0 (cos yu + i sin yu)· sin zu·e −u 2 /2 ·u j du the above formula uses the variables S j (y) and C j (y), the calculation of which is given in formulas (52) and (53) in Appendix B.

Materials and Methods
Density estimation algorithms were presented in the previous section. The Monte Carlo method was used in this study. Such a comparison of algorithms allows us to measure the real observation density values and evaluate algorithms' efficiency. For the research, we used multidimensional (d = 2, 5, 10, 15) distributions of the Cauchy mixture ∑ q j=1 p j C x, m j , u j .
Additionally, C x, m j , u j is defined as follows Calculations were performed using sample sizes of n = 50, 100, 200, 400, 800 while changing the number of distributions, their weights, and centers (see. Table 1). In each case, 100,000 samples were generated.  In cases of d = 10, 15, the same weights were used as in d = 5. Additionally, centres are located on the apexes of the hypercube.
Algorithms used in the research: AKDE-adaptive kernel, PPDE-projection pursuit, LSDElogspline, SKDE-semi-parametric kernel, IFDE-inversion formula, MIDE-inversion formula with noise cluster. In IFDE and MIDE methods are used mixture parameters, calculated with a program made in an institute of Mathematics and Informatics (Vilnius) [87].
Selection of parameters in the density estimation procedure. In this study, the Monte Carlo method aimed to perform the accuracy of the non-parametric estimates of distribution density previously described in the methodological sections (AKDE, PPDE, LSDE, SKDE, IFDE, MIDE) comparative analysis. The authors [34] propose to collect the value of the sensitivity parameter (γ, see. AKDE method step 3) used in the AKDE method from the set {0.2; 0.4; 0.6; 0.8}. The specific value of the parameter is determined by a probabilistic cross-check [88,89]. In the SKDE, all possible values of the sub-vector Y dimension s (1≤ s ≤ d − 1, where d is dimensions, see page 5) and their corresponding coordinates were reselected. The most factual errors were used to compare the results with other studied methods. The LSDE method minimizes the Akaike information criterion by selecting the number of baseline spline points [78]. The computer program for calculating this estimate is provided in the R package and was used in the study. Akaike information criterion AIC = −2l(t) + aJ(t), J-degree of spline, a = log(n), l-probability function used to select the spline coefficients. The MIDE method has a smoothing parameter, h. The chosen form of the smoothing multiplier e −hu 2 allows relating the smoothing parameter h to the variances of the projection clusters. Modelling studies have shown that this method is sensitive to parameter selection. If h is set too low, the estimate becomes very slick and has large errors. Excessive smoothing of the density estimate does not greatly affect its quality. In the studies, it was observed that the estimation becomes uneven due to the similarity of the values of the observations projected in some directions, thus distinguishing low-weight components with small dispersions. The smoothing parameter (h) as well as the specific value of the noise cluster weight (probability) from the set {0.05; 0.1; 0.15; 0.2; 0.3; 0.4} are selected by cross-checking the least squares [65]. The vector of the estimate parameters is searched for in such a way that it minimizes the integrated square error where Θ is the evaluated parameter and F(x) is the observed random variable distribution function.
Changing an unknown distribution function to an empirical distribution function yields an expression for the parameter estimateΘ wheref Θ (x|t) is the value of the estimate at point x, which is calculated by subtracting the value of X(t) from the observations. In addition, empirical research suggests that it is better to look for a maximum local minimum point rather than a global minimum [90]. Using PPDE method and following the recommendation of the paper [38], the order of the spread was 4 ≤ s ≤ 6 (see Page 9), and the projection directions were chosen to maximize the value of the estimate of the design index (2) recommended by J. H. Friedman

Results and Discussion
This section presents the main results obtained during the simulations. We calculate the mean absolute error and (50) mean absolute percentage error (51) to evaluate the accuracy.
The result tables (Tables 2 and A1, Tables A2-A10) provide 100,000 samples densities mean absolute percentage error. The values in parentheses provide information about the standard deviation of errors. The best results in these tables are bolded and underlined. According to Table 2, it is concluded that when n = 100, d = 5, the best results are obtained by SKDE and MIDE methods. Based on Table A2, it can be observed that when q = 2, n = 200, the best results are obtained using SKDE and MIDE methods. According to Table A3, it is concluded that when q = 3, n = 200, in the case of highly overlapping distributions (i = 1, 2), the best results are obtained by the SKDE method, and in the case of more isolated distributions (i ≥ 3)-by the MIDE method. Based on Table A4, it can be observed that when q = 3, n ≥ 400, the best results are obtained by SKDE, while the second-best method is MIDE. According to Table A5, it is concluded that when q = 4, n = 400, in the case of highly overlapping distributions (i ≤ 3), the best results are obtained by the SKDE method and in the case of more isolated distributions (i ≥ 4)-by the MIDE method. Table A6 shows results of q = 4, n ≥ 400, it can be noticed that, in the case of highly overlapping or average isolated distributions (i ≤ 5), the best results are obtained by the SKDE method and in the case of more isolated distributions (i = 6)-by the MIDE method. Tables A7 and A8 show results of q = 2 and n = 50, It can be noticed that in all cases highly overlapping or isolated distributions, the best results are obtained by AKDE method and in the case of more isolated distributions (i = 6) with p 1 = 0.6; p 2 = 0.4-by the MIDE method. Tables A9 and A10 show results q = 3 and n = 50; the best results are obtained by the AKDE method in all cases (highly overlapping or isolated distributions).The LSDE method with huge outliers (|x − m j | > 100 uj) is grouped with a more significant number of values closer to the centre of the distribution. With the help of the calculated spline coefficients, the density in the outliers is estimated at a value close to 10 100 . That is incorrect, and in such cases, the use of this method is not recommended. The results for the smaller dimensions (d = 2) are presented in Table A1. It can be seen that the best results are obtained using the SKDE method, both in large-and small-scale overlapping cases (i < 4). On the other hand, in the case of isolated distributions (i ≥ 5), good results were obtained by the MIDE method.
In the case of mean absolute percentage error, recommended using the semiparametric kernel when the sample has overlapping distributions. In the case of two dimensions (d~2) and a sample is with overlapping distributions, it is recommended to use the semiparametric kernel method and for isolated distributions, to use the adaptive kernel method.

Conclusions
This paper reviewed the most popular and most often used nonparametric density estimation algorithms. The density estimation inversion formula was also presented in this article. It was observed that when a noise cluster is included, the results of the inversion formula improved statistically significantly. Based on the mean absolute error, in the case of higher dimension (d~5) and small samples (n~50), it is recommended to use the adaptive kernel method. If the sample is n~100, then the modified inversion formula method showed the best results. For larger samples with overlapping distributions it is recommended to use a semi-parametric kernel and for more isolated distribution-modified inversion methods. Based on the mean absolute percentage error, it is recommended to use the semiparametric kernel when the sample is with overlapping distributions. In the case of two dimensions (d~2) and a sample is with overlapping distributions, it is recommended to use the semiparametric kernel method. For isolated distributions, it is recommended to use the adaptive kernel method.