A deconvolution path for mixtures

We propose a class of estimators for deconvolution in mixture models based on a simple two-step"bin-and-smooth"procedure applied to histogram counts. The method is both statistically and computationally efficient: by exploiting recent advances in convex optimization, we are able to provide a full deconvolution path that shows the estimate for the mixing distribution across a range of plausible degrees of smoothness, at far less cost than a full Bayesian analysis. This enables practitioners to conduct a sensitivity analysis with minimal effort. This is especially important for applied data analysis, given the ill-posed nature of the deconvolution problem. Our results establish the favorable theoretical properties of our estimator and show that it offers state-of-the-art performance when compared to benchmark methods across a range of scenarios.


Introduction
Suppose that we observe y = (y 1 , . . . , y n ) from the model where φ(· | µ) is a known distribution with location parameter µ, and f 0 is an unknown mixing distribution. Marginally, we have specified a mixture model for y i : The problem of estimating the mixing distribution f 0 is commonly referred to as deconvolution: we observe draws from the convolution m = φ * f 0 , rather than from f 0 directly, and we wish to invert this blur operation to recover the distribution of the latent location parameters. Models of this form have been used in a wide variety of applications and have attracted significant attention in the literature (e.g. Kiefer and Wolfowitz, 1956;Ferguson, 1973;Fan, 1991;Newton, 2002;Ghosal and Van Der Vaart, 2001). Yet the estimation of f 0 continues to pose both theoretical and practical challenges, making it an active area of statistical research (e.g. Delaigle and Hall, 2014;Efron, 2016).
In this paper, we propose a nonparametric method for deconvolution that is both statistically and computationally efficient. Our method can be motivated in terms of an underlying Bayesian model incorporating a prior into model (1), but it does not involve full Bayes analysis. Rather, we use a two-step "bin and smooth" procedure. In the "bin" step, we form a histogram of the sample, yielding the number of observations x j that fall into the jth histogram bin. In the "smooth" step, we use the counts x j to compute a maximum a posteriori (MAP) estimate of f 0 under a prior that encourages smoothness.
We show that this nonparametric empirical-Bayes procedure yields excellent performance for deconvolution, at reduced computational cost compared to full nonparametric Bayesian methods. Our main theorems establish conditions under which the method yields a consistent estimate of the mixing distribution f 0 , and provide a concentration bound for recovery of the marginal distribution m. We also provide simulation evidence that the method offers practical improvements over existing state-of-the-art methods.

Connections with previous work
We first recall the work by Kiefer and Wolfowitz (1956), who consider estimating f 0 using the nonparametric maximum likelihood estimation. The Kiefer-Wolfowitz estimator (KW) has some appealing features: it is completely nonparametric and invariant to translations of the data, it requires no tuning parameters, and it is consistent under fairly general conditions. Balanced against these desirable features there is, perhaps, a disadvantage when estimating a smooth true mixing density: the KW estimator is a discrete distribution involving as many as n + 1 point masses.
Our approach to deconvolution falls within a penalized likelihood framework. It has important connections with classical ideas of penalized likelihood density estimation from Good and Gaskins (1971); Silverman (1982). And at least in spirit, our work resembles recent deconvolution methods by Lee et al. (2013); Wager (2013); Efron (2016).
A detailed discussion or comparison between these methods and ours will be made in Section 5.
Deconvolution has also been studied using Bayesian methods. In the context of repeated measurements, or multivariate deconvolution, we highlight recent work by Sarkar et al. (2014a,b); Staudenmayer et al. (2008). Moreover, for the one dimensional density estimation problem, a flexible choice is the Dirichlet Process (DP) studied in Ferguson (1973) and Escobar and West (1995). For a Dirichlet prior in deconvolution problems, concentration rates were recently studied in Donnet et al. (2014). Related models were considered by Do et al. (2005) and Muralidharan (2010) for finite mixture of normals.
The DP provides a very general framework for estimating the mixing density f 0 . However, as Martin and Tokdar (2012) argue, fitting a Dirichlet process mixture does not scale well with the number of observations n. For micro-array studies, n ranges from thousands to tens of thousands, whereas for more recent studies of fMRI data or single-nucleotide polymorphisms, n can reach several hundreds of thousands (e.g. Tansey et al., 2014). For such large data sets, fitting a DP mixture model can be very time-consuming.
To overcome this difficulty, Newton (2002), Tokdar et al. (2009), Martin and Tokdar (2011), and Martin and Tokdar (2012) studied a predictive recursive (PR) algorithm. The resulting estimator scales well with large data sets while remaining reasonably accurate, thereby solving one the main challenges faced by the fully Bayesian approach.
Finally, we note the work by Carroll and Hall (1988); Stefanski and Carroll (1990); Zhang (1990); Fan (1991); Fan and Koo (2002); Hall et al. (2007); Carroll et al. (2012); Delaigle and Hall (2014), among others, who considered kernel estimators. Their idea is motivated by (1) after taking the Fourier transform of the corresponding convolution of densities, then solving for the unknown mixing density using kernel approximations for the Fourier transform of the true marginal density. The resulting kernel estimator enjoys attractive theoretical properties: for each µ 0 ∈ R, the estimator has optimal rates of convergence towards f 0 (µ 0 ) for squared-error loss when the function f 0 belongs to a smooth class of functions (Fan, 1991).

Overview of approach
We now described our proposed approach in detail. We study deconvolution estimators related to the variational problem where J(f ) is a known penalty functional. The choices of J we consider include 1 or 2 penalties on the derivatives in the log-space to encourage smoothness: with s = q = 1 or s = q = 2 and where log f (k) is the derivative of order k of the log prior. The penalty involving the first derivative is an especially interpretable one, as is the score function of the mixing density.
Note that an alternative interpretation of our approach is as a MAP estimator. To see this we consider the (possibly improper) prior on the mixing density where A is an appropriate class of density functions. The posterior distribution is p(f | y) ∝ p(y | f )p(f ), and our MAP estimator therefore solves, for an appropriate τ > 0, This belongs to a general class of MAP estimators that have been studied in Good and Gaskins (1971) and Silverman (1982) for the classical problem of density estimation. For deconvolution problems we note the recent work by Wager (2013) which penalizes the marginal density rather than the derivatives of the mixing density as we propose. Moreover such penalization is not motivated to encourage smoothness. Rather, it is an 2 projection on to the space of acceptable marginal densities. An alternative penalized likelihood method was studied in Lee et al. (2013) in the different context where the marginal density has atoms. There the authors use the roughness penalty J(f ) = |f (µ)| 2 dµ which differs from our approach that penalizes the log-mixing density and hence ensures that the solutions will be positive. Also, we allow different degrees of smoothness depending on the choice of k. Moreover, while Lee et al. (2013) only considered sample sizes in the order of hundreds, we show in the next sections that our estimator can scale to much larger data sets while still enjoying attractive statistical properties.
More recently, Efron (2016) proposed a penalized approach to deconvolution that is also based on regularization but differs from ours. Such approach proceeds by assuming that the mixing density is discrete with support {θ 1 , . . . , θ N }. Then Efron (2016) specified a parametric model on the mixing density of the form where Q j,· is the j − th row of the matrix Q ∈ R N ×p , for some p > 0, which is used to encourage structure on the mixing density. Moreover, c(α) is a normalizing constant satisfying Then summing over all the {θ j } N j=1 and considering the contribution of the different samples, Efron (2016) arrives to the minus log-likelihood The estimator from Efron (2016) where c 0 is either 1 or 2.
We note that the estimator (6) is possibly limited by the following aspects. First, the choice of the matrix Q, which Efron (2016) recommends to be a spline basis representation, might produce estimates that suffer from local adaptivity problems. Moreover, there is no theoretical support of choosing c 0 ∈ {1, 2}. In our experiments section we will present experimental comparisons between the estimator (6) and our approach.
Finally, we emphasize that our approach is not, in any sense, related to the ridge parameter deconvolution estimator from Hall et al. (2007). Such estimator does not penalizes that log-likelihood as we propose, but rather is designed to avoid the need to choose a kernel function, in the original kernel estimator from Fan (1991), by ridging the integral in its definition with a positive function.

Binned counts problem
Throughout this section we assume that φ corresponds to the pdf of the standard normal distribution, although the arguments can easily be generalized to other distributions.
To make estimation efficient in scenarios with thousands or even millions of observations, we actually fit a MAP estimator based on binning the data. First, we use the samples to form a histogram {I j , x j } D j=1 with D bins, where I j is the j−th interval in the histogram and x j = #{y i ∈ I j } is the associated count. For ease of exposition, we assume that the intervals take the form I j = ξ j ± ∆/2, i.e. have midpoints ξ j and width ∆, although this is not essential to our analysis. To arrive to a discrete estimator, instead of Problem (3), we consider an approximation, a reparametrization g = log f , and put the penalty in the objective function with a regulariation parameter τ > 0, We then approximate (7) by solving where s = q = 1 or s = q = 2, and ∆ (k+1) is the k-th order discrete difference operator.
Interestingly, following the proof of Theorem 1 in Padilla and Scott (2015) we find that where (10) if only ifθ − log(n ∆)1 solves (8). Hence, in practice we solve the unconstrained optimization Problem (10).

Solution algorithms
We now discuss the implementation details for solving (10) in the case s = q = 1. To solve this problem, motivated by the work on trend filtering for regression by Ramdas and Tibshirani (2014), we rewrite the problem as minimize θ l(θ) + τ 2 ∆ (1) α 1 subject to α = ∆ (k) θ.
Next we proceed via the alternating-direction method of multipliers (ADMM), as in Ramdas and Tibshirani (2014). (See Boyd et al., 2011, for an overview of ADMM.) By exploiting standard results we arrive at the scaled augmented Lagrangian corresponding to the constrained problem (11): This leads to the following ADMM updates at each iteration j: Note that in (12) the update for θ involves solving a sub-problem whose solution is not analytically available. To deal with this, we use the well known Broyden-Fletcher-Goldfarb-Shanno (BFGS) algorithm, which is very efficient because the gradient of the θ sub-problem objective is available in closed form. The update for α can be computed in linear time by appealing to the dynamic programming algorithm from Johnson (2013).
In the case p = q = 2, both components of the objective function in (10) have closedform gradients; see the appendix. Thus we can solve the problem using any algorithm that can use function and gradient calls. In particular, we use BFGS.

Solution path and model selection
One of the major advantages of our approach is that it easily yields an entire deconvolution path, comprising a family of estimatesf (τ ) over a grid of smoothness parameters.
Although in principle any deconvolution algorithm can yield such a path by solving the problem for many smoothing parameters, our path is generated in a highly efficient manner, using warm starts. We initially solve (11) for a large value of τ , for which the result estimate is nearly constant. We then use this solution to initialize the ADMM at a slightly smaller value of τ , which dramatically reduces the computation time compared to an arbitrarily chosen starting point. We proceed iteratively until solutions have been found across a decreasing grid of τ values (which are typically spaced uniformly in log τ ).
The resulting deconvolution path can be used to inspect a range of plausible estimates for f 0 , with varying degrees of smoothness. This allows the data analyst to bring to bear any further prior information (such as the expected number of modes in f 0 ) that was not formally incorporated into the loss function. It also enables sensitivity analysis with respect to different plausible assumptions about the smoothness of the mixing distribution.
We illustrate this approach with a real-data example in Section 4.
However, in certain cases-for example, in our simulation studies-it is necessary to select a particular value of τ using a default rule. We now briefly describe heuristics for doing so based on 1 and 2 penalties with k = 1. These heuristics are used in our simulation studies. For the case of 1 regularization, motivated by Tibshirani and Taylor (2012), we consider a surrogate AIC approach by computing and choosing the value of τ that minimizes this expression. Here,θ τ denotes the solution given by L1-D with regularization parameter τ .
In the case of 2 regularization the situation is more difficult, since there is not an intuitive notion of the number of parameters of the model. Instead, we consider an ad-hoc procedure based on cross validation. This solves the problem for a grid of regularization parameters and chooses the parameter the minimizes l(θ held out withθ τ the solution obtained by fitting the model on the training set which consists of 75% of the data. Here, l(θ held out τ ) is evaluated using the counts from the held out set which has 25% of the data, and n held out is the number of observations in such set. Our motivation for using the additional term ∆ (k+1)θheld out τ 1 is that 0 works well when the problem is formulated with 1 regularization. However, when (10) is formulated using 2 , the penalty 0 is not suitable so instead we use 1 . Our simulations in the experiments section will show that this rule works well in practice.

A toy example
We conclude this section by illustrating the accuracy of our regularized deconvolution approach on a toy example. In this example we draw 10 5 samples {y i } with the corresponding {µ i } drawn from a mixture of three normal distributions. Figure 1 shows the samples of both the observations y i (left panel) and the means µ i (right panel), together with the reconstructions provided by our method. Here, we solve the 2 version of problem (10) by using the BFGS algorithm and choose τ using the heuristic just described.
It is clear that regularizing with an 2 penalty provides an excellent fit of the marginal density. Surprisingly, it can also capture all three modes of the true mixing density, a feature which is completely obscured in the marginal). Our experiments in Section 6.1 will show in a more comprehensive way that our method far outperforms other approaches in its ability to provide accurate estimates for multi-modal mixing distributions.

Sensitivity analysis across the path
In this section, we provide an example of a sensitivity analysis using our deconvolution path estimator. We examine data originally collected and analyzed by Singh et al. (2002) on gene expression for 12,600 genes across two samples: 9 healthy patients and 25 patients with prostate tumors. The data come as a set of 12,600 t-statistics computed from gene-bygene tests for whether the mean gene-expression score differs between the two groups.
After turning these 12,600 t-statistics into z-scores via a CDF transform, we estimate a deconvolution path assuming a Gaussian convolution kernel. We use an 2 penalty and a grid of τ values evenly spaced on the logarithmic scale between 10 7 and 10 −3 .
Each row of Figure 2 shows five points along the deconvolution path; the regularization parameter is largest in Row A and gets smaller in each successive row. Within each row, the left column shows the estimated mixing distributionf for the given value of τ .
The middle column shows the histogram of the data together with the fitted marginal densitym = φ * f . The right column shows the fitted marginal density on the log scale, This vividly demonstrates the well-known fact that deconvolution, especially of a Gaussian kernel, is a very ill-posed inverse problem. There is little information in the data to distinguish a smooth mixing distribution from a highly multimodal one, and the model-selection heuristics described earlier are imperfect. A decision to prefer Panel B to Panel E, for instance, is almost entirely due to the effect of the prior. Yet for most common deconvolution methods, the mapping between prior assumptions and the smoothness of the estimate is far from intuitive. By providing a full deconvolution path, our method makes this mapping visually explicit.

Estimate of mixing distribution
For reference, it is interesting to compare our deconvolution path to the results of other methods. Figure 3 shows the result of using MCMC to fit a 10-component mixture of normals to the mixing distribution. The weights in the Gaussian mixture were assigned three different symmetric Dirichlet priors, with concentration parameter α ∈ {0.01, 1, 100}.
Panels 1-3 in Figure 3 show the posterior mean and posterior 95% credible envelopes for

Theoretical properties
In this section we establish some important theoretical properties of our estimators by thinking of them as approximations to problems involving sieves. We start by showing consistency of the mixing density in L1 norm. We do not provide convergence rates since, unlike the kernel estimator from Fan (1991) and the predictive recursion from Newton (2002), our method cannot be expressed in analytical form. Moreover, while the method from Fan (1991) remarkably attains minimax rates under squared error loss for point estimates of the true mixing, in an earlier work Carroll and Hall (1988) suggested that convergence rates for Gaussian deconvolution might be too pessimistic given the unbounded support nature of the classes of functions considered. This is out of the scope of our paper, but we do provide evidence in the later sections that our estimator can outperform existing non-parametric methods.
Throughout we consider k ∈ N − {0} and q > 0 to be fixed. We also denote by P the set of densities in R, thus P := {f : R f (µ)dµ = 1; f ≥ 0}, where dµ denotes Lebesgue measure. Moreover, given any non-negative function f we say that Here, given an arbitrary function g, we use the notation · ∞ to indicate the usual supremum norm on the support of g. Moreover g is called for all x and y.
In this section two metrics of interest will be repeatedly used. The first one is the usual The other metric of interest will be the Hellinger distance whose square is given as We also use the notation Finally, for q ∈ N, we define the functional J k,q which will be a generalization of the usual total variation. We set J k,q (f ) := R |f (k+1) (µ) | q dµ. Next we state some assumptions for our first consistency result. Our approach is to consider the objective function in (3) restricted to a smaller domain than that of its original formulation. This will then allows to prove that the new problem is not ill defined and also its solutions enjoy asymptotic properties of convergence towards the true mixing density. We refer to Geman and Hwang (1982) for a general perspective on sieves.

Assumptions and definitions Let
A be a set of functions that satisfies the following. Assumption 1. Any function f ∈ A satisfies that f ∈ P, f > 0, J k,q (log f ) < ∞, and there exists a constant t f ∈ T f . Assumption 2. For all m ∈ N, the exists a set S m ⊂ A and constants T m , K m > 0 such that for all f ∈ S m it holds that t f = T m and J k,q (log f ) ≤ K m . Moreover, for all m, the set S m induces a tight set of of probability measures in (R, B(R)) satisfying S m ⊂ S m+1 . In addition, ∪ m S m is dense in A with respect to the metric d.
Assumption 3. Data model: we assume that y 1 , . . . , y n are independent draws from the density

Assumption 4. The set
satisfies d(f 0 , α) → 0 as m → ∞ for all α ∈ A m , where the convergence is uniform in A m .
Assumption 5. We assume that the y 1 , . . . , y n are binned into D n different intervals with frequency counts {x j } j=1,...,Dn such that n −1 x ∞ → 0 a.s., and we denote by ξ j an arbitrary point in interval j. Note that this trivially holds for the case where D n = n and x j = 1f for all j.
Assumption 6. There exists f m ∈ A m such that If the x j = 1 and ξ j = y j for all j = 1, . . . , this condition can be disregarded.
Assumptions (1)-(3) are natural for the original variational problem proposed earlier.
The Lipschitz condition, the bounds on the behavior of the functions at zero, and the tightness of distributions are merely used to ensure that the sieves will indeed be compact sets with respect to the metric d. Moreover, Assumption (4) tell us that the sieves S m are rich enough to approximate the true mixing density sufficiently well. The last two assumptions can be disregarded when the counts in the bins are all one.
We are now ready to state our first consistency result. Its proof generalizes ideas from Theorem 1 in Geman and Hwang (1982). Here the data has been generated as y i ∼ N (µ i , 1) where µ i is a draw from the mixing density. The second panel shows, for this same example, the histogram of {µ i } n i=1 (unobserved draws from the mixing density) and the L1-D estimate of the mixing density plotted on top of it. Panels 3-6 show the respective cases of Examples 2 and 3. The last two panels show the corresponding plots for the L2-D solution and Example 4.
In this section we show the potential gain given by our penalized approaches. We start by considering the task of recovering the true mixing distribution. We evaluate the performance of our methods described in Section 3 which we call L1-deconvolution (L1-D) As a flexible Bayesian model we decided to use a prior for the mixing density based on a mixture of 10 normals (MN). Here, the weights of the mixture components are drawn from a Dirichlet prior with concentration parameter 1. This is done in order to have a uniform prior on the simplex. For the locations of the mixture we consider non-informative priors given as N (0, 10 2 ) while for the variances of the mixture components we place a inverse gamma prior with shape parameter 0.01 and rate 0.01. The complete model can be then thought as a weak limit approximation to the Dirichlet process, (Ishwaran and Zarepour, 2002). Also, Gibss sampling is accomplished straightforwardly by introducing a data augmentation with a variable z i indicating the component to which µ i belong. The next competing model is the predictive recursion algorithm from Newton (2002)  for which we choose the weights w i as in Martin and Tokdar (2012), close to the limit of the upper bound of the convergence rate for PR given in Tokdar et al. (2009). Moreover we average the PR estimator over 10 different permutations of the input data in order to obtain a smaller expected error Tokdar et al. (2009).
On the other hand, for the Fourier transform kernel deconvolution method, we consider different choices of bandwidth: the rule of thumb from Fan (1991), the plug in bandwidth from Stefanski and Carroll (1990), and the 2-stage plug-in bandwidth from Delaigle and Gijbels (2002). Our estimates are obtained using the R package fDKDE available at http://www.ms.unimelb.edu.au/˜aurored/links.html, which addresses the main concerns associated with the R package decon, see Delaigle (2014).
For the final competitor, the "g-modeling" approach from Efron (2016) (g-M), we use the newly released R package deconvolveR.
We now state the simulation setting for recovering the mixing density. Given the densities from Figure 4, we consider varying the number of samples n and for each fixed n we run 100 Monte Carlo simulations. Moreover, for our methods we set D, the number of evenly space points in the grid, to 250. See the appendix for a sensitivity example of this parameter.
The results on Table 1 illustrate a clear advantage of our penalized likelihood approaches over MN, PR and FTKD which seems even more significant for larger samples size. The estimated mixing density by L1-D is shown in Figure 4 where we can clearly see that L1-D can capture the peaks of the unknown mixing density. Moreover, Figure 5 shows that L2-D can also capture the structure of the true density. In contrast, MN, PR and FTKD all fail to provide reliable estimators.
For our example density 2, we observe from Table 2 that in general L2-D and L1-D offer the best performance. In the case of example 3, we observe that the L1-D again provides better results than the competitors in all the scenarios of sample sizes considered. Even with only 10000 samples L1-D is closer to the true density than all the other methods with more samples. Moreover, L2-D performs much better than PR and FTKD. Also, L2-D seems to be a clear competitor to MN. In the final example density 4, we observe that L2-D is the best method in all the scenarios considered.
Overall, we have shown that for estimating the mixing density, L1-D and L2-D can perform well under different settings, even when other methods exhibit notable deficiencies. The advantage is amplified by the fact that both of our methods are less computationally intensive that MN, with L2-D requiring around 40 seconds to handle problems with D = 250, and L1-D under the same problem conditions typically requires around 5 minutes for a full solution path across 50 values of the tuning parameter.

Normal means estimation
After evaluating our proposed methodology for the task of estimating the mixing density, we now, for the case of standard normal kernel, focus on the estimation of the normals means {µ i }. For this, we consider comparisons using the best four among the methods used before in addition to other procedures that we briefly discuss next.
As it is well known (e.g Efron (2011) for description and references ), assuming that the marginal density is known, one can use Tweedie's formula to estimate {µ i }. For all the methods here this is the approach that we take, except for MN in which case we use the posterior means resulting from Gibss sampling inference. For the methods depending on grid estimator, the number of bins is set to 250.
For the method of Efron (2011), we set to 5 the degree of the polynomial approximation to the logarithm of the marginal true density (we found larger values to be less numerically stable). The Poisson surrogate model is then fit in R using the command glm. We also compare against the general maximum likelihood empirical-Bayes estimator (GM-LEB) from Jiang and Zhang (2009), which is a discretized version of the original Kiefer-Wolfowitz estimator. For our comparisons we use the algorithm proposed in Koenker and Mizera (2014) based on an interior point method algorithm (GMLEBIP). We use the R package REBayes in order to obtain this estimator (Koenker (2013)). On the other hand, for the shape constrained (SC) estimator from Koenker and Mizera (2014), we rely on a binned count approach based on a weighted likelihood using R code provided by the   authors. Moreover, we consider the estimator from Brown and Greenshtein (2009) using the default choice of bandwidth h n = (log n) −1/2 , which we refer to as BG. The finally competitor is the non-linear projection (NLP) estimator from Wager (2013).
From Table 3 it is clear that the best methods for example 1 are L1-D, L2-D, GMLEBIP, and NLP. Moreover, it is not surprising that GMLEBIP provides good estimates given that the true mixing density has mixture components that have small variance.
For example 2, we can see from Table 4 that again L2-D and L1-D provide competitive estimates. The other suitable methods for this example seem to be PR and GMLEBIP. With slightly worse estimates MN, BG and SC provide results that are still competitive, with SC being particularly attractive given its computational speed to provide solutions.  Tables 5 and 6 respectively that L1-D and L2-D are the best or among the best methods in terms of mean squared distance when recovering the unknown means µ i . Table 6 also suggests that Efron's estimator is more suitable when the true mixing density is very smooth with no sharp peaks.

Discussion
In many problems in statistics and machine learning, we observe a blurred version of an unknown mixture distribution which we would like to recover via deconvolution. The main challenge is to find an approach that is computationally fast but still possesses nice statistical guarantees in the form of rates of convergence. We propose a two-step "binand-smooth" procedure that achieves both of these goals. This reduces the deconvolution problem to a Poisson-regularized model would can be solved either via standard methods for smooth optimization, or with a fast version of the alternating-direction method of multipliers (ADMM). Our approach reduces the computational cost compared to a fully Bayesian method and yields a full deconvolution path to illustrate the sensitivity of our solution to the specification of the amount of regularization. We provide theoretical guarantees for our procedure. In particular, under suitable regularity conditions, we establish the almost-sure convergence of our estimator towards the mixing density.
There are a number of directions for future inquiry, including multivariate extensions and extensions to multiple hypothesis testing. These are active areas of current research.

A.1 Gradient expression for 2 regularization
Here we write the mathematical expressions for the gradient of the objective function when performing L2 deconvolution, As in Section 3.3 of the main document. Using the notation there, we have that

A.2 Proof of Theorem 1
Proof. Motivated by Geman and Hwang (1982), given α ∈ A we define the function F (ξ, α) = (φ * α) (ξ) for µ ∈ R. Clearly, F (ξ, α) is a density that induces a measure in R that is absolutely continuous with respect to the Lebesgue measure in R. Also, we observe that if α, β ∈ A, then, for any Borel measurable set E, we have by Tonelli's theorem that Hence d(α, β) = 0 implies that φ * α and φ * β induce the same probability measures in (R, B (R)).
Next we verify the assumptions in Theorem 1 from Geman and Hwang (1982). This is done into different steps below. Steps 1-4 verify the assumptions B1-B4 in Theorem 1 from Geman and Hwang (1982). Steps 5-6 are needed in the general case in which the data is binned. These are also related to ideas from Wald (1949).
Step 1 Given α ∈ A and > 0, the function is continuous and therefore measurable on ξ. To see this, simply note that for any β ∈ A we have that Hence all the functions β ∈ S m are ( φ ∞ + 1)-Lipschitz and the claim follows. Also, we note that lim This follows by noticing that Step 2 Define E α (g) := R g(ξ) (φ * α) (ξ)dξ for any function g. Then for any α ∈ A and > 0 we have Step 3 Next we show that S m is compact on (A, d). Throughout, we use the notation → u to indicate uniform convergence. To show the claim, choose {α l } a sequence in S m . Then since {(log(α l )) (k+1) } are T m −Lipschitz and uniformly bounded it follows by Arzela-Ascoli Theorem that there exists a sub-sequence {α 1,l } ⊂ {α l } such that (log(α 1,l )) (k+1) → u g k+1 in [−1, 1] for some function g k+1 : [−1, 1] → R which is also T m −Lipschitz. Note that we can again use Arzela-Ascoli Theorem applied to the sequence {α 1,l } to ensure that there exists a sub-sequence {α 2,l } ⊂ {α 1,l } such that (log(α 1,l )) (k+1) → u g k+1 , in [−2, 2]. Thus we extend the domain of g k+1 if necessary.
Proceeding by induction we conclude that for every N ∈ N there exists a sequence {α N,l } l∈N ⊂ {α N −1,l } l∈N such that (log(α N,l )) (k+1) → u g k+1 , as l → ∞, in [−N, N ] as l → ∞. Hence with Cantor's diagonal argument we conclude that there exists a sub-sequence {α l j } ⊂ {α l } such that Then without loss of generality, we can assume that in [−N, N ] for all N ∈ N and where the function g k satisfies g k = g k+1 . Continuing with this process we can assume, without loss of generality, that log(α l j ) → u g as j → ∞, in [−N, N ] for all N ∈ N for some function g satisfying g (j) = g j for all 0 ≤ j ≤ k + 1. Therefore, in [−N, N ] for all N ∈ N.
Let us now prove that exp(g) ∈ S m . First, we observe by the Fatou's lemma e g is integrable in R with respect to the Lebesgue measure. since S m is tight and, we obtain This clearly also implies that exp(g) integrates to 1 or exp(g) ∈ P. Note that also by Fatou's lemma we have that J k,q (g) ≤ K m and by construction, Finally, combining all of this with g k+1 being T m -Lipschitz, we arrive to exp(g) ∈ S m .
Step 4 By assumption (4), we have that Step 5 Let us show that for all α ∈ S m . First, note that for all ξ Hence, by Step 1 we obtain Now we observe that 0 ≤ −min 0, sup d(α,β)< ,β∈Sm log (φ * β) (ξ) ≤ −min {0, log (φ * α(ξ))} , and the claim follows from the monotone convergence theorem. If x j = 1 and ξ j = y j for all j = 1, . . . , D n , the claim of Theorem 1 follows from Theorem 1 from Geman and Hwang (1982). Otherwise, we continue the proof below. In either case we can see that the solution set M n m is not empty given that the map α → φ * α(ξ) is continuous with respect to the metric d for any ξ.
Thus an induction argument allow us to conclude the proof.

B Simulation details
For the cases of the true mixing density we consider four densities of the form In all cases considered here, the observations arise as in (1) with a standard normal sampling model. In our first example we evaluate performance for samples of a density that has four peaks or explicitly K = 4, w = (0.2, 0.3, 0.3, 0.2), θ = (−3, −1.5, 1.5, 3) and with small variance σ 2 = (0.01, 0.01, 0.01, 0.01). For the second example we consider a mixture of three normals two of which are smooth while the other has a peak. The true parameters in this case are K = 3, w = (1/3, 1/3, 1/3), θ = (0, −2, 3) and σ 2 = (2, .1, .4). The next example is a mixture of K = 3 normals, one of which has very high variance. The true parameters chosen are w = (0.3, 0.4, 0.3), θ = (0, 0, 0) and σ 2 = (0.1, 1, 9). Our final example is a mixture, with K = 3, giving raise to a very smooth density, the parameters are w = (0.5, 0.4, .1), θ = (−1.5, 1.5, 4) and σ 2 = (1, 2, 2). A visualization of these examples is shown in Figure 4.  Figure 6 shows the performance of both L1-D and L2-D generally improves as we increase D. However, based on our experience, D = 250 is a reasonable choice. Specially for L1-D whose computational burden increases more rapidly. For D = 250 it typically takes around 5 min to compute the solution path for L1-D with 50 values of the regularization parameter. In contrast, L2-D only requires around 40 seconds under the same setting.