Simultaneous Estimation of Noise Variance and Number of Peaks in Bayesian Spectral Deconvolution

The heuristic identification of peaks from noisy complex spectra often leads to misunderstanding of the physical and chemical properties of matter. In this paper, we propose a framework based on Bayesian inference, which enables us to separate multipeak spectra into single peaks statistically and consists of two steps. The first step is estimating both the noise variance and the number of peaks as hyperparameters based on Bayes free energy, which generally is not analytically tractable. The second step is fitting the parameters of each peak function to the given spectrum by calculating the posterior density, which has a problem of local minima and saddles since multipeak models are nonlinear and hierarchical. Our framework enables the escape from local minima or saddles by using the exchange Monte Carlo method and calculates Bayes free energy via the multiple histogram method. We discuss a simulation demonstrating how efficient our framework is and show that estimating both the noise variance and the number of peaks prevents overfitting, overpenalizing, and misunderstanding the precision of parameter estimation.


Introduction
Spectroscopy is at the heart of all sciences concerned with matter and energy. An electromagnetic spectrum indicates the electronic states and the kinetics of atoms. The quantum nature of spectra allows them to be approximately reduced to the sum of unimodal peaks (such * okada@k.u-tokyo.ac.jp as Lorentzian peaks, Gaussian peaks, and their convolutions), whose centers are the energy levels from the semiclassical viewpoint. 1) The peak intensity is proportional to both the population density of the atoms or molecules and their transition probabilities. The Lorentzian peak width indicates the lifetime of the eigenstate due to the time-energy uncertainty relation.
The Gaussian peak width indicates the Doppler effect caused by the kinetics of atoms and depends on temperature. These pieces of information about the electronic states or kinetics of atoms are obtained by identifying peaks from spectra.
It is generally a difficult problem to distinguish each peak from noisy spectra with overlapping peaks. The simplest solution is least-squares fitting by a gradient method. 2) This type of method has a drawback in that fitting parameters are often trapped at a local minimum or a saddle whenever there is another global minimum in the parameter space. Moreover, the number of peaks is not always known in practice. Bayesian inference, by using a Markov chain Monte Carlo (MCMC) method, provides a superior solution. [3][4][5][6][7][8][9][10][11] Although the Bayesian framework enables us to estimate the number of peaks, MCMC methods generally have the limitation of local minima and saddles. Nagata et al. reported 6) that the exchange Monte Carlo method 12) (or parallel tempering 13) ) can prevent local minima or saddles efficiently and provide a more accurate estimation than the reversible jump MCMC method 14) and its extension. 15) We constructed a Bayesian framework for estimating both the noise variance and the number of peaks from spectra with white Gaussian noise by expanding the previous framework by Nagata et al. 6) The noise variance and the number of peaks are respectively estimated by hyperparameter optimization and model selection. These estimations are carried out by maximizing a function called the marginal likelihood, [16][17][18] which is a conditional probability of observed data given the noise variance and the number of peaks in our framework. We provide a straightforward and efficient scheme that calculates this bivariate function by using the exchange Monte Carlo method and the multiple histogram method. 19,20) We also demonstrated our framework through simulation. We show that estimating both the noise variance and the number of peaks prevents overfitting, overpenalizing, and misunderstanding the precision of parameter estimation.

Models
An observed spectrum y ∈ R is represented by the sum f (x; w) of single peaks φ k (x; µ k , ρ k ) and additive noise ε as where x ∈ R denotes energy, frequency, or wave number depending on the case. The pa- for each k are respectively the intensity, energy level, and peak width. The Gaussian function φ k (x) for each k should be replaced with other parametric functions, such as the Lorentzian or Voigt function, depending on the case. 1,21) If the peaks φ k (x) are symmetric functions for all k (i.e., their values depend only on the distance from each center), the function f (x; w) is called a radial basis function network in neural networks and related fields. 6,22) This is the junction of the spectral data analysis and singular learning theory. 23) If the additive noise ε is assumed to be a zero-mean Gaussian with variance b −1 ≥ 0, the statistical model of the observed spectrum is represented by a conditional probability as where y is taken as a random variable. This Gaussian distribution p(y | x, w, b) is valid if the thermal noise is dominant. The parameter set w is also regarded as a random variable from the Bayesian viewpoint. The probability density function of w, called the prior density, is heuristically modeled as where κ > 0, µ 0 ∈ R, α > 0, and ν > 0 are hyperparameters. This prior density modeling is a special case of that by Nagata et al. 6) Equation (6) promotes the sparsity of a k . Equation (7) is regarded as an almost flat prior density if α is sufficiently small. These prior density models can be replaced with any other model without loss of generality in our framework.

Bayesian formalization
The conditional probability density function of w given samples D : , set as X 1 < X 2 < · · · < X n for the sake of convenience, is represented by Bayes' theorem as where the functions p(w | D, K, b) and Z n (K, b) are respectively called the posterior density and marginal likelihood. Note that the function Z n ( is a probability density butZ n (K, b) is not. Bayes free energy F n (K, b) is defined as Note that Nagata et al. regarded bF n (K, b) as Bayes free energy for the sake of convenience 6) since the noise variance is treated as a known constant. We also assume the case in which there are no peaks as K = 0 (see Appendix A). In terms of the empirical Bayes (or type II maximum likelihood) approach, [16][17][18] empirical Bayes estimators of K and b are given by The hierarchical Bayes approach 24) is also tractable in our framework (see Appendix B). The partial derivative of F n (K, b) with respect to the variable b is obtained as where Q b denotes the posterior mean of an arbitrary quantity Q ∈ R over p(w | D, K, b). If b =b is a stationary point of F n (K, b), then the following equation is satisfied: The Bayes estimator of w is given byŵ for each parameter Q ′ ∈ w ifK > 0. However, (K,b) cannot be derived in this case since F n (K, b) and E n (w) b are analytically intractable for our model.

Exchange Monte Carlo method
In practice, we calculate F n (K, b) and E n (w) b by using the exchange Monte Carlo method, which efficiently enables sampling from p(w | D, The target density is a joint probability density as where w l is the parameter set at b l . Each density p(w l | D, K, b l ) is called a replica. Sequence {b l } L l=1 is set as 0 = b 1 < b 2 < · · · < b L for the sake of convenience. Note that the variable b is replaced with the inverse temperature β of Nagata et al.'s formulation. 6) The variable b works as quasi-inverse temperature and varies the substantial support of the posterior density p(w | D, K, b). The state exchange between high-and low-temperature replicas enables the escape from local minima or saddles in the parameter space. The sampling procedure includes the two following steps.

• State update in each replica
Simultaneously and independently update state w l subject to p(w l | D, K, b l ) using the Metropolis algorithm. 25) • State exchange between neighboring replicas Exchange states w l and w l+1 at every step subject to the probability u(w l+1 , w l , b l+1 , b l ) as where Eq. (23) ensures a detailed balance condition.
A straightforward way of computingF n (K, b l ) via the exchange Monte Carlo method is bridge sampling, 26,27) in whichF n (K, b l ) is expressed as where Q l b l for the arbitrary quantity Q l ∈ R at the l th replica is approximated by the mean However,b is not easy to accurately calculate using only the above scheme since {b l } L l=1 is a discrete set, whereas b is a continuous variable.

Multiple histogram method
) for any l via the multiple histogram method. The density of states is defined and estimated by then we obtainZ where The above procedure can be appropriately generalized to treat multidimensional histograms such The red solid line and blue dotted ones respectively show the true curve y = f (x; w 0 ) and the Gaussian peaks as N l (E, Q)dEdQ. 28) Then, the posterior mean of an arbitrary quantity is calculated as where Q l,m is the value of Q at the m th snapshot of the l th replica in an MCMC simulation.
We calculate E n (w) b via Eq. (33) and solve Eq. (21) numerically by the bisection method.
Then,ŵ with the standard deviation of each parameter is also calculated via Eq. (33). The posterior density of arbitrary quantities can also be interpolated with respect to b = b ′ in the same way (see Appendix C).

Demonstration
We demonstrated how efficient our framework is through simulation in which the same synthetic data as used by Nagata et al. 6) were used. The synthetic data Fig. 1 were generated from the true probability density as where b 0 > 0 and w 0 := {a k * , µ k * , ρ k * } K 0 k=1 are respectively the true inverse noise variance and true parameter set, as in Tables I and II where w 0 is the parameter set that minimizes the Kullback-Leibler divergence of a statistical model from a true distribution, and λ > 0 is a rational number called the real log canonical threshold (RLCT). 29,30) The RLCT is determined by the pair of a statistical model and true distribution, and the ones determined by Eqs. (4) and (34) are clarified for several cases of (K, K 0 ) with b = b 0 . 23) The values E n (w 0 ) and λ respectively become larger and smaller as Ave. µ1 Relative frequency where w ′ := w\{µ k } and ϕ(w ′ | K) := ϕ(w | K)/ϕ(µ k ). E n (w ′ ; µ k ) indicates the function E n (w) given This effect is based on the same principle as super-resolution microscopy techniques. 31,32) 2 µ k 2 b − µ k b 2 for each k is also smaller than ∆x as (d) b >b, whereas the support of µ 1 does not cover the true value µ * 1 : outside the confidence interval. An appropriate setting of b provides an appropriate precision of parameter estimation. Estimating the optimal value of b is indispensable even if the true model size K 0 is known; thus, this result also shows the validity of our framework.

Discussion and Conclusion
We constructed a framework that enables the dual estimation of the noise variance and the number of peaks and demonstrated the effectiveness of our framework through simulation.
We also warned that there are the risks of overfitting, overpenalizing, and misunderstanding the precision of parameter estimation without the estimation of the noise variance. Our framework is an extension of Nagata et al.'s framework and is versatile and applicable to not only spectral deconvolution but also any other nonlinear regression with hierarchical statistical models.
Our framework is also considered as a learning scheme in radial basis function networks.
However, the goal of spectral deconvolution is not to predict any future data, which is the goal of most other learning tasks, but to identify the true model since spectral deconvolution is an inverse problem of physics. This is the reason why we do not adopt the Bayes generalization error but adopt the Bayes free energy for hyperparameter optimization and model selection.
The Akaike information criterion (AIC) 33) and Bayesian information criterion (BIC), 34) which are respectively approximations of the generalization error and Bayes free energy, do not hold for hierarchical models such as radial basis function networks: the widely applicable information criterion (WAIC) 35) and widely applicable Bayesian information criterion (WBIC) 36) generally hold for any statistical model. If the noise variance is unknown, these criteria do not lead to computational reduction since the value of the noise variance needs to be estimated, as discussed in Sect. 3. The example we gave is classified as an unrealizable and singular (or regular) case, 37) which is a difficult problem. On the other hand, the example Nagata et al.
gave 6) is classified as a realizable and singular (or regular) case, which is a relatively easy problem. Statistical hypothesis testing does not hold for a singular case. Our scheme is also valid and sophisticated from the viewpoint of statistics.

Appendix A: Bayes free energy for no-peaks model
We define the function f (x; w = φ) = 0 as K = 0, where φ is the empty set. The statistical model of the no-peaks spectrum and marginal likelihood are expressed as The main term of Bayes free energy and the posterior mean of the mean square error are also respectively expressed asF where they can be calculated without any MCMC method.

Appendix B: Hierarchical Bayes approach
In Sect. 3, we adopted the empirical Bayes (or type II maximum likelihood) approach, in which K and b are estimated by the minimization of F n (K, b) (or the maximization of Z n (K, b)). The hierarchical Bayes approach, which takes into account the posterior density of where the integration along the b-axis is calculated using the trapezoidal rule. Note that in practice, and there is a continuous discussion: which is better, to optimize or to integrate out? 38) The users of our framework can choose a better way in light of their perspective.

Appendix C: Interpolation of posterior distribution
The density of states in the i th bin, which is the function g(E; K) given the value of µ k in the interval [X i , X i+1 ], is defined and estimated as , (C.2) then we obtaiñ where E n (w ′ ; X i ≤ µ k ≤ X i+1 ), N l (E; X i ≤ µ k ≤ X i+1 ), and E (i) l,m respectively indicate E n (w), N l (E), and E l,m in the i th bin. M (i) l is defined as M (i) l := dEN l (E; The values of {z n (K, b l , X i ≤ µ k ≤ X i+1 )} L l=1 for each i are determined selfconsistently by iterating Eq. (C.4) with b = b l . Given {z n (K, b l , X i ≤ µ k ≤ X i+1 )} L l=1 for each i, we calculatez n (K, b, X i ≤ µ k ≤ X i+1 ) for each i with b = b ′ via Eq. (C.4) again. If ∆x is sufficiently small (or ϕ(µ k ) is almost flat), P(X i ≤ µ k ≤ X i+1 | D, K, b) is expressed as . (C.5)