Bayesian Recovery of Sinusoids with Simulated Annealing

The ultimate goal of collecting data is to gain meaningful information about a physical system. However, in many situations, the quantities that we would like to determine are different from the ones which we are able to have measured. If the data we measured depends on the quantities we want, it contains at least some information about them. Therefore, our general interest is to subtract this information from data. Let the vector θ contain the parameters to be estimated from the (measurements) vector , D which is the output of the physical system that one wants to be modeled. The physical system is described by a vector function f in the form:


Introduction
The ultimate goal of collecting data is to gain meaningful information about a physical system.However, in many situations, the quantities that we would like to determine are different from the ones which we are able to have measured.If the data we measured depends on the quantities we want, it contains at least some information about them.Therefore, our general interest is to subtract this information from data.
Let the vector θ contain the parameters to be estimated from the (measurements) vector , D which is the output of the physical system that one wants to be modeled.The physical system is described by a vector function f in the form: The measurement errors () et are generally assumed to be drawn independently from a zero mean Gaussian probability distribution with a standard deviation of  .On the other hand, different signal models correspond to different choices of signal model function (, ) ftθ .In this chapter, we restrict our attention to the static1 sinusoidal model given by The goal of data analysis is usually to use the observed data , D to infer the values of parameters {,}  θω B .Besides estimating the values of the parameters, there are two additional important problems.The one is to obtain an indication of the uncertainties associated with each parameter, i.e. some measures of how far they are away from the true parameters.The other we will not consider here is to assess whether or not the model is appropriate for explaining the data.
Sinusoidal parameter estimation in additive white noise within a Bayesian framework has been an important problem in signal processing and still now is an active research area because of its wide variety of applications in multiple disciplines such as sonar, radar, digital communications and biomedical engineering.The purpose of this research is therefore to develop accurate and computationally efficient estimators for sinusoidal parameter, namely, amplitudes and frequencies.In above problem, one may or may not know the number of sinusoids.When it is unknown, it is called model selection (Andrieu and Doucet, 1999;Üstündag, 2011) and is not subject to this chapter.Under an assumption of a known number of sinusoids, several algorithms have already been used in the parameter estimation literature, such as least-square fitting (Press et al., 1995), discrete Fourier transform (Cooley & Tukey, 1964), and periodogram (Schuster, 1905).With least square fitting, the model is completely defined and the question remaining is to find the values of the parameters by minimizing the sum of squared residuals.The discrete Fourier transform has been a very powerful tool in Bayesian spectral analysis since Cooley and Tukey introduced the fast Fourier transform (FFT) technique in 1965, followed by the rapid development of computers.In 1987, Jaynes derived periodogram directly from the principles of Bayesian inference.After his work, researchers in different branches of science have given much attention to the relationship between Bayesian inference and parameter estimation and they have done excellent works in this area for last fifteen years (Bretthorst, 1990;Üstündag et al., 1989, 1991;Harney, 2003;Gregory, 2005;Üstündag & Cevri, 2008Üstündag & Cevri, , 2011;;Üstündag, 2011).
In this chapte r, we studied Bayesian rec overy of sinusoids using estimation approach proposed by Bretthorst for a general signal model equation and combined it with a simulated annealing (SA) algorithm to obtain a global maximum of the posterior probability density function (PDF)   , PI ω D for frequencies .Unfortunately, conventional algorithms (Press et al., 1995) based on the gradient direction fail to converge for this problem.Even when they converge, there is no assurance that they have found the global, rather than a local maximum.This is because the logarithm of the PDF   , PI ω D is so sharply peaked and highly nonlinear function of frequencies.In this respect, a pattern search algorithm described by Hook-Jevees (Hooke & Jevees, 1962) to overcome this problem has already been used by some researchers in literature of estimation.However, we have found out that this approach does not converge unless the starting point is much closer to the optimum so that we have developed an algorithm in which this Bayesian approach is combined with a simulated annealing (SA) algorithm (Kirkpatrick, et al., 1983;Corana et al., 1987;Goffe et al., 1994;Ingber, 1996), which is a function optimization strategy based on an analogy with the creation of crystals from their melts.This explores the entire surface of the posterior PDF for the frequencies and tries to maximize it while moving both uphill and downhill steps, whose sizes are controlled by a parameter  that plays the role of the temperature in a physical system.By slowly lowering the temperature  towards zero according to a properly chosen schedule, one can show that the globally optimal solutions are approached asymptotically.Thus, it is largely independent of the starting values, often a critical input in conventional algorithms, and also offers different approach to finding parameter values of sinusoids through a directed, but random, search of the parameter space.In this context, an algorithm of this Bayesian approach is developed and coded in Mathematica programming language (Wellin at al., 2005) and also tested for recovering noisy sinusoids with multiple frequencies.Furthermore, simulation studies on synthetic data sets of a single sinusoid under a variety of signal to noise ratio (SNR) are made for a comparison of its performance with Cramér-Rao lower bound (CRLB), known as a lower limit on variance of any unbiased estimator.The simulations results support its effectiveness.

Bayesian parameter estimation
Let us now reconsider above problem within a Bayesian context (Bernardo & Smith, 2000;Bretthorst, 1988;Gregory, 2005;Harney, 2003;Jaynes, 2003;Ruanaidh & Fitzgerald, 1996;Üstündag & Cevri, 2008).As with all Bayesian calculations, the first step is to write down Bayes' theorem for the joint PDF for all of the unknown parameters The quantity   the likelihood (Edwards, 1972) of the data , D given the model parameters.The probability function () P D|I is the evidence, which is a constant for parameter estimation and is used here for normalizing the posterior PDF   2 , ,| , P  ω BD Ι .Therefore, it can be dropped in Equation ( 5) so that the joint PDF for the unknown parameters turns out to be the following form: A key component in Bayes theorem is the likelihood function   proportional to the probability density of the noise.If its standard deviation  is assumed to be known, then the likelihood function takes on the following form: , 2 where the exponent Q is defined as follows 2 2 11 (; ) This is equivalent to where and In order to obtain the posterior PDF for ω , Equation ( 6) can be integrated with respect to the nuisance parameters B under the knowledge of 2 With the choices of an uniform prior PDF or independent Gaussians distributions with mean zero and known standard deviation for the amplitudes, the integral equations in (12) turn out to be a Gaussian integral which can be evaluated analytically (Bretthorst, 1988).To do this it is simply to convert the square matrix Ω into a special type of matrix-a so called diagonal matrix-that shares the same fundamental properties of the underlying matrix.In other words, it is equivalent to transforming the underlying system of equations into a special set of coordinate axes.Therefore, this diagonalization process (Bretthorst, 1988) effectively introduces new orthonormal model This represents the mean-square of the observed projections.If  is known, the joint posterior probability density of the parameters ω , conditional on the data and our knowledge of  is given by If there is no prior information about noise, then  is known as a nuisance parameter and must also be eliminated by integrating it out.Using Jeffreys prior (Jeffreys, 1961) 1  and integrating Equation ( 12) over  we obtain This has the form of the "Student t-distribution".As well as determining the values of the ω parameters for which the posterior PDF for ω is a maximum, it is also desirable to compute uncertainties associated with them.To do this, let us assume the case where  is known and let ω represent the estimated values of ω .Following to Bretthorst's work (Bretthorst, 1988), we can expand the function 2 h in a Taylor series at the point ω , such that By using these orthogonal variables to perform the ρ Gaussian integrals, the estimate variance for k  can be calculated: Therefore, the approximations for ω can be implemented in the form: ˆ, ( 1,2,3,..., ).
In agree with Bretthorst, the expectation values of the amplitudes are given by j j A h   .
From Equation ( 15) the expected values for the old amplitudes B becomes 2 1 1 .
Bayesian Recovery of Sinusoids with Simulated Annealing 73 The uncertainty in j A is   , so that the corresponding uncertainty in B is

Implementation of Simulated Annealing (SA) algorithm
Bayesian approach introduced by Bretthorst is briefly summarized in section 2 but, it is referred to Bretthorst's works (Bretthorst, 1988) for more detail information.Consequently, Bayesian parameter estimation turns into a global optimization problem which is a task to find the best possible solution ω for Equations ( 19) or (20) satisfying Because there is no variation at negative frequencies and the highest possible frequencies corresponds to wave that under goes a complete cycle in two unit intervals, so the lower limit on the range is 0 and all the variation is accounted for by frequencies less than  .
Over last few decades, researchers have developed many computational algorithms to address such type of global optimization problems (Metropolis et al., 1953;Jeffreys, 1961;Kirkpatrick, et al., 1983;Laarhoven & Aarts, 1987;Stefankovic et al., 2009).Although there are numerous algorithms which are suggested to achieve this goal, few of them are capable of locating it effectively.Therefore, we follow the SA algorithm, suggested by Corana (Corana et. al., 1987) and modified by Goffe (Goffe et al., 1994), which is a kind of probabilistic algorithm for finding the global optimum of Equation ( 27) although its various alternative versions have already been used in statistical applications.A brief review of the most work on many algorithms based on that of SA, together with areas of applications is provided by Ingber and Binder (Binder, 1986;Ingber, 1994).
The algorithm begins with an initial guess of the frequencies 0 ω , a trial step-length vector 0 v and a global parameter 0  (called the initial temperature).Each step of the SA algorithm replaces the current frequency with randomly generated new frequency.In other words, the next candidate point where  is a uniformly distributed random number from the interval [-1,1] and are recorded since this is the best current value of the optimum.This forces the system toward a state corresponding to a local maximum or possibly a global maximum.However, most large optimization problems, like the one given in Equation ( 27), have many local maxima and optimization algorithm is therefore often trapped in a local maximum.To get out of a local maximum, a decrease of the function value   accepted with a certain probability.This is accomplished by the Metropolis-criterion (Metropolis et al., 1953) which is based on the changes of obtaining new state with the posterior PDF of frequencies value, defined as where P  represents difference between the present and previous posterior PDF values of frequencies, e.g., p is computed and compared to p , a uniformly distributed random number from the interval   where j n is the number of accepted moves along the direction j and the parameter j c , which is initially defined by user, controls the step variation along the j th direction.The aim of these variations in a step length is to maintain average percentage of accepted moves at about one-half of the total number of moves.
where (0,1) N is a standard Normal distribution, k  represents the temperature at the k th iteration and rescales the step size and ordering j  .On the other hand, k j  is the CRLB for the j th component of angular frequencies at the   (Ireland, 2007) and provides a theoretical lower limit on how precisely parameter j  can be extracted from noisy measurements.In this respect, it is defined in the form: where the Fisher information matrix   J ω (Kay, 1993), defined by is an expectation of the second derivatives of the signal function with respect to ω .
Assuming that the matrix   J θ is diagonal for a large N so that its inversion is straightforward.In this case, the diagonal elements yield the lower bound (Stoica et al., 1989;Lagrange, 2005) for the variance of ω asymptotically and we can write, where 2  represents the estimated variance of the noise and is described in Equation ( 41).This whole cycle is then repeated n  times, after which the temperature is decreased by a factor (0,1) . This process is generally called annealing (or cooling) schedule which is the heart of the algorithm and effects the number of times the temperature is decreased.If a fast cooling takes place then the problem will be trapped in a local maximum.Therefore, there are various annealing schedules suggested by different researchers (Ingber, 1994;Stefankovic, et al., 2009) for lowering the temperature but we choose the following: Because of being exponential rather than logarithmic, it is sometimes known as simulated quenching (SQ) (Vasan et al., 2009;Aguiare et al., 2012).In case of a well conditioned estimation problem like, say, frequency estimation problem in signal processing, it is clear that the convenience of SA algorithm, together with a need for some global search over local optima, makes a strong practical case for the use of SQ.Therefore, different parameters have different finite ranges and different annealing time-dependent sensitivities.Classical annealing algorithms have distributions that sample infinite ranges and there is no decision for considering differences in each parameter dimension; e.g.different sensitivities might be necessary to use different annealing schedules.This requires the development of a new PDF to embed in the desired features (Ingber, 1994) so that it leads to variants of SA that justifies exponential temperature annealing schedules.
Termination of the algorithm occurs when average function value of the sequences of the points after each s nn   step cycle reaches a stable state: where  is a small positive number defined by user and l indicates the last four successive iteration values of the posterior PDF of the frequencies that are being stored.Further details of the algorithm initialization are problem-dependent and are given in Section 5.

Power spectral density
Before we discuss the computer simulated examples, there is something we need to say about how to display the results.The usual way the result from a spectral analysis is displayed is in the form of a power spectral density that shows the strength of the variation (energy) as a function of frequency.In Fourier transform spectroscopy this is typically taken as the squared magnitude of the discrete Fourier transform of the data.In order to display our results in the form of a power spectral density (Bretthorst, 1988;Gregory, 2005), it is necessary to give an attention to its definition that shows how much power is contained in a unit frequency.According to Bretthorst (Bretthorst, 1988) the Bayesian power spectral density is defined as the expected value of the power of the signals over the joint posterior PDF: Performing integrals analytically over 12 ,..., BB  by using these orthonormal model functions defined in section 2, the power spectral density can therefore be approximated as This function stresses information about the total energy carried by the signal and about the accuracy of each line.In the next section, we will present some numerical examples how well this technique works.

Computer simulated examples
To verify the performance of the algorithm, we generated a simulated data vector according to one, two and with five sinusoids.Here i t runs over the symmetric time interval T  to T in ( 21) 512 T  integer steps and the components of the noise i e are generated from the zero mean Gaussian distribution with a known deviation  , initially and added to the simulated data.However, one of the interests in an experiment is also to estimate noise power 2  so that it is assumed to be unknown.In agreement with Bretthorst, this is given in the following form: Clearly, it is seen that the accuracy of the estimate depends on how long we sample and the signal-to-noise ratio (SNR), defined as the ratio of the root mean square of the signal amplitude to the noise 2  .In addition, one may also get the following expression of SNR: When the standard deviation of the noise is unknown, an empirical SNR is obtained by replacing 2  in Equation ( 42) with the estimated noise variance in (41).
We then carried out the Bayesian analysis of the simulated data, assuming that we know the mathematical form of the model but not the value of the parameters.We first gave starting values to the list of frequencies to begin a multidimensional search for finding a global maximum of the posterior PDF of the frequencies ω given in Equations ( 19) or (20).As an initial estimate of the frequencies 0 ω for the maximization procedure, it is possible to take random choices from the interval   0, .However, it is better to start with the locations of the peaks chosen automatically from the Fourier power spectral density graph by using a computer code written in Mathematica.
In agreement with Corana (Corana, et al., 1987), reasonable values of the parameters that control the SA algorithm are chosen as 20 and 2, ( 1,..., ) j cj  .Then the global optimization algorithm starts at some high temperature 0 100  .Thus the sequence of points is generated until a sort of equilibrium is approached; that is a sequence of points 012 , ,, . . .ωωω whose average value of   , PI ω D reaches a stable value as iteration proceeds.During this phase the step vector v is periodically adjusted by the rule defined in Equation ( 32).The best point ω reached so far is recorded.After thermal equilibration, the temperature  is reduced by using the annealing scheduled in Equation ( 37) with a factor 0.85 and a new sequence is made starting from this best point ω , until thermal equilibrium is reached again and so on.The SA algorithm first builds up a rough view of the surface by moving large step lengths.As the temperature  falls and the step decreases, it is slowly focuses on the most promising area.Therefore it proceeds toward better maximum even in the presence of many local maxima by adjusting the step vector that can allow the algorithm to shape the space within it may move, to that within which ought to be as defined by the PDF   , PI ω D .Consequently, the process is stop at a temperature low enough that no more useful improvement can be expected, according to a stopping criterion in Equation ( 38 Once the frequencies are estimated, we then carried on calculating the amplitudes and parameter errors approximately using Equations ( 25), ( 23) and ( 26), respectively.However, an evaluation of the posterior PDF at a given point ω cannot be made analytically.It requires a numerical calculation of projections onto orthonormal model functions, related to Eigen-decomposition of the    ω .Therefore, the proposed algorithm was coded in Mathematica programming language (Wellin, P., et al., 2005), that provides a much flexible and efficient computer programming environment.Furthermore, it also contains a large collection of built-in functions so that it results much shorter computer codes than those written in C or FORTRAN programming languages.
The computer program was run on the workstation with four processors, which of each has got Intel Core 2 Quad Central Processing Unit (CPU), in two cases where the standard deviation of noise is known or not.The output of the computer simulations when 1   is illustrated in Table 1.The results when  is unknown are almost similar with that of Table 1.Parameter values are quoted as (value) ± (standard deviation).It can be seen that a single frequency and amplitudes are recovered very well.The estimated value of SNR and the standard deviation of the noise are also shown in Table 1.
In our second example, we consider a signal model with two close harmonic frequencies: In general, we consider a multiple harmonic frequency model signal: The best estimates of parameters are tabulated in Table 3. Once again, all the frequencies have been well resolved, even the third and fourth frequencies which are too closed not to be separated by the Fourier power spectral density shown in Figure 3. Actually with the Fourier spectral density when the separation of two frequencies is less than the Nyquist step, defined as 2/ N  , the two frequencies are indistinguishable.This is simply because there are no sample points in between the two frequencies in the frequency domain.If  is not large enough, the resolution will be very poor.Therefore, it is hard to tell where the two frequencies are located.This is just the inherent problem of the discrete Fourier power spectral density.In this example two frequencies are separated by 0.01, which is less than the Nyquist step size.There is no way by using Fourier power spectral density that one can resolve the closed frequencies less than the Nyquist step.However, Bayesian power spectral density shown in Figure 3 gives us very good results with high accuracy.Finally, we constructed the signal model in Equation (3), whose parameters, amplitudes and frequencies, are randomly chosen from uniform distribution in the intervals 5,5      and 0,     , respectively and used it to generate data samples of 512 N  by adding a zero mean Gaussian random noise with 1   .The proposal algorithm was rerun for recovering sinusoids from it and the results are tabulated in Table (4).It can be seen that frequencies are specially recovered with high accuracies.Ten frequencies signal model are shown in Fig. 5.   On the other hand, modifications of this algorithm (Üstündag & Cevri, 2008;2011) have already been made by generating the next candidate in Equation ( 33) from normal distribution with a mean of the current estimation whose standard deviation is a square root of the CRLB (Ireland, 2007) given in Equation ( 36), which is a lower limit to the variance of the measurement of the frequencies, so this generates a natural scale size of the search space around their estimated values.It is expected that better solutions lie close the ones that are already good and so normally distributed step size is used.Consequently, the results we obtained are comparable with or higher than those obtained previously.In addition, all the results discussed so far are also consistent with those of Bretthorst (Bretthorst, 1988) and also demonstrate the advantages of the Bayesian approach together with SA algorithm.Moreover it appears to be very reliable, in the sense that it always converged to neighborhood of the global maximum.The size of this neighborhood can be reduced by altering the control parameters of the SA algorithm, but this can be expensive in terms of CPU consumption.Moreover, we initially assumed that the values of the random noise in data were drawn from the Gaussian density with the mean 0   and the standard deviation .
 Figure 4 shows the exact and estimate probability densities of the random noise in data.It is seen that the estimated (dotted) probability density is closer to the true (solid) probability density and the histogram of the data is also much closer to the true probability density.The histogram is known as a nonparametric estimator of the probability density because it does not depend on specified parameters.
Computer simulations had been carried out to compare the performance of the method with the CRLB.To do this, we generated 64 data samples from a single real tone frequency signal 2 In order to compare the results with Bretthorst's in this example we converted 22 ,(1 , . . ., ) .   and added it to the variety of noise levels.After 50 independent trials under different SNR ratios, the mean square errors (MSE) for the estimated frequencies were obtained and their logarithmic values were plotted with respect to SNR ratios that vary between zero and 20 dB (deci Bell).It can be seen from Figure 6 that the proposed estimator has threshold about 3 dB of the SNR and follows nicely the CRLB after this value.As expected, larger SNR gives smaller MSE.However, many of existing methods in signal processing literature have a MSE that is close to the CRLB when the SNR is more than 20 dB and they usually perform poorly when the SNR is decreased.The computational complexity of the algorithm is dependent upon a few parameters such as annealing schedule, data samples and parameters.Under using same annealing schedule, Fig. 7 shows only CPU time of different simulations in a variety of number of data samples.It can be clearly seen that an increase in these numbers causes larger consumption of CPU time.With fixed size of components set and specifically annealing schedule of SA algorithm, the overall execution time of the cooling and decision is almost constant, but the runtime of the first two stages (move and evaluate) mostly depends on complicated design constraints and objective functions.Because the move and the evaluation process in the SA algorithm play an important role in CPU resource usage, improving the calculation ability for these stages will be the most feasible approach for an optimizing SA so that parallel computing is one of best approaches for this goal.

Conclusion
In this work we have partially developed a Bayesian approach combined with SA algorithm and applied it to spectral analysis and parameter algorithm for estimating parameters of sinusoids from noisy data.Overall, results presented here show that it provides rational approach for estimating the parameters, namely the frequencies and the amplitudes, can be recovered from the experimental data and the prior information with high accuracy, especially the frequency, which is the most important parameter in spectral analysis.A significant advantage of this approach comes from the very large posterior probabilities, which are sharply peaked in the neighborhood of the best fit.This helps us to simplify the problem of choosing starting values for the iteration and it provides a rational approach for estimating, in an optimal way, values of parameters by performing a random search.On the other hand, for sufficiently high SNR, MSEs of frequencies will attain CRLB so that it justifies the accuracy of the frequency estimation.Although the SA algorithm spends large consumption of CPU time, it is competitive when compared to the multiple runs often used with conventional algorithms to test different starting values.As expected, parallel implementation of SA algorithm reduces CPU resource usage.
Data analysis given in this chapter has also been applied to more complicated models and conditions, such as signals decay, periodic but non-harmonic signals, signals with nonstationary, etc.,.In general, we have also not addressed the problem of model selection which is the part of spectral analysis.In this case, one has enough prior information in a given experiment to select the best model among a finite set of model functions so that Bayesian inference helps us to accomplish it.Therefore, it will deserve further investigations.

Nomenclature
represent the amplitudes of the signal model and frequency case.For an arbitrary model the matrix jk b cannot be calculated analytically; however, it can be evaluated numerically.The calculations of the mean and standard deviations for ω parameters require for evaluating Gaussian integrals by first changing to the orthogonal variables as was done above with the amplitudes.Let j  and kj u represent the th j eigenvalue and eigenvector of the matrix jk b , respectively.Then the new the   1 k  th iteration point, it is replaced with k ω and algorithm moves uphill.If  

Figure 1 .
Figure 1.Recovering signal from noisy data produced from a single harmonic frequency signal model.

Figure 2 .
Figure 2. Recovering signals from noisy data produced from two closed harmonic frequency signal model.

Figure 3 .
Figure 3. Spectral analysis of multiple frequency signal model

Figure 4 .
Figure 4. Comparison of exact and estimate probability densities of noise in data   1   .

Figure 5 .
Figure 5. Spectral analysis of ten frequency signal model.

Figure 6 .
Figure 6.The calculated MSE of the proposed method compared with CRLB versus different SNR with a white noise.

Figure 7 .
Figure 7. CPU times versus with number of parameters and data samples.

:
a function or random variable (.;.) G : Model functions that contains sinus and cosines terms h : Projections of data onto orthonormal model functions (j th component of normalized Eigenvalues of jk  kj  : j th component of k th normalized Eigenvector of jk 

Table 1 .
).Computer simulations for a single harmonic frequency model

Table 2 .
In a similar way, we produced the same size data corrupted by the zero mean Gaussian noise with 1   .We run our Mathematica code again in the case where the deviation of noise is unknown.The results, shown in Table2, indicate that all values of the parameters within the calculated accuracy are clearly recovered.Computer simulations for two closed harmonic frequency model

Table 3 .
Computer simulations for a multiple harmonic frequency model 2

Table 4 .
Computer simulations for ten harmonic frequency signal model