Quantitative assessment of linear noise-reduction filters for spectroscopy

Linear noise-reduction filters used in spectroscopy must strike a balance between reducing noise and preserving lineshapes, the two conflicting requirements of interest.


Introduction
Reducing noise in spectra, optical or otherwise, is easy; reducing noise without compromising information is not. The challenge is particularly acute in optical spectroscopy, where the information is contained in lineshapes and the poles that give rise to features in them. While in principle noise can be reduced by taking better data, this is not always possible in practice. Given the significant advantages of working with clean spectra, noise-reduction procedures are not likely to be abandoned any time soon.
An enormous number of procedures have been developed to meet this need [1][2][3][4][5][6][7][8][9][10][11][12][13][14][15][16][17][18]. A major reason for this proliferation of approaches is the absence of an adequate method of assessing them quantitatively. A common measure is the mean-square error (MSE), which is defined as the square of the difference between the filtered lineshape and the data, summed over the spectral range [19][20][21]. As this involves (nominally) small differences between two relatively large sets of relatively large numbers projected onto a single value, a direct-space (DS) calculation makes the origins of these differences difficult to identify. The situation is aggravated in practice because the lineshapes themselves are generally unknown.
Here, we show that significant insight can be obtained by taking advantage of Parseval's (the Power) Theorem to cast these calculations into reciprocal-(Fourier-) space (RS). It has long been appreciated that intelligent filtering can only be done by considering the behavior of data in RS [2,3,8,9]. We show that the same advantages are realized with performance measures as well.
Parseval's Theorem states that the sum of the absolute squares of a function defined at discrete, equally spaced points is equal to the sum of the absolute squares of its Fourier coefficients. Being mean-square quantities, both MSEs and noise satisfy the Parseval condition. When combined with the convolution theorem, the RS version of the MSE calculation reduces to a simple expression that shows not only how MSE errors arise, but also how they are apportioned between approximations and noise, both of which can be estimated accurately without knowing the lineshape itself. Taking advantage of the insight provided, we design a linear filter that outperforms the previous best, the extended-Gauss [9] or Gauss-Hermite (GH) filter [7], with a computational load that is significantly lower.
Conceptually, linear noise-reduction filters capitalize on the fact that information enters as point-to-point correlations and noise as point-to-point fluctuations. Thus in RS information and noise are essentially separated into low-and high-order Fourier coefficients, respectively.
Nominally, the perfect linear filter multiplies all information-dominated coefficients by 1 and all noise-dominated coefficients by zero [22,23]. This rectangular, or so-called "ideal" or "brick-wall" (BW) filter, is rarely used, partly because its DS Fourier transform is wide, inhibiting its use in convolution, and partly because the Gibbs oscillations [24] resulting from its abrupt cutoff in RS compromise reconstructed lineshapes. As a result, efforts have been directed nearly universally to compromises such as the binary [12], Savitzky-Golay (SG) [15], and above-mentioned GH [7] filters, to name a few. These approximate the ideal filter via the Butterworth approach [25], i.e., eliminate as many derivatives as practical in a Taylor-series expansion of the transfer function about the lowest-index coefficient 0 C , followed by a rolloff to zero near the white-noise cutoff c n C . 3 The results presented here illustrate the problem with these filters. The transfer functions mentioned above drop below 1 well before c n is reached, thereby compromising information.
This is particularly damaging because data coefficients tend to decrease either exponentially or as Gaussians up to the white-noise cutoff. From this perspective, it may not be surprising to learn that the variation of the brick-wall filter that we present here, which is obtained by working from the high-index end, is better than any other linear filter proposed to date. Further, a parallel calculation that also takes advantage of Parseval's Theorem shows that the pass-through rms noise is only about 10% greater than that of the running-average (RA) filter, which being rectangular in DS has the lowest noise of any filter but severely distorts lineshapes. Our purposes here are to develop the theory and to provide examples.

Theory
We start by establishing the scope of the work. Continuum and discrete approaches represent two different but related methods of analysis. The continuum approach is more convenient mathematically, since it deals with integrations instead of summations, and leads to simpler analytic expressions. It also avoids complications due to the constraint of periodicity. However, data are discrete, and with procedures [26,27] available for suppressing endpoint-discontinuity artifacts that can mask information in standard Fourier analysis, the digital approach is more relevant for application. However, our objectives here are to assess methods. Consequently, we work primarily with the continuum, leaving discrete analysis for a following paper. Continuum analysis also allows direct comparisons between the two "extreme" cases, the RA and BW filters, because their continuum versions allow their RS ranges to be adjusted so their DS cutoffs are identical. This is necessary for quantitative comparisons.
Proceeding, we assume that the data are represented by a square-integrable continuous function is finite. Next, we define the Fourier transform () Fk of () fx in the usual manner as Parseval's Theorem, discussed below, ensures that () Fk is square integrable if () fx is square integrable. Now using it follows that Next, define the filtered (noise-averaged) function () fx as where the filter function () bx is unitary, that is, Given Eq. (6), we define the corresponding transfer function () Bk of () bx as which ensures that (0) 1 B  , as required for a low-pass filter. The respective inverse transform is therefore The convolution theorem is needed next. Consistent with () Substituting the respective Fourier transforms we obtain Thus, as is well known, the Fourier coefficients of the result of a convolution are the products of the Fourier coefficients of the convolving functions.
Finally, consider This is Parseval's Theorem.
We now consider the mean-square error (MSE). This is defined as With appropriate substitutions The second step uses Parseval's Theorem, and the fourth step the convolution theorem.
6 Equation (14e) is one of the main results of this work. It shows that 2 MSE  is determined by a combination of the properties of both data and filter. With ( ) 1 Bk  for small k for a well-designed low-pass filter, one might expect that the dominant contribution to 2 MSE  comes from the high-k end as () Bk rolls off to zero. While high-k behavior is certainly a factor, for typical data () Fk decreases approximately exponentially or even faster with k, so unless () Bk is accurately 1 for low k, this contribution to 2 can be surprisingly large. Because () Fk is directly calculated from the data, accurate estimates of 2 MSE  , along with its dominant contributions, can be obtained even if very little is known about the DS lineshape itself.
We now consider white noise, which we assume enters as fluctuations () fx  of () fx. Although infinities can be managed, it is easier to avoid them completely by performing the derivation in discrete form, then converting the result to the continuum at the end. Noise is introduced by replacing () where () . The lowest-order terms are ignored because they are of no interest in the following. Assuming that the fluctuations are random, the cross terms can be ignored as well because they vanish on ensemble averaging. We are left with In the sum on the left only the diagonal terms ' jj  survive, and there is only one per j. Hence we can replace ' j with j and eliminate the sum over ' j . The result is Next, the same argument applied to the right side shows that '' ''' jj  , and again there is only one such term per '' j . We are therefore left with where in the final line we have assumed that the j f  are independent of j . The remaining sums over j are superfluous, so we find that again assuming that the same uncertainty applies to all points. As expected, with the assumption that the noise is the same everywhere, the result is independent of the data. This would obviously not be the case if the noise were a function of x.
where in this equation by the length increment x  between points, the result is the digital version. By Eq. (11) we can also write this result as casting the calculation into RS.
Equations (20) and (21) show that white noise enters Eq. (14e) as an additive constant: where for the moment we view () Fk as representing the information. Thus over any range where ( ) 1 Bk  , noise as well as information is passed unattenuated. Because 2 | ( ) | Fk typically decreases exponentially or even faster with increasing k, Eq. (22) shows that noise constitutes a much smaller fraction of low-index coefficients than high-index coefficients.
However, the white-noise contribution continues unabated, so white noise must eventually dominate. Thus to prevent 2 MSE  from increasing without limit, a cutoff is essential. We define this noise cutoff N kk  by That is, N k is the value of k where the absolute square 2 | ( ) | Fk of the information equals that of the noise 2 | | 2 f  . In filter terms this can be approximated as noting that (0) 1 B  . With this definition nearly all of the information is included and the avoidable noise contribution is minimized. Once are identical, using as reference the RA filter, as noted above. For example, this determines the RS cutoff of the BW filter discussed below.
Finally, cutoffs generate Gibbs oscillations [24]. This is especially true for high-performance linear filters. Although not a direct measure of performance, these artifacts can be estimated.
Given Eqs. (2) and (5), it is seen that the "missing" contribution is given by These lineshape errors can be eliminated, or at least greatly reduced, by replacing the attenuated coefficients with their unattenuated equivalents obtained by extrapolating trends established by the low-order coefficients into the white-noise region. This can be done either by model-dependent curvefitting or model-independent maximum-entropy analysis [2]. That both procedures are nonlinear is easily proven: neither can be represented as DS convolutions. Because Gibbs oscillations are relatively easy to eliminate this way, we consider them to be a non-issue. As a result we conclude that in any given situation a combination of linear and nonlinear filtering will outperform linear filtering alone. This is the subject of a subsequent paper.

Running-average and brick-wall filters
We now consider specific linear filters, starting with RA and BW filters, which are rectangular in DS and RS, respectively, and hence represent extreme cases. To compare filters, Eq. (14e) shows that data are necessary. We represent these as a single Lorentzian (lifetime-broadened) absorption line of full-width-half-maximum 2 , which yields relatively simple but relevant analytic solutions.
Results are expressed as a ratio The RA and BW filters are defined as Their respective Fourier transforms are  (1) b is half the value of (0) b . For the BW filter this gives ( ) 10 The DS lineshapes of the two limiting filters are shown in Fig. 1, and their transfer functions in   We next consider figures of merit, starting with noise, which for uniform noise per data point is independent of the data. Substituting the above expressions into Eqs. (20) and (21) Thus with the same DS cutoffs, the BW filter exhibits only about 10% more noise than the RA filter. This is surprisingly low, given that () bx for the BW filter has a relatively long range in DS. To compare MSEs, we assume as noted above that the data are represented by a Lorentzian lineshape normalized to unit area: For 0   , that is, in the limit that the Lorentzian effectively reduces to a delta function, both expressions diverge as 1  , as a consequence of peak values becoming infinite, even though their integrated areas are finite. Accordingly, it is more instructive to consider the ratio, which is  Figure 3 shows that the RA is initially better, but by ~1  the BW starts to dominate, after which the difference increases exponentially. Figure 4 shows a more typical range encountered in practice. For these values of  the exponential dependence has taken over completely, and the BW filter is better by orders of magnitude. Thus, of the two extreme cases, the BW result, Eq. (37), is the useful benchmark.

 
, the BW filter is superior. Fig. 4. As Fig. 3, but for a more typical range of  used in practice. Here, the BW filter is superior by orders of magnitude, with only a 10% penalty in rms noise.

Gauss-Hermite and cosine-terminated filters
In this section we consider the Gauss-Hermite (GH) [7] or extended-Gauss (EG) [9] and cosineterminated (CT) filters as alternatives to the BW filter. The CT filter is developed here. Both offer better performance than the BW filter regarding the MSE, but approach filter design from opposite directions. The GH filter, introduced by Hoffman et al., [7] follows the standard Butterworth   The CT filter is defined by its transfer function 1 k establishes the onset, which for comparison purposes is defined here by the requirement . a the steepness of the cutoff, and k  the spread. a ranges from 1/ 2 to infinity.
The value 12 a  yields the Tukey filter [28,[31][32][33], characterized by a symmetric half-cycle cosine termination approximating sigmoid behavior. At 1 0 k  and 1 a  , the result approximates a Welch window, which is an inverted parabola [30]. As a approaches infinity, 2 k becomes equal to 1 k , and the result is the BW filter. The general characteristics of the CT cutoffs are shown in Figs. 7a and 7b. Figure 7a illustrates the cutoff as a function of width k  for 5 a  , and Fig. 7b as a function of a for 0.5 k  . Figure   7a also shows a cutoff for the EG filter with 100 M  . Generally speaking, we find the best performance with values of a of the order of 5, and k  of the order of 0.5. As can be seen, for these values the filter retains ( ) 1 Bk  until very near cutoff, and as the cutoff begins, favors values of () Bk as close to 1 as possible before dropping rapidly to zero afterward. This functional form capitalizes on the exponential decrease of 2 | ( ) | Fk with increasing k.
As noted above, the condition defines 2 k as a function of a, k  , and 1 k as given by Eq. (42). Taking advantage of these relations, 1 () bx, 2 () bx and 3 () bx are rewritten as (46c) Numerical evaluation of these expressions shows that The MSE ratios for the CT filter for 5 a  and various k  are shown in Fig. 8. The CT and GH filters are compared for various k  and M in Fig. 9, with M selected to give nearly similar functional dependences on  . The results of the two filters are comparable. The Tukey filter produces results that are very close to the GH filter, as can be expected since for GH filters of large M both filters have similar characteristics. The main difference between the GH and CT filters is computational speed: for comparable values of k  and M, those for the CT filter are better by at least two orders of magnitude.

Discussion
As noted above, our calculations are done in the continuum limit to obtain relatively simple analytic expressions and hence to investigate linear filters without the additional complications introduced by discrete analysis. While summations can often be done in closed form, these are significantly more complicated than the continuum equivalents given above, even though they reduce properly in the continuum limit. Digital Fourier analysis also brings in the additional constraint of periodicity. This typically results in endpoint-discontinuity artifacts, although methods are now available to eliminate them [26,27].
Many of the expressions given above can be converted to their discrete equivalents for (2 1) N  data points by replacing x with 2  Equation (49c) is also the maximum-entropy result for the spectrum of a single Lorentzian oscillator [34]. In the limit where  and  are small enough so second-order expansions are applicable, then 22 11 21 where Eq. (49d) is Eq. (49b) in the limit that  and  both approach zero. Thus the periodic pseudo-Lorentzian Eq. (49b) reduces to the Lorentzian in the limit of large N and small  and  . Some filtering algorithms are digital only, for example the binary algorithm [12] and the entire set of weighting coefficients of Savitzky and Golay [15] and subsequent contributors [17]. The major advantage of these approaches is that they are finite, and in the SG case, involve mainly single-digit or low-double-digit numbers, an additional advantage when calculations had to be done manually. As calculations are now done by computer, this is less of a consideration. In our experience, the GH and CT filters exhibit better performance than the digital versions of the SG filters in all respects. This aspect will be covered elsewhere.
As noted above, owing to their relatively abrupt cutoffs, all high-performance linear filters generate Gibbs or cutoff oscillations when their RS coefficients are transformed to DS for use as convolution filters. An example is given in Fig. 10. From Eq. (22), the amplitude and period of these oscillations are basically determined by the last Fourier coefficient retained in the filtering process. As is also seen, making cutoffs less abrupt improves the DS convergence, but the tradeoff between rapid convergence and low MSE values remains. Gibbs oscillations also appear in reconstructions of filtered lineshapes. An example is given in Fig. 11. As the most dominant contribution of high-order coefficients occurs when they add coherently, it is not surprising that the greatest effects are seen at extrema such as peak heights.
Therefore, these are a measure of the effectiveness of a filter. Again, these artifacts can be reduced significantly or even eliminated by nonlinear methods.

Conclusion
It has long been appreciated that intelligent filtering requires that the data be examined in RS [2,3,8,9]. We find that similar advantages result if the measures that assess the filtersmeansquare error and noiseare also considered in RS. We accomplish this by taking advantage of Parseval's Theorem to express these assessments in RS. The resulting perspective leads to insights that cannot be achieved from DS considerations alone, and opens up additional possibilities in linear filtering.
Using this approach, we quantitatively analyze various filters in terms of their performance using a representative Lorentzian line as data. From this perspective the RS-rectangular ("ideal" or "brick-wall") filter is significantly better than its reputation would suggest, a consequence of rigorously preserving low-order coefficients where the major fraction of spectral information resides. The BW filter can be improved by modifying its abrupt cutoff. The best previous filter that we have found to do this is the Gauss-Hermite filter, although to achieve this goal higher orders must be used than recommended in the original publication [7]. By taking advantage of the information provided by the RS formulation, we develop a cosine-terminated filter that outperforms the GH filter with a significantly reduced computation load.
We have not addressed methods that process data as well as remove noise. Processing includes operations such as interpolation, scale change, scale inversion, and differentiation. With minor modifications, the approach developed here can be used to assess performance of algorithms that do multiple operations as well.