A Semi-Blind Source Separation Method for Differential Optical Absorption Spectroscopy of Atmospheric Gas Mixtures

Differential optical absorption spectroscopy (DOAS) is a powerful tool for detecting and quantifying trace gases in atmospheric chemistry \cite{Platt_Stutz08}. DOAS spectra consist of a linear combination of complex multi-peak multi-scale structures. Most DOAS analysis routines in use today are based on least squares techniques, for example, the approach developed in the 1970s uses polynomial fits to remove a slowly varying background, and known reference spectra to retrieve the identity and concentrations of reference gases. An open problem is to identify unknown gases in the fitting residuals for complex atmospheric mixtures. In this work, we develop a novel three step semi-blind source separation method. The first step uses a multi-resolution analysis to remove the slow-varying and fast-varying components in the DOAS spectral data matrix $X$. The second step decomposes the preprocessed data $\hat{X}$ in the first step into a linear combination of the reference spectra plus a remainder, or $\hat{X} = A\,S + R$, where columns of matrix $A$ are known reference spectra, and the matrix $S$ contains the unknown non-negative coefficients that are proportional to concentration. The second step is realized by a convex minimization problem $S = \mathrm{arg} \min \mathrm{norm}\,(\hat{X} - A\,S)$, where the norm is a hybrid $\ell_1/\ell_2$ norm (Huber estimator) that helps to maintain the non-negativity of $S$. The third step performs a blind independent component analysis of the remainder matrix $R$ to extract remnant gas components. We first illustrate the proposed method in processing a set of DOAS experimental data by a satisfactory blind extraction of an a-priori unknown trace gas (ozone) from the remainder matrix. Numerical results also show that the method can identify multiple trace gases from the residuals.


Introduction
Trace gases play an important role in climate change and air quality of the Earth's atmosphere. Spectroscopic techniques are widely used today for measurements of many trace species, and have evolved over the past century from the first use of the sun as a light source to identify atmospheric trace gases. Many different light sources (e.g., infrared and UV-visible lamps, lasers, and natural sources such as the sun) are now conventionally used to identify light-absorbing species as well as determine their concentrations using Lambert-Beer's law, where I 0 (λ) is the initial intensity of light, I(λ) is its intensity after traveling through a sample of path length, L, with concentration, ρ. Each species has its characteristic absorption cross section, σ(λ), a measure of its ability to absorb light that varies with wavelength. The use of (1.1) is convenient for multi-component samples in laboratory spectrometers, but it is more difficult to determine the value of I 0 (λ) in the atmosphere over a large wavelength range.
A new method, differential optical absorption spectroscopy (DOAS), was introduced in the 1970s [14,15,16,17] to analyze atmospheric trace gas concentrations. DOAS analysis separates the trace gas absorptions, which typically vary quickly with wavelength, from features that vary slowly with wavelength, e.g., light scattering processes by molecules and aerosols. Differential cross sections are then defined relative to this new broad background in place of the true I 0 (λ). Several important trace gases were measured for the first time with DOAS, e.g., HONO, NO 3 , BrO, ClO in the troposphere, and OClO and BrO in the stratosphere. A large number of other molecules absorb in the UV and the visible wavelength region and most aromatic hydrocarbons can also be detected. An advantage of DOAS is the ability to measure absolute trace gas concentrations in situ. DOAS is therefore especially useful for measuring highly reactive species such as the free radicals OH, NO 3 , or BrO, and it provides a powerful tool for studying emissions, transformation and transport of chemicals throughout the troposphere and stratosphere. It can also help to understand the influence of atmospheric chemistry on climate and air quality. A detailed description of DOAS can be found elsewhere [18].
In general, DOAS spectra contain overlapping absorption structures which consist of complex multiple scales and peaks. They must be separated by the analysis routine to retrieve the concentrations of the trace gases. Least squares techniques are most often used for analysis of DOAS spectra, with the use of high pass filters to fit or separate out the slowly varying components. For example, the approach described in [18] applies a polynomial fit to remove the broad (slow-varying) spectral features, and known reference spectra to retrieve the concentrations of reference gases. However, the existing DOAS approaches have two limitations: 1) the condition of least squares (that errors are normally distributed ) is often violated. This suggests that a different norm other than ℓ 2 (least squares) norm should be used; 2) the fitting residuals for atmospheric samples are in most cases not pure noise due to imperfect references, atmospheric turbulence, instrument effects, and unknown trace gases. Among other interesting problems, the identification of gas structures in the fitting residuals is of great importance. The method in this paper has been developed to address these is-sues, in the hope of providing a tool for atmospheric chemists to analyze the residuals for possible hidden trace gases. The method is designed to deal with the following three major challenges. First, DOAS spectra are complex multi-scale multi-peak structural data containing slow-varying features, structured signals due to the trace gases, and instrumental noise. Hence a multi-resolution analysis tool is needed for scale decomposition. Second, the identification of gases from the residuals is actually a problem of blind source separation (BSS) as both the trace gases (including their numbers) and mixing process are not known. A major problem is to find a working assumption on the source (hidden trace gas) signals and effective BSS algorithms. Third, the new objective function for data fitting should not only overcome the limitations of least squares fitting, but also help to maintain the non-negativity. To tackle these problems, we have made an initial attempt of developing a semi-blind approach which contains three steps. The first step uses multi-resolution analysis to remove the very slow (e.g. scattering) and very fast components (noise) in the DOAS spectral data matrix X. The second step decomposes the preprocessed dataX in the first step into a linear combination of the reference spectra plus a remainder, orX = A S + R, where columns of matrix A are known reference spectra, the matrix S contains the unknown non-negative coefficients. The second step is carried out by solving a convex minimization problem S = arg min norm (X − A S), where the norm is a hybrid ℓ 1 /ℓ 2 norm that helps to maintain the non-negativity of S. The third step performs a blind independent component analysis of the remainder matrix R to extract remnant gas components. Our method can be useful for separating unknown sources from the residuals after any known reference spectra have been first deployed to fit the data.
The paper is organized as follows. In section 2, we review the essentials of DOAS and the existing approach, then introduce our method. In section 3, we illustrate the proposed method in processing a set of DOAS experimental data, and show satisfactory numerical results. Concluding remarks are in section 4.

DOAS and Fitting Methods
A typical experimental setup for a DOAS instrument consists of a continuous light source, e.g., a Xe-arc lamp, a light-absorbing sample (the atmosphere or gases in a chamber), a grating spectrometer, and an optical detector. It is also possible to use the light from the sun or moon, or scattered sun light as light sources [14,15]. The typical length of the light path in the atmosphere ranges from several hundred meters to many kilometers and < 100 m in laboratory DOAS experiments. The light of intensity I 0 (λ) passes through the sample, is typically dispersed by a grating spectrometer and is measured by a detector. During its way through the sample the light undergoes extinction due to absorption processes by trace gases and scattering by air molecules and aerosol particles. In the atmosphere, the intensity I(λ) at the end of the light path is given by Lambert-Beer's law, where σ ABS j is the absorption cross section of a trace gas j, ρ j is its number density. L is the length of the light path. The Rayleigh extinction by gases and Mie extinction by aerosols are described here by ε R and ε M . N(λ) is the measurement noise. The basic idea of DOAS is the separation of the cross section σ ABS j represents broad spectral features and the differential cross section σ ′ j represents narrow spectral structures that are of interest for identification and quantification of the trace gases. If one considers only σ ′ j , interferences with Rayleigh and Mie extinction are avoided. The mathematical description of this process is a convolution of I(λ) with the instrument function H of the spectrometer, describes the broad spectral structures due to the characteristics of the light source I 0 , the Rayleigh and Mie's extinction, and the broad absorption by trace gases. I ′ 0 is a slow-varying function of wavelength, so I * (λ) can be approximated by where x(λ) = ln I * (λ) and B ′ (λ) = ln I ′ 0 (λ) represents the broad spectral features. The noise N ′ (λ) = ln N(λ). In the experiment, the wavelength range is mapped onto p discrete pixels of the detector. The sampled data points form p-dimensional column vectors of a data matrix X. Suppose there are n measurements, then X = [x 1 , x 2 , . . . , x n ] ∈ R p×n whose column vectors are recorded DOAS data points. Equation (2.2) in matrix form is: where the columns of matrix A correspond to the reference spectra of the known trace gases; the matrix S contains non-negative coefficients; the matrix B includes the slowvarying components, and the matrix N contains the noise components. Most DOAS approaches use the least squares methods to calculate S due to its computational simplicity. To deal with the problem that the data contain both broad and narrow spectral features, a high pass filter is needed to remove the broad spectral features. It is common to use polynomials to model and filter out the slowly varying parts from the narrow trace gas absorption. Equation (2.3) is written as follows in [18].
where the polynomial P models the broad spectral structures. Given the order of polynomial and the known reference spectra, (2.4) can be solved with a least squares method. The polynomial fitting however has the following drawbacks: (1) the order of the polynomial is determined empirically and different orders might be used for different data; 2) the non-negativity of the concentration is not guaranteed during the fitting. An open problem after the fitting is how to identify and extract trace gases from the fitting residuals besides the noise. To address these issues, we propose a three step method in the next section.

Proposed Semi-Blind Source Separation Method
DOAS data can cover a range of scales and contain high frequency (<1 nm) artifact structures, for example due to pixel-to-pixel variability in the detector, while the reference spectra of the trace gases contain fewer peaks and peak widths on the order of several nm. Hence it is helpful to remove the fast varying artifacts from the spectra data by multi-resolution analysis. In addition, the broad features (slow-varying parts) in the data need to be eliminated in order to fit the reference spectra of the known trace gases. We propose to use the empirical mode decomposition (EMD) to extract these components. The detailed description of EMD can be found in [10]. The EMD method does not assume anything about the data, contrary to Fourier methods where data is assumed linear and stationary. EMD handles also non-stationary and nonlinear data.

Multi-Resolution Analysis
The concept of EMD has been developed rapidly in many areas of science and engineering since Huang et al. [10] invented EMD. Its key feature is to decompose a signal into so-called intrinsic mode function (IMF). The essential step extracting an IMF is to identify an oscillation embedded in a signal from local time scale. Considering a signal s(t) between two consecutive local extrema (say, two minima at times t 1 and t 2 ), we can heuristically define a (local) high frequency part where d(t) corresponds to the oscillation terminating at the two minima and passing through the maximum in between. For the picture to be complete, we also identify the corresponding (local) low-frequency part, or local trend, m(t) so that we have Assuming that this is done in some proper way for all the oscillations in the entire signal, we get an intrinsic mode function as well as a residual consisting of all local trends. The procedure can then be repeated on the residual, and constitutive components of a signal can be iteratively extracted.
The EMD method decomposes the DOAS data into a finite number of components of different frequencies. The advantage of EMD is that it is completely data-driven (no need to specify a parameter such as the order of a fitting polynomial), fast and automatic. Fig. 1 shows a typical DOAS spectrum of a trace gas mixture containing HONO, whose reference spectrum is also shown in the bottom panel. Fig. 2 shows the fast and slow components extracted from the DOAS data in Fig. 1. The EMD preprocessed (high-passed) dataX satisfies the following model where the columns of matrix A are the reference spectra of the known trace gases, and those of S matrix contains their concentrations, and R is the fitting residual which might contain the instrument noise, hidden trace gas structures, etc. For the estimation of the concentration matrix S, we minimize the following constrained objective function: min for a proper choice of the norm.

Huber Estimator and Robust Data Fitting
There are many kinds of norms available, e.g., ℓ 2 (least squares), ℓ 1 (least absolute deviations). The regular least squares method (ignoring the non-negative constraint on S) is the conventional choice, if the unknown noise N is assumed to be Gaussian. However, it is rather sensitive to the outliers in the data, even one outlier can drastically change the estimation. The least absolute deviations (ℓ 1 norm) is more robust to outliers in the data, however it is less effective if the peaks in N are not isolated (or sparse). We find that a hybrid ℓ 2 and ℓ 1 norm (ℓ 2 on small peaks and ℓ 1 on large peaks), or a Huber estimator [11], is able to resist the influence of outliers, and maintain non-negativity of S for our data fitting task. The Huber norm is both regular and convex. The corresponding nonlinear function H = H(x) is a parabola (ℓ 2 ) in the vicinity of zero, and increases linearly (ℓ 1 ) at |x| > k for any positive constant k. More precisely, The non-negativity of S under the Huber norm indicates that the choice fits the empirical distribution of the noise N arising from detection and photon statistics [18].    to each observation; the weights for the Huber estimator decline when |x| > k (see the weight functions in Fig. 3). The value k for the Huber estimator is called a tuning constant; smaller values of k produce more resistance to outliers, but at the expense of lower efficiency when the errors are normally distributed. The tuning constant is generally selected to yield high efficiency in the normal case; in particular, k = 1.345σ (where σ is the standard deviation of the errors). The minimization of Huber's objective can be achieved by the method of iterative re-weighted least squares. For the DOAS data in our numerical experiments, we found that the Huber's estimator produced non-negative solutions S without explicit enforcement of non-negativity constraint. The theoretical underpinning is under further study. In the residual of Huber estimation, there might be spectral structures (one or many) of trace gases buried in noise, or just random noise. In either case, we decompose the residuals in a blind fashion due to the lack of the knowledge of the hidden trace gases. The source signal assumption required for the decomposition is that the spectra of different trace gases are statistically independent (orthogonal). This appears to be a reasonable working assumption for many trace gases. Independent component analysis (ICA) can now be readily applied.

Independent Component Analysis
ICA is a useful and generic tool for solving blind source separation problems (BSS), which arise when one attempts to recover source signals from their mixtures without knowing the mixing process [5,6]. ICA finds the independent components in the mixtures by maximizing the statistical independence (minimizing mutual information) of the estimated components. Mathematically, given the mixture matrix R ∈ R p×n and the number of independent source components d, ICA finds a full rank matrix  W ∈ R d×n such that the output matrix U ∈ R d×p given by contains columns (recovered source signals) as independent from each other as possible. Here n is the number of residuals from the data fitting, p is the number of wavelength pixels. The columns of U correspond to the decomposed independent source signals. We may choose one of many ways to approximate independence, and this choice governs the form of the ICA algorithm. The two broadest definitions of independence for ICA are: (1) minimization of mutual information; (2) maximization of non-Gaussianity. The non-Gaussianity family of ICA algorithms use kurtosis and negentropy. The minimization of mutual information family of ICA algorithms use the Kullback-Leibler divergence and maximum-entropy, however, the knowledge of source signal probability distribution function (PDF) is needed. Algorithms for ICA include infomax [1], FastICA [12], and JADE [2]. We opt for JADE because JADE is based on cumulants (2nd and 4th order statistics) and the approximate joint diagonalization of cumulant matrices (hence does not rely on PDF information of source signals). For moderate number of sources, it is more direct and stable than iterative methods such as infomax [1] and FastICA [12]. It was recently found [13] that the infomax method [1] may even diverge and that it only converges in a weak sense under proper rescaling and soft dynamic control of the iterations. The most attractive aspect of JADE is that it does not require parameter tuning (e.g. choosing the learning parameter in the iterative methods). In general, ICA algorithms cannot identify the actual number of source signals, so this number needs to be found by other means, for example by human evaluation of the end results. In our decomposition of Huber residuals, we tested a range for this number, and pinpointed the one with the most reliable and meaningful outcomes when calibrated with the knowledge of the existing trace gas spectral properties.

Experimental Setup
Spectra of chemical mixtures were collected using an environmental chamber [7] for which DOAS is one of the analytical techniques used to measure species during experiments. Fig. 4 shows a simplified schematic of the chamber and optical arrangement for DOAS. The chamber is 561 L in volume and can be evacuated to a pressure of ∼ 10 −2 Torr for collection of true I 0 (λ) spectra. Spectra can also be collected before and after addition of ultrapure air and each gaseous analyte of interest through various ports.
The DOAS instrumentation consists of a high pressure Xe arc lamp (Oriel, Model 6263) as the UV-visible light source. The light beam enters the chamber through a quartz window and undergoes multiple reflections using White cell mirrors [21] through the gas mixture in the chamber. The multiple reflections increase the path length of the light beam through the sample to a total path length of L = 52 m. The light beam exits the chamber through the quartz window and is focused on the entrance slit of a monochromator (Jobin Yvon-Spex, Model HR460) with a diode array detector (Princeton Instruments, model PDA-1024 ST121). The grating (1200 grooves mm −1 blazed at 330 nm) gives a dispersion of ∼ 0.043 nm /pixel and the detector has 1024 channels giving each spectrum a total wavelength range of ∼ 44 nm. Spectra can be collected in different wavelength ranges by moving the grating motor. Changes in grating position as well as temperature lead to changes in dispersion of the light beam on the detector. This is taken into account in the least squares analysis by allowing for shifting or linear compression/expansion in one or more reference spectra along the wavelength axis to obtain the best fit. The use of such techniques is standard and user controlled to correlate wavelengths with channels of the detector. Absolute dispersion and wavelength were calibrated using a mercury lamp spectrum that was recorded daily and at the beginning of each experiment.
The analytes added to the chamber were NO 2 and O 3 at a total pressure of ∼ 1 atm at room temperature in dry ultrapure air (Scott-Marrin, Riverside, CA). The wavelength range typically used to measure NO 2 by DOAS is 340 -380 nm. Although the air was dry (relative humidity < 0.8%), even small amounts of water react with NO 2 to form HONO [8]. As a result HONO is almost always present in detectable quantities with NO 2 . HONO is also typically measured using the 340 -380 nm wavelength range, thus the mixture of HONO and NO 2 was used as a convenient test case for the new DOAS analysis technique. The addition of O 3 leads to formation of NO 3 radicals (NO 2 + O 3 → NO 2 + O 2 ). Analysis for O 3 is typically carried out in a different wavelength range, 290 -330 nm. It should be noted that a wavelength range for analysis is usually that in which the cross sections are highest for that analyte in order to optimize the detection limits. O 3 continues to absorb at wavelengths > 330 nm, albeit with absorption cross sections that are lower by a factor of 100 or more [20] compared to those at shorter wavelengths. Another test of the technique was to determine if it could identify the presence of this third component, O 3 , in the 340 -380 nm range where its detection is not optimal. NO 3 analysis was carried out in a different range, 600 -640 and 640 -680 nm, and is not discussed here.
In addition to the new DOAS analysis technique introduced in this work, the typical linear least squares analysis was carried out on HONO and NO 2 using MFC [9] for which reference spectra are needed. A reference spectrum for NO 2 was generated by adding a known quantity of NO 2 to the chamber and collecting DOAS spectra with the instrumentation described above. Pure samples of HONO are difficult to generate without the presence of NO 2 , thus HONO reference spectra were generated from published cross sections [3,4] which were convoluted to the dispersion and resolution of our spectrometer. Reference spectra for O 3 were generated from published cross sections [20] also converted to the dispersion and resolution of the spectrometer.
Chemicals used in these experiments are as follows: Gaseous NO 2 was synthesized by reaction of gaseous NO (Matheson, 99%) which was first passed through a cold trap at 195 K to remove impurities such as HNO 3 , with an excess of O 2 (Oxygen Services Co., 99.993%). The mixture was allowed to react for 2 hrs. and then purified by condensing the NO 2 at 195 K to pump away excess O 2 . Gaseous O 3 was generated as a mixture in O 2 using a commercial ozonizer (Polymetrics, Model T-816).

Computational Results
We report here the computational results for the proposed method. In the first example, we fit the known reference spectra of NO 2 and HONO to the DOAS data. The method identifies an a-priori unknown trace gas O 3 (ozone) from the fitting residuals. The results are shown in a series of plots, Fig. 2-Fig. 6. We use 11 sets of data corresponding to different reaction times and hence different gas concentrations (X has 11 columns) from the experiment. Fig. 2 illustrates the data preprocessing (EMD) described earlier which removes the fastest and slowest varying components. The Huber fitting results are presented in Fig. 5 which shows the coefficients of NO 2 and HONO in the 11 mixtures in comparison with the concentrations determined using the least squares fitting technique. These coefficients determined using the hybrid ℓ 1 /ℓ 2 fitting technique are all non-negative as well as in very good quantitative agreement with values from least squares fitting. The fitting residual is in the third plot of Fig. 5. Though some structure can be seen in the residuals, it is not clear if there are other spectral structures embedded in the fitting residuals. Then further identification was done by JADE. For the 11 residuals, we vary the number of independent components in the JADE computation. We observed that the structure of the first plot in Fig. 6 remains approximately invariant as the number varies. This invariance implies that it should be a hidden trace gas in the fitting residual. It can be seen that the identified structure resembles O 3 in many peak locations, especially the region 340-350 nm. It should be noted that the ℓ 1 /ℓ 2 fitting technique currently does not incorporate shifting and squeezing of spectra to optimize fitting, but this can be implemented in the future. Spectral shifts and squeezes are often used in DOAS analysis routines to account for changes in grating dispersion due to temperature fluctuations and grating positioning accuracy [18,19]. The second example uses the same set of data (11 mixtures), however we only use the reference spectrum of NO 2 to fit the data. Ideally, we should recover O 3 and HONO from the residuals. The two identified hidden spectral signals are in Fig. 7 and Fig. 8. The recovered fits are recognizable as HONO and O 3 upon comparison with reference spectra, demonstrating the ability of the technique to identify absorption features without the use of reference spectra during the fitting procedure. While the ℓ 1 /ℓ 2 technique is demonstrated here for laboratory DOAS data with three components, its utility lies in the analysis of atmospheric DOAS spectra, which are more complex. The least squares method works best when reference spectra for all known absorbing species are used to carry out the fitting, i.e., when the fit residuals are unstructured and do not vary considerably with wavelength. Given that this condition is rarely satisfied for complex atmospheric measurements, the method described here is  [20] for comparison complementary in that it can identify species that are either not known to be present or do not yet have available published cross sections. In addition, the results in Fig.  5 show its value as an alternative stand-alone technique for analyzing DOAS spectra with the use of appropriate reference spectra.

Concluding Remarks
We developed a semi-blind source separation method for retrieving the concentrations and performing identifications of trace gases from DOAS spectra. The method is designed to identify potentially hidden trace gases after fitting the known trace gases to the data, which is a challenging problem. Our method can be useful for separating unknown source signals from the residuals after any known reference spectra have been first deployed to fit the data. The first novelty of the method is to employ the multi-resolution analysis (EMD) to remove the slowest varying component from the data. The removal of such components relies on a polynomial fit in the existing methods. Different polynomials may produce different results, and the degree of the fitting polynomial is often empirically defined. The multi-resolution approach avoids specifying the order of polynomial, and it extracts the slow component in an automatic fashion. The second novelty is to use a hybrid ℓ 1 /ℓ 2 interpolated norm (Huber function) to fit the data, which reduced the effects of outliers and kept the concentrations non-negative. Lastly, a multi-channel signal decomposition method (JADE) produced encouraging results on extracting hidden source signals from the fitting residuals. While use of the least squares fitting procedure for atmospheric data can quantify several trace species simultaneously, typical fit residuals often suggest there are remaining absorbers. In some cases, species can be inferred based on known atmospheric chemistry, e.g., HONO is often present in NO 2 mixtures. The major strength of the technique described here is its ability to be used either with existing published reference spectra for quantification or without references for identification of new absorbers. Numerical results on DOAS data show the promising potential of our method on both trace gas recovery and quantification.